0% found this document useful (0 votes)
17 views12 pages

KNN CIML

Uploaded by

Soham Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

KNN CIML

Uploaded by

Soham Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

geometry and nearest neighbors 31

After this mapping, you can think of a single example as a vec-


tor in a high-dimensional feature space. If you have D-many fea-
tures (after expanding categorical features), then this feature vector
will have D-many components. We will denote feature vectors as
x = h x1 , x2 , . . . , x D i, so that xd denotes the value of the dth fea-
ture of x. Since these are vectors with real-valued components in
D-dimensions, we say that they belong to the space RD .
For D = 2, our feature vectors are just points in the plane, like in
Figure 3.1. For D = 3 this is three dimensional space. For D > 3 it
becomes quite hard to visualize. (You should resist the temptation
to think of D = 4 as “time” – this will just make things confusing.)
Unfortunately, for the sorts of problems you will encounter in ma-
chine learning, D ≈ 20 is considered “low dimensional,” D ≈ 1000 is
“medium dimensional” and D ≈ 100000 is “high dimensional.” Can you think of problems (per-
haps ones already mentioned in this
? book!) that are low dimensional?
That are medium dimensional?
3.2 K-Nearest Neighbors That are high dimensional?

The biggest advantage to thinking of examples as vectors in a high


dimensional space is that it allows us to apply geometric concepts
to machine learning. For instance, one of the most basic things
that one can do in a vector space is compute distances. In two-
dimensional space, the distance between h2, 3i and h6, 1i is given
p √
by (2 − 6)2 + (3 − 1)2 = 18 ≈ 4.24. In general, in D-dimensional
space, the Euclidean distance between vectors a and b is given by
Eq (3.1) (see Figure 3.2 for geometric intuition in three dimensions): (.6, 1, .8)

" #1
D 2
d( a, b) = ∑ ( a d − bd ) 2
(3.1)
)
, .5

d =1
, .4
(0

Now that you have access to distances between examples, you


can start thinking about what it means to learn again. Consider Fig-
ure 3.3. We have a collection of training data consisting of positive
examples and negative examples. There is a test point marked by a
question mark. Your job is to guess the correct label for that point.
Figure 3.2: A figure showing Euclidean
Most likely, you decided that the label of this test point is positive. distance in three dimensions. The
One reason why you might have thought that is that you believe length of the green segments are 0.6, 0.6
and 0.3 respectively, in the x-, y-, and
that the label for an example should be similar to the label of nearby z-axes. The total distance between the
points. This is an example of a new form of inductive bias. red
√ dot and the orange dot is therefore
The nearest neighbor classifier is build upon this insight. In com- 0.62 + 0.62 + 0.32 = 0.9.
Verify that d from Eq (3.1) gives the
parison to decision trees, the algorithm is ridiculously simple. At
training time, we simply store the entire training set. At test time,
? same result (4.24) for the previous
computation.
we get a test example x̂. To predict its label, we find the training ex-
ample x that is most similar to x̂. In particular, we find the training
32 a course in machine learning

example x that minimizes d( x, x̂). Since x is a training example, it has


a corresponding label, y. We predict that the label of x̂ is also y.
Despite its simplicity, this nearest neighbor classifier is incred-
ibly effective. (Some might say frustratingly effective.) However, it
is particularly prone to overfitting label noise. Consider the data in
Figure 3.4. You would probably want to label the test point positive.
Unfortunately, it’s nearest neighbor happens to be negative. Since the
nearest neighbor algorithm only looks at the single nearest neighbor,
it cannot consider the “preponderance of evidence” that this point
should probably actually be a positive example. It will make an un-
necessary error. ?
A solution to this problem is to consider more than just the single
nearest neighbor when making a classification decision. We can con-
sider the K-nearest neighbors and let them vote on the correct class
for this test point. If you consider the 3-nearest neighbors of the test Figure 3.4: A figure showing an easy
point in Figure 3.4, you will see that two of them are positive and one NN classification problem where the
test point is a ? and should be positive,
is negative. Through voting, positive would win. but its NN is actually a negative point
The full algorithm for K-nearest neighbor classification is given that’s noisy.
in Algorithm 3.2. Note that there actually is no “training” phase for Why is it a good idea to use an odd
K-nearest neighbors. In this algorithm we have introduced five new
? number for K?

conventions:

1. The training data is denoted by D.

2. We assume that there are N-many training examples.

3. These examples are pairs ( x1 , y1 ), ( x2 , y2 ), . . . , ( x N , y N ).


(Warning: do not confuse xn , the nth training example, with xd ,
the dth feature for example x.)

4. We use [ ]to denote an empty list and ⊕ · to append · to that list.

5. Our prediction on x̂ is called ŷ.

The first step in this algorithm is to compute distances from the


test point to all training points (lines 2-4). The data points are then
sorted according to distance. We then apply a clever trick of summing
the class labels for each of the K nearest neighbors (lines 6-10) and
using the sign of this sum as our prediction. Why is the sign of the sum com-
The big question, of course, is how to choose K. As we’ve seen, puted in lines 2-4 the same as the
? majority vote of the associated
with K = 1, we run the risk of overfitting. On the other hand, if training examples?
K is large (for instance, K = N), then KNN-Predict will always
predict the majority class. Clearly that is underfitting. So, K is a
hyperparameter of the KNN algorithm that allows us to trade-off
between overfitting (small value of K) and underfitting (large value of
K).

Why can’t you simply pick the


value of K that does best on the
training data? In other words, why
? do we have to treat it like a hy-
perparameter rather than just a
parameter.
geometry and nearest neighbors 33

Algorithm 3 KNN-Predict(D, K, x̂)


1: S ← [ ]

2: for n = 1 to N do

3: S ← S ⊕ hd(xn , x̂), ni // store distance to training example n


4: end for

5: S ← sort(S) // put lowest-distance objects first


6: ŷ ← 0

7: for k = 1 to K do

8: hdist,ni ← Sk // n this is the kth closest data point


9: ŷ ← ŷ + yn // vote according to the label for the nth training point
10: end for

11: return sign(ŷ) // return +1 if ŷ > 0 and −1 if ŷ < 0

Figure 3.5: A figure of a ski and a


snowboard.

One aspect of inductive bias that we’ve seen for KNN is that it
assumes that nearby points should have the same label. Another
aspect, which is quite different from decision trees, is that all features
are equally important! Recall that for decision trees, the key question
was which features are most useful for classification? The whole learning Figure 3.6: Classification data for ski vs
algorithm for a decision tree hinged on finding a small set of good snowboard in 2d

features. This is all thrown away in KNN classifiers: every feature


is used, and they are all used the same amount. This means that if
you have data with only a few relevant features and lots of irrelevant
features, KNN is likely to do poorly.
A related issue with KNN is feature scale. Suppose that we are
trying to classify whether some object is a ski or a snowboard (see
Figure 3.5). We are given two features about this data: the width
and height. As is standard in skiing, width is measured in millime-
ters and height is measured in centimeters. Since there are only two
features, we can actually plot the entire training set; see Figure 3.6
where ski is the positive class. Based on this data, you might guess
that a KNN classifier would do well.
Suppose, however, that our measurement of the width was com-
puted in millimeters (instead of centimeters). This yields the data
shown in Figure 3.7. Since the width values are now tiny, in compar-
ison to the height values, a KNN classifier will effectively ignore the
width values and classify almost purely based on height. The pre-
dicted class for the displayed test point had changed because of this Figure 3.7: Classification data for ski vs
feature scaling. snowboard in 2d, with width rescaled
to mm.
We will discuss feature scaling more in Chapter 5. For now, it is
just important to keep in mind that KNN does not have the power to
decide which features are important.
34 a course in machine learning

M ATH R EVIEW | V ECTOR A RITHMETIC AND V ECTOR N ORMS


A (real-valued) vector is just an array of real values, for instance x = h1, 2.5, −6i is a three-dimensional
vector. In general, if x = h x1 , x2 , . . . , x D i, then xd is it’s dth component. So x3 = −6 in the previous ex-
ample.

Vector sums are computed pointwise, and are only defined when dimensions match, so h1, 2.5, −6i +
h2, −2.5, 3i = h3, 0, −3i. In general, if c = a + b then cd = ad + bd for all d. Vector addition can
be viewed geometrically as taking a vector a, then tacking on b to the end of it; the new end point is
exactly c.

Vectors can be scaled by real values; for instance 2h1, 2.5, −6i = h2, 5, −12i; this is called scalar multi-
plication. In general, ax = h ax1 , ax2 , . . . , ax D i.

The norm of a vector


q x, written || x|| is its length. Unless otherwise specified, this is its Euclidean length,
namely: || x|| = ∑d xd2 .

Figure 3.8:

3.3 Decision Boundaries

The standard way that we’ve been thinking about learning algo-
rithms up to now is in the query model. Based on training data, you
learn something. I then give you a query example and you have to
guess it’s label.
An alternative, less passive, way to think about a learned model
is to ask: what sort of test examples will it classify as positive, and
what sort will it classify as negative. In Figure 3.9, we have a set of
training data. The background of the image is colored blue in regions
that would be classified as positive (if a query were issued there) Figure 3.9: decision boundary for 1nn.

and colored red in regions that would be classified as negative. This


coloring is based on a 1-nearest neighbor classifier.
In Figure 3.9, there is a solid line separating the positive regions
from the negative regions. This line is called the decision boundary
for this classifier. It is the line with positive land on one side and
negative land on the other side.
Decision boundaries are useful ways to visualize the complex-
ity of a learned model. Intuitively, a learned model with a decision
boundary that is really jagged (like the coastline of Norway) is really
complex and prone to overfitting. A learned model with a decision
boundary that is really simple (like the bounary between Arizona Figure 3.10: decision boundary for knn
with k=3.
and Utah) is potentially underfit.
Now that you know about decision boundaries, it is natural to ask:
what do decision boundaries for decision trees look like? In order
geometry and nearest neighbors 35

to answer this question, we have to be a bit more formal about how


to build a decision tree on real-valued features. (Remember that the
algorithm you learned in the previous chapter implicitly assumed
binary feature values.) The idea is to allow the decision tree to ask
questions of the form: “is the value of feature 5 greater than 0.2?”
That is, for real-valued features, the decision tree nodes are param-
eterized by a feature and a threshold for that feature. An example
decision tree for classifying skis versus snowboards is shown in Fig-
ure 3.11.
Now that a decision tree can handle feature vectors, we can talk
about decision boundaries. By example, the decision boundary for
the decision tree in Figure 3.11 is shown in Figure 3.12. In the figure,
space is first split in half according to the first query along one axis.
Figure 3.11: decision tree for ski vs.
Then, depending on which half of the space you look at, it is either snowboard
split again along the other axis, or simply classified.
Figure 3.12 is a good visualization of decision boundaries for
decision trees in general. Their decision boundaries are axis-aligned
cuts. The cuts must be axis-aligned because nodes can only query on
a single feature at a time. In this case, since the decision tree was so
shallow, the decision boundary was relatively simple.

3.4 K-Means Clustering

Up through this point, you have learned all about supervised learn-
ing (in particular, binary classification). As another example of the Figure 3.12: decision boundary for dt in
previous figure
use of geometric intuitions and data, we are going to temporarily
What sort of data might yield a
consider an unsupervised learning problem. In unsupervised learn- very simple decision boundary with
ing, our data consists only of examples xn and does not contain corre- a decision tree and very complex
sponding labels. Your job is to make sense of this data, even though
? decision boundary with 1-nearest
neighbor? What about the other
no one has provided you with correct labels. The particular notion of way around?
“making sense of” that we will talk about now is the clustering task.
Consider the data shown in Figure 3.13. Since this is unsupervised
learning and we do not have access to labels, the data points are
simply drawn as black dots. Your job is to split this data set into
three clusters. That is, you should label each data point as A, B or C
in whatever way you want.
For this data set, it’s pretty clear what you should do. You prob-
ably labeled the upper-left set of points A, the upper-right set of
points B and the bottom set of points C. Or perhaps you permuted
these labels. But chances are your clusters were the same as mine.
The K-means clustering algorithm is a particularly simple and
effective approach to producing clusters on data like you see in Fig-
ure 3.13. The idea is to represent each cluster by it’s cluster center.
Given cluster centers, we can simply assign each point to its nearest

Figure 3.13: simple clustering data...


clusters in UL, UR and BC.
36 a course in machine learning

center. Similarly, if we know the assignment of points to clusters, we


can compute the centers. This introduces a chicken-and-egg problem.
If we knew the clusters, we could compute the centers. If we knew
the centers, we could compute the clusters. But we don’t know either.
The general computer science answer to chicken-and-egg problems
is iteration. We will start with a guess of the cluster centers. Based
on that guess, we will assign each data point to its closest center.
Given these new assignments, we can recompute the cluster centers.
We repeat this process until clusters stop moving. The first few it-
erations of the K-means algorithm are shown in Figure 3.14. In this
example, the clusters converge very quickly.
Algorithm 3.4 spells out the K-means clustering algorithm in de-
tail. The cluster centers are initialized randomly. In line 6, data point
xn is compared against each cluster center µk . It is assigned to cluster
k if k is the center with the smallest distance. (That is the “argmin”
step.) The variable zn stores the assignment (a value from 1 to K) of
example n. In lines 8-12, the cluster centers are re-computed. First, Xk
stores all examples that have been assigned to cluster k. The center of
cluster k, µk is then computed as the mean of the points assigned to
it. This process repeats until the centers converge. Figure 3.14: first few iterations of
An obvious question about this algorithm is: does it converge? k-means running on previous data set

A second question is: how long does it take to converge. The first
question is actually easy to answer. Yes, it does. And in practice, it
usually converges quite quickly (usually fewer than 20 iterations). In
Chapter 15, we will actually prove that it converges. The question of
how long it takes to converge is actually a really interesting question.
Even though the K-means algorithm dates back to the mid 1950s, the
best known convergence rates were terrible for a long time. Here, ter-
rible means exponential in the number of data points. This was a sad
situation because empirically we knew that it converged very quickly.
New algorithm analysis techniques called “smoothed analysis” were
invented in 2001 and have been used to show very fast convergence
for K-means (among other algorithms). These techniques are well
beyond the scope of this book (and this author!) but suffice it to say
that K-means is fast in practice and is provably fast in theory.
It is important to note that although K-means is guaranteed to
converge and guaranteed to converge quickly, it is not guaranteed to
converge to the “right answer.” The key problem with unsupervised
learning is that we have no way of knowing what the “right answer”
is. Convergence to a bad solution is usually due to poor initialization. What is the difference between un-
supervised and supervised learning
that means that we know what the
? “right answer” is for supervised
learning but not for unsupervised
learning?
geometry and nearest neighbors 37

Algorithm 4 K-Means(D, K)
1: for k = 1 to K do

2: µk ← some random location // randomly initialize center for kth cluster


3: end for

4: repeat

5: for n = 1 to N do
6: zn ← argmink ||µk − xn || // assign example n to closest center
7: end for
8: for k = 1 to K do
9: Xk ← { x n : z n = k } // points assigned to cluster k
10: µk ← mean(Xk ) // re-estimate center of cluster k
11: end for
12: until µs stop changing

13: return z // return cluster assignments

3.5 Warning: High Dimensions are Scary

Visualizing one hundred dimensional space is incredibly difficult for


humans. After huge amounts of training, some people have reported
that they can visualize four dimensional space in their heads. But
beyond that seems impossible.1 1
If you want to try to get an intu-
In addition to being hard to visualize, there are at least two addi- itive sense of what four dimensions
looks like, I highly recommend the
tional problems in high dimensions, both refered to as the curse of short 1884 book Flatland: A Romance
dimensionality. One is computational, the other is mathematical. of Many Dimensions by Edwin Abbott
Abbott. You can even read it online at
From a computational perspective, consider the following prob- gutenberg.org/ebooks/201.
lem. For K-nearest neighbors, the speed of prediction is slow for a
very large data set. At the very least you have to look at every train-
ing example every time you want to make a prediction. To speed
things up you might want to create an indexing data structure. You
can break the plane up into a grid like that shown in Figure 3.15.
Now, when the test point comes in, you can quickly identify the grid
cell in which it lies. Now, instead of considering all training points,
you can limit yourself to training points in that grid cell (and perhaps
the neighboring cells). This can potentially lead to huge computa-
tional savings.
In two dimensions, this procedure is effective. If we want to break Figure 3.15: 2d knn with an overlaid
space up into a grid whose cells are 0.2×0.2, we can clearly do this grid, cell with test point highlighted
with 25 grid cells in two dimensions (assuming the range of the
features is 0 to 1 for simplicity). In three dimensions, we’ll need
125 = 5×5×5 grid cells. In four dimensions, we’ll need 625. By the
time we get to “low dimensional” data in 20 dimensions, we’ll need
95, 367, 431, 640, 625 grid cells (that’s 95 trillion, which is about 6 to
7 times the US national debt as of January 2011). So if you’re in 20
dimensions, this gridding technique will only be useful if you have at
least 95 trillion training examples.
38 a course in machine learning

For “medium dimensional” data (approximately 1000) dimesions,


the number of grid cells is a 9 followed by 698 numbers before the
decimal point. For comparison, the number of atoms in the universe
is approximately 1 followed by 80 zeros. So even if each atom yielded
a googul training examples, we’d still have far fewer examples than
grid cells. For “high dimensional” data (approximately 100000) di-
mensions, we have a 1 followed by just under 70, 000 zeros. Far too
big a number to even really comprehend.
Suffice it to say that for even moderately high dimensions, the
amount of computation involved in these problems is enormous. How does the above analysis relate
In addition to the computational difficulties of working in high to the number of data points you
would need to fill out a full decision
dimensions, there are a large number of strange mathematical oc- ? tree with D-many features? What
curances there. In particular, many of your intuitions that you’ve does this say about the importance
built up from working in two and three dimensions just do not carry of shallow trees?

over to high dimensions. We will consider two effects, but there are
countless others. The first is that high dimensional spheres look more
like porcupines than like balls.2 The second is that distances between
points in high dimensions are all approximately the same.
Let’s start in two dimensions as in Figure 3.16. We’ll start with
2
This result was related to me by Mark
four green spheres, each of radius one and each touching exactly two Reid, who heard about it from Marcus
other green spheres. (Remember that in two dimensions a “sphere” Hutter.
is just a “circle.”) We’ll place a red sphere in the middle so that it
touches all four green spheres. We can easily compute the radius of
this small sphere. The pythagorean theorem says that 12 + 12 = (1 +

r )2 , so solving for r we get r = 2 − 1 ≈ 0.41. Thus, by calculation,
the blue sphere lies entirely within the cube (cube = square) that
contains the grey spheres. (Yes, this is also obvious from the picture,
but perhaps you can see where this is going.)
Now we can do the same experiment in three dimensions, as
shown in Figure 3.17. Again, we can use the pythagorean theorem
to compute the radius of the blue sphere. Now, we get 12 + 12 + 12 =

(1 + r )2 , so r = 3 − 1 ≈ 0.73. This is still entirely enclosed in the Figure 3.16: 2d spheres in spheres
cube of width four that holds all eight grey spheres.
At this point it becomes difficult to produce figures, so you’ll
have to apply your imagination. In four dimensions, we would have
16 green spheres (called hyperspheres), each of radius one. They
would still be inside a cube (called a hypercube) of width four. The

blue hypersphere would have radius r = 4 − 1 = 1. Continuing
to five dimensions, the blue hypersphere embedded in 256 green

hyperspheres would have radius r = 5 − 1 ≈ 1.23 and so on.
In general, in D-dimensional space, there will be 2D green hyper-
spheres of radius one. Each green hypersphere will touch exactly
n-many other hyperspheres. The blue hyperspheres in the middle Figure 3.17: 3d spheres in spheres

will touch them all and will have radius r = D − 1.
geometry and nearest neighbors 39

Think about this for a moment. As the number of dimensions


grows, the radius of the blue hypersphere grows without bound!. For
example, in 9-dimensions the radius of the blue hypersphere is now

9 − 1 = 2. But with a radius of two, the blue hypersphere is now
“squeezing” between the green hypersphere and touching the edges
of the hypercube. In 10 dimensional space, the radius is approxi-
mately 2.16 and it pokes outside the cube.
The second strange fact we will consider has to do with the dis-
tances between points in high dimensions. We start by considering
random points in one dimension. That is, we generate a fake data set
consisting of 100 random points between zero and one. We can do
the same in two dimensions and in three dimensions. See Figure ??
for data distributed uniformly on the unit hypercube in different
dimensions.
Now, pick two of these points at random and compute the dis-
tance between them. Repeat this process for all pairs of points and
average the results. For the data shown in Figure ??, the average
distance between points in one dimension is about 0.346; in two di-
mensions is about 0.518; and in three dimensions is 0.615. The fact
that these increase as the dimension increases is not surprising. The
furthest two points can be in a 1-dimensional hypercube (line) is 1;

the furthest in a 2-dimensional hypercube (square) is 2 (opposite

corners); the furthest in a 3-d hypercube is 3 and so on. In general,

the furthest two points in a D-dimensional hypercube will be D.
You can actually compute these values analytically. Write UniD
for the uniform distribution in D dimensions. The quantity we are
interested in computing is:
h h ii
avgDist( D ) = Ea∼UniD Eb∼UniD || a − b|| (3.2)

We can actually compute this in closed form and arrive at avgDist( D ) =



D/3. Because we know that the maximum distance between two

points grows like D, this says that the ratio between average dis-
tance and maximum distance converges to 1/3.
What is more interesting, however, is the variance of the distribu-
tion of distances. You can show that in D dimensions, the variance

is constant 1/ 18, independent of D. This means that when you look
at (variance) divided-by (max distance), the variance behaves like

1/ 18D, which means that the effective variance continues to shrink
as D grows3 . 3
Brin 1995
When I first saw and re-proved this result, I was skeptical, as I
dimensionality versus uniform point distances
imagine you are. So I implemented it. In Figure 3.18 you can see 14000
2 dims
12000
8 dims
the results. This presents a histogram of distances between random 32 dims
128 dims
# of pairs of points at that distance

10000 512 dims


points in D dimensions for D ∈ {1, 2, 3, 10, 20, 100}. As you can see,
√ 8000

all of these distances begin to concentrate around 0.4 D, even for 6000

4000

2000

0
0.0 0.2 0.4 0.6 0.8 1.0
distance / sqrt(dimensionality)

Figure 3.18: histogram of distances in


D=2,8,32,128,512
practical issues 65

Algorithm 8 CrossValidate(LearningAlgorithm, Data, K)


1: ê ← ∞ // store lowest error encountered so far
2: α̂ ← unknown // store the hyperparameter setting that yielded it
3: for all hyperparameter settings α do

4: err ← [ ] // keep track of the K -many error estimates


5: for k = 1 to K do
6: train ← {( xn , yn ) ∈ Data : n mod K 6= k − 1}
7: test ← {( xn , yn ) ∈ Data : n mod K = k − 1} // test every Kth example
8: model ← Run LearningAlgorithm on train
9: err ← err ⊕ error of model on test // add current error to list of errors
10: end for
11: avgErr ← mean of set err
12: if avgErr < ê then
13: ê ← avgErr // remember these settings
14: α̂ ← α // because they’re the best so far
15: end if
16: end for

An alternative is the idea of cross validation. In cross validation,


you break your training data up into 10 equally-sized partitions. You
train a learning algorithm on 9 of them and test it on the remaining
1. You do this 10 times, each time holding out a different partition as
the “development” part. You can then average your performance over
all ten parts to get an estimate of how well your model will perform
in the future. You can repeat this process for every possible choice of
hyperparameters to get an estimate of which one performs best. The
general K-fold cross validation technique is shown in Algorithm 5.6,
where K = 10 in the preceeding discussion.
In fact, the development data approach can be seen as an approxi-
mation to cross validation, wherein only one of the K loops (line 5 in
Algorithm 5.6) is executed.
Typical choices for K are 2, 5, 10 and N − 1. By far the most com-
mon is K = 10: 10-fold cross validation. Sometimes 5 is used for
efficiency reasons. And sometimes 2 is used for subtle statistical rea-
sons, but that is quite rare. In the case that K = N − 1, this is known
as leave-one-out cross validation (or abbreviated as LOO cross val-
idation). After running cross validation, you have two choices. You
can either select one of the K trained models as your final model to
make predictions with, or you can train a new model on all of the
data, using the hyperparameters selected by cross-validation. If you
have the time, the latter is probably a better options.
It may seem that LOO cross validation is prohibitively expensive
to run. This is true for most learning algorithms except for K-nearest
neighbors. For KNN, leave-one-out is actually very natural. We loop
through each training point and ask ourselves whether this example
would be correctly classified for all different possible values of K.
66 a course in machine learning

Algorithm 9 KNN-Train-LOO(D)
1: errk ← 0, ∀1 ≤ k ≤ N − 1 // errk stores how well you do with kNN
2: for n = 1 to N do

3: Sm ← h|| xn − xm || , mi, ∀m 6= n // compute distances to other points


4: S ← sort(S) // put lowest-distance objects first
5: ŷ ← 0 // current label prediction
6: for k = 1 to N − 1 do
7: hdist,mi ← Sk
8: ŷ ← ŷ + ym // let kth closest point vote
9: if ŷ 6= ym then
10: errk ← errk + 1 // one more error for kNN
11: end if
12: end for
13: end for

14: return argmin err k // return the K that achieved lowest error
k

This requires only as much computation as computing the K nearest


neighbors for the highest value of K. This is such a popular and
effective approach for KNN classification that it is spelled out in
Algorithm 5.6.
Overall, the main advantage to cross validation over develop-
ment data is robustness. The main advantage of development data is
speed.
One warning to keep in mind is that the goal of both cross valida-
tion and development data is to estimate how well you will do in the
future. This is a question of statistics, and holds only if your test data
really looks like your training data. That is, it is drawn from the same
distribution. In many practical cases, this is not entirely true.
For example, in person identification, we might try to classify
every pixel in an image based on whether it contains a person or not.
If we have 100 training images, each with 10, 000 pixels, then we have
a total of 1m training examples. The classification for a pixel in image
5 is highly dependent on the classification for a neighboring pixel in
the same image. So if one of those pixels happens to fall in training
data, and the other in development (or cross validation) data, your
model will do unreasonably well. In this case, it is important that
when you cross validate (or use development data), you do so over
images, not over pixels. The same goes for text problems where you
sometimes want to classify things at a word level, but are handed a
collection of documents. The important thing to keep in mind is that
it is the images (or documents) that are drawn independently from
your data distribution and not the pixels (or words), which are drawn
dependently.
practical issues 67

5.7 Hypothesis Testing and Statistical Significance

Suppose that you’ve presented a machine learning solution to your


boss that achieves 7% error on cross validation. Your nemesis, Gabe,
gives a solution to your boss that achieves 6.9% error on cross vali-
dation. How impressed should your boss be? It depends. If this 0.1%
improvement was measured over 1000 examples, perhaps not too
impressed. It would mean that Gabe got exactly one more example
right than you did. (In fact, they probably got 15 more right and 14
more wrong.) If this 0.1% impressed was measured over 1, 000, 000
examples, perhaps this is more impressive.
This is one of the most fundamental questions in statistics. You
have a scientific hypothesis of the form “Gabe’s algorithm is better
than mine.” You wish to test whether this hypothesis is true. You
are testing it against the null hypothesis, which is that Gabe’s algo-
rithm is no better than yours. You’ve collected data (either 1000 or
1m data points) to measure the strength of this hypothesis. You want
to ensure that the difference in performance of these two algorithms
is statistically significant: i.e., is probably not just due to random
luck. (A more common question statisticians ask is whether one drug
treatment is better than another, where “another” is either a placebo
or the competitor’s drug.)
There are about ∞-many ways of doing hypothesis testing. Like
evaluation metrics and the number of folds of cross validation, this is
something that is very discipline specific. Here, we will discuss two
popular tests: the paired t-test and bootstrapping. These tests, and
other statistical tests, have underlying assumptions (for instance, as-
sumptions about the distribution of observations) and strengths (for
instance, small or large samples). In most cases, the goal of hypoth-
esis testing is to compute a p-value: namely, the probability that the
observed difference in performance was by chance. The standard way
of reporting results is to say something like “there is a 95% chance
that this difference was not by chance.” The value 95% is arbitrary,
and occasionally people use weaker (90%) test or stronger (99.5%)
tests.
The t-test is an example of a parametric test. It is applicable when
the null hypothesis states that the difference between two responses
has mean zero and unknown variance. The t-test actually assumes
that data is distributed according to a Gaussian distribution, which is
probably not true of binary responses. Fortunately, for large samples
(at least a few hundred), binary samples are well approximated by t significance
≥ 1.28 90.0%
a Gaussian distribution. So long as your sample is sufficiently large, ≥ 1.64 95.0%
the t-test is reasonable either for regression or classification problems. ≥ 1.96 97.5%
≥ 2.58 99.5%
Suppose that you evaluate two algorithm on N-many examples.
Table 5.3: Table of significance values
for the t-test.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy