0% found this document useful (0 votes)
7 views7 pages

Self Reading - KNN - Notes

The document discusses the concept of representing data as points in high-dimensional space and introduces the K-nearest neighbor (KNN) algorithm for classification. It explains how to compute distances between data points and emphasizes the importance of feature vectors in machine learning. Additionally, it addresses challenges such as the curse of dimensionality and the impact of feature scaling on the KNN classifier's performance.

Uploaded by

Hadia Ramzan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

Self Reading - KNN - Notes

The document discusses the concept of representing data as points in high-dimensional space and introduces the K-nearest neighbor (KNN) algorithm for classification. It explains how to compute distances between data points and emphasizes the importance of feature vectors in machine learning. Additionally, it addresses challenges such as the curse of dimensionality and the impact of feature scaling on the KNN classifier's performance.

Uploaded by

Hadia Ramzan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

3 | G EOMETRY AND N EAREST N EIGHBORS

Our brains have evolved to get us out of the rain, find where the Learning Objectives:
berries are, and keep us from getting killed. Our brains did not • Describe a data set as points in a
high dimensional space.
evolve to help us grasp really large numbers or to look at things in
• Explain the curse of dimensionality.
a hundred thousand dimensions. – Ronald Graham
• Compute distances between points
in high dimensional space.
• Implement a K-nearest neighbor
model of learning.
• Draw decision boundaries.
You can think of prediction tasks as mapping inputs (course
• Implement the K-means algorithm
reviews) to outputs (course ratings). As you learned in the previ- for clustering.
ous chapter, decomposing an input into a collection of features (e.g.,
words that occur in the review) forms a useful abstraction for learn-
ing. Therefore, inputs are nothing more than lists of feature values.
This suggests a geometric view of data, where we have one dimen-
sion for every feature. In this view, examples are points in a high-
dimensional space.
Once we think of a data set as a collection of points in high dimen-
sional space, we can start performing geometric operations on this
data. For instance, suppose you need to predict whether Alice will
like Algorithms. Perhaps we can try to find another student who is
Dependencies: Chapter 1
most “similar” to Alice, in terms of favorite courses. Say this student
is Jeremy. If Jeremy liked Algorithms, then we might guess that Alice
will as well. This is an example of a nearest neighbor model of learn-
ing. By inspecting this model, we’ll see a completely different set of
answers to the key learning questions we discovered in Chapter 1.

3.1 From Data to Feature Vectors

An example is just a collection of feature values about that example,


for instance the data in Table 1 from the Appendix. To a person, these
features have meaning. One feature might count how many times the
reviewer wrote “excellent” in a course review. Another might count
the number of exclamation points. A third might tell us if any text is
underlined in the review.
To a machine, the features themselves have no meaning. Only
the feature values, and how they vary across examples, mean some-
thing to the machine. From this perspective, you can think about an
example as being represented by a feature vector consisting of one
“dimension” for each feature, where each dimenion is simply some
real value.
Consider a review that said “excellent” three times, had one excla-
30 a course in machine learning

mation point and no underlined text. This could be represented by


the feature vector h3, 1, 0i. An almost identical review that happened
to have underlined text would have the feature vector h3, 1, 1i.

AI?
Note, here, that we have imposed the convention that for binary
features (yes/no features), the corresponding feature values are 0
and 1, respectively. This was an arbitrary choice. We could have
made them 0.92 and 16.1 if we wanted. But 0/1 is convenient and
easy?
helps us interpret the feature values. When we discuss practical
issues in Chapter 5, you will see other reasons why 0/1 is a good

sys?
choice.
Figure 3.1 shows the data from Table 1 in three views. These three

AI?
views are constructed by considering two features at a time in differ-
ent pairs. In all cases, the plusses denote positive examples and the
minuses denote negative examples. In some cases, the points fall on
top of each other, which is why you cannot see 20 unique points in
all figures.

sys?
The mapping from feature values to vectors is straighforward in
the case of real valued features (trivial) and binary features (mapped
to zero or one). It is less clear what to do with categorical features.
For example, if our goal is to identify whether an object in an image
easy?
is a tomato, blueberry, cucumber or cockroach, we might want to
Figure 3.1: A figure showing projections
know its color: is it Red, Blue, Green or Black? of data in two dimension in three
One option would be to map Red to a value of 0, Blue to a value ways – see text. Top: horizontal axis
corresponds to the first feature (easy)
of 1, Green to a value of 2 and Black to a value of 3. The problem and the vertical axis corresponds to
with this mapping is that it turns an unordered set (the set of colors) the second feature (AI?); Middle:
horizontal is second feature and vertical
into an ordered set (the set {0, 1, 2, 3}). In itself, this is not necessarily
is third (systems?); Bottom: horizontal
a bad thing. But when we go to use these features, we will measure is first and vertical is third. Truly,
examples based on their distances to each other. By doing this map- the data points would like exactly on
(0, 0) or (1, 0), etc., but they have been
ping, we are essentially saying that Red and Blue are more similar purturbed slightly to show duplicates.
(distance of 1) than Red and Black (distance of 3). This is probably Match the example ids from Table 1
not what we want to say! ? with the points in Figure 3.1.
A solution is to turn a categorical feature that can take four dif-
ferent values (say: Red, Blue, Green and Black) into four binary
features (say: IsItRed?, IsItBlue?, IsItGreen? and IsItBlack?). In gen-
eral, if we start from a categorical feature that takes V values, we can
map it to V-many binary indicator features.
With that, you should be able to take a data set and map each
example to a feature vector through the following mapping:
The computer scientist in you might
be saying: actually we could map it
• Real-valued features get copied directly. ? to log2 V-many binary features! Is
this a good idea or not?
• Binary features become 0 (for false) or 1 (for true).

• Categorical features with V possible values get mapped to V-many


binary indicator features.
geometry and nearest neighbors 31

After this mapping, you can think of a single example as a vec-


tor in a high-dimensional feature space. If you have D-many fea-
tures (after expanding categorical features), then this feature vector
will have D-many components. We will denote feature vectors as
x = h x1 , x2 , . . . , x D i, so that xd denotes the value of the dth fea-
ture of x. Since these are vectors with real-valued components in
D-dimensions, we say that they belong to the space R D .
For D = 2, our feature vectors are just points in the plane, like in
Figure 3.1. For D = 3 this is three dimensional space. For D > 3 it
becomes quite hard to visualize. (You should resist the temptation
to think of D = 4 as “time” – this will just make things confusing.)
Unfortunately, for the sorts of problems you will encounter in ma-
chine learning, D ⇡ 20 is considered “low dimensional,” D ⇡ 1000 is
“medium dimensional” and D ⇡ 100000 is “high dimensional.” Can you think of problems (per-
haps ones already mentioned in this
? book!) that are low dimensional?
That are medium dimensional?
3.2 K-Nearest Neighbors That are high dimensional?

The biggest advantage to thinking of examples as vectors in a high


dimensional space is that it allows us to apply geometric concepts
to machine learning. For instance, one of the most basic things
that one can do in a vector space is compute distances. In two-
dimensional space, the distance between h2, 3i and h6, 1i is given
p p
by (2 6)2 + (3 1)2 = 18 ⇡ 4.24. In general, in D-dimensional
space, the Euclidean distance between vectors a and b is given by
Eq (3.1) (see Figure 3.2 for geometric intuition in three dimensions): (.6, 1, .8)

" #1
D 2
2
d( a, b) = Â ( ad bd ) (3.1)
)
.5

d =1
,
.4
,
(0

Now that you have access to distances between examples, you


can start thinking about what it means to learn again. Consider Fig-
ure 3.3. We have a collection of training data consisting of positive
examples and negative examples. There is a test point marked by a
question mark. Your job is to guess the correct label for that point.
Figure 3.2: A figure showing Euclidean
Most likely, you decided that the label of this test point is positive. distance in three dimensions. The
One reason why you might have thought that is that you believe length of the green segments are 0.6, 0.6
and 0.3 respectively, in the x-, y-, and
that the label for an example should be similar to the label of nearby z-axes. The total distance between the
points. This is an example of a new form of inductive bias. red
p dot and the orange dot is therefore
The nearest neighbor classifier is build upon this insight. In com- 0.62 + 0.62 + 0.32 = 0.9.
Verify that d from Eq (3.1) gives the
parison to decision trees, the algorithm is ridiculously simple. At
training time, we simply store the entire training set. At test time,
? same result (4.24) for the previous
computation.
we get a test example x̂. To predict its label, we find the training ex-
ample x that is most similar to x̂. In particular, we find the training

?
32 a course in machine learning

example x that minimizes d( x, x̂). Since x is a training example, it has


a corresponding label, y. We predict that the label of x̂ is also y.
Despite its simplicity, this nearest neighbor classifier is incred-
ibly effective. (Some might say frustratingly effective.) However, it
is particularly prone to overfitting label noise. Consider the data in
Figure 3.4. You would probably want to label the test point positive.
Unfortunately, it’s nearest neighbor happens to be negative. Since the
nearest neighbor algorithm only looks at the single nearest neighbor,
it cannot consider the “preponderance of evidence” that this point
should probably actually be a positive example. It will make an un-
necessary error. ?
A solution to this problem is to consider more than just the single
nearest neighbor when making a classification decision. We can con-
sider the K-nearest neighbors and let them vote on the correct class
for this test point. If you consider the 3-nearest neighbors of the test Figure 3.4: A figure showing an easy
point in Figure 3.4, you will see that two of them are positive and one NN classification problem where the
test point is a ? and should be positive,
is negative. Through voting, positive would win. but its NN is actually a negative point
The full algorithm for K-nearest neighbor classification is given that’s noisy.
in Algorithm 3.2. Note that there actually is no “training” phase for Why is it a good idea to use an odd
K-nearest neighbors. In this algorithm we have introduced five new
? number for K?

conventions:

1. The training data is denoted by D.

2. We assume that there are N-many training examples.

3. These examples are pairs ( x1 , y1 ), ( x2 , y2 ), . . . , ( x N , y N ).


(Warning: do not confuse xn , the nth training example, with xd ,
the dth feature for example x.)

4. We use [ ]to denote an empty list and · to append · to that list.

5. Our prediction on x̂ is called ŷ.

The first step in this algorithm is to compute distances from the


test point to all training points (lines 2-4). The data points are then
sorted according to distance. We then apply a clever trick of summing
the class labels for each of the K nearest neighbors (lines 6-10) and
using the sign of this sum as our prediction. Why is the sign of the sum com-
The big question, of course, is how to choose K. As we’ve seen, puted in lines 2-4 the same as the
? majority vote of the associated
with K = 1, we run the risk of overfitting. On the other hand, if training examples?
K is large (for instance, K = N), then KNN-Predict will always
predict the majority class. Clearly that is underfitting. So, K is a
hyperparameter of the KNN algorithm that allows us to trade-off
between overfitting (small value of K) and underfitting (large value of
K).

Why can’t you simply pick the


value of K that does best on the
training data? In other words, why
? do we have to treat it like a hy-
perparameter rather than just a
parameter.
geometry and nearest neighbors 33

Algorithm 3 KNN-Predict(D, K, x̂)


1: S []
2: for n = 1 to N do

3: S S hd(xn , x̂), ni // store distance to training example n


4: end for

5: S sort(S) // put lowest-distance objects first


6: ŷ 0
7: for k = 1 to K do

8: hdist,ni Sk // n this is the kth closest data point


9: ŷ ŷ + yn // vote according to the label for the nth training point
10: end for

11: return sign(ŷ) // return +1 if ŷ > 0 and 1 if ŷ < 0

Figure 3.5: A figure of a ski and a


snowboard.

One aspect of inductive bias that we’ve seen for KNN is that it
assumes that nearby points should have the same label. Another
aspect, which is quite different from decision trees, is that all features
are equally important! Recall that for decision trees, the key question
was which features are most useful for classification? The whole learning Figure 3.6: Classification data for ski vs
algorithm for a decision tree hinged on finding a small set of good snowboard in 2d

features. This is all thrown away in KNN classifiers: every feature


is used, and they are all used the same amount. This means that if
you have data with only a few relevant features and lots of irrelevant
features, KNN is likely to do poorly.
A related issue with KNN is feature scale. Suppose that we are
trying to classify whether some object is a ski or a snowboard (see
Figure 3.5). We are given two features about this data: the width
and height. As is standard in skiing, width is measured in millime-
ters and height is measured in centimeters. Since there are only two
features, we can actually plot the entire training set; see Figure 3.6
where ski is the positive class. Based on this data, you might guess
that a KNN classifier would do well.
Suppose, however, that our measurement of the width was com-
puted in millimeters (instead of centimeters). This yields the data
shown in Figure 3.7. Since the width values are now tiny, in compar-
ison to the height values, a KNN classifier will effectively ignore the
width values and classify almost purely based on height. The pre-
dicted class for the displayed test point had changed because of this Figure 3.7: Classification data for ski vs
feature scaling. snowboard in 2d, with width rescaled
to mm.
We will discuss feature scaling more in Chapter 5. For now, it is
just important to keep in mind that KNN does not have the power to
decide which features are important.
34 a course in machine learning

M ATH R EVIEW | V ECTOR A RITHMETIC AND V ECTOR N ORMS


A (real-valued) vector is just an array of real values, for instance x = h1, 2.5, 6i is a three-dimensional
vector. In general, if x = h x1 , x2 , . . . , x D i, then xd is it’s dth component. So x3 = 6 in the previous ex-
ample.

Vector sums are computed pointwise, and are only defined when dimensions match, so h1, 2.5, 6i +
h2, 2.5, 3i = h3, 0, 3i. In general, if c = a + b then cd = ad + bd for all d. Vector addition can
be viewed geometrically as taking a vector a, then tacking on b to the end of it; the new end point is
exactly c.

Vectors can be scaled by real values; for instance 2h1, 2.5, 6i = h2, 5, 12i; this is called scalar multi-
plication. In general, ax = h ax1 , ax2 , . . . , ax D i.

The norm of a vector


q x, written || x|| is its length. Unless otherwise specified, this is its Euclidean length,
namely: || x|| = Âd xd2 .

Figure 3.8:

3.3 Decision Boundaries

The standard way that we’ve been thinking about learning algo-
rithms up to now is in the query model. Based on training data, you
learn something. I then give you a query example and you have to
guess it’s label.
An alternative, less passive, way to think about a learned model
is to ask: what sort of test examples will it classify as positive, and
what sort will it classify as negative. In Figure 3.9, we have a set of
training data. The background of the image is colored blue in regions
that would be classified as positive (if a query were issued there) Figure 3.9: decision boundary for 1nn.

and colored red in regions that would be classified as negative. This


coloring is based on a 1-nearest neighbor classifier.
In Figure 3.9, there is a solid line separating the positive regions
from the negative regions. This line is called the decision boundary
for this classifier. It is the line with positive land on one side and
negative land on the other side.
Decision boundaries are useful ways to visualize the complex-
ity of a learned model. Intuitively, a learned model with a decision
boundary that is really jagged (like the coastline of Norway) is really
complex and prone to overfitting. A learned model with a decision
boundary that is really simple (like the bounary between Arizona Figure 3.10: decision boundary for knn
with k=3.
and Utah) is potentially underfit.
Now that you know about decision boundaries, it is natural to ask:
what do decision boundaries for decision trees look like? In order
geometry and nearest neighbors 35

to answer this question, we have to be a bit more formal about how


to build a decision tree on real-valued features. (Remember that the
algorithm you learned in the previous chapter implicitly assumed
binary feature values.) The idea is to allow the decision tree to ask
questions of the form: “is the value of feature 5 greater than 0.2?”
That is, for real-valued features, the decision tree nodes are param-
eterized by a feature and a threshold for that feature. An example
decision tree for classifying skis versus snowboards is shown in Fig-
ure 3.11.
Now that a decision tree can handle feature vectors, we can talk
about decision boundaries. By example, the decision boundary for
the decision tree in Figure 3.11 is shown in Figure 3.12. In the figure,
space is first split in half according to the first query along one axis.
Figure 3.11: decision tree for ski vs.
Then, depending on which half of the space you look at, it is either snowboard
split again along the other axis, or simply classified.
Figure 3.12 is a good visualization of decision boundaries for
decision trees in general. Their decision boundaries are axis-aligned
cuts. The cuts must be axis-aligned because nodes can only query on
a single feature at a time. In this case, since the decision tree was so
shallow, the decision boundary was relatively simple.

3.4 K-Means Clustering

Up through this point, you have learned all about supervised learn-
ing (in particular, binary classification). As another example of the Figure 3.12: decision boundary for dt in
previous figure
use of geometric intuitions and data, we are going to temporarily
What sort of data might yield a
consider an unsupervised learning problem. In unsupervised learn- very simple decision boundary with
ing, our data consists only of examples xn and does not contain corre- a decision tree and very complex
sponding labels. Your job is to make sense of this data, even though
? decision boundary with 1-nearest
neighbor? What about the other
no one has provided you with correct labels. The particular notion of way around?
“making sense of” that we will talk about now is the clustering task.
Consider the data shown in Figure 3.13. Since this is unsupervised
learning and we do not have access to labels, the data points are
simply drawn as black dots. Your job is to split this data set into
three clusters. That is, you should label each data point as A, B or C
in whatever way you want.
For this data set, it’s pretty clear what you should do. You prob-
ably labeled the upper-left set of points A, the upper-right set of
points B and the bottom set of points C. Or perhaps you permuted
these labels. But chances are your clusters were the same as mine.
The K-means clustering algorithm is a particularly simple and
effective approach to producing clusters on data like you see in Fig-
ure 3.13. The idea is to represent each cluster by it’s cluster center.
Given cluster centers, we can simply assign each point to its nearest

Figure 3.13: simple clustering data...


clusters in UL, UR and BC.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy