Self Reading - KNN - Notes
Self Reading - KNN - Notes
Our brains have evolved to get us out of the rain, find where the Learning Objectives:
berries are, and keep us from getting killed. Our brains did not • Describe a data set as points in a
high dimensional space.
evolve to help us grasp really large numbers or to look at things in
• Explain the curse of dimensionality.
a hundred thousand dimensions. – Ronald Graham
• Compute distances between points
in high dimensional space.
• Implement a K-nearest neighbor
model of learning.
• Draw decision boundaries.
You can think of prediction tasks as mapping inputs (course
• Implement the K-means algorithm
reviews) to outputs (course ratings). As you learned in the previ- for clustering.
ous chapter, decomposing an input into a collection of features (e.g.,
words that occur in the review) forms a useful abstraction for learn-
ing. Therefore, inputs are nothing more than lists of feature values.
This suggests a geometric view of data, where we have one dimen-
sion for every feature. In this view, examples are points in a high-
dimensional space.
Once we think of a data set as a collection of points in high dimen-
sional space, we can start performing geometric operations on this
data. For instance, suppose you need to predict whether Alice will
like Algorithms. Perhaps we can try to find another student who is
Dependencies: Chapter 1
most “similar” to Alice, in terms of favorite courses. Say this student
is Jeremy. If Jeremy liked Algorithms, then we might guess that Alice
will as well. This is an example of a nearest neighbor model of learn-
ing. By inspecting this model, we’ll see a completely different set of
answers to the key learning questions we discovered in Chapter 1.
AI?
Note, here, that we have imposed the convention that for binary
features (yes/no features), the corresponding feature values are 0
and 1, respectively. This was an arbitrary choice. We could have
made them 0.92 and 16.1 if we wanted. But 0/1 is convenient and
easy?
helps us interpret the feature values. When we discuss practical
issues in Chapter 5, you will see other reasons why 0/1 is a good
sys?
choice.
Figure 3.1 shows the data from Table 1 in three views. These three
AI?
views are constructed by considering two features at a time in differ-
ent pairs. In all cases, the plusses denote positive examples and the
minuses denote negative examples. In some cases, the points fall on
top of each other, which is why you cannot see 20 unique points in
all figures.
sys?
The mapping from feature values to vectors is straighforward in
the case of real valued features (trivial) and binary features (mapped
to zero or one). It is less clear what to do with categorical features.
For example, if our goal is to identify whether an object in an image
easy?
is a tomato, blueberry, cucumber or cockroach, we might want to
Figure 3.1: A figure showing projections
know its color: is it Red, Blue, Green or Black? of data in two dimension in three
One option would be to map Red to a value of 0, Blue to a value ways – see text. Top: horizontal axis
corresponds to the first feature (easy)
of 1, Green to a value of 2 and Black to a value of 3. The problem and the vertical axis corresponds to
with this mapping is that it turns an unordered set (the set of colors) the second feature (AI?); Middle:
horizontal is second feature and vertical
into an ordered set (the set {0, 1, 2, 3}). In itself, this is not necessarily
is third (systems?); Bottom: horizontal
a bad thing. But when we go to use these features, we will measure is first and vertical is third. Truly,
examples based on their distances to each other. By doing this map- the data points would like exactly on
(0, 0) or (1, 0), etc., but they have been
ping, we are essentially saying that Red and Blue are more similar purturbed slightly to show duplicates.
(distance of 1) than Red and Black (distance of 3). This is probably Match the example ids from Table 1
not what we want to say! ? with the points in Figure 3.1.
A solution is to turn a categorical feature that can take four dif-
ferent values (say: Red, Blue, Green and Black) into four binary
features (say: IsItRed?, IsItBlue?, IsItGreen? and IsItBlack?). In gen-
eral, if we start from a categorical feature that takes V values, we can
map it to V-many binary indicator features.
With that, you should be able to take a data set and map each
example to a feature vector through the following mapping:
The computer scientist in you might
be saying: actually we could map it
• Real-valued features get copied directly. ? to log2 V-many binary features! Is
this a good idea or not?
• Binary features become 0 (for false) or 1 (for true).
" #1
D 2
2
d( a, b) = Â ( ad bd ) (3.1)
)
.5
d =1
,
.4
,
(0
?
32 a course in machine learning
conventions:
One aspect of inductive bias that we’ve seen for KNN is that it
assumes that nearby points should have the same label. Another
aspect, which is quite different from decision trees, is that all features
are equally important! Recall that for decision trees, the key question
was which features are most useful for classification? The whole learning Figure 3.6: Classification data for ski vs
algorithm for a decision tree hinged on finding a small set of good snowboard in 2d
Vector sums are computed pointwise, and are only defined when dimensions match, so h1, 2.5, 6i +
h2, 2.5, 3i = h3, 0, 3i. In general, if c = a + b then cd = ad + bd for all d. Vector addition can
be viewed geometrically as taking a vector a, then tacking on b to the end of it; the new end point is
exactly c.
Vectors can be scaled by real values; for instance 2h1, 2.5, 6i = h2, 5, 12i; this is called scalar multi-
plication. In general, ax = h ax1 , ax2 , . . . , ax D i.
Figure 3.8:
The standard way that we’ve been thinking about learning algo-
rithms up to now is in the query model. Based on training data, you
learn something. I then give you a query example and you have to
guess it’s label.
An alternative, less passive, way to think about a learned model
is to ask: what sort of test examples will it classify as positive, and
what sort will it classify as negative. In Figure 3.9, we have a set of
training data. The background of the image is colored blue in regions
that would be classified as positive (if a query were issued there) Figure 3.9: decision boundary for 1nn.
Up through this point, you have learned all about supervised learn-
ing (in particular, binary classification). As another example of the Figure 3.12: decision boundary for dt in
previous figure
use of geometric intuitions and data, we are going to temporarily
What sort of data might yield a
consider an unsupervised learning problem. In unsupervised learn- very simple decision boundary with
ing, our data consists only of examples xn and does not contain corre- a decision tree and very complex
sponding labels. Your job is to make sense of this data, even though
? decision boundary with 1-nearest
neighbor? What about the other
no one has provided you with correct labels. The particular notion of way around?
“making sense of” that we will talk about now is the clustering task.
Consider the data shown in Figure 3.13. Since this is unsupervised
learning and we do not have access to labels, the data points are
simply drawn as black dots. Your job is to split this data set into
three clusters. That is, you should label each data point as A, B or C
in whatever way you want.
For this data set, it’s pretty clear what you should do. You prob-
ably labeled the upper-left set of points A, the upper-right set of
points B and the bottom set of points C. Or perhaps you permuted
these labels. But chances are your clusters were the same as mine.
The K-means clustering algorithm is a particularly simple and
effective approach to producing clusters on data like you see in Fig-
ure 3.13. The idea is to represent each cluster by it’s cluster center.
Given cluster centers, we can simply assign each point to its nearest