0% found this document useful (0 votes)
47 views66 pages

K-Nearest Neighbors: Nipun Batra July 5, 2020

Uploaded by

samyakiitgn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views66 pages

K-Nearest Neighbors: Nipun Batra July 5, 2020

Uploaded by

samyakiitgn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

K-Nearest Neighbors

Nipun Batra
July 5, 2020
IIT Gandhinagar
CLASSIFICATION

¥10
is

1
REDNESS
CLASSIFICATION

ORANGES

)
£
n

"'
A"

• •

§ •



o

• •
A
• • •
a

O l
REDNESS
CLASSIFICATION

ORANGES

)
£
n

LES
2
.

App
. •
± .
.


• o • •
A
• • •
a

O l
REDNESS
CLASSIFICATION

LIKE
most
ORANGES
#
AN

/
7
£ ORANGE
"
"
• - '
SIMILAR
APPLES
r

?

⇐ % .
.


orisons .


• o • •

O • • IN
o

-
ATTRIBUTES
O l
REDNESS
REGRESSION
CLASSIFICATION

100k
Of

) )
?
m O

o O O
?
%
¥ .q .



¥ ! oo
o • •
I •
• ⑧ • o
a

- -

O l AGE
REDNESS
REGRESSION
CLASSIFICATION

look
f-


LIKELY ①

/

to ?
7 close O
• look o O O
?
%
⇐ .q .
.


" Ikari ! oo
O B •
f •
To
• ⑧ • o

a
HOMES
- -
OF THAT
O l AGE
PRICE
REDNESS
FOR l NN
VORONOI DIAGRAM -

)
•o

Et Boog
¥
H

FEATURE I
FOR l NN
VORONOI DIAGRAM -

)
•q
gun
'
' ' " "" "" " s

ippon
¥
H

FEATURE I
FOR l NN
VORONOI DIAGRAM -

LINE JOINING
I to

A
)

ZPOINFS

Be
, MIDPOINT
↳ g-

gun isagoge
' " "" "" " s
"'

¥
H

FEATURE I
FOR l NN
VORONOI DIAGRAM -

BLUE
LABELLED
REGION \ I to LINE JOINING

A
)

2 POINTS

Boe
, MIDPOINT
↳ g-

gun
'
' ' " "" "" " s

ippon
¥
H
← REGION LABELLED

# RED

FEATURE I
FOR l NN
VORONOI DIAGRAM -

0Pa
Ps
O

)
°
%
°
" "
Iv
E Op
I 2

¥
-

FEATURE I
FOR l NN
VORONOI DIAGRAM -

084
Ps I
°
!

)
I

°
%
°
" "
Ii
E Op
Q Z

¥
-

FEATURE I
FOR l NN
VORONOI DIAGRAM -

084
!

)
A
•-

Btw P' SPG


Ps
A MID PT
.
: o

°
%
°
" "

gun
t
Q
Op Z

¥
-

FEATURE I
FOR l NN
VORONOI DIAGRAM -

0Pa
l

i A

)
B

••A
: MID PT Btw P' SPG
Ps
o
.

" "" " °


%
" " °
" "" " "
un
&

E Op
I 2

¥
-

FEATURE I
FOR l NN
VORONOI DIAGRAM -

0Pa

\
/ I
'
IA

)
.

d•-
Btw P' SPG
Ps - B
A MID PT
.
: °

" ""
" " " "" "
%
° ! °
"
un
&

E Op
I 2

¥
-

FEATURE I
FOR l NN
VORONOI DIAGRAM -

OP 4

:/
I
DECISION 1-

BOUNDARY IS
o
"" ..

In:S: t O
Q pZ

¥
-

FEATURE I
FOR l NN
VORONOI DIAGRAM -

O O

:/
o "
"

DECISION
BOUNDARY IS o
i
ooo

g)
0
o

"" o
Piecewise
In:S:
¥
¥
-

FEATURE I
KNN CLASSIFICATION


+ t

K =L CLASSIFICATION
KNN CLASSIFICATION

t
+ t

- -
-
-

K =L CLASSIFICATION
KNN CLASSIFICATION

I
a

t
+ t t t

-
-

- -

K =3 CLASSIFICATION
K =L CLASSIFICATION
KNN CLASSIFICATION

/
a

f
+ t t t
z NEIGHBOURS
-
-
,

fi
-
- -
Iiy
-
test Point

- -

K =3 CLASSIFICATION
K =L CLASSIFICATION
KNN CLASSIFICATION

/
a

f
+ t t t
z NEIGHBOURS
-
-
,

fi
-
- -
Iiy
-
test Point

- -

K =3 CLASSIFICATION
K =L CLASSIFICATION
LINEAR REGRESSION INN REGRESSION

|•o¢

-
Al NZ Nz
LINEAR REGRESSION INN REGRESSION

K t
• •
a a

- -
de NZ NJ NZ NJ
di

ULU , : NN is
Geigy ,)
LINEAR REGRESSION INN REGRESSION

Kt
• •
a a

- -
de NZ NJ NZ NJ
di

ULU , : NN is
Geigy ,)
NL NN is (dis y ,
)
nz :

Z
LINEAR REGRESSION INN REGRESSION

•¥t

de NZ NJ NZ
di NJ

ULU , : NN is
Geigy ,)
NN is Cdisy )
Lnitznz
'
,
a -

: NN is Cnzsyz )
Mizz LdLNz3
NON PARAMETRIC
IS
-

KNN

I
+

1-
MODEL
LINEAR
NON PARAMETRIC
IS
-

KNN

f-
+
- -
. -

=
-

1-
MODEL
LINEAR
Decj
Y
'
-
mate ( # patams -
-

)
z

Boundary
NON PARAMETRIC
IS
-

KNN

f-
n

f-
+

I
-

=
. - = . .

- -
. -
-

1- -

KNN ( K- D
MODEL
LINEAR "

Decs ( Utc )
2)
-
LIKE
y
=
"

mate C # patams -
-

Decs
y Boundary
-
-

BOUNDARY
NON PARAMETRIC
IS
-

KNN

f-
n

f-
+

I
-

=
. - = . .

- -
. -
-

1- -

KNN ( K- D
MODEL
LINEAR "

Decs ( Utc )
2)
-
LIKE
y
=
"

mate C # patams -
-

Decs
y Boundary
-
-

BOUNDARY

ADD DATA

\
#
t t
t t
t

i
-
-
= .
.

-
NON PARAMETRIC
IS
-

KNN

f-
n

f-
+

I
-

=
. - = . .

- -
. -
-

1- -

KNN ( K- D
MODEL
LINEAR "

Decs ( mute )
2) LIKE
y
=
"

mate C # patams -
-

Decs
y Boundary
-
-

BOUNDARY

ADD DATA

t÷÷÷- I
n

#
t t t
-

- -
LINEAR MODEL KNN ( k=1 ) e ,

N
LEAST
DECS ( AT

BOUNDARY y=mn+
C ( zpahams ) # pppnams 772
cubic )
Parametric vs Non-Parametric Models

Parametric Non-Parametric
Parameter Number of param- Number of parame-
eters is fixed w.r.t ters grows w.r.t. to an
dataset size increase in dataset
size
Speed Quicker (as the Longer (as number
number of parame- of parameters are
ters are less) less)
Assumptions Strong Assumptions Very few (sometimes
(like linearity in Lin- no) assumptions
ear Regression)
Examples Linear Regression KNN, Decision Tree

1
Lazy vs Eager Strategies

Lazy Eager
Train Time 0 6= 0
Test Long (due to com- Quick (as only
parison with train “parameters” are
data) involved)
Memory Store/Memorise en- Store only learnt pa-
tire data rameters
Utility Useful for online
settings
Examples KNN Linear Regression,
Decision Tree

2
Important Considerations

• What are the features that will be considered for data


similarity?

3
Important Considerations

• What are the features that will be considered for data


similarity?
• What is the distance metric that will be used to calculate
data similarity?

3
Important Considerations

• What are the features that will be considered for data


similarity?
• What is the distance metric that will be used to calculate
data similarity?
• What is the aggregation function that is going to be used?

3
Important Considerations

• What are the features that will be considered for data


similarity?
• What is the distance metric that will be used to calculate
data similarity?
• What is the aggregation function that is going to be used?
• What are the number of neighbors that you are going to
take into consideration?

3
Important Considerations

• What are the features that will be considered for data


similarity?
• What is the distance metric that will be used to calculate
data similarity?
• What is the aggregation function that is going to be used?
• What are the number of neighbors that you are going to
take into consideration?
• What is the computational complexity of the algorithm
that you are implementing?

3
Important Considerations: Distance Metric

The Distance Metric acts as a measure of similarity between


the points.

4
Important Considerations: Distance Metric

The Distance Metric acts as a measure of similarity between


the points.

Euclidean Distance

4
Important Considerations: Distance Metric

The Distance Metric acts as a measure of similarity between


the points.

Hamming Distance

4
Important Considerations: Distance Metric

The Distance Metric acts as a measure of similarity between


the points.

Manhattan Distance

4
Important Considerations: Value of K

Choosing the correct value of K is difficult.

5
Important Considerations: Value of K

Choosing the correct value of K is difficult.


Low values of K will result in each point having a very high
influence on the final output =⇒ noise will influence the
result

5
Important Considerations: Value of K

Choosing the correct value of K is difficult.


Low values of K will result in each point having a very high
influence on the final output =⇒ noise will influence the
result
High values of K will result in smoother decision boundaries
=⇒ lower variance but also higher bias

5
Important Considerations: Value of K

6 6
5 5
4 4
3 3
Y

Y
2 2
1 1
0 0
1
3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4
X X

Dataset K = 1 High Variance

6
Important Considerations: Value of K

6 6
5 5
4 4
3 3
Y

Y
2 2
1 1
0 0
1 1
3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4
X X

K=3 K = 9 High Bias

6
Aggregating data

There are different ways to go about aggregating the data from


the K nearest neighbors.

• Median
• Mean
• Mode

7
KNN Algorithm

• Keep the entire dataset: (x, y)

8
KNN Algorithm

• Keep the entire dataset: (x, y)


• For a query vector q:

8
KNN Algorithm

• Keep the entire dataset: (x, y)


• For a query vector q:
1. Find the k-closest data point(s) x∗

8
KNN Algorithm

• Keep the entire dataset: (x, y)


• For a query vector q:
1. Find the k-closest data point(s) x∗
2. Predict y ∗

8
Curse of Dimensionality

With an increase in the number of dimensions:

9
Curse of Dimensionality

With an increase in the number of dimensions:

1. the distance between points starts to increase


1.6
Mean distance between two points

1.4
1.2
1.0
0.8
0.6
0.4
0.2
2.5 5.0 7.5 10.0 12.5 15.0 17.5
Number of dimensions (d)

For a unifromly random dataset

9
Curse of Dimensionality

With an increase in the number of dimensions:


1. the distance between points starts to increase
2. the variation in distances between points starts to
decrease
102
Ratio of max to min distances

101

2.5 5.0 7.5 10.0 12.5 15.0 17.5


Number of dimensions (d)

For a unifromly random dataset 10


Curse of Dimensionality

With an increase in the number of dimensions:

1. the distance between points starts to increase


2. the variation in distances between points starts to
decrease

Due to this, distance metrics lose their efficacy as a similarity


metric.

11
Approximate Nearest Neighbors

Doing an exhaustive search over all the points is time


consuming, especially if you have a large number of data
points.

2
Y

2
4 2 0 2 4
X

Example of a big dataset


12
Approximate Nearest Neighbors

Doing an exhaustive search over all the points is time


consuming, especially if you have a large number of data
points.
If you are willing to sacrifice accuracy there are algorithms that
can give you improvements that go into orders of magnitude.

12
Approximate Nearest Neighbors

Doing an exhaustive search over all the points is time


consuming, especially if you have a large number of data
points.
If you are willing to sacrifice accuracy there are algorithms that
can give you improvements that go into orders of magnitude.
Such techniques include:

12
Approximate Nearest Neighbors

Doing an exhaustive search over all the points is time


consuming, especially if you have a large number of data
points.
If you are willing to sacrifice accuracy there are algorithms that
can give you improvements that go into orders of magnitude.
Such techniques include:

• Locality sensitive hashing

12
Approximate Nearest Neighbors

Doing an exhaustive search over all the points is time


consuming, especially if you have a large number of data
points.
If you are willing to sacrifice accuracy there are algorithms that
can give you improvements that go into orders of magnitude.
Such techniques include:

• Locality sensitive hashing


• Vector approximation files

12
Approximate Nearest Neighbors

Doing an exhaustive search over all the points is time


consuming, especially if you have a large number of data
points.
If you are willing to sacrifice accuracy there are algorithms that
can give you improvements that go into orders of magnitude.
Such techniques include:

• Locality sensitive hashing


• Vector approximation files
• Greedy search in proximity neighborhood graphs

12
Locality sensitive hashing

Normal hash functions H(x) try to keep the collision of points


across bins uniform.

Example of a big dataset

13
Locality sensitive hashing

Normal hash functions H(x) try to keep the collision of points


across bins uniform.
A locality sensitive hash (LSH) function L(x) would be designed
such that similar values are mapped to similar bins.

Example of a big dataset

13
Locality sensitive hashing

A locality sensitive hash (LSH) function L(x) would be designed


such that similar values are mapped to similar bins.
For such cases, all elements in a bin would be given the same
label, which again can be decided on the basis of different
aggregation methods

Example of a big dataset


13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy