0% found this document useful (0 votes)
2 views22 pages

New Data Science Module Nearest Neighbors

The document provides an overview of the k-Nearest Neighbors (kNN) classification method, explaining its general concept of classifying points based on the majority class of their neighbors. It includes examples of how to assign labels to data points, the importance of choosing an appropriate value for k, and the necessity of scaling features for accurate distance calculations. Additionally, it demonstrates the implementation of kNN in Python using the scikit-learn library with practical examples and visualizations.

Uploaded by

akul joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views22 pages

New Data Science Module Nearest Neighbors

The document provides an overview of the k-Nearest Neighbors (kNN) classification method, explaining its general concept of classifying points based on the majority class of their neighbors. It includes examples of how to assign labels to data points, the importance of choosing an appropriate value for k, and the necessity of scaling features for accurate distance calculations. Additionally, it demonstrates the implementation of kNN in Python using the scikit-learn library with practical examples and visualizations.

Uploaded by

akul joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

BU MET CS-677: Data Science With Python, v.2.

0 kNN - Nearest Neighbors Classification

NEAREST

NEIGHBORS

CLASSIFICATION

Page 1
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

General Idea

• points in the same class are


ususally ”neighbors”
• assign class based on
majority of neighbors
• need distance

• need to choose k - number of


neighbors
• note: k must be odd for
simple majority

Page 2
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Example of kNN

12
10
8
6
 
4
Y 2   
0 $ % 
2
4
6
8
6 3 0 3 6 9 12 15 18
X

• what labels for A and B ?


Page 3
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Assigning a Label for A



Y   
$ 
N 
N 
N 

point k neighbors majority


1 x1 green
A 3 x1, x2, x3 red
5 x1, x2, x3, x4, x5 green

Page 4
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Assigning a Label for B



Y   
% 
N 
N 
N 

point k neighbors majority


1 x2 red
B 3 x2, x3, x5 red
5 x1, x2, x3, x4, x5 green

Page 5
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

How to Choose k
12
10
8
6
 
4
Y   
2
0 $ % 
2
4
6
8
6 3 0 3 6 9 12 15 18
X

point k neighbors majority


1 x1 green
A 3 x1, x2, x3 red
5 x1, x2, x3, x4, x5 green
1 x2 red
B 3 x2, x3, x5 red
5 x1, x2, x3, x4, x5 green

Page 6
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Illustration in Python
import numpy as np
import pandas as pd
from sklearn . neighbors import \
KNeighborsClassifier

data = pd . DataFrame (
{ " id " : [ 1 ,2 ,3 ,4 ,5 ,6] ,
" Label " : [ " green " ," red " ," red " ,
" green " ," green " ," red " ] ,
" X " : [1 , 6 , 7 , 10 , 10 , 15] ,
" Y " : [2 , 4 , 5 , -1 , 2 , 2 ]} ,
columns = [ " id " , " Label " , " X " ," Y " ]}
X = data [[ " X " ," Y " ]]. values
Y = data [[ " Label " ]]. values
knn_classifier = KNeighborsClassifier (
n_neighbors =3)
knn_classifier . fit (X , Y )
new_instance = np . asmatrix ([3 , 2])
prediction = knn_classifier . predict (
new_instance )

ipdb> prediction[0]
red

Page 7
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

A Numerical Example

object Height Weight Foot Label


xi (H) (W) (F) (L)
x1 5.00 100 6 green
x2 5.50 150 8 green
x3 5.33 130 7 green
x4 5.75 150 9 green
x5 6.00 180 13 red
x6 5.92 190 11 red
x7 5.58 170 12 red
x8 5.92 165 10 red

• note different scales

Page 8
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

What is the Label?

14
  12

10 Foot

  8
 6
 200
180
4.8 5.0 160
5.2 5.4 140 ight
120 We
Height5.6 5.8 6.0 100

(H=6, W=160, F=10) 7→ ?

Page 9
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

kNN in Python
import pandas as pd
data = pd . DataFrame (
{ " id " :[ 1 ,2 ,3 ,4 ,5 ,6 ,7 ,8] ,
" Label " :[ " green " ," green " ," green " ," green " ,
" red " ," red " ," red " ," red " ] ,
" Height " :[5 ,5.5 ,5.33 ,5.75 ,6.00 ,5.92 ,5.58 ,5.92] ,
" Weight " :[100 ,150 ,130 ,150 ,180 ,190 ,170 ,165] ,
" Foot " :[6 , 8 , 7 , 9 , 13 , 11 , 12 , 10]} ,
columns =[ " id " ," Height " ," Weight " ,
" Foot " ," Label " ])

X = data [[ " Height " ," Weight " ," Foot " ]]. values
Y = data [[ " Label " ]]. values

scaler = StandardScaler (). fit ( X )


X = scaler . transform ( X )

knn_classifier = KNeighborsClassifier ( n_neighbors =3)


knn_classifier . fit (X , Y )

new_instance = np . asmatrix ([6 , 160 , 10])


new_instance_scaled = scaler . transform ( new_instance )
prediction = knn_classifier . predict ( new_instance_scaled )

ipdb> prediction[0]
’red’
Page 10
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Result Without Scaling


import pandas as pd
data = pd . DataFrame (
{ " id " :[ 1 ,2 ,3 ,4 ,5 ,6 ,7 ,8] ,
" Label " :[ " green " ," green " ," green " ," green " ,
" red " ," red " ," red " ," red " ] ,
" Height " :[5 ,5.5 ,5.33 ,5.75 ,6.00 ,5.92 ,5.58 ,5.92] ,
" Weight " :[100 ,150 ,130 ,150 ,180 ,190 ,170 ,165] ,
" Foot " :[6 , 8 , 7 , 9 , 13 , 11 , 12 , 10]} ,
columns =[ " id " ," Height " ," Weight " ,
" Foot " ," Label " ])

X = data [[ " Height " ," Weight " ," Foot " ]]. values
Y = data [[ " Label " ]]. values

knn_classifier = KNeighborsClassifier ( n_neighbors =3)


knn_classifier . fit (X , Y )

new_instance = np . asmatrix ([6 , 160 , 10])


prediction = knn_classifier . predict ( new_instance )

ipdb> prediction[0]
’red’

Page 11
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Why Scaling?

6

8 

Foot 10 

12 

14 
4.8 t
5.0
5.2
5.4
5.6
5.8
200 180 160 140 120 100 6.0 h
Weight Heig

• (euclidean) distances d(·)


dominated by one dimension

Page 12
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Effect of Scaling

2.0 
1.5  
1.0 
0.5 
Foot 0.0

0.5
1.0 
1.5 
2.0 1.52.0
1.0
2.0 1.5 1.0 0.5 0.00.5 ht
0.5
0.0 0.5 1.0 1.5 1.0
1.5
2.0 H eig
Weight 2.0

• without scaling: d(x7, x8) < d(x4, x8)

• with scaling: d(x7, x8) > d(x4, x8)


Page 13
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Calculating k
import pandas as pd
from sklearn . preprocessing import StandardScaler
from sklearn . neighbors import KNeighborsClassifier
from sklearn . model_selection import train_test_split

import pandas as pd
data = pd . DataFrame (
{ " id " :[ 1 ,2 ,3 ,4 ,5 ,6 ,7 ,8] ,
" Label " :[ " green " ," green " ," green " ," green " ,
" red " ," red " ," red " ," red " ] ,
" Height " :[5 ,5.5 ,5.33 ,5.75 ,6.00 ,5.92 ,5.58 ,5.92] ,
" Weight " :[100 ,150 ,130 ,150 ,180 ,190 ,170 ,165] ,
" Foot " :[6 , 8 , 7 , 9 , 13 , 11 , 12 , 10]} ,
columns =[ " id " ," Height " ," Weight " ,
" Foot " ," Label " ])

X = data [[ " Height " ," Weight " ," Foot " ]]. values
Y = data [[ " Label " ]]. values

scaler = StandardScaler (). fit ( X )


X = scaler . transform ( X )
X_train , X_test , Y_train , Y_test = train_test_split (X ,Y ,
test_size =0.5 , random_state =0)
error_rate = []
for k in [1 ,3]:
knn_classifier = KNeighborsClassifier ( n_neighbors = k )
knn_classifier . fit ( X_train , Y_train )
pred_k = knn_classifier . predict ( X_test )
error_rate . append ( np . mean ( pred_k != Y_test ))

ipdb> error_rate
[0.5, 0.5]
Page 14
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Calculating k for IRIS


import numpy as np
import pandas as pd
import matplotlib . pyplot as plt
from sklearn . preprocessing import StandardScaler , LabelEncoder
from sklearn . neighbors import KNeighborsClassifier
from sklearn . model_selection import train_test_split

url = r ’ https :// archive . ics . uci . edu / ml / ’ + \


r ’ machine - learning - databases / iris / iris . data ’

iris_feature_names = [ ’ sepal - length ’ , ’ sepal - width ’ ,


’ petal - length ’ , ’ petal - width ’]

data = pd . read_csv ( url , names =[ ’ sepal - length ’ , ’ sepal - width ’ ,


’ petal - length ’ , ’ petal - width ’ , ’ Class ’ ])

class_labels = [ ’ Iris - versicolor ’ , ’ Iris - virginica ’]


data = data [ data [ ’ Class ’ ]. isin ( class_labels )]

X = data [ iris_feature_names ]. values


scaler = StandardScaler ()
scaler . fit ( X )
X = scaler . transform ( X )

le = LabelEncoder ()
Y = le . fit_transform ( data [ ’ Class ’ ]. values )
X_train , X_test , Y_train , Y_test = train_test_split (X ,Y , test_size =0.5 ,
random_state =3)

Page 15
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Calculating k for IRIS


(cont’d)

error_rate = []
for k in range (1 ,21 ,2):
knn_classifier = KNeighborsClassifier ( n_neighbors = k )
knn_classifier . fit ( X_train , Y_train )
pred_k = knn_classifier . predict ( X_test )
error_rate . append ( np . mean ( pred_k != Y_test ))

figure ( figsize =(10 ,4))


ax = plt . gca ()
ax . xaxis . set_major_locator ( MaxNLocator ( integer = True ))
plt . plot ( range (1 ,21 ,2) , error_rate , color = ’ red ’ , linestyle = ’ dashed ’ ,
marker = ’o ’ , markerfacecolor = ’ black ’ , markersize =10)
plt . title ( ’ Error Rate vs . k for Iris Subset ’)
plt . xlabel ( ’ number of neighbors : k ’)
plt . ylabel ( ’ Error Rate ’)

Page 16
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Calculating k for IRIS


Error Rate vs. k: Iris-versicolor and Iris-virginica
0.060

0.055

0.050

0.045
Error Rate

0.040

0.035

0.030

0.025

0.020
2 4 6 8 10 12 14 16 18
number of neighbors: k

Page 17
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

k for IRIS
Iris-setosa
Iris-versicolor
Iris-virginica7
6

petal-length
5
4
3
2
1
4.5
4.0
3.5 h
4.5 5.0
5.5 3.0 l-widt
sepal-6.0 2.5 sepa
leng6.5
th 7.0 7.5 8.0 2.0

Error Rate vs. k: Iris-setosa and Iris-virginica

0.04

0.02
Error Rate

0.00

0.02

0.04

2 4 6 8 10 12 14 16 18
number of neighbors: k

Page 18
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

A Categorical Dataset

Day Weather Temperature Wind Play


1 sunny hot low no
2 rainy mild high yes
3 sunny cold low yes
4 rainy cold high no
5 sunny cold high yes
6 overcast mild low yes
7 sunny hot low yes
8 overcast hot high yes
9 rainy hot high no
10 rainy mild low yes

• what label for x* = (sunny,


cold, low)?

Page 19
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Python Code
import pandas as pd
import numpy as np
from sklearn . neighbors import KNeighborsClassifier
from sklearn . preprocessing import LabelEncoder

data = pd . DataFrame (
{ ’ Day ’: [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ,10] ,
’ Weather ’ :[ ’ sunny ’ , ’ rainy ’ , ’ sunny ’ , ’ rainy ’ , ’ sunny ’ , ’ overcast ’ ,
’ sunny ’ , ’ overcast ’ , ’ rainy ’ , ’ rainy ’] ,
’ Temperature ’: [ ’ hot ’ , ’ mild ’ , ’ cold ’ , ’ cold ’ , ’ cold ’ , ’ mild ’ ,
’ hot ’ , ’ hot ’ , ’ hot ’ , ’ mild ’] ,
’ Wind ’: [ ’ low ’ , ’ high ’ , ’ low ’ , ’ high ’ , ’ high ’ , ’ low ’ , ’ low ’ ,
’ high ’ , ’ high ’ , ’ low ’] ,
’ Play ’: [ ’ no ’ , ’ yes ’ , ’ yes ’ , ’ no ’ , ’ yes ’ , ’ yes ’ , ’ yes ’ ,
’ yes ’ , ’ no ’ , ’ yes ’]} ,
columns = [ ’ Day ’ , ’ Weather ’ , ’ Temperature ’ , ’ Wind ’ , ’ Play ’]
)
input_data = data [[ ’ Weather ’ , ’ Temperature ’ , ’ Wind ’ ]]
dummies = [ pd . get_dummies ( data [ c ]) for c in input_data . columns ]
binary_data = pd . concat ( dummies , axis =1)

X = binary_data [0:10]. values


le = LabelEncoder ()
Y = le . fit_transform ( data [ ’ Play ’ ]. values )

knn_classifier = KNeighborsClassifier ( n_neighbors =3)


knn_classifier . fit (X , Y )
new_instance = np . asmatrix ([0 ,0 ,1 ,1 ,0 ,0 ,0 ,1])
prediction = knn_classifier . predict ( new_instance )

ipdb> prediction
1
Page 20
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

kNN: IRIS
import pandas as pd
import numpy as np
from sklearn . preprocessing import StandardScaler , LabelEncoder
from sklearn . neighbors import KNeighborsClassifier
from sklearn . model_selection import train_test_split

url = r ’ https :// archive . ics . uci . edu / ml / ’ + \


r ’ machine - learning - databases / iris / iris . data ’

iris_feature_names = [ ’ sepal - length ’ , ’ sepal - width ’ ,


’ petal - length ’ , ’ petal - width ’]
data = pd . read_csv ( url , names =[ ’ sepal - length ’ , ’ sepal - width ’ ,
’ petal - length ’ , ’ petal - width ’ , ’ Class ’ ])
class_labels = [ ’ Iris - versicolor ’ , ’ Iris - virginica ’]
data = data [ data [ ’ Class ’ ]. isin ( class_labels )]

X = data [ iris_feature_names ]. values


scaler = StandardScaler ()
scaler . fit ( X )
X = scaler . transform ( X )
le = LabelEncoder ()
Y = le . fit_transform ( data [ ’ Class ’ ]. values )

X_train , X_test , Y_train , Y_test = train_test_split (X ,Y ,


test_size =0.5 , random_state =3)
knn_classifier = KNeighborsClassifier ( n_neighbors =15)
knn_classifier . fit ( X_train , Y_train )
prediction = knn_classifier . predict ( X_test )
error_rate = np . mean ( prediction != Y_test )

ipdb> error_rate
0.06

Page 21
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification

Concepts Check:

(a) distances and neighbors


(b) nearest neigbor intuition
(c) need for scaling
(d) how to choose k
(e) analyzing categorical data

Page 22

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy