New Data Science Module Nearest Neighbors
New Data Science Module Nearest Neighbors
NEAREST
NEIGHBORS
CLASSIFICATION
Page 1
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
General Idea
Page 2
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Example of kNN
12
10
8
6
4
Y 2
0 $ %
2
4
6
8
6 3 0 3 6 9 12 15 18
X
Y
$
N
N
N
Page 4
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Y
%
N
N
N
Page 5
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
How to Choose k
12
10
8
6
4
Y
2
0 $ %
2
4
6
8
6 3 0 3 6 9 12 15 18
X
Page 6
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Illustration in Python
import numpy as np
import pandas as pd
from sklearn . neighbors import \
KNeighborsClassifier
data = pd . DataFrame (
{ " id " : [ 1 ,2 ,3 ,4 ,5 ,6] ,
" Label " : [ " green " ," red " ," red " ,
" green " ," green " ," red " ] ,
" X " : [1 , 6 , 7 , 10 , 10 , 15] ,
" Y " : [2 , 4 , 5 , -1 , 2 , 2 ]} ,
columns = [ " id " , " Label " , " X " ," Y " ]}
X = data [[ " X " ," Y " ]]. values
Y = data [[ " Label " ]]. values
knn_classifier = KNeighborsClassifier (
n_neighbors =3)
knn_classifier . fit (X , Y )
new_instance = np . asmatrix ([3 , 2])
prediction = knn_classifier . predict (
new_instance )
ipdb> prediction[0]
red
Page 7
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
A Numerical Example
Page 8
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
14
12
10 Foot
8
6
200
180
4.8 5.0 160
5.2 5.4 140 ight
120 We
Height5.6 5.8 6.0 100
Page 9
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
kNN in Python
import pandas as pd
data = pd . DataFrame (
{ " id " :[ 1 ,2 ,3 ,4 ,5 ,6 ,7 ,8] ,
" Label " :[ " green " ," green " ," green " ," green " ,
" red " ," red " ," red " ," red " ] ,
" Height " :[5 ,5.5 ,5.33 ,5.75 ,6.00 ,5.92 ,5.58 ,5.92] ,
" Weight " :[100 ,150 ,130 ,150 ,180 ,190 ,170 ,165] ,
" Foot " :[6 , 8 , 7 , 9 , 13 , 11 , 12 , 10]} ,
columns =[ " id " ," Height " ," Weight " ,
" Foot " ," Label " ])
X = data [[ " Height " ," Weight " ," Foot " ]]. values
Y = data [[ " Label " ]]. values
ipdb> prediction[0]
’red’
Page 10
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
X = data [[ " Height " ," Weight " ," Foot " ]]. values
Y = data [[ " Label " ]]. values
ipdb> prediction[0]
’red’
Page 11
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Why Scaling?
6
8
Foot 10
12
14
4.8 t
5.0
5.2
5.4
5.6
5.8
200 180 160 140 120 100 6.0 h
Weight Heig
Page 12
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Effect of Scaling
2.0
1.5
1.0
0.5
Foot 0.0
0.5
1.0
1.5
2.0 1.52.0
1.0
2.0 1.5 1.0 0.5 0.00.5 ht
0.5
0.0 0.5 1.0 1.5 1.0
1.5
2.0 H eig
Weight 2.0
Calculating k
import pandas as pd
from sklearn . preprocessing import StandardScaler
from sklearn . neighbors import KNeighborsClassifier
from sklearn . model_selection import train_test_split
import pandas as pd
data = pd . DataFrame (
{ " id " :[ 1 ,2 ,3 ,4 ,5 ,6 ,7 ,8] ,
" Label " :[ " green " ," green " ," green " ," green " ,
" red " ," red " ," red " ," red " ] ,
" Height " :[5 ,5.5 ,5.33 ,5.75 ,6.00 ,5.92 ,5.58 ,5.92] ,
" Weight " :[100 ,150 ,130 ,150 ,180 ,190 ,170 ,165] ,
" Foot " :[6 , 8 , 7 , 9 , 13 , 11 , 12 , 10]} ,
columns =[ " id " ," Height " ," Weight " ,
" Foot " ," Label " ])
X = data [[ " Height " ," Weight " ," Foot " ]]. values
Y = data [[ " Label " ]]. values
ipdb> error_rate
[0.5, 0.5]
Page 14
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
le = LabelEncoder ()
Y = le . fit_transform ( data [ ’ Class ’ ]. values )
X_train , X_test , Y_train , Y_test = train_test_split (X ,Y , test_size =0.5 ,
random_state =3)
Page 15
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
error_rate = []
for k in range (1 ,21 ,2):
knn_classifier = KNeighborsClassifier ( n_neighbors = k )
knn_classifier . fit ( X_train , Y_train )
pred_k = knn_classifier . predict ( X_test )
error_rate . append ( np . mean ( pred_k != Y_test ))
Page 16
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
0.055
0.050
0.045
Error Rate
0.040
0.035
0.030
0.025
0.020
2 4 6 8 10 12 14 16 18
number of neighbors: k
Page 17
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
k for IRIS
Iris-setosa
Iris-versicolor
Iris-virginica7
6
petal-length
5
4
3
2
1
4.5
4.0
3.5 h
4.5 5.0
5.5 3.0 l-widt
sepal-6.0 2.5 sepa
leng6.5
th 7.0 7.5 8.0 2.0
0.04
0.02
Error Rate
0.00
0.02
0.04
2 4 6 8 10 12 14 16 18
number of neighbors: k
Page 18
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
A Categorical Dataset
Page 19
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Python Code
import pandas as pd
import numpy as np
from sklearn . neighbors import KNeighborsClassifier
from sklearn . preprocessing import LabelEncoder
data = pd . DataFrame (
{ ’ Day ’: [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ,10] ,
’ Weather ’ :[ ’ sunny ’ , ’ rainy ’ , ’ sunny ’ , ’ rainy ’ , ’ sunny ’ , ’ overcast ’ ,
’ sunny ’ , ’ overcast ’ , ’ rainy ’ , ’ rainy ’] ,
’ Temperature ’: [ ’ hot ’ , ’ mild ’ , ’ cold ’ , ’ cold ’ , ’ cold ’ , ’ mild ’ ,
’ hot ’ , ’ hot ’ , ’ hot ’ , ’ mild ’] ,
’ Wind ’: [ ’ low ’ , ’ high ’ , ’ low ’ , ’ high ’ , ’ high ’ , ’ low ’ , ’ low ’ ,
’ high ’ , ’ high ’ , ’ low ’] ,
’ Play ’: [ ’ no ’ , ’ yes ’ , ’ yes ’ , ’ no ’ , ’ yes ’ , ’ yes ’ , ’ yes ’ ,
’ yes ’ , ’ no ’ , ’ yes ’]} ,
columns = [ ’ Day ’ , ’ Weather ’ , ’ Temperature ’ , ’ Wind ’ , ’ Play ’]
)
input_data = data [[ ’ Weather ’ , ’ Temperature ’ , ’ Wind ’ ]]
dummies = [ pd . get_dummies ( data [ c ]) for c in input_data . columns ]
binary_data = pd . concat ( dummies , axis =1)
ipdb> prediction
1
Page 20
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
kNN: IRIS
import pandas as pd
import numpy as np
from sklearn . preprocessing import StandardScaler , LabelEncoder
from sklearn . neighbors import KNeighborsClassifier
from sklearn . model_selection import train_test_split
ipdb> error_rate
0.06
Page 21
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Concepts Check:
Page 22