K-Nearest Neighbor: General Gist
K-Nearest Neighbor: General Gist
General gist:
• Find Euclidean distance (Distance formula) between the
new sample and previous samples.
• Sort the distances obtained in ascending order
• Pick the class of the new sample according to the class
of the nearest neighbors.
This algorithm is also known as lazy learner algorithm because using this
algorithm we don’t actually need to train the data set. We need to just
compare the distance and subsequently neighbors to classify.
Now we will test all our functions by executing the main part of the code
# Test distance function (Main part of the code)
dataset = [[2.7810836, 2.550537003, 0],
[1.465489372, 2.362125076, 0],
[3.396561688, 4.400293529, 0],
[1.38807019, 1.850220317, 0],
[3.06407232, 3.005305973, 0],
[7.627531214, 2.759262235, 1],
[5.332441248, 2.088626775, 1],
[6.922596716, 1.77106367, 1],
[8.675418651, -0.242068655, 1],
[7.673756466, 3.508563011, 1]] # Sample data set
prediction = predict_classification(dataset, dataset[0], 3) # Call predict_classifier function
print('Expected %d, Got %d.' % (dataset[0][-1], prediction))
We will pass the whole dataset into argument. This is the ‘train’ parameter.
Dataset[0] means we are sending 1st row of the dataset as a new sample
test
Pandas is used for data manipulation and analysis. We will use pandas to
read a CSV (comma separated value) file, which holds the data for three
different type of flowers we will import pandas library under the alias ‘pd’.
Numpy is a powerful scientific library which is designed to work for array
operations and give mathematical edge which MATLAB provides. We will
import numpy under alias ‘np’
Operator is used in sorting of the iterable data type and specifying the
column we need as subject to sort the data type
First step will be to read the CSV file correctly. First download the .CSV
file and place it in the working directly of the python projects.
When we open the .CSV file we will notice that the comma separated
values are not having any column name. We will provide the separate
column name through code.
col = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'type']
iris = pd.read_csv("IrisDataset.csv", names=col)
print("First Five Entries:")
print(iris.head())
print("Columns of data set: ", iris.columns)
print("Dimension: ", iris.shape)
print("Size: ", iris.size)
Lists work like array in python so we will put the column names as string
in a list. pd.read_csv will create an object which we are storing in variable
‘iris’. The argument in this method is the file name of the .CSV filed we
stored in this directory. And ’col’ list as a separator which will name the
columns according to the comma separator in the .CSV file.
We can print the first five data rows of the .CSV file by calling head method
on object ‘iris’ as ‘iris.head()’.
We can view the columns in the similar fashion as ‘iris.columns’
Dimensions and Size of the data set can also be used. These all class
methods are defined under the library pandas.
Once the data set is read into ‘iris’ object we can proceed for the first step
in algorithm. To find the distance between the sample point and each entry
of the dataset
test_set = [[7.9, 3.8, 6.4, 2.0]]
test = pd.DataFrame(test_set)
length = test_sample.shape[1]
for x in range(len(dataset)):
dist = euclidean_distance(test_sample, dataset.iloc[x], length)
distances[x] = dist[0]
sorted_d = sorted(distances.items(), key=operator.itemgetter(1))
We will first set in the values at which we want to predict the type of the
flower. To do this we will make a list and set the values and assign to
‘test_set’. The whole algorithm works on the pd object so we will have to
convert the ‘test_set’ to pd object and put it in the same data frame as the
of the ‘iris’ we read earlier with each value representing separate column.
It will be passed in an enclosing function where we will find the length
(Number of columns) of the test_sample. It will be later passed in
Euclidean_distance function.
We will make a for loop to send each row of dataset one by one. The loop
will run for each row hence the ‘len(dataset)’ will return the number of
rows. Inside the calling of Euclidean_distance function we will pass the
‘test_sample’ as argument 1 and ‘dataset.iloc[x]’ as argument 2. iloc is the
method of class pandas which stands for integer location. It returns the xth
row as x is the argument of illoc. This loop will run for every row of
‘dataset’. The 3rd argument is ‘length’ which holds the length of the
test_sample
In Euclidean_distance function we first assign a 0.0 to make distance a
float variable.
The for loop runs till the length of test_sample we passed earlier. This is
because we don’t want to include the Type of the flower while calculating
distances. We just want to perform calculation on four features.
So, the loop will run four times making the following equation:
Distance = √(𝑋 − 𝑥)2 + (𝑌 − 𝑦)2 + (𝑍 − 𝑧)2 + (𝑊 − 𝑤)2
Where X,Y,Z,W is the 4 features of the ‘test_sample’ and x,y,z,w is the
features of the each row which is getting passed as row2. This is achieved
by using numpy class’ method square and sqrt. And this function returns
the distance
This function will return the distance till the ‘len(dataset)’. Each value is
put into dictionary one by one then variable is named as ‘distances’
Next step will be to sort the distances to get the distance to the
‘test_sample’ in ascending order. This can be achieved by using sorted
function in which we take distances dictionary as items and we sort the by
taking distances as subjects. This can be told to compiler by
‘key=operator.itemgetter(1)’ where 1 indicates the values in the
dictionary. Likewise itemgetter(0) will indicates the key in the dictionary
and will sort according to key as a subject.
Once we have the sorted dictionary, we will move to find out the nearest
neighbor of the ‘test_sample’.
neighbors = []
for x in range(k):
neighbors.append(sorted_d[x][0])
We will define a variable ‘neighbor’ as a list. This loop will run to the range
defined by k. we will take the first k KEYS/INDEX of the closest values
from the sorted distances variable named ‘sorted_d’ and we have our
neighbors.
The last step in this process will be to determine the class of the
neighbors. We will do this by first making a dictionary which will hold all
the types of flower present in the dataset.
This is stored in the variable ‘counts’. This for loop determines the types
of class near the neighbor value. This is done by taking iloc value of the
index of the closest neighbor. [-1] just reads the list from the end. The last
column is the type of the flower. This whole statement will assign the type
of flower of nearest neighbor to ‘response’ variable.
Now we will increment the value in ‘counts’ by the number of times a specif
type of flower is encountered.
We will reverse sort the list of ‘counts’ variable in descending order to get
the maximum number of counts at the very first key: value position.
We will return the type of flower with maximum count. Which will make up
our prediction.
knn is the enclosing function which will also make call to the
Euclidean_distance.