0% found this document useful (0 votes)
55 views14 pages

K-Nearest Neighbor: General Gist

- K-nearest neighbor algorithm predicts the class of a new sample based on the classes of the k closest samples in the training data, where k is a positive integer. - Distance between samples is calculated using the Euclidean distance formula, and the k nearest neighbors are found by sorting distances and selecting the first k rows. - The new sample is assigned the class that is most common among its k nearest neighbors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views14 pages

K-Nearest Neighbor: General Gist

- K-nearest neighbor algorithm predicts the class of a new sample based on the classes of the k closest samples in the training data, where k is a positive integer. - Distance between samples is calculated using the Euclidean distance formula, and the k nearest neighbors are found by sorting distances and selecting the first k rows. - The new sample is assigned the class that is most common among its k nearest neighbors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

K-Nearest Neighbor

K nearest neighbor algorithm is used to solve the models by predicting the


class of the input sample by comparing the distance of the newly added
sample to the values already assigned to a given class. If the distance
between the two values is less then we assign the sample the same class
as the value which is nearest.

General gist:
• Find Euclidean distance (Distance formula) between the
new sample and previous samples.
• Sort the distances obtained in ascending order
• Pick the class of the new sample according to the class
of the nearest neighbors.

K is the value of that defines the number of neighbors we are comparing


our new sample with.

• Larger value of K means a smoother curve of separation


resulting in less complex models
• Smaller value of K tends to overfit the data resulting in
less complex models

This algorithm is also known as lazy learner algorithm because using this
algorithm we don’t actually need to train the data set. We need to just
compare the distance and subsequently neighbors to classify.

To understand the working, I tried to applied KNN without any specific


libraries such as NumPy and SciKit-learn and only with two features.
Python Code:
# 18EC028- Internship- KNN
from math import sqrt

# calculate the Euclidean distance between two vectors


def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1) - 1):
distance += pow((row1[i] - row2[i]), 2) # distance formula
return sqrt(distance)

# Locate the most similar neighbors


def get_neighbors(train, test_row, num_neighbors):
distances = list()
for train_row in train:
dist = euclidean_distance(test_row, train_row)
distances.append((train_row, dist))
distances.sort(key=lambda tup: tup[1])
neighbors = list()
for i in range(num_neighbors):
neighbors.append(distances[i][0])
return neighbors

# Make a classification prediction with neighbors


def predict_classification(train, test_row, num_neighbors):
neighbors = get_neighbors(train, test_row, num_neighbors)
output_values = [row[-1] for row in neighbors]
prediction = max(set(output_values), key=output_values.count)
return prediction

# Test distance function (Main part of the code)


dataset = [[2.7810836, 2.550537003, 0],
[1.465489372, 2.362125076, 0],
[3.396561688, 4.400293529, 0],
[1.38807019, 1.850220317, 0],
[3.06407232, 3.005305973, 0],
[7.627531214, 2.759262235, 1],
[5.332441248, 2.088626775, 1],
[6.922596716, 1.77106367, 1],
[8.675418651, -0.242068655, 1],
[7.673756466, 3.508563011, 1]] # Sample data set
prediction = predict_classification(dataset, dataset[0], 3) # Call predict_classifier function
print('Expected %d, Got %d.' % (dataset[0][-1], prediction))
First, we will start by making a sample data set to keep verifying the
working of our functions.

# calculate the Euclidean distance between two vectors


def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1) - 1):
distance += pow((row1[i] - row2[i]), 2) # distance formula
return sqrt(distance)

Starting with a function to calculate distances between the test sample


and all the other sample present in the data set. The function takes two
rows as input parameter. ‘distance’ variable is created and then initialized
with zero to hold the Euclidean distance. As the data set contains two
features and their output/Class, the length of the row becomes 3.
‘(len(row1)-1)’ excludes the output/class integer and focuses only on the
other 2 feature present in the row. Those two feature acts like coordinates
in a 2D plane hence calculating distance between them becomes easier
with just a simple distance formula. As there is symmetry in table, we need
not check for row 2. Distance stores the following result.
(Feature1 of row 1 – Feature 1 of row 2)2 in first iteration
(Feature2 of row 1 – Feature 2 of row 2)2 in second iteration
As the distance is getting added between iteration, it will result in distance
formula
This function then returns the square root of the distance. For that we will
have to use sqrt function from the library math.
Up next is a function which will help us to find the neighbors of the new
sample we will add

# Locate the most similar neighbors


def get_neighbors(train, test_row, num_neighbors):
distances = list()
for train_row in train:
dist = euclidean_distance(test_row, train_row)
distances.append((train_row, dist))
distances.sort(key=lambda tup: tup[1])
neighbors = list()
for i in range(num_neighbors):
neighbors.append(distances[i][0])
return neighbors

This function begins by taking three parameters as input.


• ‘train’ which is nothing but the data set we will provide as whole
• ‘test_row’ which will take the new sample we want to run our
algorithm on to get neighbors and predictions
• ‘num_neighbors’ which is K of KNN algorithm. The number of
neighbors we want to return to classify our new sample
A local variable ‘distance’ which is a list. This will be useful to store the
distances that we will calculate between each row of data set and our new
sample(‘test_row’).
The loop iterates through every row in the dataset we passed through
‘train’ variable. it appends the current row and the distance between the
new sample and the current row.
Next, we will sort the list in ascending order with each ‘train_row’ as key
to subsequent distance. The sort will be in ascending order with the least
distances stating from 0th index.
We will create a new variable ‘neighbor’ to hold the least distances in
distances list depending on the number of K
The for loop will iterate for K number of times. Appending distances row
wise.
The function will return the neighbors of the new sample.
The last step in applying KNN is to predict the class of the new sample.

# Make a classification prediction with neighbors


def predict_classification(train, test_row, num_neighbors):
neighbors = get_neighbors(train, test_row, num_neighbors)
output_values = [row[-1] for row in neighbors]
prediction = max(set(output_values), key=output_values.count)
return prediction

The last function is predict_classification. This function also takes the


same three input parameters as get_neighbor function.
The reason for this is we will have to call the function get_neighbor to get
the list of neighbors. We will give output values by comparing only the last
column in row (row[-1]) and making a 1D array or list of it.
The prediction value will make a set out of the output values to get only
unique values of the classes. Key=output_values.count will give us the
count of class present given by get_neighbor. If class 1 is occurring more
than class 0 the max value of the key count will be returned. Which is 1.
If class 0 is occurring more then key of class 0 is more and hence the max
function will return 0.
Hence, we have also classified the sample of our new data.

Now we will test all our functions by executing the main part of the code
# Test distance function (Main part of the code)
dataset = [[2.7810836, 2.550537003, 0],
[1.465489372, 2.362125076, 0],
[3.396561688, 4.400293529, 0],
[1.38807019, 1.850220317, 0],
[3.06407232, 3.005305973, 0],
[7.627531214, 2.759262235, 1],
[5.332441248, 2.088626775, 1],
[6.922596716, 1.77106367, 1],
[8.675418651, -0.242068655, 1],
[7.673756466, 3.508563011, 1]] # Sample data set
prediction = predict_classification(dataset, dataset[0], 3) # Call predict_classifier function
print('Expected %d, Got %d.' % (dataset[0][-1], prediction))
We will pass the whole dataset into argument. This is the ‘train’ parameter.
Dataset[0] means we are sending 1st row of the dataset as a new sample
test

[[2.7810836, 2.550537003, 0], #sample data set (datset[0])

We are selecting K as 3 that means we have to compare our sample


points with 3 closest neighbors.
As it is the first data set the expected class is 0. And the output we are
getting is:

This algorithm is limited to only 2 features. For Multiple features we will


have to use SciKit-learn library. SciKit_learn is extremely useful for
plotting the data and to carry out vectorized mathematics which is very
fast also we can optimize our code using various inbuilt optimization
functions.

KNN using libraries


Python Code:
import pandas as pd
import numpy as np
import operator

col = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'type']


iris = pd.read_csv("IrisDataset.csv", names=col)
print("First Five Entries:")
print(iris.head())
print("Columns of data set: ", iris.columns)
print("Dimension: ", iris.shape)
print("Size: ", iris.size)

def euclidean_distance(row1, row2, length):


distance = 0.0
for i in range(length): # Excludes calculating distance between Type in iris
data set
distance += np.square(row1[i] - row2[i]) # distance formula
return np.sqrt(distance)

def knn(dataset, test_sample, k):


distances = {}
length = test_sample.shape[1]
for x in range(len(dataset)):
dist = euclidean_distance(test_sample, dataset.iloc[x], length)
distances[x] = dist[0]
sorted_d = sorted(distances.items(), key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(sorted_d[x][0])
counts = {"Setosa": 0, "Versicolor": 0, "Virginica": 0}
for x in range(len(neighbors)):
response = dataset.iloc[neighbors[x]][-1]
if response in counts:
counts[response] = +1
else:
counts[response] = 1
max_counts = sorted(counts.items(), key=operator.itemgetter(1), reverse=True)
return max_counts[0][0]

test_set = [[1.4, 3.6, 3.4, 1.2]]


test = pd.DataFrame(test_set)
k=4
predict = knn(iris, test, k)
print("The predicted type is: ", predict)

The Output we will get is:


With a changed test_sample

test_set = [[7.9, 3.8, 6.4, 2.0]]


test = pd.DataFrame(test_set)
k=4
predict = knn(iris, test, k)
print("The predicted type is: ", predict)

Output we will get is:


The libraries used in this algorithm is:
import pandas as pd
import numpy as np
import operator

Pandas is used for data manipulation and analysis. We will use pandas to
read a CSV (comma separated value) file, which holds the data for three
different type of flowers we will import pandas library under the alias ‘pd’.
Numpy is a powerful scientific library which is designed to work for array
operations and give mathematical edge which MATLAB provides. We will
import numpy under alias ‘np’
Operator is used in sorting of the iterable data type and specifying the
column we need as subject to sort the data type

First step will be to read the CSV file correctly. First download the .CSV
file and place it in the working directly of the python projects.
When we open the .CSV file we will notice that the comma separated
values are not having any column name. We will provide the separate
column name through code.
col = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'type']
iris = pd.read_csv("IrisDataset.csv", names=col)
print("First Five Entries:")
print(iris.head())
print("Columns of data set: ", iris.columns)
print("Dimension: ", iris.shape)
print("Size: ", iris.size)

Lists work like array in python so we will put the column names as string
in a list. pd.read_csv will create an object which we are storing in variable
‘iris’. The argument in this method is the file name of the .CSV filed we
stored in this directory. And ’col’ list as a separator which will name the
columns according to the comma separator in the .CSV file.
We can print the first five data rows of the .CSV file by calling head method
on object ‘iris’ as ‘iris.head()’.
We can view the columns in the similar fashion as ‘iris.columns’
Dimensions and Size of the data set can also be used. These all class
methods are defined under the library pandas.

Once the data set is read into ‘iris’ object we can proceed for the first step
in algorithm. To find the distance between the sample point and each entry
of the dataset
test_set = [[7.9, 3.8, 6.4, 2.0]]
test = pd.DataFrame(test_set)

length = test_sample.shape[1]
for x in range(len(dataset)):
dist = euclidean_distance(test_sample, dataset.iloc[x], length)
distances[x] = dist[0]
sorted_d = sorted(distances.items(), key=operator.itemgetter(1))

def euclidean_distance(row1, row2, length):


distance = 0.0
for i in range(length): # Excludes calculating distance between Type in iris
distance += np.square(row1[i] - row2[i]) # distance formula
return np.sqrt(distance)

We will first set in the values at which we want to predict the type of the
flower. To do this we will make a list and set the values and assign to
‘test_set’. The whole algorithm works on the pd object so we will have to
convert the ‘test_set’ to pd object and put it in the same data frame as the
of the ‘iris’ we read earlier with each value representing separate column.
It will be passed in an enclosing function where we will find the length
(Number of columns) of the test_sample. It will be later passed in
Euclidean_distance function.
We will make a for loop to send each row of dataset one by one. The loop
will run for each row hence the ‘len(dataset)’ will return the number of
rows. Inside the calling of Euclidean_distance function we will pass the
‘test_sample’ as argument 1 and ‘dataset.iloc[x]’ as argument 2. iloc is the
method of class pandas which stands for integer location. It returns the xth
row as x is the argument of illoc. This loop will run for every row of
‘dataset’. The 3rd argument is ‘length’ which holds the length of the
test_sample
In Euclidean_distance function we first assign a 0.0 to make distance a
float variable.
The for loop runs till the length of test_sample we passed earlier. This is
because we don’t want to include the Type of the flower while calculating
distances. We just want to perform calculation on four features.
So, the loop will run four times making the following equation:
Distance = √(𝑋 − 𝑥)2 + (𝑌 − 𝑦)2 + (𝑍 − 𝑧)2 + (𝑊 − 𝑤)2
Where X,Y,Z,W is the 4 features of the ‘test_sample’ and x,y,z,w is the
features of the each row which is getting passed as row2. This is achieved
by using numpy class’ method square and sqrt. And this function returns
the distance
This function will return the distance till the ‘len(dataset)’. Each value is
put into dictionary one by one then variable is named as ‘distances’
Next step will be to sort the distances to get the distance to the
‘test_sample’ in ascending order. This can be achieved by using sorted
function in which we take distances dictionary as items and we sort the by
taking distances as subjects. This can be told to compiler by
‘key=operator.itemgetter(1)’ where 1 indicates the values in the
dictionary. Likewise itemgetter(0) will indicates the key in the dictionary
and will sort according to key as a subject.

Once we have the sorted dictionary, we will move to find out the nearest
neighbor of the ‘test_sample’.

neighbors = []
for x in range(k):
neighbors.append(sorted_d[x][0])

We will define a variable ‘neighbor’ as a list. This loop will run to the range
defined by k. we will take the first k KEYS/INDEX of the closest values
from the sorted distances variable named ‘sorted_d’ and we have our
neighbors.
The last step in this process will be to determine the class of the
neighbors. We will do this by first making a dictionary which will hold all
the types of flower present in the dataset.

counts = {"Setosa": 0, "Versicolor": 0, "Virginica": 0}


for x in range(len(neighbors)):
response = dataset.iloc[neighbors[x]][-1]
if response in counts:
counts[response] = +1
else:
counts[response] = 1
max_counts = sorted(counts.items(), key=operator.itemgetter(1), reverse=True)
return max_counts[0][0]

This is stored in the variable ‘counts’. This for loop determines the types
of class near the neighbor value. This is done by taking iloc value of the
index of the closest neighbor. [-1] just reads the list from the end. The last
column is the type of the flower. This whole statement will assign the type
of flower of nearest neighbor to ‘response’ variable.
Now we will increment the value in ‘counts’ by the number of times a specif
type of flower is encountered.
We will reverse sort the list of ‘counts’ variable in descending order to get
the maximum number of counts at the very first key: value position.
We will return the type of flower with maximum count. Which will make up
our prediction.

This is the main code that will control the flow:

test_set = [[7.9, 3.8, 6.4, 2.0]]


test = pd.DataFrame(test_set)
k=4
predict = knn(iris, test, k)
print("The predicted type is: ", predict)

knn is the enclosing function which will also make call to the
Euclidean_distance.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy