K - Nearest Neighbors Implementation in R
K - Nearest Neighbors Implementation in R
Lecture - 47
K- nearest neighbours implementation in R
In this lecture, what we are going to do is to introduce you to a case study which we use as a
means to explain how to implement this knn algorithm in R. We will start with the problem
statement of the case study, and we will show how to solve this case study using R.
In the process, we will show how to read the data from dot csv file, how to understand the data
that is being loaded into the workspace of R, and how to implement this K-nearest neighbours
algorithm in R using this knn function. And we will also talk about how to interpret the results
that this knn algorithm gives to us.
(Refer Slide Time: 01:12)
Before we jump into the case study, let us review some key points from the previous lecture of
Prof. Raghu. If you remember knn is primarily used as a classification algorithm, it is a supervise
learning algorithm. When I say supervise learning algorithm that means the data that is provided
to you has to be labelled data and knn is a non-parametric method. So, what do you mean by this
non-parametric method is that there is no extraction of the parameters of the classifiers from the
data itself. And there is no explicit training phase involved in this knn algorithm.
And the knn algorithm is the lazy learning algorithm, because it would not do any computations
till you ask you to do classification, because we are dealing with the K-nearest neighbours we
would have seen this notion of distance is important when we are dealing with this knn
algorithm. And the way the knn algorithm works is by the majority voting method that means if
you give a test point, we calculate the distance of the test point from all the data points in the
given data and arrange them in the ascending order. And we choose the k first nearest
neighbours. And based on the voting that each of them will give for this test data, we will assign
the class to the test data point that is of essentially the knn works.
Now, let us define the case study problem statement. We have name this case study as
automotive service company case study.
(Refer Slide Time: 03:07)
Let us look at the problem statement. An automotive service chain is launching its grand new
service station this weekend. They offer service to wide variety of cars. The current capacity of
the station is to check the 315 cars thoroughly per day ok. As an inaugural offer, what they have
done is they claim to freely check all the cars that arrive on their launch day, and they said they
will report whether they need servicing or not ok.
What happened is unexpectedly, they got 450 cars. Now, since they have the testing facility for
testing only 315 cars, they will not be able to check all the 450 cars very thoroughly, and the
service men will not work longer than the normal working hours.
So, what they have done is they have hide a data analyst to help them out from the situation. If
you are the data analyst which is hide by this automotive service station person, how can you
save the day for this new service station is the problem statement.
(Refer Slide Time: 04:34)
Now, let us see how a data scientist can save a day for this service station people. Since, service
station has capacity to thoroughly check 315 cars; they have thoroughly checked all the 315 cars
and given the data in this service train data dot csv. Now, for the rest of the cars among the 450,
they cannot thoroughly check all the data and they have checked only those attributes which are
easily measurable, and they have given them in this service test data dot csv. So, essentially the
data scientist has data which is like a training data for him which contains few attributes, and
with a label whether a service is needed or not.
And he also has a data for which now all the other attributes are present he do not have this
column where whether the service is needed or not. The idea here is how do one use this data
service train data to comment upon for the readings which present which are present in the
service test data to tell whether service is needed or not in this case. So, the idea is to use the knn
classification technique to classify the cars in the service test data file which cannot be tested
manually and say whether service is needed or not.
First you have to get things ready when I say get things ready I mean you have to set the working
directory as the directory in which the given data files are available that you can do using set
working directory command and the corresponding path you can give here. Otherwise you can
use GUI option also to set the working directory. And this command here is used to clear all the
variables in the environment of R. You can very well use the brush button in the environmental
history pan to clear the variables in the workspace.
And another important thing one has to do is for this knn implementation, we need two external
packages which are caret and class, one has to install this caret and class packages if they have
not installed it already. So, the way to install this packages we have explain in our R modules,
you can install the packages through the command window using this command install dot
packages and the package name and say dependencies equal to true, or you can use the GUI to
install the packages. So, please install this packages caret and class. And once you install, you
can load those packages using the library command as we have explained already. We will see
why is this packages important as we go along this lecture.
And library caret is for generating the confusion matrix which Prof. Raghu would have talked
about when he is talking about this performance matrix of a classifier. And this library class is a
library which contains different classification algorithms. And here we are going to use it for
implementing this knn. Now, let us see how to read the data.
From the given files and for this case, a data is being provided in two files as we have already
seen service train data dot csv, and service test data dot csv. So, in order to read this data from the
csv files function we use is read dot csv function. Let us look what this read dot csv function
takes and what it returns.
(Refer Slide Time: 08:59)
This read dot csv file reads a file in a table format and creates a data frame from it. The syntax
for this read dot csv function is as follows; read dot csv the filename, and the row names. Let us
look at what this input arguments file and row dot names means, file is essentially the name of
the file from which you have to read the data. And row dot names is a vector of row names, it
can be either a vector giving the actual row names or a single value which specifies what column
of the data set is having the row names.
As we have seen the data has been given in this two dot csv files, we can use read dot csv
function to read the data. As we have seen in the syntax of read dot csv we have to give the
filename that is the filename service train dot data from which I want to load the data I will give
this file name. And I am assigning this two a variable called service train when you execute this
command what happens is it reads a data from the service train data file and assign it to this
variable which is of the form data frame.
Similarly, you will read the data from service test data and assign it to variable service test which
is again a data frame. In the R environment, once you execute these commands you will see two
data frames which are service test and service train which are having this 315 observations of 6
variables and 135 observations of 6 variables.
Remember why this 315, 315 is the number of cars that they can thoroughly check, but they have
given in this 315 the 6 variables are the attributes which are easily measurable and one column
which says whether service is needed or not. And this 135 cars they have 6 variables they have
measured all the 5 attributes which are important and the 6 attribute is also given here we will
see why the 6 attribute is given and so on as we go on in this lecture.
Now, let us see what is there in this service train and service test data. One way to see what is
there in this service test and service train is to use the view command.
(Refer Slide Time: 11:42)
This view command helps you to see the data frames. For example, if you want to see what is
there in the service train data frame, what you have to do is this view service train will show a
table like this in your editor environment. Now, you can see that there are how many attributes 1,
2, 3, 4, 5, 6 attributes. And if you see these are the five attributes which are measured for testing
whether the service is needed or not, and this attribute is basically saying if service is needed or
not.
Similarly, you can see for the service test data set which is shown here. For now, what we assume
is will act such a way that we do not know this column, and we will come back to this. Now, if
you observe here, there are 135 entries for which they have not thoroughly checked they just
measured this 6 quantities, and they want to figure out whether service is needed or not using the
knn algorithm that is the whole idea. Since, you have viewed what is there in this service test and
service train data sets.
Now is there any way to know what are the data types of the these attributes that are there in this
service train and service test is the next question that comes to mind. Now, let us understand the
data and little more detail
(Refer Slide Time: 13:12)
what we have seen till now is the service train contains 315 observations of six variables service
test contains 135 observations in 6 variables. And variables that are present in the data sets are oil
quality, engine performance, normal mileage, tyre wear, HVAC wear and service. And I as I
mentioned earlier this 5 are the attributes that tells about the condition of the car. And this
attribute simply says whether service is needed or not that is what here.
First five columns are the details about the car and the last column is the label which says
whether a service is needed or not. Now, let us ask this question what are the data types of each
of these attributes, how one get the data types of the attributes that are there in the data.
So, since we have understood the data now. Let us look at what is the structure of the data.
(Refer Slide Time: 14:11)
When you say structure of data what do we mean by that is in the data set you have what are the
variables that are there, and what are their data types. So, the way you get the structure of data in
R is using this structure function. What does this structure function do structure function
compactly display the internal structure of an R object. The syntax for the structure function is as
follows. Structure function takes one input argument which is an object. What is this object this
object is essentially any R object about which you want to have some information.
Now, let us see the structure of two data frames what we have read from the two dots csv files.
(Refer Slide Time: 14:58)
You can see the structure of the service train data frame. Here if you execute this command
structure of service train, what it gives is the following information which says service train is a
data frame which contains 315 observations of six variables. And the variables are oil quality,
engine performance and so on. And they will say the data type of all this five attributes is
numeric, and the last attribute service is a factor with two levels that means we have yes or no in
this attribute. And this one two represents each entry for example one corresponds to no, and two
corresponds to yes and so on.
Let us use the structure command on the service test data and see what it has.
(Refer Slide Time: 15:54)
This is the output you see when you execute this command here. It says the service test is also
data frame which contains 135 observations in 6 variables. These are the variables that are
available. The first 5 variables are numeric type variables, and the service variable is a factor
with two levels which contains yes or no.
Since, we have seen the structure of the data let us ask this question is there any way that I will
get a summary of the data which I have read.
(Refer Slide Time: 16:27)
The answer is yes, you can get. The summary of data is obtained by the summary function.
Essentially what it does is it invokes particular methods depending upon the class of the
argument that goes along with this summary function. For example, summary function gives a 5
point summary for numeric attributes in the data. Syntax for the summary function is as follows.
The summary function takes one argument which is an object. This object is any R object about
which you want to get some information.
Let us use the summary function on our data frames which we have loaded and see what the
results are.
(Refer Slide Time: 17:15)
So, when we execute this command summary of service train, you will get the details about all
the numeric variables which are 5 point summaries including mean; and for the service variable
which is the categorical variable it gives how many no’s are there in that particular attribute and
how many yes values are there in that particular attribute.
Let us keep this number in mind we have 99 no values and 36 yes values in the service test. As I
said earlier we are going to act in such a way that we do not know the true yes and no values and
we use knn to predict which of them are yes and which of them are no.
Now, let us do the important task as far as this lecture is concerned which is implementation of
K-nearest neighbours in R. As I said earlier the function which we use to implement this K-
nearest neighbours is knn function. This knn function takes several arguments but I have listed
few which are very important as far as this course is concerned. The arguments it takes are train,
test, cl and k.
Let us see what each of this mean. Train is essentially a matrix or a data frame of the training set
cases that means you need to give all the data, in this case this is our service train data frame.
And this test is a matrix or data frame for the test set cases. In this case, what will be our test
matrix or a data frame this will be our service test data frame. This c l is a factor of true
classifications of a training set, and this k is the important parameter which is the number of
neighbours that are needed to be considered while you do this algorithm which works on this
majority voting criteria.
Now, let us implement this knn on our data. How do you do that? So, the way you do it is as
follows.
There are certain comments here, let us study what those comments are. So, as we have seen in
the previous lecture K-nearest neighbour is a lazy algorithm, and can do prediction directly with
the testing data set. It acts of training and testing data sets and the class variable of interest that is
outcome categorical variable and the parameter k as I have mentioned is to specify the number of
nearest neighbours that are to be considered for the classification.
So, the way I implement this knn algorithm is through this knn command as a training data set I
will give all my service train dataset. Remember I have a negative 6 here; I will talk about it
while later. And the test data set what I have given is the attributes in the service test except the
6th column. And in the class variables, I have given this 6th column has my classification
parameter.
And let us say I want to build a knn which takes the number of nearest neighbours as 3. So, these
are the input arguments for this knn function. When I execute this whole command here, it will
calculate the labels for the test data set and store them in this predicted knn. I will show you the
results in the coming slide.
Mean while let us interpret the service train a square bracket and minus 6 means this if you
remember since service train is a data frame from a data frames lectures, the statement here
means that in the service train data frame take all the rows and exclude column 6 that is what it
says. This command here gives information in service train except the last column. Similarly, this
command here gives the information in the service test except the last column and service train
dollar symbol service gives the last column of the training data as a classification factor for the
algorithm.
Once you give all these parameters, execute this. The knn will classify the test data points and
then store the labels in this predicted knn. Let us look at the results, and what this predicted knn
contains.
So, as we have seen in the earlier slide predicted knn is the output from the algorithm which has
categorical variable yes or no indicating whether service is needed or not for each case in the test
data. When you print this predicted knn, this is the output you see. It essentially says in this 135
values you have first car no service is needed, and second car no service is needed, and for the
23rd car service is needed and so on.
So, that is what this knn algorithm does and you have actually finished your job of classifying
the test cars as whether the service is needed or not. When you do not have this luxury of
knowing the true value this is where you stop. But in R case what happened is we already have
the true values whether service is needed or not for this data set what we have. Now, when you
have this luxury of knowing the true classes, you can generate what is called confusion matrix
and see how well you are classified this performing.
So, there are two ways of generating this confusion matrix. One you can generate the confusion
matrix manually, the other way is to use this caret package which can generate confusion matrix
and along with it lot of other parameters what Prof. Raghu has talked about in his performance
matrix lecture. Let us see how to generate this confusion matrix manually. So, this predicted knn
is the labels that is being protected using the knn algorithm. And when you observe this
command here, this is the last column of the service test data frame which says the true labels of
whether the service is needed or not.
When I do the table it generates contingency table and it stores the result in this confusion
matrix. When I print this confusion matrix, the result what I see is as follows. This is the
predicted no and yes, and these are the true no and yes. Recall that we have seen in your test data
service is not needed for 99 cars and service is needed for 36 cars. This knn has exactly predicted
all of them correctly; this is what is confusion matrix.
What we have seen this is the way you generate the confusion matrix manually. Once you have
this confusion matrix, you can calculate the accuracy right.
So, how do you calculate the accuracy the formula of accuracy is given in Prof. Raghu’s
performance matrix lecture. Essentially, I am taking the diagonal elements that is the correctly
predicted values divided by the total number of entries in the service test when you divide that
you will get the accuracy as 99 plus 36 is 135, and the n row of service is also 135. This
command here diag of confusion matrix take this element 99 and 36 and the some command will
summed them up. And when you divide that with the number of rows in the service test that is
135 by 135, you will get the value of knn accuracy as 1.
Since, knn is managed to predict all the no cases has correctly has no and all the yes cases
correctly as yes, your accuracy is 1. This is how you generate the confusion matrix manually.
Now, let us see how to generate this confusion matrix using the caret package, and the command
confusion matrix.
So, the command to generate confusion matrix which is there in this caret package is confusion
matrix. And the input arguments that you need to give are the predicted labels and the true labels.
When you pass these two arguments, this is the confusion matrix that is generated along with
confusion matrix it will generate whole lot of other parameters. We have already calculated
accuracy manually. We have seen that that is 1. You can also compare now the confusion matrix
functions also giving this accuracy as 1; along with this confusion matrix.
So, the reason why you have sensitivity is equal to 1. And specificity is equal to 1 in this case is
because all the positive classes are correctly classified all the negative classes are also correctly
classified that is the reason why you have the ideal values of one and one for sensitivity and
specificity.
So, the balance accuracy is again sensitivity plus specificity by 2 which is 2 by 2, it is it is 1. So,
this is how one can implement this knn algorithm in R.
(Refer Slide Time: 27:46)
In summary what we have seen in this lecture is how to read the dot csv files, how to use the
structure and summary functions to know the data (Refer Time: 28:00) types and the summary of
R objects, and how to implement this K-nearest neighbours algorithm which is a supervised
learning algorithm which needs labelled data. And we have also seen how to implement this K-
nearest neighbours algorithm in R using this knn function.
So, with this we end this tutorial session on how to implement knn algorithm in R. In the next
lecture, Prof. Raghu will talk about this k means clustering algorithm; after which I will come
back with a case study on how to implement k means clustering.
Thank you.