Unit 3 (Classification)
Unit 3 (Classification)
Algorithm Selection: This involves determining the structure of the learning function and the corresponding
learning algorithm. This is the most critical step of supervised learning model. On the basis of various parameters,
the best algorithm for a given problem is chosen.
Training: The learning algorithm identified in the previous step is run on the gathered training set for further fine
tuning. Some supervised learning algorithms require the user to determine specific control parameters (which are
given as inputs to the algorithm). These parameters (inputs given to algorithm) may also be adjusted by
optimizing performance on a subset (called as validation set) of the training set.
Evaluation with the Test Data Set: Training data is run on the algorithm, and its performance is measured here. If
a suitable result is not obtained, further training of parameters may be required.
COMMON CLASSIFICATION ALGORITHMS
What is the basis of this similarity or when can we say that two data elements are similar?
How many similar elements should be considered for deciding the class label of each test
data element?
For first one ,The most common approach adopted by kNN to measure similarity between two data elements
is Euclidean
distance.
Considering a very simple data set having two features (say f1 and f2),Euclidean distance between two data
elements d1 and d2 can be measured by
where f11 = value of feature f1 for data
element d1
f12 = value of feature f1 for data element d2
f21 = value of feature f2 for data element d1
f22 = value of feature f2 for data element d2
The answer to the second question, i.e. how many similar elements should be considered.
The answer lies in the value of ‘k’ which is a user-defined parameter given as an input to the
algorithm
.
In the kNN algorithm, the value of ‘k’ indicates the number of neighbours that need to be
considered.
For example, if the value of k is 3, only three nearest neighbours or three training data
elements closest to the test data element are considered.
Out of the three data elements, the class which is predominant is considered as the class
label to be assigned to the test data. In case the value of k is 1, only the closest training
data element is considered. The class label of that data element is directly assigned to the
test data element
But it is often a tricky decision to decide the value of k. The reasons are as follows:
If the value of k is very large (in the extreme case equal to the total number of records in the training data),
the class label of the majority class of the training data set will be assigned to the test data regardless of the
class labels of the neighbours nearest to the test data.
If the value of k is very small (in the extreme case equal to 1), the class value of a noisy data or outlier in the
training data set which is the nearest neighbour to the test data will be assigned to the test data.
The best k value is somewhere between these two extremes.
Few strategies, highlighted below, are adopted by machine learning practitioners to arrive at a value for k.
One common practice is to set k equal to the square root of the number of training records.
An alternative approach is to test several k values on a variety of test data sets and choose the one that
delivers the best performance.
Another interesting approach is to choose a larger value of k, but apply a weighted voting process in which the
vote of close neighbours is considered more influential than the vote of distant neighbours.
kNN algorithm
Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number of nearest neighbours to be
considered)
Steps:
Calculate the distance (usually Euclidean distance) of the test data point from the different training data points.
Find the closest ‘k’ training data points, i.e. training data points whose distances are least from the test data point.
If k = 1
Then assign class label of the training data point to the test data point
Else
Whichever class label is predominantly present in the training data points, assign that class label to the test data
point
End do
Why the kNN algorithm is called a lazy learner?
that eager learners follow the general steps of machine learning, i.e. perform an abstraction of
the information obtained from the input data and then follow it through by a generalization
step. However, as we have seen in the case of the kNN algorithm, these steps are completely
skipped. It stores the training data and directly applies the philosophy of nearest
neighbourhood finding to arrive at the classification. So, for kNN, there is no learning
happening in the real sense. Therefore, kNN falls under the category of lazy learner.