0% found this document useful (0 votes)
21 views12 pages

Unit 3 (Classification)

The document discusses the steps involved in classification learning including problem identification, data identification and preprocessing, defining training and test data sets, algorithm selection, training, and evaluation. It then provides details on the k-nearest neighbors algorithm, including how it works, determining the value of k, and its strengths and weaknesses.

Uploaded by

niranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views12 pages

Unit 3 (Classification)

The document discusses the steps involved in classification learning including problem identification, data identification and preprocessing, defining training and test data sets, algorithm selection, training, and evaluation. It then provides details on the k-nearest neighbors algorithm, including how it works, determining the value of k, and its strengths and weaknesses.

Uploaded by

niranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

CLASSIFICATION LEARNING STEPS

Problem Identification: Identifying the problem is the first


step in the supervised learning model. The problem needs
to be a well-formed problem,i.e. a problem with well-
defined goals and benefit, which has a long-term impact.

Identification of Required Data: On the basis of the problem


identified above, the required data set that precisely
represents the identified problem needs to be
identified/evaluated. For example: If the problem is to
predict whether a tumour is malignant or benign, then the
corresponding patient data sets related to malignant tumour
and benign tumours are to be identified.
Data Pre-processing: This is related to the cleaning/transforming the
data set. This step ensures that all the unnecessary/irrelevant data
elements are removed. Data pre-processing refers to the
transformations applied to the identified data before feeding the
same into the algorithm. Because the data is gathered from different
sources, it is usually collected in a raw format and is not ready for
immediate analysis. This step ensures that the data is ready to be fed
into the machine learning algorithm.
Definition of Training Data Set: Before starting the analysis, the user should decide what kind of data set is to be
used as a training set. In the case of signature analysis, for example, the training data set might be a single
handwritten alphabet, an entire handwritten word (i.e. a group of the alphabets) or an entire line of handwriting
(i.e. sentences or a group of words). Thus, a set of ‘input meta-objects’ and corresponding ‘output meta-objects’
are also gathered. The training set needs to be actively representative of the real-world use of the given scenario.
Thus, a set of data input (X) and corresponding outputs (Y) is gathered either from human experts or
experiments.

Algorithm Selection: This involves determining the structure of the learning function and the corresponding
learning algorithm. This is the most critical step of supervised learning model. On the basis of various parameters,
the best algorithm for a given problem is chosen.

Training: The learning algorithm identified in the previous step is run on the gathered training set for further fine
tuning. Some supervised learning algorithms require the user to determine specific control parameters (which are
given as inputs to the algorithm). These parameters (inputs given to algorithm) may also be adjusted by
optimizing performance on a subset (called as validation set) of the training set.

Evaluation with the Test Data Set: Training data is run on the algorithm, and its performance is measured here. If
a suitable result is not obtained, further training of parameters may be required.
COMMON CLASSIFICATION ALGORITHMS

1. k-Nearest Neighbour (kNN)


2. Decision tree
3. Random forest
4. Support Vector Machine (SVM)
5. Naïve Bayes classifier
k -Nearest Neighbour (kNN)
The kNN algorithm is a simple but extremely powerful classification algorithm.
The name of the algorithm originates from the underlying philosophy of kNN – i.e. people
having similar background or mindset tend to stay close to each other.
In other words, neighbours in a locality have a similar background. In the same way, as a part of
the kNN algorithm, the unknown and unlabelled data which comes for a prediction problem is
judged on the basis of the training data set elements which are similar to the unknown element.
So, the class label of the unknown element is assigned on the basis of the class labels of the
similar training data set elements (metaphorically can be considered as neighbours of the
unknown element).
 K Nearest Neighbor (KNN) is a very simple, easy-to-understand, and versatile machine learning
algorithm. It’s used in many different areas, such as handwriting detection, image recognition,
and video recognition.
 KNN is most useful when labeled data is too expensive or impossible to obtain, and it can
achieve high accuracy in a wide variety of prediction-type problems.
 KNN is a simple algorithm, based on the local minimum of the target function which is used to
learn an unknown function of desired precision and accuracy.
 The algorithm also finds the neighborhood of an unknown input, its range or distance from it,
and other parameters. It’s based on the principle of “information gain”—the algorithm finds
out which is most suitable to predict an unknown value.
In the kNN algorithm, the class label of the test data elements is decided by the class label of
the training data elements which are neighbouring, i.e. similar in nature. But there are two
challenges:

 What is the basis of this similarity or when can we say that two data elements are similar?
 How many similar elements should be considered for deciding the class label of each test
data element?

For first one ,The most common approach adopted by kNN to measure similarity between two data elements
is Euclidean
distance.
Considering a very simple data set having two features (say f1 and f2),Euclidean distance between two data
elements d1 and d2 can be measured by
where f11 = value of feature f1 for data
element d1
f12 = value of feature f1 for data element d2
f21 = value of feature f2 for data element d1
f22 = value of feature f2 for data element d2
The answer to the second question, i.e. how many similar elements should be considered.

The answer lies in the value of ‘k’ which is a user-defined parameter given as an input to the
algorithm
.
 In the kNN algorithm, the value of ‘k’ indicates the number of neighbours that need to be
considered.
 For example, if the value of k is 3, only three nearest neighbours or three training data
elements closest to the test data element are considered.
 Out of the three data elements, the class which is predominant is considered as the class
label to be assigned to the test data. In case the value of k is 1, only the closest training
data element is considered. The class label of that data element is directly assigned to the
test data element
But it is often a tricky decision to decide the value of k. The reasons are as follows:

 If the value of k is very large (in the extreme case equal to the total number of records in the training data),
the class label of the majority class of the training data set will be assigned to the test data regardless of the
class labels of the neighbours nearest to the test data.
 If the value of k is very small (in the extreme case equal to 1), the class value of a noisy data or outlier in the
training data set which is the nearest neighbour to the test data will be assigned to the test data.
The best k value is somewhere between these two extremes.

Few strategies, highlighted below, are adopted by machine learning practitioners to arrive at a value for k.

 One common practice is to set k equal to the square root of the number of training records.
 An alternative approach is to test several k values on a variety of test data sets and choose the one that
delivers the best performance.
 Another interesting approach is to choose a larger value of k, but apply a weighted voting process in which the
vote of close neighbours is considered more influential than the vote of distant neighbours.
kNN algorithm

Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number of nearest neighbours to be
considered)

Steps:

Do for all test data points

Calculate the distance (usually Euclidean distance) of the test data point from the different training data points.

Find the closest ‘k’ training data points, i.e. training data points whose distances are least from the test data point.

If k = 1

Then assign class label of the training data point to the test data point

Else

Whichever class label is predominantly present in the training data points, assign that class label to the test data
point

End do
Why the kNN algorithm is called a lazy learner?

that eager learners follow the general steps of machine learning, i.e. perform an abstraction of
the information obtained from the input data and then follow it through by a generalization
step. However, as we have seen in the case of the kNN algorithm, these steps are completely
skipped. It stores the training data and directly applies the philosophy of nearest
neighbourhood finding to arrive at the classification. So, for kNN, there is no learning
happening in the real sense. Therefore, kNN falls under the category of lazy learner.

Strengths of the kNN algorithm


 Extremely simple algorithm – easy to understand
 Very effective in certain situations, e.g. for recommender system design
 Very fast or almost no time required for the training phase
Weaknesses of the kNN algorithm
 Does not learn anything in the real sense. Classification is done completely on the
basis of the training data. So, it has a heavy reliance on the training data. If the
training data does not represent the problem domain comprehensively, the algorithm
fails to make an effective classification.
 Because there is no model trained in real sense and the classification is done
completely on the basis of the training data, the classification process is very slow.
 Also, a large amount of computational space is required to load the training data for
classification.

Application of the kNN algorithm


kNN algorithm is widely adopted in
 recommender systems
 searching documents/ contents similar to a given
document/content.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy