Workbook of Pattern Recognition
Workbook of Pattern Recognition
PECCS702B
WorkBook
Semester - 7
Prof. Bavrabi Ghosh
What is a feature?
The input variables that we give to our machine learning models are called features. Each
column in our dataset constitutes a feature.
To train an optimal model, we need to make sure that we use only the essential features. If
we have too many features, the model can capture the unimportant patterns and learn
from noise. The method of choosing the important parameters of our data is called Feature
Selection.
Machine learning models follow a simple rule: whatever goes in, comes out. If we put
garbage into our model, we can expect the output to be garbage too. In this case, garbage
refers to noise in our data.
To train a model, we collect enormous quantities of data to help the machine learn better.
Usually, a good portion of the data collected is noise, while some of the columns of our
dataset might not contribute significantly to the performance of our model. Further, having
a lot of data can slow down the training process and cause the model to be slower. The
model may also learn from this irrelevant data and be inaccurate.
In the above table, we can see that the model of the car, the year of manufacture, and the
miles it has traveled are pretty important to find out if the car is old enough to be crushed
or not. However, the name of the previous owner of the car does not decide if the car
should be crushed or not. Further, it can confuse the algorithm into finding patterns
between names and the other features. Hence we can drop the column.
Elaborate the concept of feature selection.
Feature Selection is the method of reducing the input variable to your model by using only
relevant data and getting rid of noise in data.
It is the process of automatically choosing relevant features for your machine learning
model based on the type of problem you are trying to solve. We do this by including or
excluding important features without changing them. It helps in cutting down the noise in
our data and reducing the size of our input data.
Supervised Models: Supervised feature selection refers to the method which uses the
output label class for feature selection. They use the target variables to identify the
variables which can increase the efficiency of the model
Unsupervised Models: Unsupervised feature selection refers to the method which does not
need the output label class for feature selection. We use them for unlabelled data.
What is the filter method of feature selection?
Filter Method: In this method, features are dropped based on their relation to the output,
or how they are correlating to the output. We use correlation to check if the features are
positively or negatively correlated to the output labels and drop features accordingly. Eg:
Information Gain, Chi-Square Test, Fisher’s Score, etc.
We split our data into subsets and train a model using this. Based on the output of the
model, we add and subtract features and train the model again. It forms the subsets using a
greedy approach and evaluates the accuracy of all the possible combinations of features. Eg:
Forward Selection, Backwards Elimination, etc.
What is intrinsic method of feature selection?
This method combines the qualities of both the Filter and Wrapper method to create the
best subset. This method takes care of the machine training iterative process while
maintaining the computation cost to be minimum. Eg: Lasso and Ridge Regression.
The process is relatively simple, with the model depending on the types of input and output
variables.
Input
Output Variable Feature Selection Model
Variable
KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the most similar features it
will put it in either cat or dog category.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
How does KNN algorithm work?
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider
the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
Large values for K are good, but it may find some difficulties.
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.
Locally Adaptive KNN - In standard KNN algorithm global value of input parameter k is used.
But this proposed algorithm suggested using different values of the parameter k for
different portions of input space. Each time for classification of a query, the value of k is
determined via applying cross-validation in its local neighbourhood.
Weighted adjusted KNN - In standard KNN algorithm all the attributes have equal
importance. All the attributes or features give equal contribution for classification of novel
tuples. But not all the attributes in the data set are equally important. A weight-adjusted
KNN algorithm which first learns weights for different attributes and according to the
weights assigned, each attribute would affect the process of classification that much only.
Improved KNN for Text Categorization - As we know how much the value of input parameter
k influences the performance of KNN algorithm. So, it is very crucial to choose appropriate
value of the parameter k. In general the classes are not evenly distributed in the data set.
Therefore, using a fixed value of k for all the classes would result in bias towards the class
which has larger number of tuples. Therefore one can use different values of k for different
classes according to their class distribution. More number of tuples is used to classify a new
tuple in a class which has large number of tuples.
Adaptive KNN – Rather than using a fixed value of k, to use non-fixed number of nearest
neighbours i.e. k. Large value of parameter k would also increase the computational cost
and time in case of large data sets. To solve this problem it has applied three heuristics so
that early break of the algorithm can be possible. These heuristics on fulfilment of a fixed
condition would break out from the algorithm. This would save computational time of the
algorithm.
KNN with Shared Nearest Neighbours - another variant of KNN algorithm which uses shared
nearest neighbours to classify documents. To find neighbours of a novel tuple, it uses BM25
similarity measure. A threshold is set, only that much number of nearest neighbours can
vote for classification of an unknown tuple.
KNN with K-Means - One of the shortcomings of KNN algorithm is its high computation
complexity. To alleviate this drawback an effort has been made by combining KNN algorithm
with the clustering algorithm K-Means. In the proposed algorithm first the clusters of the
different categories in training data set are formed. The centres of these newly formed
clusters will now act as new training samples. To classify an unknown tuple the distance of it
is computed with these new training tuples and with which the tuple has least distance it
will be assigned to that class. The benefit of this variant of KNN is that there is no need of
passing the input parameter k as we have to do in standard KNN.
KNN with Mahalanobis Metric - The performance of KNN algorithm largely depends on the
distance metric which is used to find the distance between any two tuples. A new distance
metric called Mahalanobis distance metric was introduced. It transforms the whole input
space using linear transformation. In this transformed input space the Euclidean distance is
same as Mahalanobis distance between any two data points. Euclidean distance is the
distance between any two points whereas Mahalanobis distance is a distance between a
point and a distribution. If the point represents the mean of the distribution then the
Mahalanobis distance would be zero. The main benefit of taking Mahalanobis distance
metric instead of Euclidean distance metric is that it also reckoned the correlation between
data tuples.