ML Unit 3
ML Unit 3
Classification
Topics to Be Covered
• What is Classification?
• General approach to Classification
• K-Nearest Neighbor Algorithm
• Logistic regression
• Decision Trees
• Naive Bayesian
• Support Vector Machine (SVM)
What is Classification?
• Classification is a supervised machine learning method where
the model tries to predict the correct label of a given input
data.
🡪 A target feature is
categorical type
1. Problem Identification
2. Identification of Required Data
3. Data Pre-processing
4. Definition of Training Data Set
5. Algorithm Selection
6. Training
7. Evaluation with the Test Data Set
• Problem Identification: Identifying the problem is the first step in the
supervised learning model. The problem needs to be a well-formed
problem,i.e. a problem with well-defined goals and benefit, which has a
long-term impact.
• Identification of Required Data: On the basis of the problem identified
above, the required data set that precisely represents the identified
problem needs to be identified/evaluated. For example: If the problem
is to predict whether a tumour is malignant or benign, then the
corresponding patient data sets related to malignant tumour and
benign tumours are to be identified
• Data Pre-processing: This is related to the cleaning/transforming the
data set. This step ensures that all the unnecessary/irrelevant data
elements are removed. Data pre-processing refers to the transformations
applied to the identified data before feeding the same into the
algorithm. Because the data is gathered from different sources, it is
usually collected in a raw format and is not ready for immediate analysis.
This step ensures that the data is ready to be fed into the machine
learning algorithm
• Definition of Training Data Set: Before starting the analysis, the user
should decide what kind of data set is to be used as a training set. In the
case of signature analysis, for example, the training data set might be a
single handwritten alphabet, an entire handwritten word (i.e. a group of
the alphabets) or an entire line of handwriting (i.e. sentences or a group
of words). Thus, a set of ‘input meta-objects’ and corresponding ‘output
meta-objects’ are also gathered. The training set needs to be actively
representative of the real-world use of the given scenario. Thus, a set of
data input (X) and corresponding outputs (Y) is gathered either from
human experts or experiments
• Algorithm Selection: This involves determining the structure of the
learning function and the corresponding learning algorithm. This is the
most critical step of supervised learning model. On the basis of various
parameters, the best algorithm for a given problem is chosen
• Training: The learning algorithm identified in the previous step is
run on the gathered training set for further fine tuning. Some
supervised learning algorithms require the user to determine
specific control parameters (which are given as inputs to the
algorithm). These parameters (inputs given to algorithm) may
also be adjusted by optimizing performance on a subset (called
as validation set) of the training set
• Evaluation with the Test Data Set: Training data is run on the
algorithm, and its performance is measured here. If a suitable
result is not obtained, further training of parameters may be
required
Algorithms for Classification
• k nearest neighbour
• Logistic regression
• Decision tree
• support vector machine
• Naive bayes
• Random forest
How Does KNN work?
But it is often a tricky decision to decide the value of k. The reasons are as follows:
❖ If the value of k is very large (in the extreme case equal to the total number of
records in the training data), the class label of the majority class of the training
data set will be assigned to the test data regardless of the class labels of the
neighbours nearest to the test data.
❖ If the value of k is very small (in the extreme case equal to 1), the class value of
a noisy data or outlier in the training data set which is the nearest neighbour to
the test data will be assigned to the test data.
The best k value is somewhere between these two extremes. Few strategies,
highlighted below, are adopted by machine learning practitioners to arrive at a
value for k.
• One common practice is to set k equal to the square root of the number of
training records.
• An alternative approach is to test several k values on a variety of test data sets
and choose the one that delivers the best performance.
• Another interesting approach is to choose a larger value of k, but apply a
weighted voting process in which the vote of close neighbours is considered
more influential than the vote of distant neighbours
Why kNN is a Lazy Learner
• kNN is called a lazy learner because:
• It does not abstract or generalize any model from the training data.
• Instead, it stores the data and uses it at prediction time.
• There is no explicit training phase or model built.
.How kNN Algorithm Works
1. Input Required:
1. A training dataset with input features and labeled output.
2. A test data point for which the class is to be predicted.
3. A value of ‘k’, which defines the number of neighbors to
consider.
2. Process:
1. Calculate the distance (usually Euclidean) between the test
point and all points in the training dataset.
2. Identify the k closest (smallest distance) training points.
3. Perform majority voting among those k neighbors to assign
the class label to the test point.
Strengths of kNN
• Simple to implement and understand.
• No training time required.
• Works well in recommender systems and some classification tasks.
• Adapts naturally to multi-class problems.
Weaknesses of kNN
• No real learning happens – relies completely on training data.
• Slow prediction time, especially with large datasets.
• High memory usage as the entire training data must be stored.
• Performance degrades if irrelevant features or unscaled features are
used.
Applications of kNN
1. Recommender Systems:
1. Suggest items (movies, products) based on what similar users liked.
2. Information Retrieval:
1. Find documents or articles similar to a given query.
3. Pattern Recognition:
1. Handwriting, face, or voice recognition based on closest match.
4. Medical Diagnosis:
1. Classify patient data based on past patient records.
Metrics to Evaluate ML Classification Algorithms
Metrics to Evaluate ML Classification Algorithms
• True positives: The number of positive observations the model
correctly predicted as positive.
• The goal of decision tree learning is to create a model that predicts the value of the output variable
based on the input variables in the feature vector.
• Every decision node picks a feature to split the data. The branches represent possible values (or value
ranges) of that feature. This process continues until we reach a leaf node, which gives the final prediction.
• The tree terminates at different leaf nodes (or terminal nodes) where each leaf node represents a
possible value for the output variable
• The output variable is determined by following a path that starts at the root and is guided by the
values of the input variables.
Decision tree
• A decision tree is usually represented in the format depicted in Figure 7.8.
• In the process of building a decision tree, the algorithm keeps splitting the
dataset into smaller groups (partitions) based on certain feature values.
Now, after a split happens:
→If each partition contains data from only one class (i.e., only “Yes”
or only “No”),
→Then we say the split has resulted in pure partitions.
• Let us say S is the sample set of training examples. Then, Entropy (S) measuring
the impurity of S is defined as
2. Make that attribute a decision node and breaks the dataset into
smaller subsets.
2. If all examples belong to the same class, return a leaf node with that class.
3. If the feature set is empty, return a leaf node with the majority class.
4. Otherwise:
4. Can work well both with small and large training data sets.
3. Decision trees are prone to errors in classification problems with many classes and relatively small number of
training examples.
Real-World Applications:
• Medical diagnosis
• Fraud detection
Logistic Regression (Binary/Discrete classification)
• Logistic regression is both classification and regression technique depending on the scenario
used.
• Logistic Regression is a Machine Learning classification algorithm that is used to predict the
probability of a categorical dependent variable.
• In logistic regression, the dependent variable is a binary variable that contains data coded as
1 (yes, success, etc.) or 0 (no, failure, etc.).
• In the logistic regression model, a chi-square test is used to measure how well the logistic
regression model fits the data.
• The goal of logistic regression is to predict the likelihood that Y is equal to 1 given certain
values of X.
Logistic regression is a statistical method used when the output variable is categorical (like
Yes/No, True/False, or 1/0). It tells you the probability of a certain event happening.
Instead of drawing a straight line (like in Linear Regression), Logistic Regression draws an S-
shaped curve (called a sigmoid or logistic curve), which fits the probability of Y = 1 given
X.
•
• Let us say we have a model that can predict whether a person is
male or female on the basis of their height.
Here x1, x2…, xn represent the features, i.e they can be mapped to Color,
Type, and Origin.
• By substituting for X and expanding using the chain rule we get,
• For all entries in the dataset, the denominator does not change, it
remains static. Therefore, the denominator can be removed and
proportionality can be injected.
• In our case, the class variable(y) has only two outcomes, yes or no. There could
be cases where the classification could be multivariate. Therefore, we have to
find the class variable(y) with maximum probability.
• Using the above function, we can obtain the class, given the predictors /
features.
• Step 1: First construct a frequency table. A frequency table is drawn
for each attribute against the target outcome
• Since 0.144 > 0.048, Which means given the features RED SUV and
Domestic, our example gets classified as ’NO’ the car is not stolen.
Applications of Naive Bayes Algorithms
• Real-time Prediction
• Multi-class Prediction
• Recommendation System
Support Vector Machine (SVM)
• A Support Vector Machine (SVM) is a supervised machine learning algorithm
that can be employed for both classification and regression purposes.
• It uses the non linear mapping to transform the original training data into higher
dimension
• Let us assume for the sake of simplicity that the data instances are
linearly separable. In this case, when mapped in a two dimensional
space
• In other words, the goal of the SVM analysis is to find a plane, or rather a hyperplane,
which separates the instances on the basis of their classes.
• New examples (i.e. new instances) are then mapped into that same space and
predicted to belong to a class on the basis of which side of the gap the new instance
will fall on.
• In summary, in the overall training process, the SVM algorithm analyses input data and
identifies a surface in the multi-dimensional feature space called the hyperplane
Support Vector Machine (SVM)
Hyperplane:
Is on Hyperplane if
Separating Hyperplane
Even if a separating hyperplane does exist, some instances in which a classifier might
not be desirable.
A classifier based on a separating hyperplane will necessarily perfectly classify all of the
training observations
Tiny
Margin
Support Vector Machine (SVM)
• Support Vectors: Support vectors are the data points (representing
classes), the critical component in a data set, which are near the
identified set of lines (hyperplane). If support vectors are removed, they
will alter the position of the dividing hyperplane.