0% found this document useful (0 votes)
11 views175 pages

ML Unit3b

The document discusses supervised learning techniques, focusing on regression and classification algorithms such as k-Nearest Neighbour, Support Vector Machine, and Decision Trees. It explains regression as a method for predicting continuous variables by establishing relationships between dependent and independent variables, and outlines the steps for regression modeling. Additionally, it covers classification processes, including binary and multi-class classifiers, and highlights the K-NN algorithm's principles and distance metrics used for classification.

Uploaded by

aelurigowri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views175 pages

ML Unit3b

The document discusses supervised learning techniques, focusing on regression and classification algorithms such as k-Nearest Neighbour, Support Vector Machine, and Decision Trees. It explains regression as a method for predicting continuous variables by establishing relationships between dependent and independent variables, and outlines the steps for regression modeling. Additionally, it covers classification processes, including binary and multi-class classifiers, and highlights the K-NN algorithm's principles and distance metrics used for classification.

Uploaded by

aelurigowri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 175

Supervised Learning

 Regression,

 Classifying with k-Nearest Neighbour classifier,

 Support vector machine classifier,

 Decision Tree classifier,

 Naive Bayes classifier,

 Bagging,

 Boosting,

 Improving classification with the AdaBoost algorithm.


Regression

• Regression is a process of finding the correlations between dependent and


independent variables. It helps in predicting the continuous variables such
as prediction of Market Trends, prediction of House prices, etc.
• The task of the Regression algorithm is to find the mapping function to map
the input variable (x) to the continuous output variable (y).
• Example: Suppose we want to do weather forecasting, so for this, we will
use the Regression algorithm. In weather prediction, the model is trained on
the past data, and once the training is completed, it can easily predict the
weather for future days.
 Regression in machine learning refers to a supervised learning technique
where the goal is to predict a continuous numerical value based on one or
more independent features.

 It finds relationships between variables so that predictions can be made.


we have two types of variables present in regression:

1. Dependent Variable (Target): The variable we are trying to predict e.g


house price.
2. Independent Variables (Features): The input variables that influence the
prediction e.g locality, number of rooms.

 Regression analysis problem works with if output variable is a real or


continuous value such as “salary” or “weight”. Many different regression
models can be used but the simplest model in them is linear regression.
 Regression can be classified into different types based on the number of
predictor variables and the nature of the relationship between variables.

 Many different regression models can be used but the simplest model in
them is linear regression

 In Machine Learning, Linear Regression is a supervised machine learning


algorithm.

 It tries to find out the best linear relationship that describes the data you
have.

 It assumes that there exists a linear relationship between a dependent variable


and independent variable(s).

 The value of the dependent variable of a linear regression model is a


continuous value i.e. real numbers.

 This means that the change in the dependent variable is proportional to the
change in the independent variables. For example predicting the price of a
house based on its size.
Representing Linear Regression Model
 Linear regression model represents the linear relationship between a
dependent variable and independent variable(s) via a sloped straight line.

The sloped straight line representing the linear relationship that fits the
given data best is called as a regression line. It is also called as best fit line.
Based on the given data points, we attempt to plot a line that fits the points
the best.
Based on the number of independent variables, there are two types of linear
regression-

1.Simple Linear Regression


2.Multiple Linear Regression
 The goal of the algorithm is to find the best Fit Line equation that
can predict the values based on the independent variables.

 In regression set of records are present with X and Y values and these
values are used to learn a function so if you want to predict Y from an
unknown X this learned function can be used.

 In regression we have to find the value of Y, So, a function is required


that predicts continuous Y in the case of regression given X as
independent features.
Regression
Regression Modeling
Modeling Steps
Steps
Regression analysis is a statistical method that estimates the relationship
between variables. It can be used to model future relationships between
variables.

 Define problem or question


 Specify model
 Collect data
 Do descriptive data analysis
 Estimate unknown parameters
 Evaluate model
 Use model for prediction
The steps of regression modeling include:

Define the problem: Identify the variables and the problem you are trying to
solve
Collect data: Gather data on the variables you're interested in
Check for outliers: Remove any outliers that could skew your results
Check for linearity: Plot the data to see if there is a linear relationship between
the variables
Choose a model: Select a regression model that's appropriate for your data and
goals
Run the regression: Use the data to calculate the regression equation
Evaluate the results: Examine the results and determine the significance of the
independent variables
Interpret the results: Use the results to answer your research question or make
predictions
Relate to your hypothesis: Compare the results to your original hypothesis and
decide whether to accept, reject, or revise it
Simple vs. Multiple

•  represents • i represents the


the unit change unit change in Y
in Y per unit per unit change in
change in X . Xi.
• Does not take
into account any • Takes into account
other variable the effect of other
besides single i s.
independent • “Net regression
variable.
coefficient.”
Least squares
method
 The most common method to find the best fit line is called "least squares
regression," which mathematically calculates the line that minimizes the
squared distances between the line and data points.
 The least-squares regression method is a technique commonly used in
Regression Analysis. It is a mathematical method used to find the best fit line
that represents the relationship between an independent (predictor variable)
and dependent variable (target variable. ).
 It is based on the idea that the square of the errors obtained must be
minimized to the most possible extent and hence the name least squares
method.
 The least-squares method is one of the most effective ways used to draw the
line of best fit.
 Line of Best Fit : Line of best fit is drawn to represent the relationship
between 2 or more variables. To be more specific, the best fit line is drawn
across a scatter plot of data points in order to represent a relationship between
those data points.
Least Squares method is a
statistical technique used to
find the equation of best-
fitting curve or line to a set of
data points by minimizing the
sum of the squared differences
between the observed values
and the values predicted by
the model.

This method aims at


minimizing the sum of
squares of deviations as much
as possible. The line obtained
from such a method is called a
regression line or line of best
fit.
The Method of Least Squares
Formula for Least Square Method
Least Square Method formula is used to find the best-fitting line through a set of
data points.
For a simple linear regression, which is a line of the form y=mx+c, where y is
the dependent variable, x is the independent variable, a is the slope of the line,
and b is the y-intercept, the formulas to calculate the slope (m) and intercept (c)
of the line are derived from the following equations:
Least Square Method formula is used to find the best-fitting line through a set
of data points.

For a simple linear regression, which is a line of the form y=mx+c, where y is
the dependent variable, x is the independent variable, m is the slope of the
line, and c is the y-intercept, the formulas to calculate the slope (m) and
intercept (c) of the line are derived from the following equations:
The steps to find the line of best fit by using the least square method is discussed below:
Steps to find the line of best fit by using the least square method
Consider an example. Tom who is the owner of a retail shop, found the price of
different T-shirts vs the number of T-shirts sold at his shop over a period of one
week.
https://youtu.be/h8cTBrYHWqA
Classification
 Classification is a process of finding
a function which helps in dividing the
dataset into classes based on different
parameters.
 In Classification, a computer program
is trained on the training dataset and
based on that training, it categorizes
the data into different classes.
 The task of the classification
algorithm is to find the mapping
function to map the input (x) to the
discrete output (y).
 Binary Classifier: This type of classifier is used when there are only two
possible outputs to a classification task.

YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG,


and so on are some examples.
 Multi-class Classifier: A Multi-class Classifier is used when a classification
task involves more than two outcomes.
Classifications of different sorts of crops, for example, or classifications of
different types of music.
 Example: The best example to understand the Classification problem is Email
Spam Detection. The model is trained on the basis of millions of emails on
different parameters, and whenever it receives a new email, it identifies
whether the email is spam or not. If the email is spam, then it is moved to the
Spam folder.
Learners in Classification Problems: There are two sorts of learners in
classification problems:

 Lazy Learners: A lazy learner saves the training dataset first and then
waits for the test dataset. In the case of the lazy learner, classification is
based on the most closely related data in the training dataset. Training
takes less time, but projections take longer.

 Case-based reasoning, for example, uses the K-NN method.

 Eager Learners: Before receiving a test dataset, Eager Learners create a


classification model based on a training dataset. Eager Learners, in
contrast to Lazy Learners, spend more time learning and less time
predicting. Decision Trees, Nave Bayes, and ANN are some examples.
Types of ML Classification Algorithms:

 Classification Algorithms can be further divided into the following types:


 Linear Models
1. Logistic Regression
2. Support Vector Machines

 Non-linear Models
1. K-Nearest Neighbours
2. Kernel SVM
3. Naïve Bayes
4. Decision Tree Classification
5. Random Forest Classification
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
 K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.

 K-NN algorithm assumes the similarity between the new data and available
data and put the new data into the category that is most similar to the
available categories.

 In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the
algorithm how many nearby points (neighbours) to look at when it makes a
decision.

 K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.

 K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
• Example: Suppose, we have an image of a creature that looks similar to
cat and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set
to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.
Distance Metrics Used in KNN Algorithm
 Euclidean distance is defined as the straight-line distance between two points
in a plane or space. You can think of it like the shortest path you would walk if
you were to go directly from one point to another.

 Manhattan Distance is the total distance you would travel if you could only
move along horizontal and vertical lines (like a grid or city streets). It’s also
called “taxicab distance” because a taxi can only drive along the grid-like
streets of a city.

 Minkowski distance is like a family of distances, which includes both


Euclidean and Manhattan distances as special cases.

 when p = 2 then it is the same as the formula for the Euclidean distance and
when p = 1 then we obtain the formula for the Manhattan distance.
How to select the value of K in the K-NN Algorithm?
 In the k-nearest neighbors (KNN) algorithm, k is a variable that specifies the
number of nearest neighbors to consider when classifying a query point.
 Choosing the right k : The value of k can impact the accuracy of the
algorithm. Choosing a small k can have a higher influence on the result, while
a larger k can lead to a smoother decision boundary.
Tips for choosing k
 Use an odd number for k to avoid ties in classification
 Use cross-validation to find the optimal k for your dataset
 Use the elbow method to plot the model's error rate or accuracy for different
values of k
 Use grid search to find the best value of k
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of
similarity where it predicts the label or value of a new data point by
considering the labels or values of its K nearest neighbors in the training
dataset.
Step 1: Selecting the optimal value of K
K represents the number of nearest neighbors that needs to be considered
while making prediction.
Step 2: Calculating distance
To measure the similarity between target and training data points Euclidean
distance is used. Distance is calculated between data points in the dataset and
target point.
Step 3: Finding Nearest Neighbors
The k data points with the smallest distances to the target point are nearest
neighbors.
Step 4: Voting for Classification or Taking Average for Regression
When you want to classify a data point into a category (like spam or not
spam), the K-NN algorithm looks at the K closest points in the dataset. These
closest points are called neighbors. The algorithm then looks at which
category the neighbors belong to and picks the one that appears the most. This
is called majority voting.
In regression, the algorithm still looks for the K closest points. But instead of
voting for a class in classification, it takes the average of the values of those K
neighbors. This average is the predicted value for the new point for the
algorithm.
It shows how a test point is classified based on its nearest neighbors. As the test
point moves the algorithm identifies the closest ‘k’ data points i.e 5 in this case
and assigns test point the majority class label that is grey label class here.
Pros: High accuracy, insensitive to outliers, no assumptions about data

Cons: Computationally expensive, requires a lot of memory


Works with: Numeric values, nominal values
Comparison of Linear Regression with K-
Nearest Neighbors
 Linear regression is an example of a parametric approach because it
assumes a linear functional form for f(X).
 K-Nearest Neighbors (KNN), is a non-parametric method. The KNN
algorithm does not make assumptions about the data it is analyzing.

• Parametric methods : Parametric models make strong assumptions


about the functional form, or shape, of the relationship between the
variables in the data. These models are characterized by having a fixed
number of parameters, which are estimated from the training data and
used to make predictions.
Advantages :Easy to fit. One needs to estimate a small number of
coefficients. Often easy to interpret.
Disadvantages: They make strong assumptions about the form of f (X).
• If there is a linear relationship between X and Y but the true relationship
is far from linear, then the resulting model will provide a poor fit to the
data, and any conclusions drawn from it will be suspect.
Parametric Models

 Assumptions: Parametric models make strong assumptions about the


form of the relationship between the features and the target variable.
These assumptions often involve the data following a specific
probability distribution (e.g., normal distribution).
 Fixed Parameters: They have a fixed number of parameters that are
learned during training. These parameters summarize the knowledge
extracted from the data.
 Examples: Linear Regression, Logistic Regression
 Characteristics:
 Simpler: Easier to understand and implement.
 Faster Training: Generally train faster, especially with smaller
datasets.
 Less Flexible: Can only model relationships that fit the assumed
form.
 Potentially Inaccurate: If the assumptions are incorrect, the
model's performance can suffer.
• Non-parametric models does not make assumptions about the data it is
analyzing.
Advantages: They do not assume an explicit form for f(X),
providing a more flexible approach.
Disadvantages : They can be often more complex to understand
and interpret.
If there is a small number of observations per predictor, then
parametric methods then to work better
Non-Parametric Models

 Fewer Assumptions: Non-parametric models make minimal or no


assumptions about the underlying data distribution. They are more flexible
and can adapt to various data patterns.
 Variable Parameters: The number of parameters in these models can grow
with the size of the training data.
 Examples: K-Nearest Neighbors (KNN), Decision Trees, Support Vector
Machines (with kernel trick)
 Characteristics:
 More Complex: Can be more challenging to understand and
implement.
 Slower Training: Training can be slower, especially with large
datasets.
 More Flexible: Can model a wider range of relationships, including
non-linear ones.
 Potentially More Accurate: When the data doesn't fit pre-defined
assumptions, they can provide better accuracy.
Linear Regression

•Type: Parametric
•Goal: To find the best-fitting linear relationship between independent and
dependent variables.
•How it works: Finds the line (or hyperplane in higher dimensions) that
minimizes the sum of squared errors between predicted and actual values.
•Training: Estimates coefficients for the linear equation.
•Prediction: Uses the equation to predict values for new data points.
•Assumptions: Assumes a linear relationship between variables.
•Advantages:
•Simple to understand and implement.
•Computationally efficient.
•Provides interpretable results (coefficients show the relationship between
variables).
•Disadvantages:
•Can only model linear relationships.
•Sensitive to outliers.
K-Nearest Neighbors (KNN)
•Type: Non-parametric
•Goal: To predict the value of a data point based on the values of its 'k' nearest
neighbors in the training data.
•How it works:
1.Finds the 'k' closest data points in the training set to the new data point.
2.Predicts the value based on the majority class (for classification) or
average value (for regression) of those neighbors.
•Training: Essentially memorizes the training data.
•Prediction: Calculates distances to all training points for each new data point.

•Assumptions: None about the underlying data distribution.


•Advantages:
•Simple to understand and implement.
•Can model complex, non-linear relationships.
•Disadvantages:
•Computationally expensive for large datasets (needs to calculate distances
to all training points).
•Sensitive to irrelevant features.
•Performance depends heavily on the choice of 'k'.
Naive Bayes Classifier Algorithm

 Naive Bayes algorithm is a supervised learning algorithm, which is


based on Bayes theorem and used for solving classification problems.
 It is mainly used in text classification that includes a high-dimensional
training dataset.
 Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
 It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
 Some popular examples of Naive Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naive and Bayes, which
can be described as:
 Naive: It is called Naive because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features.
 Such as if the fruit is identified on the bases of color, shape, and taste,
then red, spherical, and sweet fruit is recognized as an apple.
 Hence each feature individually contributes to identify that it is an apple
without depending on each other.

 Bayes: It is called Bayes because it depends on the principle of Bayes'


Theorem.
Bayes' Theorem
 Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
 The formula for Bayes' theorem is given as:

 P(A|B) is Posterior probability: Probability of hypothesis A on the


observed event B.
 P(B|A) is Likelihood probability: Probability of the evidence given
that the probability of a hypothesis is true.
 P(A) is Prior Probability: Probability of hypothesis before observing
the evidence.
 P(B) is Marginal Probability: Probability of Evidence.
Naive Bayes algorithm?

The Naive Bayes consists of two words: 1- Naive: As it assumes the


independency between traits or features.
2- Bayes: Based on Bayes’ theorem.

To use the algorithm:

1. We must convert the presented data set into frequency tables.


2. Then create a probability table by finding the probabilities of certain
features.
3. Then use Bayes’ theorem in order to calculate the posterior
probability.

For example, let’s solve the following problem: If the weather is sunny,
then the Player should play or not?
•𝑃(𝑌𝑒𝑠│𝑆𝑢𝑛𝑛𝑦) > 𝑃(𝑁𝑜│𝑆𝑢𝑛𝑛𝑦) ⇒ So on a sunny day, the player can play the game.
Advantages of Naïve Bayes Classifier
 Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
 It can be used for Binary as well as Multi-class Classifications.
 It performs well in Multi-class predictions as compared to the other
Algorithms.
 It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier
 Naive Bayes assumes that all features are independent or unrelated, so it
cannot learn the relationship between features.
Applications of Naïve Bayes Classifier
 It is used for Credit Scoring.
 It is used in medical data classification.
 It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
 It is used in Text classification such as Spam filtering and Sentiment
analysis.
Types of Naïve Bayes Model

There are three types of Naive Bayes Model, which are given below:
 Gaussian: The Gaussian model assumes that features follow a normal
distribution. This means if predictors take continuous values instead of discrete,
then the model assumes that these values are sampled from the Gaussian
distribution.

 Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as
Sports, Politics, education, etc.

 Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier,


but the predictor variables are the independent Booleans variables. Such as if a
particular word is present or not in a document. This model is also famous for
document classification tasks.
Support vector machines (SVM)

 Support Vector Machine or SVM is one of the most popular Supervised


Learning algorithms, which is used for Classification as well as Regression
problems, primarily, it is used for Classification problems in Machine
Learning.
 The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.
 This best decision boundary is called a hyperplane.
 SVM chooses the extreme points/vectors that help in creating the
hyperplane.
 These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine.
• Consider the below diagram in which there are two different categories that
are classified using a decision boundary or hyperplane:
Important terms in Support Vector Machine

Some of the important concepts in SVM which will be used frequently are as
follows.

 Hyperplane − It is a decision plane or space which is divided between a set


of objects having different classes.
 Support Vectors − Data points that are closest to the hyperplane are called
support vectors. The separating line will be defined with the help of these
data points.
 Kernel – A kernel is a function used in SVM for helping to solve problems.
They provide shortcuts to avoid complex calculations.
 Margin − It may be defined as the gap between two lines on the closet data
points of different classes. A large margin is considered a good margin and
a small margin is considered as a bad margin.
Hyperplane and Support Vectors in the SVM algorithm

Hyperplane:
 There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary is known as the
hyperplane of SVM.
 The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
 We always create a hyperplane that has a maximum margin, which means
the maximum distance between the data points.
Support Vectors:
 The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector.
 Since these vectors support the hyperplane, hence called a Support vector.
Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will first
train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two
data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify
it as a cat. Consider the below diagram:
Example of Support Vector Machine
 SVM algorithm can be understood better with the following example. Suppose we
want to build a model that can accurately identify whether the given fruit is an
apple or banana.
 We will first train our model with lots of images of apples and bananas so that it
can learn about the different features of apples and bananas, and then we test it
with new fruit. So as the support vector creates a decision boundary between these
two data i.e., apple and banana, and chooses support vectors. On the basis of the
support vectors, it will classify its category as an apple or banana. We can
understand the example with the below diagram.
Types of SVM

 Linear SVM: Linear SVM is used for linearly separable data.


If a dataset can be classified into two classes by using a single straight
line, then such data is termed as linearly separable data. Classifier is
used called as Linear SVM classifier.

 Non-linear SVM: Non-Linear SVM is used for non-linearly separated data.


If a dataset cannot be classified by using a straight line, then such data
is termed as non-linear data. Classifier used is called as Non-linear
SVM classifier.
Linear SVM

The best hyperplane is that plane that has


the maximum distance from both the classes,
and this is the main aim of SVM. This is
done by finding different hyperplanes which
classify the labels in the best way then it will
choose the one which is farthest from the
data points or the one which has a maximum
How does SVM works?

Linear SVM:
 Let’s consider two independent variables x1, x2 and one dependent variable
which is either a blue circle or a red circle.
 From the figure above its very clear that there are multiple lines (our
hyperplane here is a line because we are considering only two input features
x1, x2) that segregates our data points or does a classification between red
and blue circles. So how do we choose the best line or in general the best
hyperplane that segregates our data points.
 SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane.
 SVM algorithm finds the closest point of the lines from both the classes.
These points are called support vectors.
 The distance between the vectors and the hyperplane is called as margin.
And the goal of SVM is to maximize this margin.
 The hyperplane with maximum margin is called the optimal
hyperplane.
Selecting the best hyper-plane:
 One reasonable choice as the best hyperplane is the one that represents the
largest separation or margin between the two classes.
 So we choose the hyperplane whose distance from it to the nearest data
point on each side is maximized.
 If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2.
 Here we have one blue ball in the
boundary of the red ball. So how does
SVM classify the data? It’s simple!
 The blue ball in the boundary of red
ones is an outlier of blue balls. The
SVM algorithm has the characteristics
to ignore the outlier and finds the best
hyperplane that maximizes the margin.
SVM is robust to outliers.
What to do if data are not linearly separable?
 SVM solves this by creating a new
variable using a kernel.
 We call a point xi on the line and we
create a new variable yi as a function of
distance from origin o. So if we plot
this we get something like as shown
below.
 In this case, the new variable y is
created as a function of distance from
the origin.
 A non-linear function that creates a new
variable is referred to as kernel.
Non-linear SVM
 Non-Linear SVM is used in the case of non-linearly separated data. It means
if a straight line cannot classify a dataset, then such data is termed non-linear
data, and we can use the Non-linear SVM classifier.
 Nonlinear SVM was introduced when the data cannot be separated by a linear
decision boundary in the original feature space. The kernel function computes
the similarity between data points allowing SVM to capture complex patterns
and nonlinear relationships between features. This enables nonlinear SVM to
form curved or circular decision boundaries with help of kernel.

Popular kernel functions in SVM


 Radial Basis Function (RBF): Captures patterns in data by measuring the
distance between points and is ideal for circular or spherical relationships.
 Linear Kernel: Works for data that is linearly separable problem without
complex transformations.
 Polynomial Kernel: Models more complex relationships using polynomial
equations.
 Sigmoid Kernel: Mimics neural network behaviour using sigmoid function
and is suitable for specific non-linear problems.
Radial Basis Function (RBF): Sigmoid Kernel

Polynomial Kernel
SVM Kernel:
 The SVM kernel is a function that takes low dimensional input
space and transforms it into higher-dimensional space, i.e. it
converts not separable problem to separable problem. It is mostly
useful in non-linear separation problems.
 Simply put the kernel, it does some extremely complex data
transformations then finds out the process to separate the data
based on the labels or outputs defined.
The most interesting feature of SVM is that it can even work with a non-
linear dataset and for this, we use “Kernel Trick” which makes it easier to
classifies the points. Suppose we have a dataset like this
Here we see we cannot draw a single line or say hyperplane which can
classify the points correctly. So what we do is try converting this lower
dimension space to a higher dimension space using some quadratic
functions which will allow us to find a decision boundary that clearly
divides the data points. These functions which help us do this are called
Kernels and which kernel to use is purely determined by hyperparameter
tuning.
import numpy as np
from sklearn import datasets as ds
from sklearn import svm
import matplotlib.pyplot as plt
%matplotlib inline
X, y = ds.make_circles(n_samples=500, noise=0.06)
plt.scatter(X[:, 0], X[:, 1], c=y, marker='.')
plt.show()
classifier_non_linear = svm.SVC(kernel='rbf', C=1.0) classifier_non_linear.fit(X, y)
def boundary_plot(m, axis=None):
if axis is None:
axis = plt.gca() limit_x = axis.get_xlim() limit_x_y = axis.get_ylim() x_lines =
np.linspace(limit_x[0], limit_x[1], 30) y_lines = np.linspace(limit_x_y[0],
limit_x_y[1], 30) Y, X = np.meshgrid(y_lines, x_lines) xy = np.vstack([X.ravel(),
Y.ravel()]).T Plot = m.decision_function(xy).reshape(X.shape) axis.contour(X, Y, Plot,
levels=[0], alpha=0.6, linestyles=['-']) plt.scatter(X[:, 0], X[:, 1], c=y, s=55)
boundary_plot(classifier_non_linear)
plt.scatter(classifier_non_linear.support_vectors_[:, 0],
classifier_non_linear.support_vectors_[:, 1], s=55, lw=1, facecolors='none')
plt.show()
Advantages of SVM:

 Support vector machine works comparably well when there is an


understandable margin of dissociation between classes.
 It is more productive in high-dimensional spaces.
 It is effective in instances where the number of dimensions is larger than
the number of specimens.
 Support vector machine is comparably memory systematic. Support Vector
Machine (SVM) is a powerful supervised machine learning algorithm with
several advantages. Some of the main advantages of SVM include:
 Handling high-dimensional data: SVMs are effective in handling high-
dimensional data, which is common in many applications such as image
and text classification.
 Handling small datasets: SVMs can perform well with small datasets, as
they only require a small number of support vectors to define the boundary.
 Modeling non-linear decision boundaries: SVMs can model non-linear
decision boundaries by using the kernel trick, which maps the data into a
higher-dimensional space where the data becomes linearly separable.
 Robustness to noise: SVMs are robust to noise in the data, as the decision
boundary is determined by the support vectors, which are the closest data
points to the boundary.
 Generalization: SVMs have good generalization performance, which
means that they are able to classify new, unseen data well.
 Versatility: SVMs can be used for both classification and regression tasks,
and it can be applied to a wide range of applications such as natural
language processing, computer vision, and bioinformatics.
 Sparse solution: SVMs have sparse solutions, which means that they only
use a subset of the training data to make predictions. This makes the
algorithm more efficient and less prone to overfitting.
 Regularization: SVMs can be regularized, which means that the algorithm
can be modified to avoid overfitting.
Disadvantages of support vector machine:
 Support vector machine algorithm is not acceptable for large data sets.
 It does not execute very well when the data set has more sound i.e. target
classes are overlapping.
 In cases where the number of properties for each data point outstrips the
number of training data specimens, the support vector machine will
underperform.
 As the support vector classifier works by placing data points, above and
below the classifying hyperplane there is no probabilistic clarification for the
classification. Support Vector Machine (SVM) is a powerful supervised
machine learning algorithm, but it also has some limitations and
disadvantages. Some of the main disadvantages of SVM include:
 Computationally expensive: SVMs can be computationally expensive for
large datasets, as the algorithm requires solving a quadratic optimization
problem.
 Choice of kernel: The choice of kernel can greatly affect the performance of
an SVM, and it can be difficult to determine the best kernel for a given
dataset.
 Sensitivity to the choice of parameters: SVMs can be sensitive to the choice
of parameters, such as the regularization parameter, and it can be difficult to
determine the optimal parameter values for a given dataset.
 Memory-intensive: SVMs can be memory-intensive, as the algorithm
requires storing the kernel matrix, which can be large for large datasets.
 Limited to two-class problems: SVMs are primarily used for two-class
problems, although multi-class problems can be solved by using one-versus-
one or one-versus-all strategies.
 Lack of probabilistic interpretation: SVMs do not provide a probabilistic
interpretation of the decision boundary, which can be a disadvantage in some
applications.
 Not suitable for large datasets with many features: SVMs can be very slow
and can consume a lot of memory when the dataset has many features.
 Not suitable for datasets with missing values: SVMs requires complete
datasets, with no missing values, it can not handle missing values.
Applications of support vector machine:
 Face observation – It is used for detecting the face according to the classifier
and model.
 Text and hypertext arrangement – In this, the categorization technique is
used to find important information or you can say required information for
arranging text.
 Grouping of portrayals – It is also used in the Grouping of portrayals for
grouping or you can say by comparing the piece of information and take an
action accordingly.
 Bioinformatics – It is also used for medical science as well like in laboratory,
DNA, research, etc.
 Handwriting remembrance – In this, it is used for handwriting recognition.
 Protein fold and remote homology spotting – It is used for spotting or you
can say the classification class into functional and structural classes given their
amino acid sequences. It is one of the problems in bioinformatics.
 Generalized predictive control(GPC) – It is also used for Generalized
predictive control(GPC) for predicting and it relies on predictive control using
a multilayer feed-forward network as the plants linear model is presented.
 SVMs are popular in various applications such as image classification,
natural language processing, bioinformatics, and more.

 Facial Expression Classification – Support vector machines (SVMs) is a


binary classification technique. The face Expression Classification model
determines the precise face expression by modeling differences between two
facial images. Validation techniques include the leave-one-out methods and
the K-fold test methods.

 Speech Recognition – The transcription of speech into text is called speech


recognition. Mel Frequency Cepstral Coefficients (MFCC)-based features
are used to train Support Vector Machines (SVM), which are used for
figuring out speech. Speech recognition is a challenging classification
problem that is categorized using a variety of mathematical techniques,
including support vector machines, pattern recognition techniques, etc.
Decision Tree Classification Algorithm

It models decisions as a tree-like structure where internal nodes represent attribute


tests, branches represent attribute values, and leaf nodes represent final decisions or
predictions. Decision trees are versatile, interpretable, and widely used in machine
learning for predictive modeling.
Decision Trees Example
 Figure shows a flowchart, which is a decision tree.
 It has decision blocks (rectangles) and terminating blocks (ovals) where
some conclusion has been reached.
 The right and left arrows coming out of the decision blocks are known as
branches, and they can lead to other decision blocks or to a terminating
block.
Decision Trees Example

 Example: email classification system


- first checks the domain of the sending email address.
 If this is equal to myEmployer.com, it will classify the email as “Email
to read when bored.”
 If it isn’t from that domain, it checks to see if the body of the email
contains the word hockey.
 If the email contains the word hockey, then this email is classified as
“Email from friends; read immediately”; if the body doesn’t contain the
word hockey, then it gets classified as “Spam; don’t read.”
Advantages
• kNN algorithm did a great job of classifying, but it didn’t lead to any major
insights about the data.
• One of the best things about decision trees is that humans can easily
understand the data.
• The decision tree does a great job of distilling data into knowledge.
• With this, you can take a set of unfamiliar data and extract a set of rules.
• The machine learning will take place as the machine creates these rules
from the dataset.
• Decision trees are often used in expert systems, and the results obtained by
using them are often comparable to those from a human expert with
decades of experience in a given field.
Procedure

• To build a decision tree, you need to make a first decision on the dataset to
dictate which feature is used to split the data.
• To determine this, you try every feature and measure which split will give you
the best results. After that, you’ll split the dataset into subsets.
• The subsets will then traverse down the branches of the first decision node.
• If the data on the branches is the same class, then you’ve properly classified it
and don’t need to continue splitting it.
• If the data isn’t the same, then you need to repeat the splitting process on this
subset.
• The decision on how to split this subset is done the same way as the original
dataset, and you repeat this process until you’ve classified all the data.
Pseudo-code for a function called create Branch() would look like this:

Check if every item in the dataset is in the same class:


If so return the class label
Else
find the best feature to split the data
split the dataset
create a branch node
for each split
call create Branch and add the result to the branch node
return branch node
Algorithm
Where Pj denotes the probability of an
element being classified for a distinct class.
Applications

• Classification: Decision trees can categorize objects based on their


features.
• Prediction: Decision trees can predict outcomes for future data.
• Data mining: Decision trees can be used to solve classification
problems and categorize objects.
• Regression: Decision trees can be used for regression problems or
to predict continuous outcomes from unforeseen data.
• Medical research: Decision tree models can be used to describe
research findings.
Ensemble Learning
 Ensemble learning in machine learning helps enhance the performance
of machine learning models.

 Multiple machine learning models are combined to obtain a more


accurate model.

 Stacking, bagging, and boosting are the three most popular ensemble
learning techniques.

 Each of these techniques offers a unique approach to improving


predictive accuracy.

 Each technique is used for a different purpose, with the use of each
depending on varying factors.
Bias and Variance in Machine Learning

 There are various ways to evaluate a machine-learning model.

 We can use MSE (Mean Squared Error) for Regression; Precision, Recall,
and ROC (Receiver operating characteristics) for a Classification Problem
along with Absolute Error.

 In a similar way, Bias and Variance help us in parameter tuning and deciding
better-fitted models among several built.
Bias
Bias is simply defined as the inability of the model because of that there is
some difference or error occurring between the model’s predicted value and the
actual value.
These differences between actual or expected values and the predicted values
are known as error or bias error or error due to bias.

Bias is a systematic error that occurs due to wrong assumptions in the


machine learning process.

Low Bias: Low bias value means fewer assumptions are taken to build the
target function. In this case, the model will closely match the training dataset.
High Bias: High bias value means more assumptions are taken to build the
target function. In this case, the model will not match the training dataset
closely.
Ways to reduce high bias in Machine Learning:

 Use a more complex model: One of the main reasons for high bias is the very
simplified model. it will not be able to capture the complexity of the data. In such
cases, we can make our mode more complex by increasing the number of hidden
layers in the case of a deep neural network. Or we can use a more complex model
like Polynomial regression for non-linear datasets, CNN for image processing,
and RNN for sequence learning.

 Increase the number of features: By adding more features to train the dataset will
increase the complexity of the model. And improve its ability to capture the
underlying patterns in the data.

 Reduce Regularization of the model: Regularization techniques such as L1 or L2


regularization can help to prevent over fitting and improve the generalization ability
of the model. if the model has a high bias, reducing the strength of regularization or
removing it altogether can help to improve its performance.

 Increase the size of the training data: Increasing the size of the training data can
help to reduce bias by providing the model with more examples to learn from the
dataset.
Variance
 Variance is the measure of spread in data from its mean position.

 In machine learning variance is the amount by which the performance of a


predictive model changes when it is trained on different subsets of the training
data.

 More specifically, variance is the variability of the model that how much it is
sensitive to another subset of the training dataset. i.e. how much it can adjust on
the new subset of the training dataset.

 In machine learning, variance is the variability of model prediction on different


datasets.

 The variance shows how much model prediction varies when there is a slight
variation in data. If model accuracies on training and test data vary greatly, the
model has high variance.

 A model with high variance can even fit noises on training data but lacks
generalization to new, unseen data.
Trade-off between bias and variance?
High Bias, Low Variance: A model with high bias and low variance is said to be
underfitting.
High Variance, Low Bias: A model with high variance and low bias is said to be
overfitting.
High-Bias, High-Variance: A model has both high bias and high variance, which means
that the model is not able to capture the underlying patterns in the data (high bias) and is
also too sensitive to changes in the training data (high variance). As a result, the model
will produce inconsistent and inaccurate predictions on average.
Low Bias, Low Variance: A model that has low bias and low variance means that the
model is able to capture the underlying patterns in the data (low bias) and is not too
sensitive to changes in the training data (low variance).
A model with balanced bias and variance is said to have optimal generalization
performance. This means that the model is able to capture the underlying patterns in
the data without overfitting or underfitting.

The model is likely to be just complex enough to capture the complexity of the data,
but not too complex to overfit the training data.

This can happen when the model has been carefully tuned to achieve a good balance
between bias and variance, by adjusting the hyperparameters and selecting an
appropriate model architecture.
The total error is the sum of bias error and variance error. The optimal region
shows the area with the balance between bias and variance, showing optimal
model complexity with minimum error.
Bagging
 Ensemble is a machine learning concept in which multiple models are
trained using the same learning algorithm.
 Bagging is a way to decrease the variance in the prediction by generating
additional data for training from dataset using combinations with
repetitions to produce multi-sets of the original data.
 Bagging is a popular ensemble learning technique that focuses on reducing
variance and improving the stability of machine learning models.
 The term “bagging” is derived from the idea of creating multiple subsets or
bags of the training data through a process known as bootstrapping.
 Bootstrapping involves randomly sampling the dataset with replacement
to generate multiple subsets of the same size as the original data. Each of
these subsets is then used to train a base learner independently.
• Bagging is used when the goal is to reduce the variance of a decision tree
classifier.
• Here the objective is to create several subsets of data from training
sample chosen randomly with replacement.
• Each collection of subset data is used to train their decision trees.
• As a result, we get an ensemble of different models. Average of all the
predictions from different trees are used which is more robust than a single
decision tree classifier.
Bagging Steps:

 Suppose there are N observations and M features in training data set. A


sample from training data set is taken randomly with replacement.

 A subset of M features are selected randomly and whichever feature gives


the best split is used to split the node iteratively.

 The tree is grown to the largest.

 Above steps are repeated n times and prediction is given based on the
aggregation of predictions from n number of trees.
 Bagging (Bootstrap Aggregating) is an ensemble learning technique designed
to improve the accuracy and stability of machine learning algorithms. It
involves the following steps:
 Data Sampling: Creating multiple subsets of the training dataset using
bootstrap sampling (random sampling with replacement).
 Model Training: Training a separate model on each subset of the data.
 Aggregation: Combining the predictions from all individual models
(averaged for regression or majority voting for classification) to produce
the final output.
 Key Benefits:
Reduces Variance: By averaging multiple predictions, bagging reduces
the variance of the model and helps prevent overfitting.
Improves Accuracy: Combining multiple models usually leads to
better performance than individual models.

 Example of Bagging Algorithms:


 Random Forests (an extension of bagging applied to decision trees)
The steps of bagging are as follows:

1. We have an initial training dataset containing n-number of instances.


2. We create a m-number of subsets of data from the training set. We take a
subset of N sample points from the initial dataset for each subset. Each
subset is taken with replacement. This means that a specific data point
can be sampled more than once.
3. For each subset of data, we train the corresponding weak learners
independently. These models are homogeneous, meaning that they are of
the same type.
4. Each model makes a prediction.
5. Aggregating the predictions into a single prediction. For this, using either
max voting or averaging.
Advantages:
 Reduces over-fitting of the model.
 Handles higher dimensionality data very well.
 Maintains accuracy for missing data.
Disadvantages:
 Since final prediction is based on the mean predictions from subset
trees, it won’t give precise values for the classification and
regression model.
Boosting
 Boosting is an ensemble learning technique that sequentially combines
multiple weak classifiers to create a strong classifier. It is done by training
a model using training data and is then evaluated.
 Next model is built on that which tries to correct the errors present in the
first model. This procedure is continued and models are added until either
the complete training data set is predicted correctly or predefined number
of iterations is reached.
 Boosting involves sequentially training weak learners. Here, each
subsequent learner improves the errors of previous learners in the
sequence. A sample of data is first taken from the initial dataset.
 Using this sample to train the first model, and the model makes its
prediction. The samples can either be correctly or incorrectly predicted.
 The samples that are wrongly predicted are reused for training the next
model. In this way, subsequent models can improve on the errors of
previous models.
 Unlike bagging, which aggregates prediction results at the end, boosting aggregates the
results at each step.

 Weighted averaging involves giving all models different weights depending on their
predictive power.

 In other words, it gives more weight to the model with the highest predictive power.
This is because the learner with the highest predictive power is considered the most
important.
 Boosting is used to create a collection of predictors.

 In this technique, learners are learned sequentially with early learners fitting
simple models to the data and then analysing data for errors.

 Consecutive trees (random sample) are fit and at every step, the goal is to improve
the accuracy from the prior tree.

 When an input is misclassified by a hypothesis, its weight is increased so that next


hypothesis is more likely to classify it correctly. This process converts weak learners
into better performing model.
Boosting Steps:

 Draw a random subset of training samples d1 without replacement from


the training set D to train a weak learner C1

 Draw second random training subset d2 without replacement from the


training set and add 50 percent of the samples that were previously
falsely classified/misclassified to train a weak learner C2

 Find the training samples d3 in the training set D on which C1 and C2


disagree to train a third weak learner C3

 Combine all the weak learners via majority voting.


Types Of Boosting Algorithms

There are several types of boosting algorithms some of the most famous and
useful models are as :

 Gradient Boosting – Gradient Boosting constructs models in a sequential


manner where each weak learner minimizes the residual error of the previous
one using gradient descent. Instead of adjusting sample weights like
AdaBoost Gradient Boosting reduces error directly by optimizing a loss
function.

 XGBoost – XGBoost is an optimized implementation of Gradient Boosting


that uses regularization to prevent overfitting. It is faster and more efficient
than standard Gradient Boosting and supports handling both numerical and
categorical variables.

 CatBoost – CatBoost is particularly effective for datasets with categorical


features. It employs symmetric decision trees and a unique encoding method
that considers target values, making it superior in handling categorical data
without preprocessing.
Advantages of Boosting :

Improved Accuracy: By combining multiple weak learners it enhances


predictive accuracy for both classification and regression tasks.
Handles Imbalanced Data Well: It prioritizes misclassified points making it
effective for imbalanced datasets.
Better Interpretability: The sequential nature of helps break down decision-
making making the model more interpretable.

Disadvantages:
Prone to over-fitting.
Requires careful tuning of different hyper-parameters.
AdaBoost
 AdaBoost is short for adaptive boosting.
 AdaBoost is an ensemble learning method (also known as “meta-
learning”) initially created to increase the efficiency of binary classifiers.
 It is done by building a model by using weak models in series.
 AdaBoost uses an iterative approach to learn from the mistakes of weak
classifiers, and turn them into strong ones.
 The basic concept behind Adaboost is to set the weights of classifiers and
training the data sample in each iteration such that it ensures the
accurate predictions of unusual observations.
 Any machine learning algorithm can be used as base classifier if it accepts
weights on the training set.
 This algorithm updates the weights attached to each of the misclassified
training data samples and of the corresponding weak learners.

 Once an individual base model is trained, the sample next in sequence is


assigned a weight that signifies the prediction accuracy.

 The weighted sample is then used to train the next base learner which
would intuitively focus more on the samples with greater weight
assigned to them and try to make better predictions.

 The results would be re-weighted for the misclassified samples and fed
into the next individual learner.
• We see the accuracy differs when we built a different model on the
same dataset.
• But what if we use combinations of all these algorithms for making the
final prediction?
• We’ll get more accurate results by taking the average of results from
these models. We can increase the prediction power in this way.
• Boosting algorithms works in a similar way, it combines multiple
models (weak learners) to reach the final output (strong learners).
• Predictions are made by calculating the weighted average of the
weak classifiers.
AdaBoost works this way:
 A weight is applied to every example in the training data.
 We’ll call the weight vector D. Initially, these weights are all equal.
 A weak classifier is first trained on the training data.
 The errors from the weak classifier are calculated, and the weak
classifier is trained a second time with the same dataset.
 This second time the weak classifier is trained, the weights of the
training set are adjusted so the examples properly classified the first
time are weighted less and the examples incorrectly classified in the
first iteration are weighted more.
 To get one answer from all of these weak classifiers, AdaBoost assigns
alpha values to each of the classifiers.
 The alpha values are based on the error of each weak classifier.
Random Forest Algorithm

 Random Forest is a popular machine learning algorithm that belongs to the


supervised learning technique.
 It can be used for both Classification and Regression problems in ML.
 It is based on the concept of ensemble learning
 It is a process of combining multiple classifiers to solve a complex problem
and to improve the performance of the model.
 Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset.
 Random Forest is an ensemble technique capable of performing both
regression and classification tasks with the use of multiple decision trees and
a technique called Bootstrap and Aggregation, commonly known as bagging.
 The basic idea behind this is to combine multiple decision trees in
determining the final output rather than relying on individual decision trees.
• Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.
• The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
• Random Forest has multiple decision trees as base learning models.
• We randomly perform row sampling and feature sampling from the dataset
forming sample datasets for every model. This part is called Bootstrap.
• the final output is taken by using the majority voting classifier and this part
is called Aggregation.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy