ML Unit3b
ML Unit3b
Regression,
Bagging,
Boosting,
Many different regression models can be used but the simplest model in
them is linear regression
It tries to find out the best linear relationship that describes the data you
have.
This means that the change in the dependent variable is proportional to the
change in the independent variables. For example predicting the price of a
house based on its size.
Representing Linear Regression Model
Linear regression model represents the linear relationship between a
dependent variable and independent variable(s) via a sloped straight line.
The sloped straight line representing the linear relationship that fits the
given data best is called as a regression line. It is also called as best fit line.
Based on the given data points, we attempt to plot a line that fits the points
the best.
Based on the number of independent variables, there are two types of linear
regression-
In regression set of records are present with X and Y values and these
values are used to learn a function so if you want to predict Y from an
unknown X this learned function can be used.
Define the problem: Identify the variables and the problem you are trying to
solve
Collect data: Gather data on the variables you're interested in
Check for outliers: Remove any outliers that could skew your results
Check for linearity: Plot the data to see if there is a linear relationship between
the variables
Choose a model: Select a regression model that's appropriate for your data and
goals
Run the regression: Use the data to calculate the regression equation
Evaluate the results: Examine the results and determine the significance of the
independent variables
Interpret the results: Use the results to answer your research question or make
predictions
Relate to your hypothesis: Compare the results to your original hypothesis and
decide whether to accept, reject, or revise it
Simple vs. Multiple
For a simple linear regression, which is a line of the form y=mx+c, where y is
the dependent variable, x is the independent variable, m is the slope of the
line, and c is the y-intercept, the formulas to calculate the slope (m) and
intercept (c) of the line are derived from the following equations:
The steps to find the line of best fit by using the least square method is discussed below:
Steps to find the line of best fit by using the least square method
Consider an example. Tom who is the owner of a retail shop, found the price of
different T-shirts vs the number of T-shirts sold at his shop over a period of one
week.
https://youtu.be/h8cTBrYHWqA
Classification
Classification is a process of finding
a function which helps in dividing the
dataset into classes based on different
parameters.
In Classification, a computer program
is trained on the training dataset and
based on that training, it categorizes
the data into different classes.
The task of the classification
algorithm is to find the mapping
function to map the input (x) to the
discrete output (y).
Binary Classifier: This type of classifier is used when there are only two
possible outputs to a classification task.
Lazy Learners: A lazy learner saves the training dataset first and then
waits for the test dataset. In the case of the lazy learner, classification is
based on the most closely related data in the training dataset. Training
takes less time, but projections take longer.
Non-linear Models
1. K-Nearest Neighbours
2. Kernel SVM
3. Naïve Bayes
4. Decision Tree Classification
5. Random Forest Classification
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
K-NN algorithm assumes the similarity between the new data and available
data and put the new data into the category that is most similar to the
available categories.
In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the
algorithm how many nearby points (neighbours) to look at when it makes a
decision.
K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
• Example: Suppose, we have an image of a creature that looks similar to
cat and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set
to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.
Distance Metrics Used in KNN Algorithm
Euclidean distance is defined as the straight-line distance between two points
in a plane or space. You can think of it like the shortest path you would walk if
you were to go directly from one point to another.
Manhattan Distance is the total distance you would travel if you could only
move along horizontal and vertical lines (like a grid or city streets). It’s also
called “taxicab distance” because a taxi can only drive along the grid-like
streets of a city.
when p = 2 then it is the same as the formula for the Euclidean distance and
when p = 1 then we obtain the formula for the Manhattan distance.
How to select the value of K in the K-NN Algorithm?
In the k-nearest neighbors (KNN) algorithm, k is a variable that specifies the
number of nearest neighbors to consider when classifying a query point.
Choosing the right k : The value of k can impact the accuracy of the
algorithm. Choosing a small k can have a higher influence on the result, while
a larger k can lead to a smoother decision boundary.
Tips for choosing k
Use an odd number for k to avoid ties in classification
Use cross-validation to find the optimal k for your dataset
Use the elbow method to plot the model's error rate or accuracy for different
values of k
Use grid search to find the best value of k
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of
similarity where it predicts the label or value of a new data point by
considering the labels or values of its K nearest neighbors in the training
dataset.
Step 1: Selecting the optimal value of K
K represents the number of nearest neighbors that needs to be considered
while making prediction.
Step 2: Calculating distance
To measure the similarity between target and training data points Euclidean
distance is used. Distance is calculated between data points in the dataset and
target point.
Step 3: Finding Nearest Neighbors
The k data points with the smallest distances to the target point are nearest
neighbors.
Step 4: Voting for Classification or Taking Average for Regression
When you want to classify a data point into a category (like spam or not
spam), the K-NN algorithm looks at the K closest points in the dataset. These
closest points are called neighbors. The algorithm then looks at which
category the neighbors belong to and picks the one that appears the most. This
is called majority voting.
In regression, the algorithm still looks for the K closest points. But instead of
voting for a class in classification, it takes the average of the values of those K
neighbors. This average is the predicted value for the new point for the
algorithm.
It shows how a test point is classified based on its nearest neighbors. As the test
point moves the algorithm identifies the closest ‘k’ data points i.e 5 in this case
and assigns test point the majority class label that is grey label class here.
Pros: High accuracy, insensitive to outliers, no assumptions about data
•Type: Parametric
•Goal: To find the best-fitting linear relationship between independent and
dependent variables.
•How it works: Finds the line (or hyperplane in higher dimensions) that
minimizes the sum of squared errors between predicted and actual values.
•Training: Estimates coefficients for the linear equation.
•Prediction: Uses the equation to predict values for new data points.
•Assumptions: Assumes a linear relationship between variables.
•Advantages:
•Simple to understand and implement.
•Computationally efficient.
•Provides interpretable results (coefficients show the relationship between
variables).
•Disadvantages:
•Can only model linear relationships.
•Sensitive to outliers.
K-Nearest Neighbors (KNN)
•Type: Non-parametric
•Goal: To predict the value of a data point based on the values of its 'k' nearest
neighbors in the training data.
•How it works:
1.Finds the 'k' closest data points in the training set to the new data point.
2.Predicts the value based on the majority class (for classification) or
average value (for regression) of those neighbors.
•Training: Essentially memorizes the training data.
•Prediction: Calculates distances to all training points for each new data point.
For example, let’s solve the following problem: If the weather is sunny,
then the Player should play or not?
•𝑃(𝑌𝑒𝑠│𝑆𝑢𝑛𝑛𝑦) > 𝑃(𝑁𝑜│𝑆𝑢𝑛𝑛𝑦) ⇒ So on a sunny day, the player can play the game.
Advantages of Naïve Bayes Classifier
Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
It can be used for Binary as well as Multi-class Classifications.
It performs well in Multi-class predictions as compared to the other
Algorithms.
It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier
Naive Bayes assumes that all features are independent or unrelated, so it
cannot learn the relationship between features.
Applications of Naïve Bayes Classifier
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
It is used in Text classification such as Spam filtering and Sentiment
analysis.
Types of Naïve Bayes Model
There are three types of Naive Bayes Model, which are given below:
Gaussian: The Gaussian model assumes that features follow a normal
distribution. This means if predictors take continuous values instead of discrete,
then the model assumes that these values are sampled from the Gaussian
distribution.
Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as
Sports, Politics, education, etc.
Some of the important concepts in SVM which will be used frequently are as
follows.
Hyperplane:
There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary is known as the
hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means
the maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector.
Since these vectors support the hyperplane, hence called a Support vector.
Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will first
train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two
data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify
it as a cat. Consider the below diagram:
Example of Support Vector Machine
SVM algorithm can be understood better with the following example. Suppose we
want to build a model that can accurately identify whether the given fruit is an
apple or banana.
We will first train our model with lots of images of apples and bananas so that it
can learn about the different features of apples and bananas, and then we test it
with new fruit. So as the support vector creates a decision boundary between these
two data i.e., apple and banana, and chooses support vectors. On the basis of the
support vectors, it will classify its category as an apple or banana. We can
understand the example with the below diagram.
Types of SVM
Linear SVM:
Let’s consider two independent variables x1, x2 and one dependent variable
which is either a blue circle or a red circle.
From the figure above its very clear that there are multiple lines (our
hyperplane here is a line because we are considering only two input features
x1, x2) that segregates our data points or does a classification between red
and blue circles. So how do we choose the best line or in general the best
hyperplane that segregates our data points.
SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane.
SVM algorithm finds the closest point of the lines from both the classes.
These points are called support vectors.
The distance between the vectors and the hyperplane is called as margin.
And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal
hyperplane.
Selecting the best hyper-plane:
One reasonable choice as the best hyperplane is the one that represents the
largest separation or margin between the two classes.
So we choose the hyperplane whose distance from it to the nearest data
point on each side is maximized.
If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2.
Here we have one blue ball in the
boundary of the red ball. So how does
SVM classify the data? It’s simple!
The blue ball in the boundary of red
ones is an outlier of blue balls. The
SVM algorithm has the characteristics
to ignore the outlier and finds the best
hyperplane that maximizes the margin.
SVM is robust to outliers.
What to do if data are not linearly separable?
SVM solves this by creating a new
variable using a kernel.
We call a point xi on the line and we
create a new variable yi as a function of
distance from origin o. So if we plot
this we get something like as shown
below.
In this case, the new variable y is
created as a function of distance from
the origin.
A non-linear function that creates a new
variable is referred to as kernel.
Non-linear SVM
Non-Linear SVM is used in the case of non-linearly separated data. It means
if a straight line cannot classify a dataset, then such data is termed non-linear
data, and we can use the Non-linear SVM classifier.
Nonlinear SVM was introduced when the data cannot be separated by a linear
decision boundary in the original feature space. The kernel function computes
the similarity between data points allowing SVM to capture complex patterns
and nonlinear relationships between features. This enables nonlinear SVM to
form curved or circular decision boundaries with help of kernel.
Polynomial Kernel
SVM Kernel:
The SVM kernel is a function that takes low dimensional input
space and transforms it into higher-dimensional space, i.e. it
converts not separable problem to separable problem. It is mostly
useful in non-linear separation problems.
Simply put the kernel, it does some extremely complex data
transformations then finds out the process to separate the data
based on the labels or outputs defined.
The most interesting feature of SVM is that it can even work with a non-
linear dataset and for this, we use “Kernel Trick” which makes it easier to
classifies the points. Suppose we have a dataset like this
Here we see we cannot draw a single line or say hyperplane which can
classify the points correctly. So what we do is try converting this lower
dimension space to a higher dimension space using some quadratic
functions which will allow us to find a decision boundary that clearly
divides the data points. These functions which help us do this are called
Kernels and which kernel to use is purely determined by hyperparameter
tuning.
import numpy as np
from sklearn import datasets as ds
from sklearn import svm
import matplotlib.pyplot as plt
%matplotlib inline
X, y = ds.make_circles(n_samples=500, noise=0.06)
plt.scatter(X[:, 0], X[:, 1], c=y, marker='.')
plt.show()
classifier_non_linear = svm.SVC(kernel='rbf', C=1.0) classifier_non_linear.fit(X, y)
def boundary_plot(m, axis=None):
if axis is None:
axis = plt.gca() limit_x = axis.get_xlim() limit_x_y = axis.get_ylim() x_lines =
np.linspace(limit_x[0], limit_x[1], 30) y_lines = np.linspace(limit_x_y[0],
limit_x_y[1], 30) Y, X = np.meshgrid(y_lines, x_lines) xy = np.vstack([X.ravel(),
Y.ravel()]).T Plot = m.decision_function(xy).reshape(X.shape) axis.contour(X, Y, Plot,
levels=[0], alpha=0.6, linestyles=['-']) plt.scatter(X[:, 0], X[:, 1], c=y, s=55)
boundary_plot(classifier_non_linear)
plt.scatter(classifier_non_linear.support_vectors_[:, 0],
classifier_non_linear.support_vectors_[:, 1], s=55, lw=1, facecolors='none')
plt.show()
Advantages of SVM:
• To build a decision tree, you need to make a first decision on the dataset to
dictate which feature is used to split the data.
• To determine this, you try every feature and measure which split will give you
the best results. After that, you’ll split the dataset into subsets.
• The subsets will then traverse down the branches of the first decision node.
• If the data on the branches is the same class, then you’ve properly classified it
and don’t need to continue splitting it.
• If the data isn’t the same, then you need to repeat the splitting process on this
subset.
• The decision on how to split this subset is done the same way as the original
dataset, and you repeat this process until you’ve classified all the data.
Pseudo-code for a function called create Branch() would look like this:
Stacking, bagging, and boosting are the three most popular ensemble
learning techniques.
Each technique is used for a different purpose, with the use of each
depending on varying factors.
Bias and Variance in Machine Learning
We can use MSE (Mean Squared Error) for Regression; Precision, Recall,
and ROC (Receiver operating characteristics) for a Classification Problem
along with Absolute Error.
In a similar way, Bias and Variance help us in parameter tuning and deciding
better-fitted models among several built.
Bias
Bias is simply defined as the inability of the model because of that there is
some difference or error occurring between the model’s predicted value and the
actual value.
These differences between actual or expected values and the predicted values
are known as error or bias error or error due to bias.
Low Bias: Low bias value means fewer assumptions are taken to build the
target function. In this case, the model will closely match the training dataset.
High Bias: High bias value means more assumptions are taken to build the
target function. In this case, the model will not match the training dataset
closely.
Ways to reduce high bias in Machine Learning:
Use a more complex model: One of the main reasons for high bias is the very
simplified model. it will not be able to capture the complexity of the data. In such
cases, we can make our mode more complex by increasing the number of hidden
layers in the case of a deep neural network. Or we can use a more complex model
like Polynomial regression for non-linear datasets, CNN for image processing,
and RNN for sequence learning.
Increase the number of features: By adding more features to train the dataset will
increase the complexity of the model. And improve its ability to capture the
underlying patterns in the data.
Increase the size of the training data: Increasing the size of the training data can
help to reduce bias by providing the model with more examples to learn from the
dataset.
Variance
Variance is the measure of spread in data from its mean position.
More specifically, variance is the variability of the model that how much it is
sensitive to another subset of the training dataset. i.e. how much it can adjust on
the new subset of the training dataset.
The variance shows how much model prediction varies when there is a slight
variation in data. If model accuracies on training and test data vary greatly, the
model has high variance.
A model with high variance can even fit noises on training data but lacks
generalization to new, unseen data.
Trade-off between bias and variance?
High Bias, Low Variance: A model with high bias and low variance is said to be
underfitting.
High Variance, Low Bias: A model with high variance and low bias is said to be
overfitting.
High-Bias, High-Variance: A model has both high bias and high variance, which means
that the model is not able to capture the underlying patterns in the data (high bias) and is
also too sensitive to changes in the training data (high variance). As a result, the model
will produce inconsistent and inaccurate predictions on average.
Low Bias, Low Variance: A model that has low bias and low variance means that the
model is able to capture the underlying patterns in the data (low bias) and is not too
sensitive to changes in the training data (low variance).
A model with balanced bias and variance is said to have optimal generalization
performance. This means that the model is able to capture the underlying patterns in
the data without overfitting or underfitting.
The model is likely to be just complex enough to capture the complexity of the data,
but not too complex to overfit the training data.
This can happen when the model has been carefully tuned to achieve a good balance
between bias and variance, by adjusting the hyperparameters and selecting an
appropriate model architecture.
The total error is the sum of bias error and variance error. The optimal region
shows the area with the balance between bias and variance, showing optimal
model complexity with minimum error.
Bagging
Ensemble is a machine learning concept in which multiple models are
trained using the same learning algorithm.
Bagging is a way to decrease the variance in the prediction by generating
additional data for training from dataset using combinations with
repetitions to produce multi-sets of the original data.
Bagging is a popular ensemble learning technique that focuses on reducing
variance and improving the stability of machine learning models.
The term “bagging” is derived from the idea of creating multiple subsets or
bags of the training data through a process known as bootstrapping.
Bootstrapping involves randomly sampling the dataset with replacement
to generate multiple subsets of the same size as the original data. Each of
these subsets is then used to train a base learner independently.
• Bagging is used when the goal is to reduce the variance of a decision tree
classifier.
• Here the objective is to create several subsets of data from training
sample chosen randomly with replacement.
• Each collection of subset data is used to train their decision trees.
• As a result, we get an ensemble of different models. Average of all the
predictions from different trees are used which is more robust than a single
decision tree classifier.
Bagging Steps:
Above steps are repeated n times and prediction is given based on the
aggregation of predictions from n number of trees.
Bagging (Bootstrap Aggregating) is an ensemble learning technique designed
to improve the accuracy and stability of machine learning algorithms. It
involves the following steps:
Data Sampling: Creating multiple subsets of the training dataset using
bootstrap sampling (random sampling with replacement).
Model Training: Training a separate model on each subset of the data.
Aggregation: Combining the predictions from all individual models
(averaged for regression or majority voting for classification) to produce
the final output.
Key Benefits:
Reduces Variance: By averaging multiple predictions, bagging reduces
the variance of the model and helps prevent overfitting.
Improves Accuracy: Combining multiple models usually leads to
better performance than individual models.
Weighted averaging involves giving all models different weights depending on their
predictive power.
In other words, it gives more weight to the model with the highest predictive power.
This is because the learner with the highest predictive power is considered the most
important.
Boosting is used to create a collection of predictors.
In this technique, learners are learned sequentially with early learners fitting
simple models to the data and then analysing data for errors.
Consecutive trees (random sample) are fit and at every step, the goal is to improve
the accuracy from the prior tree.
There are several types of boosting algorithms some of the most famous and
useful models are as :
Disadvantages:
Prone to over-fitting.
Requires careful tuning of different hyper-parameters.
AdaBoost
AdaBoost is short for adaptive boosting.
AdaBoost is an ensemble learning method (also known as “meta-
learning”) initially created to increase the efficiency of binary classifiers.
It is done by building a model by using weak models in series.
AdaBoost uses an iterative approach to learn from the mistakes of weak
classifiers, and turn them into strong ones.
The basic concept behind Adaboost is to set the weights of classifiers and
training the data sample in each iteration such that it ensures the
accurate predictions of unusual observations.
Any machine learning algorithm can be used as base classifier if it accepts
weights on the training set.
This algorithm updates the weights attached to each of the misclassified
training data samples and of the corresponding weak learners.
The weighted sample is then used to train the next base learner which
would intuitively focus more on the samples with greater weight
assigned to them and try to make better predictions.
The results would be re-weighted for the misclassified samples and fed
into the next individual learner.
• We see the accuracy differs when we built a different model on the
same dataset.
• But what if we use combinations of all these algorithms for making the
final prediction?
• We’ll get more accurate results by taking the average of results from
these models. We can increase the prediction power in this way.
• Boosting algorithms works in a similar way, it combines multiple
models (weak learners) to reach the final output (strong learners).
• Predictions are made by calculating the weighted average of the
weak classifiers.
AdaBoost works this way:
A weight is applied to every example in the training data.
We’ll call the weight vector D. Initially, these weights are all equal.
A weak classifier is first trained on the training data.
The errors from the weak classifier are calculated, and the weak
classifier is trained a second time with the same dataset.
This second time the weak classifier is trained, the weights of the
training set are adjusted so the examples properly classified the first
time are weighted less and the examples incorrectly classified in the
first iteration are weighted more.
To get one answer from all of these weak classifiers, AdaBoost assigns
alpha values to each of the classifiers.
The alpha values are based on the error of each weak classifier.
Random Forest Algorithm