Unit-3(1)
Unit-3(1)
Definition
``
causing factors, thereby ensuring the accuracy and stability of machine
learning (ML) algorithms.
``
Ensemble learning helps improve machine learning results by combining
several models. This approach allows the production of better predictive
performance compared to a single model. Basic idea is to learn a set of
classifiers (experts) and to allow them to vote.
``
Ensemble Methods
Boosting
Stacking Classifier
Voting Classifier
Voting Classifier:
Soft Voting: Voting is calculated on the predicted probability of the
output class.
``
Fig: Left: Hard Voting, Right: Soft Voting
Implementation:
``
clf1 =
LogisticRegression(random_state=42
)
clf2 =
RandomForestClassifier(random_stat
e=42)
clf3 = GaussianNB()
clf4 = SVC(probability=True,
random_state=42)
eclf =
VotingClassifier(estimators=[('LR',
clf1), ('RF', clf2), ('GNB', clf3), ('SVC',
clf4)],
voting='soft',
weights=[1,2,1,1])
eclf.fit(X_train, y_train)
``
From the above pretty table, the voting classifier boosts the performance
compared to its base estimator performances.
Bagging and pasting are techniques that are used in order to create
varied subsets of the training data. The subsets produced by these
techniques are then used to train the predictors of an ensemble.
that consists of
For example, let’s say we have a set of observations: [2, 4, 32, 8, 16]. If
we want each bootstrap sample containing n observations, the following
are valid samples:
``
n=3: [32, 4, 4], [8, 16, 2], [2, 2, 2]…
Since we drawn data with replacement, the observations can appear more
than one time in a single sample.
ensemble method in
which we first bootstrap our data and for each bootstrap sample we train
one model. After that, we aggregate them with equal weights. When it’s
not used replacement, the method is called pasting.
Bootstrap Aggregating, also known as bagging, is a machine learning
ensemble meta-algorithm designed to improve the stability and
accuracy of machine learning algorithms used in statistical
classification and regression. It decreases the variance and helps to
avoid overfitting . It is usually applied to decision tree methods . Bagging
is a special case of the model averaging approach.
``
Out-of-Bag Scoring
``
is big enough, 37% of its samples are never selected and we could use it
to test our model. This is called Out-of-Bag scoring, or OOB Scoring.
``
Random Patches and Random Subspaces
``
One of the most important features of the Random Forest Algorithm is
model.
data with replacement & the final output is based on majority voting. For
sequential models such that the final model has the highest accuracy. For
``
Bagging, also known as Bootstrap Aggregation, is the ensemble
technique used by random forest.Bagging chooses a random
sample/random subset from the entire data set. Hence each model is
generated from the samples (Bootstrap Samples) provided by the
is
trained independently, which generates results. The final output is based
on majority voting after combining the results of all models. This step
which involves combining all the results and generating output based on
``
Now let’s look at an example by breaking it down with the help of the
following figure. Here the bootstrap sample is taken from actual data
(Bootstrap sample 01, Bootstrap sample 02, and Bootstrap sample 03)
sample won’t contain unique data. The model (Model 01, Model 02,
Happy emoji has a majority when compared to the Sad emoji. Thus
``
Steps Involved in Random Forest Algorithm
Step 1: In the Random forest model, a subset of data points and a subset
random records and m features are taken from the data set having k
number of records.
below. Now n number of samples are taken from the fruit basket, and an
``
individual decision tree is constructed for each sample. Each decision
tree will generate an output, as shown in the figure. The final output is
considered based on majority voting. In the below figure, you can see
``
Parallelization: Each tree is created independently out of different data
and attributes. This means we can fully use the CPU to build random
the data for train and test as there will always be 30% of the data which is
not seen by the decision tree. Stability: Stability arises because the result
rnd_clf = RandomForestClassifier(n_estimators=500,
max_leaf_nodes=16, n_jobs=-1) rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)
max_leaf_nodes=16), n_estimators=500,
bootstrap=True, n_jobs=-1) Feature Importance Yet another great
``
Scikit-Learn measures a feature’s importance by looking at how much
the tree nodes that use that feature reduce impurity on average (across
all trees in the forest). More precisely, it is a weighted average, where
each node’s weight is equal to the number of training samples that are
associated with it
Boosting
learning. A boosting
method that can combine several weak learners into a strong learner
Boosting Methods
2.Gradient Boosting
``
There are several boosting algorithms; was the first really successful
pay a bit more attention to the training instances that the predecessor
the hard cases. This is the technique used by AdaBoost. For example,
the updated weights, and again makes predictions on the training set,
``
The predictor’s weight α is then computed using Equation 7-2, where η is
the learning rate hyperparameter (defaults to 1). The more accurate the
``
loss function). When there are just two classes, SAMME is equivalent to
SAMME called SAMME.R (the R stands for “Real”), which relies on class
single decision node plus two leaf nodes. This is the default base
``
Boosting is an ensemble modeling technique that attempts to
build a strong classifier from the number of weak classifiers. It
is done by building a model by using weak models in series.
Firstly, a model is built from the training data. Then the second
model is built which tries to correct the errors present in the
first model. This procedure is continued and models are added
until either the complete training data set is predicted correctly
or the maximum number of models are added. AdaBoost was
the first really successful boosting algorithm
developed for the purpose of binary classification. AdaBoost is
short for Adaptive Boosting and is a very popular boosting
technique that combines multiple “weak classifiers” into a single
“strong classifier Algorithm:
1.Initialise the dataset and assign equal weight to each of the data
point.
2.Provide this as input to the model and identify the wrongly
classified data points.
3.Increase the weight of the wrongly classified data points.
4. if(gotrequiredresults)
Gotostep5
else
``
Gotostep2
5. End
6.
``
``
he above diagram explains the AdaBoost algorithm in a very simple
way. Let’s try to understand it in a stepwise process:
``
Difference between Adaboost and Gradient Boosting
The difference between AdaBoost and gradient boosting are as
follows:
``
Support Vector Machine(SVM)
Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as
Regression problems. primarily, it is used for Classification problems in
Machine Learning. The goal of the SVM algorithm is to create the best
line or decision boundary that can segregate n-dimensional space into
classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a
hyperplane. SVM chooses the extreme points/vectors that help in
creating the hyperplane. These extreme cases are called as support
vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
``
Example:
Suppose we see a strange cat that also has some features of dogs, so if
such a model can be created by using the SVM algorithm. We will first
of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector
creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if
a dataset can be classified into two classes by using a single straight line,
then such data is termed as linearly separable data, and classifier is used
called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then
``
such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.
The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector.
Since these vectors support the hyperplane, hence called a Support
vector.
How does SVM works?
Suppose we have a dataset that has two tags (green and blue), and the
dataset has two features x1 and x2. We want a classifier that can classify
the pair(x1, x2) of coordinates in either green or blue. Consider the below
image:
``
So as it is 2-d space so by just using a straight line, we can easily separate
these two classes. But there can be multiple lines that can separate these
classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary;
this best boundary or region is called as a hyperplane. SVM algorithm
``
finds the closest point of the lines from both the classes. These points are
called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal
hyperplane.
``
``
How to choose the Correct SVM
But Support Vector Machine would choose the HP1 though it has a
narrow margin. Because though HP2 has maximum margin but it
is going against the constrain that: each data point must lie on
the correct side of the margin and there should be no
misclassification.This constraint is the hard constraint
``
Margins
``
4. Hard and Soft SVM
``
We can now clearly state that HP1 is a Hard SVM(left side)
while HP2 is a Soft SVM(right side).
``
– As the value of C increases the margin decreases thus Hard
SVM.
– If the values of C are very small the margin increases thus Soft
SVM.
– Large value of C can cause overfitting therefore we need to
select the correct value using Hyperparameter Tuning.
The following Scikit-Learn code loads the iris dataset, scales the features,
and then trains a linear SVM model (using the LinearSVC class with C=1
and the hinge loss function, described shortly) to detect Iris virginica
flowers:
import Pipeline
``
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
LinearSVC(C=1, loss="hinge")), ])
("linear_svc",
svm_clf.fit(X, y)
``
So to separate these data points, we need to add one more dimension. For
linear data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below
image:
``
So now, SVM will divide the datasets into classes in the following way.
Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-
axis. If we convert it in 2d space with z=1, then it will become as:
``
Hence we get a circumference of radius 1 in case of non-linear data
Adding Polynomial Features
Polynomial Kernel
Similarity Features
Gaussian RBF Kernel
Read the above topics by referring the text book Page nos 218-223(Hands
on Machine Learning)
Support Vector Regression
Support Vector Regression as the name suggests is a regression algorithm
that supports both linear and non-linear regressions. This method works
on the principle of the Support Vector Machine. SVR differs from SVM in
the way that SVM is a classifier that is used for predicting discrete
``
In simple regression, the idea is to minimize the error rate while in SVR
the idea is to fit the error inside a certain threshold which means, work
tube.
do that we need a function that should map the data points into its higher
3. Boundary Lines: These are the two lines that are drawn around the
``
4. Support Vector: It is the vector that is used to define the hyperplane
or we can say that these are the extreme data points in the dataset which
helps in defining the hyperplane. These data points lie close to the
boundary.
was to define the hyperplane but in SVR they are used to define the linear
regression.
Working of SVR
SVR works on the principle of SVM with few minor differences. Given data
instead of using the curve as a decision boundary it uses the curve to find
the match between the vector and position of the curve. Support Vectors
helps in determining the closest match between the data points and the
``
Consider these two red lines as the decision boundary and the
green line as the hyperplane. Our objective, when we are
moving on with SVR, is to basically consider the points
that are within the decision boundary line. Our best fit line
is the hyperplane that has a maximum number of points.
wx+b= +a
wx+b= -a
``
Our main aim here is to decide a decision boundary at ‘a’ distance
from the original hyperplane such that data points closest to the
hyperplane or the support vectors are within that boundary line.
Hence, we are going to take only those points that are within the
decision boundary and have the least error rate, or are within the
Margin of Tolerance. This gives us a better fitting model
What is a Support Vector Machine (SVM)?
So what exactly is Support Vector Machine (SVM)? We’ll start by
understanding SVM in simple terms. Let’s say we have a plot of
two label classes as shown in the figure below:
Can you decide what the separating line will be? You might have
come up with this:
``
The line fairly separates the classes. This is what SVM essentially
does – simple class separation. Now, what is the data was like
this:
``
When we transform this line back to the original plane, it maps
to the circular boundary as I’ve shown here:
``
Implementing Support Vector Regression (SVR) in Python
Here, we have to predict the salary of an employee given a few
independent variables. A classic HR analytics project!
import numpy as np
import matplotlib.pyplot as
plt
import pandas as pd
dataset =
pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
StandardScaler() X =
sc_X.fit_transform(X) y =
sc_y.fit_transform(y)
regressor = SVR(kernel =
'rbf')
regressor.fit(X, y)
y_pred = regressor.predict(6.5)
y_pred =
sc_y.inverse_transform(y_pred)
``
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
dataset =
pd.read_csv('/content/drive/MyDrive/Position_Salaries.csv')
``
# our dataset in this implementation is small, and thus we can
print it all instead of viewing only the end
print(dataset)
significant
feature in this dataset is the Level column.
The Position column is just a description of
the Level column, and therefore, it adds no value to our
analysis. Therefore, we will separate the dataset into a set of
features and study variables.
As discussed above, we only have one feature in this
dataset. Therefore, we carry out our feature-study variable
separation as shown in the code below:
Output:
[[ 1] [
2]
``
[ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [10]]
From this output, it’s clear that the X_l variable is a 2D
array. Similarly, we can have a look at the y_p variable:
print(y_p)
Output:
It’s seen from the output above that the y_p variable is a
vector, i.e., a 1D array.
we
use to scale the data, takes in a 2D array; otherwise, it
returns an error. Due to this, we have to reshape
StandardScalar
class and scale up
the X_l and y_p variables separately as shown:
from sklearn.preprocessing import StandardScaler
StdS_X = StandardScaler()
StdS_y = StandardScaler()
X_l = StdS_X.fit_transform(X_l)
y_p = StdS_y.fit_transform(y_p)
Let’s simultaneously print and check if our two variables
were scaled.
print("Scaled X_l:")
print(X_l)
print("Scaled y_p:")
print(y_p)
Output:
Scaled X_l:
[[-1.5666989 ]
[-1.21854359]
[-0.87038828]
[-0.52223297]
[-0.17407766] [
0.17407766]
``
[ 0.52223297] [
0.87038828] [
1.21854359] [
1.5666989 ]]
Scaled y_p:
[[-0.72004253]
[-0.70243757]
[-0.66722767]
[-0.59680786]
[-0.49117815]
[-0.35033854]
[-0.17428902] [
0.17781001] [
0.88200808] [
2.64250325]]
``
The plot shows a non-linear relationship between
the Levels and Salary.
Implementing SVR
To implement our model, first, we need to import it from
the scikit-learn and create an object to itself.
Since we declared our data to be non-linear, we will pass it
to a kernel called the Radial Basis function (RBF) kernel.
After declaring the kernel function, we will fit our data on
the object. The following program performs these rules:
# import the model
from sklearn.svm import SVR
# create the model object
regressor = SVR(kernel = 'rbf')
# fit the model on the data
regressor.fit(X_l, y_p)
``
Since the model is now ready, we can use it and make
predictions as shown:
A=regressor.predict(StdS_X.transform([[6.5]])) print(A)
Output:
array([-0.27861589])
Output:
``
array([[-0.27861589]])
It is clear from the output above is a 2D array. Using
the inverse_transform()
function, we can convert it to an
unscaled value in the original dataset as shown:
# Taking the inverse of the scaled value
A_pred = StdS_y.inverse_transform(A)
print(A_pred)
Output:
array([[170370.0204065]])
``
Naïve Bayes Classifier
``
if the fruit is identified on the bases of color, shape, and taste, then red,
spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on
each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
Steps
``
Example
X is given as,
``
Here x_1,x_2….x_n represent the features, i.e they can be
mapped to outlook, temperature, humidity and windy. By
substituting for X and expanding using the chain rule we
get,
Now, you can obtain the values for each by looking at the
dataset and substitute them into the equation. For all
entries in the dataset, the denominator does not change, it
remain static. Therefore, the denominator can be removed
and a proportionality can be introduced.