0% found this document useful (0 votes)
6 views63 pages

Unit-3(1)

dwdm

Uploaded by

tinaktm2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views63 pages

Unit-3(1)

dwdm

Uploaded by

tinaktm2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Unit-III

Ensemble Learning and Random Forests

Suppose you ask a complex question to thousands of random people, then


aggregate their answers. In many cases you will find that this aggregated
answer is better than an expert’s answer. This is called the wisdom of the
crowd. Similarly, if you aggregate the predictions of a group of
predictors (such as classifiers or regressors), you will often get better
predictions than with the best individual predictor. A group of predictors
is called an ensemble; thus, this technique is called Ensemble Learning,
and an Ensemble Learning algorithm is called an Ensemble method.
For example, you can train a group of Decision Tree classifiers, each on a
different random subset of the training set. To make predictions, you just
obtain the predictions of all individual trees, then predict the class that
gets the most votes.Such an ensemble of Decision Trees is called
a Random Forest, and despite its simplicity, this is one of the most
powerful Machine Learning algorithms available today

Definition

Ensemble learning is the process by which multiple models, such as


classifiers or experts, are strategically generated and combined to solve a
particular computational intelligence problem. Ensemble learning is
primarily used to improve the (classification, prediction, function
approximation, etc.) performance of a model, or reduce the likelihood of
an unfortunate selection of a poor one. Other applications of ensemble
learning include assigning a confidence to the decision made by the
model, selecting optimal (or near optimal) features, data fusion,
incremental learning, nonstationary learning and error-correcting. In
learning models, noise, variance, and bias are the major sources of error.
The ensemble methods in machine learning help minimize these error-

``
causing factors, thereby ensuring the accuracy and stability of machine
learning (ML) algorithms.

Example 1: If you are planning to buy an air-conditioner, would you enter a


showroom and buy the air-conditioner that the salesperson shows you? The
answer is probably no. In this day and age, you are likely to ask your
friends, family, and colleagues for an opinion, do research on various
portals about different models, and visit a few review sites before making a
purchase decision. In a nutshell, you would not come to a conclusion
directly. Instead, you would try to make a more informed decision after
considering diverse opinions and reviews. In the case of ensemble
learning, the same principle applies

``
Ensemble learning helps improve machine learning results by combining
several models. This approach allows the production of better predictive
performance compared to a single model. Basic idea is to learn a set of
classifiers (experts) and to allow them to vote.

``
Ensemble Methods

 Bagging or Bootstrap Aggregation and Pasting

 Boosting

Stacking Classifier

Voting Classifier

Voting Classifier:

A voting classifier is a machine learning estimator that trains various base

models or estimators and predicts on the basis of aggregating the


findings of each base estimator. The aggregating criteria can be
combined decision of voting for each estimator output. The voting criteria
can be of two types:

 Hard Voting: Voting is calculated on the predicted output class.


Soft Voting: Voting is calculated on the predicted probability of the
output class.

How Voting Classifier can improve performance? The voting classifier

aggregates the predicted class or predicted


probability on basis of hard voting or soft voting. So if we feed a variety of
base models to the voting classifier it makes sure to resolve the error by
any model.

``
Fig: Left: Hard Voting, Right: Soft Voting

Implementation:

Scikit-learn packages offer implementation of Voting Classifier


Fromsklearn.ensembleimportVoting
Classifier

``
clf1 =
LogisticRegression(random_state=42
)
clf2 =
RandomForestClassifier(random_stat
e=42)
clf3 = GaussianNB()
clf4 = SVC(probability=True,
random_state=42)
eclf =
VotingClassifier(estimators=[('LR',
clf1), ('RF', clf2), ('GNB', clf3), ('SVC',
clf4)],
voting='soft',
weights=[1,2,1,1])
eclf.fit(X_train, y_train)

For our sample classification dataset, we are training 4 base estimators of


Logistic Regression, Random Forest, Gaussian Naive Bayes, and Support
Vector Classifier.

Parameter voting=‘soft’ or voting=‘hard’ enables developers to switch


between hard or soft voting aggregators. The parameter weight can be
tuned to users to overshadow some of the good-performing base
estimators. The sequence of weights to weigh the occurrences of

predicted class labels for hard voting or class probabilities before


averaging for soft voting.

We are using a soft voting classifier and weight distribution of [1,2,1,1],


where twice the weight is assigned to the Random Forest model. Now
lets, observe the benchmark performance of each of the base estimators

vis-a-vis the voting classifier.

``
From the above pretty table, the voting classifier boosts the performance
compared to its base estimator performances.

Bagging & Pasting

Bagging and pasting are techniques that are used in order to create
varied subsets of the training data. The subsets produced by these
techniques are then used to train the predictors of an ensemble.

Bagging, short for bootstrap aggregating, creates a dataset by sampling


the training set with replacement. Pasting creates a dataset by sampling
the training set without replacement.

Bootstrapping In statistics, bootstrapping refers to a resample method

that consists of

repeatedly drawn, with replacement, samples from data to form other


smaller datasets, called bootstrapping samples. It’s as if the
bootstrapping method is making a bunch of simulations to our original
dataset so in some cases we can generalize the mean and the standard
deviation.

For example, let’s say we have a set of observations: [2, 4, 32, 8, 16]. If
we want each bootstrap sample containing n observations, the following
are valid samples:

``
 n=3: [32, 4, 4], [8, 16, 2], [2, 2, 2]…

 n=4: [2, 32, 4, 16], [2, 4, 2, 8], [8, 32, 4, 2]…

Since we drawn data with replacement, the observations can appear more
than one time in a single sample.

Bagging & Pasting Bagging means bootstrap+aggregating and it is a

ensemble method in

which we first bootstrap our data and for each bootstrap sample we train
one model. After that, we aggregate them with equal weights. When it’s
not used replacement, the method is called pasting.
Bootstrap Aggregating, also known as bagging, is a machine learning
ensemble meta-algorithm designed to improve the stability and
accuracy of machine learning algorithms used in statistical
classification and regression. It decreases the variance and helps to
avoid overfitting . It is usually applied to decision tree methods . Bagging
is a special case of the model averaging approach.

Implementation Steps of Bagging




Step 1: Multiple subsets are created from the original data set with
 equal tuples, selecting observations with replacement.
Step 2: A base model is created on each of these subsets.

Step 3: Each model is learned in parallel with each training set and
independent of each other.
Step 4: The final predictions are determined by combining the
predictions from all the models.

``
Out-of-Bag Scoring

If we are using bagging, there’s a chance that a sample would never be


selected, while anothers may be selected multiple time. The probability of
not selecting a specific sample is (1–1/n), where n is the number of
samples. Therefore, the probability of not picking n samples in n draws is
(1–1/n)^n. When the value of n is big, we can approximate this probability
to 1/e, which is approximately 0.3678. This means that when the dataset

``
is big enough, 37% of its samples are never selected and we could use it
to test our model. This is called Out-of-Bag scoring, or OOB Scoring.

``
Random Patches and Random Subspaces

The BaggingClassifier class supports sampling the features as well.


Sampling is controlled by two hyperparameters: max_features and
bootstrap_features. They work the same way as max_samples and

bootstrap, but for feature sampling instead of instance sampling. Thus,


each predictor will be trained on a random subset of the input features

Random Patches samples both the training Instances as well as the


Features.
Setting certain parameters in the BaggingClassifier() performs this

bootstrap_features = True, max_samples = 0.6

Random Subspaces keeps all the instances but samples features.

Bootstrap = True, bootstrap_features = True, max_features = 0.6

Random Forest Algorithm

Random forest is a Supervised Machine Learning Algorithm that


is used widely in Classification and Regression problems. It builds
decision trees on different samples and takes their majority vote for

classification and average in case of regression.

One of the most important features of the Random Forest Algorithm is

that it can handle the data set containing continuous variables, as in


the case of regression, and categorical variables, as in the case of
classification. It performs better for classification and regression tasks.

``
One of the most important features of the Random Forest Algorithm is

that it can handle the data set containing continuous variables, as in


the case of regression, and categorical variables, as in the case of
classification. It performs better for classification and regression tasks.

Working of Random Forest Algorithm

Before understanding the working of the random forest algorithm in

machine learning, we must look into the ensemble learning


technique. Ensemble simplymeans combining multiple models. Thus a
collection of models is used to make predictions rather than an individual

model.

Ensemble uses two types of methods: 1. Bagging– It creates a different

training subset from sample training

data with replacement & the final output is based on majority voting. For

example, Random Forest. 2. Boosting– It combines weak learners into

strong learners by creating

sequential models such that the final model has the highest accuracy. For

example, ADA BOOST, XG BOOST.

``
Bagging, also known as Bootstrap Aggregation, is the ensemble
technique used by random forest.Bagging chooses a random

sample/random subset from the entire data set. Hence each model is
generated from the samples (Bootstrap Samples) provided by the

Original Data with replacement known as row sampling. This step of

row sampling with replacement is called bootstrap. Now each model

is
trained independently, which generates results. The final output is based
on majority voting after combining the results of all models. This step

which involves combining all the results and generating output based on

majority voting, is known as aggregation.

``
Now let’s look at an example by breaking it down with the help of the

following figure. Here the bootstrap sample is taken from actual data

(Bootstrap sample 01, Bootstrap sample 02, and Bootstrap sample 03)

with a replacement which means there is a high possibility that each

sample won’t contain unique data. The model (Model 01, Model 02,

and Model 03) obtained from this bootstrap sample is trained

independently. Each model generates results as shown. Now the

Happy emoji has a majority when compared to the Sad emoji. Thus

based on majority voting final output is obtained as Happy emoji.

``
Steps Involved in Random Forest Algorithm
Step 1: In the Random forest model, a subset of data points and a subset

of features is selected for constructing each decision tree. Simply put, n

random records and m features are taken from the data set having k

number of records.

Step 2: Individual decision trees are constructed for each sample.

Step 3: Each decision tree will generate an output.

Step 4: Final output is considered based on Majority Voting or


Averaging for Classification and regression, respectively.
For example: consider the fruit basket as the data as shown in the figure

below. Now n number of samples are taken from the fruit basket, and an

``
individual decision tree is constructed for each sample. Each decision

tree will generate an output, as shown in the figure. The final output is

considered based on majority voting. In the below figure, you can see

that the majority decision tree gives output as an apple when

compared to a banana, so the final output is taken as an apple.

Important Features of Random Forest

 Diversity: Not all attributes/variables/features are considered while

making an individual tree; each tree is different.

 Immune to the curse of dimensionality: Since each tree does not

consider all the features, the feature space is reduced.

``
 Parallelization: Each tree is created independently out of different data

and attributes. This means we can fully use the CPU to build random

forests. Train-Test split: In a random forest, we don’t have to segregate

 the data for train and test as there will always be 30% of the data which is

not seen by the decision tree. Stability: Stability arises because the result

is based on majority voting/ averaging.

 Random Forest is an ensemble of Decision Trees, generally trained via


the bagging method (or sometimes pasting), typically with max_samples
set to the size of the training set. Instead of building a BaggingClassifier
and passing it a DecisionTreeClassifier, you can instead use the
RandomForestClassifier class, which is more convenient and optimized
for Decision Trees
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500,
max_leaf_nodes=16, n_jobs=-1) rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

The following BaggingClassifier is roughly equivalent to the previous


RandomForestClassifier:
bag_clf = BaggingClassifier( DecisionTreeClassifier(splitter="random",
max_samples=1.0,

max_leaf_nodes=16), n_estimators=500,
bootstrap=True, n_jobs=-1) Feature Importance Yet another great

quality of Random Forests is


that they make it easy to measure the relative importance of each
feature.

``
Scikit-Learn measures a feature’s importance by looking at how much
the tree nodes that use that feature reduce impurity on average (across
all trees in the forest). More precisely, it is a weighted average, where
each node’s weight is equal to the number of training samples that are
associated with it

Boosting

Boosting is one of the techniques that use the concept of ensemble

learning. A boosting

algorithm combines multiple simple models (also known as weak learners

or base estimators) to generate the final output. It is done by building a

model by using weak models in series.

Boosting (originally called hypothesis boosting) refers to any Ensemble

method that can combine several weak learners into a strong learner

Boosting Methods

1.AdaBoost- Adaptive Boosting

2.Gradient Boosting

``
There are several boosting algorithms; was the first really successful

boosting algorithm that was developed for the purpose of binary

classification. AdaBoost is an abbreviation for and is a prevalent

boosting technique that combines multiple “weak classifiers” into a

single “strong classifier.”

AdaBoost One way for a new predictor to correct its predecessor is to

pay a bit more attention to the training instances that the predecessor

underfitted. This results in new predictors focusing more and more on

the hard cases. This is the technique used by AdaBoost. For example,

when training an AdaBoost classifier, the algorithm first trains a base

classifier (such as a Decision Tree) and uses it to make predictions on the

training set. The algorithm then increases the relative weight of

misclassified training instances. Then it trains a second classifier, using

the updated weights, and again makes predictions on the training set,

updates the instance weights, and so on

Each instance weight w is initially set to 1 m . A first predictor is trained,

and its weighted error rate r is computed on the training set

``
The predictor’s weight α is then computed using Equation 7-2, where η is

the learning rate hyperparameter (defaults to 1). The more accurate the

predictor is, the higher its weight will be

Scikit-Learn uses a multiclass version of AdaBoost called SAMME (which

stands for Stagewise Additive Modeling using a Multiclass Exponential

``
loss function). When there are just two classes, SAMME is equivalent to

AdaBoost. If the predictors can estimate class probabilities (i.e., if they

have a predict_proba() method), Scikit-Learn can use a variant of

SAMME called SAMME.R (the R stands for “Real”), which relies on class

probabilities rather than predictions and generally performs better. The

following code trains an AdaBoost classifier based on 200 Decision

Stumps using Scikit-Learn’s AdaBoostClassifier class (as you might

expect, there is also an AdaBoostRegressor class). A Decision Stump is a

Decision Tree with max_depth=1—in other words, a tree composed of a

single decision node plus two leaf nodes. This is the default base

estimator for the AdaBoostClassifier class:

``
Boosting is an ensemble modeling technique that attempts to
build a strong classifier from the number of weak classifiers. It
is done by building a model by using weak models in series.
Firstly, a model is built from the training data. Then the second
model is built which tries to correct the errors present in the
first model. This procedure is continued and models are added
until either the complete training data set is predicted correctly
or the maximum number of models are added. AdaBoost was
the first really successful boosting algorithm
developed for the purpose of binary classification. AdaBoost is
short for Adaptive Boosting and is a very popular boosting
technique that combines multiple “weak classifiers” into a single
“strong classifier Algorithm:

1.Initialise the dataset and assign equal weight to each of the data
point.
2.Provide this as input to the model and identify the wrongly
classified data points.
3.Increase the weight of the wrongly classified data points.
4. if(gotrequiredresults)
Gotostep5
else

``
Gotostep2

5. End
6.

``
``
he above diagram explains the AdaBoost algorithm in a very simple
way. Let’s try to understand it in a stepwise process:

 B1 consists of 10 data points which consist of two types namely


plus(+) and minus(-) and 5 of which are plus(+) and the other 5
are minus(-) and each one has been assigned equal weight
initially. The first model tries to classify the data points and
generates a vertical separator line but it wrongly classifies 3
plus(+) as minus(-).
 B2 consists of the 10 data points from the previous model in
which the 3 wrongly classified plus(+) are weighted more so that
the current model tries more to classify these pluses(+)
correctly. This model generates a vertical separator line that
correctly classifies the previously wrongly classified pluses(+)
but in this attempt, it wrongly classifies three minuses(-).
 B3 consists of the 10 data points from the previous model in
which the 3 wrongly classified minus(-) are weighted more so
that the current model tries more to classify these minuses(-)
correctly. This model generates a horizontal separator line that
correctly classifies the previously wrongly classified minuses(-).
 B4 combines together B1, B2, and B3 in order to build a strong
prediction model which is much better than any individual model
used.

Gradient Boosting Gradient Boosting is a powerful boosting


algorithm that combines
several weak learners into strong learners, in which each new
model is trained to minimize the loss function such as mean
squared error or cross-entropy of the previous model using gradient
descent. In each iteration, the algorithm computes the gradient of
the loss function with respect to the predictions of the current
ensemble and then trains a new weak model to minimize this
gradient. The predictions of the new model are then added to the
ensemble, and the process is repeated until a stopping criterion is
met.
In contrast to AdaBoost , the weights of the training instances are
not tweaked, instead, each predictor is trained using the residual
errors of the predecessor as labels. There is a technique called
``
the Gradient Boosted Trees whose base learner is CART
(Classification and Regression Trees). The below diagram explains
how gradient-boosted trees are trained for regression problems.

y(pred) = y1 + (eta * r1) + (eta * r2) + ....... + (eta *


rN)

Gradient Boosted Trees for Regression

The ensemble consists of M trees. Tree1 is trained using the


feature matrix X and the labels y. The predictions
labeled y1(hat) are used to determine the training set residual
errors r1. Tree2 is then trained using the feature matrix X and the
residual errors r1 of Tree1 as labels. The predicted
results r1(hat) are then used to determine the residual r2. The
process is repeated until all the M trees forming the ensemble are
trained. There is an important parameter used in this technique
known as Shrinkage. Shrinkage refers to the fact that the
prediction of each tree in the ensemble is shrunk after it is
multiplied by the learning rate (eta) which ranges between 0 to 1.
There is a trade-off between eta and the number of estimators,
decreasing learning rate needs to be compensated with increasing
estimators in order to reach certain model performance. Since all
trees are trained now, predictions can be made. Each tree predicts
a label and the final prediction is given by the formula,
y(pred) = y1 + (eta * r1) + (eta * r2) + ....... + (eta *
rN)

``
Difference between Adaboost and Gradient Boosting
The difference between AdaBoost and gradient boosting are as
follows:

AdaBoost Gradient Boosting

During each iteration in AdaBoost, Gradient Boosting updates the


the weights of incorrectly classified weights by computing the
samples are increased, so that the negative gradient of the loss
next weak learner focuses more on function with respect to the
these samples. predicted output.

Gradient Boosting can use a


AdaBoost uses simple decision
wide range of base learners,
trees with one split known as the
such as decision trees, and
decision stumps of weak learners.
linear models.

Gradient Boosting is generally


AdaBoost is more susceptible to
more robust, as it updates the
noise and outliers in the data, as it
weights based on the
assigns high weights to
gradients, which are less
misclassified samples
sensitive to outliers.

``
Support Vector Machine(SVM)
Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as
Regression problems. primarily, it is used for Classification problems in
Machine Learning. The goal of the SVM algorithm is to create the best
line or decision boundary that can segregate n-dimensional space into
classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a
hyperplane. SVM chooses the extreme points/vectors that help in
creating the hyperplane. These extreme cases are called as support
vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:

``
Example:

Suppose we see a strange cat that also has some features of dogs, so if

we want a model that can accurately identify whether it is a cat or dog, so

such a model can be created by using the SVM algorithm. We will first

train our model with lots of images

of cats and dogs so that it can learn about different features of cats and

dogs, and then we test it with this strange creature. So as support vector

creates a decision boundary between these two data (cat and dog) and

choose extreme cases (support vectors), it will see the extreme case of

cat and dog. On the basis of the support vectors, it will classify it as a cat.

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if
a dataset can be classified into two classes by using a single straight line,
then such data is termed as linearly separable data, and classifier is used
called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then

``
such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate
the classes in n-dimensional space, but we need to find out the best decision
boundary that helps to classify the data points. This best boundary is known
as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means
the maximum distance between the data points.
Support Vectors:

The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector.
Since these vectors support the hyperplane, hence called a Support
vector.
How does SVM works?
Suppose we have a dataset that has two tags (green and blue), and the
dataset has two features x1 and x2. We want a classifier that can classify
the pair(x1, x2) of coordinates in either green or blue. Consider the below
image:

``
So as it is 2-d space so by just using a straight line, we can easily separate
these two classes. But there can be multiple lines that can separate these
classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary;
this best boundary or region is called as a hyperplane. SVM algorithm

``
finds the closest point of the lines from both the classes. These points are
called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal
hyperplane.

Margins are generally defined by the closest data points (called


support vectors) on either of the hyperplane

Optimization Technique used in SVM

SVM maximizes the margin by learning a suitable decision


boundary/decision surface/separating hyperplane.

``
``
How to choose the Correct SVM

Suppose we are given 2 Hyperplane one with 100% accuracy


(HP1) on the left side and another with >90% accuracy (HP2) on
the right side. Which one would you think is the correct classifier?

Most of us would pick the HP2 thinking that it because of the


maximum margin. But it is the wrong answer.

But Support Vector Machine would choose the HP1 though it has a
narrow margin. Because though HP2 has maximum margin but it
is going against the constrain that: each data point must lie on
the correct side of the margin and there should be no
misclassification.This constraint is the hard constraint

that Support Vector Machine follows throughout.

``
Margins

Margins are generally defined by the closest data points (called


support vectors) on either of the hyperplane
Optimization Technique used in SVM
The core of any Machine learning algorithm is the Optimization
technique that is happening behind the scene.

SVM maximizes the margin by learning a suitable decision


boundary/decision surface/separating hyperplane.

It can be mathematically be written as:

``
4. Hard and Soft SVM

I would like to again continue with the above example.

``
We can now clearly state that HP1 is a Hard SVM(left side)
while HP2 is a Soft SVM(right side).

By default, Support Vector Machine implements Hard margin


SVM. It works well only if our data is linearly separable.
Hard margin SVM does not allow any misclassification to
happen.
In case our data is non-separable/ nonlinear then the Hard
margin SVM will not return any hyperplane as it will not be able
to separate the data. Hence this is where Soft Margin SVM
comes to the rescue.
Soft margin SVM allows some misclassification to happen by
relaxing the hard constraints of Support Vector Machine.
Soft margin SVM is implemented with the help of
the Regularization parameter (C).
Regularization parameter (C): It tells us how much
misclassification we want to avoid.
– Hard margin SVM generally has large values of C. – Soft

margin SVM generally has small values of C.


Relation between Regularization parameter (C) and SVM
Now that we know what the Regularization parameter (C) does.
We need to understand its relation with Support Vector Machine.

``
– As the value of C increases the margin decreases thus Hard
SVM.

– If the values of C are very small the margin increases thus Soft
SVM.
– Large value of C can cause overfitting therefore we need to
select the correct value using Hyperparameter Tuning.

The following Scikit-Learn code loads the iris dataset, scales the features,

and then trains a linear SVM model (using the LinearSVC class with C=1

and the hinge loss function, described shortly) to detect Iris virginica

flowers:

import numpy as np from sklearn import datasets from sklearn.pipeline

import Pipeline

``
from sklearn.preprocessing import StandardScaler

from sklearn.svm import LinearSVC

iris = datasets.load_iris()

X = iris["data"][:, (2, 3)] # petal length, petal width

y = (iris["target"] == 2).astype(np.float64) # Iris virginica

svm_clf = Pipeline([ ("scaler", StandardScaler()),

LinearSVC(C=1, loss="hinge")), ])
("linear_svc",
svm_clf.fit(X, y)

Non Linear SVM Classification


Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight
line, but for non-linear data, we cannot draw a single straight line.
Consider the below image:

``
So to separate these data points, we need to add one more dimension. For
linear data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below
image:

``
So now, SVM will divide the datasets into classes in the following way.
Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-
axis. If we convert it in 2d space with z=1, then it will become as:
``
Hence we get a circumference of radius 1 in case of non-linear data
Adding Polynomial Features
Polynomial Kernel
Similarity Features
Gaussian RBF Kernel
Read the above topics by referring the text book Page nos 218-223(Hands
on Machine Learning)
Support Vector Regression
Support Vector Regression as the name suggests is a regression algorithm

that supports both linear and non-linear regressions. This method works

on the principle of the Support Vector Machine. SVR differs from SVM in

the way that SVM is a classifier that is used for predicting discrete

categorical labels while SVR is a regressor that is used for predicting

continuous ordered variables.

``
In simple regression, the idea is to minimize the error rate while in SVR

the idea is to fit the error inside a certain threshold which means, work

of SVR is to approximate the best value within a given margin called ε-

tube.

1. Hyperplane: It is a separation line between two data classes in a

higher dimension than the actual dimension. In SVR it is defined as the

line that helps in predicting the target value.

2. Kernel: In SVR the regression is performed at a higher dimension. To

do that we need a function that should map the data points into its higher

dimension. This function is termed as the kernel. Type of kernel used in

SVR is Sigmoidal Kernel, Polynomial Kernel, Gaussian Kernel, etc,

3. Boundary Lines: These are the two lines that are drawn around the

hyperplane at a distance of ε (epsilon). It is used to create a margin

between the data points.

``
4. Support Vector: It is the vector that is used to define the hyperplane

or we can say that these are the extreme data points in the dataset which

helps in defining the hyperplane. These data points lie close to the

boundary.

The objective of SVR is to fit as many data points as possible without

violating the margin. Classification that is in SVM use of support vector

was to define the hyperplane but in SVR they are used to define the linear

regression.

Working of SVR
SVR works on the principle of SVM with few minor differences. Given data

points, it tries to find the curve. But since it is a regression algorithm

instead of using the curve as a decision boundary it uses the curve to find

the match between the vector and position of the curve. Support Vectors

helps in determining the closest match between the data points and the

function which is used to represent them.

The Idea Behind Support Vector Regression


The problem of regression is to find a function that approximates
mapping from an input domain to real numbers on the basis of a
training sample. So let’s now dive deep and understand how SVR
works actually.

``
Consider these two red lines as the decision boundary and the
green line as the hyperplane. Our objective, when we are
moving on with SVR, is to basically consider the points
that are within the decision boundary line. Our best fit line
is the hyperplane that has a maximum number of points.

The first thing that we’ll understand is what is the decision


boundary (the danger red line above!). Consider these lines as
being at any distance, say ‘a’, from the hyperplane. So, these are
the lines that we draw at distance ‘+a’ and ‘-a’ from the
hyperplane. This ‘a’ in the text is basically referred to as epsilon.

Assuming that the equation of the hyperplane is as follows:

Y = wx+b (equation of hyperplane)

Then the equations of decision boundary become:

wx+b= +a

wx+b= -a

Thus, any hyperplane that satisfies our SVR should satisfy:


-a < Y- wx+b < +a

``
Our main aim here is to decide a decision boundary at ‘a’ distance
from the original hyperplane such that data points closest to the
hyperplane or the support vectors are within that boundary line.

Hence, we are going to take only those points that are within the
decision boundary and have the least error rate, or are within the
Margin of Tolerance. This gives us a better fitting model
What is a Support Vector Machine (SVM)?
So what exactly is Support Vector Machine (SVM)? We’ll start by
understanding SVM in simple terms. Let’s say we have a plot of
two label classes as shown in the figure below:

Can you decide what the separating line will be? You might have
come up with this:

``
The line fairly separates the classes. This is what SVM essentially
does – simple class separation. Now, what is the data was like
this:

Here, we don’t have a simple line separating these two classes.


So we’ll extend our dimension and introduce a new dimension
along the z-axis. We can now separate these two classes:

``
When we transform this line back to the original plane, it maps
to the circular boundary as I’ve shown here:

This is exactly what SVM does! It tries to find a line/hyperplane


(in multidimensional space) that separates these two classes.
Then it classifies the new point depending on whether it lies on
the positive or negative side of the hyperplane depending on the
classes to predict.

``
Implementing Support Vector Regression (SVR) in Python
Here, we have to predict the salary of an employee given a few
independent variables. A classic HR analytics project!

Step 1: Importing the libraries

import numpy as np

import matplotlib.pyplot as
plt
import pandas as pd

Step 2: Reading the dataset

dataset =
pd.read_csv('Position_Salaries.csv')

X = dataset.iloc[:, 1:2].values

y = dataset.iloc[:, 2].values

Step 3: Feature Scaling

A real-world dataset contains features that vary in magnitudes,


units, and range. I would suggest performing normalization when
the scale of a feature is irrelevant or misleading. Feature Scaling

basically helps to normalize the data within a


particular range. Normally several common class types contain
``
the feature scaling function so that they make feature scaling
automatically. However, the SVR class is not a commonly used
class type so we should perform feature scaling using Python.

from sklearn.preprocessing import


StandardScaler

sc_X = StandardScaler() sc_y =

StandardScaler() X =

sc_X.fit_transform(X) y =

sc_y.fit_transform(y)

Step 4: Fitting SVR to the dataset

from sklearn.svm import SVR

regressor = SVR(kernel =
'rbf')
regressor.fit(X, y)

Kernel is the most important feature. There are many types of


kernels – linear, Gaussian, etc. Each is used depending on the
dataset.

Step 5. Predicting a new result

y_pred = regressor.predict(6.5)
y_pred =
sc_y.inverse_transform(y_pred)

So, the prediction for y_pred(6, 5) will be 170,370.

Step 6. Visualizing the SVR results (for higher resolution


and smoother curve)

X_grid = np.arange(min(X), max(X), 0.01) #this step required because data is


feature scaled.
X_grid = X_grid.reshape((len(X_grid), 1))

``
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

Implementing SVR in Python


Data preprocessing
As in any other implementation, first, we get the necessary
libraries in place. The code below imports these libraries:
# get the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset =
pd.read_csv('/content/drive/MyDrive/Position_Salaries.csv')

``
# our dataset in this implementation is small, and thus we can
print it all instead of viewing only the end
print(dataset)

Output: The above dataset contains ten instances. The

significant
feature in this dataset is the Level column.
The Position column is just a description of
the Level column, and therefore, it adds no value to our
analysis. Therefore, we will separate the dataset into a set of
features and study variables.
As discussed above, we only have one feature in this
dataset. Therefore, we carry out our feature-study variable
separation as shown in the code below:

# split the data into featutes and target variable seperately


X_l = dataset.iloc[:, 1:-1].values # features set
y_p = dataset.iloc[:, -1].values # set of study variable

We can look at our feature set using the print() function.


print(X_l)

Output:
[[ 1] [
2]

``
[ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [10]]
From this output, it’s clear that the X_l variable is a 2D
array. Similarly, we can have a look at the y_p variable:
print(y_p)

Output:

It’s seen from the output above that the y_p variable is a
vector, i.e., a 1D array.

We need to note that the values of y_p are huge compared


to x_l.
Therefore, if we implement a model on this data, the study
variable will dominate the feature variable, such that its
contribution to the model will be neglected. Due to this, we

will have to scale this study variable to the


same range as the scaled study variable.
The challenge here is that the StandardScaler, the class

we
use to scale the data, takes in a 2D array; otherwise, it
returns an error. Due to this, we have to reshape

our y_p variable from 1D to


2D. The code below does this for us:
``
y_p = y_p.reshape(-1,1)
Output:

[[ 45000] [ 50000] [ 60000] [ 80000] [ 110000] [ 150000] [ 200000] [


300000] [ 500000] [1000000]]
From the above output, y_p was successfully reshaped into
a 2D array.
Now, import the

StandardScalar
class and scale up
the X_l and y_p variables separately as shown:
from sklearn.preprocessing import StandardScaler
StdS_X = StandardScaler()
StdS_y = StandardScaler()
X_l = StdS_X.fit_transform(X_l)
y_p = StdS_y.fit_transform(y_p)
Let’s simultaneously print and check if our two variables
were scaled.

print("Scaled X_l:")
print(X_l)
print("Scaled y_p:")
print(y_p)
Output:

Scaled X_l:
[[-1.5666989 ]
[-1.21854359]
[-0.87038828]
[-0.52223297]
[-0.17407766] [
0.17407766]

``
[ 0.52223297] [
0.87038828] [
1.21854359] [
1.5666989 ]]
Scaled y_p:
[[-0.72004253]
[-0.70243757]
[-0.66722767]
[-0.59680786]
[-0.49117815]
[-0.35033854]
[-0.17428902] [
0.17781001] [
0.88200808] [
2.64250325]]

As we can see from the obtained output, both variables


were scaled within the range -3 and +3.
Our data is now ready to implement our SVR model.

However, before we can do so, we will first visualize the


data to know the nature of the SVR model that best fits it.
So, let us create a scatter plot of our two variables.

plt.scatter(X_l, y_p, color = 'red') # plotting the training set


plt.title('Scatter Plot') # adding a tittle to our plot plt.xlabel('Levels') #
adds a label to the x-axis plt.ylabel('Salary') # adds a label to the y-axis
plt.show() # prints

``
The plot shows a non-linear relationship between
the Levels and Salary.

Due to this, we cannot use the linear SVR to model this


data. Therefore, to capture this relationship better, we will
use the SVR with the kernel functions.

Implementing SVR
To implement our model, first, we need to import it from
the scikit-learn and create an object to itself.
Since we declared our data to be non-linear, we will pass it
to a kernel called the Radial Basis function (RBF) kernel.
After declaring the kernel function, we will fit our data on
the object. The following program performs these rules:
# import the model
from sklearn.svm import SVR
# create the model object
regressor = SVR(kernel = 'rbf')
# fit the model on the data
regressor.fit(X_l, y_p)

``
Since the model is now ready, we can use it and make
predictions as shown:
A=regressor.predict(StdS_X.transform([[6.5]])) print(A)
Output:
array([-0.27861589])

As we can see, the model prediction values are for the


scaled study variable. But, the required value for the
business is the output of the unscaled data. So, we need to
get back to the real scale of the study variable.
To go back to the real study variable, we will write a
program whose objective is to take the predicted values on
the scaled range and transform them to the actual scale. We

do so by taking an inverse of the transformation on the


study variable
we had reshaped our study variable from 1D to 2D array
since the StandarScaler method takes in only 2D arrays.
So, for any predicted value to fit within such a new
dimension of the study variable, it must be transformed
from 1D to 2D; otherwise, we will get an error. So, let’s

implement these commands and get the required


value:
# Convert A to 2D A = A.reshape(-1,1) print(A)

Output:
``
array([[-0.27861589]])
It is clear from the output above is a 2D array. Using

the inverse_transform()
function, we can convert it to an
unscaled value in the original dataset as shown:
# Taking the inverse of the scaled value
A_pred = StdS_y.inverse_transform(A)
print(A_pred)

Output:
array([[170370.0204065]])

Here is the result, and it falls within the expected range.

visualize our model.

The following code carries out this task:


# inverse the transformation to go back to the initial scale
plt.scatter(StdS_X.inverse_transform(X_l),
StdS_y.inverse_transform(y_p), color = 'red')
plt.plot(StdS_X.inverse_transform(X_l),
StdS_y.inverse_transform(regressor.predict(X_l).reshape(-
1,1)), color = 'blue')
# add the title to the plot
plt.title('Support Vector Regression Model')
# label x axis
plt.xlabel('Position')
# label y axis
plt.ylabel('Salary Level')
# print the plot
plt.show()
Output:

``
Naïve Bayes Classifier

o Naïve Bayes algorithm is a supervised learning algorithm, which is based


on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine learning
models that can make quick predictions.

o It is a probabilistic classifier, which means it predicts on the basis


of the probability of an object. Some popular examples of Naïve Bayes

o Algorithm are spam filtration,


Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?


The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a


certain feature is independent of the occurrence of other features.
Such as

``
if the fruit is identified on the bases of color, shape, and taste, then red,
spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on
each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.

o The formula for Bayes' theorem is given as:

Where, P(A|B) is Posterior probability: Probability of hypothesis A on


the observed event B. P(B|A) is Likelihood probability: Probability of the
evidence given that the probability of a hypothesis is true.
Using Bayes theorem, we can find the probability
of A happening, given that B has occurred. Here, B is the
evidence and A is the hypothesis. The assumption made
here is that the predictors/features are independent. That is
presence of one particular feature does not affect the other.
Hence it is called naive.

Steps

1.Convert the given dataset into frequency tables.


2.Generate Likelihood table by finding the probabilities of given
features.
3.Now, use Bayes theorem to calculate the posterior probability.

``
Example

According to this example, Bayes theorem can be rewritten


as:

The variable y is the class variable(play golf), which


represents if it is suitable to play golf or not given the
conditions. Variable X represent the parameters/features.

X is given as,

``
Here x_1,x_2….x_n represent the features, i.e they can be
mapped to outlook, temperature, humidity and windy. By
substituting for X and expanding using the chain rule we
get,

Now, you can obtain the values for each by looking at the
dataset and substitute them into the equation. For all
entries in the dataset, the denominator does not change, it
remain static. Therefore, the denominator can be removed
and a proportionality can be introduced.

In our case, the class variable(y) has only two outcomes,


yes or no. There could be cases where the classification
could be multivariate. Therefore, we need to find the
class y with maximum probability.

Using the above function, we can obtain the class, given


the predictors.
``
``

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy