0% found this document useful (0 votes)
8 views23 pages

Assignment 3.docx 2

The document discusses various machine learning concepts, focusing on Support Vector Machines (SVM), Decision Trees, and ensemble methods like Bagging and Boosting. It explains how SVM utilizes quadratic programming and kernel methods for classification, the training and prediction process of Decision Trees, and the importance of regularization in preventing overfitting. Additionally, it covers the CART algorithm, GINI impurity, and provides examples of implementing SVM and AdaBoost classifiers.

Uploaded by

personalabhi077
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views23 pages

Assignment 3.docx 2

The document discusses various machine learning concepts, focusing on Support Vector Machines (SVM), Decision Trees, and ensemble methods like Bagging and Boosting. It explains how SVM utilizes quadratic programming and kernel methods for classification, the training and prediction process of Decision Trees, and the importance of regularization in preventing overfitting. Additionally, it covers the CART algorithm, GINI impurity, and provides examples of implementing SVM and AdaBoost classifiers.

Uploaded by

personalabhi077
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Ballari Institute of Technology and Management

Department of Artificial Intelligence and Machine Learning


Course:Machine Learning Sem: 5th sem(A,B,C)
Assignment 3
1.​ Show that how SVM make prediction using quadratic Programming and
kernalized SVM
SVM make predictions by solving an optimization problem using
quadratic programming(QP) and kernels for non linear decision
boundaries.
​ The goal of SVM is to find a hyperplane that best separates the
data points of different classes by maximizing the margin.


Fig shows Iris data set. The two classes are easily separated with a
straight line. The left plot shows decision boundaries of 3 possible linear
classifiers. The dashed line does not separate the classes properly. The
solid line on the right rep the decission boundary of an SVM classifier, it
is far away from training instances fitting the widest possible instances
between the classes. This is called large margin classification.
​ The hard margin and soft margin problems are optimization
problems with linear models. Called quadratic programming problems.
The general problem formulation is given by
To apply 2nd degee polynomial to a 2 dimensional training set train a
linear SVM classifier on the transformed training set . Eqn shows 2nd
degree polynomial mapping function

The dot product of transformed vectors is equal to the square of the dot
product of the original vectors.
2.​ Discuss Non-linear SVM classification. How can you see polyomial
Kernel, Guassian and RBF kernel.
For non linear dataset,we add a second feature to transform into a linear
dataset.
x2 = (x1)2, the resulting 2D dataset is perfectly linearly separable
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
polynomial_svm_clf = Pipeline([
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge"))
])
polynomial_svm_clf.fit(X, y)

For a low polynomial degree SVM uses kernel trick. The folowing code
explains how to use kernel trick on moons dataset.

Technique to handle non linear is to add similarity feature. The similarity


function to be the Gaussian Radial Basis Function (RBF)
with γ = 0.3
It is a bell-shaped function varying from 0 (very far away from the
landmark) to 1 (at the landmark). Now we are ready to compute the new
features. For example, let’s look at the instance x1 = –1: it is located at a
distance of 1 from the first landmark, and 2 from the second landmark.
Therefore its new features are x2 = exp (–0.3 × 12) ≈ 0.74 and x3 = exp
(–0.3 × 22) ≈ 0.30. The plot on the right of Figure shows the transformed
dataset (dropping the original features). It is now linearly
separable.

Gaussian RBF kernel using the SVC class:

rbf_kernel_svm_clf = Pipeline([
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
])
rbf_kernel_svm_clf.fit(X, y)
The plots shows how models are trained with different values of
hyperparameters gamma (γ) and C. Increasing gamma makes the bell-shape
curve narrower (left plot), and as a result each instance’s range of influence is
smaller: the decision boundary ends up being more irregular, wiggling around
individual instances. Conversely, a small gamma value makes the bell-shaped
curve wider, so instances have a larger range of influence, and the decision
boundary ends up smoother. So γ acts like a regularization hyperparameter: if
your model is overfitting, you should reduce it, and if it is underfitting, you
should increase.

3.​ Explain how decision trees are trained, visualized and used in making
predictions.
Decision Trees are the fundamental components and most powerful
Machine Learning algorithms
To train, visualize, and make predictions with Decision Trees
Figure 6-1. Iris Decision Tree
The following code trains a DecisionTreeClassifier on the iris dataset
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)
Figure 6-1 show predictions on an iris flower to classify it. The root node
(depth 0, at the top): this node asks whether the flower’s petal length is
smaller than 2.45 cm. If it is, then moves down to the root’s left child
node (depth 1, left). In this case, it is a leaf node (i.e., it does not have any
children nodes), so it does not ask any questions:it simply looks at the
predicted class for that node and the Decision Tree predicts flower is an
Iris-Setosa (class=setosa).
Now suppose you find another flower, but this time the petal length is
greater than 2.45 cm. now move down to the root’s right child node
(depth 1, right), which is not a leaf node, so it asks another question: is
the petal width smaller than 1.75 cm? If it is, then flower is most likely
an Iris-Versicolor (depth 2, left). If not, it is likely an Iris-Virginica (depth
2, right).
For example, the depth-2 left node
has a gini score equal to 1 – (0/54)2 – (49/54)2 – (5/54)2 ≈ 0.168
suppose We found a flower whose petals are 5 cm long and 1.5 cm wide. The corresponding
leaf node is the depth-2 left node, so the Decision Tree should output the
following probabilities: 0% for Iris-Setosa (0/54), 90.7% for Iris-Versicolor (49/54),
and 9.3% for Iris-Virginica (5/54). And of course if you ask it to predict the class, it
should output Iris-Versicolor (class 1) since it has the highest probability. Let’s check
this:

tree_clf.predict_proba([[5, 1.5]])
array([[0. , 0.90740741, 0.09259259]])

>>> tree_clf.predict([[5, 1.5]])


array([1]

4.​ Explain Bagging and pasting with an example.


One way to get a diverse set of classifiers is to use very different training
algorithms. Another approach is to use the same training algorithm for
everypredictor, but to train them on different random subsets of the training set.
When sampling is performed with replacement, this method is called bagging
(short for bootstrap aggregating2). When sampling is performed without
replacement, it is called pasting. both bagging and pasting allow training
instances to be sampled several times across multiple predictors, but only
bagging allows training instances to be sampled several times for the same
predictor. This sampling and training process is represented in Figure 7-4.
Once all predictors are trained, the ensemble can make a prediction for a new
instance by simply aggregating the predictions of all predictors. The aggregation
function is typically the statistical mode (i.e., the most frequent prediction, just
like a hard voting classifier) for classification, or the average for regression.
Each individual predictor has a higher bias than if it were trained on the original
training set, but aggregation reduces both bias and variance.4 Generally, the net
result is that the ensemble has a similar bias but a lower variance than a single
predictor trained on the original training set.

5.​ Explain CART Algorithm, Regularization hyperparameters in decision


Trees.
The Classification And Regression Tree (CART) algorithm to train
Decision Trees (also called “growing” trees). The idea is quite simple: the
algorithm first splits the training set in two subsets using a single feature
k and a threshold tk (e.g., “petal length ≤ 2.45 cm)
. The cost function that the algorithm tries to minimize is given by
Equation 6-2.
To avoid overfitting the training data, you need to restrict the Decision
Tree’s freedom during training. As you know by now, this is called
regularization. Fig shows two Decision Trees trained on the moons
dataset (introduced in Chapter 5). On the left, the Decision Tree is trained
with the default hyperparameters (i.e., no restrictions), and on the right
the Decision Tree is trained with min_samples_leaf=4. It is quite obvious
that the model on the left is overfitting, and the model on the right will
probably generalize better.

6.​ What is Boosting. Explain ADA BOOSt and gradient boosting.


Boosting refers to any Ensemble method that can combine several weak
learners into a strong learner. The general idea of most boosting methods is to
train predictors sequentially, each trying to correct its predecessor. There are
many boosting methods available, but by far the most popular are Adaptive
Boosting | One way for a new predictor to correct its predecessor is to pay a bit
more attention to the training instances that the predecessor underfitted. This
results in new predictors focusing more and more on the hard cases. This is the
technique used by Ada‐Boost.
For example, to build an AdaBoost classifier, a first base classifier (such as a
Decision Tree) is trained and used to make predictions on the training set. The
relative weight of misclassified training instances is then increased. A second
classifier is trained using the updated weights and again it makes predictions on
the training set, weights are updated, and so on (see Figure 7-7).
Figure 7-8 shows the decision boundaries of five consecutive predictors on the
moons dataset (in this example, each predictor is a highly regularized SVM
classifier with an RBF kernel14). The first classifier gets many instances wrong,
so their weights get boosted. The second classifier therefore does a better job on
these instances, and so on. The plot on the right represents the same sequence of
predictors except that the learning rate is halved (i.e., the misclassified instance
weights are boosted half as much at every iteration). As you can see, this
sequential learning technique has some similarities with Gradient Descent,
except that instead of tweaking a single predictor’s parameters to minimize a
cost function, AdaBoost adds predictors to the ensemble, gradually making it
better.

Once all predictors are trained, the ensemble makes predictions very much like
bagging or pasting
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1), n_estimators=200,
algorithm="SAMME.R", learning_rate=0.5)
ada_clf.fit(X_train, y_train)

7.​ What is Bayes theorem


Refer the answer given in class

8.​ Discuss the minimum description length algorithm.

9.​ Explain the steps in Gibbs Algorithm


10.​Write EM algorithm and explain in details.
11.​Explain Naïve Bayes clasifier with an example.
12.​Implement a Support Vector Machine (SVM) model to classify a dataset
with multiple classes. Explain the steps taken to preprocess the data, train
the model, and optimize its performance. Include the methods used for
hyperparameter tuning and evaluation of the final model.

13.​What are the main differences between linear and nonlinear Support
Vector Machines
The fundamental idea behind SVMs is best explained with some pictures.
Fig shows part of the iris dataset. The two classes can clearly be separated
easily with a straight line (they are linearly separable). The left plot
shows the decision boundaries of three possible linear classifiers. The
model whose decision boundary is represented by the dashed line is so
bad that it
does not even separate the classes properly. The other two models work
perfectly on this training set, but their decision boundaries come so close
to the instances that these models will probably not perform as well on
new instances. In contrast, the solid line in the plot on the right represents
the decision boundary of an SVM classifier; this line not only separates
the two classes but also stays as far away from the closest training
instances as possible. You can think of an SVM classifier as fitting the
widest possible street (represented by the parallel dashed lines) between
the classes.
This is called large margin classification.

Soft Margin Classification


If we strictly impose that all instances be off the street and on the right
side, this is called hard margin classification. There are two main issues
with hard margin classification. First, it only works if the data is linearly
separable, and second it is quite sensitive to outliers. Figure 5-3 shows the
iris dataset with just one additional outlier: on the left, it is impossible to
find a hard margin, and on the right the decision boundary ends up very
different from the one we saw in Figure 5-1 without the outlier, and it
will probably not generalize as well.

To avoid these issues it is preferable to use a more flexible model. The


objective is to find a good balance between keeping the street as large as
possible and limiting the margin violations (i.e., instances that end up in
the middle of the street or even on the wrong side). This is called soft
margin classification. In Scikit-Learn’s SVM classes, you can control this
balance using the C hyperparameter: a smaller C value leads to a wider
street but more margin violations. Figure 5-4 shows the decision
boundaries and margins of two soft margin SVM classifiers on a
nonlinearly separable dataset. On the left, using a low C value the margin
is quite large, but many instances end up on the street. On the right, using
a high C value the classifier makes fewer margin violations but ends up
with a smaller margin. However, it seems likely that the first classifier
will generalize better: in fact even on this training set it makes fewer
prediction errors, since most of the margin violations are actually on the
correct side of the decision boundary

The following Scikit-Learn code loads the iris dataset, scales the features,
and then trains a linear SVM model (using the LinearSVC class with C =
1 and the hinge loss function, described shortly) to detect Iris-Virginica
flowers. The resulting model is represented on the left of Figure 5-4.
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
X = iris["data"][:, (2, 3)] # petal length, petal width
y = (iris["target"] == 2).astype(np.float64) # Iris-Virginica
svm_clf = Pipeline([
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1, loss="hinge")),
])
svm_clf.fit(X, y)
you can use the model to make predictions:
>>> svm_clf.predict([[5.5, 1.7]])
array([1.]
Nonlinear SVM Classification
Although linear SVM classifiers are efficient and work surprisingly well
in many cases, many datasets are not even close to being linearly
separable. One approach to handling nonlinear datasets is to add more
features, such as polynomial features (as you did in Chapter 4); in some
cases this can result in a linearly separable dataset.
Consider the left plot in Figure 5-5: it represents a simple dataset with
just one feature x1. This dataset is not linearly separable, as you can see.
But if you add a second feature x2 = (x1)2, the resulting 2D dataset is
perfectly linearly separable.

from sklearn.datasets import make_moons


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
polynomial_svm_clf = Pipeline([
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge"))
])
polynomial_svm_clf.fit(X, y)

14.​Explain the concept of GINI impurity and how it is used in decision tree
algorithms.
Gini Impurity: It is a Measure of Impurity. Gini impurity is a metric used
in decision tree algorithms to measure how “pure” a node is in terms of
class distribution. A pure node means all data points in that node belong
to a single class, while an impure node contains data points from multiple
classes.

Formula for Gini Impurity

For a given node , the Gini impurity is calculated as:



Where:
​ •​ n: Number of classes.
​ • pi,k: Proportion of samples belonging to class  in the node.

The Gini impurity ranges from 0 (pure node) to a maximum value that
depends on the number of classes.

Interpreting Gini Impurity


​ •​ Gini Impurity = 0: All samples in the node belong to one
class (pure node).
​ •​ Higher Gini Impurity: Indicates a more mixed distribution of
classes.

For example:
​ •​ If all samples in a node are of the same class, Gini=0
​ •​ If there are two classes with equal proportions, Gini=0.5

How Gini Impurity is Used in Decision Trees


​ 1.​ Splitting Nodes:
​ •​ Decision trees aim to split nodes such that the resulting child
nodes are as pure as possible.
​ •​ At each split, the algorithm evaluates the Gini impurity for
potential splits and chooses the one that minimizes the weighted average
impurity of the child nodes.
​ 2.​ Weighted Gini Impurity After a Split:
The weighted Gini impurity is computed as:
Once it has successfully split the training set in two, it splits the subsets
using the same logic, then the sub-subsets and so on, recursively. It stops
recursing once it reaches the maximum depth (defined by the max_depth
hyperparameter), or if it cannot find a split that will reduce impurity.

15.​Evaluate the performance of each bagging and boosting as well as their


combination.
One way to get a diverse set of classifiers is to use very different training
algorithms, as just discussed. Another approach is to use the same
training algorithm for every predictor, but to train them on different
random subsets of the training set. When sampling is performed with
replacement, this method is called bagging1 (short for bootstrap
aggregating2). When sampling is performed without replacement, it is
called pasting.3 In other words, both bagging and pasting allow training
instances to be sampled several times across multiple predictors, but only
bagging allows training instances to be sampled several times for the
same predictor. This sampling and training process is represented in
Figure 7-4.
Once all predictors are trained, the ensemble can make a prediction for a
new
instance by simply aggregating the predictions of all predictors. The
aggregation function is typically the statistical mode (i.e., the most
frequent prediction, just like a hard voting classifier) for classification, or
the average for regression. Each individual predictor has a higher bias
than if it were trained on the original training set, but aggregation reduces
both bias and variance.4 Generally, the net result is that the ensemble has
a similar bias but a lower variance than a single predictor trained on the
original training set.

from sklearn.ensemble import BaggingClassifier


from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
One way for a new predictor to correct its predecessor is to pay a bit more
attention to the training instances that the predecessor underfitted. This
results in new predictors focusing more and more on the hard cases. This
is the technique used by Ada‐ Boost.
For example, to build an AdaBoost classifier, a first base classifier (such
as a DecisionTree) is trained and used to make predictions on the training
set. The relative weight of misclassified training instances is then
increased. A second classifier is trained using the updated weights and
again it makes predictions on the training set, weights are updated, and so
on
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1), n_estimators=200,
algorithm="SAMME.R", learning_rate=0.5)
ada_clf.fit(X_train, y_train)

16. Discuss the results in terms of accuracy, robustness, and computational


cost.
17. Explain the Maximum Likelihood Estimation (MLE) method and its
significance in parameter estimation

18. Describe the Bayes Optimal Classifier and its theoretical importance in
classification problems.
Answer given in class
19. What is a Bayesian Belief Network, and how does it represent
probabilistic relationships between variables?
Answer given in class
20. Construct a regression using the following data which consists of 10 data
instances and three attributes “Assessment’, ‘Assignment’ and Project.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy