ML1 17 Hepsi
ML1 17 Hepsi
Ölçme Değerlendirme
Kaynak
(1) Zeki Yetgin, Makine Ogrenmesi Ders Notları
(2)Introduction to machine learning with Python, Andreas C. Müller &
Sarah Guido, 2016
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Knowledge Representation
Learning
Searching
Reasoning
(optimization)
Decision-Making
Agents that has reasoning ability is also called as knowledge-based agents. Reasoning agents
emulate the way how human beings learn. They have a knowledge-base similar to human beings and
they learn facts by explicit declaration, such as adding formal sentences to their knowledge base.
Similar to human beings, they have an inference engine through which they infer new facts through
the knowledge-base. This category, covering knowledge engineering, expert systems, logic (fuzzy
logic, first order logic, second order logic, first-order fuzzy logic, and so on), is usually out of the
scope of the machine learning. Similarly, Agents sometimes take decision by searching the solution
space, which leads to optimization. Optimization is also out of the scope of machine learning in
artificial intelligence.
Machine learning in this course deals with how to make agents learn something from data. Data is
considered in broad sense where there may be no explicit data but implicit one, such as Agent
experience. Generally Agents can be trained using input data, which covers a set of samples. Agent
can learn from input samples and make decision as output when independent samples are given.
Usually, the Agents’ learning mechanism is based on developing a learner model that generalizes the
relations between the input x and the output y (in case of supervised learning), or the relation
between samples (in case of unsupervised learning). Supervised and unsupervised learning will be
introduced in later sections.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Fig 1.1
Here f is the learner and the output ( ) is the agent decision for input sample x. Note that
throughout the course we adopted the symbol y denoting the actual output for the input sample x
whereas f(x) is the predicted output by the learner (or agent).
In practice, the agent can learn the behaviors of any real system if the system’s input and/or output
data are available. For example, as a real system, consider a medical doctor who evaluates many
variables such as symptoms, blood analysis, MR image, and so on, to make decision for the patients’
diagnosis. In this example, the input can be denoted as a vector of variables, x = { x1, x2, …, .xl }, and
the output y is a label indicating the diagnosis (illness) of the doctor. Thus, the agent learns from the
training samples, (x, y) data, and develops a learning ability f which should be an approximation to
the functionality of the medical doctor. Thus, for any input sample x, the agent decision ( ) ≅ y
Fig 1.2
The learner f is not necessarily to follow the logic (reasoning) but anything that make agents act
rationally is welcome. This is called “Rational Agent” approach (given in the AI Course). Thus, f can
be logic functions, probability density functions, curves, planes or surfaces in a multi-dimensional
space, decision trees, rules or neural network, etc. The important thing is that decision f(x) should
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
look like an intelligent behavior (output) and internal (learning) mechanism does not need to use
logic.
Let’s call the “input variables” of the system as “features” since, in practice, samples could be real
objects that are described by features. For example, consider a face image (matrix) that is described
by a vector of 50 features, meaning x=(x1,x2,…,x50) where each xi is a feature and x is a feature
vector. Extracting features that best describes the sample is called feature extraction or generation.
Feature extraction is out of the scope of the course since depending on the problem and sample
nature, the features and its way of extraction vary. The problems in chemistry, biomedical, zoology,
electronic, medicine, and so on, require its own experts to extract features. However, machine
learning techniques can sometimes be used to extract or select features. Usually the training data is
acquired from sensors, such as microphone or camera. Assuming features are extracted, the learner
desing is the main focus in machine learning.
Machine learning applications originate from many fields: Statistics, Mathematics, Computer Science,
Physics, Neuroscience, etc.
Example (classification example): The feature space below is made from a survey data that covers
people’s happiness, happy(+), unhappy(-), based on age and weight feature. The agent ‘ learning
model is a curve (curve has parameters in math). Training the model means that the curve
parameters are fit to data such that it partitions the feature space into happy and unhappy classes
from some generalization point of view.
Learner f= curve
Fig 1.3
Variable Size Feature Vectors : It will be harder to make a decision when the feature vectors vary
in size. We can either reduce/increase to a fixed size or we can choose a suitable model.
Fig 1.4
Scope of the course
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Training Set = { (xi, yi) | i=1..N , N is the number of samples } = {training samples}
Classification : The outputs (yi) are labels or descriptors. These labels are called classes. Thus,
each sample (point) in the feature space is marked by a class, usually visualized by colors or symbols.
The meaning of output labels in Traning Set are priori known. For example, the meaning of ”+” ,
Happy class, and “-“ , Unhappy class, are known before training.
Example :
Learner = curve
Here, the learner model is trained with input (face) images and their output labels (classes:
smiling, angry, etc). Features are extracted from face image and each face image is described
by a seperate feature vectors. The feature vectors forms the feature space. The agent
developes a learner that discriminate the happy classes from the other classes in the feature
space.
One form of classifier desing is the curve fitting where learner model is a curve that has
parameters and training the learner model is meant to find the best model parameters that
best fit to training samples.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Non-seperable with linear learner ( 2 features, Seperable with another non-linear learner (
2 classes) 2 features, 2 classes)
Seperable with linear (hyperplane) learner Seperable with non-linear learner (2 features,
in higher dimension ( 2 classes, 3 features) 3 classes)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Regression: The outputs yi of training samples (xi, yi) is real number. The output y axis is
orthogonal to the feature space x. The samples are not labeled (all samples are plotted same )
Consider the previous face classification example where the agent learns whether the given
face image is happy or not. Let’s the agent be trained with the input faces and their ages. At
this time, the agent can predict the age of the given face image. Since the output (age) is
continuous real number, the learner is a regressor, rather than a classifier.
Note that sometimes labels in classification could be numbers but they are used to
discriminate between classes, such as class 1, class 2, etc. In classification, being above or
below the decision surface changes the whole output reverse. For regression, whether the
points lie above or below the decision surface is not important much. Furthermore, points
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
close to the decision surface mean better accuracy in regression whereas the points close to
the decision surface are not preferred in classification.
Unsupervised learning agent generally analyses the input samples according to their
structure similarities. Two general analyses are i- cluster analysis (clustering) where the
agent groups the input samples into clusters and ii- association analysis (association rule)
where the interesting relations between the features (varibles) are discovered.
In clustering, the agent partitions the samples into clusters where the samples in each
cluster has some similarities among themselves. Thus, the similarity between samples need
to be measured by similarity metrics, also called distance metrics, such as Euclid, cos,
hamming, and etc. Once agents learn from samples, it assigns a label to each cluster, such as
cluster-1, cluster-2, etc.
Clustering and classification both does labeling in the end. The labels of classification have a
semantic meaning before the learning stage. For example, when agent predict the input face
image to be Ahmet(label), the name Ahmet is already known as a label by the agent (it is
available in dataset). In contrary, the labels of clustering are assigned at the final stage of the
learning just to differentiate between the clusters, such as cluster-1, cluster-2, etc. Renaming
cluster labels to any other labels such as group1, group2, etc, is not important since the
labels still differentiates the clusters.
Ref: https://www.slideshare.net/aorriols/lecture13-association-rules
Fig 5.1
One approach cluster data first and label the selected sample later. Here clustering affects
the labeling. Instead of choosing samples that are close to each other in the feature space,
samples that are far enough from each other and represent variations are preferred. Finding
the best samples to select is considered as a problem on its own.
Other approach is label first (there may exist already labeled samples), cluster later. Here
Labeling affects the clustering behavior.
1.6. Dataset
Each sample is represented by a feature vector. A group of input samples can be kept as
a matrix (each row represents a sample) that is shown as X in the course. The
corresponding outputs are kept a vector that is shown as y in the course.
,...
… .
= … = .
,...
= { , } = {( , ) | = 1. . }
2.1. Overfitting and Underfitting for Regressors (on curve fitting examples)
- In supervised learning models (e.g. regression models), the model should fit data enough so that
it develops a good generalization ability.
- Trying to fit to all samples in the training set negatively affects the generalization ability. This is
called overfitting.
- Mean square error (MSE) is one popular way to measure the regression error.
∑
, →
- In above example, f is a regressor (regression model) that is overfitted. The learner f loses its
generalization ability while fitting to all training samples. Training error of f is very low (zero)
- In the above example, the regressor performance on the test samples is very bad even though its
performance on training set is perfect. Test samples are far away from the decision curve f that is
overfitted the training set.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Underfitting
- The generalization ability of the model is too generic. The model highly generalizes the training
samples.
- In the example above,the regressor f is underfitted. Thus, the training error is very high ( higher
than acceptable).
- In the above example, 8 test samples ( ) are badly evaluated by the learner f
- Training Error is 0% (Perfect)
- 100% (not acceptable)
As mentioned before, Minimizing the training error is good as long as the test error is also minimized.
Otherwise a perfect training error is usually not preferred since it causes the learner to memorize
(overfitting) the training samples (just like a student memorizing the answers, instead of actual
learning). Thus, depending on the problem one may prefer a training performance on the average or
more than the average.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
- 11% (acceptable)
- 0% (perfect)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
K-Cross Validation
- If we only have a single dataset, but not separate test set, The dataset is split to K pieces (folds)
and K iterations are applied. At each iteration, one of the pieces is used as a test set and the rest
is used as a training set. Finally, the average of K errors shows the model’s performance on the
dataset.
- K-Cross validation is usually used for model selection. K-cross validation helps us choose the best
model for the given dataset. It is used to evaluate the performance of a learner model, instead of
a trained model.
- By validating the model across all parts, the model is evaluated better against the overfitting and
underfitting cases.
∑ 𝑒𝑖
- 𝑀𝑜𝑑𝑒𝑙 𝐸𝑟𝑟𝑜𝑟 = 𝐾
Actual Values
Positive (1) Negative (0)
Predicted Values
TP FP
Positive (1)
FN TN
Negative (0)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
True Positive is the samples that our model said positive and the real output is positive.
True Negative is the samples that our model said negative and the real output is negative.
False Positive is the samples that our model said positive but the real output is negative.
False Negative is the samples that our model said negative but the real output is positive.
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 =
𝑁
𝐹𝑃 + 𝐹𝑁
𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒 =
𝑁
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
𝐹𝑃
𝐹𝑃 𝑅𝑎𝑡𝑒 =
𝐹𝑃 + 𝑇𝑁
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
Precision: among the positive classified samples, what percent was correct?”
Recall: among the actual positive samples, what percent was classified correctly?”
f1-score: weighted average of precision and recall
- Precision shows how the outputs are close to each other in repeated measurements. It is the
ability of the model to produce systematic errors such that similar inputs-output repeats .
- There are many metrics for regressors. Some of them are MAPE (mean absolute percentage error),
R2 (R-square), MAE (mean absolute error), MSE (mean square error) and MMME (mean of min
over max error). The metrics are defined for n samples in formulas below where yi is the true
distance and f (xi) is the predicted distance.
- The MAPE is the average percent of accuracy with respect to original distance .
𝑛
1 |𝑦𝑖 − 𝑓(𝑥𝑖 )|
𝑀𝐴𝑃𝐸 = ∑
𝑛 𝑦𝑖
𝑖=1
- R-square (R2) is the coefficient of determination, which is a statistical measure to represent the
variance rate where y̅ is the mean of yi values.
- The MAE is the average of the absolute differences between the true and predicted distances.
𝑛
1
𝑀𝐴𝐸 = ∑ |𝑦𝑖 − 𝑓(𝑥𝑖 )|
𝑛
𝑗=1
- The MSE measures how the predicted distances are close to the actual distances.
𝑛
1
𝑀𝑆𝐸 = ∑(yi − 𝑓(𝑥𝑖 ))2
𝑛
𝑖=1
The MMME is the percentage error that measures how the predicted and actual values are close to
each other. The MMME always considers the minimum one with respect to the maximum one of the
actual and predicted values, whereas MAPE always considers their absolute difference with respect to
the actual value.
𝑛
1 min(𝑦𝑖 , 𝑓(𝑥𝑖 ))
𝑀𝑀𝑀𝐸 = 1 − ∑
𝑛 max(𝑦𝑖 , 𝑓(𝑥𝑖 ))
𝑖=1
- In order to analyze the performance of the regressors usually the y=f(x) line can be plotted to see
how many samples are on this line. More the samples falls on the line more the performance is.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Linear Non-Linear
Discriminant Models
- The focus is to model a decision surface that best discriminates outputs in the feature space.
Thus, the model f(x) is designed in such a way that it directly predicts the output without any
intermediate modeling of the data samples. For example, model the area that best separates the
classes in case of classification or model the area that best approximate the outputs in case of
regression.
- In parametric model, the model is described in terms of parameters and the best parameter
values are found during training the model. In curve fitting, the form of the curve is initially
assumed and the parameter values are found such that the curve best discriminates the outputs.
- For example, in curve fitting category, linear classifiers always aim to find the best w direction
(norm or gradient) where w = (wl, …, w1, w0) contains the model parameters and best w means
the curve direction that best separates the classes.
Generative Models
- The focus is to model the data samples on their own such that they can be regenerated
approximately. Thus, the model f(x) requires an intermediate modeling of the data samples in
feature space such that the feature space could be approximated. For example, the discriminant
models are interested in modeling the regions outside the classes whereas the generative
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
models are interested in the regions inside the classes. One general way is to model the sample
distribution in feature space as approximation of probability density functions (pdf), p(x), such
as a mixture of Gaussian distributions. For example, each class ci data could be approximated
using a separate pdf p(x|ci) .
- Then the learner f can be further modeled using the pdf approximation, p(x|ci)
- For example Bayes classifiers using pdf approximation assign the most likely class (maximum
likelyhood) as its final decision as follows,
𝐟(𝐱) = 𝐚𝐫𝐠𝐦𝐚𝐱 𝒑(𝒄𝒊 |𝒙) = 𝐚𝐫𝐠𝐦𝐚𝐱 𝑷𝒊 ∗ 𝒑(𝒙|𝒄𝒊 )
𝒊 𝒊
Standard Normalization
Standard Normalization transform the dataset into a new set in which the origin is shifted to the data
center (meaning the mean of all samples is zero) and the variances in feature values vary between 0
and 1. Thus, the feature space has a normal distribution around the data center
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
𝑥𝑖 −𝜇
𝑥𝑖′ = 𝜎
where xi is any feature vector and the division is pair-wise,
𝜇 = mean vector of the all vectors ( 𝜇𝑖 is the mean of i. feature, e.g. i. coloumn in X) ,
𝜎 = variance vector of the all vectors ( 𝜎𝑖 is the variance of i. feature)
mean = 𝜇 ⃗
𝜇=0
MinMax Normalization
All of the feature values will be scaled between 0 and 1 after min-max normalization
𝑋𝑖 −𝑋𝑚𝑖𝑛
𝑥𝑖′ =
𝑋𝑚𝑎𝑥 −𝑋𝑚𝑖𝑛
where xi is any feature vector , xmin, and xmax are the minumum and maximum values accross all
features respectively.
The learner model is assumed to be linear form. In high dimensional space (𝑅 𝑙 ), the linear form is
called as hyperplane that can be defined using norm (w) and intercept(w0), which are the model
parameters. w decides the direction of hyperplane
- One general solution is w optimization: find the best direction w = (wl ,…, w1, w0) by searching that
best fit to training data.
- Majority of the linear regression models uses derivative based optimization to find the best w while
some other techniques such as Theil-Sen estimator don’t use derivative.
Problem : how to find w that minimize the objective function (objfun)use derivative of the
objective function with respect to w and solve the linear system
𝑛 𝑛
1 ∂objfun 2
𝑜𝑏𝑗𝑓𝑢𝑛(𝑤) = ∑(yi − w. xi )2 → = − ∑ 𝑥𝑖 . (𝑦𝑖 − 𝑤𝑥𝑖 ) = 0
𝑛 𝜕𝑤 𝑛
𝑖=1 𝑖=1
f(x)
1) Simple case (Simple Linear Regression) : single feature 𝑥 ∈ 𝑅, 𝒇(𝒙) = 𝒘𝟏 𝒙 + 𝒘𝟎
𝑛 𝑛
∂objfun 2 ∂objfun 2
= − ∑(𝑦𝑖 − 𝑤1 𝑥𝑖 − 𝑤0 ) = 0 = − ∑ 𝑥𝑖 . (𝑦𝑖 − 𝑤1 𝑥𝑖 − 𝑤0 ) = 0
𝜕𝑤0 𝑛 𝜕𝑤1 𝑛
𝑖=1 𝑖=1
When the number of features (variables) is high analytic solution requires solving the linear systems
of large size, which is a complexity issue. Thus, analytic solution is not feasible for the w
optimization when dataset contains many samples with many features. One can approximate the
solutions of the derivative iteratively using gradient descent algorithm.
import numpy as np
def linear_regression(X,y,learning_rate,max_iter):
#X = nxl matrix and y=nx1 vector
n=len(y) # number of samples
l=len(X[0]) # number of features (dimensions)
w=np.zeros([l,1]) # lx1 matrix,random,here 0 vector
w0=0 # intercept,random,here 0
for i in range(max_iter):
fx=np.dot(X,w)+ w0 # fX=nx1=vector
dw=-2/n*np.sum(X*(y-fx), axis=0) # dw=lx1 matrix=vector
dw0=-2/n*np.sum(y-fx) # dw0=real number
w=w-dw*learning_rate
w0=w0-dw0*learning_rate
fx=np.dot(X,w)+ w0
score=np.sum([error**2 for error in (y-fx)])/n # training error
return (w, w0)
Stochastic Gradient descent algorithm uses a random sample at each iteration to compute the
gradient ∆𝒘 that is equal to the derivate of the objective function for one sample, e.g. loss
function in this context. SGD is faster than GD with the cost of bad trajectory and more unlikely to
converge optimum. The basic steps of the SGD are
import numpy as np
def linear_regression (X,y,learning_rate,max_iter):
#X = nxl matrix and y=nx1 vector
n=len(y) # number of samples
np.append(X,np.ones([n,1]),axis=1) # dimension = l+1
l=len(X[0])
w=np.zeros([l,1]) # lx1 matrix,random
for i in range(max_iter):
ind=np.random.randint(n) # select random sample
xi=X[ind]
yi=y[ind]
fxi=np.dot(xi,w) # fx=real number
dw=-xi*(yi-fxi) # dw=1xl =vector
w=w-dw*learning_rate
fx=np.dot(X,w)
score=np.sum([error**2 for error in (y-fx)])/n # training error
return w
Gradient descent variants: Mini-batch gradient descent variant partitions the training set into
subsets such that for each iteration any subset, rather than the whole dataset, is used to calculate
the gradient in order to update direction w. This further reduces the variance of the gradient. The
gradient is equal to the derivative of the objective function and computed only using the selected
samples by the variant. Mini-batch gradient descent is a balanced version between the
advantageous of stochastic gradient descent and the ( batch) gradient descent. It is generally used in
the field of artificial neural networks and deep learning to train the network.
Ref: https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3
- Multiple Linear Regression Case: w = (wl ,…, w1, w0) it is a research problem
Wang, X., X. Dang, H. Peng, and H. Zhang (2009), The Theil-Sen Estimators in a Multiple Linear
Regression Model
print('MSE is {}'.format(mse))
print('R2 score is {}'.format(r2))
we should examine how much the results of the model are compatible with the ideal line y=f(x) to
analyze the regression performance visually.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
cross_val_predict: This merge the prediction results of each test by using .fit and .predict
implementation of the selected model during K cross validation. We use this for analyzing
the success of the model on the dataset.
cross_val_score: This returns the performance scores of the model for each iteration using
the models own fit and predict method. Default score is the score of the model’s default
score. We should add the parameter “scoring=...” to change it.
Example: Analyzing the linear regression scores of each validation of K-cross validation
X=...
y=...
model=linear_model.LinearRegression()
score1=cross_val_score(model,X,Y,cv=3)
score2=cross_val_score(model,X,Y,cv=3,scoring='neg_mean_absolute_error')
print(score1)
score2 = abs(score2) #score2 is negative version of mse so we must take absolute of it
print(score2)
output
score1 =[0.32 , 0.09 , 0.02] default score of the linear regression model is R2
SGDRegressor Class
- SGDRegressor class uses stochastic gradient descend algorithm where the objective
function is designed in terms of loss and penalty functions that are both dependent
on parameter w. Loss function models the error for one sample, penalty function
models the cost of the error with respect to model complexity.
𝑛
1
𝑜𝑏𝑗𝑓𝑢𝑛(𝑤) = ∑ 𝑙𝑜𝑠𝑠(𝑦𝑖 , 𝑓(𝑥𝑖 )) + 𝑎𝑙𝑝ℎ𝑎 ∗ 𝑝𝑒𝑛𝑎𝑙𝑡𝑦(𝑤)
𝑛
𝑖=1
- Regulator penalizes the model complexity for over fitting. So it penalizes the model if
the model loses its generalization ability. Simpler models are better if they produce
similar total loss. Regulator param alpha can be adjusted depending on the problem
- Typical penalty for regression: L1-norm and L2-norm penalty
L1_penalty(w)=∑𝑙𝑖=1 |𝑤𝑖 |
L2_penalty(w)= ∑𝑙𝑖=1 𝑤𝑖 2
So if some dimensions of w approaches to zero, model will be generalized better.
Regulator tries to bring w vector closer to zero vector.
- Typical loss function: squared_loss
squared_loss (w)=( 𝑦𝑖 − 𝑓(𝑥𝑖 ))2
model=SGDRegressor(loss="squared_loss",penalty="None")
model.fit(X,y)
fx=model.predict(X) # predict fx for the training samples
# visualization on y=f(x) line
fig,ax=plt.subplots()
ax.scatter(y,fx,marker="o",s=5) # outputs as points of size 5
ax.plot([y.min(), y.max()],[y.min(),y.max()]) #y=f(x) ideal line
ax.set_xlabel("real")
ax.set_ylabel("predicted")
plt.show()
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
- The basic parameters of the SGDRegressor class are loss , penalty, alpha , eta0,
learning_rate. eta0 is the initial learning rate. Learning rate can be dynamically changed
according to learning_rate param, which is used to adjust the weight of the gradient in
updating w. Together with these parameters , SGDRegressor models the objective function
𝜕 𝑜𝑏𝑗𝑓𝑢𝑛(𝑤)
and approximates the solutions of the linear system 𝜕𝑤
= 0 iteratively. For
example, Ridge, Lasso and ElasticNet Regression can be defined using squared_loss and
various penalty forms, as explained in the following examples.
model=Lasso(alpha=1.0,tol=1e-4)
model.fit(X,y)
. . .
- The learner model is assumed to be linear form. In high dimensional space (𝑅 𝑙 ), the linear
form is called as hyperplane that can be defined using norm (w) and intercept(w0) which are
model parameters. w decides the direction of hyperplane.
classifier (x) = sign( f(x)) decides the output class for any given x
Activation Function = sign = could be some other function!
- One general solution is w optimization: find the best direction w = (wl ,…, w1, w0) by searching
that best fit to training data.
- Majority of the linear classifier models uses derivative based optimization.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
f: hyperplane
- Perceptron algorithm aims to find the best orientation w = (wl ,…, w1, w0) that minimizes
the total loss where the loss is the distance to the zero margin. In original variant of the
perceptron algorithm, the algorithms terminates when the total loss becomes zero (possible only
for separable problems) but later implementations allow terminating after sufficient fit or
tolerance.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
0, 𝑦𝑖 . 𝑓(𝑥𝑖 ) > 0
𝑙𝑜𝑠𝑠(𝑦𝑖 , 𝑓(𝑥𝑖 )) = { } = 𝑚𝑎𝑥(0, −𝑦𝑖 . 𝑓(𝑥𝑖 )) .
−𝑦𝑖 . 𝑓(𝑥𝑖 ), 𝑒𝑙𝑠𝑒
𝑛 𝒏
1 𝟏
𝒐𝒃𝒋𝒇𝒖𝒏(𝒘) = ∑ 𝑙𝑜𝑠𝑠(𝑦𝑖 , 𝑓(𝑥𝑖 )) = ∑ 𝒎𝒂𝒙(𝟎, −𝒚𝒊 . 𝒇(𝒙𝒊 ))
𝑛 𝒏
𝑖=1 𝒊=𝟏
Perceptron loss is a special form of the hinge loss where M is the margin f(x)=M
Algoritm iteratively updates w by step ∆𝑤 to minimize the Error=Total loss= ∑(𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑟𝑜𝑛 𝑙𝑜𝑠𝑠)
According to the gradient descent algorithm, the next weight vector, w(t +1) will be
The resulting f(x) = 1.41 x1 + 0.51 x2 - 0.5 = 0 now correctly classify the all samples and the
algorithm terminates.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
- There are many variants of Perceptron algorithm that works on unseperable problems. The
original perceptron algorithm converges only if the classes are linearly separable. A variant of the
perceptron algorithm was suggested with Pocket Algorithm that converges to a good solution
even if the linear separability condition is not fulfilled. Other related algorithms that find
reasonably good solutions when the classes are not linearly separable are the thermal
perceptron algorithm, the loss minimization algorithm, and the barycentric correction procedure.
- There are many online variants of Perceptron algorithm. In online learning, data is available
through a data flow in time. In perceptron algorihm, each iteration performs an update to w
after all of the data have been processed. This is not suitable for online classification on the flow.
- Multi-class variants of the linear classifiers will be given in later sections.
- There may exist many w solutions. SVC will search for the best w that leaves the
maximum margin using derivative approach, e.g. analytic or gradient descent. The
samples that are lying on the hyperplane are called as support vectors, circled in the
figure.
⃗⃗ argin
𝑀
⃗⃗
𝑤
The unit vector of the 𝑤
⃗⃗ is ‖𝑤‖ where its length is one.
⃗⃗
𝑤
⃗⃗ = 2 x unit vector of the 𝑤
The margin vector 𝑀 ⃗⃗ = 2 ‖𝑤‖
2
⃗⃗ = M =
The length of the margin vector 𝑀
‖𝑤‖
Proof :
Case 1 : g(xi)>0
Case 2 : g(xi)=0
The optimum point of f(x) is
affected by the g(x) constraint area.
So optimum point of f(x) can be
found by solving ΔL = Δf - α Δg = 0.
minimize Lagrange equation below w.r.t primal variables (w, w0) and maximize w.r.t
dual variables (αi) under αi . gi(w) = 0 constraint for each i
f(xi)
gi(w)
Dual Problem:
min L(w, w0 ) and max L(α)
constraint : α𝑖 . [y𝑖 . (𝑤. x𝑖 + w0 ) − 1] = 0 and α𝑖 > 0
gi(w)
Solution: solve the following system using 4 steps:
∂L ∂L ∂L
=0, = 0, =0
∂w ∂w0 ∂α
∂L
i- = 𝑤 − ∑𝑛𝑘=1 α𝑖 . y𝑖 . x𝑖 = 0 𝑤 = ∑𝑛𝑘=1 α𝑖 . y𝑖 . x𝑖
∂w
∂L
ii- = ∑𝑛𝑘=1 α𝑖 . y𝑖 = 0
∂w0
𝒘 = ∑ α𝑖 . y𝑖 . x𝑖
𝑘=1
w0 = . . . put w in yi .(w.xi+ w0 )=1 (on margin) equation to solve w0
- Lagrange based SVC is very costly because it requires quadratic programming or solve
the quadratic equation system. Thus, the Lagrange based SVC could not be
recommended for large dataset (maybe higher than 10.000 samples). An alternative
approach for SVC could approximate the Lagrange solution using gradient descent
variants such as mini-bath or stochastic gradient descent (SGD).
- In all gradient descent variants, the gradient Δw is computed at each iteration and then
the w orientation is changed one step towards to direction guided by the gradient. The
gradient is equal to the derivative of the objective function .
1 𝛛𝒐𝒃𝒋𝒇𝒖𝒏
𝒐𝒃𝒋𝒇𝒖𝒏(𝒘) = 𝑤2 Δw = =𝒘
2 𝛛𝐰
The objective function should maximize the margin and minimize the total loss (error)
0, 𝑦𝑖 . 𝑓(𝑥𝑖 ) ≥ 1
𝒆𝒊 = 𝑙𝑜𝑠𝑠(𝑦𝑖 , 𝑓(𝑥𝑖 )) = { } = 𝑚𝑎𝑥(0, 1 − 𝑦𝑖 . 𝑓(𝑥𝑖 ))
1 − 𝑦𝑖 . 𝑓(𝑥𝑖 ), 𝑒𝑙𝑠𝑒
hinge loss with margin=1
The loss function penalizes the misclassified samples and those lies inside the margin. Add
total loss term to the objective function where the param C trade-off between maximum
margin and minimum loss (which one is important to which degree)
Lagrange formulation is
needed to optimize w
objfun(w) gi(w)
Dual Problem:
min L(w, w0 ) and max L(α)
constraint : α𝑖 . [y𝑖 . (𝑤. x𝑖 + w0 ) + 𝑒𝑖 − 1] = 0 and α𝑖 > 0
gi(w)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
∂L ∂L ∂L
=0, = 0, =0
∂w ∂w0 ∂α
- In all gradient descent variants, the gradient Δw is formulated as the derivative of the
objective function. However, depending on the gradient descent variants, the gradient
can use only single sample (stochastic), or some (mini-batch) or whole samples in the
dataset (match) at each iteration in an attempt to compute an orientation towards
optimum. Thus, lets rewrite the objective function as follows
(loss) ei
1
𝒐𝒃𝒋𝒇𝒖𝒏(𝒘) = ‖𝑤‖2 + 𝐶 ∑𝑖 max[0, 1 − 𝑦𝑖 . (𝑤𝑥𝑖 + 𝑤0 )]
2
(batch) gradient on the left , SGD gradient on the right (single sample consideration)
𝑤, 𝑦𝑖 . 𝑓(𝑥𝑖 ) ≥ 1 𝑤, 𝑦𝑖 . 𝑓(𝑥𝑖 ) ≥ 1
𝛛𝒐𝒃𝒋𝒇𝒖𝒏 𝑠𝑖𝑛𝑔𝑙𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
Δw = ={ }→ { }
𝛛𝐰
𝑤 − 𝐶 ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 , 𝑒𝑙𝑠𝑒 𝑤 − 𝐶𝑥𝑖 𝑦𝑖 , 𝑒𝑙𝑠𝑒
Python SVC class is based on the libsvm implementation that uses Lagrange based solution.
Thus, libsvm library gives the accurate derivative solution for SVC with the cost of long fit
time and bad scalability. SVC class is also designed for non-linear variants of the SVC using
kernel option where the default kernel is ’linear’. However, LinearSVC class is another
library (liblinear) that is optimized for linear kernel and scales better than the SVC class.
LinearSVC has also flexibility in the choice of penalties and loss functions.
Both have the parameter dual where dual=True applies Lagrange dual solution. One can get
solution with respect to primal variables only (excluding dual variables) by dual=False.
Sometime ignoring dual variables and considering only primal variables (w, w0) in Lagrange
process may also produce approximate solution.
Example (usage of SVC : e.g. calculate the training accuracy on iris dataset)
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
model = svm.SVC(kernel = 'linear',C = 1)
model.fit(X,y)
fx = model.predict(X) #prediction of the training samples
score = np.sum(y==fx)/len(y) #compute training accuracy score manually
Example (usage of LinearSVC : e.g. calculate the training accuracy on iris dataset)
from sklearn.svm import LinearSVC
...
model = LinearSVC (C = 1) #default loss =“squared_hinge”, penalty=”l2”.
model.fit(X,y)
score = model.score(X, y) #compute training accuracy score by model.score fun.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
References
[1] Greg Grudig, Support Vector Machine(SVM) Classification , Slides
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
4.4.4. Multi-class variants of the linear classifiers and Python Sklearn examples
- The linear models are originally defined for classification on two class (+1, -1) problems.
However, they can be extended to multi-classes variants using 2 approaches
The final classification of the point is the class whose decision surface is furthest to the
point from positive side. As sample gets closer to the decision surface, the agent gets
more confused (decision is critic). The more the distance from decision surface, more the
stability is.
Each model fij votes for its class i and j. The final classification of the point is the class
which takes the most number of votes.
Example:performance of SVC linear classifier on iris data with multiclass model=ovo (default)
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data
#multi-class model
y = iris.target
Example : regarding to above example lets print the mean accuracy and deviation
>>> print ("accuracy: %0.2f ( +/- %0.2f)" %(scores.mean(), scores.std())
output → accuracy: 0.85 ( +/- 0.08)
iris = dataset.load_iris()
X = iris.data
y = iris.target
Example : confusion matrix and classification report analysis using SVC on iris data
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn import datasets
from sklearn.metrics import accuracy_score
iris = dataset.load_iris()
X = iris.data
y = iris.target
output →
The solution of the optimization problem under inequality constraints are found using
the Lagrange equation and its derivative as we did for SVC lagrange solution. Another
solution is Gradient based approach that is more scalable to large datasets and the
python implemenation will be given in later sections.
The main parameter of the 𝜺-SVR is 𝜀 that is the tolerance to the loss(error) and can be
given or computed on problem basis. For example, consider the problem of predicting the
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
ages of the people from their face images. The problem could tolerate the error up to five
years, which is the epsilon.
Python SVR class is based on the libsvm implementation that uses Lagrange based solution.
Thus, libsvm library gives the accurate derivative solution for SVR with the cost of long fit time
and bad scalability. SVR class is also designed for non-linear variants of the SVR using kernel
option where the default kernel is ’linear’. However, LinearSVR class is another library
(liblinear) that is optimized for linear kernel and scales better than the SVR class. LinearSVR
has also flexibility in the choice of penalties and loss functions.
Example : SVR usage with default score (R2) and MSE for training samples
#The diabetes dataset consists of 10 physiological variables (age, sex, weight, blood
#pressure) measure on 442 patients, and an indication of disease progression after one year
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target #disease progression
fx = model.predict(X)
Example : LinearSVR usage with default score (R2) and MSE for training samples
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target #disease progression
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
fx = model.predict(X)
- Remember: SGDClassifier and SGDRegressor classes are designed in terms of loss and
penalty functions that are both dependent on parameter w. Loss function models the
error for one sample, penalty function models the cost of the error with respect to
model complexity.
- Objective function of SVC and Ꜫ-SVR could be converted to lost+alpha*penalth form
1
𝑜𝑏𝑗𝑓𝑢𝑛(𝑤) = ‖𝑤‖2 + 𝐶 ∑𝑛𝑖=1 𝑙𝑜𝑠𝑠(𝑦𝑖 , 𝑓(𝑥𝑖 ))
2
𝑛
1
𝑜𝑏𝑗𝑓𝑢𝑛(𝑤) = ∑ 𝑙𝑜𝑠𝑠(𝑦𝑖 , 𝑓(𝑥𝑖 )) + 𝑎𝑙𝑝ℎ𝑎 ∗ 𝑝𝑒𝑛𝑎𝑙𝑡𝑦(𝑤)
𝑛
𝑖=1
- Objective function of SVC and Ꜫ-SVR has the same skeleton where loss and penalty
functions differ and constraint functions are differ.
- SGDClassifier Summary
- SGDRegressor Summary
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
- Approaching to decision surface (f=0) from positive side (meaning the classification is
correct), log gives increasing penalty to the model, while hinge and sq. hinge penalty
between [0 1], meaning [-1 1] in feature space. Square hinge penalties more comparing to
the hinge.
- Shifting out of decision surface (f=0) towards negative side (meaning the classification is
wrong), all gives penalty to the model. Zero-one loss gives constant penalty(1)
independent of the distance to the decision surface. The others gives increasing penalties
proportional to the distance to the margin (perceptron, hinge) or logarithmically (log) or
exponentially (square hinge) increasing penalties.
#The diabetes dataset consists of 10 physiological variables (age, sex, weight, blood
#pressure) measure on 442 patients, and an indication of disease progression after one year
diabetes = datasets.load_diabetes()
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
X = diabetes.data
y = diabetes.target #disease progression
fx = model.predict(X)
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target #disease progression
fx = model.predict(X)
iris = datasets.load_iris()
X = iris.data
y = iris.target #plant type
fx = model.predict(X)
iris = datasets.load_iris()
X = iris.data
y = iris.target #plant type
fx = model.predict(X)
iris = datasets.load_iris()
X = iris.data
y = iris.target #plant type
- Previous linear classifiers (perceptron, SVC) use activation function (sign) without final
step fun. Here logistic regression uses Sigmoid fun as activation, Step fun to finalize the
classification output, as shown in Fig[1].
- Sigmoid activation function, σ(f(x)) , transforms the f(x) values into [0 1] where 𝝈(𝒇(𝒙)) ≥
𝟎. 𝟓 means x is assigned to class 1 otherwise class 0.
𝟏, 𝝈(𝒇(𝒙)) ≥ 𝟎. 𝟓
𝐜𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐞𝐫(𝑥) = ɸ(𝝈) = { }
𝟎, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
- σ(f(x)) defines the likelihood of output f(x)=1 when the input x is given in original feature
space.
1
𝑃(f(x)=1 | 𝑥) = 𝝈(𝑓(𝑥)) = 1+𝑒 −𝑓(𝑥)
1
𝑃(f(x)=-1 | 𝑥) = 1 − 𝝈 (𝑓(𝑥)) = 1+𝑒 +𝑓(𝑥)
z
OutputTransform
yı=+1
yı=0
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
logloss penalty
lets add a variable zi=yi.f(xi)
𝑛
1
𝐦𝐢𝐧 𝑜𝑏𝑗𝑓𝑢𝑛(𝑤, 𝑧) = ∑ 𝑙𝑜𝑔(1 + 𝑒 −𝑧𝑖 ) + 𝑎𝑙𝑝ℎ𝑎 ∗ ‖𝑤‖2
𝑛
𝑖=1
𝐜𝐨𝐧𝐬𝐭𝐚𝐢𝐧𝐭 𝑧𝑖 = 𝑦𝑖 . 𝑓(𝑥𝑖 )
- Both decision boundary f(x) and dual optimization formula L are defined in terms of the
input product (xi .x). For example, for SVC dual problem remember the followings
𝒘 = ∑ α𝑖 . y𝑖 . x𝑖
𝑘=1
𝑛
Linear Kernel
K(x𝑖 , x𝑗 ) = x𝑖 . x𝑗
###########################################################################
# Add noise to targets
y[::5] += 3 * (0.5 - np.random.rand(8))
###########################################################################
# Fit regression model
from sklearn.svm import SVR
###########################################################################
# look at the results
import pylab as pl
pl.scatter(X, y, c='k', label='data')
pl.hold('on')
pl.plot(X, y_rbf, c='g', label='RBF model')
pl.plot(X, y_lin, c='r', label='Linear model')
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
- Once finding αi , w and f(x) can be computed using the formula above.
References
[1] R.P. Jaia Priyankka, Dr. S. Arivalagan, Dr. P. Sudhakar. Deep Convolution Neural Network With Logistic
Regression Based Image Retrieval And Classification Model For Recommendation System. INTERNATIONAL
JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 01, JANUARY 2020
[2] Greg Grudig, Support Vector Machine(SVM) Classification , Slides
[3] Satar Mahdevari et all, A support vector regression model for predicting tunnel boring machine
penetration rates, International Journal of Rock Mechanics and Mining Sciences
[4] https://ogrisel.github.io/scikit-learn.org/sklearn-
tutorial/auto_examples/svm/plot_svm_regression.html
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Basics Steps:
Training : Training means finding the K nearest points around all possible points.
Normally, the complexity of this step is O(n2) but complexity can be reduced with
some algorithms like BallTree, KDTree etc, which uses tree based data structures to
find and save the nearest points.
Feature space
classification algorithms. ID3 and its extension C4.5 are classifiers whereas CHART and
the C5 (extention of C4.5) can handle both classification and regression tasks. ID3
assumes categoric features and splits until the leaves are pure w.r.t. output class. The
basic issue of ID3 is the tree size where big tree easily overfits data and, thus, small
tree is usually preferred in ID3. Talking in general, finding the correct sized tree is an
issue of research for tree based learners. CART uses binary tree (always splits into left
and right) and works on both categoric and numeric features. C4.5 is the extended
version of the ID3 where it can handle impure leaves, numeric/categoric features and
splitting into many branches. The algorithms above basicly differ in splitting criteria
and the assumed structure of the tree.
Example: let the training set contains all categoric features, age and weight .
Two solutions are shown below where small tree is better. Decision tree
classification can be represented as logical functions , which involves set of binary
rules(disjunction of conjunctions) to predict the output.
Solution-1 Solution-2
Example: let the training set contains categoric and numeric features, [TI, PE ] are
features , Response is target variable(class)
Problems:
i- which feature to split on ?
Best Splitting
ii- which value to split on?
iii- the meaning of leaf (when splitting is no longer needed)?
iv- the size of the tree ?
Choosing Best Attibute (splitting criteria): Use Information Gain with Entropy
|𝑆𝑥=𝑣𝑖 |
𝐸(𝑆, 𝑥) = ∑ 𝐸𝑛𝑡𝑟𝑜𝑝(𝑆𝑥=𝑣𝑖 )
|S|
𝑖:𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑥
Higher Gain (S, x) is better. Higher gain means lower entropy of an average child
(meaning child sets are getting more pure).
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Example : A person decision on playin tennis are recorded for 14 days of various
weather conditions and given as table below. Train the agent using inductive decision
tree learning and provide the final decision tree.
1- Find the best feature of S , which has maximum information gain after splitting,
and add it as a node to the tree.
Rule post-pruning
One of the most popular method fpr pruning (e.g., in C4.5). First a full decision
tree is built and represented as set of if/then rules. Prune each rule by
removing any preconditions if any improvement in accuracy. Finally, sort the
pruned rules by accuracy and use them in that order.
Incorporating continues-valued attributes (ref[1])
Information gain uses the concept of Entropy of sets and doesn’t consider the
entropy of an attribute (how much information the attribute carries).
Classification
Gini measures the average amount of information or impurity of the samples in a
set, S, with respect to output classes. Gini (S) is defined as
Gini Gain (S, x) measures expected reduction in gini index due to splitting on x
|𝑆𝑖 |
𝐺𝑖𝑛𝑖 (𝑆, 𝑥) = ∑ 𝐺𝑖𝑛𝑖(𝑆𝑖 )
|S|
𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡
Higher Gini Gain is better. Higher gain means lower gini ix of an average child
(meaning child sets are getting more pure). Alternatively instead of maximizing Gini
Gain, minimizing Gini (S, x) can also be considered. By ignoring division|S|, following
Gini criteria is acquird where smaller gini criteria is better.
...
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Regression
Variation (or mse) measures the average amount of information in set S with
respect to mean of the set. Variance Reduction(VR) measures expected reduction in
variance(mse) due to splitting on feature x
|𝑆𝑖 |
𝑉𝑅(𝑆, 𝑥) = 𝑀𝑆𝐸 (𝑆) − ∑ 𝑀𝑆𝐸(𝑆𝑖 )
|S|
𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡
One can see that in order to maximize Variance Reduction, one can minimize the
total MSE of the childs (Residual Sum) for regression. Thus, smaller RSS (Residual
Sum) is better for splitting. Residual Sum (RSS) is defined as follows
Regression Tree
node0
Regression Tree
node1
iv- left of node0: the best y split = -0.48 with RSS=13.61
Regression Tree
node2
vi- right of node0: the best y split = 2.79 with RSS=25.11
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
...
If we complete the training, the final Regression Tree
Criterian : decides the criteria that measures the impurity of a set. “gini” and
“entropy” are some alternatives for classification with default “gini”. “mse” (mean
square error) and”mae”( mean absolute error) are some alternatives for regression
with default “mse”.
Splitter : decides the splitting method on “criterian” basis. Default “best” means
choose the best attribute to split, according to the information gain (variance
reduction or gini gain, information gain, etc) that uses the selected criterian. The
other alternative “random” means choose random attribute but give higher chance
to the better attributes according to their gains.
Max_depth : The maximum depth of the tree, default “none”: continues to expand
the tree's nodes until all leaves reach to purity or all leaves contain fewer samples
than the value of “min_samples_split”.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Sleft Sright
7.1.1.Bagging
A group of weak learners (decision trees here) are ensembled together to predict a
single decision. Each weak learner is trained on a random dataset that is derived from
the original training set by sampling with replacement. The decision of the ensemble
learner is the majority vote for the classification and the average value for
regression. Bagging is one of the extention to CART to reduce the variance of decision
using ensemble of many weak trees(learners). Example for regression is as follows
Single Tree Average of 100 Tree
High Variance Low Variance
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
No need for cross-validation test since the out-of-bag is ideal test set . The out-of-
bag error is the error of the learner in out of bag set as test set.
7.1.3. Adaboost
(ref: https://www.datacamp.com/community/tutorials/adaboost-classifier-python)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Adaboost:
Adaboost on example:
(ref: https://laptrinhx.com/understanding-adaboost-and-scikit-learn-s-algorithm-3554153184/)
Adaboost stands for adaptive boosting. The models are sequentially arranged in the
ensemble. This means that at each step we try to boost our weak learners (base
model) based on the mistakes of our previous models so together they are one
strong ensemble model.
Step 1: Assign equal weights for all samples in the data set
1
2
3
4
5
6
7
8
We have 8 samples in our dataset if you notice the weights and they have been
assigned an equal weight of 1/No. of samples. What this means is that the correct
classification of ever sample is equally important.
Step 2: Create a decision stump using the feature that has the lowest Gini index
A decision tree stump is just a decision tree with a single root and two leaves. A tree
with 1 level. This stump is our weak learner (base model). The feature first gets to
classify our data is determined using the Gini Index.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
If you want, you can increase your level to two, but it’s very common to go for
a stump.
Step 3: Evaluate the performance of your stump and assign its weight
In Adaboost, we have an ensemble of stumps and all their predictions are taken into
account before deciding the final prediction. But some stumps do a better job
classifying our data that than the other stumps in the ensemble. It only makes sense
to give more importance to these stumps. Adaboost does this by assigning weights to
each stump in the ensemble. Higher the weight, the more amount of
say (significance) the stump will have in the final prediction. So, for example lets
sample-3 and sample-6 are misclassified by the stump, the weight of this stump is
calculated by
Where Total error = sum of the weights of samples wrongly classified. So if our
stump got two samples misclassified, using the weights of those samples, we get
stump significance=0.5*log(1 — (1/8 + 1/8)/(1/8 + 1/8)) = 0.54 (Using natural log).
And that is the weight of our first model in the ensemble.
Remember that this is different from the weights of the sample. Sample weights
stress the importance of getting the classification of the sample right, while model
weights are used to determine the amount of say a model gets in the final prediction.
Initially the weights of both sample were 1/8 and now the new weight of each is
1/8*e^amount of say= 0.21. This is now greater than the initial 1/8.
Next, it reduces the weight of other samples. This is done by using this formula:
1
2
3
4
5
6
7
8
Step 6 Create a new dataset of the same size as the original dataset and pick
samples based on their weights
This step is curial because this is how the next model will benefit from the experience
of the previous model so that misclassified samples are given more importance. It
works in the following way:
First, an empty dataset of the same size as the original dataset is created.
Then samples are selected according to their weights such as using roulette wheel
selection. Thus, important samples are more likely to be selected in the new dataset.
This will grant the new learner the ability towards correcting the mistakes of the
previous learner.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
With roulette Wheel, a random number from 0 to 1 is picked and the sample weight
lying in the corresponding slice is chosen. For example, if 0.3 is picked, then the third
sample is chosed since the value of 0.3 falls within 0.167 and 0.416, and added to the
new dataset
Repeat the previous step till the new dataset is filled. Once the new dataset is
filled, reassign the sample weights to an equal value of 1/No of samples.
This is what out new dataset looks like.
Notice how the samples that we got wrong previously are included more? This will
give the next model a better chance to get it right. Sort of like creating a large
penalty for misclassification.
So if the sum of weights of models that classified as “Yes” is more than the sum of
weights of models that classified as “No”, the final prediction in “Yes” and vice-versa.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Example(ref: https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_regression.html)
A decision tree is boosted using the AdaBoost.R2 (Drucker 1997) algorithm on a 1D
sinusoidal dataset with a small amount of Gaussian noise. 299 boosts (300 decision
trees) is compared with a single decision tree regressor. As the number of boosts is
increased the regressor can fit more detail.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
print(__doc__)
regr_2 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng)
regr_1.fit(X, y)
regr_2.fit(X, y)
# Predict
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
Output:
Accuracy: 0.8888888888888888
Example(using SVC as base learner for Adaboost classification and compare with
decision tree base learner )
ref: https://www.datacamp.com/community/tutorials/adaboost-classifier-python
# Load libraries
from sklearn.ensemble import AdaBoostClassifier
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Output:
Accuracy: 0.9555555555555556
References
[1]- Berlin-Chen Slides
[2]- Ovronnaz, Switzerland Slides
[3] https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-
adaboost-in-python-d00faac6c464
and https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-
adaboost-in-python-d00faac6c464