0% found this document useful (0 votes)
35 views90 pages

ML1 17 Hepsi

Uploaded by

ethemagca2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views90 pages

ML1 17 Hepsi

Uploaded by

ethemagca2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

MAKİNE ÖĞRENMESİNE GİRİŞ

Ölçme Değerlendirme

Ödevler (vize olarak girilecek) %40


Final %60
Derse devam zorunluluğu %55 oranında var. (Ders devamlılığını yerine
getirmeyenler Final ve Büt sınavına girme hakkını kaybeder)

Kaynak
(1) Zeki Yetgin, Makine Ogrenmesi Ders Notları
(2)Introduction to machine learning with Python, Andreas C. Müller &
Sarah Guido, 2016
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Introduction to Machine Learning with Python

1.1. What is machine learning and its scope?


Machine learning is the study of agents that have learning ability developed using data. Machine
learning is generally used to model the Agents’ decision making. In artificial intelligence,
optimization, learning and reasoning are three basic abilities that are associated with decision
making processes of agents.

Knowledge Representation

Learning

Searching
Reasoning
(optimization)

Decision-Making

Agents that has reasoning ability is also called as knowledge-based agents. Reasoning agents
emulate the way how human beings learn. They have a knowledge-base similar to human beings and
they learn facts by explicit declaration, such as adding formal sentences to their knowledge base.
Similar to human beings, they have an inference engine through which they infer new facts through
the knowledge-base. This category, covering knowledge engineering, expert systems, logic (fuzzy
logic, first order logic, second order logic, first-order fuzzy logic, and so on), is usually out of the
scope of the machine learning. Similarly, Agents sometimes take decision by searching the solution
space, which leads to optimization. Optimization is also out of the scope of machine learning in
artificial intelligence.
Machine learning in this course deals with how to make agents learn something from data. Data is
considered in broad sense where there may be no explicit data but implicit one, such as Agent
experience. Generally Agents can be trained using input data, which covers a set of samples. Agent
can learn from input samples and make decision as output when independent samples are given.
Usually, the Agents’ learning mechanism is based on developing a learner model that generalizes the
relations between the input x and the output y (in case of supervised learning), or the relation
between samples (in case of unsupervised learning). Supervised and unsupervised learning will be
introduced in later sections.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Fig 1.1
Here f is the learner and the output ( ) is the agent decision for input sample x. Note that
throughout the course we adopted the symbol y denoting the actual output for the input sample x
whereas f(x) is the predicted output by the learner (or agent).
In practice, the agent can learn the behaviors of any real system if the system’s input and/or output
data are available. For example, as a real system, consider a medical doctor who evaluates many
variables such as symptoms, blood analysis, MR image, and so on, to make decision for the patients’
diagnosis. In this example, the input can be denoted as a vector of variables, x = { x1, x2, …, .xl }, and
the output y is a label indicating the diagnosis (illness) of the doctor. Thus, the agent learns from the
training samples, (x, y) data, and develops a learning ability f which should be an approximation to
the functionality of the medical doctor. Thus, for any input sample x, the agent decision ( ) ≅ y

 Machine learning discover the


input variables relationships between the system
variables (input, output, internal)
output variables from the samples
 A sample can be defined with ℓ
number of features (variables) or it
can contain variable number of
features. Each feature can be a
different type of information like
integers, strings, real numbers etc.
 Outputs can be either a real number
or a discrete categorical value.

Fig 1.2
The learner f is not necessarily to follow the logic (reasoning) but anything that make agents act
rationally is welcome. This is called “Rational Agent” approach (given in the AI Course). Thus, f can
be logic functions, probability density functions, curves, planes or surfaces in a multi-dimensional
space, decision trees, rules or neural network, etc. The important thing is that decision f(x) should
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

look like an intelligent behavior (output) and internal (learning) mechanism does not need to use
logic.
Let’s call the “input variables” of the system as “features” since, in practice, samples could be real
objects that are described by features. For example, consider a face image (matrix) that is described
by a vector of 50 features, meaning x=(x1,x2,…,x50) where each xi is a feature and x is a feature
vector. Extracting features that best describes the sample is called feature extraction or generation.
Feature extraction is out of the scope of the course since depending on the problem and sample
nature, the features and its way of extraction vary. The problems in chemistry, biomedical, zoology,
electronic, medicine, and so on, require its own experts to extract features. However, machine
learning techniques can sometimes be used to extract or select features. Usually the training data is
acquired from sensors, such as microphone or camera. Assuming features are extracted, the learner
desing is the main focus in machine learning.

Machine learning applications originate from many fields: Statistics, Mathematics, Computer Science,
Physics, Neuroscience, etc.

1.2. where machine learning applications are found?


(Ref: Greg Grudig)

1.3. Feature Space


Fixed Size Feature Vectors: Each sample is represented by a feature vector. Some problems
may expose fixed number (l number) of features for each sample. The distribution of all
samples in l-dimensional space form a feature space.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example (classification example): The feature space below is made from a survey data that covers
people’s happiness, happy(+), unhappy(-), based on age and weight feature. The agent ‘ learning
model is a curve (curve has parameters in math). Training the model means that the curve
parameters are fit to data such that it partitions the feature space into happy and unhappy classes
from some generalization point of view.

Learner f= curve

Fig 1.3

 Feature Vector: (Age, Weight)


 Output or Target: Happy(+), Unhappy(-)
 Learning model : curve having parameters not yet fit to data
 Learner: curve after agent learnt = curve parameters are fit to data
After agent learnt from the samples in the feature space, Agent can make decision (predict) on any
given point. For example, for x= (50, 85) the decision f(x) seems Unhappy class due to its position
lies over the curve.

Variable Size Feature Vectors : It will be harder to make a decision when the feature vectors vary
in size. We can either reduce/increase to a fixed size or we can choose a suitable model.

1.5. Briefly Machine Learning Algorithms

Machine Learning Algorithms

Reinforcement Supervised Unsupervised Semi-supervised Active


Learning Learning Learning Learning Learning

Classification Regression Clustering Association

Fig 1.4
Scope of the course
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

1.5.1. Supervised Learning


Agent is trained using input samples xi
(feature vectors) and their outputs yi
together where i=1..N. Any pair (xi, yi) is
called as a training sample. All the set of all
input and output pairs (xi, yi) by which the
agent is trained is called Training Set.

Training Set = { (xi, yi) | i=1..N , N is the number of samples } = {training samples}

Classification : The outputs (yi) are labels or descriptors. These labels are called classes. Thus,
each sample (point) in the feature space is marked by a class, usually visualized by colors or symbols.
The meaning of output labels in Traning Set are priori known. For example, the meaning of ”+” ,
Happy class, and “-“ , Unhappy class, are known before training.

Example :

Learner = curve

 Here, the learner model is trained with input (face) images and their output labels (classes:
smiling, angry, etc). Features are extracted from face image and each face image is described
by a seperate feature vectors. The feature vectors forms the feature space. The agent
developes a learner that discriminate the happy classes from the other classes in the feature
space.
 One form of classifier desing is the curve fitting where learner model is a curve that has
parameters and training the learner model is meant to find the best model parameters that
best fit to training samples.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Some example forms of classifications: seperability, lineer and non-linear classifiers

Seperable with non-linear learner ( 2 features, Non-seperable with non-linear learner (2


2 classes) features, 2 classes)

Non-seperable with linear learner ( 2 features, Seperable with another non-linear learner (
2 classes) 2 features, 2 classes)

Seperable with linear (hyperplane) learner Seperable with non-linear learner (2 features,
in higher dimension ( 2 classes, 3 features) 3 classes)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Regression: The outputs yi of training samples (xi, yi) is real number. The output y axis is
orthogonal to the feature space x. The samples are not labeled (all samples are plotted same )

Example (1 feauture, f is lineer learner) Example (1 feauture, f is non lineer


learner)

Example (1 feauture, f is non lineer


Example (2 feauture, f is lineer learner)
learner)

 Consider the previous face classification example where the agent learns whether the given
face image is happy or not. Let’s the agent be trained with the input faces and their ages. At
this time, the agent can predict the age of the given face image. Since the output (age) is
continuous real number, the learner is a regressor, rather than a classifier.
 Note that sometimes labels in classification could be numbers but they are used to
discriminate between classes, such as class 1, class 2, etc. In classification, being above or
below the decision surface changes the whole output reverse. For regression, whether the
points lie above or below the decision surface is not important much. Furthermore, points
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

close to the decision surface mean better accuracy in regression whereas the points close to
the decision surface are not preferred in classification.

1.5.2. Unsupervised Learning


 Agent is trained only with input samples xi (feature vectors). The outputs yi are not
used for training. This means that the training set consist of inputs only where any input
= ( , , … , ) is a feature vector (representation of samples).

 Unsupervised learning agent generally analyses the input samples according to their
structure similarities. Two general analyses are i- cluster analysis (clustering) where the
agent groups the input samples into clusters and ii- association analysis (association rule)
where the interesting relations between the features (varibles) are discovered.

 In clustering, the agent partitions the samples into clusters where the samples in each
cluster has some similarities among themselves. Thus, the similarity between samples need
to be measured by similarity metrics, also called distance metrics, such as Euclid, cos,
hamming, and etc. Once agents learn from samples, it assigns a label to each cluster, such as
cluster-1, cluster-2, etc.

 Clustering and classification both does labeling in the end. The labels of classification have a
semantic meaning before the learning stage. For example, when agent predict the input face
image to be Ahmet(label), the name Ahmet is already known as a label by the agent (it is
available in dataset). In contrary, the labels of clustering are assigned at the final stage of the
learning just to differentiate between the clusters, such as cluster-1, cluster-2, etc. Renaming
cluster labels to any other labels such as group1, group2, etc, is not important since the
labels still differentiates the clusters.

Example : clustering with 3 clusters (3 output) , clusters are non-overlapped, 2 feautures


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example : clustering with 4 clusters (4 outputs) , clusters are overlapped, 2 feautures

Example : association analysis , number of feautures varies

Ref: https://www.slideshare.net/aorriols/lecture13-association-rules

1.5.3. Semisupervised Learning


 When labeling the input samples is hard and costly, we prefer to label few samples. For
example, we can find MRI images, even difficult, but classifying these images with cancer or
no-cancer labels requires the decisions of many doctors. So the aim is to develop a best
learner that can learn from a dataset with few labeled samples.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Fig 5.1

 One approach cluster data first and label the selected sample later. Here clustering affects
the labeling. Instead of choosing samples that are close to each other in the feature space,
samples that are far enough from each other and represent variations are preferred. Finding
the best samples to select is considered as a problem on its own.
 Other approach is label first (there may exist already labeled samples), cluster later. Here
Labeling affects the clustering behavior.

1.5.4. Reinforcement Learning


 There is no dataset that contains inputs and outputs. Instead, data comes in real time from
the experience of agent. Thus, agent learns from its own experience. This type of agent is
called autonomous agent.
 For example, we can take a cleaning robot into account. Picking up the trash can be
considered as a positive reward and hitting an obstacle can be considered as a negative
reward for the model. Agent will choose actions that lead to positive rewards over time.
 In the beginning, agent has zero experience. Initial behaviors of the agent are primate actions
hardly embedded by the designer (agent move randomly in early times). Agent will gain
intelligence over time and will be independent (autonomous) from its initial design.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

1.6. Dataset
 Each sample is represented by a feature vector. A group of input samples can be kept as
a matrix (each row represents a sample) that is shown as X in the course. The
corresponding outputs are kept a vector that is shown as y in the course.

,...
… .
= … = .
,...

Where i. sample is the feature vector

= { , } = {( , ) | = 1. . }

• Learning methods according to dataset


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

2.1. Overfitting and Underfitting for Regressors (on curve fitting examples)
- In supervised learning models (e.g. regression models), the model should fit data enough so that
it develops a good generalization ability.
- Trying to fit to all samples in the training set negatively affects the generalization ability. This is
called overfitting.
- Mean square error (MSE) is one popular way to measure the regression error.


, →

- In above example, f is a regressor (regression model) that is overfitted. The learner f loses its
generalization ability while fitting to all training samples. Training error of f is very low (zero)

Training Error versus Test Error


- Training error measures the performance of the learner f on the training samples only.
- Test error measures the performance on independent samples that are not in training set.
- Test error is more important than the Training error
- Test error is usually higher than training error. , ,

- In the above example, the regressor performance on the test samples is very bad even though its
performance on training set is perfect. Test samples are far away from the decision curve f that is
overfitted the training set.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Underfitting
- The generalization ability of the model is too generic. The model highly generalizes the training
samples.

- In the example above,the regressor f is underfitted. Thus, the training error is very high ( higher
than acceptable).

Curve Fitting (Normal Fit)


- The model fits enough to the training samples. Training error is usually good enough but it is not
perfect.

2.2. Overfitting and Underfitting for Classifiers


- In supervised learning models (e.g. classification models), the model should fit data enough so
that it develops a good generalization ability.
- Average zero-one loss is one popular way to measure the classifier error, which counts the
number of miss classifications.
∑ ,
- ,
(xi, yi) from the Dataset of N number of samples
0,
- ,
1,
Real Output Model’s Output
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Underfitting Example for Classification

- In the above example, 45% (too high).

Overfitting Example for Classifier

- In the above example, 8 test samples ( ) are badly evaluated by the learner f
- Training Error is 0% (Perfect)
- 100% (not acceptable)

As mentioned before, Minimizing the training error is good as long as the test error is also minimized.
Otherwise a perfect training error is usually not preferred since it causes the learner to memorize
(overfitting) the training samples (just like a student memorizing the answers, instead of actual
learning). Thus, depending on the problem one may prefer a training performance on the average or
more than the average.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Curve fitting Example for Classification

- 11% (acceptable)
- 0% (perfect)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

3. Evaluating the learner


Validation Set Approach
- A separate validation set (test set) is used to test or evaluate the learner. If we have a training set
and a seperate test set, we can use training set to train the learner model and use the test set to
evaluate the model performance.
- If we have only a single dataset, but not a seperate test set, we can split the dataset randomly to
two parts; one for the training set (e.g %80) and a test set (e.g. %20).

K-Cross Validation
- If we only have a single dataset, but not separate test set, The dataset is split to K pieces (folds)
and K iterations are applied. At each iteration, one of the pieces is used as a test set and the rest
is used as a training set. Finally, the average of K errors shows the model’s performance on the
dataset.
- K-Cross validation is usually used for model selection. K-cross validation helps us choose the best
model for the given dataset. It is used to evaluate the performance of a learner model, instead of
a trained model.
- By validating the model across all parts, the model is evaluated better against the overfitting and
underfitting cases.

∑ 𝑒𝑖
- 𝑀𝑜𝑑𝑒𝑙 𝐸𝑟𝑟𝑜𝑟 = 𝐾

3.1. Performance Metrics for Classification


- Confusion matrix is usually used to define the performance metrics for classifiers. For example,
let’s say we have 2 classes, positive and negative.

Actual Values
Positive (1) Negative (0)
Predicted Values

TP FP
Positive (1)

FN TN
Negative (0)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

True Positive is the samples that our model said positive and the real output is positive.
True Negative is the samples that our model said negative and the real output is negative.
False Positive is the samples that our model said positive but the real output is negative.
False Negative is the samples that our model said negative but the real output is positive.
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 =
𝑁
𝐹𝑃 + 𝐹𝑁
𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒 =
𝑁
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
𝐹𝑃
𝐹𝑃 𝑅𝑎𝑡𝑒 =
𝐹𝑃 + 𝑇𝑁
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
Precision: among the positive classified samples, what percent was correct?”
Recall: among the actual positive samples, what percent was classified correctly?”
f1-score: weighted average of precision and recall
- Precision shows how the outputs are close to each other in repeated measurements. It is the
ability of the model to produce systematic errors such that similar inputs-output repeats .

Accuracy : High Accuracy : Low Accuracy : High Accuracy : Low


Precision: High Precision: High Precision: Low Precision: Low

3.2. Performance Metrics for Regression


- The performance metrics for regressors usually uses the error distance (ei) between the samples
and the regressor curve.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

- There are many metrics for regressors. Some of them are MAPE (mean absolute percentage error),
R2 (R-square), MAE (mean absolute error), MSE (mean square error) and MMME (mean of min
over max error). The metrics are defined for n samples in formulas below where yi is the true
distance and f (xi) is the predicted distance.
- The MAPE is the average percent of accuracy with respect to original distance .
𝑛
1 |𝑦𝑖 − 𝑓(𝑥𝑖 )|
𝑀𝐴𝑃𝐸 = ∑
𝑛 𝑦𝑖
𝑖=1

- R-square (R2) is the coefficient of determination, which is a statistical measure to represent the
variance rate where y̅ is the mean of yi values.

∑𝑛𝑖=1(yi − 𝑓(𝑥𝑖 ))2


𝑅2 = 1 −
∑𝑛𝑖=1(yi − y̅)2

- The MAE is the average of the absolute differences between the true and predicted distances.
𝑛
1
𝑀𝐴𝐸 = ∑ |𝑦𝑖 − 𝑓(𝑥𝑖 )|
𝑛
𝑗=1

- The MSE measures how the predicted distances are close to the actual distances.

𝑛
1
𝑀𝑆𝐸 = ∑(yi − 𝑓(𝑥𝑖 ))2
𝑛
𝑖=1

The MMME is the percentage error that measures how the predicted and actual values are close to
each other. The MMME always considers the minimum one with respect to the maximum one of the
actual and predicted values, whereas MAPE always considers their absolute difference with respect to
the actual value.

𝑛
1 min(𝑦𝑖 , 𝑓(𝑥𝑖 ))
𝑀𝑀𝑀𝐸 = 1 − ∑
𝑛 max(𝑦𝑖 , 𝑓(𝑥𝑖 ))
𝑖=1

- In order to analyze the performance of the regressors usually the y=f(x) line can be plotted to see
how many samples are on this line. More the samples falls on the line more the performance is.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

4. Learner Models and Linear Learners


- The word “model” is used in broad sense and refers to any form or shape of the learner .

Regression and Classifier Models

Discriminant Generative Hybrid / Ensemble


Learning Models
Models Models
Models
With Without With Without Random
parameters parameters parameters parameters Forest

Curve Neural Desicion Gaussian Mixture KNN


fitting Networks Trees Model

Linear Non-Linear

Discriminant Models
- The focus is to model a decision surface that best discriminates outputs in the feature space.
Thus, the model f(x) is designed in such a way that it directly predicts the output without any
intermediate modeling of the data samples. For example, model the area that best separates the
classes in case of classification or model the area that best approximate the outputs in case of
regression.
- In parametric model, the model is described in terms of parameters and the best parameter
values are found during training the model. In curve fitting, the form of the curve is initially
assumed and the parameter values are found such that the curve best discriminates the outputs.
- For example, in curve fitting category, linear classifiers always aim to find the best w direction
(norm or gradient) where w = (wl, …, w1, w0) contains the model parameters and best w means
the curve direction that best separates the classes.

Generative Models
- The focus is to model the data samples on their own such that they can be regenerated
approximately. Thus, the model f(x) requires an intermediate modeling of the data samples in
feature space such that the feature space could be approximated. For example, the discriminant
models are interested in modeling the regions outside the classes whereas the generative
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

models are interested in the regions inside the classes. One general way is to model the sample
distribution in feature space as approximation of probability density functions (pdf), p(x), such
as a mixture of Gaussian distributions. For example, each class ci data could be approximated
using a separate pdf p(x|ci) .

- Then the learner f can be further modeled using the pdf approximation, p(x|ci)
- For example Bayes classifiers using pdf approximation assign the most likely class (maximum
likelyhood) as its final decision as follows,
𝐟(𝐱) = 𝐚𝐫𝐠𝐦𝐚𝐱 𝒑(𝒄𝒊 |𝒙) = 𝐚𝐫𝐠𝐦𝐚𝐱 𝑷𝒊 ∗ 𝒑(𝒙|𝒄𝒊 )
𝒊 𝒊

Ensemble Learning Models


The focus is to ensemble a set of weak learners in an attempt to build a strong learner. Each
weak learner predicts better than the random. There are many methods for ensemble learning.
We can generalize them to 2 categories:
i) weak learners independently make decision in parallel way and finally their decisions
(outputs) are averaged. Such as majority vote for classification, average results for
regression. Random Forest and Bagging are examples of this category.
ii) weak learners dependently make decision in serial way and the final decision is the
decision of the final learner in the series. Each weak learner takes the experience of the
previous learner, improves better and delivers its experience to the next one by. The
final learner is the strongest one among all and it is decision is the final decision of the
ensemble. Boosting and Adaboost are examples of this category.

4.1. Preprocessing of data


Usually, dataset is not perfect since their features characterize the real life observations. Thus,
dataset may contain noisy data such as unwanted degradations in feature values, may have missing
values, containing extreme values (outliers) or erroneous values. Furthermore, the dataset in row
form could not well leant by the agent. So preprocessing transforms dataset (feature space) to a
new set by which the agent can learn better. For example, images can be enhanced and corrected
against many problems due to noises and light effects before the agent learns them. Data
transformation, normalization, scaling and dimension reduction techniques could be considered in
this scope. Preprocessing is out of scope of the course except two normalization techniques.

Standard Normalization
Standard Normalization transform the dataset into a new set in which the origin is shifted to the data
center (meaning the mean of all samples is zero) and the variances in feature values vary between 0
and 1. Thus, the feature space has a normal distribution around the data center
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

𝑥𝑖 −𝜇
𝑥𝑖′ = 𝜎
where xi is any feature vector and the division is pair-wise,

𝜇 = mean vector of the all vectors ( 𝜇𝑖 is the mean of i. feature, e.g. i. coloumn in X) ,
𝜎 = variance vector of the all vectors ( 𝜎𝑖 is the variance of i. feature)

min<=σ𝑖 <=max 0<=σ𝑖 <=1

mean = 𝜇 ⃗
𝜇=0

MinMax Normalization
All of the feature values will be scaled between 0 and 1 after min-max normalization

𝑋𝑖 −𝑋𝑚𝑖𝑛
𝑥𝑖′ =
𝑋𝑚𝑎𝑥 −𝑋𝑚𝑖𝑛
where xi is any feature vector , xmin, and xmax are the minumum and maximum values accross all
features respectively.

4.2. Linear Regression

The learner model is assumed to be linear form. In high dimensional space (𝑅 𝑙 ), the linear form is
called as hyperplane that can be defined using norm (w) and intercept(w0), which are the model
parameters. w decides the direction of hyperplane

- first form: 𝑓(𝑥) = 𝑤. 𝑥 + 𝑤0 = 𝑤𝑙 𝑥𝑙 … 𝑤1 𝑥1 + 𝑤0 where 𝑥 = (𝑥𝑙 , … 𝑥1 ) ∈ 𝑅 𝑙

- second form: 𝑓(𝑥) = 𝑤. 𝑥 where X data is centralized by following ways


1- data is centralized in l +1 dimensional space by w0 contained in w = (wl ,…, w1, w0) ∈ 𝑅 𝑙+1

𝑥′ = (𝑥𝑙 , … , 𝑥1 , 𝑥0 = 1) ∈ 𝑅 𝑙+1  𝑓(𝑥′) = 𝑤𝑙 𝑥𝑙 … 𝑤1 𝑥1 + 𝑤0 . 1 = 𝑤. 𝑥′


2- data is centralized in l dimensional space(current space) by normalization such that mean of
all xi is zero. Thus, wo is not needed to be fit and considered wo=0
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

- One general solution is w optimization: find the best direction w = (wl ,…, w1, w0) by searching that
best fit to training data.
- Majority of the linear regression models uses derivative based optimization to find the best w while
some other techniques such as Theil-Sen estimator don’t use derivative.

4.2.1. Ordinary Least Square Regression (parametric)


find the best direction w = (wl ,…, w1, w0) with which the model minimizes the mean square error
(MSE) in training data as objective function (objfun). Thus the loss function is the squared error:
𝒍𝒐𝒔𝒔𝒇𝒖𝒏(𝑤) = (yi − 𝑓(𝑥𝑖 ))2
𝑛 𝑛
1 1
𝑜𝑏𝑗𝑓𝑢𝑛(𝑤) = 𝑀𝑆𝐸 = ∑ 𝑙𝑜𝑠𝑠𝑓𝑢𝑛 = ∑(yi − 𝑓(𝑥𝑖 ))2
𝑛 𝑛
𝑖=1 𝑖=1

Problem : how to find w that minimize the objective function (objfun)use derivative of the
objective function with respect to w and solve the linear system

𝑛 𝑛
1 ∂objfun 2
𝑜𝑏𝑗𝑓𝑢𝑛(𝑤) = ∑(yi − w. xi )2 → = − ∑ 𝑥𝑖 . (𝑦𝑖 − 𝑤𝑥𝑖 ) = 0
𝑛 𝜕𝑤 𝑛
𝑖=1 𝑖=1

f(x)
1) Simple case (Simple Linear Regression) : single feature 𝑥 ∈ 𝑅, 𝒇(𝒙) = 𝒘𝟏 𝒙 + 𝒘𝟎
𝑛 𝑛
∂objfun 2 ∂objfun 2
= − ∑(𝑦𝑖 − 𝑤1 𝑥𝑖 − 𝑤0 ) = 0 = − ∑ 𝑥𝑖 . (𝑦𝑖 − 𝑤1 𝑥𝑖 − 𝑤0 ) = 0
𝜕𝑤0 𝑛 𝜕𝑤1 𝑛
𝑖=1 𝑖=1

2 Equation and 2 variables(w1,w0)  Solve it analytically. The solutions are as follows

∑(𝑥𝑖 −𝑥̅ )(𝑦𝑖 −𝑦̅)


𝑤1 = ∑(𝑥𝑖 −𝑥̅ )2
, 𝑤0 = 𝑦̅ − 𝑤𝑥̅

𝑥̅ : average of all 𝑥𝑖 and 𝑦̅: average of all 𝑦𝑖


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

2) General case (Multiple Linear Regression) : 𝑥 ∈ 𝑅 𝑙 , 𝒇(𝒙) = 𝒘𝒍 𝒙𝒍 + ⋯ + 𝒘𝟏 𝒙𝟏 + 𝒘𝟎

When the number of features (variables) is high analytic solution requires solving the linear systems
of large size, which is a complexity issue. Thus, analytic solution is not feasible for the w
optimization when dataset contains many samples with many features. One can approximate the
solutions of the derivative iteratively using gradient descent algorithm.

4.2.2. Gradient Descent based Least Square Regressions


Gradient descent(GD) algorithm uses the whole dataset at each iteration to compute the gradient
∆𝒘 that is equal to the derivate of the objective function. Thus, It takes time to converge to
optimum when the datasets is large, even if the vectorized implementation increases the speed. The
good side is it has straight trajectory towards the optimum and converge to optimum more likely.
The basic steps of the GD are

1- start with random w = (wl ,…, w1, w0).

2- for each iteration

3- change the current direction w by ∆𝒘 (rotate by ∆ and shift by ∆) as follows


∂objfun 2
𝑤 = 𝑤 − ∆𝒘 where ∆𝒘 = = − ∑𝑛𝑖=1 𝑥𝑖 . (𝑦𝑖 − 𝑤𝑥𝑖 )
𝜕𝑤 𝑛

4- when the maximum iteration is reached or the w is fit enough terminate.


5- return w
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Gradient Descent Algorithm python implementation:

import numpy as np

def linear_regression(X,y,learning_rate,max_iter):
#X = nxl matrix and y=nx1 vector
n=len(y) # number of samples
l=len(X[0]) # number of features (dimensions)
w=np.zeros([l,1]) # lx1 matrix,random,here 0 vector
w0=0 # intercept,random,here 0
for i in range(max_iter):
fx=np.dot(X,w)+ w0 # fX=nx1=vector
dw=-2/n*np.sum(X*(y-fx), axis=0) # dw=lx1 matrix=vector
dw0=-2/n*np.sum(y-fx) # dw0=real number
w=w-dw*learning_rate
w0=w0-dw0*learning_rate
fx=np.dot(X,w)+ w0
score=np.sum([error**2 for error in (y-fx)])/n # training error
return (w, w0)

Stochastic Gradient Descent Algorithm (SGD) based

Stochastic Gradient descent algorithm uses a random sample at each iteration to compute the
gradient ∆𝒘 that is equal to the derivate of the objective function for one sample, e.g. loss
function in this context. SGD is faster than GD with the cost of bad trajectory and more unlikely to
converge optimum. The basic steps of the SGD are

1- start with random w = (wl ,…, w1, w0).

2- for each iteration

3- (xi , yi )=select a random sample in dataset


4- change the current direction w by ∆𝒘 (rotate by ∆ and shift by ∆) as follows
∂lossfun
𝑤 = 𝑤 − ∆𝒘 where ∆𝒘 = 𝜕𝑤
= −𝑥𝑖 . (𝑦𝑖 − 𝑤𝑥𝑖 )

5- when the maximum iteration is reached or the w is fit enough terminate.


6- return w
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Stochastic Gradient Descent Algorithm python implementation:

import numpy as np
def linear_regression (X,y,learning_rate,max_iter):
#X = nxl matrix and y=nx1 vector
n=len(y) # number of samples
np.append(X,np.ones([n,1]),axis=1) # dimension = l+1
l=len(X[0])
w=np.zeros([l,1]) # lx1 matrix,random
for i in range(max_iter):
ind=np.random.randint(n) # select random sample
xi=X[ind]
yi=y[ind]
fxi=np.dot(xi,w) # fx=real number
dw=-xi*(yi-fxi) # dw=1xl =vector
w=w-dw*learning_rate
fx=np.dot(X,w)
score=np.sum([error**2 for error in (y-fx)])/n # training error
return w

Gradient descent variants: Mini-batch gradient descent variant partitions the training set into
subsets such that for each iteration any subset, rather than the whole dataset, is used to calculate
the gradient in order to update direction w. This further reduces the variance of the gradient. The
gradient is equal to the derivative of the objective function and computed only using the selected
samples by the variant. Mini-batch gradient descent is a balanced version between the
advantageous of stochastic gradient descent and the ( batch) gradient descent. It is generally used in
the field of artificial neural networks and deep learning to train the network.

Ref: https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3

4.2.3. Theil-Sen Regression


- Simple Case: w = (wl , w0), the best direction, w1 , is the median of the slopes of all lines
between any pairs
w1 =median { (yj − yi)/(xj − xi) | for all i ≠ j }
w0 =median { (yi– w1xi) | for all i }
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

the regressor = 𝒇(𝒙) = 𝒘𝟏 𝒙 + 𝒘𝟎

- Multiple Linear Regression Case: w = (wl ,…, w1, w0) it is a research problem
Wang, X., X. Dang, H. Peng, and H. Zhang (2009), The Theil-Sen Estimators in a Multiple Linear
Regression Model

Ref: Srikant K S (telegari.wikidot.com)


Least Square Regressor is more sensitive to outliers (noises) in data than Theil-Sen regressor.
However, the complexity of the Theil-Sen regressor is higher (O(n2) or O(n log(n)) than the Least
Squares (O(n))

4.2.4. Ordinary least square regression using Python sklearn library


Example : general form for ordinary least square regression

from sklearn.linear_model import LinearRegression


from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn import datasets

#load datasets that return X and y


X,y = datasets.load_XXX() #XXX refers to the dataset name

#validation set approach


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=5)

#select the learner model


model = LinearRegression()
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

#train the learner model


model.fit(X_train, y_train)

#test the learner , e.g using MSE or R2


fx_test = model.predict(X_test)
mse = mean_squared_error(y_test, fx_test)
r2 = r2_score(y_test, fx_test)

print('MSE is {}'.format(mse))
print('R2 score is {}'.format(r2))

Example : Analyzing linear regression model with cross validation

from sklearn.linear_model import LinearRegression


from sklearn.model_selection import cross_val_predict
from sklearn import datasets
import matplotlib.pyplot as plt

#load datasets that return X and y


X,y = datasets.load_XXX()

#select the learner model


model = LinearRegression()

#test the learner model directly


fx = cross_val_predict(model, X, y, cv=3) #fx: merging the results of each test set during cross-val.

#plot the result on y=fx line for analysis


fig, ax = plt.subplots()
ax.scatter(y, fx)
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k-', lw=2) #k- : black solid line, lw: line width
ax.set_xlabel('Real')
ax.set_ylabel('Predicted')
plt.show()

we should examine how much the results of the model are compatible with the ideal line y=f(x) to
analyze the regression performance visually.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

cross_val_predict: This merge the prediction results of each test by using .fit and .predict
implementation of the selected model during K cross validation. We use this for analyzing
the success of the model on the dataset.
cross_val_score: This returns the performance scores of the model for each iteration using
the models own fit and predict method. Default score is the score of the model’s default
score. We should add the parameter “scoring=...” to change it.

Example: Analyzing the linear regression scores of each validation of K-cross validation
X=...

y=...

model=linear_model.LinearRegression()

score1=cross_val_score(model,X,Y,cv=3)

score2=cross_val_score(model,X,Y,cv=3,scoring='neg_mean_absolute_error')

print(score1)
score2 = abs(score2) #score2 is negative version of mse so we must take absolute of it
print(score2)

output
score1 =[0.32 , 0.09 , 0.02]  default score of the linear regression model is R2

score2 =[2.2 , 1.1, 0.2]  mse error


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

4.2.4. linear regression using Python sklearn: loss+alpha*penalty form

SGDRegressor Class
- SGDRegressor class uses stochastic gradient descend algorithm where the objective
function is designed in terms of loss and penalty functions that are both dependent
on parameter w. Loss function models the error for one sample, penalty function
models the cost of the error with respect to model complexity.
𝑛
1
𝑜𝑏𝑗𝑓𝑢𝑛(𝑤) = ∑ 𝑙𝑜𝑠𝑠(𝑦𝑖 , 𝑓(𝑥𝑖 )) + 𝑎𝑙𝑝ℎ𝑎 ∗ 𝑝𝑒𝑛𝑎𝑙𝑡𝑦(𝑤)
𝑛
𝑖=1

mean loss (error) Regulator = penalty term

- Regulator penalizes the model complexity for over fitting. So it penalizes the model if
the model loses its generalization ability. Simpler models are better if they produce
similar total loss. Regulator param alpha can be adjusted depending on the problem
- Typical penalty for regression: L1-norm and L2-norm penalty

L1_penalty(w)=∑𝑙𝑖=1 |𝑤𝑖 |

L2_penalty(w)= ∑𝑙𝑖=1 𝑤𝑖 2
So if some dimensions of w approaches to zero, model will be generalized better.
Regulator tries to bring w vector closer to zero vector.
- Typical loss function: squared_loss
squared_loss (w)=( 𝑦𝑖 − 𝑓(𝑥𝑖 ))2

Example : ordinary least square using SGDRegressor with penalty=none


from sklearn.linear_model import SGDRegressor
from sklearn import datasets
X,y=datasets.load_xxx(return_X_Y=true)# load data

model=SGDRegressor(loss="squared_loss",penalty="None")
model.fit(X,y)
fx=model.predict(X) # predict fx for the training samples
# visualization on y=f(x) line
fig,ax=plt.subplots()
ax.scatter(y,fx,marker="o",s=5) # outputs as points of size 5
ax.plot([y.min(), y.max()],[y.min(),y.max()]) #y=f(x) ideal line
ax.set_xlabel("real")
ax.set_ylabel("predicted")
plt.show()
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

- The basic parameters of the SGDRegressor class are loss , penalty, alpha , eta0,
learning_rate. eta0 is the initial learning rate. Learning rate can be dynamically changed
according to learning_rate param, which is used to adjust the weight of the gradient in
updating w. Together with these parameters , SGDRegressor models the objective function
𝜕 𝑜𝑏𝑗𝑓𝑢𝑛(𝑤)
and approximates the solutions of the linear system 𝜕𝑤
= 0 iteratively. For
example, Ridge, Lasso and ElasticNet Regression can be defined using squared_loss and
various penalty forms, as explained in the following examples.

Example : Ridge Regression using SGDRegressor class


from sklearn.linear_model import SGDRegressor
from sklearn import datasets
X,y=datasets.load_XXX()# load data
. . .
model=SGDRegressor(loss="squared_loss",penalty="l2", alpha=0.0001)
model.fit(X,y)
fx=model.predict(X) # predict fx for the training samples
. . .

Example : Ridge Regression using Ridge class


from sklearn.linear_model import Ridge
X,y=. . .

model=Ridge(alpha=1.0, tol=0.001) #Precision of the solution


#Δobjfun(x)<tol means terminate
model.fit(X,y)
. . .

Example : Lasso Regression using SGDRegressor class


from sklearn.linear_model import SGDRegressor
from sklearn import datasets
X,y=datasets.load_XXX()# load data
. . .
model=SGDRegressor(loss="squared_loss",penalty="l1")
model.fit(X,y)
fx=model.predict(X) # predict fx for the training samples
. . .
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example : Lasso Regression using Lasso class


from sklearn.linear_model import Ridge
X,y=. . .

model=Lasso(alpha=1.0,tol=1e-4)
model.fit(X,y)
. . .

Example : ElasticNet Regression using SGDRegressor class


from sklearn.linear_model import SGDRegressor
from sklearn import datasets
X,y=datasets.load_XXX()# load data
. . .
model=SGDRegressor(loss="squared_loss", l1_ratio =0.6)
# penalty = 0.6 * L1 + 0.4 * L2
model.fit(X,y)
fx=model.predict(X)
. . .

Example : ElasticNet Regression using ElasticNet class


from sklearn.linear_model import Ridge
X,y=. . .

model= ElasticNet (alpha=1.0,tol=1e-3, l1_ratio =0.6)


model.fit(X,y)
. . .

4.3. Linear Classifiers

- The learner model is assumed to be linear form. In high dimensional space (𝑅 𝑙 ), the linear
form is called as hyperplane that can be defined using norm (w) and intercept(w0) which are
model parameters. w decides the direction of hyperplane.

first form: 𝑓(𝑥) = 𝑤𝑥 + 𝑤0 where 𝑥 ∈ 𝑅 𝑙 𝑎𝑛𝑑 𝑤 ∈ 𝑅 𝑙

second form: 𝑓(𝑥) = 𝑤. 𝑥 where 𝑥 ∈ 𝑅 𝑙+1 𝑎𝑛𝑑 𝑤 ∈ 𝑅 𝑙+1

classifier (x) = sign( f(x)) decides the output class for any given x
Activation Function = sign = could be some other function!

- One general solution is w optimization: find the best direction w = (wl ,…, w1, w0) by searching
that best fit to training data.
- Majority of the linear classifier models uses derivative based optimization.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

f: hyperplane

4.3.1. Perceptron Algorithm


- Perceptron algorithm emulates a single neuron in the brain. The neuron activates when the
inputs are transferred to the neuron with sufficient energy level. Energy level is tested by the
activation function. The weights w affect the conductivity (importance) of the inputs. The
weights correspond to the orientation (direction) of the decision surface. The bias decides the
position of the decision surface (only shifts away from the origin) and does not depend on inputs.

- Perceptron algorithm aims to find the best orientation w = (wl ,…, w1, w0) that minimizes
the total loss where the loss is the distance to the zero margin. In original variant of the
perceptron algorithm, the algorithms terminates when the total loss becomes zero (possible only
for separable problems) but later implementations allow terminating after sufficient fit or
tolerance.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

0, 𝑦𝑖 . 𝑓(𝑥𝑖 ) > 0
𝑙𝑜𝑠𝑠(𝑦𝑖 , 𝑓(𝑥𝑖 )) = { } = 𝑚𝑎𝑥(0, −𝑦𝑖 . 𝑓(𝑥𝑖 )) .
−𝑦𝑖 . 𝑓(𝑥𝑖 ), 𝑒𝑙𝑠𝑒

𝑛 𝒏
1 𝟏
𝒐𝒃𝒋𝒇𝒖𝒏(𝒘) = ∑ 𝑙𝑜𝑠𝑠(𝑦𝑖 , 𝑓(𝑥𝑖 )) = ∑ 𝒎𝒂𝒙(𝟎, −𝒚𝒊 . 𝒇(𝒙𝒊 ))
𝑛 𝒏
𝑖=1 𝒊=𝟏

Perceptron loss is a special form of the hinge loss where M is the margin f(x)=M

ℎ𝑖𝑛𝑔𝑒 𝑙𝑜𝑠𝑠 = max(0, 𝑀 − 𝑦𝑖 . 𝑓(𝑥𝑖 ))

Algoritm iteratively updates w by step ∆𝑤 to minimize the Error=Total loss= ∑(𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑟𝑜𝑛 𝑙𝑜𝑠𝑠)

4.3.2. Gradient descent based Perceptron Algorithm


- We can find the w plane by optimization using gradient descend or stochastic gradient
descend, as we did before by following steps

1- From the objective functions, calculate the gradient


For stochastic gradient descend,
𝒐𝒃𝒋𝒇𝒖𝒏(𝒘) = 𝒍𝒐𝒔𝒔(𝒙𝒊 , 𝒚𝒊 ) = 𝒎𝒂𝒙(𝟎, −𝒚𝒊 . 𝒇(𝒙𝒊 ))
For gradient descend,,
1
𝒐𝒃𝒋𝒇𝒖𝒏(𝒘) = 𝑛 ∑𝑛𝑖=1 𝒍𝒐𝒔𝒔(𝒙𝒊 , 𝒚𝒊 )
Calculate the gradient Δw, same as before, by taking derivative with respect to w .

 (batch) gradient descend.


𝝏𝒐𝒃𝒋𝒇𝒖𝒏(𝒘) 𝝏(∑𝑛𝑖=1 𝒎𝒂𝒙(𝟎,−𝒚𝒊 ∗𝒘.𝒙𝒊 )
Gradient 𝜟𝒘 = = = ∑𝒇𝒐𝒓 𝒆𝒂𝒄𝒉 𝒎𝒊𝒔𝒄𝒍𝒂𝒔𝒔𝒊𝒇𝒊𝒆𝒅 𝒊 −𝒚𝒊 . 𝒙𝒊
𝝏𝒘 𝝏𝑾

 for stochastic gradient descend.


𝝏𝒐𝒃𝒋𝒇𝒖𝒏(𝒘) 𝝏(−𝒚𝒊 ∗𝒘.𝒙𝒊 )
Gradient 𝜟𝒘 = = = −𝒚𝒊 . 𝒙𝒊
𝝏𝒘 𝝏𝑾

2- w is updated for only misclassified samples using the gradient Δw .

- For example, perceptron algortithm using stochastic gradient descent is as follows


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

1- start with random w = (wl ,…, w1, w0).

2- for each iteration

3- (xi , yi )=select a random sample in dataset


4- If 𝑦𝑖 . 𝑓(𝑥𝑖 ) < 0 //if the sample is misclassified by current w
5- 𝑤 = 𝑤 − 𝑒𝑡𝑎 ∗ ∆𝒘 where ∆𝒘 = −𝒚𝒊 . 𝒙𝒊 (default eta = 1)
6- when the maximum iteration is reached or the w is fit enough terminate.
7- return w

Example (gradient descent behavior at time t) :

Suppose at time t, f(x) = x1 + x2 - 0.5 with eta = 0.7.


Thus the corresponding weight vectors w(t) = [1, 1,-0.5] , which is shown in dotted line in the
figure. Two samples x1=[0.4, 0.05] and x2=[-0.20, 0.75] are misclassified by w(t) where y1=+1
and y2=-1 (note: each yi is reverted due to (–yi).xi of gradiant in the following line)

According to the gradient descent algorithm, the next weight vector, w(t +1) will be

The resulting f(x) = 1.41 x1 + 0.51 x2 - 0.5 = 0 now correctly classify the all samples and the
algorithm terminates.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

- There are many variants of Perceptron algorithm that works on unseperable problems. The
original perceptron algorithm converges only if the classes are linearly separable. A variant of the
perceptron algorithm was suggested with Pocket Algorithm that converges to a good solution
even if the linear separability condition is not fulfilled. Other related algorithms that find
reasonably good solutions when the classes are not linearly separable are the thermal
perceptron algorithm, the loss minimization algorithm, and the barycentric correction procedure.
- There are many online variants of Perceptron algorithm. In online learning, data is available
through a data flow in time. In perceptron algorihm, each iteration performs an update to w
after all of the data have been processed. This is not suitable for online classification on the flow.
- Multi-class variants of the linear classifiers will be given in later sections.

4.3.3. Perceptron implementation with Python Sklearn: loss form + no penalty


SGDClassifier and Perceptron Classes
we can customize SGDClassifier with penalty=”none”, loss=”perceptron” for perceptron.
SGDClassifier uses stocastic gradient descend to minimize the objective function , which can be
shaped according to loss and penalty parameters. However, there is no penalty for perceptron.
Perceptron class in the python similarly implements the perceptron algorithm on the basis of
SGDClassifier. Thus, they are equal with following parameters

Perceptron() = SGDClassifier(loss=’perceptron’, eta0=1, learning_rate=”constant”,penalty=”none”)


eta0= initial learning rate, learning_rate=the change of the learning rate at each iteration, constant
means eta=eta0
Example : Perceptron classification using SGDClassifier

from sklearn.datasets import load_iris


from sklearn.linear_model import SGDClassifier
X,y=load_iris(return_X_Y=true)
X_train,X_test,y_train, y_test= train_test_split(X,y,test_size=0.20) # %20 test set
model=SGDClassifier(loss=”perceptron”,eta0=1,learning_rate=”constant”,penalty=None)
model.fit(X_train, y_train)
skor=model.score(X_test, y_test) # Default metric is Accuracy

Example : Perceptron classification using Perceptron class

from sklearn.datasets import load_digits


from sklearn.linear_model import perceptron
X,Y=load_digits(return_X_Y=true)
model=perceptron(tol=le-3) # exit criterion,|total.lossi – total.lossi-1 |< tolexit
model.fit(X,Y)
skor=model.score(X,Y)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

4.4. Support Vector Machine (SVM)

4.4.1. Linear Support Vector Machine Classification (SVC) under seperability


- Original SVC model is defined for linear classification on two class (+1, -1) problems.
Later, the basic SVM model is extended to non-linear cases and multi-classes variants.
- The basic SVC aims to find best hyperplane (w) that maximizes the margin between two
classes under linear separable assumption.

this hyperplane is better !

- There may exist many w solutions. SVC will search for the best w that leaves the
maximum margin using derivative approach, e.g. analytic or gradient descent. The
samples that are lying on the hyperplane are called as support vectors, circled in the
figure.

Fig. is modified from the source [1]


- Formulating Margin (M):
In order to formulate the Margin (M) lets x1 and x2 be support vectors on class -1 margin
and class +1 margin respectively. Thus,
f(x1) = -1,
f(x2) = +1
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

⃗⃗ argin
𝑀

⃗⃗
𝑤
The unit vector of the 𝑤
⃗⃗ is ‖𝑤‖ where its length is one.

⃗⃗
𝑤
⃗⃗ = 2 x unit vector of the 𝑤
The margin vector 𝑀 ⃗⃗ = 2 ‖𝑤‖

2
⃗⃗ = M =
The length of the margin vector 𝑀
‖𝑤‖

4.4.1.1. Linear SVC as a constraint optimization problem under seperability


1
𝐦𝐢𝐧 𝒐𝒃𝒋𝒇𝒖𝒏(𝒘) = ‖𝑤‖2 Lagrange formulation is used to optimize w
2

constraint : yi .f(xi) -1 ≥ 0 for all i

gi(w) : contraint function ( a new constraint for each i)

- Lagrange Theory for minimization problem


Lets f(x) = objective function. Derivative of f(x) under a single g(x) ≥ 0 constraint is
equal to the derivate of Lagrange function L(x,α) =f(x) - α . g(x) under α . g(x) = 0
constraint. For every constraint i, a new Lagrange multiplier αi is introduced.

Proof :
Case 1 : g(xi)>0

The optimum point of f(x), which is


xi, is not affected by the g(x)
constraint area. So optimum point
of f(x) can be found by solving Δf=0
equation without using Δg.
So for α=0 the term is
∂L ∂f
α . g(x) = 0 and = ∂x
∂x
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Case 2 : g(xi)=0
The optimum point of f(x) is
affected by the g(x) constraint area.
So optimum point of f(x) can be
found by solving ΔL = Δf - α Δg = 0.

So for α>0 the term is


∂L ∂f ∂g
α . g(x) = 0 and ∂x = ∂x − 𝜶 ∂x

Example (Ref: Constraint optimization and SVM, Man-Wai MAK (EIE))


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

4.4.1.2. Lagrange based Linear SVC under seperability

minimize Lagrange equation below w.r.t primal variables (w, w0) and maximize w.r.t
dual variables (αi) under αi . gi(w) = 0 constraint for each i
f(xi)

gi(w)
Dual Problem:
min L(w, w0 ) and max L(α)
constraint : α𝑖 . [y𝑖 . (𝑤. x𝑖 + w0 ) − 1] = 0 and α𝑖 > 0

gi(w)
Solution: solve the following system using 4 steps:

∂L ∂L ∂L
=0, = 0, =0
∂w ∂w0 ∂α
∂L
i- = 𝑤 − ∑𝑛𝑘=1 α𝑖 . y𝑖 . x𝑖 = 0  𝑤 = ∑𝑛𝑘=1 α𝑖 . y𝑖 . x𝑖
∂w

∂L
ii- = ∑𝑛𝑘=1 α𝑖 . y𝑖 = 0
∂w0

iii- Put w into Lagrange equation, get L(α)

iv- max L(α)


constraint : ∑𝑛𝑖=1 α𝑖 . y𝑖 = 0 and α𝑖 > 0
∂L
= 0  Solutions are two types of Lagrange multipliers :
∂α

1st type: αi =0 has no contribution for w solution


2nd type: αi >0 support the w solution, each αi produces a support vector
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Once αi ‘s computed the optimum solution is generated by


𝑛

𝒘 = ∑ α𝑖 . y𝑖 . x𝑖
𝑘=1
w0 = . . . put w in yi .(w.xi+ w0 )=1 (on margin) equation to solve w0

Final classifier is 𝑺𝑽𝑪(𝒙) = 𝒔𝒈𝒏(𝒘. 𝒙 + 𝐰𝟎 )

4.4.1.3. Gradient Descent based Linear SVC under seperability

- Lagrange based SVC is very costly because it requires quadratic programming or solve
the quadratic equation system. Thus, the Lagrange based SVC could not be
recommended for large dataset (maybe higher than 10.000 samples). An alternative
approach for SVC could approximate the Lagrange solution using gradient descent
variants such as mini-bath or stochastic gradient descent (SGD).
- In all gradient descent variants, the gradient Δw is computed at each iteration and then
the w orientation is changed one step towards to direction guided by the gradient. The
gradient is equal to the derivative of the objective function .

1 𝛛𝒐𝒃𝒋𝒇𝒖𝒏
𝒐𝒃𝒋𝒇𝒖𝒏(𝒘) = 𝑤2 Δw = =𝒘
2 𝛛𝐰

constraint : yi .f(xi) ≥ 1 for all i change w by Δw only in case of error

- The SGD based SVC algorithm is as follows

1- start with random w = (wl ,…, w1, w0).


2- for each iteration
3- (xi , yi )=select a random sample in dataset
4- If 𝑦𝑖 . 𝑓(𝑥𝑖 ) < 1 //if the sample is misclassified by current w
𝑤 = 𝑤 − 𝑒𝑡𝑎 ∗ ∆𝒘 where ∆𝒘 = 𝒘 //eta = learning rate
5- when the maximum iteration is reached or the w is fit enough terminate.
6- return w
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

4.4.2. Linear SVC for non-separable problems

The objective function should maximize the margin and minimize the total loss (error)

0, 𝑦𝑖 . 𝑓(𝑥𝑖 ) ≥ 1
𝒆𝒊 = 𝑙𝑜𝑠𝑠(𝑦𝑖 , 𝑓(𝑥𝑖 )) = { } = 𝑚𝑎𝑥(0, 1 − 𝑦𝑖 . 𝑓(𝑥𝑖 ))
1 − 𝑦𝑖 . 𝑓(𝑥𝑖 ), 𝑒𝑙𝑠𝑒
hinge loss with margin=1

The loss function penalizes the misclassified samples and those lies inside the margin. Add
total loss term to the objective function where the param C trade-off between maximum
margin and minimum loss (which one is important to which degree)

4.4.2.1. Linear SVC as a contraint optimization problem under unseperability

Lagrange formulation is
needed to optimize w

- Lagrange based linear-SVC for non-separable problems


Convert the contraint to g(w) ≥ 0 form and construct the Lagrange formula as follows

objfun(w) gi(w)
Dual Problem:
min L(w, w0 ) and max L(α)
constraint : α𝑖 . [y𝑖 . (𝑤. x𝑖 + w0 ) + 𝑒𝑖 − 1] = 0 and α𝑖 > 0

gi(w)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Solution: solve the following system as we did for seperable case:

∂L ∂L ∂L
=0, = 0, =0
∂w ∂w0 ∂α

Once αi ‘s computed the optimum solution is generated by


𝒘 = ∑𝑛𝑘=1 α𝑖 . y𝑖 . x𝑖

put w in yi .(w.xi+ w0 )=1 (on margin) equation to solve w0

Final classifier is 𝑺𝑽𝑪(𝒙) = 𝒔𝒈𝒏(𝒘. 𝒙 + 𝐰𝟎 )

4.4.2.2. Gradient Descent based Linear SVC under unseperability

- In all gradient descent variants, the gradient Δw is formulated as the derivative of the
objective function. However, depending on the gradient descent variants, the gradient
can use only single sample (stochastic), or some (mini-batch) or whole samples in the
dataset (match) at each iteration in an attempt to compute an orientation towards
optimum. Thus, lets rewrite the objective function as follows
(loss) ei
1
𝒐𝒃𝒋𝒇𝒖𝒏(𝒘) = ‖𝑤‖2 + 𝐶 ∑𝑖 max[0, 1 − 𝑦𝑖 . (𝑤𝑥𝑖 + 𝑤0 )]
2

constraint : yi .f(xi) ≥ 1 for all i

(batch) gradient on the left , SGD gradient on the right (single sample consideration)
𝑤, 𝑦𝑖 . 𝑓(𝑥𝑖 ) ≥ 1 𝑤, 𝑦𝑖 . 𝑓(𝑥𝑖 ) ≥ 1
𝛛𝒐𝒃𝒋𝒇𝒖𝒏 𝑠𝑖𝑛𝑔𝑙𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
Δw = ={ }→ { }
𝛛𝐰
𝑤 − 𝐶 ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 , 𝑒𝑙𝑠𝑒 𝑤 − 𝐶𝑥𝑖 𝑦𝑖 , 𝑒𝑙𝑠𝑒

- The SGD based SVC algorithm is as follows

7- start with random w = (wl ,…, w1, w0).


8- for each iteration
9- (xi , yi )=select a random sample in dataset
10- If 𝑦𝑖 . 𝑓(𝑥𝑖 ) < 1 //if the sample is misclassified by current w
11- ∆𝒘 = 𝑤 − 𝐶. 𝑦𝑖 . 𝑥𝑖
12- else
13- ∆𝒘 = 𝑤
14- 𝑤 = 𝑤 − 𝑒𝑡𝑎 ∗ ∆𝒘 //eta = learning rate
15- when the maximum iteration is reached or the w is fit enough terminate.
16- return w
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

4.4.3. Linear SVM Classification with Python Sklearn (Lagrange based)

Python SVC class is based on the libsvm implementation that uses Lagrange based solution.
Thus, libsvm library gives the accurate derivative solution for SVC with the cost of long fit
time and bad scalability. SVC class is also designed for non-linear variants of the SVC using
kernel option where the default kernel is ’linear’. However, LinearSVC class is another
library (liblinear) that is optimized for linear kernel and scales better than the SVC class.
LinearSVC has also flexibility in the choice of penalties and loss functions.

Both have the parameter dual where dual=True applies Lagrange dual solution. One can get
solution with respect to primal variables only (excluding dual variables) by dual=False.
Sometime ignoring dual variables and considering only primal variables (w, w0) in Lagrange
process may also produce approximate solution.

Example (usage of SVC : e.g. calculate the training accuracy on iris dataset)
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
model = svm.SVC(kernel = 'linear',C = 1)
model.fit(X,y)
fx = model.predict(X) #prediction of the training samples
score = np.sum(y==fx)/len(y) #compute training accuracy score manually

Example (model.score is accuracy score by default for classifiers)


from sklearn.svm import SVC
...
model = SVC(kernel = 'linear',C = 1)
model.fit(X,y)
score = model.score(X, y) #compute accuracy score on traning samples.

Example (alternatively use sklearn.metrics for many classifion metrics)


from sklearn.metrics import accuracy_score, precision_score
...
model = SVC(kernel = 'linear',C = 1)
model.fit(X,y)
fx = model.predict(X) #prediction of the training samples
score1 = accuracy_score(y, fx) #compute accuracy score on traning samples.
score2 = precision_score(y, fx) #compute precision score on traning samples.

Example (usage of LinearSVC : e.g. calculate the training accuracy on iris dataset)
from sklearn.svm import LinearSVC
...
model = LinearSVC (C = 1) #default loss =“squared_hinge”, penalty=”l2”.
model.fit(X,y)
score = model.score(X, y) #compute training accuracy score by model.score fun.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example (LinearSVC has loss and penaty parameters)


from sklearn.svm import LinearSVC
...
model = LinearSVC (loss="hinge", tol=1e-5)
model.fit(X,y)
score = model.score(X, y) #compute training accuracy score by model.score fun.

Example (allow lagrange dual solution)


from sklearn.svm import LinearSVC
...
model = LinearSVC (loss="hinge", tol=1e-5, dual=True)
model.fit(X,y)
score = model.score(X, y) #compute training accuracy score by model.score fun.

References
[1] Greg Grudig, Support Vector Machine(SVM) Classification , Slides
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

4.4.4. Multi-class variants of the linear classifiers and Python Sklearn examples

- The linear models are originally defined for classification on two class (+1, -1) problems.
However, they can be extended to multi-classes variants using 2 approaches

1) One versus rest classification(ovr)


Create models as many as the number of classes (n) such that the condition for correct
classification is to be in positive side for each model fc(x)

fc(x): separates between class c (+) and others (-)

𝑓𝑖𝑛𝑎𝑙 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 = 𝒇(𝒙) = argmax 𝑓𝑐 (𝑥)


𝑐=1..𝑛

The final classification of the point is the class whose decision surface is furthest to the
point from positive side. As sample gets closer to the decision surface, the agent gets
more confused (decision is critic). The more the distance from decision surface, more the
stability is.

2) One versus one classification(ovo)


Create as many model as n (n-1)/2 where n is the number of classes, such that the model
fij (x): separates between class i (+) and class j (-)

f12 f13 f23


f(xi)=vote( [ 1, 1, 2]) = 1
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟(𝑥) = 𝑓(𝑥) = 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦_𝑣𝑜𝑡𝑒([𝑓12 (𝑥), 𝑓13 (𝑥), . . . , 𝑓(𝑛−1)𝑛 (𝑥)])

Each model fij votes for its class i and j. The final classification of the point is the class
which takes the most number of votes.

Example:performance of SVC linear classifier on iris data with multiclass model=ovo (default)
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data
#multi-class model
y = iris.target

model=svm.SVC(kernel = "linear", decision_function_shape = "ovo") #ovo :default


scores = cross_val_score(model, X, y, cv = 5)
print(scores)
scores → [0.92, 0.80, 0.65, 0.70, 1.0]
#scores are accuracy scores due to model default scoring
#another metric can be chosen such as cross_val_score(model, X, y, scoring = ‘f1_macro’)

Example : regarding to above example lets print the mean accuracy and deviation
>>> print ("accuracy: %0.2f ( +/- %0.2f)" %(scores.mean(), scores.std())
output → accuracy: 0.85 ( +/- 0.08)

Example : preprocessing dataset, standard normalization of dataset


from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

iris = dataset.load_iris()
X = iris.data
y = iris.target

#Create the validation set


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#Preprocessing the data set


scaler = StandardScaler().fit(X_train)
X_train_normal = scaler.transform(X_train) #Normalization of training set

model = svm.SVC(kernel = "linear")


model.fit(X_train_normal, y_train)

# Apply the same normalization for Test set


X_test_normal = scaler.transform(X_test)
score=model.score(X_test_normal, y_test)
print(score)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example : confusion matrix and classification report analysis using SVC on iris data
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn import datasets
from sklearn.metrics import accuracy_score

iris = dataset.load_iris()
X = iris.data
y = iris.target

#Create the validation set


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#select the model and train


model = SVC(kernel = "linear")
model.fit(X_train, y_train)

#predict the test samples


fx_test= model.predict(X_test)

#print classification report and confusion_matrix


acc_score = accuracy_score(y_test, fx_test)

print("\n", "Accuracy of SVM model: {}".format(acc_score,'.2f'))


print("\n", "Confusion Matrix: \n", confusion_matrix(y_test, fx_test))
print("\n", "Classification Report: \n", classification_report(y_test, fx_test))

output →

Accuracy of SVM model: 0.98

The Confusion Matrix:


#predicted # setosa # versicolor # virginica #actual
[[14 0 0] # setosa
[ 0 15 0] # versicolor
[ 0 1 20]] # virginica

The Classification Report:


precision recall f1-score support

setosa 1.00 1.00 1.00 14


versicolor 0.94 1.00 0.97 15
virginica 1.00 0.95 0.98 21

micro avg 0.98 0.98 0.98 50


macro avg 0.98 0.98 0.98 50
weighted avg 0.98 0.98 0.98 50
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

classification_report in Python shows the main classification metrics.


Remember:
Precision: among the positive classified samples, what percent was correct?”
Recall: among the actual positive samples, what percent was classified correctly?”
f1-score: weighted mean of precision and recall
averaging metrics : there are various averaging metrics which are out of scope now.

4.4.5. Linear Support Vector Machine Regression (Ꜫ -SVR)


Original SVR model is defined for linear regression. Later, the basic SVR is extended for non-
linear cases. The Support Vector Regression (𝜀-SVR) uses the same principles as the SVM
for classification where the maximum margin with minimal loss is still our objective here.
However, there is a tolerance to loss in some degree (𝜀-loss). Thus, the loss |yi-f(xi)| less
than the epsilon is assumed zero. 𝜀-SVR search for the best direction of the 𝜀-tube that
contains maximum number of samples inside the tube with minimal total loss.

The solution of the optimization problem under inequality constraints are found using
the Lagrange equation and its derivative as we did for SVC lagrange solution. Another
solution is Gradient based approach that is more scalable to large datasets and the
python implemenation will be given in later sections.
The main parameter of the 𝜺-SVR is 𝜀 that is the tolerance to the loss(error) and can be
given or computed on problem basis. For example, consider the problem of predicting the
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

ages of the people from their face images. The problem could tolerate the error up to five
years, which is the epsilon.

4.4.5.1. Linear Ꜫ-SVR with Python Sklearn (Lagrange based)

Python SVR class is based on the libsvm implementation that uses Lagrange based solution.
Thus, libsvm library gives the accurate derivative solution for SVR with the cost of long fit time
and bad scalability. SVR class is also designed for non-linear variants of the SVR using kernel
option where the default kernel is ’linear’. However, LinearSVR class is another library
(liblinear) that is optimized for linear kernel and scales better than the SVR class. LinearSVR
has also flexibility in the choice of penalties and loss functions.

Example : SVR usage with default score (R2) and MSE for training samples

from sklearn.svm import SVR


from sklearn.metrics import mean_squared_error
from sklearn import datasets

#The diabetes dataset consists of 10 physiological variables (age, sex, weight, blood
#pressure) measure on 442 patients, and an indication of disease progression after one year
diabetes = datasets.load_diabetes()

X = diabetes.data
y = diabetes.target #disease progression

#Create the model and train it with epsilon


model = SVR(kernel="linear", epsilon=0.1, C=1.0)
model.fit(X,y) # train process

fx = model.predict(X)

score = model.score(X,y) #default metric is R2


print("R-squared:", score)
print("MSE:", mean_squared_error(y, fx))

Example : LinearSVR usage with default score (R2) and MSE for training samples

from sklearn.svm import LinearSVR


from sklearn.metrics import mean_squared_error
from sklearn import datasets

diabetes = datasets.load_diabetes()

X = diabetes.data
y = diabetes.target #disease progression
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

#Create the model and train it with epsilon


model = LinearSVR (epsilon=0.1, C=1.0, dual=True) # yes to lagrange dual solution
model.fit(X,y) # train process

fx = model.predict(X)

score = model.score(X,y) #default metric is R2


print("R-squared:", score)
print("MSE:", mean_squared_error(y, fx))

4.5. SGD based linear learners with Python sklearn (Summary):


loss+alpha*penalty form

- Remember: SGDClassifier and SGDRegressor classes are designed in terms of loss and
penalty functions that are both dependent on parameter w. Loss function models the
error for one sample, penalty function models the cost of the error with respect to
model complexity.
- Objective function of SVC and Ꜫ-SVR could be converted to lost+alpha*penalth form
1
𝑜𝑏𝑗𝑓𝑢𝑛(𝑤) = ‖𝑤‖2 + 𝐶 ∑𝑛𝑖=1 𝑙𝑜𝑠𝑠(𝑦𝑖 , 𝑓(𝑥𝑖 ))
2
𝑛
1
𝑜𝑏𝑗𝑓𝑢𝑛(𝑤) = ∑ 𝑙𝑜𝑠𝑠(𝑦𝑖 , 𝑓(𝑥𝑖 )) + 𝑎𝑙𝑝ℎ𝑎 ∗ 𝑝𝑒𝑛𝑎𝑙𝑡𝑦(𝑤)
𝑛
𝑖=1

Loss term Regulator = penalty term

- Objective function of SVC and Ꜫ-SVR has the same skeleton where loss and penalty
functions differ and constraint functions are differ.

- SGDClassifier Summary

loss fonk penalty func(regulator) Algorithm


hinge: 𝑚𝑎𝑥(0, 1 − 𝑦𝑖 . 𝑓(𝑥𝑖 )) 𝑎𝑙𝑝ℎ𝑎. ‖𝑤‖2 SVC (2-form)
hinge: 𝑚𝑎𝑥(0, 1 − 𝑦𝑖 . 𝑓(𝑥𝑖 )) 𝑎𝑙𝑝ℎ𝑎. |𝑤| SVC (1-form)
squaredhinge: 𝑚𝑎𝑥(0, [1 − 𝑦𝑖 . 𝑓(𝑥𝑖 )]2 ) ... . . . SVC forms
perceptron: 𝑚𝑎𝑥(0, −𝑦𝑖 . 𝑓(𝑥𝑖 )) none Perceptron
log: 𝑙𝑜𝑔(1 + exp(−𝑦𝑖 . 𝑓(𝑥𝑖 )) ... . . .Logistic Regression

- SGDRegressor Summary
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

loss fonk penalty func(regulator) Algorithm


2
squared: (𝑦𝑖 − 𝑓(𝑥𝑖 )) none Least Square Reg.
squared: (𝑦𝑖 − 𝑓(𝑥𝑖 ))2 𝑎𝑙𝑝ℎ𝑎. |𝑤| Lasso Reg.
2 2
squared: (𝑦𝑖 − 𝑓(𝑥𝑖 )) 𝑎𝑙𝑝ℎ𝑎. ‖𝑤‖ Ridge Reg.
squared: (𝑦𝑖 − 𝑓(𝑥𝑖 ))2 𝑎𝑙𝑝ℎ𝑎. |𝑤| + (1 − 𝑎𝑙𝑝ℎ𝑎). ‖𝑤‖2 ElasticNet Reg
epsilon-insensitive: 𝑚𝑎𝑥(0, |𝑦𝑖 − 𝑓(𝑥𝑖 )| − 𝜀) 𝑎𝑙𝑝ℎ𝑎. ‖𝑤‖ 2
𝜀 -SVR
square-epsilon-insensitive: 𝑚𝑎𝑥(0, |𝑦𝑖 − 𝑓(𝑥𝑖 )| − 𝜀) . . .
2
. . . 𝜀-SVR
4.5.1. Loss functions for linear classifiers: summary
- One can see that the loss functions of linear classifiers are defined in terms of yif(xi)
whereas the the loss functions of linear regressors are in term of |yi - f(xi)|

- Approaching to decision surface (f=0) from positive side (meaning the classification is
correct), log gives increasing penalty to the model, while hinge and sq. hinge penalty
between [0 1], meaning [-1 1] in feature space. Square hinge penalties more comparing to
the hinge.
- Shifting out of decision surface (f=0) towards negative side (meaning the classification is
wrong), all gives penalty to the model. Zero-one loss gives constant penalty(1)
independent of the distance to the decision surface. The others gives increasing penalties
proportional to the distance to the margin (perceptron, hinge) or logarithmically (log) or
exponentially (square hinge) increasing penalties.

4.5.2. SGD based linear learners with Python sklearn examples

Example : Ridge Regression on Diabet using SGDRegressor

from sklearn.linear_model import SGDRegressor


from sklearn.metrics import mean_squared_error
from sklearn import datasets

#The diabetes dataset consists of 10 physiological variables (age, sex, weight, blood
#pressure) measure on 442 patients, and an indication of disease progression after one year
diabetes = datasets.load_diabetes()
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

X = diabetes.data
y = diabetes.target #disease progression

#Create the model and train it with epsilon


model = SGDRegressor(loss="squared_loss", penalty="l2", alpha=0.001)
model.fit(X,y) # train process

fx = model.predict(X)

score = model.score(X,y) #default metric is R2


print("R-squared:", score)
print("MSE:", mean_squared_error(y, fx))

Example : Ꜫ-SVR Regression on Diabet using SGDRegressor

from sklearn.linear_model import SGDRegressor


from sklearn import datasets

diabetes = datasets.load_diabetes()

X = diabetes.data
y = diabetes.target #disease progression

#Create the model and train it with epsilon


model = SGDRegressor(loss="epsilon_insensitive", penalty="l2", epsilon=0.01)
model.fit(X,y) # train process

fx = model.predict(X)

score = model.score(X,y) #default metric is R2


print("R-squared:", score)

Example : SVC Classification on Iris using SGDClassifier

from sklearn.linear_model import SGDClassifier


from sklearn import datasets

iris = datasets.load_iris()

X = iris.data
y = iris.target #plant type

#Create the model and train it


model = SGDClassifier (loss="hinge", penalty="l2", alpha=0.01)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

model.fit(X,y) # train process

fx = model.predict(X)

score = model.score(X,y) #default metric is accuracy ratio


print("Accuracy:", score)

Example : LogisticRegression Classification on Iris using elasticnet penalty

from sklearn.linear_model import SGDClassifier


from sklearn import datasets

iris = datasets.load_iris()

X = iris.data
y = iris.target #plant type

#Create the model and train it


model = SGDRegressor(loss="log", penalty="elasticnet", alpha=0.01) #default l1_ratio=0.5
model.fit(X,y) # train process

fx = model.predict(X)

score = model.score(X,y) #default metric is accuracy ratio


print("Accuracy:", score)

Example : LogisticRegression Classification on Iris using various solver (here liblinear)

from sklearn.linear_model import LogisticRegression


from sklearn import datasets

iris = datasets.load_iris()

X = iris.data
y = iris.target #plant type

#Create the model and train it


# liblinear uses lagrange based solution
model = LogisticRegression (solver="liblinear", C=1.0, multi_class="ovr")

model.fit(X,y) # train process


fx = model.predict(X)

score = model.score(X,y) #default metric is accuracy ratio


print("Accuracy:", score)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

4.6. Logistic Regression

- Logistic regression is actually a classification algorithm. It is formulated on binary


regression ([0 1]) in a transformed space. However, it can be extended to multiclass
generalization. Logistic regression is considered a generalized linear model because the
output still depends on the linear hyperplane f(x)=w.x. That is, we are looking for w that
minimizes total log loss and penalty

- Previous linear classifiers (perceptron, SVC) use activation function (sign) without final
step fun. Here logistic regression uses Sigmoid fun as activation, Step fun to finalize the
classification output, as shown in Fig[1].

σ(f(x)) ɸ(σ (f(x)))

f: w.x σ : sigmoid ɸ: step

- Sigmoid activation function, σ(f(x)) , transforms the f(x) values into [0 1] where 𝝈(𝒇(𝒙)) ≥
𝟎. 𝟓 means x is assigned to class 1 otherwise class 0.
𝟏, 𝝈(𝒇(𝒙)) ≥ 𝟎. 𝟓
𝐜𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐞𝐫(𝑥) = ɸ(𝝈) = { }
𝟎, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
- σ(f(x)) defines the likelihood of output f(x)=1 when the input x is given in original feature
space.

1
𝑃(f(x)=1 | 𝑥) = 𝝈(𝑓(𝑥)) = 1+𝑒 −𝑓(𝑥)
1
𝑃(f(x)=-1 | 𝑥) = 1 − 𝝈 (𝑓(𝑥)) = 1+𝑒 +𝑓(𝑥)
z

OutputTransform
yı=+1

yı=0
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

- Dual Problem Formulation


𝑛
1
𝑜𝑏𝑗𝑓𝑢𝑛(𝑤) = ∑ 𝑙𝑜𝑔(1 + 𝑒 −𝑦𝑖 𝑓(𝑥𝑖 ) ) + 𝑎𝑙𝑝ℎ𝑎 ∗ ‖𝑤‖2
𝑛
𝑖=1

logloss penalty
lets add a variable zi=yi.f(xi)
𝑛
1
𝐦𝐢𝐧 𝑜𝑏𝑗𝑓𝑢𝑛(𝑤, 𝑧) = ∑ 𝑙𝑜𝑔(1 + 𝑒 −𝑧𝑖 ) + 𝑎𝑙𝑝ℎ𝑎 ∗ ‖𝑤‖2
𝑛
𝑖=1

𝐜𝐨𝐧𝐬𝐭𝐚𝐢𝐧𝐭 𝑧𝑖 = 𝑦𝑖 . 𝑓(𝑥𝑖 )

The solution can be acquired by deriving the Lagrange dual formulation.

5. Non-Linear generalization of linear learners


It is non-linearly mapping the feature space into a new space (possibly higher-dimension),
where we expect that the linear learners can perform well. Some non-linear problems using
linear learns can be solved with the cost of increasing dimension or complexity.

Fig. Ref. [2]


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Fig. Ref. [3]

5.1. Non-Linear variants of SVC and SVR

- Both decision boundary f(x) and dual optimization formula L are defined in terms of the
input product (xi .x). For example, for SVC dual problem remember the followings

𝒘 = ∑ α𝑖 . y𝑖 . x𝑖
𝑘=1
𝑛

𝑺𝑽𝑪(𝒙) = 𝒔𝒈𝒏(𝒘. 𝒙 + 𝐰𝟎 ) = 𝒔𝒈𝒏 (∑ α𝑖 . y𝑖 . 𝐱 𝒊 . 𝒙 + w0 )


𝑘=1
- Lets do a kernel trick by replacing the input product (xi .x) with K(xi , x) where K is the
kernel function.
𝑛

𝑺𝑽𝑪(𝒙) = 𝒔𝒈𝒏 (∑ α𝑖 . y𝑖 . 𝑲(𝐱 𝒊 , 𝒙) + w0 )


𝑘=1
- The original linear SVC and SVR uses linear kernel : K(xi , xj)= (xi . xj ). Some popular
kernels are Mercer kernels listed below. Mercer kernels are symmetric K(xi , xj)= K(xj ,
xi), and positive on diagonal, K(xi , xi)≥0

 Linear Kernel
K(x𝑖 , x𝑗 ) = x𝑖 . x𝑗

 Polynomial Kernel (co, gamma, degree)


𝑑𝑒𝑔𝑟𝑒𝑒
K(x𝑖 , x𝑗 ) = (𝑐𝑜 + 𝑔𝑎𝑚𝑚𝑎. x𝑖 . x𝑗 )

 Sigmoid Kernel (co, gamma)


K(x𝑖 , x𝑗 ) = tanh(𝑐𝑜 + 𝑔𝑎𝑚𝑚𝑎. x𝑖 . x𝑗 )

 Radial Basis(RBF) or Gaussian Kernel (𝝈, gamma)


2
‖x𝑖 − x𝑗 ‖
K(x𝑖 , x𝑗 ) = exp (−𝑔𝑎𝑚𝑚𝑎 )
2𝝈𝟐
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

5.2. Non-Linear SVR and SVC examples with Python sklearn

Example: usage of SVR with various kernels, Ref[4]


###########################################################################
# Generate sample data
import numpy as np

X = np.sort(5 * np.random.rand(40, 1), axis=0)


y = np.sin(X).ravel()

###########################################################################
# Add noise to targets
y[::5] += 3 * (0.5 - np.random.rand(8))

###########################################################################
# Fit regression model
from sklearn.svm import SVR

svr_rbf = SVR(kernel='rbf', C=1e4, gamma=0.1)


svr_lin = SVR(kernel='linear', C=1e4)
svr_poly = SVR(kernel='poly', C=1e4, degree=2)
y_rbf = svr_rbf.fit(X, y).predict(X)
y_lin = svr_lin.fit(X, y).predict(X)
y_poly = svr_poly.fit(X, y).predict(X)

###########################################################################
# look at the results
import pylab as pl
pl.scatter(X, y, c='k', label='data')
pl.hold('on')
pl.plot(X, y_rbf, c='g', label='RBF model')
pl.plot(X, y_lin, c='r', label='Linear model')
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

pl.plot(X, y_poly, c='b', label='Polynomial model')


pl.xlabel('data')
pl.ylabel('target')
pl.title('Support Vector Regression')
pl.legend()
pl.show()

Example: nonlinear SVC classification on Iris using polynomial kernel


from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
model = svm.SVC(kernel = 'poly',degree = 3, coef0 = 0) # coef0 = co
model.fit(X,y)
fx = model.predict(X) #prediction of the training samples
score = model.score(X, y) #compute accuracy score on traning samples.

Example: nonlinear SVC classification on Iris using sigmoid kernel


from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
model = svm.SVC(kernel = 'sigmoid', coef0 = 0.01) # coef0 = co
model.fit(X,y)
fx = model.predict(X) #prediction of the training samples
score = model.score(X, y) #compute accuracy score on traning samples.

5.3. Kernel Perceptron Algorithm


- “kernelized” version of a perceptron algorithm is the kernel perceptron algorithm.
Substitute each inner product (xi.x) with the kernel function K(xi.x). This is equivalent to
solving the problem in some transformed space (possibly higher dimension) where the
inner product is defined in terms of the respective kernel function.
- Remember: For each training example xᵢ with actual label yᵢ ∈ {-1, 1}:
f(xᵢ) = sgn(w xᵢ).
If f(xᵢ) ≠ yᵢ, update w ← w + yᵢ xᵢ.
- To derive a kernelized version of the perceptron algorithm, its dual form must be setup.
The weight vector w can be expressed as a linear combination of the n training samples.
The equation for the weight vector is
𝒘 = ∑𝑛𝑖=1 α𝑖 . y𝑖 . x𝑖
f(x) = sgn(w x)= sgn(∑𝑛𝑖=1 α𝑖 . y𝑖 . x𝑖 . 𝑥)
where αi is the number of times xi was misclassified, forcing an update w ← w + yi xi.
- Now replace (xi.x) with the kernel function K(xi.x) and at each iteration, instead of
updating the weight vector w, update the "mistake counter" vector α
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Initialize αi =0 for i=1..n


for each iteration until maxiteration:
for each training example xj, yj
compute f(x𝑗 ) = sgn(∑𝑛𝑖=1 α𝑖 . y𝑖 . K(x𝑖 . x𝑗 ))
If f(x𝑗 ) ≠ yj,: αj ← αj + 1

- Once finding αi , w and f(x) can be computed using the formula above.

References
[1] R.P. Jaia Priyankka, Dr. S. Arivalagan, Dr. P. Sudhakar. Deep Convolution Neural Network With Logistic
Regression Based Image Retrieval And Classification Model For Recommendation System. INTERNATIONAL
JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 01, JANUARY 2020
[2] Greg Grudig, Support Vector Machine(SVM) Classification , Slides
[3] Satar Mahdevari et all, A support vector regression model for predicting tunnel boring machine
penetration rates, International Journal of Rock Mechanics and Mining Sciences
[4] https://ogrisel.github.io/scikit-learn.org/sklearn-
tutorial/auto_examples/svm/plot_svm_regression.html
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

6. K-Nearest Neighbor Learning

6.1. K-Nearest Neighbor Classification(KNN)


Knn can be considered non-linear, generative (weakly) and non-parametric model.
Knn computes the nearest K samples around each point in the feature space and
assign the given sample to the majority voted class as most probable class. The
parameter K is input parameter and not affected by the learning stage. Knn has some
information about the data distrubution around each point. From this perspective,
Knn could be considered as a generative model where it can brutely generate the
data distrubution of each class samples. Finding the K-nearest neighbour is also used
as a metod to compute a density estimation p (x | ci) for Bayes based classifiers,
which are explained in later chapters.

Basics Steps:

1- given sample x in feature space; find the nearest K points around x

2- find the majority vote

Training : Training means finding the K nearest points around all possible points.
Normally, the complexity of this step is O(n2) but complexity can be reduced with
some algorithms like BallTree, KDTree etc, which uses tree based data structures to
find and save the nearest points.

Feature space

x=which Class according to K=6?

x =Majority Vote(Decision of the nearest K example)

x= (assigned to red class in this example)

similarity metrics: To find the nearest K points, distance or similarity metrics


(Euclidean, Cos, Mahalanobis, Manhattan etc.) are used.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

KNN classifier with Python Sklearn


KNeighborClassifier class can be used with the following basic parameters:

 n_neighbors: Number of nearest neighbor (K) to consider


 algorithm: The algorithm that finds the nearest K neighbor of each point
during KNN fit process. BallTree ('ball_tree')and KDTree ('kd_tree') are two
well known algorithms that finds the nearest neighbours in unsupervised
manner (don’t use output labels at this stage). Default algorihm = 'auto' ,
meaning algorithm is automatically selected after analysing data.
 metric: Distance metric which is used to calculate the distance between
points. This metric is used to find nearest K neighbors
 Predict: finds the majority vote of K neigbours for the given sample
 additional parameters can be used depending on the selected params.

Example: Classification with iris dataset KNN(K=3)


from sklearn.neighbors import KNeighborsClassifier
X,y=load_iris(...)
model=KNeighborsClassifier(n_neighbors=3, algorithm = 'ball_tree')
model.fit(X,y)//Calculating the nearest k point in a tree like structure in the model
fx=model.predict(X) //predict the outputs of the training samples
print(accuracy_score(fx,y))

6.2. KNN Regression


Similarly, the nearest neighbour algorithms , kd_ball and K neighbors during the
training process(fit) are found with the selected algorithm, as parameter and the
final value is the average of K neigbours. Thus, instead of majority vote, majority
value (average value) is considered for regression

x is given!  f (x) = ? for K=5

f(x)=avarage value of K neighbours

K- training examples around x in the feature space


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

KNN Regression with Python Sklearn


K-NeighborsRegressor class can be used with the following basic parameters:

 n_neighbors: Number of nearest neighbor (K)


 algorithm: The algorithm that finds the nearest K neighbor of each point
during KNN fit process. Similarly, possible algorithms {‘auto’, ‘ball_tree’,
‘kd_tree’, ‘brute’} with default=’auto’
 metric: Distance metric which is used to calculate the distance between
points. This metric is used to find nearest K neighbors.
 Predict: finds the average f values of K neigbours for the given sample.
 additional parameters can be used depending on the selected params.

Example: Regression with KNN Regressor

from sklearn.neighbors import KNeighborsRegressor


X,Y=load_xxx
model=KNeighborsRegressor(n_neighbors=3, algorithm = 'auto') //auto is default
model.fit(X,Y)
print("R2 Error: %f" %model.score(X,Y)) //R2 error for training set
// call predict fun in itself

7. Decision Tree Learning


Decision tree can be considered as non-linear, discriminative and non-parametric
model. The decision tree algorithm recursively splits the training set into disjoint
subsets. Each splitting occurs on a feature (or attribute) and the best feature to split
on is found using various methods (jini, mse, etc). As a result of each splitting, a new
branch is added to the current node of the decision tree. This splitting continues until
the subsets reach a certain purity (e.g. same class) where the subset becomes a leaf
of the tree and the leaves are assigned an output by majority vote (classification) or
average value (regression). The scheleton of the decision tree learning is almost same
for classification and regression. The splitting criteria and the forming the final
outputs are basic differences.

7.1. Decision Tree Classification


Different algorithms are used to determine the “best” split at a node. ID3, C4.5, C5
and CART (classification and regression tree) are some decision tree based
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

classification algorithms. ID3 and its extension C4.5 are classifiers whereas CHART and
the C5 (extention of C4.5) can handle both classification and regression tasks. ID3
assumes categoric features and splits until the leaves are pure w.r.t. output class. The
basic issue of ID3 is the tree size where big tree easily overfits data and, thus, small
tree is usually preferred in ID3. Talking in general, finding the correct sized tree is an
issue of research for tree based learners. CART uses binary tree (always splits into left
and right) and works on both categoric and numeric features. C4.5 is the extended
version of the ID3 where it can handle impure leaves, numeric/categoric features and
splitting into many branches. The algorithms above basicly differ in splitting criteria
and the assumed structure of the tree.

Example: let the training set contains all categoric features, age and weight .

Two solutions are shown below where small tree is better. Decision tree
classification can be represented as logical functions , which involves set of binary
rules(disjunction of conjunctions) to predict the output.

Solution-1 Solution-2

fhappy = (age=old) AND (weight=normal) OR fhappy = ( weight=normal)


(age=young) AND (weight=normal)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example: let the training set contains categoric and numeric features, [TI, PE ] are
features , Response is target variable(class)

Problems:
i- which feature to split on ?
Best Splitting
ii- which value to split on?
iii- the meaning of leaf (when splitting is no longer needed)?
iv- the size of the tree ?

Solutions are addressed by the decision tree algorithms themselves.

7.1.1. Inductive Decision Tree Learning (ID3)


find a tree that is fully consistent with the training samples (splitting continue until all leaves
are pure). Features must be categoric type. ID3 recursively choose "most significant" feature
as root of (sub)tree. Greedy search through the space of possible branches is followed.
ID3 (S, features):
If Entropy(S)=0 Then Add_Leaf_to_Tree( Label(S) ) and return
If features=ɸ Then Add_Leaf_to_Tree( Majority_Vote(S) ) and return
𝐁𝐞𝐬𝐭 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐈𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐆𝐚𝐢𝐧(S, x)
𝑥 ∈ 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠
Add_Node_to_Tree(Best)
features = features – {Best}
for each value of Best
Add_Branch_to_Tree(value)
ID3(Svalue , features)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Choosing Best Attibute (splitting criteria): Use Information Gain with Entropy

Entropy measures the average amount of information or impurity of the samples in a


set, S, with respect to output classes. Lets S = {Np+, Nn-} contains Np and Nn
number of positive and negative samples respetively. Then, Entropy (S) is defined as
𝑁𝑝 𝑁𝑝 𝑁𝑛 𝑁𝑛
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = − log − log = − ∑𝑖 𝑃𝑖 log 𝑃𝑖
𝑁 𝑁 𝑁 𝑁
Pi: probability of i. class
Example : Entropy of set S where only two classes are considered. Entropy (E) is zero
if all samples in S belongs to the same class.

Information Gain (S, x) measures expected reduction in entropy due to splitting on


feature x. Let Sx=v is the subset due to splitting on x=v (feature =value). That is, the
subset Sx=v select the samples of S having feature x= value v.

parent set S E (S) = Total entropy of the parent

x=v2 E (S, x) = average entropy of the childs

child sets Sx=v1 Sx=v2 Sx=vn

𝐺𝑎𝑖𝑛(𝑆, 𝑥) = 𝐸(𝑆) − 𝐸(𝑆, 𝑥)

|𝑆𝑥=𝑣𝑖 |
𝐸(𝑆, 𝑥) = ∑ 𝐸𝑛𝑡𝑟𝑜𝑝(𝑆𝑥=𝑣𝑖 )
|S|
𝑖:𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑥

Higher Gain (S, x) is better. Higher gain means lower entropy of an average child
(meaning child sets are getting more pure).
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example : A person decision on playin tennis are recorded for 14 days of various
weather conditions and given as table below. Train the agent using inductive decision
tree learning and provide the final decision tree.

Basic steps of the solution:

1- Find the best feature of S , which has maximum information gain after splitting,
and add it as a node to the tree.

𝒃𝒆𝒔𝒕 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐺𝑎𝑖𝑛(𝑆, 𝑥)


𝒙 ∈𝒇𝒆𝒂𝒕𝒖𝒓𝒆𝒔
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

2- for each value of the best feature,


add a new branch to the tree,
replace S with Svalue and go to step-1.

What is the next here ? Addressed by the next recursive call

The final decision tree is


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

7.1.2. C4.5 Algoritm


C4.5 is the extended version of the ID3 . It can handle the leaf subsets that are not
necessary to be pure and also handle numeric continutes features by converting
them to categoric values. Converting numeric values to categoric ones is an
important issue and solved on the basis on information theory. Choosing the best
attibute (splitting criteria) is based on Gain Ratio, rather than Information Gain of
ID3. On contrary to CART that always grow binary tree, C4.5 can grow an arbitrary
tree due to splitting into many branch. It handles the basic weakness of ID3, which is
overfitting problem, by pruning the tree. There are two ways of prunings :
prepruning, and postpruning.

In Prepruning, splitting is stopped when splitting is no longer statistically significant


or not contributes to the classification ability. The aim is to find an earlier stop before
the perfect classification. However, it is hard to estimate when to stop splitting.
Postpruning lets the tree get fully grown and later prune it gradually until further
pruning no longer contributes to the classification. This approach is more practical.
Reduced error pruning and Rule post-pruning are two methods of postpruning.

Reduced error pruning


• Examine each decision node to see if pruning decreases the tree’s
performance over the evaluation data.
• “Pruning” here means replacing a subtree with a leaf with the most common
classification in the subtree.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Rule post-pruning
One of the most popular method fpr pruning (e.g., in C4.5). First a full decision
tree is built and represented as set of if/then rules. Prune each rule by
removing any preconditions if any improvement in accuracy. Finally, sort the
pruned rules by accuracy and use them in that order.
Incorporating continues-valued attributes (ref[1])

Choosing best attribute(splitting criteria) ref[1]


Which one is better: Splitting a continues attribute into two or many? Use Gain Ratio
that consider Information Gain together with best value split (SplitInformation) .

Information gain uses the concept of Entropy of sets and doesn’t consider the
entropy of an attribute (how much information the attribute carries).

Entropy of the attribute A:


Experimentally determined by the training samples
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Unknown attribute values(ref[1])


C4.5 can also handle missing values. The basic approach is to fill the missing value
with a most probable value. More sophisticated method could fill a missing value
based on the probability distrubution of all values for the given atribute.
When missing values exist in dataset, the entropy based computations need special
handling. The Gain can be multiplied by a factor F that is the ratio of number of
known values by the total number. To compute the GainRatio, missing values should
be considered as a group.

7.1.3. CART (Classification and Regression Trees)


CART grow a binary tree. It split the data into left and right parts using splitting
critera such as squared error for regression, gini index for classification. The
scheleton of the CART is almost same for both regression and classification. The basic
differences are the splitting methods and handling leaves in that when a subset is
considered as leaf it is assigned to an average value for regression while the majority
vote for classification. The core algorithm for building decision trees is similar
to ID3 where Information Gain is replaced by Variation Reduction for regression and
Gini Gain for Classification.

Choosing Best Attibute (splitting criteria) : variance and gini


Use variance reduction (or alternatively residual sum) for regression and Gini gain
(or alternatively gini index) for classification
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Classification
Gini measures the average amount of information or impurity of the samples in a
set, S, with respect to output classes. Gini (S) is defined as

𝐺𝑖𝑛𝑖 (𝑆) = ∑𝑖 𝑃𝑖 (1 − 𝑃𝑖 ) where Pi: probability of i. class

Gini Gain (S, x) measures expected reduction in gini index due to splitting on x

parent set S G (S) = Total gini indexof parent

G (S, x) = average gini ix. of the childs

child sets Sleft Sright

𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛(𝑆, 𝑥) = 𝐺𝑖𝑛𝑖 (𝑆) − 𝐺𝑖𝑛𝑖(𝑆, 𝑥)

|𝑆𝑖 |
𝐺𝑖𝑛𝑖 (𝑆, 𝑥) = ∑ 𝐺𝑖𝑛𝑖(𝑆𝑖 )
|S|
𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡

Higher Gini Gain is better. Higher gain means lower gini ix of an average child
(meaning child sets are getting more pure). Alternatively instead of maximizing Gini
Gain, minimizing Gini (S, x) can also be considered. By ignoring division|S|, following
Gini criteria is acquird where smaller gini criteria is better.

𝐺𝑖𝑛𝑖 𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎 = ∑|𝑆𝑙𝑒𝑓𝑡 |𝐺𝑖𝑛𝑖(𝑆𝑙𝑒𝑓𝑡 ) + ∑|𝑆𝑟𝑖𝑔ℎ𝑡 |𝐺𝑖𝑛𝑖(𝑆𝑟𝑖𝑔ℎ𝑡 )


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Classification Tree Example on Hepatits (ref[2]):

...
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Regression
Variation (or mse) measures the average amount of information in set S with
respect to mean of the set. Variance Reduction(VR) measures expected reduction in
variance(mse) due to splitting on feature x

|𝑆𝑖 |
𝑉𝑅(𝑆, 𝑥) = 𝑀𝑆𝐸 (𝑆) − ∑ 𝑀𝑆𝐸(𝑆𝑖 )
|S|
𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡

average MSE of the childs

One can see that in order to maximize Variance Reduction, one can minimize the
total MSE of the childs (Residual Sum) for regression. Thus, smaller RSS (Residual
Sum) is better for splitting. Residual Sum (RSS) is defined as follows

Regression Tree Example on Prostate Cancer (ref[2]):

i-Choosing the best x split =3.67 with RSS=68.09

Regression Tree
node0

ii-Choosing the best y split = 1.05 with RSS=61.76


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

iii-left of node0: the best x split =3.66 with RSS=16.11

Regression Tree

node1
iv- left of node0: the best y split = -0.48 with RSS=13.61

v-right of node0: the best x split =3.07 with RSS=27.15

Regression Tree

node2
vi- right of node0: the best y split = 2.79 with RSS=25.11
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

...
If we complete the training, the final Regression Tree

7.1.4. Decision Tree Learning using Pyhton Sklearn


Pyton currently uses an optimized version of the CART algorithm for decision tree
classification and regression. The DecisionTreeClassifier and DecisionTreeRegressor
classes are used for classification and regression respectively. Following parameters
are some of the general parameters that works on both classification and regression:

Criterian : decides the criteria that measures the impurity of a set. “gini” and
“entropy” are some alternatives for classification with default “gini”. “mse” (mean
square error) and”mae”( mean absolute error) are some alternatives for regression
with default “mse”.

Splitter : decides the splitting method on “criterian” basis. Default “best” means
choose the best attribute to split, according to the information gain (variance
reduction or gini gain, information gain, etc) that uses the selected criterian. The
other alternative “random” means choose random attribute but give higher chance
to the better attributes according to their gains.

Max_depth : The maximum depth of the tree, default “none”: continues to expand
the tree's nodes until all leaves reach to purity or all leaves contain fewer samples
than the value of “min_samples_split”.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Min_samples_split : If a node has fewer samples than “min_samples_split”, do not


split it. Default = 2

Min_samples_leaf: Minimum number of samples required for a node to become a


leaf. Default = 1
Min_impurity_decrease: it is weighted information gain that is used as a threshold ;
don’t allow splitting on the node if there is no significant gain in doing so. Meaning, if
the purity of the node is less than certain threshold don’t expand the tree. Default = 0

weighted information gain :


S

Sleft Sright

𝐺𝑎𝑖𝑛(𝑆, 𝑥) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝑃𝑖 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑖 )


𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡

𝑦𝑎𝑑𝑎 = 𝐺𝑖𝑛𝑖 (𝑆) − ∑ 𝑃𝑖 𝐺𝑖𝑛𝑖(𝑆𝑖 )


𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡

𝑦𝑎𝑑𝑎 = 𝑀𝑆𝐸 (𝑆) − ∑ 𝑃𝑖 𝑀𝑆𝐸(𝑆𝑖 )


𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡
|𝑆|
𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐺𝑎𝑖𝑛(𝑆, 𝑥) = Gain(S, x)
𝑁
where N is the total number of samples in parent set.

Example: analyse the DecisionTreeClassifier on iris dataset using cross_validation


...
from sklearn.tree import DecisionTreeClassifier()
iris=load_iris()
model=DecisionTreeClassifier()
cross_val_score(model, iris.data, target.data, cv=10)
...

Example: DecisionTreeClassifier using Entropy, rather than default Gini criterian


iris=load_iris()
model=DecisionTreeClassifier(criterian="entropy",max_depth=3)
model.fit(iris.data,iris.target)
fx=model.predict(iris.data)
print ("Accuracy rate =",accuracy_score(iris.target,fx))
print ("Accuracy rate =",model.score(iris.target,fx)) #same accuracy score
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example: analyse DecisionTreeRegressor on boston dataset using cross_validation


...
from sklearn.tree import DecisionTreeRegressor ()
boston=load_boston()
model= DecisionTreeRegressor ()
cross_val_score(model, boston.data, boston .target, cv=10)
...

Example: DecisionTreeRegressor using mae, rather than default mse


...
from sklearn.tree import DecisionTreeRegressor ()
boston=load_boston()
model= DecisionTreeRegressor (criterian="mae",max_depth=3)
model.fit(boston.data, boston.target)
fx=model.predict(boston.data)
print ("R2 error(default) =",model.score(boston.target,fx))
print ("mse error =", mean_square_error (boston.target,fx))

7.1. Ensemble Learning using Decision Trees : Bagging, Random


Forests, and Adaboost

7.1.1.Bagging
A group of weak learners (decision trees here) are ensembled together to predict a
single decision. Each weak learner is trained on a random dataset that is derived from
the original training set by sampling with replacement. The decision of the ensemble
learner is the majority vote for the classification and the average value for
regression. Bagging is one of the extention to CART to reduce the variance of decision
using ensemble of many weak trees(learners). Example for regression is as follows
Single Tree Average of 100 Tree
High Variance Low Variance
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example for classification is as follows

When creating random dataset (bootstrap sets), the features(attributes) of bootstrap


sets are preserved. Thus, no feature selection is used althoug Pyhton implemenations
may allow it. Bootstrap trees are build independently and different than the original
tree.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

7.1.2. Random Forest


It is one of the extention to Bagging. The difference from Bagging, Random Forest
select a group of random features during generating the bootstrap sets. Thus, the
random datasets are various feature lenghts of Mi (≤M).

Advantages: Random Forests reduces the effects of overfitting and improves


generalization. the samples that are not selected in each random dataset are
combined into out-of-bag. The score on out-of-bag is called oob_score.

No need for cross-validation test since the out-of-bag is ideal test set . The out-of-
bag error is the error of the learner in out of bag set as test set.

Bagging ve Random Forest With Python Scikit Learn

BaggingClassifier and BaggingRegressor classes are used for bagging classification


and regression respectively. The classes can use any classifier for weak classifiers, not
limited to decision tree or KNN.
Majority of the parameters are valid for both regression and classification. Some
basic parameters are:
base_estimator : weak classifier. Default is decision tree
n_estimator : number of weak learners
oob_score : out of bag error.
Bootstrap : are the samples re-selected from the bag? (Default :yes )
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example: BaggingClassifier where the weak classifier is DecisionTreeClassifier default


...
from sklearn.ensemble import BaggingClassifier
X,y=load …
model = BaggingClassifier (n_estimators=15)
model.fit(X,y)
print ("out of bag error",model.oob_score())
fx=model.predict(X)
...

Example: BaggingClassifier where the weak classifier is KNN


. . .
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
X,y=load …
model = BaggingClassifier (base_estimator = KNeighborsClassifier())
model.fit(X,y)
fx=model.predict(X)
print ("accuracy score",model.score(y, fx))
. . .

Example: BaggingRegressor where the weak classifier is DecisionTreeClassifier default


...
from sklearn.ensemble import BaggingRegressor
X,y=load …
model = BaggingRegressor (n_estimators=15)
model.fit(X,y)
print ("out of bag error",model.oob_score())
fx=model.predict(X)
...

Example: BaggingRegressor where the weak regressor is KNN


. . .
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import BaggingClassifier
X,y=load …
model = BaggingClassifier (base_estimator = KNeighborsRegressor ())
model.fit(X,y)
fx=model.predict(X)
print ("R2 score",model.score(y, fx))
. . .
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

RandomForestClassifier and RandomForestResgressor classes are used for Random


Forest classification and regression respectively. Random Forest can inherently use
the paremeters from decision tree, as weak learners, and also bagging. Thus, the
parameters of the decision trees and bagging are passed to the Random Forest
classes implicitly. The parameters such as criterian, max_dept, and so on are applied
to all weak trees.
Example: Regression with Random Forest
. . .
x_train, x_test, y_train, y_test = train_test_split(...)
model = RandomForestResgressor(n_estimators=15, max_depth=3, criterian="mse")
fx= model.predict(x_test)
print("Ensemble R2 error: " , model.score(y_test,fx))
. . .

Example: Classification with Random Forest


...
x_train, x_test, y_train, y_test = train_test_split(...)
model= RandomForestClassifier(n_estimator=50, max_depth=2, criterian=”gini”)
fx= model.predict(x_test)
print("Ensemble accuracy score on test set:" , model.score(y_test,fx))
...

7.1.3. Adaboost
(ref: https://www.datacamp.com/community/tutorials/adaboost-classifier-python)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Adaboost:

Adaboost on example:
(ref: https://laptrinhx.com/understanding-adaboost-and-scikit-learn-s-algorithm-3554153184/)

Adaboost stands for adaptive boosting. The models are sequentially arranged in the
ensemble. This means that at each step we try to boost our weak learners (base
model) based on the mistakes of our previous models so together they are one
strong ensemble model.
Step 1: Assign equal weights for all samples in the data set

1
2
3
4
5
6
7
8
We have 8 samples in our dataset if you notice the weights and they have been
assigned an equal weight of 1/No. of samples. What this means is that the correct
classification of ever sample is equally important.

Step 2: Create a decision stump using the feature that has the lowest Gini index
A decision tree stump is just a decision tree with a single root and two leaves. A tree
with 1 level. This stump is our weak learner (base model). The feature first gets to
classify our data is determined using the Gini Index.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

If you want, you can increase your level to two, but it’s very common to go for
a stump.

Step 3: Evaluate the performance of your stump and assign its weight
In Adaboost, we have an ensemble of stumps and all their predictions are taken into
account before deciding the final prediction. But some stumps do a better job
classifying our data that than the other stumps in the ensemble. It only makes sense
to give more importance to these stumps. Adaboost does this by assigning weights to
each stump in the ensemble. Higher the weight, the more amount of
say (significance) the stump will have in the final prediction. So, for example lets
sample-3 and sample-6 are misclassified by the stump, the weight of this stump is
calculated by

Where Total error = sum of the weights of samples wrongly classified. So if our
stump got two samples misclassified, using the weights of those samples, we get
stump significance=0.5*log(1 — (1/8 + 1/8)/(1/8 + 1/8)) = 0.54 (Using natural log).
And that is the weight of our first model in the ensemble.

Remember that this is different from the weights of the sample. Sample weights
stress the importance of getting the classification of the sample right, while model
weights are used to determine the amount of say a model gets in the final prediction.

Step 4: Re-assign the weights of the samples


Remember we set equal weights for all samples earlier? Well, after classifying using
our first model, we got two samples misclassified. Adaboost then tries to stress the
importance of getting these two samples right next time around by assigning them a
higher sample weight. This will help the next model focus more on getting these two
samples right. The formula for the new weight of the wrongly classified samples are:
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Initially the weights of both sample were 1/8 and now the new weight of each is
1/8*e^amount of say= 0.21. This is now greater than the initial 1/8.

Next, it reduces the weight of other samples. This is done by using this formula:

Step 5: Normalise the weights of the samples


Now that we have new weights for each sample, we can normalize the weights by
dividing each weight by the sum of the weights. This will make the sum of the new
weights equal to 1. After normalization, the new weights will look like following.
Notice that the weights of the two wrong samples (sample 3 and 6) are increased.

1
2
3
4
5
6
7
8

Step 6 Create a new dataset of the same size as the original dataset and pick
samples based on their weights
This step is curial because this is how the next model will benefit from the experience
of the previous model so that misclassified samples are given more importance. It
works in the following way:

 First, an empty dataset of the same size as the original dataset is created.
 Then samples are selected according to their weights such as using roulette wheel
selection. Thus, important samples are more likely to be selected in the new dataset.
This will grant the new learner the ability towards correcting the mistakes of the
previous learner.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

 With roulette Wheel, a random number from 0 to 1 is picked and the sample weight
lying in the corresponding slice is chosen. For example, if 0.3 is picked, then the third
sample is chosed since the value of 0.3 falls within 0.167 and 0.416, and added to the
new dataset
 Repeat the previous step till the new dataset is filled. Once the new dataset is
filled, reassign the sample weights to an equal value of 1/No of samples.
This is what out new dataset looks like.

Notice how the samples that we got wrong previously are included more? This will
give the next model a better chance to get it right. Sort of like creating a large
penalty for misclassification.

Step 7: Repeat steps 2 to step 6 till we have enough number of models


Step 8: Assign final prediction by majority votes for classification or averaging
for regression : majority vote and averaging is based on weights of learners.

So if the sum of weights of models that classified as “Yes” is more than the sum of
weights of models that classified as “No”, the final prediction in “Yes” and vice-versa.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Adaboost with Python Scikit Learn

AdaBoostClassifier and AdaBoostRegressor classes are used for classification and


regression respectively. The classes can use any classifier for weak classifiers, not
limited to decision tree or KNN.
Majority of the parameters are valid for both regression and classification. Some
basic parameters are:
base_estimator : weak classifier. default base classifier is
DecisionTreeClassifier initialized with max_depth=1, default base regressor
DecisionTreeRegressor initialized with max_depth=3
n_estimator : number of weak learners
learning_rate : It contributes to the weights of weak learners .There is a trade-off
between learning_rate and n_estimators. default=1
algorithm : {‘SAMME’, ‘SAMME.R’}, default=’SAMME.R’ : The SAMME and SAMME.R
algorithms are multiclass Adaboost functions that were put forward in a paper by Ji
Zhu, Saharon Rosset, Hui Zou, Trevor Hastie. These algorithms are adaptations of the
main idea of Ababoost extending their functionality with multiclass capabilities

Example(ref: https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_regression.html)
A decision tree is boosted using the AdaBoost.R2 (Drucker 1997) algorithm on a 1D
sinusoidal dataset with a small amount of Gaussian noise. 299 boosts (300 decision
trees) is compared with a single decision tree regressor. As the number of boosts is
increased the regressor can fit more detail.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

print(__doc__)

# Author: Noel Dawe <noel.dawe@gmail.com>


#
# License: BSD 3 clause

# importing necessary libraries


import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor

# Create the dataset


rng = np.random.RandomState(1)
X = np.linspace(0, 6, 100)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])

# Fit regression model


regr_1 = DecisionTreeRegressor(max_depth=4)

regr_2 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng)

regr_1.fit(X, y)
regr_2.fit(X, y)

# Predict
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)

# Plot the results


plt.figure()
plt.scatter(X, y, c="k", label="training samples")
plt.plot(X, y_1, c="g", label="n_estimators=1", linewidth=2)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

plt.plot(X, y_2, c="r", label="n_estimators=300", linewidth=2)


plt.xlabel("data")
plt.ylabel("target")
plt.title("Boosted Decision Tree Regression")
plt.legend()
plt.show()

Example( Adaboost classification on iris dataset)


ref: https://www.datacamp.com/community/tutorials/adaboost-classifier-python

from sklearn.ensemble import AdaBoostClassifier


from sklearn import datasets
# Import train_test_split function
from sklearn.model_selection import train_test_split
#Import scikit-learn metrics module for accuracy
calculation
from sklearn import metrics

# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split dataset into training set and test set


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3) # 70% training and 30% test

# Create adaboost classifer model


model = AdaBoostClassifier(n_estimators=50,
learning_rate=1)
# Train Adaboost Classifer
model.fit(X_train, y_train)

#Predict the response for test dataset


fx_test = model.predict(X_test)

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, fx_test))

Output:
Accuracy: 0.8888888888888888

Example(using SVC as base learner for Adaboost classification and compare with
decision tree base learner )
ref: https://www.datacamp.com/community/tutorials/adaboost-classifier-python

# Load libraries
from sklearn.ensemble import AdaBoostClassifier
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

# Import Support Vector Classifier


from sklearn.svm import SVC
#Import scikit-learn metrics module for accuracy
calculation
from sklearn import metrics
svc=SVC(probability=True, kernel='linear')

# Create adaboost classifer object


model =AdaBoostClassifier(n_estimators=50,
base_estimator=svc,learning_rate=1)

# Train Adaboost Classifer


model.fit(X_train, y_train)

#Predict the response for test dataset


fx_test = model.predict(X_test)

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, fx_test))

Output:
Accuracy: 0.9555555555555556

References
[1]- Berlin-Chen Slides
[2]- Ovronnaz, Switzerland Slides
[3] https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-
adaboost-in-python-d00faac6c464

Some usefull resources:


See youtube video https://www.youtube.com/watch?v=LsK-xG1cLYA

and https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-
adaboost-in-python-d00faac6c464

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy