0% found this document useful (0 votes)
18 views50 pages

U20cs604 Machine Learning Unit II

Uploaded by

Boovi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views50 pages

U20cs604 Machine Learning Unit II

Uploaded by

Boovi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

UNIT II SUPERVISED LEARNING

Linear Models for Classification - Discriminate Functions -Probabilistic Generative Models -


Probabilistic Discriminative Models - Bayesian Logistic Regression. Decision Trees – Classification
Trees- Regression Trees - Pruning. Neural Networks -Feed-forward Network Functions - Error Back
propagation- Regularization - Mixture Density and Bayesian Neural Networks - Kernel Methods -Dual
Representations - Radial Basis Function Networks. Ensemble methods- Bagging- Boosting.

Discriminate Functions

Linear Models for classification

There are two types of linear models for classification. They are,

 Logistic Regression
 Support Vector Machines

Logistic Regression

Logistic Regression utilizes the power of regression to do classification and has been doing so
exceedingly well for several decades now, to remain amongst the most popular models. One of the
main reasons for the model’s success is its power of explainabilityi.e. calling-out the contribution
of individual predictors, quantitatively.

Unlike regression which uses Least Squares, the model uses Maximum Likelihood to fit a
sigmoid-curve on the target variable distribution.

Given the model’s susceptibility to multi-collinearity, applying it step-wise turns out to be a better
approach in finalizing the chosen predictors of the model.

The algorithm is a popular choice in many natural language processing tasks e.g. toxic speech
detection, topic classification, etc.
Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category
in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider

1
the below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text categorization, etc.

2
Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.

Linear Discriminant Analysis (LDA) in Machine Learning

Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality


reduction techniques in machine learning to solve more than two-class classification problems.
It is also known as Normal Discriminant Analysis (NDA) or Discriminant Function Analysis
(DFA).

This can be used to project the features of higher dimensional space into lower-dimensional space
in order to reduce resources and dimensional costs. In this topic, "Linear Discriminant Analysis
(LDA) in machine learning”, we will discuss the LDA algorithm for classification predictive
modeling problems, limitation of logistic regression, representation of linear Discriminant analysis
model, how to make a prediction using LDA, how to prepare data for LDA, extensions to LDA
and much more. So, let's start with a quick introduction to Linear Discriminant Analysis (LDA) in
machine learning.

What is Linear Discriminant Analysis (LDA)?

3
Although the logistic regression algorithm is limited to only two-class, linear Discriminant
analysis is applicable for more than two classes of classification problems.

Linear Discriminant analysis is one of the most popular dimensionality reduction techniques
used for supervised classification problems in machine learning. It is also considered a pre-
processing step for modeling differences in ML and applications of pattern classification.

Whenever there is a requirement to separate two or more classes having multiple features
efficiently, the Linear Discriminant Analysis model is considered the most common technique to
solve such classification problems. For e.g., if we have two classes with multiple features and need
to separate them efficiently. When we classify them using a single feature, then it may show
overlapping.

To overcome the overlapping issue in the classification process, we must increase the number of
features regularly.

Example:

Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below image:

However, it is impossible to draw a straight line in a 2-d plane that can separate these data points
efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D plane into
the 1-D plane. Using this technique, we can also maximize the separability between multiple
classes.

4
How Linear Discriminant Analysis (LDA) works?

Linear Discriminant analysis is used as a dimensionality reduction technique in machine learning,


using which we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.

Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we
need to classify them efficiently. As we have already seen in the above example that LDA enables
us to draw a straight line that can completely separate the two classes of the data points. Here,
LDA uses an X-Y axis to create a new axis by separating them using a straight line and projecting
data onto a new axis.

Hence, we can maximize the separation between these classes and reduce the 2-D plane into 1-D.

To create a new axis, Linear Discriminant Analysis uses the following criteria:

o It maximizes the distance between means of two classes.


o It minimizes the variance within the individual class.

Using the above two conditions, LDA generates a new axis in such a way that it can maximize the
distance between the means of the two classes and minimizes the variation within each class.

In other words, we can say that the new axis will increase the separation between the data points of
the two classes and plot them onto the new axis.

Why LDA?

o Logistic Regression is one of the most popular classification algorithms that perform well
for binary classification but falls short in the case of multiple classification problems with
well-separated classes. At the same time, LDA handles these quite efficiently.

5
o LDA can also be used in data pre-processing to reduce the number of features, just as PCA,
which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract
useful data from different faces. Coupled with eigenfaces, it produces effective results.

Drawbacks of Linear Discriminant Analysis (LDA)

Although, LDA is specifically used to solve supervised classification problems for two or more
classes which are not possible using logistic regression in machine learning. But LDA also fails in
some cases where the Mean of the distributions is shared. In this case, LDA fails to create a new
axis that makes both the classes linearly separable.

To overcome such problems, we use non-linear Discriminant analysis in machine learning.

Extension to Linear Discriminant Analysis (LDA)

Linear Discriminant analysis is one of the most simple and effective methods to solve
classification problems in machine learning. It has so many extensions and variations as follows:

1. Quadratic Discriminant Analysis (QDA): For multiple input variables, each class
deploys its own estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used when there are non-linear groups of
inputs are used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses regularization in the estimate of the
variance (actually covariance) and hence moderates the influence of different variables on
LDA.

Real-world Applications of LDA

Some of the common real-world applications of Linear discriminant Analysis are given below:

o FaceRecognition

Face recognition is the popular application of computer vision, where each face is
represented as the combination of a number of pixel values. In this case, LDA is used to
minimize the number of features to a manageable number before going through the
classification process. It generates a new template in which each dimension consists of a
linear combination of pixel values. If a linear combination is generated using Fisher's linear
discriminant, then it is called Fisher's face.

o Medical
In the medical field, LDA has a great application in classifying the patient disease on the
basis of various parameters of patient health and the medical treatment which is going on.

6
On such parameters, it classifies disease as mild, moderate, or severe. This classification
helps the doctors in either increasing or decreasing the pace of the treatment.
o Customer Identification

o In customer identification, LDA is currently being applied. It means with the help of LDA;
we can easily identify and select the features that can specify the group of customers who
are likely to purchase a specific product in a shopping mall. This can be helpful when we
want to identify a group of customers who mostly purchase a product in a shopping mall.
o For Predictions

o LDA can also be used for making predictions and so in decision making. For example,
"will you buy this product” will give a predicted result of either one or two possible classes
as a buying or not.
o In Learning
o
Nowadays, robots are being trained for learning and talking to simulate human work, and it
can also be considered a classification problem. In this case, LDA builds similar groups on
the basis of different parameters, including pitches, frequencies, sound, tunes, etc.

How to Prepare Data for LDA

Below are some suggestions that one should always consider while preparing the data to build the
LDA model:

o Classification Problems: LDA is mainly applied for classification problems to classify the
categorical output variable. It is suitable for both binary and multi-class classification
problems.
o Gaussian Distribution: The standard LDA model applies the Gaussian Distribution of the
input variables. One should review the univariate distribution of each attribute and
transform them into more Gaussian-looking distributions. For e.g., use log and root for
exponential distributions and Box-Cox for skewed distributions.
o Remove Outliers: It is good to firstly remove the outliers from your data because these
outliers can skew the basic statistics used to separate classes in LDA, such as the mean and
the standard deviation.
o Same Variance: As LDA always assumes that all the input variables have the same
variance, hence it is always a better way to firstly standardize the data before implementing
an LDA model. By this, the Mean will be 0, and it will have a standard deviation of 1.

7
Probabilistic generative Models

o Introduction

o Probabilistic Models in Machine Learning is the use of the codes of statistics to data
examination. It was one of the initial methods of machine learning. It’s quite extensively
used to this day. Individual of the best-known algorithms in this group is the Naive Bayes
algorithm.

o Probabilistic modelling delivers a framework for accepting what learning is. The
probabilistic framework defines how to signify and deploy reservation about models.
Predictions have a dominant role in scientific data analysis. Their role is also so important in
machine learning, automation, cognitive computing and artificial intelligence.

o Description

o Probabilistic models are presented as a prevailing idiom to define the world. Those were
described by using random variables for example building blocks believed together by
probabilistic relationships.

o There are probabilistic models along with non-probabilistic models in machine learning.
The information about basic concepts of probability for example random variables and
probability distributions would be helpful in order to have a well understanding of
probabilistic models.

o Portrayal inference from noisy or ambiguous data is an imperative part of intelligent


systems. In probability theory particularly Bayes’ theorem helps as a principled framework
of combining prior knowledge and empirical evidence.

8
o Importance of probabilistic ML models
o One of the key benefits of probabilistic models is that they give an idea about the
uncertainty linked with predictions. We may get an idea of how confident a machine
learning model is on its prediction. For example, if the probabilistic classifier allocates a
probability of 0.9 for the ‘Dog’ class in its place of 0.6, it means the classifier is extra
confident that the animal in the image is a dog. These concepts connected to uncertainty and
confidence are very valuable when it originates to critical machine learning uses for
example disease diagnosis and autonomous driving. Moreover, probabilistic consequences
would be worthwhile for many methods linked to Machine Learning for instance Active
Learning.

Probabilistic Discriminative Models


While generative models learn about the distribution of the dataset, discriminative
models learn about the boundary between classes within a dataset. With discriminative models, the
goal is to identify the decision boundary between classes to apply reliable class labels to data
instances. Discriminative models separate the classes in the dataset by using conditional
probability, not making any assumptions about individual data points.
Discriminative models set out to answer the following question:

“What side of the decision boundary is this instance found in?”

Examples of discriminative models in machine learning include support vector machines, logistic
regression, decision trees, and random forests.
Examples of Discriminative Models
Support Vector Machines
Support vector machines operate by drawing a decision boundary between data points,
finding the decision boundary that best separates the different classes in the dataset. The SVM
algorithm draws either lines or hyperplanes that separate points, for 2-dimensional spaces and 3D
spaces respectively. SVM endeavors to find the line/hyperplane that best separates the classes by
trying to maximize the margin, or the distance between the line/hyperplane to the nearest points.
SVM models can also be used on datasets that aren’t linearly separable by using the “kernel trick”
to identify non-linear decision boundaries.

Logistic Regression
Logistic regression is an algorithm that uses a logit (log-odds) function to determinant the
probability of an input being in one of two states. A sigmoid function is used to “squish” the
probability towards either 0 or 1, true or false. Probabilities greater than 0.50 are assumed to be
class 1, while probabilities 0.49 or lower are assumed to be 0. For this reason, logistic regression is
typically used in binary classification problems. However, logistic regression can be applied to
multi-class problems by using a one vs. all approach, creating a binary classification model for
each class and determining the probability that an example is a target class or another class in the
dataset.

9
Bayesian Inference or Bayesian Logistic Regression
At the centre of Bayesian inference is the Bayes’ rule sometimes called Bayes’ theorem. It
is used to define the probability of a hypothesis with former knowledge. It is contingent on
conditional probability.

Bayes Rule

The formula for Bayes’ theorem is known as;


P (hypothesis│data) = P (data│hypothesis) P (hypothesis) / P (data)

 Bayes rule states that how to do inference about hypotheses from data.

 Learning and prediction may be understood as forms of inference.

The typical Bayesian inference with Bayes’ rule is needing a mechanism to straight regulate the
target posterior distribution. For example, the inference process is a one-way procedure that plans
the earlier distribution to the posterior by detecting empirical data. In supervised learning and
reinforcement learning, our final goal is to put on the posterior to learning tasks. That is applied
with some measurement on the performance for instance prediction error or expected reward.

An upright posterior distribution should have a small prediction error or a great expected reward.
Furthermore, by way, the large scale knowledge bases are built and crowdsourcing platforms are
broadly accepted to gather human data, it is needed to include the outside information into
statistical modelling and inference when building an intelligent system.

Naive Bayes algorithm

Naïve Bayes algorithm is a supervised learning algorithm. It is created on the Bayes


theorem and used for resolving sorting problems. It is chiefly used in text classification that
comprises a high-dimensional training dataset. The naïve Bayes algorithm is one of the simple and
best operational Classification algorithms that support construction the of fast machine learning
models which may create rapid predictions.

The Naive Bayes algorithm is a probabilistic classifier. It means that it predicts on the basis of the
probability of an object. More or less prevalent instances of Naïve Bayes Algorithm are;

10
 Spam filtration

 Sentimental analysis

 Classifying articles

A narrowly correlated model is the logistic regression. That is sometimes well thought-out to be the
“hello world” of modern machine learning. Don’t be deceived by its name as log reg is a
classification algorithm somewhat a regression algorithm. Considerably like Naive Bayes, up till
now, it’s quite useful to this day as log reg predates computing for a long time, Thanks to its modest
and multipurpose nature. It’s frequently the first thing a data scientist would attempt on a dataset to
become a feel for the classification task at hand.

Types of Naïve Bayes Model

There are the following three types of Naive Bayes Model:

 Gaussian: The Gaussian model takes responsibility that features monitor a normal
distribution. This means that if analysts take nonstop values rather than separate, then
the model takes up that these values are tested from the Gaussian distribution.

 Multinomial: It is used when the data is multinomial circulated. It is mainly used for
document classification problems. It means a specific document goes to that category
for example Sports, education, and Politics etc. The classifier uses the rate of words for
the predictors.

 Bernoulli: The Bernoulli classifier do work alike to the Multinomial classifier. Then the
predictor variables are the self-governing Booleans variables. For example, if a specific
word is present or not in a document. This model is as well well-known for document
classification tasks.

Uses of Naïve Bayes Model

The Naïve Bayes Classifier used;

 For Credit Scoring.

 In medical data classification.

 It may be used in real-time predictions as Naïve Bayes Classifier is a keen learner.

 In-Text classification for example Spam filtering and Sentiment analysis.

11
Pros and Cons of Bayes Classifier

Pros

 Naïve Bayes is one of the easy and fast machine learning algorithms to foresee a class
of datasets.

 It may be used for Binary also as Multi-class Classifications.

 It does well in Multi-class predictions for example likened to the other Algorithms.

 It is the greatest widespread selection for text classification problems.

Cons

 Naive Bayes accepts that all sorts are autonomous or disparate. Therefore it cannot
learn the association between features.

Decision Tree

A decision tree model functions by splitting a dataset down into smaller and smaller
portions, and once the subsets can’t be split any further the result is a tree with nodes and leaves.
Nodes in a decision tree are where decisions about data points are made using different filtering
criteria. The leaves in a decision tree are the data points that have been classified. Decision tree
algorithms can handle both numerical and categorical data, and splits in the tree are based on
specific variables/features.

Random Forests
A random forest model is basically just a collection of decision trees where the predictions
of the individual trees are averaged to come to a final decision. The random forest algorithm
selects observations and features randomly, building the individual trees based on these selections.
This tutorial article will explore how to create a Box Plot in Matplotlib. Box plots are used to
visualize summary statistics of a dataset, displaying attributes of the distribution like the data’s
range and distribution.

12
Decision Tree Classification Algorithm

Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.

In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.

The decisions or the test are performed on the basis of features of the given dataset.

It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.

It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.

In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.

A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into subtrees.

Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

13
Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:

Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.

The logic behind the decision tree can be easily understood because it shows a tree-like structure.

Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.

Branch/Sub Tree: A tree formed by splitting the tree.

Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:

Step-1: Begin the tree with the root node, says S, which contains the complete dataset.

Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).

Step-3: Divide the S into subsets that contains possible values for the best attributes.

Step-4: Generate the decision tree node, which contains the best attribute.

14
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the
office) and one leaf node based on the corresponding labels. The next decision node further gets
split into one decision node (Cab facility) and one leaf node. Finally, the decision node splits into
two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

Information Gain

Gini Index

1. Information Gain:

Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.

It calculates how much information a feature provides us about a class.

According to the value of information gain, we split the node and build the decision tree.

15
A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the below
formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

S= Total number of samples

P(yes)= probability of yes

P(no)= probability of no

2. Gini Index:

Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.

An attribute with the low Gini index should be preferred as compared to the high Gini index.

It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.

Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology
used:

Cost Complexity Pruning

Reduced Error Pruning.

16
Advantages of the Decision Tree

It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.

It can be very useful for solving decision-related problems.

It helps to think about all the possible outcomes for a problem.

There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

The decision tree contains lots of layers, which makes it complex.

It may have an overfitting issue, which can be resolved using the Random Forest algorithm.

For more class labels, the computational complexity of the decision tree may increase.

Classification tree
Classification tree methods (i.e., decision tree methods) are recommended when the data mining
task contains classifications or predictions of outcomes, and the goal is to generate rules that can be easily
explained and translated into SQL or a natural query language.

A Classification tree labels, records, and assigns variables to discrete classes. A Classification tree can also
provide a measure of confidence that the classification is correct.

A Classification tree is built through a process known as binary recursive partitioning. This is an iterative
process of splitting the data into partitions, and then splitting it up further on each of the branches.

Initially, a Training Set is created where the classification label (i.e., purchaser or non-purchaser) is known
(pre-classified) for each record. Next, the algorithm systematically assigns each record to one of two
subsets on the some basis (i.e., income > $75,000 or income <= $75,000). The object is to attain an
homogeneous set of labels (i.e., purchaser or non-purchaser) in each partition. This partitioning (splitting)
is then applied to each of the new partitions. The process continues until no more useful splits can be
found. The heart of the algorithm is the rule that determines the initial split rule (displayed in the following
figure).

17
The process starts with a Training Set consisting of pre-classified records (target field or dependent
variable with a known class or label such as purchaser or non-purchaser). The goal is to build a tree that
distinguishes among the classes. For simplicity, assume that there are only two target classes, and that each
split is a binary partition. The partition (splitting) criterion generalizes to multiple classes, and any multi-
way partitioning can be achieved through repeated binary splits. To choose the best splitter at a node, the
algorithm considers each input field in turn. In essence, each field is sorted. Every possible split is tried
and considered, and the best split is the one that produces the largest decrease in diversity of the
classification label within each partition (i.e., the increase in homogeneity). This is repeated for all fields,
and the winner is chosen as the best splitter for that node. The process is continued at subsequent nodes
until a full tree is generated.

XLMiner uses the Gini index as the splitting criterion, which is a commonly used measure of inequality.
The index fluctuates between a value of 0 and 1. A Gini index of 0 indicates that all records in the node
belong to the same category. A Gini index of 1 indicates that each record in the node belongs to a different
category. For a complete discussion of this index, please see Leo Breiman’s and Richard Friedman’s
book, Classification and Regression Trees (3).

Regression Trees

What Are Regression Trees ?

Having read the above blogs or Having already being familiar with the appropriate topics,
you hopefully understand what is a decision tree, by now ( The one we used for classification task ).
A regression tree is basically a decision tree that is used for the task of regression which can be
used to predict continuous valued outputs instead of discrete outputs.

18
Mean Square Error

In Decision Trees for Classification, we saw how the tree asks right questions at the right
node in order to give accurate and efficient classifications. The way this is done in Classification
Trees is by using 2 measures , namely Entropy and Information Gain. But since we are predicting
continuous variables, we cannot calculate the entropy and go through the same process. We need a
different measure now. A measure that tells us how much our predictions deviate from the original
target and that’s the entry-point of mean square error.

fig 1.1: Mean Square Value

Y is the actual value and Y_hat is the prediction , we only care about how much the prediction
varies from the target. Not in which direction. So, we square the difference and divide the entire
sum by the total number of records.

In the Regression Tree algorithm, we do the same thing as the Classification trees. But, we try to
reduce the Mean Square Error at each child rather than the entropy.

Building a Regression Tree

Let’s consider a dataset where we have 2 variables, as shown below

fig 2.1: Dataset, X is a continuous variable and Y is another continuous variable

19
fig 2.2: The actual dataset Table

we need to build a Regression tree that best predicts the Y given the X.

Step 1

The first step is to sort the data based on X ( In this case, it is already sorted ). Then, take
the average of the first 2 rows in variable X ( which is (1+2)/2 = 1.5 according to the given dataset
). Divide the dataset into 2 parts ( Part A and Part B ) , separated by x < 1.5 and X ≥ 1.5.

Now, Part A consist only of one point, which is the first row (1,1) and all the other points are in Part
— B. Now, take the average of all the Y values in Part A and average of all Y values in Part B
separately. These 2 values are the predicted output of the decision tree for x < 1.5 and x ≥ 1.5
respectively. Using the predicted and original values, calculate the mean square error and note it
down.

Step 2

In step 1, we calculated the average for the first 2 numbers of sorted X and split the dataset
based on that and calculated the predictions. Then, we do the same process again but this time, we
calculate the average for the second 2 numbers of sorted X ( (2+3)/2 = 2.5 ). Then, we split the
dataset again based on X < 2.5 and X ≥ 2.5 into Part A and Part B again and predict outputs, find
mean square error as shown in step 1. This process is repeated for the third 2 numbers, the fourth 2

20
numbers, the 5th, 6th, 7th till n-1th 2 numbers ( where n is the number of records or rows in the
dataset ).

Step 3

Now that we have n-1 mean squared errors calculated , we need to choose the point at which
we are going to split the dataset. and that point is the point, which resulted in the lowest mean
squared error on splitting at it. In this case, the point is x=5.5. Hence the tree will be split into 2
parts. x<5.5 and x≥ 5.5. The Root node is selected this way and the data points that go towards the
left child and right child of the root node are further recursively exposed to the same algorithm for
further splitting.

Brief Explanation of What the algorithm is doing

The basic idea behind the algorithm is to find the point in the independent variable to split the data-
set into 2 parts, so that the mean squared error is the minimised at that point. The algorithm does
this in a repetitive fashion and forms a tree-like structure.

A regression tree for the above shown dataset would look like this

fig 3.1: The resultant Decision Tree

and the resultant prediction visualisation would be this

21
fig 3.2: The Decision Boundary

well, The logic behind the algorithm itself is not rocket science. All we are doing is splitting the
data-set by selecting certain points that best splits the data-set and minimises the mean square error.
And the way we are selecting these points is by going through an iterative process of calculating
mean square error for all the splits and choosing the split that has the least value for the mse. So, It
only natural this works.

What happens when there are multiple independent variables ?

Let us consider that there are 3 variables similar to the independent variable X from fig 2.2.
At each node, All the 3 variables would go through the same process as what X went through in the
above example. The data would be sorted based on the 3 variables separately.
The points that minimises the mse are calculated for all the 3 variables. out of the 3 variables and
the points calculated for them, the one that has the least mse would be chosen.

How are categorical variables handled ?

When we use the continuous variables as independent variables , we select the point with
the least mse using an iterative algorithm as mentioned above. When given a categorical variable ,
we simply split it by asking a binary question ( usually ). For example, let’s say we have a column
specifying the size of the tumor in categorical terms. say Small, Medium and Large.

The tree would split the data-set based on whether tumor size = small or tumor size =
large or tumor size = Medium or it can also combine multiple values in some cases, based on

22
whichever question reduces the mse the most. and that becomes the top contender for this
variable (Tumor Size). The Top contenders of the other variables are compared with this and the
selection process is similar to the situation mentioned in “What happens when there are multiple
independent variables ?”

Dealing with Over-Fitting and When to stop building the Tree ?

On reading the previous blogs, one might understand the problem of overfitting and how it
affects machine learning models. Regression Trees are prone to this problem.

When we want to reduce the mean square error, the decision tree can recursively split the data-set
into a large number of subsets to the the point where a set contains only one row or record. Even
though this might reduce the mse to zero, this is obviously not a good thing.

This is the famous problem of overfitting and it is a topic of it’s own. The basic takeaway is that the
models fit to the existing data too perfectly that it fails to generalise with new data. We can use
cross validation methods to avoid this.

One way to prevent this, with respect to Regression trees, is to specify the minimum number of
records or rows, Aleaf node can have, In advance.
And the exact number is not easily known when it comes to large data-sets. But, cross-validation
could be used for this purpose.
What is Decision Tree Pruning and Why is it Important?

Pruning is a technique that removes the parts of the Decision Tree which prevent it from growing
to its full depth. The parts that it removes from the tree are the parts that do not provide the power
to classify instances. A Decision tree that is trained to its full depth will highly likely lead to
overfitting the training data - therefore Pruning is important.

In simpler terms, the aim of Decision Tree Pruning is to construct an algorithm that will perform
worse on training data but will generalize better on test data. Tuning the hyperparameters of your
Decision Tree model can do your model a lot of justice and save you a lot of time and money.

Pruning
How do you Prune a Decision Tree?

There are two types of pruning: Pre-pruning and Post-pruning. I will go through both of them and
how they work.

23
Pre-pruning
The pre-pruning technique of Decision Trees is tuning the hyperparameters prior to the
training pipeline. It involves the heuristic known as ‘early stopping’ which stops the growth of the
decision tree - preventing it from reaching its full depth.

It stops the tree-building process to avoid producing leaves with small samples. During each stage
of the splitting of the tree, the cross-validation error will be monitored. If the value of the error
does not decrease anymore - then we stop the growth of the decision tree.

The hyperparameters that can be tuned for early stopping and preventing overfitting are:

max_depth, min_samples_leaf, and min_samples_split


These same parameters can also be used to tune to get a robust model. However, you should be
cautious as early stopping can also lead to underfitting.

Post-pruning
Post-pruning does the opposite of pre-pruning and allows the Decision Tree model to grow
to its full depth. Once the model grows to its full depth, tree branches are removed to prevent the
model from overfitting.

The algorithm will continue to partition data into smaller subsets until the final subsets produced
are similar in terms of the outcome variable. The final subset of the tree will consist of only a few
data points allowing the tree to have learned the data to the T. However, when a new data point is
introduced that differs from the learned data - it may not get predicted well.

The hyperparameter that can be tuned for post-pruning and preventing overfitting is: ccp_alpha
ccp stands for Cost Complexity Pruning and can be used as another option to control the size of a
tree. A higher value of ccp_alpha will lead to an increase in the number of nodes pruned.
Cost complexity pruning (post-pruning) steps:

1. Train your Decision Tree model to its full depth


2. Compute the ccp_alphas value using cost_complexity_pruning_path()
3. Train your Decision Tree model with different ccp_alphas values and compute train and
test performance scores
4. Plot the train and test scores for each value of ccp_alphas values.
This hyperparameter can also be used to tune to get the best fit models.

24
Neural Network

An artificial neural network learning algorithm, or neural network, or just neural net

, is a computational learning system that uses a network of functions to understand and translate a
data input of one form into a desired output, usually in another form. The concept of the artificial
neural network was inspired by human biology and the way neurons of the human brain function
together to understand inputs from human senses.

Neural networks are just one of many tools and approaches used in machine learning algorithms.
The neural network itself may be used as a piece in many different machine learning algorithms to
process complex data inputs into a space that computers can understand.

Feed forward neural network

Introduction

Feedforward Neural Networks, also known as Deep feedforward Networks or Multi-layer


Perceptrons, are the focus of this article. For example, Convolutional and Recurrent Neural
Networks (which are used extensively in computer vision applications) are based on these
networks. We’ll do our best to grasp the key ideas in an engaging and hands-on manner without
having to delve too deeply into mathematics.

Search engines, machine translation, and mobile applications all rely on deep learning
technologies. It works by stimulating the human brain in terms of identifying and creating patterns
from various types of input.

A feedforward neural network is a key component of this fantastic technology since it aids
software developers with pattern recognition and classification, non-linear regression, and function
approximation.

What is Feedforward Neural Network?

A feedforward neural network is a type of artificial neural network in which nodes’ connections do
not form a loop.

Often referred to as a multi-layered network of neurons, feedforward neural networks are so


named because all information flows in a forward manner only.

The data enters the input nodes, travels through the hidden layers, and eventually exits the output
nodes. The network is devoid of links that would allow the information exiting the output node to
be sent back into the network.

25
The purpose of feedforward neural networks is to approximate functions.

Here’s how it works

There is a classifier using the formula y = f* (x).

This assigns the value of input x to the category y.

The feedfоrwаrdnetwоrk will mар y = f (x; θ). It then memorizes the value of θ that most closely
approximates the function.

As shown in the Google Photos app, a feedforward neural network serves as the foundation for
object detection in photos.

A Feedforward Neural Network’s Layers

The following are the components of a feedforward neural network:

Layer of input

It contains the neurons that receive input. The data is subsequently passed on to the next tier. The
input layer’s total number of neurons is equal to the number of variables in the dataset.

Hidden layer

This is the intermediate layer, which is concealed between the input and output layers. This layer
has a large number of neurons that perform alterations on the inputs. They then communicate with
the output layer.

Output layer

It is the last layer and is depending on the model’s construction. Additionally, the output layer is
the expected feature, as you are aware of the desired outcome.

Neurons weights

Weights are used to describe the strength of a connection between neurons. The range of a
weight’s value is from 0 to 1.

Cost Function in Feedforward Neural Network

The cost function is an important factor of a feedforward neural network. Generally, minor
adjustments to weights and biases have little effect on the categorized data points. Thus, to

26
determine a method for improving performance by making minor adjustments to weights and
biases using a smooth cost function.

The mean square error cost function is defined as follows:

Where,

w = weights collected in the network

b = biases

n = number of training inputs

a = output vectors

x = input

‖v‖ = usual length of vector v

Loss Function in Feedforward Neural Network

A neural network’s loss function is used to identify if the learning process needs to be adjusted.

As many neurons as there are classes in the output layer. To show the difference between the
predicted and actual distributions of probabilities.

The cross-entropy loss for binary classification is as follows.

The cross-entropy loss associated with multi-class categorization is as follows:

27
Gradient Learning Algorithm

Gradient Descent Algorithm repeatedly calculates the next point using gradient at the current
location, then scales it (by a learning rate) and subtracts achieved value from the current position
(makes a step) (makes a step). It subtracts the value since we want to decrease the function (to
increase it would be adding) (to maximize it would be adding). This procedure may be written as:

There’s a crucial parameter η which adjusts the gradient and hence affects the step size. In
machine learning, it is termed learning rate and has a substantial effect on performance.

 The smaller the learning rate the longer GD converges or may approach maximum iteration
before finding the optimal point
 If the learning rate is too great the algorithm may not converge to the ideal point (jump
around) or perhaps diverge altogether.

In summary, the Gradient Descent method’s steps are:

1. pick a beginning point (initialization) (initialization)


2. compute the gradient at this spot
3. produce a scaled step in the opposite direction to the gradient (objective: minimize)
(objective: minimize)
4. repeat points 2 and 3 until one of the conditions is met:

 maximum number of repetitions reached


 step size is smaller than the tolerance.

This function accepts the following five parameters:

1. Starting point – in our example, we specify it manually, but in fact, it is frequently


determined randomly.
2. Gradient function – must be defined in advance
3. Learning rate – factor used to scale step sizes
4. Maximum iterations
5. Tolerance for the algorithm to be stopped on a conditional basis (in this case a default
value is 0.01)

28
Example- A quadratic function

Consider the following elementary quadratic function:

Due to the fact that it is a univariate function, a gradient function is as follows:

Let us now write the following methods in Python:

def func1(x):
return x**2-4*x+1

def gradient_func1(x):
return 2*x - 4

With a learning rate of 0.1 and a starting point of x=9, we can simply compute each step manually
for this function. Let us begin with the first three steps:

The python function is invoked as follows:

history, result = gradient_descent(9, gradient_func1, 0.1, 100)


The animation below illustrates the steps taken by the GD algorithm at 0.1 and 0.8 learning rates.
As you can see, as the algorithm approaches the minimum, the steps become steadily smaller. For
a faster rate of learning, it is necessary to jump from one side to the other before convergence.

29
T
he first ten stages were conducted by GD to determine the learning rate for small and large
groups. IMAGE
The following diagram illustrates the trajectory, number of iterations, and ultimate converged
output (within tolerance) for various learning rates:

The Need for a Neuron Model

Suppose the inputs to the network are pixel data from a character scan. There are a few things you
need to keep in mind while designing a network to appropriately classify a digit:

In order to see how the network learns, you’ll need to play about with the weights. In order to
reach perfection, weight variations of simply a few grams should have a negligible effect on
production.

What if, on the other hand, a minor change in the weight results in a large change in the output?
The sigmoid neuron model is capable of resolving this issue.

30
IMAGE

Applications of Feedforward Neural Network

These neural networks are utilized in a wide variety of applications. Several of them are
denoted by the following area units:

 Physiological feedforward system: Here, feedforward management is exemplified by the


usual preventative control of heartbeat prior to exercise by the central involuntary system.
 Gene regulation and feedforward: Throughout this, a theme predominates throughout the
famous networks, and this motif has been demonstrated to be a feedforward system for
detecting non-temporary atmospheric alteration.
 Automating and managing machines
 Parallel feedforward compensation with derivative: This is a relatively recent approach for
converting the non-minimum component of an open-loop transfer system into the
minimum part.

Error Backpropagation

 The error backpropagation learning algorithm is tool used during the training of neural
networks. The main goal is to compute the gradient of the loss function (also known as the
error function or cost function). These gradients are required for many optimization
routines such as stochastic gradient descent and its many variants.

How does Error Backpropagation Work?

 Essentially, calculating the gradients relies entirely on the rules of differential calculus. As
a neural network is a series of layers, for each data point the loss function is computed by

31
passing a label data point through the network (feed forward). Next, the gradients are
calculated starting from the final layer and then through use of the chain rule, the gradients
can be passed backwards to calculate the gradients in the previous layers. The goal is to
get the gradients for the loss function with respect to each model parameter (weights for
each neural node connection as well as the bias weights). This point of this backwards
method of error checking is to more efficiently calculate the gradient at each layer than the
traditional approach of calculating each layer’s gradient separately.

What are the Uses of Error Backpropagation?

 Backpropagation is especially useful for deep neural networks working on error-prone


projects, such as image or speech recognition. Taking advantage of the chain and power
rules allows backpropagation to function with any number of outputs and better train all
sorts of neural networks.

The algorithm

Each training iteration of NN has two main stages

1. Forward pass/propagation

2. BP

The BP stage has the following steps

 Evaluate error signal for each layer

32
 Use the error signal to compute error gradients

 Update layer parameters using the error gradients with an optimization algorithm such
as GD.

The idea here is, the network estimates a target value during the forward pass. Then, we compute
how far our estimates are from the actual targets at the last layer (error signal δ_k ). Finally we
compute the error signal for each of the previous layers recursively.

Given an error function such as the root mean square, error gradients of the last layer can be found
using partial differentiation.

Note that h’(a_k) = 1 for a linear activation and because of this (∂E_n/∂y_k) = (∂E_n/∂a_k). The
index n has been ignored to keep the equation uncluttered. The quantity (y_nk - t_nk) is called the
error signal, δ_k, for the last layer. Therefore, gradient for the parameter linking a particular
error signal and input signal is a product of the input signal and the error signal. Using chain
rule, the error signal for the previous layer can be computed using the error signal of the current
layer. From the diagram above,

Note how the error signal for a node in the previous layer is obtained by taking a weighed sum of
all the error signals from the current layer nodes to which the previous layer node sends its signals
i.e sum over over index k. This is why its called Error backpropagation. Also to see where this sum
over k comes from mathematically note that ∂E_n/∂a_k is a Jacobian vector and ∂a_k/∂a_j is a
Jacobian matrix.

33
As before and in general, the gradient for a parameter linking a particular error signal and input
signal in a layer is a product of the input signal and the error signal of the layer. In the previous
case,

Similarly for the bias parameters,

Note that this is recursive, error signals from the current layer are used to evaluate error signals in a
previous layer. This is a very important piece in the puzzle, therefore let’s see how we are going to
vectorize and later implement this in code. We are going to assume batch training but the same
design can be used for online training by just setting batch size to 1.

For a layer of output K and input P, layer weights will be initialized as (PxK). Therefore,
for N inputs with dimension P, we get N outputs with dimension K. This is the layer forward
propagation step as can be seen in the diagram. Bias, initialized as (K,), is not shown here because it
is broadcasted over N and will not affect output dimension.

The error signal, δ_k, therefore has the shape (NxK). For a reason we’ll see shortly, I transpose this
error signal and feed into the backprop function as (KxN). Now, multiplying the layer parameters
with the layer error signal, performs the weighed sum over over k for all the N patterns, hence the
reason why the error signal is first transposed. Let’s call the matrix obtained from this step DM.

Now here is the tricky part, to complete computation of the previous layer back-propagated error
signal, each node weighed sum, has to be multiplied by h’(a_j), whose dimension is (NxP). I’ll call

34
this derivative matrix D. In matrix algebra, this is done by putting the particular input pattern as a
diagonal matrix then multiplying this matrix with the corresponding column of DM. I will call this
diagonal matrix, S_n. I’m going to zero initialize a matrix A of size (PxN) to accumulate the
previous layer signal iteratively

How you build matrix S_n and DM_n is up to you. You will see one way of doing this in the code
section. Perhaps there are more efficient ways that doesn’t involve looping, let me know if you
know this :).

Now these error signals are passed on to the previous layer , L_k-1, to update its parameters. The
current layer , L_k, parameters are updated by the error signals that have been passed on to this
layer using its “Backprop” function.

One subtle thing to note is that the layer L_k-1 error signals are
computed before layer L_k parameters are updated.

As explained before, to compute layer param gradients, we multiply error signal by the input signal
for that layer. I will call G_w weight gradients and G_b bias gradients. For layer K , matrix A
is (KxN), and input signal I is (NxP)

Note that the operations above sums over N, the effect of this is accumulating the gradients over a
batch of size N.

Regularization in Machine Learning

What is Regularization?

Regularization is one of the most important concepts of machine learning. It is a technique


to prevent the model from overfitting by adding extra information to it.

35
Sometimes the machine learning model performs well with the training data but does not perform
well with the test data. It means the model is not able to predict the output when deals with unseen
data by introducing noise in the output, and hence the model is called overfitted. This problem can
be deal with the help of a regularization technique.

This technique can be used in such a way that it will allow to maintain all variables or features in
the model by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a
generalization of the model.

It mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In
regularization technique, we reduce the magnitude of the features by keeping the same number of
features."

How does Regularization Work?

Regularization works by adding a penalty or complexity term to the complex model. Let's consider
the simple linear regression equation:

y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b

In the above equation, Y represents the value to be predicted

X1, X2, …Xn are the features for Y.

β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents
the bias of the model, and b represents the intercept.

Linear regression models try to optimize the β0 and b to minimize the cost function. The equation
for the cost function for the linear model is given below:

Now, we will add a loss function and optimize parameter to make the model that can predict the
accurate value of Y. The loss function for the linear regression is called as RSS or Residual sum
of squares.

Techniques of Regularization

There are mainly two types of regularization techniques, which are given below:

o Ridge Regression
o Lasso Regression

36
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is
introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity of
the model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount
of bias added to the model is called Ridge Regression penalty. We can calculate it by
multiplying with the lambda to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:

o In the above equation, the penalty term regularizes the coefficients of the model, and hence
ridge regression reduces the amplitudes of the coefficients that decreases the complexity of
the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum value
of λ, the model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:

o Some of the features in this technique are completely neglected for model evaluation.

37
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as
the feature selection.

Key Difference between Ridge Regression and Lasso Regression

o Ridge regression is mostly used to reduce the overfitting in the model, and it includes all
the features present in the model. It reduces the complexity of the model by shrinking the
coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature selection.

Introduction to Kernel Methods

o Kernels or kernel methods (also called Kernel functions) are sets of different types of
algorithms that are being used for pattern analysis. They are used to solve a non-linear
problem by using a linear classifier. Kernels Methods are employed in SVM (Support
Vector Machines) which are used in classification and regression problems. The SVM uses
what is called a “Kernel Trick” where the data is transformed and an optimal boundary is
found for the possible outputs.

The Need for Kernel Method and its Working


o Before we get into the working of the Kernel Methods, it is more important to understand
support vector machines or the SVMs because kernels are implemented in SVM models.
So, Support Vector Machines are supervised machine learning algorithms that are used
in classification and regression problems such as classifying an apple to class fruit while
classifying a Lion to the class animal.
o To demonstrate, below is what support vector machines look like:

o Here we can see a hyperplane which is separating green dots from the blue ones. A
hyperplane is one dimension less than the ambient plane. E.g. in the above figure, we have

38
2 dimension which represents the ambient space but the lone which divides or classifies the
space is one dimension less than the ambient space and is called hyperplane.
o But what if we have input like this:

o It is very difficult to solve this classification using a linear classifier as there is no good
linear line that should be able to classify the red and the green dots as the points are
randomly distributed. Here comes the use of kernel function which takes the points to
higher dimensions, solves the problem over there and returns the output. Think of this in
this way, we can see that the green dots are enclosed in some perimeter area while the red
one lies outside it, likewise, there could be other scenarios where green dots might be
distributed in a trapezoid-shaped area.
o So what we do is to convert the two-dimensional plane which was first classified by one-
dimensional hyperplane (“or a straight line”) to the three-dimensional area and here our
classifier i.e. hyperplane will not be a straight line but a two-dimensional plane which will
cut the area.
o In order to get a mathematical understanding of kernel, let us understand the Lili Jiang’s
equation of kernel which is
o K(x, y)=<f(x), f(y)> where,
K is the kernel function,
X and Y are the dimensional inputs,
f is the map from n-dimensional to m-dimensional space and,
< x, y > is the dot product.
o Illustration with the help of an example.
o Let us say that we have two points, x= (2, 3, 4) and y= (3, 4, 5)
o As we have seen, K(x, y) = < f(x), f(y) >.
o Let us first calculate < f(x), f(y) >

39
o f(x)=(x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3)
f(y)=(y1y1, y1y2, y1y3, y2y1, y2y2, y2y3, y3y1, y3y2, y3y3)
so,
f(2, 3, 4)=(4, 6, 8, 6, 9, 12, 8, 12, 16)and
f(3 ,4, 5)=(9, 12, 15, 12, 16, 20, 15, 20, 25)
so the dot product,
f (x). f (y) = f(2,3,4) . f(3,4,5)=
(36 + 72 + 120 + 72 +144 + 240 + 120 + 240 + 400)=
1444
And,
K(x, y) = (2*3 + 3*4 + 4*5) ^2=(6 + 12 + 20)^2=38*38=1444.
o This as we find out, f(x).f(y) and K(x, y) give us the same result, but the former method
required a lot of calculations(because of projecting 3 dimensions into 9 dimensions) while
using the kernel, it was much easier.
o Types of Kernel and methods in SVM
o Let us see some of the kernel function or the types that are being used in SVM:

1. Liner Kernel
o Let us say that we have two vectors with name x1 and Y1, then the linear kernel is defined
by the dot product of these two vectors:
o K(x1, x2) = x1 . x2

2. Polynomial Kernel
o A polynomial kernel is defined by the following equation:
o K(x1, x2) = (x1 . x2 + 1)d,
o Where,
o d is the degree of the polynomial and x1 and x2 are vectors

3. Gaussian Kernel
o This kernel is an example of a radial basis function kernel. Below is the equation for this:

o The given sigma plays a very important role in the performance of the Gaussian kernel and
should neither be overestimated and nor be underestimated, it should be carefully tuned
according to the problem.

4. Exponential Kernel
o This is in close relation with the previous kernel i.e. the Gaussian kernel with the only
difference is – the square of the norm is removed.
o The function of the exponential function is:

40
This is also a radial basis kernel function.

5. Laplacian Kernel
o This type of kernel is less prone for changes and is totally equal to previously discussed
exponential function kernel, the equation of Laplacian kernel is given as:

6. Hyperbolic or the Sigmoid Kernel


 This kernel is used in neural network areas of machine learning. The activation
function for the sigmoid kernel is the bipolar sigmoid function. The equation for the
hyperbolic kernel function is:

o This kernel is very much used and popular among support vector machines.

7. Anova radial basis kernel


o This kernel is known to perform very well in multidimensional regression problems just
like the Gaussian and Laplacian kernels. This also comes under the category of radial basis
kernel.
o The equation for Anova kernel is :

o There are a lot more types of Kernel Method and we have discussed the mostly used
kernels. It purely depends on the type of problem which will decide the kernel function to
be used.

Dual Representation

It Recall that a linear classifier predicts Yˆ (~x) = sgn b + ~x · ~w. That is, it assumes that
the data can be separated by the plane with normal vector ~w, offset a distance b from the origin.
We have been looking at the problem of learning linear classifiers as the problem of selecting good
weights ~w for input features. This is called the primal representation, and we’ve seen several
ways to do it — the prototype method, the perceptron algorithm, logistic regression, etc. The
weights ~w in the primal representation are weights on the features, and functions of the training
vectors ~xi . A dual representation gives weights to the training vectors, which are (implicitly)
functions of the features. That is, the classifier predicts Yˆ (~x) = sgn β + Xni=1 αiyi (~xi · ~x) (1)
where αi are now weights over the training data. We can always find such dual representations

41
when ~w is a linear function of the vectors, as in the perceptron or the prototype method. But we
could also use them directly. (The perceptron algorithm can be run in the dual representation. Start
with β = 0, ~α = 0. Go over the training vectors; if ~xi is mis-classified, increase αi by 1, and set β
←β + yiR2 . If any training vector was mis-classified, repeat the loop; exit when there are no mis-
classifications.) There are a couple of things to notice about dual representations like equation 1. 1.
We need to learn the n weights in ~α, not the p weights in ~w. This can help when p n. 2. The
training vector ~xi appears in the prediction function only in the form of its inner product with the
text vector ~x, ~xi · ~x = Pp j=1 xijxj . 3. We can have αi = 0 for some i. If αi 6= 0, then ~xi is a
support vector. The fewer support vectors there are, the more sparse the solution is. The first two
attributes of the dual representation play in to the kernel trick. The third, unsurprisingly, turns up
in the support vector machine.

Radial Basis Functions Function Networks

Radial Basis Functions are a special class of feed-forward neural networks consisting of
three layers: an input layer, a hidden layer, and the output layer. This is fundamentally different
from most neural network architectures, which are composed of many layers and bring about
nonlinearity by recurrently applying non-linear activation functions. The input layer receives input
data and passes it into the hidden layer, where the computation occurs. The hidden layer of Radial
Basis Functions Neural Network is the most powerful and very different from most Neural
networks. The output layer is designated for prediction tasks like classification or regression.

How Do RBF Networks Work?

RBF Neural networks are conceptually similar to K-Nearest Neighbor (k-NN) models,
though the implementation of both models is starkly different. The fundamental idea of Radial
Basis Functions is that an item's predicted target value is likely to be the same as other items with
close values of predictor variables. An RBF Network places one or many RBF neurons in the
space described by the predictor variables. The space has multiple dimensions corresponding to
the number of predictor variables present. We calculate the Euclidean distance from the evaluated
point to the center of each neuron. A Radial Basis Function (RBF), also known as kernel function,
is applied to the distance to calculate every neuron's weight (influence). The name of the Radial
Basis Function comes from the radius distance, which is the argument to the function. Weight =
RBF[distance)The greater the distance of a neuron from the point being evaluated, the less
influence (weight) it has.

Radial Basis Functions

A Radial Basis Function is a real-valued function, the value of which depends only on the distance
from the origin. Although we use various types of radial basis functions, the Gaussian function is
the most common.

42
In the instance of more than one predictor variable, the Radial basis Functions Neural Network has
the same number of dimensions as there are variables. If three neurons are in a space with two
predictor variables, we can predict the value from the RBF functions. We can calculate the best-
predicted value for the new point by adding the output values of the RBF functions multiplied by
the weights processed for each neuron.

The radial basis function for a neuron consists of a center and a radius (also called the spread). The
radius may vary between different neurons. In DTREG-generated RBF networks, each dimension's
radius can differ.

As the spread grows larger, neurons at a distance from a point have more influence.

RBF Network Architecture

The typical architecture of a radial basis functions neural network consists of an input layer,
hidden layer, and summation layer.

Input Layer

The input layer consists of one neuron for every predictor variable. The input neurons pass the
value to each neuron in the hidden layer. N-1 neurons are used for categorical values, where N
denotes the number of categories. The range of values is standardized by subtracting the median
and dividing by the interquartile range.

Hidden Layer

The hidden layer contains a variable number of neurons (the ideal number determined by the
training process). Each neuron comprises a radial basis function centered on a point. The number
of dimensions coincides with the number of predictor variables. The radius or spread of the RBF
function may vary for each dimension.

When an x vector of input values is fed from the input layer, a hidden neuron calculates the
Euclidean distance between the test case and the neuron's center point. It then applies the kernel
function using the spread values. The resulting value gets fed into the summation layer.

Output Layer or Summation Layer

The value obtained from the hidden layer is multiplied by a weight related to the neuron and
passed to the summation. Here the weighted values are added up, and the sum is presented as the
network's output. Classification problems have one output per target category, the value being the
probability that the case evaluated has that category.
43
The Input Vector

It is the n-dimensional vector that you're attempting to classify. The whole input vector is
presented to each of the RBF neurons.

The RBF Neurons

Every RBF neuron stores a prototype vector (also known as the neuron's center) from amongst the
vectors of the training set. An RBF neuron compares the input vector with its prototype, and
outputs a value between 0 and 1 as a measure of similarity. If an input is the same as the prototype,
the neuron's output will be 1. As the input and prototype difference grows, the output falls
exponentially towards 0. The shape of the response by the RBF neuron is a bell curve. The
response value is also called the activation value.

The Output Nodes

The network's output comprises a set of nodes for each category you're trying to classify. Each
output node computes a score for the concerned category. Generally, we take a classification
decision by assigning the input to the category with the highest score.

The score is calculated based on a weighted sum of the activation values from all RBF neurons. It
usually gives a positive weight to the RBF neuron belonging to its category and a negative weight
to others. Each output node has its own set of weights.

Radial Basis Function Example

Let us consider a fully trained Radial Basis Function Example.

A dataset has two-dimensional data points belonging to two separate classes. An RBF Network
has been trained with 20 RBF neurons on the said data set. We can mark the prototypes selected
and view the category one score on the input space. For viewing, we can draw a 3-D mesh or a
contour plot.

The areas of highest and lowest category one score should be marked separately.

In the case of category one output node:

 All the weights for category 2 RBF neurons will be negative.

 All the weights for category 1 RBF neurons will be positive.

44
Finally, an approximation of the decision boundary can be plotted by computing the scores over a
finite grid.

Training the RBFN

The training process includes selecting these parameters:

 The prototype (mu)

 Beta coefficient for every RBF neuron, and

 The matrix of output weights between the neurons and output nodes.

There are several approaches for selecting prototypes and their alterations, like creating an RBF
neuron for every training example or randomly choosing k prototypes from training data.

While specifying beta coefficients, set sigma equal to the average distance between points in the
cluster and the center.

Output weights can be trained using gradient descent.

Advantages of RBFN

 Easy Design

 Good Generalization

 Faster Training

 Only one hidden layer

 A straightforward interpretation of the meaning or function of each node in the hidden


layer

Our Learners Also Ask

1. What is the radial basis function neural network used for?

Radial Basis Function neural networks are commonly used artificial neural networks used for
function approximation problems and support vector machine classification.

45
2. What is the role of the radial basis?

Radial basis functions provide ways to approximate multivariable functions by using linear
combinations of terms that are based on a single univariate function.

3. What is the radial basis function in ML?

Radial Basis Functions (RBF) are real-valued functions that use supervised machine learning
(ML) to perform as a non-linear classifier. Its value depends on the distance between the input and
a certain fixed point.

4. What is the advantage of the RBF neural network?

The main advantages of the RBF neural network are:

 Easy Design

 Good Generalization

 Faster Training

 Only one hidden layer

 Strong tolerance to input noise

 Easy interpretation of the meaning or function of each node in the hidden layer

5. What is the difference between RBF and MLP?

Multilayer perceptron (MLP) and Radial Basis Function (RBF) are popular neural network
architectures called feed-forward networks. The main differences between RBF and MLP are:

MLP consists of one or several hidden layers, while RBF consists of just one hidden layer.

RBF network has a faster learning speed compared to MLP. In MLP, training is usually done
through backpropagation for every layer. But in RBF, training can be done either through
backpropagation or RBF network hybrid learning.

46
Ensemble Methods

Ensemble Methods, what are they? Ensemble methods is a machine learning technique that
combines several base models in order to produce one optimal predictive model. To better
understand this definition lets take a step back into ultimate goal of machine learning and model
building. This is going to make more sense as I dive into specific examples and why Ensemble
methods are used.

I will largely utilize Decision Trees to outline the definition and practicality of Ensemble Methods
(however it is important to note that Ensemble Methods do not only pertain to Decision Trees).

A Decision Tree determines the predictive value based on series of questions and conditions. For
instance, this simple Decision Tree determining on whether an individual should play outside or
not. The tree takes several weather factors into account, and given each factor either makes a
decision or asks another question. In this example, every time it is overcast, we will play outside.
However, if it is raining, we must ask if it is windy or not? If windy, we will not play. But given no
wind, tie those shoelaces tight because were going outside to play.

47
Decision Trees can also solve quantitative problems as well with the same format. In the Tree to the
left, we want to know wether or not to invest in a commercial real estate property. Is it an office
building? A Warehouse? An Apartment building? Good economic conditions? Poor Economic
Conditions? How much will an investment return? These questions are answered and solved using
this decision tree.

When making Decision Trees, there are several factors we must take into consideration: On what
features do we make our decisions on? What is the threshold for classifying each question into a yes
or no answer? In the first Decision Tree, what if we wanted to ask ourselves if we had friends to
play with or not. If we have friends, we will play every time. If not, we might continue to ask
ourselves questions about the weather. By adding an additional question, we hope to greater define
the Yes and No classes.

This is where Ensemble Methods come in handy! Rather than just relying on one Decision Tree and
hoping we made the right decision at each split, Ensemble Methods allow us to take a sample of
Decision Trees into account, calculate which features to use or questions to ask at each split, and
make a final predictor based on the aggregated results of the sampled Decision Trees.

Types of Ensemble Methods

1. BAGGing, or Bootstrap AGGregating. BAGGing gets its name because it


combines Bootstrapping and Aggregation to form one ensemble model. Given a sample
of data, multiple bootstrapped subsamples are pulled. A Decision Tree is formed on
each of the bootstrapped subsamples. After each subsample Decision Tree has been

48
formed, an algorithm is used to aggregate over the Decision Trees to form the most
efficient predictor. The image below will help explain:

Given a Dataset, bootstrapped subsamples are pulled. A Decision Tree is formed on each
bootstrapped sample. The results of each tree are aggregated to yield the strongest, most accurate
predictor.

2. Random Forest Models. Random Forest Models can be thought of as BAGGing, with a slight
tweak. When deciding where to split and how to make decisions, BAGGed Decision Trees have the
full disposal of features to choose from. Therefore, although the bootstrapped samples may be
slightly different, the data is largely going to break off at the same features throughout each model.
In contrary, Random Forest models decide where to split based on a random selection of features.
Rather than splitting at similar features at each node throughout, Random Forest models implement
a level of differentiation because each tree will split based on different features. This level of
differentiation provides a greater ensemble to aggregate over, ergo producing a more accurate
predictor. Refer to the image for a better understanding.

49
Similar to BAGGing, bootstrapped subsamples are pulled from a larger dataset. A decision tree is
formed on each subsample. HOWEVER, the decision tree is split on different features (in this
diagram the features are represented by shapes).

50

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy