0% found this document useful (0 votes)
417 views352 pages

Foundation of Machine Learning F-PMLFML02-WS

This document provides an overview of machine learning concepts, with a focus on supervised learning techniques including regression and classification. It covers topics such as linear regression, logistic regression, bias-variance tradeoff, performance evaluation methods, decision trees, support vector machines, random forests, and case studies applying machine learning in domains like healthcare, retail, manufacturing and more. Code examples are provided to illustrate machine learning algorithms.

Uploaded by

AINDRILA GHOSH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
417 views352 pages

Foundation of Machine Learning F-PMLFML02-WS

This document provides an overview of machine learning concepts, with a focus on supervised learning techniques including regression and classification. It covers topics such as linear regression, logistic regression, bias-variance tradeoff, performance evaluation methods, decision trees, support vector machines, random forests, and case studies applying machine learning in domains like healthcare, retail, manufacturing and more. Code examples are provided to illustrate machine learning algorithms.

Uploaded by

AINDRILA GHOSH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 352

e

in
nl
O
ty
Foundations of Machine Learning

r si
ve
ni
U
ity
m
)A
(c
e
in
© Amity University Press

All Rights Reserved

nl
No parts of this publication may be reproduced, stored in a retrieval system or transmitted
in any form or by any means, electronic, mechanical, photocopying, recording or otherwise
without the prior permission of the publisher.

O
Advisory Committee

ty
Chairman : Prof. Abhinash Kumar
Members : Prof. Arun Bisaria

si
Mr. Alok Awtans
Dr. Coral J Barboza
Dr. Monica Rose

r
Mr. Sachit Paliwal
ve
Subject Matter Experts
ni

Ms. Archana Sharma


Dr. Prateek Jain
U
ity
m
)A
(c

Published by Amity University Press for exclusive use of Amity Directorate of Distance and Online Education,
Amity University, Noida-201313
Contents

e
Page No.

in
Module - I: Supervised Learning: Regression 01
1.1 Introduction to Regression Concepts
1.1.1 Introduction to Linear Regression

nl
1.1.2 Linear Regression: Important Terms
1.1.3 Linear Regression: Assumptions

O
1.1.4 Regression: Statistical View
1.1.5 Hands On: Loading libraries
1.1.6 Multiple Linear Regression

ty
1.1.7 Linear Regression: Real-Life Events
1.1.8 Logistic Regression
1.1.9 Odds vs. Probability

si
1.1.10 Hypothesis Testing
1.1.11 Linear Regression: Advantages and Pitfalls

Module - II: Supervised Learning: Classification


r 49
ve
2.1 Introduction to Classification Methods
2.1.1 Introduction to Classification and Regression
2.1.2 Classification of Supervised Algorithms
ni

2.1.3 Bias Variance Trade-off


2.1.4 Under-fitting and Over-fitting
U

2.1.5 Evaluation Methods: Hold Out Method


2.1.6 Hands On: Introduction to Hold Out Method
2.1.7 Evaluation Methods: Cross Validation and Bootstrapping Method
ity

2.1.8 Introduction to Performances Metrices


2.1.9 Application of Performances Metrices
2.1.10 Application of Performances Metrices (Part II): Class Imbalance Problem

Module - III: Cross-Validation and Performance Evaluation Methods 105


m

3.1 Introduction to Cross-Validation and Performance Evaluation Methods


3.1.1 Naive Bayes Classifier for Labelled Data Type
)A

3.1.2 Performance Matrices for Regression


3.1.3 Introduction to Naive Bayes
3.1.4 Naive Bayes Classifier for Mixed Data Type
3.1.5 Naive Bayes Classifier
(c

3.1.6 Hands On: Naive Bayes on Movie Review Dataset


3.1.7 K Nearest Neighbour
3.1.8 Hands On: KNN
3.1.9 Decision Tree - I

e
3.1.10 Decision Tree - II

in
3.1.11 Hands On: Decision Tree
3.1.12 Introduction to Support Vector Machines
3.1.13 SVM for Linearly Separable Problems

nl
3.1.14 SVM for Non-Linearly Separable Problems
3.1.15 Hands On: Support Vector Machines

O
3.1.16 Introduction to Random Forest Classifier
3.1.17 Random Forest Technique and Construction
3.1.18 Summary

ty
Module - IV; Unsupervised Learning 177
4.1 Introduction to Clustering Techniques

si
4.1.1 Introduction to Clustering
4.1.2 Evaluation of Clustering
4.1.3 Understanding Clustering Types
4.1.4 Hierarchical Clustering Part - I r
ve
4.1.5 Hierarchical Clustering Part - II
4.1.6 DBSCAN (Density Based Spatial Clustering of Applications with Noise)
4.1.7 Hands On: DBSCAN
ni

4.1.8 Anomaly Detection


4.1.9 K-Means Clustering
4.1.10 Hands On: K-Means
U

4.1.11 Association Rule Mining - Part I


4.1.12 Association Rule Mining - Part II
ity

4.1.13 Hands On: Association Rule (Apriori)


4.1.14 Principal Component Analysis
4.1.15 Hands On: PCA
4.1.16 Singular Value Decomposition
m

4.1.17 Hands On: SVD


4.1.18 LOF
)A

4.1.19 T-Distributed Stochastic Neighbour Embedding (t-SNE)


4.1.20 Hands On: t-SNE
4.2 Case Studies
4.2.1 Healthcare Case: Early Prediction of Neonatal Hyperbilirubinemia
(c

4.2.2 Healthcare Case: Predicting Epidemic Outbreaks


4.2.3 Healthcare Case: Using AI to Predict Acute Kidney Injury
4.2.4 ML Applications in Retail
4.2.5 Netflix: Movie Recommendation

e
4.2.6 Retail: Robot Sales Associate - Apparel Store

in
4.2.7 Retail: Loss Prevention
4.2.8 Telecom: Optimising Mobile Tower Operations
4.2.9 Prediction of Tornadoes using Time Series Forecasting

nl
4.2.10 Facebook Face Detection
4.2.11 Manufacturing

O
4.2.12 Explainable AI and Google Maps
4.2.13 Uber Eats
4.2.14 Amazon ADS Recommendation and iPhone Face Detection

ty
4.2.15 Use Cases
4.2.16 Google Translation
4.2.17 Tesla Driverless Car

si
4.2.18 Kuka Game and Email

Module - V: Reinforcement Learning 300


r
5.1 Introduction to Reinforcement Learning
ve
5.1.1 Reinforcement Learning: Introduction
5.1.2 Formal Definition of Reinforcement Learning
5.1.3 Steps of Reinforcement Learning
ni

5.1.4 Terms in Reinforcement Learning


5.1.5 Markov Property
5.1.6 Bellman’s Equation and Optimal Value Function
U

5.1.7 Problem Solving Methods


5.1.8 Model Free (RL)
ity

5.1.9 Monte Carlo Control


5.1.10 Temporal Learning
m
)A
(c
(c
)A
m
ity
U
ni
ve
r si
ty
O
nl
in
e
Foundations of Machine Learning 1

Module - I: Supervised Learning: Regression


Notes

e
Learning Objectives:

in
At the end of this module, you will be able to understand:

●● Concept of Linear Regression and important terms related to Linear Regression

nl
●● Statistical View of Regression
●● How to load Libraries like panda

O
●● Applications of Linear Regression in Real life
●● Concepts of Hypothesis Testing

Introduction

ty
The technique of discovering correlations between dependent and independent
variables is known as regression. It aids in the prediction of continuous variables such
as market trends, house values, and so forth.

si
The Regression algorithm’s goal is to identify the mapping function that will
translate the continuous input variable (x) to the discrete output variable (y).
r
Example: Let’s say we want to forecast the weather, so we’ll apply the Regression
ve
approach. When it comes to weather prediction, the model is trained on historical data,
and after it is finished, it can accurately predict the weather for future days.

Regression Algorithm Types:


ni

●● Simple Linear Regression


●● Multiple Linear Regression
U

●● Polynomial Regression
●● Support Vector Regression
●● Decision Tree Regression
ity

●● Random Forest Regression

1.1 Introduction to Regression Concepts


Linear regression is a supervised learning machine learning algorithm. It carries out
m

a regression task. Based on independent variables, regression models a goal prediction


value. It is mostly utilised in forecasting and determining the link between variables.
Different regression models differ in terms of the type of relationship they evaluate
)A

between dependent and independent variables, as well as the amount of independent


variables they employ.

Linear regression is used to predict the value of a dependent variable (y) based
on the value of an independent variable (x). As a result of this regression technique, a
(c

linear relationship between x (input) and y (output) is discovered (output). As a result,


the term Linear Regression was coined.

Amity Directorate of Distance & Online Education


2 Foundations of Machine Learning

Notes

e
in
nl
O
In the diagram above, X (input) represents job experience and Y (output)

ty
represents a person’s wage. For our model, the regression line is the best fit line.

1.1.1 Introduction to Linear Regression

si
Linear regression is a type of supervised machine learning model that is commonly
used in predicting. Supervised machine learning models are those that use training data
to develop the model and then use the loss function to verify the model’s correctness.

r
One of the most well-known time series forecasting techniques for predictive
ve
modelling is linear regression. It posits a linear relationship between a group of
independent factors and the dependent variable, as the name implies (the variable of
interest).

To approximate the given data, regression and log-linear models can be employed.
ni

The data is modelled to match a straight line in (basic) linear regression. For example,
the equation can be used to represent a random variable, y (also known as a response
variable), as a linear function of another random variable, x (also known as a predictor
U

variable).

y = wx + b

where y is expected to have a constant variance. x and y are numeric database


ity

attributes in the context of data mining. The w and b coefficients (also known as
regression coefficients) indicate the line’s slope and y-intercept, respectively. The
method of least squares, which minimises the error between the actual line dividing
the data and the estimate of the line, can be used to solve for these coefficients.
m

Multiple linear regression is a variation of (basic) linear regression that lets you model a
response variable, y, as a linear function of two or more predictor variables.

Log-linear models: Discrete multidimensional probability distributions are


)A

approximated using log-linear models. We can think of each tuple as a point in an


n-dimensional space if we have a set of tuples in n dimensions (e.g., described by n
attributes). Based on a reduced subset of dimensional combinations, log-linear models
can be used to estimate the probability of each point in a multidimensional space for
a collection of discretized attributes. This enables the creation of a higher-dimensional
(c

data space from lower-dimensional data spaces. As a result, log-linear models are
especially beneficial for data smoothing and dimensionality reduction (because the
lower-dimensional points together often take up less space than the original data
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 3

points) (since aggregate estimates in the lower-dimensional space are less subject to
Notes

e
sampling variations than the estimates in the higher-dimensional space).

On sparse data, both regression and log-linear models can be utilised, however

in
their utility is limited. While both methods can deal with skewed data, regression excels
at it. When dealing with high-dimensional data, regression can be computationally
costly, whereas log-linear models scale well up to 10 dimensions.

nl
Simple Linear Regression
We want to know the relationship between a single independent variable, or input,

O
and a matching dependent variable, or output, in simple linear regression.

Applications of Simple Linear Regression


●● Students’ grades are based on the amount of hours they study (ideally)- Here,

ty
exam grades are independent of the number of hours they study.
●● Crop yields can be predicted using rainfall data. Yield is a dependent variable,
while precipitation is an independent variable.

si
●● Predicting a person’s salary based on years of experience- Experience becomes
the independent variable, while Salary becomes the dependent variable.

Limitations of Simple Linear Regression r


ve
Even the best information does not tell the whole story. In most cases, regression
analysis is used in research to establish that there is a link between variables. However,
correlation does not imply causation: a link between two variables does not imply that
one causes the other to happen. Indeed, even a line in a basic linear regression that
ni

fits the data points well may not guarantee a relationship between circumstances and
logical outcomes.

Using a linear regression model, you may determine whether or not there is any
U

correlation between variables. Extra investigation and statistical analysis will be


required to determine the exact nature of the link and whether one variable causes the
other.
ity

1.1.2 Linear Regression: Important Terms


Linear Regression is the most fundamental algorithm in machine learning, and it
can be considered the introduction to the field. It will be beneficial if you have a good
m

understanding of linear regression before moving on to more complicated methods.

Call
)A

This is how we used a formulaic function call. We employed one response variable,
mpg, as a function of one dependent variable, disp, both of which were called from the
mtcars data frame in this case.

Residuals
(c

The residuals in our model are a measure of the vertical distance between each
data point and the fitted line. We have summary data for all of the vertical distances

Amity Directorate of Distance & Online Education


4 Foundations of Machine Learning

between all of our locations and the fitted line in this example. The better the fit, the
Notes

e
lower this value is.

Coefficients

in
These are the coefficients of our linear equation’s estimated coefficients. In this
example, our equation would be y = 0.04x + 29.59.

nl
●● Std. Error: The Std. Error component of the coefficients table provides error
estimates for those coefficients. In practice, our equation would look like this:
y = (−0. 04 ± 0. 005)x + (29. 59 ± 1. 23).

O
●● t-value: This is the measurement of the difference in our data compared to the
variation. This value is related to p-values, but p-values are significantly more
commonly used.

ty
●● p-value: Statistical significance evaluations are known as p-values. The math
behind p-values is a little more technical, but for our purposes, a p-value of less
than 0.05 indicates that the number is statistically significant. We should err on the

si
side of statistical significance if the quantity in question has a p-value greater than
0.05. The importance codes that follow explain the star ratings next to them.

Best Fit Line


r
ve
The best fit line or line of best fit is a line that can describe the relationship between
the independent and dependent variable(s). The dependent variable is predicted using
this line. The goal of line of best fit is to get the observed and real values as close as
feasible to each other.
ni

Error
For us, the best fit line will yield projected values. Error is defined as the difference
between the observed or expected value and the actual value.
U

Mean Absolute Error: L1 loss


The mean absolute error is calculated by averaging the total absolute error of each
ity

example over the number of data points. We can prevent cancelling out mistakes that
are too high or too low the genuine values by adding up all absolute values of errors
in a model and getting an overall error metric to evaluate a model on by adding up all
absolute values of errors in a model. The inaccuracy can be anywhere between 0 and
∞.
m

Mean Squared Error: L2 loss


)A

The most widely used error to assess model performance is the mean squared
error. The residual error (the difference between the anticipated and true value) is
squared, unlike the absolute error. For mean squared error loss, the range is also 0
to ∞. Squaring the residual error offers the advantages of having positive error terms,
emphasising greater errors over smaller errors, and being differentiable. Being
(c

differentiable allows us to apply calculus to identify minimum and maximum values,


which can save time in the long run.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 5

Huber Loss
Notes

e
In comparison to the squared error loss, the Huber loss is less susceptible to data
outliers. At 0 it’s also differentiable. When the error is small, it is essentially absolute

in
error, which becomes quadratic. The size of the error required to make it quadratic is
determined by a hyperparameter, (delta), which can be adjusted.

R-Squared

nl
It’s also referred to as the coefficient of determination. R squared values range
from - to 1, and they assist us figure out if our model can explain variation in Y by

O
variation in X. A model with a R squared of 0 is equivalent to one that always forecasts
the target variable’s mean, whereas a model with a R squared of 1 completely predicts
the target variable.

ty
Adjusted R-Squared

si
If we add additional predictors to our model, R-Squared will rise or remain
constant. So there’s no way of knowing whether increasing the model’s complexity
makes it more accurate. The R-Squared calculation is tweaked to account for the
r
number of predictors in the model. Only if the new term improves model accuracy does
ve
the modified R-Square increase.
ni

1.1.3 Linear Regression: Assumptions


To perform a simple linear regression, certain assumptions about the data must
U

be made. This is due to the fact that it is a parametric test. The following are the
assumptions made when running a simple linear regression:

●● Homogeneity of variance (homoscedasticity)- The amount of the error stays


constant is one of the primary expectations of a simple linear regression approach.
ity

This basically indicates that the error magnitude does not change much while the
independent variable’s value varies.
●● Independence of observations- Nothing is hidden in the relationships between
the observations, and only valid sampling procedures are employed during data
m

gathering.
●● Normality- The data is flowing at a typical rate.
)A

However, while running a linear regression, there is one more assumption that
must be considered.

The line is always a straight line: During the execution of a linear regression, there
is no curve or grouping factor. The variables are linked in a linear fashion (dependent
(c

variable and independent variable). A nonparametric test may be employed if the data
violates the assumptions of homoscedasticity or normality. (Take, for instance, the
Spearman rank test.)

Amity Directorate of Distance & Online Education


6 Foundations of Machine Learning

The following is an example of data that does not match the assumptions: One
Notes

e
might believe that the consumption of cured pork and the incidence of colorectal cancer
in the United States are linked. However, it is later discovered that there is a significant
range disparity in the data collection for both variables. There can be no linear

in
regression test since the homoscedasticity condition is violated. A Spearman rank test,
on the other hand, can be used to determine the link between the variables.

nl
Homoscedasticity

Explanation

O
The residuals have constant variance at every level of x, according to the following
assumption of linear regression. Homoscedasticity is the term for this. The residuals are
considered to have heteroscedasticity when this isn’t the case.

ty
When heteroscedasticity is present in a regression study, it becomes difficult to
trust the results. Heteroscedasticity, in particular, increases the variance of regression
coefficient estimates, but the regression model ignores this. This increases the
likelihood that a regression model may proclaim a term in the model to be statistically

si
significant when it is not.

How do you know if this assumption is correct?


r
A fitted value vs. residual plot is the simplest technique to detect heteroscedasticity.
ve
After fitting a regression line to a set of data, you may compare the fitted values
of the model against the residuals of those fitted values in a scatterplot. A typical fitted
value vs. residual plot with heteroscedasticity is shown in the scatterplot below.
ni
U
ity
m
)A

As the fitted values grow greater, the residuals become considerably more spread
(c

out. Heteroscedasticity is symbolised by this “cone” shape:

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 7

Notes

e
in
nl
O
ty
si
There are three common ways to fix heteroscedasticity:

r
1. Transform the Dependent Variable: Taking the log of the dependent variable is a
ve
typical transformation. For example, instead of predicting the number of flower
shops in a city using population size (independent variable), we could use population
size to predict the log of the number of flower stores in a city. Heteroskedasticity is
often eliminated by using the log of the dependent variable rather than the original
dependent variable.
ni

2. Redefine the Dependent Variable: Instead of using the raw value, a rate is a typical
technique to redefine the dependent variable. Instead of predicting the number
U

of flower shops in a city based on population size, we might use population size
to forecast the number of flower stores per capita. Because we’re calculating the
number of flower shops per person rather than the total number of flower businesses,
this decreases the fluctuation that naturally exists across larger populations.
ity

3. Use weighted regression: Weighted regression is another method for correcting


heteroscedasticity. Each data point is given a weight based on the variance of its
fitted value in this sort of regression. This reduces the squared residuals of data
points with higher variances by assigning tiny weights to them. Heteroscedasticity
m

can be eliminated when the appropriate weights are employed.

Normality
)A

The residuals must be regularly distributed, according to the next premise of linear
regression.

How do you know if this assumption is correct?

There are two popular methods for determining whether or not this assumption is
(c

correct:

1. Check the assumption visually using Q-Q plots.

Amity Directorate of Distance & Online Education


8 Foundations of Machine Learning

A Q-Q plot, or quantile-quantile plot, is a sort of figure that we may use to see if the
Notes

e
residuals of a model follow a normal distribution or not. The normalcy assumption is
satisfied if the points on the plot generally form a straight diagonal line.

in
An example of residuals that nearly follow a normal distribution is seen in the Q-Q
plot below:

nl
O
ty
r si
ve
ni

The Q-Q plot below, on the other hand, illustrates an example of when residuals
obviously deviate from a straight diagonal line, indicating that they do not follow a
normal distribution:
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 9

2. You can also use formal statistical tests like Shapiro-Wilk, Kolmogorov-Smirnov,
Notes

e
Jarque-Barre, or D’Agostino-Pearson to check the normality assumption. Keep in
mind, however, that these tests are sensitive to big sample sizes; that is, when your
sample size is large, they frequently conclude that the residuals are not normal. This

in
is why it’s typically easier to check this assumption with graphical approaches like a
Q-Q plot.

nl
What to do if this assumption is violated

You have a few choices if the normalcy assumption is broken:

What should you do if this assumption isn’t true?

O
You have a few choices if the normalcy assumption is broken:

◌◌ To begin, double-check that any outliers aren’t having a significant impact


on the distribution. If there are any outliers, double-check that they are true

ty
values and not data input errors.
◌◌ The independent and/or dependent variables can then be subjected to a
nonlinear transformation. Taking the log, square root, or reciprocal of the

si
independent and/or dependent variable are common examples.

Independence
Explanation r
ve
The residuals must be independent, according to the following premise of linear
regression. When working with time series data, this is especially true. We don’t want a
pattern between consecutive residuals in the ideal world. For example, residuals should
not increase in size over time.
ni

How to determine if this assumption is met


A residual time series plot, which is a plot of residuals vs. time, is the simplest
U

approach to see if this condition is met. The 95 percent confidence bands around zero,
which are placed at roughly +/- 2-over the square root of n, where n is the sample size,
should ideally contain the majority of the residual autocorrelations. The Durbin-Watson
test can be used to officially check if this assumption is met.
ity

What to do if this assumption is violated


You have a few options depending on the nature of the violation of this assumption:

●● Consider using lags of the dependent and/or independent variables in the model
m

for positive serial correlation.


●● Check that none of your variables are overdifferenced if you want to use negative
)A

serial correlation.
●● Consider including seasonal dummy variables in the model for seasonal
correlation.

1.1.4 Regression: Statistical View


(c

The statistical relationship between one or more predictor variables and the
response variable is described by regression analysis, which provides an equation.

Amity Directorate of Distance & Online Education


10 Foundations of Machine Learning

You’ll want to analyse the results after fitting a regression model with Minitab Statistical
Notes

e
Software and verifying the fit with residual plots.

How Do I Interpret The P-Values In Linear Regression Analysis?

in
The null hypothesis that the coefficient is equal to zero is tested by the p-value
for each term (no effect). A low p-value (0.05) suggests that the null hypothesis can
be rejected. In other words, because changes in the predictor’s value are related to

nl
changes in the response variable, a predictor with a low p-value is likely to be a useful
addition to your model.

A larger (insignificant) p-value, on the other hand, indicates that changes in the

O
predictor are unrelated to changes in the response.

We can see that the predictor variables South and North are significant because
their p-values are both 0.000 in the result below. East, on the other hand, has a p-value

ty
of 0.092, which is higher than the typical alpha level of 0.05, indicating that it is not
statistically significant.

r si
ve
To choose which terms to keep in the regression model, you usually utilise the
coefficient p-values. We should consider deleting East from the model above.
ni

How Do I Interpret The Regression Coefficients For Linear Relationships?

The mean change in the response variable for one unit of change in the predictor
variable while holding other predictors in the model constant is represented by
U

regression coefficients. This statistical control provided by regression is crucial because


it isolates the role of one variable from the rest of the model’s variables.

The key to comprehending the coefficients is to conceive of them as slopes, which


ity

is why they’re sometimes referred to as slope coefficients. I’ll show you how this works
in the fitted line plot below, where I model a person’s weight using their height. First, the
output of Minitab’s session window:
m
)A

The fitted line plot graphically depicts the same regression results.
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 11

Notes

e
in
nl
O
ty
si
The coefficient for height in metres is 106.5 kilos, according to the equation. The
statistic suggests that for every additional metre in height, weight will increase by 106.5
kilos on average. r
ve
The same information is graphically represented by the blue fitted line. The
fitted line rises or lowers by 106.5 kg if you move left or right along the x-axis by an
amount that equals a one metre change in height. These heights, however, are from
middle-school girls and range from 1.3 to 1.7 metres. Because the relationship is only
ni

true within this data range, we wouldn’t move the line a full metre up or down in this
situation.

The expected value for weight would not vary no matter how far up or down the
U

fitted line you went if the slope coefficient was 0. As a result, a low p-value indicates
that the slope is not zero, implying that changes in the predictor variable are linked to
changes in the response variable.
ity

A fitted line plot was utilised because it brings the math to life. Fitted line plots,
on the other hand, can only show the results of simple regression, which has only one
predictor variable and one response. For multiple linear regression, the concepts stay
true, but I’d need an extra spatial dimension for each new predictor to depict the results.
m

How Do I Interpret The Regression Coefficients For Curvilinear Relationships And


Interaction Terms?

Height is a linear effect in the example above; the slope is constant, indicating that
)A

the effect is consistent across the fitted line. The interpretation is a little less apparent if
your model requires polynomial or interaction terms.

To refresh your memory, polynomial terms model data curvature, whereas


interaction terms imply that one predictor’s effect is dependent on the value of another
(c

predictor.

Amity Directorate of Distance & Online Education


12 Foundations of Machine Learning

The following example employs a data collection that necessitates the inclusion of
Notes

e
a quadratic (squared) component to model curvature. The p-values for both the linear
and quadratic terms are significant, as seen in the output below.

in
nl
We can proceed with the interpretation because the residual plots (not displayed)

O
indicate a satisfactory fit. But how are these coefficients to be interpreted? Graphing it
in a fitted line plot is quite helpful.

ty
r si
ve
ni
U

The link between machine setting and energy consumption varies depending
ity

from where you start on the fitting line, as you can see. For example, if you start with a
machine setting of 12 and increase it by 1, you should see a decrease in energy use.
If you start at 25, an increase of one should result in an increase in energy use. And if
you’re under the age of 20, your energy consumption shouldn’t alter substantially.
m

Because the effect of modifying the predictor varies depending on its value, a
major polynomial term can make the interpretation less clear. A substantial interaction
term, on the other hand, suggests that the effect of one predictor varies depending on
the value of another.
)A

When interpreting a regression model that contains these terms, be extra cautious.
You can’t simply look at the major effect (linear term) to figure out what’s going on!
Unfortunately, you won’t be able to utilise a fitted line plot to graphically describe
the results if you’re using multiple regression analysis. This is when subject-matter
(c

expertise comes in handy!

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 13

1.1.5 Hands On: Loading libraries


Notes

e
Pandas

in
Pandas is a free and open-source Python toolkit created by software engineer Wes
McKinney in 2008. It offers high-level data structures and programming tools for doing
data analysis and manipulation on a variety of data kinds, including numbers, text, and
dates.

nl
Adding pandas to the workspace

import pandas as pd imported as pd to make it simpler and short hand

O
Importing specific functionality

from pandas import Series, DataFrame

ty
Pandas Data Structures

Series

si
A data structure in the form of an array, let have labels starting from 0 and keep
incrementing. For an example.

In [1]: from pandas import Series


r
ve
In [2]: obj = Series([1, 2, 3, 4, 5, 6])

In [3]: obj

Out[3]:
ni

0 1

1 2
U

2 3

3 4

4 5
ity

5 6

dtype: int64

You may also use custom labels.


m

In [1]: from pandas import DataFrame, Series

In [2]: obj = Series([4, 7, -5, 3], index=[‘d’, ‘b’, ‘a’, ‘c’])


)A

In [3]: obj

Out[4]:

d 4
(c

b 7

a -5

Amity Directorate of Distance & Online Education


14 Foundations of Machine Learning

c 3
Notes

e
dtype: int64

Some useful features

in
◌◌ Convert dictionaries to Series, s = Series({‘age’:23, ‘name’: ‘anuradha’})
◌◌ Direct index assignment obj.index = [‘attr 1’, ‘attr 2’, ‘attr 3’]

nl
◌◌ Indexes can have attribute name obj.name = ‘Population’

DataFrame

O
These are tabular data’s data structures, which are similar to spreadsheets and
can hold vast amounts of data.

In [1]: data = {‘state’: [‘Ohio’, ‘Ohio’, ‘Ohio’, ‘Nevada’, ‘Nevada’], ‘year’: [2000, 2001,

ty
2002, 2001, 2002], ‘pop’: [1.5, 1.7, 3.6, 2.4, 2.9]}

In [2]: data

Out[2]:

si
{‘pop’: [1.5, 1.7, 3.6, 2.4, 2.9],

‘state’: [‘Ohio’, ‘Ohio’, ‘Ohio’, ‘Nevada’, ‘Nevada’],


r
‘year’: [2000, 2001, 2002, 2001, 2002]}
ve
In [3]: frame = DataFrame(data)

In [4]: frame
ni

Out[4]:

pop state year


U

0 1.5 Ohio 2000

1 1.7 Ohio 2001

2 3.6 Ohio 2002


ity

3 2.4 Nevada 2001

4 2.9 Nevada 2002

Filtering
m

For this example let us consider the following piece of code.

In [1]: from pandas import DataFrame, Series


)A

In [2]: import numpy as np

In [3]: data = DataFrame(np.arange(16).reshape(4,4), index=[‘Ohio’, ‘Colorado’,


‘Utah’, ‘New York’], columns=[‘one’, ‘two’, ‘three’, ‘four’])
(c

In [4]: data

Out[4]:

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 15

one two three four


Notes

e
Ohio 0 1 2 3
Colorado 4 5 6 7

in
Utah 8 9 10 11
New York 12 13 14 15

nl
We use np.arange(16) to create an array of elements from 0 to 15, and use
reshape(4, 4) make it a matrix of 4x4. Just a set of dummy data.

O
Boolean comparisons
In [6]: data > 5

Out[6]:

ty
one two three four
Ohio False False False False

si
Colorado False False True True
Utah True True True True
New York True True True
r
True
ve
Substitutions
We provide a query function data < 5 and a replacement operation = -1 to replace
the values less than 5 with -1.
ni

In [7]: data[data < 5] = -1

In [8]: data
U

Out[8]:

one two three four


Ohio -1 -1 -1 -1
ity

Colorado -1 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
m

Applications in real scenarios


The handling of missing values is a direct application of pandas. The majority of
)A

the time, the data we receive for Data Mining is not clean or pure. They have missing
figures and figures that are inconsistent. As a result, we must either filter them out or
use a replaced figure.

Filtering out missing values


(c

Creating a sample array with missing values(NaN values or Not a Number values).

In [9]: a = np.array([1,2,3,np.nan,4])

Amity Directorate of Distance & Online Education


16 Foundations of Machine Learning

In [10]: a
Notes

e
Out[10]: array([ 1., 2., 3., nan, 4.])

In [11]: df = DataFrame(a)

in
In [12]: df

Out[12]:

nl
0

0 1.0

O
1 2.0

2 3.0

ty
3 NaN

4 4.0

Once we have the data frame with missing values/NaNs we can fill them using

si
following manner. Here I’m filling it with -1.

In [13]: df.fillna(-1)

Out[13]: r
ve
0

0 1.0

1 2.0
ni

2 3.0

3 -1.0
U

4 4.0

Different fillings can be used in different columns. For instance, 0.5 in the first
column and -1 in the third column. The numbering of the columns begins at 0.
ity

df.fillna({1: 0.5, 3: -1})

Can use smarter ways such as filling using mean, which might make more sense in
some cases.
m

data = Series([1., NA, 3.5, NA, 7])

data.fillna(data.mean())
)A

Scikit Learn
Sklearn is a data science library’s Swiss Army Knife. It’s an essential weapon in
your data science arsenal for clearing apparently insurmountable obstacles. To put it
another way, it’s utilised to create machine learning models.
(c

Scikit-learn is undoubtedly Python’s most helpful machine learning library.


Classification, regression, clustering, and dimensionality reduction are just a few of the

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 17

useful functions in the sklearn toolkit for machine learning and statistical modelling.
Notes

e
Components of Scikit-learn:

in
Scikit-learn has a lot of useful features. To help you grasp the spread, here are a
few examples:

●● Think of any supervised machine learning method you’ve heard of, and there’s

nl
a good chance it’s included in scikit-learn. From generalised linear models (such
as linear regression) to Support Vector Machines (SVM), Decision Trees, and
Bayesian approaches, the scikit-learn toolbox has it all. One of the main reasons

O
for scikit-popularity learn’s is the widespread use of machine learning methods.
I began using scikit to handle supervised learning challenges, and I would
recommend it to anyone who is new to scikit or machine learning.
●● Cross-validation: Using sklearn, you can assess the correctness of supervised

ty
models on unknown data in a variety of ways.
●● Unsupervised learning techniques: The offering includes a wide range of machine
learning algorithms, including clustering, factor analysis, principal component

si
analysis, and unsupervised neural networks.
●● Several toy datasets: While learning scikit-learn, this came in handy. I learned

r
SAS while working with a variety of academic datasets (e.g. IRIS dataset, Boston
House prices dataset). Having them on hand while studying a new library was
ve
quite beneficial.
●● Feature extraction: Scikit-learn is a library that allows you to extract features from
photos and text (e.g. Bag of words)
ni

Quick Example:
Let me provide an example to demonstrate the use of sklearn now that you have
a high-level understanding of the ecosystem. The goal is to demonstrate how easy it is
U

to use sklearn. In one of the next articles, we’ll look at several algorithms and the best
ways to employ them.
ity

Import the relevant libraries and read the dataset


import numpy as np

import matplotlib as plt


m

from sklearn import datasets

from sklearn import metrics

from sklearn.linear_model import LogisticRegression


)A

We have imported all the libraries. Next, we read the dataset:

dataset = datasets.load_iris()
(c

Build a logistic regression model on the dataset and making predictions


model.fit(dataset.data, dataset.target)

Amity Directorate of Distance & Online Education


18 Foundations of Machine Learning

expected = dataset.target
Notes

e
predicted = model.predict(dataset.data

in
Print confusion matrix
print(metrics.classification_report(expected, predicted))

nl
print(metrics.confusion_matrix(expected, predicted))

1.1.6 Multiple Linear Regression

O
Multiple linear regression (MLR), often known as multiple regression, is a statistical
technique that predicts the result of a response variable by combining numerous
explanatory variables. Multiple linear regression attempts to represent the linear
relationship between explanatory (independent) and response (dependent) variables.

ty
Because it incorporates more than one explanatory variable, multiple regression is
essentially an extension of ordinary least-squares (OLS) regression.

●● Multiple linear regression (MLR), often known as multiple regression, is a

si
statistical technique that predicts the result of a response variable by combining
numerous explanatory variables.
●●
r
Multiple regression is a variant of linear (OLS) regression in which just one
explanatory variable is used.
ve
●● MLR is widely utilised in econometrics and financial analysis.

Formula and Calculation of Multiple Linear Regression


ni

Where, for i=n observations:


U

yi = Dependent variable

xi = explanatory variables

βi = y-intercept (constant term)


ity

βp = slope coefficients for each explanatory variable

ϵ= the model’s error term (also known as the residuals)

Simple linear regression is a function that allows a statistician or analyst to


m

generate predictions about one variable based on data about another variable. Only
two continuous variables—an independent variable and a dependent variable—can be
utilised in linear regression. The parameter that is utilised to calculate the dependent
)A

variable or outcome is known as the independent variable. Multiple explanatory


variables are included in a multiple regression model.

The following assumptions underpin the multiple regression model:

◌◌ The dependent variables and the independent variables have a linear


(c

relationship.
◌◌ The independent variables are not overly connected with one another,

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 19

◌◌ Yithe observations are chosen at random from the population.


Notes

e
◌◌ Residuals should be normally distributed with a mean of 0 and variance σ
The coefficient of determination (R-squared) is a statistical tool for determining

in
how much variation in the independent variables can be explained by variance in the
outcome. Even if the predictors are unrelated to the outcome variable, R2 always rises
when additional predictors are added to the MLR model.

nl
As a result, R2 alone cannot be used to determine which predictors should be
included and which should be eliminated from a model. R2 can only be between 0 and
1, with 0 indicating that none of the independent variables can predict the outcome and

O
1 indicating that the independent variables can predict the outcome without mistake.

While holding all other variables constant, beta coefficients are appropriate when
evaluating the results of multiple regression (“all else equal”). A multiple regression’s

ty
outcome might be shown horizontally as an equation or vertically as a table.

Example of How to Use Multiple Linear Regression

si
For instance, an analyst would be interested in seeing how market movement
influences the price of ExxonMobil (XOM). The value of the S&amp;P 500 index will be
the independent variable, or predictor, in this case, while the price of XOM will be the
dependent variable.
r
ve
In reality, a variety of elements influence an event’s result. ExxonMobil’s stock
price, for example, is influenced by more than just the market’s performance. Other
factors that influence the price of XOM and the stock prices of other oil firms include
the price of oil, interest rates, and the price movement of oil futures. Multiple linear
regression is used to understand a relationship involving more than two variables.
ni

The mathematical link between various random variables is determined using


multiple linear regression (MLR).
U

In other words, MLR investigates the relationship between numerous independent


factors and a single dependent variable. The information on the many variables can
be used to generate an accurate forecast on the level of effect they have on the result
variable once each of the independent factors has been determined to predict the
ity

dependent variable. The model generates a straight line (linear) relationship that best
approximates all of the individual data points.

Using the MLR equation as an example, in our case:


m

●● yi = dependent variable—the price of XOM


●● xi1 = interest rates
●● xi 2 = oil price
)A

●● xi3 = value of S&P 500 index


●● xi4 = price of oil futures
●● B0 = y-intercept at time zero
(c

●● B1 = regression coefficient that measures a unit change in the dependent variable


when xi1 changes - the change in XOM price when interest rates change

Amity Directorate of Distance & Online Education


20 Foundations of Machine Learning

●● B2 = coefficient value that measures a unit change in the dependent variable


Notes

e
when xi2 changes—the change in XOM price when oil prices change
Statistical software is used to calculate the least-squares estimates—B0, B1, B2

in
…Bp. In a regression model with as many variables as possible, each independent
variable is differentiated by a number—1,2, 3, 4,...p. The multiple regression model
enables an analyst to forecast an outcome using data from several explanatory factors.

nl
Even so, because each data point can diverge somewhat from the model’s
projected outcome, the model is not always flawless. To account for such minor
deviations, the residual value, E, which is the difference between the actual and

O
anticipated outcomes, is incorporated in the model.

If we run our XOM price regression model via a statistics computing programme,
we get the following results:

ty
XOM Price = 75 - 1.5 interest rates + 7.8 oil price + 3.2 S&P 500 + 5.7 oil futures

R – Sq = 86.5%

si
If all other variables remain constant, an analyst would interpret this result to
suggest that if the price of oil in the markets rises by 1%, the price of XOM will rise by
7.8%. The model also predicts that a 1% increase in interest rates will cause the price

r
of XOM to drop by 1.5 percent. Changes in the interest rate, oil price, oil futures, and
the S&P 500 index account for 86.5 percent of the volatility in Exxon Mobil’s stock price,
ve
according to R2.

The Difference Between Linear and Multiple Regression


The response of a dependent variable to a change in various explanatory variables
ni

is compared using ordinary linear squares (OLS) regression. A dependent variable,


on the other hand, is rarely described by a single variable. In this scenario, an analyst
employs multiple regression, which entails employing more than one independent
U

variable to explain a dependent variable. There are two types of multiple regressions:
linear and nonlinear.

Multiple regressions are predicated on the assumption that the dependent


ity

and independent variables have a linear relationship. It also presupposes that the
independent variables have no significant association.

Only one variable rarely explains a dependent variable. In such instances, an


analyst will employ multiple regression, which tries to explain a dependent variable with
m

the help of many independent variables. The model, on the other hand, assumes that
the independent variables have no significant relationships.
)A

1.1.7 Linear Regression: Real-Life Events


One of the most often used statistical techniques is linear regression. It’s used
to measure the association between a response variable and one or more predictor
factors.
(c

Simple linear regression is the most basic type of linear regression, and it’s used to
quantify the relationship between one predictor variable and one response variable.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 21

We can use multiple linear regression to quantify the link between numerous
Notes

e
predictor variables and a response variable if we have more than one predictor variable.

Linear Regression Real Life Example #1

in
Businesses frequently utilise linear regression to figure out how much money they
spend on advertising and how much money they make.

nl
They may run a basic linear regression model with advertising spending as the
predictor variable and revenue as the response variable, for example. The regression
model would look something like this:

O
revenue = β0 + β1(ad spending)

When ad spending is nil, the coefficient 0 represents entire expected revenue.

ty
When ad spending is increased by one unit, the coefficient 1 represents the
average change in total revenue (e.g. one dollar).

If 1 is negative, it means that more advertising spending equals less revenue.

si
If 1 is close to 0, advertising spending has little impact on revenue.

And if 1 is positive, it indicates that more ad expenditure equals higher income.


r
A company’s advertising spending may be reduced or increased depending on the
ve
value of 1.

Linear Regression Real Life Example #2


Linear regression is frequently used by medical researchers to better understand
ni

the association between drug dosage and patient blood pressure.

Researchers might, for example, give patients different doses of a medicine and
see how their blood pressure reacts. They might use dosage as the predictor variable
U

and blood pressure as the response variable in a simple linear regression model. The
regression model would look something like this:

blood pressure = β0 + β1(dosage)


ity

When the dosage is zero, the coefficient 0 represents the predicted blood pressure.

When the dosage is increased by one unit, the coefficient 1 represents the average
change in blood pressure.
m

If 1 is negative, it means that increasing the dosage leads to a reduction in blood


pressure.

If 1 is near to zero, an increase in dosage is not connected with an increase in


)A

blood pressure.

If 1 is true, it means that increasing the dosage is linked to a rise in blood pressure.

Researchers may opt to adjust the dosage provided to a patient based on the value
of 1.
(c

Linear Regression Real Life Example #3

Amity Directorate of Distance & Online Education


22 Foundations of Machine Learning

The effect of fertiliser and water on crop yields is frequently measured using linear
Notes

e
regression by agricultural scientists.

For instance, scientists may experiment with varied amounts of fertiliser and water

in
on different areas to examine how crop productivity is affected. They might use fertiliser
and water as predictor factors and crop yield as the response variable in a multivariate
linear regression model. The regression model would look something like this:

nl
crop yield = β0 + β1(amount of fertilizer) + β2(amount of water)

With no fertiliser or water, the coefficient 0 represents the projected crop output.

O
If the amount of water is kept constant, the coefficient 1 represents the average
change in crop output when fertiliser is increased by one unit.

If the amount of fertiliser is kept constant, the coefficient 2 represents the average

ty
change in crop output when water is increased by one unit.

Scientists may adjust the amount of fertiliser and water utilised based on the values
of 1 and 2 in order to maximise crop output.

si
Linear Regression Real Life Example #4
Professional sports clubs’ data scientists frequently utilise linear regression to
r
assess the impact of various training regimens on player performance.
ve
Data scientists in the NBA, for example, might look at how varied amounts of
weekly yoga and weightlifting sessions effect a player’s point total. They might use yoga
and weightlifting sessions as predictor factors and total points achieved as the response
variable in a multivariate linear regression model. The regression model would look
ni

something like this:

points scored = β0 + β1(yoga sessions) + β2(weightlifting sessions)


U

The coefficient 0 represents the predicted points for a player who does not
participate in any yoga or weightlifting workouts.

If the number of weekly weightlifting sessions remains constant, the coefficient


ity

1 represents the average change in points obtained when weekly yoga sessions are
increased by one.

When weekly weightlifting sessions are increased by one, the coefficient 2


represents the average change in points scored, assuming the number of weekly yoga
m

sessions remains constant.

In order to optimise points collected, data scientists may recommend that a player
participate in more or less weekly yoga and weightlifting sessions, depending on the
)A

values of 1 and 2.

1.1.8 Logistic Regression


In the early twentieth century, the biological sciences adopted logistic regression. It
(c

went on to be employed in a variety of social science applications. When the dependent


variable (target) is categorical, logistic regression is utilised.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 23

As an example,
Notes

e
◌◌ To predict whether an email is spam (1) or (0)
◌◌ Whether the tumor is malignant (1) or not (0)

in
Consider the situation where we need to determine whether or not an email
is spam. If we utilise linear regression to solve this problem, we’ll need to select
a threshold by which we can classify the data. If the actual class is malignant, the

nl
predicted continuous value is 0.4, and the threshold value is 0.5, the data point will be
classed as non malignant, potentially resulting in significant consequences in real time.

It is clear from this example that linear regression is unsuitable for classification

O
problems. Because linear regression is unlimited, logistic regression enters the picture.
Their value is restricted to a range of 0 to 1.

The logistic regression model was developed in order to describe the posterior

ty
probability of the K classes using linear functions in x while ensuring that they add to
one and remain in the range [0, 1]. The model has a form:

r si
ve
K 1 log-odds or logit transformations are used to specify the model (reflecting
the constraint that the probabilities sum to one). Although the last class is used as the
ni

denominator in the odds-ratios, the choice of denominator is discretionary because the


estimations are equivariant. A quick calculation reveals that
U
ity

and they clearly sum to one. To illustrate the importance of the complete parameter
set’s interdependence θ = {β_10, β_1^T, . . . , β_(K-1)0, β_(K-1)^T, }, we denote the
probabilities Pr(G = k|X = x) = pk(x; θ).
m

Because there is just one linear function for K = 2, this model is very simple. It’s
often utilised in biostatistical applications with a lot of binary responses (two classes).
Patients may live or die, have cardiac disease or not, or have a condition present or not.
)A

Fitting Logistic Regression Models


Maximum likelihood is commonly used to fit logistic regression models, using the
conditional likelihood of G given X. The multinomial distribution is acceptable since
Pr(G|X) completely characterises the conditional distribution. For N observations, the
(c

log-likelihood is

where Pk((xi ; θ) = Pr(G = k|X = xi ; θ).

Amity Directorate of Distance & Online Education


24 Foundations of Machine Learning

We discuss in detail the two-class case, since the algorithms simplify considerably.
Notes

e
It is convenient to code the two-class gi via a 0/1 response y_i, where yi= 1 when gi= 1,
and yi = 0 when gi = 2. Let p1(x; θ) = p(x; θ), and p2(x; θ) = 1 − p(x; θ). The log-likelihood
can be written

in
nl
O
Here β = {β10, β1}, and To accommodate the intercept, we assume that the vector
of inputs x iincludes the constant term 1. We set the derivatives of the log-likelihood to
zero to maximise it. These are the score equations.

ty
si
Those are nonlinear p+1 equations in The first scoring equation indicates that
because the first component of xi equals 1,

r
The expected number of class one students is the same as the actual amount (and
ve
hence also class twos.)

We utilise the Newton–Raphson approach to solve the score equations, which


requires the second-derivative or Hessian matrix.
ni

Starting with βold , a single Newton update is


U
ity

Where the derivatives are evaluated atβ^old

The score and Hessian are best written in matrix notation. Let y stand for the vector
of y ivalues, X for the N (p + 1) matrix of x i values, p for the vector of fitted probabilities
with ith element p(xi; old), and W for a N N diagonal matrix of weights with ith diagonal
m

elementp(x_i; β^old)).)(1− p(x_i; β^old)). Then we have


)A

The Newton step is thus


(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 25

In the second and third line we have re-expressed the Newton step as a weighted
Notes

e
least squares step, with the respons

in
The adjusted reaction is another name for it. These equations must be solved
multiple times because p, and hence W and z, change with each iteration. Because
each iteration solves the weighted least squares problem, this algorithm is known as

nl
iteratively reweighted least squares, or IRLS.

O
Although convergence is never guaranteed, it appears that β = 0 is an acceptable
beginning value for the iterative approach. Because the log-likelihood is concave, the
procedure usually converges, but overshooting might occur. Step size halving will
ensure convergenc in the rare circumstances where the log-likelihood declines.

ty
The Newton technique may also be described as an iteratively reweighted least
squares algorithm with a vector of K 1 responses and a nondiagonal weight matrix per

si
observation for the multiclass case (K 3). The latter prevents any simplified procedures,
therefore working with the extended vector direct is numerically more convenient in this
scenario.

r
The purpose of using logistic regression models for data analysis and inference is
ve
to understand the role of the input factors in explaining the outcome. In the search for a
parsimonious model including a subset of the variables, maybe with some interactions
terms, numerous models are often fitted. Some of the concerns are illustrated in the
following example.
ni

Table: Results from a logistic regression fit to the South African heart disease
data

Coefficient Std. Error Z Score


U

(Intercept) –4.1130 0.964 –4.285


sbp 0.006 0.006 1.023
tobacco 0.080 0.026 3.034
ity

Idl 0.185 0.057 3.219


famhist 0.939 0.225 4.178
Obesity -0.035 0.029 –1.187
m

alcohol 0.001 0.004 0.136


age 0.043 0.010 4.184
)A

Example: South African Heart Disease


To demonstrate the usual statistical usage of the logistic regression model, we give
an analysis of binary data. Below figure shows a subset of the Coronary Risk-Factor
Study (CORIS) baseline survey, which was conducted in three rural areas of South
Africa (Rousseauw et al. 1983). The study’s goal was to figure out how severe the
(c

risk factors for ischemic heart disease were in that high-incidence area. The response
variable is the presence or absence of myocardial infarction (MI) at the time of the

Amity Directorate of Distance & Online Education


26 Foundations of Machine Learning

survey, and the data reflect white males aged 15 to 64. (the overall prevalence of MI
Notes

e
was 5.1 percent in this region). Our data set contains 160 cases and a sample of 302
controls. Hastie and Tibshirani go into greater detail about these findings (1987).

in
We used maximum likelihood to fit this model, yielding the results reported in
Table 1.

The Z scores for each of the coefficients in the model (coefficients divided by

nl
standard errors) are shown in this summary; a nonsignificant Z score indicates that
a coefficient can be removed from the model. Each of these is formally equivalent to
a test of the null hypothesis that the coefficient in issue is zero, while the others are

O
not (also known as the Wald test). At the 5% level, a Z score larger than around 2 in
absolute value is important.

This table of coefficients has a few surprises that should be interpreted with

ty
caution. The difference between systolic and diastolic blood pressure (sbp) is
insignificant! Obesity isn’t either, and it’s a bad indicator. The correlation between the
set of predictors is the source of this perplexity. Both sbp and obesity are important on
their own, and both have a positive sign. They are no longer necessary in the presence

si
of several additional linked factors (and can even get a negative sign). The analyst may
undertake some model selection at this point, identifying a subset of the variables that
are sufficient for explaining their combined effect on the prevalence of chd. Dropping
r
the least significant co-efficient and re-fitting the model is one approach to go. This is
ve
performed until there are no more terms that can be removed from the model.
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 27

Figure: The data on cardiac disease in South Africa is shown as a scatterplot


Notes

e
matrix. Each plot depicts a pair of risk factors, with cases and controls coloured
differently (blue is a case). Family history of heart disease (famhist) is a binary variable
(yes or no).

in
Table: Results from stepwise logistic regression fit to South African heart disease
data.

nl
Coefficient Std. Error Z Score
(Intercept) –4.204 0.498 –8.45

O
tobacco 0.081 0.026 3.16
Idl 0.168 0.054 3.09
famhist 0.924 0.223 4.14

ty
age 0.044 0.010 4.52
Refitting each of the models with one variable removed and then performing an
analysis of deviance to determine which variable to eliminate is a better but time-

si
consuming technique. A fitted model’s residual deviance is minus twice its log-
likelihood, and the difference between two models’ residual deviances is the deviance
between them (in analogy to sums-ofsquares). This technique produced the same end
model as the previous one. r
ve
For example, how would a coefficient of 0.081 (Standard Error = 0.026) for tobacco
be interpreted? Tobacco use is measured in kilogrammes over a lifetime, with a median
of 1.0kg for controls and 4.1kg for cases. Thus, a 1kg increase in lifetime cigarette use
leads to an 8.4% increase in the risk of coronary heart disease (exp(0.081) = 1.084).
ni

When the standard error is taken into account, we get an approximate 95 percent
confidence interval of exp(0.081 2 0.026) = (1.03, 1.14).
U

1.1.9 Odds vs. Probability


The likelihood of an event occurring is the proportion of times you expect to
observe it in a large number of trials. The probabilities are always between 0 and 1.
ity

The chances are calculated by dividing the probability of an event occurring by the
probability of it not occurring.

If the likelihood of an event occurring is Y, the likelihood of it not occurring is 1-Y.


(For example, if the probability of an occurrence is 0.80 (80%), the likelihood that it will
m

not happen is 1-0.80 = 0.20, or 20%.)

The ratio of (probability that the event will occur) / (probability that the event will
occur) is the chances of an event (probability that the event will not occur). This could
)A

be stated in the following way:

Odds of event = Y / (1-Y)

So, if the likelihood of an event occuring is 0.80, the odds are 0.80 / (1-0.80) =
0.80/0.20 = 4 in this case (i.e., 4 to 1).
(c

Amity Directorate of Distance & Online Education


28 Foundations of Machine Learning

●● If a race horse competes in 100 races and wins 25 of them while losing 75 of
Notes

e
them, the chance of winning is 25/100 = 0.25 or 25%, but the odds of winning are
25/75 = 0.333 or 1 victory to 3 losses.

in
●● The likelihood of winning is 0.05 or 5%, and the odds of the horse winning are 5/95
= 0.0526 if the horse runs 100 races and wins 5 and loses the other 95 times.
●● The likelihood of winning is 50/100 = 0.50 or 50%, and the odds of winning are

nl
50/50 = 1 if the horse runs 100 races and wins 50. (even odds).
●● If a horse runs 100 races and wins 80, the probability of winning is 80/100 = 0.80,
or 80%, and the odds are 80/20 = 4 to 1.

O
We can’t compute the relative risk using the case-control design since we can’t
quantify the likelihood of disease in each of the exposure groups. We may, however,
compute the disease probabilities in each of the exposure groups and compare them

ty
using the odds ratio. The odds ratio in the hypothetical pesticide research is

OR= (7/10) / (5/57) = 6.65

Notice how near this odds ratio is to the RR that would have been achieved if the

si
full source population had been studied. The reason for this is that if the outcome being
examined is relatively rare, the risk of disease in an exposure group will be similar to
the risk of disease in the exposure group. As a result, for case-control studies, the odds
r
ratio gives a relative measure of effect, as well as an estimate of the risk ratio in the
ve
source population, assuming that the outcome of interest is unusual.

We underlined that the odds ratio is the only measure of association that can
be derived in case-control research. However, in cohort studies, which evaluate the
incidence of an outcome by following exposure groups, both a risk ratio and an odds
ni

ratio can be calculated.

If we label the cells in a contingency table arbitrarily as follows:


U
ity

then the odds ratio is computed by taking the ratio of odds, where the odds in each
group is computed as follows:

OR = (a/b) / (c/d)

In the same way that the odds in the unexposed group are placed in the
m

denominator of a risk ratio, the convention is to place the odds in the unexposed group
in the numerator. Odds ratios, like risk ratios, do not have a normal distribution, thus
we employ the lo g transformation to make them more normal. As a result, computing
)A

a confidence interval for an odds ratio is a two-step process: first, we generate a


confidence interval for Ln(OR), and then we take the antilog of the upper and lower
limits of the confidence interval for Ln(OR) to determine the upper and lower limits of
the confidence interval for the OR. The two steps are described in detail below.
(c

Computing the Confidence Interval for an Odds Ratio


To compute the confidence interval for an odds ratio use the formula

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 29

Notes

e
1. Using the equation above, calculate the confidence interval for Ln(OR).

in
2. Calculate the confidence interval for OR by determining the antilog of the result in
step 1, i.e., exp(Lower Limit), exp(Lower Limit), exp(Lower Limit), exp(Lower Limit),
exp(Lower Limit), exp(Lower Limit), exp(Lower (Upper Limit).

nl
The null value of the confidence interval for the odds ratio is one, which means
there is no difference. If the odds ratio has a 95 percent confidence interval that does
not include one, the odds are said to be statistically significant. We revisit the prior

O
examples and generate odds ratio estimates, which we compare to our estimates of risk
differences and relative risks.

Comparison Chart

ty
Basis for Comparison Odds Probability
Meaning Odds refers to the chances in Probability refers to the
favor of the event to the chances likelihood of occurrence

si
against it. of an event.
Expressed in Ratio Percent or decimal
Lies between 0 to ∞
r 0 to 1
ve
Formula Occurrence/Non-occurrence Occurrence/Whole

Definition of Odds
The term odds is defined in mathematics as the ratio of the number of favourable
ni

events to the number of unfavourable events. While odds for an event show the
possibility of the event occurring, odds against reflect the likelihood of the event not
occurring. In more technical terms, odds refers to the likelihood of a specific occurrence
U

occurring or not occurring.

Odds can range from zero to infinity; if the odds are 0, the event is unlikely to
occur; if the odds are, the event is more likely to occur.
ity

As an example, Assume there are 20 marbles in a bag, eight of which are red, six
of which are blue, and six of which are yellow. If one marble is chosen at random, the
chances of getting a red marble are 8/12, or 2:3.

Definition of Probability
m

Probability is a mathematical term that deals with the possibility of a specific event
occurring. It is the foundation for a hypothesis testing theory and an estimating theory.
)A

It’s calculated as the ratio of the number of favourable events to the total number of
events.

Probability is a number between 0 and 1, inclusive. When the probability of an


event is 0, it indicates that the event is impossible, and when it is 1, it indicates that
the event is certain or certain. In other words, the higher the likelihood of an event, the
(c

larger the odds of it happening.

Amity Directorate of Distance & Online Education


30 Foundations of Machine Learning

Consider the following scenario: Assume a dartboard is divided into 12 sections,


Notes

e
one for each of the 12 zodiac signs. If a dart is targeted, the possibilities of areas
occurring are 1/12, because the favourable event is 1 (Aries) and the total number of
events is 12, which may be denoted as 0.08 or 8%.

in
Key Differences Between Odds and Probability
The following points describe the distinctions between odds and probability:

nl
1. The term ‘odds’ is used to explain whether or not there are any chances of an event
occurring. Probability, on the other hand, establishes the possibility of an event

O
occurring, i.e. how often the event will occur.
2. Unlike the odds, which are expressed as a ratio, the probability is expressed as a
percentage or a decimal.

ty
3. Odds typically run from zero to infinity, with zero denoting the impossibility of an
event occurring and infinity denoting the potential of an event occurring. Probability,
on the other hand, ranges from zero to one. As a result, the closer the probability is
to zero, the greater the likelihood of it not occurring, and the closer it is to one, the

si
greater the likelihood of it occurring.
4. The ratio of favourable events to unfavourable events is known as the odds. By

r
dividing the favourable occurrence by the entire number of events, the probability
may be computed.
ve
Odds are a part of probability, which is a subject of mathematics. With the use of
odds or probability, one can calculate chance. The probability is the ratio of occurrence
to the whole, whereas the odds are the ratio of occurrence to non-occurrence.
ni

1.1.10 Hypothesis Testing

Hypothesis Testing
U

A statistical hypothesis test’s main goal is to determine if a data sample is typical


or atypical in comparison to a population, provided a population hypothesis is correct.
A data sample is a tiny subset of entities drawn from a population, such as by an
ity

experiment, whereas the population includes all potential entities.

The basic concept of hypothesis testing is illustrated in the diagram below. In this
case, the population is comprised of all ducks, and the data sample is comprised of one
individual duck picked at random from the whole population. The term ‘randomly drawn’
m

in statistics is referred to as’sampling.’ A quantification of the situation is required to do


the comparison between the data sample and the population. This quantification in our
situation consists of a mapping from a duck to a number. This number could be related
)A

to a duck’s body weight, beak size, body size, or hair length, for example. This mapping
is known as a test statistic in statistics.
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 31

Notes

e
in
nl
O
Figure: Intuitive example explaining the basic idea underlying an one-sample
hypothesis test.

ty
A ‘hypothesis’ is, of course, an important part of hypothesis testing. The hypothesis
is a quantitative assertion we make regarding the test statistic’s population value. In our

si
case, it could be regarding a duck’s body parts, such as body size. We can formulate
the following hypothesis: The average body size is 20 cm. The null hypothesis H0
denotes such a hypothesis.

r
Assume we now have a population of ducks with a body size of 20 cm, with
ve
natural fluctuations. The fact that the population consists of (infinite) many ducks and
we acquire such quantification for each results in a probability distribution for the mean
body size, known as the sampling distribution. It’s vital to remember that our population
is a fictitious one that follows our null hypothesis. In other words, the null hypothesis
completely defines the population.
ni

We can compare the numerical values of the test statistic, which represents the
data sample, and the sampling distribution, which represents the population, to evaluate
U

the null hypothesis that we have formulated. We get another numerical value called
the p-values from this comparison, which quantifies the typicality or atypicality of the
arrangement if the null hypothesis is correct. Finally, a judgement is taken based on the
p-values.
ity

On a technical note, we should mention that the above problem is referred to as a


one-sample hypothesis test because there is only one population involved. The main
notion, however, can be used to hypothesis tests involving more than one group.
m

Key Components of Hypothesis Testing


The preceding example will be formalised in the parts that follow. There are
seven components that all hypothesis tests have in common, regardless of the exact
)A

hypothesis test being conducted. The components are summarised in the diagram
below. When completing a hypothesis test, we listed these components in the order
in which they enter the process. As a result, they can be thought of as steps in a
hypothesis test. Because they are interrelated, the logical sequence in which they are
presented is critical. Overall, a hypothesis test is a technique that must be carried out.
(c

Amity Directorate of Distance & Online Education


32 Foundations of Machine Learning

Notes

e
in
nl
Figure: Main components that are common to all hypothesis tests.

O
Parametric & Non-parametric Tests
The Student’s t-test is an example of a parametric test that uses data expressed in
absolute numbers or values rather than ranks.

ty
The parametric statistical test only works in certain circumstances. Because these
conditions aren’t usually evaluated, they’re considered to be true. The validity of the
assumption determines the significance of a parametric test’s results. The proper

si
interpretation of a parametric test based on the normal distribution also presupposes
that the scene being examined is the product of at least an interval scale measurement.

r
Let us try to grasp the meaning of the phrase population. The term “population”
refers to the total group of people that a researcher is attempting to comprehend in
ve
relation to an event. The study is usually carried out on a sample of the population in
question, with the results being applied to the wider group from which the sample was
drawn.
ni

Parametric statistical tests include the t, z, and F tests.

T-tests: A T-test is used to see if two groups’ scores on a single variable differ.
U

A t-test is used to determine whether there are any differences in mean scores.
For example, you could use a t-test to see if students in two courses have different
writing abilities. It is worth noting that parametric tests, such as the t-test and F-test, are
considered to be quite robust and can be used even when certain assumptions are not
ity

met.

Parametric tests are useful because they are the most powerful for determining the
significance or reliability of sample statistics calculated. However, their application is
predicated on a set of assumptions. These assumptions are dependent on the structure
m

of the population distribution and how the data measurements are quantified using the
type of scale utilised.

Let’s look at what a scale is and the many types of scales. Nominal scale, ordinal
)A

scale, interval scale, and ratio scale are the four types of scales used in measuring.

1. The nominal scale is used to represent nominal or classed data, such as the
population separated into males and females. There is no data ordering in the sense
that saying male > female has no significance. These data are also labelled with
(c

arbitrary labels like m/f and 1 //0. These are also known as categorical scales, as
they have values that are divided into categories (i.e. they are names rather than
numbers).

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 33

2. The ordinal scale is used to represent interval data. Although the values are in a
Notes

e
specific order, the discrepancies between them are unimportant. For instance, a
5-point scale ranging from 1 to 5, with 1 indicating the least satisfied and 5 indicating
the most satisfied.

in
3. The interval scale is used to deal with ordered data that has intervals. This scale is
constant, yet it lacks a natural zero. Differences are understandable. Temperature in

nl
Centigrade or Fahrenheit, for example, is an example of this type of data. Calendar
dates are the dates on which a calendar is based. The magnitude and equal intervals
requirements of a proper measurement scale are met by an interval scale, but it
lacks the real or absolute zero point.

O
4. The ratio scale is a constant, ordered scale with a natural zero. Height, weight, age,
and length are examples of this type of information.
Because there is no normal distribution in a sample with a small number of items,

ty
non-parametric statistics are used, for example, if our sample size is 30 or less (N≤30).
It can also be utilised with nominal data in addition to ordinal data.

A non-parametric statistical test is based on a model that merely describes very

si
general conditions, with no mention of the precise form of the distribution from which
the sample was taken.

r
Most non-parametric statistical tests require some assumptions, such as that the
observations are independent and that the variable under examination has underlying
ve
continuity, but these assumptions are fewer and weaker than those required by
parametric tests.

Furthermore, as we will see, non-parametric approaches frequently test different


ni

population hypotheses than parametric procedures. Finally, non-parametric procedures,


unlike parametric tests, can be used to analyse data measured on an ordinal scale, a
nominal scale, or a categorical scale.
U

Small sample sizes are dealt with in non-parametric statistics. Non-parametric


statistics are assumption-free, which means they aren’t constrained by any
assumptions. When compared to parametric statistics, non-parametric statistics
are more user-friendly and time-saving. We’ve discovered that parametric tests are
ity

generally reliable and useful, even when some of their mathematical assumptions are
broken. These tests, on the other hand, are only employed with data that is based on
ratio or interval measurements. We employ non-parametric tests when dealing with
counted or ranked data. Non-parametric tests are said to have more validity since
m

they are not predicated on assumptions about the nature of the population distribution,
which are commonly neglected or violated by researchers who employ parametric tests.
Non-parametric tests, on the other hand, are less precise and have less power than
parametric tests.
)A

Steps for Hypothesis Testing

Step 1: Select Test Statistic


(c

A test statistic quantifies a data sample, to put it simply. Any mapping (or function)
between a data sample and a numerical value is referred to as a statistic in statistics.

Amity Directorate of Distance & Online Education


34 Foundations of Machine Learning

The mean value and variance are two common examples. The test statistic can be
Notes

e
stated in formal terms as

in
Xn is a data sample with sample size n, whereas D(n) = X1,..., Xn is a data sample
with sample size n. The mapping is denoted by Tn, while the value is denoted by tn.
Typically, the test statistic can assume actual values, i.e. Tn∈ R, however there are

nl
several limitations.

A test statistic plays an important part in a hypothesis test since the test statistic
chosen impacts the hypothesis test to a considerable extent. The rationale for this is

O
that it will be incorporated into the hypotheses we develop in step 2. As a result, it’s
critical to pick a test statistic that’s relevant and interesting to the study at hand.

We want to underline that we choose the test statistics in this phase, but we don’t

ty
assess or apply them yet. This is accomplished in step 5.

Step 2: Null Hypothesis H0 and Alternative Hypothesis H1

si
At this point, we define two hypotheses: H0, which is the null hypothesis, and H1,
which is the alternative hypothesis. Both hypotheses make statements about the test
statistic’s population value and are mutually exclusive. The population value of t for the
r
test statistic t = T(D) we chose in step 1 is denoted by. We can construct the following
hypothesis based on this:
ve
As can be seen, the value of the population parameter θcan only be true for one
ni

assertion, not both, because of the way the two hypotheses are worded. For example,
if θ = θ0 is true, but the alternative hypothesis H1 is false, or if θ > θ0is true, but the null
hypothesis H 0 is false,
U

The four probable outcomes of a hypothesis test are depicted in the diagram
above. Each of these outcomes has its own name that is widely used. For example,
a ‘true positive’ (TP) judgement is made when the null hypothesis is untrue and we
ity

reject H0. The asymmetric meaning of a hypothesis test means that rejecting H0 when
H0 is untrue is more informative than accepting H0 when H0 is correct, which is why
it’s called ‘positive.’ In this instance, a positive result from a hypothesis test can be
considered.
m

The above alternative hypothesis is an example of a one-sided hypothesis.


Because the alternative expects values greater than 0, we developed a right-sided
hypothesis. Furthermore, we can state a left-sided alternative hypothesis as follows:
)A

alternative hypothesis: H1: θ <θ0

Furthermore, we can develop a two-sided alternative hypothesis that is


unconcerned about which side is correct.

Alternative hypothesis: H1: θ6= θ0


(c

Despite the fact that there are hundreds of different hypothesis tests, the
description above applies to them all. This does not imply that if you understand one

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 35

hypothesis test, you will understand all of them; rather, if you understand the theory of
Notes

e
one hypothesis test, you will understand the principle of all hypothesis tests.

To correlate the sample value of the test statistic t with its population value,

in
one must first understand the test statistic’s probability distribution. This probability
distribution was given a unique name because of this connection: the sampling
distribution of the test statistic. It’s vital to remember that the sampling distribution

nl
represents the test statistic’s values if the null hypothesis is correct. This signifies that
the population value of is θ is θ0in this circumstance.

For the time being, let’s suppose we know the sampling distribution of our test

O
statistic. We obtain a quantification for the ‘typicality’ of this value with respect to the
sampling distribution by comparing the particular value t of our test statistic with the
sampling distribution in a way that is determined by the way we formulated the null and
alternative hypotheses, assuming the null hypothesis is true.

ty
Step 3: Sampling Distribution
We indicated earlier in our general overview of the basic concept of a hypothesis

si
test that the relationship between a test statistic and its sample distribution is critical for
any hypothesis test. As a result, we go into this topic in greater depth in this part.

r
To number one: The sampling distribution is, first and foremost, a probability
distribution. Given specific assumptions, this sampling distribution represents the
ve
distribution of the test statistic T, which is a random variable. This remark can be
clarified by defining the null hypothesis’s sample distribution as follows.

Let X(n) = X1,..., Xn represent a random sample from a population with Xi Ppop I
and T(X(n)) represent a test statistic. The sampling distribution of the null hypothesis, or
ni

the null distribution, is the probability distribution fn(x|H0 true) of T(X(n) assuming H0 is
true.
U

Similarly, fn(x|H1 true) is used to construct the sampling distribution of the


alternative hypothesis. There are only two alternative sampling distributions in this case
since there are only two different hypotheses, H0 and H1. However, we should point out
that sampling distributions can be used for purposes other than statistical hypothesis
ity

testing, such as estimate theory or data Bootstrapping.

We’d like to call attention to a few aspects in the preceding definition. For starters,
the P pop distribution from which the random variables Xi are sampled can take any
shape and is not limited to, say, a normal distribution. Second, because it is a function
m

of random variables, the test statistic is a random variable. As a result, there is a


distribution associated with this random variable, and the values of this random variable
are samples of it. Third, because the test statistic is a function of the sample size n,
)A

the sampling distribution is a function of n as well. That is, if we modify the sample size
n, the sampling distribution will vary. Fourth, because fn(x|H 0 true) is the probability
distribution of T(X(n)), we may precisely rebuild the distribution fn(x|H 0 true) by
collecting an unlimited number of samples from it in the form T(X(n)) fn(x|H 0 true). As
we will see in the next example, the last point provides for a numerical approximation of
(c

the sampling distribution under specific conditions.

Assume we have a random sample of size n, with X(n) = X1,..., Xn, and each data

Amity Directorate of Distance & Online Education


36 Foundations of Machine Learning

point Xi sampled from a gamma distribution with = 4 and = 2, i.e., Xi ∼ gamma(α = 4, β=


Notes

e
2). In addition, let’s utilise the mean value as a test statistic, i.e.

in
Figures A–C illustrate three examples of drawing E = 100,000 samples X(n) from
which we estimate E = 100,000 different mean values T for three different values of n

nl
(in A n = 1, B n = 3, and C n = 10) for three different values of n (in A n = 1, B n = 3, and
C n = 10). The density estimates of these 100,000 values are shown in Figure 3A–C.
As previously stated, the approximation sampling distribution Ps (n, E) becomes the

O
(theoretical) sampling distribution in the limit of infinite many samples E.

ty
r si
ve
ni
U
ity
m
)A

Figure: In Figure (A–C) we show approximate sampling distributions for different


values of the sample size n. Figure (A) shows Ps(n = 1, E = 100,000) which is equal to
the population distribution of Xi . Figure (D) shows a qq-plot comparing Ps(n = 10, E =
(c

100,000) with a normal distribution.

is a function of n, the number of samples

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 37

When n = 1, we have the special case where the sampling distribution is the same
Notes

e
as the underlying population distribution P pop, which in our instance is a gamma
distribution with parameters α = 4 and β=2, as illustrated in Figure 3A. Figure 3B,C
shows a modification in the distributional shape of the sampling distribution for any

in
other n > 1. However, we should be familiar with this transformation since, according to
the Central Limit Theorem, the mean of X1,..., Xn independent samples with mean and
variance 2 follows a normal distribution with mean and standard deviation σ/ √ n, that is,

nl
O
This result is only strictly valid in the limit of high n, as we can see. However, in
Figure 3D, we present a qq-plot that shows that the resulting distribution, Ps(n = 10, E =
100,000), is already quite near to a normal distribution for n = 10. (with the appropriate
parameters).

ty
The Central Limit Theorem holds for arbitrarily iid (independent and identically
distributed) random variables X1,..., Xn. As a result, the normal distribution given in
Equation is always the sampling distribution for the mean (4).

si
We can simplify things even more by applying a so-called z-transformation of the
mean value of (X) to Z.

r
ve
Z’s distribution is a conventional normal distribution, which means that

Z ∼ N(0, 1).
ni

We’ve arrived to a critical juncture where we must question ourselves if we’re


finished. This is contingent on our understanding of the variance σ 2. The sampling
distribution of our converted mean (X), which we name Z, is a standard normal
distribution if we know the variance σ^2. However, because the z-transformation in
U

Equation (5) is dependent on, we cannot do it if we do not know the variance 2. We


must estimate the variance of the random sample X1,..., Xn in this scenario.
ity

The variance estimate can then be used for the t-transformation that follows.
m

Despite the fact that this t-transformation is mathematically comparable to the


z-transformation in Equation, the generated random variable T follows a Students’
)A

t-distribution with n -1 degrees of freedom rather than a regular normal distribution


(dof). We should point out that this is only true for Xi ∼N(µ, σ), or normally distributed
samples.
(c

Amity Directorate of Distance & Online Education


38 Foundations of Machine Learning

Table 1. Sampling distribution of the z-score and the t-score.


Notes

e
in
Step 4: Significance Level α

nl
The significance level α is a value that ranges from 0 to 1, orα ∈ [0, 1]. It has
significance.

O
α = P(Type 1 error) = P(reject H0|H0 true )

indicating the likelihood of rejecting H0 if H0 is true That is, it tells us how likely we
are to make a Type 1 error, resulting in a false positive decision.

ty
We have the freedom to choose this value while running a hypothesis test.
However, before deciding on a numerical value, one must consider the possible
repercussions. Although 0.05 is probably the most common value, levels as low as
108 are utilised in Genome-Wide Association Studies (GWAS). The conceivable

si
ramifications in many application domains are the rationale for such a wide range
of employed values. Type 1 errors in GWAS can cost millions of dollars in follow-up
research, which are quite expensive in this discipline. As a result, is picked at a very low
level. r
ve
Step 5: Evaluate Test Statistic from Data
Because everything up to this point has been theoretical, this stage represents our
connection to the real world, as represented by the data. We estimate the numerical
ni

value of the test statistic selected in Step 1 for D(n) = X(n) = x1,..., xn.
U

t(n) denotes a specific numerical value derived from the observed data D(n).
Because our data collection is based on the number of samples n, this numerical value
will also be based on n. The subscript indicates this explicitly.
ity

Step 6: Determine the p-Values


We must utilise the sample distribution (Step 3) and the estimated test statistic tn
to determine the p-values of a hypothesis test (Step 5). That is, if H0 is correct, the
p-values are the outcome of a comparison of theoretical assumptions (sampling
m

distribution) with genuine observations (data sample). For a right-sided alternative


hypothesis, this position is depicted in the diagram below. If H0 is true, the p-values
represent the possibility of seeing more extreme values than the test statistic tn.
)A

p = P(observe × at least as extreme as |t| |H0is true) = P(x ≥ |t| |H0is true)

It’s calculated using an integral over the sampling distribution.


(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 39

Notes

e
in
nl
O
Figure: Determining the p-values from the sampling distribution of the test
statistic.

ty
The numerical value of p will be used to determine whether we reject or accept the
null hypothesis.

si
We can also make use of the following integral.

r
to find the value of θc That is, a threshold c is implied by the significance level. This
ve
threshold can also be used to decide whether or not H0 should be employed.

We want to underline that while the test statistic is a random variable, the p-values
are likewise random variables because they are dependent on the test statistic.
ni

Step 7: Make a Decision about the Null Hypothesis


We make a decision about the null hypothesis in the final stage. There are two
U

options for accomplishing this. We can choose to make a judgement based on the
p-values or the value of the test statistic tn.

1. Decision based on the p-values: If p < α reject H0


ity

2. Decision based on the threshold θc: If tn> θc reject H0


We accept the null hypothesis if we can’t refute it.

1.1.11 Linear Regression: Advantages and Pitfalls


m

Advantages of Linear Regression


)A

Simple implementation
Linear Regression is a very simple method that may be quickly implemented and
produces good results. Furthermore, when compared to other complicated techniques,
these models may be trained quickly and efficiently on systems with limited processing
resources. When compared to other machine learning techniques, linear regression has
(c

a significantly reduced temporal complexity. Linear regression’s mathematical formulae

Amity Directorate of Distance & Online Education


40 Foundations of Machine Learning

are very simple to comprehend and interpret. As a result, linear regression is a simple
Notes

e
concept to grasp.

Performance on linearly seperable datasets

in
Linear regression almost perfectly fits linearly separable datasets and is frequently
used to determine the nature of the relationship between variables.

nl
Over-fitting can be reduced by regularization
Over-fitting: It occurs when a machine learning model fits a dataset extremely well

O
and hence captures the noisy data as well. This has a detrimental effect on the model’s
performance and affects its accuracy on the test set.

Regularization: It’s a simple strategy that effectively reduces the complexity of a


function, lowering the risk of Over-fitting.

ty
Disadvantages of Linear Regression

si
Prone to under-fitting
Under-fitting : When a machine learning model fails to capture data adequately, a
situation occurs. When the hypothesis function fails to match the data well, something
happens. r
ve
Example
ni
U
ity
m
)A

Linear regression fails to fit complicated datasets because it presupposes a linear


relationship between the input and output variables. In most real-world settings, the
connection between the dataset’s variables isn’t linear, hence a straight line doesn’t
appropriately fit the data. In such cases, a more sophisticated function can more
(c

effectively capture the data. As a result, the accuracy of most linear regression models
is low.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 41

Sensitive to outliers
Notes

e
Anomalies or extreme values that depart from the rest of the distribution are called
outliers in a data set. Data outliers can severely degrade the performance of a machine

in
learning model, resulting in models with low accuracy.

Exanple:

nl
O
ty
r si
ve
ni

Outliers can have a significant impact on the performance of linear regression,


hence they must be dealt with properly before linear regression is used to the dataset.
U

Linear Regression assumes that the data is independent


Because the inputs are frequently not independent of one another, any
multicollinearity must be eliminated prior to using linear regression.
ity

Limitations of regression analysis


As a statistical technique, regression analysis offers a variety of applications
or utility for which it is widely employed in practically all natural, physical, and social
sciences. The following are some of the specific applications or utilities of such a
m

technique:

●● It establishes a functional link between two or more related variables, allowing us


)A

to readily estimate or anticipate the unknown values of one variable based on the
known values of the other.
●● It offers a measure of the mistakes in estimations obtained using the regression
line. A small scatter of observed (actual) values around the appropriate regression
line suggests good estimations of a variable’s values and a lower level of error. A
(c

large amount of scatter of observed values around the appropriate regression line,

Amity Directorate of Distance & Online Education


42 Foundations of Machine Learning

on the other hand, suggests erroneous estimations of the values of a variable and
Notes

e
a significant degree of error.
●● It gives a measure of the correlation coefficient between the two variables, which

in
may be computed by taking the square root of the product of the two regression
coefficients, i.e. r = √ (by. byx).
●● It gives us a measure of coefficient of determination, which reflects the effect of

nl
the independent variable (explanatory, or regressing variable) on the dependent
variable (explained or regressed variable), and thus gives us an indication of
the regression analysis’ predictive values. This coefficient of determination is

O
calculated by multiplying the two regression coefficients, i.e. r2 = bxy. The better
the fit and the more effective regression equations are as estimating devices, the
higher the value of the coefficient of determination (r2).
●● It is a powerful statistical analysis tool in the field of business and commerce,

ty
where people are interested in predicting future events such as consumption,
production, investment, prices, sales, profits, and so on, and where businessmen’s
success is highly dependent on the accuracy of their various estimates.

si
●● It’s a useful tool for determining and evaluating the cause-and-effect relationship
between the economic variables that are at the heart of economic theory and
practice . It is widely used to estimate Demand curves, Supply curves, Production
r
functions, Cost functions, and Consumption functions, among other things. In
ve
reality, by fitting regression lines to the input and output data, economists have
proposed a variety of production functions.
●● This method is widely utilised in our daily lives and sociological studies to estimate
numerous elements such as birth rates, death rates, tax rates, yield rates, and so
ni

on.
●● Last but not least, the regression analysis technique provides us with an estimate
of a series’ relative variation.
U

Summary
●● The technique of discovering correlations between dependent and independent
ity

variables is known as regression. It aids in the prediction of continuous variables


such as market trends, house values, and so forth. Linear regression is a type
of supervised machine learning model that is commonly used in predicting.
Supervised machine learning models are those that use training data to develop
the model and then use the loss function to verify the model’s correctness.
m

●● The best fit line or line of best fit is a line that can describe the relationship
between the independent and dependent variable(s). The dependent variable is
predicted using this line. The goal of line of best fit is to get the observed and real
)A

values as close as feasible to each other.


●● The residuals have constant variance at every level of x, according to the
following assumption of linear regression. Homoscedasticity is the term for this.
The residuals are considered to have heteroscedasticity when this isn’t the case.
(c

Heteroscedasticity, in particular, increases the variance of regression coefficient


estimates, but the regression model ignores this. This increases the likelihood that

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 43

a regression model may proclaim a term in the model to be statistically significant


Notes

e
when it is not.
●● Pandas is a free and open-source Python toolkit created by software engineer

in
Wes McKinney in 2008. It offers high-level data structures and programming tools
for doing data analysis and manipulation on a variety of data kinds, including
numbers, text, and dates.

nl
●● Scikit-learn is undoubtedly Python’s most helpful machine learning library.
Classification, regression, clustering, and dimensionality reduction are just a few
of the useful functions in the sklearn toolkit for machine learning and statistical

O
modelling.
●● Multiple linear regression (MLR), often known as multiple regression, is a
statistical technique that predicts the result of a response variable by combining
numerous explanatory variables. Multiple linear regression attempts to represent

ty
the linear relationship between explanatory (independent) and response
(dependent) variables. Because it incorporates more than one explanatory
variable, multiple regression is essentially an extension of ordinary least-squares

si
(OLS) regression.
●● The term odds is defined in mathematics as the ratio of the number of favourable
events to the number of unfavourable events. While odds for an event show the
r
possibility of the event occurring, odds against reflect the likelihood of the event
ve
not occurring.
●● Probability is a mathematical term that deals with the possibility of a specific event
occurring. It is the foundation for a hypothesis testing theory and an estimating
theory. It’s calculated as the ratio of the number of favourable events to the
ni

total number of events. A statistical hypothesis test’s main goal is to determine


if a data sample is typical or atypical in comparison to a population, provided a
population hypothesis is correct. A data sample is a tiny subset of entities drawn
U

from a population, such as by an experiment, whereas the population includes all


potential entities.
●● The parametric statistical test only works in certain circumstances. Because these
conditions aren’t usually evaluated, they’re considered to be true. The validity of
ity

the assumption determines the significance of a parametric test’s results.


●● Parametric tests are useful because they are the most powerful for determining
the significance or reliability of sample statistics calculated. Non-parametric
statistical test is based on a model that merely describes very general conditions,
m

with no mention of the precise form of the distribution from which the sample was
taken. Most non-parametric statistical tests require some assumptions, such as
that the observations are independent and that the variable under examination
)A

has underlying continuity, but these assumptions are fewer and weaker than those
required by parametric tests.

Glossary
●● Discrete distribution: A discrete distribution is one with a finite number of possible
(c

values deriving from countable data.

Amity Directorate of Distance & Online Education


44 Foundations of Machine Learning

●● Dataframe: These are tabular data’s data structures, which are similar to
Notes

e
spreadsheets and can hold vast amounts of data.
●● Binomial distribution: The binomial distribution calculates the likelihood of a certain

in
number of successes or failures in each attempt of an experiment.
●● Poisson distribution: The Poisson distribution is a discrete probability distribution
that assesses the likelihood of a number of events occurring in a particular time

nl
period when they happen in a well-defined order.
●● Hypergeometric distribution: A hypergeometric distribution is a discrete distribution
that estimates the likelihood of a certain number of successes in (n) trials without

O
replacement from a large population (N). To put it another way, it’s sampling
without replacement.
●● Geometric distribution: A geometric distribution is a discrete distribution that

ty
estimates the probability of the first success.
●● Numerical Datasets: A numerical data set is one in which the information is
expressed in numbers rather than natural language.

si
●● Bivariate Datasets: A bivariate data set is one that contains two variables. It is
concerned with the relationship that exists between the two variables.
●● Multivariate Datasets: A multivariate dataset is defined as one that has three or
r
more data kinds (variables).
ve
●● Categorical Datasets: Categorical data sets represent a person’s or an object’s
attributes or qualities.
●● Correlation Datasets: Correlation data sets are a collection of variables that have
ni

some sort of relationship with one another.


●● Decision trees: A decision tree is a recursive partition of the instance space that
represents a classifier.
U

●● Linear regression: Linear regression aims to discover correlations and


dependencies between variables.
●● DWT: Discrete Wavelet Transform.
ity

●● PCA: Principal Components Analysis.


●● UNESCO: United Nations Educational, Scientific, and Cultural Organization.
●● UNDP: United Nations Development Programme.
m

●● WHO: World Health Organization.


●● FAO: Food and Agricultural Organization.
)A

●● MLR: Multiple linear regression (MLR), often known as multiple regression, is a


statistical technique that predicts the result of a response variable by combining
numerous explanatory variables.
●● OLS: Ordinary Least-Squaresregression.
●● Best fit line: The best fit line or line of best fit is a line that can describe the
(c

relationship between the independent and dependent variable(s).

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 45

●● Over-fitting: It occurs when a machine learning model fits a dataset extremely well
Notes

e
and hence captures the noisy data as well. This has a detrimental effect on the
model’s performance and affects its accuracy on the test set.

in
●● Regularization: It’s a simple strategy that effectively reduces the complexity of a
function, lowering the risk of Over-fitting.
●● Under-fitting: When a machine learning model fails to capture data adequately,

nl
a situation occurs. When the hypothesis function fails to match the data well,
something happens.
●● Homogeneity of variance (homoscedasticity)- The amount of the error stays

O
constant is one of the primary expectations of a simple linear regression approach.
This basically indicates that the error magnitude does not change much while the
independent variable’s value varies.

ty
●● Independence of observations- Nothing is hidden in the relationships between
the observations, and only valid sampling procedures are employed during data
gathering.

si
●● Normality- The data is flowing at a typical rate.

Check Your Understanding

r
1. The technique of discovering correlations between dependent and independent
variables is known as_ _ _ _.
ve
a) Regression
b) Forecasting
c) Mapping
ni

d) Prediction
2. The_ _ _ _in linear regression are a measure of the vertical distance between each
U

data point and the fitted line.


a) call
b) residuals
ity

c) coefficient
d) None of the above
3. _ _ _ _ _are tabular data’s data structures, which are similar to spreadsheets and
m

can hold vast amounts of data.


a) Ledger
b) Stack
)A

c) Dataframe
d) Dataset
4. Pandas is a free and open-source Python toolkit created by software engineer_ _
(c

_ _in_ _ _ _.
a) Satoshi Nakamoto, 2009

Amity Directorate of Distance & Online Education


46 Foundations of Machine Learning

b) Nick Szabo, 2007


Notes

e
c) Wei Dai, 2006
d) Wes McKinney, 2008

in
5. Which one of the statements is true regarding residuals in regression analysis?
a) Mean of residuals is always zero

nl
b) Mean of residuals is always less than zero
c) Mean of residuals is always greater than zero

O
d) There is no such rule for residuals
6. Which one is true about Heteroskedasticity?
a) Linear regression with constant error terms

ty
b) Linear regression with varying error terms
c) Linear regression with zero error terms

si
d) All of the mentioned
7. The correlation coefficient is used to determine:
a) A specific value of the y-variable given a specific value of the x-variable
r
b) A specific value of the x-variable given a specific value of the y-variable
ve
c) The strength of the relationship between x and y variables
d) None of the mentioned
8. The square of the correlation coefficient r 2 will always be positive and is called the_
ni

_ _ _.
a) Regression
U

b) Coefficient of determination
c) KNN
d) Algorithm
ity

9. Acronym of Analysis of Variance is?


a) ANOV
b) AVA
m

c) ANOVA
d) ANVA
)A

10. Which is true about homoscedasticity?


a) Linear regression with zero error terms
b) Linear regression with more than zero error terms
c) The amount of the error stays varying is one of the primary expectations of a
(c

simple linear regression approach

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 47

d) The amount of the error stays constant is one of the primary expectations of a
Notes

e
simple linear regression approach
11. _ _ _ _ _ _is a statistical technique that predicts the result of a response variable by

in
combining numerous explanatory variables.
a) Multiple linear regression
b) Simple linear regression

nl
c) Polynomial regression
d) Decision tree regression

O
12. The term_ _ _ _ is defined in mathematics as the ratio of the number of favourable
events to the number of unfavourable events.
a) mean

ty
b) odds
c) median

si
d) mode
13. _ _ _ _is a mathematical term that deals with the possibility of a specific event
occurring.
a) Mean r
ve
b) Median
c) Probability
d) Mode
ni

14. _ _ _ _ _regression method is also known as the ordinary least square estimation.
a) Simple
U

b) Direct
c) Indirect
d) Mutual
ity

15. A_ _ _ _ _is a recursive partition of the instance space that represents a classifier.
a) Simple linear
b) Multiple linear
m

c) Polynomial
d) Decision tree
)A

Exercise
1. Define Linear Regression important terms and assumptions.
2. Regression: Statistical View
3. Define multiple linear regression.
(c

4. Explain Linear Regression.

Amity Directorate of Distance & Online Education


48 Foundations of Machine Learning

5. Define logistic regression.


Notes

e
6. What are the differences between odds vs. probability.
7. Define linear regression and its advantages and pitfall.

in
Learning Activities
1. Why linear regression belongs to both statistics and machine learning?

nl
Check Your Understanding - Answers
1. a

O
2. b
3. c

ty
4. d
5. a
6. b

si
7. c
8. b
9. c r
ve
10. d
11. a
12. b
ni

13. c
14. b
U

15. d

Further Readings and Bibliography:


1. Introduction to Linear Regression Analysis, Douglas Montgomery, Elizabeth A.
ity

Peck, and G. Geoffrey Vining


2. Linear regression analysis, George Seber
3. Regression and Linear Models, Richard B. Darlington
m

4. Statistics and Data Visualization Using R, David Brown


5. Storytelling With Data: A Data Visualization Guide for Business Professionals,
Cole Nussbaumer Knaflic
)A

6. Data Visualization: A Practical Introduction, Kieran Healy and Kieran Joseph


Healy
7. Fundamentals of Data Visualization: A Primer on Making Informative and
Compelling Figures, Claus O. Wilke
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 49

Module - II: Supervised Learning: Classification


Notes

e
Learning Objectives:

in
At the end of this module, you will be able to understand:

●● Concept of Classification and Regression

nl
●● Concept of Bias Variance Trade-off
●● All about Under-fitting and Over-fitting

O
●● What is Hold Out Method and its application
●● Various Evaluation Methods: Cross Validation and Bootstrapping Method
●● Performances Metrices and its applications

ty
Introduction
Regression and Classification Algorithms are two of the most common types of

si
supervised machine learning algorithms. We can forecast the output for continuous
values using Regression techniques, but we need Classification algorithms to predict
the output for categorical values.
r
An algorithm for classifying new observations based on previously trained data is
ve
known as a Supervised Learning technique. To classify new data, software uses the
dataset or observations it has been provided as a basis for learning and then assigns
the data to one of a number of categories or groups. If the answer is “yes” or “no,” then
the answer is “yes” or “no.” Targets/labels or categories are other names for classes.
ni

2.1 Introduction to Classification Methods


Supervised Learning algorithms include regression and classification techniques.
U

Both techniques are used in Machine Learning for prediction and work with labelled
datasets. However, there is a distinction between the two in terms of how they are
applied to certain machine learning situations.
ity

The main difference between Regression and Classification algorithms is that


Regression algorithms are used to predict continuous values like price, salary, age, and
so on, whereas Classification algorithms are used to predict discrete values like Male or
Female, True or False, Spam or Not Spam, and so on.
m
)A
(c

Amity Directorate of Distance & Online Education


50 Foundations of Machine Learning

2.1.1 Introduction to Classification and Regression


Notes

e
Classification:

in
Classification is the process of identifying a function that aids in the classification
of a dataset based on several factors. A computer programme is trained on the training
dataset and then categorises the data into distinct classes depending on that training.

nl
The classification algorithm’s goal is to discover the mapping function that will
convert the discrete input(x) to the discrete output (y).

Example: Email Spam Detection is the best example of the Classification problem.

O
The model is trained on millions of emails on various parameters, and it determines if
an email is spam or not when it receives a new one. The email is moved to the Spam
folder if it is spam.

ty
Algorithms for Machine Learning Classification:
●● Logistic Regression

si
●● K-Nearest Neighbours
●● Support Vector Machines
●● Kernel SVM
●● Naïve Bayes
r
ve
●● Decision Tree Classification
●● Random Forest Classification
ni

Regression:
The technique of discovering correlations between dependent and independent
variables is known as regression. It aids in the prediction of continuous variables such
U

as market trends, house values, and so forth.

The Regression algorithm’s goal is to identify the mapping function that will
translate the continuous input variable (x) to the discrete output variable (y).
ity

Example: Let’s say we want to forecast the weather, so we’ll apply the Regression
approach. When it comes to weather prediction, the model is trained on historical data,
and after it is finished, it can accurately predict the weather for future days.
m

Regression Algorithm Types:


●● Simple Linear Regression
●● Multiple Linear Regression
)A

●● Polynomial Regression
●● Support Vector Regression
●● Decision Tree Regression
(c

●● Random Forest Regression

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 51

Difference between Regression and Classification


Notes

e
Regression Algorithm Classification Algorithm
In Regression, the output variable must In Classification, the output variable must be

in
be of continuous nature or real value. a discrete value.
The task of the regression algorithm The task of the classification algorithm is to
is to map the input value (x) with the map the input value(x) with the discrete output

nl
continuous output variable(y). variable(y).
Regression Algorithms are used with Classification Algorithms are used with

O
continuous data. discrete data.
In Regression, we try to find the best fit In Classification, we try to find the decision
line, which can predict the output more boundary, which can divide the dataset into
accurately. different classes.

ty
Regression algorithms can be used to Classification Algorithms can be used to solve
solve the regression problems such classification problems such as Identification
as Weather Prediction, House price of spam emails, Speech Recognition,

si
prediction, etc. Identification of cancer cells, etc.
The regression Algorithm can be further The Classification algorithms can be divided
divided into Linear and Non-linear into Binary Classifier and Multi-class Classifier.
Regression. r
ve
2.1.2 Classification of Supervised Algorithms
Supervised Algorithms categories in two categories:
ni

●● Parametric model
●● Nonparametric model

Parametric Model
U

Although they can substantially ease learning, assumptions also have a limit on
what can be learned. Parametric machine learning algorithms are those that reduce the
function to a recognised form.
ity

“A parametric model is a type of learning model that summarises data with a set of
parameters of fixed size (regardless of the quantity of training examples). A parametric
model won’t change its mind about how many parameters it requires no matter how
much data you feed it.
m

There are two steps in the algorithms:

1. Choose a form for the function.


)A

2. Study the training data to determine the function’s coefficients.


A line, like that used in linear regression, is an intuitive functional representation for
the mapping function:

b0 + b1*x1 + b2*x2 = 0
(c

Where x1 and x2 are two input variables, and b0, b1, and b2 are the line’s
coefficients that determine the intercept and slope.

Amity Directorate of Distance & Online Education


52 Foundations of Machine Learning

Assuming a line has its functional shape substantially streamlines the learning
Notes

e
process. We now have a prediction model for the issue; all we need to do is estimate
the line equation’s coefficients.

in
As a result, parametric machine learning algorithms are frequently also referred
to as “linear machine learning algorithms” because the assumed functional form is
frequently a linear combination of the input variables.

nl
The issue is that the underlying function may not actually be linear like a line.
It might be almost a line and only need a small amount of input data modification to
function properly. Or it might not even be a line at all, in which case the assumption is

O
incorrect and the strategy will yield subpar results.

Additional illustrations of parametric machine learning algorithms are as follows:

◌◌ Logistic Regression

ty
◌◌ Linear Reression
◌◌ Naive Bayes

si
◌◌ Simple Neural Networks
Benefits of Parametric Machine Learning Algorithms:

◌◌ Simpler: The outcomes of these procedures are simpler to comprehend and


interpret. r
ve
◌◌ Speed: Learning parametric models from data happens relatively quickly.
◌◌ Less Data: They don’t need as much training data and can still perform well if
the data fit is imperfect.
Limitations of Parametric Machine Learning Algorithms:
ni

◌◌ Constrained: These techniques are severely limited to the stated form by


using a functional form.
U

◌◌ Limited Complexity: Simpler problems are more suited for the methods.
◌◌ Poor Fit: The approaches probably won’t match the underlying mapping
function in practice.
ity

Nonparametric
Nonparametric machine learning algorithms are those that do not make any firm
assumptions about the shape of the mapping function. They are allowed to learn any
functional form from the training data because they are not making any assumptions.
m

When you have a lot of data, no prior information, and you don’t want to worry too
much about picking the perfect features, nonparametric approaches are a viable option.
)A

While keeping some ability to generalise to untried data, nonparametric


approaches aim to develop the mapping function that best fits the training data. They
can therefore fit a variety of practical forms.

The k-nearest neighbours approach, which generates predictions based on the k


most comparable training patterns for a new data instance, is a simple nonparametric
(c

model. The only assumption made by the approach is that patterns that are similar are
likely to have an output variable that is comparable.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 53

Additional illustrations of well-known nonparametric machine learning techniques


Notes

e
include:

◌◌ k-Nearest Neighbors

in
◌◌ Decision Trees like CART and C4.5
◌◌ Support Vector Machines
◌◌ Random Forest

nl
Benefits of Nonparametric Machine Learning Algorithms:

◌◌ Flexibility: Able to fit a wide variety of functional forms.

O
◌◌ Power: There aren’t any (or only flimsy) assumptions made about the
underlying function.
◌◌ Performance: Can lead to prediction models with greater performance.

ty
Limitations of Nonparametric Machine Learning Algorithms:

◌◌ More data: Extra training data is needed to estimate the mapping function.
◌◌ Slower: They train much more slowly because there are frequently more

si
parameters to train.
◌◌ Over-fitting: It is tougher to justify why specific predictions are made and there
is a greater risk of Over-fitting the training data.
r
ve
Logistic Regression
Similar to linear regression, logistic regression is employed when the dependent
variable isn’t a number but rather something else, such a “yes” or “no” response.
Regression is used, but classification is also done based on the regression, and the
ni

dependent variable is put into one of the classes.

Logistic Regression
U

1
n Class 1
ssio
gre
Re
Y ear
Lin
ity

Class 0 Logistic model


o
X
As mentioned before, logistic regression is used to predict binary output. For
m

instance, a credit card provider will model whether a consumer will “default” or “not
default” on their card when determining whether or not to offer them a credit card.
)A

In order to create the model, linear regression is first applied to the connection
between the variables. It is expected that the categorization line’s threshold is set at 0.5.
(c

Regression is subjected to a logistic function to determine the likelihood that it


belongs to either class.

Amity Directorate of Distance & Online Education


54 Foundations of Machine Learning

It compares the log of the likelihood that an event will occur to the log of the
Notes

e
likelihood that it won’t. In the end, it categorises the variable according to which class
has the larger likelihood.

in
ODDS
event
Pr (y)
Pr (~y)

nl
non-event
Odds is the ratio of the probability as event occurs with the probability of an
event not occuring

O
Linear Regression
A statistical method for predicting the value of one variable based on the value

ty
of another is linear regression analysis. The variable you want to forecast is the
dependent variable. You forecast the value of the other variable using the independent
variable.

si
In order to estimate the coefficients of the linear equation, this method of analysis
uses one or more independent variables that can most accurately predict the value of
the dependent variable. The discrepancy between expected and actual output values is

r
reduced by linear regression by creating a straight line or surface. There are accessible
simple linear regression calculators that use the “least squares” approach to get the
ve
best-fit line for a set of paired data. The value of X (the dependent variable) is then
determined using Y. (independent variable).

Naive Bayes
ni

The independence assumptions between predictors are the foundation of the


naive Bayes classifier, which is based on Bayes’ theorem (i.e., it assumes the presence
of a feature in a class is unrelated to any other feature). All of these traits exist
U

independently, despite the fact that these features all depend on one another or on the
existence of other features. Hence, the term “naive Bayes.”
ity
m
)A

For classification based on the binomial (normal) distribution of data, Gaussian


naive Bayes is employed, which is based on naive Bayes.
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 55

Notes

e
in
nl
O
ty
The posterior probability of the given class (target) is denoted by P(class|data)
(attribute). the likelihood of either class for a given data point, given the data point. We
are attempting to calculate this value.

si
●● The prior probability of a class is P(class).
●● The likelihood, or probability of a predictor given a class, is P(data|class).
●●
r
The prior probability of the predictor or marginal likelihood is P(data).
ve
ni
U
ity

Steps
m

1. Calculate Prior Probability


P(class) = Number of data points in the class/Total no. of observations
)A

P(yellow) = 10/17

P(green) = 7/17

2. Calculate Marginal Likelihood


(c

P(data) = Number of data points similar to observation/Total no. of observations

P(?) = 4/17

Amity Directorate of Distance & Online Education


56 Foundations of Machine Learning

Checking the likelihood of the two outcomes has utility.


Notes

e
3. Calculate Likelihood

in
P(data/class) = Number of similar observations to the class/Total no. of points
in the class.
P(?/yellow) = 1/7

nl
P(?/green) = 3/10

4. Posterior Probability for Each Class

O
ty
5. Classification r si
ve
As the point has a class green affiliation with a chance of over 75%, the class is
ni

more likely to belong to that category. The other models used to calculate probabilities
are multinomial, Bernoulli, and naive Bayes. Thus, a naïve Bayes model is simple to
construct and does not require laborious iterative parameter estimates, making it
U

especially helpful for very big datasets.

Simple Neural Networks


Neural networks are made up of nodes that are modelled after the brain’s neural
ity

network. These nodes are linked to one another or in both directions. The basic
structure of directional networks consists of an input layer with each node receiving one
of the system’s n inputs, a hidden layer made up of m nodes that can combine their
signals until there are n inputs, and an output node that receives the hidden layer’s m
m

outputs. The neural network’s system output is what it produces. On the other side, a
neural network can also be represented by the transfer function. This illustration shows
that the propagation function is formed by first summing and weighting the inputs. In
order to define the output value between two values, such as 0 and 1 (sigmoid function)
)A

or -1 and 1 (hyperbolic function), a transfer function is added, followed by an optional


activation function.

In the learning phase, the weight values of the input connections are modified
across a number of rounds. This algorithm’s supervised version compares the neural
(c

network’s output with the output connected to the relevant input. In order to make the
weight values fit the actual output values more closely, the error is used to modify the
weight values.
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 57

k-Nearest Neighbors
Notes

e
One of the simplest classification algorithms, the K-NN algorithm is used to
determine which data points are divided into different classes in order to forecast the

in
categorization of a new sample point. K-NN is a lazily learning algorithm that is non-
parametric. It uses a similarity metric to categorise new cases (i.e., distance functions).

k-Nearest Neighbors

nl
●● k is the number of neighbours to consider.
●● Scaling is important.

O
●● k should be odd.
●● If we have binary features we can use hamming distance.
●● Voting can be weighted by distance to each neighbor.

ty
●● Does not scale to large data well.

r si
ve
If k=3, the grey square observation is predicted to be green because two of its
neighbours are green and only one is red.
ni

Decision Trees like CART and C4.5


A decision tree uses a tree structure to develop classification or regression models.
U

It incrementally develops an associated decision tree while segmenting a dataset into


smaller and smaller sections. The outcome is a tree containing leaf nodes and decision
nodes. The split is determined using the Iterative Dichotomiser 3(ID3) algorithm
structure.
ity
m
)A
(c

Amity Directorate of Distance & Online Education


58 Foundations of Machine Learning

A decision tree is built using information gain and entropy.


Notes

e
Entropy

in
Entropy is a measure of how uncertain an element’s unpredictability is. It is a
gauge of impurity, to put it another way.

nl
It informs us about an event’s predictability intuitively. The homogeneity of a

O
sample is determined by entropy. The sample has an entropy of one when it is evenly
divided and zero when it is totally homogeneous.

Information Gain

ty
The relative change in entropy in relation to the independent attribute is measured
as information gain. It makes an attempt to estimate the amount of data each attribute
contains. Finding the attribute that yields the most information gain is the key to building

si
a decision tree (i.e., the most homogeneous branches).

r
Where the information gained by using feature X is represented by Gain(T, X). The
second term determines the entropy after applying the feature X, whereas entropy(T) is
ve
the entropy of the entire collection.

At a specific node in the tree, information gain ranks filtering qualities. The highest
information gain entropy in each split determines the ranking.
ni

A decision tree model’s drawback is Over-fitting, which lowers test accuracy as it


attempts to fit the model by delve deeper into the training set.
U

Overfitting

Overfit model
ity
m

Overfitting occurs when a model starts to memorise the aspects of the training set
and inturn losses the ability to generalize
)A

Pruning nodes helps reduce Over-fitting in decision trees.

Support Vector Machines


Both classification and regression employ support vectors. The idea of decision
planes, which specify choice boundaries, serves as its foundation. One that
(c

distinguishes between a group of objects with various class memberships is referred to


as a decision plane (hyperplane).

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 59

Notes

e
in
nl
O
With the aid of support vectors, it carries out classification by locating the

ty
hyperplane that optimises the margin between the two classes.

r si
ve
ni

By converting the issue with some linear algebra, the SVM learns the hyperplane
(i.e., the example above is a linear kernel which has a linear separability between each
U

variable).

Random Forest
ity

An ensemble algorithm based on bagging, or bootstrap aggregation, is the random


forest classifier. When classifying objects, ensemble methods mix multiple algorithms of
the same or different types (i.e., an ensemble of SVM, naive Bayes or decision trees,
for example.)
m
)A
(c

Amity Directorate of Distance & Online Education


60 Foundations of Machine Learning

The basic premise is that combining learning models improves the final outcome.
Notes

e
in
nl
O
ty
Over-fitting may be a problem for deep decision trees, but random forests avoid
it by building trees from random subsets. The biases are eliminated mostly because it

si
takes the average of all the forecasts.

While the trees are developing, the random forest adds more randomness to the
model. When dividing a node, it looks for the best feature from a random subset of
r
features rather than the most crucial one. A better model is often produced as a result of
ve
the great diversity this causes.

2.1.3 Bias Variance Trade-off


The phrase “bias and variance” refers to the concept of error prediction in statistical
ni

learning. A machine learning model wants to reduce bias and variance through
minimising prediction mistakes. Variance describes how utilising a different training set
alters the prediction. In the ideal case, predictions from multiple training sets differ only
U

little from one another, but in situations with high variance, even tiny adjustments to
the training set might result in radically different predictions. However, bias refers to an
inaccuracy brought about by oversimplifying a problem. No matter how many training
data there are, using linear regression to solve a problem with apparent nonlinear
ity

connections between variables will produce considerable bias.

In general, a model’s flexibility improves variance while decreasing bias [8]. As a


result, bias and variance have a trade-off that needs to be adjusted in order to obtain
high model accuracy.
m

According to the bias-variance trade-off, a learning algorithm’s squared error can


be divided into three parts:
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 61

Notes

e
in
nl
O
ty
Figure: Over-fitting with increased model complexity: While the polynomial model
varies significantly with the training data, the linear model does not. As a result, the

si
polynomial model’s inconsistent predictions at x = 2 frequently turn out to be more
wrong than the linear model’s. If enough training data are given, the polynomial model
can perform better than the linear model.

r
1. Bias: The bias is the error brought on by the model’s simplifying assumptions, which
ve
results in some test instances having consistent errors across various training data
set options. The bias cannot be eliminated, even if the model has unlimited access
to training data. Because the linear model will never fit the (slightly curved) data
distribution exactly, regardless of the amount of data given, it has a bigger model
bias than the polynomial model, as seen in the example of the above figure. When
ni

applying a linear model for any selection of training sample, the prediction of a
specific out-of-sample test instance at x = 2 will always have an error in a specific
direction. The bias is the difference between the straight and curved lines at any
U

given value of x if we suppose that they were estimated using an unlimited quantity
of data (top left of Above Figure). The above figure provides an illustration of the bias
at x = 2.
ity

2. Variance:Variance results from the failure to fully learn all of the model’s parameters
in a statistically sound manner, particularly when the data is sparse and the model
tends to have more parameters. Over-fitting to the available training data set causes
increased variance to be evident. As a result, if different training data set selections
are made, various predictions will be made for the same test instance. The forecasts
m

of the polynomial model vary significantly over different choices of training cases,
while the linear prediction yields similar predictions at x = 2 in the above figure. The
significantly disparate forecasts at x = 2 are frequently grossly inaccurate projections,
)A

which is an example of model variance. As a result, in Above Figure, the polynomial


predictor has a bigger variance than the linear predictor.
3. Noise:The inherent mistake in the data is what causes the noise. In the upper-left
corner of the above figure’s scatter plot, for instance, every data point deviates from
(c

the real model. All scatter plot points would have overlapped with the curving line
reflecting the true model if there had been no noise.

Amity Directorate of Distance & Online Education


62 Foundations of Machine Learning

The bias-variance trade-off is described qualitatively in the paragraph above. We’ll


Notes

e
give a more formal, mathematical perspective in the paragraphs that follow.

Formal View

in
We suppose that B represents the base distribution from which the training data set
was created. This base distribution can be used to create a data collection D:

nl
D∼B

There are other approaches to draw the training data, for as by choosing only data
sets of a certain size. Assume for the time being that training data sets are obtained

O
from B using some clearly defined generative technique. The method through which
training data sets are extracted from B is not crucial to the analysis that follows.

Obtaining base distribution Because the base distribution can be used to create

ty
training data sets indefinitely, B is akin to having unlimited access to a pool of training
data. Such base distributions—i.e., endless data resources—don’t actually exist.
Practically speaking, an analyst only ever collects a single finite instance of D using

si
some type of data collecting device. However, the theoretical quantification of the
causes of error in training on this finite data set is helped by the conceptual existence of
a base distribution from which new training data sets can be produced.

r
Now imagine that the analyst had a set of t test instances in d dimensions, denoted
ve
by (Z̅ t………….(Z̅ t. The dependent variables of these test instances are denoted by
y1………………….. yt.For clarity in discussion, let us assume that the test instances and
their dependent variables were also generated from the same base distribution B by a
third party, but the analyst was provided access only to the feature representations (Z̅ t
………….(Z̅ t and no access to the dependent variables y1………………….. yt.Therefore,
ni

the analyst is tasked with job of using the single finite instance of the training data set D
in order to predict the dependent variables of (Z̅ t………….Z̅ t.
U

Now assume that the relationship between the dependent variable yi and its feature
representation Z̅ i is defined by the unknown function f(·) as follows:

Now picture the analyst in possession of a collection of t test examples in d


dimensions, represented by Z̅ 1………….Z̅ t. These test cases’ dependent variables are
ity

designated by the symbols y1………………….. yt. For the sake of clarity, let’s imagine
that the dependent variables y1………………….. yt and the test instances were created
from the same base distribution B by a third party, but that the analyst only had access
to the feature representations Z̅ 1………….Z̅ t As a result, the analyst’s duty is to forecast
m

the dependent variables of Z̅ 1………….Z̅ t. using the one finite instance of the training
data set D.

The following relationship between the dependent variable y I and its feature
)A

representation (Zi) is defined by the unnamed function f(•), according to a presumption:

The intrinsic noise, which is independent of the model being utilised, is shown here
by the notation ∈i. Although it is assumed that E[∈i ] = 0, the value of I could be positive
(c

or negative. The only remaining uncertainty would be brought on by the inherent noise

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 63

if the analyst knew the function f(•) corresponding to this connection and used it to each
Notes

e
test point Z̅ i to estimate the dependent variable yi,.

The issue is that the analyst doesn’t understand what the function f(•) actually

in
performs. Keep in mind that this function is utilised within the base distribution B’s
generative process, which functions like an oracle that the analyst is not privy to. Only
examples of this function’s input and output are available to the analyst. In order to

nl
estimate this function in a data-driven manner, it is obvious that the analyst would need
to create some kind of model g(Z̅ t, D) using the training data.

O
Note the use of the circumflex (i.e., the symbol ‘ˆ’) on the variable (Yi ) ̂ to indicate
that it is a predicted value by a specific algorithm rather than the observed (true) value
of yi .

ty
The estimated function g(•, •) includes all prediction functions of learning models,
including neural networks. Some algorithms (like perceptrons and linear regression) can
even be expressed succinctly and simply as follows:

r si
ve
Most neural networks are expressed algorithmically as compositions of multiple
functions computed at different nodes. The choice of computational function includes
the effect of its specific parameter setting, such as the coefficient vector W in a
ni

perceptron. Neural networks with a larger number of units will require more parameters
to fully learn the function. This is where the variance in predictions arises on the same
test instance; a model with The majority of neural networks are composed of numerous
functions that are generated at various nodes and stated algorithmically. The influence
U

of a computational function’s particular parameter setting, like the coefficient vector W in


a perceptron, is taken into consideration while choosing that function. To properly learn
the function, neural networks with more units will need more parameters. When using
ity

a different training data set, a model with a big parameter set W will learn highly varied
values of these parameters, which is where the variance in predictions on the same
test instance comes from. As a result, with various training data sets, the prediction of
the same test instance will likewise vary greatly. As shown in the figure above, these
contradictions amplify the inaccuracy.
m

Quantifying the expected error of the learning algorithm in terms of its bias,
variance, and noise (data-specific) is the aim of the bias-variance trade-off. The target
variable is assumed to be numerical for the sake of generality so that the mean-squared
)A

error between the anticipated values yi) and the actual values (yi ) ̂ can be used to
intuitively quantify the error. Although it may also be applied to classification in terms
of probabilistic predictions of test examples, this is a natural way of quantifying error in
regression. Over the set of test instances, the learning method g(·,D)’s squared error, or
MSE, is defined set of test instances Z̅ 1………….Z̅ t as follows:
(c

Amity Directorate of Distance & Online Education


64 Foundations of Machine Learning

Notes

e
in
Calculating the expected error across several training data set selections is the
best technique to estimate the error in a way that is independent of the particular choice
of training data set:

nl
O
ty
Expanding the quadratic expression on the right side of the first equation and using
the fact that the average value of I over a large number of test examples is 0, we can
obtain the second relationship.

si
The squared term on the right-hand side of the following calculation can be further
decomposed by adding and removing E[g( Z̅ i, D)]:

r
ve
Optimal complexity

Overall Error
ni
Squared Error

Variance
U

Bias
ity

Model Complexity
Figure: The trade-off between bias and variance usually causes a point of optimal
model complexity.
m

One can expand the quadratic polynomial on the right-hand side to obtain the
following:
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 65

Notes

e
in
nl
The second term on the right-hand side of the aforementioned expression

O
evaluates to 0 because one of the multiplicative factors is E[g( Z̅ i, D)] − E[g( Z̅i, D)]. On
simplification, we obtain the following:

ty
In other words, the bias, variance, and noise components of the squared error

si
can be divided into squares. The crucial element that keeps neural networks from
generalising is variation. For neural networks with many parameters, the variance will
typically be higher. On the other hand, insufficient degrees of freedom make it difficult to
r
accurately predict the intricacies of the data distribution, which might result from having
ve
too few model parameters. The above Figure illustrates the trade-off between bias and
variance as model complexity increases. It is obvious that there is a maximum model
complexity at which the performance is best. Additionally, the variation will rise due to
the lack of training data. But cautious design selection might lessen Over-fitting.
ni

2.1.4 Under-fitting and Over-fitting


Prediction errors, which are a measure of a machine learning model’s performance
U

and accuracy, are what we actually refer to when we discuss it. Consider for a moment
that we are creating a machine learning model. If a model generalises any new input
data from the problem domain in an appropriate way, it is said to be a good machine
learning model. This enables us to forecast future data that the data model has never
ity

encountered. Let’s say we want to assess how well our machine learning model picks
up new information and adapts to it. Over-fitting and under-fitting are the main causes
of the poor performance of machine learning methods for this.

Let’s first define two key words before moving on:


m

●● Bias: Assumptions a model makes to make a function simpler to understand.


Actually, it is the training data’s error rate. When the error rate is large, we refer to
)A

it as high bias, and when it is low, we refer to it as low bias.


●● Variance: Variance refers to the testing data’s mistake rate. When the error rate
is high, we refer to it as high variance, and when it is low, we refer to it as low
variance.
(c

Under-fitting: When a statistical model or machine learning algorithm is unable to


recognise the underlying pattern in the data, or when it only performs well on training
data but badly on testing data, this is referred to as under-fitting. (It’s like trying to

Amity Directorate of Distance & Online Education


66 Foundations of Machine Learning

squeeze into oversized jeans!) Our machine learning model’s accuracy is destroyed
Notes

e
by under-fitting. Its recurrence merely indicates that our model or method does not
adequately suit the data. It typically occurs when we try to develop a linear model with
fewer non-linear data or when we have insufficient non-linear data to build an accurate

in
model. The machine learning model will likely produce a lot of incorrect predictions in
these circumstances since the rules are too simple and flexible to be applied to such
sparse data. More data can be used to prevent under-fitting, while feature selection can

nl
be used to reduce the number of features.

In a word, under-fitting describes a model that is unable to generalise to new data


while still doing well on the training set.

O
Reason for under-fitting
◌◌ Low variance and high bias.

ty
◌◌ The training dataset size employed is insufficient.
◌◌ The model is overly basic.
◌◌ The training data has noise in it and has not been cleansed.

si
Techniques to reduce under-fitting:
1. Make the model more complex.
r
2. Perform feature engineering to increase the amount of features.
ve
3. Clean up the data of any noise.
4. To improve outcomes, increase the number of epochs or the training period.
Over-fitting: When a statistical model fails to produce reliable predictions on test
ni

data, it is said to be overfitted. A model begins to learn from the noise and erroneous
data entries in our data set when it is trained with such a large amount of data. And
when using test data for testing yields high variance. Due to too many details and noise,
U

the model fails to appropriately identify the data. The non-parametric and non-linear
approaches are the root causes of Over-fitting since these types of machine learning
algorithms have more latitude in how they develop the model based on the dataset,
making it possible for them to produce highly irrational models. If we have linear data,
ity

employing a linear algorithm is one way to prevent Over-fitting; if we are using decision
trees, utilising parameters like the maximal depth is another.

Simply put, Over-fitting is a problem when the evaluation of machine learning


algorithms on training data differs from the evaluation on unknown data.
m

Reasons for Over-fitting are as follows:


)A

◌◌ High variance and low bias


◌◌ The model is too complex
◌◌ The size of the training data
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 67

Notes

e
in
nl
O
ty
r si
ve
Techniques to reduce Over-fitting:
◌◌ Expand the training data.
◌◌ simplify the model.
ni

◌◌ During the training phase, quitting too soon (have an eye over the loss over
the training period as soon as loss begins to increase stop training).
◌◌ Regularization of the Ridge and the Lasso
U

◌◌ To combat Over-fitting, use dropout in neural networks.


When your model is too straightforward for your data, you are under-fitting.
Formally speaking, your hypothesis on the distribution of the data is incorrect and
overly simplistic; for instance, your data are quadratic while your model is linear. High
ity

prejudice is another term for this circumstance. This indicates that while your algorithm
can make precise predictions, the underlying premise about the data is flawed.
m
)A
(c

Amity Directorate of Distance & Online Education


68 Foundations of Machine Learning

Notes

e
in
nl
O
ty
r si
In contrast, Over-fitting occurs when your model is too intricate for your data.
Formally speaking, your hypothesis on the distribution of the data is incorrect and
ve
overly complicated; for instance, your data are linear and your model is a high-degree
polynomial. High variance is another name for this circumstance. This indicates that
your algorithm cannot make reliable predictions since even little changes in the input
data cause significant changes in the model’s output.
ni
U
ity
m
)A
(c

The ideal solution will always fall somewhere in the middle of these two extremes
of the same issue.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 69

Notes

e
in
nl
O
ty
si
●● low bias, low variance — is a good result, just right.
●● r
low bias, high variance — Over-fitting — For the same set of data, the algorithm
produces widely diverse predictions.
ve
●● high bias, low variance — under-fitting — Similar predictions are made by the
algorithm for comparable data, but these predictions are incorrect (the algorithm
“misses”).
ni

●● high bias, high variance — awful algorithm. Most likely, you’ll never see this.
U
ity
m
)A
(c

These cases can all be combined into one plot. Though more condensed, it is a
little less distinct than the prior one.

Amity Directorate of Distance & Online Education


70 Foundations of Machine Learning

Notes

e
in
nl
O
ty
How to Detect Under-fitting and Over-fitting

si
Let’s first learn how to “identify” under-fitting and Over-fitting before moving on to
the tools.
r
ve
ni
U
ity

When your model is underfit, it produces accurate but initially erroneous


predictions. In this instance, both the train error and the val/test mistake are significant.

When your model is overfit, it produces inaccurate predictions. Here, the train error
is relatively modest while the val/test error is substantial.
m

Train error is small (although bigger than in the case of Over-fitting) and val/test
error is also small when you find a decent model.
)A

The test error and validation error in the aforementioned instance are nearly
identical. This occurs when everything is in order and the distributions of your train,
validation, and test data are identical. If the test error and the validation error differ
significantly, you should obtain more test data and double-check that the data was
appropriately divided.
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 71

Notes

e
in
nl
O
ty
Tools and Techniques
Let’s now examine methods to avoid both under-fitting and Over-fitting while taking

si
into account why we should employ them.

You Should Recall General Intuition

As we recall:
r
ve
◌◌ When your model is too straightforward for your data, under-fitting happens.
◌◌ When your model is too intricate for your data, Over-fitting takes place.

Based on this, you should keep in mind the following simple intuition:
ni

◌◌ Over-fitting should be corrected by simplifying the model,


◌◌ under-fitting should be corrected by complicating the model.
U

In actuality, everything that follows is simply a result of this straightforward law. I’ll
try to explain why some actions will make the model more complex or simpler.

More Simple / Complex Model


ity

Based on the aforementioned understanding, the simplest approach that springs to


mind is to try a more straightforward or more complicated algorithm (model).

You must include extra parameters to make the model more challenging (degrees
m

of freedom). This can occasionally entail using a model that is immediately capable of
restoring more complex dependencies (SVM with different kernels instead of logistic
regression). Increase the number of models in boosting, for example, if the algorithm
(a neural network or some other ensemble model) is already quite sophisticated. This
)A

entails increasing the number of layers, neurons in each layer, connections between
layers, CNN filters, and other neural network components.

Contrarily, you must decrease the amount of parameters if you want to simplify the
model. Reduce the number of degrees of freedom or altogether alter the method (try
(c

random forest instead of deep neural network). fewer neurons, fewer layers, etc.

Amity Directorate of Distance & Online Education


72 Foundations of Machine Learning

More Regularization / Less Regularization


Notes

e
This point and the one before it are extremely closely related. In actuality,
regularisation is a forced and indirect model simplification. The model must be as

in
straightforward as possible because the regularisation term mandates that parameter
values be kept as low as possible. This is a particularly potent approach since complex
models with strong regularisation frequently outperform originally basic models.

nl
O
ty
si
The influence of the regularisation term grows as the model is regularised more
(simplified). The regularisation parameters vary based on the method; for example, to
reduce regularisation, alpha for Ridge regression should be decreased while C for SVM
r
should be increased. This process is strictly individual. As a result, you should research
ve
the algorithm’s parameters and pay attention to whether they should be increased or
decreased in a specific circumstance. Such parameters are numerous, including the
L1/L2 coefficients for linear regression, C and gamma for SVM, maximum tree depth
for decision trees, and others. The primary regularisation techniques used in neural
networks are:
ni

◌◌ Early stopping,
◌◌ Dropout,
U

◌◌ L1 and L2 Regularization.

2.1.5 Evaluation Methods: Hold Out Method


ity

Real data sets are not typically pre-divided into training, validation, and test data
sets in practice. It is crucial to be able to automatically separate a ratings matrix’s
entries into these groups. Instead of dividing the data set into three equal halves,
the majority of the division techniques, including hold-out and crossvalidation, are
applied. However, the following three pieces can be obtained. It is possible to create
m

the necessary three segments by first splitting the rating entries into training and test
portions, and then further segmenting the validation component from the training
data. As a result, in the sections that follow, we’ll go through how to divide the entries’
)A

ratings matrix into training and testing halves using techniques like hold-out and cross-
validation. However, the model-building and validation portions of the training data are
also divided using similar techniques. The above figure exemplifies this hierarchical
divide. Even though the same approach can also be utilised for the second level
separation into model building and validation components, we will consistently use the
(c

language of the first level of division into “training” and “testing” data in the following. To
prevent misunderstanding, language is used consistently.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 73

Notes

e
in
nl
O
Figure: Hierarchical division of rated entries into training, validation, and testing

ty
portions

Hold-Out

si
In the hold-out method, a portion of the ratings matrix’s entries are concealed while
the training model is constructed using the remaining entries. The overall accuracy
is then reported, which is the accuracy of predicting the concealed entries. Because
r
the entries used for evaluation are hidden during training, this method assures that
ve
the claimed accuracy is not the consequence of Over-fitting to the particular data set.
However, such a strategy undervalues the accuracy’s actual level. First off, not all
entries are used in training, thus the full potential of the data is not utilised. Next, take
into account the scenario in which the held-out entries have a higher average rating
ni

than the complete ratings matrix. As a result, both the held-out entries and the held-in
entries had lower average ratings than the held-in entries. As a result, the evaluation
will be biased negatively.
U

Cross-Validation
The ratings entries are separated into q equal sets for the cross-validation
procedure. As a result, |S|/q represents the size of each set in terms of entries if S is
ity

the set of specified entries in the ratings matrix R. The remaining (q 1) segments are
utilised for training, while one of the q segments is used for testing. In other words,
throughout each such training procedure, a total of |S|/q entries are hidden, and the
accuracy is then assessed over these entries. Each of the q segments serves as the
test set for this process, which is repeated q times. The reported accuracy is the mean
m

across the q distinct test sets. Keep in mind that when the value of q is large, this
method can fairly approximate the true accuracy. Choosing q to be equal to the required
number of entries in the ratings matrix is a specific case. As a result, testing uses the
)A

one entry and training uses the |S| 1 rating entries. Leave-one-out cross-validation is
the name of this method. Even though such a method may fairly roughly approximation
the accuracy, training the model |S| times is typically too expensive. In actuality,
q’s value is set to a specific number, say 10. However, for the particular situation of
neighborhood-based collaborative filtering algorithms, leave-one-out crossvalidation is
(c

not overly challenging to implement.

Amity Directorate of Distance & Online Education


74 Foundations of Machine Learning

Bootstrap Method
Notes

e
Bootstrapping, a statistical concept, is a resampling approach that uses the
replacement procedure to stimulate samples from a data collection. The method of

in
bootstrapping enables one to derive standard errors, infer data about the population,
and guarantee that data is tested effectively.

The Bootstrapping Method in Statistics and Machine Learning is a resampling

nl
statistical technique that tests a dataset by replacing the sample in order to evaluate
statistics of a given population.

With this method, a dataset is sampled repeatedly while being randomly replaced.

O
This technique, a statistical test that belongs to the class of resampling procedures,
makes sure that the statistics being assessed are as exact and objective as feasible.

The Bootstrapping method, in contrast to other sampling distribution methods,

ty
repeatedly employs the study-provided samples in order to apply the replacement
methodology and guarantee that the stimulated samples result in an accurate
evaluation.

si
In addition to guaranteeing the correctness of a given dataset’s sample,
bootstrapping in statistics also enables one to calculate the dataset’s confidence
intervals.
r
ve
Hold Out Method
Splitting the data into several sets—one for training and other sets for validation
and testing—is the hold-out method, a strategy used to train machine learning models.
A machine learning model’s performance on the fresh data is tested using the hold-
ni

out method. The issue that emerges when analysing machine learning (ML) models is
whether the model is the best one that can be found in the model’s hypothesis space
in terms of generalisation error on the unobserved/future data set. Whether the most
U

effective method was used to train and test the model. Which model should I choose
from those that are offered? The so-called hold-out strategy is used to address these
queries.

Different sets, referred to as the validation set and test set, are isolated or set aside
ity

(thus, the hold-out moniker) from the complete dataset and the model is trained just
on what is referred to as the training dataset rather than using the entire dataset for
training.

The hold-out approach for machine learning model training entails separating
m

the data into many splits, using one split for model training and other splits for model
validation and model testing. Both model evaluation and model selection are done
using the hold-out method.
)A

The issue of comparing the models and choosing the best model persists even
after the whole set of data has been utilised to train the model using various algorithms.
Finding the model with the lowest generalisation error among all other models is the
main task. In other words, which model outperforms all others in its ability to anticipate
(c

data from the future or the unknown. The necessity for a technique that allows the
model to be trained on one dataset and tested on another arises in this situation. The
hold-out strategy is useful in this situation.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 75

Hold-out method for Model Evaluation


Notes

e
The process for dividing the dataset into training and test datasets is represented
by the hold-out approach for model evaluation. To create the best model possible, the

in
model is tested on the testing set after being trained on the training set. When the
data set is tiny and there is not enough information to divide into three groupings, this
strategy is frequently utilised (training, validation, and testing). This strategy has the

nl
benefit of being straightforward to use, but it can be delicate to how the data is split into
two sets. The results can be skewed if the split was not random. All things considered,
the hold out approach for model evaluation is a decent place to start when training
machine learning models, but it should only be applied sparingly. The hold-out approach

O
to model evaluation looks like what follows.

ty
r si
ve
Figure: Hold-out method for model evaluation

You may have noticed that the data set is divided into two portions in the preceding
ni

diagram. One split is reserved or held back for model training. For testing or evaluating
the model, another set is held back. The amount of data provided for training purposes
determines the split %. The dataset is often divided into two halves, with 70% of the
U

dataset used for training and 30% of the dataset used for testing the model.

If the objective is to compare the models based on the model accuracy on the test
dataset and choose the best model, this technique is well suited. The model might end
ity

up fitting the test dataset well if this strategy is tried, but that is never a guarantee. In
other words, it is assumed that the test dataset accurately represents the population
while training the models to increase model accuracy. Thus, the test error turns into an
estimation of the generalisation error that is biassed in the favour of optimism. That is
not preferred, though. As a result of being trained to fit well (or overfit) in relation to the
m

test data, the final model does not generalise well to the unknown or future dataset.

The hold-out method for model evaluation is carried out as follows:


)A

●● Create two separate datasets (preferably based on a 70-30 percent split; However,
the percentage split will vary)
●● the training dataset for the model; Some fixed set of hyperparameters is chosen
while training the model.
(c

●● On the held-out test dataset, test or assess the model.

Amity Directorate of Distance & Online Education


76 Foundations of Machine Learning

●● To get a model that can generalise better on the unknown or future dataset, train
Notes

e
the final model on the full dataset.
It should be noted that this method relies on dividing the dataset into training

in
and test datasets and employing a predetermined set of hyperparameters for model
evaluation. Another method involves dividing the data into three sets and using these
sets to choose a model or tune hyperparameters.

nl
Hold-out method for Model Selection
Model selection and hyperparameter tuning can both be done using the hold-out

O
method. In reality, tuning of the hyper-parameters is a term that is sometimes used
to describe the model selection process. The training, validation, and test datasets
are divided into three separate sets for the hold-out approach of model selection. It is
crucial to make sure that the training, validation, and test datasets are representative of

ty
the complete dataset when utilising the hold out approach, which divides the data into
three sets. Otherwise, the model might not work well with unknown data.

r si
ve
ni

Figure: Hold out method – Training – Validation – Test Dataset


U

The hold-out strategy for choosing models is represented by the process below:

1. Divide the dataset into three sections. Datasets for training, validation, and testing.
ity

2. Use various machine learning methods to train various models. Use logistic
regression, random forest, XGBoost, etc. to train the classification model, for
instance.
3. Adjust the hyper-parameters for the models that were trained using various
m

techniques to create new models. Change the hyperparameter values and produce
several models for each of the algorithms stated in step 2.
4. Use the validation dataset to evaluate how well each of these models (belonging to
)A

each of the algorithms) perform.


5. From among the models that were examined using the validation dataset, choose
the most ideal model. For a certain method, the best model will have the best
hyperparameter settings. Let’s assume, using the aforementioned example, that the
model trained with XGBoost and having the best hyperparameters is chosen.
(c

6. Examine the most effective model’s performance using the test dataset.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 77

The following diagram might be used to comprehend the aforementioned. Take


Notes

e
note of the original dataset’s three distinct splits. The most effective model is chosen
after numerous iterations of the training, adjustment, and evaluation processes. On the
test dataset, the final model is assessed..

in
nl
O
ty
si
Figure: Hold out method for model selection

Different types of Hold-out methods


r
ve
There are various hold-out approaches that are used to increase the machine
learning model accuracy by preventing Over-fitting or under-fitting of the model, which
are based on the core principles outlined in the previous section. The list of some of
ni

them is as follows:

K-fold Cross-validation hold out method:


U

The cross-validation hold out approach involves the steps listed below:

◌◌ Training sets (training, validation) and test sets are separated from the data
set (test).
ity

◌◌ A subset of the data is used to construct the machine learning model, which is
subsequently tested using the remaining data.
◌◌ To provide an average performance metric from K machine learning models,
this technique is done K times with different random partitioning. One sample
from the data set is left out (referred to as the test data set) for each machine
m

learning model training, and the model then attempts to forecast its value
using the test data set. Repeat this procedure until the machine learning
model has predicted all samples at least once..
)A

Leave One Out Cross Validation Method


In the leave one out cross validation method, one observation is omitted, and the
remaining data are used to build a machine learning model. To produce an average
performance measure, this procedure is performed numerous times (until all of the data
(c

is covered).

Machine learning approaches called hold-out methods can be used to prevent

Amity Directorate of Distance & Online Education


78 Foundations of Machine Learning

Over-fitting or under-fitting of machine learning models. One of the most widely used
Notes

e
techniques is the cross-validation hold out method, in which a machine learning model
is trained using a subset of the data and then evaluated using the remaining data.
Another method that helps prevent these issues is leave-one-out cross-validation,

in
which trains with the majority of the data while leaving one observation out as a test
case.

nl
2.1.6 Hands On: Introduction to Hold Out Method
The Holdout Strategy is the most basic type of method for classifier evaluation. The

O
training set and test set are two sets that are created using this method from the data
set (a collection of data items or instances). A classifier’s job is to assign each data
point in a collection to a certain category or class.

ty
Example:
Spam and non-spam emails are separated out in our inbox.

The accuracy, error rate, and error estimates of the classifier should be assessed.

si
There are numerous ways to accomplish it. The “Holdout Method” is one of the most
basic techniques used to evaluate classifiers.

r
The holdout approach divides the data set so that the majority of the data belongs
to the training set and the remaining data to the test set.
ve
Example
In the case of 20 data points, 12 are included in the training set while the remaining
ni

8 are included in the test set.

◌◌ The training set is used to create a model or classifier after dividing the data
set into two sets.
U

◌◌ We use the data items in the test set after classifier development to check
accuracy, error rate, and error estimate.
Regarding the holdout method, it is crucial to keep in mind two points. As follows:
ity

The classifier’s error rates, estimates, and accuracy would all be very low if the
maximum number of data items were included in the training set when building the
model or classifier. This indicates a strong classifier or model.
m

Example –
A instructor Amit” coaches a student. She is taught by the teacher all potential
exam topics. She consequently tends to make very few mistakes in exams, resulting in
)A

good performance.

Any test set data used to evaluate the classifier becomes invalid if new training
data are utilised to create it (classifier).

If there are more data items in the test set, they can be used to test the classifier
(c

that was created using the training set. In terms of accuracy, error rate, and estimation,
we can witness more precise evaluation of the classifier.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 79

Example –
Notes

e
A teacher is coaching a student named “Amit.” Her teacher teaches her certain
subjects that could come up on the test. An accurate assessment of the student’s weak

in
and strong points can be made if the student “Amit” is given a number of exams based
on this coaching.

The error rate, error estimate, and accuracy of the created classifier can be

nl
precisely calculated if sufficient test data are utilised to evaluate it.

Problem :

O
If all data items from class GFG1 are completely placed in test set, leaving no data
items from class GFG1 in training set, then the full data set can be divided into two
parts, i.e., training set and test set. It is clear that the model/classifier used was not
developed utilising GFG1-class data items for training.

ty
Solution :
Data items from the GFG1 class are separated into two equal data sets, the

si
training set and the test set, using the stratification technique. such that data items from
the class -GFG1 are used to train the model or classifier.

Example – r
ve
The two data sets, training set and test set, each contain two of the four data items
that make up the class GFG1 in this instance.

2.1.7 Evaluation Methods: Cross Validation and Bootstrapping


ni

Method
By training the model on a subset of the input data and testing it on a subset of the
U

input data that hasn’t been used before, you may validate the model’s effectiveness. It
is also a method for determining how well a statistical model generalises to a different
dataset.

Testing the model’s stability is a constant requirement in machine learning. This


ity

indicates that we cannot fit our model to the training dataset alone. We set aside a
specific sample of the dataset—one that wasn’t included in the training dataset—for this
use. Cross-validation refers to the entire process of testing our model on that sample
before deployment. Compared to the typical train-test split, this is different.
m

Therefore, the fundamental cross-validation stages are:

●● As a validation set, set aside a portion of the dataset.


)A

●● Use the training dataset to provide the model with training.


●● Utilizing the validation set, assess the model’s performance now. Perform the next
step if the model works well on the validation set; otherwise, look for problems.

Methods used for Cross-Validation


(c

For cross-validation, there are a few standard techniques. The following lists these
techniques:

Amity Directorate of Distance & Online Education


80 Foundations of Machine Learning

●● Validation Set Approach


Notes

e
●● Leave-P-out cross-validation
●● Leave one out cross-validation

in
●● K-fold cross-validation
●● Stratified k-fold cross-validation

nl
Validation Set Approach
In the validation set approach, we separate our input dataset into a training set and

O
a test or validation set. Half of the dataset is divided between the two subsets.

However, one of the major drawbacks is that we are only using half of the dataset
to train our model, which means that the model can fail to capture crucial dataset
information. It frequently produces the underfitted model as well.

ty
Leave-P-out cross-validation
The training data in this method excludes the p datasets. This means that if the

si
original input dataset has a total of n datapoints, n-p datapoints will be utilised as the
training dataset, and p datapoints will be used as the validation set. For each sample,
the entire procedure is carried out once, and the average error is determined to
r
determine the model’s efficacy.
ve
This method has a drawback in that it can be computationally challenging for large
p.

Leave one out cross-validation


ni

This technique is similar to leave-p-out cross-validation, but we need to exclude


one dataset from training instead of p. It indicates that in this method, only one data
point is set aside for each learning set, while the remaining dataset is used to train the
U

model. For every data point, this procedure is repeated. Thus, for n samples, n distinct
training sets and n test sets are obtained. These characteristics describe it:

●● Because all the data points are used, this method has the least amount of bias.
ity

●● The process is done for n times; consequently execution time is high.


●● As we iteratively verify the model against one data point, this strategy results in
significant diversity in testing the model’s efficacy.
m

K-Fold Cross-Validation
The input dataset is split into K groups of samples with equal sizes using the K-fold
cross-validation procedure. Folds is the term for these samples. The prediction function
)A

uses k-1 folds for each learning set, while the remaining folds are used for the test set.
This strategy is particularly common in CVs since it is simple to grasp and produces
results that are less biased than those produced by other techniques.

The k-fold cross-validation procedure is as follows:


(c

●● Split the input dataset into K groups


●● For each group:
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 81

●● Take one group as the reserve or test data set.


Notes

e
●● Use remaining groups as the training dataset
●● Fit the model on the training set and evaluate the performance of the model using

in
the test set.
Let’s look at a 5-fold cross-validation example. The dataset is then divided into
five folds. The first fold is saved for testing the model on the first iteration, while the

nl
remaining folds are used to train the model. The second fold is used to test the model
on the second iteration, while the remaining folds are utilised to train the model. Until no
fold is suitable for the test fold, this process will be repeated.

O
Take a look at the diagram below:

ty
r si
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


82 Foundations of Machine Learning

Stratified k-fold cross-validation


Notes

e
With a few minor adjustments, this method is comparable to k-fold cross-validation.
The stratification principle underlies this method, which involves rearranging the data

in
to make sure that each fold or group is a good representation of the entire dataset. It is
one of the finest methods for addressing bias and variation.

It can be understood by utilising the example of housing costs, where some homes

nl
may have substantially higher prices than others. A stratified k-fold cross-validation
technique is helpful to handle such circumstances.

O
Holdout Method
This methodology for cross-validation is the simplest one available. With this
technique, we must take out a portion of the training data and train the remaining
dataset on it to obtain the prediction results.

ty
How well our model will function with the unknown dataset is shown by the error
that occurs during this phase. Although this method is straightforward to use, it still
struggles with large volatility and occasionally yields inaccurate findings.

si
Comparison of Cross-validation to train/test split in Machine Learning
●●
r
Train/test split: A ratio of 70:30, 80:20, etc. is used to divide the input data into
the training set and the test set. One of its main drawbacks is the considerable
ve
variance it offers.
◌◌ Training Data: The model is trained using training data, and the dependent
variable is known.
ni

◌◌ Test Data: The model, which has already been trained on the training data,
makes predictions using the test data. Although not a component of it, this has
the same characteristics as training data.
U

●● Cross-Validation dataset: By dividing the dataset into groups of train/test splits and
averaging the results, it is utilised to overcome the drawback of train/test split. It
can be utilised if we wish to improve the performance of our model after it has
been trained using the training dataset. Since every observation is used for both
ity

training and testing, it is more effective than a train/test split.

Limitations of Cross-Validation
The cross-validation method has some drawbacks, some of which are listed below:
m

●● It produces the best results in perfect circumstances. However, the contradictory


data could lead to a dramatic outcome. As there is uncertainty over the type of
data used in machine learning, this is one of the major drawbacks of cross-
)A

validation.
●● Because data in predictive modelling changes over time, there may be variations
between the training set and validation sets. For instance, if we develop a stock
market value prediction model and the data is trained on the stock prices from
the previous five years, but the realistic future stock prices for the following five
(c

years could be very different, it is challenging to predict the correct output in such
circumstances.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 83

Applications of Cross-Validation
Notes

e
●● This technique, which is already being utilised by data scientists in the field of
medical statistics, can be used to assess the effectiveness of various predictive

in
modelling approaches.
●● It has a wide range of applications in the field of medical research.
●● As data scientists are already using it in the field of medical statistics, it can also

nl
be utilised for meta-analysis.

Bootstrapping Method

O
Bootstrapping, a statistical concept, is a resampling approach that uses the
replacement procedure to stimulate samples from a data collection. The method of
bootstrapping enables one to derive standard errors, infer data about the population,

ty
and guarantee that data is tested effectively.

The Bootstrapping Method in Statistics and Machine Learning is a resampling


statistical technique that tests a dataset by replacing the sample in order to evaluate

si
statistics of a particular population.

With this method, a dataset is sampled repeatedly while being randomly replaced.
This technique, a statistical test that belongs to the class of resampling procedures,
r
makes sure that the statistics being assessed are as exact and objective as feasible.
ve
The Bootstrapping method, in contrast to other sampling distribution methods,
repeatedly employs the study-provided samples in order to apply the replacement
methodology and guarantee that the stimulated samples result in an accurate
evaluation.
ni

In addition to guaranteeing the correctness of a given dataset’s sample,


bootstrapping in statistics also enables one to calculate the dataset’s confidence
intervals.
U

The degree of assurance that an estimated statistic contains the actual value of
the parameter is known as the confidence interval. Let’s learn more about the approach
right now.
ity

In data science, data collected from a variety of sources is regarded as the


foundation for forecasting patterns. In this way, statistics and machine learning create
prediction patterns for use in the future using historical data.

The bootstrapping method is widely used and has major applications in the field
m

of statistics. While the Bootstrapping Method and the Jackknife Resampling Method
are two of the most popular resampling techniques, we will focus on the Bootstrap
technique.
)A

How does it work?

The bootstrapping method, created by Bradley Efron, is well recognised for


creating fresh samples or resamples from the original samples in order to assess the
precision of a sample statistic.
(c

The method generates new hypothetical samples that aid in the testing of an
estimated value using the replacement methodology.

Amity Directorate of Distance & Online Education


84 Foundations of Machine Learning

Here are three quick stages that make up the bootstrapping process. -
Notes

e
●● Randomly choose a sample size.
●● Pick an observation from the training dataset in random order.

in
●● Combine this observation with the sample chosen earlier.
The term “Bootstrapped samples” or “the bootstrap sample size” refers to the

nl
samples that are picked in the representative sample size. The “Out-of-the-bag”
samples, on the other hand, are the unchosen samples that make up the testing
dataset.

O
The bootstrapping method entails running a machine learning model on the
bootstrapped samples or the training dataset, which is then tested on a fresh dataset,
the Out-of-the-bag samples.

ty
The method’s goal is to give the model the ability to forecast outcomes for samples
taken directly from the bag, which often produce a normal distribution or a Gaussian
distribution. The above-mentioned processes are repeated (a minimum of 25 times)
utilising the replacement strategy in order to improve results.

si
To assess the model’s correctness in a more thorough manner, it must be
highlighted that the sample size chosen must be modest and that the procedure should
r
be repeated several times. The bootstrapping method is a very clever resampling
technique used to measure the unknown gaps in a model.
ve
There are two different types of bootstrapping methods that can be used in
statistics and machine learning, depending on the specifics of how they operate.
ni

Parametric Bootstrap Method


The distribution parameter must be known when using this method. This means
that a prior assumption regarding the sample’s kind of distribution must be made.
U

For example, the user must be aware of whether the sample has a Gaussian or
skewed distribution. Since it is already aware of the distribution’s characteristics, this
kind of bootstrap approach is more effective.
ity

Non-Parametric Bootstrap Method


This type does not need the distribution parameter to be known in advance, in
contrast to the parametric bootstrap approach. Because of this, this kind of bootstrap
approach operates without making any assumptions about the sample distribution.
m

Applications of the Bootstrapping Method


)A

Hypothesis Testing
The bootstrapping approach is among the most effective ways to test a hypothesis.
Unlike conventional methods, the bootstrapping method enables the replacement
technique to assess the accuracy of a dataset.
(c

In the instance of hypothesis testing in research, this technique can be used to


assess the precision of samples and determine whether a hypothesis is true or false.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 85

Overall, compared to the conventional approach, bootstrap hypothesis testing


Notes

e
reduces the drawbacks of a fictitious model. Contrary to conventional methods, the
bootstrapping approach does not assume that the data will be regularly distributed
when performing hypothesis testing.

in
Standard Error

nl
Because it uses the replacement strategy, the bootstrapping method may
effectively calculate a dataset’s standard error. A statistical data set’s Standard Error
(SE) shows the estimated standard deviation.

O
In a particular dataset, SE reflects the true statistical value of a sample while also
representing the precision of a sample statistic.

While the standard error for a sample statistic can be computed using conventional

ty
methods, the bootstrapping approach makes use of the replacement methodology to
ensure that several standard error values are calculated, each of which represents the
average SE.

si
As a result, the approach can be used to calculate standard error.

Machine Learning

r
Bootstrapping in machine learning operates very differently from statistics. When
ve
it comes to machine learning, the bootstrapping method uses the bootstrapped data to
train the models before testing them with the remaining data.

Machine learning has advanced in every conceivable way and is in the process of
evolving. To check that ML models and algorithms perform correctly when applied to
ni

independent data points, repeated sampling data points are used in the replacement
technique. One of the most significant applications of the bootstrapping method may be
machine learning.
U

Bootstrapping Aggregation
The bootstrapping method and the aggregation approach are both included in
the data mining process known as bagging, sometimes known as Bootstrapping
ity

Aggregation.

The approach of aggregation integrates the prediction results generated from many
machine learning algorithms, unlike the bootstrapping method, which is a resampling
procedure used to obtain samples using the replacement methodology.
m

This ensemble technique of bootstrap aggregating concentrates on compiling


predictions from a variety of models that result in a more accurate and trustworthy
)A

prediction rather than depending on a single ML model.

This strategy can be used with the bootstrapping method, which has improved the
performance of machine learning models.

Confidence Intervals
(c

A statistic called a confidence interval (CI) expresses the likelihood that a


calculated interval will include the true value.

Amity Directorate of Distance & Online Education


86 Foundations of Machine Learning

The bootstrapping method replaces samples with data points from a shared
Notes

e
dataset, producing results for CI that are more precise and effective.

The confidence interval describes the accuracy of an estimated sample value in

in
relation to the other samples that were taken.

Last but not least, one can validate the correctness of a sample statistic overall by
using the bootstrapping method, which is known to examine the accuracy of statistics

nl
like the confidence interval.

Pros & Cons of the Bootstrapping Method

O
Pros of the bootstrapping method are following;

Functionally Simpler

ty
When employing the usual methods to calculate statistics that are otherwise too
complex, the bootstrapping method offers a functionally simpler alternative.

The procedure, which is uncomplicated, makes it possible to process the accuracy

si
of a model in a straightforward manner with fewer stages and easy tests.

Accurate Results

r
One of the most popular resampling techniques, the bootstrapping approach, does
ve
not rely on any presumptions for the concept to function.

The bootstrapping approach merely watches the findings and works on them,
generating correct results, in contrast to traditional methods that depend on the
theoretical concept to provide results.
ni

The strategy works even when the theory contradicts the actual data, making it
particularly useful in this regard.
U

Listing below some cons of bootstrapping method

Excessive Computing Power


The bootstrapping approach resamples data points to accurately verify a model’s
ity

performance using the replacement methodology.

Since the replacement methodology supports the procedure in this situation, it


necessitates a lot of computational resources. The bootstrapping method’s drawback,
which potentially outweigh its advantages, is the use of excessive computer power.
m

Underestimates the Variability


One disadvantage of the bootstrapping method is that it is prone to underestimate
)A

the variability of the distribution, despite the fact that it is advised to function well in the
situation of small sample sizes.

The approach tends to significantly accommodate closer values when dealing with
rare-extreme values, avoiding the participation of near-end values.
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 87

2.1.8 Introduction to Performances Metrices


Notes

e
To ensure that the predictions in machine learning are accurately representing
the desired phenomenon, it is essential to be able to evaluate models in the right way

in
(disease prediction, future cost estimation, etc.).

But with so many options available (Accuracy, Precision, Recall, etc.), it can be
difficult for data scientists to decide which performance statistic to utilise.

nl
To be able to measure the performance of the model objectively and in the
appropriate context, it is crucial to choose the appropriate metric for a certain model.

O
When evaluating the performance of our machine learning model on a dataset that
it has never seen before, performance evaluation metrics are crucial. On the dataset
you trained the model on, there is a good chance that it will always perform better.
However, we develop machine learning models to be effective at handling challenges in

ty
the real world where data is constantly flowing. There is no use in employing machine
learning to tackle your problems if the model we are using is not strong enough
to perform well. Metrics for performance review come into play in this situation. A

si
performance evaluation metric determines whether or not your machine learning model
will successfully solve the problem for which it was trained.

Classification
r
ve
Since certain terms are used consistently across the majority of performance
metrics, it is crucial to ensure that they are understood before diving into the
performance metrics themselves.

True Value vs Predicted Value


ni

The real outcome (often abbreviated as “y”) and the anticipated outcome (typically
abbreviated as “‘ŷ’”) are two notions that are crucial when assessing the effectiveness
of a classification model.
U

For instance, a model can be taught to forecast a person’s risk of contracting a


specific disease. In this instance, it is trained using samples, such as data from an
individual, that contain predictive information, such as Age, Gender, etc. Each individual
ity

is then marked with a flag indicating whether or not the disease will progress. In this
situation, the label might indicate whether the sickness would manifest (y=1) or not
(y=0). When a sample is provided to a machine learning model, its goal is to ensure
that the projected and actual results are always consistent. The performance of the
m

model is measured by how closely its predictions match the actual values. The various
methods of assessing a model’s performance are listed here, but generally speaking,
models perform worse because of errors.
)A

True Positive, True Negative, False Positive and False Negative


Regarding performance, each prediction made by the model can fall into one of
four categories: True Positive, True Negative, False Positive, or False Negative.

●● True Positive (TP): A sample has a positive label (y=1, i.e., the person will actually
(c

develop the disease) and a positive prediction (ŷ=1, i.e., the person is projected to
develop the disease).

Amity Directorate of Distance & Online Education


88 Foundations of Machine Learning

●● True Negative (TN): When a sample is anticipated to be negative (ŷ=0, for


Notes

e
example, the person is actually not expected to acquire the disease), the label
is also predicted to be negative (y=0, for example, the person is genuinely not
expected to get the sickness).

in
●● False Positive (FP): When a sample is projected to be positive (ŷ =1, for example,
when the person is genuinely predicted not to get the sickness), its label is actually

nl
negative (y=0). The sample is “falsely” projected to be positive in this instance.
●● False Negative (FN): When a sample is labelled as being negative (y=1, indicating
that the individual would really develop the disease) but the prediction is for it to be

O
positive (ŷ=0), the person is predicted not to develop the condition. The sample is
“falsely” predicted to be negative in this instance.
TP, TN, FP, and FN are referred to as “Positive” and “Negative” despite the fact that
the classes are typically labelled 1 and 0. This is because the values are arbitrary and

ty
frequently seen labelled as 1 and -1.

False Positive and False Negative meanings can be difficult to recall, and Data
Scientists frequently need to pause and consider each one’s significance before they

si
can recall which one stands for which.

Concentrate first on the second portion of the name (“Positive” or “Negative”)


r
to help you remember the difference. It basically states, “The sample is predicted to
be Positive/Negative (belong to Class 1/0)...” in relation to the prediction. Then, to
ve
determine if the prediction was accurate or not (“True” or “False”), we can look at the
initial component of the name. In this instance, we are also including the prediction’s
accuracy to determine whether the sample actually belonged to that Class.
ni

E.g. False Positive: Although the sample was projected to be Positive, it was
actually anticipated to be Negative.

Confusion Matrix
U

The so-called Confusion Matrix, which is just a table organising the four values,
typically presents True Positive, True Negative, False Positive, and False Negative in a
tabular manner.
ity
m
)A
(c

Performance Metrics
Accuracy

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 89

The percentage of correct predictions made by our model out of all the forecasts
Notes

e
is known as accuracy. This means that we split all sorts of forecasts, both correct (TP,
TN) and incorrect (all other types), by the total number of predictions that were correctly
classified as Positive (TP) or correctly classified as Negative (TN) (FP, FN).

in
nl
Between 0 and 1 is the accuracy range. These extreme scenarios correspond to
predictions that were either wholly accurate or entirely off. For instance, if our model is
100% accurate, it will not produce any False Positives or False Negatives, bringing the

O
numerator to the denominator and the Accuracy to 1.

In contrast, the number of True Positives and True Negatives will be zero if our
system is always inaccurate, making the equation zero divided by something positive,
resulting in an Accuracy of 0.

ty
Since we can simply reverse the labels of the forecasts to get a better prediction if
the accuracy is less than 0.5, accuracy in real life actually ranges between 0.5 and 1.

si
However, accuracy is not a particularly useful metric, particularly when the data
is unbalanced. Accuracy does not provide the complete picture when there is a big
gap between the number of positive and negative labels. Consider a scenario where
r
we have 100 samples, 95 of which are classified as Class 0 samples, and 5 as Class
1 samples. In this instance, a weakly constructed “dummy” model that consistently
ve
predicts Class 0 earns a 95 percent Accuracy, indicating a highly powerful model.
Accuracy is not the appropriate performance criterion to assess this model’s power, and
it is not truly predictive. We would eventually wind up giving stakeholders and clients
a model that is neither effective nor prognosticative if we evaluated this model just on
ni

accuracy.

Precision
U

Data scientists typically employ Precision, Recall, and Specificity to get around the
limits of Accuracy. What percentage of accurate positive forecasts there were can be
determined by precision. To do this, it divides the total number of positive predictions—
ity

correct or incorrect—by the number of samples that were correctly predicted as positive
(TP, FP).
m

Recall = Sensitivity = True Positive Rate = Hit Rate


Similar to Precision, Recall measures the percentage of true positives that were
accurately detected. It accomplishes this by dividing the number of positive samples
)A

that were successfully forecasted as positive or mistakenly projected as negative, by


the total number of positive samples (TP) (TP, FN).
(c

Specificity = True Negative Rate = Selectivity


Similar to recall (also known as sensitivity), specificity measures the percentage
Amity Directorate of Distance & Online Education
90 Foundations of Machine Learning

of actual negatives that were accurately identified. It accomplishes this by dividing the
Notes

e
number of negative samples that were correctly projected to be negative or wrongly
forecasted to be positive by the total number of negative samples (TN, FP).

in
If we apply Precision, Recall, and Specificity in the example to demonstrate the

nl
inadequacies of Accuracy, we obtain:

●● Accuracy: 0.95

O
●● Recall: 0
We can better appreciate that a model that consistently predicts the majority class
is actually a low-performance model (Recall = 0), even though its accuracy is good
(Accuracy = 0.95), by utilising performance metrics other than accuracy.

ty
Area Under the ROC Curve (AUC)
As we’ve seen, if the distribution of the classes is not very well balanced, one of the

si
problems with Accuracy is that it might result in excessively inflated performance. The
full two-dimensional area beneath the complete ROC curve is measured by AUC, which
stands for “Area under the ROC Curve” (more on this later).
r
It represents a performance average over all potential classification levels. AUC
ve
can also be understood as the likelihood that a random positive sample will be ranked
higher than a random negative sample by the model.

Even though it can only be used in binary classification scenarios, AUC is a great
ni

metric, especially when dealing with imbalanced classes. It is one of the most often
used performance measures in classification (i.e. not with more than 2 classes as
target).
U

It has a number of advantages that make it a popular metric, including:

●● Scale-Invariance. Instead of focusing on forecasts’ absolute values, AUC


evaluates how well they are ranked.
ity

●● Classification-Threshold-Invariance. Regardless of the classification criterion that


is selected, AUC measures the accuracy of the model’s predictions.

F1 Score
m

A less well-known performance statistic, the F1 score, represents the harmonic


mean of Precision and Recall. An F1 Score can have a number as high as 1, which
denotes flawless Precision and Recall, or as low as 0, which occurs when either
)A

Precision or Recall is zero.

Performance Charts
(c

Performance metrics can also be presented as charts in addition to just as


individual statistics. The ROC Curve and the Precision/Recall Curve are two popular
charts used to illustrate how well a machine learning model performs.
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 91

ROC Curve (Receiver Operating Characteristic Curve)


Notes

e
A ROC curve is a graph that displays how well a classification model performs
across all thresholds. The True Positive Rate (TPR) is plotted on the chart’s y-axis,

in
while the False Positive Rate (FPR) is plotted on the chart’s x-axis.

A 45-degree diagonal line is the worst-case (random chance) scenario. The ideal
situation entails an angled line that travels vertically first and then horizontally.

nl
As the classification threshold is lowered, the model begins to classify more data
as positive, which raises the number of both False Positives and True Positives.

O
ty
r si
ve
ni

Precision/Recall Curve
A Precision/Recall Curve, like the ROC Curve, depicts performance across a y-axis
for Precision and an x-axis for Recall. Different threshold values are used to examine
U

each point. The ROC Curve’s best-case scenario is reversed in the best-case scenario,
which essentially entails a horizontal line turning vertical. On the other hand, random
chance’s worst-case scenario is represented as a horizontal line with Precision = 0.5.
ity
m
)A
(c

Amity Directorate of Distance & Online Education


92 Foundations of Machine Learning

Impact of Choosing the Right Performance Metric


Notes

e
In situations where False Positives and False Negatives do not have the same
impact, picking the appropriate metric is crucial. However, with machine learning

in
models, there is typically a tradeoff between accurately detecting False Positives or
False Negatives. Ideally, we would want to have a forecast that is flawless in terms of
both False Positive and False Negative (both zero).

nl
For instance, it may be argued that False Positives are more significant if our
model determines whether a person has a fatal disease, such as cancer. If that person
has the condition, we want to make sure we accurately flag them. If we mistakenly label

O
someone as having the sickness when they didn’t, we are less concerned.

On the other hand, it may be argued that False Negatives are more significant if
our model can determine whether a person is innocent or not. We want to make sure

ty
that nobody is unjustly imprisoned.

2.1.9 Application of Performances Metrices

si
Regression
It is best to emphasise certain important introductory principles before delving more

r
deeply into the performance measures.
ve
Since error is such a well-known concept, it only needs a brief formal definition.
Error is a rather intuitive measure. The difference, or “delta,” between the actual target
value and the predicted value is what we particularly refer to ‘ŷ’when we talk about
errors while discussing machine learning performance.
ni

The real outcome (often abbreviated as “y”) and the anticipated outcome (typically
abbreviated as “‘ŷ’”) are two notions that are crucial when assessing the effectiveness
of a classification model.
U

For instance, a model can be taught to forecast a person’s risk of contracting a


specific disease. In this instance, it is trained using samples, such as data from an
individual, that contain predictive information, such as Age, Gender, etc. Each individual
is then marked with a flag indicating whether or not the disease will progress. In this
ity

situation, the label might indicate whether the sickness would manifest (y=1) or not
(y=0).

When a sample is provided to a machine learning model, its goal is to ensure that
the projected and actual results are always consistent. The performance of the model is
m

measured by how closely its predictions match the actual values. The various methods
of assessing a model’s performance are listed here, but generally speaking, models
perform worse because of errors.
)A

The more “off” the model is from being an exact depiction of the phenomena, the
greater the discrepancy between the actual outcome “y” and the anticipated outcome
“‘ŷ’”; the closer the values, the better the system performs.
(c

Mean Squared Error (MSE) / Mean Squared Deviation (MSD)


The average of the squared errors is calculated using the Mean Squared Error. In

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 93

essence, it computes the difference between the estimated and real values, squares
Notes

e
the answers, and then finds the average of those numbers.

MSE can only rely on non-negative numbers because the mistakes are squared.

in
MSE is typically positive and not zero because most processes have inherent
unpredictability and noise.

nl
MSE uses the same units of measurement as the variance and the square of the

O
quantity being estimated.

Mean Squared Error shares the same major drawback as Variance in that it is not
resilient to outliers. The square of the error will be substantially bigger if one sample’s
“y” and associated error are significantly larger than those of the other samples. This

ty
renders MSE susceptible to outliers, together with the fact that it averages mistakes.

Root Mean Squared Error (RMSE) / Root Mean Squared Deviation (RMSD)

si
RMSE computes the average of the squared errors over all samples, just like MSE
does, but it also takes the square root of the outcome, thereby taking the square root of
MSE.
r
RMSE offers an error measure in the same unit as the target variable by
ve
performing this. For instance, RMSE will provide the error in dollars if our target y is
the dollar amount of sales for the upcoming year, however MSE will provide the error in
dollars squared, which is considerably less comprehensible.
ni
U

Mean Absolute Error (MAE)


The square of the mistakes is not taken into account by the Mean Absolute Error.
As an alternative, it only computes the errors’ absolute values before averaging them.
ity

The MAE uses the absolute value because the absolute distance is what interests
us, not the direction in which the estimated and actual target values differ (estimated >
actual or vice versa). By doing this, errors won’t cancel one another out when the MAE
is calculated.
m

Because the MAE formula does not square errors, unlike MSE, it does not penalise
)A

larger errors more severely than smaller ones. Another benefit is that, unlike RMSE,
MAE does not square the units, which makes the results easier to understand.

Mean Absolute Percentage Error (MAPE)


The error between actual and predicted values is measured as a percentage by
(c

the mean absolute percentage error. It does this by calculating it in a manner similar to
MAE and then dividing the result by the actual number to obtain a percentage.

Amity Directorate of Distance & Online Education


94 Foundations of Machine Learning

Notes

e
in
We can better grasp the relative accuracy of our forecasts by presenting the
mistake as a percentage. For example, if we were to forecast spending for the
following year, an MAE error of $50 may represent either a reasonably good or poor
approximation.

nl
For instance, we could confidently say that the prediction is fairly accurate if the
$50 inaccuracy was made in relation to a $1 million actual expenditure. A $60 cost

O
estimate inaccuracy, on the other hand, would be a significant departure from the true
value.

In comparison, a $50 inaccuracy against an estimate of $1 million represents a


0.005 percent error. If this error occurred on a $60 prediction, it would entail an error of

ty
83 percent of the expected value, or a range of $10 to $110, or over double the value of
the prediction.

In this instance, MAPE provides a more realistic depiction of the inaccuracy in

si
terms of absolute values.

R Squared (R²) / Coefficient of Determination


r
The amount of the dependent variable y’s variance that is explained by the
ve
independent variables X is expressed as R Squared (R2). R2 illustrates how much
the variation of one variable contributes to the variance of the other. In other words,
if a model’s R2 is 0.75, its characteristics can account for about 75% of the observed
variation.
ni

R2 is computed by dividing the overall sum of squares by one less than the sum of
squares of the residuals..
U

R2 evaluates how well the selected model fits with respect to a baseline horizontal
ity

line. The R2 is negative if the selected model fits the data less well than a horizontal
line. Despite the “square” being included and due to the R2 formula, it is possible for it
to be negative without breaking any mathematical laws. Only when the model does not
match the data’s trend and fits the data worse than a horizontal line does R2 become
m

negative.

One disadvantage of R2 is that it gets worse when more features are added to a
model. Even when the features that have been added to the model are not inherently
)A

predictive, this nevertheless occurs.

Adjusted R Squared (R²)


The Adjusted R2 was introduced as a result. The features incorporated into the
predictive model are considered. When a result, the Adjusted R2 increases as more
(c

predictive characteristics are added to the model. Contrary to what would happen with

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 95

R2, the Adjusted R2 value decreases when more “useless” features are added to the
Notes

e
model. Because of this, the Adjusted R2 is always lower than or equal to the R2 value.

in
where k is the number of features in the model and n is the number of data points.

In general, reporting both an error measure, such as RMSE, and an R2 measure

nl
is usually important. This is so because R2 articulates the relationship between the
model’s feature X and the intended variable y. Instead, error measurements describe
how dispersed the data points are in relation to the regression fit. For example,

O
reporting both Adjusted R2 and RMSE enables a more accurate comparison of the
model with other benchmarks.

2.1.10 Application of Performances Metrices (Part II): Class

ty
Imbalance Problem
In machine learning, the term “classification” refers to a predictive modelling issue
where a class label is predicted for an input set of data. The label or target could be a

si
part of two classes, or even more.

When the majority of the data fall under one class label, there is a class imbalance.
r
Both two-class classification and multiclass classification are susceptible to it. Data
distribution is assumed to be equal by machine learning techniques. Therefore, when
ve
there is an imbalance in the classes, the machine learning classifier has a tendency to
be more biassed towards the majority class, which results in incorrect categorization of
the minority class. It happens because the cost function of traditional machine learning
algorithms constantly seeks to maximise metrics like mistake rate without taking the
ni

distribution of the data into account.

How much class is imbalanced?


U

To determine how much the class is unbalanced, the two factors balance_
accuracy_score and accuracy_score must be taken into account.

When dealing with an unbalanced dataset for binary or multiclass classification


ity

issues, balanced_accuracy_score computes balanced accuracy. It is described as the


average recall achieved across all classes.

In contrast, the accuracy classification score is the accuracy score. For the
balanced dataset, there will be no difference between these two accuracy values. In
m

other words, with a balanced dataset, the accuracy_score and the balanced_accuracy_
score are equal.
)A

Ways to handle Imbalanced Class

Changing Performance Metric :


(c

When given an unbalanced dataset, the machine learning model would, despite
being a poor classifier for the minority class, consistently predict the value of the

Amity Directorate of Distance & Online Education


96 Foundations of Machine Learning

majority class and achieve high classification accuracy. The accuracy paradox refers to
Notes

e
this.

Other performance metrics, including as confusion matrices, precision, recall, F1

in
score, and Area Under the ROC Curve, should be taken into account for evaluation in
order to get around this.

The effectiveness of a classification method is summarised using the confusion

nl
matrix. It includes:

●● True Positive: Positive outcome is correctly predicted as positive.

O
●● True Negative: Negative outcome is correctly predicted as negative.
●● False Positive: Negative outcome is wrongly predicted as positive.
●● False Negative: Positive outcome is wrongly predicted as negative

ty
r si
ve
ni

Figure: Confusion Matrix

Falsely rejecting a null hypothesis as the outcome of a test process is referred to


as a type I error (false positive), also known as a first-kind error. This occurs when a
U

negative outcome is incorrectly expected to be good. The incorrect acceptance of the


null hypothesis as the outcome of a test process is referred to as a type II error or error
of the second sort. That is when a favourable outcome is mistakenly foreseen as a
ity

negative outcome.

The degree to which we are confident that a prediction we made will come true
will be referred to as precision. According to mathematics, it is the percentage of real
positive forecasts among all positive predictions.
m

The recall is the percentage of actual positive elements that are actually true
)A

positives. A true positive rate is another name for a recall.

The weighted harmonic mean of the precision and recall is the F1 score.
(c

Area Under the Receiver Operating Characteristic (ROC) Curve: The Receiver

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 97

Operating Characteristic (ROC) curve describes the performance of the classifier


Notes

e
machine learning method over a range of trade-offs between true positive and false
positive error rates. The performance statistic for the ROC curve is AUC. The likelihood
for the model to distinguish between observations from two classes is shown by the

in
area under the ROC curve.

Random Resampling:

nl
The minority class is oversampled, while the dominant class is undersampled.

Let’s imagine that our target has 20,000 records, of which 19,900 belong to the

O
majority class and 100 to the minority class.

In oversampling, more records are added to the minority class so that they are
equivalent to the majority class’s record. For our scenarios, after oversampling, the
minority class record will be 19,900, the same as the record for the original majority

ty
class.

When the dataset is not too big, it is advised. This method’s biggest drawback is

si
that it will result in Over-fitting.

The records from the majority class are randomly eliminated in a majority class
undersample. In our situations, the majority class record will be equal to 100 after
r
undersampling, which is the same as the initial minority class record. Therefore, it is
ve
clear that undersampling will result in information loss; as a result, it is advised for large
datasets since even if we lose some information, it won’t be a significant concern.

Additionally, undersampling results in poor generalisation to the test set and under-
fitting.
ni

SMOTE: Synthetic Minority Over-sampling Technique:


In order to achieve parity between the minority and majority classes, SMOTE
U

develops synthesising elements for the minority class based on the records it already
has. It selects a point from the minority class at random and then determines that
point’s k-nearest neighbours.
ity

Neighbors from the k closest neighbours are selected at random depending on how
much oversampling is necessary. Between the selected point and its neighbours, the
synthetic points are inserted.

Since we didn’t eliminate any data, SMOTE will be more accurate than random
m

under-sampling because all information is kept. However, training the model will take
more time.

Algorithmic Ensemble Techniques:


)A

Here, the same problem can be solved using n separate classifiers, and their
predictions are combined using n different training models. It will always be more
accurate than the ensemble’s individual classifier. Numerous decision trees make up a
random forest method, which makes use of ensemble learning.
(c

The algorithmic ensemble technique’s primary goals are performance


enhancement and complicated problem solving.

Amity Directorate of Distance & Online Education


98 Foundations of Machine Learning

Use Tree-Based Algorithms:


Notes

e
Predictive modelling techniques include decision tree learning. In statistics, data
mining, and machine learning, categorization issues are addressed using it. It reflects

in
decisions or is used for decision-making and has an upside-down tree-like form.
Because of the tree-like hierarchical structure that enables them to learn from both
classes, it frequently performs well on imbalanced datasets.

nl
XGBoost — Extreme Gradient Boosting
Extreme Gradient Boosting is abbreviated as XGBoost.

O
Gradient boosting is a potent machine learning ensemble approach that combines
numerous classifiers to deliver excellent performance and address challenging
problems. A gradient-boosting decision tree method has been improved and modified
as XGBoost. It uses parallel processing, which gives it a quick execution rate. It

ty
contains a built-in system to deal with missing data.

By minimising the error gradient, decision trees are fitted one at a time in gradient
boosting. As soon as it experiences a negative loss, it stops dividing a node. However,

si
XG Boost splits up to the set maximum depth. The tree is then pruned backward to
eliminate pointless comparisons or subtrees.

r
The XGBoost package in R and Python can be used to do extreme gradient
boosting.
ve
Summary
●● The regression tree-based approach is frequently used in the field of analytics
ni

due to its many benefits. The user may view each level, which enables him to
decide more wisely. According to what you consider to be most important, rank
the choice’s considerations in order of significance. Regression analysis is a
straightforward method for making decisions when compared to many other
U

methods. The majority of the unwanted data is filtered away as you descend
further down the tree, giving you fewer data. The regression tree is easy to build,
and it may be displayed to higher-ups as a chart or a straightforward diagram.
ity

●● A simple yet efficient analytical technique for identifying the variables that are most
“important” (in terms of explanatory power) is called CART (Classification and
Regression Tree Analysis). Analyses of classification and regression trees have
exciting promise for nursing and other healthcare investigations. The method is
a straightforward, computationally driven, and useful way to simulate interactions
m

between elements that are connected to health but could otherwise go unreported.
The importance of this cannot be overstated because unidentified factors
frequently affect patient outcomes in healthcare research.
)A

●● The capacity to identify and evaluate the importance of these components is what
makes this technique so beautiful. CART is a powerful algorithm that is also fairly
easy to explain when compared to other machine learning methods. It doesn’t
require a lot of processing power, which lets you make models quickly. While it’s
(c

vital to avoid overfitting your data, this approach works well in straightforward
circumstances. The definition of interactions between elements relevant to health

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 99

that could otherwise go unnoticed is made straightforward by classification and


Notes

e
regression tree analysis..
●● The two main issues that affect machine learning and lower the effectiveness of

in
the machine learning models are Over-fitting and under-fitting. Each machine
learning model’s primary objective is to generalise effectively. The ability of an ML
model to adjust the provided set of unknown input to produce an acceptable output

nl
is defined here as generalisation. It indicates that it can generate trustworthy and
accurate output after receiving training on the dataset. As a result, under-fitting
and Over-fitting are the two words that need to be examined in order to determine
how effectively the model performs and generalises.

O
●● The holdout method is the most straightforward cross validation technique.
The training set and the testing set are two different sets of the data set. The
training set is the sole data used by the function approximator to fit a function.

ty
The output values for the data in the testing set are then predicted by the function
approximator (it has never seen these output values before). The mean absolute
test set error, which is used to assess the model, is obtained by adding up the
errors it makes as before. This method’s benefit is that it computes quickly and

si
is typically better than the residual method. Its appraisal, however, may have a
wide range. The evaluation may be very different depending on how the division
is created because it may heavily depend on which data points are included in the
r
training set and which are included in the test set.
ve
●● Cross-validation and the bootstrap are two of the most popular resampling
techniques that are covered here. Both approaches are valuable tools for the
practical implementation of numerous statistical learning techniques. Cross-
validation, for instance, can be used to choose the right level of flexibility or
ni

to estimate the test error linked to a specific statistical learning approach in


order to assess its performance. Model selection is the process of choosing
the appropriate level of flexibility for a model, whereas model assessment is
U

the process of assessing a model’s performance. The bootstrap is employed in


a variety of situations, but is most frequently used to assess the precision of a
parameter estimate or a specific statistical learning method.
ity

●● When the majority of the data fall under one class label, there is a class
imbalance. Both two-class classification and multiclass classification are
susceptible to it. Data distribution is assumed to be equal by machine learning
techniques. Therefore, when there is an imbalance in the classes, the machine
learning classifier has a tendency to be more biased towards the majority class,
m

which results in incorrect categorization of the minority class. It happens because


the cost function of traditional machine learning algorithms constantly seeks to
maximise metrics like mistake rate without taking the distribution of the data into
)A

account.

Glossary
●● Classification: Classification is the process of identifying a function that aids in the
classification of a dataset based on several factors.
(c

●● Regression: The technique of discovering correlations between dependent and


independent variables is known as regression.

Amity Directorate of Distance & Online Education


100 Foundations of Machine Learning

●● Parametric model: A parametric model is a type of learning model that summarises


Notes

e
data with a set of parameters of fixed size.
●● Logistic Regression: Logistic regression is kind of like linear regression, but is

in
used when the dependent variable is not a number but something else (e.g., a
“yes/no” response).
●● Linear Regression: Linear regression analysis is a statistical technique for

nl
predicting the value of one variable based on the value of another.
●● Neural Networks: Neural Networks are structures formed by nodes based on
the neural structure of the brain. These nodes are connected each other or bi-

O
directionally.
●● K- Nearest neighbors: K-NN algorithm is one of the simplest classification
algorithms and it is used to identify the data points that are separated into several

ty
classes to predict the classification of a new sample point.
●● ID3: Iterative Dichotomiser 3.
●● Entropy: Entropy is the degree or amount of uncertainty in the randomness of

si
elements. In other words, it is a measure of impurity.
●● Information gain: Information gain measures the relative change in entropy with
respect to the independent attribute.
●●
r
Hyperplane: A decision plane (hyperplane) is one that separates between a set of
ve
objects having different class memberships.
●● SVM: Support Vector Machine.
●● Random forest: Random forest classifier is an ensemble algorithm based on
ni

bagging i.e bootstrap aggregation.


●● Variance: Variance refers to the change in prediction by using a different training
set.
U

●● Bias: Biasrefers to an error produced by simplifying the modeling of a problem too


much.
●● Noise: The noise is caused by the inherent error in the data.
ity

●● Prediction errors: When we talk about the Machine Learning model, we actually
talk about how well it performs and its accuracy which is known as prediction
errors.
m

●● Under-fitting: A statistical model or a machine learning algorithm is said to have


under-fitting when it cannot capture the underlying trend of the data, i.e., it only
performs well on training data but performs poorly on testing data.
)A

●● Over-fitting: A statistical model is said to be overfitted when the model does not
make accurate predictions on testing data.
●● Hold-out: In the hold-out method, a fraction of the entries in the ratings matrix are
hidden, and the remaining entries are used to build the training model.
(c

●● Cross-Validation: In the cross-validation method, the ratings entries are divided


into q equal sets. Therefore, if S is the set of specified entries in the ratings matrix
R, then the size of each set, in terms of the number of entries, is |S|/q.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 101

●● Bootstrapping: Bootstrapping is a resampling method used to stimulate samples


out of a data set using the replacement technique. Notes

e
●● K-fold cross validation: K-fold cross-validation approach divides the input dataset

in
into K groups of samples of equal sizes. These samples are called folds.
●● SE: Standard Error.
●● Bootstrapping Aggregation: Bootstrapping Aggregation, is an ensemble Machine

nl
Learning technique that accommodates the bootstrapping method and the
aggregation technique.
●● Confidence Intervals: A Confidence Interval (CI) is a type of statistic that reflects

O
the probability of a calculated interval containing a true value.
●● Accuracy: Accuracy is the fraction of predictions our model got right out of all the
predictions.

ty
●● AUC: Area Under the ROC Curve.
●● ROC Curve: Receiver Operating Characteristic Curve.

si
●● MSE: Mean Squared Error.
●● MSD: Mean Squared Deviation.
●●
r
RSME/RSMD: Root Mean Squared Error/ Root Mean Squared Deviation.
ve
●● MAE: Mean Absolute Error.
●● MAPE: Mean Absolute Percentage Error.
●● SMOTE: Synthetic Minority Over-sampling Technique.
ni

●● XGBoost: Extreme Gradient Boosting.

Check Your Understanding


U

1. The technique of discovering correlations between dependent and independent


variables is known as_ _ _ _.
a) Regression
ity

b) Classification
c) Boosting
d) Overlapping
2. _ _ _ _ is the process of identifying a function that aids in the classification of a
m

dataset based on several factors.


a) Data analysis
)A

b) Classification
c) Regression
d) None of the mentioned
3. Which is true about Regression algorithm?
(c

a) The output variable must be discrete


b) The algorithm can be divided into binary classifier and multi-class classifier
Amity Directorate of Distance & Online Education
102 Foundations of Machine Learning

c) Algorithm are used with continuous algorithm


Notes

e
d) None of the mentioned
4. Which of the following problems can be solved by Classification algorithm?

in
a) Identification of spam emails
b) Speech recognition

nl
c) Identification of cancer cells
d) All of the above

O
5. A_ _ _ _is a type of learning model that summarises data with a set of parameters of
fixed size.
a) Parametric model

ty
b) Non-parametric model
c) Both a and b
d) None of the mentioned

si
6. _ _ _ _ _is kind of like linear regression, but is used when the dependent variable is
not a number but something else (e.g., a “yes/no” response).
a) Linear regression r
ve
b) Logistic regression
c) Parametric
d) Non-parametric
ni

7. _ _ _ _ _analysis is a statistical technique for predicting the value of one variable


based on the value of another.
a) Logistic regression
U

b) Naïve Bayes
c) Linear regression
d) None of the mentioned
ity

8. _ _ _ _ _theorem is based with the independence assumptions between predictors,


i.e., it assumes the presence of a feature in a class is unrelated to any other feature.
a) Logistic regression
m

b) Linear regression
c) Parametric model
)A

d) Naïve Bayes
9. _ _ _ _are structures formed by nodes based on the neural structure of the brain.
a) Neural Networks
b) Logistic regression
(c

c) Linear regression
d) Naïve Bayes
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 103

10. _ _ _ _ is the degree or amount of uncertainty in the randomness of elements.


Notes

e
a) Information gain
b) SVM

in
c) Entropy
d) None of the mentioned

nl
11. _ _ _ _ measures the relative change in entropy with respect to the independent
attribute.
a) Entropy

O
b) Information gain
c) Neural networks

ty
d) SVM
12. A_ _ _ _is one that separates between a set of objects having different class
memberships.

si
a) SVM
b) Neural networks
c) Entropy r
ve
d) Decision plane
13. _ _ _ _ _classifier is an ensemble algorithm based on bagging i.e bootstrap
aggregation.
a) Random forest
ni

b) Entropy
c) Decision plane
U

d) None of the mentioned


14. _ _ _ _is a resampling method used to stimulate samples out of a data set using the
replacement technique.
ity

a) Testing
b) Bootstrapping
c) Algorithms
m

d) Cross validation
15. _ _ _ _ _is a technique for validating the model efficiency by training it on the subset
of input data and testing on previously unseen subset of the input data.
)A

a) Testing
b) Bootstrapping
c) Cross-validation
(c

d) None of the mentioned

Amity Directorate of Distance & Online Education


104 Foundations of Machine Learning

Exercise
Notes

e
1. What do you mean by classification and regression?
2. Define the concept of bias variance trade-off

in
3. Define under-fitting and over-fitting concepts in machine learning.
4. Explain the hold out method

nl
5. Define cross validation and bootstrapping method.
6. What do you mean by performances metrices? And, also give some application of
performances metrices.

O
7. Define SMOTE.
8. What do you meant by K-Fold Cross-Validation?

ty
Learning Activities
1. How python implements Naïve Bayes algorithm?

si
Check Your Understanding - Answers
1. a 2. b
3. c
r 4. d
ve
5. a 6. b
7. c 8. d
9. a 10. c
ni

11. b 12. d
13. a 14. b
15. c
U

Further Readings and Bibliography:


1. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts,
ity

Tools, and Techniques to Build Intelligent Systems, Geron Aurelien


2. The Elements of Statistical Learning, Jerome H. Friedman, Robert Tibshirani,
and Trevor Hastie
3. Introduction to Machine Learning with Python: A Guide for Data Scientists,
m

Andreas C. Müller and Sarah Guido


4. Supervised and Unsupervised Learning for Data Science, Berry Michael W.,
Azlinah Mohamed, Bee Wah Yap
)A

5. Machine Learning: An Algorithmic Perspective, Stephen Marsland


(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 105

Module - III: Cross-Validation and Performance


Notes

e
Evaluation Methods

in
Learning Objectives:
At the end of this module, you will be able to understand:

nl
●● Naive Bayes Classifier for Labelled and mixed Data Type
●● Performance Matrices for Regression

O
●● K Nearest Neighbour
●● Basics of Decision Tree
●● Concept of Support Vector Machines

ty
●● Random Forest Classifier and its technique and constructions

Introduction

si
Every machine learning pipeline has performance measurements. They quantify
your progress and let you know if you’re making any. All machine learning models
require a metric to assess performance, whether they use linear regression or a SOTA
method like BERT. r
ve
Similar to performance measurements, every machine learning activity may be
divided into two categories: regression and classification. There are other metrics for
these issues, but we’ll talk about the most common ones and the data they reveal
regarding model performance. It’s crucial to understand how your model interprets your
ni

data!

You probably saw the evaluation area if you’ve ever taken part in a Kaggle
competition. There is almost always a metric by which they evaluate your performance.
U

Loss functions are separate from metrics. Model performance is represented by


loss functions. They’re typically differentiable in the model’s parameters and used to
train a machine learning model (using an optimization technique like gradient descent).
ity

It is not necessary for metrics to be differentiable because they are used to track
and gauge a model’s performance (during training and testing).

However, if the performance metric is differentiable for some tasks, it can also be
m

utilised as a loss function, such as MSE (perhaps with additional regularizations added
to it).

3.1 Introduction to Cross-Validation and Performance


)A

Evaluation Methods
The basic principle behind the resampling approach known as cross-validation is to
divide the dataset into training and test data. The model is trained using training data,
and predictions are made using test data that hasn’t been seen. It indicates that the
(c

model hasn’t overfitted the training data and may be used for prediction if it performs
well on the test data and provides good accuracy.

Amity Directorate of Distance & Online Education


106 Foundations of Machine Learning

3.1.1 Naive Bayes Classifier for Labelled Data Type


Notes

e
A group of classification algorithms built on the Bayes’ Theorem are known as
naive Bayes classifiers. It is a family of algorithms rather than a single method, and they

in
are all based on the idea that every pair of features being classified is independent of
the other.

Let’s think about a dataset first.

nl
Take a look at a hypothetical dataset that details the weather requirements for a
round of golf. Each tuple determines whether the weather is suitable for playing golf
(“Yes”) or not (“No”) based on the current conditions.

O
Here is a table showing how our dataset is organised.

Outlook Temperature Humidity Windy Play

ty
Golf
0 Rainy Hot High FALSE No
1 Rainy Hot High TRUE No

si
2 Overcast Hot High FALSE Yes
3 Sunny Mild High FALSE Yes
4
5
Sunny
Sunny
r
Cool
Cool
Normal
Normal
FALSE
TRUE
Yes
No
ve
6 Overcast Cool Normal TRUE Yes
7 Rainy Mild High FALSE No
8 Rainy Cool Normal FALSE Yes
ni

9 Sunny Mild Normal FALSE Yes


10 Rainy Mild Normal TRUE Yes
11 Overcast Mild High TRUE Yes
U

12 Overcast Hot Normal FALSE Yes


13 Sunny Mild High TRUE No
ity

The feature matrix and the response vector are the two components that make up
the dataset.

●● All of the dataset’s vectors (rows), each of which holds the value of dependent
characteristics, are contained in the feature matrix. Features in the dataset above
m

include “Outlook,” “Temperature,” “Humidity,” and “Windy.”


●● For each row of the feature matrix, the response vector holds the value of the
class variable (prediction or output). The class variable in the aforementioned
)A

dataset is called “Play golf.”

Assumption
The fundamental Naive Bayes assumption is that each feature makes an:
(c

●● independent
●● equal

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 107

Contribution to the Outcome:


Notes

e
This idea can be regarded as follows in reference to our dataset:

●● We presum that there are no dependent feature pairs. For instance, the fact that

in
it’s “Hot” outside has nothing to do with the humidity, and the fact that it’s “Rainy”
outside has nothing to do with the winds. Therefore, it is presumed that the traits
are independent.

nl
●● Second, equal weight is assigned to each feature (or importance). For instance,
it is impossible to forecast the outcome with accuracy from temperature and
humidity alone. All of the characteristics are relevant and are seen as having an

O
equal impact on the result.
Note:In practical applications, Naive Bayes’ presumptions are frequently incorrect.
Although the independence assumption is false in theory, it frequently holds true in

ty
practice.

It is crucial to understand Bayes’ theorem before moving on to the formula for


Naive Bayes.

si
Bayes’ Theorem
The Bayes Theorem determines the likelihood of an event occurring given the
r
likelihood of an earlier event occurring. The mathematical formula for Bayes’ theorem is
ve
as follows:

Where A and B are events and P(B) ≠ 0.


ni

●● In essence, we’re looking for the likelihood that event A will occur if event B is true.
Evidence also refers to Event B.
U

●● P(A) is A’s priori (the prior probability, i.e. Probability of event before evidence is
seen). An attribute value of an unidentified instance serves as the proof (here, it is
event B).
●● P(A|B) is the a posteriori probability of B, or the likelihood of an occurrence
ity

following the discovery of evidence.


The Bayes theorem can now be applied to our dataset in the manner described
below:
m

where X is a dependent feature vector (of size n) and y is a class variable.


)A

Just to be clear, here is an illustration of a feature vector and class variable: (refer
1st row of dataset)

X = (Rainy, Hot, High, False)


(c

y = No

P(y|X) here essentially refers to the likelihood of “Not playing golf” under the
Amity Directorate of Distance & Online Education
108 Foundations of Machine Learning

provided conditions of “Rainy outlook,” “Temperature is hot,” “high humidity,” and “no
Notes

e
wind.”

Naive Assumption

in
It’s time to apply a simplistic premise to the Bayes theorem: the independence of
the features. We have now divided the evidence into its individual components.

nl
Consequently, if any two events A and B are independent,

P(A,B) = P(A)P(B)

O
Thus, we arrive at the conclusion:

ty
which is comprehensible as:

si
Now that the denominator is fixed for a particular input, we can eliminate that
phrase:

r
ve
We must now develop a classifier model. To do this, we calculate the likelihood of
the given set of inputs for each potential value of the class variable y and choose the
result with the highest probability. Mathematically, this is expressed as:
ni

Finally, the work of calculating P(y) and P(xi | y) remains.

Please take note that P(xi | y) is known as conditional probability and P(y) is also
U

known as class probability.

The assumptions that different naive Bayes classifiers make about the distribution
of P(xi | y) are what distinguish them from one another.
ity

Let’s attempt to manually apply the aforementioned formula to our weather


collection. Precomputations on our dataset are required for this.

For each xi in X and each yj in Y, we must determine P(xi | yj). The tables below
show the results of all these calculations:
m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 109

Notes

e
in
nl
O
ty
si
Therefore, in the tables 1-4, we manually calculated P(xi | yj) for each xi in X and yj
in Y, as shown in the above picture. For instance, P(temp. = cool | play golf = yes) = 3/9
is the probability of playing golf given that the temperature is cool.
r
Additionally, we must locate class probabilities (P(y)), which were computed in
ve
table 5. P(play golf = Yes), for instance, equals 9/14.

Our pre-computations are now complete, and the classifier is prepared.

Let’s test it out with a fresh set of features today:


ni

today = (Sunny, Hot, Normal, False)

Therefore, the likelihood of playing golf is provided by:


U

and the likelihood of not playing golf is determined by:


ity

P(today) is shared by both probabilities, thus we can disregard it and calculate


proportional probabilities as:
m

And
)A

Now, since
(c

Amity Directorate of Distance & Online Education


110 Foundations of Machine Learning

Notes

e
in
nl
O
ty
Golf will therefore be played, as predicted.

The approach we covered previously works with discrete data. We must make

si
some assumptions about the distribution of values for each characteristic when dealing
with continuous data. The assumptions that different naive Bayes classifiers make
about the distribution of P(xi | y) are what distinguish them from one another.

r
Gaussian Naive Bayes classifier
ve
Continuous values connected to each feature in Gaussian Naive Bayes are
presumptively distributed in a Gaussian manner. Normal distribution is another name
for a Gaussian distribution. As illustrated below, when plotted, it produces a bell-shaped
curve that is symmetric about the mean of the feature values:
ni
U
ity
m

Since it is presumed that the likelihood of the characteristics is gaussian, the


conditional probability is given by:
)A

Other popular Naive Bayes classifiers are:


(c

●● Multinomial Naive Bayes: The frequency with which specific events were produced
by a multinomial distribution are represented by feature vectors. The event model
generally employed in document classification is this one.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 111

●● Bernoulli Naive Bayes: Features are independent booleans (binary variables) that
Notes

e
describe inputs in the multivariate Bernoulli event model. This paradigm is used
for document classification problems, where binary term occurrence features—that
is, whether a word appears in a document or not—are utilised rather than term

in
frequencies (i.e. frequency of a word in the document).

3.1.2 Performance Matrices for Regression

nl
Finding correlations between dependent and independent variables is the goal
of the supervised learning technique known as regression. A numerical or discrete
value is predicted by a predictive regression model. Regression measurements are

O
different from classification metrics in many ways. This implies that we cannot assess a
regression model using the Accuracy measure; rather, the performance of a regression
model is reported as errors in the forecast. The most common indicators for assessing

ty
the effectiveness of regression models are listed below.

●● Mean Absolute Error


●● Mean Squared Error

si
●● R2 Score
●● Adjusted R2

Mean Absolute Error (MAE)


r
ve
One of the simplest metrics, Mean Absolute Error (MAE), evaluates the absolute
difference between actual and anticipated values. Absolute refers to treating a number
as positive. Let’s use a linear regression example to illustrate MAE. In linear regression,
ni

dependent and independent variables are connected along the line of greatest fit by
the model. We must compute the difference between actual and predicted numbers in
order to get the MAE, or error in prediction. However, we must first determine the mean
absolute of the entire dataset in order to determine the absolute error for the entire
U

dataset.

To compute MAE, use the formula below:


ity

MAE = 1/N∑|Y-Y’|

Here,

Y stands for the actual result, Y’ for the anticipated result, and N for all the data
points combined.
m

For the outliers, MAE is significantly more reliable. We must use other optimizers,
such as Gradient Descent, because one of MAE’s drawbacks is that it is not
differentiable. Mean Squared Error, or MSE, is a different statistic that can be employed
)A

to get around this restriction.

Mean Squared Error


One of the best metrics for evaluating regression is mean squared error, or MSE.
(c

It calculates the average of the squared difference between actual model output and
expected values.

Amity Directorate of Distance & Online Education


112 Foundations of Machine Learning

Since squared errors are used in MSE, only non-negative values are assumed,
Notes

e
which are often positive and non-zero.

Additionally, because squared differences penalise tiny errors as well, it overstates

in
how flawed the model is.

In comparison to other regression measures, MSE is a highly preferred statistic


since it may be improved differently.

nl
The following is the MSE calculation formula:

MSE = 1/N∑(Y – Y’)2

O
Here,

Y stands for the actual result, Y’ for the anticipated result, and N for all the data
points combined.

ty
R Squared Score
Another well-liked statistic for assessing regression models is R squared

si
error, commonly referred to as Coefficient of Determination. We may evaluate the
performance of the model by comparing it to a fixed baseline using the R-squared
metric. We must take the data’s mean and place the line at the mean in order to choose
the constant baseline. r
ve
Regardless of whether the numbers are too large or too tiny, the R squared score
will always be less than or equal to 1.

R2 = 1 – MSE (Model)/MSE (Baseline)


ni

Adjusted R Squared
R squared error has been improved, as implied by the phrase “adjusted R
U

squared.” R square has a flaw that prevents score improvements when terms are
increased even while the model is not improving, which could lead data scientists
astray. Adjusted R squared, which always displays a lower number than R2, is used to
get around the R square problem. This is due to the fact that it modifies the values of
ity

growing predictors and only displays improvement when there is actually one.

The adjusted R squared can be calculated as follows:


m

Here,

The number of observations is n,


)A

the number of independent variables is k,

and the adjusted R2 is represented by Ra2.

3.1.3 Introduction to Naive Bayes


(c

The Bayesian classification method combines statistical classification with


supervised learning. assumes an underlying probabilistic model and enables the

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 113

rigorous computation of outcome probabilities to represent uncertainty about the


Notes

e
model. Dealing with issues that call for prediction is the main objective of the Bayesian
classification. This classification offers efficient learning strategies while combining
observable data. A Bayesian classification viewpoint might be useful for comprehending

in
and assessing learning systems. It generates clear hypothesis probabilities and
reduces data input noise.

nl
Consider the general probability distribution P(x1, x2), which has two values. Using
Bayes’ rule, we may get the following equation without compromising generality:

O
We obtain the following equation if there is a third class variable, c:

When we apply the general example with two variables to a conditional

ty
independence assumption for a group of variables x1,..., xN conditional on another
variable c, we obtain the following results:

3.1.4 Naive Bayes Classifier for Mixed Data Type


r si
ve
Mixed Naive Bayes
Using Bayes’ theorem as a foundation, naive Bayes classifiers are a group of
supervised learning algorithms with strict independence presumptions between the
features given the value of the class variable (hence naive).
ni

Categorical (Multinoulli) and Gaussian naïve Bayes algorithms are implemented in


this module (hence mixed naive Bayes). This means that we may additionally assume
that features (given their corresponding y’s) follow the categorical distribution in addition
U

to the Gaussian distribution. Therefore, it makes sense that categorical data (nominal or
ordinal) be ascribed to the categorical distribution and continuous data be attributed to
the Gaussian distribution.
ity

The lack of a mixed type naive bayes implementation in scikit-learn served


as the inspiration for creating this package. One is already in the works here for
CategoricalNB.

Installation
m

via pip

pip install git+https://github.com/remykarem/mixed-naive-bayes#egg=mixed_


)A

naive_bayes

Example 1: Discrete and continuous data


An example dataset with both discrete (the first two columns) and continuous
(c

data is provided below (last 2). Indicate the characteristics’ indices that will follow the
categorical distribution (columns 0 and 1). Afterward, fit and predict as usual.

Amity Directorate of Distance & Online Education


114 Foundations of Machine Learning

from mixed_naive_bayes import MixedNB


Notes

e
X = [[0, 0, 180, 75],

[1, 1, 165, 61],

in
[2, 1, 166, 60],

[1, 1, 173, 68],

nl
[0, 2, 178, 71]]

y = [0, 0, 1, 1, 0]

O
clf = MixedNB(categorical_features=[0,1])

clf.fit(X,y)

ty
clf.predict(X)

3.1.5 Naive Bayes Classifier

si
A group of classification algorithms built on the Bayes’ Theorem are known as
naive Bayes classifiers. It is a family of algorithms rather than a single method, and they
are all based on the idea that every pair of features being classified is independent of
the other. r
ve
Because it presumes that the data distribution can be Gaussian, normal, Bernoulli,
or multinomial, the approach is known as naïve.

Naive Bayes also has the issue of requiring continuous features to be


preprocessed and discretized by binning, which might result in the loss of important
ni

information.

Conditional Probability
U
ity
m

A sample space is this rectangular orange area. It is a compilation of all potential


outcomes of an experiment, and A and B are occurrences in the circles. These are
not separate occurrences. They just overlap each other and the overlapped area.
)A

A intersection b refers to that segment, which both a and b are familiar with. The
likelihood that event A will occur provided that event B has already occurred is known
as the conditional probability of happening A given that event B has already occurred.

Simplify by assuming that the entire sample space has been reduced to the event
(c

that has already occurred whenever you are dealing with the conditional probability. As
a result, our sample space now becomes event b because we are only dealing with the
portion of the sample space where B has already occurred.
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 115

Now, we understand that B has already occurred. Where in this section do


Notes

e
we find A? The probability of A given that B can be expressed as the probability of
A intersecting B divided by the probability of B since that is nothing more than the
intersection part.

in
Similarly, if we asked to estimate the probability of B given that A, which will be

nl
something like that, the likelihood of occurrence of B given that the event A has already
occurred, you might want to pause and rewind this article. However, if you’re right, then
keep reading. Now imagine that sample space has been reduced to A. By dividing the

O
likelihood of an event A and B intersecting, you may determine if B is available for an
event A.

ty
This equation results from a slight rearrangement of the denominators in these two
equations.

si
After finding a relationship between the two conditional probabilities, the Bayes

r
Theorem can be proved by simply moving the probability of B from the left to the right
inside of the equation.
ve
Congratulations, we’ve reached the Bayes theorem equation. Right now, we’re
attempting to find the Bayes theorem in its generalised form.
ni
U
ity

The orange rectangle space in the figure above represents sample space, which is
a grouping of the three occurrences a1, a2, and a3. As you can see, these events are
mutually exclusive and exhaustive as a whole. Let’s define events that cannot occur
together.
m

Mutually Exclusive Event


If the occurrence of one event prevents the occurrence of any other events in a
)A

single experiment, the cases are said to be mutually exclusive. Head and tail are
mutually exclusive in a coin tossing investigation because you can only get one of them.
Heads and tails cannot be obtained simultaneously.

Collectively Exhaustive
(c

Collectively exhaustive refers to all potential results from a random experiment. As


an illustration, if we flip a coin, there are two distinct situations and the sample space
is H, T. The sample space will be determined by adding up the odds of each of these
Amity Directorate of Distance & Online Education
116 Foundations of Machine Learning

occurrences. The total sample space will be determined by adding A1, A2, and A3. The
Notes

e
prior probability are another name for these three events, A1, A2, and A3.

in
nl
B has a characteristic with A1, A2, and A3 in the image above. If we were to

O
express it this way, the area of B that is shared with junction A1 is B. A2 intersection
B is the part of B that is shared with A2, while A3 intersection B is the part of B that is
shared with A3. The probability of B is therefore equal to the intersections of A1 and B,
as well as A2 and B, and A3.

ty
The probability of A1 intersection B and the other A2, A3, can be written in this
form.

r si
So the probability of B can be expressed as
ve
The Bayes theorem now exists in a more generalised form that can encompass n
prior probabilities. It’s time to combine everything.
ni
U
ity

We now understand the Bayes theorem’s probability formula and the underlying
equation. It’s fine, but explain how this is used in the classification issue. Let’s read
some more.
m

An result (target/class/dependent variable) and some predictors or traits


(independent variables) are both present. Each record contains a few values for the
specified class’s attributes. We wish to train the model using these predictors and
)A

related classes. We can therefore forecast the class outcome if the feature values are
provided. Now that the values of the predictor are known, the Naive Bayes classifier
method calculates a likelihood probability for each category. Additionally, we can choose
the class with the highest likelihood based on intuition.

Let’s say there are n features represented by the symbols x1, x2, x3,..., xn.
(c

Moreover, the outcome variable y has k classes, each represented by the letters C1,
C2, C3,..., Ck. The likelihood of the record or observation that originates from one of

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 117

the k classes of the outcome variable y, let’s say Ck, is what we now wish to assess.
Notes

e
Now, if we enter B=Ck and A=x1x2...xn into the Bayes formula above, we may write the
aforementioned conditional probability as follows:

in
nl
Since the denominator is independent of classes in practice, we just employ the

O
numerator portion. The values of the characteristics (xi) are specified in such a way as
to effectively keep the denominator constant. The joint probability model is comparable
to the numerator. As a result, the joint model may be expressed as

ty
si
The function that assigns a class label estimated y = Ck for some k as follows is
the corresponding classifier, a Bayes classifier:

r
ve
Using a Naive Bayes classifier, our aim is to forecast whether or not a customer
would buy a product on a specific day, discount, and free delivery.

Weekdays, weekends, and holidays are examples of day characteristics. We were


ni

aware if there was a discount with free delivery on any given day. We can forecast
whether a person will purchase the product based on the available facts. A 30-row,
15-column tiny dataset is presented below as follows:
U
ity
m
)A

the dataset that includes the frequency tables for each of the three input
variables—day, discount, and free delivery.
(c

Amity Directorate of Distance & Online Education


118 Foundations of Machine Learning

Notes

e
in
nl
O
Let the preditors (discount, free delivery, and day) be an event B and the event Buy
be A for the purposes of intuition. Let’s compute the probability table for a single “day”
variable, which includes values for the weekday, weekend, and holiday.

ty
r si
ve
Determine some conditional probabilities based on the likelihood table shown
above:
ni

P(B) = P(Weekday) = 11/30 = 0.37

P(A) = P(No Buy) = 6/30 = 0.2


U

P(B | A) = P(Weekday | No Buy) = 2/6 = 0.33

P(A | B) = P(No Buy | Weekday) = P(Weekday| No Buy) * P(No Buy) / P(Weekday)

= (0.33 * 0.2) / 0.37


ity

= 0.18

The likelihood of purchasing on the weekday = 11/30 or 0.37

This indicates that 11 of the 30 customers who entered the store on weekends,
m

weekdays, and holidays will make purchases. The likelihood of making no purchase
is 6/30, or 0.2. Twenty percent of the time, people might decide not to buy anything.
regardless of the day of the week. Finally, we consider the likelihood that B (such as
)A

weekdays) will not result in a purchase.

The likelihood of not making a transaction on a weekday is 0.18, or 18 percent. The


buyer will probably purchase the product on a weekday because P(No | Weekday) has
a probability of less than 0.5.
(c

We can create the frequency tables for all three variables because we have the
frequency tables for all three predictor variables. Check see what the likelihood tables
for the three variables below look like:
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 119

Notes

e
in
nl
O
ty
si
To determine if a consumer will buy a product on a given day when there is a
discount and whether delivery is available, the likelihood tables can be employed. Think
about the following elements together, where B =
r
ve
●● Day = Holiday
●● Discount = Yes
●● Free Delivery = Yes
ni

Now we find the probability of them not purchasing based on the conditions above.

A = No Purchase

From Applying the Bayes Theorem, we get P(A | B) as follow:


U
ity
m

To determine if a consumer will buy a product on a given day when there is a


)A

discount and whether delivery is available, the likelihood tables can be employed. Think
about the following elements together, where B =
(c

Amity Directorate of Distance & Online Education


120 Foundations of Machine Learning

Notes

e
in
nl
The likelihood of making a purchase is 0.986, and the likelihood of not making

O
one is 0.178. The chance of a purchase on this day is now conditional. Norming these
probabilities is the next step.

Sum of probabilities = 0.986 + 0.178 = 1.164

ty
Likelihood of purchase = 0.986 / 1.164 = 84.71 %

Likelihood of no purchase = 0.178 / 1.164 = 15.29 %

Result: 84.71 % is greater than the 29%, we can conclude that an average

si
customer will buy on holiday with a discount and free delivery.

3.1.6 Hands On: Naive Bayes on Movie Review Dataset


r
ve
The Naive Bayes algorithm makes use of the Baye’s theorem. By calculating the
frequency of specific values or value combinations in a data collection, Baye’s theorem
is a formula that determines a probability. The Bayes Theorem can be expressed as an
equation if A stands for the previous occurrences and B for the dependent event.
ni

Bayes Theorem:
U

here x if for different words in the review text, Ck is for the class label

p(Ck|x): the probability of class label given text review words x


ity

review text(x) can be represented as {x1,x2,x3, …….. ,xn}

p(Ck|x) ∝ p(Ck|x1,x2,x3, …….. ,xn}


m
)A

About IMDB movie review dataset


The text reviews in the IMDB Movie Review Dataset have a data frame with the
word “data” in them.
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 121

Notes

e
in
nl
O
ty
The collection includes text movie reviews with the specified positive and negative
polarities.

Learn how the naive Bayes classifier functions by using some random examples

si
(not from the dataset).

Using a few instances:


r
Consider a mock movie review and its sentiment polarity (0 = negative, 1 =
ve
positive).
ni
U
ity
m

Text Preprocessing:
To sanitise your data, use the following checklist:
)A

1. Start by taking out the HTML tags.


2. Eliminate all punctuation and a select group of special characters, such as or,., #,
etc.
3. Verify that the term is not alphanumeric and is composed entirely of English letters.
(c

4. Verify that the word’s length is more than two (as it was researched that there is no
adjective in 2-letters).

Amity Directorate of Distance & Online Education


122 Foundations of Machine Learning

5. Change the word’s case to lowercase.


Notes

e
6. Eliminate stopwords, such as (the, and, a).
We can begin utilising the clean, labelled data to train models after completing

in
these procedures and checking for additional problems.

nl
O
ty
Bag of Words Representation:
r si
Making a numerical feature vector for each document is the following step. BoW
ve
keeps track of how many times tokens appear in each collection document. It produces
a matrix having the following features:

In the entire collection of documents, the number of columns equals the number of
ni

distinct tokens (vocabulary).

The total number of documents in the collection are equal to the number of rows.
U

The frequency of each token (column) in a specific document is contained in each


cell (row).
ity

pointless
Amazing

Scenes
Loved
Movie

Music

Worst

Class
Story

Poor
Best
m

No

All

....

x1 1 1 1 1 0 0 0 0 1 0 0 1 .... 0
)A

x2 1 0 0 0 1 0 0 1 0 0 0 1 .... 1
x3 1 0 0 0 1 1 0 0 0 0 1 0 .... 1
x4 1 0 0 0 0 0 1 0 1 0 0 0 .... 0
x5 1 1 0 1 0 0 1 0 0 1 0 0 .... 0
(c

x6 1 0 0 0 0 1 0 1 0 0 1 0 .... 1
x7 0 0 0 0 0 0 0 0 0 0 1 0 .... 1
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 123

The posterior probability are calculated. By consulting the tables we created


Notes

e
throughout the learning process, this is simple to accomplish.

P(class=1|text) = P(class=1)* Π(P(wi|class=1))

in
Some important points:

Laplace/Additive Smoothing

nl
Laplace smoothing, often known as additive smoothing, is a statistical technique.
An “smoothed” version of the data yields the estimator: given an observation x = (x1,...,
xd) from a multinomial distribution with N trials and parameter vectorθ = (θ1,..., θd).

O
ty
where the smoothing parameter (α = 0 means no smoothing) is the pseudo
count, which is greater than α>0. As the estimated value will fall between the empirical
estimate xi/N and the uniform probability 1/d, additive smoothing is a type of shrinkage
estimator. Though in reality a smaller value is commonly chosen, some authors have

si
suggested that should be 1 using Laplace’s law of succession (in which case the name
add-one smoothing is also used).

r
How then do we use Laplace smoothing in this situation?
ve
Consider setting the smoothing parameter to 0.1 and d to 1 (see equation above).
By adding 1 to each probability, such as P(class|text), we ensure that it will never be
zero.
ni

Log probability for numerical stability


When the probabilities are relatively tiny, using log probabilities enhances
numerical stability.
U

P(class=1 or 0|text) = P(class=1 or 0)* Π(P(wi|class=1 or 0))

log(P(class=1 or 0|text)) = log(P(class=1 or 0))+∑(log(P(wi|class=1 or 0)))

text query1: The movie’s soundtrack was the worst I’ve ever heard, and the plot
ity

was meaningless.

text preprocessed : * plot * movie * pointless * worst music *


m
)A
(c

P(class=1|text) = P(class=1)*P(plot|1)*P(movie|1)*P(pointless|1)*P(worst|1)*P(mus
Amity Directorate of Distance & Online Education
124 Foundations of Machine Learning

ic|1) =(4/7)*(0.1/4.2)*(3.1/4.2)*(0.1/4.2)*(0.1/4.2)*(1.1/4.2) =1.49097*10^(-6)


Notes

e
P(class=0|text) = P(class=0)*P(plot|0)*P(movie|0)*P(pointless|0)*P(worst|0)*P(mus
ic|0)

in
=(3/7)*(1.1/3.2)*(3.1/3.2)*(1.1/3.2)*(1.1/3.2)*(2.1/3.2) =1.10670*10^(-2)

#since probablity of P(class=0|text) is greater than probablity of P(class=1|text) for

nl
text query1 so we classify the query text as negative review.

text query2 : Loved the action parts, and the soundtrack was fantastic as well.

text preprocessed : * love * * action scenes * music * amazing *

O
ty
r si
ve
P(class=1|text) = P(class=1)*P(love|1)*P(action|1)*P(scenes|1)*P(music|1)*P(amaz
ing|1)=(4/7)*(2.1/4.2)*(2.1/4.2)*(3.1/4.2)*(1.1/4.2)*(2.1/4.2) =1.380790411*10^(-2)

P(class=0|text) = P(class=0)*P(love|0)*P(action|0)*P(scenes|0)*P(music|0)*P(amaz
ni

ing|0)

=(3/7)*(0.1/3.2)*(0.1/3.2)*(0.1/3.2)*(2.1/3.2)*(0.1/3.2) =2.6822*10^(-7)
U

#since probablity of P(class=1|text) is greater than probablity of P(class=0|text) for


text query2 so we classify the query text as positive review.

Implementing Multinomial Naive Bayes Classifier:


ity

To get the ideal value of alpha with the least amount of error, use the Multinomial
Naive Bayes classifier for several values of alpha and obtain a plot of error vs. alpha.
m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 125

We find that the ideal value of alpha is 6; hence, we will now carry out the following
Notes

e
actions:

Apply Multinomial Naive Bayes for alpha=6

in
nl
Utilize the Multinomial Naive Bayes classifier to predict the result.

O
Identify test and training accuracy

ty
Create a heatmap and confusion matrix. r si
ve
A table called a confusion matrix lets us see how well a classification algorithm is
performing.
ni
U
ity
m
)A

identifying the words that appear most frequently in both good and negative
evaluations.

The frequency of the most frequently used words has been determined using a
sample of both positive and negative words.
(c

Amity Directorate of Distance & Online Education


126 Foundations of Machine Learning

Notes

e
in
nl
O
ty
r si
ve
ni
U

Here, we see that the word “bad” is commonly used to describe unfavourable
assessments.
ity

Here, we see that the employment of adjectives like “excellent” typically indicates
positive ratings.

Improvements
m

Because certain words appear in numerous documents from both classes, they
do not provide pertinent information. A helpful method called term frequency-inverse
document frequency can be used to solve this issue (tf-IDF). It considers the word’s
)A

uniqueness in addition to its frequency.

Furthermore, each token in the BoW model we developed stands for a single word.
The unigram model is what we call that. Adding bigrams, in which tokens stand for pairs
of subsequent words, is another option.
(c

Scikit-learn implements TF-IDF with the TfidfVectorizer class.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 127

Notes

e
in
nl
O
ty
si
By doing this, we can raise both our test and training accuracy from 87 to 93
percent and 82.464 to 85.508 percent, respectively.

r
Naive Bayes is a straightforward but effective method for classifying texts. With
little effort, we can establish strong baselines and, based on business requirements,
ve
explore more complicated solutions.

For classifying texts, Naive Bayes is a great approach and is well accepted.
Naive Bayes basically serves as a benchmark for text categorization, allowing other
ni

algorithms’ accuracy to be measured against it.

3.1.7 K Nearest Neighbour


U

One of the most fundamental yet crucial categorization methods in machine


learning is K-Nearest Neighbors. It falls under the category of supervised learning and
has numerous applications in data mining, intrusion detection, and pattern recognition.
ity

Due to its non-parametric nature, which means that it makes no underlying


assumptions about the distribution of data, it is frequently disposable in real-world
circumstances (as opposed to other algorithms such as GMM, which assume a
Gaussian distribution of the given data).
m

One of the simplest machine learning algorithms, based on the supervised learning
method, is K-Nearest Neighbour.

The K-NN algorithm makes the assumption that the new case and the existing
)A

cases are comparable, and it places the new instance in the category that is most like
the existing categories.

A new data point is classified using the K-NN algorithm based on similarity after
all the existing data has been stored. This means that utilising the K-NN method, fresh
(c

data can be quickly and accurately sorted into a suitable category.

Amity Directorate of Distance & Online Education


128 Foundations of Machine Learning

Although the K-NN approach is most frequently employed for classification


Notes

e
problems, it can also be utilised for regression.

Since K-NN is a non-parametric technique, it makes no assumptions about the

in
underlying data.

Because it retains the dataset and executes an action on it when it comes time
to classify, this algorithm is also known as a lazy learner because it does not instantly

nl
learn from the training set.

The KNN method simply saves the information during the training phase, and when
it receives new data, it categorises it into a category that is quite similar to the new data.

O
Example: Let’s say we have a picture of a species that resembles both cats and
dogs, but we aren’t sure if it is one or the other. Therefore, since the KNN algorithm is
based on a similarity metric, we can utilise it for this identification. Our KNN model will

ty
look for similarities between the new data set’s features and those in the photos of cats
and dogs, and based on those similarities, it will classify the new data set as either cat-
or dog-related.

r si
ve
Why K-NN Algorithm?
ni

If there are two categories, Category A and Category B, and we have a new data
point, x1, which category does this data point belong in? We require a K-NN algorithm
to address this kind of issue. K-NN makes it simple to determine the category or class
U

of a given dataset. Take a look at the diagram below:


ity
m
)A

How does K-NN work?

The following algorithm can be used to describe how the K-NN works:
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 129

Notes

e
in
Let’s say we need to classify a new data point in order to use it. Take a look at the

nl
photo below:

O
ty
r si
ve
1. First, we’ll decide on the number of neighbours; we’ll go with k=5.
2. The Euclidean distance between the data points will then be determined. The distance
between two points, which we have already examined in geometry, is known as the
Euclidean distance. It is calculable as follows:
ni
U
ity
m

●● By calculating the Euclidean distance, we were able to determine who our closest
)A

neighbours were, with three being in group A and two being in category B. Take a
look at the photo below:
(c

Amity Directorate of Distance & Online Education


130 Foundations of Machine Learning

Notes

e
in
nl
O
ty
●● This new data point must fall within category A because, as we can see, its three
closest neighbours are also from group A.

How to select the value of K in the K-NN Algorithm?

si
The following are some things to keep in mind while choosing K’s value in the
K-NN algorithm:

◌◌ r
The ideal value for “K” cannot be determined in a specific fashion, thus we
ve
must experiment with different values to find the one that works best. K is best
represented by the number 5.
◌◌ It can be noisy and cause outlier effects in the model if K is set to an
extremely low value, such as K=1 or K=2.
ni

◌◌ Although K should have large values, there may be some issues.

Advantages:
U

◌◌ It is resilient to noisy training data, and it is easy to deploy.


◌◌ Large training data sets may increase effectiveness.

Disadvantages:
ity

◌◌ The calculation cost is considerable since it is necessary to calculate.


◌◌ The distance between the data points for each training sample, which might
be complicated at times.
m

3.1.8 Hands On: KNN


Let’s implement the KNN algorithm in Python:
)A

K-NN algorithm problem A new SUV vehicle has been produced by an automobile
manufacturer. The business wants to show adverts to people who are considering
purchasing that SUV. We therefore have a dataset for this issue that includes user
information from various social network users. There is a lot of information in the
dataset, but we will focus on the estimated salary and age as the independent
(c

variables, and the purchased variable as the dependent variable.

The dataset is shown here:

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 131

Notes

e
in
nl
O
ty
si
Steps to implement the K-NN algorithm:
◌◌ Data Pre-processing step r
ve
◌◌ Fitting the K-NN algorithm to the Training set
◌◌ Predicting the test result
◌◌ Test accuracy of the result(Creation of Confusion matrix)
ni

◌◌ Visualizing the test set result.

Data Pre-Processing Step:


U

The Data Pre-processing phase will be identical to the Logistic Regression phase.
The code is listed below:

# importing libraries
ity

import numpy as nm

import matplotlib.pyplot as mtp

import pandas as pd
m

#importing datasets

data_set= pd.read_csv(‘user_data.csv’)
)A

#Extracting Independent and dependent Variable

x= data_set.iloc[:, [2,3]].values

y= data_set.iloc[:, 4].values

# Splitting the dataset into training and test set.


(c

from sklearn.model_selection import train_test_split

Amity Directorate of Distance & Online Education


132 Foundations of Machine Learning

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_


Notes

e
state=0)

#feature Scaling

in
from sklearn.preprocessing import StandardScaler

st_x= StandardScaler()

nl
x_train= st_x.fit_transform(x_train)

x_test= st_x.transform(x_test)

O
Our dataset is well pre-processed and imported into our software by running the
aforementioned code. Our test dataset will resemble the following after feature scaling:

ty
r si
ve
ni

We can see that our data has been scaled correctly from the output image up top.
U

●● Fitting K-NN classifier to the Training data:


The K-NN classifier will now be fitted to the training set of data. We will import the
KNeighborsClassifier class from the Sklearn Neighbors library to accomplish this.
ity

We will create the class’ Classifier object after importing the class. This class’s
parameter will be
◌◌ n_neighbors: To specify the algorithm’s necessary neighbours. It typically
takes 5.
m

◌◌ metric=’minkowski’: The distance between the points is determined by this


default setting.
◌◌ p=2: It is the same as the conventional Euclidean metric.
)A

The classifier will then be fitted to the training set of data. The code is listed below:

#Fitting K-NN classifier to the training set

from sklearn.neighbors import KNeighborsClassifier


(c

classifier= KNeighborsClassifier(n_neighbors=5, metric=’minkowski’, p=2 )

classifier.fit(x_train, y_train)
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 133

Output:The output of running the aforementioned code will be as follows:


Notes

e
in
nl
●● Predicting the Test Result: As with Logistic Regression, we will generate a y pred
vector to forecast the outcome of the test set. The code is listed below:
#Predicting the test set result

O
y_pred= classifier.predict(x_test)

Output:

ty
r si
ve
ni
U
ity

●● Creating the Confusion Matrix:


To check the accuracy of the classifier, we will now generate the Confusion Matrix for
our K-NN model. The code is listed below:
m

#Creating the Confusion matrix

from sklearn.metrics import confusion_matrix


)A

cm= confusion_matrix(y_test, y_pred)

The confusion_matrix function was imported and called using the variable cm in the
code above.

Output: By running the aforementioned code, we will obtain the matrix shown
(c

below:

Amity Directorate of Distance & Online Education


134 Foundations of Machine Learning

Notes

e
in
nl
O
ty
In the graphic above, we can see that there are 64 + 29 = 93 correct predictions
and 3 + 4 = 7, compared to 11 incorrect predictions in the Logistic Regression.
Therefore, we may conclude that employing the K-NN method enhances the model’s
performance.

si
●● Visualizing the Training set result:
We will now display the K-NN model’s training set results. Except for the name of
r
the graph, the code will stay the same as it was for Logistic Regression. The code is
ve
listed below:
#Visulaizing the trianing set result

from matplotlib.colors import ListedColormap


ni

x_set, y_set = x_train, y_train

x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:,


0].max() + 1, step =0.01),
U

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).


reshape(x1.shape),
ity

alpha = 0.75, cmap = ListedColormap((‘red’,’green’ )))

mtp.xlim(x1.min(), x1.max())

mtp.ylim(x2.min(), x2.max())
m

for i, j in enumerate(nm.unique(y_set)):

mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],


)A

c = ListedColormap((‘red’, ‘green’))(i), label = j)

mtp.title(‘K-NN Algorithm (Training set)’)

mtp.xlabel(‘Age’)
(c

mtp.ylabel(‘Estimated Salary’)

mtp.legend()
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 135

mtp.show()
Notes

e
Output:

in
nl
O
ty
The graph produced in the output is different from the one we saw during the

si
Logistic Regression. The following points will help you understand it:

◌◌ The graph displays the red point and the green point, as can be seen. The

r
purchased (green) and unpurchased (red) variables are represented by the
pointers.
ve
◌◌ Because it uses a K-NN technique, or finding the nearest neighbour, the graph
displays an uneven boundary rather than a straight line or curve.
◌◌ The majority of users who chose not to purchase an SUV are located in the
red zone of the graph, while those who did purchase an SUV are located in
ni

the green region.


◌◌ The graph is showing good results, but there are still some green and red
dots in the green and red regions, respectively. But this is not a major problem
U

because by doing this, Over-fitting problems with the model are avoided.
◌◌ As a result, our model is skilled.
●● Visualizing the Test set result:
ity

After the model has been trained, we will now put a fresh dataset, called the
Test dataset, in order to test the outcome. By a few small exceptions, the code is
unchanged. For instance, x_train and y_train will be replaced with x_test and y_test,
respectively.
m

Below is the code for it:

#Visualizing the test set result


)A

from matplotlib.colors import ListedColormap

x_set, y_set = x_test, y_test

x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:,


0].max() + 1, step =0.01),
(c

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

Amity Directorate of Distance & Online Education


136 Foundations of Machine Learning

mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).


Notes

e
reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap((‘red’,’green’ )))

in
mtp.xlim(x1.min(), x1.max())

mtp.ylim(x2.min(), x2.max())

nl
for i, j in enumerate(nm.unique(y_set)):

mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

O
c = ListedColormap((‘red’, ‘green’))(i), label = j)

mtp.title(‘K-NN algorithm(Test set)’)

mtp.xlabel(‘Age’)

ty
mtp.ylabel(‘Estimated Salary’)

mtp.legend()

si
mtp.show()

Output:
r
ve
ni
U
ity

The output for the test data set is displayed in the graph above. The graph shows
that the expected result is excellent because the majority of the red points are located
m

in the red region and the majority of the green points are located in the green zone.
Nevertheless, there aren’t many green and few red points in the green region. In the
confusion matrix, these are the inaccurate observations that we have noted (7 Incorrect
output).
)A

3.1.9 Decision Tree – I


●● A supervised learning method called a decision tree can be used to solve
classification and regression problems, but it is typically favoured for doing so.
(c

It is a tree-structured classifier, where internal nodes stand in for a dataset’s


features, branches for the decision-making process, and each leaf node for the
classification result.
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 137

●● The Decision Node and Leaf Node are the two nodes of a decision tree. While
Notes

e
Leaf nodes are the results of decisions and do not have any more branches,
Decision nodes are used to create decisions and have numerous branches.

in
●● The given dataset’s features are used to execute the test or make the decisions.
●● It is a graphical depiction for obtaining all feasible answers to a choice or problem
based on predetermined conditions.

nl
●● It is known as a decision tree because, like a tree, it begins with the root node and
grows on subsequent branches to form a structure resembling a tree.

O
●● The CART algorithm, which stands for Classification and Regression Tree
algorithm, is used to construct a tree.
●● A decision tree only poses a question and divides the tree into subtrees according
to the response (Yes/No).

ty
●● The general layout of a decision tree is shown in the diagram below:

r si
ve
ni

Why use Decision Trees?


U

The most important thing to keep in mind while developing a machine learning
model is to select the optimal method for the dataset and task at hand. The two
rationales for employing the decision tree are as follows:
ity

●● The rationale behind the decision tree can be simply understood because it
displays a tree-like structure.
●● Decision Trees typically imitate human thinking capacity while making a decision,
m

thus they are simple to understand.

Decision Tree Terminologies


)A

●● Root Node: The decision tree begins at the root node. The full dataset is
represented, which is then split into two or more homogeneous sets.
●● Leaf Node: After receiving a leaf node, the tree cannot be further divided; leaf
nodes are the ultimate output nodes.
●● Splitting: In splitting, the decision node or root node is divided into sub-nodes in
(c

accordance with the specified conditions.


●● Branch/Sub Tree: The tree was split into two new trees.

Amity Directorate of Distance & Online Education


138 Foundations of Machine Learning

●● Pruning: Pruning is the procedure of removing the tree’s undesirable branches.


Notes

e
●● Parent/Child node: The parent node of the tree and the remaining nodes are
referred to as the child nodes.

in
How does the Decision Tree algorithm Work?

In a decision tree, the algorithm begins at the root node and works its way up to
forecast the class of the given dataset. This algorithm follows the branch and jumps to

nl
the following node by comparing the values of the root attribute with those of the record
(real dataset) attribute.

O
The algorithm verifies the attribute value with the other sub-nodes once again for
the following node before continuing. It keeps doing this until it reaches the tree’s leaf
node. The following algorithm can help you comprehend the entire procedure:

●● Step-1: S advises starting the tree from the root node, which has the entire

ty
dataset.
●● Step-2: Utilize Attribute Selection Measure to identify the dataset’s top attribute
(ASM).

si
●● Step-3: Subsets of the S that include potential values for the best qualities should
be created.
●● r
Step-4: Create the best attribute-containing decision tree node.
ve
●● Step-5: Using the subsets of the dataset generated in step 3, repeatedly design
new decision trees. Continue along this path until you reach a point when you can
no longer categorise the nodes and you refer to the last node as a leaf node.
ni

Example
Imagine an applicant who has received a job offer and is debating whether to
accept it or not. Therefore, the decision tree begins at the root node to tackle this issue
U

(Salary attribute by ASM). Based on the corresponding labels, the root node further
divides into the next decision node (distance from the office) and one leaf node. The
following decision node is further divided into a leaf node and a decision node (Cab
facility). The decision node finally separates into two leaf nodes (Accepted offers and
ity

Declined offer). Take a look at the diagram below:


m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 139

3.1.10 Decision Tree - II


Notes

e
Attribute Selection Measures

in
The fundamental problem that emerges while developing a decision tree is how
to choose the best attribute for the root node and for sub-nodes. As a result, a method
known as attribute selection measure, or ASM, can be used to tackle these issues. We
can choose the ideal attribute for the tree’s nodes with ease using this measurement.

nl
There are two widely used ASM approaches, which are as follows:

●● Information Gain

O
●● Gini Index
1. Information Gain:
◌◌ Following the segmentation of a dataset based on an attribute, information

ty
gain is the measurement of changes in entropy.
◌◌ It figures out how much knowledge a feature gives us about a class.
◌◌ We divide the node and create the decision tree based on the value of the

si
information gained.
◌◌ A node or attribute with the largest information gain is split first in a decision
tree algorithm, which always seeks to maximise the value of information gain.
r
The following formula can be used to compute it:
ve
Sv is the subset of S with A = v, Values (A) is the set of all possible values of A, and
if S is a set of instances and A is an attribute, then
ni

Entropy: Entropy, which describes the impurity of a random sample of examples, is


a measure of a random variable’s uncertainty. The more information present, the higher
the entropy.
U

Sv is the subset of S with A = v, Values (A) is the set of all possible values of A, and
if S is a set of instances and A is an attribute, then
ity

Example:
For the set X = {a,a,a,b,b,b,b,b}
m

Total instances: 8

Instances of b: 5

Instances of a: 3
)A

= -[0.375 * (-1.415) + 0.625 * (-0.678)]

=-(-0.53-0.424)
(c

= 0.954

Building Decision Tree using Information Gain

Amity Directorate of Distance & Online Education


140 Foundations of Machine Learning

The essentials:
Notes

e
◌◌ Begin with all training instances connected to the root node.
◌◌ Choose an attribute to label each node with using info gain.

in
◌◌ It should be noted that no root-to-leaf path should have the same discrete
property twice.
◌◌ Build each subtree recursively on the subset of training cases that would be

nl
categorised down that path in the tree.

The border cases:

O
◌◌ If all positive or all negative training occurrences are still there, label that node
“yes” or “no.”
◌◌ If no attributes are left, label with a majority vote of the remaining training

ty
examples at that node.
◌◌ If no examples remain, label the parent’s training instances with a majority
vote.

si
Example:
Now, utilising information gain, create a Decision Tree with the following data.

r
Set of exercises: 3 characteristics and 2 classes
ve
ni
U

We have three features and two output classes here. Using information gathered,
ity

construct a decision tree. We’ll take each feature and compute the information for each
one.
m
)A

Split on feature X
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 141

Notes

e
in
nl
O
Split on feature Y

ty
r si
ve
Split on feature Z
We can see from the photos above that the information gain is greatest when we
divide on feature Y. As a result, the best-suited feature for the root node is feature Y.
ni

We can now observe that when we split the dataset by feature Y, the child variable
comprises a pure subset of the target variable. As a result, we don’t need to divide the
dataset any more.
U

The completed tree for the aforementioned dataset would look like this:

2. Gini Index
ity

◌◌ The Gini Index is a metric that measures how frequently a randomly selected
piece is erroneously identified.
◌◌ This indicates that an attribute with a lower Gini index should be preferred.
◌◌ Sklearn supports “Gini” criteria for Gini Index and takes “gini” value by default.
m

◌◌ The formula for calculating the Gini Index is provided below.


)A

Example:
Consider the dataset shown below and create a decision tree using the Gini index.
(c

Amity Directorate of Distance & Online Education


142 Foundations of Machine Learning

Notes

e
in
nl
O
ty
r si
ve
ni
U

There are 5 attributes in the dataset above, with attribute E being the predictive
feature and including two (Positive & Negative) classes. We have an equal proportion of
each class.
ity

To categorise each attribute in the Gini Index, we must select some random values.
These are the values for this dataset:
m
)A

Calculating Gini Index for Var A:


Value >= 5: 12

Attribute A >= 5 & class = positive:5/12


(c

Attribute A >= 5 & class = negative:7/12

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 143

Gini(5,7) = 1-[(5/12)2+(7/12)2]
Notes

e
= 0.4860

Value < 5: 4

in
Attribute A < 5 & class = positive:3/4

Attribute A < 5 & class = negative:1/4

nl
Gini(9,1) =1 –[(3/4)2+(1/4)2]

= 0.375

O
By adding weight and sum each of the gini indices:

Gini(Target, A) = (12/16) * (0.486) +(4/16)*(0.375)

ty
= 0.45825

Calculating Gini Index for Var B:

si
Value >= 3: 12

Attribute B >= 3 & class = positive:8/12

Attribute B >= 5 & class = negative:4/12


r
ve
Gini(5,7) = 1-[(8/12)2+(4/12)2]

= 0.4460

Value < 3: 4
ni

Attribute A < 3 & class = positive: 0/4

Attribute A < 3 & class = negative: 4/4


U

Gini(3,1) =1 –[(0/4)2+(4/4)2]

=1

By weighting and totaling each of the gini indices:


ity

Gini(Target, B) = (12/16) * (0.446) +(0/16)*0

= 0.3345

We can calculate the Gini index for C and D qualities using the same method.
m

Positive Negative

For A|>= 5.0 5 7


)A

|<5 3 1

Ginin Index of A = 0.45825

Positive Negative
(c

For B|>= 3.0 8 4

|< 3.0 0 4

Amity Directorate of Distance & Online Education


144 Foundations of Machine Learning

Gini Index of B= 0.3345


Notes

e
Positive Negative

For C|>= 4.2 0 6

in
|< 4.2 8 2

Gini Index of C= 0.2

nl
Positive Negative

For D|>= 1.4 0 5

O
|< 1.4 8 3

Gini Index of D= 0.273

ty
r si
ve
ni

The most notable types of decision tree algorithms are:-

1. Iterative Dichotomiser 3 (ID3): This algorithm makes use of Information Gain


U

to determine which attribute will be used to classify the current subset of data.
Recursively, information gain is calculated for the remaining data at each level of the
tree.
2. C4.5: This algorithm is the ID3 algorithm’s successor. To determine the classifying
ity

property, this technique use either Information gain or Gain ratio. Because it can
handle both continuous and missing attribute data, it is a direct improvement over
the ID3 method.
3. Classification and Regression Tree(CART): It is a dynamic learning technique that
m

can generate both a regression tree and a classification tree based on the dependent
variable.
)A

Over-fitting
Over-fitting is a real issue while designing a decision tree model. Over-fitting is
considered when the algorithm continues to dive deeper and deeper in the to lower
the training set error but resulting in an increased test set error, i.e., the accuracy of
prediction for our model decreases. It usually happens when there are a lot of outliers
(c

and anomalies in the data.

To avoid Over-fitting, we can take two approaches:

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 145

●● Pre-Pruning
Notes

e
●● Post-Pruning

in
Pre-Pruning
Pre-pruning halts tree construction a little early. It is preferable not to split a node
if its goodness measure falls below a certain level. However, deciding on a suitable

nl
ending point is tricky.

Post-Pruning

O
Post-pruning begins by going deeper and deeper into the tree in order to produce
a whole tree. If the tree has an Over-fitting problem, pruning is performed as a post-
pruning phase. To test the effectiveness of our trimming, we employ cross-validation
data. It uses cross-validation data to determine whether expanding a node would result

ty
in an improvement or not.

If there is an improvement, we can proceed by enlarging that node. However, if


it shows a decrease in accuracy, it should not be enlarged, and the node should be

si
changed to a leaf node.

Decision Tree Algorithm Advantages and Disadvantages

Advantages:
r
ve
◌◌ Decision trees are simple to understand. It yields a set of rules.
◌◌ It takes the same strategy as humans do when making decisions.
◌◌ Visualizations can help simplify the interpretation of a complex Decision Tree
ni

model. Even a child can comprehend reasoning.


◌◌ The number of hyper-parameters that must be tweaked is nearly zero.
U

Disadvantages:
◌◌ Decision Tree has a significant likelihood of Over-fitting.
◌◌ When compared to other machine learning methods, it has a low prediction
ity

accuracy for a dataset.


◌◌ Information gain in a decision tree with categorical variables results in a
biassed response for qualities with more categories.
◌◌ When there are several class labels, calculations can grow complicated.
m

3.1.11 Hands On: Decision Tree

Python Implementation of Decision Tree


)A

We will now use Python to implement the Decision tree. We will use the dataset
“user data.csv” for this.

Steps:
(c

◌◌ Data Pre-processing step

Amity Directorate of Distance & Online Education


146 Foundations of Machine Learning

◌◌ Fitting a Decision-Tree algorithm to the Training set


Notes

e
◌◌ Predicting the test result
◌◌ Test accuracy of the result(Creation of Confusion matrix)

in
◌◌ Visualizing the test set result.

Data Pre-Processing Step:

nl
The code for the pre-processing stage is as follows:

# importing libraries

O
import numpy as nm

import matplotlib.pyplot as mtp

import pandas as pd

ty
#importing datasets

data_set= pd.read_csv(‘user_data.csv’)

si
#Extracting Independent and dependent Variable

x= data_set.iloc[:, [2,3]].values

r
y= data_set.iloc[:, 4].values
ve
# Splitting the dataset into training and test set.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_


ni

state=0)

#feature Scaling
U

from sklearn.preprocessing import StandardScaler

st_x= StandardScaler()

x_train= st_x.fit_transform(x_train)
ity

x_test= st_x.transform(x_test)

We have pre-processed the data in the code above. Where the dataset was
loaded, which is specified as:
m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 147

Notes

e
in
nl
O
ty
si
Fitting a Decision-Tree algorithm to the Training set
The model will now be fitted to the practice set. We’ll import the

r
DecisionTreeClassifier class needed for this from the sklearn.tree library. The code is
listed below:
ve
#Fitting Decision Tree classifier to the training set

From sklearn.tree import DecisionTreeClassifier


ni

classifier= DecisionTreeClassifier(criterion=’entropy’, random_state=0)

classifier.fit(x_train, y_train)

We constructed a classifier object in the code above and handed it two main
U

parameters;

◌◌ “criterion=’entropy’: The quality of the split is evaluated using the criterion,


which is determined by the information gain provided by entropy.
ity

◌◌ random_state=0”: in order to produce the random states.


Below is the output for this:

Out[8]:
m

DecisionTreeClassifier(class_weight=None, criterion=’entropy’, max_depth=None,

max_features=None, max_leaf_nodes=None,
)A

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, presort=False,

random_state=0, splitter=’best’)
(c

Amity Directorate of Distance & Online Education


148 Foundations of Machine Learning

Predicting the test result


Notes

e
We shall now forecast the outcome of the test set. A new prediction vector called
y_pred will be made. The code is listed below:

in
#Predicting the test set result

y_pred= classifier.predict(x_test)

nl
Output:
The actual test output and the projected output are shown in the output image

O
below. It is obvious that the prediction vector has certain values that differ from the
values of the actual vector. These are incorrect predictions.

ty
r si
ve
ni

Test accuracy of the result (Creation of Confusion matrix)

Since some wrong predictions were made, as seen in the output above, we must
U

utilise the confusion matrix to determine the total number of correct and incorrect
guesses. The code is listed below:

#Creating the Confusion matrix


ity

from sklearn.metrics import confusion_matrix

cm= confusion_matrix(y_test, y_pred)

Output:
m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 149

The confusion matrix, which has 6+3=9 inaccurate guesses and 62+29=91 correct
Notes

e
predictions, is displayed in the output image up top. As a result, we may conclude that
the Decision Tree classifier produced an accurate forecast when compared to other
classification models.

in
Visualizing the training set result:

nl
We’ll display the training set outcome here. We will draw a graph for the decision
tree classifier in order to visualise the training set result. For consumers who have
either purchased or not purchased the SUV car, the classifier will predict yes or no. The
code is listed below:

O
#Visulaizing the trianing set result

from matplotlib.colors import ListedColormap

ty
x_set, y_set = x_train, y_train

x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:,


0].max() + 1, step =0.01),

si
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).


reshape(x1.shape), r
ve
alpha = 0.75, cmap = ListedColormap((‘purple’,’green’ )))

mtp.xlim(x1.min(), x1.max())

mtp.ylim(x2.min(), x2.max())
ni

fori, j in enumerate(nm.unique(y_set)):

mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],


U

c = ListedColormap((‘purple’, ‘green’))(i), label = j)

mtp.title(‘Decision Tree Algorithm (Training set)’)


ity

mtp.xlabel(‘Age’)

mtp.ylabel(‘Estimated Salary’)

mtp.legend()
m

mtp.show()

Output:
)A
(c

Amity Directorate of Distance & Online Education


150 Foundations of Machine Learning

Notes

e
in
nl
O
The output presented above departs significantly from the remaining classification
models. The dataset is divided along both vertical and horizontal lines according to the
age and expected wage variables.

As we can see, the tree is Over-fitting because it is attempting to represent each

ty
dataset.

Visualizing the test set result:

si
The training set will be replaced by the test set in the visualisation of the test set
results, which will seem quite similar to the training set visualisation.

r
#Visulaizing the test set result
ve
from matplotlib.colors import ListedColormap

x_set, y_set = x_test, y_test

x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:,


ni

0].max() + 1, step =0.01),

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).


U

reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap((‘purple’,’green’ )))


ity

mtp.xlim(x1.min(), x1.max())

mtp.ylim(x2.min(), x2.max())

fori, j in enumerate(nm.unique(y_set)):
m

mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

c = ListedColormap((‘purple’, ‘green’))(i), label = j)


)A

mtp.title(‘Decision Tree Algorithm(Test set)’)

mtp.xlabel(‘Age’)

mtp.ylabel(‘Estimated Salary’)

mtp.legend()
(c

mtp.show()

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 151

Output:
Notes

e
in
nl
O
There are some green data points within the purple portion of the above graphic,

ty
and vice versa. Therefore, the inaccurate predictions that we described in the confusion
matrix are these.

si
3.1.12 Introduction to Support Vector Machines
Support vector machines (SVMs) are potent yet adaptable supervised machine
learning algorithms used for both regression and classification. However, they are
r
typically employed in classification issues. SVMs were first presented in the 1960s,
ve
however they were later improved in 1990. Compared to other machine learning
algorithms, SVMs are implemented in a different method. Because they can manage
numerous continuous and categorical variables, they have recently become very
popular.
ni

A supervised machine learning approach called Support Vector Machine (SVM)


is used for both classification and regression. Although we also refer to regression
concerns, categorization is the most appropriate term. Finding a hyperplane in an
U

N-dimensional space that clearly classifies the data points is the goal of the SVM
method. The number of features determines the hyperplane’s size. The hyperplane is
essentially a line if there are just two input features. The hyperplane turns into a 2-D
plane if there are three input features. When there are more than three features, it gets
ity

harder to imagine.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane:In n-dimensional space, there may be several lines or decision
m

boundaries used to separate the classes, but we must identify the optimum decision
boundary that best aids in classifying the data points. The hyperplane of SVM is a name
for this ideal boundary.
)A

The dataset’s features determine the hyperplane’s dimensions, therefore if there


are just two features (as in the example image), the hyperplane will be a straight line.
Additionally, if there are three features, the hyperplane will only have two dimensions.

The largest margin, or the space between the data points, is always used to form a
(c

hyperplane.

Amity Directorate of Distance & Online Education


152 Foundations of Machine Learning

Support Vectors:
Notes

e
Support vectors are the data points or vectors that are closest to the hyperplane
and have the greatest influence on where the hyperplane is located. These vectors are

in
called support vectors because they support the hyperplane.

Let’s look at two independent variables (x1, x2) and one dependent variable (either
a blue circle or a red circle), which is the dependent variable.

nl
O
ty
si
It is extremely obvious from the above graphic that there are numerous lines (our
hyperplane in this case is a line as we are only taking into account two input features,
x1, x2) that separate our data points or perform a classification between red and blue
r
circles. So how do we pick the optimal line, or optimum hyperplane, to divide up our
ve
data points?

Selecting the best hyper-plane:


The hyperplane that depicts the greatest gap or margin between the two classes is
ni

a logical option for the best hyperplane.


U
ity
m

Therefore, we select the hyperplane whose distance to the closest data point on
either side is maximum from it. The maximum-margin hyperplane/hard margin is a term
)A

used to describe such a hyperplane, if it exists. We therefore select L2 from the given
diagram.

Let’s think about the situation depicted below


(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 153

Notes

e
in
nl
O
One blue ball is currently within the red ball’s perimeter. How does SVM categorise
the data then? It’s easy! An anomaly of blue balls is the blue ball that lies on the edge of
the red ones. In order to determine the optimum hyperplane that maximises the margin,

ty
the SVM algorithm has the ability to ignore outliers. SVM can withstand outliers.

r si
ve
SVM determines the maximum margin for this type of data like it did for earlier
ni

data sets and also applies a penalty each time a point crosses the margin. Therefore,
in circumstances like these, the margins are referred to be soft margins. The SVM
attempts to minimise (1/margin+(penalty)) when the data set has a soft margin. A
U

frequent punishment is hinge loss. No loss of hinge if no violations. If infractions occur,


there will be a loss based on the violation’s length.

Up until this point, we have only discussed data that can be separated along a
ity

straight line (a linear line), such as the group of blue and red balls. What should one do
if data cannot be separated linearly?
m
)A

Let’s say the figure above represents our data. By employing a kernel to create
a new variable, SVM resolves this. We designate a new variable yi as a function of
the distance from the origin o. As a result, if we plot this, we obtain the result that is
(c

depicted below.

Amity Directorate of Distance & Online Education


154 Foundations of Machine Learning

Notes

e
in
nl
O
In this instance, the distance from the origin is used to construct the new variable y.
Kernel is the name for a non-linear function that produces a new variable.

SVM Kernel:

ty
The SVM kernel is a function that converts non separable problems into separable
problems by taking low-dimensional input space and transforming it into higher-
dimensional space. It works best in non-linear separation issues. Simply explained, the

si
kernel determines how to split the data depending on the labels or outputs defined after
performing some incredibly sophisticated data transformations.

Advantages of SVM:
●●
r
effective in cases with big dimensions.
ve
●● Its memory efficiency comes from the decision function’s use of support vectors, a
subset of training points.
●● For the decision functions, various kernel functions can be supplied, as well as
ni

bespoke kernels.

3.1.13 SVM for Linearly Separable Problems


U

Linear SVM: Linear SVM is used for linearly separable data, which is defined as
data that can be divided into two classes using just one straight line. The classifier used
for such data is called the Linear SVM classifier.
ity

Let’s begin with a straightforward two-class problem where the data is obviously
separable along a linear path, as indicated in the image below.
m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 155

Let (Xi, yi) represent the i-th data point, where yi is the associated class label and
Notes

e
Xi is the feature vector, with yi having one of two possible values: +1 or -1. The red-
colored balls in the diagram above have a class label of +1, while the blue balls, say,
have a class label of -1. All of the individuals in classes +1 and -1 can be distinguished

in
from one another by drawing a straight line. The aforementioned two-dimensional data
are obviously linearly separable.

nl
The blue balls and red balls can actually be separated by an unlimited number of
straight lines.

The challenge is determining which of the infinite straight lines is ideal in the sense

O
that it should have the least amount of classification error when applied to a fresh
observation. Based on the training sample, the straight line is anticipated to properly
identify one or more test samples.

ty
Consider the black, red, and green lines in the diagram above as an example. Is
one of them superior to the other two? Or are they all equally suited to classification?
What does this mean by optimality? It makes intuitive sense that a line will be more
sensitive to slight changes in one or more of the points if it passes too close to any of

si
the points. A red ball is not far from the green line. A blue ball is not far from the red
line. The red ball might land on the opposite side of the green line if it slightly shifts its
location. The blue ball could also be misclassified if it slightly shifts in place. The red
r
and green lines are equally sensitive to slight variations in the observations. Contrarily,
ve
the black line is less sensitive and more resistant to model variance.

A flat subspace of dimension n – 1 is known as a hyperplane in an n-dimensional


space. For instance, as illustrated in the diagram, a straight line in two dimensions is
a one-dimensional hyperplane. A hyperplane is a flat, two-dimensional subspace, or a
ni

plane, in three dimensions. In n dimensions, a separating hyperplane is mathematically


defined as a linear combination of all dimensions equal to 0.
U

It’s common to refer to the scalar θ0 as a bias. The hyperplane passes through the
origin if θ0=0.
ity

A separator is a hyperplane. There will be two distinct groups made up of the points
that lie on opposite sides of the hyperplane.

Finding the best hyperplane for linearly separable patterns is the fundamental
goal of support vector machines. The separating hyperplane that is farthest from the
m

observations is optimal margin hyperplane, also referred to as optimal separating


hyperplane. Each observation’s perpendicular distance to a certain separating
hyperplane is calculated. How close the hyperplane is to the collection of observations
)A

is determined by the least of all those distances. The margin is the name given to this
minimal separation. The SVM method works by locating the hyperplane that provides
the training instances with the highest minimum distance, or to discover the maximum
margin. The maximal margin classifier is referred to as this.

In two dimensions, a separating hyperplane can be written as


(c

θ0+ θ1x1 + θ2x2 + … + θnxn = 0

Amity Directorate of Distance & Online Education


156 Foundations of Machine Learning

A bias is a common name for the scalar θ0. In the event that θ0=0, the hyperplane
Notes

e
passes through the origin.

As a divider, a hyperplane is used. Two distinct groups will be comprised of the

in
points that lie on opposite sides of the hyperplane.

The fundamental goal of support vector machines is to identify the best hyperplane
for patterns that can be separated linearly. The separating hyperplane known as the

nl
optimal margin hyperplane, which is the farthest from the observations, is a natural
choice. It is calculated how far away each observation is from a specific separating
hyperplane. A measurement of how close the hyperplane is to the collection of

O
observations is the least of all those distances. The margin is the smallest possible
separation. In order to identify the biggest margin, the SVM algorithm operates by
locating the hyperplane that provides the training instances with the largest minimum
distance. The greatest margin classifier is used in this situation.

ty
A separating hyperplane in two dimensions is written as

θ0+ θ1x1 + θ2x2 = 0

si
Any point that is above the hyperplane therefore qualifies.

θ0+ θ1x1 + θ2x2> 0

r
the hyperplane, and any point below it, is satisfied.
ve
θ0+ θ1x1 + θ2x2< 0

The margin’s bounds can be expressed as by adjusting the coefficients or weights


θ1 and θ2.
ni
U

This will ensure that any observations that fall on or above H1 are considered to
belong to class +1, while any observations that fall on or below H2 are considered to
belong to class -1. As an alternative, we could write
ity

yi(θ0+θ1x1i+θ2x2i)≤ for every observation

H1 and H2, which define the margin bounds, are also hyperplanes. The training
data that exactly matches the margin’s edges are known as the support vectors
because they operate as weights for the maximal margin hyperplane, meaning that if
m

any of these points are moved, the maximal margin hyperplane will also move.

Notably, only these support vectors are the direct dependencies of the maximal
margin hyperplane.
)A

The maximal margin hyperplane does not alter if any of the other points shift
until the movement has an impact on the boundary conditions or the support vectors.
The support vectors are the most challenging to categorise and provide the greatest
insight into classification. The training set’s most important or crucial data points are the
(c

support vectors since they are on or close to the decision border.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 157

Notes

e
in
nl
O
The defining equation for a general n-dimensional feature space is

It is simple to show that the maximal margin is 2|| in size if the weights vector is

ty
denoted by and || is this vector’s norm. A convex quadratic optimization issue is to find
the largest margin hyperplanes and support vectors. It is crucial to remember that the
number of support vectors, not the dimension of the feature space, determines how

si
complex an SVM is.

Because of this, SVM has a relatively lower propensity to overfit. The same
separation hyperplane would be discovered if all data points from the training data set
r
aside from the support vectors were eliminated and the training method was redone.
ve
The SVM classifier’s anticipated error rate, which is independent of data dimensionality,
has an upper bound determined by the quantity of support vectors. Even when the data
is highly dimensional, an SVM with a modest number of support vectors offers good
generalisation.
ni

3.1.14 SVM for Non-Linearly Separable Problems


Non-linear SVM: When a dataset cannot be identified using a straight line, it is said
U

to be non-linear, and the classification algorithm utilised is known as a non-linear SVM


classifier.
ity
m
)A

We must therefore add another dimension in order to separate these data values.
We have used the two dimensions x and y for linear data, so we will add the third
dimension z for non-linear data. It is calculable as follows:

Z = x2 + y2
(c

The sample space will change to look like the following image by adding the third
dimension:

Amity Directorate of Distance & Online Education


158 Foundations of Machine Learning

Notes

e
in
nl
O
The datasets will now be split into the following classes by SVM. Take a look at the
photo below:

ty
r si
ve
ni

It appears to be a plane parallel to the x-axis since we are in three-dimensional


space. It will become as follows if we convert it to 2D space with z=1:
U
ity
m
)A

Therefore, when dealing with non-linear data, we obtain a circumference of radius


1.

3.1.15 Hands On: Support Vector Machines


(c

Python will be used to carry out the SVM method. The user_data dataset will be
used in this instance.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 159

Data Pre-processing step: Below is the code:


Notes

e
#Data Pre-processing Step

# importing libraries

in
import numpy as nm

import matplotlib.pyplot as mtp

nl
import pandas as pd

#importing datasets

O
data_set= pd.read_csv(‘user_data.csv’)

#Extracting Independent and dependent Variable

ty
x= data_set.iloc[:, [2,3]].values

y= data_set.iloc[:, 4].values

# Splitting the dataset into training and test set.

si
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_


state=0) r
ve
#feature Scaling

from sklearn.preprocessing import StandardScaler

st_x= StandardScaler()
ni

x_train= st_x.fit_transform(x_train)

x_test= st_x.transform(x_test)
U

We shall pre-process the data after running the aforementioned code. The dataset
is what the code will output:
ity
m
)A
(c

Amity Directorate of Distance & Online Education


160 Foundations of Machine Learning

Fitting the SVM classifier to the training set:


Notes

e
The SVM classifier will now be fitted to the training set. We will import SVC class
from the Sklearn.svm package in order to build the SVM classifier. The code is listed

in
below:

from sklearn.svm import SVC # “Support vector classifier”

nl
classifier = SVC(kernel=’linear’, random_state=0)

classifier.fit(x_train, y_train)

In the above code, we have used kernel=’linear’, as here we are creating SVM for

O
linearly separable data. However, we can change it for non-linear data. And then we
fitted the classifier to the training dataset(x_train, y_train).

ty
Output:
Out[8]:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

si
decision_function_shape=’ovr’, degree=3, gamma=’auto_deprecated’,

kernel=’linear’, max_iter=-1, probability=False, random_state=0,


r
shrinking=True, tol=0.001, verbose=False)
ve
By adjusting the values of C (Regularization factor), gamma, and kernel, the model
performance can be changed.

Predicting the test set result:


ni

We will now project the results for the test set. We will do this by making a new
vector called y pred. The code is listed below:
U

#Predicting the test set result

y_pred= classifier.predict(x_test)

Once we get the y pred vector, we can compare the y pred and y test results to see
ity

how much the actual value and anticipated value differ.

Output: The output for the test set’s forecast is shown below:
m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 161

Notes

e
in
nl
O
ty
si
Creating the confusion matrix:

r
Now, we’ll examine the SVM classifier’s performance to determine how many
inaccurate predictions it makes in comparison to the classifier for logistic regression.
ve
We must import the confusion matrix function from the sklearn package in order
to generate the confusion matrix. We will use a new variable called cm to invoke the
function when it has been imported. Two parameters are required by the function,
primarily y_true (the actual numbers) and y_pred (the targeted value return by the
ni

classifier). The code is listed below:

#Creating the Confusion matrix


U

from sklearn.metrics import confusion_matrix

cm= confusion_matrix(y_test, y_pred)

Output:
ity
m
)A
(c

There are 66+24=90 correct guesses, and 8+2=10 correct predictions, as can

Amity Directorate of Distance & Online Education


162 Foundations of Machine Learning

be seen in the output image up top. Thus, we can conclude that our SVM model
Notes

e
outperformed the Logistic Regression model.

Visualizing the training set result:

in
The code for it is below. We will now visualise the training set result.

from matplotlib.colors import ListedColormap

nl
x_set, y_set = x_train, y_train

x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:,

O
0].max() + 1, step =0.01),

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).

ty
reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap((‘red’, ‘green’)))

si
mtp.xlim(x1.min(), x1.max())

mtp.ylim(x2.min(), x2.max())

r
for i, j in enumerate(nm.unique(y_set)):
ve
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

c = ListedColormap((‘red’, ‘green’))(i), label = j)

mtp.title(‘SVM classifier (Training set)’)


ni

mtp.xlabel(‘Age’)

mtp.ylabel(‘Estimated Salary’)
U

mtp.legend()

mtp.show()
ity

Output:
m
)A
(c

As we can see, the aforementioned output resembles the results of the logistic
regression. Because we used a linear kernel in the classifier, the straight line appeared

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 163

as a hyperplane in the output. Additionally, as was already mentioned, the hyperplane in


Notes

e
SVM for the second dimension is a straight line.

Visualizing the test set result:

in
#Visulaizing the test set result

from matplotlib.colors import ListedColormap

nl
x_set, y_set = x_test, y_test

x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:,

O
0].max() + 1, step =0.01),

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).

ty
reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap((‘red’,’green’ )))

si
mtp.xlim(x1.min(), x1.max())

mtp.ylim(x2.min(), x2.max())

for i, j in enumerate(nm.unique(y_set)):
r
ve
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

c = ListedColormap((‘red’, ‘green’))(i), label = j)

mtp.title(‘SVM classifier (Test set)’)


ni

mtp.xlabel(‘Age’)

mtp.ylabel(‘Estimated Salary’)
U

mtp.legend()

mtp.show()
ity

Output:
m
)A

The SVM classifier separated the users into two zones, as shown in the result
(c

image up top (Purchased or Not purchased). The red area with the red scatter spots
represents users who bought the SUV. The green area with green scatter points

Amity Directorate of Distance & Online Education


164 Foundations of Machine Learning

represents users who did not buy the SUV. The two classes have been separated into
Notes

e
purchased and non-purchased variables by the hyperplane.

3.1.16 Introduction to Random Forest Classifier

in
A popular algorithm for classification and regression issues is the supervised
machine learning technique known as random forest. On various samples, it constructs

nl
decision trees and uses their average for classification and majority vote for regression.

The Random Forest Algorithm’s ability to handle data sets with both continuous
variables, as in regression, and categorical variables, as in classification, is one of its

O
most crucial qualities. For categorization issues, it produces superior results.

Lets understand this using an example:

ty
After finishing his 10+2, student X wants to enrol in a course but is unsure which
one to take given his skill set. So he makes the decision to talk to a variety of people,
including his cousins, teachers, parents, degree students, and workers. He asks
them a range of queries, such as why he ought to select, employment prospects with

si
that course, tuition, etc. He ultimately chooses to take the course that most people
recommend after conversing with a variety of individuals about it.

r
ve
ni

Working of Random Forest Algorithm


We must first examine the ensemble technique in order to comprehend how the
U

random forest functions. Ensemble simply refers to the blending of various models. As a
result, a group of models rather than a single model is employed to create predictions.

Two different strategies are used by Ensemble:


ity

1. Bagging:It replaces the sample training data with a different training subset, and the
result is based on majority voting. Consider Random Forest.
2. Boosting:By building consecutive models with the final model having the best
accuracy, it turns weak learners into strong learners. ADA BOOST and XG BOOST
m

are two examples.


)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 165

As was already explained, the Bagging principle governs how Random Forest
Notes

e
operates. Let’s dig in and learn more about bagging now.

Bagging

in
The ensemble method employed by random forest is bagging, sometimes referred
to as Bootstrap Aggregation. A random sample is chosen from the data set using
bagging. As a result, each model is created using the samples (Bootstrap Samples) that

nl
the Original Data gave, with a replacement process known as row sampling. Bootstrap
refers to this stage of row sampling with replacement. Each model is currently trained
independently, producing results. After merging the output from all models, the final

O
decision is made based on a majority vote. Aggregation is the process of aggregating
all the results and producing a result based on a majority vote.

ty
r si
ve
Let’s examine an example now by dissecting it with the aid of the subsequent
figure. The bootstrap sample in this case is derived from genuine data (Bootstrap
ni

samples 01, 02, and 03) with a replacement, therefore there is a significant chance that
each sample won’t include unique data. The models (Models 01, 02, and 03) derived
from this bootstrap sample are currently being trained independently. Results from
U

each model are displayed. Now, when compared to sad emoji, the happy emoji is in the
majority. Thus, the Happy emoji is the ultimate result based on majority voting.
ity
m
)A
(c

Amity Directorate of Distance & Online Education


166 Foundations of Machine Learning

Steps involved in random forest algorithm:


Notes

e
Step 1: In Random Forest, n records at random are selected from a data set with k
records.

in
Step 2: For each sample, a different decision tree is built.

Step 3: An output will be produced by each decision tree.

nl
Step 4: For classification and regression, the final result is based on either majority
voting or averaging.

O
ty
r si
ve
For example: Take the data from the fruit basket in the figure below as an example.
Now, n samples are drawn from the fruit basket, and each sample is used to build a
separate decision tree. As seen in the illustration, each decision tree will produce an
output. Based on majority voting, the result is taken into consideration. As you can see
ni

in the figure below, when compared to a banana, the majority decision tree produces an
apple as the output, hence an apple is chosen as the ultimate result.
U
ity
m
)A

Important Features of Random Forest


1. Diversity: Each tree is unique, thus not all qualities, variables, or features are taken
(c

into account when creating it.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 167

2. Immune to the curse of dimensionality: The feature space is smaller since no tree
Notes

e
takes into account every feature.
3. Parallelization:Out of various data and attributes, each tree is independently

in
generated. This implies that we can create random forests by using the CPU to its
fullest extent.
4. Train-Test split-:In a random forest, there will always be 30% of the data that the

nl
decision tree cannot see, therefore we don’t need to separate the data for train and
test.
5. Stability: Because the outcome is based on majority vote or averaging, stability

O
results.

Advantages and Disadvantages of Random Forest Algorithm

ty
Advantages
1. Classification and regression issues can be solved with it.
2. Because output is based on majority vote or averaging, it resolves the Over-

si
fitting issue.
3. It works well even when the data has missing or null values.
r
4. Because each decision tree that is constructed is independent of the others,
ve
parallelization is demonstrated.
5. Because the average responses provided by a vast number of trees are used,
it is extremely stable.
6. It maintains diversity because not all traits are taken into account when
ni

creating each decision tree, albeit this isn’t always the case.
7. It is resistant to the dimensionality curse. The feature space is smaller
U

because each tree does not take all of the qualities into account.
8. Since 30% of the data will always be hidden from the decision tree built using
bootstrap, we don’t need to separate the data into train and test.
ity

Disadvantages:
1. When opposed to decision trees, where choices may be made by going along
the tree’s path, random forests are significantly more complex.
2. Due to its intricacy, training takes longer than with other models. Every time a
m

decision tree needs to make a prediction, it must produce output based on the
input data.
)A

3.1.17 Random Forest Technique and Construction


Built on decision trees, the random forest modelling technique is used for
behaviour analysis and modelling predictions. It has numerous decision trees, each of
which represents a different instance of how the classification of data entered into the
(c

random forest is done. The forecast chosen by the random forest technique is the one
that receives the most votes after taking into account each case separately.

Amity Directorate of Distance & Online Education


168 Foundations of Machine Learning

Notes

e
in
nl
O
ty
Figure: Random Forest Technique

si
Samples from the initial dataset are used as input for each tree in the
classifications. The features used to grow the tree at each node are then chosen at
random. Until the exercise is complete and the forecast is clearly reached, no trees in
r
the forest should be pruned. Any classifier with weak correlations can produce a strong
ve
classifier using the random forest in this way.

Modeling Predictions
Using random forest regression trees, which are typically left unpruned to produce
ni

accurate predictions, the random forest approach may develop prediction models.
The regression trees, which shouldn’t be pruned, are sampled using the bootstrap
approach. The optimal splitting feature is created by sampling the best nodes from the
U

entire number of nodes in the tree.

The variance of the regression trees is reduced by the random sampling method
used to choose the best splitting feature by reducing correlation. It enhances the
ity

ability of certain forest plants to predict future events. Increased independence among
individual trees is another benefit of bootstrap sampling.

Variable Importance
Since it might be difficult to comprehend the models, particularly from a biological
m

standpoint, variables (features) are crucial to the random forest method. By giving a
variable importance depending on how frequently all trees include it in the sample,
the naive approach illustrates the significance of variables. It is doable but challenging
)A

because the impacts on cost savings and accuracy improvement are duplicated.

When the variables are permuted at random from samples taken from a bag,
the permutation significance is a metric that records prediction accuracy. Although
more expensive, the permutation importance approach performs better than the naive
(c

approach.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 169

The strategy relies on the naive, mean decrease impurity, and permutation
Notes

e
importance techniques to provide them direct interpretability to the issues because the
random forest cannot interpret predictions from the biological viewpoints well enough.
The three methods provide different kinds of support for the predictor variables.

in
However, neither the permutation importance nor the mean decrease impurity
approaches show biases when applied to continuous predictor variables with a

nl
comparable number of categories. Bias is frequently present in variable selection. To
prevent it, subsampling without replacement should be employed, and the random
forest technique should be used when conditional inference is being used.

O
Oblique Random Forests
In contrast to typical decision splits at the nodes, oblique random forests use
oblique splits for decisions. The following characteristics of oblique woods demonstrate

ty
their superiority.

First, they can use a single multivariate split to divide distributions at the coordinate
axes, including the customarily required deep axis-aligned splits. Second, they make it

si
possible for the decision trees’ bias toward the plotted limitations to be reduced. When
splitting similar classes, the traditional axis-aligned splits would require two extra layers
of nesting, but the oblique splits are simpler and more effective to deploy.

Random Forest Classifier


r
ve
A group of prediction trees make up the random forest classifier. Each tree in the
random forest is reliant on random vectors that were randomly sampled and had a
comparable distribution to other trees.
ni

Due to its excellent accuracy in classifying remotely sensed data, the classifier,
which was initially developed for machine learning, has acquired favour in the remote
sensing field. Additionally, it completes the process with the necessary speed and with
U

effective parameterization. The prediction with the greatest vote across all trees is
chosen by the random forest classifier, which bootstraps random samples.

The uniqueness of the trees is crucial throughout the process. The following
ity

characteristics ensure that each tree is unique. First, random subsets from the initial
training samples are used in each tree training in the sample. Second, the randomly
picked attributes of the unpruned tree nodes are used to select the best split. Thirdly, no
tree should ever be pruned because all trees grow at an unlimited rate.
m

Advantages of Random Forests


Estimates of variable importance, or neural nets, are provided by random forests.
They also provide a better way to deal with missing data. The variable that appears
)A

most frequently in a given node is used to fill in any missing values. Random forests
offer the highest accuracy of all the categorization techniques that are currently
available.

The random forest method can also handle large amounts of data with hundreds
(c

of different variables. When a class in the data is less frequent than other classes, it
can automatically balance data sets. The approach is appropriate for challenging
assignments since it handles variables quickly.

Amity Directorate of Distance & Online Education


170 Foundations of Machine Learning

Summary
Notes

e
●● A group of classification algorithms built on the Bayes’ Theorem are known as
naive Bayes classifiers. It is a family of algorithms rather than a single method,

in
and they are all based on the idea that every pair of features being classified is
independent of the other.
●● The Bayes Theorem determines the likelihood of an event occurring given the

nl
likelihood of an earlier event occurring.
●● Continuous values connected to each feature in Gaussian Naive Bayes are
presumptively distributed in a Gaussian manner. Normal distribution is another

O
name for a Gaussian distribution.
●● Naïve Bayes classifiers are: a) Multinomial Naïve Bayes, b) Bernoulli Naïve Bayes.
●● Finding correlations between dependent and independent variables is the goal of

ty
the supervised learning technique known as regression. A numerical or discrete
value is predicted by a predictive regression model. Regression measurements
are different from classification metrics in many ways. This implies that we cannot

si
assess a regression model using the Accuracy measure; rather, the performance
of a regression model is reported as errors in the forecast.
●● The most common indicators for assessing the effectiveness of regression models
r
are: a) Mean Absolute Error, b) Mean Squared Error, c) R2 Score, d) Adjusted R2.
ve
●● The Bayesian classification method combines statistical classification with
supervised learning. assumes an underlying probabilistic model and enables the
rigorous computation of outcome probabilities to represent uncertainty about the
model.
ni

●● One of the most fundamental yet crucial categorization methods in machine


learning is K-Nearest Neighbors. It falls under the category of supervised learning
and has numerous applications in data mining, intrusion detection, and pattern
U

recognition.
●● A new data point is classified using the K-NN algorithm based on similarity after
all the existing data has been stored. This means that utilising the K-NN method,
ity

fresh data can be quickly and accurately sorted into a suitable category.
●● A supervised learning method called a decision tree can be used to solve
classification and regression problems, but it is typically favoured for doing so.
It is a tree-structured classifier, where internal nodes stand in for a dataset’s
m

features, branches for the decision-making process, and each leaf node for the
classification result.
●● The Decision Node and Leaf Node are the two nodes of a decision tree. While
)A

Leaf nodes are the results of decisions and do not have any more branches,
Decision nodes are used to create decisions and have numerous branches.
●● The fundamental problem that emerges while developing a decision tree is how
to choose the best attribute for the root node and for sub-nodes. As a result, a
method known as attribute selection measure, or ASM, can be used to tackle
(c

these issues.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 171

●● Over-fitting is a real issue while designing a decision tree model. Over-fitting is


Notes

e
considered when the algorithm continues to dive deeper and deeper in the to
lower the training set error but resulting in an increased test set error, i.e., the
accuracy of prediction for our model decreases.

in
●● Support vector machines (SVMs) are potent yet adaptable supervised machine
learning algorithms used for both regression and classification. However, they are

nl
typically employed in classification issues. SVMs were first presented in the 1960s,
however they were later improved in 1990. Compared to other machine learning
algorithms, SVMs are implemented in a different method.

O
●● The SVM kernel is a function that converts non separable problems into separable
problems by taking low-dimensional input space and transforming it into higher-
dimensional space. It works best in non-linear separation issues.
●● A popular algorithm for classification and regression issues is the supervised

ty
machine learning technique known as random forest. On various samples, it
constructs decision trees and uses their average for classification and majority
vote for regression.

si
●● The Random Forest Algorithm’s ability to handle data sets with both continuous
variables, as in regression, and categorical variables, as in classification, is one
of its most crucial qualities. For categorization issues, it produces superior results.
●●
r
The ensemble method employed by random forest is bagging, sometimes referred
ve
to as Bootstrap Aggregation. A random sample is chosen from the data set
using bagging. As a result, each model is created using the samples (Bootstrap
Samples) that the Original Data gave, with a replacement process known as row
sampling.
ni

Glossary
●● Naïve Bayes classifiers: A group of classification algorithms built on the Bayes’
U

Theorem are known as naive Bayes classifiers.


●● Bayes Theorem: The Bayes Theorem determines the likelihood of an event
occurring given the likelihood of an earlier event occurring.
ity

●● Multinomial Naive Bayes: The frequency with which specific events were produced
by a multinomial distribution are represented by feature vectors.
●● Bernoulli Naive Bayes: Features are independent booleans (binary variables) that
describe inputs in the multivariate Bernoulli event model.
m

●● Regression: Finding correlations between dependent and independent variables is


the goal of the supervised learning technique known as regression.
●● MAE: Mean Absolute Error (MAE), evaluates the absolute difference between
)A

actual and anticipated values.


●● MSE: Mean Squared Error, it calculates the average of the squared difference
between actual model output and expected values.
●● R2 Score: R Squared Score, we may evaluate the performance of the model by
(c

comparing it to a fixed baseline using the R-squared metric.

Amity Directorate of Distance & Online Education


172 Foundations of Machine Learning

●● Adjusted R Squared: Adjusted R squared, which always displays a lower number


Notes

e
than R2, is used to get around the R square problem.
●● Root Node: The decision tree begins at the root node. The full dataset is

in
represented, which is then split into two or more homogeneous sets.
●● Leaf Node: After receiving a leaf node, the tree cannot be further divided; leaf
nodes are the ultimate output nodes.

nl
●● Splitting: In splitting, the decision node or root node is divided into sub-nodes in
accordance with the specified conditions.

O
●● Branch/Sub Tree: The tree was split into two new trees.
●● Pruning: Pruning is the procedure of removing the tree’s undesirable branches.
●● Parent/Child node: The parent node of the tree and the remaining nodes are

ty
referred to as the child nodes.
●● ASM: Attribute Selection Measures.
●● Gini index: The Gini Index is a metric that measures how frequently a randomly

si
selected piece is erroneously identified.
●● Iterative Dichotomiser 3 (ID3): This algorithm makes use of Information Gain to
determine which attribute will be used to classify the current subset of data.
●●
r
C4.5: This algorithm is the ID3 algorithm’s successor. To determine the classifying
ve
property, this technique use either Information gain or Gain ratio.
●● Classification and Regression Tree(CART): It is a dynamic learning technique
that can generate both a regression tree and a classification tree based on the
dependent variable.
ni

●● SVM: Support Vector Machines.


●● SVM Kernel: The SVM kernel is a function that converts non separable problems
U

into separable problems by taking low-dimensional input space and transforming it


into higher-dimensional space.
●● Linear SVM: Linear SVM is used for linearly separable data, which is defined as
ity

data that can be divided into two classes using just one straight line.
●● Non-Linear SVM: When a dataset cannot be identified using a straight line, it is
said to be non-linear, and the classification algorithm utilised is known as a non-
linear SVM classifier.
m

Check Your Understanding


1. Naïve Bayes is:
)A

a) Conditional Independence
b) Conditional Dependence
c) Both a and b
d) None of the above
(c

2. Naïve Bayes require:


a) Categorical values
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 173

b) Numerical values
Notes

e
c) Both a and b
d) None of the mentioned

in
3. Probabilistic Model of data within each class is?
a) Discriminative classification

nl
b) Generative classification
c) Probabilistic classification

O
d) Both b and c
4. Spam Classification is an example of?
a) Random Forest

ty
b) Naïve Bayes
c) Probabilistic classification

si
d) All of the above
5. Finding correlations between dependent and independent variables is the goal of the
supervised learning technique known as_ _ _ _.
a) Classification r
ve
b) Prediction
c) Regression
d) None of the mentioned
ni

6. _ _ _ _evaluates the absolute difference between actual and anticipated values.


a) Mean Absolute Error
U

b) Mean Squared Error


c) R2 Score
d) Adjusted R2
ity

7. _ _ _ _ _calculates the average of the squared difference between actual model


output and expected values.
a) Mean Absolute Error
m

b) Mean Squared Error


c) R2 Score
d) Adjusted R2
)A

8. What is the application of K-nearest neighbour?


a) Data mining
b) Intrusion detection
(c

c) Pattern recognition
d) All of the above

Amity Directorate of Distance & Online Education


174 Foundations of Machine Learning

9. _ _ _ _ is a tree-structured classifier, where internal nodes stand in for a dataset’s


Notes

e
features, branches for the decision-making process, and each leaf node for the
classification result.

in
a) K-nearest neighbor
b) Decision tree
c) Linear regression

nl
d) Naïve Bayes
10. What are the two nodes of a decision tree?

O
a) Root node and Stem Node
b) Root node and Leaf node
c) Decision node and Leaf node

ty
d) None of the mentioned
11. What is abbreviation of CART?

si
a) Classification and Regression Tree Algorithm
b) Class and Root Tree Algorithm

r
c) Classification and Root Tree Algorithm
ve
d) Class and Regression Tree Algorithm
12. After receiving a_ _ _ _, the tree cannot be further divided; leaf nodes are the ultimate
output nodes.
a) Root node
ni

b) Leaf node
c) Stem node
U

d) Parent node
13. In_ _ _ _ _, the decision node or root node is divided into sub-nodes in accordance
with the specified conditions.
ity

a) Root node
b) Leaf node
c) Splitting
m

d) None of the mentioned


14. __ _ _ _ _is the procedure of removing the tree’s undesirable branches.
)A

a) Splitting
b) Bootstrapping
c) Parent node
d) Pruning
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 175

15. The_ _ _ _is a metric that measures how frequently a randomly selected piece is
Notes

e
erroneously identified.
a) Gini Index

in
b) Bootstrapping
c) K-nearest neighbor

nl
d) Decision tree

Exercise

O
1. What do you mean by Naive Bayes Classifier?
2. Define Naive Bayes Classifier for Mixed Data Type.
3. What do you mean by K Nearest Neighbour? Also give example to implement KNN

ty
4. Explain Decision Tree?
5. What do you mean by support vector machines?

si
6. How SVM work for linearly and non-linearly separable problems?
7. Define random forest classifier.
8. Define random forest technique.
r
ve
Learning Activities
1. How to implement KNN algorithm using python?

Check Your Understanding - Answers


ni

1. a
2. a
U

3. d
4. b
5. c
ity

6. a
7. b
8. d
m

9. b
10. c
11. a
)A

12. b
13. c
14. d
(c

15. a

Amity Directorate of Distance & Online Education


176 Foundations of Machine Learning

Further Readings and Bibliography:


Notes

e
1. Introduction to Machine Learning with Python: A Guide for Data Scientists,
Andreas C. Müller and Sarah Guido

in
2. Supervised and Unsupervised Learning for Data Science, Berry Michael W.,
Azlinah Mohamed, Bee Wah Yap
3. Machine Learning: An Algorithmic Perspective, Stephen Marsland

nl
4. Deep Reinforcement Learning Hands-On: Apply Modern RL Methods to
Practical Problems of Chatbots, Robotics, Discrete Optimization, Web

O
Automation, and More, 2nd Edition, Maxim Lapan
5. Algorithms for Reinforcement Learning, Csaba Szepesvari
6. Foundations of Deep Reinforcement Learning: Theory and Practice in Python,

ty
Laura Graesser and Wah Loon Keng

r si
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 177

Module - IV : Unsupervised Learning


Notes

e
Learning Objectives:

in
At the end of this module, you will be able to understand:

●● Clustering and Evaluation of Clustering

nl
●● Different types of Clustering
●● DBSCAN (Density Based Spatial Clustering of Applications with Noise)

O
●● Concept of Anomaly Detection
●● K-Means Clustering
●● Concept of principal component analysis

ty
●● Singular Value Decomposition
●● LOF

si
●● Concept of T-Distributed Stochastic Neighbour Embedding (t-SNE)

Introduction
r
Unsupervised learning is the process of teaching a computer to use unlabeled,
ve
unclassified data and allowing the algorithm to act on the data without supervision.
Without any prior data training, the machine’s objective in this case is to categorise
unsorted data according to similarities, patterns, and differences.

In contrast to supervised learning, no teacher is present, hence the machine


ni

won’t be trained. As a result, the machine is limited in its ability to discover the hidden
structure in unlabeled data on its own.

Consider the scenario where it is shown a picture of dogs and cats that it has never
U

seen before.
ity
m

Therefore, we cannot classify the machine as “dogs and cats” because it is


)A

unaware of the characteristics of dogs and cats. But it can categorise them according
to their similarities, patterns, and contrasts, i.e., we can simply categorise the above
picture into two halves. The first section might have only images with dogs, and the
second segment might have just images with cats. You have no prior knowledge of this,
hence there are no examples or training data.
(c

It enables the model to function independently and find previously unnoticed


patterns and information. It focuses primarily on unlabeled data.

Amity Directorate of Distance & Online Education


178 Foundations of Machine Learning

There are many different clustering algorithms, and there are many different ways
Notes

e
to compare a clustering to a gold standard. The clustering objects and the clustering
task determine the best clustering algorithm to use as well as the best assessment
measure. Verbs serve as the clustering objects in this thesis, and the work of clustering

in
entails categorising the verbs semantically. Within the cluster analysis of the verbs,
more cluster parameters need to be investigated.

nl
4.1 Introduction to Clustering Techniques
Unsupervised learning is divided into two groups by the algorithms they fall under:

O
●● Clustering: Identifying the natural groupings in the data, such as classifying clients
based on their purchasing patterns, is a clustering problem.
●● Association: When you wish to find rules that broadly characterise your data, such

ty
as “those who buy X also tend to buy Y,” you have an association rule learning
problem.

Various Unsupervised Learning Methods:

si
Clustering

●● Exclusive (partitioning)
●● Agglomerative r
ve
●● Overlapping
●● Probabilistic
Clustering Types:-
ni

●● Hierarchical clustering
●● K-means clustering
U

●● Principal Component Analysis


●● Singular Value Decomposition
●● Independent Component Analysis
ity

4.1.1 Introduction to Clustering

Clustering
m

In essence, it is a kind of unsupervised learning technique. The process of


drawing references from datasets of input data without labelled replies is known as
unsupervised learning. It is typically used as a method to identify the groups, generative
)A

qualities, and significant structures that are inherent in a set of instances. The objective
of clustering is to divide the population or set of data points into a number of groups
so that the data points within each group are more similar to one another and different
from the data points within the other groups. It is essentially a grouping of objects based
on how similar and unlike they are to one another. For instance, the data points that are
(c

clustered together in the graph below can be categorised into a single category. In the
image below, the clusters can be distinguished, and we can count three clusters.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 179

Notes

e
in
nl
O
Clusters don’t always have to be spherical. like as

ty
r si
ve
Why Clustering?

As it establishes the natural grouping among the available unlabeled data,


clustering is crucial. There are no standards for effective clustering. Which criteria the
ni

user chooses to meet their needs relies on them. We might be interested in identifying
homogeneous group representatives (data reduction), identifying “natural clusters”and
describing their unknown characteristics (“natural” data types), identifying appropriate
U

and helpful groupings (“useful” data classes), or identifying odd data objects (outlier
detection). This algorithm must make various assumptions regarding the similarity of
the points, and each assumption results in a unique cluster that is equally valid.
ity

Clustering Methods:
●● Density based Method: These methods take clusters into account as a dense zone
that differs and is similar to the lower dense part of the space. These techniques
can combine two clusters and have good accuracy. Examples include OPTICS
m

(Ordering Points to Identify Clustering Structure), DBSCAN (Density-Based Spatial


Clustering of Applications with Noise), etc.
●● Hierarchy based Method: In this method, the clusters are organised into a tree-like
)A

structure. The already formed cluster is used to create new ones. There are two
categories.
◌◌ Agglomerative (bottom-up approach)
◌◌ Divisive (top-down approach)
(c

Examples include BIRCH (Balanced Iterative Reducing Clustering and Using


Hierarchies), CURE (Clustering Using Representatives), etc.

Amity Directorate of Distance & Online Education


180 Foundations of Machine Learning

●● Partitioning Methods: The items are divided by these procedures into k clusters,
Notes

e
and each split creates a cluster. This technique is used to improve an objective
criterion similarity function, such as K-means or CLARANS (Clustering Large
Applications based upon Randomized Search), when distance is a key parameter.

in
●● Grid-based Methods: In this approach, the data space is represented as a grid-
like structure made up of a finite number of cells. STING (Statistical Information

nl
Grid), wave cluster, CLIQUE (CLusteringIn Quest), and other clustering processes
carried out on these grids are all quick and independent of the quantity of data
objects.

O
4.1.2 Evaluation of Clustering
In multivariate data analysis, clustering is a common practice. It is intended to
investigate the data objects’ innate natural structure, where items in the same cluster

ty
are as similar as possible to one another and objects in separate clusters are as
distinct as possible from one another. A method for generalising over the data items
and their features is provided by the equivalence classes that the clusters induce. Many

si
fields, including pattern detection, psychology, economics, and medical research, use
clustering techniques.

Unconsciously, humans frequently undertake the task of clustering. For instance,


r
when seeing a two-dimensional map, one immediately distinguishes distinct areas
ve
based on how close together the locations are, if the locations are divided by rivers,
lakes, or the sea, etc. However, intuitive judgments become more difficult to get and
justify as the description of objects by their attributes increases in dimension.

Clustering is frequently mistaken for classification or discriminant analysis. The


ni

three main data analysis types, however, each correspond to a distinct concept and are
distinguished as follows: Clustering differs from classification in that the former places
items in pre-established classes, whilst the latter requires no prior knowledge of the
U

object classes or the individuals who make up those classes. A discriminant analysis
seeks to reinforce a classification that has already been provided, whereas a cluster
analysis has to first create the class structure. This is how a cluster analysis differs from
a discriminant analysis in (b).
ity

For the evaluation and comparison of clustering experiments and outcomes,


a clustering evaluation requires an impartial and trustworthy metric. Although the
clustering researcher has developed an understanding for the clustering evaluation
in theory, in practice it is impossible to make an intuitive decision due to the volume
m

of data and the intricate intricacies of data representation and clustering methods.
Therefore, a subjective method is required for large-scale trials, but an intuitive,
introspective evaluation is only feasible for limited collections of objects.
)A

Although there is no precise method for measuring clusterings, several evaluation


metrics from various fields, including theoretical statistics, machine learning, and web
page clustering, are applicable.

Demands on Clustering Evaluation


(c

A trustworthy strategy for judging the calibre of the clusterings should be used
when developing an objective method for evaluating clusterings. How can we convert

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 181

these impersonal descriptions into more specific requirements? The needs for the
Notes

e
task of grouping verbs into semantic classes are then defined, with an increasing
level of linguistic task specificity. In other words, I first specify general requirements
for an evaluation, then general requirements for a clustering evaluation, and finally

in
requirements for the verb-specific clustering evaluation.

The formal notation of the clustering result and gold standard classification makes

nl
it easier to clarify the requirements for the clustering assessment, hence the notation is
given in advance:

General Evaluation Demands: An experiment’s evaluation should be conducted

O
using a gold standard that is as impartial and trustworthy as possible. The classification
was constructed closely in connection to the existing classifications for German by
Schumacher (1986) and English by Levin in order to make up for the suboptimal setup
by a single person (1993). Prior to conducting any word experimentation, the entire

ty
classification was completed.

General Clustering Demands: The second set of requirements relates to general


cluster analysis characteristics that are not reliant on the clustering region.

si
●● The measure should be applicable to all similarity measures used in clustering, but
may be independent of the specific similarity measure, as the evaluation’s goal is

r
to evaluate and compare various clustering experiments and results.
ve
●● A (numerical) measure representing the worth of the clustering should be defined
using the evaluation’s findings. To make the comprehension of the assessment
easier, the resulting value should either be simple to understand or else be
illustrated with respect to its range and impacts.
ni

●● Without favouring a certain number or size of clusters, the evaluation technique


should be specified.
●● The evaluation metric should be able to differentiate between the quality of the
U

individual clusters Ci ɛ C as well as the entire clustering partition C.


Linguistic Clustering Demands: The demands on an appropriate clustering
evaluation are sharpened by the fact that this thesis deals with the clustering of
ity

linguistic data because the demands on verb classes are particular to the linguistic
background and linguistic intuition and are not necessarily desired for different
clustering areas. As linguistic desiderata for the clustering of verbs, the list that follows
corresponds to a third set of requirements.
m

4.1.3 Understanding Clustering Types


The various types of clustering are:
)A

●● Connectivity-based Clustering (Hierarchical clustering)


●● Centroids-based Clustering (Partitioning methods)
●● Distribution-based Clustering
●● Density-based Clustering (Model-based methods)
(c

●● Fuzzy Clustering

Amity Directorate of Distance & Online Education


182 Foundations of Machine Learning

Connectivity-Based Clustering (Hierarchical Clustering)


Notes

e
Unsupervised machine learning clustering that starts with a top-to-bottom hierarchy
of clusters is called “hierarchical clustering.” Following that, it decomposes the data

in
items based on this hierarchy to produce the clusters. Depending on whether the
process of building clusters proceeds top-down or bottom-up, this strategy adopts
one of two approaches. These are, respectively, the Divisive Approach and the

nl
Agglomerative Approach.

O
ty
r si
ve
●● Divisive Approach
With this top-down method of hierarchical clustering, we assume that all of the
ni

data points are part of a single, large cluster and attempt to divide the data into smaller
groups using a termination logic, or a limit beyond which there can be no further division
of data points. For categorical data, the metric can be the GINI coefficient inside
U

a cluster, or this termination logic can be based on the least sum of squares of error
inside a cluster.

As a result, we iteratively divide the data into “n” number of smaller clusters to
ity

which the data points now belong after it was previously grouped into a single large
cluster.

When separating the clusters, it is important to keep in mind that this approach is
quite “strict,” which means that once clustering is completed inside of a loop, it cannot
m

be reversed.
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 183

●● Agglomerative Approach
Notes

e
Agglomerative is quite different from Divisive, in which each of the “N” data points
is regarded as a separate component of the “N” clusters that the data is divided into.

in
We subsequently allocate the data points to each of these clusters in accordance after
repeatedly condensing these numerous “N” clusters to a smaller number of clusters,
let’s say “k” clusters. This method, which is bottom-up, also combines the clusters

nl
using termination logic. Clusters shouldn’t be too widely apart to be merged, a distance
criterion, a variance criterion, or a number-based criterion can all be used in this
reasoning (increase in the variance of the cluster being merged should not exceed a
threshold, Ward Method).

O
Centroid Based Clustering
The most efficient method of forming clusters and allocating data points to them

ty
is centroid-based clustering, which is regarded as one of the most straightforward
clustering algorithms. The idea behind centroid-based clustering is that each cluster
is identified and represented by a central vector, and each cluster is assigned to data
points that are near these vectors.

si
These categories of clustering techniques employ a variety of distance metrics
to iteratively calculate the separation between the clusters and the characteristic
r
centroids. These are either of the Minkowski, Manhattan, or Euclidian distances.
ve
The main drawback in this situation is that we need to identify the number of
clusters, “k,” either intuitively or scientifically (Elbow Method) before any clustering
machine learning algorithm may start assigning the data points during its initial iteration.
ni
U
ity
m

Despite its drawbacks, centroid-based clustering has surpassed hierarchical


clustering in the analysis of sizable datasets. These algorithms have numerous
)A

application areas, including market segmentation, customer segmentation, text topic


retrieval, image segmentation, etc., due to their simplicity in implementation and
interpretation.

Density-based Clustering (Model-based Methods)


(c

The two methods we previously discussed—hierarchical and centroid-based—


are both reliant on a distance (similarity/proximity) measure, if one were to examine

Amity Directorate of Distance & Online Education


184 Foundations of Machine Learning

the prior two in more detail. This statistic serves as the foundation for the defining of
Notes

e
a cluster. Instead of considering distances, density-based clustering algorithms do.
Clusters, which are defined as a maximal set of connected points, are thought to be the
densest region in a data space, separated from it by regions with lower object densities.

in
The data is assumed to be noise-free and the shape of the cluster that results is
simply geometrical when conducting the majority of the clustering (circular or elliptical).

nl
Data is always inconsistent to some degree (noise), which cannot be disregarded.
Furthermore, it is preferable to have variable shapes rather than sticking to fixed
attribute shapes in order to take into account all possible data points. These are the
applications where methods based on density have shown to be useful!

O
Clusters with various geometries, clusters without any size restrictions, clusters
that have the highest level of homogeneity by guaranteeing the same levels of density
within them, and clusters that are inclusive of outliers or noisy data can all be obtained

ty
using density-based techniques.

r si
ve
ni
U

Distribution-Based Clustering
The clustering methods that have been used up to this point have either been
based on proximity (similarity/distance) or composition (density). A family of clustering
algorithms considers probability, which is a whole separate measure. Data points are
ity

created and grouped using distribution-based clustering based on their likelihood of


belonging to the same probability distribution (such as a Gaussian, binomial, or other) in
the data.
m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 185

The distribution models of clustering have a close relationship to statistics since


Notes

e
they are used to generate and arrange datasets according to the principles of random
sampling, i.e., to collect data points from a particular distribution. Objects that are most
likely to belong to the same distribution can subsequently be characterised as clusters.

in
The requirement to define the clusters a priori for some algorithms, and primarily
the definition of the cluster form for the majority of algorithms, is a significant

nl
disadvantage of density and boundary-based techniques. It is necessary to choose at
least one tuning or hyper-parameter, and although if doing so is straightforward, doing it
wrong could have unintended consequences.

O
In terms of flexibility, accuracy, and shape of the clusters created, distribution-
based clustering offers a clear advantage over proximity- and centroid-based clustering
techniques. The main issue is that these clustering approaches only perform well with
data that is artificial, simulated, or if the majority of the data points unquestionably

ty
belong to a predefined distribution; otherwise, the results will overfit.

Fuzzy Clustering

si
The fundamental principle of clustering is to allocate data points to mutually
exclusive clusters, which means that a data point may only ever exist inside of
one cluster and cannot be a member of more than one cluster. By allocating a data
r
point to numerous groups with a specified degree of belongingness measure, fuzzy
ve
clustering algorithms challenge this paradigm. The data points near a cluster’s centre
may also belong in the cluster, which is to a greater extent, than the data points around
a cluster’s edge. A membership coefficient, which ranges from 0 to 1, measures the
likelihood that an element belongs to a particular cluster.
ni

With datasets with a lot of overlap between the variables, fuzzy clustering can
be applied. In bioinformatics, where it is difficult for generic clustering algorithms to
distinguish between the picture’s pixels and they fail to execute a suitable clustering
U

due to the identification of overlapping gene codes, it is a strongly recommended


approach for image segmentation.

4.1.4 Hierarchical Clustering Part – I


ity

Another unsupervised machine learning approach, hierarchical clustering (also


known as hierarchical cluster analysis, or HCA), is used to cluster unlabeled datasets.

Another unsupervised learning approach used to combine unlabeled data points


m

with comparable features together is hierarchical clustering. The two categories of


hierarchical clustering algorithms are as follows.

Algorithms with aggregative hierarchies Each data point is handled as a single


)A

cluster in agglomerative hierarchical algorithms, which then sequentially merge or


aggregate the pairings of clusters (bottom-up approach). A dendrogram or tree structure
is used to show the hierarchy of the clusters.

On the other hand, in divisive hierarchical algorithms, all the data points are
represented as one large cluster, and the process of clustering entails splitting the one
(c

large cluster into numerous little clusters (Top-down approach).

Why hierarchical clustering?


Amity Directorate of Distance & Online Education
186 Foundations of Machine Learning

There are certain difficulties with the K-means clustering algorithm, which always
Notes

e
attempts to produce clusters of the same size and has a preset number of clusters.
Because we don’t need to be aware of the specified number of clusters while using the
hierarchical clustering technique, we can choose this algorithm to address these two

in
problems.

Agglomerative Hierarchical clustering

nl
One well-known HCA is the agglomerative hierarchical clustering algorithm.
It employs a bottom-up methodology to cluster the datasets. This means that this
algorithm starts by treating each dataset as a single cluster and then begins combining

O
the two clusters that are the closest to one another. It continues doing this until every
cluster has been combined into a single cluster that has every dataset.

The dendrogram is a visual representation of this cluster structure.

ty
How the Agglomerative Hierarchical clustering Work?

The following steps can be used to describe how the AHC algorithm functions:

si
Step-1: Make a separate cluster out of each data point. Let’s imagine there are N
data points, which means there will be N clusters.

r
ve
ni

Step-2: Combine the two nearest data points or clusters to create a single cluster.
There will now be N-1 clusters as a result.
U
ity
m

Step-3: Once more, combine the two nearest clusters to create a single cluster. N-2
clusters will exist.
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 187

Step-4: until there is just one cluster remaining, repeat Step 3. We will therefore
Notes

e
obtain the following clusters. Think about the pictures below:

in
nl
O
Step-5: Create a dendrogram to divide the clusters according to the problem once
all the clusters have been joined into one large cluster.

Measure for the Distance between two Clusters

ty
As we’ve seen, the hierarchical clustering depends on how near the two groups
are to one another. There are several methods for measuring the distance between
two clusters, and these methods determine the clustering rule. Linkage techniques are

si
the name given to these actions. The following list of popular connection techniques
includes

Single Linkage: The closest points of the clusters are separated by the shortest
distance. Take a look at the photo below: r
ve
ni
U

Complete Linkage: It represents the greatest separation between the centres of


two distinct clusters. Since it creates more compact clusters than single-linking, it is one
of the widely used linkage techniques.
ity
m
)A

Average Linkage: The average distance between two clusters is determined using
the linkage method, which adds up and divides the distance between each pair of
datasets by the total number of datasets. It is also among the most widely used linking
techniques.

Centroid Linkage: The linkage method is used to calculate the separation between
(c

the cluster centroid. Take a look at the photo below:

Amity Directorate of Distance & Online Education


188 Foundations of Machine Learning

Notes

e
in
nl
Depending on the nature of the problem or the needs of the business, we can use
any of the aforementioned ways.

O
Woking of Dendrogram in Hierarchical clustering
The dendrogram, which resembles a tree, is primarily used to store each step that
the HC algorithm executes as memory. The X-axis of the dendrogram plot displays

ty
all of the data points in the specified dataset, while the Y-axis displays the Euclidean
distances between the data points.

The diagram below can be used to show how the dendrogram works:

r si
ve
ni

The left portion of the following diagram demonstrates how clusters are formed
U

during agglomerative clustering, while the right portion displays the matching
dendrogram.

●● The datapoints P2 and P3 join to form a cluster, as we have already described.


ity

As a result, a dendrogram is produced, connecting P2 and P3 with a rectangle


shape. According to the Euclidean distance between the data points, the height is
determined.
●● The subsequent stage is clustering P5 and P6 and producing the associated
m

dendrogram. It is higher than it was previously because P5 and P6’s Euclidean


distance is a tiny bit more than P2 and P3’s.
●● Once more, two fresh dendrograms are made, one combining P1, P2, and P3, and
)A

the other one combining P4, P5, and P6.


●● The final dendrogram, incorporating all the data points, is then produced.
The dendrogram tree structure can be cut at any level based on our needs.
(c

4.1.5 Hierarchical Clustering Part – II


In continuation from last topic.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 189

Example:
Notes

e
Let’s begin by importing the necessary libraries as shown below:

%matplotlib inline

in
import matplotlib.pyplot as plt

import numpy as np

nl
The data points we collected for this example will then be plotted.

X = np.array(

O
[[7,8],[12,20],[17,19],[26,15],[32,37],[87,75],[73,85], [62,80],[73,60],[87,96],])

labels = range(1, 11)

ty
plt.figure(figsize = (10, 7))

plt.subplots_adjust(bottom = 0.1)

plt.scatter(X[:,0],X[:,1], label = ‘True Position’)

si
for label, x, y in zip(labels, X[:, 0], X[:, 1]):

plt.annotate(
r
label,xy = (x, y), xytext = (-3, 3),textcoords = ‘offset points’, ha = ‘right’, va =
ve
‘bottom’)

plt.show()
ni
U
ity
m

It is pretty clear from the above diagram that we have two clusters in our
datapoints, however there could be thousands of clusters in the actual data. The
)A

dendrograms of our datapoints will then be plotted using the Scipy package.

from scipy.cluster.hierarchy import dendrogram, linkage

from matplotlib import pyplot as plt


(c

linked = linkage(X, ‘single’)

labelList = range(1, 11)

Amity Directorate of Distance & Online Education


190 Foundations of Machine Learning

plt.figure(figsize = (10, 7))


Notes

e
dendrogram(linked, orientation = ‘top’,labels = labelList,

distance_sort =’descending’,show_leaf_counts = True)

in
plt.show()

nl
O
ty
The longest vertical distance is now chosen after the large cluster has formed.

si
Then, as in the diagram that follows, a vertical line is drawn through it. There would be
two clusters since the horizontal line and the blue line cross each other at two different
locations.
r
ve
ni
U

The class for clustering must then be imported, and the fit predict method must
ity

be called to forecast the cluster. AgglomerativeClustering class from the sklearn.cluster


package is being imported.

from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters = 2, affinity = ‘euclidean’, linkage =


m

‘ward’)

cluster.fit_predict(X)
)A

Next, plot the cluster with the help of following code –

plt.scatter(X[:,0],X[:,1], c = cluster.labels_, cmap = ‘rainbow’)


(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 191

Notes

e
in
nl
O
The two clusters from our datapoints are depicted in the diagram up top.

ty
Divisive clustering
also referred to as a top-down strategy. The number of clusters need not be

si
predetermined for this approach. Top-down clustering starts by figuring out how to split
a cluster that contains all of the data, then splits clusters back and forth until all of the
individual data are in singleton clusters.

Algorithm r
ve
given a dataset (d1, d2, d3, ....dN) of size N

at the top we have all data in one cluster

the cluster is split using a flat clustering method eg. K-Means etc
ni

repeat

choose the best cluster among all the clusters to split


U

split that cluster by the flat clustering algorithm

until each data is in its own singleton cluster


ity
m
)A
(c

Amity Directorate of Distance & Online Education


192 Foundations of Machine Learning

Hierarchical clustering benefits


Notes

e
●● In some instances, it is easy to execute and produces the best results.
●● It is simple to do and produces a hierarchy, or a structure with more information.

in
●● The amount of clusters does not need to be pre-determined.

Hierarchical clustering’s drawbacks

nl
●● Large clusters are broken up by it.
●● Handling convex shapes and clusters of varying sizes is challenging.

O
●● Noise and outliers are picked up by it.
●● Once the algorithm has been used, it cannot be altered or removed.

ty
4.1.6 DBSCAN (Density Based Spatial Clustering of Applications
with Noise)
In essence, clustering analysis, also known as clustering, is an unsupervised

si
learning technique that separates the data points into a number of distinct batches or
groups, with the goal of ensuring that the properties of the data points within the same
group are similar and that the properties of the data points within different groups are,
r
at least in part, dissimilar. It consists of numerous various differential evolution-based
methodologies.
ve
Examples include Gaussian mixtures (Mahalanobis distance between centres),
Affinity propagation (graph distance), Mean-shift (distance between points), DBSCAN
(distance between nearest points), K-Means (distance between points), Spectral
ni

clustering (graph distance), etc.

Fundamentally, all clustering techniques follow the same methodology: we first


determine commonalities, and then we apply them to group or batch the data points.
U

The density-based spatial clustering of applications with noise (DBSCAN) clustering


approach will be the main emphasis of this section.

Clusters are dense areas of the data space that are divided by areas of lower point
ity

densities. This common understanding of “clusters” and “noise” is the foundation of the
DBSCAN algorithm. The main principle is that at least a certain number of points must
be present in the vicinity of each point within a cluster within a particular radius.
m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 193

Why DBSCAN?
Notes

e
Finding convex or spherical groups can be done using hierarchical clustering
or partitioning methods (K-means, PAM clustering). In other words, they are only

in
appropriate for clusters that are small and well-distributed. Furthermore, the presence
of noise and outliers in the data has a significant impact on them as well.

Real-world data may have errors, such as:

nl
◌◌ Clusters may have any shape, like in the case of those in the figure below.
◌◌ Data could have noise in it.

O
ty
r si
A data set with nonconvex clusters and outliers/noise is depicted in the figure
ve
below. K-means algorithm has trouble locating these clusters with arbitrary shapes
given such input.

DBSCAN Algorithm Requires two Parameters:


ni

●● eps: It establishes the area surrounding a data point; if the distance between two
points is less than or equal to “eps,” they are regarded as neighbours. A significant
portion of the data will be viewed as outliers if the eps value is chosen too low. The
U

clusters will merge if it is decided to be very large, and most of the data points will
be in the same clusters. The k-distance graph can be used to determine the eps
value.
ity

●● MinPts: The bare minimum amount of nearby neighbours (data points) within the
eps radius The greater value of MinPts must be selected the larger the dataset.
In general, the minimal MinPts can be calculated as MinPts>= D+1 from the
dataset’s D dimensions. MinPts must be set to a value of at least 3.
m

In this algorithm, we have 3 types of data points.

Core Point: If a point has more than MinPts points inside an episodic period, it is a
core point.
)A

Border Point: a point that is close to a core point but has fewer than MinPts inside
eps.

Noise or outlier: a location that is neither a border point nor a core point.
(c

Amity Directorate of Distance & Online Education


194 Foundations of Machine Learning

Notes

e
in
nl
DBSCAN algorithm can be abstracted in the following steps:

O
1. Recognize the core points or points that have been visited by more than
MinPtsneighbours by finding all the neighbour points inside eps.
2. Create a new cluster for each core point if it hasn’t been given one before.

ty
3. Recursively locate all of the points that are associated to it by density and add them
to the same cluster as the core point.
4. If there is a point c that has a sufficient number of points in its neighbours and both

si
the points a and b are within the eps distance, then those two points are said to be
density linked. This procedure involves chains. Therefore, it follows that b must be a
neighbour of an if b is a neighbour of c, c is a neighbour of d, d is a neighbour of e,
and e is a neighbour of a. r
ve
5. Go through the dataset’s remaining unexplored points iteratively. Noise consists of
all points that do not belong to any cluster.

Below is the DBSCAN clustering algorithm in pseudocode:


ni

DBSCAN(dataset, eps, MinPts){

# cluster index
U

C=1

for each unvisited point p in dataset {

mark p as visited
ity

# find neighbors

Neighbors N = find the neighboring points of p

if |N|>=MinPts:
m

N = N U N’

if p’ is not a member of any cluster:


)A

add p’ to cluster C

4.1.7 Hands On: DBSCAN


(c

Additionally, we will assess clustering methods using the Silhouette score and the
Adjusted rand score.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 195

Scores for silhouettes range from -1 to 1. The ideal score is one or close to one,
Notes

e
indicating that the data point is located distant from other clusters and is very compact
inside the cluster to which it belongs. A value of -1 is the worst. Values close to 0
indicate clusters that overlap.

in
The range of the absolute Rand Score is 0 to 1. Excellent cluster recovery
is defined as more than 0.9, and good recovery as more than 0.8. Poor recovery is

nl
defined as a value lower than 0.5.

Example:

O
import matplotlib.pyplot as plt

import numpy as np

from sklearn.cluster import DBSCAN

ty
from sklearn import metrics

from sklearn.datasets.samples_generator import make_blobs

si
from sklearn.preprocessing import StandardScaler

from sklearn import datasets

# Load data in X r
ve
X, y_true = make_blobs(n_samples=300, centers=4,

cluster_std=0.50, random_state=0)

db = DBSCAN(eps=0.3, min_samples=10).fit(X)
ni

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)

core_samples_mask[db.core_sample_indices_] = True
U

labels = db.labels_

# Number of clusters in labels, ignoring noise if present.


ity

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print(labels)

# Plot result
m

# Black removed and is used for noise instead.

unique_labels = set(labels)

colors = [‘y’, ‘b’, ‘g’, ‘r’]


)A

print(colors)

for k, col in zip(unique_labels, colors):

if k == -1:
(c

# Black used for noise.

col = ‘k’
Amity Directorate of Distance & Online Education
196 Foundations of Machine Learning

class_member_mask = (labels == k)
Notes

e
xy = X[class_member_mask&core_samples_mask]

plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=col,

in
markeredgecolor=’k’,

markersize=6)

nl
xy = X[class_member_mask& ~core_samples_mask]

plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=col,

O
markeredgecolor=’k’,

markersize=6)

ty
plt.title(‘number of clusters: %d’ % n_clusters_)

plt.show()

#evaluation metrics

si
sc = metrics.silhouette_score(X, labels)

print(“Silhouette Coefficient:%0.2f”%sc)
r
ari = adjusted_rand_score(y_true, labels)
ve
print(“Adjusted Rand Index: %0.2f”%ari)

Output:
ni
U
ity
m

Silhouette Coefficient:0.13

Adjusted Rand Index: 0.31


)A

Outliers are shown by black points. We can alter the cluster setup by modifying the
eps and the MinPts.

4.1.8 Anomaly Detection


(c

Finding unusual occurrences or observations that are statistically distinct from the
rest of the observations is known as anomaly detection. Such “anomalous” behaviour

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 197

is usually indicative of some sort of issue, such as credit card fraud, a failed server
Notes

e
component, a cyberattack, etc.

Three general categories can be used to classify anomalies:

in
1. Point Anomaly: A tuple in a dataset is referred to as a point anomaly if it deviates
significantly from the rest of the data.

nl
2. Contextual Anomaly: anomalies are observations that are anomalous because of the
context in which they were made.
3. Collective Anomaly: A collection of data instances aid in the discovery of an anomaly.

O
The principles of machine learning can be used for anomaly detection. The
following techniques can be used:

1. Supervised Anomaly Detection: This technique creates a classification model for

ty
future data points using a labelled dataset that includes both normal and anomalous
samples. The most popular methods for this purpose include K-Nearest Neighbors
Classifier, Support Vector Machine learning, and Supervised Neural Networks.

si
2. Unsupervised Anomaly Detection: This technique doesn’t need any training data
and instead makes two assumptions about the data: (1) Not all data is anomalous,
and (2) Any anomaly deviates statistically from the normal samples. The data is then
r
clustered using a similarity metric based on the aforementioned presumptions, and
the data points that are far from the cluster are regarded as anomalies.
ve
Now, using the K-Nearest Neighbors method from the pyod module, we show how
anomaly detection works on a fictitious dataset.
ni

Step 1: Importing the required libraries


import numpy as np

from scipy import stats


U

import matplotlib.pyplot as plt

import matplotlib.font_manager
ity

from pyod.models.knn import KNN

from pyod.utils.data import generate_data, get_outliers_inliers

Step 2: Creating the synthetic data


m

# generating a random dataset with two features

X_train, y_train = generate_data(n_train = 300, train_only = True,


)A

n_features = 2)

# Setting the percentage of outliers

outlier_fraction = 0.1
(c

# Storing the outliers and inliners in different numpy arrays

X_outliers, X_inliers = get_outliers_inliers(X_train, y_train)

Amity Directorate of Distance & Online Education


198 Foundations of Machine Learning

n_inliers = len(X_inliers)
Notes

e
n_outliers = len(X_outliers)

# Separating the two features

in
f1 = X_train[:, [0]].reshape(-1, 1)

f2 = X_train[:, [1]].reshape(-1, 1)

nl
Step 3: Visualising the data
# Visualising the dataset

O
# create a meshgrid

xx, yy = np.meshgrid(np.linspace(-10, 10, 200),

ty
np.linspace(-10, 10, 200))

# scatter plot

si
plt.scatter(f1, f2)

plt.xlabel(‘Feature 1’)

plt.ylabel(‘Feature 2’)
r
ve
Output:
ni
U
ity

Step 4: Training and evaluating the model


m

# Training the classifier

clf = KNN(contamination = outlier_fraction)


)A

clf.fit(X_train, y_train)

# You can print this to see all the prediction scores

scores_pred = clf.decision_function(X_train)*-1

y_pred = clf.predict(X_train)
(c

n_errors = (y_pred != y_train).sum()

# Counting the number of errors


Amity Directorate of Distance & Online Education
Foundations of Machine Learning 199

print(‘The number of prediction errors are ‘ + str(n_errors))


Notes

e
Output:

in
Step 5: Visualising the predictions
# threshold value to consider a

nl
# datapoint inlier or outlier

threshold = stats.scoreatpercentile(scores_pred, 100 * outlier_fraction)

# decision function calculates the raw

O
# anomaly score for every point

Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1

ty
Z = Z.reshape(xx.shape)

# fill blue colormap from minimum anomaly

si
# score to threshold value

subplot = plt.subplot(1, 2, 1)

r
subplot.contourf(xx, yy, Z, levels = np.linspace(Z.min(),
ve
threshold, 10), cmap = plt.cm.Blues_r)

# draw red contour line where anomaly

# score is equal to threshold


ni

a = subplot.contour(xx, yy, Z, levels =[threshold],

linewidths = 2, colors =’red’)


U

# fill orange contour lines where range of anomaly

# score is from threshold to maximum anomaly score

subplot.contourf(xx, yy, Z, levels =[threshold, Z.max()], colors =’orange’)


ity

# scatter plot of inliers with white dots

b = subplot.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1],

c =’white’, s = 20, edgecolor =’k’)


m

# scatter plot of outliers with black dots

c = subplot.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1],


)A

c =’black’, s = 20, edgecolor =’k’)

subplot.axis(‘tight’)

subplot.legend(
(c

[a.collections[0], b, c],

Amity Directorate of Distance & Online Education


200 Foundations of Machine Learning

[‘learned decision function’, ‘true inliers’, ‘true outliers’],


Notes

e
prop = matplotlib.font_manager.FontProperties(size = 10),

loc =’lower right’)

in
subplot.set_title(‘K-Nearest Neighbours’)

subplot.set_xlim((-10, 10))

nl
subplot.set_ylim((-10, 10))

plt.show()

O
Output:

ty
r si
ve
ni

4.1.9 K-Means Clustering


One of the most straightforward unsupervised learning techniques to handle the
well-known clustering problem is k-means. The process uses a predetermined number
U

of clusters (let’s assume k clusters) fixed apriori to categorise a given data set. To
define k centres, one for each cluster, is the main notion. These centres should be
strategically positioned because different locations yield various effects. The preferable
option is to situate them as far apart from one another as you can.
ity

The following phase is connecting each point from a given data set to the closest
centre. The first step is finished and an early group age is finished when there are no
points still open. In order to determine the barycenter of the clusters produced by the
previous step, we must now recalculate k new centroids. The same data set points
m

must now be bound to the closest new centre once we have these k new centroids.

There now exists a loop. This loop may cause the k centres to gradually shift
)A

positions until no more modifications are made, or, to put it another way, the centres
stop moving altogether. The algorithm’s final goal is to minimise the squared error
function, which is represented by:
(c

where,

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 201

‘||xi- vj||’ is the Euclidean distance between xi and vj.


Notes

e
ci’ is the number of data points in ith cluster.

‘c’ is the number of cluster centers.

in
Algorithmic steps for k-means clustering

nl
Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be
the set of centers.

1. Randomly select ‘c’ cluster centers.

O
2. Calculate the distance between each data point and cluster centers.
3. Assign the data point to the cluster center whose distance from the cluster
center is minimum of all the cluster centers..

ty
4. Recalculate the new cluster center using:
where, ‘ci’ represents the number of data points in ith cluster.

centers. r si
5. Recalculate the distance between each data point and new obtained cluster
ve
6. If no data point was reassigned then stop, otherwise repeat from step 3).
ni
U
ity
m
)A
(c

Figure: k-means algorithm figure

Amity Directorate of Distance & Online Education


202 Foundations of Machine Learning

Advantages and Disadvantages


Notes

e
Advantages

in
The K-Means clustering techniques provide the following benefits:

◌◌ It is really simple to comprehend and put into practice .


◌◌ K-means would be quicker than hierarchical clustering if we had a lot of

nl
variables.
◌◌ An instance may switch the cluster upon recalculating the centroids.
◌◌ When compared to hierarchical clustering, K-means produces tighter clusters.

O
Disadvantages
K-Means clustering techniques have the following drawbacks:

ty
◌◌ Predicting the value of k, or the number of clusters, is a little challenging.
◌◌ Initial inputs like the number of clusters have a significant impact on the output
(value of k).

si
◌◌ The ultimate result will be significantly impacted by the data’s order.
◌◌ Rescaling affects it extremely delicately. The output will be radically different if

◌◌
r
we rescale our data using normalisation or standards. final result.
If the clusters have a sophisticated geometric shape, clustering is not
ve
recommended.

Applications of K-Means Clustering Algorithm


The principal objectives of cluster analysis are
ni

●● To make sense of the data we are working with through intuition.


●● Different models will be developed for various subgroups in the cluster-then-
U

predict method.
K-means clustering is functioning well enough to achieve the aforementioned
objectives. It can be applied to the following things:
ity

●● Market segmentation
●● Document Clustering
●● Image segmentation
m

●● Image compression
●● Customer segmentation
●● Analyzing the trend on dynamic data
)A

4.1.10 Hands On: K-Means


Let’s first define the kind of issue we’re trying to tackle before moving on.
Therefore, we have a dataset called Mall_Customers, which contains information about
(c

people who go to the mall and spend money there.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 203

Customer_Id, Gender, Age, Annual Income ($), and Spending Score are all
Notes

e
included in the dataset (which is the calculated value of how much a customer has
spent in the mall, the more the value, the more he has spent). Since this method is
unsupervised, we don’t really know what to calculate, so we need to generate some

in
patterns from this dataset.

The following are the measures that must be taken for implementation:

nl
●● Pre-processing of data.
●● The elbow approach is used to determine the ideal number of clusters.

O
●● using the training dataset to train the K-means algorithm.
●● cluster visualisation

Step-1: Data pre-processing Step

ty
Data pre-processing will be the first step. But compared to other models, it will be
different for the clustering issue. Let’s talk about it:

si
Import libraries

As part of data pre-processing, we will start by importing the libraries for our model.
Following is the code:

# importing libraries
r
ve
import numpy as nm

import matplotlib.pyplot as mtp


ni

import pandas as pd

In the code above, we have imported matplotlib to plot the graph, pandas to
manage the dataset, and numpy to conduct mathematical calculations.
U

• Bringing in the data set

The required dataset will then be imported. We are thus using the Mall Customer
data.csv dataset in this instance. The code listed below can be used to import it:
ity

# Importing the dataset

dataset = pd.read_csv(‘Mall_Customers_data.csv’)

We may access our dataset via the Spyder IDE by running the code mentioned
m

above. The dataset appears as shown in the image below:


)A
(c

Amity Directorate of Distance & Online Education


204 Foundations of Machine Learning

Notes

e
in
nl
O
ty
We need to look for some patterns in the dataset mentioned above.

si
• Extracting Uncorrelated Variables

Since this is a clustering problem and we don’t know what to find, we don’t require
r
any dependent variables for the data pre-processing step. Therefore, we will merely
ve
add a line of code for the features matrix.

x = dataset.iloc[:, [3, 4]].values

We are simply extracting the third and fourth features, as can be seen. It’s because
ni

some features, like customer_id, are not necessary when creating a 2d plot to represent
the model.

Step-2: Finding the optimal number of clusters using the elbow method
U

The best number of clusters for our clustering problem will be determined in
the second step. As was said previously, we will employ the elbow approach in this
instance.
ity

As is well known, the elbow approach plots the WCSS values on the Y-axis and the
number of clusters on the X-axis to create the plot using the WCSS idea. Therefore, for
various k values between 1 and 10, we will calculate the value for WCSS. The code is
listed below:
m

#finding optimal number of clusters using the elbow method

from sklearn.cluster import KMeans


)A

wcss_list= [] #Initializing the list for the values of WCSS

#Using for loop for iterations from 1 to 10.

for i in range(1, 11):


(c

kmeans = KMeans(n_clusters=i, init=’k-means++’, random_state= 42)

kmeans.fit(x)

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 205

wcss_list.append(kmeans.inertia_)
Notes

e
mtp.plot(range(1, 11), wcss_list)

mtp.title(‘The Elobw Method Graph’)

in
mtp.xlabel(‘Number of clusters(k)’)

mtp.ylabel(‘wcss_list’)

nl
mtp.show()

As we can see from the code above, we formed the clusters using the sklearn.

O
cluster library’s KMeans class.

In order to establish an empty list that will hold the value of wcss computed for
various values of k ranging from 1 to 10, we constructed the wcss list variable.

ty
Then, as Python’s for loop excludes the outbound limit, we initialised the for loop to
iterate on a different value of k between 1 and 10; it is taken as 11 to include the 10th
value.

si
As we fitted the model to a matrix of characteristics and then plotted the graph
between the number of clusters and WCSS, the remaining portion of the code is
identical to what we did in earlier sections.
r
The following output will be obtained after running the aforementioned code:
ve
ni
U
ity

We can see from the plot above that the elbow point is at 5. Therefore, there will be
5 clusters in this area.
m
)A
(c

Amity Directorate of Distance & Online Education


206 Foundations of Machine Learning

Step- 3: Training the K-means algorithm on the training dataset


Notes

e
We can now train the model on the dataset because we know how many clusters
there are.

in
The two lines of code we used in the section above will be used again to train the
model, but this time we will use 5 instead of I because we know that 5 clusters must be
produced. Following is the code:

nl
#training the K-means model on a dataset

kmeans = KMeans(n_clusters=5, init=’k-means++’, random_state= 42)

O
y_predict= kmeans.fit_predict(x)

For constructing an object of the KMeans class, the first line is the same as above.

ty
The dependent variable y_predict was established in the second line of code to
train the model.

We will obtain the y_predict variable by running the code in the aforementioned

si
lines. In the Spyder IDE, we can inspect it using the variable explorer option. Now that
we have our original dataset and y_predict values, we can compare them. Take a look
at the photo below:

r
ve
ni
U

We may now deduce from the above graphic that CustomerID 1 belongs to cluster
3 (because the index starts at 0, 2 will be treated as 3), and that CustomerID 2 belongs
ity

to cluster 4, and so on.

Step-4: Visualizing the Clusters


Visualizing the clusters is the final stage. We will visualise each cluster individually
m

since our model consists of 5 clusters.

The matplotlib function mtp.scatter() will be used to create a scatter plot to visualise
)A

the clusters.

#visulaizing the clusters

mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = ‘blue’, label =


‘Cluster 1’) #for first cluster
(c

mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = ‘green’, label =


‘Cluster 2’) #for second cluster

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 207

mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = ‘red’, label =


Notes

e
‘Cluster 3’) #for third cluster

mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = ‘cyan’, label =

in
‘Cluster 4’) #for fourth cluster

mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = ‘magenta’, label =


‘Cluster 5’) #for fifth cluster

nl
mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300,
c = ‘yellow’, label = ‘Centroid’)

O
mtp.title(‘Clusters of customers’)

mtp.xlabel(‘Annual Income (k$)’)

mtp.ylabel(‘Spending Score (1-100)’)

ty
mtp.legend()

mtp.show()

si
We have code for each cluster, spanning from 1 to 5, in the lines of code above.
The mtp.scatter’s initial coordinate, x[y predict == 0, 0], contains the x value for
displaying the values of the features matrix, and the y predict ranges from 0 to 1.
r
ve
Output:
ni
U
ity

The five distinct clusters, each with a distinctive colour, are plainly visible in
the resulting image. The dataset’s two parameters, Customer Annual Income and
m

Spending, are where the clusters are generated. The labels and colours can be
changed to suit your needs or preferences. From the patterns mentioned above, we
may additionally note the following points:
)A

●● Cluster1 displays the clients with the average salary and average spending,
allowing us to classify these clients as
●● We can classify the consumer as prudent because Cluster2 demonstrates that
they have a high income but minimal spending.
(c

●● Cluster 3 has low income as well as modest spending, qualifying them as prudent.

Amity Directorate of Distance & Online Education


208 Foundations of Machine Learning

●● Cluster 4 displays the clients who have modest incomes but high spending rates,
Notes

e
indicating that they are irresponsible.
●● For the mall owner, Cluster5 identifies the target clients who have high incomes

in
and high spending levels because they can be classified as these customers.

4.1.11 Association Rule Mining - Part I

nl
Finding common patterns, correlations, connections, or causal structures from data
sets found in many types of databases, such as relational databases, transactional
databases, and other types of data repositories, is the goal of the method known as

O
association rule mining.

The goal of association rule mining, given a set of transactions, is to identify the
rules that will allow us to anticipate the occurrence of a particular item based on the

ty
occurrences of the other items in the transaction.

Typically, association rules are shown as: {Bread}-> {Butter}

An approach frequently used for market basket analysis is association rule mining.

si
ARM is currently the best method for analysing huge market basket data, however
when there are many products being sold in a significant volume of transactions, the
data matrix needed for association rule mining ends up being enormous and sparse,
r
which takes longer to compute. This kind of information is provided by association
ve
rules in the form of “IF-THEN” statements. To comprehend the existence, nature, and
strength of an association rule, three indices are frequently used.

Lift is obtained first since it tells us whether an association exists or not and
whether it is a positive or negative association. We acquire the value for support if the
ni

lift value indicates the existence of an association rule.


U

The percentage of transactions in our dataset that contain a certain item or itemset
constitutes support for that item or itemset. It is a crucial measure since a rule with little
support may develop by accident. Because it may not be economical to promote goods
ity

that are infrequently purchased together, a low support rule may also be boring from a
business standpoint. Because of these factors, assistance is frequently utilised to get
rid of boring restrictions.

support = (A ∩ B)/N
m

The conditional probability that indicates the transaction having the LHS will also
contain the RHS is known as confidence. Results of association analyses should
)A

be handled with care. An association rule’s inference does not automatically imply
causality. Instead, it suggests a close association between the things in the rule’s
antecedent and consequent.
(c

The strength of an association rule is gauged by confidence and support.


Due to the size of the transactional database, there is a greater chance of receiving

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 209

an excessive number of irrelevant rules that may not be in our best interests. We
Notes

e
frequently set a support and confidence level before the analysis in order to ensure that
only helpful and interesting rules are produced in our outcome.

in
If lift is larger than 1, it means that there is a greater chance that the items on the
RHS will occur in this transaction as a result of the LHS’s presence. If the lift is less
than 1, it means that the likelihood that the items on the RHS will be purchased is less

nl
likely as a result of the LHS’s presence. If the lift is 1, it means that the LHS and RHS
items’ presence are independent; the likelihood that things will appear on the RHS is
unaffected by the LHS items’ presence.

O
We search for rules with a lift of more than one when performing market basket
analysis. It is also ideal to have rules with high support since they will apply to a lot
of transactions, and rules with higher confidence are those in which there is a high
likelihood that an item will appear on the RHS given the presence of items on the LHS.

ty
Become familiar with the following terms related to market basket analysis:

●● Antecedents are things or “itemsets” that can be located in the data. It is the IF

si
component, which is written on the left-hand side, to put it simply. The antecedent
in the case above is bread.
●● A consequent is an object or group of objects that are discovered in conjunction
r
with the antecedent. The THEN clause is the part that is written on the right-hand
ve
side. Butter is the result in the case above.
The principal uses of association rule mining include:

●● Basket Data Analysis: Analyzing the relationship of things purchased in a single


basket or single purchase, as in the aforementioned cases, is known as basket
ni

data analysis.
●● Cross-marketing is collaborating with businesses that enhance your own, not
U

with rivals. For instance, it stands to reason that auto dealers and manufacturers
engage in cross-promotional activities with the oil and gas industries.
●● Catalogue Design: The items in a company’s catalogue are frequently chosen to
work well together so that purchasing one item will prompt a subsequent purchase
ity

of another. These goods are therefore frequently complementary or closely


connected.
The difficult part is sorting through the enormous amount of association rules that
can be formed from a list of things to find the crucial ones.
m

Rule generation involves two steps. The first step is to create an itemset, such as
{Bread, Egg, Milk,} and the second is to create a rule, such as {Bread → Egg, Milk}
{Bread, Egg → Milk} etc., from each itemset. The following discussion covers both
)A

steps.

1. Generating itemsets from a list of items


Obtaining all frequent itemsets on which binary divisions can be performed to
(c

obtain the antecedent and consequent is the first stage in the creation of association
rules. For instance, if all of the transactions include the following six items: bread,
butter, eggs, milk, notebooks, and toothbrush, the item sets will appear as follows:

Amity Directorate of Distance & Online Education


210 Foundations of Machine Learning

bread, butter, eggs, milk, notebooks, milk, toothbrushes, milk, eggs, vegetables, etc. An
Notes

e
itemset can range in size from one to all of the items we currently have. Now, in order
to keep the quantity of total itemsets generated under control, we only look for frequent
itemsets from this data and not all of it.

in
nl
O
ty
Itemsets that are frequent are those that appear in transactions at least a certain
number of times. These are the itemsets that, technically speaking, have a support
value (fraction of transactions that contain the itemset) that is higher than the minsup

si
threshold.

As a result, {Bread, Notebook} might not be a frequent itemset if it only appears


r
twice out of every 100 transactions and its frequency (2/100) = 0.02 is less than the
value of minsup.
ve
The support value of every conceivable itemset is checked in a brute force method
to locate frequent itemsets. The aprioriprinciple aids in the effectiveness of this search.
A frequent itemset must also have a frequent subset, according to the statement.
ni

This is the same as declaring that more transactions with the goods {Bread,
Egg} were made than those with the items {Bread, Egg, Vegetables} or vice versa. If
the latter happens in 30 transactions, the former happens in all 30 transactions and
U

might happen in even a few more. Therefore, we can be sure that the support value of
{Bread, Egg, Vegetables} or (30/100) = 0.3, is above minsup. Likewise, if {Bread, Egg}
or (>30/100) = >0.3, is above minsup. The support value of the new itemset formed will
either remain the same or increase if an item is removed from an itemset. This is known
ity

as the anti-monotone attribute of support.

The Apriori principle enables us to remove any supersets of an itemset that do not
meet the support’s minimal threshold criterion. For instance, if “Milk, Notebook” doesn’t
meet our criterion of minsup, an itemset with any item added to this won’t either. Make
m

every common itemset (support minsup) with just one item. Next, create itemsets of
length 2 from all combinations of the preceding itemsets. After that, remove those
whose support value fell below the minimum.
)A

Create length 3 itemsets from all conceivable combinations of length 2 itemsets


(that remained after pruning), and then carry out the same support value check. In a
similar manner, we keep lengthening itemsets by one and checking for the threshold
each time. According to the graphic, reducing infrequent itemsets results in a more than
(c

50% reduction in the number of itemsets to take into consideration. As the number of
items rises, the computing power reduction percentage becomes ever more important.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 211

This proportion is also dependent on the minimal support level (minsup) that we
Notes

e
choose, which is entirely arbitrary with respect to the current issue and may be based
on prior experience.

in
2. Generating all possible rules from the frequent itemsets
Finding rules from the often createditemsets is relatively less difficult. Each
itemset’s binary partition is used to create rules. Candidate rules will resemble these if

nl
“Bread, Egg, Milk, and Butter” is the common itemset:

Bread is made of (Egg, Milk→Butter), (Bread is made of (Egg, Milk→Butter), and

O
(Butter is made of (Bread, Egg→Milk)).

We seek to select rules that exceed a minimum confidence level from a list of all
potential candidate rules (minconf). The confidence of rules derived from the same
itemset also follows an anti-monotone property, just like the anti-monotone feature of

ty
support. In terms of the quantity of components, it is anti-monotone.

Accordingly, the confidence of (A,B,C,→D) ≥(B,C→ A,D) ≥(C A→,B,D) is implied.

si
Recall that confidence for X and Y equals support for X,Y and support for X.

Since all of the rules produced from the same itemset have the same support,
the only difference is in how the confidence denominator is calculated. Due to the
r
anti-monotone property of support, when the number of items in X reduces, support X
ve
grows, and as a result, the confidence value lowers.

The following will provide a simple explanation for the aforementioned. Think about
F1 and F2:

F1 is the percentage of transactions that additionally included (butter) (egg, milk,


ni

bread)

All three of the items—egg, milk, and bread—will only be able to find a position in a
U

limited proportion of the butter-containing transactions.

F2 is the percentage of transactions that also included (milk, butter, and bread)
(egg)
ity

Compared to only butter, there will be very few transactions that include milk,
butter, and bread, and the likelihood that there will be an egg on those transactions is
very high. Consequently, it will be shown that F1 F2. Pruning is carried out similarly to
how it was done while searching for frequent itemsets using this confidence attribute.
The image below serves as an illustration.
m
)A
(c

Amity Directorate of Distance & Online Education


212 Foundations of Machine Learning

Beginning with a common itemset of {a, b, c,d} we create rules using just one
Notes

e
consequent. Eliminate the rules that don’t meet the minconf requirement. Create rules
now by combining the consequences from the remaining ones. Repeat until there is
only one item on the antecedent. All frequently used item sets must go through this

in
process. Once more, the minimum confidence criterion that we choose depends entirely
on the issue at hand.

nl
These two methods have allowed us to pinpoint a set of association rules that meet
the minimal support and minimum confidence requirements. Depending on the values
of minsup and minconf, different numbers of these rules will be produced. Now that a
subset of rules has been created, the highest lift values can be looked for in order to

O
make business judgments.

Maximal frequent itemset: For which none of the immediate supersets are frequent,
it is a frequent itemset. This is comparable to a frequent itemset X to which item y

ty
cannot be added while maintaining {X,y} above the minimum threshold.

r si
ve
Closed frequent itemset: It happens frequently that an itemset does not have a
ni

superset that supports it in the same way. Think about an X-item set. X is NOT a closed
set if ALL instance of X is followed by an instance of Y.
U

Because they are the most compact type of frequent item representation, maximal
frequent itemsets are valuable.

All frequent item sets can be obtained from maximal frequent item sets and their
subsets. The support for the subsets, however, is lost knowledge. Closed frequent
ity

itemset is another approach to express all the frequent itemsets if this value is
necessary. Closed itemsets make it possible to eliminate some unnecessary itemsets
without losing data on the support values.
m

4.1.12 Association Rule Mining - Part II

Apriori Algorithm
)A

This section will describe the algorithm that will power Market Basket Analysis,
an application of association rule mining, which will be implemented using Python
modules. This will aid businesses in better understanding their customers and
analysing their data with greater care and attention. The Apriori algorithm, which was
the first associative algorithm introduced by Rakesh Agrawal, has been employed as a
(c

component of the approach in subsequent advances of association, classification, and


associative classification algorithms.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 213

The two steps of association rule mining are as follows:


Notes

e
●● Frequent Itemset Generation: Locate all frequent item-sets having a support
level greater than a specified minimum support level. Frequent mining often

in
identifies the intriguing connections and links between item sets in relational and
transactional databases. In a nutshell, frequent mining identifies the elements that
frequently coexist in a transaction or relation. Multiple repetitions are required to

nl
find frequent item sets. The whole training set must be scanned in order to count
new candidate item-sets from existing item sets. Only two crucial processes are
needed, in brief:

O
◌◌ Pruning
◌◌ Joining
●● Rule Generation: enumerate all of the association rules from common item-sets.

ty
Calculate each rule’s support and confidence. Remove rules that don’t pass the
minimum support and confidence levels.
Frequent Itemset Generation searches the entire database for frequent itemsets

si
with a support threshold. It is the most computationally intensive stage because it
scans the entire database. In the actual world, transaction data for retail can amount
to Gigabytes and Terabytes of data. To filter out item sets that won’t be useful in the
following steps, an optimal method is required. The Apriori algorithm is employed in this.
r
ve
Algorithm a priori states “A frequent itemset’s subsets must also be frequent. In
other words, neither the generation nor testing of a superset of an uncommon itemset is
required.

The Apriori algorithm premise is graphically depicted in the image below. It


ni

comprises of a k-item set node and a relation between k-item sets’ subsets. As you can
see in the graphic, the transaction data items are all listed at the bottom, and as you
move up, you start producing subsets until you reach the null set.
U
ity
m
)A

Figure: All possible subset

This demonstrates that obtaining support for each combination will make it tough
to produce frequent item sets. As a result, the Apriori method helps to decrease the
number of sets that must be generated, as seen in the image below.
(c

Amity Directorate of Distance & Online Education


214 Foundations of Machine Learning

Notes

e
in
nl
O
Figure: if an item set is infrequent, we do not consider its super sets

ty
We do not need to take into account all of the super sets of an item-set {a, b} if it
occurs infrequently.

The following example demonstrates why the Apriori method is significantly more

si
efficient and gradually produces stronger association rules.

Step1:
◌◌ r
Make a table with the support counts for every item in the dataset.
ve
◌◌ We get a new set of things when we compare the support count to the
minimum support count (in this example, the minimum support count is 2 and
the support count is less than that).
ni
U
ity

Figure: transactional data to frequent items


m

Step2:
◌◌ This process is called the join step. By cross-joining each piece with another,
we create a new set.
)A

◌◌ Check to see if an itemset’s subsets are frequently occurring, and if not,


eliminate them. For instance, we can see that the subset of {I1, I2} in the
situation below is {I1} {I2} and is frequent. Each itemset must be checked in
the same manner.
(c

◌◌ Now use the dataset to look for the support count of these item-sets.
◌◌ Given that we’ve previously established a 2-support minimum requirement,

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 215

If the support count is lower than the minimum support count, those items
Notes

e
are removed. Otherwise, we compare the minimum support count. gives us
another collection of things, as shown below.

in
nl
O
ty
Figure: pruning and joining

Step3:

si
◌◌ We repeat this process after obtaining a new dataset (I.e. join step). Each
itemset is cross-joined to the others. As a result, the itemset created following
this step will be:
{I1, I2, I3} r
ve
{I1, I2, I4}

{I1, I2, I5}

{I1, I3, I5}


ni

{I2, I3, I4}

{I2, I4, I5}


U

{I2, I3, I5}

◌◌ Check to see if all of these item sets’ subsets are often used; if not, eliminate
them. For instance, the frequently occurring subset of {I1, I2, I3} in this
ity

situation is {I1, I2} {I1, I3} and {I2, I3}. However, one of the subsets for {I2, I3,
I4] is [I3, I4}, which is uncommon. So, we take this out. The same is applied to
each itemset.
◌◌ Find the support count of the remaining item set by scanning the dataset after
m

removing all of the non-frequent item sets.


◌◌ If the support count is lower than the minimum support count, those items
are removed. Otherwise, compare the support count to the minimum support
)A

count. We can see below that it provides us another itemset.


(c

Amity Directorate of Distance & Online Education


216 Foundations of Machine Learning

Notes

e
in
nl
Figure: pruning and joining again until there are no more frequent items left

O
Step4:
◌◌ The same process is repeated. We cross-connect each itemset with the

ty
others in the join step, which comes first. The item set’s first two components
in our case should match.
◌◌ Then, determine whether or not all of these item sets’ subsets are frequent.

si
The itemset created in our example following the join step is {I1, I2, I3, I5}.
So, one of the uncommon subgroups of this itemset is {I1, I3, I5}. There is
therefore no longer an itemset.
◌◌
r
Because there are no longer any common itemsets, we stop here.
ve
The process of association rule mining began with this.

The next stage will be to generate how strong the association rules are and create
a list of all frequent item-sets. We determine each rule’s confidence level for this. We
employ the following formula to determine confidence:
ni
U
ity

Figure: support, confidence and lift calculation

We will demonstrate rule creation by using an example of any common item (we
used {I1, I2, I3}).
m
)A
(c

Figure: calculation of confidence

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 217

Therefore, the first three rules in this situation can be regarded as strong
Notes

e
association rules if the minimum confidence level is 50%. Having a confidence level
of 50%, for instance take {I1, I2} => {I3}, indicates that 50% of those who purchased I1
and I2 also purchased I3.

in
4.1.13 Hands On: Association Rule (Apriori)

nl
Association Rules for Market Basket Analysis in R: Step-by-Step Implementation

Loading Required Libraries/Package

O
The ‘arules’ package of the R programming language is used to represent, work
with, and examine transaction data and patterns. It does an MBA on the data using
frequent itemsets and association rules.

ty
r si
ve
Terminal Output:
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


218 Foundations of Machine Learning

Loading Dataset/Transactional Dataset


Notes

e
The ‘arules’ package of the R programming language is used to represent, work
with, and examine transaction data and patterns. It does an MBA on the data using

in
frequent itemsets and association rules.

nl
Input Data

O
ty
r si
ve
ni
U

Performing Apriori Algorithm And Generating Association Rules


The Apriori Algorithm can be used to produce association rules. The arules
ity

package contains the apriori() function. One of the popular methods in market basket
analysis is the apriori algorithm. It accepts a range of input parameters, including:

support threshold (supp),


m

confidence threshold(conf),

target=’rules’,
)A

minimum length of rules(minlen)

data of two key columns – Transaction Id and Item name


(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 219

Notes

e
in
nl
Terminal Output:

O
ty
r si
ve
The status of completed operations are displayed together with the parameter
ni

specification and minimum support count.

Association Rules Output:


U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


220 Foundations of Machine Learning

In this case, LHS stands for items that have previously been taken and RHS for
Notes

e
objects that are usually taken with purchased items. We can infer from this analysis
that there are some common pairings of the items, such as “Coffee” and “Toast” and
“Coffee” and “Alfajores.”

in
4.1.14 Principal Component Analysis

nl
One of the main applications of the SVD is principal components analysis (PCA),
which offers a data-driven, hierarchical coordinate system to describe high-dimensional
correlated data. The correlation matrices are used in this coordinate system. Prior to

O
running the SVD, PCA pre-processes the data by mean subtraction and setting the
variance to unity. Principal components (PCs) with the highest correlation to the data
and minimal correlation to each other determine the geometry of the final coordinate
system. Pearson created this hypothesis in 1901, and Hotelling independently

ty
developed it in the 1930s. Jolliffe offers a useful reference book.

Typical PCA Algorithm Terms Include:

si
●● Dimensionality: It refers to the quantity of characteristics or variables in the dataset
in question. The dataset’s amount of columns makes this easier to determine.
●● Correlation: It denotes how closely two variables are correlated with one another.
r
For instance, if one variable changes, the other variable also changes. Between -1
ve
and +1 is the correlation value. Here, -1 denotes an inverse relationship between
the variables, and +1 denotes a direct relationship between the variables.
●● Orthogonal: It states that variables are not connected with one another, hence
there is no correlation between the two.
ni

●● Eigenvector and Eigenvalue: Let’s have a look at the non-zero vector v. The vector
v is referred to as the eigenvector of the square matrix A if a square matrix (n x
n), let’s say A, is a multiple of v when multiplied (scalarly) with the vector v (i.e.,
U

Av). The following equation can be used to represent it: Av = ?v The eigenvector
in this case is v, and the associated eigenvalue is? The direction of the vector will
not change as a result of applying linear transformation because we know that a
vector has both a direction and a value. Additionally, the eigenvector must have a
ity

non-null value. Here is the equation once more for quick reference: (Square matrix
A * EigenVector) – (EigenValue*EigenVector) = 0 We can better comprehend
and analyse data by using eigenvectors and eigenvalues. As a result, we employ
them to alter data and show it in a more comprehensible manner. Both ideas are
m

essential for carrying out data science.


●● Covariance Matrix: The Covariance Matrix is a matrix that represents the
covariance between the two variables.
)A

Principal Components in PCA


The Principal Components are the newly altered characteristics or the result
of PCA, as previously said. These PCs are either the same number or fewer than
the initial characteristics that were included in the dataset. Following are a few
(c

characteristics of these primary components:

●● The linear combination of the unique traits must be the major component.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 221

●● Because these components are orthogonal, there is no association between any


Notes

e
two variables.
●● Going from 1 to n, the importance of each component declines, making 1PC the

in
most important and nPC the least important.
In a typical experiment, a number of measurements are gathered, and these
measurements are organised into a row vector. The measures could be characteristics

nl
of an observable, like the demographics of a particular human being. Each
measurement vector is organised as a row in a big matrix X following a number of
experiments. Polling may be used to collect data for experiments, like in the case of

O
demography. Even if it has more rows than columns or vice versa, the matrix will still be
n m in size.

Computation The mean of all rows, or the row-wise mean x¯, is now calculated,
and it is deducted from X. Given by is the mean x¯.

ty
si
and the mean matrix is

r
ve
Subtracting X¯ from X results in the mean-subtracted data B:

B = X − B¯ .
ni

The covariance matrix of the rows of B is given by


U

The first principal component u1 is given as


ity

This corresponds to the biggest eigenvalue of B*B’s eigenvector. Since the highest
singular value corresponds to the left singular vector of B, u1, it is now obvious what it
is.
m

By using C’s eigen-decomposition, it is feasible to extract the primary components:

CV = VD,
)A

This, given that C is Hermitian, must exist.

pca Command For the principal components analysis, Matlab provides the extra
functions pca and princomp (based on pca):
(c

>> [V,score,s2] = pca(X);

Amity Directorate of Distance & Online Education


222 Foundations of Machine Learning

Up to changes in column sign, the matrix V is equal to the V matrix from the SVD of
Notes

e
X. The squares of the singular values make up the eigenvalues of the covariance of X,
sometimes referred to as main component variances, in the vector s2. The coordinates
for each row of B (the mean-subtracted data) in the principal component directions are

in
all that are included in the variable score.

4.1.15 Hands On: PCA

nl
The dataset for the breast cancer example given below is easily accessible via the
Python sklearn.datasets module.

O
Step-1: Loading all necessary packages
import pandas as pd

ty
import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

si
%matplotlib inline

from sklearn.datasets import load_breast_cancer


r
from sklearn.preprocessing import StandardScaler
ve
from sklearn.decomposition import PCA

from mpl_toolkits import mplot3d


ni

Step-2: Loading dataset


#loads breast cancer dataset into variable by name cancer.

cancer = load_breast_cancer()
U

# creating dataframe

df = pd.DataFrame(cancer[‘data’], columns = cancer[‘feature_names’])


ity

# checking head of dataframe

df.head()

The data is loaded into a variable called cancer using this piece of code. The top
m

5 rows of the data frame constructed using these data are then printed. As a result, we
obtain the following.
)A
(c

Fig: top five rows of the dataset

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 223

So it is obvious that there are 30 columns, each of which contains numerical


Notes

e
numbers. Thus, PCA can be used immediately.

Step-3: Standardizing and applying PCA

in
scalar = StandardScaler()

# Standardizing

nl
scalar.fit(df)

scaled_data = scalar.transform(df)

O
# applying PCA

pca = PCA(n_components = 3)

ty
pca.fit(scaled_data)

x_pca = pca.transform(scaled_data)

x_pca.shape

si
In this case, normalising the data (i.e., df) and running PCA on it are what we’re
doing. There, n-components is a representation of the desired number of major
components, or new features.
r
ve
Step-4: 2-D results
plt.figure(figsize =(8, 6))

plt.scatter(x_pca[:, 0], x_pca[:, 1], c = cancer[‘target’])


ni

# labeling x and y axes

plt.xlabel(‘First Principal Component’)


U

plt.ylabel(‘Second Principal Component’)

This stage consists just of displaying the data’s 2-Dimensional graphic, which
seems something like this.
ity
m
)A
(c

Figure: 2-D results of PCA applied upon breast cancer dataset

Amity Directorate of Distance & Online Education


224 Foundations of Machine Learning

Step-5: 3-D results


Notes

e
ax = plt.axes(projection=’3d’)

ax.scatter(x_pca[:, 0], x_pca[:, 1], x_pca[:, 2], c=cancer[‘target’], cmap=’viridis’,

in
linewidth=1);

Things start to become intriguing at this point. The code for a 3D plot is shown

nl
above. These are the output images that we obtained.

O
ty
si
Fig: 3-D results of PCA applied upon breast cancer dataset.

By rotating the plot r


ve
ni
U
ity

Fig: Plot emphasizing on second and third principal components


m
)A
(c

Fig: Plot emphasizing on First and third principal components

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 225

Notes

e
in
nl
O
ty
Fig: Plot emphasizing on First and Second principal components

Conclusion:

si
Variations across the first component >across the second component>across the
third component.

r
They fall between the range of -7 to 20, or range = 27, for the first primary
component. They fall between the range of -9 to 13, or range = 22, for the second main
ve
component. They fall somewhere between -6 and 10, or a range of 16, for the third
major component.

4.1.16 Singular Value Decomposition


ni

A matrix is factored into three matrices in the Singular Value Decomposition (SVD)
process. It communicates significant geometrical and theoretical insights regarding
linear transformations and has several intriguing algebraic characteristics. Additionally,
U

data science uses it in some significant ways.

Mathematics behind SVD


ity

The following formula yields the SVD of the mxn matrix A:

A = UWVT

where:
m

◌◌ U: mxn matrix of the orthonormal eigenvectors of AAT.


◌◌ VT: transpose of a nxn matrix containing the orthonormal eigenvectors of
A^{T}A.
)A

◌◌ W: a nxn diagonal matrix of the singular values which are the square roots of
the eigenvalues of ATA.
(c

Amity Directorate of Distance & Online Education


226 Foundations of Machine Learning

Examples
Notes

e
◌◌ Find the SVD for the matrix A =
◌◌ To calculate the SVD, First, we need to compute the singular values by finding

in
eigenvalues of AA^{T}.

nl
The characteristic equation for the above matrix is:

O
ty
so our singular values are: σ_1= 5; σ_2= 3;

si
◌◌ We now identify the appropriate singular vectors, or the orthonormal set of
ATA’s eigenvectors. Since ATA is symmetric, we know that the eigenvectors

r
will be orthogonal given that its eigenvalues are 25, 9, and 0.
ve
ni

which can be reduced in a row to:


U

A unit vector pointing in that direction is:


ity

Similarly, the eigenvector for lambda = 9 is:


m
)A

We might make use of the third eigenvector’s perpendicularity to v1 and v2 to the


effect that:
(c

generating the third eigenvector by solving the aforementioned equation


Amity Directorate of Distance & Online Education
Foundations of Machine Learning 227

Notes

e
in
Now, we calculate U using the formula u_i = \frac{1}{\sigma} A v_i and this gives

nl
U= .

As a result, the final SVD equation is:

O
ty
Applications
●● Calculation of Pseudo-inverse: The generalisation of the matrix inverse that might

si
not be invertible is the pseudo inverse or Moore-Penrose inverse (such as low-
rank matrices). When a matrix is invertible, its inverse equals the pseudo inverse,
but when a matrix is not invertible, there is no pseudo inverse. The symbol for it is
A+. r
ve
Let’s say we need to determine a matrix M’s pseudo-inverse:

The SVD of M can thus be given as follows:

M = UWVT
ni

Multiply both sides by M^{-1}.

M1M = M1UWVT
U

I = M1UWVT

Multiply both side by V:

V = M1UWVTV
ity

V = M-1UW

Multiply by W^{-1}. Since the W is the singular matrix, the inverse of W = diag (a1,
a2, a3, …an)-1 is
m

= diag(1/a1, 1/a2, 1/a3, … 1/an)

VW-1 = M-1UWW-1
)A

VW-1 = M-1U

Multiply by UT

VW-1UT = M-1UUT
(c

VW-1UT = M-1 = M+

The pseudo-inverse is given by the equation above.

Amity Directorate of Distance & Online Education


228 Foundations of Machine Learning

●● Solving a set of Homogeneous Linear Equation (Mx =b): Calculate SVD and select
Notes

e
any column of VT that has a singular value (in W) of zero if b=0.
If b≠0, Mx = b

in
Multiply by M-1

M-1Mx = M-1b

nl
X = M-1b

From the Pseudo-inverse, we know that M-1 = VW-1UT

O
Hence x = VW-1UTb

●● Rank, Range, and Null space:


◌◌ By counting the number of nonzero singular values, the rank of matrix M can

ty
be determined from SVD.
◌◌ The left singular vectors of U that correspond to the non-zero singular values
are the range of matrix M.

si
◌◌ The right singular vectors of V, which correspond to the zeroed singular
values, make up the null space of matrix M.
M = UWVT

◌◌
r
Curve fitting problem: Singular value decomposition can be used to reduce
ve
the least square error in a curve fitting problem. It makes an approximation
using the pseudo inverse.
◌◌ Singular value decomposition and pseudo-inverse can also be employed
in digital signal processing and image processing in addition to the
ni

aforementioned use.

4.1.17 Hands On: SVD


U

Using Numpy and Scipy, we will attempt to calculate the Singular Value
Decomposition in this code. Both SVD and pseudo-inverse calculations will be done.
Finally, we may use SVD to compress the image.
ity

# Imports

import numpy as np

from scipy.linalg import svd


m

“””

Singular Value Decomposition


)A

“””

# define a matrix

X = np.array([[3, 3, 2], [2,3,-2]])


(c

print(X)

# perform SVD

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 229

U, singular, V_transpose = svd(X)


Notes

e
# print different components

print(“U: “,U)

in
print(“Singular array”,s)

print(“V^{T}”,V_transpose)

nl
“””

Calculate Pseudo inverse

O
“””

# inverse of singular matrix is just the reciprocal of each element

ty
singular_inv = 1.0 / singular

# create m x n matrix of zeroes and put singular values in it

s_inv = np.zeros(A.shape)

si
s_inv[0][0]= singular_inv[0]

s_inv[1][1] =singular_inv[1]

# calculate pseudoinverse
r
ve
M = np.dot(np.dot(V_transpose.T,s_inv.T),U.T)

print(M)
ni

“””

SVD on image compression


U

“””

import numpy as np

import matplotlib.pyplot as plt


ity

from skimage import data

from skimage.color import rgb2gray

cat = data.chelsea()
m

plt.imshow(cat)

# convert to grayscale
)A

gray_cat = rgb2gray(cat)

# calculate the SVD and plot the image

U,S,V_T = svd(gray_cat, full_matrices=False)


(c

S = np.diag(S)

fig, ax = plt.subplots(5, 2, figsize=(8, 20))

Amity Directorate of Distance & Online Education


230 Foundations of Machine Learning

curr_fig=0
Notes

e
for r in [5, 10, 70, 100, 200]:

cat_approx =U[:, :r] @ S[0:r, :r] @ V_T[:r, :]

in
ax[curr_fig][0].imshow(256-cat_approx)

ax[curr_fig][0].set_title(“k = “+str(r))

nl
ax[curr_fig,0].axis(‘off’)

ax[curr_fig][1].set_title(“Original Image”)

O
ax[curr_fig][1].imshow(gray_cat)

ax[curr_fig,1].axis(‘off’)

ty
curr_fig +=1

plt.show()

si
Output:
[[ 3 3 2]

[ 2 3 -2]]
r
ve
---------------------------

U: [[-0.7815437 -0.6238505]

[-0.6238505 0.7815437]]
ni

---------------------------

Singular array [5.54801894 2.86696457]


U

---------------------------

V^{T} [[-0.64749817 -0.7599438 -0.05684667]

[-0.10759258 0.16501062 -0.9804057 ]


ity

[-0.75443354 0.62869461 0.18860838]]

--------------------------

# Inverse
m

array([[ 0.11462451, 0.04347826],

[ 0.07114625, 0.13043478],
)A

[ 0.22134387, -0.26086957]])

---------------------------
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 231

Notes

e
in
nl
O
ty
si
4.1.18 LOF
r
An algorithm for unsupervised outlier detection is called the local outlier factor
ve
(LOF). It generates an anomaly score that indicates the data points in the data set that
are outliers. This is accomplished by calculating the local density deviation of a given
data point in relation to nearby data points.

Working of LOF: By calculating the distances between nearby data points, local
ni

density is calculated (k-nearest neighbors). Local density may therefore be determined


for each data point. By contrasting these, we may determine which data points have
densities that are comparable to their neighbours and which have lower densities.
U

Outliers are those having densities that are below average. In order to find each point’s
k-nearest neighbours, k-distances—or distances between points—are first determined.
The second-closest point is referred to as the point’s second-closest neighbour. Here is
an image showing the k-distances between various neighbours in a point cluster:
ity
m
)A
(c

The reachability distance is determined using this distance. The maximum of the
k-distance and the distance between two places is how it is defined. Consider the

Amity Directorate of Distance & Online Education


232 Foundations of Machine Learning

following equation, where B represents the central point and A represents a point that is
Notes

e
close by.

Here is an illustration of the reachability distance between a spot and its nearby

in
neighbours:

nl
O
As seen in the graphic above, the k-distance is taken into account for points inside
the circle, while the distance between points is taken into account for points outside the
cluster. The Local Reachability Density (LRD) of a point is now derived by calculating
the reachability distances to all of its k-nearest neighbours. The inverse of the total

ty
sum of the reachability distances of all the k-nearest surrounding points can be used
to determine the local reachability density, which is a measurement of the density
of k-nearest points around a point. The inverse is used in the equation because the

si
distance is smaller and the density is greater the closer the points are to one another.

r
ve
By dividing the average of the lrds of k neighbours’lrds by the lrd of the point in
question, the local outlier factor (LOR) is calculated. The LOR equation is as follows:
ni

As a result, in the equation, if the point’s density and that of its neighbours are
U

nearly equal, we can say that they are quite similar; if the point’s density and that of its
neighbours are not equal, we can say that the point is an inlier, or inside the cluster; and
if the point’s density and that of its neighbours are not equal, we can say that the point
is an outlier. Consider the following example:
ity

LOF ~ 1 => Similar data point LOF < 1 => Inlier ( similar data point which is
inside the density cluster) LOF > 1 => Outlier

Here is an illustration of the LOF plot on a set of data:


m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 233

Advantages:
Notes

e
◌◌ Finding outliers might be challenging at times. A point that is close to a highly
dense cluster may be regarded as an outlier, whereas a place that is farther

in
away from a cluster that is more widely dispersed may be regarded as an
inlier. This problem is resolved by identifying outliers in small areas using
LOR.

nl
◌◌ The method employed in LOF can be utilised to tackle problems of
recognising outliers in many other disciplines, such as geographic data, video
streams, etc.

O
◌◌ Another dissimilarity function can be implemented using the LOF. It is also
discovered to perform better than several other algorithms for anomaly
identification.

ty
Disadvantages:
◌◌ The LOF score that determines whether a point is an outlier or not varies from
case to case. For various data sets, it might change.

si
◌◌ The LOF algorithm’s detection accuracy degrades in greater dimensions.
◌◌ It may be a bit difficult to comprehend how to discriminate between inliers
and outliers using the LOF score because it can be any number that the ratio
generates. r
ve
4.1.19 T-Distributed Stochastic Neighbour Embedding (t-SNE)
The machine learning algorithm t-distributed stochastic neighbour embedding
(t-SNE), created by Laurens van der Maaten and Geoffrey Hinton, is used for data
ni

visualisation. When embedding high-dimensional data for display in a low-dimensional


environment of two or three dimensions, t-SNE is a nonlinear dimensionality
reduction technique that is well-suited and adaptable. In other words, it models each
U

high-dimensional object by a two- or three-dimensional point in a way that models


comparable objects by points adjacent to each other in clusters and dissimilar objects
by distant points outside the local clusters or in other clusters.

There are two main steps in the t-SNE algorithm. First, t-SNE creates a probability
ity

distribution across pairs of high-dimensional objects in which picking similar objects is


more likely than picking different points, and vice versa. Second, t-SNE minimises the
Kullback-Leibler divergence between the two distributions with regard to the positions of
the points in the map by defining a similar probability distribution over the points in the
m

low-dimensional map.

Data may be understood and solutions for problems in many different sectors
can be developed with the aid of high-dimensional data visualisation. For instance,
)A

according to Street et al., breast cancer-related cell nuclei are described by about
30 different variables, whereas wordcount vectors and pixel intensity vectors used to
represent text documents typically comprise hundreds of dimensions. Finding the
most crucial variables in this kind of situation can be crucial in order tovisualise the
relationships between the data as manually looking for relationships between all the
(c

various factors can take a very long period.

Amity Directorate of Distance & Online Education


234 Foundations of Machine Learning

Applications for t-SNE include computer security analysis, music analysis, and
Notes

e
cancer research, among many others. It is used to depict high-dimensional data, which
can have tens to hundreds of variables, as was previously stated. One advantage of
t-SNE is that, in comparison to previous methods like Sammon mapping or Locally

in
Linear Embedding, LLE, it is better at producing a single map that displays structure
at numerous scales. This is crucial for high-dimensional data that is distributed
across several related but dissimilar low-dimensional manifolds, such as photographs

nl
of objects from various classes seen from various angles. However, as seen in a
comparison with UMAP, another technique for high-dimensional data display, it is not
the fastest.

O
Let’s first examine SNE, or stochastic neighbour embedding, to better grasp how
t-SNE functions. SNE begins by translating high-dimensional Euclidean distances
between data points into conditional probabilities that describe the similarities between

ty
data points. The conditional probability, pji, that xi would choose xj as its neighbour if
neighbours were chosen in proportion to their probability density under a Gaussian
distribution centred at xi, is what determines how similar two data points are. While pji
will be virtually endlessly small for data points that are far apart, it is reasonably large

si
for data points that are close together. The conditional probability, pji, is determined
mathematically by:

r
ve
ni
U
ity

Figure: Example t-SNE map in two-dimensions.


m
)A

where σi data point xi is the centre of the Gaussian distribution, and I is its
variance. Values of pj|i are set to zero because the objective primarily involves
modelling pairwise similarities. It is possible to calculate a similar conditional probability,
indicated by qj|i, for the low-dimensional counterparts (map points) yi and yj of the high-
dimensional (data points) xi and xj. The conditional probabilities qj|i are computed using
(c

a Gaussian distribution with a variance set to 1/√2. Consequently, we may model the
similarity between map points yj and yi as follows:

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 235

Notes

e
in
Again, qj|i is equal to 0 because the modelling of pairwise similarities is the main
focus. The conditional probabilities pj|i and qj|i will be equal if the map points yi and
yj accurately depict the similarity between the high-dimensional data points xi and xj.

nl
This fact serves as the inspiration for SNE, which seeks to identify a low-dimensional
data representation that reduces the discrepancy between pj|i and qj|i. SNE employs a
gradient descent technique to reduce the total Kullback-Leibler divergences across all

O
data points. As a result, the cost function C is provided by:

ty
The distribution utilised to calculate the similarity scores is the last component
of t-SNE. The normal or Gaussian distributions discussed above have a similar
appearance to the Student t-distribution, which takes their place in the t-SNE. The

si
Student t-distribution differs from them in that the tails are taller at the extremities and
the centre is not as tall, as shown in Figure (Difference between a normal distribution
and a Student t-distribution). Geoffrey Hinton and Laurens van der Maaten concur. The

r
student t-distribution is chosen because it has the incredibly desirable quality of being
(nearly) invariant to changes in the map’s scale for locations that are far apart on the
ve
map. This implies that the method behaves uniformly for all points, and that big clusters
comprising numerous low density points interact uniformly with each other.
ni
U

Figure: Difference between a normal distribution and a Student t-distribution


ity

Hyperparameters
The hyperparameters that the t-SNE algorithm uses are given and described
below.
m

PerplexityA adjustable parameter used in t-SNE specifies how to balance focus on


local and global elements of your data. In a way, the parameter is an educated guess
)A

as to how many near neighbours each point has. The number of k nearest neighbours
increases when there are several perplexities. The resultant images are complicated
by the perplexity value. The performance of SNE is fairly resilient to changes in the
perplexity, and typical values are between 5 and 50, according to the original study
by van der Maaten. However, the narrative is more complex than that. Analyzing
(c

multiple plots with various perplexities in addition to changes in other parameters


may be necessary to get the most out of t-SNE. In the API utilised in this report, the
learning rate for t-SNE typically falls between 10 and 1000. The data may resemble an

Amity Directorate of Distance & Online Education


236 Foundations of Machine Learning

extremely dispersed cluster with each point being roughly separated from its nearest
Notes

e
neighbours if the learning rate is too high. Since too many significant modifications
are applied too soon, many points also detach from their local clusters. However, if
the learning rate is too slow, most map points might appear compressed into a small,

in
distinct ball with few outliers. Increasing the learning rate may assist resolve this
problem if the cost function becomes trapped in a poor local minimum. Looking at the
map and finding it difficult to discern any distinct divisions between the various classes

nl
can indicate a too high learning rate. Since the t-SNE approach is iterative, it is crucial
to allow for enough iterations to allow the map to converge to a state where any further
changes are minute.

O
Attraction and Repulsion Over Iterations
Each time a map iterates, points are shifted about to show how similar points
attract one another and dissimilar points repel one another. Physically, the gradient

ty
can be thought of as the force produced when a set of springs is placed between the
map point yi and all of the other map points yj. There is a force applied by all springs
in the direction (yi− yj). Depending on whether the distance between the two map

si
points is too little or large to indicate the similarities between the two high-dimensional
data points, the spring between yi and yj repels or attracts the map points. The spring
between yi and yj exerts force that is proportional to both its length and stiffness, which
r
corresponds to the mismatch (pj|i− qj|i + pi|j− qi|j) between the pairwise similarities
of the data points and the map points. Figure provides an instance of this (Step
ve
by step illustration of how the points from every class attracts and repel points from
other classes each iteration.). When the method repeats over all locations, the map
eventually develops clusters of various sizes that bring together related objects close to
one another.
ni
U
ity
m
)A
(c

Figure: Step by step illustration of how the points from every class attracts and
repel points from other classes each iteration.
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 237

4.1.20 Hands On: t-SNE


Notes

e
A method for displaying high-dimensional data is called T-distributed Stochastic
Neighbor Embedding (T-SNE). T-SNE is a nonlinear dimensionality reduction technique

in
to visualise data in a two- or three-dimensional space. It is based on stochastic
neighbour embedding.

To visualise data using the T-SNE method, the Scikit-learn API offers the TSNE

nl
class. Here, we’ll quickly go over how to use Python’s TSNE library to fit and visualise
data.

Let’s begin by loading the necessary libraries and functions:

O
from sklearn.manifold import TSNE

from keras.datasets import mnist

ty
from sklearn.datasets import load_iris

from numpy import reshape

si
import seaborn as sns

import pandas as pd

r
We’ll acquire the data and label specific portions of the Iris dataset after it has
loaded.
ve
iris = load_iris()

x = iris.data
ni

y = iris.target

The number of target dimensions is defined by the n_components option in


the TSNE class, which is used to define the model. We can inspect the log data
U

because’verbose=1’ displays it.

tsne = TSNE(n_components=2, verbose=1, random_state=123)

z = tsne.fit_transform(x)
ity

[t-SNE] Computing 91 nearest neighbors...

[t-SNE] Indexed 150 samples in 0.001s...

[t-SNE] Computed neighbors for 150 samples in 0.006s...


m

[t-SNE] Computed conditional probabilities for sample 150 / 150

[t-SNE] Mean sigma: 0.509910


)A

[t-SNE] KL divergence after 250 iterations with early exaggeration: 48.021526

[t-SNE] KL divergence after 1000 iterations: 0.122989

The outcome will then be represented in a plot. The output component data will be
(c

gathered in a dataframe, and the scatterplot() function of the “seaborn” library will be
used to plot the data. Because there are three different categories in the label data, we
will set the scatter plot’s colour palette to 3.

Amity Directorate of Distance & Online Education


238 Foundations of Machine Learning

df = pd.DataFrame()
Notes

e
df[“y”] = y

df[“comp-1”] = z[:,0]

in
df[“comp-2”] = z[:,1]

sns.scatterplot(x=”comp-1”, y=”comp-2”, hue=df.y.tolist(),

nl
palette=sns.color_palette(“hls”, 3),

data=df).set(title=”Iris data T-SNE projection”)

O
ty
r si
ve
ni

MNIST Dataset TSNE Fitting and Visualizing


U

We’ll next use the larger dataset to use the same methodology. For this task, the
MNIST handwritten digit dataset from the Keras API performs well. We just extract the
train portion of the dataset because TSNE can be used to test the data in this area. We
will only use 3000 rows because processing the TSNE will take too long.
ity

(x_train, y_train), (_ , _) = mnist.load_data()

x_train = x_train[:3000]

y_train = y_train[:3000]
m

print(x_train.shape)

(3000, 28, 28)


)A

We will restructure the three-dimensional MNIST data into a two-dimensional one.

print(x_train.shape)

x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])


(c

print(x_mnist.shape)

(3000, 784)

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 239

We have data on 784 features here. We’ll now use TSNE to project it into two
Notes

e
dimensions so that we can see it in a plot.

tsne = TSNE(n_components=2, verbose=1, random_state=123)

in
z = tsne.fit_transform(x_mnist)

df = pd.DataFrame()

nl
df[“y”] = y_train

df[“comp-1”] = z[:,0]

O
df[“comp-2”] = z[:,1]

sns.scatterplot(x=”comp-1”, y=”comp-2”, hue=df.y.tolist(),

palette=sns.color_palette(“hls”, 10),

ty
data=df).set(title=”MNIST data T-SNE projection”)

[t-SNE] Computing 91 nearest neighbors...

si
[t-SNE] Indexed 3000 samples in 0.922s...

[t-SNE] Computed neighbors for 3000 samples in 10.601s...


r
[t-SNE] Computed conditional probabilities for sample 1000 / 3000
ve
[t-SNE] Computed conditional probabilities for sample 2000 / 3000

[t-SNE] Computed conditional probabilities for sample 3000 / 3000

[t-SNE] Mean sigma: 480.474473


ni

[t-SNE] KL divergence after 250 iterations with early exaggeration: 78.815109

[t-SNE] KL divergence after 1000 iterations: 1.261612


U
ity
m
)A

The MNIST data are displayed in two dimensions on the plot. The target digits and
the placement of their feature data in 2D space are determined by the colours.
(c

Amity Directorate of Distance & Online Education


240 Foundations of Machine Learning

4.2 Case Study


Notes

e
4.2.1 Healthcare Case: Early Prediction of Neonatal

in
Hyperbilirubinemia
The most frequent disease in newborns and one of the main causes
of hospitalisation in the first year of life is neonatal jaundice brought on by

nl
hyperbilirubinemia. A considerable portion of newborn infants with significant
hyperbilirubinemia, defined as a bilirubin level above the 95th percentile at a specific
age in hours (1, 2), require phototherapy treatment. This condition affects around 10%

O
of newborn newborns.

In nations with high rates of glucose-6-phosphate dehydrogenase (G6PD)


deficiency, a known risk factor for newborn jaundice, costs for national health care
systems are consequently considerable (4). (5). Neonatal jaundice, if not properly

ty
treated, can result in severe disability and lifelong consequences (6,7). On the other
hand, population-based research have revealed that phototherapy treatments raise
the risk of allergy disorders in children. Therefore, for infants who are at high risk

si
of developing substantial hyperbilirubinemia, both careful patient monitoring and
thoughtful treatment assignment are needed in the setting of neonatal jaundice.

Nomograms, which are still often used to divide newborns into risk categories,
r
were first presented in 1999 by Bhutani et al. for the assessment of neonatal
ve
hyperbilirubinemia. Nomograms are based on percentiles of bilirubin readings at
a certain age in hours. Additional clinical indicators for the prediction of newborn
hyperbilirubinemia shortly after birth, before to discharge, or during hospital
phototherapy are included in more recent risk stratification methods. Although risk
ni

stratification approaches give clinicians a framework for their evaluation, these


approaches are overly general because they do not take into account whether or not
a person has actually received phototherapy treatment—a decision that frequently
U

depends on additional practical considerations. We identify a case of clinically


significant hyperbilirubinemia if a phototherapy treatment was administered, as the
clinician’s appraisal of any visible jaundice or the medical history of a newborn are
critical criteria that influence the treatment choice. We seek a more accurate prediction
ity

of neonates at risk for clinically significant hyperbilirubinemia in order to avoid


overtreatment and unneeded hospital stays as well as the development of severe
neonatal jaundice.

The early detection of hyperbilirubinemia has been the subject of several research.
m

In order to predict subsequent hyperbilirubinemia using gestational age (GA), the


percentage of maximum body weight loss, and the peak bilirubin level during the first 72
hours of life, Huang et al. used a logistic regression analysis for neonates with at least
35 weeks of gestational age (GA) and exclusive breastfeeding. Their model achieved
)A

an AUC of 78.8%, establishing a benchmark for further research. By comparing various


classification algorithms on a dataset of healthy term and near-term neonates, 15.4
percent of whom received phototherapy, Ferreira et al. showed that state-of-the-art
techniques can be used to improve the early detection of neonatal hyperbilirubinemia.
(c

Their best model, a logistic regression based on 60 variables, achieved an AUC of


89 percent. The impractical requirement of 60 variables and the restriction that a

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 241

phototherapy forecast be produced at 24 hours after delivery for all neonates are the
Notes

e
model’s drawbacks.

A model for the early identification of hyperbilirubinemia in healthy term and near-

in
term neonates less than 24 hours was created more recently by Castillo et al. Their
model, a regularised logistic regression (LASSO), obtains a performance of 89 percent
( 3 percent) AUC using umbilical cord bilirubin, GA, and maternal race as predictors.

nl
Their model’s drawback is that it only makes one prediction during the first 24 hours
after birth, although in real practice , predictions are needed for each subsequent
bilirubin measurement, especially before discharge. As a result, even though early
identification of hyperbilirubinemia has been researched, most practical scenarios do

O
not allow for the application of existing models because they either need too much
overhead or cannot be applied after every new bilirubin measurement.

Prior to the first phototherapy session, machine learning (ML) is used to improve

ty
the early detection of clinically meaningful hyperbilirubinemia. ML has a lot of potential
for clinical use, and in paediatrics it has been effectively used for enhanced early
diagnosis of neonatal seizures based on EEG data and of late-onset neonatal sepsis

si
based on medical records.

In order to improve the timing of bilirubin assessments and the start of


phototherapy treatment in practice , it is intended to leverage the predictive capacity of
r
cutting-edge ML-methods to identify newborns at risk of developing clinically meaningful
ve
hyperbilirubinemia early.

Results
23 of the 385 newborns were left out because they had phototherapy before
ni

having their initial bilirubin level checked. 98 (27.07 percent) of the 362 neonates who
were still alive underwent at least one phototherapy treatment while they were initially
hospitalised. Within the first week of life, or, on average, 66 hours after birth, all initial
U

phototherapies take place.

The distribution of bilirubin readings among neonates is shown in the table below.
There were a total of 1543 bilirubin measures available, or an average of more than
four measurements for each neonate. The typical interval between bilirubin measures is
ity

less than 27 hours, and 95% of individuals who underwent phototherapy saw intervals
of no more than 50 hours between subsequent measurements.
m
)A

The table below provides descriptive statistics for a number of factors that highlight
important aspects of the sample. Notably, there are a lot of preterm newborns: in the
(c

sample, 54.1 percent had gestational ages (GA) less 37 weeks, and 41.4 percent have
GAs under 35 weeks. A GA of fewer than 37 weeks and phototherapy were found to be
correlated by 0.374.

Amity Directorate of Distance & Online Education


242 Foundations of Machine Learning

5% 50% 95% Percentage


Notes

e
Quantile Quantile Quantile
Gestational age (in days) 228 253 289 -

in
Birthweight (in grams) 1490 2605 4120 -
Age of mother (in years) 23 32 40 -

nl
First bilirubin level (in 19 46 216 -
µmol/L)
Gender (male) - - - 56.91

O
Preterm† birth - - - 54.13
Multiple birth - - - 23.48

ty
4.2.2 Healthcare Case: Predicting Epidemic Outbreaks
Public health organisations that design measures to lessen or stop illness
transmission can benefit from accurate predictions of infectious disease dynamics. With

si
the expansion of big data in the biomedical and healthcare industries, accurate analysis
of such data may aid in the early diagnosis of diseases and improved patient care. It
is now highly feasible to use “big data” to predict and manage an epidemic outbreak
r
due to the accessibility of enormous computer power. In rural and suburban areas
where access to healthcare may be limited, our goal is to analyse and quantify the
ve
transmission of epidemic diseases.

We wish to develop a machine learning model that can forecast epidemic disease
dynamics and forecast the location of the next epidemic breakout. Our method takes
ni

into account the topography, climate, and population distribution of an affected area
because they are significant characteristics and slightly influence the dynamics of
epidemic disease. Our model would help the healthcare authorities take the required
steps to ensure that there are sufficient resources available to meet the requirement
U

and, if possible, to prevent the spread of such epidemic diseases.

Use of Intel Technology: Intel Distribution for Python, Intel Parallel Studio XE, Intel
VTune amplifier, AWS C5.2Xlarge Intel instance.
ity

Broad Objective:
reducing the avoidable misery caused by disease

Reduce the cost load on governments and healthcare systems by giving them first-
m

hand knowledge on outbreak hotspots and epidemic-causing agents.

Our ML model should be able to detect future outbreak-prone areas and attributes
)A

that greatly aid in the spread of the outbreak given an area where an epidemic outbreak
has already occurred.

Changes in the ecology of the host population, changes in the pathogen reservoir,
or the introduction of an emerging pathogen to a host population are some of the
common causes of infectious disease epidemics.
(c

The feature vectors in our model are sufficiently broad to be modified slightly and
applied to the analysis of any epidemic disease.
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 243

Topic of case study: 2015–2016 Zika virus epidemic


Notes

e
Why Zika?

●● The Centers for Disease Control and Prevention’s Zika Data Repository offers

in
information about the Zika epidemic that is accessible to the general public. It had
enough information for us to create and test our model.

nl
●● The ecology of the host population may change, the pathogen reservoir may
change, or an emerging pathogen may be introduced to a host community. These
are only a few of the common causes of infectious disease epidemics.

O
●● The feature vectors in our model are sufficiently broad to be modified slightly and
applied to the analysis of any epidemic disease.

Implementation Details

ty
To improve the performance of machine learning and data analytics, we used
Intel Distribution for Python* and the Python API for Intel® Data Analytics Acceleration
Library (Intel® DAAL), also known as PyDAAL. We were successful in getting decent

si
results for the prediction issue by utilising the benefit of optimised scikit-learn* (Scikit-
learn with Intel DAAL) that comes with it.

r
ve
ni

Data Sources
U
ity

●● The Centers for Disease Control and Prevention’s Zika Data Repository offers
m

information about the Zika epidemic that is accessible to the general public. https://
github.com/cdcepi/zika is a website.
)A

●● Latitude and longitude of locations connected to outbreaks can be obtained using


the Google Geolocation API.
●● Information about airport locations worldwide was collected from Falling rain.
●● The nearest airport code and the weather information taken from Wunderground.
com.
(c

●● NASA (SEDAC) (https://earthdata.nasa.gov/about/daacs/daac-sedac) used a


gridded map to extract the population density of various regions.

Amity Directorate of Distance & Online Education


244 Foundations of Machine Learning

●● The global compendium of Aedes aegyptian and Ae. albopictus occurrence is


Notes

e
available at https://datadryad.org/resource/doi:10.5061/dryad.47v3c. It lists the
occurrences of vector agents (Aedes albopictus, Aedes aegypti).

in
●● IMF World Economic Outlook data on GDP and GDP PPP.
There are Jupyter notebooks in the repository that use techniques for cleaning and
munging data.

nl
The likelihood of an outbreak in a particular area is the evaluation result.

Pre-processing

O
Preprocessing and adjusting class imbalance:
Data transformations are carried out on the data before to feeding it to the

ty
algorithm as part of data pre-processing. Some of the variables in the dataset are
categorical, so different techniques must be used to transform them into numerical
variables. Preliminary research revealed a significant class imbalance, particularly for
Zika cases recorded in the CDC database.

si
This was partly because most places did experience outbreaks, and the majority of
these outbreaks persisted (were present on all dates) throughout the whole of the time

r
period for which data were available.
ve
In an effort to address this, two frameworks were created and tested in order to
balance the classes and facilitate prediction:

Framework A: The feature information from the earliest accessible date was
used for this data, and the locations included in the non-epidemic class were those
ni

that had never had an outbreak. A place was considered to be in the outbreak class
if it experienced an outbreak at any point during the studied dates. Features from two
separate dates were tested for these locations: those from the outbreak’s start date and
U

that of the outbreak’s peak level (during the span of data collection). Framework a first
and Framework a max were the names of these two data sets.

Framework B: Only information from areas where an outbreak occurred was used.
ity

Assuming there were no epidemics, this data was then divided into the earliest date
accessible. The non-outbreak class was based on this data. Features from either the
date at which the outbreak started or the date at which it had reached its peak (within
the period of data collection) were taken from the timeseries for these points and used
for the outbreak class. Framework b first and Framework b max were the names of
m

these two data sets.

Of the four, we discovered that framework a max produced the best outcome.
Intuitively, that’s the point in time when the illness can truly be referred to as an
)A

epidemic. Our main concern is that if the spread reaches a certain point, it will quickly
spread to surrounding locations.
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 245

Feature Selection
Notes

e
in
nl
O
ty
Data sets may have duplicate or unnecessary features that could confuse the
machine-learning model. In this step, our goal is to eliminate any elements that could

si
lengthen runtime, produce complex patterns, etc. For subsequent analysis, the created
subset of characteristics is used. Either the Random Forest algorithm or the Xgboost
algorithm can be used to choose the features.
r
The Xgboost method is utilised in our project to choose the best features that
ve
have a score higher than a predetermined threshold number. Our results support the
literature on the Zika epidemic1. Temperature, precipitation, proximity to mosquito
breeding grounds, population density, and proximity to other locations with dense
populations of people (as determined by airport dist large) all have a significant impact
ni

in the spread of epidemics.

Data Split
U

Data splitting into train and test sets: For additional analysis, the data is split into
train and test sets. 30% of the data is used for testing, while 70% is used for training.
Data splitting is accomplished using the scikit-learn function StratifiedShuffleSplit(n
splits=1, test size=0.3, random state=0). To address the class disparity between
ity

Zika cases and non-Zika cases, stratified splitting is required. The ratio of positive to
negative cases in the entire sample is maintained via stratified splitting between the
train and test sets.
m

Model Building
scikit-learn with Intel DAAL
)A

Balancing the data set


With 86 percent of the data includes positive Zika infections, the dataset is severely
skewed. The SMOTETomek(SMOTE + Tomek)* technique handles this data imbalance
by creating a new smoted dataset that solves the unbalanced class problem. To
balance the training dataset, it artificially creates observations of minority classes using
(c

these elements’ closest neighbours. Using SMOTE and Tomek connections, it mixes
over- and under-sampling.

Amity Directorate of Distance & Online Education


246 Foundations of Machine Learning

Model Building and Training


Notes

e
Machine-learning models are chosen for training at this step. The fit (X, y) method
is used by all classifiers in scikit-learn to fit the model to the provided train data (X)

in
and train label (Y). An ensemble of classifiers is used to compare the performance of
various models. The model can be used for prediction after training. SVM, Multi Layer
Perceptron, Logistic Regression, ADABoost, and XGBoost were all put to the test.

nl
Prediction
In this phase, the trained model uses its learning to predict the output for a certain

O
input. In other words, predict (X) provides the expected label y given an unlabeled
observation X.

Evaluation

ty
There are many performance evaluation indicators available to gauge a model’s
performance. In order to select the most effective model for the issue, we used
evaluation measures for accuracy, precision, and recall.

si
Results
The best estimator, XGBoostClassifier, received outstanding results on the test set
r
for accuracy (0.95) and Stratified five fold cross validation (0.96). Below are some more
ve
pertinent metrics.
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 247

Notes

e
in
nl
O
ty
Figure: ROC with 5-fold cross validation

4.2.3 Healthcare Case: Using AI to Predict Acute Kidney Injury

si
Acute kidney injury (AKI) is poorly managed, and chances to prevent AKI are
frequently passed up. In the hospital context, non-nephrologists who are less skilled at
r
diagnosing AKI will typically treat it as it is frequently accompanied by other underlying
disorders. As a result, the creation of AKI detection or prediction models has become
ve
a prominent topic. The generalizability of most of these models is hindered by the fact
that they are only able to predict the risk for AKI at baseline, such as during admission
or preoperatively, and/or in specific settings, such as heart surgery or sepsis. In the
limited environment of the intensive care unit (ICU), on the basis of which the algorithm
ni

was created, a machine learning-based algorithm was recently demonstrated to


perform as well as clinicians in predicting AKI stages 2-3 on the day of arrival.
U

The implementation of a (automated) electronic alert system is very appealing


because more sophisticated prediction models have demonstrated increased clinical
usefulness by allowing continuous, and nearly real-time, risk prediction, taking into
account longitudinal patient data, and thereby the dynamic status of the patient. AKI can
ity

be prevented or mitigated, although many models only predict advanced stages of AKI
or the requirement for renal replacement treatment.

Tomasev et al. presented the creation of an artificial intelligence-based continuous


prediction model for AKI in Nature (AI). To build a deep learning recurrent neural
m

network model, the scientists gathered longitudinal data from electronic health records
of more than 700 000 inpatients and outpatients from all disciplines. The algorithm was
trained using a staggering 6 billion independent data entries, which included both the
)A

most recent medical data and data from up to two years prior to admittance. To enable
supervised learning, each case’s presence or lack of AKI was labelled. With a lead
time of up to 48 hours and a ratio of two false warnings to every genuine alert, the
resulting model was able to accurately forecast AKI in 55.8% of all inpatient instances of
identified AKI.
(c

This study serves as an example of the potential of big data-trained AI in the


medical field. It also highlights the problems that require more study and attention, as
well as the drawbacks and traps of such a strategy.
Amity Directorate of Distance & Online Education
248 Foundations of Machine Learning

First off, the model’s performance is not particularly spectacular. With a low
Notes

e
sensitivity of 55.8%, the model fails to detect half of AKI episodes. It’s possible that
this was done on purpose to increase specificity. Even extremely effective models
in environments where AKI is uncommon will need to find a compromise between

in
high sensitivity, which will result in missing fewer cases, and specificity, which will
lessen false positives and address warning fatigue. To determine which strategy will
significantly improve outcomes in clinical practice , more study is required.

nl
Notably, missing AKI cases in a general hospital context can be viewed as
more troublesome than alert weariness. It is crucial to keep in mind, though, that low
sensitivity can simply be a result of the fact that a sizable fraction of AKI cases cannot

O
be predicted, just as it is impossible to predict which side a coin will land on in a coin
toss experiment. These limitations stem from the rules of probability and hence hold
true for all diagnostic and prediction models, whether they are based on machine

ty
learning or traditional methods.

In supervised learning, which is based on a mathematical method, input features


are connected to pre-specified output labels. Tomasev et al. This suggests that

si
the model is given the appropriate label for each event during training and that the
model also learns from its “mistakes” by modifying the weights used by the algorithm
to link the data with the label. Such a strategy assumes that the categorization is I

r
“transparent” and “uniform,” meaning that everyone knows exactly what the category
label means; (ii) “relevant,” meaning that categories meaningfully distinguish between
ve
cases; (iii) “unique,” which means that each case belongs to only one category; and
(iv) “exhaustive,” which means that all cases that exhibit meaningful differences can
be assigned to a different category. Finally, it also assumes that every example in the
training set has been appropriately labelled. When these presumptions are broken, the
ni

resulting algorithm will perform poorly in real-world situations.

While the majority of AKI prediction algorithms claim to be based on Kidney


Disease: Improving Global Outcomes criteria, they actually only take the creatinine level
U

into account and do not take the urine output criterion into account. This suggests that
the AKI predicted by these models may not be the same as AKI as it is understood in
clinical practice , and as a result, the definition of what the algorithm predicts is neither
ity

transparent nor clearly defined. Importantly, urine output might be the least expensive
continuously available predictor of AKI we have, and teaching medical professionals to
keep track of diuresis might be a better way to improve AKI management than looking
for novel AKI biomarkers or creating AKI prediction algorithms.
m

In settings where there is a higher chance of missing an AKI diagnosis, an


electronic alarm for the condition would be most beneficial. However, cases where
a diagnosis of AKI was not considered by the doctor are very likely to lack the data
to support the diagnosis in a retrospective data collection. As in Tomasev et alpaper,
)A

.’s one can choose to consistently classify these cases as having “no AKI.” The term
“no AKI” in this situation would be meaningless since it does not distinguish between
“confirmed no AKI,” as supported by the data, and “uncertain AKI,” when there is
insufficient information to support the diagnosis.
(c

Through supervised learning, “uncertain AKI” may represent examples of missing


AKI or inaccurate labelling in the training set, leading to false negatives in the test
set. In this way, cases overlooked by the doctor will also be missed by the prediction
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 249

model. Excluding cases that lack the necessary information to support the diagnosis is
Notes

e
another option. The training set, however, would not be exhaustive because it would not
include instances of “easily undetected AKI.” These patients will be randomly classified
as having AKI or not having AKI because the test set will not be able to identify their

in
normal pattern.

In addition, the model is trained to use metadata, such as test ordering and timing,

nl
in addition to test results, so it is not just trained to use test outcomes. There is proof
that these metadata can predict outcomes, such as survival, more accurately than
actual test findings. However, due to a supposed danger of AKI, such metadata strongly
rely on the skill of the doctor who would perform the test. The model will develop a

O
vicious circle as a result of this. During training, the model will unconsciously (i.e.,
without being aware of the issue) link the doctor’s request for an AKI test with the onset
of the condition later on. Thus, the model will issue an alarm for a probable AKI issue

ty
that the doctor has already identified.

In addition, the model is trained to use metadata, such as test ordering and timing,
in addition to test results, so it is not just trained to use test outcomes. There is proof

si
that these metadata can predict outcomes, such as survival, more accurately than
actual test findings.

However, due to a supposed danger of AKI, such metadata strongly rely on the skill
r
of the doctor who would perform the test. The model will develop a vicious circle as a
ve
result of this. During training, the model will unconsciously (i.e., without being aware of
the issue) link the doctor’s request for an AKI test with the onset of the condition later
on. As demonstrated in an AI-based chest X-ray diagnosis model where the diagnosis
was largely driven by the type of X-ray machine used, with differences in the machine
ni

used in the outpatient setting (low incidence of pneumonia) compared with the ICU
setting, the model will thus generate an alert for a potential AKI problem that the doctor
already recognised himself (high incidence of pneumonia).
U

To avoid or reduce the danger of AKI, an alert trigger must be timely. Although it
would seem that a 48-hour prediction window would be sufficient, not all AKI diagnoses
are anticipated by the prediction algorithm within 48 hours of the onset of AKI. In
actuality, only 20% of AKI cases are anticipated more than 24 hours in advance of
ity

AKI start. This might out to be too late for a successful intervention in the majority of
situations. AKI intervention would also include preventive interventions that should be
given to all patients, not only those who are at risk for developing AKI.
m

4.2.4 ML Applications in Retail


Artificial intelligence (AI) applications including predictive analytics, natural
language processing (NLP), pattern and image recognition are built on machine
)A

learning (ML) and deep learning (DL) technologies. Supervised Machine Learning
methods are used with NLP to enable chatbot implementations in commercial
solutions like Google DialogFlow, Microsoft Bot Framework, etc. A field with
numerous commercial applications is also developing around the recognition of
patterns in photographs. E-commerce enterprises that use such techniques in their
(c

recommendation engines have already taken advantage of the maturity of unsupervised


learning techniques.

Amity Directorate of Distance & Online Education


250 Foundations of Machine Learning

We may make major advancements in a range of industrial areas by implementing


Notes

e
these approaches of supervised learning, unsupervised learning, natural language
processing, and pattern recognition. In this work, we show case studies that illustrate
how the same set of AI/ML techniques may be used in four distinct industry sectors—

in
insurance, education, hospitality, and retail—to achieve significantly different use
cases. The Information Processing Task that is inherent to each of these, despite
the differences in the use cases, is quite similar. These technological enablers might

nl
be compared to Lego blocks for business users who can put them together to solve
specific problems in their industry.

Through the implementation of the following use cases, ML technologies are

O
demonstrated to be of substantial benefit in easing the issues experienced by both end
users of e-commerce platforms and the administrators of e-commerce platforms.

ty
Use Case Details
There are numerous e-commerce websites that sell individuals the goods
they want and need. The consumer will find it more difficult to examine each site for

si
availability, pricing, offers, suggestions, etc. as there are more sites, options, and
websites from which to pick. An NLP-based chatbot interface can greatly improve the
customer experience and, as a result, guarantee consumer stickiness.

r
The integration of a chatbot into an e-commerce website’s landing page is
ve
demonstrated in Fig. A below.
ni
U

Figure: A (E-commerce website with a chatbot integrated within its landing page)
ity

1) Responding to FAQs of customers


Chatbots’ most developed use is in answering frequently asked questions from
customers about things like purchase, return, and refund procedures from specific
m

stores. Business-wise, this guarantees that customers receive answers to their inquiries
at the “moment of truth,” or when a transaction is most likely to take place. Customers
won’t have to wait long for responses from the e-commerce site’s help desk crew this
)A

way. This guarantees that common questions, which would otherwise be answered by
the helpdesk phone support team, are handled in a DIY fashion.

For instance, a chatbot that is connected to the inventory database of the store can
react to inquiries about the availability of a product of a particular brand or model (e.g.
(c

availability of Amazon Alexa).

When chatbots are coupled with the corporate landscape of the e-commerce
company and other web services via API calls, they can provide more value than just
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 251

answering FAQs. The following implementations serve as examples of several use


Notes

e
cases where the chatbot might be utilised as a conversational User Interface to deliver
results of such API calls.

in
2) Price comparisons for customers
A user can ask the chatbot to compare the pricing of a product, like a mouse, as
shown in Fig. B below. The chatbot uses the PriceYuge Data API to call an external

nl
web service after identifying the “Price Comparison” intent. The price of the goods is
then obtained from other e-commerce websites, like Amazon, Flipkart, etc. Information
contains the following: Price, Product Color, Expected Delivery, COD Availability,

O
Product Offer, etc.

ty
r si
ve
Figure B: Comparative Details of the product Logitech M235 Wireless Mouse
fetched and shown in the chatbot window
ni

3) Product Recommendations for Customers


On some items, advanced statistical and machine learning approaches can
U

be used to find similarities across several products and depict them in a more
comprehensive way. Giving advice based on the chemical makeup of cosmetics, for
instance, in the context of cosmetic products According to the representations shown in
Figs. C and D, the composition of two objects is more similar the closer their distance
ity

is to one another. Customers can compare the products as a result without having any
prior knowledge of chemistry.
m
)A
(c

Figure C: Select the Skin Type from the options given.

Amity Directorate of Distance & Online Education


252 Foundations of Machine Learning

Notes

e
in
nl
O
ty
Figure D: Recommendation System recommends products with brand, pricing
and Rank

si
4) Automated Product Labeling for Store Administrator
E-commerce companies must make sure that the products featured on their
platform come from various merchants and are listed consistently. The difficulty of
r
accurately listing all the features of the product increases with the number of vendors
on a platform and the number of things being offered. To identify a product’s features
ve
from a picture and make sure they are consistently listed, image recognition using Deep
Learning algorithms can be employed. Therefore, some of the typical labels connected
to a product can be automatically filled in using deep learning techniques. The
prototype implementation of uploading a product image is shown in Fig. E below, and
ni

Fig. F displays the numerous labels or features that were predicted for that product and
were automatically placed in the database. Images of apparel, accessories, etc. can
be used to achieve the same results, even though the example shown below is for a
U

cosmetic product. As a result, the database populating process is automated, effective,


and less mistake prone. The system’s accuracy percentage is 97 percent after training
on more than 3000 photos.
ity
m
)A
(c

Figure E: Web support for Auto-labelling the product Image

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 253

5) Aid for updating FAQ knowledge base by Store Admin


Notes

e
Following the creation of a chatbot application, it is crucial to regularly assess the
outcomes of the chatbot’s interactions with end users. It is crucial that an administrator

in
keeps the FAQs often updated and pertinent to consumer inquiries by identifying
those utterances for which the chatbot was unable to determine the intent. It can be
difficult to manually sort through a large number of unanswered questions to identify

nl
the subjects for which chatbot responses need to be created. Here, Topic Modelling on
the unrecognised search questions can help customers prioritise those areas where the
chatbot needs to be updated with responses by identifying the clusters of themes that
are commonly queried by customers. The most popular topics thus found can be added

O
to a chatbot’s future versions to better serve users.

ty
r si
ve
Figure F: Predicted labels from the input image are the red colored words
ni

Technology Enablers & Implementation Details


The following is a list of the major technological enablers that were applied in the
development of this prototype application:
U

●● Google Dialogflow: The chatbot was created using Dialogflow, which enables the
creation of entities, intents, context-setting, and training phrases for questions and
search queries pertaining to certain products.
ity

●● Ngrok: Local communication to the web-based Chatbot Interface is provided by


the web tunnelling tool.
●● Flask: Webhook access to the chatbot interface is provided by a Python
Framework.
m

●● MySQL Workbench: In order to save user comments, unrecognised search


requests, product and sales information, database connectivity is provided.
)A

●● PriceYuge Data API: To retrieve product details from various e-commerce stores,
an external API must be used.
●● NLTK Library: To text-analyze user-submitted search terms.
●● FastAI library: Product Image Dataset with more than 40.000 photos and label
(c

information for image feature extraction


●● Bokeh for Visualization library

Amity Directorate of Distance & Online Education


254 Foundations of Machine Learning

The results of training the Deep Learning model for label recognition from photos
Notes

e
are shown in figure G below.

in
nl
O
ty
r si
ve
ni
U

Finding the right learning rate for the multi-label classification model is shown
in Fig. G. (a). (b) A training versus validation loss plot using more than 3000 image
samples and a training model with 5 epochs
ity

4.2.5 Netflix: Movie Recommendation


Deep learning started gaining popularity in the machine learning community in the
early 2010s thanks to outstanding results on a range of tasks in several areas, such
as computer vision, speech recognition, and natural language processing (NLP). The
m

recommender-systems research community was buzzing at the moment with the


question, “Will the wave of deep learning also wash over recommenders to provide
enormous improvements?” This query and the potential for deep learning to enhance
)A

our suggestions piqued the interest of many people, including us at Netflix.

While it is now obvious that deep learning is advantageous for recommender


systems, it was difficult to determine how deep learning differs from other
recommendation strategies. This is seen by the length of time it took for these
(c

methodologies to gain acceptance in the research community. However, it was a fruitful


approach, as shown by the ensuing explosion of work on the issue. The route of our

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 255

own deep learning research at Netflix was similar: after initial elation, the stark reality of
Notes

e
refined baselines confronted us.

As a result, the connection between deep learning and other recommender models

in
became clearer. When we merged the knowledge gained from these methods with
information on the kind of problems in which each deep-learning method excels as
well as its limitations, the barriers disappeared. Along the way, it also produced useful

nl
insights into how to make deep learning effective in actual recommendation scenarios.

We were able to research a range of models for various Netflix recommendation


tasks by viewing deep learning through a recommendation lens. We will give insights

O
into a few significant types of models from this. We start out by talking about models
known as “bags of goods” that employ a member’s collection of seen videos. We then
explore sequential models that fill this gap since these models, while effective, ignore
temporal information. For various tasks, we discovered both strategies to be useful.

ty
Our research showed that deep-learning algorithms particularly began to
shine in recommendation problems when we gave them with extra heterogeneous
characteristics and information sources, after initially struggling with deep-learning

si
methodologies. As is frequently the case in the literature (see also Ferrari Dacrema,
Cremonesi, and Jannach (2019)), other models, when these methods were correctly
tweaked, remained competitive in the typical recommendation context where only user-
r
item interaction-data are used. However, because deep learning models are flexible,
ve
we were able to identify situations in which we could create models that significantly
improved offline measures assessed using historical data.

In an A/B test with actual members, we discovered that improvements in offline


metrics did not always translate to online performance, even when they were quite
ni

large. We required new offline measurements that were better proxies of the online
data to address this. In addition, we discuss other relevant practical issues needed to
implement deep learning in a recommendation system that serves hundreds of millions
U

of users. It is fairly simple to create and alter model architectures for recommender
systems in practice thanks to the flexible foundation offered by the current deep-
learning toolboxes.
ity

Finally, we will discuss major lessons learned from applying deep learning to
Netflix’s recommendation algorithms as well as lessons that may apply to other
applications.

Recommendations at Netflix
m

To enhance their long-term satisfaction, Netflix’s recommender system’s primary


goal is to assist subscribers in finding content that they will watch and appreciate. This
is a difficult challenge for a number of reasons, such as the fact that each individual is
)A

different, has a wide range of interests that might change depending on the situation,
and needs a recommender system most when they are unsure of what they want to
watch. When done correctly, each member receives a special experience that enables
them to maximise their use of Netflix. Since we offer a monthly subscription service,
member satisfaction has a direct impact on customers’ propensity to continue using our
(c

service, which in turn affects how much money we make. Therefore, the improvement
in member retention can be used to gauge the value of a recommender system. We

Amity Directorate of Distance & Online Education


256 Foundations of Machine Learning

have routinely produced significant increases in retention over the years thanks to the
Notes

e
development of personalisation and recommendation technologies (Gomez-Uribe and
Hunt 2015)

in
In multiple controlled online trials, we found that the specific recommendation task
as well as the available data play a key influence when it comes to the question of
which algorithm works the best. The main selection made by a recommender system

nl
at a video-streaming service is which videos each subscriber will see on their Netflix
homepage after logging into their profiles on any device. Using recommender systems
created to address various member demands, each of which can be powered by a
different algorithm, this personalization work is itself broken into various sub-tasks.

O
Instead of a single algorithm driving all recommendations at Netflix, the company
uses a variety of strategies that are all focused on boosting member satisfaction.
The best algorithm (whether deep learning or another) depends on the particular

ty
recommendation problem that needs to be solved as well as the available data. This
was discovered through experimentation with several kinds of recommendation
algorithms. Because of this, several machine learning models are employed to create

si
tailored suggestions for the various sections (such as rows) of the Netflix home page.
Gomez-Uribe and Hunt provide an overview of the Netflix recommender system before
deep learning was used (2015). By breaking the recommendation problem down into

r
several smaller jobs, we can combine a variety of alternative approaches and increase
the modularity and scalability of the research and development of new or improved
ve
recommendation algorithms.
ni
U
ity
m
)A

Figure: Netflix homepage where we highlighted various recommendation tasks

The Netflix homepage shown in the figure above has red circles listing many
suggestion tasks, each of which is driven by a separate algorithm. For instance, there
are specific algorithms (1) for selecting the first video to display prominently at the top
(c

of the homepage, another one for ranking previously watched videos that the user
may wish to continue viewing (7), as well as others designed to assist our members in
finding new videos (5). The results of each of these algorithms can be shown on the site

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 257

as several rows of suggested videos. The purpose of this essay does not extend to a
Notes

e
number of further Netflix customising tasks. For instance, an algorithm (4, 6) chooses
which rows to present in a customised manner to form the homepage’s structure
(Alvino and Basilico 2015). In addition, our members receive tailored messages

in
and notifications (3). Additionally, we incorporate recommendation methods into our
search engine (Lamkhede and Das 2019). Since it occurs the most frequently in our
recommendation settings, we concentrate on the process of suggesting videos from the

nl
Netflix library to each member in this article. Many of the lessons we offer, nevertheless,
apply to these other duties as well.

The available data and its qualities, in addition to the recommendation objective,

O
have a significant impact on which recommendation algorithm performs best. The first
and most crucial distinction is whether the data contain additional information such
as user attributes, item attributes, or contextual information regarding the user-item

ty
interaction, as opposed to just the user-item interactions (as is frequently the case in
publicly available data sets used in the literature). Through testing, we discovered that
adding such additional heterogeneous variables to the data made more sophisticated
models, such as deep-learning algorithms, shine.

si
Before we examine further issues in the data for developing real-world
recommender-systems, it is important to briefly review a few typical aspects of the data

r
that have been covered in the literature on recommender systems. The following are
the significant distinctions between the data sets utilised in other domains: The entries
ve
with observed positives (such as played videos, clicked items, given ratings or thumbs,
etc.) are not randomly distributed in the user-item interaction-matrix (Hernández-
Lobato, Houlsby, and Ghahramani 2014; Liang et al. 2016; Marlin, Zemel, and Roweis
2005; Marlin et al. 2007; Marlin and Zemel 2009; Steck 2010), which is the first thing to
ni

This is a critical distinction from areas like compressive sensing or matrix completion,
where it is usual to presume that matrix entries are missing at random. Second, the
unobserved entries in a user-item interaction matrix could either represent (true)
U

negatives (i.e., the user is actually uninterested in this item) or positives that weren’t
noticed (yet). Thirdly, because observed positives result from a stochastic process,
observed data are frequently highly scarce and noisy. Fourth, the data exhibit a
significant popularity-skew, or the fact that the popularity of the various items roughly
ity

follows a power-law distribution, leading to big disparities in the popularity of the various
items. A similar (approximate) power-law distribution can be seen in the distribution of
users, with a small number of very active users and a huge number of less active users.
Due to distribution mismatch, this power-law distribution might make modelling difficult.
Making fair and accurate suggestions for unpopular items or for users with minimal
m

activity also presents a hurdle.

In a real-world recommender system, the different types of biases in the user-


)A

item interaction data, such as presentation or position biases, may be amplified due
to a feedback loop in which the recommender system is trained on the observed
user-actions from a previous time-step, which may have been biassed due to the
recommendations presented to the users at that time (Chaney, Stewart, and Engelhardt
2018). Due to presentation bias, users are more inclined to interact with objects that
(c

the system displays prominently. In real-world recommender systems, breaking (or at


least damping) the feedback loop is a major challenge. This not only makes it difficult

Amity Directorate of Distance & Online Education


258 Foundations of Machine Learning

to train recommender systems using the obtained data, but it also causes a noteworthy
Notes

e
disparity between offline and online metrics.

4.2.6 Retail: Robot Sales Associate - Apparel Store

in
The retail sector will soon undergo a change thanks to AI robots. In addition to
performing physical chores, these robots will also be in charge of more cognitive

nl
ones thanks to a learning system that will enable them to make judgments in place of
humans. These robots are now serving as customer service agents, giving customers a
seamless and enjoyable shopping experience.

O
Customer-facing businesses are vying to best meet the ambiguous expectations
of the customers in this age of automation and the digital world. The AI-based robots
serving as customer service agents are assisting customers by making brick-and-
mortar store buying less complicated. By 2020, the market for customer support robots

ty
will be valued US $ 88 million, according to a Tractica analysis. These customer care
robots are made to engage with consumers while also automating some of their more
routine chores. They come in both humanoid and non-humanoid varieties, and their key

si
benefits are labour efficiency, uptime, and cost savings.

The majority of these robots direct shoppers to things in stores or assist them as
they navigate the space. They are able to communicate with clients more successfully
r
than human employees. The development of interactive marketing and rebranding
ve
tactics for the tracking and analytics of consumer behaviour therefore has a significant
potential for these robots.

Invento Robots
ni

MITRA: These humanoid robots were created to increase in-store consumer


engagement. Three key technologies—face recognition, conversational skills, and
autonomous navigation—are integrated into these. The robots can recognise people
U

using face recognition, as well as age, gender, and levels of the most dominating
emotions. They can engage in dialogue with the clients, offering them a more useful
and understandable response. They are able to effortlessly navigate the site and may
provide clients with information about a certain product, including details, discounts,
ity

and offers.

The business just debuted MITRA 3, MITRA Robot’s sibling. The MITRA 3
humanoids are designed to give customers a more engaging, personal experience.
These robots can mimic human gestures and eye movements because their eyes
m

are connected with their bodies, unlike MITRA. These robots can greet visitors with
“Namaste” because they have three degrees of freedom (shoulder, elbow, and finger
movements). This flexibility makes them more adaptable. Along with voice synthesis in
)A

local Asian languages like Hindi, Tamil, and Sinhalese, they also have the advantage
of seamless independent mobility and obstacle recognition. “Customer engagement
is terrible, and businesses can no longer afford to hire human service workers. This
leaves one with two choices: a subpar kiosk or no engagement. According to Balaji
Viswanathan, CEO of Invento Robotics, “We aim to give visitors to companies, malls,
(c

and airports a sophisticated, enjoyable user experience.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 259

Suzuki, Smartworks, HDFC Bank, Accenture, PVR Cinemas, and the Airports
Notes

e
Authority of India are just a few of MITRA’s clients. To automate check-ins and customer
service for the travellers, the company ran a test programme with these MITRA robots.

in
Spod: These non-humanoid robotic shopping assistants with artificial intelligence
capabilities, like MITRA, will help customers make in-store purchasing decisions. The
box-shaped robot that is suspended from a basket can recognise faces, greet users,

nl
accept input, and travel by itself with ease. The robot, which is still in the development
stage, scans the faces of clients to determine whether they have previously visited and
to determine their gender. These robots may interact with consumers, guiding them
to the appropriate shelf and providing product information such as descriptions of the

O
products, specials, and deals from the Spod tablet.

By setting up a video conference between the clients and the specialists, Spod
may also give customers remote assistance. The demand for a tool that can enhance

ty
customers’ purchasing experiences in retail markets sparked the development of
the robot. Driving sales is the main concern for merchants. Giving out more details
regarding the product is important for brands. Large malls want it in the front, directing

si
visitors into stores, while e-commerce players want it utilised at the rear, in their
stockpiles, according to Balaji.

SoftBank Robots
r
ve
Pepper Robots: Pepper robots, a subset of SoftBank robots, are humanoid
robots that can recognise faces and fundamental human emotions. Robots that have
touchscreens on them can communicate with customers. They primarily provide
the creative greeting, education, and guidance functions for guests. With more than
15 languages for speech recognition and dialogue delivery, they have 20 degrees of
ni

freedom for the most organic and expressive movements. To help the customers get
the intended outcomes, they can effortlessly explore the store and surrounding region.
The use of these robots can be observed at Ave, a custom-print garment store in the
U

United States, which witnessed an increase in foot traffic through raising consumer
knowledge of the outlet, luring in new customers, describing novel parts of the store,
and driving sales.
ity

The company’s NAO humanoid robot is another one whose main uses are in
healthcare, research, and teaching. These 58 cm tall, bipedal robots have beautifully
rounded features. Along with their extra functions of speech recognition and
conversation in up to 20 different languages, their 25 degrees of freedom allow them
to move and adapt to the environment. They can recognise things, images, and people
m

thanks to the two 2D cameras. The STEAM education assistance programme functions
as a teaching assistant in education and research. It can simply deliver customised
courses to pupils, encourage positive behaviour, and finally follow and monitor their
)A

development. NAO supports personnel with health trending and monitoring while
assisting patients with self-diagnosis. They serve as both the informational centre and
the telemedicine platform (alert, notifications, fall & sound detection, etc.)

Cleveron Automatic Click and Collect


(c

Leading vendor of click-and-collect pick-up automation solutions is Cleveron. The


company’s technology enables businesses to offer an automated self-pickup location

Amity Directorate of Distance & Online Education


260 Foundations of Machine Learning

for all online orders. The equipment can be placed within the shop, where consumers
Notes

e
can visit, scan the QR code or enter the code to pick up their online orders. Without
human assistance, the system can process the code and give the consumer their
requested order. The Cleveron machines eliminate all the bother associated with the

in
BOPIS (Buy online, pick up in-store) idea, including waiting for the order to be prepared
and contacting the staff executive. “Saving consumers’ time is the ultimate goal of
Cleveron’s delivery robots.

nl
Online buying is a quicker and more practical alternative to physical store
purchasing. However, the time you saved won’t matter after your package is delivered
because you’ll need to wait in line at the store’s service desk or at home for the

O
courier, who may show up with your package between 5 and 9 o’clock in the evening.
The typical wait time to pick up an online order at the shop is more than 15 minutes.
Customers receive their orders in less than a minute thanks to Cleveron 401 and

ty
Cleveron 402, which function as automated in-store self-service pick-up terminals. 15
more minutes are now yours to keep! Arno Kütt, founder and chairman of the board
of Cleveron, stated that the company’s robots “make online shopping and picking up
orders in a physical store a positive consumer experience, which in turn generates foot

si
traffic, revenue and promotes growth for the company.”

He continued, “Here is a quick calculation: In-store parcel pickup typically takes

r
longer than 15 minutes. In a year (52 weeks), if you just pick up once a week, you will
stand in line for 13 hours. The pick-up time with Cleveron 401 is under a minute. You
ve
will spend 52 minutes picking up your online orders from the store in a year, even if we
round it up to a minute.

Lowe Innovation Labs


ni

The 5-foot-tall robot shopping assistant LoweBot assists clients in-store just
like a customer service agent. These robots can recognise customers entering the
store since they are equipped with 3D scanners. Customers can ask them about any
U

specific item in the store and they understand and speak seven different languages.
These robots steer them to the merchandise using sophisticated laser sensors. On their
rear-facing display panel, these additionally show location-based offers and intelligent
recommendations when travelling. Additionally, these also move throughout the store,
ity

scanning the shelves and sending inventory updates to the sales staff. As a result, the
business may not only restock its shelves but also learn more about which products
sell more rapidly and on what days of the week or during what seasons. They are also
capable of determining the shopping trends at the location.
m

4.2.7 Retail: Loss Prevention


A retail situation that one would be willing to face is Among the many ill-intentioned
)A

fraudulent activities that put a strong Retail loss prevention programme in perspective
are the following: a product is whisked away without being scanned at the counter; a
merchandise is given without being charged; a mal-intentioned individual goes free after
returning a product (with no rhyme or reason to do so); and discounts are being given to
transactions that aren’t worthy of discounts at all.
(c

Retailers are increasingly making reducing retail shrink a primary focus in order
to increase revenues and reduce losses by transforming retail loss prevention with
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 261

AI and machine intelligence. Data, and particularly patterns, behavioural insights,


Notes

e
and correlations sensed from data, are forewarning (predicting) them of any possible
fraudulent activity and arming them with necessary measures to protect retail
shrink and loss as the retail sector struggles with shrink and loss that comes with it,

in
perpetuated using shoplifting, internal theft, return fraud, vendor fraud, discount abuse,
and administrative errors among others.

nl
Reactive to Proactive Loss Prevention Strategy
When it comes to anticipating and mitigating retail shrink and loss, there has
been a significant shift from reactive to proactive approaches. Data from vital sources

O
such as security systems (CCTV, camera, access control, and alarm records), video,
store operation applications, POS, payment data, crime data (local crime statistics),
store profile, supply chain data, employee data, customer data, ORC data map, and
e-commerce platforms are first collected.

ty
The data is used as a crucial input for techniques including computer vision, deep
learning, image processing and recognition, machine learning and correlation, pattern

si
recognition, and behavioural, predictive, and prescriptive analytics.

Proactive measures assist in stopping retail loss, enhance KPIs to prevent


inventory loss, shoplifting, theft, pilferage, discount abuse, and return fraud, and
r
reduce shrinkage when AI/Machine learning is used to change retail loss prevention.
ve
Additionally, it facilitates the change from “Identifying a case” to “Preventing a case.”

AI/Machine learning to transform retail loss prevention – A case in point


Use of machine learning to stop return fraud is a case in point.
ni

Return patterns may serve as a warning sign to thwart fraudulent refunds. Return
permission can be likened to balancing on a tightrope between providing excellent
customer service and ensuring that the action is not fraudulent. For the machine
U

learning system to anticipate and stop return fraud, the shopper, shopping behaviour,
return behaviour, and store trends provide a vibrant feed.

Reading the Risk tea-leaf for a comprehensive loss prevention program


ity

Another advanced analytics technique for forecasting and preventing retail loss
is reading the risk tea-leaf. Retailers can use machine learning algorithms and data
science techniques to discover key loss drivers, anticipate loss, and avoid loss by
monitoring retail loss risk variables. Concerning their loss prevention plan, some
m

merchants are bothered with persistent questions such

●● Is this store prone to theft and loss?


)A

●● Does this product lead to retail losses?


●● Is this transaction devoid of suspicions?
It will be valuable to forecast “risky” locations in terms of high-risk products, high-
risk transactions, high-risk stores, and high-risk connections using advanced analytics.
(c

The prevention of fraudulent actions that lead to retail loss is aided by keeping risk-
ranking predictive models developed using statistical modelling and machine learning at
the forefront.

Amity Directorate of Distance & Online Education


262 Foundations of Machine Learning

utilising the appropriate loss prevention method to combat retail shrink


Notes

e
Retail shrinkage is a constant annoyance. A strong loss prevention strategy is the
first step in reducing retail shrink. Retailers succeed by extending the LP spectrum

in
to encompass mobile, web, and other customer engagement channels and using AI/
Machine learning to transform retail loss prevention after taking stock of the existing
loss prevention technology used to supplement KPIs.

nl
4.2.8 Telecom: Optimising Mobile Tower Operations
Recent developments in machine learning (ML) make it possible to optimise at

O
previously unaffordable levels of complexity. This has produced notable performance
gains, encouraging the usage of ML techniques like neural networks in a variety of
industries.

ty
Utilizing ML to improve wireless network performance began with 5G and is crucial
to advancing zero-touch configuration and management, enabling the self-configuration
and self-optimization anticipated for 6G networks. Numerous factors that affect how
wireless networks operate are difficult to predict or infer and are not always understood

si
at the time that decisions must be made. Furthermore, because they include a variety
of radio access technologies and modules that interact with one another, must meet
a variety of growing criteria, and must swiftly adapt to changes, wireless networks
r
are becoming more complicated and heterogeneous. Because of this, real-time
ve
wireless system performance optimization is a challenge that cannot be solved using
conventional methods. While ML technologies can manage extremely complex
systems, they are better suited to managing highly dynamic wireless networks and
making more intelligent judgments, such as those based on anticipated future traffic
ni

patterns.

Based on these findings, this study suggests a modular machine learning (ML)-
based wireless network optimization framework that enables plug-and-play machine
U

intelligence integration into both new and existing network operations. In particular,
we use ML to I characterise traffic attributes and (ii) anticipate future traffic volume.
We then use this data to enhance the performance of network control mechanisms.
The benefits of our strategy are twofold: In accordance with the recent ITU-T Y.3172
ity

recommendation for integrating ML in future networks, it is I sufficiently general


and enables the instantiation of ML pipelines across various network elements and
functions, and (ii) it enables the retrofit of ML to legacy architectures and the reuse of
existing network control mechanisms with little to no modification.
m

Prior research has integrated machine learning (ML) into the creation of specific
algorithms, concentrating on network functions such as I resource management and
orchestration, (ii) routing, (iii) detection and channel estimation in massive MIMO
)A

systems, and (iv) resource scaling of virtual network functions (VNF). They cannot be
easily reused due to their mechanism-specific nature, which means that each network
control mechanism requires a specially designed ML technique.

Contrarily, we employ ML to generate precise traffic forecasts that can be directly


(c

fed into tried-and-true algorithms and decision-making modules. Significant scientific


interest has been shown in traffic forecasting and characterization using ML. While
our approach relies on traffic analytics to enhance the performance of general network

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 263

control mechanisms, previous work has largely concentrated on traffic analysis


Notes

e
to optimise specific network operations, such as routing (see for a survey of ML
techniques applied to SDN), or VNF resource scaling.

in
We also include an ML orchestrator, which is in charge of resource management
and monitoring as well as determining the best configurations for ML model training.
With the aid of our framework, we anticipate that the costs associated with launching

nl
an ML pipeline for a particular function will be comparable to those incurred by ML
algorithms created specifically for that purpose. In our method, the orchestration
of a function has a one-time signalling cost, but this is a minor price to pay given the
flexibility it provides.

O
We describe how to coordinate two ML pipelines, namely trafficdriven VNF scaling
and routing in mobile backhaul networks, to illustrate the viability and performance
improvements possible with our system. We assess the latter use case practically. Our

ty
methodology achieves 3 reductions in packet delay by feeding a cutting-edge routing
scheme with city-scale projections of future traffic consumption obtained with a deep
learning structure.

si
ML-Based 5g Network Optimization
We suggest a machine learning (ML)-based framework for network optimization
and describe how to apply it in 5G networks. r
ve
A. ML-based Framework
The components of our framework are shown in Fig. below. They are the ML
orchestrator, modules to measure mobile network traffic, ML algorithms to process the
ni

data, and modules to make specific network optimizations depending on the results of
the ML algorithms.
U
ity
m
)A

Figure: Components of the suggested framework. Based on measurements and


flow metadata, ML algorithms are used to categorise and forecast traffic. Modules
implementing network functions are supplied the knowledge that was extracted.

The orchestrator specifies in the form of a template the set of collector nodes, the
(c

duration and the aggregation level of traffic measurements, and ML pipeline-specific


parameters, such as number of epochs, layers, and possibly a unique loss function,
according to the particular network function to be optimised. Different functions require

Amity Directorate of Distance & Online Education


264 Foundations of Machine Learning

different inputs; for instance, increasing computing capacity of VNFs performing core
Notes

e
services involves monitoring control traffic from the same set of base stations while
routing requires monitoring traffic from a set of base stations to determine the best
routes.

in
As a result, the orchestrator coordinates the implementation of an ML pipeline and
the mechanisms for updating the decisions for each network function (for example,

nl
by collaborating with the VNF orchestrator). It also makes sure that there is enough
computing power available to train ML models either centrally or decentralised.

Measuring requires direct access to flow data, which may be present at

O
base stations or Software Defined Networking (SDN) switches, for example. The
measurement modules take sequences of packets from each flow, along with their
durations, interarrival times, direction (uplink/downlink), and maybe even some
of the content, as opposed to defining a finite set of input attributes. Working with

ty
such extensive data as input has several benefits, but the main one is that abstract
features may be automatically extracted during training rather than relying on a small
and manually selected set. Feature engineering can be expensive and have negative

si
performance effects. In addition, new use cases might call for other functionality. Our
strategy is therefore foreseeable.

Our system is versatile enough to accommodate various learning methods.


r
We concentrate on deep learning (DL) because I DL algorithms scale better than
ve
ML approaches as the volume of data increases, (ii) DL produces highly accurate
outputs in network settings where inferences must be based on a large number of
input parameters, and (iii) advances in parallel computing enable the quick training of
complex neural networks and the use of those networks in various contexts without
ni

retraining.

In order to I meet particular needs, such as latency and reliability (traffic


classification), and (ii) forecast key aspects of future network traffic, such as average
U

and peak data rates, level of burstiness, etc., we are particularly interested in DL
structures that can identify different types of flows within large aggregates (traffic
forecasting). Different DL structures might be used, depending on the objective goal.
ity

Based on knowledge of TCP traffic flow, auto-encoders are extremely good at


classifying traffic. Convolutional Neural Networks, for example, which are frequently
used for image segmentation, are useful for classification. Accurate traffic classification
is necessary for optimising network processes like scheduling and load balancing.
m

Temporal characteristics play a key role in traffic forecasting. Time series lend
themselves well to Long-Short Term Memories (LSTMs). Similar to how the spatial
dimension can be added to the convolution operation to create a 3D-CNN, mobile
traffic-specific spatio-temporal information can be extracted.
)A

A deep learning pipeline designed specifically for anticipating mobile traffic is


shown in the figure below. City-level traffic data is fed into stacks of these 3D-CNNs and
ConvLSTMs to extract spatio-temporal properties from traffic snapshots, which are then
used by a collection of fully connected layers to predict future traffic according to eNB.
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 265

Notes

e
in
nl
Figure: An example of a deep learning pipeline that was modified from for
forecasting mobile traffic In order to estimate future traffic demands at the eNodeB

O
level, city level metrics are input into stacks of 3D-CNNs and ConvLSTMs.

Last but not least, our decision modules are built on pre-existing algorithms that
only need to be altered to accept the DL algorithms’ predictions as input. Therefore,

ty
using our framework leaves these algorithms’ fundamental functionality untouched.
Contrary to completely ML-based systems, where ML algorithms directly make
decisions, this gives considerably better control over how they operate.

si
Result
ML-based framework to improve (beyond) 5G network performance. Our
r
architecture does not necessitate the invention of use-case-specific ML algorithms or
ve
the modification of existing network algorithms, in contrast to current approaches that
directly embed ML within network control systems. To describe traffic features and
forecast future traffic demands, our system deploys ML pipelines. The current network
control mechanisms are then given the predictions.
ni

In a single complete framework, our method integrates and harmonises ideas


from ITU-T, 3GPP, and other specifications. We demonstrated how our system can
launch numerous ML pipelines for various goals. For one, proactive routing, we put
U

our framework into practice and tested it. Results show that our method considerably
lowers packet delay even in tiny topologies.

4.2.9 Prediction of Tornadoes using Time Series Forecasting


ity

A tornado is a rapidly spinning column of air that is in touch with the ground, either
pendent from or beneath a cumuliform cloud, and that is frequently but not always
visible as a funnel cloud (Glickman & Zenk, 2000). In comparison to other countries,
the United States has more tornadoes (Perkins, 2002). In the United States, more
m

than 1200 tornadoes are reported each year, which is four times more than what has
been documented in Europe (National Oceanic Atmospheric Administration, 2008).
The United States experiences more violent tornadoes than any other nation, with
)A

EF4 or EF5 tornadoes on the Enhanced Fujita Scale (National Oceanic Atmospheric
Administration, 2008).

The majority of tornadoes in the US happen east of the Rocky Mountains. The
Great Plains, the Midwest, and the Mississippi Valley are tornado-prone regions. A
region that is highly vulnerable to tornadoes is referred to as Tornado Alley (Glickman
(c

& Zenk, 2000) or Dixie Alley (Cox, 2010). Additionally, tornadoes have been reported
in every state in the US at least once since 1950. (National Oceanic Atmospheric

Amity Directorate of Distance & Online Education


266 Foundations of Machine Learning

Administration, 2008). Due to their location in Tornado Alley, which is where tornadoes
Notes

e
occur most frequently, Midwestern states including Kansas, Missouri, Nebraska, Iowa,
and South Dakota are particularly vulnerable to tornado activity (Tornadoes in the
United States, 2016). Tornado Alley includes the state of Oklahoma as well.

in
Since records began to be kept in 1950 (F5 and EF5 Tornadoes of the United
States), there have been 58 tornadoes that have been classified as having the

nl
maximum intensity and level of destruction (Tornadoes in the United States, 2016).
Each year, tornadoes in the United States cause more than 80 fatalities and 1500
injuries (National Oceanic Atmospheric Administration, 2008). Each year, tornado-
related disasters cost billions of dollars in damages.

O
As a result, it’s critical to create precise forecasts that can determine a tornado’s
striking site, the extent of its property damage, and its length and width. These
predictions can help spread awareness and work towards preventing losses of life and

ty
property in the tornado event’s affected area.

Numerical Results for Time Series Forecasting

si
In the tornado dataset, various univariate time series were fitted with time series
forecasting models, and the resulting numerical results were examined and analysed.
Five separate, univariate time series have been created from the dataset. Property
r
Damage, Tornado Length, Tornado Width, Beginning Latitude, and Beginning Longitude
ve
are the time series. The National Climatic Data Center has provided the information
(2015). The time series data is divided into training and validation periods after data
visualisation. Records of tornado events from January 2009 to December 2013 make
up the training dataset. Records of tornado incidents from January 2014 to December
2014 make up the validation dataset (12 months).
ni

There is only one value input into the models each month and only one output
value per model because we have decided to represent the time series in terms
U

of monthly values. The mean property damage per month, mean tornado length per
month, mean tornado breadth per month, and mean latitude and longitude of tornado hit
per month are the values that are inserted into the time series. Similar to the predicted
values, the forecasted values are also the mean expected monthly values for property
ity

damage, tornado length, width, and location. These values are within a predetermined
prediction interval.

The most accurate model is picked to anticipate future values from January 2015
to December 2015 after model performances are compared (a forecast horizon of 12
m

months, identical to the validation period). By assessing the precision and seasonality
of model forecasts on the validation period, the best model is determined. The 2015
actual figures are then compared to the forecasts, and conclusions are drawn.
)A

Forecasting Property Damage


We start by looking at the damage that tornadoes caused to property between
January 2009 and December 2014. We can see the overall trend in property damage
values over the last six years in the figure below. We notice significant increases in the
(c

amount of property damage in the years 2010 and 2012 from this overall map. These
years have seen the highest amounts of tornado-related property damage.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 267

Notes

e
in
nl
O
Figure: General Property Damage ($) Plot, 2009-2014

Property damage seen in past years shows considerably modest jumps. We


notice yearly additive seasonality across the series after decomposing the general time

ty
series shown in Figure below. We examine smaller time series windows to seek for
seasonality and trends in order to better understand the nature of the series. Thus, the
three time periods of 2009–2010, 2011–2012, and 2013–2014 are used to partition the

si
series.

r
ve
ni
U

Figure: Decomposed Property Damage 2009-2014

The graph below shows that in this time frame, 2010 saw the highest value for
ity

property damage. In 2012, $2.5 million in property damages were noted. During
the years 2009 through 2010, April exhibits additive seasonality. We notice extra
seasonality in the month of June between 2011 and 2012. Property damage values
for the years 2013 to 2014 are noticeably lower, and November in both years has an
m

additive seasonality pattern.


)A
(c

Amity Directorate of Distance & Online Education


268 Foundations of Machine Learning

Notes

e
in
nl
O
ty
r si
ve
ni
U

Figure: Property Damage Time Windows 2009-2014


ity

4.2.10 Facebook Face Detection


A method of recognising or verifying a person’s identification using their face is
facial recognition. People can be recognised using facial recognition technology in real-
m

time or in still images and videos.

One of the many marvels of machine learning on Facebook is facial recognition.


Even with that heavy makeup on, it might not be difficult for you to identify your buddies
)A

on social media. however, how does Facebook do it? The Machine Learning System
analyses the pixels of the face in the image and develops a template that is a string
of numbers if you have Facebook’s “tag recommendations” or “facial recognition” set
on (this indicates that you have given permission for Facial Recognition). But because
each face has a unique template—sort of like a facial fingerprint—it may be used to
(c

recognise that face in other faces and recommend a tag.

What use does it serve to enable facial recognition on Facebook, then?


Amity Directorate of Distance & Online Education
Foundations of Machine Learning 269

The Facial Recognition technology can identify your template and send you a
Notes

e
notification if a newly uploaded photo or video on Facebook features your face but you
aren’t tagged in it. Additionally, you will be promptly informed if another person attempts
to use your photo as their Facebook profile picture (perhaps in an effort to gain more

in
popularity!). Facial Recognition can also let those with vision impairments know if they
are in a picture or video when used in conjunction with other accessibility features.

nl
4.2.11 Manufacturing

Use Case

O
Around 1,500 processes are required in the production of memory chips, all of
which must be done in sterile environments to prevent dust specks from harming the
wafers. However, harm still takes place. Scratches, flaws, and other quality problems

ty
are frequently minute and almost imperceptible to the human eye!

Many different machinery, pipes, and parts can be found in the manufacturing
setting. These deteriorate, malfunction, or begin to drip. Early detection of these

si
problems is essential. Maintenance is typically handled by engineers. Even the most
talented engineer, though, may overlook early warning signs of a problem.

The process of making memory chips offers a lot of room for error by nature.
r
Micron Technology was spending a lot of money—on average $250,000 per hour
ve
of downtime—by relying on human monitoring to spot quality flaws and mechanical
concerns.

The Solution
ni

AI solutions are a good fit for this specific business issue. There is sufficient
internal data to employ Machine Learning (ML) on numerous fronts with good accuracy
because the challenges are precisely defined, quantifiable, and measurable. The
U

solutions also function with lower amounts of data, but the ML algorithm’s accuracy will
suffer. Accuracy will increase as more data are collected.

Intel, a different significant memory chip manufacturer, also integrated machine


learning and vision algorithms into its wafer production process. The following intriguing
ity

statement is found in a whitepaper on their method: “Similar technology can be applied


in many different industries-wherever machines capture photos, regardless of the
original function for those images.”

The ML algorithms are intended to discover anomalies with greater accuracy and
m

frequency than their human equivalents at an earlier stage. It must be emphasised


again and again that humans are still required to understand and respond to the
system’s alarms.
)A

Machine Vision
As its photolithographic cameras etch the circuitry into the wafers, Micron
Technology integrated machine vision technology. The system checks for recurring
(c

problems and notifies the engineers whenever one is found. It can take anywhere from
15 seconds to 15 minutes before the alarm is issued, depending on the type of defect.

The issue of manually classifying each fault is resolved by the company’s auto-
Amity Directorate of Distance & Online Education
270 Foundations of Machine Learning

defect-classification system (ADC). Deep learning is used by the algorithm to sift and
Notes

e
classify the millions of flaws. The ADC system is detailed in an Intel whitepaper.

Thermal Imaging

in
Thermal imaging was used by Micron to monitor their production process,
significantly enhancing the efficacy and precision of AI. A “digital twin,” which is
effectively a digital copy of the industrial environment, is overlaid with “heat maps” of

nl
the factory environment under typical operating conditions. Then, using this map as a
baseline, it is possible to assess the factory’s infrared images in real time. The device
will sound an alarm if it detects an anomaly, such as different temperatures from the

O
digital twin.

Acoustic Listening

ty
The last of the three is perhaps the most unexpected: an AI solution has been
developed to detect strange noises throughout the manufacturing process. A machine
generating a unique sound frequently signifies difficulty, much like when your car makes
strange noises. By transforming sound to visual datapoints, the AI system at Micron

si
has been trained to recognise abnormalities in sound frequencies. Audial sensors are
positioned adjacent to machinery or pumps to record the sounds of the equipment in a
noisy environment. The engineers classify noises based on their likely causes.
r
ve
Benefits & Impact
First off, Micron Technology’s AI solutions have significantly improved production
accuracy and efficiency. The second improvement is in worker safety (Workers come
into contact with extreme temperatures and harmful substances less frequently).
ni

Thirdly, AI solutions have given the company’s engineers important time to devote to
other projects. Last but not least, the use of AI solutions in manufacturing has moved to
other business processes, such as product demand projections, improving the accuracy
U

by 10 to 20 percent.

Key benefits:
◌◌ Production in the manufacturing sector increased by 10%.
ity

◌◌ 35 percent fewer quality problems


◌◌ 25 percent quicker yield maturity time.
◌◌ Early diagnosis of equipment failures and quality problems saved millions of
m

dollars.
◌◌ gave engineers more time to work on other projects.
◌◌ improved workplace security.
)A

◌◌ facilitates the implementation of AI technologies in other corporate processes.

Accessibility and Requirements


Data, loads and lots of data, is the driving force behind the three AI technologies
that Micron Technology has developed. Petabytes of manufacturing data are gathered
(c

by the business from more than 8000 sources and 500 servers globally. For data
mining, this data is sent to two environments of the Apache Hadoop open-source

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 271

software programme. Large data sets can be processed in parallel using Hadoop,
Notes

e
allowing for the simultaneous analysis of many datasets.

Over 2,000,000 photos are kept in the Hadoop environment for machine vision.

in
To quickly manage the enormous strain of the complicated machine learning algorithm
used in Acoustic Listening, Micron delivers the pertinent data to a GPU processor.
By distributing work among numerous processors, GPUs can continue to accelerate

nl
applications, enabling the system to handle the massive volume of incoming data.

4.2.12 Explainable AI and Google Maps

O
Explainable AI is a collection of tools and frameworks that are natively linked
with a number of Google products and services to assist you in comprehending and
interpreting predictions provided by your machine learning models. With it, you may
debug models, enhance their performance, and aid in the behavioural understanding

ty
of others. Additionally, you may use the What-If Tool to graphically analyse model
behaviour and generate feature attributions for model predictions in Vertex AI, BigQuery
ML, and AutoML Tables.

si
In more than 220 nations and territories around the world, Google Maps is used
to drive more than 1 billion kilometres each day. You are immediately presented a few
things when you get in your car or on your motorcycle and start navigating: which way
r
to go, whether or not the traffic along your route is heavy or light, an anticipated journey
ve
time, and an estimated arrival time (ETA). Even if everything seems straightforward, a
lot is happening in the background to offer this information in a few of seconds.

We’ll dissect one of our favourite subjects today: traffic and routing. Read on if
you’ve ever wondered how Google Maps chooses the optimal route for a journey or
ni

how it recognises when there’s a significant traffic jam.

Live traffic, powered by drivers all around the world


U

Aggregate location data can be utilised to analyse traffic conditions on roads


throughout the world when users use Google Maps to navigate. Even though this data
can help you determine whether or not a traffic congestion would delay your drive right
now, it doesn’t take into account how traffic will behave 10, 20, or even 50 minutes into
ity

your trip. Technology definitely shines in this situation.

Predicting traffic with advanced machine learning techniques, and a little bit of
history
m

Google Maps examines past traffic patterns for routes throughout time to forecast
how traffic will appear in the near future. For instance, a pattern might reveal that traffic
on the 280 freeway in Northern California regularly moves at 65 mph between 6 and 7
)A

am but only around 15-20 mph in the late afternoon. Then, using machine learning to
create predictions based on both sets of data, we merge this database of past traffic
patterns with current traffic circumstances.

To increase the precision of our traffic forecast capabilities, we recently


collaborated with DeepMind, an Alphabet AI research facility. Our ETA forecasts already
(c

have a very high accuracy bar; in fact, we can observe that over 97 percent of journeys
have had accurate ETA projections. Through our collaboration with DeepMind, we’ve
been able to significantly enhance ETA accuracy in cities like Berlin, Jakarta, So
Amity Directorate of Distance & Online Education
272 Foundations of Machine Learning

Paulo, Sydney, Tokyo, and Washington, D.C. by utilising Graph Neural Networks, a
Notes

e
machine learning architecture. By using this method, Google Maps is better equipped
to anticipate whether or not you will be impacted by a slowdown that may not even have
started yet!

in
Keeping it Fresh
Historical traffic patterns have generally been accurate predictors of your potential

nl
road conditions for the majority of the 13 years that Google Maps has published traffic
data, but this isn’t always the case. Global transportation patterns have drastically
changed since the COVID-19 outbreak began. When lockdowns began in early 2020,

O
traffic decreased by up to 50% throughout the world. Since then, some regions of the
world have slowly started to reopen, while others continue to have restrictions. We
recently changed our models to make them more flexible in order to take into account
this unexpected change. As a result, historical traffic patterns from the previous two to

ty
four weeks are now automatically prioritised, whereas patterns from earlier periods are
deprioritized.

si
How Google Maps Selects Routes
Google Maps’ route-finding algorithms heavily rely on our forecast traffic models.
We’ll automatically locate you a less-congested option if we anticipate that traffic in one
r
way may become heavy. Along with these variables, we also consider the condition
ve
of the roads. Is the road paved, unpaved, gravelly, muddy, or covered in dirt? These
kinds of factors can make a road challenging to drive on, therefore we’re less inclined
to advise including it in your itinerary. We also consider the size and directness of a
route because it is frequently more efficient to drive down a highway than it is to take a
ni

smaller road with several stops.

Authentic data from local governments and current user input are two more
sources of information that are crucial to ensuring that we only provide the best routes.
U

Google Maps can access authoritative data to learn about speed restrictions, tolls, and
whether certain roads are closed due to COVID-19 or construction. Additionally, incident
reports from drivers help Google Maps swiftly display information about restricted
lanes, nearby construction, damaged vehicles, and other roadside obstructions. Both
ity

sources are used to explain how unexpected changes in road conditions brought on by
mudslides, snowstorms, or other natural phenomena.

Putting it all Together


m

So, how does this all function in practice ? Let’s imagine you’re travelling down
your usual route to a doctor’s appointment across town. When you leave the house,
there are no signs of any delays in the traffic’s smooth flow. We notify you that if you
)A

stick with your present course, there’s a good possibility you’ll get caught in unexpected
standstill traffic around 30 minutes into your ride, which would mean missing your
appointment. This is done using Google Maps’ traffic projections mixed with current
traffic conditions. In order to assist you avoid the traffic altogether and get at your
meeting on time, Google Maps automatically reroutes you using information on local
(c

road conditions and accidents.

We’ll continue developing tools and technology to keep you out of traffic jams and

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 273

on a route that is as safe and effective as possible. Traffic prediction and route planning
Notes

e
are tremendously complicated tasks.

4.2.13 Uber Eats

in
One of the most challenging computational challenges is determining the ideal
times for drivers to pick up food delivery orders for a variety of businesses. Consider

nl
it similar to the NP-Hard combinatorial optimization problem known as the Traveling
Salesperson: The delivery worker wants the meal ready when they pull up, and the
consumer wants their food delivered promptly. Customers get dissatisfied and delivery

O
personnel seek employment elsewhere if estimations are even slightly off.

Car-sharing services, however Uber is developing Uber Eats, a service available


worldwide that will be successful only with precise predictions. The company’s own
machine learning platform, known as Michelangelo, will be the key to its success.

ty
Zi Wang, a senior software engineer at Uber who oversees the company’s time
prediction efforts, provided an explanation of how AI is used to generate these time
estimates at QCON New York last month.

si
Uber Eats was introduced in 2015 in Toronto and has now expanded to 400 cities
with 220,000 participating eateries. Utilizing independent contractors from its car-
r
sharing business, it offers a meal delivery service for eateries. The effectiveness of the
service depends heavily on the ability to anticipate delivery timeframes, including when
ve
the meal will be ready and how long it will take to deliver. For 2018, it received orders of
over $8 billion.

Other companies in the nascent O2O (online-to-offline) e-commerce industry,


ni

which must squeeze their profits from constrained and unpredictable tolerances, could
benefit from Wang’s lessons.

In 2015, Wang joined Uber to improve the company’s dispatch system, but the
U

work eventually led him towards machine learning.

The business aims to make food delivery dependable, simple, and reasonably
priced for customers. According to Wang, “We want to ensure that the food will be
ity

delivered without a hitch, which requires us to forecast the future and strike a balance
between orders and delivery partners. The system must forecast three things for
each order: the time of delivery, the amount of time it takes to deliver the food, and
the amount of time it takes the restaurant to produce the order. Given that Uber
lacks knowledge of the length of time it takes a restaurateur to make a certain meal,
m

predictions are made to be more challenging.

The secret is timing. When the order is prepared, Uber Eats wants the delivery
)A

person to be sent out. The delivery worker will waste time waiting around if it is too
early, costing them money from potential orders. However, if you wait too long, the
hungry clients can get their food late or cold.
(c

Amity Directorate of Distance & Online Education


274 Foundations of Machine Learning

Notes

e
in
nl
O
ty
Users are given a broad estimation of how long it will take for the food to arrive
when they launch the app and browse a list of eateries. Additionally, it gives eateries the
option of estimating the time needed to make the cuisine, which they can then modify if
necessary.

si
Kafka is used to gather all data, which is then forwarded to a streaming engine
for pre-processing and storage in a Cassandra data store, where Spark processes it

r
to model the data. The Cassandra model repository houses trained and ready-to-use
models.
ve
ni
U
ity

With so many moving parts in the dispatch system’s issue space, Wang
acknowledges that finding the ideal state is an NP-hard computational challenge.
Wang remarked, referring to the customer, the delivery driver, and the restaurant, that
m

“for every decision we are making, we are taking into mind all the partners.” The ideal
situation is one in which each stakeholder receives the best possible result: the diner
receives their food as quickly as possible, the delivery person arrives as promptly as
)A

possible, and the restaurant receives the best estimate of the amount of time it will take
to prepare their food.

The dispatch system must determine the most precise timing to send the driver to
the restaurant after an order is placed.
(c

“We require that our meal delivery partner show up at the appointed hour when the
food will be ready. They shouldn’t show up too late because the meal will be cold. We

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 275

don’t want them to arrive too early since they would waste time waiting there and also
Notes

e
take up the restaurant’s parking and seating areas, added Wang.

When deciding when to send a delivery person before machine learning, Uber

in
Eats relied on the Greedy algorithm, which estimated the best local solution for each
delivery without optimising the problem space for all the drivers in that region. This had
a negative impact on the service as a whole because it resulted in delayed deliveries

nl
and delivery personnel lingering in the parking lot of the restaurant while orders were
being processed.

O
ty
r si
The nearest driver for a given order will be found by a greedy algorithm. The ideal
ve
timings for all the drivers and pickups would be determined by a global optimization,
though. Wang gave a fictitious example: If a greedy algorithm matched two drivers to
their nearest orders, the total journey time would be six minutes; however, the pickups
for those two drivers might be switched, resulting in a total travel time for both orders of
ni

only four minutes.

With ML in place, a journey time estimate may be obtained based on the historical
data of all travel times, for all local restaurants, and for all open jobs and drivers. The
U

method used by Uber Eats incorporates real-time, near-real-time, and historical data.
We can take into account all jobs and suppliers simultaneously with a global matching
system, he explained. “In a single global optimization problem, we can solve the
complete collection of jobs and supplies.”
ity
m
)A
(c

However, estimating the order preparation time can be challenging, in large part
because Uber is unable to observe how long the kitchen staff spends preparing the
dish without being present. Although this is of limited relevance because the meal might

Amity Directorate of Distance & Online Education


276 Foundations of Machine Learning

already be gone by the time the delivery person arrives, it can be calculated by the time
Notes

e
the delivery person leaves the restaurant.

Here, machine learning (ML) can offer insight with more contextual cues, both

in
historical and current: This approach takes into account things like the average amount
of time spent preparing food over the previous week, the quantity and cost of unfulfilled
orders, and even the day of the week. You can use even immediate signs, such how

nl
many other orders—including those from other delivery services—are currently being
processed.

O
4.2.14 Amazon ADS Recommendation and iPhone Face Detection
It is commonly known that Amazon’s online store offers recommendations that are
incredibly accurate. To enhance customer experience and boost sales, it makes use of
technology like artificial intelligence algorithms and machine learning.

ty
In order to provide clients with a list of suggested products, Amazon’s
recommendation system is capable of intelligently evaluating and anticipating their
shopping habits.

si
Amazon’s recommendation engine is regarded as one of the best on the market,
despite the fact that rival retailers have recently added comparable features to their
websites. r
ve
What is the Amazon algorithm?

Amazon started creating an algorithm to scan things posted by users and


ascertain each customer’s shopping preferences in tandem with the advancement of AI
technology.
ni

The recommendation engine used by Amazon is an algorithm made up of


numerous crucial components that analyse various types of data. Artificial intelligence
U

and machine learning-based technology have made this possible.

Each user’s recommended products are chosen by Amazon’s algorithm based


on their past purchases, interactions with other items on display, and ratings of those
items. These recommended products are then combined with similar products viewed
ity

by other users who share their tastes and interests.

How does Amazon use artificial intelligence in sales?

Existing customers anticipate that the online store will vary their buying options and
m

offer them personalised information.

According to the most recent research on personalization, 98 percent of


eCommerce website owners claim that personalization strengthens their relationships
)A

with customers, and up to 91 percent of online store customers admit that they are
more likely to use a brand’s offer that customises their experience.

Personalization is the secret to achieving all of these objectives, whether you want
to increase views, lower bounce rates, or improve click-through rates.
(c

Amazon employs artificial intelligence in a number of its commercial operations for


this reason. As a result, employing AI to enhance website personalisation requires the
use of Amazon’s recommendation system.
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 277

Amazon is able to keep customers interested and offer them things they might not
Notes

e
even consider by making suggestions that optimise potential value for each individual
customer.

in
Amazon and Personalization of the Shopping Path
To personalise purchases, the business developed the Amazon Personalize
feature.

nl
It is a machine learning and artificial intelligence service that specialises in
creating recommender system solutions. In order to produce real-time suggestions, it

O
automatically analyses data, chooses functions and algorithms, optimises the model
based on the data, and then implements and maintains the model.

Developers can utilise this method to create sophisticated, intelligent systems


that recommend other websites, going beyond the conventional eCommerce

ty
recommendations.

Amazon uses a number of different artificial intelligence algorithms in addition to

si
the suggestions themselves, supporting various elements of the platform’s operation.

Additionally, the business makes use of its own A9 algorithm, which powers the
website’s sophisticated product search.
r
The A9 algorithm examines and categorises certain brands and their products
ve
on the marketplace, enabling it to provide relevant and customised search results to
Amazon users.

This technique is also used to choose which merchants are shown to customers on
the front page.
ni

The three main operating concepts of the Amazon A9 algorithm are as follows:

◌◌ To determine the best products, it considers keywords, content, vendor


U

information, user comments and reviews, and return rates.


◌◌ Based on past sales outcomes, text matching precision, price, and stock
levels at specific sellers, the A9 algorithm categorises products.
ity

◌◌ The location of specific products in the algorithm’s classification is also


affected by indirect factors. Among others, delivery and payment choices,
product details and images, premium content, advertising, and promotions are
among the most crucial of them.
m

How Apple Is Using Machine Learning


Developing a smartphone with facial recognition is not easy. Combating challenges
like the size of the device, Face ID’s launch has put Apple in a sweet spot. With Face
)A

ID, an iPhone X user simply holds up the device and it recognizes his or her face.

According to Phil Schiller, senior vice president of global marketing at Apple,


“nothing has ever been simpler, more intuitive, and effortless.” Face ID is the way we’ll
unlock smartphones in the future and safeguard our private data.
(c

The use of machine learning algorithms that have been trained to recognise you
is the foundation of the face detection capability. The software will be able to identify

Amity Directorate of Distance & Online Education


278 Foundations of Machine Learning

you even with minor changes like glasses, makeup, and maybe a beard, and it will only
Notes

e
become better with time.

For this function, Apple used a number of neural networks, and it is open about

in
how secure Face ID is. With Face ID, the likelihood that a random individual will be able
to unlock the phone is a whopping 1 in 1,000,000 as opposed to Touch ID’s 1 in 50,000
possibility. According to Apple, having an identical twin is the only thing that can readily

nl
defeat the software.

Various pieces of hardware have made the facial detection feature possible. The
elements include a Dot Projector (a transmitter of structured light), an infrared camera

O
(a receiver of structured light), a flood light (an ambient sensor), and a proximity sensor.
The iPhone X’s front camera also collects 2D information about a user’s face. When the
user looks at their phone, the dot projector projects 30k infrared dots onto their face.

ty
Schiller claimed that the iPhone X smartphone accesses the neural engine in the
Apple A11 chip. Face identification, known as Face ID on the iPhone X, is one of the
activities that the A11 can speed up using machine learning and artificial intelligence.

si
Machine learning activities are also well-suited to the A11’s graphics processor.
Additionally, the camera contains an image signal processor created by Apple that can
improve photos and low-light autofocus.

r
Apple’s Earlier Machine Learning Efforts
ve
Apple launched Core ML, a new machine learning framework API for developers,
in June of this year. Core ML would hasten the processing of AI activities on the iPhone,
iPad, and Apple Watch.
ni

Numerous fundamental machine learning tools are supported by Core ML,


including deep, recurrent, and convolutional neural networks, linear models, and tree
ensembles. The data that developers use to enhance user experience will stay on
U

users’ smartphones and tablets because Core ML is designed for on-device processing.

4.2.15 Use Cases


ity

People are currently experiencing a true global technological revolution. Thanks


to developments in computer power and applications for machine learning. Artificial
intelligence (AI) research has made significant strides in the first two decades of
the twenty-first century. One of the most successful and widely used technological
applications, machine learning has a daily impact on billions of users and a variety of
m

businesses. The study and use of statistical models and methods enable computer
systems to carry out specified tasks without human input is known as machine learning,
and it is a subset of artificial intelligence. Future technologies that people use every day
)A

are made possible by the usage of machine learning.

Voice Assistant
These days, voice assistants are everywhere. Voice assistants are becoming more
and more common, thanks to programmes like Apple’s Siri, Google Assistant, Amazon’s
(c

Alexa, and others. All of these voice assistants use a machine learning technique that
uses Natural Language Processing to recognise speech (NLP). The response is then

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 279

created after utilising machine learning to translate the voice into numbers. In order to
Notes

e
prepare information, NLP is also used to convert ambiguous legalese from contracts
into understandable English. As machine learning techniques evolve, researchers
anticipate it to become astonishingly smarter.

in
Personalised Marketing
In the marketing system, technology is advancing. The marketing sector groups

nl
clients based on behavioural and characteristic data using machine learning features.
The platforms for digital advertising enable marketers to concentrate on a group of
consumers who are likely to be influenced by their products. They are aware of what

O
customers want and thus provide better product advertising.

Fraud Detection

ty
Machine learning is being used by banks and large financial services providers to
detect fraud. This aids businesses in safeguarding customer safety. Companies that
process credit card transactions can benefit from machine learning as well. According
to the policies of the company, the system is programmed to identify transactions that

si
seem to be fraudulent based on specific criteria. Companies might avoid suffering a
significant cost by spotting such errors. Additionally, an organisation can use machine
learning to estimate sales or demand in real-time and obtain insights into its consumer
loyalty and competitive landscape. r
ve
Self-Driving Cars
One of the exciting technologies that makes heavy use of machine learning is self-
driving cars. Self-driving cars have the advantage of incorporating all three of the main
ni

machine learning techniques, supervised, unsupervised, and reinforcement learning,


into its design. Machine learning capabilities are used by smart cars to identify things
in the immediate vicinity of the vehicle, measure the distance between the vehicle in
U

front of it, determine the location of the pavement and traffic signals, assess the driver’s
health, and categorise scenes. Real-time advise regarding traffic and road conditions
can also be provided via machine learning.
ity

Transportation Optimization
Machine learning is being used as the main source by businesses aiming to
increase the transportation sector’s reliance on technology. Machine learning is used
by ride-hailing applications like Uber, Lyft, Ola, and others to design the best routes and
m

set costs for all of their many services. Travel rates are adjusted dynamically to reflect
shifting market conditions. Prices change based on variables such as the time of day,
the location, the weather, client demand, etc. Drivers can now use machine learning to
)A

discover the best path to take passengers from point A to point B.

Predicting Behaviour
Machine learning models can be used by businesses to forecast customer
behaviour based on historical data. Businesses examine social media for topics people
(c

are discussing before identifying users who are looking for a certain commodity or
service. In order to offer clients with customised sizing and search results as well as

Amity Directorate of Distance & Online Education


280 Foundations of Machine Learning

predictive behaviour models, Zappos, for instance, leverages analytics and machine
Notes

e
learning.

Healthcare

in
The benefit of machine learning in the healthcare industry is its capacity to handle
enormous information beyond the limits of human capabilities and then consistently
turn the analysis of those datasets into therapeutic insights that benefit clinicians. The

nl
use of machine learning in the planning and delivery of healthcare ultimately results in
improved outcomes, cheaper healthcare expenditures, and more patient satisfaction.
Women’s mammography scans can be reviewed and cancer risk assessed using

O
computer-assisted diagnosis (CAD), a machine learning programme.

Process Automation

ty
The result of the fusion of AI with related technologies like computer vision,
cognitive automation, and machine learning is intelligent process automation (IPA).
Companies have a richer automation possibility by combining these technologies into
a single process, unleashing every commercial value for the firm. Automating error-free

si
insurance risk assessment from human data input activity is possible with the use of
machine learning algorithms.

Chatbots r
ve
By utilising chatbots that provide pertinent answers to customers’ questions,
machine learning is assisting customer service. Machine learning algorithms are able
to comprehend the needs of customers and the tone in which they express those needs
using Natural Language Processing (NLP) and sentiment analysis ideas. The system
ni

then directs the inquiry to the proper customer service agent.

Physical Security
U

Security at huge gatherings is greatly aided by machine learning. The use


of technology in security at significant public events helps prevent false alarms
and detects items that human screeners would overlook. As an illustration, Evolv
Technology asserts to provide a physical security system that screens 600 to 900
ity

individuals every hour.

4.2.16 Google Translation


68 percent of smartphone users utilise machine learning features on a daily basis
m

without even knowing how they function. Despite the fact that most individuals are
unaware of what these traits actually are. Voice assistants, predictive text, driving route
suggestions, and many more AI-driven functions are often used.
)A

The need to understand what someone speaking to us in a different language is


saying also contributed to the development of translation. But what began as a way for
two people to better understand one another has evolved into one of the most lucrative
economic sectors. However, you don’t always need to use a qualified translator
(c

for translations. Sometimes all you need to know about a document or a word is its
general meaning. You might look online for assistance in that circumstance. The most
dependable translation tool online is Google Translate. However, consumers are aware
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 281

that they shouldn’t put their trust in any other app’s method. They are therefore curious
Notes

e
in how Google’s translation feature functions.

For eleven years, Google Translate has been available. Phrase-Based Machine

in
Translation was the main algorithm used when Google Translate was first released.
Later, Google made other advances in machine learning that irrevocably changed the
way we perceive foreign languages.

nl
O
The most important development in translation technologies was the introduction of

ty
Google Neural Machine Translation, or GNMT. An encoder network (on the left) and a
decoder network make up its model architecture (on the right).

Speech must be represented via an intermediate representation in traditional

si
cascade systems. With Translatotron, Google demonstrated that a single sequence-to-
sequence model can translate voice from one language into speech in another without
the need for an intermediary text representation, in contrast to cascaded systems.

How Google translator using Neural Networks?


r
ve
Given the flexibility of human language, machine or automatic translation may be
one of the most challenging AI undertakings. Deep neural network models now produce
cutting-edge findings in the field of neural machine translation. A lot of the shortcomings
ni

of traditional phrase-based translation systems may be overcome by using this end-to-


end learning technique for automated translation.

Recurrent Neural Network (RNN):


U

Recurrent neural networks (RNNs) with the ability to learn long-term dependencies
are known as LSTMs. A series of repeating modules may be seen in all RNNs.
ity
m

In the above image ,


)A

◌◌ x: it’s input. It might be a word from a sentence or another kind of sequential


information.
◌◌ O: the result. For instance, the network might suggest that the preceding word
in a phrase be followed by the next word.
(c

◌◌ h: The RNN’s central component. It includes the network’s weights and


activation mechanisms.
◌◌ V: represents the flow of communication between two time steps.

Amity Directorate of Distance & Online Education


282 Foundations of Machine Learning

The network in the image can be represented both folded and unfolded equally
Notes

e
well. When you want to understand what is happening at each stage better, it can be
helpful to unfold the network.

in
Long short-term memory (LSTM) and Gated Recurrent Unit (GRU)

nl
O
ty
Problems exist with the fundamental RNN building block that was previously
illustrated. One of the most significant is the inability to remember information when
it is presented in a lengthy sequence. Information that was provided a few time steps

si
previously is forgotten. The learning performance is thus constrained. To address that,
humans developed certain architectural solutions. Long short-term memory (LSTM) and
Gated Recurrent Units are the most common (GRU).
r
ve
ni
U

The vast majority of applications use one of these two sister architectures. The h
block contains information that distinguishes these two from the basic one. The images
below illustrate the internal parts of both the LSTM and the GRU, but I won’t go into
depth about each one.
ity

However, the difficulty was that NMT systems are generally more expensive and
have trouble handling unusual words. Google’s Neural Machine Translation (GNMT)
system was introduced to address these problems.
m

How GNMT Works?

The GNMT uses attention and residual connections to build a deep LSTM (Long
Short-Term Memory) network with 8 encoder and 8 decoder layers. to increase
)A

parallelism, which will shorten training time.

The top layer of the encoder is connected to the bottom layer of the decoder
through Google’s attention mechanism. They use low-precision arithmetic in their
inference computations to speed up the final translation pace. They split words into a
(c

small number of common sub-word units for input and output to better handle unusual
words. This approach naturally handles the translation of uncommon words and
enhances the system’s overall accuracy by striking a fair balance between the flexibility

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 283

of “character”-delimited models and the effectiveness of “word”-delimited models. The


Notes

e
length-normalization algorithm and coverage penalty used by the Google team search
technique encourage the creation of an output sentence that is most likely to include
every word in the source sentence. The GNMT produces competitive results to state-of-

in
the-art on the WMT’14 English-to-French and English-to-German benchmarks. When
compared to Google’s phrase-based production system, it reduces translation errors by
an average of 60% using a human side-by-side evaluation on a collection of isolated

nl
simple sentences.

O
ty
r si
ve
4.2.17 Tesla Driverless Car
To train its algorithms for autonomous vehicles, Elon Musk’s corporate offices are
ni

receiving data from over 500,000 Teslas throughout the globe. In the rush to put more
self-driving cars on the road, this data gives Tesla a significant advantage.
U

You might imagine Tesla is a conventional auto manufacturer when you think of
them. Tesla is without a doubt a pioneer in the field of electric vehicles.

However, the fact that they are a technological company is the secret to their
success. One of the factors contributing to their success is the fact that their business is
ity

based on artificial intelligence technology.

Making their cars totally autonomous is currently one of Tesla’s main objectives,
and they are using big data and AI to achieve this.
m

How AI Can “Teach” Cars to Drive on Their Own


Autonomous vehicles constantly analyse data from their sensors and machine
vision cameras in order to be able to drive independently. They then use this data to
)A

decide what to do next.

They employ AI to comprehend and predict the next moves of bicycles,


pedestrians, and autos. They can use this information to quickly plan their actions and
make split-second decisions. Should the car continue in its current lane or should it
(c

switch lanes? Should it continue where it is or pass the automobile in front of them?
When should the vehicle decelerate or speed up?

Amity Directorate of Distance & Online Education


284 Foundations of Machine Learning

Tesla needs to gather the appropriate data to train the algorithms and feed their
Notes

e
AIs in order to make cars fully autonomous. Better performance will always result from
more training data, and Tesla shines in this area.

in
The fact that Tesla crowdsources all of its data from the hundreds of thousands
of Tesla vehicles that are now on the road gives it a competitive advantage. Both
internal and external sensors track how Teslas behave in a variety of circumstances.

nl
They also gather information on driver behaviour, including how they respond to various
circumstances and how frequently they touch the steering wheel or dashboard.

“Imitation learning” is the name of Tesla’s strategy. Millions of real drivers

O
throughout the world make decisions, respond, and move, and their algorithms learn
from those actions. All those miles result in incredibly sophisticated autonomous
vehicles.

ty
Their tracking system is really advanced. For instance, Tesla stores a data
snapshot of the moment, adds it to the data set, and then recreates an abstract
representation of the world using color-coded shapes that the neural network can learn
from. This happens when a Tesla vehicle predicts the behaviour of a car or bicycle

si
incorrectly. In contrast to the real-world data that Tesla is utilising to train its AIs, other
businesses working on autonomous vehicles employ synthetic data (such as driving
behaviours from video games like Grand Theft Auto).
r
ve
AI at the Heart of Tesla
Since its start, Tesla has benefited from data from its current customer base, and
its work on autonomous vehicles is a part of their ongoing ambition to put AI at the
centre of all their operations.
ni

AI and big data will continue to be Elon Musk and his team at Tesla’s faithful allies
as they move into their newest initiatives (including their plans to transform the electric
grid with their home solar power panels).
U

4.2.18 Kuka Game and Email


A screen that mimics the playing field of the well-known board game Settlers of
ity

Catan is where the robot places an orange street. Students at the OTH Regensburg
conducted some fundamental software engineering research as part of their robotics
course, which led to the creation of the board game project, which allowed Markus
Webert complete his bachelor’s degree in computer science. Prof. Dr. Klaus Volbert, a
m

Regensburg professor of computer science, served as his supervisor.

Collecting Data for Artificial Intelligence


)A

Standing in front of the robot cell with the KR AGILUS, Webert explains, “Before I
started my bachelor’s degree, I thought ‘Programming, I don’t like that much.’ But now
I’m really into it. “When I first started learning programming, a lecturer encouraged me
to participate in this awesome effort to create an AI that could play Settlers of Catan.
We started with a platform for an online game that required actual participants, and we
(c

used that platform to gather information from those players. The intention was to use
the data for later purposes, such as machine learning, to train our algorithm to play.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 285

Industrial Robot makes Connection between Theory and Practice


Notes

e
There are 40 students enrolled in the robotics programme at OTH Regensburg
right now. The course lecturer and professor at the Faculty of Mathematics and Science

in
is Prof. Dr. Martin Weiss. He has a background in industry. Practice is a key component
of his lessons. Martin explains that in his courses, students “use robots as a practical
example to illustrate what very theoretical things in Mathematics and Computer Science

nl
might mean in the real world.” “We don’t only do things on paper or write programmes;
instead, I can demonstrate using a computer and robot that an equation has multiple
solutions and that the robot can move to the same spot using a variety of various axis
configurations.”

O
Working with the KUKA Small Robot Fascinates Students
The students particularly like the robot, according to Prof. Dr. Weiss. “Working

ty
with the robot you see bits moving and you see the robot doing some intriguing things.”
They enjoy it because it allows them to apply their thoughts to extremely useful things.
The students should understand that this requires teamwork and that this is a job
where they might have a bright future ahead of them. Sometimes students are afraid

si
of mathematics, but the problems can be divided so that one group works on the
mathematics, the other works on programming, and others work on the engineering.
The KR AGILUS will then have a camera installed so that future students can create
r
vision systems that enable the robot to pick up pieces that are randomly orientated.
ve
For the time being, the KUKA tiny robot will continue to move the robber, develop
settlements, optimise streets, and place buildings on the screen.

Email and Machine Learning


ni

Unsupervised machine learning


We utilised unsupervised machine learning to cluster the unlabeled emails. How,
U

what? Yes, unsupervised, as my training data only contains inputs, often referred to as
features, and no outputs. We deal with inputs and their known results in supervised
machine learning. This was an unsupervised machine learning challenge where we
wanted to categorise emails based on their message bodies.
ity

Loading in the data


We divided the dataset into a few files, each containing 10k emails, rather than
loading in all +500k emails at once. You don’t want to load the entire Enron dataset into
m

RAM and perform intricate calculations with it, we can assure you of that.

import pandas as pd
)A

emails = pd.read_csv(‘split_emails_1.csv’)

print emails.shape # (10000, 3)

The dataset now had 10,000 emails divided into three columns (index, message
id, and the raw message). We converted the raw message into key-value pairs before
(c

working with this data.

A raw email message would look something like this.

Amity Directorate of Distance & Online Education


286 Foundations of Machine Learning

Message-ID: 0965995.1075863688265.JavaMail.evans@thyme>
Notes

e
Date: Thu, 31 Aug 2000 04:17:00 -0700 (PDT)

From: phillip.allen@enron.com

in
To: greg.piper@enron.com

Subject: Re: Hello

nl
Mime-Version: 1.0

Content-Type: text/plain; charset=us-ascii

O
Content-Transfer-Encoding: 7bit

X-From: Phillip K Allen

ty
X-To: Greg Piper

X-cc:

X-bcc:

si
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\’sent mail

X-Origin: Allen-P

X-FileName: pallen.nsf
r
ve
Greg,

How about either next Tuesday or Thursday?


ni

Phillip

I created a function that extracts this data into key-value pairs so that I could work
just with the sender, receiver, and email body data.
U

def parse_raw_message(raw_message):

lines = raw_message.split(‘\n’)
ity

email = {}

message = ‘’

keys_to_extract = [‘from’, ‘to’]


m

for line in lines:

if ‘:’ not in line:


)A

message += line.strip()

email[‘body’] = message

else:

pairs = line.split(‘:’)
(c

key = pairs[0].lower()

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 287

val = pairs[1].strip()
Notes

e
if key in keys_to_extract:

email[key] = val

in
return email

def parse_into_emails(messages):

nl
emails = [parse_raw_message(message) for message in messages]

return {

O
‘body’: map_to_list(emails, ‘body’),

‘to’: map_to_list(emails, ‘to’),

ty
‘from_’: map_to_list(emails, ‘from’)

After executing this function, we produced a fresh dataframe that appears as

si
follows:

email_df = pd.DataFrame(parse_into_emails(emails.message))

index body from_ to r


ve
0 After some... phillip.allen@.. tim.belden@..

To make absolutely certain that there are no empty columns:

mail_df.drop(email_df.query(
ni

“body == ‘’ | to == ‘’ | from_ == ‘’”

).index, inplace=True)
U

Analyzing text with TF-IDF

It stands for term frequency-inverse document frequency and is a metric that


measures the weight of a word in a document inside a corpus or collection. Machines
ity

struggle with text, but excel at numbers, so I need to feed it something it can
understand. I created a document-term matrix out of the email bodies because of this:

vect = TfidfVectorizer(stop_words=’english’, max_df=0.50, min_df=2)


m

X = vect.fit_transform(email_df.body)

To see this matrix, we quickly created a visualisation. To accomplish this, we first


created a two-dimensional approximation of the DTM (document-term matrix).
)A

X_dense = X.todense()

coords = PCA(n_components=2).fit_transform(X_dense)

plt.scatter(coords[:, 0], coords[:, 1], c=’m’)


(c

plt.show()

Amity Directorate of Distance & Online Education


288 Foundations of Machine Learning

Notes

e
in
nl
O
Having done that, we sought to identify the principal terms used in those emails.

ty
Exactly that is how we created this function:

def top_tfidf_feats(row, features, top_n=20):

si
topn_ids = np.argsort(row)[::-1][:top_n]

top_feats = [(features[i], row[i]) for i in topn_ids]

r
df = pd.DataFrame(top_feats, columns=[‘features’, ‘score’])
ve
return df

def top_feats_in_doc(X, features, row_id, top_n=25):

row = np.squeeze(X[row_id].toarray())
ni

return top_tfidf_feats(row, features, top_n)

This function produced the following result after being applied on a document.
U

features = vect.get_feature_names()

print top_feats_in_doc(X, features, 1, 10)

features score
ity

0 meetings 0.383128

1 trip 0.324351

2 ski 0.280451
m

3 business 0.276205

4 takes 0.204126
)A

5 try 0.161225

6 presenter 0.158455

7 stimulate 0.155878
(c

8 quiet 0.148051

9 speaks 0.148051
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 289

10 productive 0.145076
Notes

e
11 honest 0.140225

12 flying 0.139182

in
13 desired 0.133885

14 boat 0.130366

nl
15 golf 0.126318

16 traveling 0.125302

O
17 jet 0.124813

18 suggestion 0.124336

ty
19 holding 0.120896

20 opinions 0.116045

21 prepare 0.112680

si
22 suggest 0.111434

23 round 0.108736

24 formal 0.106745
r
ve
If you look at the corresponding email, everything makes sense.

The travel is less enjoyable when it is to get to a business meeting. especially if a


presentation needs to be prepared. I’d advise having the business plan meetings here
ni

before going on vacation without having any official business meetings. Particularly if
a trip wasn’t even wanted or necessary, I would try to acquire some straight answers.
Regarding the business meetings, I believe it would be more beneficial to try and incite
U

talks about what is working and what is not working across the various groups. All
too frequently, only the presenter talks while the audience members wait their time in
silence. A round table discussion structure for the meetings could be preferable. Austin
is a good option, in my opinion. Golf, a ski boat, and jet skis can all be rented. It takes
ity

too long to fly somewhere.

Writing a function to extract the most important terms from all the emails was the
next step.
m

def top_mean_feats(X, features,

grp_ids=None, min_tfidf=0.1, top_n=25):

if grp_ids:
)A

D = X[grp_ids].toarray()

else:

D = X.toarray()
(c

D[D < min_tfidf] = 0

Amity Directorate of Distance & Online Education


290 Foundations of Machine Learning

tfidf_means = np.mean(D, axis=0)


Notes

e
return top_tfidf_feats(tfidf_means, features, top_n)

a list of all the emails’ top search phrases.

in
print top_mean_feats(X, features, top_n=10)

features score

nl
0 enron 0.044036

1 com 0.033229

O
2 ect 0.027058

3 hou 0.017350

ty
4 message 0.016722

5 original 0.014824

6 phillip 0.012118

si
7 image 0.009894

8 gas 0.009022

9 john 0.008551
r
ve
The information we have thus far is intriguing, but we were curious to learn more
and see what more the machine could have learned from this facts.

Clustering with KMeans


ni

A well-liked clustering approach in machine learning is called KMeans, where


K is the number of clusters. With three clusters and 100 iterations, I built a KMeans
U

classifier.

n_clusters = 3

clf = KMeans(n_clusters=n_clusters, max_iter=100, init=’k-means++’, n_init=1)


ity

labels = clf.fit_predict(X)

The classifier produced the following 3 groupings after training.


m
)A
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 291

We were able to create a function that retrieves the top phrases for each cluster
Notes

e
because we now knew which emails the machine assigned to each cluster.

def top_feats_per_cluster(X, y, features, min_tfidf=0.1, top_n=25):

in
dfs = []

labels = np.unique(y)

nl
for label in labels:

ids = np.where(y==label)

O
feats_df = top_mean_feats(X, features, ids, min_tfidf=min_tfidf, top_
n=top_n)

feats_df.label = label

ty
dfs.append(feats_df)

return dfs

si
I discovered an excellent example of how to plot this graph with matlibplot instead
of printing out the terms. I then duplicated the function, made some changes, and
created the following plot:
r
ve
ni
U
ity
m

We noted right away that cluster 1 contained strange words like “hou” and “ect.”
We basically needed to gain additional insight into the entire dataset, necessitating
)A

a different methodology, to learn more about why phrases like “hou” and “ect” are so
popular.

Summary
●● Instead, the focus of this chapter will be on unsupervised learning, a collection of
(c

statistical methods made for situations where we simply have a set of features,
such as X1, X2,..., Xp, measured on n observations. Because we lack a response
variable Y, we are not interested in making predictions. Instead, the objective is
Amity Directorate of Distance & Online Education
292 Foundations of Machine Learning

to learn interesting facts regarding the measurements on X1, X2,...,Xp. Exists a


Notes

e
useful technique to visualise the data? Can we identify subgroups within the
observations or within the variables? Unsupervised learning encompasses a
wide range of methods for responding to queries like these. In this chapter,

in
we will concentrate on two specific types of unsupervised learning: principal
components analysis and clustering. Principal components analysis is a tool for
data visualisation or pre-processing before supervised approaches are utilised.

nl
●● Unsupervised learning is frequently more difficult. There is no clear objective for
the analysis, such as the prediction of a response, hence the exercise tends to
be more subjective. Unsupervised learning is frequently used in conjunction with

O
exploratory data analysis. Furthermore, because there is no widely established
way for performing cross-validation or verifying results on a separate data set, it
might be difficult to evaluate the outcomes of unsupervised learning methods

ty
and data analysis. It’s easy to understand why there is a difference. If we use
supervised learning to fit a predictive model, we may evaluate our results by
seeing how well the model predicts the answer Y on observations that weren’t
used to fit the model. However, since we don’t know the actual solution in

si
unsupervised learning—where the issue is unsupervised—there is no way to
verify our work.Unsupervised machine learning clustering that starts with a top-
to-bottom hierarchy of clusters is called “hierarchical clustering.” Following that,
r
it decomposes the data items based on this hierarchy to produce the clusters.
ve
Depending on whether the process of building clusters proceeds top-down or
bottom-up, this strategy adopts one of two approaches.
●● Agglomerative is quite different from Divisive, in which each of the “N” data
points is regarded as a separate component of the “N” clusters that the data is
ni

divided into. We subsequently allocate the data points to each of these clusters in
accordance after repeatedly condensing these numerous “N” clusters to a smaller
number of clusters, let’s say “k” clusters. This method, which is bottom-up, also
U

combines the clusters using termination logic.


●● The most efficient method of forming clusters and allocating data points to them
is centroid-based clustering, which is regarded as one of the most straightforward
clustering algorithms. The idea behind centroid-based clustering is that each
ity

cluster is identified and represented by a central vector, and each cluster is


assigned to data points that are near these vectors.
●● A family of clustering algorithms considers probability, which is a whole separate
measure. Data points are created and grouped using distribution-based clustering
m

based on their likelihood of belonging to the same probability distribution (such as


a Gaussian, binomial, or other) in the data.
●● With datasets with a lot of overlap between the variables, fuzzy clustering can
)A

be applied. In bioinformatics, where it is difficult for generic clustering algorithms


to distinguish between the picture’s pixels and they fail to execute a suitable
clustering due to the identification of overlapping gene codes, it is a strongly
recommended approach for image segmentation.
(c

●● One well-known HCA is the agglomerative hierarchical clustering algorithm. It


employs a bottom-up methodology to cluster the datasets. This means that this

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 293

algorithm starts by treating each dataset as a single cluster and then begins
Notes

e
combining the two clusters that are the closest to one another.
●● Divisive clustering also referred to as a top-down strategy. The number of clusters

in
need not be predetermined for this approach. Top-down clustering starts by
figuring out how to split a cluster that contains all of the data, then splits clusters
back and forth until all of the individual data are in singleton clusters.

nl
●● Finding unusual occurrences or observations that are statistically distinct from
the rest of the observations is known as anomaly detection. Such “anomalous”
behaviour is usually indicative of some sort of issue, such as credit card fraud, a

O
failed server component, a cyberattack, etc.
●● Three general categories can be used to classify anomalies: a) Point anomaly, b)
Contextual anomaly, c) Collective anomaly.

ty
●● One of the most straightforward unsupervised learning techniques to handle the
well-known clustering problem is k-means. The process uses a predetermined
number of clusters (let’s assume k clusters) fixed apriori to categorise a given data
set. To define k centres, one for each cluster, is the main notion.

si
●● The Apriori algorithm, which was the first associative algorithm introduced
by Rakesh Agrawal, has been employed as a component of the approach in

r
subsequent advances of association, classification, and associative classification
algorithms.
ve
●● Algorithm a priori states “A frequent itemset’s subsets must also be frequent. In
other words, neither the generation nor testing of a superset of an uncommon
itemset is required.
ni

●● The Apriori algorithm premise is graphically depicted in the image below. It


comprises of a k-item set node and a relation between k-item sets’ subsets. As
you can see in the graphic, the transaction data items are all listed at the bottom,
U

and as you move up, you start producing subsets until you reach the null set.
●● The Apriori Algorithm can be used to produce association rules. The arules
package contains the apriori() function. One of the popular methods in market
basket analysis is the apriori algorithm.
ity

●● One of the main applications of the SVD is principal components analysis (PCA),
which offers a data-driven, hierarchical coordinate system to describe high-
dimensional correlated data. The correlation matrices are used in this coordinate
system.
m

●● A matrix is factored into three matrices in the Singular Value Decomposition (SVD)
process. It communicates significant geometrical and theoretical insights regarding
linear transformations and has several intriguing algebraic characteristics.
)A

●● An algorithm for unsupervised outlier detection is called the local outlier factor
(LOF). It generates an anomaly score that indicates the data points in the data set
that are outliers. This is accomplished by calculating the local density deviation of
a given data point in relation to nearby data points.
(c

●● The machine learning algorithm t-distributed stochastic neighbour embedding


(t-SNE), created by Laurens van der Maaten and Geoffrey Hinton, is used for

Amity Directorate of Distance & Online Education


294 Foundations of Machine Learning

data visualisation. When embedding high-dimensional data for display in a low-


Notes

e
dimensional environment of two or three dimensions, t-SNE is a nonlinear
dimensionality reduction technique that is well-suited and adaptable.

in
●● Applications for t-SNE include computer security analysis, music analysis, and
cancer research, among many others. It is used to depict high-dimensional data,
which can have tens to hundreds of variables, as was previously stated.

nl
Glossary
●● OPTICS: Ordering Points to Identify Clustering Structure.

O
●● OBSCAN: Density-Based Spatial Clustering of Applications with Noise.
●● BIRCH: Balanced Iterative Reducing Clustering and Using Hierarchies.
●● CURE: Clustering Using Representatives.

ty
●● CLARANS: Clustering Large Applications based upon Randomized Search.
●● STING: Statistical Information Grid.

si
●● CLIQUE: Clustering In Quest.
●● HCA: Hierarchical Cluster Analysis.
●● DBSCAN: Density Based Spatial Clustering of Applications with Noise.
●●
r
Core Point: If a point has more than MinPts points inside an episodic period, it is a
ve
core point.
●● Border Point: A point that is close to a core point but has fewer than MinPts inside
eps.
ni

●● Noise or outlier: A location that is neither a border point nor a core point.
●● Clustering: Identifying the natural groupings in the data, such as classifying clients
based on their purchasing patterns, is a clustering problem.
U

●● Association: When you wish to find rules that broadly characterise your data, such
as “those who buy X also tend to buy Y,” you have an association rule learning
problem.
ity

●● Unsupervised learning: The process of drawing references from datasets of input


data without labelled replies is known as unsupervised learning.
●● Hierarchical clustering: Unsupervised machine learning clustering that starts with a
top-to-bottom hierarchy of clusters is called “hierarchical clustering.”
m

●● Single Linkage: The closest points of the clusters are separated by the shortest
distance.
)A

●● Complete Linkage: It represents the greatest separation between the centres of


two distinct clusters.
●● Average Linkage: The average distance between two clusters is determined using
the linkage method, which adds up and divides the distance between each pair of
datasets by the total number of datasets.
(c

●● Centroid Linkage: The linkage method is used to calculate the separation between
the cluster centroid.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 295

●● eps: It establishes the area surrounding a data point; if the distance between two
Notes

e
points is less than or equal to “eps,” they are regarded as neighbours.
●● MinPts: The bare minimum amount of nearby neighbours (data points) within the

in
eps radius The greater value of MinPts must be selected the larger the dataset.
●● Anomaly Detection: Finding unusual occurrences or observations that are
statistically distinct from the rest of the observations is known as anomaly

nl
detection.
●● Point Anomaly: A tuple in a dataset is referred to as a point anomaly if it deviates
significantly from the rest of the data.

O
●● Contextual Anomaly: anomalies are observations that are anomalous because of
the context in which they were made.
●● Collective Anomaly: A collection of data instances aid in the discovery of an

ty
anomaly.
●● Basket Data Analysis: Analyzing the relationship of things purchased in a single
basket or single purchase, as in the aforementioned cases, is known as basket

si
data analysis.
●● Cross-marketing is collaborating with businesses that enhance your own, not
with rivals. For instance, it stands to reason that auto dealers and manufacturers
r
engage in cross-promotional activities with the oil and gas industries.
ve
●● Catalogue Design: The items in a company’s catalogue are frequently chosen to
work well together so that purchasing one item will prompt a subsequent purchase
of another.
ni

●● PCs: Principal Components.


●● PCA: Principal Components Analysis.
●● SVD: Singular Value Decomposition.
U

●● LOF: An algorithm for unsupervised outlier detection is called the local outlier
factor.
●● t-SNE: t-distributed Stochastic NeighbourEmbedding.
ity

Check Your Understanding


1. Identifying the natural groupings in the data, such as classifying clients based on
their purchasing patterns, is a_ _ _ _ _.
m

a) Clustering problem
b) Association problem
)A

c) Density based method


d) Overlapping
2. When you wish to find rules that broadly characterise your data, such as “those who
buy X also tend to buy Y,” you have an_ _ _ __ _.
(c

a) Data analysis
b) Association rule learning problem

Amity Directorate of Distance & Online Education


296 Foundations of Machine Learning

c) Data collection
Notes

e
d) None of the mentioned
3. The process of drawing references from datasets of input data without labelled

in
replies is known as_ _ _ _.
a) Clustering

nl
b) Supervised learning
c) Unsupervised learning
d) Detection

O
4. _ _ _ _ _methods take clusters into account as a dense zone that differs and is
similar to the lower dense part of the space.
a) Hierarchy based

ty
b) Partitioning based
c) Grid based

si
d) Density based
5. In this approach, the data space is represented as a grid-like structure made up of a
finite number of cells:
a) Grid based method
r
ve
b) Partitioning based method
c) Hierarchy based method
d) None of the mentioned
ni

6. In this method, the clusters are organised into a tree-like structure:


a) Grid based method
U

b) Hierarchy based method


c) Density based method
d) Partitioning based method
ity

7. The items are divided by these procedures into k clusters, and each split creates a
cluster:
a) Grid based method
m

b) Density based method


c) Partitioning based method
)A

d) Hierarchy based method


8. Unsupervised machine learning clustering that starts with a top-to-bottom hierarchy
of clusters is called_ _ _ _.
a) Supervised clustering
(c

b) Unsupervised clustering
c) Grid clustering

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 297

d) Hierarchical clustering
Notes

e
9. The closest points of the clusters are separated by the shortest distance, is termed
as:

in
a) Single linkage
b) Complete linkage

nl
c) Average linkage
d) Centroid linkage
10. _ _ _ _ _represents the greatest separation between the centres of two distinct

O
clusters.
a) Average linkage
b) Complete linkage

ty
c) Single linkage
d) Centroid linkage

si
11. The average distance between two clusters is determined using the_ _ _ _linkage
method, which adds up and divides the distance between each pair of datasets by
the total number of datasets.
a) Complete r
ve
b) Single
c) Average
d) Centroid
ni

12. The_ _ _ _linkage method is used to calculate the separation between the cluster
centroid.
U

a) Single
b) Average
c) Complete
ity

d) Centroid
13. The number of clusters need not be predetermined for this approach, and is also
referred as a top down strategy?
a) Divisive algorithm
m

b) Agglomerative
c) Fuzzy clustering
)A

d) None of the mentioned


14. Finding unusual occurrences or observations that are statistically distinct from the
rest of the observations is known as_ _ _ _.
a) Testing
(c

b) Anomaly detection
c) Error finding
Amity Directorate of Distance & Online Education
298 Foundations of Machine Learning

d) Supervised detection
Notes

e
15. A tuple in a dataset is referred to as a _ _ _ _if it deviates significantly from the rest
of the data.

in
a) Contextual anomaly
b) Collective anomaly

nl
c) Point anomaly
d) None of the mentioned

O
Exercise
1. What do you mean by evaluation of clustering?
2. Define various clustering types.

ty
3. What do you mean byhierarchical clustering?
4. What do you mean by DBSCAN (Density Based Spatial Clustering of Applications
with Noise)?

si
5. Explain anomaly detection.
6. Define K-Means Clustering.
7. r
Define singular value decomposition.
ve
8. What do you mean by LOF?
9. Explain the concept of T-distributed stochastic neighbour embedding (t-SNE).

Learning Activities
ni

1. How machine learning and artificial intelligence changing our day to day lives?
Explain with the help of few examples.
U

Check Your Understanding - Answers


1. a
2. b
ity

3. c
4. d
5. a
m

6. b
7. c
)A

8. d
9. a
10. b
11. c
(c

12. d
13. a
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 299

14. b
Notes

e
15. c

in
Further Readings and Bibliography:
1. The Hundred-Page Machine Learning Book, Andriy Burkov
2. The Elements of Statistical Learning, Jerome H. Friedman, Robert Tibshirani,

nl
and Trevor Hastie
3. Hands-On Unsupervised Learning Using Python: How to Build Applied
Machine Learning Solutions from Unlabeled Data, Ankur A. Patel

O
4. Machine Learning in Action, Peter Harrington
5. Artificial Intelligence: A Modern Approach, Peter Norvig and Stuart J. Russell

ty
6. Understanding Machine Learning: From Theory to Algorithms, Shai Ben-David
and Shai Shalev-Shwartz

r si
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


300 Foundations of Machine Learning

Module - V: Reinforcement Learning


Notes

e
Learning Objectives:

in
At the end of this module, you will be able to understand:

●● Reinforcement Learning different terms used in Reinforcement Learning

nl
●● Various Steps involved Reinforcement Learning
●● Concept of Markov Property

O
●● What is Bellman’s Equation and Optimal Value Function
●● Various types of Problem Solving Methods

Introduction

ty
When we consider the nature of learning, the notion that we learn by interacting
with our surroundings is likely the first that comes to mind. A baby has no explicit
instructor when it plays, waves its arms, or looks around, but it does have a direct

si
sensory link to its surroundings. This connection can be used to learn a multitude
of information about cause and effect, the results of activities, and what to do to
accomplish goals.
r
Such interactions surely serve as a significant source of information about our
ve
surroundings and ourselves throughout our lives. We are highly aware of how our
environment responds to what we do, whether we are learning to drive a car or to hold
a conversation, and we attempt to affect what happens through our behaviour. Nearly
all theories of learning and intelligence are based on the fundamental concept that
ni

learning occurs through interaction.

We investigate a computational strategy for interaction-based learning. We study


U

idealised learning scenarios and assess the efficacy of various learning techniques
rather than directly thinking about how people or animals learn. In other words, we
adopt the viewpoint of an AI researcher or engineer.

Through mathematical analysis or computational experiments, we investigate


ity

machine designs that are successful in resolving learning issues of scientific or


commercial significance. Compared to previous machine learning approaches, the
method we investigate, called reinforcement learning, is significantly more focused on
goal-directed learning from interaction.
m

5.1 Introduction to Reinforcement Learning


Two significant, lengthy, and rich threads that were pursued separately before
)A

coming together to form contemporary reinforcement learning make up the history of


reinforcement learning. A theme that has its roots in the psychology of animal learning
is learning by doing. The rebirth of reinforcement learning in the early 1980s was
influenced by this thread, which runs through some of the earliest work in artificial
intelligence.
(c

The other thread focuses on the optimal control issue and how value functions
and dynamic programming are used to solve it. This thread largely has little to do with
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 301

education. Although the two threads have typically been independent of one another,
Notes

e
there have been a few exceptions that centre on a third, less obvious thread of
temporal-difference techniques, like those employed in the tic-tac-toe example in this
chapter. The present discipline of reinforcement learning was created in the late 1980s

in
when all three elements came together.

We are most familiar with and have the most to say in this brief history about the

nl
thread that focuses on trial-and-error learning. But first, we quickly go over the thread
for the best control.

The difficulty of creating a controller to minimise a measure of a dynamical

O
system’s behaviour over time was first described as the “optimal control” problem in
the late 1950s. By extending a Hamilton and Jacobi theory from the nineteenth century,
Richard Bellman and others came up with one of the solutions to this issue in the
middle of the 1950s. In this method, a functional equation, now frequently known as the

ty
Bell-Man equation, is defined using the ideas of the state of a dynamical system and a
value function, or “optimal return function.”

Dynamic programming is a class of techniques for resolving optimal control issues

si
by resolving this equation (Bellman, 1957a). Markovian decision processes (MDPs),
which were first presented by Bellman (1957b) and Ronald Howard (1960), are a
discrete stochastic variant of the optimum control problem. These are the fundamental
r
concepts that underlie the theory and techniques of contemporary reinforcement
ve
learning.

Most experts agree that the only practical method for resolving general
stochastic optimum control issues is dynamic programming. It is exponentially more
computationally expensive the more state variables there are, which Bellman called “the
ni

curse of dimensionality,” yet it is still much more effective and applicable than any other
general method.

Since the late 1950s, dynamic programming has undergone extensive


U

development, including extensions to partially observable MDPs (reviewed by


Lovejoy, 1991), numerous applications (reviewed by White, 1985, 1988, and 1993),
approximation techniques (reviewed by Rust, 1996), and asynchronous techniques
ity

(Bertsekas, 1982, 1983). There are many excellent contemporary treatments for
dynamic programming (e.g., Bertsekas, 1995; Puterman, 1994; Ross, 1983; and
Whittle, 1982, 1983). An authoritative history of optimal control is provided by Bryson
(1996).
m

5.1.1 Reinforcement Learning: Introduction


Reinforcement learning shares similarities with other subjects whose names end
in -ing, such as machine learning, planning, and mountaineering, in that it is at once
)A

a problem, a class of solutions that are effective for the class of problems, and the
field that investigates these issues and their solutions. In order to optimise a numerical
reward signal, reinforcement learning problems include learning what to do—how to
translate events to actions.
(c

They are closed-loop issues in that the learning system’s actions have an impact
on its subsequent inputs. In addition, unlike many other types of machine learning, the
learner must experiment with different actions to determine which ones produce the
Amity Directorate of Distance & Online Education
302 Foundations of Machine Learning

greatest rewards. In the most fascinating and difficult situations, choices can influence
Notes

e
not just the immediate reward but also the subsequent circumstance and, thus, all
subsequent benefits.

in
The three key differentiating properties of reinforcement learning issues are that
they are fundamentally closed-loop, lack explicit instructions for what actions to take,
and have consequences of actions, including reward signals, that take a long time to

nl
manifest.

The main goal is to simply capture the most significant features of the actual
challenge facing a learning agent interacting with its environment in order to achieve a

O
goal. Reinforcement learning problems can be fully specified in terms of optimal control
of Markov decision processes. It is obvious that such an agent must be able to partially
detect the environment’s state and must be able to behave in a way that influences
that state. A objective or goals pertaining to the status of the environment must also be

ty
present for the agent. Sensation, action, and purpose are the only three characteristics
that should be included in the formulation, and they should all be included in the most
basic forms without being trivialised.

si
We define a reinforcement learning method as any approach that is effective at
tackling a situation of this nature. The type of learning that is investigated in the majority
of recent research in the field of machine learning, supervised learning, is distinct from
reinforcement learning. r
ve
A knowledgeable external supervisor provides a training set of labelled examples
for supervised learning. Each example includes a description of a circumstance as well
as a label—the specification of the correct action the system should do in response to
that situation—which is frequently to define a category to which the situation belongs.
ni

It is the goal of this type of learning for the system to extrapolate, or generalise,
its answers so that it responds appropriately in circumstances that are not part
of the training set. Although this is a crucial type of education, learning through
U

interaction requires more than this. It is frequently impractical to find instances of


desired behaviour in interactive issues that are both accurate and typical of all the
circumstances in which the agent must behave. An agent must be able to draw lessons
ity

from its own experience in new territory, where one would anticipate learning to be
most helpful. Additionally, unsupervised learning, which machine learning researchers
refer to as identifying structure concealed in sets of unlabeled data, is distinct from
reinforcement learning. Although it might seem that the phrases supervised learning
and unsupervised learning fully categorise machine learning paradigms, they do not.
m

Since reinforcement learning does not rely on examples of appropriate conduct, it is


tempting to mistake it for unsupervised learning. However, reinforcement learning aims
to maximise a reward signal rather than seek out hidden structure.
)A

Although finding structure in an agent’s experience can be helpful for reinforcement


learning, it does not solve the issue of how to maximise a reward signal for the agent.
Therefore, alongside supervised learning, unsupervised learning, and maybe other
paradigms, we see reinforcement learning as a third machine learning paradigm.
(c

The trade-off between exploration and exploitation is one of the difficulties that
reinforcement learning faces, as opposed to other types of learning. A reinforcement
learning agent must favour activities that it has previously attempted and found to be
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 303

successful in creating reward in order to receive a lot of reward. However, it must try
Notes

e
acts that it has never chosen before in order to find such activities.

The agent must take advantage of what it already knows in order to profit, but it

in
must also explore in order to choose its future actions more wisely. The problem is that
pursuing either exploration or exploitation solely would result in failure. The agent must
test several different things and gradually favour the ones that seem to work the best.

nl
For a reliable estimate of the expected reward on a stochastic task, each action must
be tried numerous times.

Mathematicians have spent many years studying the exploration–exploitation

O
conundrum. For the time being, we merely point out that, at least in their purist forms,
the whole problem of striking a balance between exploration and exploitation never
even comes up in supervised and unsupervised learning.

ty
The explicit consideration of the entire issue of a goal-directed agent dealing with
an unpredictable environment is another important aspect of reinforcement learning.
Contrary to many techniques, this one takes into account subproblems without
considering how they might integrate into a bigger picture.

si
For instance, we have said that supervised learning is a major focus of machine
learning research without directly stating the ultimate application of such a capability.

r
Other academics have created planning theories with broad objectives, but without
taking into account the function of planning in real-time decision making or the issue of
ve
where the predictive models required for planning would originate from. These methods’
emphasis on discrete subproblems is a key restriction, despite the fact that they have
produced numerous helpful results.
ni

The approach taken by reinforcement learning is the contrary, beginning with a fully
developed, interactive, goal-seeking agent. All reinforcement learning agents can sense
elements of their environments, have clear goals, and make decisions about how to
affect those environments.
U

Furthermore, it is typically taken for granted right away that the agent must
function in spite of a great deal of uncertainty regarding the environment it faces. When
reinforcement learning contains planning, it needs to take into account how real-time
ity

action selection and planning interact as well as the issue of how environment models
are created and updated.

When supervised learning is incorporated into reinforcement learning, it does


so for particular reasons that specify which skills are essential and which are not.
m

Important subproblems must be isolated and researched in order for learning research
to advance, but even if some of the complete agent’s specifics are still missing, these
subproblems must play distinct roles in complete, interactive, goal-seeking agents.
)A

One of the most intriguing features of contemporary reinforcement learning is how


it interacts meaningfully and productively with other engineering and scientific fields.
Increasing the integration of artificial intelligence and machine learning with statistics,
optimization, and other mathematical disciplines has been a long-standing trend in
artificial intelligence and machine learning.
(c

For instance, the traditional “curse of dimensionality” in operations research and


control theory is addressed by some reinforcement learning techniques’ capacity to
Amity Directorate of Distance & Online Education
304 Foundations of Machine Learning

learn with parameterized approximators. In a unique way, reinforcement learning has


Notes

e
also strongly influenced psychology and neuroscience, with significant gains for both
fields.

in
Reinforcement learning is the type of machine learning that is most similar to
the learning that occurs in humans and other animals, and many of its fundamental
algorithms were first motivated by biological learning systems. And reinforcement

nl
learning has also given something back, both in the form of an influential model of some
aspects of the brain’s reward system and a psychological model of animal learning that
more closely matches some of the scientific findings.

O
Finally, the return to straightforward general principles in artificial intelligence is a
bigger trend that is reflected in reinforcement learning. Since the late 1960s, a lot of
researchers studying artificial intelligence have assumed that there are no fundamental
principles to be found and that intelligence instead results from the accumulation of a

ty
huge variety of specialised techniques, methods, and heuristics.

It was formerly thought that if we could merely feed a machine with one million
or one billion pertinent data, the machine would become intelligent. Weak techniques

si
were defined as those based on broad principles, such searching or learning, and
strong methods were defined as those based on specific information. Although much
less prevalent now, this viewpoint is still widespread. From our perspective, it was just
r
too early to draw that conclusion because not enough time had been spent looking for
ve
general principles.

Nowadays, a lot of research is being done on modern AI in an effort to include


enormous volumes of domain information as well as general concepts of learning,
search, and decision-making. Reinforcement learning research is unquestionably a
ni

component of the swing back toward simpler and fewer general principles of artificial
intelligence, however it is unclear how far back the pendulum will swing.
U

Examples of Reinforcement Learning


Examining some of the instances and potential applications that have influenced
reinforcement learning’s development is a useful way to grasp it.
ity

●● An expert chess player moves: The decision is influenced by both forethought—


planning potential responses and counterattacks—and by quick, intuitive
assessments of the value of specific positions and manoeuvres.
●● Real-time parameter adjustments are made by an adaptive controller during the
m

operation of a petroleum refinery. Without rigidly adhering to the set points initially
recommended by engineers, the controller optimises the yield/cost/quality trade-off
on the basis of stated marginal costs.
)A

●● Minutes after birth, a gazelle calf struggles to stand. It reaches a speed of 20 miles
per hour after 30 minutes.
●● A mobile robot chooses whether to move into a different room in search of
additional rubbish to collect or to begin attempting to return to its battery charging
station. It bases its choice on the battery’s current state of charge as well as past
(c

experiences with how quickly and simply it has located the recharger.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 305

●● Phil gets his breakfast ready: When closely analysed, even this seemingly routine
Notes

e
action—walking to the cabinet, opening it, choosing a cereal box, then reaching
for, grabbing, and retrieving the box—displays a complicated web of conditional
behaviour and interlocking goal-subgoal linkages. To gain a bowl, spoon, and milk

in
jug, additional intricate, calibrated, interactive sequences of behaviour are needed.
A succession of eye movements are required for each step in order to gather

nl
information and direct reaching and movement.

Quick decisions are made regarding how to transport the items and whether it
would be best to transport some of them to the dinner table first before obtaining others.

O
In order to accomplish other goals, such as having the spoon to eat with once the
cereal is produced and ultimately acquiring sustenance, each step is directed by goals,
such as grabbing a spoon or reaching the refrigerator.

ty
Whether or not Phil is aware of it, his body is constantly accessing data about its
condition to assess his dietary requirements, level of hunger, and food preferences.

These instances have characteristics in common that are so fundamental that is

si
simple to ignore them. All of these include interactions between an active decision-
making agent and its surroundings, in which the agent tries to accomplish a task
despite being unclear about the surroundings. The agent’s actions are allowed to

r
influence the environment’s future state (such as the next move in a game of chess, the
level of the refinery’s reservoirs, the location of the robot in the future, and the battery’s
ve
future charge level), which in turn influences the options and opportunities the agent will
have in the future.

Making the right decision necessitates considering the indirect, long-term effects
of actions, which may call for preparation or foresight. The agent must continually
ni

monitor its environment in order to react effectively because, in all of these situations,
the repercussions of actions cannot be entirely predicted. For instance, Phil needs to be
careful not to overfill his breakfast bowl with milk.
U

In each of these situations, the objectives are explicit in the sense that the agent
may gauge its success based on what it can immediately experience. The mobile robot
knows when its batteries run out, the chess player knows whether or not he wins, the
ity

refinery controller knows how much petroleum is produced, and Phil knows whether or
not he is enjoying his breakfast.

The agent and its environment could not exactly match our conceptions of an agent
and its surroundings. The entirety of a robot or organism is not always considered to be
m

an agent, and a robot or organism’s environment is not always considered to be its only
habitat. The example robot’s battery is a feature of its controlling agent’s environment,
just as Phil’s level of hunger and food preferences are aspects of his internal decision-
)A

making agent’s environment.

Information regarding the state of the machine or organism in which an agent


is located is frequently included in the state of the agent’s surroundings, and this
information might include memories and even aspirations. When we discuss agents and
their environments throughout this book, we do it in an abstract manner.
(c

In each of these cases, the agent can use experience to gradually raise
performance levels. The gazelle calf increases its running efficiency, the chess player

Amity Directorate of Distance & Online Education


306 Foundations of Machine Learning

improves his intuition for assessing positions, and Phil learns to prepare his breakfast
Notes

e
more quickly. The knowledge an agent brings to a task at the outset—either from prior
experience with related tasks, or built into it by design or evolution—influences what is
useful or simple to learn, but interaction with the environment is crucial for modifying

in
behaviour to take advantage of particular task features.

5.1.2 Formal Definition of Reinforcement Learning

nl
Machine learning includes the discipline of reinforcement learning. It involves
acting appropriately to maximise reward in a certain circumstance. It is used by

O
a variety of programmes and machines to determine the optimal course of action to
pursue in a given circumstance. There is no correct answer in reinforcement learning,
but the reinforcement agent selects what to do to complete the job. This is different from
supervised learning, where the training data includes the solution key and the model is

ty
trained with that answer. It will undoubtedly gain knowledge from its experience in the
absence of a training dataset.

Machine learning models are trained via reinforcement learning to make a series

si
of judgments. The agent gains the ability to do a task in a possibly complicated and
uncertain environment. An artificial intelligence encounters a scenario similar to a
game during reinforcement learning. In order to solve the problem, the computer uses
r
trial and error. The artificial intelligence receives rewards or penalties for the activities
it takes in order to make the machine accomplish what the programmer desires. To
ve
maximise the overall reward is its aim.

Although the designer establishes the reward policy, or the game’s rules, he offers
the model no tips or advice on how to win. The model must determine how to complete
ni

the objective to maximise the reward, starting with completely arbitrary trials and
ending with sophisticated strategies and superhuman abilities. Reinforcement learning
is currently the most efficient method for hinting machine creativity since it makes use
of the power of search and numerous trials. If a reinforcement learning algorithm is
U

performed on a powerful enough computer infrastructure, artificial intelligence can learn


from thousands of concurrent games, unlike humans.

A machine learning training method called reinforcement learning rewards desired


ity

behaviours and/or penalises undesirable ones. A reinforcement learning agent can


typically perceive and comprehend its surroundings, act, and learn by making mistakes.

Developers provide a way of rewarding desired actions and penalising undesirable


behaviours in reinforcement learning. In order to motivate the agent, this technique
m

assigns positive values to desired acts and negative values to undesirable behaviours.
This trains the agent to seek maximal overall reward over the long run in order to arrive
at the best possible outcome.
)A

The agent is prevented from stagnating on smaller tasks by these long-term


objectives. The agent eventually learns to steer clear of the negative and look for the
positive. Artificial intelligence (AI) has adopted this learning strategy as a technique to
control unsupervised machine learning using rewards and penalties.
(c

●● By executing actions and observing the outcomes of those actions, an agent


learns how to behave in a given environment via reinforcement learning, a
feedback-based machine learning technique. The agent receives compliments
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 307

for each positive activity, and is penalised or given negative feedback for each
Notes

e
negative action.
●● In contrast to supervised learning, reinforcement learning uses feedback to

in
autonomously train the agent without the use of labelled data.
●● The agent can only learn from its own experience because there isn’t any labelled
data.

nl
●● In situations like gaming, robotics, and other long-term endeavours where
decisions must be made sequentially, RL offers a solution.

O
●● The agent engages with the environment and independently explores it. In
reinforcement learning, an agent’s main objective is to maximise positive
reinforcement while doing better.
●● The agent learns through hit-and-miss, and depending on its experience, it

ty
develops the skills necessary to carry out the mission more effectively. Thus,
“Reinforcement learning is a form of machine learning method where an intelligent
agent (computer programme) interacts with the environment and learns to function

si
within that,” we might state. Reinforcement learning is demonstrated by the way a
robotic dog learns to move his arms.
●● It is a fundamental component of artificial intelligence, and the idea of
r
reinforcement learning is the basis for all AI agents. In this case, there is no need
ve
to pre-program the agent because it learns on its own without assistance from
humans.

5.1.3 Steps of Reinforcement Learning


ni

1. An input state is observed by the agent.


2. A decision-making process determines an action (policy).
U

3. The task is completed.


4. The environment provides the agent with reinforcement or a scalar reward.
5. It is noted what reward was given for that state/action pair.
ity
m

Feature of Reinforcement Learning


)A

Although the majority of the reinforcement learning techniques we discuss in


this book are built around estimating value functions, doing so is not strictly required
to address reinforcement learning issues. Without ever mentioning value functions,
techniques like genetic algorithms, genetic programming, simulated annealing, and
(c

other optimization techniques have been utilised to approach reinforcement learning


challenges.

Amity Directorate of Distance & Online Education


308 Foundations of Machine Learning

These techniques assess the “lifetime” behaviour of a large number of non-learning


Notes

e
agents, each of which employs a distinct policy for interacting with its surroundings,
and then choose the agents that are able to maximise reward. These techniques are
referred to as evolutionary because they operate similarly to how biological evolution

in
creates animals with complex behaviour even when those animals do not acquire new
skills over their lifetimes.

nl
Evolutionary approaches can be successful if the space of policies is sufficiently
small, can be organised so that good policies are common or easy to locate, or
if a lot of time is available for the search. Furthermore, evolutionary approaches are
advantageous for issues when the learning agent is unable to precisely assess the

O
status of its surroundings.

Although the majority of the reinforcement learning techniques we discuss in


this book are built around estimating value functions, doing so is not strictly required

ty
to address reinforcement learning issues. Without ever mentioning value functions,
techniques like genetic algorithms, genetic programming, simulated annealing, and
other optimization techniques have been utilised to approach reinforcement learning

si
challenges.

These techniques assess the “lifetime” behaviour of a large number of non-learning


agents, each of which employs a distinct policy for interacting with its surroundings,
r
and then choose the agents that are able to maximise reward. These techniques are
ve
referred to as evolutionary because they operate similarly to how biological evolution
creates animals with complex behaviour even when those animals do not acquire new
skills over their lifetimes.

Evolutionary approaches can be successful if the space of policies is sufficiently


ni

small, can be organised so that good policies are common or easy to locate, or
if a lot of time is available for the search. Furthermore, evolutionary approaches are
advantageous for issues when the learning agent is unable to precisely assess the
U

status of its surroundings.

We do, however, include several approaches that don’t use value functions, such
as evolutionary approaches. These techniques look for policies in policy spaces that are
ity

bounded by a set of numerical parameters. In order to increase a policy’s performance


as quickly as possible, they predict the directions in which the parameters should be
changed.

However, unlike evolutionary approaches, they do so while the agent is interacting


m

with its environment, allowing them to benefit from the specifics of each behavioural
encounter. These techniques, sometimes known as policy gradient approaches, have
been effective in solving a variety of issues, and some of the most straightforward
reinforcement learning techniques fall within this group.
)A

In fact, some of these techniques use estimates of the value function to enhance
their estimations of the gradient. Overall, there isn’t a clear line separating policy
gradient methods from the other techniques we count as reinforcement learning
techniques.
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 309

5.1.4 Terms in Reinforcement Learning


Notes

e
Agent(): A thing with the capacity to see, investigate, and react to its surroundings.

Action(): The steps an agent takes inside the environment are referred to as

in
actions.

State(): Following each action done by the agent, the environment will return a

nl
state.

Q-value(): Generally speaking, it is comparable to the value, except it adds a


current action parameter.

O
Beyond the agent and the environment, a reinforcement learning system primarily
consists of four subelements: a policy, a reward signal, a value function, and, if present,
a model of the environment.

ty
Policy: The learning agent’s behaviour at a specific time is defined by a policy. A
policy, broadly speaking, is a mapping from perceived environmental states to actions
to be taken when those states are present. It corresponds to what would be referred

si
to as a collection of stimulus-response rules or associations in psychology (provided
that stimuli include those that can come from within the animal). The policy may
be as straightforward as a function or lookup table in some circumstances, while in

r
others it may need complex computations like a search procedure. Insofar as it can
control behaviour on its own, the policy serves as the fundamental component of a
ve
reinforcement learning agent. Policies may generally be stochastic.

Reward Signal: In a reinforcement learning problem, the objective is defined by a


reward signal. The environment gives a single number, or reward, to the reinforcement
ni

learning agent on each time step. The agent’s only goal is to increase its overall
return over the long term. Thus, the reward signal identifies the positive and negative
occurrences for the agent.
U

We might consider rewards to be comparable to the sensations of pleasure or pain


in a biological system. They are the primary and distinguishing characteristics of the
agent’s issue. Depending on the agent’s current action and the environment’s current
state, a reward may be provided to the agent at any time. The action-taking process
ity

cannot be changed by the agent. Only the agent’s activities, which may directly affect
reward or indirectly by affecting the condition of the environment, can have an impact
on the reward signal.

In the previous example, Phil is eating breakfast. Depending on how hungry


m

he is, how he is feeling, and other features of his body, which are part of his internal
reinforcement learning agent’s environment, the reinforcement learning agent directing
his behaviour may receive different reward signals when he eats his breakfast. The
)A

reward signal serves as the main justification for changing the policy.

The policy may be amended to choose a different action in that circumstance in the
future if the policy’s chosen action is followed by a low reward. Reward signals could, in
general, be stochastic functions of the environment’s state and the actions conducted.
(c

Value Function: A value function explains what is beneficial over the long term, but
the reward signal only reveals what is good in the here and now. A state’s value can be
thought of as the total amount of reward that agent can anticipate accumulating over
Amity Directorate of Distance & Online Education
310 Foundations of Machine Learning

the course of the future, starting from that state. Values show the long-term desirability
Notes

e
of states after taking into consideration the states that are expected to follow and the
rewards available in those states, whereas rewards determine the immediate, intrinsic
attraction of environmental conditions.

in
For instance, even though a state constantly produces a small immediate reward,
it nonetheless has a high value since it is frequently followed by states that produce

nl
large rewards. The opposite may also be true. To use a human comparison, rewards
resemble pleasure (if high) and pain (if low), whereas values reflect a more complex
and long-term assessment of how happy or unhappy we are with the current state
of our surroundings. With this explanation, we think it is evident that value functions

O
formalise a fundamental and well-known concept.

In a way, rewards are primary, while values—which serve as predictions of


rewards—are secondary. Values wouldn’t exist without rewards, and the main goal

ty
of assessing values is to increase rewards. However, when making and assessing
decisions, we are mainly concerned with values. Decisions about what to do are based
on value assessments. Not the highest reward, but the highest value states are what

si
we aim to achieve because they will ultimately yield the greatest rewards for us.

The derived quantity known as value is the one we are most concerned with when
making decisions and planning. Unfortunately, figuring out values is far more difficult
r
than figuring out incentives.
ve
Although values must be evaluated and re-estimated from the sequences of
observations an agent makes throughout its whole lifespan, rewards are essentially
delivered directly by the environment. In fact, a technique for effectively estimating
values is the most crucial element of practically all reinforcement learning algorithms we
ni

study. The most significant lesson we have learned about reinforcement learning over
the past few decades is likely the essential importance of value estimate.

Model of the Environment: A model of the environment serves as the fourth and
U

final component of several reinforcement learning systems. This is something that


imitates the behaviour of the environment or, to put it more broadly, something that
enables predictions about the future behaviour of the environment. For instance, the
ity

model might forecast the subsequent state and reward given a particular state and
behaviour.

Models are used for planning, which we define as any method of selecting a course
of action by taking into account potential future circumstances before they actually
m

occur. Model-based approaches to reinforcement learning problems are distinguished


from simpler modelfree approaches that explicitly rely on trial-and-error learning, which
are thought of as the antithesis of planning, by their use of models and planning.
)A

5.1.5 Markov Property


The problem of learning from interaction to accomplish a goal is intended to be
simply framed as the reinforcement learning problem. The agent is a learner and
decision-maker. The environment is what it interacts with and consists of everything
(c

outside the agent. The agent chooses actions, and the environment reacts to those
actions by presenting the agent with new circumstances.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 311

The environment also produces rewards, which are unique numerical values that
Notes

e
the agent attempts to maximise over time. A task, one example of the reinforcement
learning problem, is defined by a thorough definition of an environment.

in
More particularly, at each of a series of discrete time steps, t = 0, 1, 2, 3,...2,
the agent and environment interact. The agent chooses an action, At ∈ A(St), based
on some representation of the state of the environment, St∈ S, where S is the set of

nl
potential states, at each time step t. A(St) is the set of actions that are feasible in state
St. The agent finds itself in a new state, St+13, one time step later, partly as a result
of its actions, and earns a numerical reward, Rt+1∈ R ⊂ R. 3 The agent–environment
interaction is depicted in Figure below.

O
ty
si
Figure: The agent–environment interaction in reinforcement learning

r
The agent implements a mapping from states to probabilities of choosing each
ve
potential action at each time step. The probability that At = a if St = s exists for this
mapping, which is known as the agent’s policy, is symbolised by the symbol πt, where
πt(a|s). The agent’s policy modification process is outlined using reinforcement learning
techniques. In general, the agent wants to maximise the overall reward sum it receives
ni

over the long run.

The Markov Property


U

The environment’s state serves as a signal from the environment that the agent
considers when using the reinforcement learning framework. This section explains
the requirements for the state signal as well as the kind of information we should and
should not anticipate from it. In particular, we formally describe the Markov property,
ity

which is a particularly interesting property of environments and their state signals.

In this book, “the state” refers to any information that the agent has access to.
We presume that some preprocessing system that is ostensibly a component of the
environment provides the state. In this book, we do not discuss how to build, modify,
m

or learn the state signal. We use this strategy in order to completely concentrate on
the issues surrounding decision-making, not because we think state representation
is unimportant. In other words, choosing what to do based on the state signal that is
)A

available, rather than designing the state signal, is our primary concern.

The status signal can contain a lot more information than just immediate
experiences like sensory data. State representations might be elaborate structures that
have been constructed over time from a sequence of sensations, or they can be highly
processed reproductions of the initial sensations.
(c

In order to create a rich and realistic representation of a scene, we can move our
eyes over it while only allowing the fovea—a tiny region in the centre of the retina—to

Amity Directorate of Distance & Online Education


312 Foundations of Machine Learning

be seen in detail at any given time. Or, more plainly, we can glance at something, then
Notes

e
turn our gaze elsewhere, and still be able to see it. Depending on the question that was
asked earlier and is no longer audible, we can hear the word “yes” and believe that we
are in completely different states.

in
On a more common level, a control system can monitor position twice and provide
a state representation that includes velocity information. In each of these situations, the

nl
state is created and maintained based on the current state as well as the previous state
or another recollection of earlier feelings.

There is no reason to limit the state representation to the agent’s current feelings;

O
in most cases, the state representation should be able to provide the agent with more
information.

On the other hand, the agent shouldn’t be expected to receive all of the information

ty
that would be relevant to its decision-making via the state signal. We shouldn’t
anticipate the agent to be aware of the upcoming card in the deck if it is playing
blackjack.

si
We shouldn’t anticipate the caller’s identity being known to the agent when it
answers the phone. We shouldn’t expect a paramedic who has been dispatched to a
car accident to know the interior injuries of a victim who is unconscious right away.

r
In each of these scenarios, the environment contains concealed state information
ve
that would be helpful if the agent knew about it, but the agent is unable to do so
because it has never experienced any pertinent sensations. In other words, we only
criticise an agent for knowing something and then forgetting it, not for not knowing
something that matters.
ni

A status signal that condenses previous feelings while yet retaining all pertinent
information is what we are ideally looking for. Normally, more information is needed
than only the current sensations, but never more than the entire history of all previous
U

sensations. A state signal is said to be Markov, or to possess the Markov property, if it is


successful in keeping all pertinent information (we define this formally below).

A checkers position, which is the current arrangement of all the pieces on the
ity

board, would function as an example of a Markov state since it condenses all of the
significant information from the entire series of positions that led to it. While much of the
information about the sequence is destroyed, everything that is actually important for
the game’s future is still there.
m

A cannonball’s future trajectory depends only on its present position and velocity.
How that position and velocity got to be is irrelevant. Because everything is contained in
the current state signal and has no bearing on the “path,” or history, of previous signals,
this attribute is also frequently referred to as having “independence of path.”
)A

The Markov property for the reinforcement learning issue is now fully defined.
Here, we assume that there are a finite number of states and reward values in order to
simplify the mathematics. While the argument may be easily expanded to incorporate
continuous states and rewards, it enables us to work in terms of sums and probabilities
(c

rather than integrals and probability densities.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 313

Think about how a larger ecosystem might react to an activity at time t+1. This
Notes

e
response could be influenced by everything that has happened before in the broadest,
most causative sense. The dynamics in this situation can only be specified by supplying
the whole probability distribution:

in
for all r, s 0 , and all possible values of the past events: S0, A0, R1, ..., St−1, At−1,

nl
Rt , St , At . If the state signal has the Markov property, on the other hand, then the
environment’s response at t + 1 depends only on the state and action representations at
t, in which case the environment’s dynamics can be defined by specifying only

O
S0, A0, R1,..., St-1, At-1, Rt, St, At for all r, s 0, and all probable values of the past
occurrences. On the other hand, if the state signal possesses the Markov property, the
environment’s response at t + 1 depends only on the state and action representations

ty
at t, in which case the dynamics of the environment can be characterised by providing
only the state and action representations at t.

for each and every r, s0, St, and At. Alternatively put, a state signal possesses the

si
Markov property and is a Markov state if and only if equation above is equation below
for every s0, r, and histories, S0, A0, R1,..., St-1, At-1, Rt, St, At. In this instance, it is
also claimed that the environment and task collectively exhibit the Markov property.
r
ve
If an environment possesses the Markov property, then we can use the current
state and action to forecast the next state and anticipated reward. By repeating this
equation, it is possible to demonstrate that it is possible to forecast all future states
ni

and expected rewards using only knowledge of the current state, which is equivalent to
using knowledge of the entire past up to the present.

Therefore, Markov states offer the greatest potential foundation for selecting
U

actions. In other words, the greatest strategy for selecting actions based on a Markov
state is just as effective as the best strategy for selecting actions based on entire
histories.
ity

It is fair to think of the state in reinforcement learning as an approximation to a


Markov state even when the state signal is non-Markov.

We want the state to provide a good basis for predicting future rewards and
choosing actions in particular. We also want the state to be a useful starting point for
m

forecasting subsequent states in situations when an environment model is learned. For


all of these, Markov states offer an unrivalled foundation.

One will get greater performance from reinforcement learning systems to the extent
)A

that the state approaches the capability of Markov states in these respects. Because
of all of these factors, it can be helpful to think of the state at each time step as being
something similar to a Markov state, even though this may not be the case in all cases.

Because judgments and values are thought to solely be a function of the current
state, the Markov property is crucial in reinforcement learning. The state representation
(c

must be instructive for these to be useful and helpful.

Amity Directorate of Distance & Online Education


314 Foundations of Machine Learning

The entire body of theory in this book is predicated on Markov state signals. In
Notes

e
circumstances when the Markov property does not exactly apply, this means that not
all of the theory applies. Although the algorithms can be effectively applied to several
jobs with states that are not precisely Markov, the theory created for the Markov case

in
continues to be useful in understanding how the algorithms behave.

To expand the theory of the Markov case to the more intricate and realistic

nl
non-Markov situation, a thorough comprehension of the Markov case theory is a
prerequisite. Finally, we point out that the assumption of Markov state representations
is not specific to reinforcement learning but is found in the majority, if not all, of artificial
intelligence methods.

O
Example: Pole-Balancing State: A state signal would be Markov in the earlier
described pole-balancing job if it precisely indicated, or allowed for the precise
reconstruction of, the position and speed of the cart down the track, the angle between

ty
the cart and the pole, and the rate at which this angle is changing (the angular velocity).

This knowledge would be adequate in a hypothetical cart–pole system to


accurately forecast the future behaviour of the cart and pole given the controller’s

si
actions. However, in reality, it is impossible to know this information precisely because
any genuine sensor would skew and delay its results in some way.

r
In addition, there are always additional factors that have a small impact on the
behaviour of a genuine cart–pole system, such as the pole’s bending, the temperatures
ve
of the wheel and pole bearings, and different types of backlash. If the state signal
consisted just of the positions and speeds of the cart and the pole, then these variables
would lead to violations of the Markov property.
ni

The states, however, are frequently extremely well served by the locations and
velocities. A coarse state signal that separated cart positions into the right, left, and
centre regions was utilised in several early investigations of learning to solve the
pole-balancing task (and similar rough quantizations of the other three intrinsic state
U

variables).

This clearly non-Markov state was sufficient to enable reinforcement learning


techniques to successfully complete the job. In reality, this coarse representation might
ity

have sped up learning by making the learning agent disregard subtle differences that
weren’t relevant to the job at hand.

Example:Draw Poker: Each player receives a hand of five cards when playing draw
poker. After a betting round in which each player switches out some of his cards for
m

fresh ones, there is one more betting round.

Each participant must match or exceed the highest wager placed by the other
players in each round, or they will be eliminated (fold). The player who still has the
)A

greatest hand after the second round of betting wins and takes home all of the bets.

Each player in draw poker receives a unique state signal. Every player is aware
of the cards in his own hand, but the other players’ hands are simply a guess. It’s a
common misconception to believe that a Markov state signal should include the cards
(c

still in the deck as well as the contents of each player’s hand.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 315

However, in a fair game, we suppose that the participants are inherently unable
Notes

e
to infer these details from prior experience. A player could predict some future events
(such as the cards one could trade for) better if she knew them than if she simply
remembered all of the previous observations.

in
In addition to being aware of one’s own cards, draw poker players need also be
aware of their opponents’ bets and the amount of cards they have drawn. If one of the

nl
other players drew three additional cards, for instance, you might assume he kept a pair
and change your estimation of the strength of his hand to reflect this. Your evaluation of
the hands of the players is also influenced by their bets.

O
In reality, a large portion of your past interactions with these specific players
are included in the Markov state. Does Ellen prefer to play it safe or does she enjoy
bluffing? Does her expression or manner convey anything about how powerful her hand
is? When it is late at night or when he has already made a sizable profit, how does

ty
Joe’s strategy change?

Although everything that has ever been noticed about the other players may have
an impact on the chances that they are holding different types of hands, in actuality

si
there is far too much information to remember and analyse, and most of it won’t have a
discernible influence on one’s predictions or actions.

r
Very good poker players are skilled at retaining only the most important details and
swiftly assessing new opponents, yet nobody can recall everything that is pertinent. As
ve
a result, people’s state representations are likely nonMarkov, and the decisions they
make in poker are probably also imperfect.

Even Nevertheless, people continue to perform these activities with excellent


ni

decision-making. We come to the conclusion that a reinforcement learning agent


is probably not severely hampered by not having access to a complete Markov state
representation.
U

5.1.6 Bellman’s Equation and Optimal Value Function


The Bellman Optimality Equation and the Optimal Value Function are connected
recursively.
ity

The sole difference between the Bellman Optimality equation and the Bellman
Expectation equation is that instead of choosing the action with the highest value, our
agent chooses the action with the average value.
m

Let’s examine this using the Backup diagram:


)A
(c

Figure: Backup diagram for State-Value Function

Amity Directorate of Distance & Online Education


316 Foundations of Machine Learning

Assume that our agent is in state S and that it has two options from that state (a).
Notes

e
In contrast to the Bellman Expectation Equation, our agent chooses the action with the
highest q* value when we examine the action-values for each of the actions. As a result,
we have the value of being in state S.

in
This can be written mathematically as:

nl
Let’s define the Bellman Optimality Equation for State-Action Value Function in a
similar manner (Q-Function).

O
Let’s examine the Backup Diagram for the Q-Function State-Action Value Function:

ty
si
Figure: Backup Diagram for State-Action Value Function

r
Let’s say that our agent has performed action an in state s. We might be blown to
ve
any of these states right now depending on the atmosphere. The sole difference is that
in the Bellman Optimality Equation, we are aware of the ideal values for each state.
Nevertheless, we continue to take the average of the values of the two states. We just
knew the values of the states, as opposed to the Bellman Expectation Equation.
ni

This can be written mathematically as:


U

Let’s sew these backup State-Value Function diagrams once more.


ity
m
)A

Figure: Backup Diagram for State-Value Function


(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 317

Assume that our agent is in state s when it takes action (a), with the probability of
Notes

e
that action being taken being influenced by the policy. The agent may be blown to any
of the states where probability is weighed by the environment as a result of action (a).
We merely take the average of the optimal values of the states (s’) to determine the

in
value of state S. The value of being in state S is provided by this.

This can be written mathematically as:

nl
O
We are maximising the activities the agent can do in the upper arcs, hence the max
in the equation. This equation demonstrates how the V* function can be related to itself.

ty
The Bellman Optimality Equation for the State-Action Value Function, q*(s,a), is as
follows:

r si
ve
Figure: Backup Diagram for State-Action Value Function
ni

Imagine that while in state s, our agent took some action (a). The environment may
send our agent to any of the states (s’) as a result of that action, and from these states
U

we can maximise the action our agent will perform by selecting the action with the
highest q* value. The worth of the action an is shown when we back that up to the top.

This can be written mathematically as:


ity
m

To better comprehend it, let’s look at this illustration:


)A
(c

Amity Directorate of Distance & Online Education


318 Foundations of Machine Learning

Notes

e
in
nl
O
ty
si
Figure: Example for Bellman Optimality Equation

Look at the red arrows. Assuming we want to find the value of the state with value
6 (in red), as we can see, choosing Facebook will earn us a reward of -1 and choosing
r
study will get us a reward of -2. Considering the other two states have an optimal value,
ve
we will take an average and maximise for both the action and the state in red using
the Bellman Optimality Equation for State-Value Function (choose the one that gives
maximum value).

We can therefore deduce from the diagram that visiting Facebook results in a value
ni

of 5 for our red state and visiting a research site results in a value of 6, and when we
maximum over the two, the result is 6 for our red state.
U

5.1.7 Problem Solving Methods


ity
m
)A

Dynamic Programming
(c

If the environment is perfectly modelled as a Markov decision process, a set of


methods known as dynamic programming (DP) can be utilised to calculate the best
possible policies (MDP). Due to their high computing cost and assumption of a perfect
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 319

model, classical DP algorithms are of limited practical value in reinforcement learning,


Notes

e
although they are still significant conceptually.

DP is a crucial foundation for comprehending the methodologies discussed in the

in
next chapters of this book. In actuality, all of these approaches can be seen as attempts
to achieve roughly the same result as DP, albeit with less computational effort and
without relying on an ideal environment model.

nl
We usually assume that the environment is a finite MDP. That is, we assume that
its state, action, and reward sets, S, A, and R, are finite, and that its dynamics are given
by a set of probabilities p(s’ , r|s, a), for all s ɛ S, a ɛ A(s), r ɛ R, and s0ɛ S+ (S+ is S

O
plus a terminal state if the problem is episodic).

Although continuous state and action space problems can benefit from the use
of DP concepts, accurate solutions are only feasible in certain circumstances. Applying

ty
finite-state DP methods after quantizing the state and action spaces is a typical
technique for getting approximations for problems with continuous states and actions.

The use of value functions to organise and structure the search for effective

si
policies is the central concept of DP and reinforcement learning in general. In this
chapter, we demonstrate how to compute the value functions using DP. Once we have
determined the optimal value functions, v* or q*, which fulfil the Bellman optimality

r
equations, as was stated there, we may quickly obtain optimal policies:
ve
ni
U

for all s ɛ S, a ɛ A(s), and s0ɛ S+. As we shall see, Bellman equations like these
are converted into assignments, or update rules for enhancing approximations of the
ity

desired value functions, to produce DP algorithms.

Policy Evaluation
We start by thinking about how to calculate the state-value function vπ for
m

every given policy π. In the DP literature, this is referred to as policy evaluation. The
prediction problem is another name for it.
)A
(c

the expectations are subscripted by to show that they are dependent on the
assumption that the policy is followed, and π (a|s) represents the probability of acting

Amity Directorate of Distance & Online Education


320 Foundations of Machine Learning

in state s under the policy π. As long as either ultimate termination from all states under
Notes

e
the policy or γ< 1 is assured, vπ existence and uniqueness are guaranteed.

The above equation represents a system of |S| simultaneous linear equations in

in
|S| unknowns (the vπ(s), s ∈ S) if the dynamics of the environment are fully known.
Its answer is, in theory, a simple, albeit time-consuming, computation. Iterative solution
techniques are ideal for our needs. Take into account a string of approximation value

nl
functions, each mapping S + to R: v0, v1, v2,... With the exception of the terminal state,
if any, which must be assigned a value of 0, the initial approximation, v0, is chosen
freely, and each succeeding approximation is obtained by using the Bellman equation
for vπ as an update rule:

O
ty
for all s ∈ S. Given that the Bellman equation for v guarantees equality in this
situation, it is obvious that vk = vπ is a fixed point for this update rule. In fact, under the

si
identical circumstances that ensure the existence of vπ, it can be demonstrated that
the sequence “vk” generally converges to vπ as k → ∞. Iterative policy evaluation is the
name of this algorithm.
r
Iterative policy evaluation applies the same operation to each state s to produce
ve
each subsequent approximation, vk+1 from vk: it replaces the old value of s with a new
value obtained from the old values of the successor states of s, as well as the expected
immediate rewards, along all the one-step transitions possible under the policy being
evaluated. This process is referred to as a full backup. The new approximative value
ni

function vk+1 is created by backing up the value of each state once after each round of
iterative policy assessment.

Depending on whether a state (as in this case) or a state-action combination is


U

being backed up, as well as how precisely the projected values of the successor states
are combined, there are various types of full backups. Because they are based on all
potential next states rather than just a sample next state, all backups performed using
ity

DP algorithms are referred to as full backups.


m

You would need to employ two arrays: one for the previous values, vk(s), and one
for the new values, vk+1(s), in order to construct a sequential computer programme to
)A

accomplish iterative policy evaluation as stated in the equation above. In this manner,
the old values can be used to calculate the new values one at a time without affecting
the old values. Of course, updating the values “in place,” or with each new backed-up
value immediately replacing the previous one, is simpler when using a single array.
(c

Then, depending on the order in which the states are backed up, new values may
occasionally be utilised on the right-hand side of the aforementioned equation in place
of old ones. As you might assume given that it uses fresh data as soon as they become

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 321

available, this slightly different algorithm also converges to vπ; in fact, it typically does
Notes

e
so faster than the two-array variant. We imagine that the backups are carried out in
a thorough scan of the state space. The rate of convergence of the in-place method
depends significantly on the order in which the states are backed up during the sweep.

in
When we consider DP algorithms, we typically have the in-place variant in mind.

nl
O
ty
Figure: Iterative policy evaluation

si
The algorithm’s termination is a further implementation point. Iterative policy
evaluation must be stopped before it reaches the limit because it converges only
r
there in theory. Testing the quantity max s∈S |vk+1(s)−vk(s)| after each sweep and
ve
quitting when it is sufficiently small is a common stopping condition for iterative policy
assessment. The whole procedure for iterative policy evaluation using this stopping
criterion is shown in the figure above.
ni

Policy Improvement
To aid in the development of better policies, we compute the value function for
each policy. Let’s say that we have established the value function vπ for each given
U

deterministic policy. We are interested in knowing whether or not we should adjust the
policy for specific states to choose an action a 6=π(s) deterministically. We are aware
of the benefits of adhering to the current policy from s, or vπ (s), but would switching
to the new policy be better or worse? Consider choosing an in s and then adhering
ity

to the current policy, π, as one approach to answering this query. The benefit of this
behaviour is:
m

Whether this is larger than or less than v π(s) is the main requirement. If it is
)A

higher—that is, if it is better to choose a once in s and then follow than it would be
to follow always—then one would anticipate that it would be even better to choose an
each time s is met, and that the new policy would really be a better one overall.

This is a specific example of the policy improvement theorem, which is a general


(c

conclusion that this is true. Let π and π0 represent any two deterministic policies such
that for any s ∈ S,

Amity Directorate of Distance & Online Education


322 Foundations of Machine Learning

Notes

e
The policy π’ must then be at least as good as π or better. This means that it must

in
receive a higher or equivalent projected return from each state s ∈ S.

nl
Furthermore, there must be strict inequality of the above equation in at least one
state if stringent inequality exists in any state. This result is especially relevant to the
two policies we examined in the preceding sentence: the original deterministic policy,

O
π, and the modified policy, π’, which is identical to, with the exception that, π’(s) = a
6= π(s). Evidently, the aforementioned equation is true for all states besides s. The
modified policy is therefore superior to if q π (s, a) > v π (s).

The theory supporting the policy improvement theorem’s proof is simple to

ty
comprehend. We repeatedly apply and increase the qπ side until we obtain vπ’(s):

r si
ve
ni

We have thus far seen how, given a policy and its value function, it is simple to
U

assess how a change in the policy in one state would affect a specific action. It is a
logical extension to take into account changes at all states and in all potential actions,
choosing at each state the action that q deems to be the best q π (s, a). To put it
another way, take into account the new selfish policy, π’ offered by
ity
m
)A

where argmaxa is the a value at which the following expression is maximised (with
ties broken arbitrarily). After one step of looking ahead, according to vπ the greedy
policy chooses the course of action that appears optimal in the short term. The greedy
policy is as good as or better than the original policy since it was designed to satisfy the
(c

requirements of the policy improvement theorem. Policy improvement is the process of


creating a new policy that enhances an existing one by making it greedy with regard to
the value function of the existing policy.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 323

Policy Iteration
Notes

e
Once a policy, π, has been improved using vπ to yield a better policy, π 0 , we
can then compute vπ0 and improve it again to yield an even better π 00. We can thus

in
obtain a sequence of monotonically improving policies and value functions:

We can compute vπ’ and improve a policy again to produce an even better π’’ after
a policy, π’, has been improved using vπ to provide a better policy, π’’. Thus, we can

nl
obtain a series of rules and value functions that improve monotonically:

O
: represents a policy evaluation

: represents a policy improvement.

ty
Every new policy will unquestionably be an upgrade over the one that came before
it (unless it is already optimal). This process must reach an optimal policy and optimal
value function in a finite number of iterations because a finite MDP can only have a

si
finite number of policies.

Policy iteration is the process of discovering the best possible policy. The figure
below shows a complete algorithm. Keep in mind that the value function for the
r
preceding policy is used to start each policy assessment, which is an iterative process.
ve
Usually, this has a significant impact on the rate of convergence of policy evaluation
(presumably because the value function changes little from one policy to the next).
ni
U
ity
m
)A

Figure: Policy iteration (using iterative policy evaluation) for v∗. This algorithm
contains a slight fault in that it might never finish running if the policy alternates between
(c

two or more equally good policies repeatedly. Although the bug can be fixed by adding
more flags, doing so would make the pseudocode look terrible.

Amity Directorate of Distance & Online Education


324 Foundations of Machine Learning

Asynchronous Dynamic Programming


Notes

e
The fact that the DP approaches we have covered thus far involve operations over
the full state set of the MDP, or state set sweeps, is a significant disadvantage. Even a

in
single sweep can become unaffordable if the state set is sufficiently vast. Backgammon,
as an illustration, has more than 1020 states. It would take more than a thousand years
to do even one sweep if we could execute the value iteration backup on a million states

nl
per second.

In-place iterative DP algorithms that are not structured around systematic sweeps
of the state set are known as asynchronous DP algorithms. These algorithms use

O
whatever values of other states are available to support the values of states in any
order. Before the values of other states are backed up once, the values of some states
may be backed up multiple times. An asynchronous algorithm, however, cannot ignore
any state after a certain point in the calculation; it must keep backing up the values of

ty
every state in order to converge appropriately. The states to which backup procedures
are applied can be chosen quite freely with asynchronous DP methods.

For instance, one implementation of asynchronous value iteration uses the

si
value iteration backup to store the value of only one state, sk, on each step, k. Given
that all states appear in the sequence “sk” an infinite number of times, asymptotic
convergence to v is ensured if 0 ≤ γ < 1. (the sequence could even be stochastic v*).
r
(In the undiscounted episodic situation, it is possible that some backup orderings do
ve
not lead to convergence, although it is generally straightforward to avoid these.) Similar
to how policy evaluation and value iteration backups can be combined to create an
asynchronous shortened policy iteration, it is also possible. It is obvious that a few
distinct backups create building blocks that can be employed flexibly in a wide range
of sweepless DP algorithms, even though the details of this and other more uncommon
ni

DP algorithms are outside the purview of this book.

Naturally, avoiding sweeps does not automatically imply that we can use
U

less computing. It simply means that an algorithm need not become stuck in any
interminably long sweep in order to advance in refining a policy. By deciding which
states to apply backups to, we may try to take advantage of this flexibility and speed
up the algorithm’s progress. We can attempt to arrange the backups so that value
ity

information can spread effectively from state to state. Some states might not require as
much support for their principles as others. If some states are not important for optimal
functioning, we might even try to completely avoid backing them up.

It is also simpler to combine computing with real-time communication when using


m

asynchronous methods. We can execute an iterative DP method when an agent is


actually experiencing a certain MDP in order to solve it. The states to which the DP
algorithm applies its backups can be determined by the agent’s experience. The agent’s
)A

decision-making can also be influenced by the most recent value and policy information
obtained by the DP algorithm. We could, for instance, apply backups to states as the
agent travels through them. This enables the agent-relevant portions of the state set to
be the focus of the DP algorithm’s backups. A recurring theme in reinforcement learning
is this form of concentrating.
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 325

5.1.8 Model Free (RL)


Notes

e
Monte Carlo

in
Only experience—sample sequences of states, actions, and rewards from actual
or simulated contact with an environment—is required for Monte Carlo algorithms.
Learning from actual experience is remarkable in that it requires no prior understanding
of the dynamics of the environment while still allowing for optimal conduct. Simulated

nl
experience can also help you learn. Although a model is necessary, it just needs
to provide sample transitions, rather than the whole probability distributions of all
potential transitions that dynamic programming requires (DP). In a surprising number of

O
circumstances, generating experience sampled according to the appropriate probability
distributions is simple, but obtaining the distributions in explicit form is impossible.

Monte Carlo methods are based on averaging sample returns and are used to

ty
solve the reinforcement learning problem. We specify Monte Carlo methods only for
episodic jobs to ensure that well-defined returns are provided. That is, we presume
that experience is divided into episodes, and that no matter what actions are taken, all

si
episodes will eventually end. Value estimations and policies are only updated once an
episode is completed. In an episode-by-episode sense, Monte Carlo approaches can
be incremental, but not in a step-by-step (online) sense. The word “Monte Carlo” is
typically used to refer to any estimating method that has a strong random component
r
in its operation. It’s especially for approaches based on averaging complete returns that
ve
we’re using it for here.

Like bandit approaches, Monte Carlo methods sample and average returns for
each state–action pair. The key distinction is that there are now numerous states,
each of which acts as a different bandit problem (such as an associative-search or
ni

contextual bandit), and the bandit problems are interconnected. That is, the outcome
of an action in one state is determined by subsequent acts in the same episode. From
the standpoint of the prior state, the problem becomes nonstationary because all of the
U

action selections are learning.

We modify the general policy iteration (GPI) concept introduced for DP to deal
with nonstationarity. Rather than computing value functions based on MDP knowledge,
ity

we learn value functions from sample returns using the MDP. The value functions and
policies that correspond to them still interact in essentially the same way to achieve
optimality (GPI). We start with the prediction problem (computation of vπ and qπ for
a fixed arbitrary policyπ), then go on to policy improvement, and finally the control
problem and its solution using GPI, as we did in the DP chapter. Each of these
m

concepts from DP is applied to the Monte Carlo example, where only sample data is
provided.
)A

Monte Carlo Prediction


We begin by looking at Monte Carlo approaches for learning a policy’s state-value
function. Remember that the expected return on a state is the expected cumulative
future discounted reward starting from that state. Simply averaging the returns
(c

observed after trips to that state is an easy way to estimate it from experience. The
average should converge to the predicted value as more returns are recorded. All
Monte Carlo methods are based on this concept.

Amity Directorate of Distance & Online Education


326 Foundations of Machine Learning

Consider the case where we want to estimate vπ(s), the value of a state s under
Notes

e
policy π, based on a set of episodes collected by following and passing through s.
A visit to s is the name given to each occurrence of state s in an episode. Of fact, s
can appear numerous times in a single episode; let’s call the first appearance of s

in
in an episode the first visit to s. The first-visit MC technique calculates vπ(s) as the
average of first-visit returns, whereas the every-visit MC approach calculates vπ(s) as
the average of all trips to s. These two Monte Carlo (MC) methods are quite similar,

nl
although their theoretical properties differ slightly.

The most commonly studied MC is first-visit MC, which has been studied since the
1940s and is the subject of this chapter. Function approximation and eligibility traces

O
are more naturally extended by every-visit MC. Figure below depicts first-visit MC in
procedural form.

ty
r si
ve
Figure: For estimating vπ, use the first-visit MC approach. Because the
ni

approximate value function quickly becomes a random variable after initialization, we


use a capital letter V for it.

As the number of visits (or first visits) to s grows to infinity, both first-visit MC
U

and every-visit MC converge to vπ(s). In the case of first-visit MC, this is clear. In this
situation, each return represents a finite variance, independent, identically distributed
estimate of vπ(s). The sequence of averages of these estimates converges to their
expected value due to the law of large numbers. Each average is an unbiased estimate
ity

in and of itself, and the standard deviation of its error is 1/√n, where n is the number
of averaged returns. Every-visit MC is more complicated, but its estimates converge to
vπ(s) asymptotically (Singh and Sutton, 1996).

An example is the greatest way to demonstrate the use of Monte Carlo methods.
m

Example: Blackjack The goal of the popular casino card game blackjack is to
collect as many cards as possible with a sum of numerical values that does not exceed
21. All face cards are worth ten points, while an ace can be worth one or eleven points.
)A

We’ll look at the variation in which each player plays against the dealer on their own.
Both the dealer and the player are dealt two cards at the start of the game. The dealer
has one face-up card and one face-down card. A natural is when a player has 21 points
right away (an ace and a 10-card).
(c

Unless the dealer has a natural as well, in which case the game is a tie. If the
player does not have a natural, he can request more cards one at a time (hits) until he
either quits (sticks) or exceeds 21. (goes bust). He loses if he goes bust; if he sticks,
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 327

the dealer has his turn. The dealer has no choice except to hit or stick according to
Notes

e
a predetermined strategy: he sticks on any total of 17 or more, and hits otherwise. If
the dealer goes bust, the player wins; otherwise, the winner is whoever’s total sum is
closest to 21.

in
Blackjack is naturally formulated as a finite episodic MDP. Each blackjack
game is like a mini-series. Winning, losing, and drawing receive +1, 1, and 0 points,

nl
respectively. Because all prizes in a game are zero, and we don’t discount (γ= 1),
these terminal rewards are likewise returns. Hit or stick are the player’s options. The
states are determined by the cards held by the players and the dealer’s displaying card.
We assume that cards are dealt from an endless deck (i.e., with replenishment), thus

O
keeping track of the cards already dealt isn’t necessary.

The ace is said to be useful if the player can count it as 11 without going bust. In
this situation, it is always considered as 11 because treating it as 1 would result in a

ty
sum of 11 or less, in which case the player should always hit. As a result, the player
bases his decisions on three factors: his current sum (12–21), the dealer’s one showing
card (ace–10), and whether or not he has an useable ace. This brings the total number

si
of states to 200.

Consider the policy that sticks if the player’s sum is 20 or 21, and otherwise
hits. To find the state-value function for this policy by a Monte Carlo approach, one
r
simulates many blackjack games using the policy and averages the returns following
ve
each state. Note that in this task the same state never recurs within one episode, so
there is no difference between first-visit and every-visit MC methods. In this way, we
obtained the estimates of the state value function shown in Figure below. The estimates
for states with a usable ace are less certain and less regular because these states
ni

are less common. In any event, after 500,000 games the value function is very well
approximated.
U
ity
m
)A

Figure: Approximate state-value functions for the blackjack policy that sticks
only on 20 or 21, computed by Monte Carlo policy evaluation.
(c

Although we have complete knowledge of the environment in this scenario,


applying DP methods to compute the value function would be difficult. The distribution
of next events is required by DP methods—specifically, the quantities p(s’, r|s, a), which

Amity Directorate of Distance & Online Education


328 Foundations of Machine Learning

are difficult to determine in blackjack. Let’s say the player’s total is 14 and he decides
Notes

e
to stick. As a function of the dealer’s displaying card, what is his expected reward?
Before DP can be used, all of these predicted rewards and transition probabilities must
be estimated, and such computations are generally difficult and error-prone. Creating

in
the sample games necessary by Monte Carlo methods, on the other hand, is simple.
The capacity of Monte Carlo methods to function with sample episodes alone can
be a considerable advantage even when one has comprehensive knowledge of the

nl
environment’s dynamics; this is the case surprisingly often.

Is it possible to use backup diagrams to Monte Carlo algorithms? A backup


diagram’s general concept is to represent the root node to be updated at the top and

O
all the transitions and leaf nodes whose rewards and estimated values contribute to
the update at the bottom. The root is a state node in Monte Carlo estimation of vπ,
and below it is the whole trajectory of transitions along a single episode, finishing at the

ty
terminal state, as shown in Figure below.

r si
ve
ni
U

Figure: The backup diagram for Monte Carlo estimation of vπ.


ity

The estimations for each state are independent, which is a key feature of Monte
Carlo methods. As in DP, the estimate for one state does not build on the estimate for
any other state. Monte Carlo approaches, in other words, do not bootstrap.

It’s worth noting, in particular, that the cost of computing the value of a single
m

state is independent on the number of states. This can make Monte Carlo approaches
particularly appealing when only one or a subset of states is required. Many sample
episodes can be generated by starting with the states of interest and average the
returns from only these states while ignoring all others. This is the third benefit of Monte
)A

Carlo methods over DP methods (after the ability to learn from actual experience and
from simulated experience).

Monte Carlo Estimation of Action Values


(c

If a model isn’t provided, estimating action values (the values of state–action


pairings) rather than state values is particularly useful. When using a model, state
values alone are enough to define a policy; simply look forward one step and choose

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 329

the action that leads to the best combination of reward and future state, like we did in
Notes

e
the DP chapter. State values, on the other hand, are insufficient without a model. In
order for the values to be useful in establishing a policy, each action must be clearly
valued. As a result, one of the most important purposes of Monte Carlo methods is to

in
estimate q*. To do so, we first look at the problem of policy evaluation for action values.

Estimating qπ (s, a), the expected return when starting in state s, taking action a,

nl
and then following policy π is the policy evaluation issue for action values. The Monte
Carlo methods are roughly the same as for state values, but now we’re talking about
visits to a state–action pair instead of a state. If the state s is visited and action an is
taken in an episode, the state–action pair s is said to be visited.

O
The every visit MC method calculates the value of a state–action pair as the
average of the returns from all of the visits to it. The first-visit MC approach averages
the returns once the state is visited and the action is chosen for the first time in each

ty
episode. As the number of visits to each state–action pair approaches infinity, both
algorithms converge quadratically to the true anticipated values, as previously.

The only snag is that many state–action pairs might never be visited. If π is a

si
deterministic strategy, then one will only see results for one of the actions from each
state in the following. The Monte Carlo estimates of the other activities will not improve
with experience if there are no returns to average. This is a critical issue because the
r
goal of learning action values is to aid in the selection of appropriate actions in each
ve
stage. We must estimate the worth of all activities from each stage, not just the one we
currently favour, in order to compare options.

We must ensure continuous exploration in order for policy evaluation to work for
action values. One approach to do this is to define that each episode begins with a
ni

state–action pair, with each pair having a nonzero chance of being chosen as the start.
In the limit of an unlimited number of episodes, this assures that all state–action pairs
will be visited an infinite number of times. This is what we refer to as the exploratory
U

assumption.

The idea that learning begins with exploration is occasionally valuable, but it
cannot be depended on in general, especially when learning directly from actual
ity

interaction with an environment. In that instance, the starting circumstances are unlikely
to be advantageous. To ensure that all state–action combinations are encountered,
the most popular alternative is to include only stochastic policies with a nonzero
probability of picking all actions in each state. In the following sections, we’ll go over
two major variations of this strategy. For the time being, we’ll stick with the exploration
m

start assumption and finish the presentation of a comprehensive Monte Carlo control
approach.
)A

Temporal Differences Learning (TD Learning)


Rewards are frequently difficult to see right away in the surroundings, which is one
of the issues. For instance, in games like tic-tac-toe or others, we only learn the prize(s)
on the last move (terminal state). There will be no immediate rewards for any additional
actions.
(c

An unsupervised method to forecast a variable’s anticipated value in a series


of states is called TD learning. Through the use of a mathematical ploy, TD replaces

Amity Directorate of Distance & Online Education


330 Foundations of Machine Learning

sophisticated future-related reasoning with a straightforward learning process that can


Notes

e
achieve the same outcomes. TD seeks to forecast the combination of the immediate
reward and its own reward prediction at the next moment in time rather than calculating
the whole future reward. (More details are available here)

in
The discounted return is the fundamental mathematical idea behind TD learning.

nl
where the combination of future benefits that have been discounted makes up the

O
reward at time t. It implies a lower value placed on rewards in the future. The final right
reward and the TD Error are different (V*_t) and our current prediction (V_t).

ty
r si
ve
ni
U

The current value will be modified by its value + learning_rate * error, just like other
optimization techniques:
ity

5.1.9 Monte Carlo Control

Monte Carlo Control


m

We’re now ready to look at how Monte Carlo estimate can be used to approximate
optimal policies in control. The basic goal is to follow the same approach as the DP
chapter, namely, generalised policy iteration (GPI). Both an approximation policy and
)A

an approximate value function are maintained in GPI. The value function is changed
repeatedly to better match the current policy’s value function, and the policy is improved
repeatedly in relation to the current value function:
(c

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 331

Notes

e
in
nl
These two types of adjustments operate against each other to some extent, as
each provides a moving goal for the other, but when combined, they bring both the
policy and the value function closer to their ideal state.

O
Let’s start with a Monte Carlo simulation of classical policy iteration. Starting with
an arbitrary policy π0 and finishing with the optimal policy and optimal action-value
function, we use this procedure to alternate entire stages of policy evaluation and policy
improvement:

ty
si
where denotes a complete policy evaluation and denotes a
complete policy improvement.

Policy evaluation is carried out in the same manner as indicated in the previous
r
section. Many episodes are encountered, with the estimated actionvalue function
ve
asymptotically approaching the genuine function. Let us assume for the time being that
we actually observe an endless number of episodes and that the episodes are also
generated with exploring begins. The Monte Carlo methods will compute each qπk
exactly, for every πk, with certain conditions.
ni

Making the policy greedy with respect to the current value function improves the
policy. Because we have an action-value function in this situation, we don’t require a
model to build the greedy policy. The appropriate greedy strategy for every action value
U

function q is the one that, for each s ∈ S, deterministically chooses an action with the
highest action-value:
ity

The greedy policy can then be constructed for each πk+1 with respect to qπk to
improve policy. Because for any s∈S, the policy improvement theory holds for πk and
πk+1.
m
)A

The theory guarantees that each πk+1 is uniformly better than πk, or just as
good as k, in which case they are both optimum policies, as we stated in the previous
(c

chapter. As a result, we may be confident that the overall process will converge to the
best policy and value function. In this approach, Monte Carlo methods can be used

Amity Directorate of Distance & Online Education


332 Foundations of Machine Learning

to determine optimal policies when just a few sample episodes are available and no
Notes

e
further information about the dynamics of the environment is available. To simply
acquire this guarantee of convergence for the Monte Carlo approach, we made two
implausible assumptions above. The first was that the episodes had exploration begins,

in
and the second was that policy evaluation could be done with any number of episodes.
Both assumptions must be removed in order to create a workable method.

nl
For the time being, we’ll assume that policy evaluation is performed on an
infinite number of episodes. It’s relatively simple to disprove this assumption. In
reality, traditional DP methods such as iterative policy evaluation, which converge
only asymptotically to the true value function, face the same problem. There are two

O
approaches to solving the problem in both the DP and Monte Carlo situations. One
option is to stick to the principle of approximating qπk in each policy assessment. To
obtain bounds on the amount and probability of error in the estimates, measurements

ty
and assumptions are made, and then adequate actions are performed during each
policy evaluation to ensure that these bounds are suitably minimal. This method can
most likely be made totally satisfactory in terms of ensuring correct convergence up to
a certain level of approximation. However, it will almost certainly require far too many

si
episodes to be useful in practice on anything other than the tiniest of difficulties.

The second strategy for avoiding the theoretically endless number of episodes

r
required for policy evaluation is to abandon policy evaluation entirely before returning
to policy improvement. We move the value function closer to qk with each evaluation
ve
step, but we don’t expect to come close unless we go through a lot of them. Value
iteration is an extreme version of the concept, in which only one iteration of iterative
policy evaluation is carried out between each step of policy improvement. The in-
place form of value iteration goes even farther; for single states, we alternate between
ni

improvement and assessment stages. It’s reasonable to cycle between evaluation and
improvement on an episode-by-episode basis for Monte Carlo policy evaluation. The
observed returns are utilised to evaluate policy after each episode, and the policy is
U

then improved in all of the states visited during the episode. Figure following shows a
complete simple algorithm along these lines. This algorithm is known as Monte Carlo
ES, which stands for Monte Carlo with Exploring Starts.
ity
m
)A
(c

Figure: : Monte Carlo ES: A Monte Carlo control algorithm assuming exploring
starts and that episodes always terminate for all policies.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 333

All of the returns for each state–action combination are aggregated and averaged
Notes

e
in Monte Carlo ES, regardless of what policy was in effect at the time they were
observed. Monte Carlo ES cannot converge to any suboptimal strategy, as can be
seen. If it did, the value function for that policy would eventually converge to the value

in
function for that policy, causing the policy to alter. Only when both the policy and the
value function are ideal can stability be established.

nl
Convergence to this optimal fixed point appears inevitable as changes to the
action-value function reduce over time, but it has yet to be shown formally. This is one
of the most fundamental outstanding theoretical concerns in reinforcement learning, in
our opinion (for a partial solution, see Tsitsiklis, 2002).

O
Monte Carlo Control without Exploring Starts
How can we avoid making the improbable assumption that exploring begins?

ty
The agent must continue to select all actions in order to assure that they are selected
infinitely often. There are two ways to achieve this, which we refer to as on-policy
methods and off-policy methods. On-policy approaches try to improve or assess

si
the policy that is used to make decisions, whereas off-policy methods try to improve
or evaluate a policy that is not used to generate the data. An example of an on-
policy approach is the Monte Carlo ES method described above. We describe how to
develop an on-policy Monte Carlo control approach that does not rely on the unrealistic
assumption of investigating begins in this section. r
ve
GPI’s overall concept of on-policy Monte Carlo control remains the same. We
employ first-visit MC methods to estimate the action-value function for the current
policy, much as we did in Monte Carlo ES. We can’t just enhance the policy by making
it greedy with regard to the current value function without the assumption of exploring
ni

beginnings, because that would prevent further investigation of nongreedy actions.


Fortunately, GPI just requires that the policy be advanced toward a greedy policy, not
that it be taken all the way there. We’ll only change it to a ε -greedy policy using our on-
U

policy function. Any , π, ε -greedy policy with respect to qπ is guaranteed to be better


than or equal π to any ε -soft policy.

That any ε-greedy policy with respect to qπ is an enhancement over any ε-soft
ity

policy π is assured by the policy improvement theorem. Let π 0 be the ε-greedy policy.
The situation of the policy improvement theorem is relevant because for any s ∈ S:
m
)A

(Because the sum is a weighted average with nonnegative weights that add up to
1, it must be less than or equal to the greatest number averaged.)
(c

Amity Directorate of Distance & Online Education


334 Foundations of Machine Learning

Notes

e
in
As a result of the policy improvement theorem, π 0 ≥ π (i.e., vπ0(s) ≥ vπ(s), for
all s ∈ S). We now show that equality can only be achieved when both π0 and π are

nl
optimal among the ε -soft policies, i.e. when they are better than or equal to all other ε
-soft policies.

Consider a new environment that is identical to the original, with the exception

O
that regulations must be “softly” “moved inside” the environment. The new environment
functions as follows, with the same action and state settings as the old. If you’re in state
s and executing action a, the new environment will behave precisely like the previous
one with probability 1- ε.

ty
It repicks the action at random, with equal odds, and then behaves like the old
environment with the new, random action, using probability ε. The best one can achieve
with general policies in this new context is the same as one could do with ε -soft policies

si
in the original environment.

Let ve∗ and qe∗ stand for the new environment’s optimal value functions. If and
r
only if vπ = ve∗, a policy π is best among ε -soft policies. We know from the definition of
ve∗ that it is the only solution to
ve
ni
U

When equality is maintained and the ε -soft policy π is no longer enhanced, we can
conclude that
ity
m
)A

However, except for the replacement of vπ for ve∗, this equation is identical to the
previous one. Since ve∗ is the unique solution, it must be that vπ = ve∗.

In essence, we’ve demonstrated that policy iteration works for -soft policies in the
last several pages. Using the natural notion of greedy policy for ε -soft policies, one may
(c

be assured of progress at every phase, except when the best policy among the ε -soft
policies has been discovered.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 335

This study is unaffected by the method by which the action-value functions are
Notes

e
generated at each stage, but it does assume that they are computed precisely. This
puts us back to where we were in the previous part. We can now just attain the best
policy among the ε -soft policies, but the assumption of exploring begins has been

in
removed. The entire method is depicted in the diagram below.

nl
O
ty
si
Figure: An on-policy first-visit MC control algorithm for ε-soft policies.
r
ve
5.1.10 Temporal Learning
If there is one important and innovative concept in reinforcement learning, it is
surely temporal-difference (TD) learning. Monte Carlo and dynamic programming (DP)
principles are combined in TD learning. TD approaches, like Monte Carlo methods, can
ni

learn from raw experience without the need for a model of the environment’s dynamics.
TD approaches, like DP, update estimates based in part on previously learnt estimates
rather than waiting for a final result (they bootstrap). In reinforcement learning theory,
U

the link between TD, DP, and Monte Carlo methods is a repeating theme. This chapter
is the start of our investigation. Before we’re done, we’ll notice how these concepts and
procedures flow together and can be blended in a variety of ways.
ity

TD Prediction
To tackle the prediction problem, both the TD and Monte Carlo approaches rely on
prior experience. Both methods update their estimate v of vπ for the nonterminal states
St that occur in that experience given some experience following a policy π. Monte
m

Carlo approaches, in general, wait until the return following the visit is known before
using that return as a target for V(St). A nonstationary environment requires a simple
every-visit Monte Carlo approach.
)A

Gt denotes the actual return after time t, and is a constant stepsize parameter.
This method will be known as constant- α MC. TD techniques simply need to wait until
(c

the next time step to calculate the increment to V (St), whereas Monte Carlo methods
must wait until the end of the episode to determine the increment to V (St) (only then is
Gt known). They create a target at time t+1 and perform a beneficial update based on

Amity Directorate of Distance & Online Education


336 Foundations of Machine Learning

the observed reward Rt+1 and the estimate V (St+1). The most basic TD approach is
Notes

e
known as TD(0).

in
In effect, the Monte Carlo update’s aim is Gt, whereas the TD update’s target is
Rt+1 + γV (St+1).

nl
We call the TD technique, like the DP method, a bootstrapping approach since it
updates based in part on an existing estimate.

O
ty
si
Monte Carlo methods, on the other hand, employ an estimate of first line of above
r
equation as a target, whereas DP methods use an estimate of last line of above
ve
equation. Because the expected value in above equation is unknown, the Monte Carlo
target is an approximation; a sample return is utilised in place of the genuine expected
return.

The DP goal is an estimate not because predicted values are supposed to be


ni

totally provided by an environment model, but because vπ (St+1) is unknown and the
present estimate, V (St+1), is used instead. For both of these reasons, the TD target is
an estimate: it samples the expected values in above equation and utilises the current
U

estimate V instead of the true vπ. As a result, TD approaches combine. Monte Carlo
sampling with DP bootstrapping. As we’ll see, with a little thought and imagination, we
can get a lot of the benefits of both Monte Carlo and DP approaches.
ity
m
)A

Figure: Tabular TD(0) for estimating vπ.

Figure above depicts TD(0) in its entirety in procedural form, whereas Figure below
provides its backup diagram. The value estimate for the state node at the top of the
(c

backup diagram is updated based on a single sample transition from that state to the
next one. Because TD and Monte Carlo updates involve looking ahead to a sample

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 337

successor state (or state–action pair), computing a backed-up value using the value
Notes

e
of the successor and the reward along the way, and then changing the value of the
original state (or state–action pair) accordingly, we refer to them as sample backups.
Sample backups differ from full DP method backups in that they are based on a single

in
sample successor rather than a whole distribution of all possible successors.

nl
Figure: The backup diagram for TD(0)

O
Example 6.1: Driving Home Each day as you drive home from work, you try to
predict how long it will take to get home. When you leave your office, you note the time,
the day of week, and anything else that might be relevant. Say on this Friday you are

ty
leaving at exactly 6 o’clock, and you estimate that it will take 30 minutes to get home.
As you reach your car it is 6:05, and you notice it is starting to rain. Traffic is often
slower in the rain, so you re-estimate that it will take 35 minutes from then, or a total

si
of 40 minutes. Fifteen minutes later you have completed the highway portion of your
journey in good time. As you exit onto a secondary road you cut your estimate of total
travel time to 35 minutes. Unfortunately, at this point you get stuck behind a slow truck,
and the road is too narrow to pass. You end up having to follow the truck until you turn
r
onto the side street where you live at 6:40. Three minutes later you are home. The
ve
sequence of states, times, and predictions is thus as follows:
ni
U
ity

Figure: Changes recommended by Monte Carlo methods in the driving home


example.
m
)A
(c

The elapsed times on each leg of the route are the rewards in this case. Because
we are not discounting (γ= 1), the return for each state is the actual time it takes to

Amity Directorate of Distance & Online Education


338 Foundations of Machine Learning

get there. The expected time to leave is the value of each state. The current estimated
Notes

e
value for each state encountered is listed in the second column of numbers.

Plotting the anticipated total time (the last column) over the sequence, as shown

in
in Figure above, is a simple approach to see how Monte Carlo algorithms work. For α
= 1, the arrows depict the changes in predictions recommended by the constant- α MC
technique. These are the differences between the estimated value (expected time to

nl
go) and the actual return in each state (actual time to go). For example, you assumed it
would take only 15 minutes to go home after exiting the freeway, but it took 23 minutes.

O
At this point, the above Equation is used to calculate an increase in the estimated
time to travel after departing the highway. At this time, the error, Gt -V (St), is eight
minutes. Assume that the step-size parameter is, α, 1/2. As a result of this event,

ty
the estimated time to travel after exiting the highway would be increased upward by
four minutes. In this scenario, this is perhaps too drastic of a change; the truck was
most likely just a bad break. In any case, the adjustment can only be made off-line,

si
that is, once you’ve arrived at your residence. You don’t know anything about the real
returns until this moment. Is it important to hold off on learning until the final outcome
is known? Let’s say you estimate that it will take you 30 minutes to travel home from
r
work on another day, but you get stuck in a big traffic congestion. You’re still stuck in
ve
traffic on the highway twenty-five minutes after leaving the workplace. You estimate it
will take another 25 minutes to get home, bringing the total time to 50 minutes. As you
wait in traffic, you realise that your 30-minute estimate was far too optimistic. Do you
have to wait until you return home before upgrading your first state estimate? You must,
according to the Monte Carlo method, because you don’t know the true return yet.
ni

A TD strategy, on the other hand, would allow you to learn right away, altering
your initial estimate from 30 to 50 minutes. In fact, each estimate would be shifted in
U

the direction of the estimate immediately after it. Returning to our first day of driving,
Figure below shows the identical predictions as Figure above, but with the TD rule’s
recommendations (these are the adjustments made by the rule if α = 1). Each
inaccuracy is proportional to the prediction’s change through time, or the temporal
ity

disparities in predictions.
m
)A
(c

Figure: Changes recommended by TD methods in the driving home example.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 339

Apart from giving you something to do while stuck in traffic, there are several
Notes

e
computational reasons why learning based on your present forecasts is preferable to
learning based on the actual return.

in
Advantages of TD Prediction Methods
Part of how TD techniques learn their estimates is by looking at other estimates.
They bootstrap, or learn a guess from a guess. Is this a wise decision? What are the

nl
advantages of TD approaches over Monte Carlo and DP? It will take the rest of this
book and more to develop and answer such questions. We’ll look ahead to some of the
solutions in this section.

O
TD approaches, as opposed to DP methods, have the advantage of not requiring a
model of the environment, as well as reward and next-state probability distributions.

The second most evident advantage of TD methods over Monte Carlo methods

ty
is that they may be implemented in a natural on-line, completely incremental manner.
When using Monte Carlo methods, one must wait until the end of an episode because
the return is only known at that point, however when using TD methods, one only needs

si
to wait one time step. Surprisingly, this is frequently a crucial aspect. Because some
programmes have very long episodes, deferring all learning until the end of the episode
is too sluggish. Other applications are task-oriented and do not have any episodes.
r
Monte Carlo algorithms must disregard or devalue episodes where experimental
ve
actions are done, which can significantly impede learning. Because they learn from
each transition regardless of what happens next, TD approaches are substantially less
vulnerable to these issues.

Are TD approaches, on the other hand, sound? While it is certainly convenient


ni

to learn from one guess to the next without having to wait for the final result, can
we still ensure convergence to the correct answer? The answer is, thankfully, yes.
The TD algorithm described above has been shown to converge to vπ for any fixed
U

policyπ, in the mean if the step-size parameter is small enough, and with probability 1
if the step-size parameter declines according to the normal stochastic approximation
requirements. The majority of convergence proofs only apply to the table-based
instance of the algorithm described above, although some also apply to general linear
ity

function approximation.

If both the TD and Monte Carlo approaches converge asymptotically to the right
predictions, the next logical inquiry is “Which arrives first?” To put it another way, which
strategy is more efficient at learning? Which method makes the best use of limited
m

data? This is now an open subject because no one has been able to demonstrate
mathematically that one method converges faster than the other. In fact, it’s unclear
what the most official approach to express this inquiry would be! In reality, however, TD
)A

approaches have been found to converge faster on stochastic problems than constant-
MC methods.

Optimality of TD(0)
(c

Assume there is only a limited quantity of experience accessible, such as 10


episodes or 100 time steps. In this scenario, presenting the experience again until the
algorithm converges on an answer is a frequent approach with incremental learning

Amity Directorate of Distance & Online Education


340 Foundations of Machine Learning

approaches. Given an approximate value function, V, the increments described by the


Notes

e
following equation are computed for each time step t at which a non terminal state is
visited, but the value function is only altered once, by the sum of all increments. Then,
with the new value function, all of the available experience is processed again to

in
produce a new overall increment, and so on, until the value function converges. This is
referred to as batch updating since updates are only performed after each whole batch
of training data has been processed.

nl
O
TD(0) converges deterministically to a single answer under batch updating,
regardless of the step-size parameter, α, as long as is selected to be small enough.
Under the identical assumptions, the constant- α MC approach likewise converges
deterministically, but to a different result. Understanding the differences between the

ty
two ways will be easier if we grasp these two replies. The methods do not go all the
way to their respective batch replies during normal updating, but they do take steps
in that direction. We’ll look at a few examples before trying to comprehend the two
responses in general for all potential tasks.

si
Summary
●●
r
A computational method for comprehending and automating goal-directed learning
and decision-making is called reinforcement learning. By emphasising learning
ve
by an agent from direct interaction with its environment, rather than depending on
exemplary supervision or comprehensive models of the environment, it differs from
previous computational techniques. According to our assessment, reinforcement
learning is the first area to take the computational challenges presented by
ni

learning from interactions with the environment in order to accomplish long-term


objectives seriously.
●● The interaction between a learning agent and its environment is defined by
U

a formal framework in reinforcement learning in terms of states, actions, and


rewards. This framework aims to offer a straightforward method of expressing key
aspects of the artificial intelligence challenge. A sense of cause and effect, a sense
of ambiguity and nondeterminism, and the presence of clear goals are some of
ity

these characteristics.
●● The main characteristics of the majority of the reinforcement learning techniques
that we discuss in this book are the ideas of value and value functions. We believe
that in order to effectively search the space of policies, value functions are crucial.
m

In contrast to evolutionary approaches that search directly in the policy space


while being led by scalar assessments of complete policies, reinforcement learning
methods use value functions.
)A

●● If an environment’s status signal concisely summarises the past without impairing


forecasting accuracy, it satisfies the Markov property. The state signal should be
chosen or built in such a way that the Markov property holds as nearly as feasible,
even if this is rarely exactly true but is frequently almost so. In this book, we use
(c

the assumption that this has already been accomplished and concentrate on the
challenge of decision-making: how to choose what to do based on the state signal
that is available. The environment is said to as a Markov decision process if the

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 341

Markov property holds (MDP). A MDP with finite state and action sets is referred
Notes

e
to as a finite MDP. Although the majority of the present theory of reinforcement
learning is limited to finite MDPs, the techniques and concepts are applicable more
broadly.

in
●● The expected return from each state, or state-action pair, is given to that state,
or state-action pair, by the value functions of a policy, presuming that the agent

nl
applies the policy. The maximum expected return that any policy may achieve
is what the optimal value functions assign to each state, or state-action pair.
An ideal policy is one whose value functions are optimal. There can be multiple
optimal policies, but only one set of optimal value functions for states and state-

O
action pairings for a given MDP. A policy must be optimum if it is greedy in relation
to the optimal value functions. The optimal value functions must fulfil an unique
consistency requirement known as the Bellman optimality equations, which can

ty
theoretically be solved for the optimal value functions and used to fairly easily
identify the ideal policy.
●● The reinforcement learning problem was used as an example to illustrate a novel
type of learning technique called temporal-difference (TD) learning. We separated

si
the total issue into a prediction problem and a control problem, as usual. For
resolving the prediction problem, TD approaches are substitutes for Monte Carlo
techniques. The generalised policy iteration (GPI) concept, which we abstracted
r
from dynamic programming, is used in both instances to extend the solution to
ve
the control problem. It is the notion that approximation policy and value functions
ought to interact in a way that leads to both of them moving in the direction of their
ideal values.
●● The prediction problem is how precisely one of the two processes that make
ni

up GPI drives the value function to forecast returns for the current policy. With
regard to the current value function, the other process pushes the policy to locally
improve (e.g., to be “-greedy”). A problem with maintaining successful exploration
U

emerges when the initial process is based on experience. Using an on-policy or


o-policy approach, we can categorise TD control approaches according on how
they handle this difficulty. Q-learning is an o-policy method, while Sarsa is an on-
policy approach. As we explain it here, expected Sarsa is likewise an o-policy
ity

technique. There is a third technique to extend TD methods to control, known as


actor-critic methods, which we did not cover in this chapter.
●● TD approaches are actually more broad than this, but up to this point, we have
only examined them in the context of reinforcement learning issues. They are
m

all-encompassing techniques for gaining the ability to forecast the future of


dynamical systems. TD techniques might be useful for forecasting, among other
things, financial data, life expectancy, election results, weather patterns, animal
)A

behaviour, demands on power plants, or consumer purchases. The theoretical


features of TD approaches weren’t fully known until they were examined as pure
prediction methods, irrespective of their usage in reinforcement learning. But
little research has been done on these additional possible uses for TD learning
techniques.
(c

●● We adhered to the generalised policy iteration (GPI) framework described


in Chapter 4 when creating Monte Carlo control methods. Policy evaluation

Amity Directorate of Distance & Online Education


342 Foundations of Machine Learning

and improvement procedures interact as part of GPI. A different technique of


Notes

e
evaluating policies is offered by Monte Carlo methods. They simply average
numerous returns that begin in the state rather than using a model to determine
the value of each state. This average can serve as a good approximation to

in
the value as a state’s value is the expected return. Since they can be utilised
to improve the policy without requiring a model of the environment’s transition
dynamics, we are particularly interested in approximating action-value functions

nl
in control approaches. We are now familiar with the fundamental concepts and
dynamic programming algorithms as they apply to solving finite MDPs.
●● Monte Carlo approaches combine policy evaluation and policy enhancement

O
stages on an episode-by-episode basis and can be incrementally implemented
on an episode-by-episode basis. The computation of the value functions for
a specific policy, which is often iterative, is referred to as policy evaluation. The

ty
computation of an improved policy given the value function for that policy is
referred to as “policy improvement.” The two most well-known DP methods, policy
iteration and value iteration, are obtained by combining these two computations.
Given complete information of the MDP, either of these can be used to accurately

si
construct optimal policies and value functions for finite MDPs.
●● Traditional DP algorithms work by iteratively traversing the set of states, updating
each state as intended. According to the values of all potential successor states
r
and their probabilities of occurring, one state is updated with each such operation.
ve
Bellman equations and expected updates share many similarities; in essence,
expected updates are just Bellman equations written as assignment statements.
Convergence occurs to values that meet the related Bellman equation when the
updates no longer cause any value changes. There are four Bellman equations
ni

and four predicted updates for the same set of four primary value functions (vπ, v*,
qπ, and q*). Their backup graphics provide a clear understanding of how DP

Glossary
U

●● MDPs: Markovian Decision Processes.


●● GPI: Generalised policy iteration
ity

●● MPH: Markov Property Holds


●● TD: Temporal Difference
●● DP: Dynamic programming
m

●● Agent(): A thing with the capacity to see, investigate, and react to its surroundings.
●● Action(): The steps an agent takes inside the environment are referred to as
actions.
)A

●● State(): Following each action done by the agent, the environment will return a
state.
●● Q-value(): Generally speaking, it is comparable to the value, except it adds a
current action parameter.
(c

●● Reinforcement learning: We define a reinforcement learning method as any


approach that is effective at tackling a situation of this nature.

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 343

●● Environment: This is the situation in which an agent learns and completes tasks in
Notes

e
the future.
●● Policy: An agent’s ability to make decisions, in which the agent bases future

in
actions on the present situation.
●● Reward: Payment an agent receives from the environment for each action.
●● Value: When compared to the benefit, it represents the anticipated long-term

nl
return at a reduced rate.
●● Value function: Indicates a state’s value, or the total number of returns.

O
●● Function approximator: Creating a function from training samples using a function
approximator.
●● Model of the environment: An environment model is a simulation of the real world

ty
used to forecast conclusions.
●● Model based methods: Reward-based models are solved using model-based
approaches.

si
●● Markov decision process: A probabilistic model of the sequential decision problem
is the Markov decision process.
●● Dynamic programming: Dynamic programming is a category of techniques for
resolving sequential decision-making issues. r
ve
Check Your Understanding
1. Reinforcement learning is a:
a. Unsupervised learning
ni

b. Supervised learning
c. Award based learning
U

d. None of the mentioned


2. Which of the following is an application of reinforcement learning?
a. Topic modelling
ity

b. Recommendation system
c. Pattern recognition
d. Image recognition
m

3. Upper confidence bound is a:


a. Reinforcement algorithm
)A

b. Supervised algorithm
c. Unsupervised algorithm
d. None of the mentioned
4. Which of the following is true about reinforcement learning?
(c

a. The agent gets reward or penalty according to the action


b. It’s an online learning
Amity Directorate of Distance & Online Education
344 Foundations of Machine Learning

c. The target of an agent is to maximize the rewards


Notes

e
d. All of the above
5. You have a task which is to show relative ads to target users. Which algorithm you

in
should use for this task?
a. Upper confidence bound

nl
b. K means clustering
c. Naïve Bayes
d. Support vector machine

O
6. Hidden Markov model is used in:
e. Supervised learning

ty
f. Unsupervised learning
g. Reinforcement learning
h. All of the mentioned

si
7. Which algorithm is used in robotics and industrial automation?
a. Thompson sampling
b. Naïve Bayes r
ve
c. Decision tree
d. All of the mentioned
8. Thompson sampling is a:
ni

a. Probabilistic algorithm
b. Based on Bayes inference rule
U

c. Reinforcement learning algorithm


d. All of the above
9. Which of the following is false about Upper confidence bound?
ity

a. It is a Deterministic algorithm
b. It does not allow delayed feedback
c. It is not based on Bayes inference
m

d. None of the mentioned


10. The multi-armed bandit problem is a generalized use case for:
)A

a. Supervised learning
b. Reinforcement learning
c. Unsupervised learning
d. None of the mentioned
(c

11. _ _ __ _is the machine learning algorithms that can be used with labeled data.
a. Regression algorithms

Amity Directorate of Distance & Online Education


Foundations of Machine Learning 345

b. Clustering algorithms
Notes

e
c. Association algorithms
d. All of the above

in
12. Application of Machine learning is:
a. Email filtering

nl
b. Sentimental analysis
c. Face recognition

O
d. All of the above
13. _ _ _ _ is a disadvantage of decision trees?
a. Decision trees are robust to outliers

ty
b. Decision trees are prone to overfit
c. Both a and b

si
d. None of the mentioned
14. Machine learning is an application of:
a. Artificial intelligence
b. Blockchain
r
ve
c. Both a and b
d. None of the mentioned
15. The term machine learning was coined in which year?
ni

a. 1958
b. 1960
U

c. 1959
d. 1961
ity

Exercise
1. What do you mean by reinforcement learning?
2. Define various elements in reinforcement learning.
3. Define the concept of Markov Property.
m

4. Bellman’s Equation and Optimal Value Function


5. Define various Problem Solving Methods
)A

6. Define dynamic programming.


7. Explain Monte Carlo Control.
8. What do you mean by temporal learning?
(c

Learning Activities
1. Define various use cases reinforcement learning in real world.

Amity Directorate of Distance & Online Education


346 Foundations of Machine Learning

Check Your Understanding - Answers


Notes

e
1. c
2. b

in
3. a
4. d

nl
5. a
6. d

O
7. a
8. d
9. d

ty
10. b
11. a

si
12. d
13. b
14. a
15. c
r
ve
Further Readings and Bibliography:
1. Reinforcement Learning, Richard S. Sutton and Andrew G. Barto
ni

2. Reinforcement Learning, Phil Winder Ph.D.


3. Deep Reinforcement Learning Hands-On: Apply Modern RL Methods to
Practical Problems of Chatbots, Robotics, Discrete Optimization, Web
U

Automation, and More, 2nd Edition, Maxim Lapan


4. Algorithms for Reinforcement Learning, Csaba Szepesvari
5. Foundations of Deep Reinforcement Learning: Theory and Practice in Python,
ity

Laura Graesser and Wah Loon Keng


m
)A
(c

Amity Directorate of Distance & Online Education

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy