Foundation of Machine Learning F-PMLFML02-WS
Foundation of Machine Learning F-PMLFML02-WS
in
nl
O
ty
Foundations of Machine Learning
r si
ve
ni
U
ity
m
)A
(c
e
in
© Amity University Press
nl
No parts of this publication may be reproduced, stored in a retrieval system or transmitted
in any form or by any means, electronic, mechanical, photocopying, recording or otherwise
without the prior permission of the publisher.
O
Advisory Committee
ty
Chairman : Prof. Abhinash Kumar
Members : Prof. Arun Bisaria
si
Mr. Alok Awtans
Dr. Coral J Barboza
Dr. Monica Rose
r
Mr. Sachit Paliwal
ve
Subject Matter Experts
ni
Published by Amity University Press for exclusive use of Amity Directorate of Distance and Online Education,
Amity University, Noida-201313
Contents
e
Page No.
in
Module - I: Supervised Learning: Regression 01
1.1 Introduction to Regression Concepts
1.1.1 Introduction to Linear Regression
nl
1.1.2 Linear Regression: Important Terms
1.1.3 Linear Regression: Assumptions
O
1.1.4 Regression: Statistical View
1.1.5 Hands On: Loading libraries
1.1.6 Multiple Linear Regression
ty
1.1.7 Linear Regression: Real-Life Events
1.1.8 Logistic Regression
1.1.9 Odds vs. Probability
si
1.1.10 Hypothesis Testing
1.1.11 Linear Regression: Advantages and Pitfalls
e
3.1.10 Decision Tree - II
in
3.1.11 Hands On: Decision Tree
3.1.12 Introduction to Support Vector Machines
3.1.13 SVM for Linearly Separable Problems
nl
3.1.14 SVM for Non-Linearly Separable Problems
3.1.15 Hands On: Support Vector Machines
O
3.1.16 Introduction to Random Forest Classifier
3.1.17 Random Forest Technique and Construction
3.1.18 Summary
ty
Module - IV; Unsupervised Learning 177
4.1 Introduction to Clustering Techniques
si
4.1.1 Introduction to Clustering
4.1.2 Evaluation of Clustering
4.1.3 Understanding Clustering Types
4.1.4 Hierarchical Clustering Part - I r
ve
4.1.5 Hierarchical Clustering Part - II
4.1.6 DBSCAN (Density Based Spatial Clustering of Applications with Noise)
4.1.7 Hands On: DBSCAN
ni
e
4.2.6 Retail: Robot Sales Associate - Apparel Store
in
4.2.7 Retail: Loss Prevention
4.2.8 Telecom: Optimising Mobile Tower Operations
4.2.9 Prediction of Tornadoes using Time Series Forecasting
nl
4.2.10 Facebook Face Detection
4.2.11 Manufacturing
O
4.2.12 Explainable AI and Google Maps
4.2.13 Uber Eats
4.2.14 Amazon ADS Recommendation and iPhone Face Detection
ty
4.2.15 Use Cases
4.2.16 Google Translation
4.2.17 Tesla Driverless Car
si
4.2.18 Kuka Game and Email
e
Learning Objectives:
in
At the end of this module, you will be able to understand:
nl
●● Statistical View of Regression
●● How to load Libraries like panda
O
●● Applications of Linear Regression in Real life
●● Concepts of Hypothesis Testing
Introduction
ty
The technique of discovering correlations between dependent and independent
variables is known as regression. It aids in the prediction of continuous variables such
as market trends, house values, and so forth.
si
The Regression algorithm’s goal is to identify the mapping function that will
translate the continuous input variable (x) to the discrete output variable (y).
r
Example: Let’s say we want to forecast the weather, so we’ll apply the Regression
ve
approach. When it comes to weather prediction, the model is trained on historical data,
and after it is finished, it can accurately predict the weather for future days.
●● Polynomial Regression
●● Support Vector Regression
●● Decision Tree Regression
ity
Linear regression is used to predict the value of a dependent variable (y) based
on the value of an independent variable (x). As a result of this regression technique, a
(c
Notes
e
in
nl
O
In the diagram above, X (input) represents job experience and Y (output)
ty
represents a person’s wage. For our model, the regression line is the best fit line.
si
Linear regression is a type of supervised machine learning model that is commonly
used in predicting. Supervised machine learning models are those that use training data
to develop the model and then use the loss function to verify the model’s correctness.
r
One of the most well-known time series forecasting techniques for predictive
ve
modelling is linear regression. It posits a linear relationship between a group of
independent factors and the dependent variable, as the name implies (the variable of
interest).
To approximate the given data, regression and log-linear models can be employed.
ni
The data is modelled to match a straight line in (basic) linear regression. For example,
the equation can be used to represent a random variable, y (also known as a response
variable), as a linear function of another random variable, x (also known as a predictor
U
variable).
y = wx + b
attributes in the context of data mining. The w and b coefficients (also known as
regression coefficients) indicate the line’s slope and y-intercept, respectively. The
method of least squares, which minimises the error between the actual line dividing
the data and the estimate of the line, can be used to solve for these coefficients.
m
Multiple linear regression is a variation of (basic) linear regression that lets you model a
response variable, y, as a linear function of two or more predictor variables.
data space from lower-dimensional data spaces. As a result, log-linear models are
especially beneficial for data smoothing and dimensionality reduction (because the
lower-dimensional points together often take up less space than the original data
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 3
points) (since aggregate estimates in the lower-dimensional space are less subject to
Notes
e
sampling variations than the estimates in the higher-dimensional space).
On sparse data, both regression and log-linear models can be utilised, however
in
their utility is limited. While both methods can deal with skewed data, regression excels
at it. When dealing with high-dimensional data, regression can be computationally
costly, whereas log-linear models scale well up to 10 dimensions.
nl
Simple Linear Regression
We want to know the relationship between a single independent variable, or input,
O
and a matching dependent variable, or output, in simple linear regression.
ty
exam grades are independent of the number of hours they study.
●● Crop yields can be predicted using rainfall data. Yield is a dependent variable,
while precipitation is an independent variable.
si
●● Predicting a person’s salary based on years of experience- Experience becomes
the independent variable, while Salary becomes the dependent variable.
fits the data points well may not guarantee a relationship between circumstances and
logical outcomes.
Using a linear regression model, you may determine whether or not there is any
U
Call
)A
This is how we used a formulaic function call. We employed one response variable,
mpg, as a function of one dependent variable, disp, both of which were called from the
mtcars data frame in this case.
Residuals
(c
The residuals in our model are a measure of the vertical distance between each
data point and the fitted line. We have summary data for all of the vertical distances
between all of our locations and the fitted line in this example. The better the fit, the
Notes
e
lower this value is.
Coefficients
in
These are the coefficients of our linear equation’s estimated coefficients. In this
example, our equation would be y = 0.04x + 29.59.
nl
●● Std. Error: The Std. Error component of the coefficients table provides error
estimates for those coefficients. In practice, our equation would look like this:
y = (−0. 04 ± 0. 005)x + (29. 59 ± 1. 23).
O
●● t-value: This is the measurement of the difference in our data compared to the
variation. This value is related to p-values, but p-values are significantly more
commonly used.
ty
●● p-value: Statistical significance evaluations are known as p-values. The math
behind p-values is a little more technical, but for our purposes, a p-value of less
than 0.05 indicates that the number is statistically significant. We should err on the
si
side of statistical significance if the quantity in question has a p-value greater than
0.05. The importance codes that follow explain the star ratings next to them.
Error
For us, the best fit line will yield projected values. Error is defined as the difference
between the observed or expected value and the actual value.
U
example over the number of data points. We can prevent cancelling out mistakes that
are too high or too low the genuine values by adding up all absolute values of errors
in a model and getting an overall error metric to evaluate a model on by adding up all
absolute values of errors in a model. The inaccuracy can be anywhere between 0 and
∞.
m
The most widely used error to assess model performance is the mean squared
error. The residual error (the difference between the anticipated and true value) is
squared, unlike the absolute error. For mean squared error loss, the range is also 0
to ∞. Squaring the residual error offers the advantages of having positive error terms,
emphasising greater errors over smaller errors, and being differentiable. Being
(c
Huber Loss
Notes
e
In comparison to the squared error loss, the Huber loss is less susceptible to data
outliers. At 0 it’s also differentiable. When the error is small, it is essentially absolute
in
error, which becomes quadratic. The size of the error required to make it quadratic is
determined by a hyperparameter, (delta), which can be adjusted.
R-Squared
nl
It’s also referred to as the coefficient of determination. R squared values range
from - to 1, and they assist us figure out if our model can explain variation in Y by
O
variation in X. A model with a R squared of 0 is equivalent to one that always forecasts
the target variable’s mean, whereas a model with a R squared of 1 completely predicts
the target variable.
ty
Adjusted R-Squared
si
If we add additional predictors to our model, R-Squared will rise or remain
constant. So there’s no way of knowing whether increasing the model’s complexity
makes it more accurate. The R-Squared calculation is tweaked to account for the
r
number of predictors in the model. Only if the new term improves model accuracy does
ve
the modified R-Square increase.
ni
be made. This is due to the fact that it is a parametric test. The following are the
assumptions made when running a simple linear regression:
This basically indicates that the error magnitude does not change much while the
independent variable’s value varies.
●● Independence of observations- Nothing is hidden in the relationships between
the observations, and only valid sampling procedures are employed during data
m
gathering.
●● Normality- The data is flowing at a typical rate.
)A
However, while running a linear regression, there is one more assumption that
must be considered.
The line is always a straight line: During the execution of a linear regression, there
is no curve or grouping factor. The variables are linked in a linear fashion (dependent
(c
variable and independent variable). A nonparametric test may be employed if the data
violates the assumptions of homoscedasticity or normality. (Take, for instance, the
Spearman rank test.)
The following is an example of data that does not match the assumptions: One
Notes
e
might believe that the consumption of cured pork and the incidence of colorectal cancer
in the United States are linked. However, it is later discovered that there is a significant
range disparity in the data collection for both variables. There can be no linear
in
regression test since the homoscedasticity condition is violated. A Spearman rank test,
on the other hand, can be used to determine the link between the variables.
nl
Homoscedasticity
Explanation
O
The residuals have constant variance at every level of x, according to the following
assumption of linear regression. Homoscedasticity is the term for this. The residuals are
considered to have heteroscedasticity when this isn’t the case.
ty
When heteroscedasticity is present in a regression study, it becomes difficult to
trust the results. Heteroscedasticity, in particular, increases the variance of regression
coefficient estimates, but the regression model ignores this. This increases the
likelihood that a regression model may proclaim a term in the model to be statistically
si
significant when it is not.
As the fitted values grow greater, the residuals become considerably more spread
(c
Notes
e
in
nl
O
ty
si
There are three common ways to fix heteroscedasticity:
r
1. Transform the Dependent Variable: Taking the log of the dependent variable is a
ve
typical transformation. For example, instead of predicting the number of flower
shops in a city using population size (independent variable), we could use population
size to predict the log of the number of flower stores in a city. Heteroskedasticity is
often eliminated by using the log of the dependent variable rather than the original
dependent variable.
ni
2. Redefine the Dependent Variable: Instead of using the raw value, a rate is a typical
technique to redefine the dependent variable. Instead of predicting the number
U
of flower shops in a city based on population size, we might use population size
to forecast the number of flower stores per capita. Because we’re calculating the
number of flower shops per person rather than the total number of flower businesses,
this decreases the fluctuation that naturally exists across larger populations.
ity
Normality
)A
The residuals must be regularly distributed, according to the next premise of linear
regression.
There are two popular methods for determining whether or not this assumption is
(c
correct:
A Q-Q plot, or quantile-quantile plot, is a sort of figure that we may use to see if the
Notes
e
residuals of a model follow a normal distribution or not. The normalcy assumption is
satisfied if the points on the plot generally form a straight diagonal line.
in
An example of residuals that nearly follow a normal distribution is seen in the Q-Q
plot below:
nl
O
ty
r si
ve
ni
The Q-Q plot below, on the other hand, illustrates an example of when residuals
obviously deviate from a straight diagonal line, indicating that they do not follow a
normal distribution:
U
ity
m
)A
(c
2. You can also use formal statistical tests like Shapiro-Wilk, Kolmogorov-Smirnov,
Notes
e
Jarque-Barre, or D’Agostino-Pearson to check the normality assumption. Keep in
mind, however, that these tests are sensitive to big sample sizes; that is, when your
sample size is large, they frequently conclude that the residuals are not normal. This
in
is why it’s typically easier to check this assumption with graphical approaches like a
Q-Q plot.
nl
What to do if this assumption is violated
O
You have a few choices if the normalcy assumption is broken:
ty
values and not data input errors.
◌◌ The independent and/or dependent variables can then be subjected to a
nonlinear transformation. Taking the log, square root, or reciprocal of the
si
independent and/or dependent variable are common examples.
Independence
Explanation r
ve
The residuals must be independent, according to the following premise of linear
regression. When working with time series data, this is especially true. We don’t want a
pattern between consecutive residuals in the ideal world. For example, residuals should
not increase in size over time.
ni
approach to see if this condition is met. The 95 percent confidence bands around zero,
which are placed at roughly +/- 2-over the square root of n, where n is the sample size,
should ideally contain the majority of the residual autocorrelations. The Durbin-Watson
test can be used to officially check if this assumption is met.
ity
●● Consider using lags of the dependent and/or independent variables in the model
m
serial correlation.
●● Consider including seasonal dummy variables in the model for seasonal
correlation.
The statistical relationship between one or more predictor variables and the
response variable is described by regression analysis, which provides an equation.
You’ll want to analyse the results after fitting a regression model with Minitab Statistical
Notes
e
Software and verifying the fit with residual plots.
in
The null hypothesis that the coefficient is equal to zero is tested by the p-value
for each term (no effect). A low p-value (0.05) suggests that the null hypothesis can
be rejected. In other words, because changes in the predictor’s value are related to
nl
changes in the response variable, a predictor with a low p-value is likely to be a useful
addition to your model.
A larger (insignificant) p-value, on the other hand, indicates that changes in the
O
predictor are unrelated to changes in the response.
We can see that the predictor variables South and North are significant because
their p-values are both 0.000 in the result below. East, on the other hand, has a p-value
ty
of 0.092, which is higher than the typical alpha level of 0.05, indicating that it is not
statistically significant.
r si
ve
To choose which terms to keep in the regression model, you usually utilise the
coefficient p-values. We should consider deleting East from the model above.
ni
The mean change in the response variable for one unit of change in the predictor
variable while holding other predictors in the model constant is represented by
U
is why they’re sometimes referred to as slope coefficients. I’ll show you how this works
in the fitted line plot below, where I model a person’s weight using their height. First, the
output of Minitab’s session window:
m
)A
The fitted line plot graphically depicts the same regression results.
(c
Notes
e
in
nl
O
ty
si
The coefficient for height in metres is 106.5 kilos, according to the equation. The
statistic suggests that for every additional metre in height, weight will increase by 106.5
kilos on average. r
ve
The same information is graphically represented by the blue fitted line. The
fitted line rises or lowers by 106.5 kg if you move left or right along the x-axis by an
amount that equals a one metre change in height. These heights, however, are from
middle-school girls and range from 1.3 to 1.7 metres. Because the relationship is only
ni
true within this data range, we wouldn’t move the line a full metre up or down in this
situation.
The expected value for weight would not vary no matter how far up or down the
U
fitted line you went if the slope coefficient was 0. As a result, a low p-value indicates
that the slope is not zero, implying that changes in the predictor variable are linked to
changes in the response variable.
ity
A fitted line plot was utilised because it brings the math to life. Fitted line plots,
on the other hand, can only show the results of simple regression, which has only one
predictor variable and one response. For multiple linear regression, the concepts stay
true, but I’d need an extra spatial dimension for each new predictor to depict the results.
m
Height is a linear effect in the example above; the slope is constant, indicating that
)A
the effect is consistent across the fitted line. The interpretation is a little less apparent if
your model requires polynomial or interaction terms.
predictor.
The following example employs a data collection that necessitates the inclusion of
Notes
e
a quadratic (squared) component to model curvature. The p-values for both the linear
and quadratic terms are significant, as seen in the output below.
in
nl
We can proceed with the interpretation because the residual plots (not displayed)
O
indicate a satisfactory fit. But how are these coefficients to be interpreted? Graphing it
in a fitted line plot is quite helpful.
ty
r si
ve
ni
U
The link between machine setting and energy consumption varies depending
ity
from where you start on the fitting line, as you can see. For example, if you start with a
machine setting of 12 and increase it by 1, you should see a decrease in energy use.
If you start at 25, an increase of one should result in an increase in energy use. And if
you’re under the age of 20, your energy consumption shouldn’t alter substantially.
m
Because the effect of modifying the predictor varies depending on its value, a
major polynomial term can make the interpretation less clear. A substantial interaction
term, on the other hand, suggests that the effect of one predictor varies depending on
the value of another.
)A
When interpreting a regression model that contains these terms, be extra cautious.
You can’t simply look at the major effect (linear term) to figure out what’s going on!
Unfortunately, you won’t be able to utilise a fitted line plot to graphically describe
the results if you’re using multiple regression analysis. This is when subject-matter
(c
e
Pandas
in
Pandas is a free and open-source Python toolkit created by software engineer Wes
McKinney in 2008. It offers high-level data structures and programming tools for doing
data analysis and manipulation on a variety of data kinds, including numbers, text, and
dates.
nl
Adding pandas to the workspace
O
Importing specific functionality
ty
Pandas Data Structures
Series
si
A data structure in the form of an array, let have labels starting from 0 and keep
incrementing. For an example.
In [3]: obj
Out[3]:
ni
0 1
1 2
U
2 3
3 4
4 5
ity
5 6
dtype: int64
In [3]: obj
Out[4]:
d 4
(c
b 7
a -5
c 3
Notes
e
dtype: int64
in
◌◌ Convert dictionaries to Series, s = Series({‘age’:23, ‘name’: ‘anuradha’})
◌◌ Direct index assignment obj.index = [‘attr 1’, ‘attr 2’, ‘attr 3’]
nl
◌◌ Indexes can have attribute name obj.name = ‘Population’
DataFrame
O
These are tabular data’s data structures, which are similar to spreadsheets and
can hold vast amounts of data.
In [1]: data = {‘state’: [‘Ohio’, ‘Ohio’, ‘Ohio’, ‘Nevada’, ‘Nevada’], ‘year’: [2000, 2001,
ty
2002, 2001, 2002], ‘pop’: [1.5, 1.7, 3.6, 2.4, 2.9]}
In [2]: data
Out[2]:
si
{‘pop’: [1.5, 1.7, 3.6, 2.4, 2.9],
In [4]: frame
ni
Out[4]:
Filtering
m
In [4]: data
Out[4]:
e
Ohio 0 1 2 3
Colorado 4 5 6 7
in
Utah 8 9 10 11
New York 12 13 14 15
nl
We use np.arange(16) to create an array of elements from 0 to 15, and use
reshape(4, 4) make it a matrix of 4x4. Just a set of dummy data.
O
Boolean comparisons
In [6]: data > 5
Out[6]:
ty
one two three four
Ohio False False False False
si
Colorado False False True True
Utah True True True True
New York True True True
r
True
ve
Substitutions
We provide a query function data < 5 and a replacement operation = -1 to replace
the values less than 5 with -1.
ni
In [8]: data
U
Out[8]:
Colorado -1 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
m
the time, the data we receive for Data Mining is not clean or pure. They have missing
figures and figures that are inconsistent. As a result, we must either filter them out or
use a replaced figure.
Creating a sample array with missing values(NaN values or Not a Number values).
In [9]: a = np.array([1,2,3,np.nan,4])
In [10]: a
Notes
e
Out[10]: array([ 1., 2., 3., nan, 4.])
In [11]: df = DataFrame(a)
in
In [12]: df
Out[12]:
nl
0
0 1.0
O
1 2.0
2 3.0
ty
3 NaN
4 4.0
Once we have the data frame with missing values/NaNs we can fill them using
si
following manner. Here I’m filling it with -1.
In [13]: df.fillna(-1)
Out[13]: r
ve
0
0 1.0
1 2.0
ni
2 3.0
3 -1.0
U
4 4.0
Different fillings can be used in different columns. For instance, 0.5 in the first
column and -1 in the third column. The numbering of the columns begins at 0.
ity
Can use smarter ways such as filling using mean, which might make more sense in
some cases.
m
data.fillna(data.mean())
)A
Scikit Learn
Sklearn is a data science library’s Swiss Army Knife. It’s an essential weapon in
your data science arsenal for clearing apparently insurmountable obstacles. To put it
another way, it’s utilised to create machine learning models.
(c
useful functions in the sklearn toolkit for machine learning and statistical modelling.
Notes
e
Components of Scikit-learn:
in
Scikit-learn has a lot of useful features. To help you grasp the spread, here are a
few examples:
●● Think of any supervised machine learning method you’ve heard of, and there’s
nl
a good chance it’s included in scikit-learn. From generalised linear models (such
as linear regression) to Support Vector Machines (SVM), Decision Trees, and
Bayesian approaches, the scikit-learn toolbox has it all. One of the main reasons
O
for scikit-popularity learn’s is the widespread use of machine learning methods.
I began using scikit to handle supervised learning challenges, and I would
recommend it to anyone who is new to scikit or machine learning.
●● Cross-validation: Using sklearn, you can assess the correctness of supervised
ty
models on unknown data in a variety of ways.
●● Unsupervised learning techniques: The offering includes a wide range of machine
learning algorithms, including clustering, factor analysis, principal component
si
analysis, and unsupervised neural networks.
●● Several toy datasets: While learning scikit-learn, this came in handy. I learned
r
SAS while working with a variety of academic datasets (e.g. IRIS dataset, Boston
House prices dataset). Having them on hand while studying a new library was
ve
quite beneficial.
●● Feature extraction: Scikit-learn is a library that allows you to extract features from
photos and text (e.g. Bag of words)
ni
Quick Example:
Let me provide an example to demonstrate the use of sklearn now that you have
a high-level understanding of the ecosystem. The goal is to demonstrate how easy it is
U
to use sklearn. In one of the next articles, we’ll look at several algorithms and the best
ways to employ them.
ity
dataset = datasets.load_iris()
(c
expected = dataset.target
Notes
e
predicted = model.predict(dataset.data
in
Print confusion matrix
print(metrics.classification_report(expected, predicted))
nl
print(metrics.confusion_matrix(expected, predicted))
O
Multiple linear regression (MLR), often known as multiple regression, is a statistical
technique that predicts the result of a response variable by combining numerous
explanatory variables. Multiple linear regression attempts to represent the linear
relationship between explanatory (independent) and response (dependent) variables.
ty
Because it incorporates more than one explanatory variable, multiple regression is
essentially an extension of ordinary least-squares (OLS) regression.
si
statistical technique that predicts the result of a response variable by combining
numerous explanatory variables.
●●
r
Multiple regression is a variant of linear (OLS) regression in which just one
explanatory variable is used.
ve
●● MLR is widely utilised in econometrics and financial analysis.
yi = Dependent variable
xi = explanatory variables
generate predictions about one variable based on data about another variable. Only
two continuous variables—an independent variable and a dependent variable—can be
utilised in linear regression. The parameter that is utilised to calculate the dependent
)A
relationship.
◌◌ The independent variables are not overly connected with one another,
e
◌◌ Residuals should be normally distributed with a mean of 0 and variance σ
The coefficient of determination (R-squared) is a statistical tool for determining
in
how much variation in the independent variables can be explained by variance in the
outcome. Even if the predictors are unrelated to the outcome variable, R2 always rises
when additional predictors are added to the MLR model.
nl
As a result, R2 alone cannot be used to determine which predictors should be
included and which should be eliminated from a model. R2 can only be between 0 and
1, with 0 indicating that none of the independent variables can predict the outcome and
O
1 indicating that the independent variables can predict the outcome without mistake.
While holding all other variables constant, beta coefficients are appropriate when
evaluating the results of multiple regression (“all else equal”). A multiple regression’s
ty
outcome might be shown horizontally as an equation or vertically as a table.
si
For instance, an analyst would be interested in seeing how market movement
influences the price of ExxonMobil (XOM). The value of the S&P 500 index will be
the independent variable, or predictor, in this case, while the price of XOM will be the
dependent variable.
r
ve
In reality, a variety of elements influence an event’s result. ExxonMobil’s stock
price, for example, is influenced by more than just the market’s performance. Other
factors that influence the price of XOM and the stock prices of other oil firms include
the price of oil, interest rates, and the price movement of oil futures. Multiple linear
regression is used to understand a relationship involving more than two variables.
ni
dependent variable. The model generates a straight line (linear) relationship that best
approximates all of the individual data points.
e
when xi2 changes—the change in XOM price when oil prices change
Statistical software is used to calculate the least-squares estimates—B0, B1, B2
in
…Bp. In a regression model with as many variables as possible, each independent
variable is differentiated by a number—1,2, 3, 4,...p. The multiple regression model
enables an analyst to forecast an outcome using data from several explanatory factors.
nl
Even so, because each data point can diverge somewhat from the model’s
projected outcome, the model is not always flawless. To account for such minor
deviations, the residual value, E, which is the difference between the actual and
O
anticipated outcomes, is incorporated in the model.
If we run our XOM price regression model via a statistics computing programme,
we get the following results:
ty
XOM Price = 75 - 1.5 interest rates + 7.8 oil price + 3.2 S&P 500 + 5.7 oil futures
R – Sq = 86.5%
si
If all other variables remain constant, an analyst would interpret this result to
suggest that if the price of oil in the markets rises by 1%, the price of XOM will rise by
7.8%. The model also predicts that a 1% increase in interest rates will cause the price
r
of XOM to drop by 1.5 percent. Changes in the interest rate, oil price, oil futures, and
the S&P 500 index account for 86.5 percent of the volatility in Exxon Mobil’s stock price,
ve
according to R2.
variable to explain a dependent variable. There are two types of multiple regressions:
linear and nonlinear.
and independent variables have a linear relationship. It also presupposes that the
independent variables have no significant association.
the help of many independent variables. The model, on the other hand, assumes that
the independent variables have no significant relationships.
)A
Simple linear regression is the most basic type of linear regression, and it’s used to
quantify the relationship between one predictor variable and one response variable.
We can use multiple linear regression to quantify the link between numerous
Notes
e
predictor variables and a response variable if we have more than one predictor variable.
in
Businesses frequently utilise linear regression to figure out how much money they
spend on advertising and how much money they make.
nl
They may run a basic linear regression model with advertising spending as the
predictor variable and revenue as the response variable, for example. The regression
model would look something like this:
O
revenue = β0 + β1(ad spending)
ty
When ad spending is increased by one unit, the coefficient 1 represents the
average change in total revenue (e.g. one dollar).
si
If 1 is close to 0, advertising spending has little impact on revenue.
Researchers might, for example, give patients different doses of a medicine and
see how their blood pressure reacts. They might use dosage as the predictor variable
U
and blood pressure as the response variable in a simple linear regression model. The
regression model would look something like this:
When the dosage is zero, the coefficient 0 represents the predicted blood pressure.
When the dosage is increased by one unit, the coefficient 1 represents the average
change in blood pressure.
m
blood pressure.
If 1 is true, it means that increasing the dosage is linked to a rise in blood pressure.
Researchers may opt to adjust the dosage provided to a patient based on the value
of 1.
(c
The effect of fertiliser and water on crop yields is frequently measured using linear
Notes
e
regression by agricultural scientists.
For instance, scientists may experiment with varied amounts of fertiliser and water
in
on different areas to examine how crop productivity is affected. They might use fertiliser
and water as predictor factors and crop yield as the response variable in a multivariate
linear regression model. The regression model would look something like this:
nl
crop yield = β0 + β1(amount of fertilizer) + β2(amount of water)
With no fertiliser or water, the coefficient 0 represents the projected crop output.
O
If the amount of water is kept constant, the coefficient 1 represents the average
change in crop output when fertiliser is increased by one unit.
If the amount of fertiliser is kept constant, the coefficient 2 represents the average
ty
change in crop output when water is increased by one unit.
Scientists may adjust the amount of fertiliser and water utilised based on the values
of 1 and 2 in order to maximise crop output.
si
Linear Regression Real Life Example #4
Professional sports clubs’ data scientists frequently utilise linear regression to
r
assess the impact of various training regimens on player performance.
ve
Data scientists in the NBA, for example, might look at how varied amounts of
weekly yoga and weightlifting sessions effect a player’s point total. They might use yoga
and weightlifting sessions as predictor factors and total points achieved as the response
variable in a multivariate linear regression model. The regression model would look
ni
The coefficient 0 represents the predicted points for a player who does not
participate in any yoga or weightlifting workouts.
1 represents the average change in points obtained when weekly yoga sessions are
increased by one.
In order to optimise points collected, data scientists may recommend that a player
participate in more or less weekly yoga and weightlifting sessions, depending on the
)A
values of 1 and 2.
As an example,
Notes
e
◌◌ To predict whether an email is spam (1) or (0)
◌◌ Whether the tumor is malignant (1) or not (0)
in
Consider the situation where we need to determine whether or not an email
is spam. If we utilise linear regression to solve this problem, we’ll need to select
a threshold by which we can classify the data. If the actual class is malignant, the
nl
predicted continuous value is 0.4, and the threshold value is 0.5, the data point will be
classed as non malignant, potentially resulting in significant consequences in real time.
It is clear from this example that linear regression is unsuitable for classification
O
problems. Because linear regression is unlimited, logistic regression enters the picture.
Their value is restricted to a range of 0 to 1.
The logistic regression model was developed in order to describe the posterior
ty
probability of the K classes using linear functions in x while ensuring that they add to
one and remain in the range [0, 1]. The model has a form:
r si
ve
K 1 log-odds or logit transformations are used to specify the model (reflecting
the constraint that the probabilities sum to one). Although the last class is used as the
ni
and they clearly sum to one. To illustrate the importance of the complete parameter
set’s interdependence θ = {β_10, β_1^T, . . . , β_(K-1)0, β_(K-1)^T, }, we denote the
probabilities Pr(G = k|X = x) = pk(x; θ).
m
Because there is just one linear function for K = 2, this model is very simple. It’s
often utilised in biostatistical applications with a lot of binary responses (two classes).
Patients may live or die, have cardiac disease or not, or have a condition present or not.
)A
log-likelihood is
We discuss in detail the two-class case, since the algorithms simplify considerably.
Notes
e
It is convenient to code the two-class gi via a 0/1 response y_i, where yi= 1 when gi= 1,
and yi = 0 when gi = 2. Let p1(x; θ) = p(x; θ), and p2(x; θ) = 1 − p(x; θ). The log-likelihood
can be written
in
nl
O
Here β = {β10, β1}, and To accommodate the intercept, we assume that the vector
of inputs x iincludes the constant term 1. We set the derivatives of the log-likelihood to
zero to maximise it. These are the score equations.
ty
si
Those are nonlinear p+1 equations in The first scoring equation indicates that
because the first component of xi equals 1,
r
The expected number of class one students is the same as the actual amount (and
ve
hence also class twos.)
The score and Hessian are best written in matrix notation. Let y stand for the vector
of y ivalues, X for the N (p + 1) matrix of x i values, p for the vector of fitted probabilities
with ith element p(xi; old), and W for a N N diagonal matrix of weights with ith diagonal
m
In the second and third line we have re-expressed the Newton step as a weighted
Notes
e
least squares step, with the respons
in
The adjusted reaction is another name for it. These equations must be solved
multiple times because p, and hence W and z, change with each iteration. Because
each iteration solves the weighted least squares problem, this algorithm is known as
nl
iteratively reweighted least squares, or IRLS.
O
Although convergence is never guaranteed, it appears that β = 0 is an acceptable
beginning value for the iterative approach. Because the log-likelihood is concave, the
procedure usually converges, but overshooting might occur. Step size halving will
ensure convergenc in the rare circumstances where the log-likelihood declines.
ty
The Newton technique may also be described as an iteratively reweighted least
squares algorithm with a vector of K 1 responses and a nondiagonal weight matrix per
si
observation for the multiclass case (K 3). The latter prevents any simplified procedures,
therefore working with the extended vector direct is numerically more convenient in this
scenario.
r
The purpose of using logistic regression models for data analysis and inference is
ve
to understand the role of the input factors in explaining the outcome. In the search for a
parsimonious model including a subset of the variables, maybe with some interactions
terms, numerous models are often fitted. Some of the concerns are illustrated in the
following example.
ni
Table: Results from a logistic regression fit to the South African heart disease
data
risk factors for ischemic heart disease were in that high-incidence area. The response
variable is the presence or absence of myocardial infarction (MI) at the time of the
survey, and the data reflect white males aged 15 to 64. (the overall prevalence of MI
Notes
e
was 5.1 percent in this region). Our data set contains 160 cases and a sample of 302
controls. Hastie and Tibshirani go into greater detail about these findings (1987).
in
We used maximum likelihood to fit this model, yielding the results reported in
Table 1.
The Z scores for each of the coefficients in the model (coefficients divided by
nl
standard errors) are shown in this summary; a nonsignificant Z score indicates that
a coefficient can be removed from the model. Each of these is formally equivalent to
a test of the null hypothesis that the coefficient in issue is zero, while the others are
O
not (also known as the Wald test). At the 5% level, a Z score larger than around 2 in
absolute value is important.
This table of coefficients has a few surprises that should be interpreted with
ty
caution. The difference between systolic and diastolic blood pressure (sbp) is
insignificant! Obesity isn’t either, and it’s a bad indicator. The correlation between the
set of predictors is the source of this perplexity. Both sbp and obesity are important on
their own, and both have a positive sign. They are no longer necessary in the presence
si
of several additional linked factors (and can even get a negative sign). The analyst may
undertake some model selection at this point, identifying a subset of the variables that
are sufficient for explaining their combined effect on the prevalence of chd. Dropping
r
the least significant co-efficient and re-fitting the model is one approach to go. This is
ve
performed until there are no more terms that can be removed from the model.
ni
U
ity
m
)A
(c
e
matrix. Each plot depicts a pair of risk factors, with cases and controls coloured
differently (blue is a case). Family history of heart disease (famhist) is a binary variable
(yes or no).
in
Table: Results from stepwise logistic regression fit to South African heart disease
data.
nl
Coefficient Std. Error Z Score
(Intercept) –4.204 0.498 –8.45
O
tobacco 0.081 0.026 3.16
Idl 0.168 0.054 3.09
famhist 0.924 0.223 4.14
ty
age 0.044 0.010 4.52
Refitting each of the models with one variable removed and then performing an
analysis of deviance to determine which variable to eliminate is a better but time-
si
consuming technique. A fitted model’s residual deviance is minus twice its log-
likelihood, and the difference between two models’ residual deviances is the deviance
between them (in analogy to sums-ofsquares). This technique produced the same end
model as the previous one. r
ve
For example, how would a coefficient of 0.081 (Standard Error = 0.026) for tobacco
be interpreted? Tobacco use is measured in kilogrammes over a lifetime, with a median
of 1.0kg for controls and 4.1kg for cases. Thus, a 1kg increase in lifetime cigarette use
leads to an 8.4% increase in the risk of coronary heart disease (exp(0.081) = 1.084).
ni
When the standard error is taken into account, we get an approximate 95 percent
confidence interval of exp(0.081 2 0.026) = (1.03, 1.14).
U
The chances are calculated by dividing the probability of an event occurring by the
probability of it not occurring.
The ratio of (probability that the event will occur) / (probability that the event will
occur) is the chances of an event (probability that the event will not occur). This could
)A
So, if the likelihood of an event occuring is 0.80, the odds are 0.80 / (1-0.80) =
0.80/0.20 = 4 in this case (i.e., 4 to 1).
(c
●● If a race horse competes in 100 races and wins 25 of them while losing 75 of
Notes
e
them, the chance of winning is 25/100 = 0.25 or 25%, but the odds of winning are
25/75 = 0.333 or 1 victory to 3 losses.
in
●● The likelihood of winning is 0.05 or 5%, and the odds of the horse winning are 5/95
= 0.0526 if the horse runs 100 races and wins 5 and loses the other 95 times.
●● The likelihood of winning is 50/100 = 0.50 or 50%, and the odds of winning are
nl
50/50 = 1 if the horse runs 100 races and wins 50. (even odds).
●● If a horse runs 100 races and wins 80, the probability of winning is 80/100 = 0.80,
or 80%, and the odds are 80/20 = 4 to 1.
O
We can’t compute the relative risk using the case-control design since we can’t
quantify the likelihood of disease in each of the exposure groups. We may, however,
compute the disease probabilities in each of the exposure groups and compare them
ty
using the odds ratio. The odds ratio in the hypothetical pesticide research is
Notice how near this odds ratio is to the RR that would have been achieved if the
si
full source population had been studied. The reason for this is that if the outcome being
examined is relatively rare, the risk of disease in an exposure group will be similar to
the risk of disease in the exposure group. As a result, for case-control studies, the odds
r
ratio gives a relative measure of effect, as well as an estimate of the risk ratio in the
ve
source population, assuming that the outcome of interest is unusual.
We underlined that the odds ratio is the only measure of association that can
be derived in case-control research. However, in cohort studies, which evaluate the
incidence of an outcome by following exposure groups, both a risk ratio and an odds
ni
then the odds ratio is computed by taking the ratio of odds, where the odds in each
group is computed as follows:
OR = (a/b) / (c/d)
In the same way that the odds in the unexposed group are placed in the
m
denominator of a risk ratio, the convention is to place the odds in the unexposed group
in the numerator. Odds ratios, like risk ratios, do not have a normal distribution, thus
we employ the lo g transformation to make them more normal. As a result, computing
)A
Notes
e
1. Using the equation above, calculate the confidence interval for Ln(OR).
in
2. Calculate the confidence interval for OR by determining the antilog of the result in
step 1, i.e., exp(Lower Limit), exp(Lower Limit), exp(Lower Limit), exp(Lower Limit),
exp(Lower Limit), exp(Lower Limit), exp(Lower (Upper Limit).
nl
The null value of the confidence interval for the odds ratio is one, which means
there is no difference. If the odds ratio has a 95 percent confidence interval that does
not include one, the odds are said to be statistically significant. We revisit the prior
O
examples and generate odds ratio estimates, which we compare to our estimates of risk
differences and relative risks.
Comparison Chart
ty
Basis for Comparison Odds Probability
Meaning Odds refers to the chances in Probability refers to the
favor of the event to the chances likelihood of occurrence
si
against it. of an event.
Expressed in Ratio Percent or decimal
Lies between 0 to ∞
r 0 to 1
ve
Formula Occurrence/Non-occurrence Occurrence/Whole
Definition of Odds
The term odds is defined in mathematics as the ratio of the number of favourable
ni
events to the number of unfavourable events. While odds for an event show the
possibility of the event occurring, odds against reflect the likelihood of the event not
occurring. In more technical terms, odds refers to the likelihood of a specific occurrence
U
Odds can range from zero to infinity; if the odds are 0, the event is unlikely to
occur; if the odds are, the event is more likely to occur.
ity
As an example, Assume there are 20 marbles in a bag, eight of which are red, six
of which are blue, and six of which are yellow. If one marble is chosen at random, the
chances of getting a red marble are 8/12, or 2:3.
Definition of Probability
m
Probability is a mathematical term that deals with the possibility of a specific event
occurring. It is the foundation for a hypothesis testing theory and an estimating theory.
)A
It’s calculated as the ratio of the number of favourable events to the total number of
events.
e
one for each of the 12 zodiac signs. If a dart is targeted, the possibilities of areas
occurring are 1/12, because the favourable event is 1 (Aries) and the total number of
events is 12, which may be denoted as 0.08 or 8%.
in
Key Differences Between Odds and Probability
The following points describe the distinctions between odds and probability:
nl
1. The term ‘odds’ is used to explain whether or not there are any chances of an event
occurring. Probability, on the other hand, establishes the possibility of an event
O
occurring, i.e. how often the event will occur.
2. Unlike the odds, which are expressed as a ratio, the probability is expressed as a
percentage or a decimal.
ty
3. Odds typically run from zero to infinity, with zero denoting the impossibility of an
event occurring and infinity denoting the potential of an event occurring. Probability,
on the other hand, ranges from zero to one. As a result, the closer the probability is
to zero, the greater the likelihood of it not occurring, and the closer it is to one, the
si
greater the likelihood of it occurring.
4. The ratio of favourable events to unfavourable events is known as the odds. By
r
dividing the favourable occurrence by the entire number of events, the probability
may be computed.
ve
Odds are a part of probability, which is a subject of mathematics. With the use of
odds or probability, one can calculate chance. The probability is the ratio of occurrence
to the whole, whereas the odds are the ratio of occurrence to non-occurrence.
ni
Hypothesis Testing
U
The basic concept of hypothesis testing is illustrated in the diagram below. In this
case, the population is comprised of all ducks, and the data sample is comprised of one
individual duck picked at random from the whole population. The term ‘randomly drawn’
m
to a duck’s body weight, beak size, body size, or hair length, for example. This mapping
is known as a test statistic in statistics.
(c
Notes
e
in
nl
O
Figure: Intuitive example explaining the basic idea underlying an one-sample
hypothesis test.
ty
A ‘hypothesis’ is, of course, an important part of hypothesis testing. The hypothesis
is a quantitative assertion we make regarding the test statistic’s population value. In our
si
case, it could be regarding a duck’s body parts, such as body size. We can formulate
the following hypothesis: The average body size is 20 cm. The null hypothesis H0
denotes such a hypothesis.
r
Assume we now have a population of ducks with a body size of 20 cm, with
ve
natural fluctuations. The fact that the population consists of (infinite) many ducks and
we acquire such quantification for each results in a probability distribution for the mean
body size, known as the sampling distribution. It’s vital to remember that our population
is a fictitious one that follows our null hypothesis. In other words, the null hypothesis
completely defines the population.
ni
We can compare the numerical values of the test statistic, which represents the
data sample, and the sampling distribution, which represents the population, to evaluate
U
the null hypothesis that we have formulated. We get another numerical value called
the p-values from this comparison, which quantifies the typicality or atypicality of the
arrangement if the null hypothesis is correct. Finally, a judgement is taken based on the
p-values.
ity
hypothesis test being conducted. The components are summarised in the diagram
below. When completing a hypothesis test, we listed these components in the order
in which they enter the process. As a result, they can be thought of as steps in a
hypothesis test. Because they are interrelated, the logical sequence in which they are
presented is critical. Overall, a hypothesis test is a technique that must be carried out.
(c
Notes
e
in
nl
Figure: Main components that are common to all hypothesis tests.
O
Parametric & Non-parametric Tests
The Student’s t-test is an example of a parametric test that uses data expressed in
absolute numbers or values rather than ranks.
ty
The parametric statistical test only works in certain circumstances. Because these
conditions aren’t usually evaluated, they’re considered to be true. The validity of the
assumption determines the significance of a parametric test’s results. The proper
si
interpretation of a parametric test based on the normal distribution also presupposes
that the scene being examined is the product of at least an interval scale measurement.
r
Let us try to grasp the meaning of the phrase population. The term “population”
refers to the total group of people that a researcher is attempting to comprehend in
ve
relation to an event. The study is usually carried out on a sample of the population in
question, with the results being applied to the wider group from which the sample was
drawn.
ni
T-tests: A T-test is used to see if two groups’ scores on a single variable differ.
U
A t-test is used to determine whether there are any differences in mean scores.
For example, you could use a t-test to see if students in two courses have different
writing abilities. It is worth noting that parametric tests, such as the t-test and F-test, are
considered to be quite robust and can be used even when certain assumptions are not
ity
met.
Parametric tests are useful because they are the most powerful for determining the
significance or reliability of sample statistics calculated. However, their application is
predicated on a set of assumptions. These assumptions are dependent on the structure
m
of the population distribution and how the data measurements are quantified using the
type of scale utilised.
Let’s look at what a scale is and the many types of scales. Nominal scale, ordinal
)A
scale, interval scale, and ratio scale are the four types of scales used in measuring.
1. The nominal scale is used to represent nominal or classed data, such as the
population separated into males and females. There is no data ordering in the sense
that saying male > female has no significance. These data are also labelled with
(c
arbitrary labels like m/f and 1 //0. These are also known as categorical scales, as
they have values that are divided into categories (i.e. they are names rather than
numbers).
2. The ordinal scale is used to represent interval data. Although the values are in a
Notes
e
specific order, the discrepancies between them are unimportant. For instance, a
5-point scale ranging from 1 to 5, with 1 indicating the least satisfied and 5 indicating
the most satisfied.
in
3. The interval scale is used to deal with ordered data that has intervals. This scale is
constant, yet it lacks a natural zero. Differences are understandable. Temperature in
nl
Centigrade or Fahrenheit, for example, is an example of this type of data. Calendar
dates are the dates on which a calendar is based. The magnitude and equal intervals
requirements of a proper measurement scale are met by an interval scale, but it
lacks the real or absolute zero point.
O
4. The ratio scale is a constant, ordered scale with a natural zero. Height, weight, age,
and length are examples of this type of information.
Because there is no normal distribution in a sample with a small number of items,
ty
non-parametric statistics are used, for example, if our sample size is 30 or less (N≤30).
It can also be utilised with nominal data in addition to ordinal data.
si
general conditions, with no mention of the precise form of the distribution from which
the sample was taken.
r
Most non-parametric statistical tests require some assumptions, such as that the
observations are independent and that the variable under examination has underlying
ve
continuity, but these assumptions are fewer and weaker than those required by
parametric tests.
generally reliable and useful, even when some of their mathematical assumptions are
broken. These tests, on the other hand, are only employed with data that is based on
ratio or interval measurements. We employ non-parametric tests when dealing with
counted or ranked data. Non-parametric tests are said to have more validity since
m
they are not predicated on assumptions about the nature of the population distribution,
which are commonly neglected or violated by researchers who employ parametric tests.
Non-parametric tests, on the other hand, are less precise and have less power than
parametric tests.
)A
A test statistic quantifies a data sample, to put it simply. Any mapping (or function)
between a data sample and a numerical value is referred to as a statistic in statistics.
The mean value and variance are two common examples. The test statistic can be
Notes
e
stated in formal terms as
in
Xn is a data sample with sample size n, whereas D(n) = X1,..., Xn is a data sample
with sample size n. The mapping is denoted by Tn, while the value is denoted by tn.
Typically, the test statistic can assume actual values, i.e. Tn∈ R, however there are
nl
several limitations.
A test statistic plays an important part in a hypothesis test since the test statistic
chosen impacts the hypothesis test to a considerable extent. The rationale for this is
O
that it will be incorporated into the hypotheses we develop in step 2. As a result, it’s
critical to pick a test statistic that’s relevant and interesting to the study at hand.
We want to underline that we choose the test statistics in this phase, but we don’t
ty
assess or apply them yet. This is accomplished in step 5.
si
At this point, we define two hypotheses: H0, which is the null hypothesis, and H1,
which is the alternative hypothesis. Both hypotheses make statements about the test
statistic’s population value and are mutually exclusive. The population value of t for the
r
test statistic t = T(D) we chose in step 1 is denoted by. We can construct the following
hypothesis based on this:
ve
As can be seen, the value of the population parameter θcan only be true for one
ni
assertion, not both, because of the way the two hypotheses are worded. For example,
if θ = θ0 is true, but the alternative hypothesis H1 is false, or if θ > θ0is true, but the null
hypothesis H 0 is false,
U
The four probable outcomes of a hypothesis test are depicted in the diagram
above. Each of these outcomes has its own name that is widely used. For example,
a ‘true positive’ (TP) judgement is made when the null hypothesis is untrue and we
ity
reject H0. The asymmetric meaning of a hypothesis test means that rejecting H0 when
H0 is untrue is more informative than accepting H0 when H0 is correct, which is why
it’s called ‘positive.’ In this instance, a positive result from a hypothesis test can be
considered.
m
Despite the fact that there are hundreds of different hypothesis tests, the
description above applies to them all. This does not imply that if you understand one
hypothesis test, you will understand all of them; rather, if you understand the theory of
Notes
e
one hypothesis test, you will understand the principle of all hypothesis tests.
To correlate the sample value of the test statistic t with its population value,
in
one must first understand the test statistic’s probability distribution. This probability
distribution was given a unique name because of this connection: the sampling
distribution of the test statistic. It’s vital to remember that the sampling distribution
nl
represents the test statistic’s values if the null hypothesis is correct. This signifies that
the population value of is θ is θ0in this circumstance.
For the time being, let’s suppose we know the sampling distribution of our test
O
statistic. We obtain a quantification for the ‘typicality’ of this value with respect to the
sampling distribution by comparing the particular value t of our test statistic with the
sampling distribution in a way that is determined by the way we formulated the null and
alternative hypotheses, assuming the null hypothesis is true.
ty
Step 3: Sampling Distribution
We indicated earlier in our general overview of the basic concept of a hypothesis
si
test that the relationship between a test statistic and its sample distribution is critical for
any hypothesis test. As a result, we go into this topic in greater depth in this part.
r
To number one: The sampling distribution is, first and foremost, a probability
distribution. Given specific assumptions, this sampling distribution represents the
ve
distribution of the test statistic T, which is a random variable. This remark can be
clarified by defining the null hypothesis’s sample distribution as follows.
Let X(n) = X1,..., Xn represent a random sample from a population with Xi Ppop I
and T(X(n)) represent a test statistic. The sampling distribution of the null hypothesis, or
ni
the null distribution, is the probability distribution fn(x|H0 true) of T(X(n) assuming H0 is
true.
U
We’d like to call attention to a few aspects in the preceding definition. For starters,
the P pop distribution from which the random variables Xi are sampled can take any
shape and is not limited to, say, a normal distribution. Second, because it is a function
m
the sampling distribution is a function of n as well. That is, if we modify the sample size
n, the sampling distribution will vary. Fourth, because fn(x|H 0 true) is the probability
distribution of T(X(n)), we may precisely rebuild the distribution fn(x|H 0 true) by
collecting an unlimited number of samples from it in the form T(X(n)) fn(x|H 0 true). As
we will see in the next example, the last point provides for a numerical approximation of
(c
Assume we have a random sample of size n, with X(n) = X1,..., Xn, and each data
e
2). In addition, let’s utilise the mean value as a test statistic, i.e.
in
Figures A–C illustrate three examples of drawing E = 100,000 samples X(n) from
which we estimate E = 100,000 different mean values T for three different values of n
nl
(in A n = 1, B n = 3, and C n = 10) for three different values of n (in A n = 1, B n = 3, and
C n = 10). The density estimates of these 100,000 values are shown in Figure 3A–C.
As previously stated, the approximation sampling distribution Ps (n, E) becomes the
O
(theoretical) sampling distribution in the limit of infinite many samples E.
ty
r si
ve
ni
U
ity
m
)A
When n = 1, we have the special case where the sampling distribution is the same
Notes
e
as the underlying population distribution P pop, which in our instance is a gamma
distribution with parameters α = 4 and β=2, as illustrated in Figure 3A. Figure 3B,C
shows a modification in the distributional shape of the sampling distribution for any
in
other n > 1. However, we should be familiar with this transformation since, according to
the Central Limit Theorem, the mean of X1,..., Xn independent samples with mean and
variance 2 follows a normal distribution with mean and standard deviation σ/ √ n, that is,
nl
O
This result is only strictly valid in the limit of high n, as we can see. However, in
Figure 3D, we present a qq-plot that shows that the resulting distribution, Ps(n = 10, E =
100,000), is already quite near to a normal distribution for n = 10. (with the appropriate
parameters).
ty
The Central Limit Theorem holds for arbitrarily iid (independent and identically
distributed) random variables X1,..., Xn. As a result, the normal distribution given in
Equation is always the sampling distribution for the mean (4).
si
We can simplify things even more by applying a so-called z-transformation of the
mean value of (X) to Z.
r
ve
Z’s distribution is a conventional normal distribution, which means that
Z ∼ N(0, 1).
ni
The variance estimate can then be used for the t-transformation that follows.
m
e
in
Step 4: Significance Level α
nl
The significance level α is a value that ranges from 0 to 1, orα ∈ [0, 1]. It has
significance.
O
α = P(Type 1 error) = P(reject H0|H0 true )
indicating the likelihood of rejecting H0 if H0 is true That is, it tells us how likely we
are to make a Type 1 error, resulting in a false positive decision.
ty
We have the freedom to choose this value while running a hypothesis test.
However, before deciding on a numerical value, one must consider the possible
repercussions. Although 0.05 is probably the most common value, levels as low as
108 are utilised in Genome-Wide Association Studies (GWAS). The conceivable
si
ramifications in many application domains are the rationale for such a wide range
of employed values. Type 1 errors in GWAS can cost millions of dollars in follow-up
research, which are quite expensive in this discipline. As a result, is picked at a very low
level. r
ve
Step 5: Evaluate Test Statistic from Data
Because everything up to this point has been theoretical, this stage represents our
connection to the real world, as represented by the data. We estimate the numerical
ni
value of the test statistic selected in Step 1 for D(n) = X(n) = x1,..., xn.
U
t(n) denotes a specific numerical value derived from the observed data D(n).
Because our data collection is based on the number of samples n, this numerical value
will also be based on n. The subscript indicates this explicitly.
ity
p = P(observe × at least as extreme as |t| |H0is true) = P(x ≥ |t| |H0is true)
Notes
e
in
nl
O
Figure: Determining the p-values from the sampling distribution of the test
statistic.
ty
The numerical value of p will be used to determine whether we reject or accept the
null hypothesis.
si
We can also make use of the following integral.
r
to find the value of θc That is, a threshold c is implied by the significance level. This
ve
threshold can also be used to decide whether or not H0 should be employed.
We want to underline that while the test statistic is a random variable, the p-values
are likewise random variables because they are dependent on the test statistic.
ni
options for accomplishing this. We can choose to make a judgement based on the
p-values or the value of the test statistic tn.
Simple implementation
Linear Regression is a very simple method that may be quickly implemented and
produces good results. Furthermore, when compared to other complicated techniques,
these models may be trained quickly and efficiently on systems with limited processing
resources. When compared to other machine learning techniques, linear regression has
(c
are very simple to comprehend and interpret. As a result, linear regression is a simple
Notes
e
concept to grasp.
in
Linear regression almost perfectly fits linearly separable datasets and is frequently
used to determine the nature of the relationship between variables.
nl
Over-fitting can be reduced by regularization
Over-fitting: It occurs when a machine learning model fits a dataset extremely well
O
and hence captures the noisy data as well. This has a detrimental effect on the model’s
performance and affects its accuracy on the test set.
ty
Disadvantages of Linear Regression
si
Prone to under-fitting
Under-fitting : When a machine learning model fails to capture data adequately, a
situation occurs. When the hypothesis function fails to match the data well, something
happens. r
ve
Example
ni
U
ity
m
)A
effectively capture the data. As a result, the accuracy of most linear regression models
is low.
Sensitive to outliers
Notes
e
Anomalies or extreme values that depart from the rest of the distribution are called
outliers in a data set. Data outliers can severely degrade the performance of a machine
in
learning model, resulting in models with low accuracy.
Exanple:
nl
O
ty
r si
ve
ni
technique:
to readily estimate or anticipate the unknown values of one variable based on the
known values of the other.
●● It offers a measure of the mistakes in estimations obtained using the regression
line. A small scatter of observed (actual) values around the appropriate regression
line suggests good estimations of a variable’s values and a lower level of error. A
(c
large amount of scatter of observed values around the appropriate regression line,
on the other hand, suggests erroneous estimations of the values of a variable and
Notes
e
a significant degree of error.
●● It gives a measure of the correlation coefficient between the two variables, which
in
may be computed by taking the square root of the product of the two regression
coefficients, i.e. r = √ (by. byx).
●● It gives us a measure of coefficient of determination, which reflects the effect of
nl
the independent variable (explanatory, or regressing variable) on the dependent
variable (explained or regressed variable), and thus gives us an indication of
the regression analysis’ predictive values. This coefficient of determination is
O
calculated by multiplying the two regression coefficients, i.e. r2 = bxy. The better
the fit and the more effective regression equations are as estimating devices, the
higher the value of the coefficient of determination (r2).
●● It is a powerful statistical analysis tool in the field of business and commerce,
ty
where people are interested in predicting future events such as consumption,
production, investment, prices, sales, profits, and so on, and where businessmen’s
success is highly dependent on the accuracy of their various estimates.
si
●● It’s a useful tool for determining and evaluating the cause-and-effect relationship
between the economic variables that are at the heart of economic theory and
practice . It is widely used to estimate Demand curves, Supply curves, Production
r
functions, Cost functions, and Consumption functions, among other things. In
ve
reality, by fitting regression lines to the input and output data, economists have
proposed a variety of production functions.
●● This method is widely utilised in our daily lives and sociological studies to estimate
numerous elements such as birth rates, death rates, tax rates, yield rates, and so
ni
on.
●● Last but not least, the regression analysis technique provides us with an estimate
of a series’ relative variation.
U
Summary
●● The technique of discovering correlations between dependent and independent
ity
●● The best fit line or line of best fit is a line that can describe the relationship
between the independent and dependent variable(s). The dependent variable is
predicted using this line. The goal of line of best fit is to get the observed and real
)A
e
when it is not.
●● Pandas is a free and open-source Python toolkit created by software engineer
in
Wes McKinney in 2008. It offers high-level data structures and programming tools
for doing data analysis and manipulation on a variety of data kinds, including
numbers, text, and dates.
nl
●● Scikit-learn is undoubtedly Python’s most helpful machine learning library.
Classification, regression, clustering, and dimensionality reduction are just a few
of the useful functions in the sklearn toolkit for machine learning and statistical
O
modelling.
●● Multiple linear regression (MLR), often known as multiple regression, is a
statistical technique that predicts the result of a response variable by combining
numerous explanatory variables. Multiple linear regression attempts to represent
ty
the linear relationship between explanatory (independent) and response
(dependent) variables. Because it incorporates more than one explanatory
variable, multiple regression is essentially an extension of ordinary least-squares
si
(OLS) regression.
●● The term odds is defined in mathematics as the ratio of the number of favourable
events to the number of unfavourable events. While odds for an event show the
r
possibility of the event occurring, odds against reflect the likelihood of the event
ve
not occurring.
●● Probability is a mathematical term that deals with the possibility of a specific event
occurring. It is the foundation for a hypothesis testing theory and an estimating
theory. It’s calculated as the ratio of the number of favourable events to the
ni
with no mention of the precise form of the distribution from which the sample was
taken. Most non-parametric statistical tests require some assumptions, such as
that the observations are independent and that the variable under examination
)A
has underlying continuity, but these assumptions are fewer and weaker than those
required by parametric tests.
Glossary
●● Discrete distribution: A discrete distribution is one with a finite number of possible
(c
●● Dataframe: These are tabular data’s data structures, which are similar to
Notes
e
spreadsheets and can hold vast amounts of data.
●● Binomial distribution: The binomial distribution calculates the likelihood of a certain
in
number of successes or failures in each attempt of an experiment.
●● Poisson distribution: The Poisson distribution is a discrete probability distribution
that assesses the likelihood of a number of events occurring in a particular time
nl
period when they happen in a well-defined order.
●● Hypergeometric distribution: A hypergeometric distribution is a discrete distribution
that estimates the likelihood of a certain number of successes in (n) trials without
O
replacement from a large population (N). To put it another way, it’s sampling
without replacement.
●● Geometric distribution: A geometric distribution is a discrete distribution that
ty
estimates the probability of the first success.
●● Numerical Datasets: A numerical data set is one in which the information is
expressed in numbers rather than natural language.
si
●● Bivariate Datasets: A bivariate data set is one that contains two variables. It is
concerned with the relationship that exists between the two variables.
●● Multivariate Datasets: A multivariate dataset is defined as one that has three or
r
more data kinds (variables).
ve
●● Categorical Datasets: Categorical data sets represent a person’s or an object’s
attributes or qualities.
●● Correlation Datasets: Correlation data sets are a collection of variables that have
ni
●● Over-fitting: It occurs when a machine learning model fits a dataset extremely well
Notes
e
and hence captures the noisy data as well. This has a detrimental effect on the
model’s performance and affects its accuracy on the test set.
in
●● Regularization: It’s a simple strategy that effectively reduces the complexity of a
function, lowering the risk of Over-fitting.
●● Under-fitting: When a machine learning model fails to capture data adequately,
nl
a situation occurs. When the hypothesis function fails to match the data well,
something happens.
●● Homogeneity of variance (homoscedasticity)- The amount of the error stays
O
constant is one of the primary expectations of a simple linear regression approach.
This basically indicates that the error magnitude does not change much while the
independent variable’s value varies.
ty
●● Independence of observations- Nothing is hidden in the relationships between
the observations, and only valid sampling procedures are employed during data
gathering.
si
●● Normality- The data is flowing at a typical rate.
r
1. The technique of discovering correlations between dependent and independent
variables is known as_ _ _ _.
ve
a) Regression
b) Forecasting
c) Mapping
ni
d) Prediction
2. The_ _ _ _in linear regression are a measure of the vertical distance between each
U
c) coefficient
d) None of the above
3. _ _ _ _ _are tabular data’s data structures, which are similar to spreadsheets and
m
c) Dataframe
d) Dataset
4. Pandas is a free and open-source Python toolkit created by software engineer_ _
(c
_ _in_ _ _ _.
a) Satoshi Nakamoto, 2009
e
c) Wei Dai, 2006
d) Wes McKinney, 2008
in
5. Which one of the statements is true regarding residuals in regression analysis?
a) Mean of residuals is always zero
nl
b) Mean of residuals is always less than zero
c) Mean of residuals is always greater than zero
O
d) There is no such rule for residuals
6. Which one is true about Heteroskedasticity?
a) Linear regression with constant error terms
ty
b) Linear regression with varying error terms
c) Linear regression with zero error terms
si
d) All of the mentioned
7. The correlation coefficient is used to determine:
a) A specific value of the y-variable given a specific value of the x-variable
r
b) A specific value of the x-variable given a specific value of the y-variable
ve
c) The strength of the relationship between x and y variables
d) None of the mentioned
8. The square of the correlation coefficient r 2 will always be positive and is called the_
ni
_ _ _.
a) Regression
U
b) Coefficient of determination
c) KNN
d) Algorithm
ity
c) ANOVA
d) ANVA
)A
d) The amount of the error stays constant is one of the primary expectations of a
Notes
e
simple linear regression approach
11. _ _ _ _ _ _is a statistical technique that predicts the result of a response variable by
in
combining numerous explanatory variables.
a) Multiple linear regression
b) Simple linear regression
nl
c) Polynomial regression
d) Decision tree regression
O
12. The term_ _ _ _ is defined in mathematics as the ratio of the number of favourable
events to the number of unfavourable events.
a) mean
ty
b) odds
c) median
si
d) mode
13. _ _ _ _is a mathematical term that deals with the possibility of a specific event
occurring.
a) Mean r
ve
b) Median
c) Probability
d) Mode
ni
14. _ _ _ _ _regression method is also known as the ordinary least square estimation.
a) Simple
U
b) Direct
c) Indirect
d) Mutual
ity
15. A_ _ _ _ _is a recursive partition of the instance space that represents a classifier.
a) Simple linear
b) Multiple linear
m
c) Polynomial
d) Decision tree
)A
Exercise
1. Define Linear Regression important terms and assumptions.
2. Regression: Statistical View
3. Define multiple linear regression.
(c
e
6. What are the differences between odds vs. probability.
7. Define linear regression and its advantages and pitfall.
in
Learning Activities
1. Why linear regression belongs to both statistics and machine learning?
nl
Check Your Understanding - Answers
1. a
O
2. b
3. c
ty
4. d
5. a
6. b
si
7. c
8. b
9. c r
ve
10. d
11. a
12. b
ni
13. c
14. b
U
15. d
e
Learning Objectives:
in
At the end of this module, you will be able to understand:
nl
●● Concept of Bias Variance Trade-off
●● All about Under-fitting and Over-fitting
O
●● What is Hold Out Method and its application
●● Various Evaluation Methods: Cross Validation and Bootstrapping Method
●● Performances Metrices and its applications
ty
Introduction
Regression and Classification Algorithms are two of the most common types of
si
supervised machine learning algorithms. We can forecast the output for continuous
values using Regression techniques, but we need Classification algorithms to predict
the output for categorical values.
r
An algorithm for classifying new observations based on previously trained data is
ve
known as a Supervised Learning technique. To classify new data, software uses the
dataset or observations it has been provided as a basis for learning and then assigns
the data to one of a number of categories or groups. If the answer is “yes” or “no,” then
the answer is “yes” or “no.” Targets/labels or categories are other names for classes.
ni
Both techniques are used in Machine Learning for prediction and work with labelled
datasets. However, there is a distinction between the two in terms of how they are
applied to certain machine learning situations.
ity
e
Classification:
in
Classification is the process of identifying a function that aids in the classification
of a dataset based on several factors. A computer programme is trained on the training
dataset and then categorises the data into distinct classes depending on that training.
nl
The classification algorithm’s goal is to discover the mapping function that will
convert the discrete input(x) to the discrete output (y).
Example: Email Spam Detection is the best example of the Classification problem.
O
The model is trained on millions of emails on various parameters, and it determines if
an email is spam or not when it receives a new one. The email is moved to the Spam
folder if it is spam.
ty
Algorithms for Machine Learning Classification:
●● Logistic Regression
si
●● K-Nearest Neighbours
●● Support Vector Machines
●● Kernel SVM
●● Naïve Bayes
r
ve
●● Decision Tree Classification
●● Random Forest Classification
ni
Regression:
The technique of discovering correlations between dependent and independent
variables is known as regression. It aids in the prediction of continuous variables such
U
The Regression algorithm’s goal is to identify the mapping function that will
translate the continuous input variable (x) to the discrete output variable (y).
ity
Example: Let’s say we want to forecast the weather, so we’ll apply the Regression
approach. When it comes to weather prediction, the model is trained on historical data,
and after it is finished, it can accurately predict the weather for future days.
m
●● Polynomial Regression
●● Support Vector Regression
●● Decision Tree Regression
(c
e
Regression Algorithm Classification Algorithm
In Regression, the output variable must In Classification, the output variable must be
in
be of continuous nature or real value. a discrete value.
The task of the regression algorithm The task of the classification algorithm is to
is to map the input value (x) with the map the input value(x) with the discrete output
nl
continuous output variable(y). variable(y).
Regression Algorithms are used with Classification Algorithms are used with
O
continuous data. discrete data.
In Regression, we try to find the best fit In Classification, we try to find the decision
line, which can predict the output more boundary, which can divide the dataset into
accurately. different classes.
ty
Regression algorithms can be used to Classification Algorithms can be used to solve
solve the regression problems such classification problems such as Identification
as Weather Prediction, House price of spam emails, Speech Recognition,
si
prediction, etc. Identification of cancer cells, etc.
The regression Algorithm can be further The Classification algorithms can be divided
divided into Linear and Non-linear into Binary Classifier and Multi-class Classifier.
Regression. r
ve
2.1.2 Classification of Supervised Algorithms
Supervised Algorithms categories in two categories:
ni
●● Parametric model
●● Nonparametric model
Parametric Model
U
Although they can substantially ease learning, assumptions also have a limit on
what can be learned. Parametric machine learning algorithms are those that reduce the
function to a recognised form.
ity
“A parametric model is a type of learning model that summarises data with a set of
parameters of fixed size (regardless of the quantity of training examples). A parametric
model won’t change its mind about how many parameters it requires no matter how
much data you feed it.
m
b0 + b1*x1 + b2*x2 = 0
(c
Where x1 and x2 are two input variables, and b0, b1, and b2 are the line’s
coefficients that determine the intercept and slope.
Assuming a line has its functional shape substantially streamlines the learning
Notes
e
process. We now have a prediction model for the issue; all we need to do is estimate
the line equation’s coefficients.
in
As a result, parametric machine learning algorithms are frequently also referred
to as “linear machine learning algorithms” because the assumed functional form is
frequently a linear combination of the input variables.
nl
The issue is that the underlying function may not actually be linear like a line.
It might be almost a line and only need a small amount of input data modification to
function properly. Or it might not even be a line at all, in which case the assumption is
O
incorrect and the strategy will yield subpar results.
◌◌ Logistic Regression
ty
◌◌ Linear Reression
◌◌ Naive Bayes
si
◌◌ Simple Neural Networks
Benefits of Parametric Machine Learning Algorithms:
◌◌ Limited Complexity: Simpler problems are more suited for the methods.
◌◌ Poor Fit: The approaches probably won’t match the underlying mapping
function in practice.
ity
Nonparametric
Nonparametric machine learning algorithms are those that do not make any firm
assumptions about the shape of the mapping function. They are allowed to learn any
functional form from the training data because they are not making any assumptions.
m
When you have a lot of data, no prior information, and you don’t want to worry too
much about picking the perfect features, nonparametric approaches are a viable option.
)A
model. The only assumption made by the approach is that patterns that are similar are
likely to have an output variable that is comparable.
e
include:
◌◌ k-Nearest Neighbors
in
◌◌ Decision Trees like CART and C4.5
◌◌ Support Vector Machines
◌◌ Random Forest
nl
Benefits of Nonparametric Machine Learning Algorithms:
O
◌◌ Power: There aren’t any (or only flimsy) assumptions made about the
underlying function.
◌◌ Performance: Can lead to prediction models with greater performance.
ty
Limitations of Nonparametric Machine Learning Algorithms:
◌◌ More data: Extra training data is needed to estimate the mapping function.
◌◌ Slower: They train much more slowly because there are frequently more
si
parameters to train.
◌◌ Over-fitting: It is tougher to justify why specific predictions are made and there
is a greater risk of Over-fitting the training data.
r
ve
Logistic Regression
Similar to linear regression, logistic regression is employed when the dependent
variable isn’t a number but rather something else, such a “yes” or “no” response.
Regression is used, but classification is also done based on the regression, and the
ni
Logistic Regression
U
1
n Class 1
ssio
gre
Re
Y ear
Lin
ity
instance, a credit card provider will model whether a consumer will “default” or “not
default” on their card when determining whether or not to offer them a credit card.
)A
In order to create the model, linear regression is first applied to the connection
between the variables. It is expected that the categorization line’s threshold is set at 0.5.
(c
It compares the log of the likelihood that an event will occur to the log of the
Notes
e
likelihood that it won’t. In the end, it categorises the variable according to which class
has the larger likelihood.
in
ODDS
event
Pr (y)
Pr (~y)
nl
non-event
Odds is the ratio of the probability as event occurs with the probability of an
event not occuring
O
Linear Regression
A statistical method for predicting the value of one variable based on the value
ty
of another is linear regression analysis. The variable you want to forecast is the
dependent variable. You forecast the value of the other variable using the independent
variable.
si
In order to estimate the coefficients of the linear equation, this method of analysis
uses one or more independent variables that can most accurately predict the value of
the dependent variable. The discrepancy between expected and actual output values is
r
reduced by linear regression by creating a straight line or surface. There are accessible
simple linear regression calculators that use the “least squares” approach to get the
ve
best-fit line for a set of paired data. The value of X (the dependent variable) is then
determined using Y. (independent variable).
Naive Bayes
ni
independently, despite the fact that these features all depend on one another or on the
existence of other features. Hence, the term “naive Bayes.”
ity
m
)A
Notes
e
in
nl
O
ty
The posterior probability of the given class (target) is denoted by P(class|data)
(attribute). the likelihood of either class for a given data point, given the data point. We
are attempting to calculate this value.
si
●● The prior probability of a class is P(class).
●● The likelihood, or probability of a predictor given a class, is P(data|class).
●●
r
The prior probability of the predictor or marginal likelihood is P(data).
ve
ni
U
ity
Steps
m
P(yellow) = 10/17
P(green) = 7/17
P(?) = 4/17
e
3. Calculate Likelihood
in
P(data/class) = Number of similar observations to the class/Total no. of points
in the class.
P(?/yellow) = 1/7
nl
P(?/green) = 3/10
O
ty
5. Classification r si
ve
As the point has a class green affiliation with a chance of over 75%, the class is
ni
more likely to belong to that category. The other models used to calculate probabilities
are multinomial, Bernoulli, and naive Bayes. Thus, a naïve Bayes model is simple to
construct and does not require laborious iterative parameter estimates, making it
U
network. These nodes are linked to one another or in both directions. The basic
structure of directional networks consists of an input layer with each node receiving one
of the system’s n inputs, a hidden layer made up of m nodes that can combine their
signals until there are n inputs, and an output node that receives the hidden layer’s m
m
outputs. The neural network’s system output is what it produces. On the other side, a
neural network can also be represented by the transfer function. This illustration shows
that the propagation function is formed by first summing and weighting the inputs. In
order to define the output value between two values, such as 0 and 1 (sigmoid function)
)A
In the learning phase, the weight values of the input connections are modified
across a number of rounds. This algorithm’s supervised version compares the neural
(c
network’s output with the output connected to the relevant input. In order to make the
weight values fit the actual output values more closely, the error is used to modify the
weight values.
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 57
k-Nearest Neighbors
Notes
e
One of the simplest classification algorithms, the K-NN algorithm is used to
determine which data points are divided into different classes in order to forecast the
in
categorization of a new sample point. K-NN is a lazily learning algorithm that is non-
parametric. It uses a similarity metric to categorise new cases (i.e., distance functions).
k-Nearest Neighbors
nl
●● k is the number of neighbours to consider.
●● Scaling is important.
O
●● k should be odd.
●● If we have binary features we can use hamming distance.
●● Voting can be weighted by distance to each neighbor.
ty
●● Does not scale to large data well.
r si
ve
If k=3, the grey square observation is predicted to be green because two of its
neighbours are green and only one is red.
ni
e
Entropy
in
Entropy is a measure of how uncertain an element’s unpredictability is. It is a
gauge of impurity, to put it another way.
nl
It informs us about an event’s predictability intuitively. The homogeneity of a
O
sample is determined by entropy. The sample has an entropy of one when it is evenly
divided and zero when it is totally homogeneous.
Information Gain
ty
The relative change in entropy in relation to the independent attribute is measured
as information gain. It makes an attempt to estimate the amount of data each attribute
contains. Finding the attribute that yields the most information gain is the key to building
si
a decision tree (i.e., the most homogeneous branches).
r
Where the information gained by using feature X is represented by Gain(T, X). The
second term determines the entropy after applying the feature X, whereas entropy(T) is
ve
the entropy of the entire collection.
At a specific node in the tree, information gain ranks filtering qualities. The highest
information gain entropy in each split determines the ranking.
ni
Overfitting
Overfit model
ity
m
Overfitting occurs when a model starts to memorise the aspects of the training set
and inturn losses the ability to generalize
)A
Notes
e
in
nl
O
With the aid of support vectors, it carries out classification by locating the
ty
hyperplane that optimises the margin between the two classes.
r si
ve
ni
By converting the issue with some linear algebra, the SVM learns the hyperplane
(i.e., the example above is a linear kernel which has a linear separability between each
U
variable).
Random Forest
ity
The basic premise is that combining learning models improves the final outcome.
Notes
e
in
nl
O
ty
Over-fitting may be a problem for deep decision trees, but random forests avoid
it by building trees from random subsets. The biases are eliminated mostly because it
si
takes the average of all the forecasts.
While the trees are developing, the random forest adds more randomness to the
model. When dividing a node, it looks for the best feature from a random subset of
r
features rather than the most crucial one. A better model is often produced as a result of
ve
the great diversity this causes.
learning. A machine learning model wants to reduce bias and variance through
minimising prediction mistakes. Variance describes how utilising a different training set
alters the prediction. In the ideal case, predictions from multiple training sets differ only
U
little from one another, but in situations with high variance, even tiny adjustments to
the training set might result in radically different predictions. However, bias refers to an
inaccuracy brought about by oversimplifying a problem. No matter how many training
data there are, using linear regression to solve a problem with apparent nonlinear
ity
Notes
e
in
nl
O
ty
Figure: Over-fitting with increased model complexity: While the polynomial model
varies significantly with the training data, the linear model does not. As a result, the
si
polynomial model’s inconsistent predictions at x = 2 frequently turn out to be more
wrong than the linear model’s. If enough training data are given, the polynomial model
can perform better than the linear model.
r
1. Bias: The bias is the error brought on by the model’s simplifying assumptions, which
ve
results in some test instances having consistent errors across various training data
set options. The bias cannot be eliminated, even if the model has unlimited access
to training data. Because the linear model will never fit the (slightly curved) data
distribution exactly, regardless of the amount of data given, it has a bigger model
bias than the polynomial model, as seen in the example of the above figure. When
ni
applying a linear model for any selection of training sample, the prediction of a
specific out-of-sample test instance at x = 2 will always have an error in a specific
direction. The bias is the difference between the straight and curved lines at any
U
given value of x if we suppose that they were estimated using an unlimited quantity
of data (top left of Above Figure). The above figure provides an illustration of the bias
at x = 2.
ity
2. Variance:Variance results from the failure to fully learn all of the model’s parameters
in a statistically sound manner, particularly when the data is sparse and the model
tends to have more parameters. Over-fitting to the available training data set causes
increased variance to be evident. As a result, if different training data set selections
are made, various predictions will be made for the same test instance. The forecasts
m
of the polynomial model vary significantly over different choices of training cases,
while the linear prediction yields similar predictions at x = 2 in the above figure. The
significantly disparate forecasts at x = 2 are frequently grossly inaccurate projections,
)A
the real model. All scatter plot points would have overlapped with the curving line
reflecting the true model if there had been no noise.
e
give a more formal, mathematical perspective in the paragraphs that follow.
Formal View
in
We suppose that B represents the base distribution from which the training data set
was created. This base distribution can be used to create a data collection D:
nl
D∼B
There are other approaches to draw the training data, for as by choosing only data
sets of a certain size. Assume for the time being that training data sets are obtained
O
from B using some clearly defined generative technique. The method through which
training data sets are extracted from B is not crucial to the analysis that follows.
Obtaining base distribution Because the base distribution can be used to create
ty
training data sets indefinitely, B is akin to having unlimited access to a pool of training
data. Such base distributions—i.e., endless data resources—don’t actually exist.
Practically speaking, an analyst only ever collects a single finite instance of D using
si
some type of data collecting device. However, the theoretical quantification of the
causes of error in training on this finite data set is helped by the conceptual existence of
a base distribution from which new training data sets can be produced.
r
Now imagine that the analyst had a set of t test instances in d dimensions, denoted
ve
by (Z̅ t………….(Z̅ t. The dependent variables of these test instances are denoted by
y1………………….. yt.For clarity in discussion, let us assume that the test instances and
their dependent variables were also generated from the same base distribution B by a
third party, but the analyst was provided access only to the feature representations (Z̅ t
………….(Z̅ t and no access to the dependent variables y1………………….. yt.Therefore,
ni
the analyst is tasked with job of using the single finite instance of the training data set D
in order to predict the dependent variables of (Z̅ t………….Z̅ t.
U
Now assume that the relationship between the dependent variable yi and its feature
representation Z̅ i is defined by the unknown function f(·) as follows:
designated by the symbols y1………………….. yt. For the sake of clarity, let’s imagine
that the dependent variables y1………………….. yt and the test instances were created
from the same base distribution B by a third party, but that the analyst only had access
to the feature representations Z̅ 1………….Z̅ t As a result, the analyst’s duty is to forecast
m
the dependent variables of Z̅ 1………….Z̅ t. using the one finite instance of the training
data set D.
The following relationship between the dependent variable y I and its feature
)A
The intrinsic noise, which is independent of the model being utilised, is shown here
by the notation ∈i. Although it is assumed that E[∈i ] = 0, the value of I could be positive
(c
or negative. The only remaining uncertainty would be brought on by the inherent noise
if the analyst knew the function f(•) corresponding to this connection and used it to each
Notes
e
test point Z̅ i to estimate the dependent variable yi,.
The issue is that the analyst doesn’t understand what the function f(•) actually
in
performs. Keep in mind that this function is utilised within the base distribution B’s
generative process, which functions like an oracle that the analyst is not privy to. Only
examples of this function’s input and output are available to the analyst. In order to
nl
estimate this function in a data-driven manner, it is obvious that the analyst would need
to create some kind of model g(Z̅ t, D) using the training data.
O
Note the use of the circumflex (i.e., the symbol ‘ˆ’) on the variable (Yi ) ̂ to indicate
that it is a predicted value by a specific algorithm rather than the observed (true) value
of yi .
ty
The estimated function g(•, •) includes all prediction functions of learning models,
including neural networks. Some algorithms (like perceptrons and linear regression) can
even be expressed succinctly and simply as follows:
r si
ve
Most neural networks are expressed algorithmically as compositions of multiple
functions computed at different nodes. The choice of computational function includes
the effect of its specific parameter setting, such as the coefficient vector W in a
ni
perceptron. Neural networks with a larger number of units will require more parameters
to fully learn the function. This is where the variance in predictions arises on the same
test instance; a model with The majority of neural networks are composed of numerous
functions that are generated at various nodes and stated algorithmically. The influence
U
a different training data set, a model with a big parameter set W will learn highly varied
values of these parameters, which is where the variance in predictions on the same
test instance comes from. As a result, with various training data sets, the prediction of
the same test instance will likewise vary greatly. As shown in the figure above, these
contradictions amplify the inaccuracy.
m
Quantifying the expected error of the learning algorithm in terms of its bias,
variance, and noise (data-specific) is the aim of the bias-variance trade-off. The target
variable is assumed to be numerical for the sake of generality so that the mean-squared
)A
error between the anticipated values yi) and the actual values (yi ) ̂ can be used to
intuitively quantify the error. Although it may also be applied to classification in terms
of probabilistic predictions of test examples, this is a natural way of quantifying error in
regression. Over the set of test instances, the learning method g(·,D)’s squared error, or
MSE, is defined set of test instances Z̅ 1………….Z̅ t as follows:
(c
Notes
e
in
Calculating the expected error across several training data set selections is the
best technique to estimate the error in a way that is independent of the particular choice
of training data set:
nl
O
ty
Expanding the quadratic expression on the right side of the first equation and using
the fact that the average value of I over a large number of test examples is 0, we can
obtain the second relationship.
si
The squared term on the right-hand side of the following calculation can be further
decomposed by adding and removing E[g( Z̅ i, D)]:
r
ve
Optimal complexity
Overall Error
ni
Squared Error
Variance
U
Bias
ity
Model Complexity
Figure: The trade-off between bias and variance usually causes a point of optimal
model complexity.
m
One can expand the quadratic polynomial on the right-hand side to obtain the
following:
)A
(c
Notes
e
in
nl
The second term on the right-hand side of the aforementioned expression
O
evaluates to 0 because one of the multiplicative factors is E[g( Z̅ i, D)] − E[g( Z̅i, D)]. On
simplification, we obtain the following:
ty
In other words, the bias, variance, and noise components of the squared error
si
can be divided into squares. The crucial element that keeps neural networks from
generalising is variation. For neural networks with many parameters, the variance will
typically be higher. On the other hand, insufficient degrees of freedom make it difficult to
r
accurately predict the intricacies of the data distribution, which might result from having
ve
too few model parameters. The above Figure illustrates the trade-off between bias and
variance as model complexity increases. It is obvious that there is a maximum model
complexity at which the performance is best. Additionally, the variation will rise due to
the lack of training data. But cautious design selection might lessen Over-fitting.
ni
and accuracy, are what we actually refer to when we discuss it. Consider for a moment
that we are creating a machine learning model. If a model generalises any new input
data from the problem domain in an appropriate way, it is said to be a good machine
learning model. This enables us to forecast future data that the data model has never
ity
encountered. Let’s say we want to assess how well our machine learning model picks
up new information and adapts to it. Over-fitting and under-fitting are the main causes
of the poor performance of machine learning methods for this.
squeeze into oversized jeans!) Our machine learning model’s accuracy is destroyed
Notes
e
by under-fitting. Its recurrence merely indicates that our model or method does not
adequately suit the data. It typically occurs when we try to develop a linear model with
fewer non-linear data or when we have insufficient non-linear data to build an accurate
in
model. The machine learning model will likely produce a lot of incorrect predictions in
these circumstances since the rules are too simple and flexible to be applied to such
sparse data. More data can be used to prevent under-fitting, while feature selection can
nl
be used to reduce the number of features.
O
Reason for under-fitting
◌◌ Low variance and high bias.
ty
◌◌ The training dataset size employed is insufficient.
◌◌ The model is overly basic.
◌◌ The training data has noise in it and has not been cleansed.
si
Techniques to reduce under-fitting:
1. Make the model more complex.
r
2. Perform feature engineering to increase the amount of features.
ve
3. Clean up the data of any noise.
4. To improve outcomes, increase the number of epochs or the training period.
Over-fitting: When a statistical model fails to produce reliable predictions on test
ni
data, it is said to be overfitted. A model begins to learn from the noise and erroneous
data entries in our data set when it is trained with such a large amount of data. And
when using test data for testing yields high variance. Due to too many details and noise,
U
the model fails to appropriately identify the data. The non-parametric and non-linear
approaches are the root causes of Over-fitting since these types of machine learning
algorithms have more latitude in how they develop the model based on the dataset,
making it possible for them to produce highly irrational models. If we have linear data,
ity
employing a linear algorithm is one way to prevent Over-fitting; if we are using decision
trees, utilising parameters like the maximal depth is another.
Notes
e
in
nl
O
ty
r si
ve
Techniques to reduce Over-fitting:
◌◌ Expand the training data.
◌◌ simplify the model.
ni
◌◌ During the training phase, quitting too soon (have an eye over the loss over
the training period as soon as loss begins to increase stop training).
◌◌ Regularization of the Ridge and the Lasso
U
prejudice is another term for this circumstance. This indicates that while your algorithm
can make precise predictions, the underlying premise about the data is flawed.
m
)A
(c
Notes
e
in
nl
O
ty
r si
In contrast, Over-fitting occurs when your model is too intricate for your data.
Formally speaking, your hypothesis on the distribution of the data is incorrect and
ve
overly complicated; for instance, your data are linear and your model is a high-degree
polynomial. High variance is another name for this circumstance. This indicates that
your algorithm cannot make reliable predictions since even little changes in the input
data cause significant changes in the model’s output.
ni
U
ity
m
)A
(c
The ideal solution will always fall somewhere in the middle of these two extremes
of the same issue.
Notes
e
in
nl
O
ty
si
●● low bias, low variance — is a good result, just right.
●● r
low bias, high variance — Over-fitting — For the same set of data, the algorithm
produces widely diverse predictions.
ve
●● high bias, low variance — under-fitting — Similar predictions are made by the
algorithm for comparable data, but these predictions are incorrect (the algorithm
“misses”).
ni
●● high bias, high variance — awful algorithm. Most likely, you’ll never see this.
U
ity
m
)A
(c
These cases can all be combined into one plot. Though more condensed, it is a
little less distinct than the prior one.
Notes
e
in
nl
O
ty
How to Detect Under-fitting and Over-fitting
si
Let’s first learn how to “identify” under-fitting and Over-fitting before moving on to
the tools.
r
ve
ni
U
ity
When your model is overfit, it produces inaccurate predictions. Here, the train error
is relatively modest while the val/test error is substantial.
m
Train error is small (although bigger than in the case of Over-fitting) and val/test
error is also small when you find a decent model.
)A
The test error and validation error in the aforementioned instance are nearly
identical. This occurs when everything is in order and the distributions of your train,
validation, and test data are identical. If the test error and the validation error differ
significantly, you should obtain more test data and double-check that the data was
appropriately divided.
(c
Notes
e
in
nl
O
ty
Tools and Techniques
Let’s now examine methods to avoid both under-fitting and Over-fitting while taking
si
into account why we should employ them.
As we recall:
r
ve
◌◌ When your model is too straightforward for your data, under-fitting happens.
◌◌ When your model is too intricate for your data, Over-fitting takes place.
Based on this, you should keep in mind the following simple intuition:
ni
In actuality, everything that follows is simply a result of this straightforward law. I’ll
try to explain why some actions will make the model more complex or simpler.
You must include extra parameters to make the model more challenging (degrees
m
of freedom). This can occasionally entail using a model that is immediately capable of
restoring more complex dependencies (SVM with different kernels instead of logistic
regression). Increase the number of models in boosting, for example, if the algorithm
(a neural network or some other ensemble model) is already quite sophisticated. This
)A
entails increasing the number of layers, neurons in each layer, connections between
layers, CNN filters, and other neural network components.
Contrarily, you must decrease the amount of parameters if you want to simplify the
model. Reduce the number of degrees of freedom or altogether alter the method (try
(c
random forest instead of deep neural network). fewer neurons, fewer layers, etc.
e
This point and the one before it are extremely closely related. In actuality,
regularisation is a forced and indirect model simplification. The model must be as
in
straightforward as possible because the regularisation term mandates that parameter
values be kept as low as possible. This is a particularly potent approach since complex
models with strong regularisation frequently outperform originally basic models.
nl
O
ty
si
The influence of the regularisation term grows as the model is regularised more
(simplified). The regularisation parameters vary based on the method; for example, to
reduce regularisation, alpha for Ridge regression should be decreased while C for SVM
r
should be increased. This process is strictly individual. As a result, you should research
ve
the algorithm’s parameters and pay attention to whether they should be increased or
decreased in a specific circumstance. Such parameters are numerous, including the
L1/L2 coefficients for linear regression, C and gamma for SVM, maximum tree depth
for decision trees, and others. The primary regularisation techniques used in neural
networks are:
ni
◌◌ Early stopping,
◌◌ Dropout,
U
◌◌ L1 and L2 Regularization.
Real data sets are not typically pre-divided into training, validation, and test data
sets in practice. It is crucial to be able to automatically separate a ratings matrix’s
entries into these groups. Instead of dividing the data set into three equal halves,
the majority of the division techniques, including hold-out and crossvalidation, are
applied. However, the following three pieces can be obtained. It is possible to create
m
the necessary three segments by first splitting the rating entries into training and test
portions, and then further segmenting the validation component from the training
data. As a result, in the sections that follow, we’ll go through how to divide the entries’
)A
ratings matrix into training and testing halves using techniques like hold-out and cross-
validation. However, the model-building and validation portions of the training data are
also divided using similar techniques. The above figure exemplifies this hierarchical
divide. Even though the same approach can also be utilised for the second level
separation into model building and validation components, we will consistently use the
(c
language of the first level of division into “training” and “testing” data in the following. To
prevent misunderstanding, language is used consistently.
Notes
e
in
nl
O
Figure: Hierarchical division of rated entries into training, validation, and testing
ty
portions
Hold-Out
si
In the hold-out method, a portion of the ratings matrix’s entries are concealed while
the training model is constructed using the remaining entries. The overall accuracy
is then reported, which is the accuracy of predicting the concealed entries. Because
r
the entries used for evaluation are hidden during training, this method assures that
ve
the claimed accuracy is not the consequence of Over-fitting to the particular data set.
However, such a strategy undervalues the accuracy’s actual level. First off, not all
entries are used in training, thus the full potential of the data is not utilised. Next, take
into account the scenario in which the held-out entries have a higher average rating
ni
than the complete ratings matrix. As a result, both the held-out entries and the held-in
entries had lower average ratings than the held-in entries. As a result, the evaluation
will be biased negatively.
U
Cross-Validation
The ratings entries are separated into q equal sets for the cross-validation
procedure. As a result, |S|/q represents the size of each set in terms of entries if S is
ity
the set of specified entries in the ratings matrix R. The remaining (q 1) segments are
utilised for training, while one of the q segments is used for testing. In other words,
throughout each such training procedure, a total of |S|/q entries are hidden, and the
accuracy is then assessed over these entries. Each of the q segments serves as the
test set for this process, which is repeated q times. The reported accuracy is the mean
m
across the q distinct test sets. Keep in mind that when the value of q is large, this
method can fairly approximate the true accuracy. Choosing q to be equal to the required
number of entries in the ratings matrix is a specific case. As a result, testing uses the
)A
one entry and training uses the |S| 1 rating entries. Leave-one-out cross-validation is
the name of this method. Even though such a method may fairly roughly approximation
the accuracy, training the model |S| times is typically too expensive. In actuality,
q’s value is set to a specific number, say 10. However, for the particular situation of
neighborhood-based collaborative filtering algorithms, leave-one-out crossvalidation is
(c
Bootstrap Method
Notes
e
Bootstrapping, a statistical concept, is a resampling approach that uses the
replacement procedure to stimulate samples from a data collection. The method of
in
bootstrapping enables one to derive standard errors, infer data about the population,
and guarantee that data is tested effectively.
nl
statistical technique that tests a dataset by replacing the sample in order to evaluate
statistics of a given population.
With this method, a dataset is sampled repeatedly while being randomly replaced.
O
This technique, a statistical test that belongs to the class of resampling procedures,
makes sure that the statistics being assessed are as exact and objective as feasible.
ty
repeatedly employs the study-provided samples in order to apply the replacement
methodology and guarantee that the stimulated samples result in an accurate
evaluation.
si
In addition to guaranteeing the correctness of a given dataset’s sample,
bootstrapping in statistics also enables one to calculate the dataset’s confidence
intervals.
r
ve
Hold Out Method
Splitting the data into several sets—one for training and other sets for validation
and testing—is the hold-out method, a strategy used to train machine learning models.
A machine learning model’s performance on the fresh data is tested using the hold-
ni
out method. The issue that emerges when analysing machine learning (ML) models is
whether the model is the best one that can be found in the model’s hypothesis space
in terms of generalisation error on the unobserved/future data set. Whether the most
U
effective method was used to train and test the model. Which model should I choose
from those that are offered? The so-called hold-out strategy is used to address these
queries.
Different sets, referred to as the validation set and test set, are isolated or set aside
ity
(thus, the hold-out moniker) from the complete dataset and the model is trained just
on what is referred to as the training dataset rather than using the entire dataset for
training.
The hold-out approach for machine learning model training entails separating
m
the data into many splits, using one split for model training and other splits for model
validation and model testing. Both model evaluation and model selection are done
using the hold-out method.
)A
The issue of comparing the models and choosing the best model persists even
after the whole set of data has been utilised to train the model using various algorithms.
Finding the model with the lowest generalisation error among all other models is the
main task. In other words, which model outperforms all others in its ability to anticipate
(c
data from the future or the unknown. The necessity for a technique that allows the
model to be trained on one dataset and tested on another arises in this situation. The
hold-out strategy is useful in this situation.
e
The process for dividing the dataset into training and test datasets is represented
by the hold-out approach for model evaluation. To create the best model possible, the
in
model is tested on the testing set after being trained on the training set. When the
data set is tiny and there is not enough information to divide into three groupings, this
strategy is frequently utilised (training, validation, and testing). This strategy has the
nl
benefit of being straightforward to use, but it can be delicate to how the data is split into
two sets. The results can be skewed if the split was not random. All things considered,
the hold out approach for model evaluation is a decent place to start when training
machine learning models, but it should only be applied sparingly. The hold-out approach
O
to model evaluation looks like what follows.
ty
r si
ve
Figure: Hold-out method for model evaluation
You may have noticed that the data set is divided into two portions in the preceding
ni
diagram. One split is reserved or held back for model training. For testing or evaluating
the model, another set is held back. The amount of data provided for training purposes
determines the split %. The dataset is often divided into two halves, with 70% of the
U
dataset used for training and 30% of the dataset used for testing the model.
If the objective is to compare the models based on the model accuracy on the test
dataset and choose the best model, this technique is well suited. The model might end
ity
up fitting the test dataset well if this strategy is tried, but that is never a guarantee. In
other words, it is assumed that the test dataset accurately represents the population
while training the models to increase model accuracy. Thus, the test error turns into an
estimation of the generalisation error that is biassed in the favour of optimism. That is
not preferred, though. As a result of being trained to fit well (or overfit) in relation to the
m
test data, the final model does not generalise well to the unknown or future dataset.
●● Create two separate datasets (preferably based on a 70-30 percent split; However,
the percentage split will vary)
●● the training dataset for the model; Some fixed set of hyperparameters is chosen
while training the model.
(c
●● To get a model that can generalise better on the unknown or future dataset, train
Notes
e
the final model on the full dataset.
It should be noted that this method relies on dividing the dataset into training
in
and test datasets and employing a predetermined set of hyperparameters for model
evaluation. Another method involves dividing the data into three sets and using these
sets to choose a model or tune hyperparameters.
nl
Hold-out method for Model Selection
Model selection and hyperparameter tuning can both be done using the hold-out
O
method. In reality, tuning of the hyper-parameters is a term that is sometimes used
to describe the model selection process. The training, validation, and test datasets
are divided into three separate sets for the hold-out approach of model selection. It is
crucial to make sure that the training, validation, and test datasets are representative of
ty
the complete dataset when utilising the hold out approach, which divides the data into
three sets. Otherwise, the model might not work well with unknown data.
r si
ve
ni
The hold-out strategy for choosing models is represented by the process below:
1. Divide the dataset into three sections. Datasets for training, validation, and testing.
ity
2. Use various machine learning methods to train various models. Use logistic
regression, random forest, XGBoost, etc. to train the classification model, for
instance.
3. Adjust the hyper-parameters for the models that were trained using various
m
techniques to create new models. Change the hyperparameter values and produce
several models for each of the algorithms stated in step 2.
4. Use the validation dataset to evaluate how well each of these models (belonging to
)A
6. Examine the most effective model’s performance using the test dataset.
e
note of the original dataset’s three distinct splits. The most effective model is chosen
after numerous iterations of the training, adjustment, and evaluation processes. On the
test dataset, the final model is assessed..
in
nl
O
ty
si
Figure: Hold out method for model selection
them is as follows:
The cross-validation hold out approach involves the steps listed below:
◌◌ Training sets (training, validation) and test sets are separated from the data
set (test).
ity
◌◌ A subset of the data is used to construct the machine learning model, which is
subsequently tested using the remaining data.
◌◌ To provide an average performance metric from K machine learning models,
this technique is done K times with different random partitioning. One sample
from the data set is left out (referred to as the test data set) for each machine
m
learning model training, and the model then attempts to forecast its value
using the test data set. Repeat this procedure until the machine learning
model has predicted all samples at least once..
)A
is covered).
Over-fitting or under-fitting of machine learning models. One of the most widely used
Notes
e
techniques is the cross-validation hold out method, in which a machine learning model
is trained using a subset of the data and then evaluated using the remaining data.
Another method that helps prevent these issues is leave-one-out cross-validation,
in
which trains with the majority of the data while leaving one observation out as a test
case.
nl
2.1.6 Hands On: Introduction to Hold Out Method
The Holdout Strategy is the most basic type of method for classifier evaluation. The
O
training set and test set are two sets that are created using this method from the data
set (a collection of data items or instances). A classifier’s job is to assign each data
point in a collection to a certain category or class.
ty
Example:
Spam and non-spam emails are separated out in our inbox.
The accuracy, error rate, and error estimates of the classifier should be assessed.
si
There are numerous ways to accomplish it. The “Holdout Method” is one of the most
basic techniques used to evaluate classifiers.
r
The holdout approach divides the data set so that the majority of the data belongs
to the training set and the remaining data to the test set.
ve
Example
In the case of 20 data points, 12 are included in the training set while the remaining
ni
◌◌ The training set is used to create a model or classifier after dividing the data
set into two sets.
U
◌◌ We use the data items in the test set after classifier development to check
accuracy, error rate, and error estimate.
Regarding the holdout method, it is crucial to keep in mind two points. As follows:
ity
The classifier’s error rates, estimates, and accuracy would all be very low if the
maximum number of data items were included in the training set when building the
model or classifier. This indicates a strong classifier or model.
m
Example –
A instructor Amit” coaches a student. She is taught by the teacher all potential
exam topics. She consequently tends to make very few mistakes in exams, resulting in
)A
good performance.
Any test set data used to evaluate the classifier becomes invalid if new training
data are utilised to create it (classifier).
If there are more data items in the test set, they can be used to test the classifier
(c
that was created using the training set. In terms of accuracy, error rate, and estimation,
we can witness more precise evaluation of the classifier.
Example –
Notes
e
A teacher is coaching a student named “Amit.” Her teacher teaches her certain
subjects that could come up on the test. An accurate assessment of the student’s weak
in
and strong points can be made if the student “Amit” is given a number of exams based
on this coaching.
The error rate, error estimate, and accuracy of the created classifier can be
nl
precisely calculated if sufficient test data are utilised to evaluate it.
Problem :
O
If all data items from class GFG1 are completely placed in test set, leaving no data
items from class GFG1 in training set, then the full data set can be divided into two
parts, i.e., training set and test set. It is clear that the model/classifier used was not
developed utilising GFG1-class data items for training.
ty
Solution :
Data items from the GFG1 class are separated into two equal data sets, the
si
training set and the test set, using the stratification technique. such that data items from
the class -GFG1 are used to train the model or classifier.
Example – r
ve
The two data sets, training set and test set, each contain two of the four data items
that make up the class GFG1 in this instance.
Method
By training the model on a subset of the input data and testing it on a subset of the
U
input data that hasn’t been used before, you may validate the model’s effectiveness. It
is also a method for determining how well a statistical model generalises to a different
dataset.
indicates that we cannot fit our model to the training dataset alone. We set aside a
specific sample of the dataset—one that wasn’t included in the training dataset—for this
use. Cross-validation refers to the entire process of testing our model on that sample
before deployment. Compared to the typical train-test split, this is different.
m
For cross-validation, there are a few standard techniques. The following lists these
techniques:
e
●● Leave-P-out cross-validation
●● Leave one out cross-validation
in
●● K-fold cross-validation
●● Stratified k-fold cross-validation
nl
Validation Set Approach
In the validation set approach, we separate our input dataset into a training set and
O
a test or validation set. Half of the dataset is divided between the two subsets.
However, one of the major drawbacks is that we are only using half of the dataset
to train our model, which means that the model can fail to capture crucial dataset
information. It frequently produces the underfitted model as well.
ty
Leave-P-out cross-validation
The training data in this method excludes the p datasets. This means that if the
si
original input dataset has a total of n datapoints, n-p datapoints will be utilised as the
training dataset, and p datapoints will be used as the validation set. For each sample,
the entire procedure is carried out once, and the average error is determined to
r
determine the model’s efficacy.
ve
This method has a drawback in that it can be computationally challenging for large
p.
model. For every data point, this procedure is repeated. Thus, for n samples, n distinct
training sets and n test sets are obtained. These characteristics describe it:
●● Because all the data points are used, this method has the least amount of bias.
ity
K-Fold Cross-Validation
The input dataset is split into K groups of samples with equal sizes using the K-fold
cross-validation procedure. Folds is the term for these samples. The prediction function
)A
uses k-1 folds for each learning set, while the remaining folds are used for the test set.
This strategy is particularly common in CVs since it is simple to grasp and produces
results that are less biased than those produced by other techniques.
e
●● Use remaining groups as the training dataset
●● Fit the model on the training set and evaluate the performance of the model using
in
the test set.
Let’s look at a 5-fold cross-validation example. The dataset is then divided into
five folds. The first fold is saved for testing the model on the first iteration, while the
nl
remaining folds are used to train the model. The second fold is used to test the model
on the second iteration, while the remaining folds are utilised to train the model. Until no
fold is suitable for the test fold, this process will be repeated.
O
Take a look at the diagram below:
ty
r si
ve
ni
U
ity
m
)A
(c
e
With a few minor adjustments, this method is comparable to k-fold cross-validation.
The stratification principle underlies this method, which involves rearranging the data
in
to make sure that each fold or group is a good representation of the entire dataset. It is
one of the finest methods for addressing bias and variation.
It can be understood by utilising the example of housing costs, where some homes
nl
may have substantially higher prices than others. A stratified k-fold cross-validation
technique is helpful to handle such circumstances.
O
Holdout Method
This methodology for cross-validation is the simplest one available. With this
technique, we must take out a portion of the training data and train the remaining
dataset on it to obtain the prediction results.
ty
How well our model will function with the unknown dataset is shown by the error
that occurs during this phase. Although this method is straightforward to use, it still
struggles with large volatility and occasionally yields inaccurate findings.
si
Comparison of Cross-validation to train/test split in Machine Learning
●●
r
Train/test split: A ratio of 70:30, 80:20, etc. is used to divide the input data into
the training set and the test set. One of its main drawbacks is the considerable
ve
variance it offers.
◌◌ Training Data: The model is trained using training data, and the dependent
variable is known.
ni
◌◌ Test Data: The model, which has already been trained on the training data,
makes predictions using the test data. Although not a component of it, this has
the same characteristics as training data.
U
●● Cross-Validation dataset: By dividing the dataset into groups of train/test splits and
averaging the results, it is utilised to overcome the drawback of train/test split. It
can be utilised if we wish to improve the performance of our model after it has
been trained using the training dataset. Since every observation is used for both
ity
Limitations of Cross-Validation
The cross-validation method has some drawbacks, some of which are listed below:
m
validation.
●● Because data in predictive modelling changes over time, there may be variations
between the training set and validation sets. For instance, if we develop a stock
market value prediction model and the data is trained on the stock prices from
the previous five years, but the realistic future stock prices for the following five
(c
years could be very different, it is challenging to predict the correct output in such
circumstances.
Applications of Cross-Validation
Notes
e
●● This technique, which is already being utilised by data scientists in the field of
medical statistics, can be used to assess the effectiveness of various predictive
in
modelling approaches.
●● It has a wide range of applications in the field of medical research.
●● As data scientists are already using it in the field of medical statistics, it can also
nl
be utilised for meta-analysis.
Bootstrapping Method
O
Bootstrapping, a statistical concept, is a resampling approach that uses the
replacement procedure to stimulate samples from a data collection. The method of
bootstrapping enables one to derive standard errors, infer data about the population,
ty
and guarantee that data is tested effectively.
si
statistics of a particular population.
With this method, a dataset is sampled repeatedly while being randomly replaced.
This technique, a statistical test that belongs to the class of resampling procedures,
r
makes sure that the statistics being assessed are as exact and objective as feasible.
ve
The Bootstrapping method, in contrast to other sampling distribution methods,
repeatedly employs the study-provided samples in order to apply the replacement
methodology and guarantee that the stimulated samples result in an accurate
evaluation.
ni
The degree of assurance that an estimated statistic contains the actual value of
the parameter is known as the confidence interval. Let’s learn more about the approach
right now.
ity
The bootstrapping method is widely used and has major applications in the field
m
of statistics. While the Bootstrapping Method and the Jackknife Resampling Method
are two of the most popular resampling techniques, we will focus on the Bootstrap
technique.
)A
The method generates new hypothetical samples that aid in the testing of an
estimated value using the replacement methodology.
Here are three quick stages that make up the bootstrapping process. -
Notes
e
●● Randomly choose a sample size.
●● Pick an observation from the training dataset in random order.
in
●● Combine this observation with the sample chosen earlier.
The term “Bootstrapped samples” or “the bootstrap sample size” refers to the
nl
samples that are picked in the representative sample size. The “Out-of-the-bag”
samples, on the other hand, are the unchosen samples that make up the testing
dataset.
O
The bootstrapping method entails running a machine learning model on the
bootstrapped samples or the training dataset, which is then tested on a fresh dataset,
the Out-of-the-bag samples.
ty
The method’s goal is to give the model the ability to forecast outcomes for samples
taken directly from the bag, which often produce a normal distribution or a Gaussian
distribution. The above-mentioned processes are repeated (a minimum of 25 times)
utilising the replacement strategy in order to improve results.
si
To assess the model’s correctness in a more thorough manner, it must be
highlighted that the sample size chosen must be modest and that the procedure should
r
be repeated several times. The bootstrapping method is a very clever resampling
technique used to measure the unknown gaps in a model.
ve
There are two different types of bootstrapping methods that can be used in
statistics and machine learning, depending on the specifics of how they operate.
ni
For example, the user must be aware of whether the sample has a Gaussian or
skewed distribution. Since it is already aware of the distribution’s characteristics, this
kind of bootstrap approach is more effective.
ity
Hypothesis Testing
The bootstrapping approach is among the most effective ways to test a hypothesis.
Unlike conventional methods, the bootstrapping method enables the replacement
technique to assess the accuracy of a dataset.
(c
e
reduces the drawbacks of a fictitious model. Contrary to conventional methods, the
bootstrapping approach does not assume that the data will be regularly distributed
when performing hypothesis testing.
in
Standard Error
nl
Because it uses the replacement strategy, the bootstrapping method may
effectively calculate a dataset’s standard error. A statistical data set’s Standard Error
(SE) shows the estimated standard deviation.
O
In a particular dataset, SE reflects the true statistical value of a sample while also
representing the precision of a sample statistic.
While the standard error for a sample statistic can be computed using conventional
ty
methods, the bootstrapping approach makes use of the replacement methodology to
ensure that several standard error values are calculated, each of which represents the
average SE.
si
As a result, the approach can be used to calculate standard error.
Machine Learning
r
Bootstrapping in machine learning operates very differently from statistics. When
ve
it comes to machine learning, the bootstrapping method uses the bootstrapped data to
train the models before testing them with the remaining data.
Machine learning has advanced in every conceivable way and is in the process of
evolving. To check that ML models and algorithms perform correctly when applied to
ni
independent data points, repeated sampling data points are used in the replacement
technique. One of the most significant applications of the bootstrapping method may be
machine learning.
U
Bootstrapping Aggregation
The bootstrapping method and the aggregation approach are both included in
the data mining process known as bagging, sometimes known as Bootstrapping
ity
Aggregation.
The approach of aggregation integrates the prediction results generated from many
machine learning algorithms, unlike the bootstrapping method, which is a resampling
procedure used to obtain samples using the replacement methodology.
m
This strategy can be used with the bootstrapping method, which has improved the
performance of machine learning models.
Confidence Intervals
(c
The bootstrapping method replaces samples with data points from a shared
Notes
e
dataset, producing results for CI that are more precise and effective.
in
relation to the other samples that were taken.
Last but not least, one can validate the correctness of a sample statistic overall by
using the bootstrapping method, which is known to examine the accuracy of statistics
nl
like the confidence interval.
O
Pros of the bootstrapping method are following;
Functionally Simpler
ty
When employing the usual methods to calculate statistics that are otherwise too
complex, the bootstrapping method offers a functionally simpler alternative.
si
of a model in a straightforward manner with fewer stages and easy tests.
Accurate Results
r
One of the most popular resampling techniques, the bootstrapping approach, does
ve
not rely on any presumptions for the concept to function.
The bootstrapping approach merely watches the findings and works on them,
generating correct results, in contrast to traditional methods that depend on the
theoretical concept to provide results.
ni
The strategy works even when the theory contradicts the actual data, making it
particularly useful in this regard.
U
the variability of the distribution, despite the fact that it is advised to function well in the
situation of small sample sizes.
The approach tends to significantly accommodate closer values when dealing with
rare-extreme values, avoiding the participation of near-end values.
(c
e
To ensure that the predictions in machine learning are accurately representing
the desired phenomenon, it is essential to be able to evaluate models in the right way
in
(disease prediction, future cost estimation, etc.).
But with so many options available (Accuracy, Precision, Recall, etc.), it can be
difficult for data scientists to decide which performance statistic to utilise.
nl
To be able to measure the performance of the model objectively and in the
appropriate context, it is crucial to choose the appropriate metric for a certain model.
O
When evaluating the performance of our machine learning model on a dataset that
it has never seen before, performance evaluation metrics are crucial. On the dataset
you trained the model on, there is a good chance that it will always perform better.
However, we develop machine learning models to be effective at handling challenges in
ty
the real world where data is constantly flowing. There is no use in employing machine
learning to tackle your problems if the model we are using is not strong enough
to perform well. Metrics for performance review come into play in this situation. A
si
performance evaluation metric determines whether or not your machine learning model
will successfully solve the problem for which it was trained.
Classification
r
ve
Since certain terms are used consistently across the majority of performance
metrics, it is crucial to ensure that they are understood before diving into the
performance metrics themselves.
The real outcome (often abbreviated as “y”) and the anticipated outcome (typically
abbreviated as “‘ŷ’”) are two notions that are crucial when assessing the effectiveness
of a classification model.
U
is then marked with a flag indicating whether or not the disease will progress. In this
situation, the label might indicate whether the sickness would manifest (y=1) or not
(y=0). When a sample is provided to a machine learning model, its goal is to ensure
that the projected and actual results are always consistent. The performance of the
m
model is measured by how closely its predictions match the actual values. The various
methods of assessing a model’s performance are listed here, but generally speaking,
models perform worse because of errors.
)A
●● True Positive (TP): A sample has a positive label (y=1, i.e., the person will actually
(c
develop the disease) and a positive prediction (ŷ=1, i.e., the person is projected to
develop the disease).
e
example, the person is actually not expected to acquire the disease), the label
is also predicted to be negative (y=0, for example, the person is genuinely not
expected to get the sickness).
in
●● False Positive (FP): When a sample is projected to be positive (ŷ =1, for example,
when the person is genuinely predicted not to get the sickness), its label is actually
nl
negative (y=0). The sample is “falsely” projected to be positive in this instance.
●● False Negative (FN): When a sample is labelled as being negative (y=1, indicating
that the individual would really develop the disease) but the prediction is for it to be
O
positive (ŷ=0), the person is predicted not to develop the condition. The sample is
“falsely” predicted to be negative in this instance.
TP, TN, FP, and FN are referred to as “Positive” and “Negative” despite the fact that
the classes are typically labelled 1 and 0. This is because the values are arbitrary and
ty
frequently seen labelled as 1 and -1.
False Positive and False Negative meanings can be difficult to recall, and Data
Scientists frequently need to pause and consider each one’s significance before they
si
can recall which one stands for which.
E.g. False Positive: Although the sample was projected to be Positive, it was
actually anticipated to be Negative.
Confusion Matrix
U
The so-called Confusion Matrix, which is just a table organising the four values,
typically presents True Positive, True Negative, False Positive, and False Negative in a
tabular manner.
ity
m
)A
(c
Performance Metrics
Accuracy
The percentage of correct predictions made by our model out of all the forecasts
Notes
e
is known as accuracy. This means that we split all sorts of forecasts, both correct (TP,
TN) and incorrect (all other types), by the total number of predictions that were correctly
classified as Positive (TP) or correctly classified as Negative (TN) (FP, FN).
in
nl
Between 0 and 1 is the accuracy range. These extreme scenarios correspond to
predictions that were either wholly accurate or entirely off. For instance, if our model is
100% accurate, it will not produce any False Positives or False Negatives, bringing the
O
numerator to the denominator and the Accuracy to 1.
In contrast, the number of True Positives and True Negatives will be zero if our
system is always inaccurate, making the equation zero divided by something positive,
resulting in an Accuracy of 0.
ty
Since we can simply reverse the labels of the forecasts to get a better prediction if
the accuracy is less than 0.5, accuracy in real life actually ranges between 0.5 and 1.
si
However, accuracy is not a particularly useful metric, particularly when the data
is unbalanced. Accuracy does not provide the complete picture when there is a big
gap between the number of positive and negative labels. Consider a scenario where
r
we have 100 samples, 95 of which are classified as Class 0 samples, and 5 as Class
1 samples. In this instance, a weakly constructed “dummy” model that consistently
ve
predicts Class 0 earns a 95 percent Accuracy, indicating a highly powerful model.
Accuracy is not the appropriate performance criterion to assess this model’s power, and
it is not truly predictive. We would eventually wind up giving stakeholders and clients
a model that is neither effective nor prognosticative if we evaluated this model just on
ni
accuracy.
Precision
U
Data scientists typically employ Precision, Recall, and Specificity to get around the
limits of Accuracy. What percentage of accurate positive forecasts there were can be
determined by precision. To do this, it divides the total number of positive predictions—
ity
correct or incorrect—by the number of samples that were correctly predicted as positive
(TP, FP).
m
of actual negatives that were accurately identified. It accomplishes this by dividing the
Notes
e
number of negative samples that were correctly projected to be negative or wrongly
forecasted to be positive by the total number of negative samples (TN, FP).
in
If we apply Precision, Recall, and Specificity in the example to demonstrate the
nl
inadequacies of Accuracy, we obtain:
●● Accuracy: 0.95
O
●● Recall: 0
We can better appreciate that a model that consistently predicts the majority class
is actually a low-performance model (Recall = 0), even though its accuracy is good
(Accuracy = 0.95), by utilising performance metrics other than accuracy.
ty
Area Under the ROC Curve (AUC)
As we’ve seen, if the distribution of the classes is not very well balanced, one of the
si
problems with Accuracy is that it might result in excessively inflated performance. The
full two-dimensional area beneath the complete ROC curve is measured by AUC, which
stands for “Area under the ROC Curve” (more on this later).
r
It represents a performance average over all potential classification levels. AUC
ve
can also be understood as the likelihood that a random positive sample will be ranked
higher than a random negative sample by the model.
Even though it can only be used in binary classification scenarios, AUC is a great
ni
metric, especially when dealing with imbalanced classes. It is one of the most often
used performance measures in classification (i.e. not with more than 2 classes as
target).
U
F1 Score
m
Performance Charts
(c
e
A ROC curve is a graph that displays how well a classification model performs
across all thresholds. The True Positive Rate (TPR) is plotted on the chart’s y-axis,
in
while the False Positive Rate (FPR) is plotted on the chart’s x-axis.
A 45-degree diagonal line is the worst-case (random chance) scenario. The ideal
situation entails an angled line that travels vertically first and then horizontally.
nl
As the classification threshold is lowered, the model begins to classify more data
as positive, which raises the number of both False Positives and True Positives.
O
ty
r si
ve
ni
Precision/Recall Curve
A Precision/Recall Curve, like the ROC Curve, depicts performance across a y-axis
for Precision and an x-axis for Recall. Different threshold values are used to examine
U
each point. The ROC Curve’s best-case scenario is reversed in the best-case scenario,
which essentially entails a horizontal line turning vertical. On the other hand, random
chance’s worst-case scenario is represented as a horizontal line with Precision = 0.5.
ity
m
)A
(c
e
In situations where False Positives and False Negatives do not have the same
impact, picking the appropriate metric is crucial. However, with machine learning
in
models, there is typically a tradeoff between accurately detecting False Positives or
False Negatives. Ideally, we would want to have a forecast that is flawless in terms of
both False Positive and False Negative (both zero).
nl
For instance, it may be argued that False Positives are more significant if our
model determines whether a person has a fatal disease, such as cancer. If that person
has the condition, we want to make sure we accurately flag them. If we mistakenly label
O
someone as having the sickness when they didn’t, we are less concerned.
On the other hand, it may be argued that False Negatives are more significant if
our model can determine whether a person is innocent or not. We want to make sure
ty
that nobody is unjustly imprisoned.
si
Regression
It is best to emphasise certain important introductory principles before delving more
r
deeply into the performance measures.
ve
Since error is such a well-known concept, it only needs a brief formal definition.
Error is a rather intuitive measure. The difference, or “delta,” between the actual target
value and the predicted value is what we particularly refer to ‘ŷ’when we talk about
errors while discussing machine learning performance.
ni
The real outcome (often abbreviated as “y”) and the anticipated outcome (typically
abbreviated as “‘ŷ’”) are two notions that are crucial when assessing the effectiveness
of a classification model.
U
situation, the label might indicate whether the sickness would manifest (y=1) or not
(y=0).
When a sample is provided to a machine learning model, its goal is to ensure that
the projected and actual results are always consistent. The performance of the model is
m
measured by how closely its predictions match the actual values. The various methods
of assessing a model’s performance are listed here, but generally speaking, models
perform worse because of errors.
)A
The more “off” the model is from being an exact depiction of the phenomena, the
greater the discrepancy between the actual outcome “y” and the anticipated outcome
“‘ŷ’”; the closer the values, the better the system performs.
(c
essence, it computes the difference between the estimated and real values, squares
Notes
e
the answers, and then finds the average of those numbers.
MSE can only rely on non-negative numbers because the mistakes are squared.
in
MSE is typically positive and not zero because most processes have inherent
unpredictability and noise.
nl
MSE uses the same units of measurement as the variance and the square of the
O
quantity being estimated.
Mean Squared Error shares the same major drawback as Variance in that it is not
resilient to outliers. The square of the error will be substantially bigger if one sample’s
“y” and associated error are significantly larger than those of the other samples. This
ty
renders MSE susceptible to outliers, together with the fact that it averages mistakes.
Root Mean Squared Error (RMSE) / Root Mean Squared Deviation (RMSD)
si
RMSE computes the average of the squared errors over all samples, just like MSE
does, but it also takes the square root of the outcome, thereby taking the square root of
MSE.
r
RMSE offers an error measure in the same unit as the target variable by
ve
performing this. For instance, RMSE will provide the error in dollars if our target y is
the dollar amount of sales for the upcoming year, however MSE will provide the error in
dollars squared, which is considerably less comprehensible.
ni
U
The MAE uses the absolute value because the absolute distance is what interests
us, not the direction in which the estimated and actual target values differ (estimated >
actual or vice versa). By doing this, errors won’t cancel one another out when the MAE
is calculated.
m
Because the MAE formula does not square errors, unlike MSE, it does not penalise
)A
larger errors more severely than smaller ones. Another benefit is that, unlike RMSE,
MAE does not square the units, which makes the results easier to understand.
the mean absolute percentage error. It does this by calculating it in a manner similar to
MAE and then dividing the result by the actual number to obtain a percentage.
Notes
e
in
We can better grasp the relative accuracy of our forecasts by presenting the
mistake as a percentage. For example, if we were to forecast spending for the
following year, an MAE error of $50 may represent either a reasonably good or poor
approximation.
nl
For instance, we could confidently say that the prediction is fairly accurate if the
$50 inaccuracy was made in relation to a $1 million actual expenditure. A $60 cost
O
estimate inaccuracy, on the other hand, would be a significant departure from the true
value.
ty
83 percent of the expected value, or a range of $10 to $110, or over double the value of
the prediction.
si
terms of absolute values.
R2 is computed by dividing the overall sum of squares by one less than the sum of
squares of the residuals..
U
R2 evaluates how well the selected model fits with respect to a baseline horizontal
ity
line. The R2 is negative if the selected model fits the data less well than a horizontal
line. Despite the “square” being included and due to the R2 formula, it is possible for it
to be negative without breaking any mathematical laws. Only when the model does not
match the data’s trend and fits the data worse than a horizontal line does R2 become
m
negative.
One disadvantage of R2 is that it gets worse when more features are added to a
model. Even when the features that have been added to the model are not inherently
)A
predictive characteristics are added to the model. Contrary to what would happen with
R2, the Adjusted R2 value decreases when more “useless” features are added to the
Notes
e
model. Because of this, the Adjusted R2 is always lower than or equal to the R2 value.
in
where k is the number of features in the model and n is the number of data points.
nl
is usually important. This is so because R2 articulates the relationship between the
model’s feature X and the intended variable y. Instead, error measurements describe
how dispersed the data points are in relation to the regression fit. For example,
O
reporting both Adjusted R2 and RMSE enables a more accurate comparison of the
model with other benchmarks.
ty
Imbalance Problem
In machine learning, the term “classification” refers to a predictive modelling issue
where a class label is predicted for an input set of data. The label or target could be a
si
part of two classes, or even more.
When the majority of the data fall under one class label, there is a class imbalance.
r
Both two-class classification and multiclass classification are susceptible to it. Data
distribution is assumed to be equal by machine learning techniques. Therefore, when
ve
there is an imbalance in the classes, the machine learning classifier has a tendency to
be more biassed towards the majority class, which results in incorrect categorization of
the minority class. It happens because the cost function of traditional machine learning
algorithms constantly seeks to maximise metrics like mistake rate without taking the
ni
To determine how much the class is unbalanced, the two factors balance_
accuracy_score and accuracy_score must be taken into account.
In contrast, the accuracy classification score is the accuracy score. For the
balanced dataset, there will be no difference between these two accuracy values. In
m
other words, with a balanced dataset, the accuracy_score and the balanced_accuracy_
score are equal.
)A
When given an unbalanced dataset, the machine learning model would, despite
being a poor classifier for the minority class, consistently predict the value of the
majority class and achieve high classification accuracy. The accuracy paradox refers to
Notes
e
this.
in
score, and Area Under the ROC Curve, should be taken into account for evaluation in
order to get around this.
nl
matrix. It includes:
O
●● True Negative: Negative outcome is correctly predicted as negative.
●● False Positive: Negative outcome is wrongly predicted as positive.
●● False Negative: Positive outcome is wrongly predicted as negative
ty
r si
ve
ni
negative outcome.
The degree to which we are confident that a prediction we made will come true
will be referred to as precision. According to mathematics, it is the percentage of real
positive forecasts among all positive predictions.
m
The recall is the percentage of actual positive elements that are actually true
)A
The weighted harmonic mean of the precision and recall is the F1 score.
(c
Area Under the Receiver Operating Characteristic (ROC) Curve: The Receiver
e
machine learning method over a range of trade-offs between true positive and false
positive error rates. The performance statistic for the ROC curve is AUC. The likelihood
for the model to distinguish between observations from two classes is shown by the
in
area under the ROC curve.
Random Resampling:
nl
The minority class is oversampled, while the dominant class is undersampled.
Let’s imagine that our target has 20,000 records, of which 19,900 belong to the
O
majority class and 100 to the minority class.
In oversampling, more records are added to the minority class so that they are
equivalent to the majority class’s record. For our scenarios, after oversampling, the
minority class record will be 19,900, the same as the record for the original majority
ty
class.
When the dataset is not too big, it is advised. This method’s biggest drawback is
si
that it will result in Over-fitting.
The records from the majority class are randomly eliminated in a majority class
undersample. In our situations, the majority class record will be equal to 100 after
r
undersampling, which is the same as the initial minority class record. Therefore, it is
ve
clear that undersampling will result in information loss; as a result, it is advised for large
datasets since even if we lose some information, it won’t be a significant concern.
Additionally, undersampling results in poor generalisation to the test set and under-
fitting.
ni
develops synthesising elements for the minority class based on the records it already
has. It selects a point from the minority class at random and then determines that
point’s k-nearest neighbours.
ity
Neighbors from the k closest neighbours are selected at random depending on how
much oversampling is necessary. Between the selected point and its neighbours, the
synthetic points are inserted.
Since we didn’t eliminate any data, SMOTE will be more accurate than random
m
under-sampling because all information is kept. However, training the model will take
more time.
Here, the same problem can be solved using n separate classifiers, and their
predictions are combined using n different training models. It will always be more
accurate than the ensemble’s individual classifier. Numerous decision trees make up a
random forest method, which makes use of ensemble learning.
(c
e
Predictive modelling techniques include decision tree learning. In statistics, data
mining, and machine learning, categorization issues are addressed using it. It reflects
in
decisions or is used for decision-making and has an upside-down tree-like form.
Because of the tree-like hierarchical structure that enables them to learn from both
classes, it frequently performs well on imbalanced datasets.
nl
XGBoost — Extreme Gradient Boosting
Extreme Gradient Boosting is abbreviated as XGBoost.
O
Gradient boosting is a potent machine learning ensemble approach that combines
numerous classifiers to deliver excellent performance and address challenging
problems. A gradient-boosting decision tree method has been improved and modified
as XGBoost. It uses parallel processing, which gives it a quick execution rate. It
ty
contains a built-in system to deal with missing data.
By minimising the error gradient, decision trees are fitted one at a time in gradient
boosting. As soon as it experiences a negative loss, it stops dividing a node. However,
si
XG Boost splits up to the set maximum depth. The tree is then pruned backward to
eliminate pointless comparisons or subtrees.
r
The XGBoost package in R and Python can be used to do extreme gradient
boosting.
ve
Summary
●● The regression tree-based approach is frequently used in the field of analytics
ni
due to its many benefits. The user may view each level, which enables him to
decide more wisely. According to what you consider to be most important, rank
the choice’s considerations in order of significance. Regression analysis is a
straightforward method for making decisions when compared to many other
U
methods. The majority of the unwanted data is filtered away as you descend
further down the tree, giving you fewer data. The regression tree is easy to build,
and it may be displayed to higher-ups as a chart or a straightforward diagram.
ity
●● A simple yet efficient analytical technique for identifying the variables that are most
“important” (in terms of explanatory power) is called CART (Classification and
Regression Tree Analysis). Analyses of classification and regression trees have
exciting promise for nursing and other healthcare investigations. The method is
a straightforward, computationally driven, and useful way to simulate interactions
m
between elements that are connected to health but could otherwise go unreported.
The importance of this cannot be overstated because unidentified factors
frequently affect patient outcomes in healthcare research.
)A
●● The capacity to identify and evaluate the importance of these components is what
makes this technique so beautiful. CART is a powerful algorithm that is also fairly
easy to explain when compared to other machine learning methods. It doesn’t
require a lot of processing power, which lets you make models quickly. While it’s
(c
vital to avoid overfitting your data, this approach works well in straightforward
circumstances. The definition of interactions between elements relevant to health
e
regression tree analysis..
●● The two main issues that affect machine learning and lower the effectiveness of
in
the machine learning models are Over-fitting and under-fitting. Each machine
learning model’s primary objective is to generalise effectively. The ability of an ML
model to adjust the provided set of unknown input to produce an acceptable output
nl
is defined here as generalisation. It indicates that it can generate trustworthy and
accurate output after receiving training on the dataset. As a result, under-fitting
and Over-fitting are the two words that need to be examined in order to determine
how effectively the model performs and generalises.
O
●● The holdout method is the most straightforward cross validation technique.
The training set and the testing set are two different sets of the data set. The
training set is the sole data used by the function approximator to fit a function.
ty
The output values for the data in the testing set are then predicted by the function
approximator (it has never seen these output values before). The mean absolute
test set error, which is used to assess the model, is obtained by adding up the
errors it makes as before. This method’s benefit is that it computes quickly and
si
is typically better than the residual method. Its appraisal, however, may have a
wide range. The evaluation may be very different depending on how the division
is created because it may heavily depend on which data points are included in the
r
training set and which are included in the test set.
ve
●● Cross-validation and the bootstrap are two of the most popular resampling
techniques that are covered here. Both approaches are valuable tools for the
practical implementation of numerous statistical learning techniques. Cross-
validation, for instance, can be used to choose the right level of flexibility or
ni
●● When the majority of the data fall under one class label, there is a class
imbalance. Both two-class classification and multiclass classification are
susceptible to it. Data distribution is assumed to be equal by machine learning
techniques. Therefore, when there is an imbalance in the classes, the machine
learning classifier has a tendency to be more biased towards the majority class,
m
account.
Glossary
●● Classification: Classification is the process of identifying a function that aids in the
classification of a dataset based on several factors.
(c
e
data with a set of parameters of fixed size.
●● Logistic Regression: Logistic regression is kind of like linear regression, but is
in
used when the dependent variable is not a number but something else (e.g., a
“yes/no” response).
●● Linear Regression: Linear regression analysis is a statistical technique for
nl
predicting the value of one variable based on the value of another.
●● Neural Networks: Neural Networks are structures formed by nodes based on
the neural structure of the brain. These nodes are connected each other or bi-
O
directionally.
●● K- Nearest neighbors: K-NN algorithm is one of the simplest classification
algorithms and it is used to identify the data points that are separated into several
ty
classes to predict the classification of a new sample point.
●● ID3: Iterative Dichotomiser 3.
●● Entropy: Entropy is the degree or amount of uncertainty in the randomness of
si
elements. In other words, it is a measure of impurity.
●● Information gain: Information gain measures the relative change in entropy with
respect to the independent attribute.
●●
r
Hyperplane: A decision plane (hyperplane) is one that separates between a set of
ve
objects having different class memberships.
●● SVM: Support Vector Machine.
●● Random forest: Random forest classifier is an ensemble algorithm based on
ni
●● Prediction errors: When we talk about the Machine Learning model, we actually
talk about how well it performs and its accuracy which is known as prediction
errors.
m
●● Over-fitting: A statistical model is said to be overfitted when the model does not
make accurate predictions on testing data.
●● Hold-out: In the hold-out method, a fraction of the entries in the ratings matrix are
hidden, and the remaining entries are used to build the training model.
(c
e
●● K-fold cross validation: K-fold cross-validation approach divides the input dataset
in
into K groups of samples of equal sizes. These samples are called folds.
●● SE: Standard Error.
●● Bootstrapping Aggregation: Bootstrapping Aggregation, is an ensemble Machine
nl
Learning technique that accommodates the bootstrapping method and the
aggregation technique.
●● Confidence Intervals: A Confidence Interval (CI) is a type of statistic that reflects
O
the probability of a calculated interval containing a true value.
●● Accuracy: Accuracy is the fraction of predictions our model got right out of all the
predictions.
ty
●● AUC: Area Under the ROC Curve.
●● ROC Curve: Receiver Operating Characteristic Curve.
si
●● MSE: Mean Squared Error.
●● MSD: Mean Squared Deviation.
●●
r
RSME/RSMD: Root Mean Squared Error/ Root Mean Squared Deviation.
ve
●● MAE: Mean Absolute Error.
●● MAPE: Mean Absolute Percentage Error.
●● SMOTE: Synthetic Minority Over-sampling Technique.
ni
b) Classification
c) Boosting
d) Overlapping
2. _ _ _ _ is the process of identifying a function that aids in the classification of a
m
b) Classification
c) Regression
d) None of the mentioned
3. Which is true about Regression algorithm?
(c
e
d) None of the mentioned
4. Which of the following problems can be solved by Classification algorithm?
in
a) Identification of spam emails
b) Speech recognition
nl
c) Identification of cancer cells
d) All of the above
O
5. A_ _ _ _is a type of learning model that summarises data with a set of parameters of
fixed size.
a) Parametric model
ty
b) Non-parametric model
c) Both a and b
d) None of the mentioned
si
6. _ _ _ _ _is kind of like linear regression, but is used when the dependent variable is
not a number but something else (e.g., a “yes/no” response).
a) Linear regression r
ve
b) Logistic regression
c) Parametric
d) Non-parametric
ni
b) Naïve Bayes
c) Linear regression
d) None of the mentioned
ity
b) Linear regression
c) Parametric model
)A
d) Naïve Bayes
9. _ _ _ _are structures formed by nodes based on the neural structure of the brain.
a) Neural Networks
b) Logistic regression
(c
c) Linear regression
d) Naïve Bayes
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 103
e
a) Information gain
b) SVM
in
c) Entropy
d) None of the mentioned
nl
11. _ _ _ _ measures the relative change in entropy with respect to the independent
attribute.
a) Entropy
O
b) Information gain
c) Neural networks
ty
d) SVM
12. A_ _ _ _is one that separates between a set of objects having different class
memberships.
si
a) SVM
b) Neural networks
c) Entropy r
ve
d) Decision plane
13. _ _ _ _ _classifier is an ensemble algorithm based on bagging i.e bootstrap
aggregation.
a) Random forest
ni
b) Entropy
c) Decision plane
U
a) Testing
b) Bootstrapping
c) Algorithms
m
d) Cross validation
15. _ _ _ _ _is a technique for validating the model efficiency by training it on the subset
of input data and testing on previously unseen subset of the input data.
)A
a) Testing
b) Bootstrapping
c) Cross-validation
(c
Exercise
Notes
e
1. What do you mean by classification and regression?
2. Define the concept of bias variance trade-off
in
3. Define under-fitting and over-fitting concepts in machine learning.
4. Explain the hold out method
nl
5. Define cross validation and bootstrapping method.
6. What do you mean by performances metrices? And, also give some application of
performances metrices.
O
7. Define SMOTE.
8. What do you meant by K-Fold Cross-Validation?
ty
Learning Activities
1. How python implements Naïve Bayes algorithm?
si
Check Your Understanding - Answers
1. a 2. b
3. c
r 4. d
ve
5. a 6. b
7. c 8. d
9. a 10. c
ni
11. b 12. d
13. a 14. b
15. c
U
e
Evaluation Methods
in
Learning Objectives:
At the end of this module, you will be able to understand:
nl
●● Naive Bayes Classifier for Labelled and mixed Data Type
●● Performance Matrices for Regression
O
●● K Nearest Neighbour
●● Basics of Decision Tree
●● Concept of Support Vector Machines
ty
●● Random Forest Classifier and its technique and constructions
Introduction
si
Every machine learning pipeline has performance measurements. They quantify
your progress and let you know if you’re making any. All machine learning models
require a metric to assess performance, whether they use linear regression or a SOTA
method like BERT. r
ve
Similar to performance measurements, every machine learning activity may be
divided into two categories: regression and classification. There are other metrics for
these issues, but we’ll talk about the most common ones and the data they reveal
regarding model performance. It’s crucial to understand how your model interprets your
ni
data!
You probably saw the evaluation area if you’ve ever taken part in a Kaggle
competition. There is almost always a metric by which they evaluate your performance.
U
It is not necessary for metrics to be differentiable because they are used to track
and gauge a model’s performance (during training and testing).
However, if the performance metric is differentiable for some tasks, it can also be
m
utilised as a loss function, such as MSE (perhaps with additional regularizations added
to it).
Evaluation Methods
The basic principle behind the resampling approach known as cross-validation is to
divide the dataset into training and test data. The model is trained using training data,
and predictions are made using test data that hasn’t been seen. It indicates that the
(c
model hasn’t overfitted the training data and may be used for prediction if it performs
well on the test data and provides good accuracy.
e
A group of classification algorithms built on the Bayes’ Theorem are known as
naive Bayes classifiers. It is a family of algorithms rather than a single method, and they
in
are all based on the idea that every pair of features being classified is independent of
the other.
nl
Take a look at a hypothetical dataset that details the weather requirements for a
round of golf. Each tuple determines whether the weather is suitable for playing golf
(“Yes”) or not (“No”) based on the current conditions.
O
Here is a table showing how our dataset is organised.
ty
Golf
0 Rainy Hot High FALSE No
1 Rainy Hot High TRUE No
si
2 Overcast Hot High FALSE Yes
3 Sunny Mild High FALSE Yes
4
5
Sunny
Sunny
r
Cool
Cool
Normal
Normal
FALSE
TRUE
Yes
No
ve
6 Overcast Cool Normal TRUE Yes
7 Rainy Mild High FALSE No
8 Rainy Cool Normal FALSE Yes
ni
The feature matrix and the response vector are the two components that make up
the dataset.
●● All of the dataset’s vectors (rows), each of which holds the value of dependent
characteristics, are contained in the feature matrix. Features in the dataset above
m
Assumption
The fundamental Naive Bayes assumption is that each feature makes an:
(c
●● independent
●● equal
e
This idea can be regarded as follows in reference to our dataset:
●● We presum that there are no dependent feature pairs. For instance, the fact that
in
it’s “Hot” outside has nothing to do with the humidity, and the fact that it’s “Rainy”
outside has nothing to do with the winds. Therefore, it is presumed that the traits
are independent.
nl
●● Second, equal weight is assigned to each feature (or importance). For instance,
it is impossible to forecast the outcome with accuracy from temperature and
humidity alone. All of the characteristics are relevant and are seen as having an
O
equal impact on the result.
Note:In practical applications, Naive Bayes’ presumptions are frequently incorrect.
Although the independence assumption is false in theory, it frequently holds true in
ty
practice.
si
Bayes’ Theorem
The Bayes Theorem determines the likelihood of an event occurring given the
r
likelihood of an earlier event occurring. The mathematical formula for Bayes’ theorem is
ve
as follows:
●● In essence, we’re looking for the likelihood that event A will occur if event B is true.
Evidence also refers to Event B.
U
●● P(A) is A’s priori (the prior probability, i.e. Probability of event before evidence is
seen). An attribute value of an unidentified instance serves as the proof (here, it is
event B).
●● P(A|B) is the a posteriori probability of B, or the likelihood of an occurrence
ity
Just to be clear, here is an illustration of a feature vector and class variable: (refer
1st row of dataset)
y = No
P(y|X) here essentially refers to the likelihood of “Not playing golf” under the
Amity Directorate of Distance & Online Education
108 Foundations of Machine Learning
provided conditions of “Rainy outlook,” “Temperature is hot,” “high humidity,” and “no
Notes
e
wind.”
Naive Assumption
in
It’s time to apply a simplistic premise to the Bayes theorem: the independence of
the features. We have now divided the evidence into its individual components.
nl
Consequently, if any two events A and B are independent,
P(A,B) = P(A)P(B)
O
Thus, we arrive at the conclusion:
ty
which is comprehensible as:
si
Now that the denominator is fixed for a particular input, we can eliminate that
phrase:
r
ve
We must now develop a classifier model. To do this, we calculate the likelihood of
the given set of inputs for each potential value of the class variable y and choose the
result with the highest probability. Mathematically, this is expressed as:
ni
Please take note that P(xi | y) is known as conditional probability and P(y) is also
U
The assumptions that different naive Bayes classifiers make about the distribution
of P(xi | y) are what distinguish them from one another.
ity
For each xi in X and each yj in Y, we must determine P(xi | yj). The tables below
show the results of all these calculations:
m
)A
(c
Notes
e
in
nl
O
ty
si
Therefore, in the tables 1-4, we manually calculated P(xi | yj) for each xi in X and yj
in Y, as shown in the above picture. For instance, P(temp. = cool | play golf = yes) = 3/9
is the probability of playing golf given that the temperature is cool.
r
Additionally, we must locate class probabilities (P(y)), which were computed in
ve
table 5. P(play golf = Yes), for instance, equals 9/14.
And
)A
Now, since
(c
Notes
e
in
nl
O
ty
Golf will therefore be played, as predicted.
The approach we covered previously works with discrete data. We must make
si
some assumptions about the distribution of values for each characteristic when dealing
with continuous data. The assumptions that different naive Bayes classifiers make
about the distribution of P(xi | y) are what distinguish them from one another.
r
Gaussian Naive Bayes classifier
ve
Continuous values connected to each feature in Gaussian Naive Bayes are
presumptively distributed in a Gaussian manner. Normal distribution is another name
for a Gaussian distribution. As illustrated below, when plotted, it produces a bell-shaped
curve that is symmetric about the mean of the feature values:
ni
U
ity
m
●● Multinomial Naive Bayes: The frequency with which specific events were produced
by a multinomial distribution are represented by feature vectors. The event model
generally employed in document classification is this one.
●● Bernoulli Naive Bayes: Features are independent booleans (binary variables) that
Notes
e
describe inputs in the multivariate Bernoulli event model. This paradigm is used
for document classification problems, where binary term occurrence features—that
is, whether a word appears in a document or not—are utilised rather than term
in
frequencies (i.e. frequency of a word in the document).
nl
Finding correlations between dependent and independent variables is the goal
of the supervised learning technique known as regression. A numerical or discrete
value is predicted by a predictive regression model. Regression measurements are
O
different from classification metrics in many ways. This implies that we cannot assess a
regression model using the Accuracy measure; rather, the performance of a regression
model is reported as errors in the forecast. The most common indicators for assessing
ty
the effectiveness of regression models are listed below.
si
●● R2 Score
●● Adjusted R2
dependent and independent variables are connected along the line of greatest fit by
the model. We must compute the difference between actual and predicted numbers in
order to get the MAE, or error in prediction. However, we must first determine the mean
absolute of the entire dataset in order to determine the absolute error for the entire
U
dataset.
MAE = 1/N∑|Y-Y’|
Here,
Y stands for the actual result, Y’ for the anticipated result, and N for all the data
points combined.
m
For the outliers, MAE is significantly more reliable. We must use other optimizers,
such as Gradient Descent, because one of MAE’s drawbacks is that it is not
differentiable. Mean Squared Error, or MSE, is a different statistic that can be employed
)A
It calculates the average of the squared difference between actual model output and
expected values.
Since squared errors are used in MSE, only non-negative values are assumed,
Notes
e
which are often positive and non-zero.
in
how flawed the model is.
nl
The following is the MSE calculation formula:
O
Here,
Y stands for the actual result, Y’ for the anticipated result, and N for all the data
points combined.
ty
R Squared Score
Another well-liked statistic for assessing regression models is R squared
si
error, commonly referred to as Coefficient of Determination. We may evaluate the
performance of the model by comparing it to a fixed baseline using the R-squared
metric. We must take the data’s mean and place the line at the mean in order to choose
the constant baseline. r
ve
Regardless of whether the numbers are too large or too tiny, the R squared score
will always be less than or equal to 1.
Adjusted R Squared
R squared error has been improved, as implied by the phrase “adjusted R
U
squared.” R square has a flaw that prevents score improvements when terms are
increased even while the model is not improving, which could lead data scientists
astray. Adjusted R squared, which always displays a lower number than R2, is used to
get around the R square problem. This is due to the fact that it modifies the values of
ity
growing predictors and only displays improvement when there is actually one.
Here,
e
model. Dealing with issues that call for prediction is the main objective of the Bayesian
classification. This classification offers efficient learning strategies while combining
observable data. A Bayesian classification viewpoint might be useful for comprehending
in
and assessing learning systems. It generates clear hypothesis probabilities and
reduces data input noise.
nl
Consider the general probability distribution P(x1, x2), which has two values. Using
Bayes’ rule, we may get the following equation without compromising generality:
O
We obtain the following equation if there is a third class variable, c:
ty
independence assumption for a group of variables x1,..., xN conditional on another
variable c, we obtain the following results:
to the Gaussian distribution. Therefore, it makes sense that categorical data (nominal or
ordinal) be ascribed to the categorical distribution and continuous data be attributed to
the Gaussian distribution.
ity
Installation
m
via pip
naive_bayes
data is provided below (last 2). Indicate the characteristics’ indices that will follow the
categorical distribution (columns 0 and 1). Afterward, fit and predict as usual.
e
X = [[0, 0, 180, 75],
in
[2, 1, 166, 60],
nl
[0, 2, 178, 71]]
y = [0, 0, 1, 1, 0]
O
clf = MixedNB(categorical_features=[0,1])
clf.fit(X,y)
ty
clf.predict(X)
si
A group of classification algorithms built on the Bayes’ Theorem are known as
naive Bayes classifiers. It is a family of algorithms rather than a single method, and they
are all based on the idea that every pair of features being classified is independent of
the other. r
ve
Because it presumes that the data distribution can be Gaussian, normal, Bernoulli,
or multinomial, the approach is known as naïve.
information.
Conditional Probability
U
ity
m
A intersection b refers to that segment, which both a and b are familiar with. The
likelihood that event A will occur provided that event B has already occurred is known
as the conditional probability of happening A given that event B has already occurred.
Simplify by assuming that the entire sample space has been reduced to the event
(c
that has already occurred whenever you are dealing with the conditional probability. As
a result, our sample space now becomes event b because we are only dealing with the
portion of the sample space where B has already occurred.
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 115
e
we find A? The probability of A given that B can be expressed as the probability of
A intersecting B divided by the probability of B since that is nothing more than the
intersection part.
in
Similarly, if we asked to estimate the probability of B given that A, which will be
nl
something like that, the likelihood of occurrence of B given that the event A has already
occurred, you might want to pause and rewind this article. However, if you’re right, then
keep reading. Now imagine that sample space has been reduced to A. By dividing the
O
likelihood of an event A and B intersecting, you may determine if B is available for an
event A.
ty
This equation results from a slight rearrangement of the denominators in these two
equations.
si
After finding a relationship between the two conditional probabilities, the Bayes
r
Theorem can be proved by simply moving the probability of B from the left to the right
inside of the equation.
ve
Congratulations, we’ve reached the Bayes theorem equation. Right now, we’re
attempting to find the Bayes theorem in its generalised form.
ni
U
ity
The orange rectangle space in the figure above represents sample space, which is
a grouping of the three occurrences a1, a2, and a3. As you can see, these events are
mutually exclusive and exhaustive as a whole. Let’s define events that cannot occur
together.
m
single experiment, the cases are said to be mutually exclusive. Head and tail are
mutually exclusive in a coin tossing investigation because you can only get one of them.
Heads and tails cannot be obtained simultaneously.
Collectively Exhaustive
(c
occurrences. The total sample space will be determined by adding A1, A2, and A3. The
Notes
e
prior probability are another name for these three events, A1, A2, and A3.
in
nl
B has a characteristic with A1, A2, and A3 in the image above. If we were to
O
express it this way, the area of B that is shared with junction A1 is B. A2 intersection
B is the part of B that is shared with A2, while A3 intersection B is the part of B that is
shared with A3. The probability of B is therefore equal to the intersections of A1 and B,
as well as A2 and B, and A3.
ty
The probability of A1 intersection B and the other A2, A3, can be written in this
form.
r si
So the probability of B can be expressed as
ve
The Bayes theorem now exists in a more generalised form that can encompass n
prior probabilities. It’s time to combine everything.
ni
U
ity
We now understand the Bayes theorem’s probability formula and the underlying
equation. It’s fine, but explain how this is used in the classification issue. Let’s read
some more.
m
related classes. We can therefore forecast the class outcome if the feature values are
provided. Now that the values of the predictor are known, the Naive Bayes classifier
method calculates a likelihood probability for each category. Additionally, we can choose
the class with the highest likelihood based on intuition.
Let’s say there are n features represented by the symbols x1, x2, x3,..., xn.
(c
Moreover, the outcome variable y has k classes, each represented by the letters C1,
C2, C3,..., Ck. The likelihood of the record or observation that originates from one of
the k classes of the outcome variable y, let’s say Ck, is what we now wish to assess.
Notes
e
Now, if we enter B=Ck and A=x1x2...xn into the Bayes formula above, we may write the
aforementioned conditional probability as follows:
in
nl
Since the denominator is independent of classes in practice, we just employ the
O
numerator portion. The values of the characteristics (xi) are specified in such a way as
to effectively keep the denominator constant. The joint probability model is comparable
to the numerator. As a result, the joint model may be expressed as
ty
si
The function that assigns a class label estimated y = Ck for some k as follows is
the corresponding classifier, a Bayes classifier:
r
ve
Using a Naive Bayes classifier, our aim is to forecast whether or not a customer
would buy a product on a specific day, discount, and free delivery.
aware if there was a discount with free delivery on any given day. We can forecast
whether a person will purchase the product based on the available facts. A 30-row,
15-column tiny dataset is presented below as follows:
U
ity
m
)A
the dataset that includes the frequency tables for each of the three input
variables—day, discount, and free delivery.
(c
Notes
e
in
nl
O
Let the preditors (discount, free delivery, and day) be an event B and the event Buy
be A for the purposes of intuition. Let’s compute the probability table for a single “day”
variable, which includes values for the weekday, weekend, and holiday.
ty
r si
ve
Determine some conditional probabilities based on the likelihood table shown
above:
ni
= 0.18
This indicates that 11 of the 30 customers who entered the store on weekends,
m
weekdays, and holidays will make purchases. The likelihood of making no purchase
is 6/30, or 0.2. Twenty percent of the time, people might decide not to buy anything.
regardless of the day of the week. Finally, we consider the likelihood that B (such as
)A
We can create the frequency tables for all three variables because we have the
frequency tables for all three predictor variables. Check see what the likelihood tables
for the three variables below look like:
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 119
Notes
e
in
nl
O
ty
si
To determine if a consumer will buy a product on a given day when there is a
discount and whether delivery is available, the likelihood tables can be employed. Think
about the following elements together, where B =
r
ve
●● Day = Holiday
●● Discount = Yes
●● Free Delivery = Yes
ni
Now we find the probability of them not purchasing based on the conditions above.
A = No Purchase
discount and whether delivery is available, the likelihood tables can be employed. Think
about the following elements together, where B =
(c
Notes
e
in
nl
The likelihood of making a purchase is 0.986, and the likelihood of not making
O
one is 0.178. The chance of a purchase on this day is now conditional. Norming these
probabilities is the next step.
ty
Likelihood of purchase = 0.986 / 1.164 = 84.71 %
Result: 84.71 % is greater than the 29%, we can conclude that an average
si
customer will buy on holiday with a discount and free delivery.
Bayes Theorem:
U
here x if for different words in the review text, Ck is for the class label
Notes
e
in
nl
O
ty
The collection includes text movie reviews with the specified positive and negative
polarities.
Learn how the naive Bayes classifier functions by using some random examples
si
(not from the dataset).
Text Preprocessing:
To sanitise your data, use the following checklist:
)A
4. Verify that the word’s length is more than two (as it was researched that there is no
adjective in 2-letters).
e
6. Eliminate stopwords, such as (the, and, a).
We can begin utilising the clean, labelled data to train models after completing
in
these procedures and checking for additional problems.
nl
O
ty
Bag of Words Representation:
r si
Making a numerical feature vector for each document is the following step. BoW
ve
keeps track of how many times tokens appear in each collection document. It produces
a matrix having the following features:
In the entire collection of documents, the number of columns equals the number of
ni
The total number of documents in the collection are equal to the number of rows.
U
pointless
Amazing
Scenes
Loved
Movie
Music
Worst
Class
Story
Poor
Best
m
No
All
....
x1 1 1 1 1 0 0 0 0 1 0 0 1 .... 0
)A
x2 1 0 0 0 1 0 0 1 0 0 0 1 .... 1
x3 1 0 0 0 1 1 0 0 0 0 1 0 .... 1
x4 1 0 0 0 0 0 1 0 1 0 0 0 .... 0
x5 1 1 0 1 0 0 1 0 0 1 0 0 .... 0
(c
x6 1 0 0 0 0 1 0 1 0 0 1 0 .... 1
x7 0 0 0 0 0 0 0 0 0 0 1 0 .... 1
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 123
e
throughout the learning process, this is simple to accomplish.
in
Some important points:
Laplace/Additive Smoothing
nl
Laplace smoothing, often known as additive smoothing, is a statistical technique.
An “smoothed” version of the data yields the estimator: given an observation x = (x1,...,
xd) from a multinomial distribution with N trials and parameter vectorθ = (θ1,..., θd).
O
ty
where the smoothing parameter (α = 0 means no smoothing) is the pseudo
count, which is greater than α>0. As the estimated value will fall between the empirical
estimate xi/N and the uniform probability 1/d, additive smoothing is a type of shrinkage
estimator. Though in reality a smaller value is commonly chosen, some authors have
si
suggested that should be 1 using Laplace’s law of succession (in which case the name
add-one smoothing is also used).
r
How then do we use Laplace smoothing in this situation?
ve
Consider setting the smoothing parameter to 0.1 and d to 1 (see equation above).
By adding 1 to each probability, such as P(class|text), we ensure that it will never be
zero.
ni
text query1: The movie’s soundtrack was the worst I’ve ever heard, and the plot
ity
was meaningless.
P(class=1|text) = P(class=1)*P(plot|1)*P(movie|1)*P(pointless|1)*P(worst|1)*P(mus
Amity Directorate of Distance & Online Education
124 Foundations of Machine Learning
e
P(class=0|text) = P(class=0)*P(plot|0)*P(movie|0)*P(pointless|0)*P(worst|0)*P(mus
ic|0)
in
=(3/7)*(1.1/3.2)*(3.1/3.2)*(1.1/3.2)*(1.1/3.2)*(2.1/3.2) =1.10670*10^(-2)
nl
text query1 so we classify the query text as negative review.
text query2 : Loved the action parts, and the soundtrack was fantastic as well.
O
ty
r si
ve
P(class=1|text) = P(class=1)*P(love|1)*P(action|1)*P(scenes|1)*P(music|1)*P(amaz
ing|1)=(4/7)*(2.1/4.2)*(2.1/4.2)*(3.1/4.2)*(1.1/4.2)*(2.1/4.2) =1.380790411*10^(-2)
P(class=0|text) = P(class=0)*P(love|0)*P(action|0)*P(scenes|0)*P(music|0)*P(amaz
ni
ing|0)
=(3/7)*(0.1/3.2)*(0.1/3.2)*(0.1/3.2)*(2.1/3.2)*(0.1/3.2) =2.6822*10^(-7)
U
To get the ideal value of alpha with the least amount of error, use the Multinomial
Naive Bayes classifier for several values of alpha and obtain a plot of error vs. alpha.
m
)A
(c
We find that the ideal value of alpha is 6; hence, we will now carry out the following
Notes
e
actions:
in
nl
Utilize the Multinomial Naive Bayes classifier to predict the result.
O
Identify test and training accuracy
ty
Create a heatmap and confusion matrix. r si
ve
A table called a confusion matrix lets us see how well a classification algorithm is
performing.
ni
U
ity
m
)A
identifying the words that appear most frequently in both good and negative
evaluations.
The frequency of the most frequently used words has been determined using a
sample of both positive and negative words.
(c
Notes
e
in
nl
O
ty
r si
ve
ni
U
Here, we see that the word “bad” is commonly used to describe unfavourable
assessments.
ity
Here, we see that the employment of adjectives like “excellent” typically indicates
positive ratings.
Improvements
m
Because certain words appear in numerous documents from both classes, they
do not provide pertinent information. A helpful method called term frequency-inverse
document frequency can be used to solve this issue (tf-IDF). It considers the word’s
)A
Furthermore, each token in the BoW model we developed stands for a single word.
The unigram model is what we call that. Adding bigrams, in which tokens stand for pairs
of subsequent words, is another option.
(c
Notes
e
in
nl
O
ty
si
By doing this, we can raise both our test and training accuracy from 87 to 93
percent and 82.464 to 85.508 percent, respectively.
r
Naive Bayes is a straightforward but effective method for classifying texts. With
little effort, we can establish strong baselines and, based on business requirements,
ve
explore more complicated solutions.
For classifying texts, Naive Bayes is a great approach and is well accepted.
Naive Bayes basically serves as a benchmark for text categorization, allowing other
ni
One of the simplest machine learning algorithms, based on the supervised learning
method, is K-Nearest Neighbour.
The K-NN algorithm makes the assumption that the new case and the existing
)A
cases are comparable, and it places the new instance in the category that is most like
the existing categories.
A new data point is classified using the K-NN algorithm based on similarity after
all the existing data has been stored. This means that utilising the K-NN method, fresh
(c
e
problems, it can also be utilised for regression.
in
underlying data.
Because it retains the dataset and executes an action on it when it comes time
to classify, this algorithm is also known as a lazy learner because it does not instantly
nl
learn from the training set.
The KNN method simply saves the information during the training phase, and when
it receives new data, it categorises it into a category that is quite similar to the new data.
O
Example: Let’s say we have a picture of a species that resembles both cats and
dogs, but we aren’t sure if it is one or the other. Therefore, since the KNN algorithm is
based on a similarity metric, we can utilise it for this identification. Our KNN model will
ty
look for similarities between the new data set’s features and those in the photos of cats
and dogs, and based on those similarities, it will classify the new data set as either cat-
or dog-related.
r si
ve
Why K-NN Algorithm?
ni
If there are two categories, Category A and Category B, and we have a new data
point, x1, which category does this data point belong in? We require a K-NN algorithm
to address this kind of issue. K-NN makes it simple to determine the category or class
U
The following algorithm can be used to describe how the K-NN works:
(c
Notes
e
in
Let’s say we need to classify a new data point in order to use it. Take a look at the
nl
photo below:
O
ty
r si
ve
1. First, we’ll decide on the number of neighbours; we’ll go with k=5.
2. The Euclidean distance between the data points will then be determined. The distance
between two points, which we have already examined in geometry, is known as the
Euclidean distance. It is calculable as follows:
ni
U
ity
m
●● By calculating the Euclidean distance, we were able to determine who our closest
)A
neighbours were, with three being in group A and two being in category B. Take a
look at the photo below:
(c
Notes
e
in
nl
O
ty
●● This new data point must fall within category A because, as we can see, its three
closest neighbours are also from group A.
si
The following are some things to keep in mind while choosing K’s value in the
K-NN algorithm:
◌◌ r
The ideal value for “K” cannot be determined in a specific fashion, thus we
ve
must experiment with different values to find the one that works best. K is best
represented by the number 5.
◌◌ It can be noisy and cause outlier effects in the model if K is set to an
extremely low value, such as K=1 or K=2.
ni
Advantages:
U
Disadvantages:
ity
K-NN algorithm problem A new SUV vehicle has been produced by an automobile
manufacturer. The business wants to show adverts to people who are considering
purchasing that SUV. We therefore have a dataset for this issue that includes user
information from various social network users. There is a lot of information in the
dataset, but we will focus on the estimated salary and age as the independent
(c
Notes
e
in
nl
O
ty
si
Steps to implement the K-NN algorithm:
◌◌ Data Pre-processing step r
ve
◌◌ Fitting the K-NN algorithm to the Training set
◌◌ Predicting the test result
◌◌ Test accuracy of the result(Creation of Confusion matrix)
ni
The Data Pre-processing phase will be identical to the Logistic Regression phase.
The code is listed below:
# importing libraries
ity
import numpy as nm
import pandas as pd
m
#importing datasets
data_set= pd.read_csv(‘user_data.csv’)
)A
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
e
state=0)
#feature Scaling
in
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
nl
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
O
Our dataset is well pre-processed and imported into our software by running the
aforementioned code. Our test dataset will resemble the following after feature scaling:
ty
r si
ve
ni
We can see that our data has been scaled correctly from the output image up top.
U
We will create the class’ Classifier object after importing the class. This class’s
parameter will be
◌◌ n_neighbors: To specify the algorithm’s necessary neighbours. It typically
takes 5.
m
The classifier will then be fitted to the training set of data. The code is listed below:
classifier.fit(x_train, y_train)
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 133
e
in
nl
●● Predicting the Test Result: As with Logistic Regression, we will generate a y pred
vector to forecast the outcome of the test set. The code is listed below:
#Predicting the test set result
O
y_pred= classifier.predict(x_test)
Output:
ty
r si
ve
ni
U
ity
The confusion_matrix function was imported and called using the variable cm in the
code above.
Output: By running the aforementioned code, we will obtain the matrix shown
(c
below:
Notes
e
in
nl
O
ty
In the graphic above, we can see that there are 64 + 29 = 93 correct predictions
and 3 + 4 = 7, compared to 11 incorrect predictions in the Logistic Regression.
Therefore, we may conclude that employing the K-NN method enhances the model’s
performance.
si
●● Visualizing the Training set result:
We will now display the K-NN model’s training set results. Except for the name of
r
the graph, the code will stay the same as it was for Logistic Regression. The code is
ve
listed below:
#Visulaizing the trianing set result
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
m
for i, j in enumerate(nm.unique(y_set)):
mtp.xlabel(‘Age’)
(c
mtp.ylabel(‘Estimated Salary’)
mtp.legend()
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 135
mtp.show()
Notes
e
Output:
in
nl
O
ty
The graph produced in the output is different from the one we saw during the
si
Logistic Regression. The following points will help you understand it:
◌◌ The graph displays the red point and the green point, as can be seen. The
r
purchased (green) and unpurchased (red) variables are represented by the
pointers.
ve
◌◌ Because it uses a K-NN technique, or finding the nearest neighbour, the graph
displays an uneven boundary rather than a straight line or curve.
◌◌ The majority of users who chose not to purchase an SUV are located in the
red zone of the graph, while those who did purchase an SUV are located in
ni
because by doing this, Over-fitting problems with the model are avoided.
◌◌ As a result, our model is skilled.
●● Visualizing the Test set result:
ity
After the model has been trained, we will now put a fresh dataset, called the
Test dataset, in order to test the outcome. By a few small exceptions, the code is
unchanged. For instance, x_train and y_train will be replaced with x_test and y_test,
respectively.
m
e
reshape(x1.shape),
in
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
nl
for i, j in enumerate(nm.unique(y_set)):
O
c = ListedColormap((‘red’, ‘green’))(i), label = j)
mtp.xlabel(‘Age’)
ty
mtp.ylabel(‘Estimated Salary’)
mtp.legend()
si
mtp.show()
Output:
r
ve
ni
U
ity
The output for the test data set is displayed in the graph above. The graph shows
that the expected result is excellent because the majority of the red points are located
m
in the red region and the majority of the green points are located in the green zone.
Nevertheless, there aren’t many green and few red points in the green region. In the
confusion matrix, these are the inaccurate observations that we have noted (7 Incorrect
output).
)A
●● The Decision Node and Leaf Node are the two nodes of a decision tree. While
Notes
e
Leaf nodes are the results of decisions and do not have any more branches,
Decision nodes are used to create decisions and have numerous branches.
in
●● The given dataset’s features are used to execute the test or make the decisions.
●● It is a graphical depiction for obtaining all feasible answers to a choice or problem
based on predetermined conditions.
nl
●● It is known as a decision tree because, like a tree, it begins with the root node and
grows on subsequent branches to form a structure resembling a tree.
O
●● The CART algorithm, which stands for Classification and Regression Tree
algorithm, is used to construct a tree.
●● A decision tree only poses a question and divides the tree into subtrees according
to the response (Yes/No).
ty
●● The general layout of a decision tree is shown in the diagram below:
r si
ve
ni
The most important thing to keep in mind while developing a machine learning
model is to select the optimal method for the dataset and task at hand. The two
rationales for employing the decision tree are as follows:
ity
●● The rationale behind the decision tree can be simply understood because it
displays a tree-like structure.
●● Decision Trees typically imitate human thinking capacity while making a decision,
m
●● Root Node: The decision tree begins at the root node. The full dataset is
represented, which is then split into two or more homogeneous sets.
●● Leaf Node: After receiving a leaf node, the tree cannot be further divided; leaf
nodes are the ultimate output nodes.
●● Splitting: In splitting, the decision node or root node is divided into sub-nodes in
(c
e
●● Parent/Child node: The parent node of the tree and the remaining nodes are
referred to as the child nodes.
in
How does the Decision Tree algorithm Work?
In a decision tree, the algorithm begins at the root node and works its way up to
forecast the class of the given dataset. This algorithm follows the branch and jumps to
nl
the following node by comparing the values of the root attribute with those of the record
(real dataset) attribute.
O
The algorithm verifies the attribute value with the other sub-nodes once again for
the following node before continuing. It keeps doing this until it reaches the tree’s leaf
node. The following algorithm can help you comprehend the entire procedure:
●● Step-1: S advises starting the tree from the root node, which has the entire
ty
dataset.
●● Step-2: Utilize Attribute Selection Measure to identify the dataset’s top attribute
(ASM).
si
●● Step-3: Subsets of the S that include potential values for the best qualities should
be created.
●● r
Step-4: Create the best attribute-containing decision tree node.
ve
●● Step-5: Using the subsets of the dataset generated in step 3, repeatedly design
new decision trees. Continue along this path until you reach a point when you can
no longer categorise the nodes and you refer to the last node as a leaf node.
ni
Example
Imagine an applicant who has received a job offer and is debating whether to
accept it or not. Therefore, the decision tree begins at the root node to tackle this issue
U
(Salary attribute by ASM). Based on the corresponding labels, the root node further
divides into the next decision node (distance from the office) and one leaf node. The
following decision node is further divided into a leaf node and a decision node (Cab
facility). The decision node finally separates into two leaf nodes (Accepted offers and
ity
e
Attribute Selection Measures
in
The fundamental problem that emerges while developing a decision tree is how
to choose the best attribute for the root node and for sub-nodes. As a result, a method
known as attribute selection measure, or ASM, can be used to tackle these issues. We
can choose the ideal attribute for the tree’s nodes with ease using this measurement.
nl
There are two widely used ASM approaches, which are as follows:
●● Information Gain
O
●● Gini Index
1. Information Gain:
◌◌ Following the segmentation of a dataset based on an attribute, information
ty
gain is the measurement of changes in entropy.
◌◌ It figures out how much knowledge a feature gives us about a class.
◌◌ We divide the node and create the decision tree based on the value of the
si
information gained.
◌◌ A node or attribute with the largest information gain is split first in a decision
tree algorithm, which always seeks to maximise the value of information gain.
r
The following formula can be used to compute it:
ve
Sv is the subset of S with A = v, Values (A) is the set of all possible values of A, and
if S is a set of instances and A is an attribute, then
ni
Sv is the subset of S with A = v, Values (A) is the set of all possible values of A, and
if S is a set of instances and A is an attribute, then
ity
Example:
For the set X = {a,a,a,b,b,b,b,b}
m
Total instances: 8
Instances of b: 5
Instances of a: 3
)A
=-(-0.53-0.424)
(c
= 0.954
The essentials:
Notes
e
◌◌ Begin with all training instances connected to the root node.
◌◌ Choose an attribute to label each node with using info gain.
in
◌◌ It should be noted that no root-to-leaf path should have the same discrete
property twice.
◌◌ Build each subtree recursively on the subset of training cases that would be
nl
categorised down that path in the tree.
O
◌◌ If all positive or all negative training occurrences are still there, label that node
“yes” or “no.”
◌◌ If no attributes are left, label with a majority vote of the remaining training
ty
examples at that node.
◌◌ If no examples remain, label the parent’s training instances with a majority
vote.
si
Example:
Now, utilising information gain, create a Decision Tree with the following data.
r
Set of exercises: 3 characteristics and 2 classes
ve
ni
U
We have three features and two output classes here. Using information gathered,
ity
construct a decision tree. We’ll take each feature and compute the information for each
one.
m
)A
Split on feature X
(c
Notes
e
in
nl
O
Split on feature Y
ty
r si
ve
Split on feature Z
We can see from the photos above that the information gain is greatest when we
divide on feature Y. As a result, the best-suited feature for the root node is feature Y.
ni
We can now observe that when we split the dataset by feature Y, the child variable
comprises a pure subset of the target variable. As a result, we don’t need to divide the
dataset any more.
U
The completed tree for the aforementioned dataset would look like this:
2. Gini Index
ity
◌◌ The Gini Index is a metric that measures how frequently a randomly selected
piece is erroneously identified.
◌◌ This indicates that an attribute with a lower Gini index should be preferred.
◌◌ Sklearn supports “Gini” criteria for Gini Index and takes “gini” value by default.
m
Example:
Consider the dataset shown below and create a decision tree using the Gini index.
(c
Notes
e
in
nl
O
ty
r si
ve
ni
U
There are 5 attributes in the dataset above, with attribute E being the predictive
feature and including two (Positive & Negative) classes. We have an equal proportion of
each class.
ity
To categorise each attribute in the Gini Index, we must select some random values.
These are the values for this dataset:
m
)A
Gini(5,7) = 1-[(5/12)2+(7/12)2]
Notes
e
= 0.4860
Value < 5: 4
in
Attribute A < 5 & class = positive:3/4
nl
Gini(9,1) =1 –[(3/4)2+(1/4)2]
= 0.375
O
By adding weight and sum each of the gini indices:
ty
= 0.45825
si
Value >= 3: 12
= 0.4460
Value < 3: 4
ni
Gini(3,1) =1 –[(0/4)2+(4/4)2]
=1
= 0.3345
We can calculate the Gini index for C and D qualities using the same method.
m
Positive Negative
|<5 3 1
Positive Negative
(c
|< 3.0 0 4
e
Positive Negative
in
|< 4.2 8 2
nl
Positive Negative
O
|< 1.4 8 3
ty
r si
ve
ni
to determine which attribute will be used to classify the current subset of data.
Recursively, information gain is calculated for the remaining data at each level of the
tree.
2. C4.5: This algorithm is the ID3 algorithm’s successor. To determine the classifying
ity
property, this technique use either Information gain or Gain ratio. Because it can
handle both continuous and missing attribute data, it is a direct improvement over
the ID3 method.
3. Classification and Regression Tree(CART): It is a dynamic learning technique that
m
can generate both a regression tree and a classification tree based on the dependent
variable.
)A
Over-fitting
Over-fitting is a real issue while designing a decision tree model. Over-fitting is
considered when the algorithm continues to dive deeper and deeper in the to lower
the training set error but resulting in an increased test set error, i.e., the accuracy of
prediction for our model decreases. It usually happens when there are a lot of outliers
(c
●● Pre-Pruning
Notes
e
●● Post-Pruning
in
Pre-Pruning
Pre-pruning halts tree construction a little early. It is preferable not to split a node
if its goodness measure falls below a certain level. However, deciding on a suitable
nl
ending point is tricky.
Post-Pruning
O
Post-pruning begins by going deeper and deeper into the tree in order to produce
a whole tree. If the tree has an Over-fitting problem, pruning is performed as a post-
pruning phase. To test the effectiveness of our trimming, we employ cross-validation
data. It uses cross-validation data to determine whether expanding a node would result
ty
in an improvement or not.
si
changed to a leaf node.
Advantages:
r
ve
◌◌ Decision trees are simple to understand. It yields a set of rules.
◌◌ It takes the same strategy as humans do when making decisions.
◌◌ Visualizations can help simplify the interpretation of a complex Decision Tree
ni
Disadvantages:
◌◌ Decision Tree has a significant likelihood of Over-fitting.
◌◌ When compared to other machine learning methods, it has a low prediction
ity
We will now use Python to implement the Decision tree. We will use the dataset
“user data.csv” for this.
Steps:
(c
e
◌◌ Predicting the test result
◌◌ Test accuracy of the result(Creation of Confusion matrix)
in
◌◌ Visualizing the test set result.
nl
The code for the pre-processing stage is as follows:
# importing libraries
O
import numpy as nm
import pandas as pd
ty
#importing datasets
data_set= pd.read_csv(‘user_data.csv’)
si
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
r
y= data_set.iloc[:, 4].values
ve
# Splitting the dataset into training and test set.
state=0)
#feature Scaling
U
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
ity
x_test= st_x.transform(x_test)
We have pre-processed the data in the code above. Where the dataset was
loaded, which is specified as:
m
)A
(c
Notes
e
in
nl
O
ty
si
Fitting a Decision-Tree algorithm to the Training set
The model will now be fitted to the practice set. We’ll import the
r
DecisionTreeClassifier class needed for this from the sklearn.tree library. The code is
listed below:
ve
#Fitting Decision Tree classifier to the training set
classifier.fit(x_train, y_train)
We constructed a classifier object in the code above and handed it two main
U
parameters;
Out[8]:
m
max_features=None, max_leaf_nodes=None,
)A
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=0, splitter=’best’)
(c
e
We shall now forecast the outcome of the test set. A new prediction vector called
y_pred will be made. The code is listed below:
in
#Predicting the test set result
y_pred= classifier.predict(x_test)
nl
Output:
The actual test output and the projected output are shown in the output image
O
below. It is obvious that the prediction vector has certain values that differ from the
values of the actual vector. These are incorrect predictions.
ty
r si
ve
ni
Since some wrong predictions were made, as seen in the output above, we must
U
utilise the confusion matrix to determine the total number of correct and incorrect
guesses. The code is listed below:
Output:
m
)A
(c
The confusion matrix, which has 6+3=9 inaccurate guesses and 62+29=91 correct
Notes
e
predictions, is displayed in the output image up top. As a result, we may conclude that
the Decision Tree classifier produced an accurate forecast when compared to other
classification models.
in
Visualizing the training set result:
nl
We’ll display the training set outcome here. We will draw a graph for the decision
tree classifier in order to visualise the training set result. For consumers who have
either purchased or not purchased the SUV car, the classifier will predict yes or no. The
code is listed below:
O
#Visulaizing the trianing set result
ty
x_set, y_set = x_train, y_train
si
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
ni
fori, j in enumerate(nm.unique(y_set)):
mtp.xlabel(‘Age’)
mtp.ylabel(‘Estimated Salary’)
mtp.legend()
m
mtp.show()
Output:
)A
(c
Notes
e
in
nl
O
The output presented above departs significantly from the remaining classification
models. The dataset is divided along both vertical and horizontal lines according to the
age and expected wage variables.
ty
dataset.
si
The training set will be replaced by the test set in the visualisation of the test set
results, which will seem quite similar to the training set visualisation.
r
#Visulaizing the test set result
ve
from matplotlib.colors import ListedColormap
reshape(x1.shape),
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
fori, j in enumerate(nm.unique(y_set)):
m
mtp.xlabel(‘Age’)
mtp.ylabel(‘Estimated Salary’)
mtp.legend()
(c
mtp.show()
Output:
Notes
e
in
nl
O
There are some green data points within the purple portion of the above graphic,
ty
and vice versa. Therefore, the inaccurate predictions that we described in the confusion
matrix are these.
si
3.1.12 Introduction to Support Vector Machines
Support vector machines (SVMs) are potent yet adaptable supervised machine
learning algorithms used for both regression and classification. However, they are
r
typically employed in classification issues. SVMs were first presented in the 1960s,
ve
however they were later improved in 1990. Compared to other machine learning
algorithms, SVMs are implemented in a different method. Because they can manage
numerous continuous and categorical variables, they have recently become very
popular.
ni
N-dimensional space that clearly classifies the data points is the goal of the SVM
method. The number of features determines the hyperplane’s size. The hyperplane is
essentially a line if there are just two input features. The hyperplane turns into a 2-D
plane if there are three input features. When there are more than three features, it gets
ity
harder to imagine.
boundaries used to separate the classes, but we must identify the optimum decision
boundary that best aids in classifying the data points. The hyperplane of SVM is a name
for this ideal boundary.
)A
The largest margin, or the space between the data points, is always used to form a
(c
hyperplane.
Support Vectors:
Notes
e
Support vectors are the data points or vectors that are closest to the hyperplane
and have the greatest influence on where the hyperplane is located. These vectors are
in
called support vectors because they support the hyperplane.
Let’s look at two independent variables (x1, x2) and one dependent variable (either
a blue circle or a red circle), which is the dependent variable.
nl
O
ty
si
It is extremely obvious from the above graphic that there are numerous lines (our
hyperplane in this case is a line as we are only taking into account two input features,
x1, x2) that separate our data points or perform a classification between red and blue
r
circles. So how do we pick the optimal line, or optimum hyperplane, to divide up our
ve
data points?
Therefore, we select the hyperplane whose distance to the closest data point on
either side is maximum from it. The maximum-margin hyperplane/hard margin is a term
)A
used to describe such a hyperplane, if it exists. We therefore select L2 from the given
diagram.
Notes
e
in
nl
O
One blue ball is currently within the red ball’s perimeter. How does SVM categorise
the data then? It’s easy! An anomaly of blue balls is the blue ball that lies on the edge of
the red ones. In order to determine the optimum hyperplane that maximises the margin,
ty
the SVM algorithm has the ability to ignore outliers. SVM can withstand outliers.
r si
ve
SVM determines the maximum margin for this type of data like it did for earlier
ni
data sets and also applies a penalty each time a point crosses the margin. Therefore,
in circumstances like these, the margins are referred to be soft margins. The SVM
attempts to minimise (1/margin+(penalty)) when the data set has a soft margin. A
U
Up until this point, we have only discussed data that can be separated along a
ity
straight line (a linear line), such as the group of blue and red balls. What should one do
if data cannot be separated linearly?
m
)A
Let’s say the figure above represents our data. By employing a kernel to create
a new variable, SVM resolves this. We designate a new variable yi as a function of
the distance from the origin o. As a result, if we plot this, we obtain the result that is
(c
depicted below.
Notes
e
in
nl
O
In this instance, the distance from the origin is used to construct the new variable y.
Kernel is the name for a non-linear function that produces a new variable.
SVM Kernel:
ty
The SVM kernel is a function that converts non separable problems into separable
problems by taking low-dimensional input space and transforming it into higher-
dimensional space. It works best in non-linear separation issues. Simply explained, the
si
kernel determines how to split the data depending on the labels or outputs defined after
performing some incredibly sophisticated data transformations.
Advantages of SVM:
●●
r
effective in cases with big dimensions.
ve
●● Its memory efficiency comes from the decision function’s use of support vectors, a
subset of training points.
●● For the decision functions, various kernel functions can be supplied, as well as
ni
bespoke kernels.
Linear SVM: Linear SVM is used for linearly separable data, which is defined as
data that can be divided into two classes using just one straight line. The classifier used
for such data is called the Linear SVM classifier.
ity
Let’s begin with a straightforward two-class problem where the data is obviously
separable along a linear path, as indicated in the image below.
m
)A
(c
Let (Xi, yi) represent the i-th data point, where yi is the associated class label and
Notes
e
Xi is the feature vector, with yi having one of two possible values: +1 or -1. The red-
colored balls in the diagram above have a class label of +1, while the blue balls, say,
have a class label of -1. All of the individuals in classes +1 and -1 can be distinguished
in
from one another by drawing a straight line. The aforementioned two-dimensional data
are obviously linearly separable.
nl
The blue balls and red balls can actually be separated by an unlimited number of
straight lines.
The challenge is determining which of the infinite straight lines is ideal in the sense
O
that it should have the least amount of classification error when applied to a fresh
observation. Based on the training sample, the straight line is anticipated to properly
identify one or more test samples.
ty
Consider the black, red, and green lines in the diagram above as an example. Is
one of them superior to the other two? Or are they all equally suited to classification?
What does this mean by optimality? It makes intuitive sense that a line will be more
sensitive to slight changes in one or more of the points if it passes too close to any of
si
the points. A red ball is not far from the green line. A blue ball is not far from the red
line. The red ball might land on the opposite side of the green line if it slightly shifts its
location. The blue ball could also be misclassified if it slightly shifts in place. The red
r
and green lines are equally sensitive to slight variations in the observations. Contrarily,
ve
the black line is less sensitive and more resistant to model variance.
It’s common to refer to the scalar θ0 as a bias. The hyperplane passes through the
origin if θ0=0.
ity
A separator is a hyperplane. There will be two distinct groups made up of the points
that lie on opposite sides of the hyperplane.
Finding the best hyperplane for linearly separable patterns is the fundamental
goal of support vector machines. The separating hyperplane that is farthest from the
m
is determined by the least of all those distances. The margin is the name given to this
minimal separation. The SVM method works by locating the hyperplane that provides
the training instances with the highest minimum distance, or to discover the maximum
margin. The maximal margin classifier is referred to as this.
A bias is a common name for the scalar θ0. In the event that θ0=0, the hyperplane
Notes
e
passes through the origin.
in
points that lie on opposite sides of the hyperplane.
The fundamental goal of support vector machines is to identify the best hyperplane
for patterns that can be separated linearly. The separating hyperplane known as the
nl
optimal margin hyperplane, which is the farthest from the observations, is a natural
choice. It is calculated how far away each observation is from a specific separating
hyperplane. A measurement of how close the hyperplane is to the collection of
O
observations is the least of all those distances. The margin is the smallest possible
separation. In order to identify the biggest margin, the SVM algorithm operates by
locating the hyperplane that provides the training instances with the largest minimum
distance. The greatest margin classifier is used in this situation.
ty
A separating hyperplane in two dimensions is written as
si
Any point that is above the hyperplane therefore qualifies.
r
the hyperplane, and any point below it, is satisfied.
ve
θ0+ θ1x1 + θ2x2< 0
This will ensure that any observations that fall on or above H1 are considered to
belong to class +1, while any observations that fall on or below H2 are considered to
belong to class -1. As an alternative, we could write
ity
H1 and H2, which define the margin bounds, are also hyperplanes. The training
data that exactly matches the margin’s edges are known as the support vectors
because they operate as weights for the maximal margin hyperplane, meaning that if
m
any of these points are moved, the maximal margin hyperplane will also move.
Notably, only these support vectors are the direct dependencies of the maximal
margin hyperplane.
)A
The maximal margin hyperplane does not alter if any of the other points shift
until the movement has an impact on the boundary conditions or the support vectors.
The support vectors are the most challenging to categorise and provide the greatest
insight into classification. The training set’s most important or crucial data points are the
(c
Notes
e
in
nl
O
The defining equation for a general n-dimensional feature space is
It is simple to show that the maximal margin is 2|| in size if the weights vector is
ty
denoted by and || is this vector’s norm. A convex quadratic optimization issue is to find
the largest margin hyperplanes and support vectors. It is crucial to remember that the
number of support vectors, not the dimension of the feature space, determines how
si
complex an SVM is.
Because of this, SVM has a relatively lower propensity to overfit. The same
separation hyperplane would be discovered if all data points from the training data set
r
aside from the support vectors were eliminated and the training method was redone.
ve
The SVM classifier’s anticipated error rate, which is independent of data dimensionality,
has an upper bound determined by the quantity of support vectors. Even when the data
is highly dimensional, an SVM with a modest number of support vectors offers good
generalisation.
ni
We must therefore add another dimension in order to separate these data values.
We have used the two dimensions x and y for linear data, so we will add the third
dimension z for non-linear data. It is calculable as follows:
Z = x2 + y2
(c
The sample space will change to look like the following image by adding the third
dimension:
Notes
e
in
nl
O
The datasets will now be split into the following classes by SVM. Take a look at the
photo below:
ty
r si
ve
ni
Python will be used to carry out the SVM method. The user_data dataset will be
used in this instance.
e
#Data Pre-processing Step
# importing libraries
in
import numpy as nm
nl
import pandas as pd
#importing datasets
O
data_set= pd.read_csv(‘user_data.csv’)
ty
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
si
from sklearn.model_selection import train_test_split
st_x= StandardScaler()
ni
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
U
We shall pre-process the data after running the aforementioned code. The dataset
is what the code will output:
ity
m
)A
(c
e
The SVM classifier will now be fitted to the training set. We will import SVC class
from the Sklearn.svm package in order to build the SVM classifier. The code is listed
in
below:
nl
classifier = SVC(kernel=’linear’, random_state=0)
classifier.fit(x_train, y_train)
In the above code, we have used kernel=’linear’, as here we are creating SVM for
O
linearly separable data. However, we can change it for non-linear data. And then we
fitted the classifier to the training dataset(x_train, y_train).
ty
Output:
Out[8]:
si
decision_function_shape=’ovr’, degree=3, gamma=’auto_deprecated’,
We will now project the results for the test set. We will do this by making a new
vector called y pred. The code is listed below:
U
y_pred= classifier.predict(x_test)
Once we get the y pred vector, we can compare the y pred and y test results to see
ity
Output: The output for the test set’s forecast is shown below:
m
)A
(c
Notes
e
in
nl
O
ty
si
Creating the confusion matrix:
r
Now, we’ll examine the SVM classifier’s performance to determine how many
inaccurate predictions it makes in comparison to the classifier for logistic regression.
ve
We must import the confusion matrix function from the sklearn package in order
to generate the confusion matrix. We will use a new variable called cm to invoke the
function when it has been imported. Two parameters are required by the function,
primarily y_true (the actual numbers) and y_pred (the targeted value return by the
ni
Output:
ity
m
)A
(c
There are 66+24=90 correct guesses, and 8+2=10 correct predictions, as can
be seen in the output image up top. Thus, we can conclude that our SVM model
Notes
e
outperformed the Logistic Regression model.
in
The code for it is below. We will now visualise the training set result.
nl
x_set, y_set = x_train, y_train
O
0].max() + 1, step =0.01),
ty
reshape(x1.shape),
si
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
r
for i, j in enumerate(nm.unique(y_set)):
ve
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
mtp.xlabel(‘Age’)
mtp.ylabel(‘Estimated Salary’)
U
mtp.legend()
mtp.show()
ity
Output:
m
)A
(c
As we can see, the aforementioned output resembles the results of the logistic
regression. Because we used a linear kernel in the classifier, the straight line appeared
e
SVM for the second dimension is a straight line.
in
#Visulaizing the test set result
nl
x_set, y_set = x_test, y_test
O
0].max() + 1, step =0.01),
ty
reshape(x1.shape),
si
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
r
ve
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
mtp.xlabel(‘Age’)
mtp.ylabel(‘Estimated Salary’)
U
mtp.legend()
mtp.show()
ity
Output:
m
)A
The SVM classifier separated the users into two zones, as shown in the result
(c
image up top (Purchased or Not purchased). The red area with the red scatter spots
represents users who bought the SUV. The green area with green scatter points
represents users who did not buy the SUV. The two classes have been separated into
Notes
e
purchased and non-purchased variables by the hyperplane.
in
A popular algorithm for classification and regression issues is the supervised
machine learning technique known as random forest. On various samples, it constructs
nl
decision trees and uses their average for classification and majority vote for regression.
The Random Forest Algorithm’s ability to handle data sets with both continuous
variables, as in regression, and categorical variables, as in classification, is one of its
O
most crucial qualities. For categorization issues, it produces superior results.
ty
After finishing his 10+2, student X wants to enrol in a course but is unsure which
one to take given his skill set. So he makes the decision to talk to a variety of people,
including his cousins, teachers, parents, degree students, and workers. He asks
them a range of queries, such as why he ought to select, employment prospects with
si
that course, tuition, etc. He ultimately chooses to take the course that most people
recommend after conversing with a variety of individuals about it.
r
ve
ni
random forest functions. Ensemble simply refers to the blending of various models. As a
result, a group of models rather than a single model is employed to create predictions.
1. Bagging:It replaces the sample training data with a different training subset, and the
result is based on majority voting. Consider Random Forest.
2. Boosting:By building consecutive models with the final model having the best
accuracy, it turns weak learners into strong learners. ADA BOOST and XG BOOST
m
As was already explained, the Bagging principle governs how Random Forest
Notes
e
operates. Let’s dig in and learn more about bagging now.
Bagging
in
The ensemble method employed by random forest is bagging, sometimes referred
to as Bootstrap Aggregation. A random sample is chosen from the data set using
bagging. As a result, each model is created using the samples (Bootstrap Samples) that
nl
the Original Data gave, with a replacement process known as row sampling. Bootstrap
refers to this stage of row sampling with replacement. Each model is currently trained
independently, producing results. After merging the output from all models, the final
O
decision is made based on a majority vote. Aggregation is the process of aggregating
all the results and producing a result based on a majority vote.
ty
r si
ve
Let’s examine an example now by dissecting it with the aid of the subsequent
figure. The bootstrap sample in this case is derived from genuine data (Bootstrap
ni
samples 01, 02, and 03) with a replacement, therefore there is a significant chance that
each sample won’t include unique data. The models (Models 01, 02, and 03) derived
from this bootstrap sample are currently being trained independently. Results from
U
each model are displayed. Now, when compared to sad emoji, the happy emoji is in the
majority. Thus, the Happy emoji is the ultimate result based on majority voting.
ity
m
)A
(c
e
Step 1: In Random Forest, n records at random are selected from a data set with k
records.
in
Step 2: For each sample, a different decision tree is built.
nl
Step 4: For classification and regression, the final result is based on either majority
voting or averaging.
O
ty
r si
ve
For example: Take the data from the fruit basket in the figure below as an example.
Now, n samples are drawn from the fruit basket, and each sample is used to build a
separate decision tree. As seen in the illustration, each decision tree will produce an
output. Based on majority voting, the result is taken into consideration. As you can see
ni
in the figure below, when compared to a banana, the majority decision tree produces an
apple as the output, hence an apple is chosen as the ultimate result.
U
ity
m
)A
2. Immune to the curse of dimensionality: The feature space is smaller since no tree
Notes
e
takes into account every feature.
3. Parallelization:Out of various data and attributes, each tree is independently
in
generated. This implies that we can create random forests by using the CPU to its
fullest extent.
4. Train-Test split-:In a random forest, there will always be 30% of the data that the
nl
decision tree cannot see, therefore we don’t need to separate the data for train and
test.
5. Stability: Because the outcome is based on majority vote or averaging, stability
O
results.
ty
Advantages
1. Classification and regression issues can be solved with it.
2. Because output is based on majority vote or averaging, it resolves the Over-
si
fitting issue.
3. It works well even when the data has missing or null values.
r
4. Because each decision tree that is constructed is independent of the others,
ve
parallelization is demonstrated.
5. Because the average responses provided by a vast number of trees are used,
it is extremely stable.
6. It maintains diversity because not all traits are taken into account when
ni
creating each decision tree, albeit this isn’t always the case.
7. It is resistant to the dimensionality curse. The feature space is smaller
U
because each tree does not take all of the qualities into account.
8. Since 30% of the data will always be hidden from the decision tree built using
bootstrap, we don’t need to separate the data into train and test.
ity
Disadvantages:
1. When opposed to decision trees, where choices may be made by going along
the tree’s path, random forests are significantly more complex.
2. Due to its intricacy, training takes longer than with other models. Every time a
m
decision tree needs to make a prediction, it must produce output based on the
input data.
)A
random forest is done. The forecast chosen by the random forest technique is the one
that receives the most votes after taking into account each case separately.
Notes
e
in
nl
O
ty
Figure: Random Forest Technique
si
Samples from the initial dataset are used as input for each tree in the
classifications. The features used to grow the tree at each node are then chosen at
random. Until the exercise is complete and the forecast is clearly reached, no trees in
r
the forest should be pruned. Any classifier with weak correlations can produce a strong
ve
classifier using the random forest in this way.
Modeling Predictions
Using random forest regression trees, which are typically left unpruned to produce
ni
accurate predictions, the random forest approach may develop prediction models.
The regression trees, which shouldn’t be pruned, are sampled using the bootstrap
approach. The optimal splitting feature is created by sampling the best nodes from the
U
The variance of the regression trees is reduced by the random sampling method
used to choose the best splitting feature by reducing correlation. It enhances the
ity
ability of certain forest plants to predict future events. Increased independence among
individual trees is another benefit of bootstrap sampling.
Variable Importance
Since it might be difficult to comprehend the models, particularly from a biological
m
standpoint, variables (features) are crucial to the random forest method. By giving a
variable importance depending on how frequently all trees include it in the sample,
the naive approach illustrates the significance of variables. It is doable but challenging
)A
because the impacts on cost savings and accuracy improvement are duplicated.
When the variables are permuted at random from samples taken from a bag,
the permutation significance is a metric that records prediction accuracy. Although
more expensive, the permutation importance approach performs better than the naive
(c
approach.
The strategy relies on the naive, mean decrease impurity, and permutation
Notes
e
importance techniques to provide them direct interpretability to the issues because the
random forest cannot interpret predictions from the biological viewpoints well enough.
The three methods provide different kinds of support for the predictor variables.
in
However, neither the permutation importance nor the mean decrease impurity
approaches show biases when applied to continuous predictor variables with a
nl
comparable number of categories. Bias is frequently present in variable selection. To
prevent it, subsampling without replacement should be employed, and the random
forest technique should be used when conditional inference is being used.
O
Oblique Random Forests
In contrast to typical decision splits at the nodes, oblique random forests use
oblique splits for decisions. The following characteristics of oblique woods demonstrate
ty
their superiority.
First, they can use a single multivariate split to divide distributions at the coordinate
axes, including the customarily required deep axis-aligned splits. Second, they make it
si
possible for the decision trees’ bias toward the plotted limitations to be reduced. When
splitting similar classes, the traditional axis-aligned splits would require two extra layers
of nesting, but the oblique splits are simpler and more effective to deploy.
Due to its excellent accuracy in classifying remotely sensed data, the classifier,
which was initially developed for machine learning, has acquired favour in the remote
sensing field. Additionally, it completes the process with the necessary speed and with
U
effective parameterization. The prediction with the greatest vote across all trees is
chosen by the random forest classifier, which bootstraps random samples.
The uniqueness of the trees is crucial throughout the process. The following
ity
characteristics ensure that each tree is unique. First, random subsets from the initial
training samples are used in each tree training in the sample. Second, the randomly
picked attributes of the unpruned tree nodes are used to select the best split. Thirdly, no
tree should ever be pruned because all trees grow at an unlimited rate.
m
most frequently in a given node is used to fill in any missing values. Random forests
offer the highest accuracy of all the categorization techniques that are currently
available.
The random forest method can also handle large amounts of data with hundreds
(c
of different variables. When a class in the data is less frequent than other classes, it
can automatically balance data sets. The approach is appropriate for challenging
assignments since it handles variables quickly.
Summary
Notes
e
●● A group of classification algorithms built on the Bayes’ Theorem are known as
naive Bayes classifiers. It is a family of algorithms rather than a single method,
in
and they are all based on the idea that every pair of features being classified is
independent of the other.
●● The Bayes Theorem determines the likelihood of an event occurring given the
nl
likelihood of an earlier event occurring.
●● Continuous values connected to each feature in Gaussian Naive Bayes are
presumptively distributed in a Gaussian manner. Normal distribution is another
O
name for a Gaussian distribution.
●● Naïve Bayes classifiers are: a) Multinomial Naïve Bayes, b) Bernoulli Naïve Bayes.
●● Finding correlations between dependent and independent variables is the goal of
ty
the supervised learning technique known as regression. A numerical or discrete
value is predicted by a predictive regression model. Regression measurements
are different from classification metrics in many ways. This implies that we cannot
si
assess a regression model using the Accuracy measure; rather, the performance
of a regression model is reported as errors in the forecast.
●● The most common indicators for assessing the effectiveness of regression models
r
are: a) Mean Absolute Error, b) Mean Squared Error, c) R2 Score, d) Adjusted R2.
ve
●● The Bayesian classification method combines statistical classification with
supervised learning. assumes an underlying probabilistic model and enables the
rigorous computation of outcome probabilities to represent uncertainty about the
model.
ni
recognition.
●● A new data point is classified using the K-NN algorithm based on similarity after
all the existing data has been stored. This means that utilising the K-NN method,
ity
fresh data can be quickly and accurately sorted into a suitable category.
●● A supervised learning method called a decision tree can be used to solve
classification and regression problems, but it is typically favoured for doing so.
It is a tree-structured classifier, where internal nodes stand in for a dataset’s
m
features, branches for the decision-making process, and each leaf node for the
classification result.
●● The Decision Node and Leaf Node are the two nodes of a decision tree. While
)A
Leaf nodes are the results of decisions and do not have any more branches,
Decision nodes are used to create decisions and have numerous branches.
●● The fundamental problem that emerges while developing a decision tree is how
to choose the best attribute for the root node and for sub-nodes. As a result, a
method known as attribute selection measure, or ASM, can be used to tackle
(c
these issues.
e
considered when the algorithm continues to dive deeper and deeper in the to
lower the training set error but resulting in an increased test set error, i.e., the
accuracy of prediction for our model decreases.
in
●● Support vector machines (SVMs) are potent yet adaptable supervised machine
learning algorithms used for both regression and classification. However, they are
nl
typically employed in classification issues. SVMs were first presented in the 1960s,
however they were later improved in 1990. Compared to other machine learning
algorithms, SVMs are implemented in a different method.
O
●● The SVM kernel is a function that converts non separable problems into separable
problems by taking low-dimensional input space and transforming it into higher-
dimensional space. It works best in non-linear separation issues.
●● A popular algorithm for classification and regression issues is the supervised
ty
machine learning technique known as random forest. On various samples, it
constructs decision trees and uses their average for classification and majority
vote for regression.
si
●● The Random Forest Algorithm’s ability to handle data sets with both continuous
variables, as in regression, and categorical variables, as in classification, is one
of its most crucial qualities. For categorization issues, it produces superior results.
●●
r
The ensemble method employed by random forest is bagging, sometimes referred
ve
to as Bootstrap Aggregation. A random sample is chosen from the data set
using bagging. As a result, each model is created using the samples (Bootstrap
Samples) that the Original Data gave, with a replacement process known as row
sampling.
ni
Glossary
●● Naïve Bayes classifiers: A group of classification algorithms built on the Bayes’
U
●● Multinomial Naive Bayes: The frequency with which specific events were produced
by a multinomial distribution are represented by feature vectors.
●● Bernoulli Naive Bayes: Features are independent booleans (binary variables) that
describe inputs in the multivariate Bernoulli event model.
m
e
than R2, is used to get around the R square problem.
●● Root Node: The decision tree begins at the root node. The full dataset is
in
represented, which is then split into two or more homogeneous sets.
●● Leaf Node: After receiving a leaf node, the tree cannot be further divided; leaf
nodes are the ultimate output nodes.
nl
●● Splitting: In splitting, the decision node or root node is divided into sub-nodes in
accordance with the specified conditions.
O
●● Branch/Sub Tree: The tree was split into two new trees.
●● Pruning: Pruning is the procedure of removing the tree’s undesirable branches.
●● Parent/Child node: The parent node of the tree and the remaining nodes are
ty
referred to as the child nodes.
●● ASM: Attribute Selection Measures.
●● Gini index: The Gini Index is a metric that measures how frequently a randomly
si
selected piece is erroneously identified.
●● Iterative Dichotomiser 3 (ID3): This algorithm makes use of Information Gain to
determine which attribute will be used to classify the current subset of data.
●●
r
C4.5: This algorithm is the ID3 algorithm’s successor. To determine the classifying
ve
property, this technique use either Information gain or Gain ratio.
●● Classification and Regression Tree(CART): It is a dynamic learning technique
that can generate both a regression tree and a classification tree based on the
dependent variable.
ni
data that can be divided into two classes using just one straight line.
●● Non-Linear SVM: When a dataset cannot be identified using a straight line, it is
said to be non-linear, and the classification algorithm utilised is known as a non-
linear SVM classifier.
m
a) Conditional Independence
b) Conditional Dependence
c) Both a and b
d) None of the above
(c
b) Numerical values
Notes
e
c) Both a and b
d) None of the mentioned
in
3. Probabilistic Model of data within each class is?
a) Discriminative classification
nl
b) Generative classification
c) Probabilistic classification
O
d) Both b and c
4. Spam Classification is an example of?
a) Random Forest
ty
b) Naïve Bayes
c) Probabilistic classification
si
d) All of the above
5. Finding correlations between dependent and independent variables is the goal of the
supervised learning technique known as_ _ _ _.
a) Classification r
ve
b) Prediction
c) Regression
d) None of the mentioned
ni
c) Pattern recognition
d) All of the above
e
features, branches for the decision-making process, and each leaf node for the
classification result.
in
a) K-nearest neighbor
b) Decision tree
c) Linear regression
nl
d) Naïve Bayes
10. What are the two nodes of a decision tree?
O
a) Root node and Stem Node
b) Root node and Leaf node
c) Decision node and Leaf node
ty
d) None of the mentioned
11. What is abbreviation of CART?
si
a) Classification and Regression Tree Algorithm
b) Class and Root Tree Algorithm
r
c) Classification and Root Tree Algorithm
ve
d) Class and Regression Tree Algorithm
12. After receiving a_ _ _ _, the tree cannot be further divided; leaf nodes are the ultimate
output nodes.
a) Root node
ni
b) Leaf node
c) Stem node
U
d) Parent node
13. In_ _ _ _ _, the decision node or root node is divided into sub-nodes in accordance
with the specified conditions.
ity
a) Root node
b) Leaf node
c) Splitting
m
a) Splitting
b) Bootstrapping
c) Parent node
d) Pruning
(c
15. The_ _ _ _is a metric that measures how frequently a randomly selected piece is
Notes
e
erroneously identified.
a) Gini Index
in
b) Bootstrapping
c) K-nearest neighbor
nl
d) Decision tree
Exercise
O
1. What do you mean by Naive Bayes Classifier?
2. Define Naive Bayes Classifier for Mixed Data Type.
3. What do you mean by K Nearest Neighbour? Also give example to implement KNN
ty
4. Explain Decision Tree?
5. What do you mean by support vector machines?
si
6. How SVM work for linearly and non-linearly separable problems?
7. Define random forest classifier.
8. Define random forest technique.
r
ve
Learning Activities
1. How to implement KNN algorithm using python?
1. a
2. a
U
3. d
4. b
5. c
ity
6. a
7. b
8. d
m
9. b
10. c
11. a
)A
12. b
13. c
14. d
(c
15. a
e
1. Introduction to Machine Learning with Python: A Guide for Data Scientists,
Andreas C. Müller and Sarah Guido
in
2. Supervised and Unsupervised Learning for Data Science, Berry Michael W.,
Azlinah Mohamed, Bee Wah Yap
3. Machine Learning: An Algorithmic Perspective, Stephen Marsland
nl
4. Deep Reinforcement Learning Hands-On: Apply Modern RL Methods to
Practical Problems of Chatbots, Robotics, Discrete Optimization, Web
O
Automation, and More, 2nd Edition, Maxim Lapan
5. Algorithms for Reinforcement Learning, Csaba Szepesvari
6. Foundations of Deep Reinforcement Learning: Theory and Practice in Python,
ty
Laura Graesser and Wah Loon Keng
r si
ve
ni
U
ity
m
)A
(c
e
Learning Objectives:
in
At the end of this module, you will be able to understand:
nl
●● Different types of Clustering
●● DBSCAN (Density Based Spatial Clustering of Applications with Noise)
O
●● Concept of Anomaly Detection
●● K-Means Clustering
●● Concept of principal component analysis
ty
●● Singular Value Decomposition
●● LOF
si
●● Concept of T-Distributed Stochastic Neighbour Embedding (t-SNE)
Introduction
r
Unsupervised learning is the process of teaching a computer to use unlabeled,
ve
unclassified data and allowing the algorithm to act on the data without supervision.
Without any prior data training, the machine’s objective in this case is to categorise
unsorted data according to similarities, patterns, and differences.
won’t be trained. As a result, the machine is limited in its ability to discover the hidden
structure in unlabeled data on its own.
Consider the scenario where it is shown a picture of dogs and cats that it has never
U
seen before.
ity
m
unaware of the characteristics of dogs and cats. But it can categorise them according
to their similarities, patterns, and contrasts, i.e., we can simply categorise the above
picture into two halves. The first section might have only images with dogs, and the
second segment might have just images with cats. You have no prior knowledge of this,
hence there are no examples or training data.
(c
There are many different clustering algorithms, and there are many different ways
Notes
e
to compare a clustering to a gold standard. The clustering objects and the clustering
task determine the best clustering algorithm to use as well as the best assessment
measure. Verbs serve as the clustering objects in this thesis, and the work of clustering
in
entails categorising the verbs semantically. Within the cluster analysis of the verbs,
more cluster parameters need to be investigated.
nl
4.1 Introduction to Clustering Techniques
Unsupervised learning is divided into two groups by the algorithms they fall under:
O
●● Clustering: Identifying the natural groupings in the data, such as classifying clients
based on their purchasing patterns, is a clustering problem.
●● Association: When you wish to find rules that broadly characterise your data, such
ty
as “those who buy X also tend to buy Y,” you have an association rule learning
problem.
si
Clustering
●● Exclusive (partitioning)
●● Agglomerative r
ve
●● Overlapping
●● Probabilistic
Clustering Types:-
ni
●● Hierarchical clustering
●● K-means clustering
U
Clustering
m
qualities, and significant structures that are inherent in a set of instances. The objective
of clustering is to divide the population or set of data points into a number of groups
so that the data points within each group are more similar to one another and different
from the data points within the other groups. It is essentially a grouping of objects based
on how similar and unlike they are to one another. For instance, the data points that are
(c
clustered together in the graph below can be categorised into a single category. In the
image below, the clusters can be distinguished, and we can count three clusters.
Notes
e
in
nl
O
Clusters don’t always have to be spherical. like as
ty
r si
ve
Why Clustering?
user chooses to meet their needs relies on them. We might be interested in identifying
homogeneous group representatives (data reduction), identifying “natural clusters”and
describing their unknown characteristics (“natural” data types), identifying appropriate
U
and helpful groupings (“useful” data classes), or identifying odd data objects (outlier
detection). This algorithm must make various assumptions regarding the similarity of
the points, and each assumption results in a unique cluster that is equally valid.
ity
Clustering Methods:
●● Density based Method: These methods take clusters into account as a dense zone
that differs and is similar to the lower dense part of the space. These techniques
can combine two clusters and have good accuracy. Examples include OPTICS
m
structure. The already formed cluster is used to create new ones. There are two
categories.
◌◌ Agglomerative (bottom-up approach)
◌◌ Divisive (top-down approach)
(c
●● Partitioning Methods: The items are divided by these procedures into k clusters,
Notes
e
and each split creates a cluster. This technique is used to improve an objective
criterion similarity function, such as K-means or CLARANS (Clustering Large
Applications based upon Randomized Search), when distance is a key parameter.
in
●● Grid-based Methods: In this approach, the data space is represented as a grid-
like structure made up of a finite number of cells. STING (Statistical Information
nl
Grid), wave cluster, CLIQUE (CLusteringIn Quest), and other clustering processes
carried out on these grids are all quick and independent of the quantity of data
objects.
O
4.1.2 Evaluation of Clustering
In multivariate data analysis, clustering is a common practice. It is intended to
investigate the data objects’ innate natural structure, where items in the same cluster
ty
are as similar as possible to one another and objects in separate clusters are as
distinct as possible from one another. A method for generalising over the data items
and their features is provided by the equivalence classes that the clusters induce. Many
si
fields, including pattern detection, psychology, economics, and medical research, use
clustering techniques.
three main data analysis types, however, each correspond to a distinct concept and are
distinguished as follows: Clustering differs from classification in that the former places
items in pre-established classes, whilst the latter requires no prior knowledge of the
U
object classes or the individuals who make up those classes. A discriminant analysis
seeks to reinforce a classification that has already been provided, whereas a cluster
analysis has to first create the class structure. This is how a cluster analysis differs from
a discriminant analysis in (b).
ity
of data and the intricate intricacies of data representation and clustering methods.
Therefore, a subjective method is required for large-scale trials, but an intuitive,
introspective evaluation is only feasible for limited collections of objects.
)A
A trustworthy strategy for judging the calibre of the clusterings should be used
when developing an objective method for evaluating clusterings. How can we convert
these impersonal descriptions into more specific requirements? The needs for the
Notes
e
task of grouping verbs into semantic classes are then defined, with an increasing
level of linguistic task specificity. In other words, I first specify general requirements
for an evaluation, then general requirements for a clustering evaluation, and finally
in
requirements for the verb-specific clustering evaluation.
The formal notation of the clustering result and gold standard classification makes
nl
it easier to clarify the requirements for the clustering assessment, hence the notation is
given in advance:
O
using a gold standard that is as impartial and trustworthy as possible. The classification
was constructed closely in connection to the existing classifications for German by
Schumacher (1986) and English by Levin in order to make up for the suboptimal setup
by a single person (1993). Prior to conducting any word experimentation, the entire
ty
classification was completed.
si
●● The measure should be applicable to all similarity measures used in clustering, but
may be independent of the specific similarity measure, as the evaluation’s goal is
r
to evaluate and compare various clustering experiments and results.
ve
●● A (numerical) measure representing the worth of the clustering should be defined
using the evaluation’s findings. To make the comprehension of the assessment
easier, the resulting value should either be simple to understand or else be
illustrated with respect to its range and impacts.
ni
linguistic data because the demands on verb classes are particular to the linguistic
background and linguistic intuition and are not necessarily desired for different
clustering areas. As linguistic desiderata for the clustering of verbs, the list that follows
corresponds to a third set of requirements.
m
●● Fuzzy Clustering
e
Unsupervised machine learning clustering that starts with a top-to-bottom hierarchy
of clusters is called “hierarchical clustering.” Following that, it decomposes the data
in
items based on this hierarchy to produce the clusters. Depending on whether the
process of building clusters proceeds top-down or bottom-up, this strategy adopts
one of two approaches. These are, respectively, the Divisive Approach and the
nl
Agglomerative Approach.
O
ty
r si
ve
●● Divisive Approach
With this top-down method of hierarchical clustering, we assume that all of the
ni
data points are part of a single, large cluster and attempt to divide the data into smaller
groups using a termination logic, or a limit beyond which there can be no further division
of data points. For categorical data, the metric can be the GINI coefficient inside
U
a cluster, or this termination logic can be based on the least sum of squares of error
inside a cluster.
As a result, we iteratively divide the data into “n” number of smaller clusters to
ity
which the data points now belong after it was previously grouped into a single large
cluster.
When separating the clusters, it is important to keep in mind that this approach is
quite “strict,” which means that once clustering is completed inside of a loop, it cannot
m
be reversed.
)A
(c
●● Agglomerative Approach
Notes
e
Agglomerative is quite different from Divisive, in which each of the “N” data points
is regarded as a separate component of the “N” clusters that the data is divided into.
in
We subsequently allocate the data points to each of these clusters in accordance after
repeatedly condensing these numerous “N” clusters to a smaller number of clusters,
let’s say “k” clusters. This method, which is bottom-up, also combines the clusters
nl
using termination logic. Clusters shouldn’t be too widely apart to be merged, a distance
criterion, a variance criterion, or a number-based criterion can all be used in this
reasoning (increase in the variance of the cluster being merged should not exceed a
threshold, Ward Method).
O
Centroid Based Clustering
The most efficient method of forming clusters and allocating data points to them
ty
is centroid-based clustering, which is regarded as one of the most straightforward
clustering algorithms. The idea behind centroid-based clustering is that each cluster
is identified and represented by a central vector, and each cluster is assigned to data
points that are near these vectors.
si
These categories of clustering techniques employ a variety of distance metrics
to iteratively calculate the separation between the clusters and the characteristic
r
centroids. These are either of the Minkowski, Manhattan, or Euclidian distances.
ve
The main drawback in this situation is that we need to identify the number of
clusters, “k,” either intuitively or scientifically (Elbow Method) before any clustering
machine learning algorithm may start assigning the data points during its initial iteration.
ni
U
ity
m
the prior two in more detail. This statistic serves as the foundation for the defining of
Notes
e
a cluster. Instead of considering distances, density-based clustering algorithms do.
Clusters, which are defined as a maximal set of connected points, are thought to be the
densest region in a data space, separated from it by regions with lower object densities.
in
The data is assumed to be noise-free and the shape of the cluster that results is
simply geometrical when conducting the majority of the clustering (circular or elliptical).
nl
Data is always inconsistent to some degree (noise), which cannot be disregarded.
Furthermore, it is preferable to have variable shapes rather than sticking to fixed
attribute shapes in order to take into account all possible data points. These are the
applications where methods based on density have shown to be useful!
O
Clusters with various geometries, clusters without any size restrictions, clusters
that have the highest level of homogeneity by guaranteeing the same levels of density
within them, and clusters that are inclusive of outliers or noisy data can all be obtained
ty
using density-based techniques.
r si
ve
ni
U
Distribution-Based Clustering
The clustering methods that have been used up to this point have either been
based on proximity (similarity/distance) or composition (density). A family of clustering
algorithms considers probability, which is a whole separate measure. Data points are
ity
e
they are used to generate and arrange datasets according to the principles of random
sampling, i.e., to collect data points from a particular distribution. Objects that are most
likely to belong to the same distribution can subsequently be characterised as clusters.
in
The requirement to define the clusters a priori for some algorithms, and primarily
the definition of the cluster form for the majority of algorithms, is a significant
nl
disadvantage of density and boundary-based techniques. It is necessary to choose at
least one tuning or hyper-parameter, and although if doing so is straightforward, doing it
wrong could have unintended consequences.
O
In terms of flexibility, accuracy, and shape of the clusters created, distribution-
based clustering offers a clear advantage over proximity- and centroid-based clustering
techniques. The main issue is that these clustering approaches only perform well with
data that is artificial, simulated, or if the majority of the data points unquestionably
ty
belong to a predefined distribution; otherwise, the results will overfit.
Fuzzy Clustering
si
The fundamental principle of clustering is to allocate data points to mutually
exclusive clusters, which means that a data point may only ever exist inside of
one cluster and cannot be a member of more than one cluster. By allocating a data
r
point to numerous groups with a specified degree of belongingness measure, fuzzy
ve
clustering algorithms challenge this paradigm. The data points near a cluster’s centre
may also belong in the cluster, which is to a greater extent, than the data points around
a cluster’s edge. A membership coefficient, which ranges from 0 to 1, measures the
likelihood that an element belongs to a particular cluster.
ni
With datasets with a lot of overlap between the variables, fuzzy clustering can
be applied. In bioinformatics, where it is difficult for generic clustering algorithms to
distinguish between the picture’s pixels and they fail to execute a suitable clustering
U
On the other hand, in divisive hierarchical algorithms, all the data points are
represented as one large cluster, and the process of clustering entails splitting the one
(c
There are certain difficulties with the K-means clustering algorithm, which always
Notes
e
attempts to produce clusters of the same size and has a preset number of clusters.
Because we don’t need to be aware of the specified number of clusters while using the
hierarchical clustering technique, we can choose this algorithm to address these two
in
problems.
nl
One well-known HCA is the agglomerative hierarchical clustering algorithm.
It employs a bottom-up methodology to cluster the datasets. This means that this
algorithm starts by treating each dataset as a single cluster and then begins combining
O
the two clusters that are the closest to one another. It continues doing this until every
cluster has been combined into a single cluster that has every dataset.
ty
How the Agglomerative Hierarchical clustering Work?
The following steps can be used to describe how the AHC algorithm functions:
si
Step-1: Make a separate cluster out of each data point. Let’s imagine there are N
data points, which means there will be N clusters.
r
ve
ni
Step-2: Combine the two nearest data points or clusters to create a single cluster.
There will now be N-1 clusters as a result.
U
ity
m
Step-3: Once more, combine the two nearest clusters to create a single cluster. N-2
clusters will exist.
)A
(c
Step-4: until there is just one cluster remaining, repeat Step 3. We will therefore
Notes
e
obtain the following clusters. Think about the pictures below:
in
nl
O
Step-5: Create a dendrogram to divide the clusters according to the problem once
all the clusters have been joined into one large cluster.
ty
As we’ve seen, the hierarchical clustering depends on how near the two groups
are to one another. There are several methods for measuring the distance between
two clusters, and these methods determine the clustering rule. Linkage techniques are
si
the name given to these actions. The following list of popular connection techniques
includes
Single Linkage: The closest points of the clusters are separated by the shortest
distance. Take a look at the photo below: r
ve
ni
U
Average Linkage: The average distance between two clusters is determined using
the linkage method, which adds up and divides the distance between each pair of
datasets by the total number of datasets. It is also among the most widely used linking
techniques.
Centroid Linkage: The linkage method is used to calculate the separation between
(c
Notes
e
in
nl
Depending on the nature of the problem or the needs of the business, we can use
any of the aforementioned ways.
O
Woking of Dendrogram in Hierarchical clustering
The dendrogram, which resembles a tree, is primarily used to store each step that
the HC algorithm executes as memory. The X-axis of the dendrogram plot displays
ty
all of the data points in the specified dataset, while the Y-axis displays the Euclidean
distances between the data points.
The diagram below can be used to show how the dendrogram works:
r si
ve
ni
The left portion of the following diagram demonstrates how clusters are formed
U
during agglomerative clustering, while the right portion displays the matching
dendrogram.
Example:
Notes
e
Let’s begin by importing the necessary libraries as shown below:
%matplotlib inline
in
import matplotlib.pyplot as plt
import numpy as np
nl
The data points we collected for this example will then be plotted.
X = np.array(
O
[[7,8],[12,20],[17,19],[26,15],[32,37],[87,75],[73,85], [62,80],[73,60],[87,96],])
ty
plt.figure(figsize = (10, 7))
plt.subplots_adjust(bottom = 0.1)
si
for label, x, y in zip(labels, X[:, 0], X[:, 1]):
plt.annotate(
r
label,xy = (x, y), xytext = (-3, 3),textcoords = ‘offset points’, ha = ‘right’, va =
ve
‘bottom’)
plt.show()
ni
U
ity
m
It is pretty clear from the above diagram that we have two clusters in our
datapoints, however there could be thousands of clusters in the actual data. The
)A
dendrograms of our datapoints will then be plotted using the Scipy package.
e
dendrogram(linked, orientation = ‘top’,labels = labelList,
in
plt.show()
nl
O
ty
The longest vertical distance is now chosen after the large cluster has formed.
si
Then, as in the diagram that follows, a vertical line is drawn through it. There would be
two clusters since the horizontal line and the blue line cross each other at two different
locations.
r
ve
ni
U
The class for clustering must then be imported, and the fit predict method must
ity
‘ward’)
cluster.fit_predict(X)
)A
Notes
e
in
nl
O
The two clusters from our datapoints are depicted in the diagram up top.
ty
Divisive clustering
also referred to as a top-down strategy. The number of clusters need not be
si
predetermined for this approach. Top-down clustering starts by figuring out how to split
a cluster that contains all of the data, then splits clusters back and forth until all of the
individual data are in singleton clusters.
Algorithm r
ve
given a dataset (d1, d2, d3, ....dN) of size N
the cluster is split using a flat clustering method eg. K-Means etc
ni
repeat
e
●● In some instances, it is easy to execute and produces the best results.
●● It is simple to do and produces a hierarchy, or a structure with more information.
in
●● The amount of clusters does not need to be pre-determined.
nl
●● Large clusters are broken up by it.
●● Handling convex shapes and clusters of varying sizes is challenging.
O
●● Noise and outliers are picked up by it.
●● Once the algorithm has been used, it cannot be altered or removed.
ty
4.1.6 DBSCAN (Density Based Spatial Clustering of Applications
with Noise)
In essence, clustering analysis, also known as clustering, is an unsupervised
si
learning technique that separates the data points into a number of distinct batches or
groups, with the goal of ensuring that the properties of the data points within the same
group are similar and that the properties of the data points within different groups are,
r
at least in part, dissimilar. It consists of numerous various differential evolution-based
methodologies.
ve
Examples include Gaussian mixtures (Mahalanobis distance between centres),
Affinity propagation (graph distance), Mean-shift (distance between points), DBSCAN
(distance between nearest points), K-Means (distance between points), Spectral
ni
Clusters are dense areas of the data space that are divided by areas of lower point
ity
densities. This common understanding of “clusters” and “noise” is the foundation of the
DBSCAN algorithm. The main principle is that at least a certain number of points must
be present in the vicinity of each point within a cluster within a particular radius.
m
)A
(c
Why DBSCAN?
Notes
e
Finding convex or spherical groups can be done using hierarchical clustering
or partitioning methods (K-means, PAM clustering). In other words, they are only
in
appropriate for clusters that are small and well-distributed. Furthermore, the presence
of noise and outliers in the data has a significant impact on them as well.
nl
◌◌ Clusters may have any shape, like in the case of those in the figure below.
◌◌ Data could have noise in it.
O
ty
r si
A data set with nonconvex clusters and outliers/noise is depicted in the figure
ve
below. K-means algorithm has trouble locating these clusters with arbitrary shapes
given such input.
●● eps: It establishes the area surrounding a data point; if the distance between two
points is less than or equal to “eps,” they are regarded as neighbours. A significant
portion of the data will be viewed as outliers if the eps value is chosen too low. The
U
clusters will merge if it is decided to be very large, and most of the data points will
be in the same clusters. The k-distance graph can be used to determine the eps
value.
ity
●● MinPts: The bare minimum amount of nearby neighbours (data points) within the
eps radius The greater value of MinPts must be selected the larger the dataset.
In general, the minimal MinPts can be calculated as MinPts>= D+1 from the
dataset’s D dimensions. MinPts must be set to a value of at least 3.
m
Core Point: If a point has more than MinPts points inside an episodic period, it is a
core point.
)A
Border Point: a point that is close to a core point but has fewer than MinPts inside
eps.
Noise or outlier: a location that is neither a border point nor a core point.
(c
Notes
e
in
nl
DBSCAN algorithm can be abstracted in the following steps:
O
1. Recognize the core points or points that have been visited by more than
MinPtsneighbours by finding all the neighbour points inside eps.
2. Create a new cluster for each core point if it hasn’t been given one before.
ty
3. Recursively locate all of the points that are associated to it by density and add them
to the same cluster as the core point.
4. If there is a point c that has a sufficient number of points in its neighbours and both
si
the points a and b are within the eps distance, then those two points are said to be
density linked. This procedure involves chains. Therefore, it follows that b must be a
neighbour of an if b is a neighbour of c, c is a neighbour of d, d is a neighbour of e,
and e is a neighbour of a. r
ve
5. Go through the dataset’s remaining unexplored points iteratively. Noise consists of
all points that do not belong to any cluster.
# cluster index
U
C=1
mark p as visited
ity
# find neighbors
if |N|>=MinPts:
m
N = N U N’
add p’ to cluster C
Additionally, we will assess clustering methods using the Silhouette score and the
Adjusted rand score.
Scores for silhouettes range from -1 to 1. The ideal score is one or close to one,
Notes
e
indicating that the data point is located distant from other clusters and is very compact
inside the cluster to which it belongs. A value of -1 is the worst. Values close to 0
indicate clusters that overlap.
in
The range of the absolute Rand Score is 0 to 1. Excellent cluster recovery
is defined as more than 0.9, and good recovery as more than 0.8. Poor recovery is
nl
defined as a value lower than 0.5.
Example:
O
import matplotlib.pyplot as plt
import numpy as np
ty
from sklearn import metrics
si
from sklearn.preprocessing import StandardScaler
# Load data in X r
ve
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.50, random_state=0)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
ni
core_samples_mask[db.core_sample_indices_] = True
U
labels = db.labels_
print(labels)
# Plot result
m
unique_labels = set(labels)
print(colors)
if k == -1:
(c
col = ‘k’
Amity Directorate of Distance & Online Education
196 Foundations of Machine Learning
class_member_mask = (labels == k)
Notes
e
xy = X[class_member_mask&core_samples_mask]
in
markeredgecolor=’k’,
markersize=6)
nl
xy = X[class_member_mask& ~core_samples_mask]
O
markeredgecolor=’k’,
markersize=6)
ty
plt.title(‘number of clusters: %d’ % n_clusters_)
plt.show()
#evaluation metrics
si
sc = metrics.silhouette_score(X, labels)
print(“Silhouette Coefficient:%0.2f”%sc)
r
ari = adjusted_rand_score(y_true, labels)
ve
print(“Adjusted Rand Index: %0.2f”%ari)
Output:
ni
U
ity
m
Silhouette Coefficient:0.13
Outliers are shown by black points. We can alter the cluster setup by modifying the
eps and the MinPts.
Finding unusual occurrences or observations that are statistically distinct from the
rest of the observations is known as anomaly detection. Such “anomalous” behaviour
is usually indicative of some sort of issue, such as credit card fraud, a failed server
Notes
e
component, a cyberattack, etc.
in
1. Point Anomaly: A tuple in a dataset is referred to as a point anomaly if it deviates
significantly from the rest of the data.
nl
2. Contextual Anomaly: anomalies are observations that are anomalous because of the
context in which they were made.
3. Collective Anomaly: A collection of data instances aid in the discovery of an anomaly.
O
The principles of machine learning can be used for anomaly detection. The
following techniques can be used:
ty
future data points using a labelled dataset that includes both normal and anomalous
samples. The most popular methods for this purpose include K-Nearest Neighbors
Classifier, Support Vector Machine learning, and Supervised Neural Networks.
si
2. Unsupervised Anomaly Detection: This technique doesn’t need any training data
and instead makes two assumptions about the data: (1) Not all data is anomalous,
and (2) Any anomaly deviates statistically from the normal samples. The data is then
r
clustered using a similarity metric based on the aforementioned presumptions, and
the data points that are far from the cluster are regarded as anomalies.
ve
Now, using the K-Nearest Neighbors method from the pyod module, we show how
anomaly detection works on a fictitious dataset.
ni
import matplotlib.font_manager
ity
n_features = 2)
outlier_fraction = 0.1
(c
n_inliers = len(X_inliers)
Notes
e
n_outliers = len(X_outliers)
in
f1 = X_train[:, [0]].reshape(-1, 1)
f2 = X_train[:, [1]].reshape(-1, 1)
nl
Step 3: Visualising the data
# Visualising the dataset
O
# create a meshgrid
ty
np.linspace(-10, 10, 200))
# scatter plot
si
plt.scatter(f1, f2)
plt.xlabel(‘Feature 1’)
plt.ylabel(‘Feature 2’)
r
ve
Output:
ni
U
ity
clf.fit(X_train, y_train)
scores_pred = clf.decision_function(X_train)*-1
y_pred = clf.predict(X_train)
(c
e
Output:
in
Step 5: Visualising the predictions
# threshold value to consider a
nl
# datapoint inlier or outlier
O
# anomaly score for every point
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
ty
Z = Z.reshape(xx.shape)
si
# score to threshold value
subplot = plt.subplot(1, 2, 1)
r
subplot.contourf(xx, yy, Z, levels = np.linspace(Z.min(),
ve
threshold, 10), cmap = plt.cm.Blues_r)
subplot.axis(‘tight’)
subplot.legend(
(c
[a.collections[0], b, c],
e
prop = matplotlib.font_manager.FontProperties(size = 10),
in
subplot.set_title(‘K-Nearest Neighbours’)
subplot.set_xlim((-10, 10))
nl
subplot.set_ylim((-10, 10))
plt.show()
O
Output:
ty
r si
ve
ni
of clusters (let’s assume k clusters) fixed apriori to categorise a given data set. To
define k centres, one for each cluster, is the main notion. These centres should be
strategically positioned because different locations yield various effects. The preferable
option is to situate them as far apart from one another as you can.
ity
The following phase is connecting each point from a given data set to the closest
centre. The first step is finished and an early group age is finished when there are no
points still open. In order to determine the barycenter of the clusters produced by the
previous step, we must now recalculate k new centroids. The same data set points
m
must now be bound to the closest new centre once we have these k new centroids.
There now exists a loop. This loop may cause the k centres to gradually shift
)A
positions until no more modifications are made, or, to put it another way, the centres
stop moving altogether. The algorithm’s final goal is to minimise the squared error
function, which is represented by:
(c
where,
e
ci’ is the number of data points in ith cluster.
in
Algorithmic steps for k-means clustering
nl
Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be
the set of centers.
O
2. Calculate the distance between each data point and cluster centers.
3. Assign the data point to the cluster center whose distance from the cluster
center is minimum of all the cluster centers..
ty
4. Recalculate the new cluster center using:
where, ‘ci’ represents the number of data points in ith cluster.
centers. r si
5. Recalculate the distance between each data point and new obtained cluster
ve
6. If no data point was reassigned then stop, otherwise repeat from step 3).
ni
U
ity
m
)A
(c
e
Advantages
in
The K-Means clustering techniques provide the following benefits:
nl
variables.
◌◌ An instance may switch the cluster upon recalculating the centroids.
◌◌ When compared to hierarchical clustering, K-means produces tighter clusters.
O
Disadvantages
K-Means clustering techniques have the following drawbacks:
ty
◌◌ Predicting the value of k, or the number of clusters, is a little challenging.
◌◌ Initial inputs like the number of clusters have a significant impact on the output
(value of k).
si
◌◌ The ultimate result will be significantly impacted by the data’s order.
◌◌ Rescaling affects it extremely delicately. The output will be radically different if
◌◌
r
we rescale our data using normalisation or standards. final result.
If the clusters have a sophisticated geometric shape, clustering is not
ve
recommended.
predict method.
K-means clustering is functioning well enough to achieve the aforementioned
objectives. It can be applied to the following things:
ity
●● Market segmentation
●● Document Clustering
●● Image segmentation
m
●● Image compression
●● Customer segmentation
●● Analyzing the trend on dynamic data
)A
Customer_Id, Gender, Age, Annual Income ($), and Spending Score are all
Notes
e
included in the dataset (which is the calculated value of how much a customer has
spent in the mall, the more the value, the more he has spent). Since this method is
unsupervised, we don’t really know what to calculate, so we need to generate some
in
patterns from this dataset.
The following are the measures that must be taken for implementation:
nl
●● Pre-processing of data.
●● The elbow approach is used to determine the ideal number of clusters.
O
●● using the training dataset to train the K-means algorithm.
●● cluster visualisation
ty
Data pre-processing will be the first step. But compared to other models, it will be
different for the clustering issue. Let’s talk about it:
si
Import libraries
As part of data pre-processing, we will start by importing the libraries for our model.
Following is the code:
# importing libraries
r
ve
import numpy as nm
import pandas as pd
In the code above, we have imported matplotlib to plot the graph, pandas to
manage the dataset, and numpy to conduct mathematical calculations.
U
The required dataset will then be imported. We are thus using the Mall Customer
data.csv dataset in this instance. The code listed below can be used to import it:
ity
dataset = pd.read_csv(‘Mall_Customers_data.csv’)
We may access our dataset via the Spyder IDE by running the code mentioned
m
Notes
e
in
nl
O
ty
We need to look for some patterns in the dataset mentioned above.
si
• Extracting Uncorrelated Variables
Since this is a clustering problem and we don’t know what to find, we don’t require
r
any dependent variables for the data pre-processing step. Therefore, we will merely
ve
add a line of code for the features matrix.
We are simply extracting the third and fourth features, as can be seen. It’s because
ni
some features, like customer_id, are not necessary when creating a 2d plot to represent
the model.
Step-2: Finding the optimal number of clusters using the elbow method
U
The best number of clusters for our clustering problem will be determined in
the second step. As was said previously, we will employ the elbow approach in this
instance.
ity
As is well known, the elbow approach plots the WCSS values on the Y-axis and the
number of clusters on the X-axis to create the plot using the WCSS idea. Therefore, for
various k values between 1 and 10, we will calculate the value for WCSS. The code is
listed below:
m
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
Notes
e
mtp.plot(range(1, 11), wcss_list)
in
mtp.xlabel(‘Number of clusters(k)’)
mtp.ylabel(‘wcss_list’)
nl
mtp.show()
As we can see from the code above, we formed the clusters using the sklearn.
O
cluster library’s KMeans class.
In order to establish an empty list that will hold the value of wcss computed for
various values of k ranging from 1 to 10, we constructed the wcss list variable.
ty
Then, as Python’s for loop excludes the outbound limit, we initialised the for loop to
iterate on a different value of k between 1 and 10; it is taken as 11 to include the 10th
value.
si
As we fitted the model to a matrix of characteristics and then plotted the graph
between the number of clusters and WCSS, the remaining portion of the code is
identical to what we did in earlier sections.
r
The following output will be obtained after running the aforementioned code:
ve
ni
U
ity
We can see from the plot above that the elbow point is at 5. Therefore, there will be
5 clusters in this area.
m
)A
(c
e
We can now train the model on the dataset because we know how many clusters
there are.
in
The two lines of code we used in the section above will be used again to train the
model, but this time we will use 5 instead of I because we know that 5 clusters must be
produced. Following is the code:
nl
#training the K-means model on a dataset
O
y_predict= kmeans.fit_predict(x)
For constructing an object of the KMeans class, the first line is the same as above.
ty
The dependent variable y_predict was established in the second line of code to
train the model.
We will obtain the y_predict variable by running the code in the aforementioned
si
lines. In the Spyder IDE, we can inspect it using the variable explorer option. Now that
we have our original dataset and y_predict values, we can compare them. Take a look
at the photo below:
r
ve
ni
U
We may now deduce from the above graphic that CustomerID 1 belongs to cluster
3 (because the index starts at 0, 2 will be treated as 3), and that CustomerID 2 belongs
ity
The matplotlib function mtp.scatter() will be used to create a scatter plot to visualise
)A
the clusters.
e
‘Cluster 3’) #for third cluster
in
‘Cluster 4’) #for fourth cluster
nl
mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300,
c = ‘yellow’, label = ‘Centroid’)
O
mtp.title(‘Clusters of customers’)
ty
mtp.legend()
mtp.show()
si
We have code for each cluster, spanning from 1 to 5, in the lines of code above.
The mtp.scatter’s initial coordinate, x[y predict == 0, 0], contains the x value for
displaying the values of the features matrix, and the y predict ranges from 0 to 1.
r
ve
Output:
ni
U
ity
The five distinct clusters, each with a distinctive colour, are plainly visible in
the resulting image. The dataset’s two parameters, Customer Annual Income and
m
Spending, are where the clusters are generated. The labels and colours can be
changed to suit your needs or preferences. From the patterns mentioned above, we
may additionally note the following points:
)A
●● Cluster1 displays the clients with the average salary and average spending,
allowing us to classify these clients as
●● We can classify the consumer as prudent because Cluster2 demonstrates that
they have a high income but minimal spending.
(c
●● Cluster 3 has low income as well as modest spending, qualifying them as prudent.
●● Cluster 4 displays the clients who have modest incomes but high spending rates,
Notes
e
indicating that they are irresponsible.
●● For the mall owner, Cluster5 identifies the target clients who have high incomes
in
and high spending levels because they can be classified as these customers.
nl
Finding common patterns, correlations, connections, or causal structures from data
sets found in many types of databases, such as relational databases, transactional
databases, and other types of data repositories, is the goal of the method known as
O
association rule mining.
The goal of association rule mining, given a set of transactions, is to identify the
rules that will allow us to anticipate the occurrence of a particular item based on the
ty
occurrences of the other items in the transaction.
An approach frequently used for market basket analysis is association rule mining.
si
ARM is currently the best method for analysing huge market basket data, however
when there are many products being sold in a significant volume of transactions, the
data matrix needed for association rule mining ends up being enormous and sparse,
r
which takes longer to compute. This kind of information is provided by association
ve
rules in the form of “IF-THEN” statements. To comprehend the existence, nature, and
strength of an association rule, three indices are frequently used.
Lift is obtained first since it tells us whether an association exists or not and
whether it is a positive or negative association. We acquire the value for support if the
ni
The percentage of transactions in our dataset that contain a certain item or itemset
constitutes support for that item or itemset. It is a crucial measure since a rule with little
support may develop by accident. Because it may not be economical to promote goods
ity
that are infrequently purchased together, a low support rule may also be boring from a
business standpoint. Because of these factors, assistance is frequently utilised to get
rid of boring restrictions.
support = (A ∩ B)/N
m
The conditional probability that indicates the transaction having the LHS will also
contain the RHS is known as confidence. Results of association analyses should
)A
be handled with care. An association rule’s inference does not automatically imply
causality. Instead, it suggests a close association between the things in the rule’s
antecedent and consequent.
(c
an excessive number of irrelevant rules that may not be in our best interests. We
Notes
e
frequently set a support and confidence level before the analysis in order to ensure that
only helpful and interesting rules are produced in our outcome.
in
If lift is larger than 1, it means that there is a greater chance that the items on the
RHS will occur in this transaction as a result of the LHS’s presence. If the lift is less
than 1, it means that the likelihood that the items on the RHS will be purchased is less
nl
likely as a result of the LHS’s presence. If the lift is 1, it means that the LHS and RHS
items’ presence are independent; the likelihood that things will appear on the RHS is
unaffected by the LHS items’ presence.
O
We search for rules with a lift of more than one when performing market basket
analysis. It is also ideal to have rules with high support since they will apply to a lot
of transactions, and rules with higher confidence are those in which there is a high
likelihood that an item will appear on the RHS given the presence of items on the LHS.
ty
Become familiar with the following terms related to market basket analysis:
●● Antecedents are things or “itemsets” that can be located in the data. It is the IF
si
component, which is written on the left-hand side, to put it simply. The antecedent
in the case above is bread.
●● A consequent is an object or group of objects that are discovered in conjunction
r
with the antecedent. The THEN clause is the part that is written on the right-hand
ve
side. Butter is the result in the case above.
The principal uses of association rule mining include:
data analysis.
●● Cross-marketing is collaborating with businesses that enhance your own, not
U
with rivals. For instance, it stands to reason that auto dealers and manufacturers
engage in cross-promotional activities with the oil and gas industries.
●● Catalogue Design: The items in a company’s catalogue are frequently chosen to
work well together so that purchasing one item will prompt a subsequent purchase
ity
Rule generation involves two steps. The first step is to create an itemset, such as
{Bread, Egg, Milk,} and the second is to create a rule, such as {Bread → Egg, Milk}
{Bread, Egg → Milk} etc., from each itemset. The following discussion covers both
)A
steps.
obtain the antecedent and consequent is the first stage in the creation of association
rules. For instance, if all of the transactions include the following six items: bread,
butter, eggs, milk, notebooks, and toothbrush, the item sets will appear as follows:
bread, butter, eggs, milk, notebooks, milk, toothbrushes, milk, eggs, vegetables, etc. An
Notes
e
itemset can range in size from one to all of the items we currently have. Now, in order
to keep the quantity of total itemsets generated under control, we only look for frequent
itemsets from this data and not all of it.
in
nl
O
ty
Itemsets that are frequent are those that appear in transactions at least a certain
number of times. These are the itemsets that, technically speaking, have a support
value (fraction of transactions that contain the itemset) that is higher than the minsup
si
threshold.
This is the same as declaring that more transactions with the goods {Bread,
Egg} were made than those with the items {Bread, Egg, Vegetables} or vice versa. If
the latter happens in 30 transactions, the former happens in all 30 transactions and
U
might happen in even a few more. Therefore, we can be sure that the support value of
{Bread, Egg, Vegetables} or (30/100) = 0.3, is above minsup. Likewise, if {Bread, Egg}
or (>30/100) = >0.3, is above minsup. The support value of the new itemset formed will
either remain the same or increase if an item is removed from an itemset. This is known
ity
The Apriori principle enables us to remove any supersets of an itemset that do not
meet the support’s minimal threshold criterion. For instance, if “Milk, Notebook” doesn’t
meet our criterion of minsup, an itemset with any item added to this won’t either. Make
m
every common itemset (support minsup) with just one item. Next, create itemsets of
length 2 from all combinations of the preceding itemsets. After that, remove those
whose support value fell below the minimum.
)A
50% reduction in the number of itemsets to take into consideration. As the number of
items rises, the computing power reduction percentage becomes ever more important.
This proportion is also dependent on the minimal support level (minsup) that we
Notes
e
choose, which is entirely arbitrary with respect to the current issue and may be based
on prior experience.
in
2. Generating all possible rules from the frequent itemsets
Finding rules from the often createditemsets is relatively less difficult. Each
itemset’s binary partition is used to create rules. Candidate rules will resemble these if
nl
“Bread, Egg, Milk, and Butter” is the common itemset:
O
(Butter is made of (Bread, Egg→Milk)).
We seek to select rules that exceed a minimum confidence level from a list of all
potential candidate rules (minconf). The confidence of rules derived from the same
itemset also follows an anti-monotone property, just like the anti-monotone feature of
ty
support. In terms of the quantity of components, it is anti-monotone.
si
Recall that confidence for X and Y equals support for X,Y and support for X.
Since all of the rules produced from the same itemset have the same support,
the only difference is in how the confidence denominator is calculated. Due to the
r
anti-monotone property of support, when the number of items in X reduces, support X
ve
grows, and as a result, the confidence value lowers.
The following will provide a simple explanation for the aforementioned. Think about
F1 and F2:
bread)
All three of the items—egg, milk, and bread—will only be able to find a position in a
U
F2 is the percentage of transactions that also included (milk, butter, and bread)
(egg)
ity
Compared to only butter, there will be very few transactions that include milk,
butter, and bread, and the likelihood that there will be an egg on those transactions is
very high. Consequently, it will be shown that F1 F2. Pruning is carried out similarly to
how it was done while searching for frequent itemsets using this confidence attribute.
The image below serves as an illustration.
m
)A
(c
Beginning with a common itemset of {a, b, c,d} we create rules using just one
Notes
e
consequent. Eliminate the rules that don’t meet the minconf requirement. Create rules
now by combining the consequences from the remaining ones. Repeat until there is
only one item on the antecedent. All frequently used item sets must go through this
in
process. Once more, the minimum confidence criterion that we choose depends entirely
on the issue at hand.
nl
These two methods have allowed us to pinpoint a set of association rules that meet
the minimal support and minimum confidence requirements. Depending on the values
of minsup and minconf, different numbers of these rules will be produced. Now that a
subset of rules has been created, the highest lift values can be looked for in order to
O
make business judgments.
Maximal frequent itemset: For which none of the immediate supersets are frequent,
it is a frequent itemset. This is comparable to a frequent itemset X to which item y
ty
cannot be added while maintaining {X,y} above the minimum threshold.
r si
ve
Closed frequent itemset: It happens frequently that an itemset does not have a
ni
superset that supports it in the same way. Think about an X-item set. X is NOT a closed
set if ALL instance of X is followed by an instance of Y.
U
Because they are the most compact type of frequent item representation, maximal
frequent itemsets are valuable.
All frequent item sets can be obtained from maximal frequent item sets and their
subsets. The support for the subsets, however, is lost knowledge. Closed frequent
ity
itemset is another approach to express all the frequent itemsets if this value is
necessary. Closed itemsets make it possible to eliminate some unnecessary itemsets
without losing data on the support values.
m
Apriori Algorithm
)A
This section will describe the algorithm that will power Market Basket Analysis,
an application of association rule mining, which will be implemented using Python
modules. This will aid businesses in better understanding their customers and
analysing their data with greater care and attention. The Apriori algorithm, which was
the first associative algorithm introduced by Rakesh Agrawal, has been employed as a
(c
e
●● Frequent Itemset Generation: Locate all frequent item-sets having a support
level greater than a specified minimum support level. Frequent mining often
in
identifies the intriguing connections and links between item sets in relational and
transactional databases. In a nutshell, frequent mining identifies the elements that
frequently coexist in a transaction or relation. Multiple repetitions are required to
nl
find frequent item sets. The whole training set must be scanned in order to count
new candidate item-sets from existing item sets. Only two crucial processes are
needed, in brief:
O
◌◌ Pruning
◌◌ Joining
●● Rule Generation: enumerate all of the association rules from common item-sets.
ty
Calculate each rule’s support and confidence. Remove rules that don’t pass the
minimum support and confidence levels.
Frequent Itemset Generation searches the entire database for frequent itemsets
si
with a support threshold. It is the most computationally intensive stage because it
scans the entire database. In the actual world, transaction data for retail can amount
to Gigabytes and Terabytes of data. To filter out item sets that won’t be useful in the
following steps, an optimal method is required. The Apriori algorithm is employed in this.
r
ve
Algorithm a priori states “A frequent itemset’s subsets must also be frequent. In
other words, neither the generation nor testing of a superset of an uncommon itemset is
required.
comprises of a k-item set node and a relation between k-item sets’ subsets. As you can
see in the graphic, the transaction data items are all listed at the bottom, and as you
move up, you start producing subsets until you reach the null set.
U
ity
m
)A
This demonstrates that obtaining support for each combination will make it tough
to produce frequent item sets. As a result, the Apriori method helps to decrease the
number of sets that must be generated, as seen in the image below.
(c
Notes
e
in
nl
O
Figure: if an item set is infrequent, we do not consider its super sets
ty
We do not need to take into account all of the super sets of an item-set {a, b} if it
occurs infrequently.
The following example demonstrates why the Apriori method is significantly more
si
efficient and gradually produces stronger association rules.
Step1:
◌◌ r
Make a table with the support counts for every item in the dataset.
ve
◌◌ We get a new set of things when we compare the support count to the
minimum support count (in this example, the minimum support count is 2 and
the support count is less than that).
ni
U
ity
Step2:
◌◌ This process is called the join step. By cross-joining each piece with another,
we create a new set.
)A
◌◌ Now use the dataset to look for the support count of these item-sets.
◌◌ Given that we’ve previously established a 2-support minimum requirement,
If the support count is lower than the minimum support count, those items
Notes
e
are removed. Otherwise, we compare the minimum support count. gives us
another collection of things, as shown below.
in
nl
O
ty
Figure: pruning and joining
Step3:
si
◌◌ We repeat this process after obtaining a new dataset (I.e. join step). Each
itemset is cross-joined to the others. As a result, the itemset created following
this step will be:
{I1, I2, I3} r
ve
{I1, I2, I4}
◌◌ Check to see if all of these item sets’ subsets are often used; if not, eliminate
them. For instance, the frequently occurring subset of {I1, I2, I3} in this
ity
situation is {I1, I2} {I1, I3} and {I2, I3}. However, one of the subsets for {I2, I3,
I4] is [I3, I4}, which is uncommon. So, we take this out. The same is applied to
each itemset.
◌◌ Find the support count of the remaining item set by scanning the dataset after
m
Notes
e
in
nl
Figure: pruning and joining again until there are no more frequent items left
O
Step4:
◌◌ The same process is repeated. We cross-connect each itemset with the
ty
others in the join step, which comes first. The item set’s first two components
in our case should match.
◌◌ Then, determine whether or not all of these item sets’ subsets are frequent.
si
The itemset created in our example following the join step is {I1, I2, I3, I5}.
So, one of the uncommon subgroups of this itemset is {I1, I3, I5}. There is
therefore no longer an itemset.
◌◌
r
Because there are no longer any common itemsets, we stop here.
ve
The process of association rule mining began with this.
The next stage will be to generate how strong the association rules are and create
a list of all frequent item-sets. We determine each rule’s confidence level for this. We
employ the following formula to determine confidence:
ni
U
ity
We will demonstrate rule creation by using an example of any common item (we
used {I1, I2, I3}).
m
)A
(c
Therefore, the first three rules in this situation can be regarded as strong
Notes
e
association rules if the minimum confidence level is 50%. Having a confidence level
of 50%, for instance take {I1, I2} => {I3}, indicates that 50% of those who purchased I1
and I2 also purchased I3.
in
4.1.13 Hands On: Association Rule (Apriori)
nl
Association Rules for Market Basket Analysis in R: Step-by-Step Implementation
O
The ‘arules’ package of the R programming language is used to represent, work
with, and examine transaction data and patterns. It does an MBA on the data using
frequent itemsets and association rules.
ty
r si
ve
Terminal Output:
ni
U
ity
m
)A
(c
e
The ‘arules’ package of the R programming language is used to represent, work
with, and examine transaction data and patterns. It does an MBA on the data using
in
frequent itemsets and association rules.
nl
Input Data
O
ty
r si
ve
ni
U
package contains the apriori() function. One of the popular methods in market basket
analysis is the apriori algorithm. It accepts a range of input parameters, including:
confidence threshold(conf),
target=’rules’,
)A
Notes
e
in
nl
Terminal Output:
O
ty
r si
ve
The status of completed operations are displayed together with the parameter
ni
In this case, LHS stands for items that have previously been taken and RHS for
Notes
e
objects that are usually taken with purchased items. We can infer from this analysis
that there are some common pairings of the items, such as “Coffee” and “Toast” and
“Coffee” and “Alfajores.”
in
4.1.14 Principal Component Analysis
nl
One of the main applications of the SVD is principal components analysis (PCA),
which offers a data-driven, hierarchical coordinate system to describe high-dimensional
correlated data. The correlation matrices are used in this coordinate system. Prior to
O
running the SVD, PCA pre-processes the data by mean subtraction and setting the
variance to unity. Principal components (PCs) with the highest correlation to the data
and minimal correlation to each other determine the geometry of the final coordinate
system. Pearson created this hypothesis in 1901, and Hotelling independently
ty
developed it in the 1930s. Jolliffe offers a useful reference book.
si
●● Dimensionality: It refers to the quantity of characteristics or variables in the dataset
in question. The dataset’s amount of columns makes this easier to determine.
●● Correlation: It denotes how closely two variables are correlated with one another.
r
For instance, if one variable changes, the other variable also changes. Between -1
ve
and +1 is the correlation value. Here, -1 denotes an inverse relationship between
the variables, and +1 denotes a direct relationship between the variables.
●● Orthogonal: It states that variables are not connected with one another, hence
there is no correlation between the two.
ni
●● Eigenvector and Eigenvalue: Let’s have a look at the non-zero vector v. The vector
v is referred to as the eigenvector of the square matrix A if a square matrix (n x
n), let’s say A, is a multiple of v when multiplied (scalarly) with the vector v (i.e.,
U
Av). The following equation can be used to represent it: Av = ?v The eigenvector
in this case is v, and the associated eigenvalue is? The direction of the vector will
not change as a result of applying linear transformation because we know that a
vector has both a direction and a value. Additionally, the eigenvector must have a
ity
non-null value. Here is the equation once more for quick reference: (Square matrix
A * EigenVector) – (EigenValue*EigenVector) = 0 We can better comprehend
and analyse data by using eigenvectors and eigenvalues. As a result, we employ
them to alter data and show it in a more comprehensible manner. Both ideas are
m
●● The linear combination of the unique traits must be the major component.
e
two variables.
●● Going from 1 to n, the importance of each component declines, making 1PC the
in
most important and nPC the least important.
In a typical experiment, a number of measurements are gathered, and these
measurements are organised into a row vector. The measures could be characteristics
nl
of an observable, like the demographics of a particular human being. Each
measurement vector is organised as a row in a big matrix X following a number of
experiments. Polling may be used to collect data for experiments, like in the case of
O
demography. Even if it has more rows than columns or vice versa, the matrix will still be
n m in size.
Computation The mean of all rows, or the row-wise mean x¯, is now calculated,
and it is deducted from X. Given by is the mean x¯.
ty
si
and the mean matrix is
r
ve
Subtracting X¯ from X results in the mean-subtracted data B:
B = X − B¯ .
ni
This corresponds to the biggest eigenvalue of B*B’s eigenvector. Since the highest
singular value corresponds to the left singular vector of B, u1, it is now obvious what it
is.
m
CV = VD,
)A
pca Command For the principal components analysis, Matlab provides the extra
functions pca and princomp (based on pca):
(c
Up to changes in column sign, the matrix V is equal to the V matrix from the SVD of
Notes
e
X. The squares of the singular values make up the eigenvalues of the covariance of X,
sometimes referred to as main component variances, in the vector s2. The coordinates
for each row of B (the mean-subtracted data) in the principal component directions are
in
all that are included in the variable score.
nl
The dataset for the breast cancer example given below is easily accessible via the
Python sklearn.datasets module.
O
Step-1: Loading all necessary packages
import pandas as pd
ty
import numpy as np
si
%matplotlib inline
cancer = load_breast_cancer()
U
# creating dataframe
df.head()
The data is loaded into a variable called cancer using this piece of code. The top
m
5 rows of the data frame constructed using these data are then printed. As a result, we
obtain the following.
)A
(c
e
numbers. Thus, PCA can be used immediately.
in
scalar = StandardScaler()
# Standardizing
nl
scalar.fit(df)
scaled_data = scalar.transform(df)
O
# applying PCA
pca = PCA(n_components = 3)
ty
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
x_pca.shape
si
In this case, normalising the data (i.e., df) and running PCA on it are what we’re
doing. There, n-components is a representation of the desired number of major
components, or new features.
r
ve
Step-4: 2-D results
plt.figure(figsize =(8, 6))
This stage consists just of displaying the data’s 2-Dimensional graphic, which
seems something like this.
ity
m
)A
(c
e
ax = plt.axes(projection=’3d’)
in
linewidth=1);
Things start to become intriguing at this point. The code for a 3D plot is shown
nl
above. These are the output images that we obtained.
O
ty
si
Fig: 3-D results of PCA applied upon breast cancer dataset.
Notes
e
in
nl
O
ty
Fig: Plot emphasizing on First and Second principal components
Conclusion:
si
Variations across the first component >across the second component>across the
third component.
r
They fall between the range of -7 to 20, or range = 27, for the first primary
component. They fall between the range of -9 to 13, or range = 22, for the second main
ve
component. They fall somewhere between -6 and 10, or a range of 16, for the third
major component.
A matrix is factored into three matrices in the Singular Value Decomposition (SVD)
process. It communicates significant geometrical and theoretical insights regarding
linear transformations and has several intriguing algebraic characteristics. Additionally,
U
A = UWVT
where:
m
◌◌ W: a nxn diagonal matrix of the singular values which are the square roots of
the eigenvalues of ATA.
(c
Examples
Notes
e
◌◌ Find the SVD for the matrix A =
◌◌ To calculate the SVD, First, we need to compute the singular values by finding
in
eigenvalues of AA^{T}.
nl
The characteristic equation for the above matrix is:
O
ty
so our singular values are: σ_1= 5; σ_2= 3;
si
◌◌ We now identify the appropriate singular vectors, or the orthonormal set of
ATA’s eigenvectors. Since ATA is symmetric, we know that the eigenvectors
r
will be orthogonal given that its eigenvalues are 25, 9, and 0.
ve
ni
Notes
e
in
Now, we calculate U using the formula u_i = \frac{1}{\sigma} A v_i and this gives
nl
U= .
O
ty
Applications
●● Calculation of Pseudo-inverse: The generalisation of the matrix inverse that might
si
not be invertible is the pseudo inverse or Moore-Penrose inverse (such as low-
rank matrices). When a matrix is invertible, its inverse equals the pseudo inverse,
but when a matrix is not invertible, there is no pseudo inverse. The symbol for it is
A+. r
ve
Let’s say we need to determine a matrix M’s pseudo-inverse:
M = UWVT
ni
M1M = M1UWVT
U
I = M1UWVT
V = M1UWVTV
ity
V = M-1UW
Multiply by W^{-1}. Since the W is the singular matrix, the inverse of W = diag (a1,
a2, a3, …an)-1 is
m
VW-1 = M-1UWW-1
)A
VW-1 = M-1U
Multiply by UT
VW-1UT = M-1UUT
(c
VW-1UT = M-1 = M+
●● Solving a set of Homogeneous Linear Equation (Mx =b): Calculate SVD and select
Notes
e
any column of VT that has a singular value (in W) of zero if b=0.
If b≠0, Mx = b
in
Multiply by M-1
M-1Mx = M-1b
nl
X = M-1b
O
Hence x = VW-1UTb
ty
be determined from SVD.
◌◌ The left singular vectors of U that correspond to the non-zero singular values
are the range of matrix M.
si
◌◌ The right singular vectors of V, which correspond to the zeroed singular
values, make up the null space of matrix M.
M = UWVT
◌◌
r
Curve fitting problem: Singular value decomposition can be used to reduce
ve
the least square error in a curve fitting problem. It makes an approximation
using the pseudo inverse.
◌◌ Singular value decomposition and pseudo-inverse can also be employed
in digital signal processing and image processing in addition to the
ni
aforementioned use.
Using Numpy and Scipy, we will attempt to calculate the Singular Value
Decomposition in this code. Both SVD and pseudo-inverse calculations will be done.
Finally, we may use SVD to compress the image.
ity
# Imports
import numpy as np
“””
“””
# define a matrix
print(X)
# perform SVD
e
# print different components
print(“U: “,U)
in
print(“Singular array”,s)
print(“V^{T}”,V_transpose)
nl
“””
O
“””
ty
singular_inv = 1.0 / singular
s_inv = np.zeros(A.shape)
si
s_inv[0][0]= singular_inv[0]
s_inv[1][1] =singular_inv[1]
# calculate pseudoinverse
r
ve
M = np.dot(np.dot(V_transpose.T,s_inv.T),U.T)
print(M)
ni
“””
“””
import numpy as np
cat = data.chelsea()
m
plt.imshow(cat)
# convert to grayscale
)A
gray_cat = rgb2gray(cat)
S = np.diag(S)
curr_fig=0
Notes
e
for r in [5, 10, 70, 100, 200]:
in
ax[curr_fig][0].imshow(256-cat_approx)
ax[curr_fig][0].set_title(“k = “+str(r))
nl
ax[curr_fig,0].axis(‘off’)
ax[curr_fig][1].set_title(“Original Image”)
O
ax[curr_fig][1].imshow(gray_cat)
ax[curr_fig,1].axis(‘off’)
ty
curr_fig +=1
plt.show()
si
Output:
[[ 3 3 2]
[ 2 3 -2]]
r
ve
---------------------------
U: [[-0.7815437 -0.6238505]
[-0.6238505 0.7815437]]
ni
---------------------------
---------------------------
--------------------------
# Inverse
m
[ 0.07114625, 0.13043478],
)A
[ 0.22134387, -0.26086957]])
---------------------------
(c
Notes
e
in
nl
O
ty
si
4.1.18 LOF
r
An algorithm for unsupervised outlier detection is called the local outlier factor
ve
(LOF). It generates an anomaly score that indicates the data points in the data set that
are outliers. This is accomplished by calculating the local density deviation of a given
data point in relation to nearby data points.
Working of LOF: By calculating the distances between nearby data points, local
ni
Outliers are those having densities that are below average. In order to find each point’s
k-nearest neighbours, k-distances—or distances between points—are first determined.
The second-closest point is referred to as the point’s second-closest neighbour. Here is
an image showing the k-distances between various neighbours in a point cluster:
ity
m
)A
(c
The reachability distance is determined using this distance. The maximum of the
k-distance and the distance between two places is how it is defined. Consider the
following equation, where B represents the central point and A represents a point that is
Notes
e
close by.
Here is an illustration of the reachability distance between a spot and its nearby
in
neighbours:
nl
O
As seen in the graphic above, the k-distance is taken into account for points inside
the circle, while the distance between points is taken into account for points outside the
cluster. The Local Reachability Density (LRD) of a point is now derived by calculating
the reachability distances to all of its k-nearest neighbours. The inverse of the total
ty
sum of the reachability distances of all the k-nearest surrounding points can be used
to determine the local reachability density, which is a measurement of the density
of k-nearest points around a point. The inverse is used in the equation because the
si
distance is smaller and the density is greater the closer the points are to one another.
r
ve
By dividing the average of the lrds of k neighbours’lrds by the lrd of the point in
question, the local outlier factor (LOR) is calculated. The LOR equation is as follows:
ni
As a result, in the equation, if the point’s density and that of its neighbours are
U
nearly equal, we can say that they are quite similar; if the point’s density and that of its
neighbours are not equal, we can say that the point is an inlier, or inside the cluster; and
if the point’s density and that of its neighbours are not equal, we can say that the point
is an outlier. Consider the following example:
ity
LOF ~ 1 => Similar data point LOF < 1 => Inlier ( similar data point which is
inside the density cluster) LOF > 1 => Outlier
Advantages:
Notes
e
◌◌ Finding outliers might be challenging at times. A point that is close to a highly
dense cluster may be regarded as an outlier, whereas a place that is farther
in
away from a cluster that is more widely dispersed may be regarded as an
inlier. This problem is resolved by identifying outliers in small areas using
LOR.
nl
◌◌ The method employed in LOF can be utilised to tackle problems of
recognising outliers in many other disciplines, such as geographic data, video
streams, etc.
O
◌◌ Another dissimilarity function can be implemented using the LOF. It is also
discovered to perform better than several other algorithms for anomaly
identification.
ty
Disadvantages:
◌◌ The LOF score that determines whether a point is an outlier or not varies from
case to case. For various data sets, it might change.
si
◌◌ The LOF algorithm’s detection accuracy degrades in greater dimensions.
◌◌ It may be a bit difficult to comprehend how to discriminate between inliers
and outliers using the LOF score because it can be any number that the ratio
generates. r
ve
4.1.19 T-Distributed Stochastic Neighbour Embedding (t-SNE)
The machine learning algorithm t-distributed stochastic neighbour embedding
(t-SNE), created by Laurens van der Maaten and Geoffrey Hinton, is used for data
ni
There are two main steps in the t-SNE algorithm. First, t-SNE creates a probability
ity
low-dimensional map.
Data may be understood and solutions for problems in many different sectors
can be developed with the aid of high-dimensional data visualisation. For instance,
)A
according to Street et al., breast cancer-related cell nuclei are described by about
30 different variables, whereas wordcount vectors and pixel intensity vectors used to
represent text documents typically comprise hundreds of dimensions. Finding the
most crucial variables in this kind of situation can be crucial in order tovisualise the
relationships between the data as manually looking for relationships between all the
(c
Applications for t-SNE include computer security analysis, music analysis, and
Notes
e
cancer research, among many others. It is used to depict high-dimensional data, which
can have tens to hundreds of variables, as was previously stated. One advantage of
t-SNE is that, in comparison to previous methods like Sammon mapping or Locally
in
Linear Embedding, LLE, it is better at producing a single map that displays structure
at numerous scales. This is crucial for high-dimensional data that is distributed
across several related but dissimilar low-dimensional manifolds, such as photographs
nl
of objects from various classes seen from various angles. However, as seen in a
comparison with UMAP, another technique for high-dimensional data display, it is not
the fastest.
O
Let’s first examine SNE, or stochastic neighbour embedding, to better grasp how
t-SNE functions. SNE begins by translating high-dimensional Euclidean distances
between data points into conditional probabilities that describe the similarities between
ty
data points. The conditional probability, pji, that xi would choose xj as its neighbour if
neighbours were chosen in proportion to their probability density under a Gaussian
distribution centred at xi, is what determines how similar two data points are. While pji
will be virtually endlessly small for data points that are far apart, it is reasonably large
si
for data points that are close together. The conditional probability, pji, is determined
mathematically by:
r
ve
ni
U
ity
where σi data point xi is the centre of the Gaussian distribution, and I is its
variance. Values of pj|i are set to zero because the objective primarily involves
modelling pairwise similarities. It is possible to calculate a similar conditional probability,
indicated by qj|i, for the low-dimensional counterparts (map points) yi and yj of the high-
dimensional (data points) xi and xj. The conditional probabilities qj|i are computed using
(c
a Gaussian distribution with a variance set to 1/√2. Consequently, we may model the
similarity between map points yj and yi as follows:
Notes
e
in
Again, qj|i is equal to 0 because the modelling of pairwise similarities is the main
focus. The conditional probabilities pj|i and qj|i will be equal if the map points yi and
yj accurately depict the similarity between the high-dimensional data points xi and xj.
nl
This fact serves as the inspiration for SNE, which seeks to identify a low-dimensional
data representation that reduces the discrepancy between pj|i and qj|i. SNE employs a
gradient descent technique to reduce the total Kullback-Leibler divergences across all
O
data points. As a result, the cost function C is provided by:
ty
The distribution utilised to calculate the similarity scores is the last component
of t-SNE. The normal or Gaussian distributions discussed above have a similar
appearance to the Student t-distribution, which takes their place in the t-SNE. The
si
Student t-distribution differs from them in that the tails are taller at the extremities and
the centre is not as tall, as shown in Figure (Difference between a normal distribution
and a Student t-distribution). Geoffrey Hinton and Laurens van der Maaten concur. The
r
student t-distribution is chosen because it has the incredibly desirable quality of being
(nearly) invariant to changes in the map’s scale for locations that are far apart on the
ve
map. This implies that the method behaves uniformly for all points, and that big clusters
comprising numerous low density points interact uniformly with each other.
ni
U
Hyperparameters
The hyperparameters that the t-SNE algorithm uses are given and described
below.
m
as to how many near neighbours each point has. The number of k nearest neighbours
increases when there are several perplexities. The resultant images are complicated
by the perplexity value. The performance of SNE is fairly resilient to changes in the
perplexity, and typical values are between 5 and 50, according to the original study
by van der Maaten. However, the narrative is more complex than that. Analyzing
(c
extremely dispersed cluster with each point being roughly separated from its nearest
Notes
e
neighbours if the learning rate is too high. Since too many significant modifications
are applied too soon, many points also detach from their local clusters. However, if
the learning rate is too slow, most map points might appear compressed into a small,
in
distinct ball with few outliers. Increasing the learning rate may assist resolve this
problem if the cost function becomes trapped in a poor local minimum. Looking at the
map and finding it difficult to discern any distinct divisions between the various classes
nl
can indicate a too high learning rate. Since the t-SNE approach is iterative, it is crucial
to allow for enough iterations to allow the map to converge to a state where any further
changes are minute.
O
Attraction and Repulsion Over Iterations
Each time a map iterates, points are shifted about to show how similar points
attract one another and dissimilar points repel one another. Physically, the gradient
ty
can be thought of as the force produced when a set of springs is placed between the
map point yi and all of the other map points yj. There is a force applied by all springs
in the direction (yi− yj). Depending on whether the distance between the two map
si
points is too little or large to indicate the similarities between the two high-dimensional
data points, the spring between yi and yj repels or attracts the map points. The spring
between yi and yj exerts force that is proportional to both its length and stiffness, which
r
corresponds to the mismatch (pj|i− qj|i + pi|j− qi|j) between the pairwise similarities
of the data points and the map points. Figure provides an instance of this (Step
ve
by step illustration of how the points from every class attracts and repel points from
other classes each iteration.). When the method repeats over all locations, the map
eventually develops clusters of various sizes that bring together related objects close to
one another.
ni
U
ity
m
)A
(c
Figure: Step by step illustration of how the points from every class attracts and
repel points from other classes each iteration.
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 237
e
A method for displaying high-dimensional data is called T-distributed Stochastic
Neighbor Embedding (T-SNE). T-SNE is a nonlinear dimensionality reduction technique
in
to visualise data in a two- or three-dimensional space. It is based on stochastic
neighbour embedding.
To visualise data using the T-SNE method, the Scikit-learn API offers the TSNE
nl
class. Here, we’ll quickly go over how to use Python’s TSNE library to fit and visualise
data.
O
from sklearn.manifold import TSNE
ty
from sklearn.datasets import load_iris
si
import seaborn as sns
import pandas as pd
r
We’ll acquire the data and label specific portions of the Iris dataset after it has
loaded.
ve
iris = load_iris()
x = iris.data
ni
y = iris.target
z = tsne.fit_transform(x)
ity
The outcome will then be represented in a plot. The output component data will be
(c
gathered in a dataframe, and the scatterplot() function of the “seaborn” library will be
used to plot the data. Because there are three different categories in the label data, we
will set the scatter plot’s colour palette to 3.
df = pd.DataFrame()
Notes
e
df[“y”] = y
df[“comp-1”] = z[:,0]
in
df[“comp-2”] = z[:,1]
nl
palette=sns.color_palette(“hls”, 3),
O
ty
r si
ve
ni
We’ll next use the larger dataset to use the same methodology. For this task, the
MNIST handwritten digit dataset from the Keras API performs well. We just extract the
train portion of the dataset because TSNE can be used to test the data in this area. We
will only use 3000 rows because processing the TSNE will take too long.
ity
x_train = x_train[:3000]
y_train = y_train[:3000]
m
print(x_train.shape)
print(x_train.shape)
print(x_mnist.shape)
(3000, 784)
We have data on 784 features here. We’ll now use TSNE to project it into two
Notes
e
dimensions so that we can see it in a plot.
in
z = tsne.fit_transform(x_mnist)
df = pd.DataFrame()
nl
df[“y”] = y_train
df[“comp-1”] = z[:,0]
O
df[“comp-2”] = z[:,1]
palette=sns.color_palette(“hls”, 10),
ty
data=df).set(title=”MNIST data T-SNE projection”)
si
[t-SNE] Indexed 3000 samples in 0.922s...
The MNIST data are displayed in two dimensions on the plot. The target digits and
the placement of their feature data in 2D space are determined by the colours.
(c
e
4.2.1 Healthcare Case: Early Prediction of Neonatal
in
Hyperbilirubinemia
The most frequent disease in newborns and one of the main causes
of hospitalisation in the first year of life is neonatal jaundice brought on by
nl
hyperbilirubinemia. A considerable portion of newborn infants with significant
hyperbilirubinemia, defined as a bilirubin level above the 95th percentile at a specific
age in hours (1, 2), require phototherapy treatment. This condition affects around 10%
O
of newborn newborns.
ty
treated, can result in severe disability and lifelong consequences (6,7). On the other
hand, population-based research have revealed that phototherapy treatments raise
the risk of allergy disorders in children. Therefore, for infants who are at high risk
si
of developing substantial hyperbilirubinemia, both careful patient monitoring and
thoughtful treatment assignment are needed in the setting of neonatal jaundice.
Nomograms, which are still often used to divide newborns into risk categories,
r
were first presented in 1999 by Bhutani et al. for the assessment of neonatal
ve
hyperbilirubinemia. Nomograms are based on percentiles of bilirubin readings at
a certain age in hours. Additional clinical indicators for the prediction of newborn
hyperbilirubinemia shortly after birth, before to discharge, or during hospital
phototherapy are included in more recent risk stratification methods. Although risk
ni
The early detection of hyperbilirubinemia has been the subject of several research.
m
phototherapy forecast be produced at 24 hours after delivery for all neonates are the
Notes
e
model’s drawbacks.
A model for the early identification of hyperbilirubinemia in healthy term and near-
in
term neonates less than 24 hours was created more recently by Castillo et al. Their
model, a regularised logistic regression (LASSO), obtains a performance of 89 percent
( 3 percent) AUC using umbilical cord bilirubin, GA, and maternal race as predictors.
nl
Their model’s drawback is that it only makes one prediction during the first 24 hours
after birth, although in real practice , predictions are needed for each subsequent
bilirubin measurement, especially before discharge. As a result, even though early
identification of hyperbilirubinemia has been researched, most practical scenarios do
O
not allow for the application of existing models because they either need too much
overhead or cannot be applied after every new bilirubin measurement.
Prior to the first phototherapy session, machine learning (ML) is used to improve
ty
the early detection of clinically meaningful hyperbilirubinemia. ML has a lot of potential
for clinical use, and in paediatrics it has been effectively used for enhanced early
diagnosis of neonatal seizures based on EEG data and of late-onset neonatal sepsis
si
based on medical records.
Results
23 of the 385 newborns were left out because they had phototherapy before
ni
having their initial bilirubin level checked. 98 (27.07 percent) of the 362 neonates who
were still alive underwent at least one phototherapy treatment while they were initially
hospitalised. Within the first week of life, or, on average, 66 hours after birth, all initial
U
The distribution of bilirubin readings among neonates is shown in the table below.
There were a total of 1543 bilirubin measures available, or an average of more than
four measurements for each neonate. The typical interval between bilirubin measures is
ity
less than 27 hours, and 95% of individuals who underwent phototherapy saw intervals
of no more than 50 hours between subsequent measurements.
m
)A
The table below provides descriptive statistics for a number of factors that highlight
important aspects of the sample. Notably, there are a lot of preterm newborns: in the
(c
sample, 54.1 percent had gestational ages (GA) less 37 weeks, and 41.4 percent have
GAs under 35 weeks. A GA of fewer than 37 weeks and phototherapy were found to be
correlated by 0.374.
e
Quantile Quantile Quantile
Gestational age (in days) 228 253 289 -
in
Birthweight (in grams) 1490 2605 4120 -
Age of mother (in years) 23 32 40 -
nl
First bilirubin level (in 19 46 216 -
µmol/L)
Gender (male) - - - 56.91
O
Preterm† birth - - - 54.13
Multiple birth - - - 23.48
ty
4.2.2 Healthcare Case: Predicting Epidemic Outbreaks
Public health organisations that design measures to lessen or stop illness
transmission can benefit from accurate predictions of infectious disease dynamics. With
si
the expansion of big data in the biomedical and healthcare industries, accurate analysis
of such data may aid in the early diagnosis of diseases and improved patient care. It
is now highly feasible to use “big data” to predict and manage an epidemic outbreak
r
due to the accessibility of enormous computer power. In rural and suburban areas
where access to healthcare may be limited, our goal is to analyse and quantify the
ve
transmission of epidemic diseases.
We wish to develop a machine learning model that can forecast epidemic disease
dynamics and forecast the location of the next epidemic breakout. Our method takes
ni
into account the topography, climate, and population distribution of an affected area
because they are significant characteristics and slightly influence the dynamics of
epidemic disease. Our model would help the healthcare authorities take the required
steps to ensure that there are sufficient resources available to meet the requirement
U
Use of Intel Technology: Intel Distribution for Python, Intel Parallel Studio XE, Intel
VTune amplifier, AWS C5.2Xlarge Intel instance.
ity
Broad Objective:
reducing the avoidable misery caused by disease
Reduce the cost load on governments and healthcare systems by giving them first-
m
Our ML model should be able to detect future outbreak-prone areas and attributes
)A
that greatly aid in the spread of the outbreak given an area where an epidemic outbreak
has already occurred.
Changes in the ecology of the host population, changes in the pathogen reservoir,
or the introduction of an emerging pathogen to a host population are some of the
common causes of infectious disease epidemics.
(c
The feature vectors in our model are sufficiently broad to be modified slightly and
applied to the analysis of any epidemic disease.
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 243
e
Why Zika?
●● The Centers for Disease Control and Prevention’s Zika Data Repository offers
in
information about the Zika epidemic that is accessible to the general public. It had
enough information for us to create and test our model.
nl
●● The ecology of the host population may change, the pathogen reservoir may
change, or an emerging pathogen may be introduced to a host community. These
are only a few of the common causes of infectious disease epidemics.
O
●● The feature vectors in our model are sufficiently broad to be modified slightly and
applied to the analysis of any epidemic disease.
Implementation Details
ty
To improve the performance of machine learning and data analytics, we used
Intel Distribution for Python* and the Python API for Intel® Data Analytics Acceleration
Library (Intel® DAAL), also known as PyDAAL. We were successful in getting decent
si
results for the prediction issue by utilising the benefit of optimised scikit-learn* (Scikit-
learn with Intel DAAL) that comes with it.
r
ve
ni
Data Sources
U
ity
●● The Centers for Disease Control and Prevention’s Zika Data Repository offers
m
information about the Zika epidemic that is accessible to the general public. https://
github.com/cdcepi/zika is a website.
)A
e
available at https://datadryad.org/resource/doi:10.5061/dryad.47v3c. It lists the
occurrences of vector agents (Aedes albopictus, Aedes aegypti).
in
●● IMF World Economic Outlook data on GDP and GDP PPP.
There are Jupyter notebooks in the repository that use techniques for cleaning and
munging data.
nl
The likelihood of an outbreak in a particular area is the evaluation result.
Pre-processing
O
Preprocessing and adjusting class imbalance:
Data transformations are carried out on the data before to feeding it to the
ty
algorithm as part of data pre-processing. Some of the variables in the dataset are
categorical, so different techniques must be used to transform them into numerical
variables. Preliminary research revealed a significant class imbalance, particularly for
Zika cases recorded in the CDC database.
si
This was partly because most places did experience outbreaks, and the majority of
these outbreaks persisted (were present on all dates) throughout the whole of the time
r
period for which data were available.
ve
In an effort to address this, two frameworks were created and tested in order to
balance the classes and facilitate prediction:
Framework A: The feature information from the earliest accessible date was
used for this data, and the locations included in the non-epidemic class were those
ni
that had never had an outbreak. A place was considered to be in the outbreak class
if it experienced an outbreak at any point during the studied dates. Features from two
separate dates were tested for these locations: those from the outbreak’s start date and
U
that of the outbreak’s peak level (during the span of data collection). Framework a first
and Framework a max were the names of these two data sets.
Framework B: Only information from areas where an outbreak occurred was used.
ity
Assuming there were no epidemics, this data was then divided into the earliest date
accessible. The non-outbreak class was based on this data. Features from either the
date at which the outbreak started or the date at which it had reached its peak (within
the period of data collection) were taken from the timeseries for these points and used
for the outbreak class. Framework b first and Framework b max were the names of
m
Of the four, we discovered that framework a max produced the best outcome.
Intuitively, that’s the point in time when the illness can truly be referred to as an
)A
epidemic. Our main concern is that if the spread reaches a certain point, it will quickly
spread to surrounding locations.
(c
Feature Selection
Notes
e
in
nl
O
ty
Data sets may have duplicate or unnecessary features that could confuse the
machine-learning model. In this step, our goal is to eliminate any elements that could
si
lengthen runtime, produce complex patterns, etc. For subsequent analysis, the created
subset of characteristics is used. Either the Random Forest algorithm or the Xgboost
algorithm can be used to choose the features.
r
The Xgboost method is utilised in our project to choose the best features that
ve
have a score higher than a predetermined threshold number. Our results support the
literature on the Zika epidemic1. Temperature, precipitation, proximity to mosquito
breeding grounds, population density, and proximity to other locations with dense
populations of people (as determined by airport dist large) all have a significant impact
ni
Data Split
U
Data splitting into train and test sets: For additional analysis, the data is split into
train and test sets. 30% of the data is used for testing, while 70% is used for training.
Data splitting is accomplished using the scikit-learn function StratifiedShuffleSplit(n
splits=1, test size=0.3, random state=0). To address the class disparity between
ity
Zika cases and non-Zika cases, stratified splitting is required. The ratio of positive to
negative cases in the entire sample is maintained via stratified splitting between the
train and test sets.
m
Model Building
scikit-learn with Intel DAAL
)A
these elements’ closest neighbours. Using SMOTE and Tomek connections, it mixes
over- and under-sampling.
e
Machine-learning models are chosen for training at this step. The fit (X, y) method
is used by all classifiers in scikit-learn to fit the model to the provided train data (X)
in
and train label (Y). An ensemble of classifiers is used to compare the performance of
various models. The model can be used for prediction after training. SVM, Multi Layer
Perceptron, Logistic Regression, ADABoost, and XGBoost were all put to the test.
nl
Prediction
In this phase, the trained model uses its learning to predict the output for a certain
O
input. In other words, predict (X) provides the expected label y given an unlabeled
observation X.
Evaluation
ty
There are many performance evaluation indicators available to gauge a model’s
performance. In order to select the most effective model for the issue, we used
evaluation measures for accuracy, precision, and recall.
si
Results
The best estimator, XGBoostClassifier, received outstanding results on the test set
r
for accuracy (0.95) and Stratified five fold cross validation (0.96). Below are some more
ve
pertinent metrics.
ni
U
ity
m
)A
(c
Notes
e
in
nl
O
ty
Figure: ROC with 5-fold cross validation
si
Acute kidney injury (AKI) is poorly managed, and chances to prevent AKI are
frequently passed up. In the hospital context, non-nephrologists who are less skilled at
r
diagnosing AKI will typically treat it as it is frequently accompanied by other underlying
disorders. As a result, the creation of AKI detection or prediction models has become
ve
a prominent topic. The generalizability of most of these models is hindered by the fact
that they are only able to predict the risk for AKI at baseline, such as during admission
or preoperatively, and/or in specific settings, such as heart surgery or sepsis. In the
limited environment of the intensive care unit (ICU), on the basis of which the algorithm
ni
be prevented or mitigated, although many models only predict advanced stages of AKI
or the requirement for renal replacement treatment.
network model, the scientists gathered longitudinal data from electronic health records
of more than 700 000 inpatients and outpatients from all disciplines. The algorithm was
trained using a staggering 6 billion independent data entries, which included both the
)A
most recent medical data and data from up to two years prior to admittance. To enable
supervised learning, each case’s presence or lack of AKI was labelled. With a lead
time of up to 48 hours and a ratio of two false warnings to every genuine alert, the
resulting model was able to accurately forecast AKI in 55.8% of all inpatient instances of
identified AKI.
(c
First off, the model’s performance is not particularly spectacular. With a low
Notes
e
sensitivity of 55.8%, the model fails to detect half of AKI episodes. It’s possible that
this was done on purpose to increase specificity. Even extremely effective models
in environments where AKI is uncommon will need to find a compromise between
in
high sensitivity, which will result in missing fewer cases, and specificity, which will
lessen false positives and address warning fatigue. To determine which strategy will
significantly improve outcomes in clinical practice , more study is required.
nl
Notably, missing AKI cases in a general hospital context can be viewed as
more troublesome than alert weariness. It is crucial to keep in mind, though, that low
sensitivity can simply be a result of the fact that a sizable fraction of AKI cases cannot
O
be predicted, just as it is impossible to predict which side a coin will land on in a coin
toss experiment. These limitations stem from the rules of probability and hence hold
true for all diagnostic and prediction models, whether they are based on machine
ty
learning or traditional methods.
si
the model is given the appropriate label for each event during training and that the
model also learns from its “mistakes” by modifying the weights used by the algorithm
to link the data with the label. Such a strategy assumes that the categorization is I
r
“transparent” and “uniform,” meaning that everyone knows exactly what the category
label means; (ii) “relevant,” meaning that categories meaningfully distinguish between
ve
cases; (iii) “unique,” which means that each case belongs to only one category; and
(iv) “exhaustive,” which means that all cases that exhibit meaningful differences can
be assigned to a different category. Finally, it also assumes that every example in the
training set has been appropriately labelled. When these presumptions are broken, the
ni
into account and do not take the urine output criterion into account. This suggests that
the AKI predicted by these models may not be the same as AKI as it is understood in
clinical practice , and as a result, the definition of what the algorithm predicts is neither
ity
transparent nor clearly defined. Importantly, urine output might be the least expensive
continuously available predictor of AKI we have, and teaching medical professionals to
keep track of diuresis might be a better way to improve AKI management than looking
for novel AKI biomarkers or creating AKI prediction algorithms.
m
.’s one can choose to consistently classify these cases as having “no AKI.” The term
“no AKI” in this situation would be meaningless since it does not distinguish between
“confirmed no AKI,” as supported by the data, and “uncertain AKI,” when there is
insufficient information to support the diagnosis.
(c
model. Excluding cases that lack the necessary information to support the diagnosis is
Notes
e
another option. The training set, however, would not be exhaustive because it would not
include instances of “easily undetected AKI.” These patients will be randomly classified
as having AKI or not having AKI because the test set will not be able to identify their
in
normal pattern.
In addition, the model is trained to use metadata, such as test ordering and timing,
nl
in addition to test results, so it is not just trained to use test outcomes. There is proof
that these metadata can predict outcomes, such as survival, more accurately than
actual test findings. However, due to a supposed danger of AKI, such metadata strongly
rely on the skill of the doctor who would perform the test. The model will develop a
O
vicious circle as a result of this. During training, the model will unconsciously (i.e.,
without being aware of the issue) link the doctor’s request for an AKI test with the onset
of the condition later on. Thus, the model will issue an alarm for a probable AKI issue
ty
that the doctor has already identified.
In addition, the model is trained to use metadata, such as test ordering and timing,
in addition to test results, so it is not just trained to use test outcomes. There is proof
si
that these metadata can predict outcomes, such as survival, more accurately than
actual test findings.
However, due to a supposed danger of AKI, such metadata strongly rely on the skill
r
of the doctor who would perform the test. The model will develop a vicious circle as a
ve
result of this. During training, the model will unconsciously (i.e., without being aware of
the issue) link the doctor’s request for an AKI test with the onset of the condition later
on. As demonstrated in an AI-based chest X-ray diagnosis model where the diagnosis
was largely driven by the type of X-ray machine used, with differences in the machine
ni
used in the outpatient setting (low incidence of pneumonia) compared with the ICU
setting, the model will thus generate an alert for a potential AKI problem that the doctor
already recognised himself (high incidence of pneumonia).
U
To avoid or reduce the danger of AKI, an alert trigger must be timely. Although it
would seem that a 48-hour prediction window would be sufficient, not all AKI diagnoses
are anticipated by the prediction algorithm within 48 hours of the onset of AKI. In
actuality, only 20% of AKI cases are anticipated more than 24 hours in advance of
ity
AKI start. This might out to be too late for a successful intervention in the majority of
situations. AKI intervention would also include preventive interventions that should be
given to all patients, not only those who are at risk for developing AKI.
m
learning (ML) and deep learning (DL) technologies. Supervised Machine Learning
methods are used with NLP to enable chatbot implementations in commercial
solutions like Google DialogFlow, Microsoft Bot Framework, etc. A field with
numerous commercial applications is also developing around the recognition of
patterns in photographs. E-commerce enterprises that use such techniques in their
(c
e
these approaches of supervised learning, unsupervised learning, natural language
processing, and pattern recognition. In this work, we show case studies that illustrate
how the same set of AI/ML techniques may be used in four distinct industry sectors—
in
insurance, education, hospitality, and retail—to achieve significantly different use
cases. The Information Processing Task that is inherent to each of these, despite
the differences in the use cases, is quite similar. These technological enablers might
nl
be compared to Lego blocks for business users who can put them together to solve
specific problems in their industry.
O
demonstrated to be of substantial benefit in easing the issues experienced by both end
users of e-commerce platforms and the administrators of e-commerce platforms.
ty
Use Case Details
There are numerous e-commerce websites that sell individuals the goods
they want and need. The consumer will find it more difficult to examine each site for
si
availability, pricing, offers, suggestions, etc. as there are more sites, options, and
websites from which to pick. An NLP-based chatbot interface can greatly improve the
customer experience and, as a result, guarantee consumer stickiness.
r
The integration of a chatbot into an e-commerce website’s landing page is
ve
demonstrated in Fig. A below.
ni
U
Figure: A (E-commerce website with a chatbot integrated within its landing page)
ity
stores. Business-wise, this guarantees that customers receive answers to their inquiries
at the “moment of truth,” or when a transaction is most likely to take place. Customers
won’t have to wait long for responses from the e-commerce site’s help desk crew this
)A
way. This guarantees that common questions, which would otherwise be answered by
the helpdesk phone support team, are handled in a DIY fashion.
For instance, a chatbot that is connected to the inventory database of the store can
react to inquiries about the availability of a product of a particular brand or model (e.g.
(c
When chatbots are coupled with the corporate landscape of the e-commerce
company and other web services via API calls, they can provide more value than just
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 251
e
cases where the chatbot might be utilised as a conversational User Interface to deliver
results of such API calls.
in
2) Price comparisons for customers
A user can ask the chatbot to compare the pricing of a product, like a mouse, as
shown in Fig. B below. The chatbot uses the PriceYuge Data API to call an external
nl
web service after identifying the “Price Comparison” intent. The price of the goods is
then obtained from other e-commerce websites, like Amazon, Flipkart, etc. Information
contains the following: Price, Product Color, Expected Delivery, COD Availability,
O
Product Offer, etc.
ty
r si
ve
Figure B: Comparative Details of the product Logitech M235 Wireless Mouse
fetched and shown in the chatbot window
ni
be used to find similarities across several products and depict them in a more
comprehensive way. Giving advice based on the chemical makeup of cosmetics, for
instance, in the context of cosmetic products According to the representations shown in
Figs. C and D, the composition of two objects is more similar the closer their distance
ity
is to one another. Customers can compare the products as a result without having any
prior knowledge of chemistry.
m
)A
(c
Notes
e
in
nl
O
ty
Figure D: Recommendation System recommends products with brand, pricing
and Rank
si
4) Automated Product Labeling for Store Administrator
E-commerce companies must make sure that the products featured on their
platform come from various merchants and are listed consistently. The difficulty of
r
accurately listing all the features of the product increases with the number of vendors
on a platform and the number of things being offered. To identify a product’s features
ve
from a picture and make sure they are consistently listed, image recognition using Deep
Learning algorithms can be employed. Therefore, some of the typical labels connected
to a product can be automatically filled in using deep learning techniques. The
prototype implementation of uploading a product image is shown in Fig. E below, and
ni
Fig. F displays the numerous labels or features that were predicted for that product and
were automatically placed in the database. Images of apparel, accessories, etc. can
be used to achieve the same results, even though the example shown below is for a
U
e
Following the creation of a chatbot application, it is crucial to regularly assess the
outcomes of the chatbot’s interactions with end users. It is crucial that an administrator
in
keeps the FAQs often updated and pertinent to consumer inquiries by identifying
those utterances for which the chatbot was unable to determine the intent. It can be
difficult to manually sort through a large number of unanswered questions to identify
nl
the subjects for which chatbot responses need to be created. Here, Topic Modelling on
the unrecognised search questions can help customers prioritise those areas where the
chatbot needs to be updated with responses by identifying the clusters of themes that
are commonly queried by customers. The most popular topics thus found can be added
O
to a chatbot’s future versions to better serve users.
ty
r si
ve
Figure F: Predicted labels from the input image are the red colored words
ni
●● Google Dialogflow: The chatbot was created using Dialogflow, which enables the
creation of entities, intents, context-setting, and training phrases for questions and
search queries pertaining to certain products.
ity
●● PriceYuge Data API: To retrieve product details from various e-commerce stores,
an external API must be used.
●● NLTK Library: To text-analyze user-submitted search terms.
●● FastAI library: Product Image Dataset with more than 40.000 photos and label
(c
The results of training the Deep Learning model for label recognition from photos
Notes
e
are shown in figure G below.
in
nl
O
ty
r si
ve
ni
U
Finding the right learning rate for the multi-label classification model is shown
in Fig. G. (a). (b) A training versus validation loss plot using more than 3000 image
samples and a training model with 5 epochs
ity
own deep learning research at Netflix was similar: after initial elation, the stark reality of
Notes
e
refined baselines confronted us.
As a result, the connection between deep learning and other recommender models
in
became clearer. When we merged the knowledge gained from these methods with
information on the kind of problems in which each deep-learning method excels as
well as its limitations, the barriers disappeared. Along the way, it also produced useful
nl
insights into how to make deep learning effective in actual recommendation scenarios.
O
into a few significant types of models from this. We start out by talking about models
known as “bags of goods” that employ a member’s collection of seen videos. We then
explore sequential models that fill this gap since these models, while effective, ignore
temporal information. For various tasks, we discovered both strategies to be useful.
ty
Our research showed that deep-learning algorithms particularly began to
shine in recommendation problems when we gave them with extra heterogeneous
characteristics and information sources, after initially struggling with deep-learning
si
methodologies. As is frequently the case in the literature (see also Ferrari Dacrema,
Cremonesi, and Jannach (2019)), other models, when these methods were correctly
tweaked, remained competitive in the typical recommendation context where only user-
r
item interaction-data are used. However, because deep learning models are flexible,
ve
we were able to identify situations in which we could create models that significantly
improved offline measures assessed using historical data.
large. We required new offline measurements that were better proxies of the online
data to address this. In addition, we discuss other relevant practical issues needed to
implement deep learning in a recommendation system that serves hundreds of millions
U
of users. It is fairly simple to create and alter model architectures for recommender
systems in practice thanks to the flexible foundation offered by the current deep-
learning toolboxes.
ity
Finally, we will discuss major lessons learned from applying deep learning to
Netflix’s recommendation algorithms as well as lessons that may apply to other
applications.
Recommendations at Netflix
m
different, has a wide range of interests that might change depending on the situation,
and needs a recommender system most when they are unsure of what they want to
watch. When done correctly, each member receives a special experience that enables
them to maximise their use of Netflix. Since we offer a monthly subscription service,
member satisfaction has a direct impact on customers’ propensity to continue using our
(c
service, which in turn affects how much money we make. Therefore, the improvement
in member retention can be used to gauge the value of a recommender system. We
have routinely produced significant increases in retention over the years thanks to the
Notes
e
development of personalisation and recommendation technologies (Gomez-Uribe and
Hunt 2015)
in
In multiple controlled online trials, we found that the specific recommendation task
as well as the available data play a key influence when it comes to the question of
which algorithm works the best. The main selection made by a recommender system
nl
at a video-streaming service is which videos each subscriber will see on their Netflix
homepage after logging into their profiles on any device. Using recommender systems
created to address various member demands, each of which can be powered by a
different algorithm, this personalization work is itself broken into various sub-tasks.
O
Instead of a single algorithm driving all recommendations at Netflix, the company
uses a variety of strategies that are all focused on boosting member satisfaction.
The best algorithm (whether deep learning or another) depends on the particular
ty
recommendation problem that needs to be solved as well as the available data. This
was discovered through experimentation with several kinds of recommendation
algorithms. Because of this, several machine learning models are employed to create
si
tailored suggestions for the various sections (such as rows) of the Netflix home page.
Gomez-Uribe and Hunt provide an overview of the Netflix recommender system before
deep learning was used (2015). By breaking the recommendation problem down into
r
several smaller jobs, we can combine a variety of alternative approaches and increase
the modularity and scalability of the research and development of new or improved
ve
recommendation algorithms.
ni
U
ity
m
)A
The Netflix homepage shown in the figure above has red circles listing many
suggestion tasks, each of which is driven by a separate algorithm. For instance, there
are specific algorithms (1) for selecting the first video to display prominently at the top
(c
of the homepage, another one for ranking previously watched videos that the user
may wish to continue viewing (7), as well as others designed to assist our members in
finding new videos (5). The results of each of these algorithms can be shown on the site
as several rows of suggested videos. The purpose of this essay does not extend to a
Notes
e
number of further Netflix customising tasks. For instance, an algorithm (4, 6) chooses
which rows to present in a customised manner to form the homepage’s structure
(Alvino and Basilico 2015). In addition, our members receive tailored messages
in
and notifications (3). Additionally, we incorporate recommendation methods into our
search engine (Lamkhede and Das 2019). Since it occurs the most frequently in our
recommendation settings, we concentrate on the process of suggesting videos from the
nl
Netflix library to each member in this article. Many of the lessons we offer, nevertheless,
apply to these other duties as well.
The available data and its qualities, in addition to the recommendation objective,
O
have a significant impact on which recommendation algorithm performs best. The first
and most crucial distinction is whether the data contain additional information such
as user attributes, item attributes, or contextual information regarding the user-item
ty
interaction, as opposed to just the user-item interactions (as is frequently the case in
publicly available data sets used in the literature). Through testing, we discovered that
adding such additional heterogeneous variables to the data made more sophisticated
models, such as deep-learning algorithms, shine.
si
Before we examine further issues in the data for developing real-world
recommender-systems, it is important to briefly review a few typical aspects of the data
r
that have been covered in the literature on recommender systems. The following are
the significant distinctions between the data sets utilised in other domains: The entries
ve
with observed positives (such as played videos, clicked items, given ratings or thumbs,
etc.) are not randomly distributed in the user-item interaction-matrix (Hernández-
Lobato, Houlsby, and Ghahramani 2014; Liang et al. 2016; Marlin, Zemel, and Roweis
2005; Marlin et al. 2007; Marlin and Zemel 2009; Steck 2010), which is the first thing to
ni
This is a critical distinction from areas like compressive sensing or matrix completion,
where it is usual to presume that matrix entries are missing at random. Second, the
unobserved entries in a user-item interaction matrix could either represent (true)
U
negatives (i.e., the user is actually uninterested in this item) or positives that weren’t
noticed (yet). Thirdly, because observed positives result from a stochastic process,
observed data are frequently highly scarce and noisy. Fourth, the data exhibit a
significant popularity-skew, or the fact that the popularity of the various items roughly
ity
follows a power-law distribution, leading to big disparities in the popularity of the various
items. A similar (approximate) power-law distribution can be seen in the distribution of
users, with a small number of very active users and a huge number of less active users.
Due to distribution mismatch, this power-law distribution might make modelling difficult.
Making fair and accurate suggestions for unpopular items or for users with minimal
m
item interaction data, such as presentation or position biases, may be amplified due
to a feedback loop in which the recommender system is trained on the observed
user-actions from a previous time-step, which may have been biassed due to the
recommendations presented to the users at that time (Chaney, Stewart, and Engelhardt
2018). Due to presentation bias, users are more inclined to interact with objects that
(c
to train recommender systems using the obtained data, but it also causes a noteworthy
Notes
e
disparity between offline and online metrics.
in
The retail sector will soon undergo a change thanks to AI robots. In addition to
performing physical chores, these robots will also be in charge of more cognitive
nl
ones thanks to a learning system that will enable them to make judgments in place of
humans. These robots are now serving as customer service agents, giving customers a
seamless and enjoyable shopping experience.
O
Customer-facing businesses are vying to best meet the ambiguous expectations
of the customers in this age of automation and the digital world. The AI-based robots
serving as customer service agents are assisting customers by making brick-and-
mortar store buying less complicated. By 2020, the market for customer support robots
ty
will be valued US $ 88 million, according to a Tractica analysis. These customer care
robots are made to engage with consumers while also automating some of their more
routine chores. They come in both humanoid and non-humanoid varieties, and their key
si
benefits are labour efficiency, uptime, and cost savings.
The majority of these robots direct shoppers to things in stores or assist them as
they navigate the space. They are able to communicate with clients more successfully
r
than human employees. The development of interactive marketing and rebranding
ve
tactics for the tracking and analytics of consumer behaviour therefore has a significant
potential for these robots.
Invento Robots
ni
using face recognition, as well as age, gender, and levels of the most dominating
emotions. They can engage in dialogue with the clients, offering them a more useful
and understandable response. They are able to effortlessly navigate the site and may
provide clients with information about a certain product, including details, discounts,
ity
and offers.
The business just debuted MITRA 3, MITRA Robot’s sibling. The MITRA 3
humanoids are designed to give customers a more engaging, personal experience.
These robots can mimic human gestures and eye movements because their eyes
m
are connected with their bodies, unlike MITRA. These robots can greet visitors with
“Namaste” because they have three degrees of freedom (shoulder, elbow, and finger
movements). This flexibility makes them more adaptable. Along with voice synthesis in
)A
local Asian languages like Hindi, Tamil, and Sinhalese, they also have the advantage
of seamless independent mobility and obstacle recognition. “Customer engagement
is terrible, and businesses can no longer afford to hire human service workers. This
leaves one with two choices: a subpar kiosk or no engagement. According to Balaji
Viswanathan, CEO of Invento Robotics, “We aim to give visitors to companies, malls,
(c
Suzuki, Smartworks, HDFC Bank, Accenture, PVR Cinemas, and the Airports
Notes
e
Authority of India are just a few of MITRA’s clients. To automate check-ins and customer
service for the travellers, the company ran a test programme with these MITRA robots.
in
Spod: These non-humanoid robotic shopping assistants with artificial intelligence
capabilities, like MITRA, will help customers make in-store purchasing decisions. The
box-shaped robot that is suspended from a basket can recognise faces, greet users,
nl
accept input, and travel by itself with ease. The robot, which is still in the development
stage, scans the faces of clients to determine whether they have previously visited and
to determine their gender. These robots may interact with consumers, guiding them
to the appropriate shelf and providing product information such as descriptions of the
O
products, specials, and deals from the Spod tablet.
By setting up a video conference between the clients and the specialists, Spod
may also give customers remote assistance. The demand for a tool that can enhance
ty
customers’ purchasing experiences in retail markets sparked the development of
the robot. Driving sales is the main concern for merchants. Giving out more details
regarding the product is important for brands. Large malls want it in the front, directing
si
visitors into stores, while e-commerce players want it utilised at the rear, in their
stockpiles, according to Balaji.
SoftBank Robots
r
ve
Pepper Robots: Pepper robots, a subset of SoftBank robots, are humanoid
robots that can recognise faces and fundamental human emotions. Robots that have
touchscreens on them can communicate with customers. They primarily provide
the creative greeting, education, and guidance functions for guests. With more than
15 languages for speech recognition and dialogue delivery, they have 20 degrees of
ni
freedom for the most organic and expressive movements. To help the customers get
the intended outcomes, they can effortlessly explore the store and surrounding region.
The use of these robots can be observed at Ave, a custom-print garment store in the
U
United States, which witnessed an increase in foot traffic through raising consumer
knowledge of the outlet, luring in new customers, describing novel parts of the store,
and driving sales.
ity
The company’s NAO humanoid robot is another one whose main uses are in
healthcare, research, and teaching. These 58 cm tall, bipedal robots have beautifully
rounded features. Along with their extra functions of speech recognition and
conversation in up to 20 different languages, their 25 degrees of freedom allow them
to move and adapt to the environment. They can recognise things, images, and people
m
thanks to the two 2D cameras. The STEAM education assistance programme functions
as a teaching assistant in education and research. It can simply deliver customised
courses to pupils, encourage positive behaviour, and finally follow and monitor their
)A
development. NAO supports personnel with health trending and monitoring while
assisting patients with self-diagnosis. They serve as both the informational centre and
the telemedicine platform (alert, notifications, fall & sound detection, etc.)
for all online orders. The equipment can be placed within the shop, where consumers
Notes
e
can visit, scan the QR code or enter the code to pick up their online orders. Without
human assistance, the system can process the code and give the consumer their
requested order. The Cleveron machines eliminate all the bother associated with the
in
BOPIS (Buy online, pick up in-store) idea, including waiting for the order to be prepared
and contacting the staff executive. “Saving consumers’ time is the ultimate goal of
Cleveron’s delivery robots.
nl
Online buying is a quicker and more practical alternative to physical store
purchasing. However, the time you saved won’t matter after your package is delivered
because you’ll need to wait in line at the store’s service desk or at home for the
O
courier, who may show up with your package between 5 and 9 o’clock in the evening.
The typical wait time to pick up an online order at the shop is more than 15 minutes.
Customers receive their orders in less than a minute thanks to Cleveron 401 and
ty
Cleveron 402, which function as automated in-store self-service pick-up terminals. 15
more minutes are now yours to keep! Arno Kütt, founder and chairman of the board
of Cleveron, stated that the company’s robots “make online shopping and picking up
orders in a physical store a positive consumer experience, which in turn generates foot
si
traffic, revenue and promotes growth for the company.”
r
longer than 15 minutes. In a year (52 weeks), if you just pick up once a week, you will
stand in line for 13 hours. The pick-up time with Cleveron 401 is under a minute. You
ve
will spend 52 minutes picking up your online orders from the store in a year, even if we
round it up to a minute.
The 5-foot-tall robot shopping assistant LoweBot assists clients in-store just
like a customer service agent. These robots can recognise customers entering the
store since they are equipped with 3D scanners. Customers can ask them about any
U
specific item in the store and they understand and speak seven different languages.
These robots steer them to the merchandise using sophisticated laser sensors. On their
rear-facing display panel, these additionally show location-based offers and intelligent
recommendations when travelling. Additionally, these also move throughout the store,
ity
scanning the shelves and sending inventory updates to the sales staff. As a result, the
business may not only restock its shelves but also learn more about which products
sell more rapidly and on what days of the week or during what seasons. They are also
capable of determining the shopping trends at the location.
m
fraudulent activities that put a strong Retail loss prevention programme in perspective
are the following: a product is whisked away without being scanned at the counter; a
merchandise is given without being charged; a mal-intentioned individual goes free after
returning a product (with no rhyme or reason to do so); and discounts are being given to
transactions that aren’t worthy of discounts at all.
(c
Retailers are increasingly making reducing retail shrink a primary focus in order
to increase revenues and reduce losses by transforming retail loss prevention with
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 261
e
and correlations sensed from data, are forewarning (predicting) them of any possible
fraudulent activity and arming them with necessary measures to protect retail
shrink and loss as the retail sector struggles with shrink and loss that comes with it,
in
perpetuated using shoplifting, internal theft, return fraud, vendor fraud, discount abuse,
and administrative errors among others.
nl
Reactive to Proactive Loss Prevention Strategy
When it comes to anticipating and mitigating retail shrink and loss, there has
been a significant shift from reactive to proactive approaches. Data from vital sources
O
such as security systems (CCTV, camera, access control, and alarm records), video,
store operation applications, POS, payment data, crime data (local crime statistics),
store profile, supply chain data, employee data, customer data, ORC data map, and
e-commerce platforms are first collected.
ty
The data is used as a crucial input for techniques including computer vision, deep
learning, image processing and recognition, machine learning and correlation, pattern
si
recognition, and behavioural, predictive, and prescriptive analytics.
Return patterns may serve as a warning sign to thwart fraudulent refunds. Return
permission can be likened to balancing on a tightrope between providing excellent
customer service and ensuring that the action is not fraudulent. For the machine
U
learning system to anticipate and stop return fraud, the shopper, shopping behaviour,
return behaviour, and store trends provide a vibrant feed.
Another advanced analytics technique for forecasting and preventing retail loss
is reading the risk tea-leaf. Retailers can use machine learning algorithms and data
science techniques to discover key loss drivers, anticipate loss, and avoid loss by
monitoring retail loss risk variables. Concerning their loss prevention plan, some
m
The prevention of fraudulent actions that lead to retail loss is aided by keeping risk-
ranking predictive models developed using statistical modelling and machine learning at
the forefront.
e
Retail shrinkage is a constant annoyance. A strong loss prevention strategy is the
first step in reducing retail shrink. Retailers succeed by extending the LP spectrum
in
to encompass mobile, web, and other customer engagement channels and using AI/
Machine learning to transform retail loss prevention after taking stock of the existing
loss prevention technology used to supplement KPIs.
nl
4.2.8 Telecom: Optimising Mobile Tower Operations
Recent developments in machine learning (ML) make it possible to optimise at
O
previously unaffordable levels of complexity. This has produced notable performance
gains, encouraging the usage of ML techniques like neural networks in a variety of
industries.
ty
Utilizing ML to improve wireless network performance began with 5G and is crucial
to advancing zero-touch configuration and management, enabling the self-configuration
and self-optimization anticipated for 6G networks. Numerous factors that affect how
wireless networks operate are difficult to predict or infer and are not always understood
si
at the time that decisions must be made. Furthermore, because they include a variety
of radio access technologies and modules that interact with one another, must meet
a variety of growing criteria, and must swiftly adapt to changes, wireless networks
r
are becoming more complicated and heterogeneous. Because of this, real-time
ve
wireless system performance optimization is a challenge that cannot be solved using
conventional methods. While ML technologies can manage extremely complex
systems, they are better suited to managing highly dynamic wireless networks and
making more intelligent judgments, such as those based on anticipated future traffic
ni
patterns.
Based on these findings, this study suggests a modular machine learning (ML)-
based wireless network optimization framework that enables plug-and-play machine
U
intelligence integration into both new and existing network operations. In particular,
we use ML to I characterise traffic attributes and (ii) anticipate future traffic volume.
We then use this data to enhance the performance of network control mechanisms.
The benefits of our strategy are twofold: In accordance with the recent ITU-T Y.3172
ity
Prior research has integrated machine learning (ML) into the creation of specific
algorithms, concentrating on network functions such as I resource management and
orchestration, (ii) routing, (iii) detection and channel estimation in massive MIMO
)A
systems, and (iv) resource scaling of virtual network functions (VNF). They cannot be
easily reused due to their mechanism-specific nature, which means that each network
control mechanism requires a specially designed ML technique.
e
to optimise specific network operations, such as routing (see for a survey of ML
techniques applied to SDN), or VNF resource scaling.
in
We also include an ML orchestrator, which is in charge of resource management
and monitoring as well as determining the best configurations for ML model training.
With the aid of our framework, we anticipate that the costs associated with launching
nl
an ML pipeline for a particular function will be comparable to those incurred by ML
algorithms created specifically for that purpose. In our method, the orchestration
of a function has a one-time signalling cost, but this is a minor price to pay given the
flexibility it provides.
O
We describe how to coordinate two ML pipelines, namely trafficdriven VNF scaling
and routing in mobile backhaul networks, to illustrate the viability and performance
improvements possible with our system. We assess the latter use case practically. Our
ty
methodology achieves 3 reductions in packet delay by feeding a cutting-edge routing
scheme with city-scale projections of future traffic consumption obtained with a deep
learning structure.
si
ML-Based 5g Network Optimization
We suggest a machine learning (ML)-based framework for network optimization
and describe how to apply it in 5G networks. r
ve
A. ML-based Framework
The components of our framework are shown in Fig. below. They are the ML
orchestrator, modules to measure mobile network traffic, ML algorithms to process the
ni
data, and modules to make specific network optimizations depending on the results of
the ML algorithms.
U
ity
m
)A
The orchestrator specifies in the form of a template the set of collector nodes, the
(c
different inputs; for instance, increasing computing capacity of VNFs performing core
Notes
e
services involves monitoring control traffic from the same set of base stations while
routing requires monitoring traffic from a set of base stations to determine the best
routes.
in
As a result, the orchestrator coordinates the implementation of an ML pipeline and
the mechanisms for updating the decisions for each network function (for example,
nl
by collaborating with the VNF orchestrator). It also makes sure that there is enough
computing power available to train ML models either centrally or decentralised.
O
base stations or Software Defined Networking (SDN) switches, for example. The
measurement modules take sequences of packets from each flow, along with their
durations, interarrival times, direction (uplink/downlink), and maybe even some
of the content, as opposed to defining a finite set of input attributes. Working with
ty
such extensive data as input has several benefits, but the main one is that abstract
features may be automatically extracted during training rather than relying on a small
and manually selected set. Feature engineering can be expensive and have negative
si
performance effects. In addition, new use cases might call for other functionality. Our
strategy is therefore foreseeable.
retraining.
and peak data rates, level of burstiness, etc., we are particularly interested in DL
structures that can identify different types of flows within large aggregates (traffic
forecasting). Different DL structures might be used, depending on the objective goal.
ity
Temporal characteristics play a key role in traffic forecasting. Time series lend
themselves well to Long-Short Term Memories (LSTMs). Similar to how the spatial
dimension can be added to the convolution operation to create a 3D-CNN, mobile
traffic-specific spatio-temporal information can be extracted.
)A
Notes
e
in
nl
Figure: An example of a deep learning pipeline that was modified from for
forecasting mobile traffic In order to estimate future traffic demands at the eNodeB
O
level, city level metrics are input into stacks of 3D-CNNs and ConvLSTMs.
Last but not least, our decision modules are built on pre-existing algorithms that
only need to be altered to accept the DL algorithms’ predictions as input. Therefore,
ty
using our framework leaves these algorithms’ fundamental functionality untouched.
Contrary to completely ML-based systems, where ML algorithms directly make
decisions, this gives considerably better control over how they operate.
si
Result
ML-based framework to improve (beyond) 5G network performance. Our
r
architecture does not necessitate the invention of use-case-specific ML algorithms or
ve
the modification of existing network algorithms, in contrast to current approaches that
directly embed ML within network control systems. To describe traffic features and
forecast future traffic demands, our system deploys ML pipelines. The current network
control mechanisms are then given the predictions.
ni
our framework into practice and tested it. Results show that our method considerably
lowers packet delay even in tiny topologies.
A tornado is a rapidly spinning column of air that is in touch with the ground, either
pendent from or beneath a cumuliform cloud, and that is frequently but not always
visible as a funnel cloud (Glickman & Zenk, 2000). In comparison to other countries,
the United States has more tornadoes (Perkins, 2002). In the United States, more
m
than 1200 tornadoes are reported each year, which is four times more than what has
been documented in Europe (National Oceanic Atmospheric Administration, 2008).
The United States experiences more violent tornadoes than any other nation, with
)A
EF4 or EF5 tornadoes on the Enhanced Fujita Scale (National Oceanic Atmospheric
Administration, 2008).
The majority of tornadoes in the US happen east of the Rocky Mountains. The
Great Plains, the Midwest, and the Mississippi Valley are tornado-prone regions. A
region that is highly vulnerable to tornadoes is referred to as Tornado Alley (Glickman
(c
& Zenk, 2000) or Dixie Alley (Cox, 2010). Additionally, tornadoes have been reported
in every state in the US at least once since 1950. (National Oceanic Atmospheric
Administration, 2008). Due to their location in Tornado Alley, which is where tornadoes
Notes
e
occur most frequently, Midwestern states including Kansas, Missouri, Nebraska, Iowa,
and South Dakota are particularly vulnerable to tornado activity (Tornadoes in the
United States, 2016). Tornado Alley includes the state of Oklahoma as well.
in
Since records began to be kept in 1950 (F5 and EF5 Tornadoes of the United
States), there have been 58 tornadoes that have been classified as having the
nl
maximum intensity and level of destruction (Tornadoes in the United States, 2016).
Each year, tornadoes in the United States cause more than 80 fatalities and 1500
injuries (National Oceanic Atmospheric Administration, 2008). Each year, tornado-
related disasters cost billions of dollars in damages.
O
As a result, it’s critical to create precise forecasts that can determine a tornado’s
striking site, the extent of its property damage, and its length and width. These
predictions can help spread awareness and work towards preventing losses of life and
ty
property in the tornado event’s affected area.
si
In the tornado dataset, various univariate time series were fitted with time series
forecasting models, and the resulting numerical results were examined and analysed.
Five separate, univariate time series have been created from the dataset. Property
r
Damage, Tornado Length, Tornado Width, Beginning Latitude, and Beginning Longitude
ve
are the time series. The National Climatic Data Center has provided the information
(2015). The time series data is divided into training and validation periods after data
visualisation. Records of tornado events from January 2009 to December 2013 make
up the training dataset. Records of tornado incidents from January 2014 to December
2014 make up the validation dataset (12 months).
ni
There is only one value input into the models each month and only one output
value per model because we have decided to represent the time series in terms
U
of monthly values. The mean property damage per month, mean tornado length per
month, mean tornado breadth per month, and mean latitude and longitude of tornado hit
per month are the values that are inserted into the time series. Similar to the predicted
values, the forecasted values are also the mean expected monthly values for property
ity
damage, tornado length, width, and location. These values are within a predetermined
prediction interval.
The most accurate model is picked to anticipate future values from January 2015
to December 2015 after model performances are compared (a forecast horizon of 12
m
months, identical to the validation period). By assessing the precision and seasonality
of model forecasts on the validation period, the best model is determined. The 2015
actual figures are then compared to the forecasts, and conclusions are drawn.
)A
amount of property damage in the years 2010 and 2012 from this overall map. These
years have seen the highest amounts of tornado-related property damage.
Notes
e
in
nl
O
Figure: General Property Damage ($) Plot, 2009-2014
ty
series shown in Figure below. We examine smaller time series windows to seek for
seasonality and trends in order to better understand the nature of the series. Thus, the
three time periods of 2009–2010, 2011–2012, and 2013–2014 are used to partition the
si
series.
r
ve
ni
U
The graph below shows that in this time frame, 2010 saw the highest value for
ity
property damage. In 2012, $2.5 million in property damages were noted. During
the years 2009 through 2010, April exhibits additive seasonality. We notice extra
seasonality in the month of June between 2011 and 2012. Property damage values
for the years 2013 to 2014 are noticeably lower, and November in both years has an
m
Notes
e
in
nl
O
ty
r si
ve
ni
U
on social media. however, how does Facebook do it? The Machine Learning System
analyses the pixels of the face in the image and develops a template that is a string
of numbers if you have Facebook’s “tag recommendations” or “facial recognition” set
on (this indicates that you have given permission for Facial Recognition). But because
each face has a unique template—sort of like a facial fingerprint—it may be used to
(c
The Facial Recognition technology can identify your template and send you a
Notes
e
notification if a newly uploaded photo or video on Facebook features your face but you
aren’t tagged in it. Additionally, you will be promptly informed if another person attempts
to use your photo as their Facebook profile picture (perhaps in an effort to gain more
in
popularity!). Facial Recognition can also let those with vision impairments know if they
are in a picture or video when used in conjunction with other accessibility features.
nl
4.2.11 Manufacturing
Use Case
O
Around 1,500 processes are required in the production of memory chips, all of
which must be done in sterile environments to prevent dust specks from harming the
wafers. However, harm still takes place. Scratches, flaws, and other quality problems
ty
are frequently minute and almost imperceptible to the human eye!
Many different machinery, pipes, and parts can be found in the manufacturing
setting. These deteriorate, malfunction, or begin to drip. Early detection of these
si
problems is essential. Maintenance is typically handled by engineers. Even the most
talented engineer, though, may overlook early warning signs of a problem.
The process of making memory chips offers a lot of room for error by nature.
r
Micron Technology was spending a lot of money—on average $250,000 per hour
ve
of downtime—by relying on human monitoring to spot quality flaws and mechanical
concerns.
The Solution
ni
AI solutions are a good fit for this specific business issue. There is sufficient
internal data to employ Machine Learning (ML) on numerous fronts with good accuracy
because the challenges are precisely defined, quantifiable, and measurable. The
U
solutions also function with lower amounts of data, but the ML algorithm’s accuracy will
suffer. Accuracy will increase as more data are collected.
The ML algorithms are intended to discover anomalies with greater accuracy and
m
Machine Vision
As its photolithographic cameras etch the circuitry into the wafers, Micron
Technology integrated machine vision technology. The system checks for recurring
(c
problems and notifies the engineers whenever one is found. It can take anywhere from
15 seconds to 15 minutes before the alarm is issued, depending on the type of defect.
The issue of manually classifying each fault is resolved by the company’s auto-
Amity Directorate of Distance & Online Education
270 Foundations of Machine Learning
defect-classification system (ADC). Deep learning is used by the algorithm to sift and
Notes
e
classify the millions of flaws. The ADC system is detailed in an Intel whitepaper.
Thermal Imaging
in
Thermal imaging was used by Micron to monitor their production process,
significantly enhancing the efficacy and precision of AI. A “digital twin,” which is
effectively a digital copy of the industrial environment, is overlaid with “heat maps” of
nl
the factory environment under typical operating conditions. Then, using this map as a
baseline, it is possible to assess the factory’s infrared images in real time. The device
will sound an alarm if it detects an anomaly, such as different temperatures from the
O
digital twin.
Acoustic Listening
ty
The last of the three is perhaps the most unexpected: an AI solution has been
developed to detect strange noises throughout the manufacturing process. A machine
generating a unique sound frequently signifies difficulty, much like when your car makes
strange noises. By transforming sound to visual datapoints, the AI system at Micron
si
has been trained to recognise abnormalities in sound frequencies. Audial sensors are
positioned adjacent to machinery or pumps to record the sounds of the equipment in a
noisy environment. The engineers classify noises based on their likely causes.
r
ve
Benefits & Impact
First off, Micron Technology’s AI solutions have significantly improved production
accuracy and efficiency. The second improvement is in worker safety (Workers come
into contact with extreme temperatures and harmful substances less frequently).
ni
Thirdly, AI solutions have given the company’s engineers important time to devote to
other projects. Last but not least, the use of AI solutions in manufacturing has moved to
other business processes, such as product demand projections, improving the accuracy
U
by 10 to 20 percent.
Key benefits:
◌◌ Production in the manufacturing sector increased by 10%.
ity
dollars.
◌◌ gave engineers more time to work on other projects.
◌◌ improved workplace security.
)A
by the business from more than 8000 sources and 500 servers globally. For data
mining, this data is sent to two environments of the Apache Hadoop open-source
software programme. Large data sets can be processed in parallel using Hadoop,
Notes
e
allowing for the simultaneous analysis of many datasets.
Over 2,000,000 photos are kept in the Hadoop environment for machine vision.
in
To quickly manage the enormous strain of the complicated machine learning algorithm
used in Acoustic Listening, Micron delivers the pertinent data to a GPU processor.
By distributing work among numerous processors, GPUs can continue to accelerate
nl
applications, enabling the system to handle the massive volume of incoming data.
O
Explainable AI is a collection of tools and frameworks that are natively linked
with a number of Google products and services to assist you in comprehending and
interpreting predictions provided by your machine learning models. With it, you may
debug models, enhance their performance, and aid in the behavioural understanding
ty
of others. Additionally, you may use the What-If Tool to graphically analyse model
behaviour and generate feature attributions for model predictions in Vertex AI, BigQuery
ML, and AutoML Tables.
si
In more than 220 nations and territories around the world, Google Maps is used
to drive more than 1 billion kilometres each day. You are immediately presented a few
things when you get in your car or on your motorcycle and start navigating: which way
r
to go, whether or not the traffic along your route is heavy or light, an anticipated journey
ve
time, and an estimated arrival time (ETA). Even if everything seems straightforward, a
lot is happening in the background to offer this information in a few of seconds.
We’ll dissect one of our favourite subjects today: traffic and routing. Read on if
you’ve ever wondered how Google Maps chooses the optimal route for a journey or
ni
Predicting traffic with advanced machine learning techniques, and a little bit of
history
m
Google Maps examines past traffic patterns for routes throughout time to forecast
how traffic will appear in the near future. For instance, a pattern might reveal that traffic
on the 280 freeway in Northern California regularly moves at 65 mph between 6 and 7
)A
am but only around 15-20 mph in the late afternoon. Then, using machine learning to
create predictions based on both sets of data, we merge this database of past traffic
patterns with current traffic circumstances.
have a very high accuracy bar; in fact, we can observe that over 97 percent of journeys
have had accurate ETA projections. Through our collaboration with DeepMind, we’ve
been able to significantly enhance ETA accuracy in cities like Berlin, Jakarta, So
Amity Directorate of Distance & Online Education
272 Foundations of Machine Learning
Paulo, Sydney, Tokyo, and Washington, D.C. by utilising Graph Neural Networks, a
Notes
e
machine learning architecture. By using this method, Google Maps is better equipped
to anticipate whether or not you will be impacted by a slowdown that may not even have
started yet!
in
Keeping it Fresh
Historical traffic patterns have generally been accurate predictors of your potential
nl
road conditions for the majority of the 13 years that Google Maps has published traffic
data, but this isn’t always the case. Global transportation patterns have drastically
changed since the COVID-19 outbreak began. When lockdowns began in early 2020,
O
traffic decreased by up to 50% throughout the world. Since then, some regions of the
world have slowly started to reopen, while others continue to have restrictions. We
recently changed our models to make them more flexible in order to take into account
this unexpected change. As a result, historical traffic patterns from the previous two to
ty
four weeks are now automatically prioritised, whereas patterns from earlier periods are
deprioritized.
si
How Google Maps Selects Routes
Google Maps’ route-finding algorithms heavily rely on our forecast traffic models.
We’ll automatically locate you a less-congested option if we anticipate that traffic in one
r
way may become heavy. Along with these variables, we also consider the condition
ve
of the roads. Is the road paved, unpaved, gravelly, muddy, or covered in dirt? These
kinds of factors can make a road challenging to drive on, therefore we’re less inclined
to advise including it in your itinerary. We also consider the size and directness of a
route because it is frequently more efficient to drive down a highway than it is to take a
ni
Authentic data from local governments and current user input are two more
sources of information that are crucial to ensuring that we only provide the best routes.
U
Google Maps can access authoritative data to learn about speed restrictions, tolls, and
whether certain roads are closed due to COVID-19 or construction. Additionally, incident
reports from drivers help Google Maps swiftly display information about restricted
lanes, nearby construction, damaged vehicles, and other roadside obstructions. Both
ity
sources are used to explain how unexpected changes in road conditions brought on by
mudslides, snowstorms, or other natural phenomena.
So, how does this all function in practice ? Let’s imagine you’re travelling down
your usual route to a doctor’s appointment across town. When you leave the house,
there are no signs of any delays in the traffic’s smooth flow. We notify you that if you
)A
stick with your present course, there’s a good possibility you’ll get caught in unexpected
standstill traffic around 30 minutes into your ride, which would mean missing your
appointment. This is done using Google Maps’ traffic projections mixed with current
traffic conditions. In order to assist you avoid the traffic altogether and get at your
meeting on time, Google Maps automatically reroutes you using information on local
(c
We’ll continue developing tools and technology to keep you out of traffic jams and
on a route that is as safe and effective as possible. Traffic prediction and route planning
Notes
e
are tremendously complicated tasks.
in
One of the most challenging computational challenges is determining the ideal
times for drivers to pick up food delivery orders for a variety of businesses. Consider
nl
it similar to the NP-Hard combinatorial optimization problem known as the Traveling
Salesperson: The delivery worker wants the meal ready when they pull up, and the
consumer wants their food delivered promptly. Customers get dissatisfied and delivery
O
personnel seek employment elsewhere if estimations are even slightly off.
ty
Zi Wang, a senior software engineer at Uber who oversees the company’s time
prediction efforts, provided an explanation of how AI is used to generate these time
estimates at QCON New York last month.
si
Uber Eats was introduced in 2015 in Toronto and has now expanded to 400 cities
with 220,000 participating eateries. Utilizing independent contractors from its car-
r
sharing business, it offers a meal delivery service for eateries. The effectiveness of the
service depends heavily on the ability to anticipate delivery timeframes, including when
ve
the meal will be ready and how long it will take to deliver. For 2018, it received orders of
over $8 billion.
which must squeeze their profits from constrained and unpredictable tolerances, could
benefit from Wang’s lessons.
In 2015, Wang joined Uber to improve the company’s dispatch system, but the
U
The business aims to make food delivery dependable, simple, and reasonably
priced for customers. According to Wang, “We want to ensure that the food will be
ity
delivered without a hitch, which requires us to forecast the future and strike a balance
between orders and delivery partners. The system must forecast three things for
each order: the time of delivery, the amount of time it takes to deliver the food, and
the amount of time it takes the restaurant to produce the order. Given that Uber
lacks knowledge of the length of time it takes a restaurateur to make a certain meal,
m
The secret is timing. When the order is prepared, Uber Eats wants the delivery
)A
person to be sent out. The delivery worker will waste time waiting around if it is too
early, costing them money from potential orders. However, if you wait too long, the
hungry clients can get their food late or cold.
(c
Notes
e
in
nl
O
ty
Users are given a broad estimation of how long it will take for the food to arrive
when they launch the app and browse a list of eateries. Additionally, it gives eateries the
option of estimating the time needed to make the cuisine, which they can then modify if
necessary.
si
Kafka is used to gather all data, which is then forwarded to a streaming engine
for pre-processing and storage in a Cassandra data store, where Spark processes it
r
to model the data. The Cassandra model repository houses trained and ready-to-use
models.
ve
ni
U
ity
With so many moving parts in the dispatch system’s issue space, Wang
acknowledges that finding the ideal state is an NP-hard computational challenge.
Wang remarked, referring to the customer, the delivery driver, and the restaurant, that
m
“for every decision we are making, we are taking into mind all the partners.” The ideal
situation is one in which each stakeholder receives the best possible result: the diner
receives their food as quickly as possible, the delivery person arrives as promptly as
)A
possible, and the restaurant receives the best estimate of the amount of time it will take
to prepare their food.
The dispatch system must determine the most precise timing to send the driver to
the restaurant after an order is placed.
(c
“We require that our meal delivery partner show up at the appointed hour when the
food will be ready. They shouldn’t show up too late because the meal will be cold. We
don’t want them to arrive too early since they would waste time waiting there and also
Notes
e
take up the restaurant’s parking and seating areas, added Wang.
When deciding when to send a delivery person before machine learning, Uber
in
Eats relied on the Greedy algorithm, which estimated the best local solution for each
delivery without optimising the problem space for all the drivers in that region. This had
a negative impact on the service as a whole because it resulted in delayed deliveries
nl
and delivery personnel lingering in the parking lot of the restaurant while orders were
being processed.
O
ty
r si
The nearest driver for a given order will be found by a greedy algorithm. The ideal
ve
timings for all the drivers and pickups would be determined by a global optimization,
though. Wang gave a fictitious example: If a greedy algorithm matched two drivers to
their nearest orders, the total journey time would be six minutes; however, the pickups
for those two drivers might be switched, resulting in a total travel time for both orders of
ni
With ML in place, a journey time estimate may be obtained based on the historical
data of all travel times, for all local restaurants, and for all open jobs and drivers. The
U
method used by Uber Eats incorporates real-time, near-real-time, and historical data.
We can take into account all jobs and suppliers simultaneously with a global matching
system, he explained. “In a single global optimization problem, we can solve the
complete collection of jobs and supplies.”
ity
m
)A
(c
However, estimating the order preparation time can be challenging, in large part
because Uber is unable to observe how long the kitchen staff spends preparing the
dish without being present. Although this is of limited relevance because the meal might
already be gone by the time the delivery person arrives, it can be calculated by the time
Notes
e
the delivery person leaves the restaurant.
Here, machine learning (ML) can offer insight with more contextual cues, both
in
historical and current: This approach takes into account things like the average amount
of time spent preparing food over the previous week, the quantity and cost of unfulfilled
orders, and even the day of the week. You can use even immediate signs, such how
nl
many other orders—including those from other delivery services—are currently being
processed.
O
4.2.14 Amazon ADS Recommendation and iPhone Face Detection
It is commonly known that Amazon’s online store offers recommendations that are
incredibly accurate. To enhance customer experience and boost sales, it makes use of
technology like artificial intelligence algorithms and machine learning.
ty
In order to provide clients with a list of suggested products, Amazon’s
recommendation system is capable of intelligently evaluating and anticipating their
shopping habits.
si
Amazon’s recommendation engine is regarded as one of the best on the market,
despite the fact that rival retailers have recently added comparable features to their
websites. r
ve
What is the Amazon algorithm?
Existing customers anticipate that the online store will vary their buying options and
m
with customers, and up to 91 percent of online store customers admit that they are
more likely to use a brand’s offer that customises their experience.
Personalization is the secret to achieving all of these objectives, whether you want
to increase views, lower bounce rates, or improve click-through rates.
(c
Amazon is able to keep customers interested and offer them things they might not
Notes
e
even consider by making suggestions that optimise potential value for each individual
customer.
in
Amazon and Personalization of the Shopping Path
To personalise purchases, the business developed the Amazon Personalize
feature.
nl
It is a machine learning and artificial intelligence service that specialises in
creating recommender system solutions. In order to produce real-time suggestions, it
O
automatically analyses data, chooses functions and algorithms, optimises the model
based on the data, and then implements and maintains the model.
ty
recommendations.
si
the suggestions themselves, supporting various elements of the platform’s operation.
Additionally, the business makes use of its own A9 algorithm, which powers the
website’s sophisticated product search.
r
The A9 algorithm examines and categorises certain brands and their products
ve
on the marketplace, enabling it to provide relevant and customised search results to
Amazon users.
This technique is also used to choose which merchants are shown to customers on
the front page.
ni
The three main operating concepts of the Amazon A9 algorithm are as follows:
ID, an iPhone X user simply holds up the device and it recognizes his or her face.
The use of machine learning algorithms that have been trained to recognise you
is the foundation of the face detection capability. The software will be able to identify
you even with minor changes like glasses, makeup, and maybe a beard, and it will only
Notes
e
become better with time.
For this function, Apple used a number of neural networks, and it is open about
in
how secure Face ID is. With Face ID, the likelihood that a random individual will be able
to unlock the phone is a whopping 1 in 1,000,000 as opposed to Touch ID’s 1 in 50,000
possibility. According to Apple, having an identical twin is the only thing that can readily
nl
defeat the software.
Various pieces of hardware have made the facial detection feature possible. The
elements include a Dot Projector (a transmitter of structured light), an infrared camera
O
(a receiver of structured light), a flood light (an ambient sensor), and a proximity sensor.
The iPhone X’s front camera also collects 2D information about a user’s face. When the
user looks at their phone, the dot projector projects 30k infrared dots onto their face.
ty
Schiller claimed that the iPhone X smartphone accesses the neural engine in the
Apple A11 chip. Face identification, known as Face ID on the iPhone X, is one of the
activities that the A11 can speed up using machine learning and artificial intelligence.
si
Machine learning activities are also well-suited to the A11’s graphics processor.
Additionally, the camera contains an image signal processor created by Apple that can
improve photos and low-light autofocus.
r
Apple’s Earlier Machine Learning Efforts
ve
Apple launched Core ML, a new machine learning framework API for developers,
in June of this year. Core ML would hasten the processing of AI activities on the iPhone,
iPad, and Apple Watch.
ni
users’ smartphones and tablets because Core ML is designed for on-device processing.
businesses. The study and use of statistical models and methods enable computer
systems to carry out specified tasks without human input is known as machine learning,
and it is a subset of artificial intelligence. Future technologies that people use every day
)A
Voice Assistant
These days, voice assistants are everywhere. Voice assistants are becoming more
and more common, thanks to programmes like Apple’s Siri, Google Assistant, Amazon’s
(c
Alexa, and others. All of these voice assistants use a machine learning technique that
uses Natural Language Processing to recognise speech (NLP). The response is then
created after utilising machine learning to translate the voice into numbers. In order to
Notes
e
prepare information, NLP is also used to convert ambiguous legalese from contracts
into understandable English. As machine learning techniques evolve, researchers
anticipate it to become astonishingly smarter.
in
Personalised Marketing
In the marketing system, technology is advancing. The marketing sector groups
nl
clients based on behavioural and characteristic data using machine learning features.
The platforms for digital advertising enable marketers to concentrate on a group of
consumers who are likely to be influenced by their products. They are aware of what
O
customers want and thus provide better product advertising.
Fraud Detection
ty
Machine learning is being used by banks and large financial services providers to
detect fraud. This aids businesses in safeguarding customer safety. Companies that
process credit card transactions can benefit from machine learning as well. According
to the policies of the company, the system is programmed to identify transactions that
si
seem to be fraudulent based on specific criteria. Companies might avoid suffering a
significant cost by spotting such errors. Additionally, an organisation can use machine
learning to estimate sales or demand in real-time and obtain insights into its consumer
loyalty and competitive landscape. r
ve
Self-Driving Cars
One of the exciting technologies that makes heavy use of machine learning is self-
driving cars. Self-driving cars have the advantage of incorporating all three of the main
ni
front of it, determine the location of the pavement and traffic signals, assess the driver’s
health, and categorise scenes. Real-time advise regarding traffic and road conditions
can also be provided via machine learning.
ity
Transportation Optimization
Machine learning is being used as the main source by businesses aiming to
increase the transportation sector’s reliance on technology. Machine learning is used
by ride-hailing applications like Uber, Lyft, Ola, and others to design the best routes and
m
set costs for all of their many services. Travel rates are adjusted dynamically to reflect
shifting market conditions. Prices change based on variables such as the time of day,
the location, the weather, client demand, etc. Drivers can now use machine learning to
)A
Predicting Behaviour
Machine learning models can be used by businesses to forecast customer
behaviour based on historical data. Businesses examine social media for topics people
(c
are discussing before identifying users who are looking for a certain commodity or
service. In order to offer clients with customised sizing and search results as well as
predictive behaviour models, Zappos, for instance, leverages analytics and machine
Notes
e
learning.
Healthcare
in
The benefit of machine learning in the healthcare industry is its capacity to handle
enormous information beyond the limits of human capabilities and then consistently
turn the analysis of those datasets into therapeutic insights that benefit clinicians. The
nl
use of machine learning in the planning and delivery of healthcare ultimately results in
improved outcomes, cheaper healthcare expenditures, and more patient satisfaction.
Women’s mammography scans can be reviewed and cancer risk assessed using
O
computer-assisted diagnosis (CAD), a machine learning programme.
Process Automation
ty
The result of the fusion of AI with related technologies like computer vision,
cognitive automation, and machine learning is intelligent process automation (IPA).
Companies have a richer automation possibility by combining these technologies into
a single process, unleashing every commercial value for the firm. Automating error-free
si
insurance risk assessment from human data input activity is possible with the use of
machine learning algorithms.
Chatbots r
ve
By utilising chatbots that provide pertinent answers to customers’ questions,
machine learning is assisting customer service. Machine learning algorithms are able
to comprehend the needs of customers and the tone in which they express those needs
using Natural Language Processing (NLP) and sentiment analysis ideas. The system
ni
Physical Security
U
without even knowing how they function. Despite the fact that most individuals are
unaware of what these traits actually are. Voice assistants, predictive text, driving route
suggestions, and many more AI-driven functions are often used.
)A
for translations. Sometimes all you need to know about a document or a word is its
general meaning. You might look online for assistance in that circumstance. The most
dependable translation tool online is Google Translate. However, consumers are aware
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 281
that they shouldn’t put their trust in any other app’s method. They are therefore curious
Notes
e
in how Google’s translation feature functions.
For eleven years, Google Translate has been available. Phrase-Based Machine
in
Translation was the main algorithm used when Google Translate was first released.
Later, Google made other advances in machine learning that irrevocably changed the
way we perceive foreign languages.
nl
O
The most important development in translation technologies was the introduction of
ty
Google Neural Machine Translation, or GNMT. An encoder network (on the left) and a
decoder network make up its model architecture (on the right).
si
cascade systems. With Translatotron, Google demonstrated that a single sequence-to-
sequence model can translate voice from one language into speech in another without
the need for an intermediary text representation, in contrast to cascaded systems.
Recurrent neural networks (RNNs) with the ability to learn long-term dependencies
are known as LSTMs. A series of repeating modules may be seen in all RNNs.
ity
m
The network in the image can be represented both folded and unfolded equally
Notes
e
well. When you want to understand what is happening at each stage better, it can be
helpful to unfold the network.
in
Long short-term memory (LSTM) and Gated Recurrent Unit (GRU)
nl
O
ty
Problems exist with the fundamental RNN building block that was previously
illustrated. One of the most significant is the inability to remember information when
it is presented in a lengthy sequence. Information that was provided a few time steps
si
previously is forgotten. The learning performance is thus constrained. To address that,
humans developed certain architectural solutions. Long short-term memory (LSTM) and
Gated Recurrent Units are the most common (GRU).
r
ve
ni
U
The vast majority of applications use one of these two sister architectures. The h
block contains information that distinguishes these two from the basic one. The images
below illustrate the internal parts of both the LSTM and the GRU, but I won’t go into
depth about each one.
ity
However, the difficulty was that NMT systems are generally more expensive and
have trouble handling unusual words. Google’s Neural Machine Translation (GNMT)
system was introduced to address these problems.
m
The GNMT uses attention and residual connections to build a deep LSTM (Long
Short-Term Memory) network with 8 encoder and 8 decoder layers. to increase
)A
The top layer of the encoder is connected to the bottom layer of the decoder
through Google’s attention mechanism. They use low-precision arithmetic in their
inference computations to speed up the final translation pace. They split words into a
(c
small number of common sub-word units for input and output to better handle unusual
words. This approach naturally handles the translation of uncommon words and
enhances the system’s overall accuracy by striking a fair balance between the flexibility
e
length-normalization algorithm and coverage penalty used by the Google team search
technique encourage the creation of an output sentence that is most likely to include
every word in the source sentence. The GNMT produces competitive results to state-of-
in
the-art on the WMT’14 English-to-French and English-to-German benchmarks. When
compared to Google’s phrase-based production system, it reduces translation errors by
an average of 60% using a human side-by-side evaluation on a collection of isolated
nl
simple sentences.
O
ty
r si
ve
4.2.17 Tesla Driverless Car
To train its algorithms for autonomous vehicles, Elon Musk’s corporate offices are
ni
receiving data from over 500,000 Teslas throughout the globe. In the rush to put more
self-driving cars on the road, this data gives Tesla a significant advantage.
U
You might imagine Tesla is a conventional auto manufacturer when you think of
them. Tesla is without a doubt a pioneer in the field of electric vehicles.
However, the fact that they are a technological company is the secret to their
success. One of the factors contributing to their success is the fact that their business is
ity
Making their cars totally autonomous is currently one of Tesla’s main objectives,
and they are using big data and AI to achieve this.
m
switch lanes? Should it continue where it is or pass the automobile in front of them?
When should the vehicle decelerate or speed up?
Tesla needs to gather the appropriate data to train the algorithms and feed their
Notes
e
AIs in order to make cars fully autonomous. Better performance will always result from
more training data, and Tesla shines in this area.
in
The fact that Tesla crowdsources all of its data from the hundreds of thousands
of Tesla vehicles that are now on the road gives it a competitive advantage. Both
internal and external sensors track how Teslas behave in a variety of circumstances.
nl
They also gather information on driver behaviour, including how they respond to various
circumstances and how frequently they touch the steering wheel or dashboard.
O
throughout the world make decisions, respond, and move, and their algorithms learn
from those actions. All those miles result in incredibly sophisticated autonomous
vehicles.
ty
Their tracking system is really advanced. For instance, Tesla stores a data
snapshot of the moment, adds it to the data set, and then recreates an abstract
representation of the world using color-coded shapes that the neural network can learn
from. This happens when a Tesla vehicle predicts the behaviour of a car or bicycle
si
incorrectly. In contrast to the real-world data that Tesla is utilising to train its AIs, other
businesses working on autonomous vehicles employ synthetic data (such as driving
behaviours from video games like Grand Theft Auto).
r
ve
AI at the Heart of Tesla
Since its start, Tesla has benefited from data from its current customer base, and
its work on autonomous vehicles is a part of their ongoing ambition to put AI at the
centre of all their operations.
ni
AI and big data will continue to be Elon Musk and his team at Tesla’s faithful allies
as they move into their newest initiatives (including their plans to transform the electric
grid with their home solar power panels).
U
Catan is where the robot places an orange street. Students at the OTH Regensburg
conducted some fundamental software engineering research as part of their robotics
course, which led to the creation of the board game project, which allowed Markus
Webert complete his bachelor’s degree in computer science. Prof. Dr. Klaus Volbert, a
m
Standing in front of the robot cell with the KR AGILUS, Webert explains, “Before I
started my bachelor’s degree, I thought ‘Programming, I don’t like that much.’ But now
I’m really into it. “When I first started learning programming, a lecturer encouraged me
to participate in this awesome effort to create an AI that could play Settlers of Catan.
We started with a platform for an online game that required actual participants, and we
(c
used that platform to gather information from those players. The intention was to use
the data for later purposes, such as machine learning, to train our algorithm to play.
e
There are 40 students enrolled in the robotics programme at OTH Regensburg
right now. The course lecturer and professor at the Faculty of Mathematics and Science
in
is Prof. Dr. Martin Weiss. He has a background in industry. Practice is a key component
of his lessons. Martin explains that in his courses, students “use robots as a practical
example to illustrate what very theoretical things in Mathematics and Computer Science
nl
might mean in the real world.” “We don’t only do things on paper or write programmes;
instead, I can demonstrate using a computer and robot that an equation has multiple
solutions and that the robot can move to the same spot using a variety of various axis
configurations.”
O
Working with the KUKA Small Robot Fascinates Students
The students particularly like the robot, according to Prof. Dr. Weiss. “Working
ty
with the robot you see bits moving and you see the robot doing some intriguing things.”
They enjoy it because it allows them to apply their thoughts to extremely useful things.
The students should understand that this requires teamwork and that this is a job
where they might have a bright future ahead of them. Sometimes students are afraid
si
of mathematics, but the problems can be divided so that one group works on the
mathematics, the other works on programming, and others work on the engineering.
The KR AGILUS will then have a camera installed so that future students can create
r
vision systems that enable the robot to pick up pieces that are randomly orientated.
ve
For the time being, the KUKA tiny robot will continue to move the robber, develop
settlements, optimise streets, and place buildings on the screen.
what? Yes, unsupervised, as my training data only contains inputs, often referred to as
features, and no outputs. We deal with inputs and their known results in supervised
machine learning. This was an unsupervised machine learning challenge where we
wanted to categorise emails based on their message bodies.
ity
RAM and perform intricate calculations with it, we can assure you of that.
import pandas as pd
)A
emails = pd.read_csv(‘split_emails_1.csv’)
The dataset now had 10,000 emails divided into three columns (index, message
id, and the raw message). We converted the raw message into key-value pairs before
(c
Message-ID: 0965995.1075863688265.JavaMail.evans@thyme>
Notes
e
Date: Thu, 31 Aug 2000 04:17:00 -0700 (PDT)
From: phillip.allen@enron.com
in
To: greg.piper@enron.com
nl
Mime-Version: 1.0
O
Content-Transfer-Encoding: 7bit
ty
X-To: Greg Piper
X-cc:
X-bcc:
si
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\’sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf
r
ve
Greg,
Phillip
I created a function that extracts this data into key-value pairs so that I could work
just with the sender, receiver, and email body data.
U
def parse_raw_message(raw_message):
lines = raw_message.split(‘\n’)
ity
email = {}
message = ‘’
message += line.strip()
email[‘body’] = message
else:
pairs = line.split(‘:’)
(c
key = pairs[0].lower()
val = pairs[1].strip()
Notes
e
if key in keys_to_extract:
email[key] = val
in
return email
def parse_into_emails(messages):
nl
emails = [parse_raw_message(message) for message in messages]
return {
O
‘body’: map_to_list(emails, ‘body’),
ty
‘from_’: map_to_list(emails, ‘from’)
si
follows:
email_df = pd.DataFrame(parse_into_emails(emails.message))
mail_df.drop(email_df.query(
ni
).index, inplace=True)
U
struggle with text, but excel at numbers, so I need to feed it something it can
understand. I created a document-term matrix out of the email bodies because of this:
X = vect.fit_transform(email_df.body)
X_dense = X.todense()
coords = PCA(n_components=2).fit_transform(X_dense)
plt.show()
Notes
e
in
nl
O
Having done that, we sought to identify the principal terms used in those emails.
ty
Exactly that is how we created this function:
si
topn_ids = np.argsort(row)[::-1][:top_n]
r
df = pd.DataFrame(top_feats, columns=[‘features’, ‘score’])
ve
return df
row = np.squeeze(X[row_id].toarray())
ni
This function produced the following result after being applied on a document.
U
features = vect.get_feature_names()
features score
ity
0 meetings 0.383128
1 trip 0.324351
2 ski 0.280451
m
3 business 0.276205
4 takes 0.204126
)A
5 try 0.161225
6 presenter 0.158455
7 stimulate 0.155878
(c
8 quiet 0.148051
9 speaks 0.148051
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 289
10 productive 0.145076
Notes
e
11 honest 0.140225
12 flying 0.139182
in
13 desired 0.133885
14 boat 0.130366
nl
15 golf 0.126318
16 traveling 0.125302
O
17 jet 0.124813
18 suggestion 0.124336
ty
19 holding 0.120896
20 opinions 0.116045
21 prepare 0.112680
si
22 suggest 0.111434
23 round 0.108736
24 formal 0.106745
r
ve
If you look at the corresponding email, everything makes sense.
before going on vacation without having any official business meetings. Particularly if
a trip wasn’t even wanted or necessary, I would try to acquire some straight answers.
Regarding the business meetings, I believe it would be more beneficial to try and incite
U
talks about what is working and what is not working across the various groups. All
too frequently, only the presenter talks while the audience members wait their time in
silence. A round table discussion structure for the meetings could be preferable. Austin
is a good option, in my opinion. Golf, a ski boat, and jet skis can all be rented. It takes
ity
Writing a function to extract the most important terms from all the emails was the
next step.
m
if grp_ids:
)A
D = X[grp_ids].toarray()
else:
D = X.toarray()
(c
e
return top_tfidf_feats(tfidf_means, features, top_n)
in
print top_mean_feats(X, features, top_n=10)
features score
nl
0 enron 0.044036
1 com 0.033229
O
2 ect 0.027058
3 hou 0.017350
ty
4 message 0.016722
5 original 0.014824
6 phillip 0.012118
si
7 image 0.009894
8 gas 0.009022
9 john 0.008551
r
ve
The information we have thus far is intriguing, but we were curious to learn more
and see what more the machine could have learned from this facts.
classifier.
n_clusters = 3
labels = clf.fit_predict(X)
We were able to create a function that retrieves the top phrases for each cluster
Notes
e
because we now knew which emails the machine assigned to each cluster.
in
dfs = []
labels = np.unique(y)
nl
for label in labels:
ids = np.where(y==label)
O
feats_df = top_mean_feats(X, features, ids, min_tfidf=min_tfidf, top_
n=top_n)
feats_df.label = label
ty
dfs.append(feats_df)
return dfs
si
I discovered an excellent example of how to plot this graph with matlibplot instead
of printing out the terms. I then duplicated the function, made some changes, and
created the following plot:
r
ve
ni
U
ity
m
We noted right away that cluster 1 contained strange words like “hou” and “ect.”
We basically needed to gain additional insight into the entire dataset, necessitating
)A
a different methodology, to learn more about why phrases like “hou” and “ect” are so
popular.
Summary
●● Instead, the focus of this chapter will be on unsupervised learning, a collection of
(c
statistical methods made for situations where we simply have a set of features,
such as X1, X2,..., Xp, measured on n observations. Because we lack a response
variable Y, we are not interested in making predictions. Instead, the objective is
Amity Directorate of Distance & Online Education
292 Foundations of Machine Learning
e
useful technique to visualise the data? Can we identify subgroups within the
observations or within the variables? Unsupervised learning encompasses a
wide range of methods for responding to queries like these. In this chapter,
in
we will concentrate on two specific types of unsupervised learning: principal
components analysis and clustering. Principal components analysis is a tool for
data visualisation or pre-processing before supervised approaches are utilised.
nl
●● Unsupervised learning is frequently more difficult. There is no clear objective for
the analysis, such as the prediction of a response, hence the exercise tends to
be more subjective. Unsupervised learning is frequently used in conjunction with
O
exploratory data analysis. Furthermore, because there is no widely established
way for performing cross-validation or verifying results on a separate data set, it
might be difficult to evaluate the outcomes of unsupervised learning methods
ty
and data analysis. It’s easy to understand why there is a difference. If we use
supervised learning to fit a predictive model, we may evaluate our results by
seeing how well the model predicts the answer Y on observations that weren’t
used to fit the model. However, since we don’t know the actual solution in
si
unsupervised learning—where the issue is unsupervised—there is no way to
verify our work.Unsupervised machine learning clustering that starts with a top-
to-bottom hierarchy of clusters is called “hierarchical clustering.” Following that,
r
it decomposes the data items based on this hierarchy to produce the clusters.
ve
Depending on whether the process of building clusters proceeds top-down or
bottom-up, this strategy adopts one of two approaches.
●● Agglomerative is quite different from Divisive, in which each of the “N” data
points is regarded as a separate component of the “N” clusters that the data is
ni
divided into. We subsequently allocate the data points to each of these clusters in
accordance after repeatedly condensing these numerous “N” clusters to a smaller
number of clusters, let’s say “k” clusters. This method, which is bottom-up, also
U
algorithm starts by treating each dataset as a single cluster and then begins
Notes
e
combining the two clusters that are the closest to one another.
●● Divisive clustering also referred to as a top-down strategy. The number of clusters
in
need not be predetermined for this approach. Top-down clustering starts by
figuring out how to split a cluster that contains all of the data, then splits clusters
back and forth until all of the individual data are in singleton clusters.
nl
●● Finding unusual occurrences or observations that are statistically distinct from
the rest of the observations is known as anomaly detection. Such “anomalous”
behaviour is usually indicative of some sort of issue, such as credit card fraud, a
O
failed server component, a cyberattack, etc.
●● Three general categories can be used to classify anomalies: a) Point anomaly, b)
Contextual anomaly, c) Collective anomaly.
ty
●● One of the most straightforward unsupervised learning techniques to handle the
well-known clustering problem is k-means. The process uses a predetermined
number of clusters (let’s assume k clusters) fixed apriori to categorise a given data
set. To define k centres, one for each cluster, is the main notion.
si
●● The Apriori algorithm, which was the first associative algorithm introduced
by Rakesh Agrawal, has been employed as a component of the approach in
r
subsequent advances of association, classification, and associative classification
algorithms.
ve
●● Algorithm a priori states “A frequent itemset’s subsets must also be frequent. In
other words, neither the generation nor testing of a superset of an uncommon
itemset is required.
ni
and as you move up, you start producing subsets until you reach the null set.
●● The Apriori Algorithm can be used to produce association rules. The arules
package contains the apriori() function. One of the popular methods in market
basket analysis is the apriori algorithm.
ity
●● One of the main applications of the SVD is principal components analysis (PCA),
which offers a data-driven, hierarchical coordinate system to describe high-
dimensional correlated data. The correlation matrices are used in this coordinate
system.
m
●● A matrix is factored into three matrices in the Singular Value Decomposition (SVD)
process. It communicates significant geometrical and theoretical insights regarding
linear transformations and has several intriguing algebraic characteristics.
)A
●● An algorithm for unsupervised outlier detection is called the local outlier factor
(LOF). It generates an anomaly score that indicates the data points in the data set
that are outliers. This is accomplished by calculating the local density deviation of
a given data point in relation to nearby data points.
(c
e
dimensional environment of two or three dimensions, t-SNE is a nonlinear
dimensionality reduction technique that is well-suited and adaptable.
in
●● Applications for t-SNE include computer security analysis, music analysis, and
cancer research, among many others. It is used to depict high-dimensional data,
which can have tens to hundreds of variables, as was previously stated.
nl
Glossary
●● OPTICS: Ordering Points to Identify Clustering Structure.
O
●● OBSCAN: Density-Based Spatial Clustering of Applications with Noise.
●● BIRCH: Balanced Iterative Reducing Clustering and Using Hierarchies.
●● CURE: Clustering Using Representatives.
ty
●● CLARANS: Clustering Large Applications based upon Randomized Search.
●● STING: Statistical Information Grid.
si
●● CLIQUE: Clustering In Quest.
●● HCA: Hierarchical Cluster Analysis.
●● DBSCAN: Density Based Spatial Clustering of Applications with Noise.
●●
r
Core Point: If a point has more than MinPts points inside an episodic period, it is a
ve
core point.
●● Border Point: A point that is close to a core point but has fewer than MinPts inside
eps.
ni
●● Noise or outlier: A location that is neither a border point nor a core point.
●● Clustering: Identifying the natural groupings in the data, such as classifying clients
based on their purchasing patterns, is a clustering problem.
U
●● Association: When you wish to find rules that broadly characterise your data, such
as “those who buy X also tend to buy Y,” you have an association rule learning
problem.
ity
●● Single Linkage: The closest points of the clusters are separated by the shortest
distance.
)A
●● Centroid Linkage: The linkage method is used to calculate the separation between
the cluster centroid.
●● eps: It establishes the area surrounding a data point; if the distance between two
Notes
e
points is less than or equal to “eps,” they are regarded as neighbours.
●● MinPts: The bare minimum amount of nearby neighbours (data points) within the
in
eps radius The greater value of MinPts must be selected the larger the dataset.
●● Anomaly Detection: Finding unusual occurrences or observations that are
statistically distinct from the rest of the observations is known as anomaly
nl
detection.
●● Point Anomaly: A tuple in a dataset is referred to as a point anomaly if it deviates
significantly from the rest of the data.
O
●● Contextual Anomaly: anomalies are observations that are anomalous because of
the context in which they were made.
●● Collective Anomaly: A collection of data instances aid in the discovery of an
ty
anomaly.
●● Basket Data Analysis: Analyzing the relationship of things purchased in a single
basket or single purchase, as in the aforementioned cases, is known as basket
si
data analysis.
●● Cross-marketing is collaborating with businesses that enhance your own, not
with rivals. For instance, it stands to reason that auto dealers and manufacturers
r
engage in cross-promotional activities with the oil and gas industries.
ve
●● Catalogue Design: The items in a company’s catalogue are frequently chosen to
work well together so that purchasing one item will prompt a subsequent purchase
of another.
ni
●● LOF: An algorithm for unsupervised outlier detection is called the local outlier
factor.
●● t-SNE: t-distributed Stochastic NeighbourEmbedding.
ity
a) Clustering problem
b) Association problem
)A
a) Data analysis
b) Association rule learning problem
c) Data collection
Notes
e
d) None of the mentioned
3. The process of drawing references from datasets of input data without labelled
in
replies is known as_ _ _ _.
a) Clustering
nl
b) Supervised learning
c) Unsupervised learning
d) Detection
O
4. _ _ _ _ _methods take clusters into account as a dense zone that differs and is
similar to the lower dense part of the space.
a) Hierarchy based
ty
b) Partitioning based
c) Grid based
si
d) Density based
5. In this approach, the data space is represented as a grid-like structure made up of a
finite number of cells:
a) Grid based method
r
ve
b) Partitioning based method
c) Hierarchy based method
d) None of the mentioned
ni
7. The items are divided by these procedures into k clusters, and each split creates a
cluster:
a) Grid based method
m
b) Unsupervised clustering
c) Grid clustering
d) Hierarchical clustering
Notes
e
9. The closest points of the clusters are separated by the shortest distance, is termed
as:
in
a) Single linkage
b) Complete linkage
nl
c) Average linkage
d) Centroid linkage
10. _ _ _ _ _represents the greatest separation between the centres of two distinct
O
clusters.
a) Average linkage
b) Complete linkage
ty
c) Single linkage
d) Centroid linkage
si
11. The average distance between two clusters is determined using the_ _ _ _linkage
method, which adds up and divides the distance between each pair of datasets by
the total number of datasets.
a) Complete r
ve
b) Single
c) Average
d) Centroid
ni
12. The_ _ _ _linkage method is used to calculate the separation between the cluster
centroid.
U
a) Single
b) Average
c) Complete
ity
d) Centroid
13. The number of clusters need not be predetermined for this approach, and is also
referred as a top down strategy?
a) Divisive algorithm
m
b) Agglomerative
c) Fuzzy clustering
)A
b) Anomaly detection
c) Error finding
Amity Directorate of Distance & Online Education
298 Foundations of Machine Learning
d) Supervised detection
Notes
e
15. A tuple in a dataset is referred to as a _ _ _ _if it deviates significantly from the rest
of the data.
in
a) Contextual anomaly
b) Collective anomaly
nl
c) Point anomaly
d) None of the mentioned
O
Exercise
1. What do you mean by evaluation of clustering?
2. Define various clustering types.
ty
3. What do you mean byhierarchical clustering?
4. What do you mean by DBSCAN (Density Based Spatial Clustering of Applications
with Noise)?
si
5. Explain anomaly detection.
6. Define K-Means Clustering.
7. r
Define singular value decomposition.
ve
8. What do you mean by LOF?
9. Explain the concept of T-distributed stochastic neighbour embedding (t-SNE).
Learning Activities
ni
1. How machine learning and artificial intelligence changing our day to day lives?
Explain with the help of few examples.
U
3. c
4. d
5. a
m
6. b
7. c
)A
8. d
9. a
10. b
11. c
(c
12. d
13. a
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 299
14. b
Notes
e
15. c
in
Further Readings and Bibliography:
1. The Hundred-Page Machine Learning Book, Andriy Burkov
2. The Elements of Statistical Learning, Jerome H. Friedman, Robert Tibshirani,
nl
and Trevor Hastie
3. Hands-On Unsupervised Learning Using Python: How to Build Applied
Machine Learning Solutions from Unlabeled Data, Ankur A. Patel
O
4. Machine Learning in Action, Peter Harrington
5. Artificial Intelligence: A Modern Approach, Peter Norvig and Stuart J. Russell
ty
6. Understanding Machine Learning: From Theory to Algorithms, Shai Ben-David
and Shai Shalev-Shwartz
r si
ve
ni
U
ity
m
)A
(c
e
Learning Objectives:
in
At the end of this module, you will be able to understand:
nl
●● Various Steps involved Reinforcement Learning
●● Concept of Markov Property
O
●● What is Bellman’s Equation and Optimal Value Function
●● Various types of Problem Solving Methods
Introduction
ty
When we consider the nature of learning, the notion that we learn by interacting
with our surroundings is likely the first that comes to mind. A baby has no explicit
instructor when it plays, waves its arms, or looks around, but it does have a direct
si
sensory link to its surroundings. This connection can be used to learn a multitude
of information about cause and effect, the results of activities, and what to do to
accomplish goals.
r
Such interactions surely serve as a significant source of information about our
ve
surroundings and ourselves throughout our lives. We are highly aware of how our
environment responds to what we do, whether we are learning to drive a car or to hold
a conversation, and we attempt to affect what happens through our behaviour. Nearly
all theories of learning and intelligence are based on the fundamental concept that
ni
idealised learning scenarios and assess the efficacy of various learning techniques
rather than directly thinking about how people or animals learn. In other words, we
adopt the viewpoint of an AI researcher or engineer.
The other thread focuses on the optimal control issue and how value functions
and dynamic programming are used to solve it. This thread largely has little to do with
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 301
education. Although the two threads have typically been independent of one another,
Notes
e
there have been a few exceptions that centre on a third, less obvious thread of
temporal-difference techniques, like those employed in the tic-tac-toe example in this
chapter. The present discipline of reinforcement learning was created in the late 1980s
in
when all three elements came together.
We are most familiar with and have the most to say in this brief history about the
nl
thread that focuses on trial-and-error learning. But first, we quickly go over the thread
for the best control.
O
system’s behaviour over time was first described as the “optimal control” problem in
the late 1950s. By extending a Hamilton and Jacobi theory from the nineteenth century,
Richard Bellman and others came up with one of the solutions to this issue in the
middle of the 1950s. In this method, a functional equation, now frequently known as the
ty
Bell-Man equation, is defined using the ideas of the state of a dynamical system and a
value function, or “optimal return function.”
si
by resolving this equation (Bellman, 1957a). Markovian decision processes (MDPs),
which were first presented by Bellman (1957b) and Ronald Howard (1960), are a
discrete stochastic variant of the optimum control problem. These are the fundamental
r
concepts that underlie the theory and techniques of contemporary reinforcement
ve
learning.
Most experts agree that the only practical method for resolving general
stochastic optimum control issues is dynamic programming. It is exponentially more
computationally expensive the more state variables there are, which Bellman called “the
ni
curse of dimensionality,” yet it is still much more effective and applicable than any other
general method.
(Bertsekas, 1982, 1983). There are many excellent contemporary treatments for
dynamic programming (e.g., Bertsekas, 1995; Puterman, 1994; Ross, 1983; and
Whittle, 1982, 1983). An authoritative history of optimal control is provided by Bryson
(1996).
m
a problem, a class of solutions that are effective for the class of problems, and the
field that investigates these issues and their solutions. In order to optimise a numerical
reward signal, reinforcement learning problems include learning what to do—how to
translate events to actions.
(c
They are closed-loop issues in that the learning system’s actions have an impact
on its subsequent inputs. In addition, unlike many other types of machine learning, the
learner must experiment with different actions to determine which ones produce the
Amity Directorate of Distance & Online Education
302 Foundations of Machine Learning
greatest rewards. In the most fascinating and difficult situations, choices can influence
Notes
e
not just the immediate reward but also the subsequent circumstance and, thus, all
subsequent benefits.
in
The three key differentiating properties of reinforcement learning issues are that
they are fundamentally closed-loop, lack explicit instructions for what actions to take,
and have consequences of actions, including reward signals, that take a long time to
nl
manifest.
The main goal is to simply capture the most significant features of the actual
challenge facing a learning agent interacting with its environment in order to achieve a
O
goal. Reinforcement learning problems can be fully specified in terms of optimal control
of Markov decision processes. It is obvious that such an agent must be able to partially
detect the environment’s state and must be able to behave in a way that influences
that state. A objective or goals pertaining to the status of the environment must also be
ty
present for the agent. Sensation, action, and purpose are the only three characteristics
that should be included in the formulation, and they should all be included in the most
basic forms without being trivialised.
si
We define a reinforcement learning method as any approach that is effective at
tackling a situation of this nature. The type of learning that is investigated in the majority
of recent research in the field of machine learning, supervised learning, is distinct from
reinforcement learning. r
ve
A knowledgeable external supervisor provides a training set of labelled examples
for supervised learning. Each example includes a description of a circumstance as well
as a label—the specification of the correct action the system should do in response to
that situation—which is frequently to define a category to which the situation belongs.
ni
It is the goal of this type of learning for the system to extrapolate, or generalise,
its answers so that it responds appropriately in circumstances that are not part
of the training set. Although this is a crucial type of education, learning through
U
from its own experience in new territory, where one would anticipate learning to be
most helpful. Additionally, unsupervised learning, which machine learning researchers
refer to as identifying structure concealed in sets of unlabeled data, is distinct from
reinforcement learning. Although it might seem that the phrases supervised learning
and unsupervised learning fully categorise machine learning paradigms, they do not.
m
The trade-off between exploration and exploitation is one of the difficulties that
reinforcement learning faces, as opposed to other types of learning. A reinforcement
learning agent must favour activities that it has previously attempted and found to be
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 303
successful in creating reward in order to receive a lot of reward. However, it must try
Notes
e
acts that it has never chosen before in order to find such activities.
The agent must take advantage of what it already knows in order to profit, but it
in
must also explore in order to choose its future actions more wisely. The problem is that
pursuing either exploration or exploitation solely would result in failure. The agent must
test several different things and gradually favour the ones that seem to work the best.
nl
For a reliable estimate of the expected reward on a stochastic task, each action must
be tried numerous times.
O
conundrum. For the time being, we merely point out that, at least in their purist forms,
the whole problem of striking a balance between exploration and exploitation never
even comes up in supervised and unsupervised learning.
ty
The explicit consideration of the entire issue of a goal-directed agent dealing with
an unpredictable environment is another important aspect of reinforcement learning.
Contrary to many techniques, this one takes into account subproblems without
considering how they might integrate into a bigger picture.
si
For instance, we have said that supervised learning is a major focus of machine
learning research without directly stating the ultimate application of such a capability.
r
Other academics have created planning theories with broad objectives, but without
taking into account the function of planning in real-time decision making or the issue of
ve
where the predictive models required for planning would originate from. These methods’
emphasis on discrete subproblems is a key restriction, despite the fact that they have
produced numerous helpful results.
ni
The approach taken by reinforcement learning is the contrary, beginning with a fully
developed, interactive, goal-seeking agent. All reinforcement learning agents can sense
elements of their environments, have clear goals, and make decisions about how to
affect those environments.
U
Furthermore, it is typically taken for granted right away that the agent must
function in spite of a great deal of uncertainty regarding the environment it faces. When
reinforcement learning contains planning, it needs to take into account how real-time
ity
action selection and planning interact as well as the issue of how environment models
are created and updated.
Important subproblems must be isolated and researched in order for learning research
to advance, but even if some of the complete agent’s specifics are still missing, these
subproblems must play distinct roles in complete, interactive, goal-seeking agents.
)A
e
also strongly influenced psychology and neuroscience, with significant gains for both
fields.
in
Reinforcement learning is the type of machine learning that is most similar to
the learning that occurs in humans and other animals, and many of its fundamental
algorithms were first motivated by biological learning systems. And reinforcement
nl
learning has also given something back, both in the form of an influential model of some
aspects of the brain’s reward system and a psychological model of animal learning that
more closely matches some of the scientific findings.
O
Finally, the return to straightforward general principles in artificial intelligence is a
bigger trend that is reflected in reinforcement learning. Since the late 1960s, a lot of
researchers studying artificial intelligence have assumed that there are no fundamental
principles to be found and that intelligence instead results from the accumulation of a
ty
huge variety of specialised techniques, methods, and heuristics.
It was formerly thought that if we could merely feed a machine with one million
or one billion pertinent data, the machine would become intelligent. Weak techniques
si
were defined as those based on broad principles, such searching or learning, and
strong methods were defined as those based on specific information. Although much
less prevalent now, this viewpoint is still widespread. From our perspective, it was just
r
too early to draw that conclusion because not enough time had been spent looking for
ve
general principles.
component of the swing back toward simpler and fewer general principles of artificial
intelligence, however it is unclear how far back the pendulum will swing.
U
operation of a petroleum refinery. Without rigidly adhering to the set points initially
recommended by engineers, the controller optimises the yield/cost/quality trade-off
on the basis of stated marginal costs.
)A
●● Minutes after birth, a gazelle calf struggles to stand. It reaches a speed of 20 miles
per hour after 30 minutes.
●● A mobile robot chooses whether to move into a different room in search of
additional rubbish to collect or to begin attempting to return to its battery charging
station. It bases its choice on the battery’s current state of charge as well as past
(c
experiences with how quickly and simply it has located the recharger.
●● Phil gets his breakfast ready: When closely analysed, even this seemingly routine
Notes
e
action—walking to the cabinet, opening it, choosing a cereal box, then reaching
for, grabbing, and retrieving the box—displays a complicated web of conditional
behaviour and interlocking goal-subgoal linkages. To gain a bowl, spoon, and milk
in
jug, additional intricate, calibrated, interactive sequences of behaviour are needed.
A succession of eye movements are required for each step in order to gather
nl
information and direct reaching and movement.
Quick decisions are made regarding how to transport the items and whether it
would be best to transport some of them to the dinner table first before obtaining others.
O
In order to accomplish other goals, such as having the spoon to eat with once the
cereal is produced and ultimately acquiring sustenance, each step is directed by goals,
such as grabbing a spoon or reaching the refrigerator.
ty
Whether or not Phil is aware of it, his body is constantly accessing data about its
condition to assess his dietary requirements, level of hunger, and food preferences.
si
simple to ignore them. All of these include interactions between an active decision-
making agent and its surroundings, in which the agent tries to accomplish a task
despite being unclear about the surroundings. The agent’s actions are allowed to
r
influence the environment’s future state (such as the next move in a game of chess, the
level of the refinery’s reservoirs, the location of the robot in the future, and the battery’s
ve
future charge level), which in turn influences the options and opportunities the agent will
have in the future.
Making the right decision necessitates considering the indirect, long-term effects
of actions, which may call for preparation or foresight. The agent must continually
ni
monitor its environment in order to react effectively because, in all of these situations,
the repercussions of actions cannot be entirely predicted. For instance, Phil needs to be
careful not to overfill his breakfast bowl with milk.
U
In each of these situations, the objectives are explicit in the sense that the agent
may gauge its success based on what it can immediately experience. The mobile robot
knows when its batteries run out, the chess player knows whether or not he wins, the
ity
refinery controller knows how much petroleum is produced, and Phil knows whether or
not he is enjoying his breakfast.
The agent and its environment could not exactly match our conceptions of an agent
and its surroundings. The entirety of a robot or organism is not always considered to be
m
an agent, and a robot or organism’s environment is not always considered to be its only
habitat. The example robot’s battery is a feature of its controlling agent’s environment,
just as Phil’s level of hunger and food preferences are aspects of his internal decision-
)A
In each of these cases, the agent can use experience to gradually raise
performance levels. The gazelle calf increases its running efficiency, the chess player
improves his intuition for assessing positions, and Phil learns to prepare his breakfast
Notes
e
more quickly. The knowledge an agent brings to a task at the outset—either from prior
experience with related tasks, or built into it by design or evolution—influences what is
useful or simple to learn, but interaction with the environment is crucial for modifying
in
behaviour to take advantage of particular task features.
nl
Machine learning includes the discipline of reinforcement learning. It involves
acting appropriately to maximise reward in a certain circumstance. It is used by
O
a variety of programmes and machines to determine the optimal course of action to
pursue in a given circumstance. There is no correct answer in reinforcement learning,
but the reinforcement agent selects what to do to complete the job. This is different from
supervised learning, where the training data includes the solution key and the model is
ty
trained with that answer. It will undoubtedly gain knowledge from its experience in the
absence of a training dataset.
Machine learning models are trained via reinforcement learning to make a series
si
of judgments. The agent gains the ability to do a task in a possibly complicated and
uncertain environment. An artificial intelligence encounters a scenario similar to a
game during reinforcement learning. In order to solve the problem, the computer uses
r
trial and error. The artificial intelligence receives rewards or penalties for the activities
it takes in order to make the machine accomplish what the programmer desires. To
ve
maximise the overall reward is its aim.
Although the designer establishes the reward policy, or the game’s rules, he offers
the model no tips or advice on how to win. The model must determine how to complete
ni
the objective to maximise the reward, starting with completely arbitrary trials and
ending with sophisticated strategies and superhuman abilities. Reinforcement learning
is currently the most efficient method for hinting machine creativity since it makes use
of the power of search and numerous trials. If a reinforcement learning algorithm is
U
assigns positive values to desired acts and negative values to undesirable behaviours.
This trains the agent to seek maximal overall reward over the long run in order to arrive
at the best possible outcome.
)A
for each positive activity, and is penalised or given negative feedback for each
Notes
e
negative action.
●● In contrast to supervised learning, reinforcement learning uses feedback to
in
autonomously train the agent without the use of labelled data.
●● The agent can only learn from its own experience because there isn’t any labelled
data.
nl
●● In situations like gaming, robotics, and other long-term endeavours where
decisions must be made sequentially, RL offers a solution.
O
●● The agent engages with the environment and independently explores it. In
reinforcement learning, an agent’s main objective is to maximise positive
reinforcement while doing better.
●● The agent learns through hit-and-miss, and depending on its experience, it
ty
develops the skills necessary to carry out the mission more effectively. Thus,
“Reinforcement learning is a form of machine learning method where an intelligent
agent (computer programme) interacts with the environment and learns to function
si
within that,” we might state. Reinforcement learning is demonstrated by the way a
robotic dog learns to move his arms.
●● It is a fundamental component of artificial intelligence, and the idea of
r
reinforcement learning is the basis for all AI agents. In this case, there is no need
ve
to pre-program the agent because it learns on its own without assistance from
humans.
e
agents, each of which employs a distinct policy for interacting with its surroundings,
and then choose the agents that are able to maximise reward. These techniques are
referred to as evolutionary because they operate similarly to how biological evolution
in
creates animals with complex behaviour even when those animals do not acquire new
skills over their lifetimes.
nl
Evolutionary approaches can be successful if the space of policies is sufficiently
small, can be organised so that good policies are common or easy to locate, or
if a lot of time is available for the search. Furthermore, evolutionary approaches are
advantageous for issues when the learning agent is unable to precisely assess the
O
status of its surroundings.
ty
to address reinforcement learning issues. Without ever mentioning value functions,
techniques like genetic algorithms, genetic programming, simulated annealing, and
other optimization techniques have been utilised to approach reinforcement learning
si
challenges.
small, can be organised so that good policies are common or easy to locate, or
if a lot of time is available for the search. Furthermore, evolutionary approaches are
advantageous for issues when the learning agent is unable to precisely assess the
U
We do, however, include several approaches that don’t use value functions, such
as evolutionary approaches. These techniques look for policies in policy spaces that are
ity
with its environment, allowing them to benefit from the specifics of each behavioural
encounter. These techniques, sometimes known as policy gradient approaches, have
been effective in solving a variety of issues, and some of the most straightforward
reinforcement learning techniques fall within this group.
)A
In fact, some of these techniques use estimates of the value function to enhance
their estimations of the gradient. Overall, there isn’t a clear line separating policy
gradient methods from the other techniques we count as reinforcement learning
techniques.
(c
e
Agent(): A thing with the capacity to see, investigate, and react to its surroundings.
Action(): The steps an agent takes inside the environment are referred to as
in
actions.
State(): Following each action done by the agent, the environment will return a
nl
state.
O
Beyond the agent and the environment, a reinforcement learning system primarily
consists of four subelements: a policy, a reward signal, a value function, and, if present,
a model of the environment.
ty
Policy: The learning agent’s behaviour at a specific time is defined by a policy. A
policy, broadly speaking, is a mapping from perceived environmental states to actions
to be taken when those states are present. It corresponds to what would be referred
si
to as a collection of stimulus-response rules or associations in psychology (provided
that stimuli include those that can come from within the animal). The policy may
be as straightforward as a function or lookup table in some circumstances, while in
r
others it may need complex computations like a search procedure. Insofar as it can
control behaviour on its own, the policy serves as the fundamental component of a
ve
reinforcement learning agent. Policies may generally be stochastic.
learning agent on each time step. The agent’s only goal is to increase its overall
return over the long term. Thus, the reward signal identifies the positive and negative
occurrences for the agent.
U
cannot be changed by the agent. Only the agent’s activities, which may directly affect
reward or indirectly by affecting the condition of the environment, can have an impact
on the reward signal.
he is, how he is feeling, and other features of his body, which are part of his internal
reinforcement learning agent’s environment, the reinforcement learning agent directing
his behaviour may receive different reward signals when he eats his breakfast. The
)A
reward signal serves as the main justification for changing the policy.
The policy may be amended to choose a different action in that circumstance in the
future if the policy’s chosen action is followed by a low reward. Reward signals could, in
general, be stochastic functions of the environment’s state and the actions conducted.
(c
Value Function: A value function explains what is beneficial over the long term, but
the reward signal only reveals what is good in the here and now. A state’s value can be
thought of as the total amount of reward that agent can anticipate accumulating over
Amity Directorate of Distance & Online Education
310 Foundations of Machine Learning
the course of the future, starting from that state. Values show the long-term desirability
Notes
e
of states after taking into consideration the states that are expected to follow and the
rewards available in those states, whereas rewards determine the immediate, intrinsic
attraction of environmental conditions.
in
For instance, even though a state constantly produces a small immediate reward,
it nonetheless has a high value since it is frequently followed by states that produce
nl
large rewards. The opposite may also be true. To use a human comparison, rewards
resemble pleasure (if high) and pain (if low), whereas values reflect a more complex
and long-term assessment of how happy or unhappy we are with the current state
of our surroundings. With this explanation, we think it is evident that value functions
O
formalise a fundamental and well-known concept.
ty
of assessing values is to increase rewards. However, when making and assessing
decisions, we are mainly concerned with values. Decisions about what to do are based
on value assessments. Not the highest reward, but the highest value states are what
si
we aim to achieve because they will ultimately yield the greatest rewards for us.
The derived quantity known as value is the one we are most concerned with when
making decisions and planning. Unfortunately, figuring out values is far more difficult
r
than figuring out incentives.
ve
Although values must be evaluated and re-estimated from the sequences of
observations an agent makes throughout its whole lifespan, rewards are essentially
delivered directly by the environment. In fact, a technique for effectively estimating
values is the most crucial element of practically all reinforcement learning algorithms we
ni
study. The most significant lesson we have learned about reinforcement learning over
the past few decades is likely the essential importance of value estimate.
Model of the Environment: A model of the environment serves as the fourth and
U
model might forecast the subsequent state and reward given a particular state and
behaviour.
Models are used for planning, which we define as any method of selecting a course
of action by taking into account potential future circumstances before they actually
m
outside the agent. The agent chooses actions, and the environment reacts to those
actions by presenting the agent with new circumstances.
The environment also produces rewards, which are unique numerical values that
Notes
e
the agent attempts to maximise over time. A task, one example of the reinforcement
learning problem, is defined by a thorough definition of an environment.
in
More particularly, at each of a series of discrete time steps, t = 0, 1, 2, 3,...2,
the agent and environment interact. The agent chooses an action, At ∈ A(St), based
on some representation of the state of the environment, St∈ S, where S is the set of
nl
potential states, at each time step t. A(St) is the set of actions that are feasible in state
St. The agent finds itself in a new state, St+13, one time step later, partly as a result
of its actions, and earns a numerical reward, Rt+1∈ R ⊂ R. 3 The agent–environment
interaction is depicted in Figure below.
O
ty
si
Figure: The agent–environment interaction in reinforcement learning
r
The agent implements a mapping from states to probabilities of choosing each
ve
potential action at each time step. The probability that At = a if St = s exists for this
mapping, which is known as the agent’s policy, is symbolised by the symbol πt, where
πt(a|s). The agent’s policy modification process is outlined using reinforcement learning
techniques. In general, the agent wants to maximise the overall reward sum it receives
ni
The environment’s state serves as a signal from the environment that the agent
considers when using the reinforcement learning framework. This section explains
the requirements for the state signal as well as the kind of information we should and
should not anticipate from it. In particular, we formally describe the Markov property,
ity
In this book, “the state” refers to any information that the agent has access to.
We presume that some preprocessing system that is ostensibly a component of the
environment provides the state. In this book, we do not discuss how to build, modify,
m
or learn the state signal. We use this strategy in order to completely concentrate on
the issues surrounding decision-making, not because we think state representation
is unimportant. In other words, choosing what to do based on the state signal that is
)A
available, rather than designing the state signal, is our primary concern.
The status signal can contain a lot more information than just immediate
experiences like sensory data. State representations might be elaborate structures that
have been constructed over time from a sequence of sensations, or they can be highly
processed reproductions of the initial sensations.
(c
In order to create a rich and realistic representation of a scene, we can move our
eyes over it while only allowing the fovea—a tiny region in the centre of the retina—to
be seen in detail at any given time. Or, more plainly, we can glance at something, then
Notes
e
turn our gaze elsewhere, and still be able to see it. Depending on the question that was
asked earlier and is no longer audible, we can hear the word “yes” and believe that we
are in completely different states.
in
On a more common level, a control system can monitor position twice and provide
a state representation that includes velocity information. In each of these situations, the
nl
state is created and maintained based on the current state as well as the previous state
or another recollection of earlier feelings.
There is no reason to limit the state representation to the agent’s current feelings;
O
in most cases, the state representation should be able to provide the agent with more
information.
On the other hand, the agent shouldn’t be expected to receive all of the information
ty
that would be relevant to its decision-making via the state signal. We shouldn’t
anticipate the agent to be aware of the upcoming card in the deck if it is playing
blackjack.
si
We shouldn’t anticipate the caller’s identity being known to the agent when it
answers the phone. We shouldn’t expect a paramedic who has been dispatched to a
car accident to know the interior injuries of a victim who is unconscious right away.
r
In each of these scenarios, the environment contains concealed state information
ve
that would be helpful if the agent knew about it, but the agent is unable to do so
because it has never experienced any pertinent sensations. In other words, we only
criticise an agent for knowing something and then forgetting it, not for not knowing
something that matters.
ni
A status signal that condenses previous feelings while yet retaining all pertinent
information is what we are ideally looking for. Normally, more information is needed
than only the current sensations, but never more than the entire history of all previous
U
A checkers position, which is the current arrangement of all the pieces on the
ity
board, would function as an example of a Markov state since it condenses all of the
significant information from the entire series of positions that led to it. While much of the
information about the sequence is destroyed, everything that is actually important for
the game’s future is still there.
m
A cannonball’s future trajectory depends only on its present position and velocity.
How that position and velocity got to be is irrelevant. Because everything is contained in
the current state signal and has no bearing on the “path,” or history, of previous signals,
this attribute is also frequently referred to as having “independence of path.”
)A
The Markov property for the reinforcement learning issue is now fully defined.
Here, we assume that there are a finite number of states and reward values in order to
simplify the mathematics. While the argument may be easily expanded to incorporate
continuous states and rewards, it enables us to work in terms of sums and probabilities
(c
Think about how a larger ecosystem might react to an activity at time t+1. This
Notes
e
response could be influenced by everything that has happened before in the broadest,
most causative sense. The dynamics in this situation can only be specified by supplying
the whole probability distribution:
in
for all r, s 0 , and all possible values of the past events: S0, A0, R1, ..., St−1, At−1,
nl
Rt , St , At . If the state signal has the Markov property, on the other hand, then the
environment’s response at t + 1 depends only on the state and action representations at
t, in which case the environment’s dynamics can be defined by specifying only
O
S0, A0, R1,..., St-1, At-1, Rt, St, At for all r, s 0, and all probable values of the past
occurrences. On the other hand, if the state signal possesses the Markov property, the
environment’s response at t + 1 depends only on the state and action representations
ty
at t, in which case the dynamics of the environment can be characterised by providing
only the state and action representations at t.
for each and every r, s0, St, and At. Alternatively put, a state signal possesses the
si
Markov property and is a Markov state if and only if equation above is equation below
for every s0, r, and histories, S0, A0, R1,..., St-1, At-1, Rt, St, At. In this instance, it is
also claimed that the environment and task collectively exhibit the Markov property.
r
ve
If an environment possesses the Markov property, then we can use the current
state and action to forecast the next state and anticipated reward. By repeating this
equation, it is possible to demonstrate that it is possible to forecast all future states
ni
and expected rewards using only knowledge of the current state, which is equivalent to
using knowledge of the entire past up to the present.
Therefore, Markov states offer the greatest potential foundation for selecting
U
actions. In other words, the greatest strategy for selecting actions based on a Markov
state is just as effective as the best strategy for selecting actions based on entire
histories.
ity
We want the state to provide a good basis for predicting future rewards and
choosing actions in particular. We also want the state to be a useful starting point for
m
One will get greater performance from reinforcement learning systems to the extent
)A
that the state approaches the capability of Markov states in these respects. Because
of all of these factors, it can be helpful to think of the state at each time step as being
something similar to a Markov state, even though this may not be the case in all cases.
Because judgments and values are thought to solely be a function of the current
state, the Markov property is crucial in reinforcement learning. The state representation
(c
The entire body of theory in this book is predicated on Markov state signals. In
Notes
e
circumstances when the Markov property does not exactly apply, this means that not
all of the theory applies. Although the algorithms can be effectively applied to several
jobs with states that are not precisely Markov, the theory created for the Markov case
in
continues to be useful in understanding how the algorithms behave.
To expand the theory of the Markov case to the more intricate and realistic
nl
non-Markov situation, a thorough comprehension of the Markov case theory is a
prerequisite. Finally, we point out that the assumption of Markov state representations
is not specific to reinforcement learning but is found in the majority, if not all, of artificial
intelligence methods.
O
Example: Pole-Balancing State: A state signal would be Markov in the earlier
described pole-balancing job if it precisely indicated, or allowed for the precise
reconstruction of, the position and speed of the cart down the track, the angle between
ty
the cart and the pole, and the rate at which this angle is changing (the angular velocity).
si
actions. However, in reality, it is impossible to know this information precisely because
any genuine sensor would skew and delay its results in some way.
r
In addition, there are always additional factors that have a small impact on the
behaviour of a genuine cart–pole system, such as the pole’s bending, the temperatures
ve
of the wheel and pole bearings, and different types of backlash. If the state signal
consisted just of the positions and speeds of the cart and the pole, then these variables
would lead to violations of the Markov property.
ni
The states, however, are frequently extremely well served by the locations and
velocities. A coarse state signal that separated cart positions into the right, left, and
centre regions was utilised in several early investigations of learning to solve the
pole-balancing task (and similar rough quantizations of the other three intrinsic state
U
variables).
have sped up learning by making the learning agent disregard subtle differences that
weren’t relevant to the job at hand.
Example:Draw Poker: Each player receives a hand of five cards when playing draw
poker. After a betting round in which each player switches out some of his cards for
m
Each participant must match or exceed the highest wager placed by the other
players in each round, or they will be eliminated (fold). The player who still has the
)A
greatest hand after the second round of betting wins and takes home all of the bets.
Each player in draw poker receives a unique state signal. Every player is aware
of the cards in his own hand, but the other players’ hands are simply a guess. It’s a
common misconception to believe that a Markov state signal should include the cards
(c
However, in a fair game, we suppose that the participants are inherently unable
Notes
e
to infer these details from prior experience. A player could predict some future events
(such as the cards one could trade for) better if she knew them than if she simply
remembered all of the previous observations.
in
In addition to being aware of one’s own cards, draw poker players need also be
aware of their opponents’ bets and the amount of cards they have drawn. If one of the
nl
other players drew three additional cards, for instance, you might assume he kept a pair
and change your estimation of the strength of his hand to reflect this. Your evaluation of
the hands of the players is also influenced by their bets.
O
In reality, a large portion of your past interactions with these specific players
are included in the Markov state. Does Ellen prefer to play it safe or does she enjoy
bluffing? Does her expression or manner convey anything about how powerful her hand
is? When it is late at night or when he has already made a sizable profit, how does
ty
Joe’s strategy change?
Although everything that has ever been noticed about the other players may have
an impact on the chances that they are holding different types of hands, in actuality
si
there is far too much information to remember and analyse, and most of it won’t have a
discernible influence on one’s predictions or actions.
r
Very good poker players are skilled at retaining only the most important details and
swiftly assessing new opponents, yet nobody can recall everything that is pertinent. As
ve
a result, people’s state representations are likely nonMarkov, and the decisions they
make in poker are probably also imperfect.
The sole difference between the Bellman Optimality equation and the Bellman
Expectation equation is that instead of choosing the action with the highest value, our
agent chooses the action with the average value.
m
Assume that our agent is in state S and that it has two options from that state (a).
Notes
e
In contrast to the Bellman Expectation Equation, our agent chooses the action with the
highest q* value when we examine the action-values for each of the actions. As a result,
we have the value of being in state S.
in
This can be written mathematically as:
nl
Let’s define the Bellman Optimality Equation for State-Action Value Function in a
similar manner (Q-Function).
O
Let’s examine the Backup Diagram for the Q-Function State-Action Value Function:
ty
si
Figure: Backup Diagram for State-Action Value Function
r
Let’s say that our agent has performed action an in state s. We might be blown to
ve
any of these states right now depending on the atmosphere. The sole difference is that
in the Bellman Optimality Equation, we are aware of the ideal values for each state.
Nevertheless, we continue to take the average of the values of the two states. We just
knew the values of the states, as opposed to the Bellman Expectation Equation.
ni
Assume that our agent is in state s when it takes action (a), with the probability of
Notes
e
that action being taken being influenced by the policy. The agent may be blown to any
of the states where probability is weighed by the environment as a result of action (a).
We merely take the average of the optimal values of the states (s’) to determine the
in
value of state S. The value of being in state S is provided by this.
nl
O
We are maximising the activities the agent can do in the upper arcs, hence the max
in the equation. This equation demonstrates how the V* function can be related to itself.
ty
The Bellman Optimality Equation for the State-Action Value Function, q*(s,a), is as
follows:
r si
ve
Figure: Backup Diagram for State-Action Value Function
ni
Imagine that while in state s, our agent took some action (a). The environment may
send our agent to any of the states (s’) as a result of that action, and from these states
U
we can maximise the action our agent will perform by selecting the action with the
highest q* value. The worth of the action an is shown when we back that up to the top.
Notes
e
in
nl
O
ty
si
Figure: Example for Bellman Optimality Equation
Look at the red arrows. Assuming we want to find the value of the state with value
6 (in red), as we can see, choosing Facebook will earn us a reward of -1 and choosing
r
study will get us a reward of -2. Considering the other two states have an optimal value,
ve
we will take an average and maximise for both the action and the state in red using
the Bellman Optimality Equation for State-Value Function (choose the one that gives
maximum value).
We can therefore deduce from the diagram that visiting Facebook results in a value
ni
of 5 for our red state and visiting a research site results in a value of 6, and when we
maximum over the two, the result is 6 for our red state.
U
Dynamic Programming
(c
e
although they are still significant conceptually.
in
next chapters of this book. In actuality, all of these approaches can be seen as attempts
to achieve roughly the same result as DP, albeit with less computational effort and
without relying on an ideal environment model.
nl
We usually assume that the environment is a finite MDP. That is, we assume that
its state, action, and reward sets, S, A, and R, are finite, and that its dynamics are given
by a set of probabilities p(s’ , r|s, a), for all s ɛ S, a ɛ A(s), r ɛ R, and s0ɛ S+ (S+ is S
O
plus a terminal state if the problem is episodic).
Although continuous state and action space problems can benefit from the use
of DP concepts, accurate solutions are only feasible in certain circumstances. Applying
ty
finite-state DP methods after quantizing the state and action spaces is a typical
technique for getting approximations for problems with continuous states and actions.
The use of value functions to organise and structure the search for effective
si
policies is the central concept of DP and reinforcement learning in general. In this
chapter, we demonstrate how to compute the value functions using DP. Once we have
determined the optimal value functions, v* or q*, which fulfil the Bellman optimality
r
equations, as was stated there, we may quickly obtain optimal policies:
ve
ni
U
for all s ɛ S, a ɛ A(s), and s0ɛ S+. As we shall see, Bellman equations like these
are converted into assignments, or update rules for enhancing approximations of the
ity
Policy Evaluation
We start by thinking about how to calculate the state-value function vπ for
m
every given policy π. In the DP literature, this is referred to as policy evaluation. The
prediction problem is another name for it.
)A
(c
the expectations are subscripted by to show that they are dependent on the
assumption that the policy is followed, and π (a|s) represents the probability of acting
in state s under the policy π. As long as either ultimate termination from all states under
Notes
e
the policy or γ< 1 is assured, vπ existence and uniqueness are guaranteed.
in
|S| unknowns (the vπ(s), s ∈ S) if the dynamics of the environment are fully known.
Its answer is, in theory, a simple, albeit time-consuming, computation. Iterative solution
techniques are ideal for our needs. Take into account a string of approximation value
nl
functions, each mapping S + to R: v0, v1, v2,... With the exception of the terminal state,
if any, which must be assigned a value of 0, the initial approximation, v0, is chosen
freely, and each succeeding approximation is obtained by using the Bellman equation
for vπ as an update rule:
O
ty
for all s ∈ S. Given that the Bellman equation for v guarantees equality in this
situation, it is obvious that vk = vπ is a fixed point for this update rule. In fact, under the
si
identical circumstances that ensure the existence of vπ, it can be demonstrated that
the sequence “vk” generally converges to vπ as k → ∞. Iterative policy evaluation is the
name of this algorithm.
r
Iterative policy evaluation applies the same operation to each state s to produce
ve
each subsequent approximation, vk+1 from vk: it replaces the old value of s with a new
value obtained from the old values of the successor states of s, as well as the expected
immediate rewards, along all the one-step transitions possible under the policy being
evaluated. This process is referred to as a full backup. The new approximative value
ni
function vk+1 is created by backing up the value of each state once after each round of
iterative policy assessment.
being backed up, as well as how precisely the projected values of the successor states
are combined, there are various types of full backups. Because they are based on all
potential next states rather than just a sample next state, all backups performed using
ity
You would need to employ two arrays: one for the previous values, vk(s), and one
for the new values, vk+1(s), in order to construct a sequential computer programme to
)A
accomplish iterative policy evaluation as stated in the equation above. In this manner,
the old values can be used to calculate the new values one at a time without affecting
the old values. Of course, updating the values “in place,” or with each new backed-up
value immediately replacing the previous one, is simpler when using a single array.
(c
Then, depending on the order in which the states are backed up, new values may
occasionally be utilised on the right-hand side of the aforementioned equation in place
of old ones. As you might assume given that it uses fresh data as soon as they become
available, this slightly different algorithm also converges to vπ; in fact, it typically does
Notes
e
so faster than the two-array variant. We imagine that the backups are carried out in
a thorough scan of the state space. The rate of convergence of the in-place method
depends significantly on the order in which the states are backed up during the sweep.
in
When we consider DP algorithms, we typically have the in-place variant in mind.
nl
O
ty
Figure: Iterative policy evaluation
si
The algorithm’s termination is a further implementation point. Iterative policy
evaluation must be stopped before it reaches the limit because it converges only
r
there in theory. Testing the quantity max s∈S |vk+1(s)−vk(s)| after each sweep and
ve
quitting when it is sufficiently small is a common stopping condition for iterative policy
assessment. The whole procedure for iterative policy evaluation using this stopping
criterion is shown in the figure above.
ni
Policy Improvement
To aid in the development of better policies, we compute the value function for
each policy. Let’s say that we have established the value function vπ for each given
U
deterministic policy. We are interested in knowing whether or not we should adjust the
policy for specific states to choose an action a 6=π(s) deterministically. We are aware
of the benefits of adhering to the current policy from s, or vπ (s), but would switching
to the new policy be better or worse? Consider choosing an in s and then adhering
ity
to the current policy, π, as one approach to answering this query. The benefit of this
behaviour is:
m
Whether this is larger than or less than v π(s) is the main requirement. If it is
)A
higher—that is, if it is better to choose a once in s and then follow than it would be
to follow always—then one would anticipate that it would be even better to choose an
each time s is met, and that the new policy would really be a better one overall.
conclusion that this is true. Let π and π0 represent any two deterministic policies such
that for any s ∈ S,
Notes
e
The policy π’ must then be at least as good as π or better. This means that it must
in
receive a higher or equivalent projected return from each state s ∈ S.
nl
Furthermore, there must be strict inequality of the above equation in at least one
state if stringent inequality exists in any state. This result is especially relevant to the
two policies we examined in the preceding sentence: the original deterministic policy,
O
π, and the modified policy, π’, which is identical to, with the exception that, π’(s) = a
6= π(s). Evidently, the aforementioned equation is true for all states besides s. The
modified policy is therefore superior to if q π (s, a) > v π (s).
ty
comprehend. We repeatedly apply and increase the qπ side until we obtain vπ’(s):
r si
ve
ni
We have thus far seen how, given a policy and its value function, it is simple to
U
assess how a change in the policy in one state would affect a specific action. It is a
logical extension to take into account changes at all states and in all potential actions,
choosing at each state the action that q deems to be the best q π (s, a). To put it
another way, take into account the new selfish policy, π’ offered by
ity
m
)A
where argmaxa is the a value at which the following expression is maximised (with
ties broken arbitrarily). After one step of looking ahead, according to vπ the greedy
policy chooses the course of action that appears optimal in the short term. The greedy
policy is as good as or better than the original policy since it was designed to satisfy the
(c
Policy Iteration
Notes
e
Once a policy, π, has been improved using vπ to yield a better policy, π 0 , we
can then compute vπ0 and improve it again to yield an even better π 00. We can thus
in
obtain a sequence of monotonically improving policies and value functions:
We can compute vπ’ and improve a policy again to produce an even better π’’ after
a policy, π’, has been improved using vπ to provide a better policy, π’’. Thus, we can
nl
obtain a series of rules and value functions that improve monotonically:
O
: represents a policy evaluation
ty
Every new policy will unquestionably be an upgrade over the one that came before
it (unless it is already optimal). This process must reach an optimal policy and optimal
value function in a finite number of iterations because a finite MDP can only have a
si
finite number of policies.
Policy iteration is the process of discovering the best possible policy. The figure
below shows a complete algorithm. Keep in mind that the value function for the
r
preceding policy is used to start each policy assessment, which is an iterative process.
ve
Usually, this has a significant impact on the rate of convergence of policy evaluation
(presumably because the value function changes little from one policy to the next).
ni
U
ity
m
)A
Figure: Policy iteration (using iterative policy evaluation) for v∗. This algorithm
contains a slight fault in that it might never finish running if the policy alternates between
(c
two or more equally good policies repeatedly. Although the bug can be fixed by adding
more flags, doing so would make the pseudocode look terrible.
e
The fact that the DP approaches we have covered thus far involve operations over
the full state set of the MDP, or state set sweeps, is a significant disadvantage. Even a
in
single sweep can become unaffordable if the state set is sufficiently vast. Backgammon,
as an illustration, has more than 1020 states. It would take more than a thousand years
to do even one sweep if we could execute the value iteration backup on a million states
nl
per second.
In-place iterative DP algorithms that are not structured around systematic sweeps
of the state set are known as asynchronous DP algorithms. These algorithms use
O
whatever values of other states are available to support the values of states in any
order. Before the values of other states are backed up once, the values of some states
may be backed up multiple times. An asynchronous algorithm, however, cannot ignore
any state after a certain point in the calculation; it must keep backing up the values of
ty
every state in order to converge appropriately. The states to which backup procedures
are applied can be chosen quite freely with asynchronous DP methods.
si
value iteration backup to store the value of only one state, sk, on each step, k. Given
that all states appear in the sequence “sk” an infinite number of times, asymptotic
convergence to v is ensured if 0 ≤ γ < 1. (the sequence could even be stochastic v*).
r
(In the undiscounted episodic situation, it is possible that some backup orderings do
ve
not lead to convergence, although it is generally straightforward to avoid these.) Similar
to how policy evaluation and value iteration backups can be combined to create an
asynchronous shortened policy iteration, it is also possible. It is obvious that a few
distinct backups create building blocks that can be employed flexibly in a wide range
of sweepless DP algorithms, even though the details of this and other more uncommon
ni
Naturally, avoiding sweeps does not automatically imply that we can use
U
less computing. It simply means that an algorithm need not become stuck in any
interminably long sweep in order to advance in refining a policy. By deciding which
states to apply backups to, we may try to take advantage of this flexibility and speed
up the algorithm’s progress. We can attempt to arrange the backups so that value
ity
information can spread effectively from state to state. Some states might not require as
much support for their principles as others. If some states are not important for optimal
functioning, we might even try to completely avoid backing them up.
decision-making can also be influenced by the most recent value and policy information
obtained by the DP algorithm. We could, for instance, apply backups to states as the
agent travels through them. This enables the agent-relevant portions of the state set to
be the focus of the DP algorithm’s backups. A recurring theme in reinforcement learning
is this form of concentrating.
(c
e
Monte Carlo
in
Only experience—sample sequences of states, actions, and rewards from actual
or simulated contact with an environment—is required for Monte Carlo algorithms.
Learning from actual experience is remarkable in that it requires no prior understanding
of the dynamics of the environment while still allowing for optimal conduct. Simulated
nl
experience can also help you learn. Although a model is necessary, it just needs
to provide sample transitions, rather than the whole probability distributions of all
potential transitions that dynamic programming requires (DP). In a surprising number of
O
circumstances, generating experience sampled according to the appropriate probability
distributions is simple, but obtaining the distributions in explicit form is impossible.
Monte Carlo methods are based on averaging sample returns and are used to
ty
solve the reinforcement learning problem. We specify Monte Carlo methods only for
episodic jobs to ensure that well-defined returns are provided. That is, we presume
that experience is divided into episodes, and that no matter what actions are taken, all
si
episodes will eventually end. Value estimations and policies are only updated once an
episode is completed. In an episode-by-episode sense, Monte Carlo approaches can
be incremental, but not in a step-by-step (online) sense. The word “Monte Carlo” is
typically used to refer to any estimating method that has a strong random component
r
in its operation. It’s especially for approaches based on averaging complete returns that
ve
we’re using it for here.
Like bandit approaches, Monte Carlo methods sample and average returns for
each state–action pair. The key distinction is that there are now numerous states,
each of which acts as a different bandit problem (such as an associative-search or
ni
contextual bandit), and the bandit problems are interconnected. That is, the outcome
of an action in one state is determined by subsequent acts in the same episode. From
the standpoint of the prior state, the problem becomes nonstationary because all of the
U
We modify the general policy iteration (GPI) concept introduced for DP to deal
with nonstationarity. Rather than computing value functions based on MDP knowledge,
ity
we learn value functions from sample returns using the MDP. The value functions and
policies that correspond to them still interact in essentially the same way to achieve
optimality (GPI). We start with the prediction problem (computation of vπ and qπ for
a fixed arbitrary policyπ), then go on to policy improvement, and finally the control
problem and its solution using GPI, as we did in the DP chapter. Each of these
m
concepts from DP is applied to the Monte Carlo example, where only sample data is
provided.
)A
observed after trips to that state is an easy way to estimate it from experience. The
average should converge to the predicted value as more returns are recorded. All
Monte Carlo methods are based on this concept.
Consider the case where we want to estimate vπ(s), the value of a state s under
Notes
e
policy π, based on a set of episodes collected by following and passing through s.
A visit to s is the name given to each occurrence of state s in an episode. Of fact, s
can appear numerous times in a single episode; let’s call the first appearance of s
in
in an episode the first visit to s. The first-visit MC technique calculates vπ(s) as the
average of first-visit returns, whereas the every-visit MC approach calculates vπ(s) as
the average of all trips to s. These two Monte Carlo (MC) methods are quite similar,
nl
although their theoretical properties differ slightly.
The most commonly studied MC is first-visit MC, which has been studied since the
1940s and is the subject of this chapter. Function approximation and eligibility traces
O
are more naturally extended by every-visit MC. Figure below depicts first-visit MC in
procedural form.
ty
r si
ve
Figure: For estimating vπ, use the first-visit MC approach. Because the
ni
As the number of visits (or first visits) to s grows to infinity, both first-visit MC
U
and every-visit MC converge to vπ(s). In the case of first-visit MC, this is clear. In this
situation, each return represents a finite variance, independent, identically distributed
estimate of vπ(s). The sequence of averages of these estimates converges to their
expected value due to the law of large numbers. Each average is an unbiased estimate
ity
in and of itself, and the standard deviation of its error is 1/√n, where n is the number
of averaged returns. Every-visit MC is more complicated, but its estimates converge to
vπ(s) asymptotically (Singh and Sutton, 1996).
An example is the greatest way to demonstrate the use of Monte Carlo methods.
m
Example: Blackjack The goal of the popular casino card game blackjack is to
collect as many cards as possible with a sum of numerical values that does not exceed
21. All face cards are worth ten points, while an ace can be worth one or eleven points.
)A
We’ll look at the variation in which each player plays against the dealer on their own.
Both the dealer and the player are dealt two cards at the start of the game. The dealer
has one face-up card and one face-down card. A natural is when a player has 21 points
right away (an ace and a 10-card).
(c
Unless the dealer has a natural as well, in which case the game is a tie. If the
player does not have a natural, he can request more cards one at a time (hits) until he
either quits (sticks) or exceeds 21. (goes bust). He loses if he goes bust; if he sticks,
Amity Directorate of Distance & Online Education
Foundations of Machine Learning 327
the dealer has his turn. The dealer has no choice except to hit or stick according to
Notes
e
a predetermined strategy: he sticks on any total of 17 or more, and hits otherwise. If
the dealer goes bust, the player wins; otherwise, the winner is whoever’s total sum is
closest to 21.
in
Blackjack is naturally formulated as a finite episodic MDP. Each blackjack
game is like a mini-series. Winning, losing, and drawing receive +1, 1, and 0 points,
nl
respectively. Because all prizes in a game are zero, and we don’t discount (γ= 1),
these terminal rewards are likewise returns. Hit or stick are the player’s options. The
states are determined by the cards held by the players and the dealer’s displaying card.
We assume that cards are dealt from an endless deck (i.e., with replenishment), thus
O
keeping track of the cards already dealt isn’t necessary.
The ace is said to be useful if the player can count it as 11 without going bust. In
this situation, it is always considered as 11 because treating it as 1 would result in a
ty
sum of 11 or less, in which case the player should always hit. As a result, the player
bases his decisions on three factors: his current sum (12–21), the dealer’s one showing
card (ace–10), and whether or not he has an useable ace. This brings the total number
si
of states to 200.
Consider the policy that sticks if the player’s sum is 20 or 21, and otherwise
hits. To find the state-value function for this policy by a Monte Carlo approach, one
r
simulates many blackjack games using the policy and averages the returns following
ve
each state. Note that in this task the same state never recurs within one episode, so
there is no difference between first-visit and every-visit MC methods. In this way, we
obtained the estimates of the state value function shown in Figure below. The estimates
for states with a usable ace are less certain and less regular because these states
ni
are less common. In any event, after 500,000 games the value function is very well
approximated.
U
ity
m
)A
Figure: Approximate state-value functions for the blackjack policy that sticks
only on 20 or 21, computed by Monte Carlo policy evaluation.
(c
are difficult to determine in blackjack. Let’s say the player’s total is 14 and he decides
Notes
e
to stick. As a function of the dealer’s displaying card, what is his expected reward?
Before DP can be used, all of these predicted rewards and transition probabilities must
be estimated, and such computations are generally difficult and error-prone. Creating
in
the sample games necessary by Monte Carlo methods, on the other hand, is simple.
The capacity of Monte Carlo methods to function with sample episodes alone can
be a considerable advantage even when one has comprehensive knowledge of the
nl
environment’s dynamics; this is the case surprisingly often.
O
all the transitions and leaf nodes whose rewards and estimated values contribute to
the update at the bottom. The root is a state node in Monte Carlo estimation of vπ,
and below it is the whole trajectory of transitions along a single episode, finishing at the
ty
terminal state, as shown in Figure below.
r si
ve
ni
U
The estimations for each state are independent, which is a key feature of Monte
Carlo methods. As in DP, the estimate for one state does not build on the estimate for
any other state. Monte Carlo approaches, in other words, do not bootstrap.
It’s worth noting, in particular, that the cost of computing the value of a single
m
state is independent on the number of states. This can make Monte Carlo approaches
particularly appealing when only one or a subset of states is required. Many sample
episodes can be generated by starting with the states of interest and average the
returns from only these states while ignoring all others. This is the third benefit of Monte
)A
Carlo methods over DP methods (after the ability to learn from actual experience and
from simulated experience).
the action that leads to the best combination of reward and future state, like we did in
Notes
e
the DP chapter. State values, on the other hand, are insufficient without a model. In
order for the values to be useful in establishing a policy, each action must be clearly
valued. As a result, one of the most important purposes of Monte Carlo methods is to
in
estimate q*. To do so, we first look at the problem of policy evaluation for action values.
Estimating qπ (s, a), the expected return when starting in state s, taking action a,
nl
and then following policy π is the policy evaluation issue for action values. The Monte
Carlo methods are roughly the same as for state values, but now we’re talking about
visits to a state–action pair instead of a state. If the state s is visited and action an is
taken in an episode, the state–action pair s is said to be visited.
O
The every visit MC method calculates the value of a state–action pair as the
average of the returns from all of the visits to it. The first-visit MC approach averages
the returns once the state is visited and the action is chosen for the first time in each
ty
episode. As the number of visits to each state–action pair approaches infinity, both
algorithms converge quadratically to the true anticipated values, as previously.
The only snag is that many state–action pairs might never be visited. If π is a
si
deterministic strategy, then one will only see results for one of the actions from each
state in the following. The Monte Carlo estimates of the other activities will not improve
with experience if there are no returns to average. This is a critical issue because the
r
goal of learning action values is to aid in the selection of appropriate actions in each
ve
stage. We must estimate the worth of all activities from each stage, not just the one we
currently favour, in order to compare options.
We must ensure continuous exploration in order for policy evaluation to work for
action values. One approach to do this is to define that each episode begins with a
ni
state–action pair, with each pair having a nonzero chance of being chosen as the start.
In the limit of an unlimited number of episodes, this assures that all state–action pairs
will be visited an infinite number of times. This is what we refer to as the exploratory
U
assumption.
The idea that learning begins with exploration is occasionally valuable, but it
cannot be depended on in general, especially when learning directly from actual
ity
interaction with an environment. In that instance, the starting circumstances are unlikely
to be advantageous. To ensure that all state–action combinations are encountered,
the most popular alternative is to include only stochastic policies with a nonzero
probability of picking all actions in each state. In the following sections, we’ll go over
two major variations of this strategy. For the time being, we’ll stick with the exploration
m
start assumption and finish the presentation of a comprehensive Monte Carlo control
approach.
)A
e
achieve the same outcomes. TD seeks to forecast the combination of the immediate
reward and its own reward prediction at the next moment in time rather than calculating
the whole future reward. (More details are available here)
in
The discounted return is the fundamental mathematical idea behind TD learning.
nl
where the combination of future benefits that have been discounted makes up the
O
reward at time t. It implies a lower value placed on rewards in the future. The final right
reward and the TD Error are different (V*_t) and our current prediction (V_t).
ty
r si
ve
ni
U
The current value will be modified by its value + learning_rate * error, just like other
optimization techniques:
ity
We’re now ready to look at how Monte Carlo estimate can be used to approximate
optimal policies in control. The basic goal is to follow the same approach as the DP
chapter, namely, generalised policy iteration (GPI). Both an approximation policy and
)A
an approximate value function are maintained in GPI. The value function is changed
repeatedly to better match the current policy’s value function, and the policy is improved
repeatedly in relation to the current value function:
(c
Notes
e
in
nl
These two types of adjustments operate against each other to some extent, as
each provides a moving goal for the other, but when combined, they bring both the
policy and the value function closer to their ideal state.
O
Let’s start with a Monte Carlo simulation of classical policy iteration. Starting with
an arbitrary policy π0 and finishing with the optimal policy and optimal action-value
function, we use this procedure to alternate entire stages of policy evaluation and policy
improvement:
ty
si
where denotes a complete policy evaluation and denotes a
complete policy improvement.
Policy evaluation is carried out in the same manner as indicated in the previous
r
section. Many episodes are encountered, with the estimated actionvalue function
ve
asymptotically approaching the genuine function. Let us assume for the time being that
we actually observe an endless number of episodes and that the episodes are also
generated with exploring begins. The Monte Carlo methods will compute each qπk
exactly, for every πk, with certain conditions.
ni
Making the policy greedy with respect to the current value function improves the
policy. Because we have an action-value function in this situation, we don’t require a
model to build the greedy policy. The appropriate greedy strategy for every action value
U
function q is the one that, for each s ∈ S, deterministically chooses an action with the
highest action-value:
ity
The greedy policy can then be constructed for each πk+1 with respect to qπk to
improve policy. Because for any s∈S, the policy improvement theory holds for πk and
πk+1.
m
)A
The theory guarantees that each πk+1 is uniformly better than πk, or just as
good as k, in which case they are both optimum policies, as we stated in the previous
(c
chapter. As a result, we may be confident that the overall process will converge to the
best policy and value function. In this approach, Monte Carlo methods can be used
to determine optimal policies when just a few sample episodes are available and no
Notes
e
further information about the dynamics of the environment is available. To simply
acquire this guarantee of convergence for the Monte Carlo approach, we made two
implausible assumptions above. The first was that the episodes had exploration begins,
in
and the second was that policy evaluation could be done with any number of episodes.
Both assumptions must be removed in order to create a workable method.
nl
For the time being, we’ll assume that policy evaluation is performed on an
infinite number of episodes. It’s relatively simple to disprove this assumption. In
reality, traditional DP methods such as iterative policy evaluation, which converge
only asymptotically to the true value function, face the same problem. There are two
O
approaches to solving the problem in both the DP and Monte Carlo situations. One
option is to stick to the principle of approximating qπk in each policy assessment. To
obtain bounds on the amount and probability of error in the estimates, measurements
ty
and assumptions are made, and then adequate actions are performed during each
policy evaluation to ensure that these bounds are suitably minimal. This method can
most likely be made totally satisfactory in terms of ensuring correct convergence up to
a certain level of approximation. However, it will almost certainly require far too many
si
episodes to be useful in practice on anything other than the tiniest of difficulties.
The second strategy for avoiding the theoretically endless number of episodes
r
required for policy evaluation is to abandon policy evaluation entirely before returning
to policy improvement. We move the value function closer to qk with each evaluation
ve
step, but we don’t expect to come close unless we go through a lot of them. Value
iteration is an extreme version of the concept, in which only one iteration of iterative
policy evaluation is carried out between each step of policy improvement. The in-
place form of value iteration goes even farther; for single states, we alternate between
ni
improvement and assessment stages. It’s reasonable to cycle between evaluation and
improvement on an episode-by-episode basis for Monte Carlo policy evaluation. The
observed returns are utilised to evaluate policy after each episode, and the policy is
U
then improved in all of the states visited during the episode. Figure following shows a
complete simple algorithm along these lines. This algorithm is known as Monte Carlo
ES, which stands for Monte Carlo with Exploring Starts.
ity
m
)A
(c
Figure: : Monte Carlo ES: A Monte Carlo control algorithm assuming exploring
starts and that episodes always terminate for all policies.
All of the returns for each state–action combination are aggregated and averaged
Notes
e
in Monte Carlo ES, regardless of what policy was in effect at the time they were
observed. Monte Carlo ES cannot converge to any suboptimal strategy, as can be
seen. If it did, the value function for that policy would eventually converge to the value
in
function for that policy, causing the policy to alter. Only when both the policy and the
value function are ideal can stability be established.
nl
Convergence to this optimal fixed point appears inevitable as changes to the
action-value function reduce over time, but it has yet to be shown formally. This is one
of the most fundamental outstanding theoretical concerns in reinforcement learning, in
our opinion (for a partial solution, see Tsitsiklis, 2002).
O
Monte Carlo Control without Exploring Starts
How can we avoid making the improbable assumption that exploring begins?
ty
The agent must continue to select all actions in order to assure that they are selected
infinitely often. There are two ways to achieve this, which we refer to as on-policy
methods and off-policy methods. On-policy approaches try to improve or assess
si
the policy that is used to make decisions, whereas off-policy methods try to improve
or evaluate a policy that is not used to generate the data. An example of an on-
policy approach is the Monte Carlo ES method described above. We describe how to
develop an on-policy Monte Carlo control approach that does not rely on the unrealistic
assumption of investigating begins in this section. r
ve
GPI’s overall concept of on-policy Monte Carlo control remains the same. We
employ first-visit MC methods to estimate the action-value function for the current
policy, much as we did in Monte Carlo ES. We can’t just enhance the policy by making
it greedy with regard to the current value function without the assumption of exploring
ni
That any ε-greedy policy with respect to qπ is an enhancement over any ε-soft
ity
policy π is assured by the policy improvement theorem. Let π 0 be the ε-greedy policy.
The situation of the policy improvement theorem is relevant because for any s ∈ S:
m
)A
(Because the sum is a weighted average with nonnegative weights that add up to
1, it must be less than or equal to the greatest number averaged.)
(c
Notes
e
in
As a result of the policy improvement theorem, π 0 ≥ π (i.e., vπ0(s) ≥ vπ(s), for
all s ∈ S). We now show that equality can only be achieved when both π0 and π are
nl
optimal among the ε -soft policies, i.e. when they are better than or equal to all other ε
-soft policies.
Consider a new environment that is identical to the original, with the exception
O
that regulations must be “softly” “moved inside” the environment. The new environment
functions as follows, with the same action and state settings as the old. If you’re in state
s and executing action a, the new environment will behave precisely like the previous
one with probability 1- ε.
ty
It repicks the action at random, with equal odds, and then behaves like the old
environment with the new, random action, using probability ε. The best one can achieve
with general policies in this new context is the same as one could do with ε -soft policies
si
in the original environment.
Let ve∗ and qe∗ stand for the new environment’s optimal value functions. If and
r
only if vπ = ve∗, a policy π is best among ε -soft policies. We know from the definition of
ve∗ that it is the only solution to
ve
ni
U
When equality is maintained and the ε -soft policy π is no longer enhanced, we can
conclude that
ity
m
)A
However, except for the replacement of vπ for ve∗, this equation is identical to the
previous one. Since ve∗ is the unique solution, it must be that vπ = ve∗.
In essence, we’ve demonstrated that policy iteration works for -soft policies in the
last several pages. Using the natural notion of greedy policy for ε -soft policies, one may
(c
be assured of progress at every phase, except when the best policy among the ε -soft
policies has been discovered.
This study is unaffected by the method by which the action-value functions are
Notes
e
generated at each stage, but it does assume that they are computed precisely. This
puts us back to where we were in the previous part. We can now just attain the best
policy among the ε -soft policies, but the assumption of exploring begins has been
in
removed. The entire method is depicted in the diagram below.
nl
O
ty
si
Figure: An on-policy first-visit MC control algorithm for ε-soft policies.
r
ve
5.1.10 Temporal Learning
If there is one important and innovative concept in reinforcement learning, it is
surely temporal-difference (TD) learning. Monte Carlo and dynamic programming (DP)
principles are combined in TD learning. TD approaches, like Monte Carlo methods, can
ni
learn from raw experience without the need for a model of the environment’s dynamics.
TD approaches, like DP, update estimates based in part on previously learnt estimates
rather than waiting for a final result (they bootstrap). In reinforcement learning theory,
U
the link between TD, DP, and Monte Carlo methods is a repeating theme. This chapter
is the start of our investigation. Before we’re done, we’ll notice how these concepts and
procedures flow together and can be blended in a variety of ways.
ity
TD Prediction
To tackle the prediction problem, both the TD and Monte Carlo approaches rely on
prior experience. Both methods update their estimate v of vπ for the nonterminal states
St that occur in that experience given some experience following a policy π. Monte
m
Carlo approaches, in general, wait until the return following the visit is known before
using that return as a target for V(St). A nonstationary environment requires a simple
every-visit Monte Carlo approach.
)A
Gt denotes the actual return after time t, and is a constant stepsize parameter.
This method will be known as constant- α MC. TD techniques simply need to wait until
(c
the next time step to calculate the increment to V (St), whereas Monte Carlo methods
must wait until the end of the episode to determine the increment to V (St) (only then is
Gt known). They create a target at time t+1 and perform a beneficial update based on
the observed reward Rt+1 and the estimate V (St+1). The most basic TD approach is
Notes
e
known as TD(0).
in
In effect, the Monte Carlo update’s aim is Gt, whereas the TD update’s target is
Rt+1 + γV (St+1).
nl
We call the TD technique, like the DP method, a bootstrapping approach since it
updates based in part on an existing estimate.
O
ty
si
Monte Carlo methods, on the other hand, employ an estimate of first line of above
r
equation as a target, whereas DP methods use an estimate of last line of above
ve
equation. Because the expected value in above equation is unknown, the Monte Carlo
target is an approximation; a sample return is utilised in place of the genuine expected
return.
totally provided by an environment model, but because vπ (St+1) is unknown and the
present estimate, V (St+1), is used instead. For both of these reasons, the TD target is
an estimate: it samples the expected values in above equation and utilises the current
U
estimate V instead of the true vπ. As a result, TD approaches combine. Monte Carlo
sampling with DP bootstrapping. As we’ll see, with a little thought and imagination, we
can get a lot of the benefits of both Monte Carlo and DP approaches.
ity
m
)A
Figure above depicts TD(0) in its entirety in procedural form, whereas Figure below
provides its backup diagram. The value estimate for the state node at the top of the
(c
backup diagram is updated based on a single sample transition from that state to the
next one. Because TD and Monte Carlo updates involve looking ahead to a sample
successor state (or state–action pair), computing a backed-up value using the value
Notes
e
of the successor and the reward along the way, and then changing the value of the
original state (or state–action pair) accordingly, we refer to them as sample backups.
Sample backups differ from full DP method backups in that they are based on a single
in
sample successor rather than a whole distribution of all possible successors.
nl
Figure: The backup diagram for TD(0)
O
Example 6.1: Driving Home Each day as you drive home from work, you try to
predict how long it will take to get home. When you leave your office, you note the time,
the day of week, and anything else that might be relevant. Say on this Friday you are
ty
leaving at exactly 6 o’clock, and you estimate that it will take 30 minutes to get home.
As you reach your car it is 6:05, and you notice it is starting to rain. Traffic is often
slower in the rain, so you re-estimate that it will take 35 minutes from then, or a total
si
of 40 minutes. Fifteen minutes later you have completed the highway portion of your
journey in good time. As you exit onto a secondary road you cut your estimate of total
travel time to 35 minutes. Unfortunately, at this point you get stuck behind a slow truck,
and the road is too narrow to pass. You end up having to follow the truck until you turn
r
onto the side street where you live at 6:40. Three minutes later you are home. The
ve
sequence of states, times, and predictions is thus as follows:
ni
U
ity
The elapsed times on each leg of the route are the rewards in this case. Because
we are not discounting (γ= 1), the return for each state is the actual time it takes to
get there. The expected time to leave is the value of each state. The current estimated
Notes
e
value for each state encountered is listed in the second column of numbers.
Plotting the anticipated total time (the last column) over the sequence, as shown
in
in Figure above, is a simple approach to see how Monte Carlo algorithms work. For α
= 1, the arrows depict the changes in predictions recommended by the constant- α MC
technique. These are the differences between the estimated value (expected time to
nl
go) and the actual return in each state (actual time to go). For example, you assumed it
would take only 15 minutes to go home after exiting the freeway, but it took 23 minutes.
O
At this point, the above Equation is used to calculate an increase in the estimated
time to travel after departing the highway. At this time, the error, Gt -V (St), is eight
minutes. Assume that the step-size parameter is, α, 1/2. As a result of this event,
ty
the estimated time to travel after exiting the highway would be increased upward by
four minutes. In this scenario, this is perhaps too drastic of a change; the truck was
most likely just a bad break. In any case, the adjustment can only be made off-line,
si
that is, once you’ve arrived at your residence. You don’t know anything about the real
returns until this moment. Is it important to hold off on learning until the final outcome
is known? Let’s say you estimate that it will take you 30 minutes to travel home from
r
work on another day, but you get stuck in a big traffic congestion. You’re still stuck in
ve
traffic on the highway twenty-five minutes after leaving the workplace. You estimate it
will take another 25 minutes to get home, bringing the total time to 50 minutes. As you
wait in traffic, you realise that your 30-minute estimate was far too optimistic. Do you
have to wait until you return home before upgrading your first state estimate? You must,
according to the Monte Carlo method, because you don’t know the true return yet.
ni
A TD strategy, on the other hand, would allow you to learn right away, altering
your initial estimate from 30 to 50 minutes. In fact, each estimate would be shifted in
U
the direction of the estimate immediately after it. Returning to our first day of driving,
Figure below shows the identical predictions as Figure above, but with the TD rule’s
recommendations (these are the adjustments made by the rule if α = 1). Each
inaccuracy is proportional to the prediction’s change through time, or the temporal
ity
disparities in predictions.
m
)A
(c
Apart from giving you something to do while stuck in traffic, there are several
Notes
e
computational reasons why learning based on your present forecasts is preferable to
learning based on the actual return.
in
Advantages of TD Prediction Methods
Part of how TD techniques learn their estimates is by looking at other estimates.
They bootstrap, or learn a guess from a guess. Is this a wise decision? What are the
nl
advantages of TD approaches over Monte Carlo and DP? It will take the rest of this
book and more to develop and answer such questions. We’ll look ahead to some of the
solutions in this section.
O
TD approaches, as opposed to DP methods, have the advantage of not requiring a
model of the environment, as well as reward and next-state probability distributions.
The second most evident advantage of TD methods over Monte Carlo methods
ty
is that they may be implemented in a natural on-line, completely incremental manner.
When using Monte Carlo methods, one must wait until the end of an episode because
the return is only known at that point, however when using TD methods, one only needs
si
to wait one time step. Surprisingly, this is frequently a crucial aspect. Because some
programmes have very long episodes, deferring all learning until the end of the episode
is too sluggish. Other applications are task-oriented and do not have any episodes.
r
Monte Carlo algorithms must disregard or devalue episodes where experimental
ve
actions are done, which can significantly impede learning. Because they learn from
each transition regardless of what happens next, TD approaches are substantially less
vulnerable to these issues.
to learn from one guess to the next without having to wait for the final result, can
we still ensure convergence to the correct answer? The answer is, thankfully, yes.
The TD algorithm described above has been shown to converge to vπ for any fixed
U
policyπ, in the mean if the step-size parameter is small enough, and with probability 1
if the step-size parameter declines according to the normal stochastic approximation
requirements. The majority of convergence proofs only apply to the table-based
instance of the algorithm described above, although some also apply to general linear
ity
function approximation.
If both the TD and Monte Carlo approaches converge asymptotically to the right
predictions, the next logical inquiry is “Which arrives first?” To put it another way, which
strategy is more efficient at learning? Which method makes the best use of limited
m
data? This is now an open subject because no one has been able to demonstrate
mathematically that one method converges faster than the other. In fact, it’s unclear
what the most official approach to express this inquiry would be! In reality, however, TD
)A
approaches have been found to converge faster on stochastic problems than constant-
MC methods.
Optimality of TD(0)
(c
e
following equation are computed for each time step t at which a non terminal state is
visited, but the value function is only altered once, by the sum of all increments. Then,
with the new value function, all of the available experience is processed again to
in
produce a new overall increment, and so on, until the value function converges. This is
referred to as batch updating since updates are only performed after each whole batch
of training data has been processed.
nl
O
TD(0) converges deterministically to a single answer under batch updating,
regardless of the step-size parameter, α, as long as is selected to be small enough.
Under the identical assumptions, the constant- α MC approach likewise converges
deterministically, but to a different result. Understanding the differences between the
ty
two ways will be easier if we grasp these two replies. The methods do not go all the
way to their respective batch replies during normal updating, but they do take steps
in that direction. We’ll look at a few examples before trying to comprehend the two
responses in general for all potential tasks.
si
Summary
●●
r
A computational method for comprehending and automating goal-directed learning
and decision-making is called reinforcement learning. By emphasising learning
ve
by an agent from direct interaction with its environment, rather than depending on
exemplary supervision or comprehensive models of the environment, it differs from
previous computational techniques. According to our assessment, reinforcement
learning is the first area to take the computational challenges presented by
ni
these characteristics.
●● The main characteristics of the majority of the reinforcement learning techniques
that we discuss in this book are the ideas of value and value functions. We believe
that in order to effectively search the space of policies, value functions are crucial.
m
the assumption that this has already been accomplished and concentrate on the
challenge of decision-making: how to choose what to do based on the state signal
that is available. The environment is said to as a Markov decision process if the
Markov property holds (MDP). A MDP with finite state and action sets is referred
Notes
e
to as a finite MDP. Although the majority of the present theory of reinforcement
learning is limited to finite MDPs, the techniques and concepts are applicable more
broadly.
in
●● The expected return from each state, or state-action pair, is given to that state,
or state-action pair, by the value functions of a policy, presuming that the agent
nl
applies the policy. The maximum expected return that any policy may achieve
is what the optimal value functions assign to each state, or state-action pair.
An ideal policy is one whose value functions are optimal. There can be multiple
optimal policies, but only one set of optimal value functions for states and state-
O
action pairings for a given MDP. A policy must be optimum if it is greedy in relation
to the optimal value functions. The optimal value functions must fulfil an unique
consistency requirement known as the Bellman optimality equations, which can
ty
theoretically be solved for the optimal value functions and used to fairly easily
identify the ideal policy.
●● The reinforcement learning problem was used as an example to illustrate a novel
type of learning technique called temporal-difference (TD) learning. We separated
si
the total issue into a prediction problem and a control problem, as usual. For
resolving the prediction problem, TD approaches are substitutes for Monte Carlo
techniques. The generalised policy iteration (GPI) concept, which we abstracted
r
from dynamic programming, is used in both instances to extend the solution to
ve
the control problem. It is the notion that approximation policy and value functions
ought to interact in a way that leads to both of them moving in the direction of their
ideal values.
●● The prediction problem is how precisely one of the two processes that make
ni
up GPI drives the value function to forecast returns for the current policy. With
regard to the current value function, the other process pushes the policy to locally
improve (e.g., to be “-greedy”). A problem with maintaining successful exploration
U
e
evaluating policies is offered by Monte Carlo methods. They simply average
numerous returns that begin in the state rather than using a model to determine
the value of each state. This average can serve as a good approximation to
in
the value as a state’s value is the expected return. Since they can be utilised
to improve the policy without requiring a model of the environment’s transition
dynamics, we are particularly interested in approximating action-value functions
nl
in control approaches. We are now familiar with the fundamental concepts and
dynamic programming algorithms as they apply to solving finite MDPs.
●● Monte Carlo approaches combine policy evaluation and policy enhancement
O
stages on an episode-by-episode basis and can be incrementally implemented
on an episode-by-episode basis. The computation of the value functions for
a specific policy, which is often iterative, is referred to as policy evaluation. The
ty
computation of an improved policy given the value function for that policy is
referred to as “policy improvement.” The two most well-known DP methods, policy
iteration and value iteration, are obtained by combining these two computations.
Given complete information of the MDP, either of these can be used to accurately
si
construct optimal policies and value functions for finite MDPs.
●● Traditional DP algorithms work by iteratively traversing the set of states, updating
each state as intended. According to the values of all potential successor states
r
and their probabilities of occurring, one state is updated with each such operation.
ve
Bellman equations and expected updates share many similarities; in essence,
expected updates are just Bellman equations written as assignment statements.
Convergence occurs to values that meet the related Bellman equation when the
updates no longer cause any value changes. There are four Bellman equations
ni
and four predicted updates for the same set of four primary value functions (vπ, v*,
qπ, and q*). Their backup graphics provide a clear understanding of how DP
Glossary
U
●● Agent(): A thing with the capacity to see, investigate, and react to its surroundings.
●● Action(): The steps an agent takes inside the environment are referred to as
actions.
)A
●● State(): Following each action done by the agent, the environment will return a
state.
●● Q-value(): Generally speaking, it is comparable to the value, except it adds a
current action parameter.
(c
●● Environment: This is the situation in which an agent learns and completes tasks in
Notes
e
the future.
●● Policy: An agent’s ability to make decisions, in which the agent bases future
in
actions on the present situation.
●● Reward: Payment an agent receives from the environment for each action.
●● Value: When compared to the benefit, it represents the anticipated long-term
nl
return at a reduced rate.
●● Value function: Indicates a state’s value, or the total number of returns.
O
●● Function approximator: Creating a function from training samples using a function
approximator.
●● Model of the environment: An environment model is a simulation of the real world
ty
used to forecast conclusions.
●● Model based methods: Reward-based models are solved using model-based
approaches.
si
●● Markov decision process: A probabilistic model of the sequential decision problem
is the Markov decision process.
●● Dynamic programming: Dynamic programming is a category of techniques for
resolving sequential decision-making issues. r
ve
Check Your Understanding
1. Reinforcement learning is a:
a. Unsupervised learning
ni
b. Supervised learning
c. Award based learning
U
b. Recommendation system
c. Pattern recognition
d. Image recognition
m
b. Supervised algorithm
c. Unsupervised algorithm
d. None of the mentioned
4. Which of the following is true about reinforcement learning?
(c
e
d. All of the above
5. You have a task which is to show relative ads to target users. Which algorithm you
in
should use for this task?
a. Upper confidence bound
nl
b. K means clustering
c. Naïve Bayes
d. Support vector machine
O
6. Hidden Markov model is used in:
e. Supervised learning
ty
f. Unsupervised learning
g. Reinforcement learning
h. All of the mentioned
si
7. Which algorithm is used in robotics and industrial automation?
a. Thompson sampling
b. Naïve Bayes r
ve
c. Decision tree
d. All of the mentioned
8. Thompson sampling is a:
ni
a. Probabilistic algorithm
b. Based on Bayes inference rule
U
a. It is a Deterministic algorithm
b. It does not allow delayed feedback
c. It is not based on Bayes inference
m
a. Supervised learning
b. Reinforcement learning
c. Unsupervised learning
d. None of the mentioned
(c
11. _ _ __ _is the machine learning algorithms that can be used with labeled data.
a. Regression algorithms
b. Clustering algorithms
Notes
e
c. Association algorithms
d. All of the above
in
12. Application of Machine learning is:
a. Email filtering
nl
b. Sentimental analysis
c. Face recognition
O
d. All of the above
13. _ _ _ _ is a disadvantage of decision trees?
a. Decision trees are robust to outliers
ty
b. Decision trees are prone to overfit
c. Both a and b
si
d. None of the mentioned
14. Machine learning is an application of:
a. Artificial intelligence
b. Blockchain
r
ve
c. Both a and b
d. None of the mentioned
15. The term machine learning was coined in which year?
ni
a. 1958
b. 1960
U
c. 1959
d. 1961
ity
Exercise
1. What do you mean by reinforcement learning?
2. Define various elements in reinforcement learning.
3. Define the concept of Markov Property.
m
Learning Activities
1. Define various use cases reinforcement learning in real world.
e
1. c
2. b
in
3. a
4. d
nl
5. a
6. d
O
7. a
8. d
9. d
ty
10. b
11. a
si
12. d
13. b
14. a
15. c
r
ve
Further Readings and Bibliography:
1. Reinforcement Learning, Richard S. Sutton and Andrew G. Barto
ni