0% found this document useful (0 votes)
8 views37 pages

ML Module 02

The document covers key concepts in machine learning, focusing on bivariate and multivariate data analysis, including statistical methods like covariance and correlation. It discusses essential mathematics for multivariate data, including linear systems, matrix decompositions, and the importance of probability and statistics in machine learning. Additionally, it explains various probability distributions, such as normal, binomial, and Poisson distributions, and their applications in data analysis.

Uploaded by

prajwal4560
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views37 pages

ML Module 02

The document covers key concepts in machine learning, focusing on bivariate and multivariate data analysis, including statistical methods like covariance and correlation. It discusses essential mathematics for multivariate data, including linear systems, matrix decompositions, and the importance of probability and statistics in machine learning. Additionally, it explains various probability distributions, such as normal, binomial, and Poisson distributions, and their applications in data analysis.

Uploaded by

prajwal4560
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Machine Learning BCS602

Module-2
Understanding Data-2: Bivariate Data And Multivariate Data, Multivariate Statistics,
Essential Mathematics For Multivariate Data, Feature Engineering And Dimensionality
Reduction Techniques.

Basic Learning Theory: Design Of Learning System, Introduction to Concept of Learning,


Modelling in Machine Learning.

2.1 BIVARIATE DATA AND MULTIARIATE DATA

Bivariate data examines relationships between two variables, aiming to find connections.
Bivariate Data deals with causes of relationships. For example, consider the correlation
between shop temperature and sweater sales.

Temperature (in centigrade) Sales of Sweaters (in thousands)


5 200
10 150
15 140
20 75
22 60
23 55
25 20

Table 2.1: Temperature in a shop and sales Data

Fig. 2.1: Scatter Plot

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 1


Machine Learning BCS602

Bivariate analysis explores relationships between two variables through graphical methods like
scatter plots. Scatter plots visualize data, reveal trends, show differences, and indicate the
strength, shape, direction, and outliers of the relationship, aiding in exploratory data analysis
before further calculations.

Line Graphs are similar to scatter plots. The line chart for sales data is shown below.

Fig. 2.2: Line Chart

2.1.1 Bivariate Statistics

Covariance and Correlation are examples of bivariate statistics. Covariance is a measure of


joint probability of random variables, say X and Y. Generally, random variables are represented
in Capital letters. It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the
variance between two dimensions. The formula for finding co-variance for specific x, and y
are:

𝟏 𝑵
COV(X, Y) = 𝑵 ∑𝒊=𝟏(𝒙𝒊 − 𝑬(𝑿))(𝒚𝒊 − 𝑬(𝒀))

Where:

• COV(X,Y) is the covariance between X and Y.

• N is the number of data points.

• xi and yi are individual data points from X and Y, respectively.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 2


Machine Learning BCS602

• E(X) is the mean (average) of the data points in X.

• E(Y) is the mean (average) of the data points in Y.

• Σ denotes the sum of the terms.

Correlation

The Pearson correlation coefficient is the most common test for determining any association
between two phenomena. It measures the strength and directions of a linear relationship
between the x and y variables.

The correlation indicates the relationship between dimensions using its sign. The sign is more
important than the actual value.

• Positive sign: Indicates a direct relationship; as one variable increases, the other also
tends to increase.

• Negative sign: Indicates an inverse relationship; as one variable increases, the other
tends to decrease.

• Zero: Indicates no linear relationship; the variables are considered independent.

If a strong correlation exists, it might suggest that one of the variables is redundant and could
potentially be removed from the analysis.

The Pearson correlation coefficient, denoted as 'r', is calculated using the following formula:

𝐶𝑂𝑉(𝑋,𝑌)
r=
σx σy

Where:

• r is the Pearson correlation coefficient.

• COV(X,Y) is the covariance between variables X and Y.

• σx is the standard deviation of variable X.

• σy is the standard deviation of variable Y.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 3


Machine Learning BCS602

2.2 MULTIVARIATE STATISTICS

In machine learning almost all datasets are multivariable. Multivariate data is the analysis of
more than two observable variables, and often, thousands of multiple measurements need to be
conducted for one or more subjects.

❖ More than two: Multivariate data analyses datasets with three or more variables.
❖ Mean vector: The average of each variable is represented as a mean vector.
❖ Covariance matrix: Variance becomes a covariance matrix, showing relationships
between variables.
❖ Applications: Includes techniques like regression, factor analysis, and PCA.

Heatmap
❖ Visual Representation: Heatmaps use color to show the values in a 2D matrix.
❖ Color Coding: Darker colors represent higher values, lighter colors represent lower
values.
❖ Human Perception: We easily understand color differences, making heatmaps
effective.
❖ Applications: Heatmaps can visualize data like traffic density or patient health data.

Fig. 2.3: Heatmap for patient Data

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 4


Machine Learning BCS602

Pairplot
❖ Pairplot/Scatter Matrix: A visual technique for multivariate data.
❖ Structure: Consists of multiple pairwise scatter plots.
❖ Purpose: Shows relationships between variables.
❖ Format: Presented in a matrix layout.
❖ Analysis: Allows easy identification of correlations and other relationships.
Example: Demonstrated with a random 3-column matrix in below figure.

Fig. 2.4: Pairplot for Random Data

2.3 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA

Machine learning relies heavily on mathematical foundations, particularly linear algebra,


statistics, probability, and information theory, with linear algebra being paramount as the
"mathematics of data." It provides the essential tools, including linear equations, vectors,
matrices, vector spaces, and transformations, that are fundamental for machine learning
algorithms to function effectively.

2.3.1 Linear Systems And Gaussian Elimination For Multivariate Data

A Linear system of equation is a group of equations with unknown variables.

Let Ax = y, then the solution x is given as:

x = y/A = A-1 y

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 5


Machine Learning BCS602

This is true if y is not zero. The logic can be extended for N-set of equations with ‘n’ unknown
variables.

It means if and y = ( y1, y2,…yn), then the unknown variable x can be


computed as:

x = y/A = A-1 y

If there is a unique solution, then the system is called consistent independent. If there are
various solutions, then the system is called consistent dependant. If there are no solutions and
if the equations are contradictory, then the system is called inconsistent.

For solving large number of system of equations, Gaussian elimination can be used. The
procedure for applying Gaussian elimination is given as follows:

1. Write the given matrix.


2. Append vector y to the matrix A. This matrix is called augmentation matrix.
3. Keep the element a11 as pivot and eliminate all a11 in second row using the matrix
𝑎 𝑎
operation, R2 – [𝑎21 ], here R2 is the second row and [𝑎21 ] is called the multiplier. The
11 11

same logic can be used to remove a11 in all other equations.


4. Repeat the same logic and reduce it to reduced echelon form. Then, the unknown
variable as:
𝒚𝒏𝒏
𝒙𝒏 =
𝒂𝒏𝒏
5. Then, the remaining unknown variables can be found by back-substitution as:

𝒚𝒏−𝟏 − 𝒂𝒏−𝟏 × 𝒙𝒏
𝒙𝒏−𝟏 =
𝒂(𝒏−𝟏)(𝒏−𝟏)

This part is called backward substitution.

To facilitate the application of Gaussian elimination method, the following row operations are
applied:

❖ Swapping the rows.


❖ Multiplying or dividing a row by a constant.
❖ Replacing a row by adding or subtracting a multiple of another row to it.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 6


Machine Learning BCS602

2.3.2 Matrix Decompositions

Matrix factorization methods, like eigen decomposition, break down a matrix into simpler
components for easier operations. Eigen decomposition, a common technique, specifically
decomposes a matrix into its eigenvalues and eigenvectors. This results in expressing the
original matrix as the product of a matrix of eigenvectors, a diagonal matrix, and the transpose
of the eigenvector matrix.

Then, the matrix A can be decomposed as:

A = 𝑸 ∧ 𝑸𝑻

Where, Q is the matrix of eigen vectors, ∧ is the diagonal matrix and 𝑄 𝑇 is the transpose of
matrix Q.

LU Decomposition

One of the simplest matrix decompositions is LU decomposition where the matrix A can be
decomposed matrices:

A = LU

Here, L is the lower triangular matrix and U is the upper triangular matrix. The decomposition
can be done using Gaussian elimination method as discussed in the previous section. First, an
identity matrix is augmented to the given matrix. Then, row operations and Gaussian
elimination is applied to reduce the given matrix to get matrices L and U.

2.3.3 Machine Learning and importance of probability and statistics

Machine learning heavily relies on statistics and probability, with statistics being crucial for
data analysis and probability essential for understanding data distributions. Data is viewed as
generated from probability distributions, and machine learning datasets often involve multiple
distributions, making knowledge of probability distributions and random variables vital.
Furthermore, hypothesis testing, model construction and evaluation, and dataset creation via
sampling theory are all key aspects linking machine learning with probability and statistics,
forming the foundation for effective model development and analysis.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 7


Machine Learning BCS602

Probability Distributions

Probability distributions summarize the probability of a variable's events. They are functions
describing the relationship between observations in a sample space. Data following a
distribution obeys a mathematical function, allowing probability calculations.

There are two main types of probability distributions:


1. Discrete probability distribution
2. Continuous probability distribution

For continuous variables, the probability density function (PDF) gives the probability of
observing a value, while the cumulative distribution function (CDF) gives the probability of an
observation being less than or equal to a value. Both PDF and CDF are continuous.
The discrete equivalent of PDF for discrete distributions is the probability mass function
(PMF). To find the probability of an event, calculate the area under the curve of the PDF for a
small interval around the specific outcome. This is also defined as the CDF.

Continuous Probability Distributions Normal, Rectangular, and Exponential distributions


fall under this category.

1. Normal Distribution (Gaussian or Bell-Shaped)

❖ Definition: A continuous probability distribution where data clusters around a central


mean, forming a symmetrical bell-shaped curve.

❖ Key Features:

o Mean (μ): The centre of the distribution.

o Standard Deviation (σ): Measures the spread of the data.

o Symmetry: Mean, median, and mode are equal.

o Range: Extends from negative infinity (-∞) to positive infinity (+∞).

❖ PDF (Probability Density Function):

𝟏 (𝒙−𝝁)𝟐
𝟐) −
𝒇(𝒙, 𝝁, 𝝈 = 𝒆 𝟐𝝈𝟐
√𝟐𝝅𝝈𝟐

o Standard Normal Distribution: A special case where μ = 0 and σ = 1.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 8


Machine Learning BCS602

❖ Z-score: Measures how many standard deviations a data point is from the mean.
z = (x - μ) / σ

o Used to normalize data.

❖ Normality Tests:

o Q-Q Plot: Compares the quantiles of the data to the quantiles of a normal
distribution. A straight line indicates normality.

2. Rectangular Distribution (Uniform Distribution)

❖ Definition: A continuous distribution where all values within a specified range have
equal probability.

❖ Key Features:

o Range: Defined by an interval [a, b].

o Constant Probability: Probability is uniform across the range.

3. Exponential Distribution

❖ Definition: A continuous distribution describing the time between events in a Poisson


process.

❖ Key Features:

o Rate Parameter (λ): Determines the rate of events.

o Memoryless Property: The probability of an event occurring in the future is


independent of past events.

o Special Case of Gamma Distribution: With a shape parameter of 1.

❖ Mean and Standard Deviation: Both are equal to 1/λ (represented as β in the notes).

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 9


Machine Learning BCS602

Applications

❖ Normal Distribution: Used in many statistical analyses, modelling natural phenomena,


and approximating other distributions.

❖ Rectangular Distribution: Used when all outcomes within a range are equally likely.

❖ Exponential Distribution: Used in reliability analysis, queuing theory, and modelling


waiting times.

Discrete Distribution Binomial, Poisson, and Bernoulli distributions fall under this category.

1. Binomial Distribution

Binomial distribution is another distribution that is often encountered in machine learning. It


has only two outcomes: success or failure. This is also called Bernoulli trial.

The objective of this distribution is to find probability of getting success k out of n trials. The
way to get success out of k out of n number of trials is given as:

𝒏 𝒏!
[ ]=
𝒌 𝒌! (𝒏 − 𝒌)!

The binomial distribution function is given as follows, where p is the probability of success
and probability of failure is (1 – p). The probability of success in a certain number of trials is
given as:

𝒑𝒌 (𝟏 − 𝒑)(𝒏−𝒌) or 𝒑𝒌 𝒒(𝒏−𝒌)

Combining both, one gets PDF of binomial distribution as:

𝒏
[ ] 𝒑𝒌 (𝟏 − 𝒑)(𝒏−𝒌)
𝒌
Here, p is the probability of each choice, k is the number of choices, and n is the total number
of choices. The mean of binomial distribution is given below:

μ=n×p

and Variance is given as:

σ² = np(1-p)

Hence, the standard deviation is given as:

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 10


Machine Learning BCS602

σ = √𝒏𝒑(𝟏 − 𝒑)

2. Poisson Distribution

It is another important distribution that is quite useful. Given an interval of time, this
distribution is used to model the probability of a given number of events k. The mean rule 𝜆 is
inclusive of previous events. Some of the examples of poisson distribution are number of
emails received, number of customers visiting a shop and the number of phone calls received
by the office.

The PDF of poisson distribution is given as follows:

𝐞−𝛌 𝛌𝐱
𝒇(𝑿 = 𝒙; 𝝀) = Pr [X = x] =
𝐱!

Here, x is the number of times the event occurs and 𝜆 is the mean number of times an event
occurs. The mean is the population mean at number of emails received and the standard
deviation is √𝜆 .

3. Bernoulli Distribution

This distribution models an experiment whose outcome is binary. The outcome is positive with
p and negative with 1 – p. The PMF of this distribution is given as:

𝒒=𝟏−𝒑 𝒊𝒇 𝒌 = 𝟎
𝒇(𝒌; 𝒑) = {
𝒑 𝒊𝒇 𝒌 = 𝟏

The mean is p and variance is p(1 – p) = q.

Density Estimation

Density estimation is a statistical problem where the goal is to approximate the probability
density function of a population based on a finite sample of data points. This estimated
function, denoted as p(x), allows us to assign probabilities to new, unseen data points. By
comparing the estimated probability of a new point, p(x_i), to a threshold ε, we can identify
outliers or anomalies: points with probabilities below ε are considered atypical, suggesting they
deviate significantly from the learned distribution.

There are two types of density estimation methods, namely parametric density estimation and
non- parametric density estimation.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 11


Machine Learning BCS602

Parametric Density Estimation

Parametric density estimation assumes data originates from a known distribution, characterized
by parameters θ, allowing the density to be expressed as p(x | θ). The method focuses on
estimating these parameters, often using techniques like maximum likelihood estimation, to
define the most likely distribution that generated the observed data.

Maximum Likelihood Estimation (MLE)

MLE is a method for estimating the parameters of a probability distribution based on observed
data. It aims to find the parameter values that maximize the likelihood of observing the given
data.

1. Formulate the Likelihood Function (L(X; θ)): This function represents the probability
of observing the data X given the distribution's parameters θ. For independent data
points, it's the product of individual probabilities: L(X; θ) = ∏ p(xᵢ ; θ).

2. Maximize the Likelihood: The goal is to find the parameter values θ that maximize
L(X; θ).

3. Log-Likelihood: For computational stability, the log-likelihood is often used:

log L(X; θ) = ∑ log p(xᵢ ; θ). Maximizing the log-likelihood is equivalent to maximizing
the likelihood.

4. Negative Log-Likelihood: Often, minimization is preferred over maximization. The


negative log-likelihood is used: -log L(X; θ) = -∑ log p(xᵢ; θ).

Application in Machine Learning:

❖ Density Estimation: MLE is used to estimate the underlying probability distribution of


data.

❖ Predictive Modelling: In regression, MLE can be used to estimate the parameters of a


model that predicts an output y given an input x by maximizing the conditional
likelihood p(y | x, h), where h represents the model parameters.

Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm

Gaussian Mixture Models (GMMs) leverage the Maximum Likelihood Estimation (MLE)
framework for clustering by assuming data is generated from a mixture of Gaussian
distributions, each with its own parameters. The Expectation-Maximization (EM) algorithm is

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 12


Machine Learning BCS602

employed to estimate these parameters, particularly when dealing with latent variables, such as
unobserved group memberships (e.g., gender influencing weight), enabling effective modelling
of complex data distributions.

Generally, there can be many unspecified distributions with different set of parameters. The
EM algorithm has two stages:

1. Expectation (E) Stage – In this stage, the expected PDF and its parameters are
estimated for each latent variable.
2. Maximization (M) Stage – In this, the parameters are optimized using the MLE
function.

This process is iterative, and the iteration is continued till all the latent variables are fitted by
probability distributions effectively along with the parameters.

Non-parametric Density Estimation

Non-parametric density estimation, which can be generative (like Parzen windows, finding
p(x | θ)) or discriminative (finding p(θ | x)), avoids assumptions about the underlying data
distribution. Examples include Parzen windows and k-Nearest Neighbors (KNN).

Parzen Window

Parzen window is a non-parametric method to estimate the probability density function (PDF)
of a dataset. It works by placing a "window" function (often a hypercube) around each data
point and summing these windows to approximate the overall density.

Let there be ‘n’ samples, X = { }

The samples are drawn independently, called as identically independent distribution. Let R be
the region that covers ‘k’ samples of total ‘n’ samples. Then, the probability density function is
given as:

p = k/n

The estimation is given as

𝒌/𝒏
p(x) =
𝑽

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 13


Machine Learning BCS602

where V is the volume of the region R. If R is the hypercube centred at x and h is the length of
the hypercube, the volume V is h2 for 2D square cube and h3 for 3D cube.

The Parzen window is given as follows:

The window indicates if the sample is inside the region or not. The Parzen probability density
function estimate using Above equation is given as:

This window can be replaced by any other function too. If Gaussian function is used, then it is
called Gaussian density function.

KNN Estimation

The KNN estimation is another non-parametric density estimation method. Here, the initial
parameter k is determined and based on that k-neighbours are determined. The probability
density function estimate is the average of the values that are returned by the neighbours.

2.4 FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION TECHNIQUES

Feature engineering is crucial for improving machine learning model performance by carefully
selecting and transforming input features. It encompasses two main aspects: feature
transformation, which involves creating new features from existing ones (e.g., calculating BMI
from height and weight), and feature subset selection, which focuses on identifying the most
relevant features to reduce dimensionality and computational complexity without sacrificing
reliability. This process combats the "curse of dimensionality," where processing high-
dimensional data becomes intractable, by employing strategies like greedy search to find
optimal feature subsets.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 14


Machine Learning BCS602

The features can be removed based on two aspects:

❖ Feature Relevancy: Feature relevancy emphasizes the importance of selecting features


that directly contribute to the classification task. Not all features are created equal; some
provide significantly more information than others
❖ Feature Redundancy: Feature redundancy focuses on eliminating features that
provide overlapping or easily derivable information. When features are redundant, they
add unnecessary complexity to the model without improving its performance.

So, the procedure is,

1. Generate all possible subsets.


2. Evaluate the subsets and model performance
3. Evaluate the results for optimal feature selection

Filter-based selection uses statistical measures for assessing features. In this approach, no
learning algorithm is used. Correlation and information gain measures like mutual information
and entropy are all examples of this approach.

Wrapper-based methods use classifiers to identify the best features. These are selected and
evaluated by the learning algorithms. This procedure is computationally intensive but has
superior performance.

2.4.1 Stepwise Forward Selection

This procedure starts with an empty set of attributes. Every time, an attribute is tested for
statistical significance for best quality and is added to be reduced set. This process is continued
till a good reduced set of attributes is obtained.

2.4.2 Stepwise Backward Elimination

This procedure start with a complete set of attributes. At every stage, the procedure removes
the worst attribute from the set, leading to the reduced set.

Combined Approach Both forward and reverse methods can be combined so that the
procedure can add the best attribute and remove the worst attribute.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 15


Machine Learning BCS602

2.4.3 Principal Component Analysis

The idea of the principal component analysis (PCA) or KL transform is to transform a given
set of measurements to a new set of features so that the features exhibit high information
packing properties. This leads to a reduced and compact set of features.

Consider a group of random vectors of the form:

The mean vector of the set of random is defined as:

mx = E{x}

The operator E refers to the expected value of the population. This is calculated theoretically
using the probability density functions (PDF) of the elements xi and the joint probability density
functions between the elements xi and xj . From this covariance matrix can be calculated as:

C = E{(x-mx) (x-mx)T}

For M random vectors, when M is large enough, the mean vector and covariance matrix can be
approximately calculated as:

The covariance matrix is real and symmetric, allowing for the calculation of eigenvectors (eᵢ)
and eigenvalues (λᵢ), which are ordered by magnitude (λ₁ ≥ λ₂...). These eigenvectors form the
transformation matrix (A), used to map data (x) to a new representation (y) through
y = A(x - mₓ), and this transformation is also known as the Karhunen-Loeve or Hotelling
transform. The original data can be reconstructed via x = Aᵀ y + mₓ. The goal of PCA is to
reduce the data's dimensionality by using only the most significant eigenvectors, achieving
maximum compression, with a reconstruction using the K largest eigen values represented by
x = 𝑨𝑻𝒌 y + mₓ.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 16


Machine Learning BCS602

The advantage of PCA are immense. It reduces the attribute list by eliminating all irrelevant
attributes. The PCA algorithm is as follows:

1. The target dataset x is obtained


2. The mean is subtracted from the dataset. Let the mean be m. Thus, the adjusted dataset
is X – m. The objective of this process is to transform the dataset with zero mean.
3. The covariance of dataset x is obtained. Let it be C.
4. Eigen values and eigen vectors of the covariance matrix are calculated.
5. The eigen vector of the highest eigen value is the principal component of the dataset.
The eigen values are arranged in a descending order. The feature vector is formed with
these eigen vectors in its columns.
a. Feature vector = {eigen vector1, eigen vector2, …., eigen vectorn}
6. Obtain the transpose of feature vector. Let it be A.
7. PCA transform is y = A × (x – m), where x is the input dataset, m is the mean, and A is
the transpose of the feature vector.

The original data can be retrieved using the formula given below:

Original data (f) = {(A)-1 × y} + m

= {(A)T × y} + m

From below figure, one can infer the relevance of the attributes. The scree plot indicates that
the first attribute is more important than all other attributes.

Fig. 2.5: Scree plot

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 17


Machine Learning BCS602

❖ PCA reduces data to a smaller, representative matrix, effectively removing non-


contributing attributes.
❖ The original data can be perfectly reconstructed, ensuring no information loss.
❖ A scree plot visualizes the importance of principal components, highlighting key
attributes.
❖ In a 246-attribute dataset, a scree plot revealed only 6 attributes were significant after
PCA.

2.4.4 Linear Discriminant Analysis (LDA)

LDA is also a feature reduction technique like PCA. The focus of LDA is to project higher
dimension data to a line (lower dimension data). LDA is also used to classify the data. Let there
be two classes, c₁ and c₂. Let μ₁ and μ₂ be the mean of the patterns of two classes. The mean of
the classes c₁ and c₂ can be computed as:

The aim of LDA is to optimize the function:

Where, V is the linear projection and σB and σW class scatter matrix, respectively. For the two-
class problem, these matrices are given as:

The maximization of J(V) should satisfy the equation:

As σB V is always in the direction of (μ₁ - μ₂), V can be given as:

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 18


Machine Learning BCS602

Let V = {v1, v2, …., vd} be the generalized eigen vectors of σB and σW, where, d is the largest
eigen values as in PCA. The transformation of x is the given as:

Like in PCA, the largest eigen values can be retained to have projections.

2.4.5 Singular value Decomposition (SVD)

SVD is another useful decomposition technique. Let A be the matrix, then the matrix A can be
decomposed as:

A = USVT

Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension
is m × n, S is the diagonal matrix of dimension n × n, and V is the orthogonal matrix. The
procedure for finding decomposition matrix is given as follows:

1.
For a given matrix, find AAT
2.
Find eigen values of AAT
3. Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U.
4. Arrange the square root of the eigen values in diagonal. This matrix is diagonal matrix,
S.
5. Find eigen values and eigen vectors for AT A. Find the eigen value and pack the eigen
vector as a matrix called V.

Thus, A = USVT . Here, U and V are orthogonal matrices. The columns of U and V are left
and right singular values, respectively. SVD is useful in compression, as one can decide to
retain only a certain component instead of the original matrix A as:

Based on the choice of relation, the compression can be controlled.

The main advantage of SVD is compression. A matrix, say an image, can be decomposed and
selectively only certain components can be retained by making all other elements zero. This
reduces the contents of image while retaining the quality of the image. SVD is useful in data
reduction too.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 19


Machine Learning BCS602

BASIC LEARNING THEORY

2.5 DESIGN OF LEARNING SYSTEM

A system that is built around a learning algorithm is called a learning system. The design of
systems focuses on these steps:

1. Choosing a training experience


2. Choosing a target function
3. Representation of a target function
4. Function approximation

Training Experience

Let us consider designing of a chess game. In direct experience, individual board states and
correct moves of the chess game are given directly. In indirect system, the move sequences and
results are only given. The training experience also depends on the presence of a supervisor
who can label all valid moves for a board state. In the absence of a supervisor, the game agent
plays against itself and learns the good moves, if the training samples cover all scenarios, or in
other words, distributed enough for performance computation. If the training samples and
testing samples have the same distribution, the results would be good.

Determine the Target Function

The next step is the determination of a target function. In this step, the type of knowledge that
needs to be learnt is determined. In direct experience, a board move is selected and is
determined whether it is a good move or not against all other moves. If it is the best move, then
it is chosen as: B →M, where, B and M are legal moves. In indirect experience, all legal moves
are accepted and a score is generated for each. The move with largest score is then chosen and
executed.

Determine the Target Function Representation

The representation of knowledge may be a table, collection of rules or a neural network, The
linear combination of these factors can be coined as:

𝒗 = 𝝎𝟎 + 𝝎𝟏 𝒙𝟏 + 𝝎𝟐 𝒙𝟐 + 𝝎𝟑 𝒙𝟑

where, 𝑥1 , 𝑥2 , and 𝑥3 , represent different board features and ww, w, and w, represent weights.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 20


Machine Learning BCS602

Choosing an Approximation Algorithm for the Target Function

The focus is to choose weights and fit the given training samples effectively. The aim is to
reduce the error given as:

Here, b is the sample and 𝑣̂(𝑏) is the predicted hypothesis. The approximation is carried out
as:

Computing the error as the difference between trained and expected hypothesis. Let error be
error(b). Then, for every board feature xi, the weights are updated as:

𝝎𝒊 = 𝝎𝒊 + µ × 𝐞𝐫𝐫𝐨𝐫(𝐛) × 𝒙𝒊

Here, µ is the constant that moderates the size of the weight update.

Thus, the leaming system has the following components:

❖ A Performance system to allow the game to play against itself.


❖ A Critic system to generate the samples.
❖ A Generalizer system to generate a hypothesis based on samples.
❖ An Experimenter system to generate a new system based on the currently learnt
function. This is sent as input to the performance system.

2.6 INTRODUCTION TO CONCEPT LEARNING

Concept learning is a learning method where a system acquires general knowledge or


categories by observing training examples, identifying common features, and creating a
simplified model to classify new instances; this process, which involves abstraction and
generalization, functions as a Boolean classifier, assigning true or false values to objects based
on whether they fit the learned concept, much like humans categorize animals by recognizing
distinguishing features.

Concept learning requires three things:

1. Input - Training dataset which is a set of training instances, each labelled with the name
of a concept or category to which it belongs. Use this past experience to train and build
the model.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 21


Machine Learning BCS602

2. Output - Target concept or Target function f. It is a mapping function (x) from input x
lo output y. It is to determine the specific features or common features to identify an
object. In other words, it is to find the hypothesis to determine the target concept. For
e.g., the specific set of features to identify an elephant from all animals.
3. Test - New instances to test the learned model.

Formally, Concept learning is defined as "Given a set of hypotheses, the learner searches
through the hypothesis space to identify the best hypothesis that matches the target concept".

Consider the following set of training instances shown in below table

Sl. Horns Tail Tusks Paws Fur Color Hooves Size Elephant
No.
1 No Short Yes No No Black No Big Yes
2 Yes Short No No No Brown Yes Medium No
3 No Short Yes No No Black No Medium Yes
4 No Long No Yes Yes White No Medium No
5 No Short Yes Yes Yes Black No Big Yes

Here, in this set of training instances, the independent attributes considered are 'Homs', 'Tail',
'Tusks', 'Paws', 'Fur', 'Color', 'Hooves' and 'Size'. The dependent attribute is 'Elephant'. The
target concept is to identify the animal to be an Elephant.
Let us now take this example and understand further the concept of hypothesis.

Target Concept: Predict the type of animal - For example -'Elephant'.

2.6.1 Representation of a Hypothesis

A hypothesis 'h' approximates a target function 'f' to represent the relationship between the
independent attributes and the dependent attribute of the training instances. The hypothesis is
the predicted approximate model that best maps the inputs to outputs. Each hypothesis is
represented as a conjunction of attribute conditions in the antecedent part.

For example, (Tail Short) (Colour = Black)...

The set of hypotheses in the search space is called as hypotheses. Hypotheses are the plural
form of hypothesis. Generally, 'H' is used to represent the hypotheses and 'h' is used to represent
a candidate hypothesis.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 22


Machine Learning BCS602

Each attribute condition is the constraint on the attribute which is represented as attribute-value
pair. In the antecedent of an attribute condition of a hypothesis, each attribute can take value
as either? or 'o' or can hold a single value.

❖ "?" denotes that the attribute can take any value [e.g., Colour =?]
❖ "𝜙" denotes that the attribute cannot take any value, Le., it represents a null value [e.g.,
Horns = 𝜙 ]
❖ Single value denotes a specific single value from acceptable values of the attribute, ie,
the attribute Tail' can take a value as 'short' [e.g., Tail Short]

For example, a hypothesis 'h' will look like,

Given a test instance x, we say h(x) = 1, if the test instance x satisfies this hypothesis h.

The training dataset given above has 5 training instances with 8 independent attributes and one
dependent attribute. Here, the different hypotheses that can be predicted for the target concept
are,

The task is to predict the best hypothesis for the target concept (an elephant). The most general
hypothesis can allow any value for each of the attribute.

It is represented as:

<?, ?, ?, ?, ?, ?, ?, ?>. This hypothesis indicates that any animal can be an elephant.

The most specific hypothesis will not allow any value for each of the attribute <
𝜙, 𝜙, 𝜙, 𝜙, 𝜙, 𝜙, 𝜙, 𝜙 >. This hypothesis indicates that no animal can be an elephant.

The target concept mentioned in this example is to identify the conjunction of specific features
from the training instances to correctly identify an elephant.

Thus, concept learning can also be called as Inductive Learning that tries to induce a general
function from specific training instances. This way of learning a hypothesis that can produce
an approximate target function with a sufficiently large set of training instances can also

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 23


Machine Learning BCS602

approximately classify other unobserved instances and is called as inductive learning


hypothesis. We can only determine an approximate target function because it is very difficult
to find an exact target function with the observed training instances. That is why a hypothesis
is an approximate target function that best maps the inputs to outputs.

2.6.2 Hypothesis Space

Hypothesis space is the set of all possible hypotheses that approximates the target function f In
other words, the set of all possible approximations of the target function can be defined as
hypothesis space. From this set of hypotheses in the hypothesis space, a machine learning
algorithm would determine the best possible hypothesis that would best describe the target
function or best fit the outputs. Generally, a hypothesis representation language represents a
larger hypothesis space. Every machine learning algorithm would represent the hypothesis
space in a different manner about the function that maps the input variables to output variables.
For example, a regression algorithm represents the hypothesis space as a linear function
whereas a decision tree algorithm represents the hypothesis space as a tree.

The set of hypotheses that can be generated by a learning algorithm can be further reduced by
specifying a language bias.

The subset of hypothesis space that is consistent with all-observed training instances is called
as Version Space. Version space represents the only hypotheses that are used for the
classification.

For example, each of the attribute given in the Table 3.1 has the following possible set of values.

Horns-Yes, No
Tail-Long, Short
Tusks-Yes, No
Paws-Yes, No
Fur-Yes, No
Color - Brown, Black, White
Hooves-Yes, No
Size-Medium, Big

Considering these values for each of the attribute, there are (2x2x2x2x2x3x2x2) = 384 distinct
instances covering all the 5 instances in the training dataset.

So, we can generate (4x4x4x4x4x5x4x4) = 81,920 distinct hypotheses when including two
more values [?, 𝜙] for each of the attribute. However, any hypothesis containing one or more
symbols represents the empty set of instances; that is, it classifies every instance as negative

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 24


Machine Learning BCS602

instance. Therefore, there will be (3x3x3x3x3x4x3x3+1) = 8,749 distinct hypotheses by


including only'?" for each of the attribute and one hypothesis representing the empty set of
instances. Thus, the hypothesis space is much larger and hence we need efficient learning
algorithms to search for the best hypothesis from the set of hypotheses.

Hypothesis ordering is also important wherein the hypotheses are ordered from the most
specific one to the most general one in order to restrict searching the hypothesis space
exhaustively.

2.6.3 Heuristic Space Search

Heuristic search is a search strategy that finds an optimized hypothesis/solution to a problem


by iteratively improving the hypothesis/solution based on a given heuristic function or a cost
measure. Heuristic search methods will generate a possible hypothesis that can be a solution in
the hypothesis space or a path from the initial state. This hypothesis will be tested with the
target function or the goal state to see if it is a real solution. If the tested hypothesis is a real
solution, then it will be selected. This method generally increases the efficiency because it is
guaranteed to find a better hypothesis but may not be the best hypothesis. It is useful for solving
tough problems which could not solved by any other method. The typical example problem
solved by heuristic search is the travelling salesman problem.

Several commonly used heuristic search methods are hill climbing methods, constraint
satisfaction problems, best-first search, simulated-annealing. A* algorithm, and genetic
algorithms.

2.6.4 Generalization and Specialization

In order to understand about how we construct this concept hierarchy, let us apply this general
principle of generalization/specialization relation. By generalization of the most specific
hypothesis and by specialization of the most general hypothesis, the hypothesis space can be
searched for an approximate hypothesis that matches all positive instances but does not match
any negative instance.
Searching the Hypothesis Space
There are two ways of learning the hypothesis, consistent with all training instances from the
large hypothesis space.
1. Specialization - General to Specific learning
2. Generalization - Specific to General learning

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 25


Machine Learning BCS602

Generalization-Specific to General Learning This learning methodology will search through


the hypothesis space for an approximate hypothesis by generalizing the most specific
hypothesis.

Specialization - General to Specific learning This learning methodology will search through
the hypothesis space for an approximate hypothesis by specializing the most general
hypothesis.

2.6.5 Hypothesis Space Search by Find-S Algorithm

Algorithm 2.1: Find-S

Find-S algorithm is guaranteed to converge to the most specific hypothesis in H that is


consistent with the positive instances in the training dataset. Obviously, it will also be
consistent with the negative instances. Thus, this algorithm considers only the positive
instances and eliminates negative instances while generating the hypothesis. It initially starts
with the most specific hypothesis.

Limitations of Find – S Algorithm

1. Find 5 algorithm tries to find a hypothesis that is consistent with positive instances,
ignoring all negative instances. As long as the training dataset is consistent, the
hypothesis found by this algorithm may be consistent.
2. The algorithm finds only one unique hypothesis, wherein there may be many other
hypotheses that are consistent with the training dataset.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 26


Machine Learning BCS602

3. Many times, the training dataset may contain some errors; hence such inconsistent data
instances can mislead this algorithm in determining the consistent hypothesis since it
ignores negative instances.

Hence, It is necessary to find the set of hypotheses that are consistent with the training data
including the negative examples. To overcome the limitations of Find-S algorithm, Candidate
Elimination algorithm was proposed to output the set of all hypotheses consistent with the
training dataset.

2.6.6 Version Spaces

The version space contains the subset of hypotheses from the hypothesis space that is consistent
with all training instances in the training dataset.

List – Then – Eliminate Algorithm

The principle idea of this learning algorithm is to initialize the version space to contain all
hypotheses and then eliminate any hypothesis that is found inconsistent with any training
instances. Initially, the algorithm starts with a version space to contain all hypotheses scanning
each training instance. The hypotheses that are inconsistent with the training instance are
eliminated. Finally, the algorithm outputs the list of remaining hypotheses that are all
consistent.

Algorithm 2.2: List-Then-Eliminate

The above algorithm works fine if the hypothesis space is finite but practically it is difficult to
deploy this algorithm. Hence, a variation of this idea is introduced in the Candidate Elimination
algorithm.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 27


Machine Learning BCS602

Version Spaces and the Candidate Elimination Algorithm

Algorithm 3.3: Candidate Elimination

Version space learning is to generate all consistent hypotheses around. This algorithm computes
the version space by the combination of the two cases namely,

❖ Specific to General leaming - Generalize S to include the positive example


❖ General to Specific leaming - Specialize G to exclude the negative example

Using the Candidate Elimination algorithm, we can compute the version space containing all
(and only those) hypotheses from H that are consistent with the given observed sequence of
training instances. The algorithm defines two boundaries called 'general boundary which is a
set of all hypotheses that are the most general and 'specific boundary which is a set of all
hypotheses that are the most specific. Thus, the algorithm limits the version space to contain
only those hypotheses that are most general and most specific. Thus, it provides a compact
representation of List-then algorithm.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 28


Machine Learning BCS602

Generating Positive Hypothesis 'S' If it is a positive example, refine 5 to include the positive
instance. We need to generalize S to include the positive instance. The hypothesis is the
conjunction of 'S' and positive instance. When generalizing, for the first positive instance, add
to S all minimal generalizations such that S is filled with attribute values of the positive
instance. For the subsequent positive instances scanned, check the attribute value of the positive
instance and S obtained in the previous iteration. If the attribute values of positive instance and
S are different, fill that field value with a "?". If the attribute values of positive instance and S
are same, no change is required.

If it is a negative instance, it skips.

Generating Negative Hypothesis 'G' If it is a negative instance, refine G to exclude the


negative instance. Then, prune G to exclude all inconsistent hypotheses in G with the positive
instance. The idea is to add to G all minimal specializations to exclude the negative instance
and be consistent with the positive instance, Negative hypothesis indicates general hypothesis.

If the attribute values of positive and negative instances are different, then fill that field wills
positive instance value so that the hypothesis does not classify that negative instance as true If
the attribute values of positive and negative instances are same, then no need to update 'G' and
fill that attribute value with a ‘?’.

Generating Version Space - [Consistent Hypothesis] We need to take the combination of sets
in 'G' and check that with 'S'. When the combined set fields are matched with fields in 'S', then
only that is included in the version space as consistent hypothesis.

2.7 MODELLING IN MACHINE LEARNING

Machine learning models are created by training algorithms on datasets to make predictions on
new data, a process involving parameter learning and model evaluation using separate training
and testing sets to prevent overfitting. The model's accuracy is assessed by measuring the
difference between predicted and actual values, often using Mean Squared Error, with lower
errors indicating better predictive performance.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 29


Machine Learning BCS602

Machine Learning Process

The four basic steps in the machine learning process are:

1. Choose a machine learning algorithm to suit the training data and the problem domain
2. Input the training dataset and train the machine learning algorithm to learn from the
data and capture the patterns in the data
3. Tune the parameters of the model to improve the accuracy of learning of the algorithm
4. Evaluate the learned model once the model is built

2.7.1 Model Selection and Model Evaluation

The biggest challenge in machine learning is choosing an algorithm that suits the problem.
Hence, model selection and assessment are very important and deal with two types of
complexities.

1. Model Performance-How well the model performs on the training dataset?


2. Model Complexity-How much complexity the model possesses after the training phase
is over?

Model Selection is a process of selecting one good enough model among different machine
learning models for the dataset or selecting different sets of features or hyperparameters for the
same machine learning model. It is difficult to find the best model because all models exhibit
some predictive error for the problem, so at least a good enough model should be selected that
performs fairly well with the dataset.

Some of the approaches used for selecting a machine learning model are listed below:

❖ Use re-sample methods and split the dataset as training, testing and validation datasets
and observe the performance of the model over all the phases. This approach is suitable
for smaller datasets.
❖ The simplest approach is to fit a model on the training dataset and to compute measures
like error or accuracy.
❖ The use of probabilistic framework and quantification of the performance of the model
as a score is the third approach.

These methods are discussed in the following sections.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 30


Machine Learning BCS602

2.7.2 Re-sampling Methods

Re-sampling is a technique to select a model by reconstructing the training dataset and test
dataset by randomly choosing instances by some method from the given dataset. This method
involves selecting different instances repeatedly from a training dataset to tune a model. It is
done to improve the accuracy of a model. The common re-sampling model selection methods
are Random train/test splits, Cross-validation (K-fold, LOOCV, etc.) and Bootstrap.

Cross-Validation

Cross-Validation is a method by which we can tune the model with only training dataset. It is
a model evaluation approach by which we can set aside some data of the training dataset for
validation and fit the rest of the data to train the model. The best model is found by estimating
the average of errors on different test data. The popular cross-validation family of methods
includes Holdout method, K-fold cross-validation, Stratified cross-validation and Leave-One-
Out Cross-Validation (LOOCV).

Holdout Method

This is the simplest method of cross-validation. The dataset is split into two subsets called
training dataset and test dataset. The model is trained using the training dataset and then
evaluated using the test dataset. This holdout method can be applied for a single time which is
called as single holdout method or it can be repeated for more than once which is called as
repeated holdout method. The average performance on the test dataset is estimated to evaluate
the model. Even though this model is very simple, it can exhibit high variance and the
performance largely depends on how the dataset is split.

K-fold Cross-Validation

Another way of cross-validating is using a k-fold cross-validation, which will split the training
dataset into k equal folds/parts creating k-1 subsets of training set and one test subset. Out of
the k folds, k-1 folds are used for training and one fold is used for testing the model. This has
to be performed for k iterations and during each iteration a different fold is selected for testing.
The average performance of the model on k iterations is the final estimate of the model
performance.

The illustration of this re-sampling is shown in below figure.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 31


Machine Learning BCS602

Fig. 2.6: Illustration of K-fold Cross-Validation

Algorithm 2.4: K-fold Cross Validation

Stratified K-fold Cross-Validation

This method is similar to k-fold cross-validation but with a slight difference. Here, it is ensured
that while splitting the dataset into k folds, each fold should contain the same proportion of
instances with a given categorical value. This is called stratified cross-validation.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 32


Machine Learning BCS602

Leave-One-Out Cross-Validation (LOOCV)

This method repeatedly splits the n data instances of the dataset into training dataset containing
-1 data instances and leaving one data instance for evaluating the model. This process is
repeated a times and average test error is then estimated for the model. Even though this model
is expensive and time consuming because it has to run for a times (i.e., a data instances in the
dataset), it has less bias. For example, if the training dataset contains 100 data instances, then
99 instances are used for training and one instance to test or evaluate the model. This process
is repeated 100 times selecting a different instance as holdout instance for testing in each
iteration.

The illustration of this re-sampling is shown in below figure.

Fig. 2.7: Illustration of Leave-One-Out Cross-Validation

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 33


Machine Learning BCS602

Algorithm 2.5: LOOCV

Model Performance

Classifier models are discussed in the subsequent chapters. The focus of this section is the
evaluation of classifier models. Classifiers are unstable as a small change in the input can
change the output. A solid framework is needed for proper evaluation. There are several metrics
that can be used to describe the quality and usefulness of a classifier. One way to compute the
metrics is to form a table called contingency table. For example, consider a test for detecting a
disease, say cancer. Below table shows a contingency table for this scenario.

Test vs Disease Has Disease Cancer Has No Disease As Cancer


Positive True Positive False Positive
Negative False Negative True Negative

In this table, True Positive (TP) Number of cancer patients who are classified by the test
correctly, True Negative (TN) = Number of normal patients who do not have cancer are
correctly detected. The two errors that are involved in this process is False Positive (FP) that is
an alarm that indicates that the tests show positive when the patient has no disease and False
Negative (FN) is another error that says a patient has cancer when tests says negative or normal.
FP and FN are costly errors in this classification process.

The metrics that can be derived from this contingency table are listed below:

1. Sensitivity - The sensitivity of a test is the probability that it will produce a true positive
result when used on a test dataset. It is also known as true positive rate. The sensitivity
of a test can be determined by calculating:
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
2. Specificity - The specificity of a test is the probability that a test will produce a true
negative result when used on test dataset.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 34


Machine Learning BCS602

𝑇𝑁
𝑇𝑁 + 𝐹𝑃
3. Positive Predictive Value - The positive predictive value of a test is the probability that
an object is classified correctly when a positive test result is observed.
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
4. Negative Predictive Value - The negative predictive value of a test is the probability
that an object is not classified properly when a negative test result is observed.
𝑇𝑁
𝑇𝑁 + 𝐹𝑁
5. Accuracy - The accuracy of the classifier can be shown in terms of sensitivity computed
as:
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
6. Precision - Precision is also known as positive predictive power. It is defined as the
ratio of true positive divided by the sum of true positive and false positive.
𝑇𝑃
Precision = 𝑇𝑃+𝐹𝑃

Precision indicates how good classifier is in predicting the positive classes.


7. Recall - It is same as sensitivity.

𝑇𝑃
Recall Sensitivity= 𝑇𝑃+𝐹𝑁

A combination of harmonic mean of precision and recall is called F-measure or Fl score. This
is useful in identifying the model skill for a specific threshold.

Classifier Performance as Distance Measures The classifier performance can be computed


as a distance measure also. The classifier accuracy can be plotted as a point. A point in the
north-west is a better classifier. Euclid distance of two points of the two classifiers can give a
performance measure. The value ranges from 0 to 1.

Visual Classifier Performance Receiver Operating Characteristic (ROC) curve and Precision-
Recall curves indicate the performance of classifiers visually. ROC curves are visual means of
checking the accuracy and comparison of classifiers. ROC is a plot of sensitivity (True Positive
Rate) and the 1-specificity (False Positive Rate) for a given model.

A sample ROC curve is shown in Figure 3.6, where results of five classifiers are given. A is the
ROC of an average classifier. The ideal classifier is E where the area under curve is 1.0.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 35


Machine Learning BCS602

Theoretically, it can range from 0.9 to 1. The rest of the classifiers B, C, D are categorized
based on area under curve as good, better and still better based on the area under curve values.

Fig. 2.8: A Sample ROC Curve

Classifier predictions rely on a threshold value, like 0.5, to assign data points to classes, and
this threshold can be adjusted to manage false positives (FP) and false negatives (FN), which
is crucial when focusing on specific error types; the Receiver Operating Characteristic (ROC)
curve, which plots true positive rate against false positive rate, visually assesses model skill,
with curves above the diagonal indicating better performance and the area under the curve
(AUC) quantifying overall accuracy across various thresholds, where an AUC of 1 signifies a
perfect model.

Instead of just predicting labels, models can output probabilities, enabling more nuanced
evaluations through scoring functions like AUC, which measures a model's performance across
different threshold values; precision-recall curves, plotting precision against recall, are
particularly useful for imbalanced datasets where one class significantly outnumbers the other,
whereas ROC curves are preferred for balanced datasets, offering a comprehensive view of a
model's ability to discriminate between classes.

Scoring Methods

Another alternative for model selection is to combine the complexity of the model and
performance of the model as a score. Then, model selection is done by selecting the model that
maximizes or minimizes the score.

Minimum Description Length (MDL) is one such method. The aim is to describe target variable
and model in terms of bits. MDL is the principle of using minimum number of bits to represent
the data and model. It is a variant of Occom Razor's principle that states that the model with

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 36


Machine Learning BCS602

the simplest explanation is the best model. MDL, too recommends the selection of the
hypothesis that minimizes the sum of two descriptions of data and model.

Let a be a learning model. Let L(h) is the number of bits used to represent the model and Dis
the number of predictions, then the MDL is given as:

L(h) + L(D|h)

where, L(D|h) is the number of bits used to represent the predictions D based on the training
set. MDL can be expressed in terms of negative log-likelihood also as:

MDL = − 𝒍𝒐𝒈(𝒑(𝜽)) − 𝒍𝒐𝒈(𝒑(𝒚 | 𝒙, 𝜽))

where, y is the target variable, x is the input and is the model parameters.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 37

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy