ML Module 02
ML Module 02
Module-2
Understanding Data-2: Bivariate Data And Multivariate Data, Multivariate Statistics,
Essential Mathematics For Multivariate Data, Feature Engineering And Dimensionality
Reduction Techniques.
Bivariate data examines relationships between two variables, aiming to find connections.
Bivariate Data deals with causes of relationships. For example, consider the correlation
between shop temperature and sweater sales.
Bivariate analysis explores relationships between two variables through graphical methods like
scatter plots. Scatter plots visualize data, reveal trends, show differences, and indicate the
strength, shape, direction, and outliers of the relationship, aiding in exploratory data analysis
before further calculations.
Line Graphs are similar to scatter plots. The line chart for sales data is shown below.
𝟏 𝑵
COV(X, Y) = 𝑵 ∑𝒊=𝟏(𝒙𝒊 − 𝑬(𝑿))(𝒚𝒊 − 𝑬(𝒀))
Where:
Correlation
The Pearson correlation coefficient is the most common test for determining any association
between two phenomena. It measures the strength and directions of a linear relationship
between the x and y variables.
The correlation indicates the relationship between dimensions using its sign. The sign is more
important than the actual value.
• Positive sign: Indicates a direct relationship; as one variable increases, the other also
tends to increase.
• Negative sign: Indicates an inverse relationship; as one variable increases, the other
tends to decrease.
If a strong correlation exists, it might suggest that one of the variables is redundant and could
potentially be removed from the analysis.
The Pearson correlation coefficient, denoted as 'r', is calculated using the following formula:
𝐶𝑂𝑉(𝑋,𝑌)
r=
σx σy
Where:
In machine learning almost all datasets are multivariable. Multivariate data is the analysis of
more than two observable variables, and often, thousands of multiple measurements need to be
conducted for one or more subjects.
❖ More than two: Multivariate data analyses datasets with three or more variables.
❖ Mean vector: The average of each variable is represented as a mean vector.
❖ Covariance matrix: Variance becomes a covariance matrix, showing relationships
between variables.
❖ Applications: Includes techniques like regression, factor analysis, and PCA.
Heatmap
❖ Visual Representation: Heatmaps use color to show the values in a 2D matrix.
❖ Color Coding: Darker colors represent higher values, lighter colors represent lower
values.
❖ Human Perception: We easily understand color differences, making heatmaps
effective.
❖ Applications: Heatmaps can visualize data like traffic density or patient health data.
Pairplot
❖ Pairplot/Scatter Matrix: A visual technique for multivariate data.
❖ Structure: Consists of multiple pairwise scatter plots.
❖ Purpose: Shows relationships between variables.
❖ Format: Presented in a matrix layout.
❖ Analysis: Allows easy identification of correlations and other relationships.
Example: Demonstrated with a random 3-column matrix in below figure.
x = y/A = A-1 y
This is true if y is not zero. The logic can be extended for N-set of equations with ‘n’ unknown
variables.
x = y/A = A-1 y
If there is a unique solution, then the system is called consistent independent. If there are
various solutions, then the system is called consistent dependant. If there are no solutions and
if the equations are contradictory, then the system is called inconsistent.
For solving large number of system of equations, Gaussian elimination can be used. The
procedure for applying Gaussian elimination is given as follows:
𝒚𝒏−𝟏 − 𝒂𝒏−𝟏 × 𝒙𝒏
𝒙𝒏−𝟏 =
𝒂(𝒏−𝟏)(𝒏−𝟏)
To facilitate the application of Gaussian elimination method, the following row operations are
applied:
Matrix factorization methods, like eigen decomposition, break down a matrix into simpler
components for easier operations. Eigen decomposition, a common technique, specifically
decomposes a matrix into its eigenvalues and eigenvectors. This results in expressing the
original matrix as the product of a matrix of eigenvectors, a diagonal matrix, and the transpose
of the eigenvector matrix.
A = 𝑸 ∧ 𝑸𝑻
Where, Q is the matrix of eigen vectors, ∧ is the diagonal matrix and 𝑄 𝑇 is the transpose of
matrix Q.
LU Decomposition
One of the simplest matrix decompositions is LU decomposition where the matrix A can be
decomposed matrices:
A = LU
Here, L is the lower triangular matrix and U is the upper triangular matrix. The decomposition
can be done using Gaussian elimination method as discussed in the previous section. First, an
identity matrix is augmented to the given matrix. Then, row operations and Gaussian
elimination is applied to reduce the given matrix to get matrices L and U.
Machine learning heavily relies on statistics and probability, with statistics being crucial for
data analysis and probability essential for understanding data distributions. Data is viewed as
generated from probability distributions, and machine learning datasets often involve multiple
distributions, making knowledge of probability distributions and random variables vital.
Furthermore, hypothesis testing, model construction and evaluation, and dataset creation via
sampling theory are all key aspects linking machine learning with probability and statistics,
forming the foundation for effective model development and analysis.
Probability Distributions
Probability distributions summarize the probability of a variable's events. They are functions
describing the relationship between observations in a sample space. Data following a
distribution obeys a mathematical function, allowing probability calculations.
For continuous variables, the probability density function (PDF) gives the probability of
observing a value, while the cumulative distribution function (CDF) gives the probability of an
observation being less than or equal to a value. Both PDF and CDF are continuous.
The discrete equivalent of PDF for discrete distributions is the probability mass function
(PMF). To find the probability of an event, calculate the area under the curve of the PDF for a
small interval around the specific outcome. This is also defined as the CDF.
❖ Key Features:
𝟏 (𝒙−𝝁)𝟐
𝟐) −
𝒇(𝒙, 𝝁, 𝝈 = 𝒆 𝟐𝝈𝟐
√𝟐𝝅𝝈𝟐
❖ Z-score: Measures how many standard deviations a data point is from the mean.
z = (x - μ) / σ
❖ Normality Tests:
o Q-Q Plot: Compares the quantiles of the data to the quantiles of a normal
distribution. A straight line indicates normality.
❖ Definition: A continuous distribution where all values within a specified range have
equal probability.
❖ Key Features:
3. Exponential Distribution
❖ Key Features:
❖ Mean and Standard Deviation: Both are equal to 1/λ (represented as β in the notes).
Applications
❖ Rectangular Distribution: Used when all outcomes within a range are equally likely.
Discrete Distribution Binomial, Poisson, and Bernoulli distributions fall under this category.
1. Binomial Distribution
The objective of this distribution is to find probability of getting success k out of n trials. The
way to get success out of k out of n number of trials is given as:
𝒏 𝒏!
[ ]=
𝒌 𝒌! (𝒏 − 𝒌)!
The binomial distribution function is given as follows, where p is the probability of success
and probability of failure is (1 – p). The probability of success in a certain number of trials is
given as:
𝒑𝒌 (𝟏 − 𝒑)(𝒏−𝒌) or 𝒑𝒌 𝒒(𝒏−𝒌)
𝒏
[ ] 𝒑𝒌 (𝟏 − 𝒑)(𝒏−𝒌)
𝒌
Here, p is the probability of each choice, k is the number of choices, and n is the total number
of choices. The mean of binomial distribution is given below:
μ=n×p
σ² = np(1-p)
σ = √𝒏𝒑(𝟏 − 𝒑)
2. Poisson Distribution
It is another important distribution that is quite useful. Given an interval of time, this
distribution is used to model the probability of a given number of events k. The mean rule 𝜆 is
inclusive of previous events. Some of the examples of poisson distribution are number of
emails received, number of customers visiting a shop and the number of phone calls received
by the office.
𝐞−𝛌 𝛌𝐱
𝒇(𝑿 = 𝒙; 𝝀) = Pr [X = x] =
𝐱!
Here, x is the number of times the event occurs and 𝜆 is the mean number of times an event
occurs. The mean is the population mean at number of emails received and the standard
deviation is √𝜆 .
3. Bernoulli Distribution
This distribution models an experiment whose outcome is binary. The outcome is positive with
p and negative with 1 – p. The PMF of this distribution is given as:
𝒒=𝟏−𝒑 𝒊𝒇 𝒌 = 𝟎
𝒇(𝒌; 𝒑) = {
𝒑 𝒊𝒇 𝒌 = 𝟏
Density Estimation
Density estimation is a statistical problem where the goal is to approximate the probability
density function of a population based on a finite sample of data points. This estimated
function, denoted as p(x), allows us to assign probabilities to new, unseen data points. By
comparing the estimated probability of a new point, p(x_i), to a threshold ε, we can identify
outliers or anomalies: points with probabilities below ε are considered atypical, suggesting they
deviate significantly from the learned distribution.
There are two types of density estimation methods, namely parametric density estimation and
non- parametric density estimation.
Parametric density estimation assumes data originates from a known distribution, characterized
by parameters θ, allowing the density to be expressed as p(x | θ). The method focuses on
estimating these parameters, often using techniques like maximum likelihood estimation, to
define the most likely distribution that generated the observed data.
MLE is a method for estimating the parameters of a probability distribution based on observed
data. It aims to find the parameter values that maximize the likelihood of observing the given
data.
1. Formulate the Likelihood Function (L(X; θ)): This function represents the probability
of observing the data X given the distribution's parameters θ. For independent data
points, it's the product of individual probabilities: L(X; θ) = ∏ p(xᵢ ; θ).
2. Maximize the Likelihood: The goal is to find the parameter values θ that maximize
L(X; θ).
log L(X; θ) = ∑ log p(xᵢ ; θ). Maximizing the log-likelihood is equivalent to maximizing
the likelihood.
Gaussian Mixture Models (GMMs) leverage the Maximum Likelihood Estimation (MLE)
framework for clustering by assuming data is generated from a mixture of Gaussian
distributions, each with its own parameters. The Expectation-Maximization (EM) algorithm is
employed to estimate these parameters, particularly when dealing with latent variables, such as
unobserved group memberships (e.g., gender influencing weight), enabling effective modelling
of complex data distributions.
Generally, there can be many unspecified distributions with different set of parameters. The
EM algorithm has two stages:
1. Expectation (E) Stage – In this stage, the expected PDF and its parameters are
estimated for each latent variable.
2. Maximization (M) Stage – In this, the parameters are optimized using the MLE
function.
This process is iterative, and the iteration is continued till all the latent variables are fitted by
probability distributions effectively along with the parameters.
Non-parametric density estimation, which can be generative (like Parzen windows, finding
p(x | θ)) or discriminative (finding p(θ | x)), avoids assumptions about the underlying data
distribution. Examples include Parzen windows and k-Nearest Neighbors (KNN).
Parzen Window
Parzen window is a non-parametric method to estimate the probability density function (PDF)
of a dataset. It works by placing a "window" function (often a hypercube) around each data
point and summing these windows to approximate the overall density.
The samples are drawn independently, called as identically independent distribution. Let R be
the region that covers ‘k’ samples of total ‘n’ samples. Then, the probability density function is
given as:
p = k/n
𝒌/𝒏
p(x) =
𝑽
where V is the volume of the region R. If R is the hypercube centred at x and h is the length of
the hypercube, the volume V is h2 for 2D square cube and h3 for 3D cube.
The window indicates if the sample is inside the region or not. The Parzen probability density
function estimate using Above equation is given as:
This window can be replaced by any other function too. If Gaussian function is used, then it is
called Gaussian density function.
KNN Estimation
The KNN estimation is another non-parametric density estimation method. Here, the initial
parameter k is determined and based on that k-neighbours are determined. The probability
density function estimate is the average of the values that are returned by the neighbours.
Feature engineering is crucial for improving machine learning model performance by carefully
selecting and transforming input features. It encompasses two main aspects: feature
transformation, which involves creating new features from existing ones (e.g., calculating BMI
from height and weight), and feature subset selection, which focuses on identifying the most
relevant features to reduce dimensionality and computational complexity without sacrificing
reliability. This process combats the "curse of dimensionality," where processing high-
dimensional data becomes intractable, by employing strategies like greedy search to find
optimal feature subsets.
Filter-based selection uses statistical measures for assessing features. In this approach, no
learning algorithm is used. Correlation and information gain measures like mutual information
and entropy are all examples of this approach.
Wrapper-based methods use classifiers to identify the best features. These are selected and
evaluated by the learning algorithms. This procedure is computationally intensive but has
superior performance.
This procedure starts with an empty set of attributes. Every time, an attribute is tested for
statistical significance for best quality and is added to be reduced set. This process is continued
till a good reduced set of attributes is obtained.
This procedure start with a complete set of attributes. At every stage, the procedure removes
the worst attribute from the set, leading to the reduced set.
Combined Approach Both forward and reverse methods can be combined so that the
procedure can add the best attribute and remove the worst attribute.
The idea of the principal component analysis (PCA) or KL transform is to transform a given
set of measurements to a new set of features so that the features exhibit high information
packing properties. This leads to a reduced and compact set of features.
mx = E{x}
The operator E refers to the expected value of the population. This is calculated theoretically
using the probability density functions (PDF) of the elements xi and the joint probability density
functions between the elements xi and xj . From this covariance matrix can be calculated as:
C = E{(x-mx) (x-mx)T}
For M random vectors, when M is large enough, the mean vector and covariance matrix can be
approximately calculated as:
The covariance matrix is real and symmetric, allowing for the calculation of eigenvectors (eᵢ)
and eigenvalues (λᵢ), which are ordered by magnitude (λ₁ ≥ λ₂...). These eigenvectors form the
transformation matrix (A), used to map data (x) to a new representation (y) through
y = A(x - mₓ), and this transformation is also known as the Karhunen-Loeve or Hotelling
transform. The original data can be reconstructed via x = Aᵀ y + mₓ. The goal of PCA is to
reduce the data's dimensionality by using only the most significant eigenvectors, achieving
maximum compression, with a reconstruction using the K largest eigen values represented by
x = 𝑨𝑻𝒌 y + mₓ.
The advantage of PCA are immense. It reduces the attribute list by eliminating all irrelevant
attributes. The PCA algorithm is as follows:
The original data can be retrieved using the formula given below:
= {(A)T × y} + m
From below figure, one can infer the relevance of the attributes. The scree plot indicates that
the first attribute is more important than all other attributes.
LDA is also a feature reduction technique like PCA. The focus of LDA is to project higher
dimension data to a line (lower dimension data). LDA is also used to classify the data. Let there
be two classes, c₁ and c₂. Let μ₁ and μ₂ be the mean of the patterns of two classes. The mean of
the classes c₁ and c₂ can be computed as:
Where, V is the linear projection and σB and σW class scatter matrix, respectively. For the two-
class problem, these matrices are given as:
Let V = {v1, v2, …., vd} be the generalized eigen vectors of σB and σW, where, d is the largest
eigen values as in PCA. The transformation of x is the given as:
Like in PCA, the largest eigen values can be retained to have projections.
SVD is another useful decomposition technique. Let A be the matrix, then the matrix A can be
decomposed as:
A = USVT
Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension
is m × n, S is the diagonal matrix of dimension n × n, and V is the orthogonal matrix. The
procedure for finding decomposition matrix is given as follows:
1.
For a given matrix, find AAT
2.
Find eigen values of AAT
3. Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U.
4. Arrange the square root of the eigen values in diagonal. This matrix is diagonal matrix,
S.
5. Find eigen values and eigen vectors for AT A. Find the eigen value and pack the eigen
vector as a matrix called V.
Thus, A = USVT . Here, U and V are orthogonal matrices. The columns of U and V are left
and right singular values, respectively. SVD is useful in compression, as one can decide to
retain only a certain component instead of the original matrix A as:
The main advantage of SVD is compression. A matrix, say an image, can be decomposed and
selectively only certain components can be retained by making all other elements zero. This
reduces the contents of image while retaining the quality of the image. SVD is useful in data
reduction too.
A system that is built around a learning algorithm is called a learning system. The design of
systems focuses on these steps:
Training Experience
Let us consider designing of a chess game. In direct experience, individual board states and
correct moves of the chess game are given directly. In indirect system, the move sequences and
results are only given. The training experience also depends on the presence of a supervisor
who can label all valid moves for a board state. In the absence of a supervisor, the game agent
plays against itself and learns the good moves, if the training samples cover all scenarios, or in
other words, distributed enough for performance computation. If the training samples and
testing samples have the same distribution, the results would be good.
The next step is the determination of a target function. In this step, the type of knowledge that
needs to be learnt is determined. In direct experience, a board move is selected and is
determined whether it is a good move or not against all other moves. If it is the best move, then
it is chosen as: B →M, where, B and M are legal moves. In indirect experience, all legal moves
are accepted and a score is generated for each. The move with largest score is then chosen and
executed.
The representation of knowledge may be a table, collection of rules or a neural network, The
linear combination of these factors can be coined as:
𝒗 = 𝝎𝟎 + 𝝎𝟏 𝒙𝟏 + 𝝎𝟐 𝒙𝟐 + 𝝎𝟑 𝒙𝟑
where, 𝑥1 , 𝑥2 , and 𝑥3 , represent different board features and ww, w, and w, represent weights.
The focus is to choose weights and fit the given training samples effectively. The aim is to
reduce the error given as:
Here, b is the sample and 𝑣̂(𝑏) is the predicted hypothesis. The approximation is carried out
as:
Computing the error as the difference between trained and expected hypothesis. Let error be
error(b). Then, for every board feature xi, the weights are updated as:
𝝎𝒊 = 𝝎𝒊 + µ × 𝐞𝐫𝐫𝐨𝐫(𝐛) × 𝒙𝒊
Here, µ is the constant that moderates the size of the weight update.
1. Input - Training dataset which is a set of training instances, each labelled with the name
of a concept or category to which it belongs. Use this past experience to train and build
the model.
2. Output - Target concept or Target function f. It is a mapping function (x) from input x
lo output y. It is to determine the specific features or common features to identify an
object. In other words, it is to find the hypothesis to determine the target concept. For
e.g., the specific set of features to identify an elephant from all animals.
3. Test - New instances to test the learned model.
Formally, Concept learning is defined as "Given a set of hypotheses, the learner searches
through the hypothesis space to identify the best hypothesis that matches the target concept".
Sl. Horns Tail Tusks Paws Fur Color Hooves Size Elephant
No.
1 No Short Yes No No Black No Big Yes
2 Yes Short No No No Brown Yes Medium No
3 No Short Yes No No Black No Medium Yes
4 No Long No Yes Yes White No Medium No
5 No Short Yes Yes Yes Black No Big Yes
Here, in this set of training instances, the independent attributes considered are 'Homs', 'Tail',
'Tusks', 'Paws', 'Fur', 'Color', 'Hooves' and 'Size'. The dependent attribute is 'Elephant'. The
target concept is to identify the animal to be an Elephant.
Let us now take this example and understand further the concept of hypothesis.
A hypothesis 'h' approximates a target function 'f' to represent the relationship between the
independent attributes and the dependent attribute of the training instances. The hypothesis is
the predicted approximate model that best maps the inputs to outputs. Each hypothesis is
represented as a conjunction of attribute conditions in the antecedent part.
The set of hypotheses in the search space is called as hypotheses. Hypotheses are the plural
form of hypothesis. Generally, 'H' is used to represent the hypotheses and 'h' is used to represent
a candidate hypothesis.
Each attribute condition is the constraint on the attribute which is represented as attribute-value
pair. In the antecedent of an attribute condition of a hypothesis, each attribute can take value
as either? or 'o' or can hold a single value.
❖ "?" denotes that the attribute can take any value [e.g., Colour =?]
❖ "𝜙" denotes that the attribute cannot take any value, Le., it represents a null value [e.g.,
Horns = 𝜙 ]
❖ Single value denotes a specific single value from acceptable values of the attribute, ie,
the attribute Tail' can take a value as 'short' [e.g., Tail Short]
Given a test instance x, we say h(x) = 1, if the test instance x satisfies this hypothesis h.
The training dataset given above has 5 training instances with 8 independent attributes and one
dependent attribute. Here, the different hypotheses that can be predicted for the target concept
are,
The task is to predict the best hypothesis for the target concept (an elephant). The most general
hypothesis can allow any value for each of the attribute.
It is represented as:
<?, ?, ?, ?, ?, ?, ?, ?>. This hypothesis indicates that any animal can be an elephant.
The most specific hypothesis will not allow any value for each of the attribute <
𝜙, 𝜙, 𝜙, 𝜙, 𝜙, 𝜙, 𝜙, 𝜙 >. This hypothesis indicates that no animal can be an elephant.
The target concept mentioned in this example is to identify the conjunction of specific features
from the training instances to correctly identify an elephant.
Thus, concept learning can also be called as Inductive Learning that tries to induce a general
function from specific training instances. This way of learning a hypothesis that can produce
an approximate target function with a sufficiently large set of training instances can also
Hypothesis space is the set of all possible hypotheses that approximates the target function f In
other words, the set of all possible approximations of the target function can be defined as
hypothesis space. From this set of hypotheses in the hypothesis space, a machine learning
algorithm would determine the best possible hypothesis that would best describe the target
function or best fit the outputs. Generally, a hypothesis representation language represents a
larger hypothesis space. Every machine learning algorithm would represent the hypothesis
space in a different manner about the function that maps the input variables to output variables.
For example, a regression algorithm represents the hypothesis space as a linear function
whereas a decision tree algorithm represents the hypothesis space as a tree.
The set of hypotheses that can be generated by a learning algorithm can be further reduced by
specifying a language bias.
The subset of hypothesis space that is consistent with all-observed training instances is called
as Version Space. Version space represents the only hypotheses that are used for the
classification.
For example, each of the attribute given in the Table 3.1 has the following possible set of values.
Horns-Yes, No
Tail-Long, Short
Tusks-Yes, No
Paws-Yes, No
Fur-Yes, No
Color - Brown, Black, White
Hooves-Yes, No
Size-Medium, Big
Considering these values for each of the attribute, there are (2x2x2x2x2x3x2x2) = 384 distinct
instances covering all the 5 instances in the training dataset.
So, we can generate (4x4x4x4x4x5x4x4) = 81,920 distinct hypotheses when including two
more values [?, 𝜙] for each of the attribute. However, any hypothesis containing one or more
symbols represents the empty set of instances; that is, it classifies every instance as negative
Hypothesis ordering is also important wherein the hypotheses are ordered from the most
specific one to the most general one in order to restrict searching the hypothesis space
exhaustively.
Several commonly used heuristic search methods are hill climbing methods, constraint
satisfaction problems, best-first search, simulated-annealing. A* algorithm, and genetic
algorithms.
In order to understand about how we construct this concept hierarchy, let us apply this general
principle of generalization/specialization relation. By generalization of the most specific
hypothesis and by specialization of the most general hypothesis, the hypothesis space can be
searched for an approximate hypothesis that matches all positive instances but does not match
any negative instance.
Searching the Hypothesis Space
There are two ways of learning the hypothesis, consistent with all training instances from the
large hypothesis space.
1. Specialization - General to Specific learning
2. Generalization - Specific to General learning
Specialization - General to Specific learning This learning methodology will search through
the hypothesis space for an approximate hypothesis by specializing the most general
hypothesis.
1. Find 5 algorithm tries to find a hypothesis that is consistent with positive instances,
ignoring all negative instances. As long as the training dataset is consistent, the
hypothesis found by this algorithm may be consistent.
2. The algorithm finds only one unique hypothesis, wherein there may be many other
hypotheses that are consistent with the training dataset.
3. Many times, the training dataset may contain some errors; hence such inconsistent data
instances can mislead this algorithm in determining the consistent hypothesis since it
ignores negative instances.
Hence, It is necessary to find the set of hypotheses that are consistent with the training data
including the negative examples. To overcome the limitations of Find-S algorithm, Candidate
Elimination algorithm was proposed to output the set of all hypotheses consistent with the
training dataset.
The version space contains the subset of hypotheses from the hypothesis space that is consistent
with all training instances in the training dataset.
The principle idea of this learning algorithm is to initialize the version space to contain all
hypotheses and then eliminate any hypothesis that is found inconsistent with any training
instances. Initially, the algorithm starts with a version space to contain all hypotheses scanning
each training instance. The hypotheses that are inconsistent with the training instance are
eliminated. Finally, the algorithm outputs the list of remaining hypotheses that are all
consistent.
The above algorithm works fine if the hypothesis space is finite but practically it is difficult to
deploy this algorithm. Hence, a variation of this idea is introduced in the Candidate Elimination
algorithm.
Version space learning is to generate all consistent hypotheses around. This algorithm computes
the version space by the combination of the two cases namely,
Using the Candidate Elimination algorithm, we can compute the version space containing all
(and only those) hypotheses from H that are consistent with the given observed sequence of
training instances. The algorithm defines two boundaries called 'general boundary which is a
set of all hypotheses that are the most general and 'specific boundary which is a set of all
hypotheses that are the most specific. Thus, the algorithm limits the version space to contain
only those hypotheses that are most general and most specific. Thus, it provides a compact
representation of List-then algorithm.
Generating Positive Hypothesis 'S' If it is a positive example, refine 5 to include the positive
instance. We need to generalize S to include the positive instance. The hypothesis is the
conjunction of 'S' and positive instance. When generalizing, for the first positive instance, add
to S all minimal generalizations such that S is filled with attribute values of the positive
instance. For the subsequent positive instances scanned, check the attribute value of the positive
instance and S obtained in the previous iteration. If the attribute values of positive instance and
S are different, fill that field value with a "?". If the attribute values of positive instance and S
are same, no change is required.
If the attribute values of positive and negative instances are different, then fill that field wills
positive instance value so that the hypothesis does not classify that negative instance as true If
the attribute values of positive and negative instances are same, then no need to update 'G' and
fill that attribute value with a ‘?’.
Generating Version Space - [Consistent Hypothesis] We need to take the combination of sets
in 'G' and check that with 'S'. When the combined set fields are matched with fields in 'S', then
only that is included in the version space as consistent hypothesis.
Machine learning models are created by training algorithms on datasets to make predictions on
new data, a process involving parameter learning and model evaluation using separate training
and testing sets to prevent overfitting. The model's accuracy is assessed by measuring the
difference between predicted and actual values, often using Mean Squared Error, with lower
errors indicating better predictive performance.
1. Choose a machine learning algorithm to suit the training data and the problem domain
2. Input the training dataset and train the machine learning algorithm to learn from the
data and capture the patterns in the data
3. Tune the parameters of the model to improve the accuracy of learning of the algorithm
4. Evaluate the learned model once the model is built
The biggest challenge in machine learning is choosing an algorithm that suits the problem.
Hence, model selection and assessment are very important and deal with two types of
complexities.
Model Selection is a process of selecting one good enough model among different machine
learning models for the dataset or selecting different sets of features or hyperparameters for the
same machine learning model. It is difficult to find the best model because all models exhibit
some predictive error for the problem, so at least a good enough model should be selected that
performs fairly well with the dataset.
Some of the approaches used for selecting a machine learning model are listed below:
❖ Use re-sample methods and split the dataset as training, testing and validation datasets
and observe the performance of the model over all the phases. This approach is suitable
for smaller datasets.
❖ The simplest approach is to fit a model on the training dataset and to compute measures
like error or accuracy.
❖ The use of probabilistic framework and quantification of the performance of the model
as a score is the third approach.
Re-sampling is a technique to select a model by reconstructing the training dataset and test
dataset by randomly choosing instances by some method from the given dataset. This method
involves selecting different instances repeatedly from a training dataset to tune a model. It is
done to improve the accuracy of a model. The common re-sampling model selection methods
are Random train/test splits, Cross-validation (K-fold, LOOCV, etc.) and Bootstrap.
Cross-Validation
Cross-Validation is a method by which we can tune the model with only training dataset. It is
a model evaluation approach by which we can set aside some data of the training dataset for
validation and fit the rest of the data to train the model. The best model is found by estimating
the average of errors on different test data. The popular cross-validation family of methods
includes Holdout method, K-fold cross-validation, Stratified cross-validation and Leave-One-
Out Cross-Validation (LOOCV).
Holdout Method
This is the simplest method of cross-validation. The dataset is split into two subsets called
training dataset and test dataset. The model is trained using the training dataset and then
evaluated using the test dataset. This holdout method can be applied for a single time which is
called as single holdout method or it can be repeated for more than once which is called as
repeated holdout method. The average performance on the test dataset is estimated to evaluate
the model. Even though this model is very simple, it can exhibit high variance and the
performance largely depends on how the dataset is split.
K-fold Cross-Validation
Another way of cross-validating is using a k-fold cross-validation, which will split the training
dataset into k equal folds/parts creating k-1 subsets of training set and one test subset. Out of
the k folds, k-1 folds are used for training and one fold is used for testing the model. This has
to be performed for k iterations and during each iteration a different fold is selected for testing.
The average performance of the model on k iterations is the final estimate of the model
performance.
This method is similar to k-fold cross-validation but with a slight difference. Here, it is ensured
that while splitting the dataset into k folds, each fold should contain the same proportion of
instances with a given categorical value. This is called stratified cross-validation.
This method repeatedly splits the n data instances of the dataset into training dataset containing
-1 data instances and leaving one data instance for evaluating the model. This process is
repeated a times and average test error is then estimated for the model. Even though this model
is expensive and time consuming because it has to run for a times (i.e., a data instances in the
dataset), it has less bias. For example, if the training dataset contains 100 data instances, then
99 instances are used for training and one instance to test or evaluate the model. This process
is repeated 100 times selecting a different instance as holdout instance for testing in each
iteration.
Model Performance
Classifier models are discussed in the subsequent chapters. The focus of this section is the
evaluation of classifier models. Classifiers are unstable as a small change in the input can
change the output. A solid framework is needed for proper evaluation. There are several metrics
that can be used to describe the quality and usefulness of a classifier. One way to compute the
metrics is to form a table called contingency table. For example, consider a test for detecting a
disease, say cancer. Below table shows a contingency table for this scenario.
In this table, True Positive (TP) Number of cancer patients who are classified by the test
correctly, True Negative (TN) = Number of normal patients who do not have cancer are
correctly detected. The two errors that are involved in this process is False Positive (FP) that is
an alarm that indicates that the tests show positive when the patient has no disease and False
Negative (FN) is another error that says a patient has cancer when tests says negative or normal.
FP and FN are costly errors in this classification process.
The metrics that can be derived from this contingency table are listed below:
1. Sensitivity - The sensitivity of a test is the probability that it will produce a true positive
result when used on a test dataset. It is also known as true positive rate. The sensitivity
of a test can be determined by calculating:
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
2. Specificity - The specificity of a test is the probability that a test will produce a true
negative result when used on test dataset.
𝑇𝑁
𝑇𝑁 + 𝐹𝑃
3. Positive Predictive Value - The positive predictive value of a test is the probability that
an object is classified correctly when a positive test result is observed.
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
4. Negative Predictive Value - The negative predictive value of a test is the probability
that an object is not classified properly when a negative test result is observed.
𝑇𝑁
𝑇𝑁 + 𝐹𝑁
5. Accuracy - The accuracy of the classifier can be shown in terms of sensitivity computed
as:
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
6. Precision - Precision is also known as positive predictive power. It is defined as the
ratio of true positive divided by the sum of true positive and false positive.
𝑇𝑃
Precision = 𝑇𝑃+𝐹𝑃
𝑇𝑃
Recall Sensitivity= 𝑇𝑃+𝐹𝑁
A combination of harmonic mean of precision and recall is called F-measure or Fl score. This
is useful in identifying the model skill for a specific threshold.
Visual Classifier Performance Receiver Operating Characteristic (ROC) curve and Precision-
Recall curves indicate the performance of classifiers visually. ROC curves are visual means of
checking the accuracy and comparison of classifiers. ROC is a plot of sensitivity (True Positive
Rate) and the 1-specificity (False Positive Rate) for a given model.
A sample ROC curve is shown in Figure 3.6, where results of five classifiers are given. A is the
ROC of an average classifier. The ideal classifier is E where the area under curve is 1.0.
Theoretically, it can range from 0.9 to 1. The rest of the classifiers B, C, D are categorized
based on area under curve as good, better and still better based on the area under curve values.
Classifier predictions rely on a threshold value, like 0.5, to assign data points to classes, and
this threshold can be adjusted to manage false positives (FP) and false negatives (FN), which
is crucial when focusing on specific error types; the Receiver Operating Characteristic (ROC)
curve, which plots true positive rate against false positive rate, visually assesses model skill,
with curves above the diagonal indicating better performance and the area under the curve
(AUC) quantifying overall accuracy across various thresholds, where an AUC of 1 signifies a
perfect model.
Instead of just predicting labels, models can output probabilities, enabling more nuanced
evaluations through scoring functions like AUC, which measures a model's performance across
different threshold values; precision-recall curves, plotting precision against recall, are
particularly useful for imbalanced datasets where one class significantly outnumbers the other,
whereas ROC curves are preferred for balanced datasets, offering a comprehensive view of a
model's ability to discriminate between classes.
Scoring Methods
Another alternative for model selection is to combine the complexity of the model and
performance of the model as a score. Then, model selection is done by selecting the model that
maximizes or minimizes the score.
Minimum Description Length (MDL) is one such method. The aim is to describe target variable
and model in terms of bits. MDL is the principle of using minimum number of bits to represent
the data and model. It is a variant of Occom Razor's principle that states that the model with
the simplest explanation is the best model. MDL, too recommends the selection of the
hypothesis that minimizes the sum of two descriptions of data and model.
Let a be a learning model. Let L(h) is the number of bits used to represent the model and Dis
the number of predictions, then the MDL is given as:
L(h) + L(D|h)
where, L(D|h) is the number of bits used to represent the predictions D based on the training
set. MDL can be expressed in terms of negative log-likelihood also as:
where, y is the target variable, x is the input and is the model parameters.