0% found this document useful (0 votes)
6 views65 pages

AI&ML Module 2

Module 2 covers the understanding of bivariate and multivariate data, including key concepts such as covariance, correlation, and statistical methods for analyzing relationships between variables. It discusses essential mathematical principles from linear algebra and probability that underpin machine learning, including matrix decomposition and probability distributions. The module also introduces graphical representations like heatmaps and pairplots to visualize data relationships.

Uploaded by

ganaviprakash67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views65 pages

AI&ML Module 2

Module 2 covers the understanding of bivariate and multivariate data, including key concepts such as covariance, correlation, and statistical methods for analyzing relationships between variables. It discusses essential mathematical principles from linear algebra and probability that underpin machine learning, including matrix decomposition and probability distributions. The module also introduces graphical representations like heatmaps and pairplots to visualize data relationships.

Uploaded by

ganaviprakash67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Module 2

Understanding Data – 2 and Basic Learning Theory

Dr. Vishwesh J, GSSSIETW, Mysuru


Understanding Data-2=> 2.1 Bivariate Data and Multivariate Data
• Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The aim is to find
relationships among data. Consider the following Table 2.1, with data of the temperature in a shop and sales
of sweaters.
• Figure 2.1 and 2.2 shows scatter plot and line chart for the Table 2.1

Table 2.1: Temperature in a Shop and Sales Data


Figure 2.1: Scatter Plot Figure 2.2: Line Chart

2.1.1 Bivariate Statistics


• Covariance and Correlation are examples of bivariate statistics.
Covariance
• It is a measure of joint probability* of random variables, say X and Y.
• Generally, random variables are represented in capital letters.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

• It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the variance between two dimensions.
• The formula for finding co-variance for specific x, and y are:

Here, 𝑥𝑖 and 𝑦𝑖 are data values from X and Y. E(X) and E(Y) are the mean values of 𝑥𝑖 and 𝑦𝑖 . N is the number
of given data. Also, the COV(X, Y) is same as COV(Y, X).

Example 2.1: Find the covariance of data X = {1, 2, 3, 4, 5} and Y = {1, 4, 9, 16, 25}.
Solution: Mean(X) = E(X) = 15/5 = 3, Mean(Y) = E(Y) = 15/5 = 11. The covariance is computed using COV(X, Y) as:

The covariance between X and Y is 12. It can be normalized to a value between -1 and +1. This is done by dividing
it by the correlation of variables. This is called Pearson correlation coefficient. Sometimes, N - 1 is also can be
used instead of N. In that case, the covariance is 60/4 = 15.

Correlation
• The correlation indicates the relationship between dimensions using its sign. The sign is more important
than the actual value.
1. If the value is positive, it indicates that the dimensions increase together.
2. If the value is negative, it indicates that while one-dimension increases, the other dimension decreases.

Dr. Vishwesh J, GSSSIETW, Mysuru


3. If the value is zero, then it indicates that both the dimensions are independent of each other.
• If the dimensions are correlated, then it is better to remove one dimension as it is a redundant dimension.
• If the given attributes are X = (x1, x2, …, xN) and Y = (y1, y2, …, yN), then the Pearson correlation coefficient,
that is denoted as r, is given as:

Example 2.2: Find the correlation coefficient of data X = {1, 2, 3, 4, 5} and Y = {1, 4, 9, 16, 25}.
Solution: The mean values of X and Y are 15/5 = 3 and 55/5 = 11. The standard deviations of X and Y are 1.41 and
8.6486, respectively. Therefore, the correlation coefficient is given as ratio of covariance (12 from the previous problem
2.1) and standard deviation of x and y as per the above equation as:

Understanding Data-2=> 2.2 Multivariate Statistics


• In machine learning, almost all datasets are multivariable. Multivariate data is the analysis of more than two
observable variables, and often, thousands of multiple measurements need to be conducted for one or more
subjects.
• The multivariate data is like bivariate data but may have more than two dependent variables. Some of the
multivariate analysis are regression analysis, principal component analysis, and path analysis.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

• The mean of multivariate data is a mean vector and the mean of the shown
three attributes is given as (2, 5, 1.33).
• The variance of multivariate data becomes the covariance matrix.
• The mean vector is called centroid and variance is called dispersion matrix
(Will be discussed later).
• Multivariate data has three or more variables.

Heatmap
• Heatmap is a graphical representation of 2D matrix.
• It takes a matrix as input and colours it. The darker colours indicate very large values and lighter colours indicate smaller
values.
• The advantage of this method is that humans perceive colours well. So, by colour shaping, larger values can be perceived well.
• For example, in vehicle traffic data, heavy traffic regions can be differentiated from low traffic regions through heatmap.
• In Figure 2.3, patient data highlighting weight and health status is plotted. Here, X-axis is weights and Y-axis is patient counts.
The dark colour regions highlight patients’ weights vs patient counts in health status.

Pairplot
• Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists of several pair-wise scatter
plots of variables of the multivariate data. All the results are presented in a matrix format.
• By visual examination of the chart, one can easily find relationships among the variables such as correlation between the
variables.
• A random matrix of three columns is chosen and the relationships of the columns is plotted as a pairplot (or scattermatrix) as
shown below in Figure 2.4.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Understanding Data-2

Figure 2.3: Heatmap for Patient Data

Figure 2.4: Pairplot for Random Data


Dr. Vishwesh J, GSSSIETW, Mysuru
Understanding Data-2=> 2.3 Essential Mathematics for Multivariate Data
• Machine learning involves many mathematical concepts from the domain of Linear algebra, Statistics,
Probability and Information theory.
• Here we discuss important aspects of linear algebra and probability.
• 'Linear Algebra' is a branch of mathematics that is central for many scientific applications and other
mathematical subjects.
• Linear algebra deals with linear equations, vectors, matrices, vector spaces and transformations. These are the
driving forces of machine learning and machine learning cannot exist without these data types.
• Let us discuss some of the important concepts of linear algebra now.

2.3.1 Linear Systems and Gaussian Elimination for Multivariate Data:


• A linear system of equations is a group of equations with unknown variables.
• Let Ax = y, then the solution x is given as:

• This is true if y is not zero and A is not zero. The logic can be extended for N-set of equations with ‘n’ unknown
variables.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

• If there is a unique solution, then the system is called consistent independent. If there are various solutions,
then the system is called consistent dependent. If there are no solutions and if the equations are contradictory,
then the system is called inconsistent.
• For solving large number of system of equations, Gaussian elimination can be used. The procedure for
applying Gaussian elimination is given as follows:

This part is called backward substitution.


Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Understanding Data-2

• To facilitate the application of Gaussian elimination method, the following row operations are applied:
1. Swapping the rows
2. Multiplying or dividing a row by a constant
3. Replacing a row by adding or subtracting a multiple of another row to it These concepts are illustrated in
Example 2.3.

Example 2.3: Solve the following set of equations using Gaussian Elimination method.
2𝑥1 + 4𝑥2 = 6 and 4𝑥1 + 3𝑥2 = 7
Solution: Rewrite this in matrix form as follows:

Apply the transformation by dividing the row 1 by 2. There are no general guidelines of row operations
other than reducing the given matrix to row echelon form. The operator ~ means reducing to. The above
matrix can further be reduced as follows:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

2.3.2 Matrix Decompositions:


• It is often necessary to reduce a matrix to its constituent parts so that complex matrix operations can be
performed. These methods are also known as matrix factorization methods.
• The most popular matrix decomposition is called eigen decomposition. It is a way of reducing the matrix into
eigen values and eigen vectors. Then, the matrix A can be decomposed as:
𝐴 = 𝑄𝛬𝑄𝑇

where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and 𝑸𝑻 is the transpose of matrix Q.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

LU Decomposition
• One of the simplest matrix decompositions is LU decomposition where the matrix A can be decomposed
matrices:
A = LU
o Here, L is the lower triangular matrix and U is the upper triangular matrix. The decomposition can be
done using Gaussian elimination method.
o First, an identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination
is applied to reduce the given matrix to get matrices L and U.
o Example 2.4 illustrates the application of Gaussian elimination to get LU.

Example 2.4: Find LU decomposition of the given matrix:

Solution: First, augment an identity matrix and apply Gaussian elimination. The steps are as shown in:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

Now, it can be observed that the first matrix is L as it is the lower triangular matrix whose values are the
determiners used in the reduction of equations above such as 3, 3 and 2/3. The second matrix is U, the upper
triangular matrix whose values are the values of the reduced matrix because of Gaussian elimination.

It can be cross verified that the multiplication of LU yields


the original matrix A.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Understanding Data-2

2.3.3 Machine Learning and Importance of Probability and Statistics:


• Machine learning is linked with statistics and probability. Like linear algebra, statistics is the heart of machine
learning. The importance of statistics needs to be stressed as without statistics; analysis of data is difficult.
• Probability is especially important for machine learning. Any data can be assumed to be generated by a
probability distribution.

Probability Distributions
• A probability distribution of a variable, say X, summarizes the probability associated with X’s events.
Distribution is a function that describes the relationship between the observations in a sample space.
• Probability distributions are of two types:
1. Discrete probability distribution
2. Continuous probability distribution
• The relationships between the events for a continuous random variable and their probabilities is called a
continuous probability distribution. It is summarized as Probability Density Function (PDF). PDF calculates
the probability of observing an instance. The plot of PDF shows the shape of the distribution.
• Cumulative Distributive Function (CDF) computes the probability of an observation ≤ value. Both PDF and
CDF are continuous values.
• The discrete equivalent of PDF in discrete distribution is called Probability Mass Function (PMF).

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

Continuous Probability Distributions


• Continuous Probability Distributions categories are Normal, Rectangular, and Exponential distributions.

1. Normal Distribution
• Normal distribution is a continuous probability distribution. This is also known as gaussian distribution or
bell-shaped curve distribution.
• It is the most common distribution function. The shape of this distribution is a typical bell-shaped curve.
• The heights of the students, blood pressure of a population, and marks scored in a class can be approximated
using normal distribution.
• PDF of the normal distribution is given as:

Here, μ is mean and σ is the standard deviation. Normal distribution is characterized by two parameters –
mean and variance.
2. Rectangular Distribution
• This is also known as uniform distribution. It has equal probabilities for all values in the range a, b.
• The uniform distribution is given as follows:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

3. Exponential Distribution
• This is a continuous uniform distribution. This probability distribution is used to describe the time between
events in a Poisson process.
• Exponential distribution is another special case of Gamma distribution with a fixed parameter of 1.
• This distribution is helpful in modelling of time until an event occurs.

Discrete Probability Distributions


• Binomial, Poisson, and Bernoulli distributions fall under this category.
1. Binomial Distribution
• Binomial distribution is another distribution that is often encountered in machine learning. It has only two
outcomes: success or failure. This is also called Bernoulli trial.
• The objective of this distribution is to find probability of getting success k out of n trials. The way to get
success out of k out of n number of trials is given as:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

• The binomial distribution function is given as follows, where p is the probability of success and probability of
failure is (1 - p). The probability of success in a certain number of trials is given as:

• Combining both, one gets PDF of binomial distribution as:

• Here, p is the probability of each choice, k is the number of choices, and n is the total number of choices. The
mean of binomial distribution is given below:

• And the variance is given as:

• Hence, the standard deviation is given as:

2. Poisson Distribution
• It is another important distribution that is quite useful. Given an interval of time, this distribution is used to
model the probability of a given number of events k. The mean rule λ is inclusive of previous events.
• Some of the examples of Poisson distribution are number of emails received, number of customers visiting a
shop and the number of phone calls received by the office. The PDF of Poisson distribution is given as follows:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

• The PDF of Poisson distribution is given as follows:

Here, x is the number of times the event occurs and λ is the mean number of times an event occurs.

3. Bernoulli Distribution
• This distribution models an experiment whose outcome is binary. The outcome is positive with p and negative
with 1 - p.
• The PMF of this distribution is given as:

The mean is p and variance is p(1 - p) = q

Density Estimation
• Let there be a set of observed values x1, x2, …, xn from a larger set of data whose distribution is not known.
Density estimation is the problem of estimating the density function from an observed data.
• The estimated density function, denoted as, p(x) can be used to value directly for any unknown data, say 𝑥𝑡 as
p(𝑥𝑡 ). If its value is less than ε, then 𝑥𝑡 is not an outlier or anomaly data. Else, it is categorized as an anomaly
data.
• There are two types of density estimation methods, namely parametric density estimation and non-parametric
density estimation.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Understanding Data-2

Parametric Density Estimation


• It assumes that the data is from a known probabilistic distribution and can be estimated as p(x | Θ), where, Θ
is the parameter.
• Maximum likelihood function is a parametric estimation method.

Maximum Likelihood Estimation


• For a sample of observations, one can estimate the probability distribution. This is called density estimation.
Maximum Likelihood Estimation (MLE) is a probabilistic framework that can be used for density estimation.
• This involves formulating a function called likelihood function which is the conditional probability of
observing the observed samples and distribution function with its parameters.
• For example, consider a joint probability p(X; θ), where, X = {x1, x2, …, xn}
• The likelihood of observing the data is given as a function L(X; θ). The objective of MLE is to maximize this
function as max L(X; θ).
• The joint probability of this problem can be restated as:
• The computation of the above formula is unstable and the hence the problem is restated as maximum of log
conditional probability given θ. This is given as:

• Instead of maximizing, one can minimize this function as:

This is called negative log-likelihood function.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

• MLE can be stated as:

Here, β is the regression coefficient and 𝑥𝑖 is the given sample.

Non-parametric Density Estimation


• A non-parametric estimation can be generative or discriminative. Parzen window and k-Nearest Neighbour
(KNN) rule are examples of non-parametric density estimation.

Parzen Window
• Let there be ‘n’ samples, X = {x1, x2, …, xn}
• The samples are drawn independently, called as identically independent distribution.
• Let R be the region that covers ‘k’ samples of total ‘n’ samples. Then, the probability density function is given as:

• The estimate is given as:

where, V is the volume of the region R.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

• The Parzen window is given as follows:

• The window indicates if the sample is inside the region or not. The Parzen probability density function estimate
using above equation is given as:

KNN Estimation
• The KNN estimation is another non-parametric density estimation method.
• Here, the initial parameter k is determined and based on that k-neighbours are determined.
• The probability density function estimate is the average of the values that are returned by the neighbours.

Dr. Vishwesh J, GSSSIETW, Mysuru


Understanding Data-2=> 2.4 Feature Engineering and Dimensionality Reduction Techniques
• Features are attributes. Feature engineering is about determining the subset of features that form an important
part of the input that improves the performance of the model, be it classification or any other model in machine
learning.
• Feature engineering deals with two problems – Feature Transformation and Feature Selection.
o Feature transformation is extraction of features and creating new features that may be helpful in
increasing performance. For example, the height and weight may give a new attribute called Body Mass
Index (BMI).
o Feature subset selection is another important aspect of feature engineering that focuses on selection of
features to reduce the time but not at the cost of reliability.
• The subset selection reduces the dataset size by removing irrelevant features and constructs a minimum set of
attributes for machine learning.
• The features can be removed based on two aspects:
1. Feature relevancy – Some features contribute more for classification than other features. For example, a mole
on the face can help in face detection than common features like nose. In simple words, the features should be
relevant. The relevancy of the features can be determined based on information measures such as mutual
information, correlation based features like correlation coefficient and distance measures.
2. Feature redundancy – Some features are redundant. For example, when a database table has a field called
Date of birth, then age field is not relevant as age can be computed easily from date of birth. This helps in
removing the column age that leads to reduction of dimension one.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

2.4.1 Stepwise Forward Selection


• This procedure starts with an empty set of attributes. Every time, an attribute is tested for statistical significance
for best quality and is added to the reduced set. This process is continued till a good reduced set of attributes is
obtained.

2.4.2 Stepwise Backward Elimination


• This procedure starts with a complete set of attributes. At every stage, the procedure removes the worst attribute
from the set, leading to the reduced set.

2.4.3 Principal Component Analysis


• The idea of the principal component analysis (PCA) or KL transform is to transform a given set of measurements
to a new set of features so that the features exhibit high information packing properties.
• This leads to a reduced and compact set of features.
• Consider a group of random vectors of the form:

• The mean vector of the set of random vectors is defined as:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

• The operator E refers to the expected value of the population. This is calculated theoretically using the
probability density functions (PDF) of the elements xi and the joint probability density functions between the
elements xi and xj. From this, the covariance matrix can be calculated as:

• For M random vectors, when M is large enough, the mean vector and covariance matrix can be approximately
calculated as:
Eq. (1)

Eq. (2)

• The mapping of the vectors x to y using the transformation can now be described as: Eq. (3)

• This transform is also called as Karhunen-Loeve or Hoteling transform. The original vector x can now be
reconstructed as follows:

• The goal of PCA is to reduce the set of attributes to a newer, smaller set that captures the variance of the data.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

2 1
Example 2.5: Let the data points be and . Apply PCA and find the transformed data. Again, apply the
6 7
inverse and prove that PCA works.
Solution: One can combine two vectors into a matrix as follows:
The mean vector can be computed as Eq. (1) as follows:

As part of PCA, the mean must be subtracted from the data to get the adjusted data:

One can find the covariance for these data vectors. The covariance can be obtained using Eq. (2):

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

The final covariance matrix is obtained by adding these two matrices as:

−1
The eigen values and eigen vectors of matrix C can be obtained as λ1 = 1, λ2 = 0. The eigen vectors are and
1
1
. The matrix A can be obtained by packing the eigen vector of these eigen values (after sorting it) of matrix C.
1
−1 1 −1 1
For this problem, 𝐴 = . The transpose of A, 𝐴𝑇 = is also the same matrix as it is an orthogonal
1 1 1 1
matrix. The matrix can be normalized by diving each elements of the vector, by the norm of the vector to get:

One can check that the PCA matrix A is orthogonal. A matrix is orthogonal is 𝐴−1 = 𝐴 and 𝐴𝐴−1 = 𝐼.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

The transformed matrix y using Eq. (3) is given as:

Recollect that (x-m) is the adjusted matrix.

One can check the original matrix can be retrieved from this matrix as:

Therefore, one can infer the original is obtained without


any loss of information.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

2.4.4 Linear Discriminant Analysis


• Linear Discriminant Analysis (LDA) is also a feature reduction technique like PCA. The focus of LDA is to
project higher dimension data to a line (lower dimension data).
• LDA is also used to classify the data. Let there be two classes, c1 and c2. Let m1 and m2 be the mean of the
patterns of two classes. The mean of the class c1 and c2 can be computed as:

• The aim of LDA is to optimize the function:

where, V is the linear projection and σ𝐵 and σ𝑊 are class scatter matrix and within scatter matrix, respectively.
For the two-class problem, these matrices are given as:

• The maximization of J(V) should satisfy the equation:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

2.4.5 Singular Value Decomposition


• Singular Value Decomposition (SVD) is another useful decomposition technique. Let A be the matrix, then the
matrix A can be decomposed as:

Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension is m × n, S is the
diagonal matrix of dimension n × n, and V is the orthogonal matrix.
• The procedure for finding decomposition matrix is given as follows:
1. For a given matrix, find 𝐴𝐴𝑇
2. Find eigen values of 𝐴𝐴𝑇
3. Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U.
4. Arrange the square root of the eigen values in diagonal. This matrix is diagonal matrix, S.
5. Find eigen values and eigen vectors for 𝐴𝑇 𝐴. Find the eigen value and pack the eigen vector as a matrix
called V.
• Thus, 𝐴 = 𝑈𝑆𝑉 𝑇 . Here, U and V are orthogonal matrices. The columns of U and V are left and right singular
values, respectively. SVD is useful in compression, as one can decide to retain only a certain component instead
of the original matrix A as:

Based on the choice of retention, the compression can be controlled.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

Example 2.6: Find SVD of the matrix:

Solution: The first step is to compute:

The eigen value and eigen vector of this matrix can be calculated to get U. The eigen values of this matrix are 0.0098
and 101.9902.
The eigen vectors of this matrix are:

These vectors are normalized to get the vectors respectively as:

The matrix U can be obtained by concatenating the above vector as:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Understanding Data-2

The main advantage of SVD is compression. A matrix, say


an image, can be decomposed and selectively only certain
components can be retained by making all other elements
zero. This reduces the contents of image while retaining
the quality of the image. SVD is useful in data reduction
too.

Dr. Vishwesh J, GSSSIETW, Mysuru


Basic Learning Theory=> 2.5 Design of Learning System
• A system that is built around a learning algorithm is called a learning system. The design of systems focuses on
these steps:
1. Choosing a training experience
2. Choosing a target function
3. Representation of a target function
4. Function approximation

Training Experience
• Let us consider designing of a chess game.
• In direct experience, individual board states and correct moves of the chess game are given directly.
• In indirect system, the move sequences and results are only given.
• The training experience also depends on the presence of a supervisor who can label all valid moves for a board
state.
• In the absence of a supervisor, the game agent plays against itself and learns the good moves, If the training
samples and testing samples have the same distribution, the results would be good.

Determine the Target Function


• The next step is the determination of a target function. In this step, the type of knowledge that needs to be learnt
is determined.
• In direct experience, a board move is selected and is determined whether it is a good move or not against all
other moves. If it is the best move, then it is chosen as: B -> M, where, B and M are legal moves.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Basic Learning Theory

• In indirect experience, all legal moves are accepted and a score is generated for each. The move with largest
score is then chosen and executed.

Determine the Target Function Representation


• The representation of knowledge may be a table, collection of rules or a neural network. The linear combination
of these factors can be coined as:

where, x1, x2 and x3 represent different board features and w0, w1, w2 and w3 represent weights.

Choosing an Approximation Algorithm for the Target Function


• The focus is to choose weights and fit the given training samples effectively. The aim is to reduce the error given
as:

෠ is the predicted hypothesis. The approximation is carried out as:


Here, b is the sample and 𝑉(b)
o Computing the error as the difference between trained and expected hypothesis. Let error be error(b).
o Then, for every board feature xi, the weights are updated as:

o Here, μ is the constant that moderates the size of the weight update.Type equation here.

Dr. Vishwesh J, GSSSIETW, Mysuru


Basic Learning Theory=> 2.6 Introduction to Concept Learning
• Concept learning is a learning strategy of acquiring abstract knowledge or inferring a general concept or
deriving a category from the given training samples.
• It is a process of abstraction and generalization from the data.
• Concept learning helps to classify an object that has a set of common, relevant features.
• For example, humans can identify different kinds of animals based on common relevant features and categorize
all animals based on specific sets of features. The special features that distinguish one animal from another can
be called as a concept. This way of learning categories for object and to recognize new instances of those
categories is called as concept learning.
• Concept learning requires three things:
1. Input – Training dataset which is a set of training instances, each labeled with the name of a concept or
category to which it belongs.
2. Output – Target concept or Target function f. It is a mapping function f(x) from input x to output y. It is to
determine the specific features or common features to identify an object.
3. Test – New instances to test the learned model.
• Formally, Concept learning is defined as–"Given a set of hypotheses, the learner searches through the hypothesis
space to identify the best hypothesis that matches the target concept".
• Consider the following set of training instances shown in Table 2.2.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory
Table 2.2: Sample Training Instances

• Here, in this set of training instances, the independent attributes considered are ‘Horns’, ‘Tail’, ‘Tusks’, ‘Paws’,
‘Fur’, ‘Color’, ‘Hooves’ and ‘Size’.
• The dependent attribute is ‘Elephant’.
• The target concept is to identify the animal to be an Elephant.
• Let us now take this example and understand further the concept of hypothesis.
Target Concept: Predict the type of animal - For example –‘Elephant’.

2.6.1 Representation of a Hypothesis


• A hypothesis ‘h’ approximates a target function ‘f’ to represent the relationship between the independent
attributes and the dependent attribute of the training instances.
• The hypothesis is the predicted approximate model that best maps the inputs to outputs.
• Each hypothesis is represented as a conjunction of attribute conditions in the antecedent part. For example, (Tail
= Short) ∧ (Color = Black)…. The set of hypothesis in the search space is called as hypotheses.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Basic Learning Theory

• Each hypothesis is represented as a conjunction of attribute conditions in the antecedent part.


For example, (Tail = Short) ∧ (Color = Black)….
• The set of hypothesis in the search space is called as hypotheses.
• Generally ‘H’ is used to represent the hypotheses and ‘h’ is used to represent a candidate hypothesis. ‘
• Each attribute condition is the constraint on the attribute which is represented as attribute-value pair.
• In the antecedent of an attribute condition of a hypothesis, each attribute can take value as either ‘?’ or ‘Ø’ or can
hold a single value.
o “?” denotes that the attribute can take any value [e.g., Color =?]
o “Ø” denotes that the attribute cannot take any value, i.e., it represents a null value [e.g., Horns = Ø]
o Single value denotes a specific single value from acceptable values of the attribute, i.e., the attribute ‘Tail’
can take a value as ‘short’ [e.g., Tail = Short]
• For example, a hypothesis ‘h’ will look like,

• Given a test instance x, we say h(x) = 1, if the test instance x satisfies this hypothesis h.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

• The training dataset given above has 5 training instances with 8 independent attributes and one dependent
attribute. Here, the different hypotheses that can be predicted for the target concept are,

• The task is to predict the best hypothesis for the target concept (an elephant).
• The most general hypothesis can allow any value for each of the attribute.
o It is represented as: <?,?,?,?,?,?,?, ?>. This hypothesis indicates that any animal can be an elephant.
• The most specific hypothesis will not allow any value for each of the attribute.
o < Ø, Ø, Ø, Ø, Ø, Ø, Ø, Ø >. This hypothesis indicates that no animal can be an elephant.

Example 2.7: Explain Concept Learning Task of an Elephant from the dataset given in Table 2.2. Given,
Input: 5 instances each with 8 attributes
Target concept/function ‘c’: Elephant → {Yes, No}
Hypotheses H: Set of hypothesis each with conjunctions of literals as propositions [i.e., each literal is
represented as an attribute-value pair]
Solution: The hypothesis ‘h’ for the concept learning task of an Elephant is given as:

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Solution: The hypothesis ‘h’ for the concept learning task of an Elephant is given as:

This hypothesis produced is also called as concept description which is a model that can be used to classify
subsequent instances.

2.6.2 Hypothesis Space


• Hypothesis space is the set of all possible hypotheses that approximates the target function f.
• From this set of hypotheses in the hypothesis space, a machine learning algorithm would determine the best
possible hypothesis that would best describe the target function or best fit the outputs.
• The subset of hypothesis space that is consistent with all-observed training instances is called as Version Space.
• Version space represents the only hypotheses that are used for the classification.
• For example, each of the attribute given in the Table 2.2 has the following possible set of values.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

• Considering these values for each of the attribute, there are (2 × 2 × 2 × 2 × 2 × 3 × 2 × 2) = 384 distinct instances
covering all the 5 instances in the training dataset.
• So, we can generate (4 × 4 × 4 × 4 × 4 × 5 × 4 × 4) = 81,920 distinct hypotheses when including two more values [?,
Ø] for each of the attribute.
• However, any hypothesis containing one or more Ø symbols represents the empty set of instances; that is, it
classifies every instance as negative instance. Therefore, there will be (3 × 3 × 3 × 3 × 3 × 4 × 3 × 3 + 1) = 8,749
distinct hypotheses by including only ‘?’ for each of the attribute and one hypothesis representing the empty set
of instances.
• Thus, the hypothesis space is much larger and hence we need efficient learning algorithms to search for the best
hypothesis from the set of hypotheses.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

2.6.3 Generalization and Specialization


• In order to understand about how we construct this concept hierarchy, let us apply this general principle of
generalization/specialization relation.
• By generalization of the most specific hypothesis and by specialization of the most general hypothesis, the
hypothesis space can be searched for an approximate hypothesis that matches all positive instances but does not
match any negative instance.

Searching the Hypothesis Space


• There are two ways of learning the hypothesis, consistent with all training instances from the large hypothesis
space.
1. Generalization – Specific to General learning
2. Specialization – General to Specific learning

Generalization – Specific to General Learning


• This learning methodology will search through the hypothesis space for an approximate hypothesis by
generalizing the most specific hypothesis.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Example 2.8: Consider the training instances shown in Table 2.2 and illustrate Specific to General Learning.
Solution: We will start from all false or the most specific hypothesis to determine the most restrictive
specialization. Consider only the positive instances and generalize the most specific hypothesis. Ignore the negative
instances.
This learning is illustrated as follows: The most specific hypothesis is taken now, which will not classify any
instance to true.

Read the first instance I1, to generalize the hypothesis h so that this positive instance can be classified by the
hypothesis h1.

When reading the second instance I2, it is a negative instance, so ignore it.

Similarly, when reading the third instance I3, it is a positive instance so generalize h2 to h3 to accommodate it. The
resulting h3 is generalized.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Ignore I4 since it is a negative instance.

When reading the fifth instance I5, h4 is further generalized to h5.

Now, after observing all the positive instances, an approximate hypothesis h5 is generated which can now classify
any subsequent positive instance to true.

Specialization – General to Specific Learning


• This learning methodology will search through the hypothesis space for an approximate hypothesis by
specializing the most general hypothesis.

Example 2.9: Illustrate learning by Specialization – General to Specific Learning for the data instances shown in
Table 2.2.
Solution: Start from the most general hypothesis which will make true all positive and negative instances.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

2.6.4 Hypothesis Space Search by Find-S Algorithm


• Find-S algorithm is guaranteed to converge to the most specific hypothesis in H that is consistent with the
positive instances in the training dataset.
• Obviously, it will also be consistent with the negative instances. Thus, this algorithm considers only the positive
instances and eliminates negative instances while generating the hypothesis.
• It initially starts with the most specific hypothesis.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Algorithm 2.1: Find-S

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Example 2.10: Consider the training dataset of 4 instances shown in Table 2.3. It contains the details of the
performance of students and their likelihood of getting a job offer or not in their final semester. Apply the Find-S
algorithm.
Table 2.3 Training Dataset

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Limitations of Find-S Algorithm


1. Find-S algorithm tries to find a hypothesis that is consistent with positive instances, ignoring all negative
instances. As long as the training dataset is consistent, the hypothesis found by this algorithm may be
consistent.
2. The algorithm finds only one unique hypothesis, wherein there may be many other hypotheses that are
consistent with the training dataset.
3. Many times, the training dataset may contain some errors; hence such inconsistent data instances can
mislead this algorithm in determining the consistent hypothesis since it ignores negative instances.

• Hence, it is necessary to find the set of hypotheses that are consistent with the training data including the
negative examples.
• To overcome the limitations of Find-S algorithm, Candidate Elimination algorithm was proposed to output the
set of all hypotheses consistent with the training dataset.

2.6.5 Version Spaces


• The version space contains the subset of hypotheses from the hypothesis space that is consistent with all training
instances in the training dataset.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

List-Then-Eliminate Algorithm
• The principle idea of this learning algorithm is to initialize the version space to contain all hypotheses and then
eliminate any hypothesis that is found inconsistent with any training instances.
• Initially, the algorithm starts with a version space to contain all hypotheses scanning each training instance. The
hypotheses that are inconsistent with the training instance are eliminated.
• Finally, the algorithm outputs the list of remaining hypotheses that are all consistent.

Algorithm 2.2: List-Then-Eliminate

• This algorithm works fine if the hypothesis space is finite but practically it is difficult to deploy this algorithm.
Hence, a variation of this idea is introduced in the Candidate Elimination algorithm.
Version Spaces and the Candidate Elimination Algorithm
• Version space learning is to generate all consistent hypotheses around. This algorithm computes the version
space by the combination of the two cases namely,
o Specific to General learning – Generalize S to include the positive example
o General to Specific learning – Specialize Dr.
GVishwesh
to exclude theMysuru
J, GSSSIETW, negative example
Machine Learning (BCS602): Module 2 Basic Learning Theory

• Using the Candidate Elimination algorithm, we can compute the version space containing all (and only those)
hypotheses from H that are consistent with the given observed sequence of training instances.
• The algorithm defines two boundaries called ‘general boundary’ which is a set of all hypotheses that are the
most general and ‘specific boundary’ which is a set of all hypotheses that are the most specific.
• Thus, the algorithm limits the version space to contain only those hypotheses that are most general and most
specific.
Algorithm 2.3: Candidate Elimination

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

• Generating Positive Hypothesis ‘S’ If it is a positive example, refine S to include the positive instance. We need
to generalize S to include the positive instance. The hypothesis is the conjunction of ‘S’ and positive instance.
• Generating Negative Hypothesis ‘G’ If it is a negative instance, refine G to exclude the negative instance. Then,
prune G to exclude all inconsistent hypotheses in G with the positive instance.
• Generating Version Space – [Consistent Hypothesis] We need to take the combination of sets in ‘G’ and check
that with ‘S’. When the combined set fields are matched with fields in ‘S’, then only that is included in the
version space as consistent hypothesis.

Example 2.11: Consider the same set of instances from the training dataset shown in Table 3.2 and generate version
space as consistent hypothesis.
Solution:
Step 1: Initialize ‘G’ boundary to the maximally general hypotheses,

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Step 2: Initialize ‘S’ boundary to the maximally specific hypothesis. There are 6 attributes, so for each attribute, we
initially fill ‘Ø’ in the hypothesis ‘S’.

Generalize the initial hypothesis for the first positive instance. I1 is a positive instance; so generalize the most
specific hypothesis ‘S’ to include this positive instance. Hence,

Step 3:
Iteration 1
Scan the next instance I2. Since I2 is a positive instance, generalize ‘S1’ to include positive instance I2. For each of
the non-matching attribute value in ‘S1’, put a ‘?’ to include this positive instance. The third attribute value is
mismatching in ‘S1’ with I2, so put a ‘?’.

Prune G1 to exclude all inconsistent hypotheses with the positive instance. Since G1 is consistent with this
positive instance, there is no change. The resulting G2 is,

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Iteration 2
Now Scan I3,

Since it is a negative instance, specialize G2 to exclude the negative example but stay consistent with S2.
Generate hypothesis for each of the non-matching attribute value in S2 and fill with the attribute value of S2. In
those generated hypotheses, for all matching attribute values, put a ‘?’. The first, second and 6th attribute
values do not match, hence ‘3’ hypotheses are generated in G3.
There is no inconsistent hypothesis in S2 with the negative instance, hence S3 remains the same.

Iteration 3
Now Scan I4. Since it is a positive instance, check for mismatch in the hypothesis ‘S3’ with I4. The 5th and 6th
attribute value are mismatching, so add ‘?’ to those attributes in ‘S4’.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Prune G3 to exclude all inconsistent hypotheses with the positive instance I4.

Since the third hypothesis in G3 is inconsistent with this positive instance, remove the third one. The resulting
G4 is,

Using the two boundary sets, S4 and G4, the version space is converged to contain the set of consistent
hypotheses.
The final version space is,

Thus, the algorithm finds the version space to contain only those hypotheses that are most general and most
specific.
The diagrammatic representation of deriving the version space is shown in Figure 2.5.

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Figure 2.5 Deriving the Version Space

Dr. Vishwesh J, GSSSIETW, Mysuru


Basic Learning Theory=> 2.7 Modelling in Machine Learning
• The process of modelling means training a machine learning algorithm with the training dataset, tuning it to
increase performance, validating it and making predictions for a new unseen data.
• The major concern in machine learning is what model to select, how to train the model, time required to train,
the dataset to be used, what performance to expect, and so on.
• Learning the parameters is the main goal in machine learning algorithms. There are two types of parameters –
model parameters and hyperparameters.
o Certain parameters can be learnt directly from training data and are called model parameters. For
example, the coefficients used in regression model, split attributes in decision tree model, weights and
biases in neural networks and so on.
o Hyperparameters are higher-level parameters which cannot be learnt directly. For example, regularization
lambda 𝝀 used in regularized regression, number of decision trees to include in a random forest, and so
on.
• Evaluating the selected machine learning model is also equally important as training the model. Hence, the
dataset is split into two subsets called training dataset and test dataset, wherein the training dataset is used to
train the model and the test dataset is used to evaluate the model.
• During prediction, an error occurs when the estimated output does not match with the true output.
o Training error, also called as in-sample error, results when applying the predicted model on the training
data, while Test error also called as out-of-sample error is the average error when predicting on unseen
observations.
o The error function or the loss function is the aggregation of the differences between the true values and
the predicted values.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Basic Learning Theory

• This loss function is defined as the Mean Squared Error (MSE), which is the average of the squared differences
between the true values 𝐘𝒊 and the predicted values 𝐟(𝐗 𝒊 ) for an input value ‘𝐗 𝒊 ‘.
• A smaller value of MSE denotes that the error is less and, therefore, the prediction is more accurate.

Machine Learning Process


The four basic steps in the machine learning process are:
1. Choose a machine learning algorithm to suit the training data and the problem domain
2. Input the training dataset and train the machine learning algorithm to learn from the data and capture the
patterns in the data
3. Tune the parameters of the model to improve the accuracy of learning of the algorithm
4. Evaluate the learned model once the model is built

2.7.1 Model Selection and Model Evaluation


The biggest challenge in machine learning is choosing an algorithm that suits the problem. Hence, model
selection and assessment are very important and deal with two types of complexities.
1. Model Performance – How well the model performs on the training dataset?
2. Model Complexity – How much complexity the model possesses after the training phase is over?
Some of the approaches used for selecting a machine learning model are listed below:
1. Use resample methods and split the dataset as training, testing and validation datasets and observe the
performance of the model over all the phases. This approach is suitable for smaller datasets.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Basic Learning Theory

2. The simplest approach is to fit a model on the training dataset and to compute measures like error or
accuracy.
3. The use of probabilistic framework and quantification of the performance of the model as a score is the
third approach.

2.7.2 Re-sampling Methods


• Re-sampling is a technique to select a model by reconstructing the training dataset and test dataset by
randomly choosing instances by some method from the given dataset.
• This method involves selecting different instances repeatedly from a training dataset to tune a model.
• It is done to improve the accuracy of a model.
• The common re-sampling model selection methods are Random train/test splits, Cross-Validation (K-fold,
LOOCV, etc.) and Bootstrap.
Cross-Validation
• Cross-Validation is a method by which we can tune the model with only training dataset.
• It is a model evaluation approach by which we can set aside some data of the training dataset for validation
and fit the rest of the data to train the model.
• The best model is found by estimating the average of errors on different test data.
• The popular cross-validation family of methods includes Holdout method, K-fold cross-validation, Stratified
cross-validation and Leave-One-Out Cross-Validation (LOOCV).

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory
1. Holdout Method
• This is the simplest method of cross-validation.
• The dataset is split into two subsets called training dataset and test dataset.
• The model is trained using the training dataset and then evaluated using the test dataset.
• This holdout method can be applied for a single time which is called as single holdout method or it can be
repeated for more than once which is called as repeated holdout method.
• The average performance on the test dataset is estimated to evaluate the model.
2. K-fold Cross-Validation
• Another way of cross-validating is using a k-fold cross-validation, which will split the training dataset into
k equal folds/parts creating k – 1 subsets of training set and one test subset.
• Out of the k folds, k – 1 folds are used for training and one fold is used for testing the model.
• This has to be performed for k iterations and during each iteration a different fold is selected for testing.
• The average performance of the model on k iterations is the final estimate of the model performance. The
illustration of this re-sampling is shown in Figure 2.6.
Algorithm 2.4: K-fold Cross Validation

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Figure 2.6 Illustration of K-fold Cross-Validation


Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Basic Learning Theory
3. Stratified K-fold Cross-Validation
• This method is similar to k-fold cross-validation but with a slight difference.
• Here, it is ensured that while splitting the dataset into k folds, each fold should contain the same
proportion of instances with a given categorical value. This is called stratified cross-validation.
4. Leave-One-Out Cross-Validation (LOOCV)
• This method repeatedly splits the n data instances of the dataset into training dataset containing n – 1
data instances and leaving one data instance for evaluating the model.
• This process is repeated n times and average test error is then estimated for the model.
• Even though this model is expensive and time consuming because it has to run for n times (i.e., n data
instances in the dataset), it has less bias.
• For example, if the training dataset contains 100 data instances, then 99 instances are used for training and
one instance to test or evaluate the model. This process is repeated 100 times selecting a different instance
as holdout instance for testing in each iteration.
• The illustration of this re-sampling is shown in Figure 2.7.
Algorithm 2.5: LOOCV

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Figure 2.7 Illustration of Leave-One-Out Cross-Validation

Dr. Vishwesh J, GSSSIETW, Mysuru


Machine Learning (BCS602): Module 2 Basic Learning Theory

Model Performance
• The focus of this section is the evaluation of classifier models. Classifiers are unstable as a small change in the
input can change the output.
• There are several metrics that can be used to describe the quality and usefulness of a classifier. One way to
compute the metrics is to form a table called contingency table. For example, consider a test for detecting a
disease, say cancer. Table 2.4 shows a contingency table for this scenario.
Table 2.4 Contingency Table

o In this table, True Positive (TP) = Number of cancer patients who are classified by the test correctly, True
Negative (TN) = Number of normal patients who do not have cancer are correctly detected. The two errors
that are involved in this process is False Positive (FP) that is an alarm that indicates that the tests show
positive when the patient has no disease and False Negative (FN) is another error that says a patient has
cancer when tests says negative or normal. FP and FN are costly errors in this classification process.
• The metrics that can be derived from this contingency table are listed below:
1. Sensitivity – The sensitivity of a test is the probability that it will produce a true positive result when
used on a test dataset. It is also known as true positive rate. The sensitivity of a test can be determined by
calculating: 𝑻𝑷
𝑻𝑷 + 𝑭𝑵
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Basic Learning Theory

2. Specificity – The specificity of a test is the probability that a test will produce a true negative result when
used on test dataset. The specificity of a test can be determined by calculating: 𝑻𝑵
𝑻𝑵 + 𝑭𝑷
3. Positive Predictive Value – The positive predictive value of a test is the probability that an object is
classified correctly when a positive test result is observed. The test can be determined by calculating:
𝑻𝑷
𝑻𝑷 + 𝑭𝑷
4. Negative Predictive Value – The negative predictive value of a test is the probability that an object is not
classified properly when a negative test result is observed. The test can be determined by calculating:
𝑻𝑵
𝑻𝑵 + 𝑭𝑵
5. Accuracy – The accuracy of the classifier can be shown in terms of sensitivity computed as:
𝑻𝑷 + 𝑻𝑵
𝑻𝑷 + 𝑻𝑵 + 𝑭𝑷 + 𝑭𝑵
6. Precision – Precision is also known as positive predictive power. It is defined as the ratio of true positive
divided by the sum of true positive and false positive. Precision indicates how good classifier is in
predicting the positive classes. 𝑻𝑷
7. Recall – It is same as sensitivity. 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =
𝑻𝑷 + 𝑭𝑷
𝑻𝑷
𝑹𝒆𝒄𝒂𝒍𝒍 = 𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 =
𝑻𝑷 + 𝑭𝑵
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Basic Learning Theory

Classifier Performance as Distance Measures


• The classifier performance can be computed as a distance measure also.
• The classifier accuracy can be plotted as a point. A point in the north-west is a better classifier.
• Euclid distance of two points of the two classifiers can give a performance measure. The value ranges from 0 to
1.
Visual Classifier Performance
• Receiver Operating Characteristic (ROC) curve and Precision-Recall curves indicate the performance of
classifiers visually.
• ROC curves are visual means of checking the accuracy and comparison of classifiers. ROC is a plot of sensitivity
(True Positive Rate) and the 1-specificity (False Positive Rate) for a given model.
• A sample ROC curve is shown in Figure 2.8, where results of five classifiers
are given.
o A is the ROC of an average classifier.
o The ideal classifier is E where the area under curve is 1.0.
Theoretically, it can range from 0.9 to 1.
o The rest of the classifiers B, C, D are categorized based on area under
curve as good, better and still better based on the area under curve
values.
• We start from the bottom left-hand corner initially. If we have any true
positive case, we move up and plot a point.
Dr. Vishwesh J, GSSSIETW, Mysuru Figure 2.8 A Sample ROC Curve
Machine Learning (BCS602): Module 2 Basic Learning Theory
• If it is a false positive case, we move right and plot. This process is
repeated until the complete curve is drawn.
• In ROC, the diagonal of the plot indicates the model has no skill or
random classifier, and skillful models show the curve above the diagonal.
• In short, if the ROC curve is closer to the diagonal line, then it shows the
classifier to be less accurate.
Scoring Methods
• Another alternative for model selection is to combine the complexity of the
model and performance of the model as a score. Then, model selection is
done by selecting the model that maximizes or minimizes the score. Figure 2.8 A Sample ROC Curve
• Minimum Description Length (MDL) is one such method. The aim is to describe target variable and model in
terms of bits.
• MDL is the principle of using minimum number of bits to represent the data and model. It is a variant of
Occom Razor’s principle that states that the model with the simplest explanation is the best model. MDL too
recommends the selection of the hypothesis that minimizes the sum of two descriptions of data and model.
• Let h be a learning model. Let L(h) is the number of bits used to represent the model and D is the number of
predictions, then the MDL is given as:
𝑳 𝒉 + 𝑳(𝑫|𝒉)
where, L(D|h) is the number of bits used to represent the predictions D based on the training set.

******
Dr. Vishwesh J, GSSSIETW, Mysuru

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy