AI&ML Module 2
AI&ML Module 2
• It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the variance between two dimensions.
• The formula for finding co-variance for specific x, and y are:
Here, 𝑥𝑖 and 𝑦𝑖 are data values from X and Y. E(X) and E(Y) are the mean values of 𝑥𝑖 and 𝑦𝑖 . N is the number
of given data. Also, the COV(X, Y) is same as COV(Y, X).
Example 2.1: Find the covariance of data X = {1, 2, 3, 4, 5} and Y = {1, 4, 9, 16, 25}.
Solution: Mean(X) = E(X) = 15/5 = 3, Mean(Y) = E(Y) = 15/5 = 11. The covariance is computed using COV(X, Y) as:
The covariance between X and Y is 12. It can be normalized to a value between -1 and +1. This is done by dividing
it by the correlation of variables. This is called Pearson correlation coefficient. Sometimes, N - 1 is also can be
used instead of N. In that case, the covariance is 60/4 = 15.
Correlation
• The correlation indicates the relationship between dimensions using its sign. The sign is more important
than the actual value.
1. If the value is positive, it indicates that the dimensions increase together.
2. If the value is negative, it indicates that while one-dimension increases, the other dimension decreases.
Example 2.2: Find the correlation coefficient of data X = {1, 2, 3, 4, 5} and Y = {1, 4, 9, 16, 25}.
Solution: The mean values of X and Y are 15/5 = 3 and 55/5 = 11. The standard deviations of X and Y are 1.41 and
8.6486, respectively. Therefore, the correlation coefficient is given as ratio of covariance (12 from the previous problem
2.1) and standard deviation of x and y as per the above equation as:
• The mean of multivariate data is a mean vector and the mean of the shown
three attributes is given as (2, 5, 1.33).
• The variance of multivariate data becomes the covariance matrix.
• The mean vector is called centroid and variance is called dispersion matrix
(Will be discussed later).
• Multivariate data has three or more variables.
Heatmap
• Heatmap is a graphical representation of 2D matrix.
• It takes a matrix as input and colours it. The darker colours indicate very large values and lighter colours indicate smaller
values.
• The advantage of this method is that humans perceive colours well. So, by colour shaping, larger values can be perceived well.
• For example, in vehicle traffic data, heavy traffic regions can be differentiated from low traffic regions through heatmap.
• In Figure 2.3, patient data highlighting weight and health status is plotted. Here, X-axis is weights and Y-axis is patient counts.
The dark colour regions highlight patients’ weights vs patient counts in health status.
Pairplot
• Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists of several pair-wise scatter
plots of variables of the multivariate data. All the results are presented in a matrix format.
• By visual examination of the chart, one can easily find relationships among the variables such as correlation between the
variables.
• A random matrix of three columns is chosen and the relationships of the columns is plotted as a pairplot (or scattermatrix) as
shown below in Figure 2.4.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Understanding Data-2
• This is true if y is not zero and A is not zero. The logic can be extended for N-set of equations with ‘n’ unknown
variables.
• If there is a unique solution, then the system is called consistent independent. If there are various solutions,
then the system is called consistent dependent. If there are no solutions and if the equations are contradictory,
then the system is called inconsistent.
• For solving large number of system of equations, Gaussian elimination can be used. The procedure for
applying Gaussian elimination is given as follows:
• To facilitate the application of Gaussian elimination method, the following row operations are applied:
1. Swapping the rows
2. Multiplying or dividing a row by a constant
3. Replacing a row by adding or subtracting a multiple of another row to it These concepts are illustrated in
Example 2.3.
Example 2.3: Solve the following set of equations using Gaussian Elimination method.
2𝑥1 + 4𝑥2 = 6 and 4𝑥1 + 3𝑥2 = 7
Solution: Rewrite this in matrix form as follows:
Apply the transformation by dividing the row 1 by 2. There are no general guidelines of row operations
other than reducing the given matrix to row echelon form. The operator ~ means reducing to. The above
matrix can further be reduced as follows:
where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and 𝑸𝑻 is the transpose of matrix Q.
LU Decomposition
• One of the simplest matrix decompositions is LU decomposition where the matrix A can be decomposed
matrices:
A = LU
o Here, L is the lower triangular matrix and U is the upper triangular matrix. The decomposition can be
done using Gaussian elimination method.
o First, an identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination
is applied to reduce the given matrix to get matrices L and U.
o Example 2.4 illustrates the application of Gaussian elimination to get LU.
Solution: First, augment an identity matrix and apply Gaussian elimination. The steps are as shown in:
Now, it can be observed that the first matrix is L as it is the lower triangular matrix whose values are the
determiners used in the reduction of equations above such as 3, 3 and 2/3. The second matrix is U, the upper
triangular matrix whose values are the values of the reduced matrix because of Gaussian elimination.
Probability Distributions
• A probability distribution of a variable, say X, summarizes the probability associated with X’s events.
Distribution is a function that describes the relationship between the observations in a sample space.
• Probability distributions are of two types:
1. Discrete probability distribution
2. Continuous probability distribution
• The relationships between the events for a continuous random variable and their probabilities is called a
continuous probability distribution. It is summarized as Probability Density Function (PDF). PDF calculates
the probability of observing an instance. The plot of PDF shows the shape of the distribution.
• Cumulative Distributive Function (CDF) computes the probability of an observation ≤ value. Both PDF and
CDF are continuous values.
• The discrete equivalent of PDF in discrete distribution is called Probability Mass Function (PMF).
1. Normal Distribution
• Normal distribution is a continuous probability distribution. This is also known as gaussian distribution or
bell-shaped curve distribution.
• It is the most common distribution function. The shape of this distribution is a typical bell-shaped curve.
• The heights of the students, blood pressure of a population, and marks scored in a class can be approximated
using normal distribution.
• PDF of the normal distribution is given as:
Here, μ is mean and σ is the standard deviation. Normal distribution is characterized by two parameters –
mean and variance.
2. Rectangular Distribution
• This is also known as uniform distribution. It has equal probabilities for all values in the range a, b.
• The uniform distribution is given as follows:
3. Exponential Distribution
• This is a continuous uniform distribution. This probability distribution is used to describe the time between
events in a Poisson process.
• Exponential distribution is another special case of Gamma distribution with a fixed parameter of 1.
• This distribution is helpful in modelling of time until an event occurs.
• The binomial distribution function is given as follows, where p is the probability of success and probability of
failure is (1 - p). The probability of success in a certain number of trials is given as:
• Here, p is the probability of each choice, k is the number of choices, and n is the total number of choices. The
mean of binomial distribution is given below:
2. Poisson Distribution
• It is another important distribution that is quite useful. Given an interval of time, this distribution is used to
model the probability of a given number of events k. The mean rule λ is inclusive of previous events.
• Some of the examples of Poisson distribution are number of emails received, number of customers visiting a
shop and the number of phone calls received by the office. The PDF of Poisson distribution is given as follows:
Here, x is the number of times the event occurs and λ is the mean number of times an event occurs.
3. Bernoulli Distribution
• This distribution models an experiment whose outcome is binary. The outcome is positive with p and negative
with 1 - p.
• The PMF of this distribution is given as:
Density Estimation
• Let there be a set of observed values x1, x2, …, xn from a larger set of data whose distribution is not known.
Density estimation is the problem of estimating the density function from an observed data.
• The estimated density function, denoted as, p(x) can be used to value directly for any unknown data, say 𝑥𝑡 as
p(𝑥𝑡 ). If its value is less than ε, then 𝑥𝑡 is not an outlier or anomaly data. Else, it is categorized as an anomaly
data.
• There are two types of density estimation methods, namely parametric density estimation and non-parametric
density estimation.
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Understanding Data-2
Parzen Window
• Let there be ‘n’ samples, X = {x1, x2, …, xn}
• The samples are drawn independently, called as identically independent distribution.
• Let R be the region that covers ‘k’ samples of total ‘n’ samples. Then, the probability density function is given as:
• The window indicates if the sample is inside the region or not. The Parzen probability density function estimate
using above equation is given as:
KNN Estimation
• The KNN estimation is another non-parametric density estimation method.
• Here, the initial parameter k is determined and based on that k-neighbours are determined.
• The probability density function estimate is the average of the values that are returned by the neighbours.
• The operator E refers to the expected value of the population. This is calculated theoretically using the
probability density functions (PDF) of the elements xi and the joint probability density functions between the
elements xi and xj. From this, the covariance matrix can be calculated as:
• For M random vectors, when M is large enough, the mean vector and covariance matrix can be approximately
calculated as:
Eq. (1)
Eq. (2)
• The mapping of the vectors x to y using the transformation can now be described as: Eq. (3)
• This transform is also called as Karhunen-Loeve or Hoteling transform. The original vector x can now be
reconstructed as follows:
• The goal of PCA is to reduce the set of attributes to a newer, smaller set that captures the variance of the data.
2 1
Example 2.5: Let the data points be and . Apply PCA and find the transformed data. Again, apply the
6 7
inverse and prove that PCA works.
Solution: One can combine two vectors into a matrix as follows:
The mean vector can be computed as Eq. (1) as follows:
As part of PCA, the mean must be subtracted from the data to get the adjusted data:
One can find the covariance for these data vectors. The covariance can be obtained using Eq. (2):
The final covariance matrix is obtained by adding these two matrices as:
−1
The eigen values and eigen vectors of matrix C can be obtained as λ1 = 1, λ2 = 0. The eigen vectors are and
1
1
. The matrix A can be obtained by packing the eigen vector of these eigen values (after sorting it) of matrix C.
1
−1 1 −1 1
For this problem, 𝐴 = . The transpose of A, 𝐴𝑇 = is also the same matrix as it is an orthogonal
1 1 1 1
matrix. The matrix can be normalized by diving each elements of the vector, by the norm of the vector to get:
One can check that the PCA matrix A is orthogonal. A matrix is orthogonal is 𝐴−1 = 𝐴 and 𝐴𝐴−1 = 𝐼.
One can check the original matrix can be retrieved from this matrix as:
where, V is the linear projection and σ𝐵 and σ𝑊 are class scatter matrix and within scatter matrix, respectively.
For the two-class problem, these matrices are given as:
Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension is m × n, S is the
diagonal matrix of dimension n × n, and V is the orthogonal matrix.
• The procedure for finding decomposition matrix is given as follows:
1. For a given matrix, find 𝐴𝐴𝑇
2. Find eigen values of 𝐴𝐴𝑇
3. Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U.
4. Arrange the square root of the eigen values in diagonal. This matrix is diagonal matrix, S.
5. Find eigen values and eigen vectors for 𝐴𝑇 𝐴. Find the eigen value and pack the eigen vector as a matrix
called V.
• Thus, 𝐴 = 𝑈𝑆𝑉 𝑇 . Here, U and V are orthogonal matrices. The columns of U and V are left and right singular
values, respectively. SVD is useful in compression, as one can decide to retain only a certain component instead
of the original matrix A as:
The eigen value and eigen vector of this matrix can be calculated to get U. The eigen values of this matrix are 0.0098
and 101.9902.
The eigen vectors of this matrix are:
Training Experience
• Let us consider designing of a chess game.
• In direct experience, individual board states and correct moves of the chess game are given directly.
• In indirect system, the move sequences and results are only given.
• The training experience also depends on the presence of a supervisor who can label all valid moves for a board
state.
• In the absence of a supervisor, the game agent plays against itself and learns the good moves, If the training
samples and testing samples have the same distribution, the results would be good.
• In indirect experience, all legal moves are accepted and a score is generated for each. The move with largest
score is then chosen and executed.
where, x1, x2 and x3 represent different board features and w0, w1, w2 and w3 represent weights.
o Here, μ is the constant that moderates the size of the weight update.Type equation here.
• Here, in this set of training instances, the independent attributes considered are ‘Horns’, ‘Tail’, ‘Tusks’, ‘Paws’,
‘Fur’, ‘Color’, ‘Hooves’ and ‘Size’.
• The dependent attribute is ‘Elephant’.
• The target concept is to identify the animal to be an Elephant.
• Let us now take this example and understand further the concept of hypothesis.
Target Concept: Predict the type of animal - For example –‘Elephant’.
• Given a test instance x, we say h(x) = 1, if the test instance x satisfies this hypothesis h.
• The training dataset given above has 5 training instances with 8 independent attributes and one dependent
attribute. Here, the different hypotheses that can be predicted for the target concept are,
• The task is to predict the best hypothesis for the target concept (an elephant).
• The most general hypothesis can allow any value for each of the attribute.
o It is represented as: <?,?,?,?,?,?,?, ?>. This hypothesis indicates that any animal can be an elephant.
• The most specific hypothesis will not allow any value for each of the attribute.
o < Ø, Ø, Ø, Ø, Ø, Ø, Ø, Ø >. This hypothesis indicates that no animal can be an elephant.
Example 2.7: Explain Concept Learning Task of an Elephant from the dataset given in Table 2.2. Given,
Input: 5 instances each with 8 attributes
Target concept/function ‘c’: Elephant → {Yes, No}
Hypotheses H: Set of hypothesis each with conjunctions of literals as propositions [i.e., each literal is
represented as an attribute-value pair]
Solution: The hypothesis ‘h’ for the concept learning task of an Elephant is given as:
Solution: The hypothesis ‘h’ for the concept learning task of an Elephant is given as:
This hypothesis produced is also called as concept description which is a model that can be used to classify
subsequent instances.
• Considering these values for each of the attribute, there are (2 × 2 × 2 × 2 × 2 × 3 × 2 × 2) = 384 distinct instances
covering all the 5 instances in the training dataset.
• So, we can generate (4 × 4 × 4 × 4 × 4 × 5 × 4 × 4) = 81,920 distinct hypotheses when including two more values [?,
Ø] for each of the attribute.
• However, any hypothesis containing one or more Ø symbols represents the empty set of instances; that is, it
classifies every instance as negative instance. Therefore, there will be (3 × 3 × 3 × 3 × 3 × 4 × 3 × 3 + 1) = 8,749
distinct hypotheses by including only ‘?’ for each of the attribute and one hypothesis representing the empty set
of instances.
• Thus, the hypothesis space is much larger and hence we need efficient learning algorithms to search for the best
hypothesis from the set of hypotheses.
Example 2.8: Consider the training instances shown in Table 2.2 and illustrate Specific to General Learning.
Solution: We will start from all false or the most specific hypothesis to determine the most restrictive
specialization. Consider only the positive instances and generalize the most specific hypothesis. Ignore the negative
instances.
This learning is illustrated as follows: The most specific hypothesis is taken now, which will not classify any
instance to true.
Read the first instance I1, to generalize the hypothesis h so that this positive instance can be classified by the
hypothesis h1.
When reading the second instance I2, it is a negative instance, so ignore it.
Similarly, when reading the third instance I3, it is a positive instance so generalize h2 to h3 to accommodate it. The
resulting h3 is generalized.
Now, after observing all the positive instances, an approximate hypothesis h5 is generated which can now classify
any subsequent positive instance to true.
Example 2.9: Illustrate learning by Specialization – General to Specific Learning for the data instances shown in
Table 2.2.
Solution: Start from the most general hypothesis which will make true all positive and negative instances.
Example 2.10: Consider the training dataset of 4 instances shown in Table 2.3. It contains the details of the
performance of students and their likelihood of getting a job offer or not in their final semester. Apply the Find-S
algorithm.
Table 2.3 Training Dataset
• Hence, it is necessary to find the set of hypotheses that are consistent with the training data including the
negative examples.
• To overcome the limitations of Find-S algorithm, Candidate Elimination algorithm was proposed to output the
set of all hypotheses consistent with the training dataset.
List-Then-Eliminate Algorithm
• The principle idea of this learning algorithm is to initialize the version space to contain all hypotheses and then
eliminate any hypothesis that is found inconsistent with any training instances.
• Initially, the algorithm starts with a version space to contain all hypotheses scanning each training instance. The
hypotheses that are inconsistent with the training instance are eliminated.
• Finally, the algorithm outputs the list of remaining hypotheses that are all consistent.
• This algorithm works fine if the hypothesis space is finite but practically it is difficult to deploy this algorithm.
Hence, a variation of this idea is introduced in the Candidate Elimination algorithm.
Version Spaces and the Candidate Elimination Algorithm
• Version space learning is to generate all consistent hypotheses around. This algorithm computes the version
space by the combination of the two cases namely,
o Specific to General learning – Generalize S to include the positive example
o General to Specific learning – Specialize Dr.
GVishwesh
to exclude theMysuru
J, GSSSIETW, negative example
Machine Learning (BCS602): Module 2 Basic Learning Theory
• Using the Candidate Elimination algorithm, we can compute the version space containing all (and only those)
hypotheses from H that are consistent with the given observed sequence of training instances.
• The algorithm defines two boundaries called ‘general boundary’ which is a set of all hypotheses that are the
most general and ‘specific boundary’ which is a set of all hypotheses that are the most specific.
• Thus, the algorithm limits the version space to contain only those hypotheses that are most general and most
specific.
Algorithm 2.3: Candidate Elimination
• Generating Positive Hypothesis ‘S’ If it is a positive example, refine S to include the positive instance. We need
to generalize S to include the positive instance. The hypothesis is the conjunction of ‘S’ and positive instance.
• Generating Negative Hypothesis ‘G’ If it is a negative instance, refine G to exclude the negative instance. Then,
prune G to exclude all inconsistent hypotheses in G with the positive instance.
• Generating Version Space – [Consistent Hypothesis] We need to take the combination of sets in ‘G’ and check
that with ‘S’. When the combined set fields are matched with fields in ‘S’, then only that is included in the
version space as consistent hypothesis.
Example 2.11: Consider the same set of instances from the training dataset shown in Table 3.2 and generate version
space as consistent hypothesis.
Solution:
Step 1: Initialize ‘G’ boundary to the maximally general hypotheses,
Step 2: Initialize ‘S’ boundary to the maximally specific hypothesis. There are 6 attributes, so for each attribute, we
initially fill ‘Ø’ in the hypothesis ‘S’.
Generalize the initial hypothesis for the first positive instance. I1 is a positive instance; so generalize the most
specific hypothesis ‘S’ to include this positive instance. Hence,
Step 3:
Iteration 1
Scan the next instance I2. Since I2 is a positive instance, generalize ‘S1’ to include positive instance I2. For each of
the non-matching attribute value in ‘S1’, put a ‘?’ to include this positive instance. The third attribute value is
mismatching in ‘S1’ with I2, so put a ‘?’.
Prune G1 to exclude all inconsistent hypotheses with the positive instance. Since G1 is consistent with this
positive instance, there is no change. The resulting G2 is,
Iteration 2
Now Scan I3,
Since it is a negative instance, specialize G2 to exclude the negative example but stay consistent with S2.
Generate hypothesis for each of the non-matching attribute value in S2 and fill with the attribute value of S2. In
those generated hypotheses, for all matching attribute values, put a ‘?’. The first, second and 6th attribute
values do not match, hence ‘3’ hypotheses are generated in G3.
There is no inconsistent hypothesis in S2 with the negative instance, hence S3 remains the same.
Iteration 3
Now Scan I4. Since it is a positive instance, check for mismatch in the hypothesis ‘S3’ with I4. The 5th and 6th
attribute value are mismatching, so add ‘?’ to those attributes in ‘S4’.
Prune G3 to exclude all inconsistent hypotheses with the positive instance I4.
Since the third hypothesis in G3 is inconsistent with this positive instance, remove the third one. The resulting
G4 is,
Using the two boundary sets, S4 and G4, the version space is converged to contain the set of consistent
hypotheses.
The final version space is,
Thus, the algorithm finds the version space to contain only those hypotheses that are most general and most
specific.
The diagrammatic representation of deriving the version space is shown in Figure 2.5.
• This loss function is defined as the Mean Squared Error (MSE), which is the average of the squared differences
between the true values 𝐘𝒊 and the predicted values 𝐟(𝐗 𝒊 ) for an input value ‘𝐗 𝒊 ‘.
• A smaller value of MSE denotes that the error is less and, therefore, the prediction is more accurate.
2. The simplest approach is to fit a model on the training dataset and to compute measures like error or
accuracy.
3. The use of probabilistic framework and quantification of the performance of the model as a score is the
third approach.
Model Performance
• The focus of this section is the evaluation of classifier models. Classifiers are unstable as a small change in the
input can change the output.
• There are several metrics that can be used to describe the quality and usefulness of a classifier. One way to
compute the metrics is to form a table called contingency table. For example, consider a test for detecting a
disease, say cancer. Table 2.4 shows a contingency table for this scenario.
Table 2.4 Contingency Table
o In this table, True Positive (TP) = Number of cancer patients who are classified by the test correctly, True
Negative (TN) = Number of normal patients who do not have cancer are correctly detected. The two errors
that are involved in this process is False Positive (FP) that is an alarm that indicates that the tests show
positive when the patient has no disease and False Negative (FN) is another error that says a patient has
cancer when tests says negative or normal. FP and FN are costly errors in this classification process.
• The metrics that can be derived from this contingency table are listed below:
1. Sensitivity – The sensitivity of a test is the probability that it will produce a true positive result when
used on a test dataset. It is also known as true positive rate. The sensitivity of a test can be determined by
calculating: 𝑻𝑷
𝑻𝑷 + 𝑭𝑵
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Basic Learning Theory
2. Specificity – The specificity of a test is the probability that a test will produce a true negative result when
used on test dataset. The specificity of a test can be determined by calculating: 𝑻𝑵
𝑻𝑵 + 𝑭𝑷
3. Positive Predictive Value – The positive predictive value of a test is the probability that an object is
classified correctly when a positive test result is observed. The test can be determined by calculating:
𝑻𝑷
𝑻𝑷 + 𝑭𝑷
4. Negative Predictive Value – The negative predictive value of a test is the probability that an object is not
classified properly when a negative test result is observed. The test can be determined by calculating:
𝑻𝑵
𝑻𝑵 + 𝑭𝑵
5. Accuracy – The accuracy of the classifier can be shown in terms of sensitivity computed as:
𝑻𝑷 + 𝑻𝑵
𝑻𝑷 + 𝑻𝑵 + 𝑭𝑷 + 𝑭𝑵
6. Precision – Precision is also known as positive predictive power. It is defined as the ratio of true positive
divided by the sum of true positive and false positive. Precision indicates how good classifier is in
predicting the positive classes. 𝑻𝑷
7. Recall – It is same as sensitivity. 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =
𝑻𝑷 + 𝑭𝑷
𝑻𝑷
𝑹𝒆𝒄𝒂𝒍𝒍 = 𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 =
𝑻𝑷 + 𝑭𝑵
Dr. Vishwesh J, GSSSIETW, Mysuru
Machine Learning (BCS602): Module 2 Basic Learning Theory
******
Dr. Vishwesh J, GSSSIETW, Mysuru