Unit-1 - Machine Learning
Unit-1 - Machine Learning
LEARNING
UNIT-I
•This is the mode of the distribution p(y∣x,D) and known as a MAP estimate (maximum a
posteriori).
• Confidence in predictions is crucial, especially in risk-averse domains like medicine
and finance.
• IBM's Watson for Jeopardy uses a confidence module to decide when to answer.
• Google's SmartASS (ad selection system) predicts the click-through rate (CTR) to
maximize expected profit.
• Systems like Watson and SmartASS assess the risk of their predictions, making
decisions based on confidence levels to optimize performance and minimize errors.
Supervised learning-Cont.
Real-world applications:
(i) Document classification and email spam filtering
• In document classification, the primary objective is to categorize documents like web
pages or email messages into predefined classes C, determining p(y=c∣x,D), where x
represents the document's text representation.
• A classic example is email spam filtering, where classes are typically labeled as spam (
y=1 ) or non-spam ( y=0).
• Most classifiers assume a fixed-size input vector x. To handle variable-length documents,
a common approach is the bag of words (BoW) representation.
Bag of Words (BoW):
• Documents are transformed into fixed-size feature vectors.
• Each vector element corresponds to a word from a predefined vocabulary.
• If a word appears in the document, its corresponding vector element is set to 1; otherwise,
it remains 0.
Supervised learning-Cont.
(ii) Classifying flowers
• The goal is to classify iris flowers into three types: setosa, versicolor, and virginica,
based on four extracted features: sepal length, sepal width, petal length, and petal
width.
Supervised learning-Cont.
(iii) Image classification and handwriting recognition
• Image classification involves categorizing images
based on their content, such as indoor vs. outdoor
scenes, orientation (horizontal vs. vertical), or
presence of specific objects like dogs.
• MNIST (which stands for “Modified National
Institute of Standards” )is a widely used dataset for
handwritten digit recognition, containing 60,000
training images and 10,000 test images of digits (0-9).
• Each image is grayscale, sized 28x28 pixels, and
represents handwritten digits by various individuals.
• Images are represented as feature vectors, where each
pixel's grayscale value (ranging from 0 to 255) serves
as a feature.
Supervised learning-Cont.
(iv) Face detection and recognition
• Object detection, or localization, involves
identifying specific objects within an image. A
notable application is face detection, which is
crucial for tasks like autofocus in cameras and
privacy features in services like Google's
StreetView.
• One approach to face detection is the sliding
window detector method. It divides the image into
small overlapping patches at various locations,
scales, and orientations.
• Each patch is classified based on whether it
exhibits face-like textures or features. Locations
where the probability of containing a face is high
are identified as potential face locations.
• Modern digital cameras often integrate face
detection systems to assist with autofocus by
identifying and focusing on faces within the frame.
• Services like Google's StreetView use face
detection to automatically blur faces to protect
privacy.
Supervised learning-Cont.
2. Regression:
• Regression is just like classification except the response variable is continuous
1. Discovering clusters:
•Clustering involves grouping data points into clusters based on similarities in their features,
without predefined labels.
•The goal is to estimate the distribution p(K∣D) over the number of clusters K, indicating
the presence of subgroups within the data.
•Model selection in clustering aims to determine the optimal number of clusters K ∗ often
approximated by the mode of p(K∣D). Unlike supervised learning where classes are
predefined, unsupervised learning allows flexibility in choosing the number of clusters that
best represent the underlying structure of the data.
•Each data point i is assigned to a cluster zi∈{1,…,K} based on the probability
p(zi=k∣xi,D), where xi is the feature vector of the data point.
•Assignments zi∗ are inferred to determine the cluster membership of each data point,
illustrated by different colors representing clusters in visualizations.
Unsupervised learning-Cont.
Applications of Clustering:
•Astronomy: Clustering methods like Autoclass have been used to discover new
types of stars based on astrophysical measurements.
•E-commerce: Clustering users based on purchasing or web-surfing behavior allows
for targeted advertising and personalized recommendations.
•Biology: Clustering flow-cytometry data helps identify different sub-populations of
cells, aiding in biological research such as understanding disease mechanisms.
Unsupervised learning-Cont.
2. Discovering latent factors:
• Dimensionality reduction involves projecting high-dimensional data into a
lower-dimensional subspace that captures essential characteristics of the data.
• Despite high-dimensional appearances, data often exhibit variability across a smaller
number of latent factors. Dimensionality reduction helps in focusing on these key
factors, such as lighting, pose, or identity in face image modeling.
• PCA is a common approach for dimensionality reduction, resembling an unsupervised
form of multi-output linear regression.
• Given high-dimensional responses y, PCA infers latent low-dimensional factors z that
explain most of the variability in y.
Unsupervised learning-Cont.
Applications:
• In biology, it is common to use PCA to interpret gene microarray data, to account for
the fact that each measurement is usually the result of many genes which are correlated
in their behavior by the fact that they belong to different biological pathways.
• In natural language processing, it is common to use a variant of PCA called latent
semantic analysis for document retrieval.
• In signal processing (e.g., of acoustic or neural signals), it is common to use ICA (which
is a variant of PCA) to separate signals into their different sources.
• In computer graphics, it is common to project motion capture data to a low dimensional
space, and use it to create animations.
Unsupervised learning-Cont.
3. Discovering graph structure
• Learning sparse graphical models involves representing relationships between correlated
variables using a graph G, where nodes depict variables and edges denote direct
dependencies.
• This approach is pivotal in both discovering new knowledge and enhancing joint
probability density estimators.
• In systems biology, sparse graphical models are used to uncover relationships among
biological entities. For instance, graphs derived from protein phosphorylation data reveal
complex interactions within cellular networks.
• In fields like financial portfolio management, sparse graphs help model covariance
between stocks for better prediction and decision-making.
• Applications extend to traffic prediction systems, such as JamBayes, which leverage
learned graphical models to forecast traffic flow dynamics.
Unsupervised learning-Cont.
Unsupervised learning-Cont.
4. Matrix completion
• Sometimes we have missing data, that is, variables whose values are unknown. For example, we might have
conducted a survey, and some people might not have answered certain questions.
• The corresponding design matrix will then have “holes” in it; these missing entries are often represented by
NaN, which stands for “not a number”. The goal of imputation is to infer plausible values for the missing
entries. This is sometimes called matrix completion.
• Image Inpainting: Technique to fill in missing parts of images due to scratches or occlusions, achieved by
modeling joint probability of pixels from clean images.
• Collaborative Filtering: Predicting user preferences for items (like movies) based on sparse ratings matrices,
aiming to fill in missing ratings for better recommendation systems.
• Market basket analysis:
❖ Involves examining a large, sparse binary matrix where columns represent items/products and rows
represent transactions.
❖ Each entry in the matrix indicates whether an item was purchased in a specific transaction. By analyzing
correlations among items often bought together, predictions can be made about additional items a consumer
might buy based on partial transaction data.
❖ This technique is also applicable in other domains, such as predicting file dependencies in software systems.
❖ Common methods for market basket analysis include frequent itemset mining, which generates association
rules, and probabilistic modeling, which fits a joint density model to the data.
❖ Data mining emphasizes interpretability of models, whereas machine learning focuses on model accuracy.
Polynomial Curve Fitting
• Consider the example of recognizing handwritten digits, illustrated in Figure 1.1.
• Each digit corresponds to a 28×28 pixel image and so can be represented by a vector x comprising 784 real
numbers.
• The goal is to build a machine that will take such a vector x as input and that will produce the identity of the digit
0…9 as the output.
• This is not a simple problem due to the wide variability of handwriting. It could be tackled using rules or
heuristics for distinguishing the digits based on the shapes of the strokes.
• But this leads to a creation of rules and of exceptions to the rules and so on, and so gives poor results.
• Better results can be obtained by adopting a machine learning approach in which a large set of N digits {x 1,..., xN
} called a training set is used to tune the parameters of an adaptive model.
• The categories of the digits in the training set are known in advance (hand-labelling). We can express the
category of a digit using target vector t, which represents the identity of the corresponding digit. There is one
such target vector t for each digit image x.
• The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit
image x as input and that generates an output vector y, encoded in the same way as the target vectors.
• The precise form of the function y(x) is determined during the training phase (learning phase) on the basis of the
training data.
• Once the model is trained it can then determine the identity of new digit images (a test set). The ability to
categorize correctly new examples that differ from those used for training is known as generalization (goal of
pattern recognition).
• The original input variables are pre-processed to transform them into some new space of variables where the
pattern recognition problem will be easier to solve.
Polynomial Curve Fitting-Cont.
• Example: In the digit recognition problem, the images of the digits are translated and scaled so that each digit is
contained within a box of a fixed size. This greatly reduces the variability within each digit class, because the
location and scale of all the digits are now the same, which makes it much easier for a subsequent pattern
recognition algorithm to distinguish between the different classes. This pre-processing is sometimes called feature
extraction.
• Pre-processing might also be performed to speed up computation. Example: if the goal is real-time face detection
in a high-resolution video stream, the computer must handle huge numbers of pixels per second.
• Instead, useful features (dimensionality reduction) are find out that are fast to compute, and preserve useful
discriminatory information enabling faces to be distinguished from non-faces.
• These features are then used as the inputs to the pattern recognition algorithm.
• The training data comprises of the input vectors along with their corresponding target vectors are known as
supervised learning problems.
• Ex: Classification - the digit recognition assigns each input vector to one of a finite number of discrete categories
• Regression -the desired output consists of one or more continuous variables
• Ex: prediction of the yield in a chemical manufacturing process in which the inputs consist of the concentrations
of reactants, the temperature, and the pressure.
Polynomial Curve Fitting-Cont.
• Unsupervised learning problems: The training data consists of a set of input vectors x without any corresponding
target values.
• Clustering - to find groups of similar examples within the data (or to determine the distribution of data within the
input space, known as density estimation, or to project the data from a high-dimensional space down to two or
three dimensions for visualization.)
Fig. 1.2 Plot of a training data set of N=10 points (blue circles), each comprising an observation of the input variable
x along with the corresponding target variable t. The green curve shows the function sin(2πx) used to generate the
data. The goal is to predict the value of t for some new value of x, without knowledge of the green curve.
Polynomial Curve Fitting-Cont.
• Consider a simple regression problem. Let a real-valued input variable x and which used to predict
the value of a real-valued target variable t.
• For example, the data is generated from the function sin(2πx) with random noise included in the
target values. Now suppose that we are given a training set comprising N observations of x, written x
≡ (x1,...,xN )T, together with corresponding observations of the values of t, denoted t ≡ (t1,...,tN )T.
• Figure 1.2 shows a plot of a training set comprising N = 10 data points. The input data set x was
generated by choosing values of xn, for n = 1,...,N, spaced uniformly in range [0, 1], and the target
data set t was obtained by first computing the corresponding values of the function sin(2πx) and then
adding a small level of random noise having a Gaussian distribution to each such point in order to
obtain the corresponding value tn.
• In this way, by capturing a property of many real data sets, namely that they possess an underlying
regularity, which is wish to learn, but that individual observations are corrupted by random noise.
Polynomial Curve Fitting-Cont.
• This noise might arise from intrinsically stochastic (i.e. random) processes such as radioactive decay
but more typically is due to there being sources of variability that are unobserved.
• The goal is to exploit this training set to make predictions of the value t of the target variable for some
new value x of the input variable.
• The observed data are corrupted with noise, and for a given x there is uncertainty as to the appropriate
value for t.
• Probability theory provides a framework for expressing such uncertainty in a precise and quantitative
manner.
• Decision theory allows to exploit this probabilistic representation in order to make predictions that are
optimal according to appropriate criteria.
• Consider a simple approach based on curve fitting. The data can be fit using a polynomial function of
the form
where M is the order of the polynomial. The polynomial coefficients w0,...,wM are collectively denoted
by the vector w.
Polynomial Curve Fitting-Cont.
• The polynomial function y(x, w) is a nonlinear function of x, it is a linear function of the coefficients w.
• Functions, such as the polynomial, which are linear in the unknown parameters have important properties and are
called linear models.
• The values of the coefficients will be determined by fitting the polynomial to the training data. This can be done
by minimizing an error function that measures the misfit between the function y(x, w), for any given value of w,
and the training set data points.
• The error function is given by the sum of the squares of the errors between the predictions y(xn, w) for each data
point xn and the corresponding target values tn, so that we minimize
Fig. 1.3 The error function (1.2) corresponds to (one half of) the sum of the squares of
the displacements (vertical green bars) of each data point from the function y(x,w)
Polynomial Curve Fitting-Cont.
Fig. 1.5 Graphs of the root-mean-square error, defined by (1.3), evaluated on the training set and on independent
set for various values of M
Polynomial Curve Fitting-Cont.
Table 1.1.
Polynomial Curve Fitting-Cont.
Polynomial Curve Fitting-Cont.
Fig. 1.8 Graph of the root-mean-square error (1.3) values In λ for the M=9 polynomial
Probability Theory
What is probability?
• We are all familiar with the phrase “the probability that a coin will land heads is 0.5”.
• There are actually at least two different interpretations of probability.
• One is called the frequentist interpretation. In this view, probabilities represent long run frequencies of
events. For example, the above statement means that, if we flip the coin many times, we expect it to land
heads about half the time
• The other interpretation is called the Bayesian interpretation of probability. In this view, probability is
used to quantify our uncertainty about something; hence it is fundamentally related to information rather
than repeated trials.
• One big advantage of the Bayesian interpretation is that it can be used to model our uncertainty about
events that do not have long term frequencies.
Eg:
• To compute the probability whether the received email message is a spam
Discrete Random Variables
• The expression p(A) denotes the probability that the event • Figure 2.1 shows two pmf’s defined on the finite
A is true. For example, A might be the logical expression state space X = {1, 2, 3, 4, 5}.
“it will rain tomorrow”. • On the left we have a uniform distribution, p(x)=1/5,
• We require that 0 ≤ p(A) ≤ 1, where p(A)=0 means the and on the right, we have a degenerate distribution,
event definitely will not happen, and p(A)=1 means the p(x) = I(x = 1), where I() is the binary indicator
event definitely will happen. function.
• We write p(A) to denote the probability of the event not A; • This distribution represents the fact that X is always
this is defined to p(A)=1 − p(A). equal to the value 1, in other words, it is a constant.
• We will often write A = 1 to mean the event A is true, and
A = 0 to mean the event A is false
• We can extend the notion of binary events by defining a
discrete random variable X, which can take on any value
from a finite or countably infinite set X .
• We denote the probability of the event that X = x by p(X =
x), or just p(x) for short. Here p() is called a probability
mass function or pmf.
• This satisfies the properties 0 ≤ p(x) ≤ 1 and x∈X p(x)=1.
Fundamental Rules
• Probability of a union of two events The product rule can be applied multiple times to
Given two events, A and B, we define the probability of A yield the chain rule of probability:
or B as follows:
Where x=1 is the event the mammogram is positive, and y=1 Where p(y=0)=1−p(y=1)=0.996. In other words, if you
is the event you have breast cancer. Many people conclude test positive, you only have about a 3% chance of
they are therefore 80% likely to have cancer. actually having breast cancer.
Bayes Rule(Contd..)
2. Example: Generative classifiers
We can generalize the medical diagnosis example to classify feature vectors x of arbitrary type as follows:
This is called a generative classifier, since it specifies how to generate the data using the class conditional density
p(x|y=c) and the class prior p(y=c).An alternative approach is to directly fit the class posterior, p(y=c|x); this is known
as a discriminative classifier.
Independence and conditional independence
• X and Y are unconditionally independent or marginally independent, denoted X ⊥ Y , if we can represent the
joint as the product of the two marginals
• A set of variables is mutually independent if the joint can be written as a product of marginals.
• Unconditional independence is rare, because most variables can influence most other variables.
Independence and conditional independence—Contd.
• X and Y are conditionally independent (CI) given Z if the conditional joint can be written as a product of
conditional marginals:
• The assumption can be represented as a graph X−Z−Y, indicating that all dependencies between X and Y are
mediated through Z.
• For instance, the probability of rain tomorrow (X) is independent of whether the ground is wet today (Y) if we
know whether it is raining today (Z). This is because Z influences both X and Y, making knowledge of Z
sufficient to predict X or Y independently.
• Another characterization of conditional independence (CI) is provided by a Theorem , which states that
“X is conditionally independent of Y given Z (X ⊥ Y | Z) if and only if there exist functions g and h such that:
Example
• Let’s consider an example. Suppose A is the height of a child, and B is the
number of words that the child knows. It seems that when A is high, B is high
too.
• The height and the number of words known by the kid are NOT independent,
but they are conditionally independent if you provide the kid’s age.
Random Variable
Introduction to Random Variables
• A random variable is a mathematical function that assigns a numerical value
to every outcome which is possible in a random experiment. It is called a
random variable because its value depends on the output of a random
experiment. Random variables are usually symbolized by capital letters, such
as X, Y, or Z.
• For example, let's consider the outcome of flipping a coin. If the coin lands
head, we assign a value of 1 to the random variable X, and if it lands tails, we
assign a value of 0 to X. Therefore, X is a random variable that takes on two
possible values, 0 or 1, depending on the outcome of the coin flip. We can
represent the possible values of X using a probability distribution, which tells
us the probability of each possible value.
Random Variable-
Contd.
For the case of tossing a coin, the probability distribution is as follows:
• P(X = 0) = 1 / 2 (since the probability of getting tails is 1 / 2)
• P(X = 1) = 1 / 2 (since the probability of getting heads is also 1 / 2)
• In general, a probability distribution for a random variable specifies the
probability of each possible value of the variable.
Random Variable-
Contd.
• There are two specific types of Random variables:
1. Discrete Random variables
2. Continuous Random variables.
1. Discrete Random Variables:
• Discrete random variables take on a finite or accountably infinite set of
possible values. In other words, the range of a discrete random variable is a
discrete set of numbers. For example, the number of heads obtained when
flipping a coin five times is a discrete random variable that can take on values
0, 1, 2, 3, 4, or 5.
• The probability distribution for a discrete random variable is called a
probability mass function (PMF). The PMF gives the probability that the
random variable takes on a particular value.
Random Variable- Contd.
Sample Space (S): This is the set of all possible outcomes when three
coins are tossed simultaneously.
S={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}
• Each outcome represents the result of the three coins, where 'H' stands
for heads and 'T' stands for tails.
Random Variable (X): The number of tails in each outcome is
considered as the random variable XXX.
Values of X:
• Here, xi represents the number of tails in the i-th outcome. Listing the
number of tails for each outcome in the sample space:
• X={0,1,1,1,2,2,2,3}
Continuous Random variables
• The Continuous random variables take on any value in a continuous range of
possible values. In other words, the range of a continuous random variable is
an uncountable set of numbers.
• For example, the height of a person can be referred to as a continuous random
variable that can take on any value in a continuous range from zero to infinity.
• The probability distribution for a continuous random variable is called a
probability density function (PDF).
• Unlike the PMF, the PDF does not give the probability that the random
variable takes on a particular value, but rather the probability density at each
point in the range of possible values.
Continuous Random variables- Contd
Continuous Random variables- Contd
Continuous Random variables- Contd
Continuous Random variables- Contd
Var(X) = 0.26
Probability densities
• In probability theory and statistics, a probability density function (PDF) is a function that
describes the likelihood of a continuous random variable to take on a particular value. Unlike
discrete random variables, which have probabilities assigned to specific outcomes, continuous
• PDF provides a powerful tool for modeling and analyzing the behavior of continuous random
variables.
Probability Density Function (PDF)
Properties of PDF
Some of the commonly used PDF in probability and statistics to model different types of data:
• Uniform Distribution
• Normal Distribution (Gaussian Distribution)
• Exponential Distribution:
• Gamma distribution
• Beta distribution
Uniform distribution
Normal distribution
This PDF is the most popular distribution for continuous random variables
Describes some natural phenomena, e.g. height, blood pressure, test scores, measurement error, and IQ scores.
Exponential distribution
Standard Deviation
–The positive square root of the variance
The variance of a Random Variable
The variance of a Random Variable
–Suppose that the diameter of a metal cylinder has a p.d.f
Covariance tells us direction in which two quantities vary with each other.
Two random variables compared against each other.
Cov(X,Y) = E(X.Y) — E(X).E(Y)
Correlation shows us both, the direction and magnitude of how two quantities
vary with each other.
Summary: