0% found this document useful (0 votes)
43 views39 pages

G.C. Calafiore (Politecnico Di Torino)

This document summarizes key concepts from a lecture on Bayesian inference: - Bayes' theorem provides a formula for updating probabilities based on new evidence or data. The posterior probability is proportional to the likelihood of the data times the prior probability. - Bayesian estimators are point estimates derived from the central tendencies (mean, median, mode) of the posterior distribution. - Common loss functions used in Bayesian estimation include quadratic, absolute value, and Huber losses, which measure the error between the estimated and true values. - The minimum mean square error estimator minimizes the expected quadratic loss of the posterior distribution for a given dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views39 pages

G.C. Calafiore (Politecnico Di Torino)

This document summarizes key concepts from a lecture on Bayesian inference: - Bayes' theorem provides a formula for updating probabilities based on new evidence or data. The posterior probability is proportional to the likelihood of the data times the prior probability. - Bayesian estimators are point estimates derived from the central tendencies (mean, median, mode) of the posterior distribution. - Common loss functions used in Bayesian estimation include quadratic, absolute value, and Huber losses, which measure the error between the estimated and true values. - The minimum mean square error estimator minimizes the expected quadratic loss of the posterior distribution for a given dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

G.C.

Calafiore (Politecnico di Torino) 1 / 39


LECTURE 2

Bayesian Inference

Posterior ∝ Likelihood × Prior


T. Bayes

G.C. Calafiore (Politecnico di Torino) 2 / 39


Outline

1 Bayes’ rule

2 Bayesian estimators
Example: estimating a Bernoulli parameter

3 Naive Bayes classifiers


Example: document classification
Example: buyers’ classification

G.C. Calafiore (Politecnico di Torino) 3 / 39


Introduction
Bayesian inference is a mathematical procedure that applies probabilities to
statistical problems. It provides the tools to update one’s beliefs in the evidence
of new data.
Bayes’ Theorem
p(A)p(B|A)
p(A|B) =
p(B)

A is some statement (e.g., “the subject is pregnant”), and B is some data or


evidence (e.g., “the human chorionic gonadotropin (HCG) test is positive”).
We cannot observe A directly, but we observe some related “clue” B:

p(pregnant)p(HCG+|pregnant)
p(pregnant|HCG+) =
p(HCG+)

p(pregnant|HCG+): the probability that the subject is pregnant, given the


information that the HCG test is positive.
Notice that the HCG test is not infallible!

G.C. Calafiore (Politecnico di Torino) 4 / 39


Introduction
p(A)p(B|A) p(pregnant)p(HCG+|pregnant)
p(A|B) = → p(pregant|HCG+) =
p(B) p(HCG+)

p(pregnant): the probability of being pregnant, before looking at any


evidence: it is the a-priori plausibility of statement A, for instance based on
age, nationality, etc.
p(HCG+|pregant) (likelihood) expresses how likely is for the subject to be
pregnant, under the assumption that the HCG test is positive.
p(HCG+) is the total plausibility of the evidence, whereby we have to
consider all possible scenarios to ensure that the posterior is a proper
probability distribution:
p(HCG+) =
p(HCG+|pregnant)p(pregnant) + p(HCG+|not pregnant)p(not pregnant)

Effectively updates our initial beliefs about a proposition with some


observation, yielding a final measure of the plausibility, given the evidence.
G.C. Calafiore (Politecnico di Torino) 5 / 39
Bayes’ Theorem
In pictures...

p(A ∩ B) is the joint probability of A and B, often also denoted by p(A, B)


The conditional probability of A given B is defined as
. p(A ∩ B)
p(A|B) =
p(B)

Since p(A ∩ B) = p(B ∩ A), we also have that


. p(B ∩ A) p(A|B)p(B)
p(B|A) = =
p(A) p(A)
which is Bayes’ rule.
G.C. Calafiore (Politecnico di Torino) 6 / 39
Law of total probability
Let {Bi : i = 1, 2, 3, . . .} be a finite or countably infinite partition of a
sample space (i.e., a set of pairwise disjoint events whose union is the entire
sample space), and each event Bi is measurable.
Then for any event A of the same probability space it holds that
X X
p(A) = p(A ∩ Bi )= p(A|Bi )p(Bi )
i i

We can hence write Bayes’ rule as

p(A|Bk )p(Bk ) p(A|Bk )p(Bk )


p(Bk |A) = =P .
p(A) i p(A|Bi )p(Bi )

G.C. Calafiore (Politecnico di Torino) 7 / 39


Bayes’ rule for probability density functions (pdf)
Let θ ∈ Rn be a vector of parameters we are interested in.
We describe our assumptions and a-priori knowledge about θ, before
observing any data, in the form of a prior probability distribution (or, prior
belief) p(θ).
Let D denote a set of observed data related to θ. The statistical model
relating D to θ is expressed by the conditional distribution p(D|θ), which is
called the likelihood.
Bayes’ rule for pdf then states that
p(D|θ)p(θ)
p(θ|D) = ,
p(D)
where p(θ|D) is the posterior distribution, representing the updated state of
knowledge about θ, after we see the data.
Since p(D) only acts as a normalization constant, we can state Bayes’ rule
as
Posterior ∝ Likelihood × Prior.

G.C. Calafiore (Politecnico di Torino) 8 / 39


Bayes’ rule for probability density functions (pdf)
The denominator p(D) can be expressed in terms of the prior and the
likelihood, via the continuous version of the total probability rule
Z
p(D) = p(D|θ)p(θ)dθ.

The posterior distribution of θ can be used to infer information about θ, i.e.,


to find a suitable point estimate θ̂ of θ (more on this in the following slides).
The Bayesian approach is very suitable for on-line, recursive inference, since
it can be applied recursively: as new data is gathered the current posterior
becomes the new prior, and a new posterior is computed based on the new
data, and so on...

G.C. Calafiore (Politecnico di Torino) 9 / 39


Bayesian Estimators
In statistics, point estimation involves the use of sample data to calculate a
single value (known as a statistic) which is to serve as a best guess or “best
estimate” of an unknown population parameter θ.
Bayesian point-estimators are the central-tendency statistics of the posterior
distribution (e.g., its mean, median, or mode):
I The posterior mean, which minimizes the (posterior) risk (expected
loss) for a squared-error loss function;
I The posterior median, which minimizes the posterior risk for the
absolute-value loss function;
I The maximum a posteriori (MAP) estimate, which finds a maximum of
the posterior distribution;
I The Maximum Likelihood (ML) estimate, which coincides with the
MAP under uniform prior probability.

G.C. Calafiore (Politecnico di Torino) 10 / 39


Loss functions
For a given point estimate θ̂, a loss function L measures the error in
predicting θ via θ̂.
Typical loss functions are the following ones (assuming for simplicity θ to be
scalar)
I Quadratic:
L(θ − θ̂) = (θ − θ̂)2
I Absolute-value:
L(θ − θ̂) = |θ − θ̂|
I Hit-or-miss: 
0 if |θ − θ̂| ≤ δ
L(θ − θ̂) =
1 if |θ − θ̂| > δ
I Huber loss:
1
− θ̂)2

2 (θ if |θ − θ̂| ≤ δ
L(θ − θ̂) =
δ(|θ − θ̂| − δ/2) if |θ − θ̂| > δ

G.C. Calafiore (Politecnico di Torino) 11 / 39


Estimation
Bayesian estimators are defined by minimizing the expected loss, under the
posterior conditional density:
Z
θ̂ = arg min L(θ − θ̂)p(θ|D)dθ
θ̂

= arg min E{L(θ − θ̂)|D}.


θ̂

We next discuss three relevant special cases.

G.C. Calafiore (Politecnico di Torino) 12 / 39


Estimation
Minimum Mean-Square Error (MMSE)

For the quadratic loss L(θ − θ̂) = (θ − θ̂)2 , we have


Z
E{L(θ − θ̂)|D} = (θ − θ̂)2 p(θ|D)dθ

To compute the minimum w.r.t. θ̂, we compute the derivative of the


objective
Z Z
∂ ∂  
(θ − θ̂)2 p(θ|D)dθ = (θ − θ̂)2 p(θ|D) dθ
∂ θ̂ Z ∂ θ̂
= −2(θ − θ̂)p(θ|D)dθ

Setting the previous derivative to zero, we obtain


Z Z Z
−2(θ − θ̂)p(θ|D)dθ = 0 ⇔ θ̂p(θ|D)dθ = θp(θ|D)dθ
Z
⇔ θ̂ = θp(θ|D)dθ = E{θ|D}

G.C. Calafiore (Politecnico di Torino) 13 / 39


Estimation
Minimum Mean-Square Error (MMSE)

Thus, we have the minimum:


Z
θ̂MMSE = θp(θ|D)dθ = E{θ|D},

i.e., the mean of the posterior pdf p(θ|D).

This is called the minimum mean square error estimator (MMSE estimator),
because it minimizes the average squared error.

G.C. Calafiore (Politecnico di Torino) 14 / 39


Estimation
Minimum Absolute Error (MAE)

For the absolute value loss L(θ − θ̂) = |θ − θ̂|, we have


Z
E{L(θ − θ̂)|D} = |θ − θ̂|p(θ|D)dθ

It can be shown that the minimum w.r.t. θ̂ is obtained when


Z θ̂ Z ∞
p(θ|D)dθ = p(θ|D)dθ
−∞ θ̂

In words, the estimate θ̂ is the value which divides the probability mass into
equal proportions:
Z θ̂
1
p(θ|D)dθ = ,
−∞ 2
which is the definition of the median of the posterior pdf.
A well known fact, see, e.g., J.B.S. Haldane, “Note on the median of a
multivariate distribution,” Biometrika, vol. 35, pp. 414–417, 1948.
G.C. Calafiore (Politecnico di Torino) 15 / 39
Estimation
Maximum A-Posteriori Estimator (MAP)

0 if |θ − θ̂| ≤ δ
For the hit-or-miss loss , we have
1 if |θ − θ̂| > δ
Z θ̂−δ Z ∞
E{L(θ − θ̂)|D} = p(θ|D)dθ + p(θ|D)dθ
−∞ θ̂+δ
Z θ̂+δ
= 1− p(θ|D)dθ
θ̂−δ

Z θ̂+δ
This is minimized by maximizing p(θ|D)dθ.
θ̂−δ

For small δ and suitable assumptions on p(θ|D) (e.g., smooth,


quasi-concave) the maximum occurs at the maximum of p(θ|D).
Therefore, the estimator is the mode (the peak value) of the posteriori PDF.
Thus the name Maximum a Posteriori (MAP) estimator.

G.C. Calafiore (Politecnico di Torino) 16 / 39


Estimation
MAP and Maximum Likelihood (ML) estimators

The MAP estimate is

θ̂MAP = arg max p(θ|D)


θ
[by Bayes’ rule] = arg max p(D|θ)p(θ),
θ

where p(D|θ) is the likelihood, and p(θ) is the prior.


If the prior is uniform, i.e., p(θ) = const., then maximizing p(θ|D) is
equivalent to maximizing the likelihood p(D|θ).
The maximum likelihood estimator is

θ̂ML = arg max p(D|θ),


θ

and it is equivalent to the MAP estimator, under uniform prior.

G.C. Calafiore (Politecnico di Torino) 17 / 39


Estimation
Summary

Probabilistic model:
I p(θ): the prior
I p(D|θ): the likelihood
I p(θ|D) = const. × p(D|θ)p(θ): the posterior.
Estimators:
I θ̂MMSE is the mean of the posterior p(θ|D). It is the value which
minimizes the expected quadratic loss.
I θ̂MAE is the median of the posterior p(θ|D). It is the value which
minimizes the expected absolute loss.
I θ̂MAP is the maximum (i.e., peak, or mode) of the posterior p(θ|D). It
is (approximately) the value which minimizes the expected hit-or-miss
loss.
I θ̂ML is the maximum (i.e., peak, or mode) of the likelihood function
p(D|θ).

G.C. Calafiore (Politecnico di Torino) 18 / 39


Example
Estimating the outcome probability of a Bernoulli experiment

A Bernoulli experiment is an experiment whose outcome y is random and


binary, say 0 (fail) or 1 (success).
The random outcome of a Bernoulli experiment is dictated by an underlying
probability θ ∈ [0, 1], which is the success probability of the experiment, i.e.,
p(y = 1) = θ.
We here assume that θ is unknown, and we want to estimate it from
observed data.
Let D = {y1 , . . . , yN } be the observed outcomes of N such experiments.
Here yi , i = 1, . . . , N, are i.i.d. trials, each having distribution yi ∼ Ber(θ),
where Ber(x|θ) is the Bernoulli distribution

θ for x = 1
Ber(x|θ) = θ1(x) (1 − θ)1(1−x) =
1 − θ for x = 0

where 1(x) is equal to one for x = 1 and it is zero for x = 0.

G.C. Calafiore (Politecnico di Torino) 19 / 39


Example
Estimating the outcome probability of a Bernoulli experiment

Since the experiments are i.i.d., we have that

p(y1 , . . . , yN |θ) = θ1(y1 )+···+1(yN ) (1 − θ)1(1−y1 )+···+1(1−yN )


.
Let N1 = 1(y1 ) + · · · + 1(yN ) denote the number of successes, then

p(y1 , . . . , yN |θ) = θN1 (1 − θ)N−N1 ,

which is proportional to the Binomial pdf


 
. N k
Bin(k|θ, N) = θ (1 − θ)N−k
k

Thus, the likelihood of the data is

p(D|θ) = p(y1 , . . . , yN |θ) = θN1 (1 − θ)N−N1 ∝ Bin(N1 |θ, N)

G.C. Calafiore (Politecnico di Torino) 20 / 39


Example
Estimating the outcome probability of a Bernoulli experiment
Maximum likelihood estimation
The ML estimate is obtained by maximizing the likelihood p(D|θ) w.r.t. θ.
Taking the derivative
d d N1
p(D|θ) = θ (1 − θ)N−N1
dθ dθ
= N1 θN1 −1 (1 − θ)N−N1 − (N − N1 )θN1 (1 − θ)N−N1 −1
= θN1 −1 (1 − θ)N−N1 −1 (N1 (1 − θ) − (N − N1 )θ)
= θN1 −1 (1 − θ)N−N1 −1 (N1 − Nθ)

we see that the derivative is zero for


N1
θ̂ML = .
N

A quite expected result: the estimated success probability is the empirical


frequency of successes observed in N trials.

G.C. Calafiore (Politecnico di Torino) 21 / 39


Example
Maximum likelihood estimation
Overfitting in ML estimation, or the zero-count problem
A problem with the ML estimation is that it puts too much emphasis on the
observed data.
Suppose for example that the Bernoulli process that we are observing is a
(perhaps unfair, or biased) coin, where 1 indicates heads, and 0 indicates tail.
Suppose we perform a few experiments, say N = 5, and in all the N
experiments we observe heads. We would conclude that θ̂ML = N1 /N = 1,
so our guess is that the underlying Bernoulli probability is one. This means
that we will predict another head, with probability one.
ML estimation may lead to poor results for small sample sizes. ML works
fine asymptotically, i.e., for N → ∞.

G.C. Calafiore (Politecnico di Torino) 22 / 39


Example
MAP estimation

If we have the information that the experiment is actually a coin toss, we can
actually consider a prior on θ which is somewhat concentrated around 1/2,
instead of uniform in [0, 1], as we implicitly assumed in the ML estimation.
A typical choice of prior for the Bernoulli likelihood is the Beta prior

p(θ) = Beta(θ|a, b) ∝ θa−1 (1 − θ)b−1 ,

where a, b > 0 are the hyperparameters of the distribution. If a = b = 1 we


obtain the uniform distribution.
For Beta(θ|a, b) we have

a a−1 ab
mean = , mode = , variance =
a+b a+b−2 (a + b)2 (a + b + 1)

For instance, for a = b = 2, Beta(θ|a, b) has mode and mean in 1/2.

G.C. Calafiore (Politecnico di Torino) 23 / 39


Example
MAP estimation

An interesting fact is that for a Bernoulli likelihood

p(D|θ) = p(y1 , . . . , yN |θ) = θN1 (1 − θ)N−N1 ∝ Bin(N1 |θ, N)

assuming a Beta(θ|a, b) prior yields to a posterior which is again a Beta


distribution, since

p(θ|D) ∝ p(D|θ)p(θ) ∝ θN1 (1 − θ)N−N1 θa−1 (1 − θ)b−1


= θN1 +a−1 (1 − θ)N−N1 +b−1 .

When the prior and the posterior have the same form, we say that the prior
is a conjugate prior for the corresponding likelihood. In the case of the
Bernoulli likelihood, the conjugate prior is the beta distribution.

G.C. Calafiore (Politecnico di Torino) 24 / 39


Example
MAP estimation

Since

p(θ|D) ∝ θN1 +a−1 (1 − θ)N−N1 +b−1 = Beta(θ|N1 + a, N − N1 + b),

we have that θ̂MAP is given by the mode of the posterior, hence


N1 + a − 1
θ̂MAP = .
a+b+N −2

N1 +1
In our example with a = b = 2, we obtain θ̂MAP = N+2 .

For the small-sample example with N = 5 and N1 = N, wee see that the
effect of an informative prior is to avoid the extreme estimate
θ̂ML = N1 /N = 1.

G.C. Calafiore (Politecnico di Torino) 25 / 39


Example
MMSE estimation

The MMSE estimate is given by the mean of the posterior distribution, i.e.,
N1 + a
θ̂MMSE = E{θ|D} = .
N +a+b

We may observe that


a N1
θ̂MMSE = α + (1 − α) ,
a+b N
for α = (a + b)/(N + a + b).
The interpretation is that the posterior mean is a convex combination of the
prior mean and of the ML estimate.

G.C. Calafiore (Politecnico di Torino) 26 / 39


Example
Medical diagnosis

Suppose you are a woman in your 40s, and you decide to have a medical test
for breast cancer called a mammogram.
If the test is positive, what is the probability you have cancer?
That obviously depends on how reliable the test is. Suppose you are told the
test has a sensitivity of 80%, which means, if you have cancer, the test will
be positive with probability 0.8. In other words,
p(y = 1|x = 1) = 0.8
where y = 1 is the event the outcome of the mammogram is positive, and
x = 1 is the (hidden) event you have breast cancer.
Consider also the rate of “false positives” of the test, quantified as
p(y = 1|x = 0) = 0.1.
What is the probability that you have cancer (x = 1), given that the
mammogram is positive (y = 1)? That is, evaluate
p(x = 1|y = 1).

G.C. Calafiore (Politecnico di Torino) 27 / 39


Example
Medical diagnosis

Many people conclude they are therefore 80% likely to have cancer. But this
is false!
It ignores the prior probability of having breast cancer, which fortunately is
quite low:
p(x = 1) = 0.004.
Ignoring this prior is called the base rate fallacy.
We compute the correct probability by using Bayes’ rule:
p(y = 1|x = 1)p(x = 1)
p(x = 1|y = 1) =
p(y = 1)
0.8 × 0.004
=
p(y = 1|x = 1)p(x = 1) + p(y = 1|x = 0)p(x = 0)
0.8 × 0.004
= = 0.031.
0.8 × 0.004 + 0.1 × (1 − 0.004)
In other words, if you test positive, you only have about a 3% chance of
actually having breast cancer!
G.C. Calafiore (Politecnico di Torino) 28 / 39
Naive Bayes classifiers
We next discuss how to classify vectors of n features x = (x1 , . . . , xn ) into K
classes C1 , . . . , CK . The output y of the model is thus a categorical variable
in {1, . . . , K }.
The classifier assigns to each input feature vector x = (x1 , . . . , xn ) a class
probability
p(y |x), y = 1, . . . , K .

We will use a generative approach. This requires us to specify the class


conditional distribution, p(x|y ).
The simplest approach is to assume the features are conditionally
independent given the class label. This allows us to write the class
conditional density as a product of one dimensional densities:
n
Y
p(x|y = c, θ) = p(xj |y = c, θjc )
j=1

The resulting model is called a naive Bayes classifier (NBC).

G.C. Calafiore (Politecnico di Torino) 29 / 39


Naive Bayes classifiers
In the case of binary features xj ∈ {0, 1}, we can use the Bernoulli
distribution:
p(x|y = c, θ) = Ber(xj |θjc ),
where θjc is the probability that feature j occurs in class c. This is
sometimes called the multivariate Bernoulli naive Bayes model.
In the case of categorical features, xj ∈ {1, . . . , Q}, we can use the
multinoulli distribution
n
Y
p(x|y = c, θ) = Cat(xj |θjc ),
j=1

where θjc is a histogram over the Q possible values for xj in class c. That is,
if xj ∼ Cat(xj |θjc ), then p(xj = q|θjc ) = θjc (q).

G.C. Calafiore (Politecnico di Torino) 30 / 39


Naive Bayes classifiers
The class probability, given the input feature, can be expressed as
n
p(x|y )p(y ) 1 Y
p(y |x) = = p(y ) p(xj |y ),
p(x) p(x)
j=1

where the constant factor (the evidence) p(x) can be computed as


K
X
p(x) = p(Ck )p(x|Ck ).
k=1

The chosen output class is the one who maximizes p(y |x):

ŷ = arg max p(y |x) = arg max p(x|y )p(y ).


y y

G.C. Calafiore (Politecnico di Torino) 31 / 39


Naive Bayes classifiers
Binary features

In the case of binary features xj ∈ {0, 1} we obtain the Bernoulli naive Bayes
classifier, whereby
Yn
xi
p(x|Ck ) = θki (1 − θki )1−xi
i=1

where θki is the probability of class Ck generating the feature xi .

The decoupling of the class conditional feature distributions means that


each distribution can be independently estimated as a one-dimensional
distribution.

G.C. Calafiore (Politecnico di Torino) 32 / 39


Example
Document classification

Consider a dictionary composed of n words.


Any document D can be represented by a binary vector x ∈ {0, 1}n , where
xj = 1 if word j appears in the document, and xj = 0 otherwise.
Any document belongs to one of K classes C1 , . . . , CK .
Given the class Ck , the class-conditional probability of a document x is
n
Y
p(x|Ck ) = p(xi |Ck ),
i=1

and we assume that p(xi |Ck ) is Bernoulli with parameter θki (to be
estimated).
The question that we want to answer is: “what is the probability that a
given document D (or, more precisely, its encoding x) belongs to a given
class Ck ?” In other words, what is p(Ck |x)?

G.C. Calafiore (Politecnico di Torino) 33 / 39


Example
Document classification

By Bayes’ rule
n
P(x|Ck )p(Ck ) p(Ck ) Y
p(Ck |x) = = p(xi |Ck ).
p(x) p(x)
i=1

Suppose there are only two classes C1 and C2 (e.g., spam and non-spam),
then
n n
p(C1 ) Y p(C2 ) Y
p(C1 |x) = p(xi |C1 ), p(C2 |x) = p(xi |C2 ).
p(x) p(x)
i=1 i=1

By dividing we obtain the odds


n
p(C1 |x) p(C1 ) Y p(xi |C1 )
=
p(C2 |x) p(C2 ) p(xi |C2 )
i=1

p(C1 |x)
Classify as spam if p(C2 |x) > 1.

G.C. Calafiore (Politecnico di Torino) 34 / 39


Example
Buyers’ classification

The table above contains N = 14 samples of potential clients.


Each client is described by an n-dimensional vector of attributes, or, features.
G.C. Calafiore (Politecnico di Torino) 35 / 39
Example
Buyers’ classification

There are n = 4 categorical features:


x = (Age, Income, Student, CreditRating)

The response variable y is a binary output that describes wether or not the
client will buy a computer.
We shall use this data to construct a Bayes’ classifier with the purpose of
predicting if a new client x will buy a computer or not.
There are two classes C1 (buys computer) and C2 (does not buy a
computer).
Under the Naive Bayes’ hypotheses, we wish to compute p(C1 |x), p(C2 |x):
n n
p(C1 ) Y p(C2 ) Y
p(C1 |x) = p(xi |C1 ), p(C2 |x) = p(xi |C2 ).
p(x) p(x)
i=1 i=1

Observe that for the purpose of deciding if p(C1 |x) > p(C2 |x) we do not
need to evaluate the common denominator p(x)...
G.C. Calafiore (Politecnico di Torino) 36 / 39
Example
Buyers’ classification
We next evaluate all the terms needed to build the classifier:
The marginal class probabilities p(C1 ), p(C2 ) are simply evaluated as the
empirical frequencies of the two classes in the data set:
number of clients that buy a computer 9
p(C1 ) = = = 0.643
N 14
number of clients that do not buy a computer 5
p(C2 ) = = = 0.357.
N 14

The class-conditional probability p(xi |Ck ) is evaluated as the empirical


frequency of the outcome xi for the population in class Ck .
For instance, if
x =(age=youth, income=medium, student=yes, credit=fair),
then...

G.C. Calafiore (Politecnico di Torino) 37 / 39


Example
Buyers’ classification

...class-conditional probabilities (continued)


2
p(x1 = youth|C1 ) = = 0.222
9
4
p(x2 = medium|C1 ) = = 0.444
9
6
p(x3 = yes|C1 ) = = 0.667
9
6
p(x4 = fair|C1 ) = = 0.667
9
3
p(x1 = youth|C2 ) = = 0.600
5
2
p(x2 = medium|C2 ) = = 0.400
5
1
p(x3 = yes|C2 ) = = 0.200
5
2
p(x4 = fair|C2 ) = = 0.400.
5
G.C. Calafiore (Politecnico di Torino) 38 / 39
Example
Buyers’ classification

Therefore,
n
p(C1 ) Y
p(C1 |x) = p(xi |C1 )
p(x)
i=1
0.643 0.0282
= × 0.222 × 0.444 × 0.667 × 0.667 =
p(x) p(x)

n
p(C2 ) Y
p(C2 |x) = p(xi |C2 )
p(x)
i=1
0.357 0.0069
= × 0.600 × 0.400 × 0.200 × 0.400 = .
p(x) p(x)

Since p(C1 |x) + p(C2 |x) = 1, we obtain p(x) = 0.0351.


Since p(C1 |x) > p(C2 |x), we classify x in class C1 .

G.C. Calafiore (Politecnico di Torino) 39 / 39

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy