G.C. Calafiore (Politecnico Di Torino)
G.C. Calafiore (Politecnico Di Torino)
Bayesian Inference
1 Bayes’ rule
2 Bayesian estimators
Example: estimating a Bernoulli parameter
p(pregnant)p(HCG+|pregnant)
p(pregnant|HCG+) =
p(HCG+)
This is called the minimum mean square error estimator (MMSE estimator),
because it minimizes the average squared error.
In words, the estimate θ̂ is the value which divides the probability mass into
equal proportions:
Z θ̂
1
p(θ|D)dθ = ,
−∞ 2
which is the definition of the median of the posterior pdf.
A well known fact, see, e.g., J.B.S. Haldane, “Note on the median of a
multivariate distribution,” Biometrika, vol. 35, pp. 414–417, 1948.
G.C. Calafiore (Politecnico di Torino) 15 / 39
Estimation
Maximum A-Posteriori Estimator (MAP)
0 if |θ − θ̂| ≤ δ
For the hit-or-miss loss , we have
1 if |θ − θ̂| > δ
Z θ̂−δ Z ∞
E{L(θ − θ̂)|D} = p(θ|D)dθ + p(θ|D)dθ
−∞ θ̂+δ
Z θ̂+δ
= 1− p(θ|D)dθ
θ̂−δ
Z θ̂+δ
This is minimized by maximizing p(θ|D)dθ.
θ̂−δ
Probabilistic model:
I p(θ): the prior
I p(D|θ): the likelihood
I p(θ|D) = const. × p(D|θ)p(θ): the posterior.
Estimators:
I θ̂MMSE is the mean of the posterior p(θ|D). It is the value which
minimizes the expected quadratic loss.
I θ̂MAE is the median of the posterior p(θ|D). It is the value which
minimizes the expected absolute loss.
I θ̂MAP is the maximum (i.e., peak, or mode) of the posterior p(θ|D). It
is (approximately) the value which minimizes the expected hit-or-miss
loss.
I θ̂ML is the maximum (i.e., peak, or mode) of the likelihood function
p(D|θ).
If we have the information that the experiment is actually a coin toss, we can
actually consider a prior on θ which is somewhat concentrated around 1/2,
instead of uniform in [0, 1], as we implicitly assumed in the ML estimation.
A typical choice of prior for the Bernoulli likelihood is the Beta prior
a a−1 ab
mean = , mode = , variance =
a+b a+b−2 (a + b)2 (a + b + 1)
When the prior and the posterior have the same form, we say that the prior
is a conjugate prior for the corresponding likelihood. In the case of the
Bernoulli likelihood, the conjugate prior is the beta distribution.
Since
N1 +1
In our example with a = b = 2, we obtain θ̂MAP = N+2 .
For the small-sample example with N = 5 and N1 = N, wee see that the
effect of an informative prior is to avoid the extreme estimate
θ̂ML = N1 /N = 1.
The MMSE estimate is given by the mean of the posterior distribution, i.e.,
N1 + a
θ̂MMSE = E{θ|D} = .
N +a+b
Suppose you are a woman in your 40s, and you decide to have a medical test
for breast cancer called a mammogram.
If the test is positive, what is the probability you have cancer?
That obviously depends on how reliable the test is. Suppose you are told the
test has a sensitivity of 80%, which means, if you have cancer, the test will
be positive with probability 0.8. In other words,
p(y = 1|x = 1) = 0.8
where y = 1 is the event the outcome of the mammogram is positive, and
x = 1 is the (hidden) event you have breast cancer.
Consider also the rate of “false positives” of the test, quantified as
p(y = 1|x = 0) = 0.1.
What is the probability that you have cancer (x = 1), given that the
mammogram is positive (y = 1)? That is, evaluate
p(x = 1|y = 1).
Many people conclude they are therefore 80% likely to have cancer. But this
is false!
It ignores the prior probability of having breast cancer, which fortunately is
quite low:
p(x = 1) = 0.004.
Ignoring this prior is called the base rate fallacy.
We compute the correct probability by using Bayes’ rule:
p(y = 1|x = 1)p(x = 1)
p(x = 1|y = 1) =
p(y = 1)
0.8 × 0.004
=
p(y = 1|x = 1)p(x = 1) + p(y = 1|x = 0)p(x = 0)
0.8 × 0.004
= = 0.031.
0.8 × 0.004 + 0.1 × (1 − 0.004)
In other words, if you test positive, you only have about a 3% chance of
actually having breast cancer!
G.C. Calafiore (Politecnico di Torino) 28 / 39
Naive Bayes classifiers
We next discuss how to classify vectors of n features x = (x1 , . . . , xn ) into K
classes C1 , . . . , CK . The output y of the model is thus a categorical variable
in {1, . . . , K }.
The classifier assigns to each input feature vector x = (x1 , . . . , xn ) a class
probability
p(y |x), y = 1, . . . , K .
where θjc is a histogram over the Q possible values for xj in class c. That is,
if xj ∼ Cat(xj |θjc ), then p(xj = q|θjc ) = θjc (q).
The chosen output class is the one who maximizes p(y |x):
In the case of binary features xj ∈ {0, 1} we obtain the Bernoulli naive Bayes
classifier, whereby
Yn
xi
p(x|Ck ) = θki (1 − θki )1−xi
i=1
and we assume that p(xi |Ck ) is Bernoulli with parameter θki (to be
estimated).
The question that we want to answer is: “what is the probability that a
given document D (or, more precisely, its encoding x) belongs to a given
class Ck ?” In other words, what is p(Ck |x)?
By Bayes’ rule
n
P(x|Ck )p(Ck ) p(Ck ) Y
p(Ck |x) = = p(xi |Ck ).
p(x) p(x)
i=1
Suppose there are only two classes C1 and C2 (e.g., spam and non-spam),
then
n n
p(C1 ) Y p(C2 ) Y
p(C1 |x) = p(xi |C1 ), p(C2 |x) = p(xi |C2 ).
p(x) p(x)
i=1 i=1
p(C1 |x)
Classify as spam if p(C2 |x) > 1.
The response variable y is a binary output that describes wether or not the
client will buy a computer.
We shall use this data to construct a Bayes’ classifier with the purpose of
predicting if a new client x will buy a computer or not.
There are two classes C1 (buys computer) and C2 (does not buy a
computer).
Under the Naive Bayes’ hypotheses, we wish to compute p(C1 |x), p(C2 |x):
n n
p(C1 ) Y p(C2 ) Y
p(C1 |x) = p(xi |C1 ), p(C2 |x) = p(xi |C2 ).
p(x) p(x)
i=1 i=1
Observe that for the purpose of deciding if p(C1 |x) > p(C2 |x) we do not
need to evaluate the common denominator p(x)...
G.C. Calafiore (Politecnico di Torino) 36 / 39
Example
Buyers’ classification
We next evaluate all the terms needed to build the classifier:
The marginal class probabilities p(C1 ), p(C2 ) are simply evaluated as the
empirical frequencies of the two classes in the data set:
number of clients that buy a computer 9
p(C1 ) = = = 0.643
N 14
number of clients that do not buy a computer 5
p(C2 ) = = = 0.357.
N 14
Therefore,
n
p(C1 ) Y
p(C1 |x) = p(xi |C1 )
p(x)
i=1
0.643 0.0282
= × 0.222 × 0.444 × 0.667 × 0.667 =
p(x) p(x)
n
p(C2 ) Y
p(C2 |x) = p(xi |C2 )
p(x)
i=1
0.357 0.0069
= × 0.600 × 0.400 × 0.200 × 0.400 = .
p(x) p(x)