CS772 Lec5
CS772 Lec5
Regression
CS772A: PML
3
Bayesian Inference for Variance of a Univariate Gaussian
Consider i.i.d. scalar obs drawn from
An inverse-gamma
Inverse
dist has this form (are shape and scale
hyperparams)
gamma prior
on variance
𝑝 ( 𝑥𝑛|𝜇, 𝜆 )= 𝒩 ( 𝑥|𝜇, 𝜆 )=
√ [ ]
variance −1 −1 𝜆 𝜆 2
exp − ( 𝑥 𝑛 −𝜇 )
2𝜋 2
If Gamma and are the shape
mean is known, for precision, is a conjugate prior to Gaussian lik.
and rate params,
prior on the (Note: mean of ) resp., of the Gamma
precision distribution
Note: Unlike the case of unknown mean and fixed variance, the PPD
for this case (and also the unknown variance case) will not be a
Gaussian
Assuming shape-rate
parametrization of the
The above is product of a normal and a gamma distribution
gamma
(For full derivation of posterior, refer to “Conjugate Bayesian analysis of the Gaussian distribution” - Murphy (2007)) CS772A: PML
8
Other Quantities of Interest
We saw that the joint posterior for mean and precision is NG
From the above, we can also obtain the marginal posteriors for and
(For full derivation of posterior, refer to “Conjugate Bayesian analysis of the Gaussian distribution” - Murphy (2007)) CS772A: PML
9
Student-t distribution
An infinite sum of Gaussian distributions, with same means but
Same as saying that we
different precisions are integrating out the
precision parameter of
a Gaussian with the
mean held as fixed
is called the degree of freedom, is the mean, and isAs tends to infinity,
the scalea
student-t becomes
Gaussian
CS772A: PML
10
Inferring Params of Gaussian: Some Other Cases
We only considered parameter estimation for univariate
Gaussian distribution
𝑝 ( 𝒙 ∨𝒛 ) =𝒩 ( 𝒙 | 𝑨𝒛 + 𝒃 , 𝑳−1 )
Easy to see that, conditioned on , too has a Gaussian distribution
=𝒩 ( 𝒛|𝚺 ( 𝑨 𝑳 ( 𝒙 − 𝒃 )+ 𝚲 𝝁), 𝚺 )
𝑝 ( 𝒙| 𝒛 ) 𝑝 ( 𝒛 )
Posterior of
⊤
( )
𝑝 𝒛∨ 𝒙 =
𝑝 (𝒙)
𝑝 ( 𝒙 ) =∫ 𝑝 ( 𝒙| 𝒛 ) 𝑝 ( 𝒛 ) 𝑑 𝒛 =𝒩 ( 𝒙∨ 𝑨 𝝁+ 𝒃Closed
Margin
al −1 ⊤ −1
likeliho
od
𝚲 𝑨 +𝑳 )
, 𝑨 form
expressions for
Very useful results (PRML Chap. 2 contains a proof) posterior and marginal
likelihood (and both
Gaussian) CS772A: PML
12
(Probabilistic/Bayesian) Linear Regression
Assume training data , with features and responses
Each weight assumed
real-valued
Assume generated by a noisy linear model with wts
⊤ Gaussian noise Other noise models
𝑦 𝑛= 𝒘 𝒙 𝑛 +𝜖 𝑛 drawn from also possible (like
Laplace)
Notation alert: is the precision of Gaussian noise (and the
variance) The line represents
𝑝 ( 𝑦 𝑛| 𝒙 𝑛 ,𝒘 )=𝒩 ( 𝒘 𝒙 𝑛 , 𝛽 )
the mean of the
Likelihood model output random
⊤ −1 variable
The zero mean
Gaussian noise
perturbs the
𝑦 output from its
mean
Gaussian
⊤ Thus NLL is like
𝒘 𝒙𝑛 squared loss
𝑥 CS772A: PML
13
Probabilistic Linear Regression
For all the training data, we can write the above model in
𝒚 is thenotation
𝑁×1
is the 𝑁×1 noise
matrix-vector
response vector is the input matrix
vector
Same as writing
𝒚 =𝑿𝒘 +𝝐
This is an example of a linear Gaussian model with being the
unknown
A simple plate notation would look like this
CS772A: PML
14
Prior on weights May also use a non-zero mean
Gaussian prior, e.g., if we
expect weights to be close to
Assume a zero-mean Gaussian prior on some value
This prior assumes that a priori
𝑝 ( 𝒘 )=∏ 𝑝 ( 𝑤 𝑑 ) =∏ 𝒩 (𝑤 ∨0 , 𝜆 )
each weight has a small value
𝐷 𝐷 (close to zero)
−1 Can even use different
’s for different ’s.
In zero-mean case, 𝑑 Useful in sparse
sort of denotes 𝑑=1 𝑑=1 controls the uncertainty modeling
Can (later)
also use a full
each feature’s around our prior belief about
𝒩
importance. Think covariance matrix for
value of
Large
why? means −1 the prior to impose a
more ¿ (𝒘 ∨𝟎 , 𝜆 𝐷) 𝐈 priori correlations
aggressive among different
Prior’s
( ) [ ]
The push
precision
𝐷 weights
controls how
𝜆 𝜆 ⊤ hyperparameters (/) etc
aggressively
the prior ∝ 2
exp − 𝒘 𝒘 can be learned as well
using point estimation
pushes
towards mean
2𝜋 2 (e.g., MLE-II) or fully
Bayesian inference
(0)
Reason: The negative
log prior
Zero-mean Gaussian prior corresponds to regularizer
CS772A: PML
Will only look at
fully Bayesian
15
The Posterior inference. For
MLE/MAP, refer to
CS771 slides or
The posterior over (for now, assume hyperparams and to be
book
known)
Marginal likelihood for this regression
model. Note that it is conditioned on
Must be a too which is assumed given and not
Gaussian due being modeled
to conjugacy
CS772A: PML
16
The Posterior: A Visualization
Assume a lin. reg. problem with true
Assume data generated by a linear regression model
Note: It’s actually 1-D regression ( is just a bias term), or 2-D reg. with
feature
Figures below show the “data space” and posterior of for different
number
Each red line
represents the
of observations (note: with no observations, the posterior =
prior)
“data”
generated for
a randomly
drawn from
the current
posterior
CS772A: PML
17
Posterior Predictive Distribution
To get the prediction for a new input , we can compute its PPD
Only is unknown with
a posterior distribution
𝒩 (𝒘 ∨𝝁 𝑁 , 𝚺 𝑁 )
so only has to be
𝒩 (𝑦 ∗∨𝒘 ⊤ 𝒙∗ , 𝛽− 1)
integrated out
CS772A: PML
18
Posterior Predictive Distribution: An Illustration
Black dots are training examples
CS772A: PML
21
Hyperparameters
The probabilistic linear reg. model we saw had two
hyperparams () Need posterior
Thus total three unknowns () over all the 3
unknowns
PPD would
require
integrating out
all 3 unknowns