0% found this document useful (0 votes)
14 views22 pages

CS772 Lec5

Uploaded by

darkvoyeur5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views22 pages

CS772 Lec5

Uploaded by

darkvoyeur5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

(Probabilistic/Bayesian) Linear

Regression

CS772A: Probabilistic Machine Learning


Piyush Rai
2
Plan for today
 Wrapping up the discussion of parameter estimation for
Gaussians
 (Probabilistic/Bayesian)Linear Regression

CS772A: PML
3
Bayesian Inference for Variance of a Univariate Gaussian
 Consider i.i.d. scalar obs drawn from

 We want to estimate the variance . Assume to be known.


 If we want a conjugate prior , its functional form must be same as
likelihood

 An inverse-gamma
Inverse
dist has this form (are shape and scale
hyperparams)
gamma prior
on variance

 Due to conjugacy, posterior will also be IG with expression


(exercise: verify): CS772A: PML
4
Working with Gaussians: Variance vs Precision
 Often, it is easier to work with the precision (=1/variance) rather than

𝑝 ( 𝑥𝑛|𝜇, 𝜆 )= 𝒩 ( 𝑥|𝜇, 𝜆 )=
√ [ ]
variance −1 −1 𝜆 𝜆 2
exp − ( 𝑥 𝑛 −𝜇 )
2𝜋 2
 If Gamma and are the shape
mean is known, for precision, is a conjugate prior to Gaussian lik.
and rate params,
prior on the (Note: mean of ) resp., of the Gamma
precision distribution

 (Verify) The posterior will be

 Note: Unlike the case of unknown mean and fixed variance, the PPD
for this case (and also the unknown variance case) will not be a
Gaussian

 Note: Gamma distribution can be defined in terms of shape and


scale or shape and rate parametrization (scale = 1/rate). Likewise,
CS772A: PML
5
Bayesian Inference for Both Parameters of a Gaussian
 Gaussian with unknown scalar mean and unknown scalar precision
(two parameters)
 Consider i.i.d. scalar obs drawn from
 Assume both mean and precision to be unknown. The likelihood
can be written as

 Would like a jointly conjugate prior distribution


Thankfully, this is a known
distribution: normal-
 It must
gamma have the same form as the likelihood as written
(NG) distribution above. Basically,

Calledsomething
so since it can bethat looks like
written as a product of a The NG also has a multivariate version called
normal and a gamma normal-Wishart distribution to jointly model a real-
valued vector and a PSD matrix CS772A: PML
6
Detour: Normal-gamma (Gaussian-gamma) Distribution
 We saw that the conjugate prior needed to have the form

Assuming shape-rate
parametrization of the
 The above is product of a normal and a gamma distribution
gamma

 The NG is conjugate to a Gaussian


distribution if both its mean and precision parameters are unknown
and are to be estimated
 Thus a useful prior in many problems involving Gaussians with unknown
CS772A: PML
7
Bayesian Inference for Both Parameters of a Gaussian
 Due to conjugacy, the joint posterior will also be normal-gamma
Skipping all
hyperparameters on
the conditioning side
 Plugging in the expressions for and , we get

 The above’s posterior’s parameters will be

(For full derivation of posterior, refer to “Conjugate Bayesian analysis of the Gaussian distribution” - Murphy (2007)) CS772A: PML
8
Other Quantities of Interest
 We saw that the joint posterior for mean and precision is NG

 From the above, we can also obtain the marginal posteriors for and

Marginal lik has closed


 Marginal likelihood of the model form expression (for
conjugate lik and prior,
the marginal lik has
closed form – more
when we see exp-
 PPD of a new observation family distributions)

(For full derivation of posterior, refer to “Conjugate Bayesian analysis of the Gaussian distribution” - Murphy (2007)) CS772A: PML
9
Student-t distribution
 An infinite sum of Gaussian distributions, with same means but
Same as saying that we
different precisions are integrating out the
precision parameter of
a Gaussian with the
mean held as fixed

 is called the degree of freedom, is the mean, and isAs tends to infinity,
the scalea
student-t becomes
Gaussian

Has fatter tail than


Gaussian and is sharper
around the mean
Zero-mean Student-t (and
other such “infinite sum of
Gaussians” are useful priors
for modeling sparse weights

CS772A: PML
10
Inferring Params of Gaussian: Some Other Cases
 We only considered parameter estimation for univariate
Gaussian distribution

 The approach also extends to inferring parameters of a


multivariate Gaussian

 Posterior updates have forms similar to that in the univariate


case

 Conjugate priors exist for the multivariate Gaussian case as


well
 Multivariate Gaussian distribution as conjugate prior for the mean vector of
the multivar Gaussian
 Wishart (inverse Wishart) distribution as conjugate prior for precision CS772A: PML
11
Background: Linear Gaussian Models
 Consider noisy linear transformation of a random var. with
Both and are vectors
Noise vector -
𝒙= 𝑨𝒛 + 𝒃+𝝐
(can be of different sizes)
independently and drawn
Also assume to be
from
known; only is unknown

𝑝 ( 𝒙 ∨𝒛 ) =𝒩 ( 𝒙 | 𝑨𝒛 + 𝒃 , 𝑳−1 )
 Easy to see that, conditioned on , too has a Gaussian distribution

 Assume as prior and as the likelihood, and defining

=𝒩 ( 𝒛|𝚺 ( 𝑨 𝑳 ( 𝒙 − 𝒃 )+ 𝚲 𝝁), 𝚺 )
𝑝 ( 𝒙| 𝒛 ) 𝑝 ( 𝒛 )
Posterior of

( )
𝑝 𝒛∨ 𝒙 =
𝑝 (𝒙)
𝑝 ( 𝒙 ) =∫ 𝑝 ( 𝒙| 𝒛 ) 𝑝 ( 𝒛 ) 𝑑 𝒛 =𝒩 ( 𝒙∨ 𝑨 𝝁+ 𝒃Closed
Margin
al −1 ⊤ −1
likeliho
od
𝚲 𝑨 +𝑳 )
, 𝑨 form
expressions for
 Very useful results (PRML Chap. 2 contains a proof) posterior and marginal
likelihood (and both
Gaussian) CS772A: PML
12
(Probabilistic/Bayesian) Linear Regression
 Assume training data , with features and responses
Each weight assumed
real-valued
 Assume generated by a noisy linear model with wts
⊤ Gaussian noise Other noise models
𝑦 𝑛= 𝒘 𝒙 𝑛 +𝜖 𝑛 drawn from also possible (like
Laplace)
 Notation alert: is the precision of Gaussian noise (and the
variance) The line represents

𝑝 ( 𝑦 𝑛| 𝒙 𝑛 ,𝒘 )=𝒩 ( 𝒘 𝒙 𝑛 , 𝛽 )
the mean of the
Likelihood model output random
⊤ −1 variable
The zero mean
Gaussian noise
perturbs the
𝑦 output from its
mean

Gaussian
⊤ Thus NLL is like
𝒘 𝒙𝑛 squared loss

𝑥 CS772A: PML
13
Probabilistic Linear Regression
 For all the training data, we can write the above model in
𝒚 is thenotation
𝑁×1
is the 𝑁×1 noise
matrix-vector
response vector is the input matrix
vector

Same as writing
𝒚 =𝑿𝒘 +𝝐
 This is an example of a linear Gaussian model with being the
unknown
 A simple plate notation would look like this

CS772A: PML
14
Prior on weights May also use a non-zero mean
Gaussian prior, e.g., if we
expect weights to be close to
 Assume a zero-mean Gaussian prior on some value
This prior assumes that a priori

𝑝 ( 𝒘 )=∏ 𝑝 ( 𝑤 𝑑 ) =∏ 𝒩 (𝑤 ∨0 , 𝜆 )
each weight has a small value
𝐷 𝐷 (close to zero)
−1 Can even use different
’s for different ’s.
In zero-mean case, 𝑑 Useful in sparse
sort of denotes 𝑑=1 𝑑=1 controls the uncertainty modeling
Can (later)
also use a full
each feature’s around our prior belief about

𝒩
importance. Think covariance matrix for
value of
Large
why? means −1 the prior to impose a
more ¿ (𝒘 ∨𝟎 , 𝜆 𝐷) 𝐈 priori correlations
aggressive among different
Prior’s

( ) [ ]
The push
precision
𝐷 weights
controls how
𝜆 𝜆 ⊤ hyperparameters (/) etc
aggressively
the prior ∝ 2
exp − 𝒘 𝒘 can be learned as well
using point estimation
pushes
towards mean
2𝜋 2 (e.g., MLE-II) or fully
Bayesian inference
(0)
Reason: The negative
log prior
 Zero-mean Gaussian prior corresponds to regularizer
CS772A: PML
Will only look at
fully Bayesian
15
The Posterior inference. For
MLE/MAP, refer to
CS771 slides or
 The posterior over (for now, assume hyperparams and to be
book

known)
Marginal likelihood for this regression
model. Note that it is conditioned on
Must be a too which is assumed given and not
Gaussian due being modeled
to conjugacy

Note that and can be


 Using the “completing the squares” trick (or linear
learned underGaussian
the
model results) probabilistic set-
up(though assumed
fixed as of now)

CS772A: PML
16
The Posterior: A Visualization
 Assume a lin. reg. problem with true
 Assume data generated by a linear regression model
 Note: It’s actually 1-D regression ( is just a bias term), or 2-D reg. with
feature
 Figures below show the “data space” and posterior of for different
number
Each red line
represents the
of observations (note: with no observations, the posterior =
prior)
“data”
generated for
a randomly
drawn from
the current
posterior

CS772A: PML
17
Posterior Predictive Distribution
 To get the prediction for a new input , we can compute its PPD
Only is unknown with
a posterior distribution

𝒩 (𝒘 ∨𝝁 𝑁 , 𝚺 𝑁 )
so only has to be

𝒩 (𝑦 ∗∨𝒘 ⊤ 𝒙∗ , 𝛽− 1)
integrated out

 The above is the marginalization of from . Using Gaussian results


Can also derive it by writing where
and

 So we have a predictive mean as well as an input-specific


predictive variance
 In contrast, MLE and MAP make “plug-in” predictionsSince
into
PPD also takes
(using
accountthe
the
point estimate of ) uncertainty in , the
predictive variance is
larger

CS772A: PML
18
Posterior Predictive Distribution: An Illustration
 Black dots are training examples

 Width of the shaded region at any denotes the predictive


uncertainty at that (+/- one std-dev)
 Regions with more training examples have smaller predictive
variance CS772A: PML
19
Nonlinear Regression

 Can extend the linear regression model to handle nonlinear


regression problems
 One way is to replace the feature vectors by a nonlinear mapping
Can be pre-defined or
extracted by a pretrained
deep neural net

 Alternatively, a kernel function can be used to implicitly define the


nonlinear mapping CS772A: PML
20
More on Visualization of Uncertainty
 Figures below: Green curve is the true function and blue circles are
observations

 Posterior of the nonlinear regression model: Some curves drawn


from the posterior

CS772A: PML
21
Hyperparameters
 The probabilistic linear reg. model we saw had two
hyperparams () Need posterior
 Thus total three unknowns () over all the 3
unknowns

PPD would
require
integrating out
all 3 unknowns

 Posterior and PPD computation is intractable. Several ways to


address this
 MLE-II for (): . Use them to infer the posterior of
and PPD CS772A: PML
For any model where
hyperparams are 22
MLE-II estimated by MLE-II, the
posterior and PPD is
approximated in a similar
 For the probabilistic linear regression model, the overall
fashion

posterior over unknowns

 With MLE-II approx of (𝛽, 𝜆), , a point mass


Same as the posterior
ofwith the
at hyperparameters fixed

 Likewise, the PPD will be approximated as follows

Same form for the PPD Only need to


as in the case of fixed integrate over ,
hyperparams since other two are
fixed at their MLE-II
CS772A: PML
solutions

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy