0% found this document useful (0 votes)

14 views22 pages

CS772 Lec5

Uploaded by

darkvoyeur5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views22 pages

CS772 Lec5

Uploaded by

darkvoyeur5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

(Probabilistic/Bayesian) Linear

Regression

CS772A: Probabilistic Machine Learning

Piyush Rai
2
Plan for today
 Wrapping up the discussion of parameter estimation for
Gaussians
 (Probabilistic/Bayesian)Linear Regression

CS772A: PML
3
Bayesian Inference for Variance of a Univariate Gaussian
 Consider i.i.d. scalar obs drawn from

 We want to estimate the variance . Assume to be known.

 If we want a conjugate prior , its functional form must be same as
likelihood

 An inverse-gamma
Inverse
dist has this form (are shape and scale
hyperparams)
gamma prior
on variance

 Due to conjugacy, posterior will also be IG with expression

(exercise: verify): CS772A: PML
4
Working with Gaussians: Variance vs Precision
 Often, it is easier to work with the precision (=1/variance) rather than

𝑝 ( 𝑥𝑛|𝜇, 𝜆 )= 𝒩 ( 𝑥|𝜇, 𝜆 )=
√ [ ]
variance −1 −1 𝜆 𝜆 2
exp − ( 𝑥 𝑛 −𝜇 )
2𝜋 2
 If Gamma and are the shape
mean is known, for precision, is a conjugate prior to Gaussian lik.
and rate params,
prior on the (Note: mean of ) resp., of the Gamma
precision distribution

 (Verify) The posterior will be

 Note: Unlike the case of unknown mean and fixed variance, the PPD
for this case (and also the unknown variance case) will not be a
Gaussian

 Note: Gamma distribution can be defined in terms of shape and

scale or shape and rate parametrization (scale = 1/rate). Likewise,
CS772A: PML
5
Bayesian Inference for Both Parameters of a Gaussian
 Gaussian with unknown scalar mean and unknown scalar precision
(two parameters)
 Consider i.i.d. scalar obs drawn from
 Assume both mean and precision to be unknown. The likelihood
can be written as

 Would like a jointly conjugate prior distribution

Thankfully, this is a known
distribution: normal-
 It must
gamma have the same form as the likelihood as written
(NG) distribution above. Basically,

Calledsomething
so since it can bethat looks like
written as a product of a The NG also has a multivariate version called
normal and a gamma normal-Wishart distribution to jointly model a real-
valued vector and a PSD matrix CS772A: PML
6
Detour: Normal-gamma (Gaussian-gamma) Distribution
 We saw that the conjugate prior needed to have the form

Assuming shape-rate
parametrization of the
 The above is product of a normal and a gamma distribution
gamma

 The NG is conjugate to a Gaussian

distribution if both its mean and precision parameters are unknown
and are to be estimated
 Thus a useful prior in many problems involving Gaussians with unknown
CS772A: PML
7
Bayesian Inference for Both Parameters of a Gaussian
 Due to conjugacy, the joint posterior will also be normal-gamma
Skipping all
hyperparameters on
the conditioning side
 Plugging in the expressions for and , we get

 The above’s posterior’s parameters will be

(For full derivation of posterior, refer to “Conjugate Bayesian analysis of the Gaussian distribution” - Murphy (2007)) CS772A: PML
8
Other Quantities of Interest
 We saw that the joint posterior for mean and precision is NG

 From the above, we can also obtain the marginal posteriors for and

Marginal lik has closed

 Marginal likelihood of the model form expression (for
conjugate lik and prior,
the marginal lik has
closed form – more
when we see exp-
 PPD of a new observation family distributions)

(For full derivation of posterior, refer to “Conjugate Bayesian analysis of the Gaussian distribution” - Murphy (2007)) CS772A: PML
9
Student-t distribution
 An infinite sum of Gaussian distributions, with same means but
Same as saying that we
different precisions are integrating out the
precision parameter of
a Gaussian with the
mean held as fixed

 is called the degree of freedom, is the mean, and isAs tends to infinity,
the scalea
student-t becomes
Gaussian

Has fatter tail than

Gaussian and is sharper
around the mean
Zero-mean Student-t (and
other such “infinite sum of
Gaussians” are useful priors
for modeling sparse weights

CS772A: PML
10
Inferring Params of Gaussian: Some Other Cases
 We only considered parameter estimation for univariate
Gaussian distribution

 The approach also extends to inferring parameters of a

multivariate Gaussian

 Posterior updates have forms similar to that in the univariate

case

 Conjugate priors exist for the multivariate Gaussian case as

well
 Multivariate Gaussian distribution as conjugate prior for the mean vector of
the multivar Gaussian
 Wishart (inverse Wishart) distribution as conjugate prior for precision CS772A: PML
11
Background: Linear Gaussian Models
 Consider noisy linear transformation of a random var. with
Both and are vectors
Noise vector -
𝒙= 𝑨𝒛 + 𝒃+𝝐
(can be of different sizes)
independently and drawn
Also assume to be
from
known; only is unknown

𝑝 ( 𝒙 ∨𝒛 ) =𝒩 ( 𝒙 | 𝑨𝒛 + 𝒃 , 𝑳−1 )
 Easy to see that, conditioned on , too has a Gaussian distribution

 Assume as prior and as the likelihood, and defining

=𝒩 ( 𝒛|𝚺 ( 𝑨 𝑳 ( 𝒙 − 𝒃 )+ 𝚲 𝝁), 𝚺 )
𝑝 ( 𝒙| 𝒛 ) 𝑝 ( 𝒛 )
Posterior of
⊤
( )
𝑝 𝒛∨ 𝒙 =
𝑝 (𝒙)
𝑝 ( 𝒙 ) =∫ 𝑝 ( 𝒙| 𝒛 ) 𝑝 ( 𝒛 ) 𝑑 𝒛 =𝒩 ( 𝒙∨ 𝑨 𝝁+ 𝒃Closed
Margin
al −1 ⊤ −1
likeliho
od
𝚲 𝑨 +𝑳 )
, 𝑨 form
expressions for
 Very useful results (PRML Chap. 2 contains a proof) posterior and marginal
likelihood (and both
Gaussian) CS772A: PML
12
(Probabilistic/Bayesian) Linear Regression
 Assume training data , with features and responses
Each weight assumed
real-valued
 Assume generated by a noisy linear model with wts
⊤ Gaussian noise Other noise models
𝑦 𝑛= 𝒘 𝒙 𝑛 +𝜖 𝑛 drawn from also possible (like
Laplace)
 Notation alert: is the precision of Gaussian noise (and the
variance) The line represents

𝑝 ( 𝑦 𝑛| 𝒙 𝑛 ,𝒘 )=𝒩 ( 𝒘 𝒙 𝑛 , 𝛽 )
the mean of the
Likelihood model output random
⊤ −1 variable
The zero mean
Gaussian noise
perturbs the
𝑦 output from its
mean

Gaussian
⊤ Thus NLL is like
𝒘 𝒙𝑛 squared loss

𝑥 CS772A: PML
13
Probabilistic Linear Regression
 For all the training data, we can write the above model in
𝒚 is thenotation
𝑁×1
is the 𝑁×1 noise
matrix-vector
response vector is the input matrix
vector

Same as writing
𝒚 =𝑿𝒘 +𝝐
 This is an example of a linear Gaussian model with being the
unknown
 A simple plate notation would look like this

CS772A: PML
14
Prior on weights May also use a non-zero mean
Gaussian prior, e.g., if we
expect weights to be close to
 Assume a zero-mean Gaussian prior on some value
This prior assumes that a priori

𝑝 ( 𝒘 )=∏ 𝑝 ( 𝑤 𝑑 ) =∏ 𝒩 (𝑤 ∨0 , 𝜆 )
each weight has a small value
𝐷 𝐷 (close to zero)
−1 Can even use different
’s for different ’s.
In zero-mean case, 𝑑 Useful in sparse
sort of denotes 𝑑=1 𝑑=1 controls the uncertainty modeling
Can (later)
also use a full
each feature’s around our prior belief about

𝒩
importance. Think covariance matrix for
value of
Large
why? means −1 the prior to impose a
more ¿ (𝒘 ∨𝟎 , 𝜆 𝐷) 𝐈 priori correlations
aggressive among different
Prior’s

( ) [ ]
The push
precision
𝐷 weights
controls how
𝜆 𝜆 ⊤ hyperparameters (/) etc
aggressively
the prior ∝ 2
exp − 𝒘 𝒘 can be learned as well
using point estimation
pushes
towards mean
2𝜋 2 (e.g., MLE-II) or fully
Bayesian inference
(0)
Reason: The negative
log prior
 Zero-mean Gaussian prior corresponds to regularizer
CS772A: PML
Will only look at
fully Bayesian
15
The Posterior inference. For
MLE/MAP, refer to
CS771 slides or
 The posterior over (for now, assume hyperparams and to be
book

known)
Marginal likelihood for this regression
model. Note that it is conditioned on
Must be a too which is assumed given and not
Gaussian due being modeled
to conjugacy

Note that and can be

 Using the “completing the squares” trick (or linear
learned underGaussian
the
model results) probabilistic set-
up(though assumed
fixed as of now)

CS772A: PML
16
The Posterior: A Visualization
 Assume a lin. reg. problem with true
 Assume data generated by a linear regression model
 Note: It’s actually 1-D regression ( is just a bias term), or 2-D reg. with
feature
 Figures below show the “data space” and posterior of for different
number
Each red line
represents the
of observations (note: with no observations, the posterior =
prior)
“data”
generated for
a randomly
drawn from
the current
posterior

CS772A: PML
17
Posterior Predictive Distribution
 To get the prediction for a new input , we can compute its PPD
Only is unknown with
a posterior distribution

𝒩 (𝒘 ∨𝝁 𝑁 , 𝚺 𝑁 )
so only has to be

𝒩 (𝑦 ∗∨𝒘 ⊤ 𝒙∗ , 𝛽− 1)
integrated out

 The above is the marginalization of from . Using Gaussian results

Can also derive it by writing where
and

 So we have a predictive mean as well as an input-specific

predictive variance
 In contrast, MLE and MAP make “plug-in” predictionsSince
into
PPD also takes
(using
accountthe
the
point estimate of ) uncertainty in , the
predictive variance is
larger

CS772A: PML
18
Posterior Predictive Distribution: An Illustration
 Black dots are training examples

 Width of the shaded region at any denotes the predictive

uncertainty at that (+/- one std-dev)
 Regions with more training examples have smaller predictive
variance CS772A: PML
19
Nonlinear Regression

 Can extend the linear regression model to handle nonlinear

regression problems
 One way is to replace the feature vectors by a nonlinear mapping
Can be pre-defined or
extracted by a pretrained
deep neural net

 Alternatively, a kernel function can be used to implicitly define the

nonlinear mapping CS772A: PML
20
More on Visualization of Uncertainty
 Figures below: Green curve is the true function and blue circles are
observations

 Posterior of the nonlinear regression model: Some curves drawn

from the posterior

CS772A: PML
21
Hyperparameters
 The probabilistic linear reg. model we saw had two
hyperparams () Need posterior
 Thus total three unknowns () over all the 3
unknowns

PPD would
require
integrating out
all 3 unknowns

 Posterior and PPD computation is intractable. Several ways to

address this
 MLE-II for (): . Use them to infer the posterior of
and PPD CS772A: PML
For any model where
hyperparams are 22
MLE-II estimated by MLE-II, the
posterior and PPD is
approximated in a similar
 For the probabilistic linear regression model, the overall
fashion

posterior over unknowns

 With MLE-II approx of (𝛽, 𝜆), , a point mass

Same as the posterior
ofwith the
at hyperparameters fixed

 Likewise, the PPD will be approximated as follows

Same form for the PPD Only need to

as in the case of fixed integrate over ,
hyperparams since other two are
fixed at their MLE-II
CS772A: PML
solutions

CS772 Lec3
No ratings yet
CS772 Lec3
27 pages
Lec8 MLE
No ratings yet
Lec8 MLE
35 pages
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
32 pages
Ds 7
No ratings yet
Ds 7
20 pages
TheoryIdeasInla Screen
No ratings yet
TheoryIdeasInla Screen
69 pages
Sparse Bayesian Learning - Analysis and Applications
No ratings yet
Sparse Bayesian Learning - Analysis and Applications
57 pages
CS772 Lec8
No ratings yet
CS772 Lec8
14 pages
A Course in Bayesian Econometrics University of Queensland
No ratings yet
A Course in Bayesian Econometrics University of Queensland
22 pages
Intro Bayes Time Series 1
No ratings yet
Intro Bayes Time Series 1
72 pages
CS772 Lec6
No ratings yet
CS772 Lec6
23 pages
Maria Research Paper
No ratings yet
Maria Research Paper
10 pages
CS772 Lec21
No ratings yet
CS772 Lec21
26 pages
CS772 Lec7
No ratings yet
CS772 Lec7
18 pages
Bishop CH 3 Notes
No ratings yet
Bishop CH 3 Notes
6 pages
Lecture 5
No ratings yet
Lecture 5
23 pages
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
No ratings yet
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
40 pages
Multi Parametric Models
No ratings yet
Multi Parametric Models
5 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Quiz1 Spring23
No ratings yet
Quiz1 Spring23
2 pages
cs772 Quiz1 Spring24 Solutions
No ratings yet
cs772 Quiz1 Spring24 Solutions
2 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Chapter 2.3.6
No ratings yet
Chapter 2.3.6
4 pages
Bayes Gauss
100% (1)
Bayes Gauss
29 pages
BR 2
No ratings yet
BR 2
36 pages
Posterior Mean and Variance Approximation For Regression and Time Series Problems
No ratings yet
Posterior Mean and Variance Approximation For Regression and Time Series Problems
25 pages
CS772 Lec1
No ratings yet
CS772 Lec1
15 pages
CS772 Lec9 13
No ratings yet
CS772 Lec9 13
15 pages
ML 3
No ratings yet
ML 3
66 pages
Lin Reg
No ratings yet
Lin Reg
34 pages
Bayesian Kernel Methods
No ratings yet
Bayesian Kernel Methods
40 pages
Machine Learning and Pattern Recognition Week 10 - Bayes - Logistic - Regression
No ratings yet
Machine Learning and Pattern Recognition Week 10 - Bayes - Logistic - Regression
4 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
Lecture 14 - Logistic and Softmax Regression - Plain
No ratings yet
Lecture 14 - Logistic and Softmax Regression - Plain
12 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages
Chapter 4 ML Parametric Classification
No ratings yet
Chapter 4 ML Parametric Classification
42 pages
Tut7 Questions
No ratings yet
Tut7 Questions
2 pages
Priors in Bayesian Learning
No ratings yet
Priors in Bayesian Learning
26 pages
Gaussian Processes in Machine Learning
No ratings yet
Gaussian Processes in Machine Learning
9 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Bayesian Linear Regression in Data Mining: K.Sathyanarayana Sharma, Dr.S.Rajagopal
No ratings yet
Bayesian Linear Regression in Data Mining: K.Sathyanarayana Sharma, Dr.S.Rajagopal
3 pages
Lecture2 2013
No ratings yet
Lecture2 2013
60 pages
Tut6 Questions
No ratings yet
Tut6 Questions
2 pages
PRML Slides 2
No ratings yet
PRML Slides 2
86 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
1.2.6 Advanced
No ratings yet
1.2.6 Advanced
5 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
Risk Theory
No ratings yet
Risk Theory
224 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
RVM Tutorial
No ratings yet
RVM Tutorial
25 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Lecture 4 Day 3 Stochastic Frontier Analysis
100% (1)
Lecture 4 Day 3 Stochastic Frontier Analysis
45 pages
Modern Bayesian Econometrics
No ratings yet
Modern Bayesian Econometrics
100 pages
The Role of Purchase Regularity and Purchase Frequency in Price and Discount Sensitivity
No ratings yet
The Role of Purchase Regularity and Purchase Frequency in Price and Discount Sensitivity
27 pages
Springer Series in Operations Research: Springer Science+Business Media, LLC
No ratings yet
Springer Series in Operations Research: Springer Science+Business Media, LLC
553 pages
Beta and Gamma Functions
No ratings yet
Beta and Gamma Functions
32 pages
Poisson and Exponential Distribution
No ratings yet
Poisson and Exponential Distribution
55 pages
Probability Distributions and Combination of Random Variables
No ratings yet
Probability Distributions and Combination of Random Variables
52 pages
Excel Functions Complete List
No ratings yet
Excel Functions Complete List
12 pages
M2S2 - Statistical Modelling: DR Axel Gandy Imperial College London Spring 2011
No ratings yet
M2S2 - Statistical Modelling: DR Axel Gandy Imperial College London Spring 2011
25 pages
IOE 474 Exam Review2
No ratings yet
IOE 474 Exam Review2
20 pages
Chapter 4
No ratings yet
Chapter 4
27 pages
Workshop 2
No ratings yet
Workshop 2
57 pages
ASBIRES 2011 - Paper Format
No ratings yet
ASBIRES 2011 - Paper Format
10 pages
Chapter .4
No ratings yet
Chapter .4
10 pages
JuliaPro v0.6.2.1 Package API Manual
No ratings yet
JuliaPro v0.6.2.1 Package API Manual
480 pages
CH 4 Slides
No ratings yet
CH 4 Slides
64 pages
SAR Image Analysis - A Computational Statistics Approach: With R Code, Data, and Applications 1st Edition Alejandro C. Frery
100% (4)
SAR Image Analysis - A Computational Statistics Approach: With R Code, Data, and Applications 1st Edition Alejandro C. Frery
50 pages
1984 WGEN Richardson
No ratings yet
1984 WGEN Richardson
88 pages
Lectures 15-27 Review: Scott Sheffield
No ratings yet
Lectures 15-27 Review: Scott Sheffield
203 pages
Bias Correction and Model Output Statistics (MOS) SantanderMetGroup - Downscaler Wiki GitHub
No ratings yet
Bias Correction and Model Output Statistics (MOS) SantanderMetGroup - Downscaler Wiki GitHub
5 pages
Development of Fragility Curves For Confined Masonry Buildings of Lima
No ratings yet
Development of Fragility Curves For Confined Masonry Buildings of Lima
11 pages
Wind Speed Distribution Modeling For Wind Power Estimation - Case of Agadir in Morocco
No ratings yet
Wind Speed Distribution Modeling For Wind Power Estimation - Case of Agadir in Morocco
11 pages
02-Random Variables
No ratings yet
02-Random Variables
62 pages
Gologit 2
No ratings yet
Gologit 2
18 pages
Probability Models in Marketing
No ratings yet
Probability Models in Marketing
66 pages
JRFM 16 00055
No ratings yet
JRFM 16 00055
28 pages
Variance Gamma
No ratings yet
Variance Gamma
24 pages
A New Generalized Weibull Distribution Generated by Gamma Random Variables
No ratings yet
A New Generalized Weibull Distribution Generated by Gamma Random Variables
9 pages
The Individual Risk Slide
No ratings yet
The Individual Risk Slide
26 pages
Performance Evaluations of Frequency Diversity Radar System
No ratings yet
Performance Evaluations of Frequency Diversity Radar System
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CS772 Lec5

Uploaded by

CS772 Lec5

Uploaded by

(Probabilistic/Bayesian) Linear

CS772A: Probabilistic Machine Learning

 We want to estimate the variance . Assume to be known.

 Due to conjugacy, posterior will also be IG with expression

 (Verify) The posterior will be

 Note: Gamma distribution can be defined in terms of shape and

 Would like a jointly conjugate prior distribution

 The NG is conjugate to a Gaussian

 The above’s posterior’s parameters will be

Marginal lik has closed

Has fatter tail than

 The approach also extends to inferring parameters of a

 Posterior updates have forms similar to that in the univariate

 Conjugate priors exist for the multivariate Gaussian case as

 Assume as prior and as the likelihood, and defining

Note that and can be

 The above is the marginalization of from . Using Gaussian results

 So we have a predictive mean as well as an input-specific

 Width of the shaded region at any denotes the predictive

 Can extend the linear regression model to handle nonlinear

 Alternatively, a kernel function can be used to implicitly define the

 Posterior of the nonlinear regression model: Some curves drawn

 Posterior and PPD computation is intractable. Several ways to

posterior over unknowns

 With MLE-II approx of (𝛽, 𝜆), , a point mass

 Likewise, the PPD will be approximated as follows

Same form for the PPD Only need to

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.