0% found this document useful (0 votes)
7 views169 pages

w6 Bayesianregression

This document provides an overview of Bayesian regression. It discusses underdetermined systems with more observations than unknowns, which can be solved using a noise model that assumes errors are normally distributed. Gaussian densities are described, which have important properties like the sum of Gaussian variables being Gaussian itself. Bayesian regression allows fitting regression models to overdetermined systems by placing prior distributions over the model parameters.

Uploaded by

james getem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views169 pages

w6 Bayesianregression

This document provides an overview of Bayesian regression. It discusses underdetermined systems with more observations than unknowns, which can be solved using a noise model that assumes errors are normally distributed. Gaussian densities are described, which have important properties like the sum of Gaussian variables being Gaussian itself. Bayesian regression allows fitting regression models to overdetermined systems by placing prior distributions over the model parameters.

Uploaded by

james getem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

Bayesian Regression

MLAI: Week 6

Neil D. Lawrence

Department of Computer Science


Sheffield University

3rd November 2015


Outline

Quick Review: Overdetermined Systems

Underdetermined Systems

Bayesian Regression

Univariate Bayesian Linear Regression

Bayesian Polynomials
Two Simultaneous Equations

time in min/km, y
A system of two simultaneous
equations with two unknowns.
4.5

y1 =mx1 + c 3.5
y2 =mx2 + c
1920 1960 2000
year, x
Two Simultaneous Equations

time in min/km, y
A system of two simultaneous
equations with two unknowns.
4.5

y1 − y2 =m(x1 − x2 ) 3.5

1920 1960 2000


year, x
Two Simultaneous Equations

time in min/km, y
A system of two simultaneous
equations with two unknowns.
4.5

y1 − y2 3.5
=m
x1 − x2
1920 1960 2000
year, x
Two Simultaneous Equations

time in min/km, y
A system of two simultaneous
equations with two unknowns.
4.5

y1 − y2
y2 − y1 3.5
m=
x2 − x1 x2 − x1

c = y1 − mx1 1920 1960 2000


year, x
Two Simultaneous Equations

How do we deal with three

time in min/km, y
simultaneous equations with
only two unknowns? 4.5

y1 − y2
3.5
y1 =mx1 + c x2 − x1

y2 =mx2 + c
1920 1960 2000
y3 =mx3 + c year, x
Overdetermined System
I With two unknowns and two observations:

y1 =mx1 + c
y2 =mx2 + c
Overdetermined System
I With two unknowns and two observations:

y1 =mx1 + c
y2 =mx2 + c

I Additional observation leads to overdetermined system.

y3 = mx3 + c
Overdetermined System
I With two unknowns and two observations:

y1 =mx1 + c
y2 =mx2 + c

I Additional observation leads to overdetermined system.

y3 = mx3 + c

 
I This problem is solved through a noise model  ∼ N 0, σ2

y1 = mx1 + c + 1
y2 = mx2 + c + 2
y3 = mx3 + c + 3
Noise Models

I We aren’t modeling entire system.


I Noise model gives mismatch between model and data.
I Gaussian model justified by appeal to central limit
theorem.
I Other models also possible (Student-t for heavy tails).
I Maximum likelihood with Gaussian noise leads to least
squares.
y = mx + c
5

4 y = mx + c
3
y

0
0 1 2 3 4 5
x
5

4
c y = mx + c
3
y

m
1

0
0 1 2 3 4 5
x
5

4
c y = mx + c
3
y

m
1

0
0 1 2 3 4 5
x
5

4
c y = mx + c
3
y

m
1

0
0 1 2 3 4 5
x
5

4 y = mx + c
3
y

0
0 1 2 3 4 5
x
5

4 y = mx + c
3
y

0
0 1 2 3 4 5
x
5

4 y = mx + c
3
y

0
0 1 2 3 4 5
x
y = mx + c

point 1: x = 1, y = 3
3=m+c
point 2: x = 3, y = 1
1 = 3m + c
point 3: x = 2, y = 2.5
2.5 = 2m + c
6 A PHILOSOPHICAL ESSAY ON PROBABILITIES.

height: "The day will come when, by study pursued


through several ages, the things now concealed will

appear with evidence; and posterity will be astonished


' '

that truths so clear had escaped us. Clairaut then


undertook to submit to analysis the perturbations which
the comet had experienced by the action of the two

great planets, Jupiter and Saturn; after immense cal-

culations he fixed next passage at the perihelion


its

toward the beginning of April, 1759, which was actually


verified by observation. The regularity which astronomy
shows us in the movements of the comets doubtless
exists also in all phenomena. -

The curve described by a simple molecule of air or


vapor is regulated in a manner just as certain as the
planetary orbits ;
the only difference between them is

that which comes from our ignorance.


Probability is relative, in part to this ignorance, in

part to our knowledge. We know that of three or a


greater number of events a
single one ought to occur ;

but nothing induces us to believe that one of them will


occur rather than the others. In this state of indecision
it is
impossible for us to announce their occurrence with
certainty. It is, however, probable that one of these
events, chosen at will, will not occur because we see
several cases equally possible which exclude its occur-
rence, while only a single one favors it.
The theory of chance consists in reducing all the
events of the same kind to a certain number of cases
equally possible, that is to say, to such as we may be
equally undecided about in regard to their existence,
and in determining the number of cases favorable to
the event whose The ratio of
probability is sought.
y = mx + c + 

point 1: x = 1, y = 3
3 = m + c + 1
point 2: x = 3, y = 1
1 = 3m + c + 2
point 3: x = 2, y = 2.5
2.5 = 2m + c + 3
The Gaussian Density

I Perhaps the most common probability density.

(y − µ)2
!
1
p(y|µ, σ ) = √
2
exp −
2πσ2 2σ2
4
 
= N y|µ, σ2

I The Gaussian density.


Gaussian Density

2
p(h|µ, σ2 )

0
0 1 2
h, height/m

The Gaussian PDF with µ = 1.7 and variance σ2 = 0.0225. Mean


shown as red line. It could represent the heights of a
population of students.
Gaussian Density

(y − µ)2
!
  1
N y|µ, σ2 = √ exp −
2πσ 2 2σ2
σ2 is the variance of the density and µ is
the mean.
Two Important Gaussian Properties

Sum of Gaussians

I Sum of Gaussian variables is also Gaussian.


 
yi ∼ N µi , σ2i
Two Important Gaussian Properties

Sum of Gaussians

I Sum of Gaussian variables is also Gaussian.


 
yi ∼ N µi , σ2i

And the sum is distributed as


n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1
Two Important Gaussian Properties

Sum of Gaussians

I Sum of Gaussian variables is also Gaussian.


 
yi ∼ N µi , σ2i

And the sum is distributed as


n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1

(Aside: As sum increases, sum of non-Gaussian, finite


variance variables is also Gaussian [central limit theorem].)
Two Important Gaussian Properties

Sum of Gaussians

I Sum of Gaussian variables is also Gaussian.


 
yi ∼ N µi , σ2i

And the sum is distributed as


n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1

(Aside: As sum increases, sum of non-Gaussian, finite


variance variables is also Gaussian [central limit theorem].)
Two Important Gaussian Properties

Scaling a Gaussian

I Scaling a Gaussian leads to a Gaussian.


Two Important Gaussian Properties

Scaling a Gaussian

I Scaling a Gaussian leads to a Gaussian.


 
y ∼ N µ, σ2
Two Important Gaussian Properties

Scaling a Gaussian

I Scaling a Gaussian leads to a Gaussian.


 
y ∼ N µ, σ2

And the scaled density is distributed as


 
wy ∼ N wµ, w2 σ2
Outline

Quick Review: Overdetermined Systems

Underdetermined Systems

Bayesian Regression

Univariate Bayesian Linear Regression

Bayesian Polynomials
Underdetermined System

What about two unknowns and 5


one observation? 4
3

y
y1 = mx1 + c 2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5


4
y1 − c 3
m=

y
x 2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5


4
c = 1.75 =⇒ m = 1.25 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5


4
c = −0.777 =⇒ m = 3.78 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5


4
c = −4.01 =⇒ m = 7.01 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5


4
c = −0.718 =⇒ m = 3.72 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5


4
c = 2.45 =⇒ m = 0.545 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5


4
c = −0.657 =⇒ m = 3.66 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5


4
c = −3.13 =⇒ m = 6.13 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5


4
c = −1.47 =⇒ m = 4.47 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5


Assume 4
3

y
c ∼ N (0, 4) , 2
1
we find a distribution of solu- 0
tions. 0 1 2 3
x
Different Types of Uncertainty

I The first type of uncertainty we are assuming is aleatoric


uncertainty.
I The second type of uncertainty we are assuming is
epistemic uncertainty.
Aleatoric Uncertainty

I This is uncertainty we couldn’t know even if we wanted to.


e.g. the result of a football match before it’s played.
I Where a sheet of paper might land on the floor.
Outline

Quick Review: Overdetermined Systems

Underdetermined Systems

Bayesian Regression

Univariate Bayesian Linear Regression

Bayesian Polynomials
Prior Distribution

I Bayesian inference requires a prior on the parameters.


I The prior represents your belief before you see the data of
the likely value of the parameters.
I For linear regression, consider a Gaussian prior on the
intercept:
c ∼ N (0, α1 )
Posterior Distribution

I Posterior distribution is found by combining the prior with


the likelihood.
I Posterior distribution is your belief after you see the data of
the likely value of the parameters.
I The posterior is found through Bayes’ Rule

p(y|c)p(c)
p(c|y) =
p(y)
Bayes Update

2 p(c) = N (c|0, α1 )

0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update

2 p(c) = N (c|0, α1 )

 
p(y|m, c, x, σ2 ) = N y|mx + c, σ2

0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update

2 p(c) = N (c|0, α1 )

 
p(y|m, c, x, σ2 ) = N y|mx + c, σ2

1 p(c|y, m, x, σ2 ) =
y−mx
 
N c| 1+σ2 /α1 , (σ−2 + α−1
1
)−1

0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Stages to Derivation of the Posterior

I Multiply likelihood by prior


I they are “exponentiated quadratics”, the answer is always
also an exponentiated quadratic because
exp(a2 ) exp(b2 ) = exp(a2 + b2 ).
I Complete the square to get the resulting density in the
form of a Gaussian.
I Recognise the mean and (co)variance of the Gaussian. This
is the estimate of the posterior.
Main Trick

1 1 2
 
p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ2 ) = n exp 
−
 2σ2 (y i − mx i − c) 2


(2πσ2 ) 2 i=1
Main Trick

1 1 2
 
p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ2 ) = n exp 
−
 2σ2 (y i − mx i − c) 2


2
(2πσ ) 2
i=1

p(y|x, c, m, σ2 )p(c)
p(c|y, x, m, σ2 ) =
p(y|x, m, σ2 )
Main Trick

1 1 2
 
p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ ) =
2
n exp 
−

 2σ2 (y i − mx i − c) 2 

2
(2πσ ) 2
i=1

p(y|x, c, m, σ2 )p(c)
p(c|y, x, m, σ2 ) = R
p(y|x, c, m, σ2 )p(c)dc
Main Trick

1 1 2
 
p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ2 ) = n exp 
−
2
(yi − mxi − c)2 
(2πσ2 ) 2
 2σ i=1

p(c|y, x, m, σ2 ) ∝ p(y|x, c, m, σ2 )p(c)


n
1 X 1 2
log p(c|y, x, m, σ ) = − 2
2
(yi − c − mxi )2 − c + const
2σ i=1 2α1
n
1 X n 1
 
=− 2 (yi − mxi )2 − + c2
2σ i=1 2σ2 2α1
Pn
(yi − mxi )
+ c i=1 2 ,
σ
complete the square of the quadratic form to obtain

1
log p(c|y, x, m, σ2 ) = − (c − µ)2 + const,
2τ2
−1
τ2
 PN
where τ2 = nσ−2 + α−1
1
and µ = σ2 n=1 (yi − mxi ).
The Joint Density

I Really want to know the joint posterior density over the


parameters c and m.
I Could now integrate out over m, but it’s easier to consider
the multivariate case.
Aleatoric Uncertainty

I This is uncertainty we couldn’t know even if we wanted to.


e.g. the result of a football match before it’s played.
I Where a sheet of paper might land on the floor.
Epistemic Uncertainty

I This is uncertainty we could in principal know the answer


too. We just haven’t observed enough yet, e.g. the result of
a football match after it’s played.
I What colour socks your lecturer is wearing.
Reading

I Bishop Section 1.2.3 (pg 21–24).


I Bishop Section 1.2.6 (start from just past eq 1.64 pg 30-32).
I Rogers and Girolami use an example of a coin toss for
introducing Bayesian inference Chapter 3, Sections 3.1-3.4
(pg 95-117). Although you also need the beta density
which we haven’t yet discussed. This is also the example
that Laplace used.
I Bayesian Inference
I Rogers and Girolami use an example of a coin toss for
introducing Bayesian inference Chapter 3, Sections 3.1-3.4
(pg 95-117). Although you also need the beta density which
we haven’t yet discussed. This is also the example that
Laplace used.
I Bishop Section 1.2.3 (pg 21–24).
I Bishop Section 1.2.6 (start from just past eq 1.64 pg 30-32).
Outline

Quick Review: Overdetermined Systems

Underdetermined Systems

Bayesian Regression

Univariate Bayesian Linear Regression

Bayesian Polynomials
Prior Distribution

I Bayesian inference requires a prior on the parameters.


I The prior represents your belief before you see the data of
the likely value of the parameters.
I For linear regression, consider a Gaussian prior on the
intercept:
c ∼ N (0, α1 )
Posterior Distribution

I Posterior distribution is found by combining the prior with


the likelihood.
I Posterior distribution is your belief after you see the data of
the likely value of the parameters.
I The posterior is found through Bayes’ Rule

p(y|c)p(c)
p(c|y) =
p(y)
Bayes Update

2 p(c) = N (c|0, α1 )

0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update

2 p(c) = N (c|0, α1 )

 
p(y|m, c, x, σ2 ) = N y|mx + c, σ2

0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update

2 p(c) = N (c|0, α1 )

 
p(y|m, c, x, σ2 ) = N y|mx + c, σ2

1 p(c|y, m, x, σ2 ) =
y−mx
 
N c| 1+σ2 /α1 , (σ−2 + α−1
1
)−1

0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Stages to Derivation of the Posterior

I Multiply likelihood by prior


I they are “exponentiated quadratics”, the answer is always
also an exponentiated quadratic because
exp(a2 ) exp(b2 ) = exp(a2 + b2 ).
I Complete the square to get the resulting density in the
form of a Gaussian.
I Recognise the mean and (co)variance of the Gaussian. This
is the estimate of the posterior.
Main Trick

1 1 2
 
p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ2 ) = n exp 
−
 2σ2 (y i − mx i − c) 2


(2πσ2 ) 2 i=1
Main Trick

1 1 2
 
p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ2 ) = n exp 
−
 2σ2 (y i − mx i − c) 2


2
(2πσ ) 2
i=1

p(y|x, c, m, σ2 )p(c)
p(c|y, x, m, σ2 ) =
p(y|x, m, σ2 )
Main Trick

1 1 2
 
p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ ) =
2
n exp 
−

 2σ2 (y i − mx i − c) 2 

2
(2πσ ) 2
i=1

p(y|x, c, m, σ2 )p(c)
p(c|y, x, m, σ2 ) = R
p(y|x, c, m, σ2 )p(c)dc
Main Trick

1 1 2
 
p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ2 ) = n exp 
−
2
(yi − mxi − c)2 
(2πσ2 ) 2
 2σ i=1

p(c|y, x, m, σ2 ) ∝ p(y|x, c, m, σ2 )p(c)


n
1 X 1 2
log p(c|y, x, m, σ ) = − 2
2
(yi − c − mxi )2 − c + const
2σ i=1 2α1
n
1 X n 1
 
=− 2 (yi − mxi )2 − + c2
2σ i=1 2σ2 2α1
Pn
(yi − mxi )
+ c i=1 2 ,
σ
complete the square of the quadratic form to obtain

1
log p(c|y, x, m, σ2 ) = − (c − µ)2 + const,
2τ2
−1
τ2
 PN
where τ2 = nσ−2 + α−1
1
and µ = σ2 n=1 (yi − mxi ).
The Joint Density

I Really want to know the joint posterior density over the


parameters c and m.
I Could now integrate out over m, but it’s easier to consider
the multivariate case.
Two Dimensional Gaussian

I Consider height, h/m and weight, w/kg.


I Could sample height from a distribution:

p(h) ∼ N (1.7, 0.0225)

I And similarly weight:

p(w) ∼ N (75, 36)


Height and Weight Models

p(w)
p(h)

h/m w/kg

Gaussian distributions for height and weight.


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Sampling Two Dimensional Variables
Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m

Samples of height and weight


Independence Assumption

I This assumes height and weight are independent.

p(h, w) = p(h)p(w)

I In reality they are dependent (body mass index) = w


h2
.
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Sampling Two Dimensional Variables

Marginal Distributions

Joint Distribution

p(h)
w/kg

p(w)
h/m
Independent Gaussians

p(w, h) = p(w)p(h)
Independent Gaussians

 1  (w − µ1 )2 (h − µ2 )2 
  
1
p(w, h) = q exp −  + 
2 σ 2 σ 2
q 
2 2
2πσ1 2πσ2 1 2
Independent Gaussians

 " # " #!> " 2 #−1 " # " #!


1  1 w µ σ1 0 w µ 
p(w, h) = q exp − − 1 − 1 
2 h µ2 0 σ22 h µ2
2πσ21 2πσ22
Independent Gaussians

1 1
 
p(y) = 1
exp − (y − µ) > −1
D (y − µ)
|2πD| 2 2
Correlated Gaussian

Form correlated from original by rotating the data space using


matrix R.

1 1
 
p(y) = 1
exp − (y − µ) > −1
D (y − µ)
|2πD| 2 2
Correlated Gaussian

Form correlated from original by rotating the data space using


matrix R.

1 1 >
 
p(y) = 1
exp − (R y − R µ)
> > −1
D (R>
y − R>
µ)
|2πD| 2 2
Correlated Gaussian

Form correlated from original by rotating the data space using


matrix R.

1 1
 
p(y) = 1
exp − (y − µ) >
RD−1 >
R (y − µ)
|2πD| 2 2
this gives a covariance matrix:

C−1 = RD−1 R>


Correlated Gaussian

Form correlated from original by rotating the data space using


matrix R.

1 1
 
p(y) = 1
exp − (y − µ) > −1
C (y − µ)
|2πC| 2 2
this gives a covariance matrix:

C = RDR>
Reading

I Section 2.3 of Bishop up to top of pg 85 (multivariate


Gaussians).
I Section 3.3 of Bishop up to 159 (pg 152–159).
Outline

Quick Review: Overdetermined Systems

Underdetermined Systems

Bayesian Regression

Univariate Bayesian Linear Regression

Bayesian Polynomials
Revisit Olympics Data

I Use Bayesian approach on olympics data with


polynomials.
I Choose a prior w ∼ N (0, αI) with α = 1.
I Choose noise variance σ2 = 0.01
Sampling the Prior

I Always useful to perform a ‘sanity check’ and sample from


the prior before observing the data.
I Since y = Φw +  just need to sample

w ∼ N (0, α)
 
 ∼ N 0, σ2
with α = 1 and  = 0.01.
Polynomial Fits to Olympics Data

5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: marginal log likelihood. Polynomial


order 0, model error 29.757, σ2 = 0.286, σ = 0.535.
Polynomial Fits to Olympics Data

5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: marginal log likelihood. Polynomial


order 1, model error 14.942, σ2 = 0.0749, σ = 0.274.
Polynomial Fits to Olympics Data

5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: marginal log likelihood. Polynomial


order 2, model error 9.7206, σ2 = 0.0427, σ = 0.207.
Polynomial Fits to Olympics Data

5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: marginal log likelihood. Polynomial


order 3, model error 10.416, σ2 = 0.0402, σ = 0.200.
Polynomial Fits to Olympics Data

5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: marginal log likelihood. Polynomial


order 4, model error 11.34, σ2 = 0.0401, σ = 0.200.
Polynomial Fits to Olympics Data

5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: marginal log likelihood. Polynomial


order 5, model error 11.986, σ2 = 0.0399, σ = 0.200.
Polynomial Fits to Olympics Data

5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: marginal log likelihood. Polynomial


order 6, model error 12.369, σ2 = 0.0384, σ = 0.196.
Model Fit

I Marginal likelihood doesn’t always increase as model


order increases.
I Bayesian model always has 2 parameters, regardless of
how many basis functions (and here we didn’t even fit
them).
I Maximum likelihood model over fits through increasing
number of parameters.
I Revisit maximum likelihood solution with validation set.
Recall: Validation Set for Maximum Likelihood

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 0, training


error -1.8774, validation error -0.13132, σ2 = 0.302, σ = 0.549.
Recall: Validation Set for Maximum Likelihood

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 1, training


error -15.325, validation error 2.5863, σ2 = 0.0733, σ = 0.271.
Recall: Validation Set for Maximum Likelihood

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 2, training


error -17.579, validation error -8.4831, σ2 = 0.0578, σ = 0.240.
Recall: Validation Set for Maximum Likelihood

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 3, training


error -18.064, validation error 11.27, σ2 = 0.0549, σ = 0.234.
Recall: Validation Set for Maximum Likelihood

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 4, training


error -18.245, validation error 232.92, σ2 = 0.0539, σ = 0.232.
Recall: Validation Set for Maximum Likelihood

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 5, training


error -20.471, validation error 9898.1, σ2 = 0.0426, σ = 0.207.
Recall: Validation Set for Maximum Likelihood

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 6, training


error -22.881, validation error 67775, σ2 = 0.0331, σ = 0.182.
Validation Set

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 0, training


error 29.757, validation error -0.29243, σ2 = 0.302, σ = 0.550.
Validation Set

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 1, training


error 14.942, validation error 4.4027, σ2 = 0.0762, σ = 0.276.
Validation Set

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 2, training


error 9.7206, validation error -8.6623, σ2 = 0.0580, σ = 0.241.
Validation Set

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 3, training


error 10.416, validation error -6.4726, σ2 = 0.0555, σ = 0.236.
Validation Set

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 4, training


error 11.34, validation error -8.431, σ2 = 0.0555, σ = 0.236.
Validation Set

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 5, training


error 11.986, validation error -10.483, σ2 = 0.0551, σ = 0.235.
Validation Set

5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order

Left: fit to data, Right: model error. Polynomial order 6, training


error 12.369, validation error -3.3823, σ2 = 0.0537, σ = 0.232.
Regularized Mean

I Validation fit here based on mean solution for w only.


I For Bayesian solution
h i−1
µw = σ−2 Φ> Φ + α−1 I σ−2 Φ> y

instead of h i−1
w∗ = Φ> Φ Φ> y
I Two are equivalent when α → ∞.
I Equivalent to a prior for w with infinite variance.
I In other cases αI regularizes the system (keeps parameters
smaller).
Sampling the Posterior

I Now check samples by extracting w from the posterior.


I Now for y = Φw +  need
 
w ∼ N µw , Cw
h i−1
with Cw = σ−2 Φ> Φ + α−1 I and µw = Cw σ−2 Φ> y
 
 ∼ N 0, σ2

with α = 1 and  = 0.01.


Marginal Likelihood

I The marginal likelihood can also be computed, it has the


form:
1 1 > −1
 
p(y|X, σ , α) =
2
1
exp − y K y
n
(2π) 2 |K| 2 2

where K = αΦΦ> + σ2 I.
I So it is a zero mean n-dimensional Gaussian with
covariance matrix K.
Computing the Expected Output

I Given the posterior for the parameters, how can we


compute the expected output at a given location?
I Output of model at location xi is given by

f (xi ; w) = φ>
i w

I We want the expected output under the posterior density,


p(w|y, X, σ2 , α).
I Mean of mapping function will be given by

= φ>


f (xi ; w) p(w|y,X,σ2 ,α) i hwip(w|y,X,σ2 ,α)

i µw
= φ>
Variance of Expected Output

I Variance of model at location xi is given by


D E

var( f (xi ; w)) = ( f (xi ; w))2 − f (xi ; w) 2



D E
>
= φ>i ww >
φi − φ>
i hwi hwi φi
= φ>
i Ci φi

where all these expectations are taken under the posterior


density, p(w|y, X, σ2 , α).
Reading

I Section 3.7–3.8 of Rogers and Girolami (pg 122–133).


I Section 3.4 of Bishop (pg 161–165).
References I

C. M. Bishop. Pattern Recognition and Machine Learning.


Springer-Verlag, 2006. [Google Books] .
P. S. Laplace. Mémoire sur la probabilité des causes par les
évènemens. In Mémoires de mathèmatique et de physique, presentés à
lAcadémie Royale des Sciences, par divers savans, & lù dans ses
assemblées 6, pages 621–656, 1774. Translated in Stigler (1986).
P. S. Laplace. Essai philosophique sur les probabilités. Courcier, Paris, 2nd
edition, 1814. Sixth edition of 1840 translated and repreinted (1951)
as A Philosophical Essay on Probabilities, New York: Dover; fifth
edition of 1825 reprinted 1986 with notes by Bernard Bru, Paris:
Christian Bourgois Éditeur, translated by Andrew Dale (1995) as
Philosophical Essay on Probabilities, New York:Springer-Verlag.
S. Rogers and M. Girolami. A First Course in Machine Learning. CRC
Press, 2011. [Google Books] .
S. M. Stigler. Laplace’s 1774 memoir on inverse probability. Statistical
Science, 1:359–378, 1986.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy