0% found this document useful (0 votes)

7 views169 pages

w6 Bayesianregression

This document provides an overview of Bayesian regression. It discusses underdetermined systems with more observations than unknowns, which can be solved using a noise model that assumes errors are normally distributed. Gaussian densities are described, which have important properties like the sum of Gaussian variables being Gaussian itself. Bayesian regression allows fitting regression models to overdetermined systems by placing prior distributions over the model parameters.

Uploaded by

james getem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views169 pages

w6 Bayesianregression

Uploaded by

james getem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 169

Bayesian Regression

MLAI: Week 6

Neil D. Lawrence

Department of Computer Science

Sheffield University

3rd November 2015

Outline

Quick Review: Overdetermined Systems

Underdetermined Systems

Bayesian Regression

Univariate Bayesian Linear Regression

Bayesian Polynomials
Two Simultaneous Equations

time in min/km, y
A system of two simultaneous
equations with two unknowns.
4.5

y1 =mx1 + c 3.5
y2 =mx2 + c
1920 1960 2000
year, x
Two Simultaneous Equations

time in min/km, y
A system of two simultaneous
equations with two unknowns.
4.5

y1 − y2 =m(x1 − x2 ) 3.5

1920 1960 2000

year, x
Two Simultaneous Equations

time in min/km, y
A system of two simultaneous
equations with two unknowns.
4.5

y1 − y2 3.5
=m
x1 − x2
1920 1960 2000
year, x
Two Simultaneous Equations

time in min/km, y
A system of two simultaneous
equations with two unknowns.
4.5

y1 − y2
y2 − y1 3.5
m=
x2 − x1 x2 − x1

c = y1 − mx1 1920 1960 2000

year, x
Two Simultaneous Equations

How do we deal with three

time in min/km, y
simultaneous equations with
only two unknowns? 4.5

y1 − y2
3.5
y1 =mx1 + c x2 − x1

y2 =mx2 + c
1920 1960 2000
y3 =mx3 + c year, x
Overdetermined System
I With two unknowns and two observations:

y1 =mx1 + c
y2 =mx2 + c
Overdetermined System
I With two unknowns and two observations:

y1 =mx1 + c
y2 =mx2 + c

I Additional observation leads to overdetermined system.

y3 = mx3 + c
Overdetermined System
I With two unknowns and two observations:

y1 =mx1 + c
y2 =mx2 + c

I Additional observation leads to overdetermined system.

y3 = mx3 + c

I This problem is solved through a noise model ∼ N 0, σ2

y1 = mx1 + c + 1
y2 = mx2 + c + 2
y3 = mx3 + c + 3
Noise Models

I We aren’t modeling entire system.

I Noise model gives mismatch between model and data.
I Gaussian model justified by appeal to central limit
theorem.
I Other models also possible (Student-t for heavy tails).
I Maximum likelihood with Gaussian noise leads to least
squares.
y = mx + c
5

4 y = mx + c
3
y

0
0 1 2 3 4 5
x
5

4
c y = mx + c
3
y

m
1

0
0 1 2 3 4 5
x
5

4
c y = mx + c
3
y

m
1

0
0 1 2 3 4 5
x
5

4
c y = mx + c
3
y

m
1

0
0 1 2 3 4 5
x
5

4 y = mx + c
3
y

0
0 1 2 3 4 5
x
5

4 y = mx + c
3
y

0
0 1 2 3 4 5
x
5

4 y = mx + c
3
y

0
0 1 2 3 4 5
x
y = mx + c

point 1: x = 1, y = 3
3=m+c
point 2: x = 3, y = 1
1 = 3m + c
point 3: x = 2, y = 2.5
2.5 = 2m + c
6 A PHILOSOPHICAL ESSAY ON PROBABILITIES.

height: "The day will come when, by study pursued

through several ages, the things now concealed will

appear with evidence; and posterity will be astonished

' '

that truths so clear had escaped us. Clairaut then

undertook to submit to analysis the perturbations which
the comet had experienced by the action of the two

great planets, Jupiter and Saturn; after immense cal-

culations he fixed next passage at the perihelion

its

toward the beginning of April, 1759, which was actually

verified by observation. The regularity which astronomy
shows us in the movements of the comets doubtless
exists also in all phenomena. -

The curve described by a simple molecule of air or

vapor is regulated in a manner just as certain as the
planetary orbits ;
the only difference between them is

that which comes from our ignorance.

Probability is relative, in part to this ignorance, in

part to our knowledge. We know that of three or a

greater number of events a
single one ought to occur ;

but nothing induces us to believe that one of them will

occur rather than the others. In this state of indecision
it is
impossible for us to announce their occurrence with
certainty. It is, however, probable that one of these
events, chosen at will, will not occur because we see
several cases equally possible which exclude its occur-
rence, while only a single one favors it.
The theory of chance consists in reducing all the
events of the same kind to a certain number of cases
equally possible, that is to say, to such as we may be
equally undecided about in regard to their existence,
and in determining the number of cases favorable to
the event whose The ratio of
probability is sought.
y = mx + c +

point 1: x = 1, y = 3
3 = m + c + 1
point 2: x = 3, y = 1
1 = 3m + c + 2
point 3: x = 2, y = 2.5
2.5 = 2m + c + 3
The Gaussian Density

I Perhaps the most common probability density.

(y − µ)2
!
1
p(y|µ, σ ) = √
2
exp −
2πσ2 2σ2
4

= N y|µ, σ2

I The Gaussian density.

Gaussian Density

2
p(h|µ, σ2 )

0
0 1 2
h, height/m

The Gaussian PDF with µ = 1.7 and variance σ2 = 0.0225. Mean

shown as red line. It could represent the heights of a
population of students.
Gaussian Density

(y − µ)2
!
1
N y|µ, σ2 = √ exp −
2πσ 2 2σ2
σ2 is the variance of the density and µ is
the mean.
Two Important Gaussian Properties

Sum of Gaussians

I Sum of Gaussian variables is also Gaussian.

yi ∼ N µi , σ2i
Two Important Gaussian Properties

Sum of Gaussians

I Sum of Gaussian variables is also Gaussian.

yi ∼ N µi , σ2i

And the sum is distributed as

n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1
Two Important Gaussian Properties

Sum of Gaussians

I Sum of Gaussian variables is also Gaussian.

yi ∼ N µi , σ2i

And the sum is distributed as

n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1

(Aside: As sum increases, sum of non-Gaussian, finite

variance variables is also Gaussian [central limit theorem].)
Two Important Gaussian Properties

Sum of Gaussians

I Sum of Gaussian variables is also Gaussian.

yi ∼ N µi , σ2i

And the sum is distributed as

n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1

(Aside: As sum increases, sum of non-Gaussian, finite

variance variables is also Gaussian [central limit theorem].)
Two Important Gaussian Properties

Scaling a Gaussian

I Scaling a Gaussian leads to a Gaussian.

Two Important Gaussian Properties

Scaling a Gaussian

I Scaling a Gaussian leads to a Gaussian.

y ∼ N µ, σ2
Two Important Gaussian Properties

Scaling a Gaussian

I Scaling a Gaussian leads to a Gaussian.

y ∼ N µ, σ2

And the scaled density is distributed as

wy ∼ N wµ, w2 σ2
Outline

Quick Review: Overdetermined Systems

Underdetermined Systems

Bayesian Regression

Univariate Bayesian Linear Regression

Bayesian Polynomials
Underdetermined System

What about two unknowns and 5

one observation? 4
3

y
y1 = mx1 + c 2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5

4
y1 − c 3
m=

y
x 2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5

4
c = 1.75 =⇒ m = 1.25 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5

4
c = −0.777 =⇒ m = 3.78 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5

4
c = −4.01 =⇒ m = 7.01 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5

4
c = −0.718 =⇒ m = 3.72 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5

4
c = 2.45 =⇒ m = 0.545 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5

4
c = −0.657 =⇒ m = 3.66 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5

4
c = −3.13 =⇒ m = 6.13 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5

4
c = −1.47 =⇒ m = 4.47 3

y
2
1
0
0 1 2 3
x
Underdetermined System

Can compute m given c. 5

Assume 4
3

y
c ∼ N (0, 4) , 2
1
we find a distribution of solu- 0
tions. 0 1 2 3
x
Different Types of Uncertainty

I The first type of uncertainty we are assuming is aleatoric

uncertainty.
I The second type of uncertainty we are assuming is
epistemic uncertainty.
Aleatoric Uncertainty

I This is uncertainty we couldn’t know even if we wanted to.

e.g. the result of a football match before it’s played.
I Where a sheet of paper might land on the floor.
Outline

Quick Review: Overdetermined Systems

Underdetermined Systems

Bayesian Regression

Univariate Bayesian Linear Regression

Bayesian Polynomials
Prior Distribution

I Bayesian inference requires a prior on the parameters.

I The prior represents your belief before you see the data of
the likely value of the parameters.
I For linear regression, consider a Gaussian prior on the
intercept:
c ∼ N (0, α1 )
Posterior Distribution

I Posterior distribution is found by combining the prior with

the likelihood.
I Posterior distribution is your belief after you see the data of
the likely value of the parameters.
I The posterior is found through Bayes’ Rule

p(y|c)p(c)
p(c|y) =
p(y)
Bayes Update

2 p(c) = N (c|0, α1 )

0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update

2 p(c) = N (c|0, α1 )

p(y|m, c, x, σ2 ) = N y|mx + c, σ2

0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update

2 p(c) = N (c|0, α1 )

p(y|m, c, x, σ2 ) = N y|mx + c, σ2

1 p(c|y, m, x, σ2 ) =
y−mx

N c| 1+σ2 /α1 , (σ−2 + α−1
1
)−1

0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Stages to Derivation of the Posterior

I Multiply likelihood by prior

I they are “exponentiated quadratics”, the answer is always
also an exponentiated quadratic because
exp(a2 ) exp(b2 ) = exp(a2 + b2 ).
I Complete the square to get the resulting density in the
form of a Gaussian.
I Recognise the mean and (co)variance of the Gaussian. This
is the estimate of the posterior.
Main Trick

1 1 2

p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ2 ) = n exp 
−
 2σ2 (y i − mx i − c) 2


(2πσ2 ) 2 i=1
Main Trick

1 1 2

p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ2 ) = n exp 
−
 2σ2 (y i − mx i − c) 2


2
(2πσ ) 2
i=1

p(y|x, c, m, σ2 )p(c)
p(c|y, x, m, σ2 ) =
p(y|x, m, σ2 )
Main Trick

1 1 2

p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ ) =
2
n exp 
−

 2σ2 (y i − mx i − c) 2 

2
(2πσ ) 2
i=1

p(y|x, c, m, σ2 )p(c)
p(c|y, x, m, σ2 ) = R
p(y|x, c, m, σ2 )p(c)dc
Main Trick

1 1 2

p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ2 ) = n exp 
−
2
(yi − mxi − c)2 
(2πσ2 ) 2
 2σ i=1

p(c|y, x, m, σ2 ) ∝ p(y|x, c, m, σ2 )p(c)

n
1 X 1 2
log p(c|y, x, m, σ ) = − 2
2
(yi − c − mxi )2 − c + const
2σ i=1 2α1
n
1 X n 1

=− 2 (yi − mxi )2 − + c2
2σ i=1 2σ2 2α1
Pn
(yi − mxi )
+ c i=1 2 ,
σ
complete the square of the quadratic form to obtain

1
log p(c|y, x, m, σ2 ) = − (c − µ)2 + const,
2τ2
−1
τ2
PN
where τ2 = nσ−2 + α−1
1
and µ = σ2 n=1 (yi − mxi ).
The Joint Density

I Really want to know the joint posterior density over the

parameters c and m.
I Could now integrate out over m, but it’s easier to consider
the multivariate case.
Aleatoric Uncertainty

I This is uncertainty we couldn’t know even if we wanted to.

e.g. the result of a football match before it’s played.
I Where a sheet of paper might land on the floor.
Epistemic Uncertainty

I This is uncertainty we could in principal know the answer

too. We just haven’t observed enough yet, e.g. the result of
a football match after it’s played.
I What colour socks your lecturer is wearing.
Reading

I Bishop Section 1.2.3 (pg 21–24).

I Bishop Section 1.2.6 (start from just past eq 1.64 pg 30-32).
I Rogers and Girolami use an example of a coin toss for
introducing Bayesian inference Chapter 3, Sections 3.1-3.4
(pg 95-117). Although you also need the beta density
which we haven’t yet discussed. This is also the example
that Laplace used.
I Bayesian Inference
I Rogers and Girolami use an example of a coin toss for
introducing Bayesian inference Chapter 3, Sections 3.1-3.4
(pg 95-117). Although you also need the beta density which
we haven’t yet discussed. This is also the example that
Laplace used.
I Bishop Section 1.2.3 (pg 21–24).
I Bishop Section 1.2.6 (start from just past eq 1.64 pg 30-32).
Outline

Quick Review: Overdetermined Systems

Underdetermined Systems

Bayesian Regression

Univariate Bayesian Linear Regression

Bayesian Polynomials
Prior Distribution

I Bayesian inference requires a prior on the parameters.

I Posterior distribution is found by combining the prior with

the likelihood.
I Posterior distribution is your belief after you see the data of
the likely value of the parameters.
I The posterior is found through Bayes’ Rule

p(y|c)p(c)
p(c|y) =
p(y)
Bayes Update

2 p(c) = N (c|0, α1 )

0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update

2 p(c) = N (c|0, α1 )

p(y|m, c, x, σ2 ) = N y|mx + c, σ2

0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update

2 p(c) = N (c|0, α1 )

p(y|m, c, x, σ2 ) = N y|mx + c, σ2

1 p(c|y, m, x, σ2 ) =
y−mx

N c| 1+σ2 /α1 , (σ−2 + α−1
1
)−1

0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Stages to Derivation of the Posterior

I Multiply likelihood by prior

1 1 2

p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ2 ) = n exp 
−
 2σ2 (y i − mx i − c) 2


(2πσ2 ) 2 i=1
Main Trick

1 1 2

p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ2 ) = n exp 
−
 2σ2 (y i − mx i − c) 2


2
(2πσ ) 2
i=1

p(y|x, c, m, σ2 )p(c)
p(c|y, x, m, σ2 ) =
p(y|x, m, σ2 )
Main Trick

1 1 2

p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ ) =
2
n exp 
−

 2σ2 (y i − mx i − c) 2 

2
(2πσ ) 2
i=1

p(y|x, c, m, σ2 )p(c)
p(c|y, x, m, σ2 ) = R
p(y|x, c, m, σ2 )p(c)dc
Main Trick

1 1 2

p(c) = √ exp − c
2πα1 2α1
 n

1  1 X 
p(y|x, c, m, σ2 ) = n exp 
−
2
(yi − mxi − c)2 
(2πσ2 ) 2
 2σ i=1

p(c|y, x, m, σ2 ) ∝ p(y|x, c, m, σ2 )p(c)

1
log p(c|y, x, m, σ2 ) = − (c − µ)2 + const,
2τ2
−1
τ2
PN
where τ2 = nσ−2 + α−1
1
and µ = σ2 n=1 (yi − mxi ).
The Joint Density

I Really want to know the joint posterior density over the

parameters c and m.
I Could now integrate out over m, but it’s easier to consider
the multivariate case.
Two Dimensional Gaussian

I Consider height, h/m and weight, w/kg.

I Could sample height from a distribution:

p(h) ∼ N (1.7, 0.0225)

I And similarly weight:

p(w) ∼ N (75, 36)

Height and Weight Models

p(w)
p(h)

h/m w/kg