w6 Bayesianregression
w6 Bayesianregression
MLAI: Week 6
Neil D. Lawrence
Underdetermined Systems
Bayesian Regression
Bayesian Polynomials
Two Simultaneous Equations
time in min/km, y
A system of two simultaneous
equations with two unknowns.
4.5
y1 =mx1 + c 3.5
y2 =mx2 + c
1920 1960 2000
year, x
Two Simultaneous Equations
time in min/km, y
A system of two simultaneous
equations with two unknowns.
4.5
y1 − y2 =m(x1 − x2 ) 3.5
time in min/km, y
A system of two simultaneous
equations with two unknowns.
4.5
y1 − y2 3.5
=m
x1 − x2
1920 1960 2000
year, x
Two Simultaneous Equations
time in min/km, y
A system of two simultaneous
equations with two unknowns.
4.5
y1 − y2
y2 − y1 3.5
m=
x2 − x1 x2 − x1
time in min/km, y
simultaneous equations with
only two unknowns? 4.5
y1 − y2
3.5
y1 =mx1 + c x2 − x1
y2 =mx2 + c
1920 1960 2000
y3 =mx3 + c year, x
Overdetermined System
I With two unknowns and two observations:
y1 =mx1 + c
y2 =mx2 + c
Overdetermined System
I With two unknowns and two observations:
y1 =mx1 + c
y2 =mx2 + c
y3 = mx3 + c
Overdetermined System
I With two unknowns and two observations:
y1 =mx1 + c
y2 =mx2 + c
y3 = mx3 + c
I This problem is solved through a noise model ∼ N 0, σ2
y1 = mx1 + c + 1
y2 = mx2 + c + 2
y3 = mx3 + c + 3
Noise Models
4 y = mx + c
3
y
0
0 1 2 3 4 5
x
5
4
c y = mx + c
3
y
m
1
0
0 1 2 3 4 5
x
5
4
c y = mx + c
3
y
m
1
0
0 1 2 3 4 5
x
5
4
c y = mx + c
3
y
m
1
0
0 1 2 3 4 5
x
5
4 y = mx + c
3
y
0
0 1 2 3 4 5
x
5
4 y = mx + c
3
y
0
0 1 2 3 4 5
x
5
4 y = mx + c
3
y
0
0 1 2 3 4 5
x
y = mx + c
point 1: x = 1, y = 3
3=m+c
point 2: x = 3, y = 1
1 = 3m + c
point 3: x = 2, y = 2.5
2.5 = 2m + c
6 A PHILOSOPHICAL ESSAY ON PROBABILITIES.
point 1: x = 1, y = 3
3 = m + c + 1
point 2: x = 3, y = 1
1 = 3m + c + 2
point 3: x = 2, y = 2.5
2.5 = 2m + c + 3
The Gaussian Density
(y − µ)2
!
1
p(y|µ, σ ) = √
2
exp −
2πσ2 2σ2
4
= N y|µ, σ2
2
p(h|µ, σ2 )
0
0 1 2
h, height/m
(y − µ)2
!
1
N y|µ, σ2 = √ exp −
2πσ 2 2σ2
σ2 is the variance of the density and µ is
the mean.
Two Important Gaussian Properties
Sum of Gaussians
Sum of Gaussians
Sum of Gaussians
Sum of Gaussians
Scaling a Gaussian
Scaling a Gaussian
Scaling a Gaussian
Underdetermined Systems
Bayesian Regression
Bayesian Polynomials
Underdetermined System
y
y1 = mx1 + c 2
1
0
0 1 2 3
x
Underdetermined System
y
x 2
1
0
0 1 2 3
x
Underdetermined System
y
2
1
0
0 1 2 3
x
Underdetermined System
y
2
1
0
0 1 2 3
x
Underdetermined System
y
2
1
0
0 1 2 3
x
Underdetermined System
y
2
1
0
0 1 2 3
x
Underdetermined System
y
2
1
0
0 1 2 3
x
Underdetermined System
y
2
1
0
0 1 2 3
x
Underdetermined System
y
2
1
0
0 1 2 3
x
Underdetermined System
y
2
1
0
0 1 2 3
x
Underdetermined System
y
c ∼ N (0, 4) , 2
1
we find a distribution of solu- 0
tions. 0 1 2 3
x
Different Types of Uncertainty
Underdetermined Systems
Bayesian Regression
Bayesian Polynomials
Prior Distribution
p(y|c)p(c)
p(c|y) =
p(y)
Bayes Update
2 p(c) = N (c|0, α1 )
0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update
2 p(c) = N (c|0, α1 )
p(y|m, c, x, σ2 ) = N y|mx + c, σ2
0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update
2 p(c) = N (c|0, α1 )
p(y|m, c, x, σ2 ) = N y|mx + c, σ2
1 p(c|y, m, x, σ2 ) =
y−mx
N c| 1+σ2 /α1 , (σ−2 + α−1
1
)−1
0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Stages to Derivation of the Posterior
1 1 2
p(c) = √ exp − c
2πα1 2α1
n
1 1 X
p(y|x, c, m, σ2 ) = n exp
−
2σ2 (y i − mx i − c) 2
(2πσ2 ) 2 i=1
Main Trick
1 1 2
p(c) = √ exp − c
2πα1 2α1
n
1 1 X
p(y|x, c, m, σ2 ) = n exp
−
2σ2 (y i − mx i − c) 2
2
(2πσ ) 2
i=1
p(y|x, c, m, σ2 )p(c)
p(c|y, x, m, σ2 ) =
p(y|x, m, σ2 )
Main Trick
1 1 2
p(c) = √ exp − c
2πα1 2α1
n
1 1 X
p(y|x, c, m, σ ) =
2
n exp
−
2σ2 (y i − mx i − c) 2
2
(2πσ ) 2
i=1
p(y|x, c, m, σ2 )p(c)
p(c|y, x, m, σ2 ) = R
p(y|x, c, m, σ2 )p(c)dc
Main Trick
1 1 2
p(c) = √ exp − c
2πα1 2α1
n
1 1 X
p(y|x, c, m, σ2 ) = n exp
−
2
(yi − mxi − c)2
(2πσ2 ) 2
2σ i=1
1
log p(c|y, x, m, σ2 ) = − (c − µ)2 + const,
2τ2
−1
τ2
PN
where τ2 = nσ−2 + α−1
1
and µ = σ2 n=1 (yi − mxi ).
The Joint Density
Underdetermined Systems
Bayesian Regression
Bayesian Polynomials
Prior Distribution
p(y|c)p(c)
p(c|y) =
p(y)
Bayes Update
2 p(c) = N (c|0, α1 )
0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update
2 p(c) = N (c|0, α1 )
p(y|m, c, x, σ2 ) = N y|mx + c, σ2
0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update
2 p(c) = N (c|0, α1 )
p(y|m, c, x, σ2 ) = N y|mx + c, σ2
1 p(c|y, m, x, σ2 ) =
y−mx
N c| 1+σ2 /α1 , (σ−2 + α−1
1
)−1
0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Stages to Derivation of the Posterior
1 1 2
p(c) = √ exp − c
2πα1 2α1
n
1 1 X
p(y|x, c, m, σ2 ) = n exp
−
2σ2 (y i − mx i − c) 2
(2πσ2 ) 2 i=1
Main Trick
1 1 2
p(c) = √ exp − c
2πα1 2α1
n
1 1 X
p(y|x, c, m, σ2 ) = n exp
−
2σ2 (y i − mx i − c) 2
2
(2πσ ) 2
i=1
p(y|x, c, m, σ2 )p(c)
p(c|y, x, m, σ2 ) =
p(y|x, m, σ2 )
Main Trick
1 1 2
p(c) = √ exp − c
2πα1 2α1
n
1 1 X
p(y|x, c, m, σ ) =
2
n exp
−
2σ2 (y i − mx i − c) 2
2
(2πσ ) 2
i=1
p(y|x, c, m, σ2 )p(c)
p(c|y, x, m, σ2 ) = R
p(y|x, c, m, σ2 )p(c)dc
Main Trick
1 1 2
p(c) = √ exp − c
2πα1 2α1
n
1 1 X
p(y|x, c, m, σ2 ) = n exp
−
2
(yi − mxi − c)2
(2πσ2 ) 2
2σ i=1
1
log p(c|y, x, m, σ2 ) = − (c − µ)2 + const,
2τ2
−1
τ2
PN
where τ2 = nσ−2 + α−1
1
and µ = σ2 n=1 (yi − mxi ).
The Joint Density
p(w)
p(h)
h/m w/kg
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
p(h, w) = p(h)p(w)
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Independent Gaussians
p(w, h) = p(w)p(h)
Independent Gaussians
1 (w − µ1 )2 (h − µ2 )2
1
p(w, h) = q exp − +
2 σ 2 σ 2
q
2 2
2πσ1 2πσ2 1 2
Independent Gaussians
1 1
p(y) = 1
exp − (y − µ) > −1
D (y − µ)
|2πD| 2 2
Correlated Gaussian
1 1
p(y) = 1
exp − (y − µ) > −1
D (y − µ)
|2πD| 2 2
Correlated Gaussian
1 1 >
p(y) = 1
exp − (R y − R µ)
> > −1
D (R>
y − R>
µ)
|2πD| 2 2
Correlated Gaussian
1 1
p(y) = 1
exp − (y − µ) >
RD−1 >
R (y − µ)
|2πD| 2 2
this gives a covariance matrix:
1 1
p(y) = 1
exp − (y − µ) > −1
C (y − µ)
|2πC| 2 2
this gives a covariance matrix:
C = RDR>
Reading
Underdetermined Systems
Bayesian Regression
Bayesian Polynomials
Revisit Olympics Data
w ∼ N (0, α)
∼ N 0, σ2
with α = 1 and = 0.01.
Polynomial Fits to Olympics Data
5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
75
5
4.5 70
4 65
3.5 60
3 55
2.5
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
5.5
100
5 80
4.5 60
4 40
20
3.5
0
3 -20
2.5 -40
1892 1932 1972 2012 0 1 2 3 4 5 6 7
polynomial order
instead of h i−1
w∗ = Φ> Φ Φ> y
I Two are equivalent when α → ∞.
I Equivalent to a prior for w with infinite variance.
I In other cases αI regularizes the system (keeps parameters
smaller).
Sampling the Posterior
where K = αΦΦ> + σ2 I.
I So it is a zero mean n-dimensional Gaussian with
covariance matrix K.
Computing the Expected Output
f (xi ; w) = φ>
i w
= φ>
f (xi ; w) p(w|y,X,σ2 ,α) i hwip(w|y,X,σ2 ,α)
i µw
= φ>
Variance of Expected Output