Linear Regression Model: Alan Ledesma Arista
Linear Regression Model: Alan Ledesma Arista
Contents
1 Bayesian inference 2
2 Likelihood function 3
4 Conjugate priors 5
5 Independent priors 7
6 Simulations 10
1
Research analyst at Central Reserve Bank of Peru. Email: alan.ledesma@bcrp.gob.pe
Bayesian inference Bayesian Econometrics of Time Series
1 Bayesian inference
• We are interested in learning about a set of coefficients θ (of a model) based on the data Y.
• In the Bayesian approach the ‘true’ coefficient are regarded as random variables; hence, we estimate distributions rather
than points.
1. The prior distribution: p(θ). It reflects the researcher beliefs about θ and it is parametrized within a PDF.
2. The likelihood function L(θ) ≡ p(Y|θ). It describes the likelihood of observing Y conditional on θ.
3. The posterior distribution p(θ|Y). This measure is the fundamental interest of a Bayesian analysis. It summarizes
what we learn about θ given the data. It can also be understood as the update of our beliefs once the data has
been processed.
• Bayes theorem:
p(Y|θ)p(θ)
p(θ|Y) =
p(Y)
as inference is made upon θ de denominator can be treated as an scalar. Hence
2 Likelihood function
• The linear regression model is
y = Xβ + ε, (2.1)
0 −1 0 (y − Xb)0 (y − Xb)
2
b = (X X) X y and s = ,
n−k
therefore, the expression (y − Xβ)0 (y − Xβ) can be reduced to
the second to last equality comes from the following well-known result X0 (y − Xb) = X0 (y − X(X0 X)−1 X0 y) =
X0 (I − X(X0 X)−1 X0 )y = (X0 − X0 X(X0 X)−1 X0 )y = 0. Hence, the likelihood remains as follows:
2 0 0
1 (n − k)s + (β − b) X X(β − b)
L(β, σ 2 ) = p(y|β, σ 2 )y ∝ (σ 2 )−n/2 exp − . (2.2)
2 σ2
4 Conjugate priors
• A prior p(θ) is called “conjugate prior” of the likelihood p(y|θ) if the posterior distribution p(θ|y) is in the same
probability distribution family as the prior density.
2 2 2 −1 ν−k−2 η
• In the case of the linear regression model, set the following prior distribution: β|σ ∼ N (b, σ Q) and σ ∼ Γ 2 ,2 .
These priors are
1 1
p(β|σ 2 ) ∝ 2 k/2 exp − 2 (β − b)0 Q−1 (β − b) and (4.1)
(σ ) 2σ
1 h η i
2
p(σ ) ∝ 2 (ν−k)/2 exp − 2 (4.2)
(σ ) 2σ
such that
1 1
p(β, σ 2 ) = p(β|σ 2 )p(σ 2 ) ∝ 2 ν/2 exp − 2 η + (β − b)0 Q−1 (β − b)
(4.3)
(σ ) 2σ
Define
−1
Q = Q−1 + X0 X b = Q Q−1 b + X0 Xb ,
, (4.5)
−1
ν = ν + n and η = η + (n − k)s2 + (b − b)0 Q + (X0 X)−1 (b − b) (4.6)
• With definitions (4.5) the following result holds (see derivations in equations (2.2-2.9) in Appendix.pdf)
−1 −1
(β − b)0 Q−1 (β − b) + (β − b)0 X0 X(β − b) =(β − b)0 Q (β − b) + (b − b)0 (X0 X)−1 + Q (b − b)
(4.7)
• Using identity (2.1) in the technical appendix, it can be shown that the posterior marginal distributions are
η
β|y ∼ t b, Q, ν − k − 1 (4.9)
ν−k−1
ν − k − 2 η
σ 2 |y ∼ Γ−1 , (4.10)
2 2
5 Independent priors
• Sometimes it is convenient to formulate prior as independent distributions among coefficients
• The algorithm to draw simulations from the posterior will depend upon the shape of the resulting posterior:
– If the complete set of posterior conditional posterior distributions are easy to simulate: Gibbs sampling
– Otherwise: Metropolis-Hastings
Under this specification, the if no closed expression for the posterior distribution
– The posterior distribution is
define
then
1 −1
p(β|σ 2 , y) ∝ exp − 2 (β − b)0 V (β − b) (5.5)
2σ
define
therefore,
1 b/2
p(σ 2 |β, y) ∝ a exp − 2 (5.8)
(σ 2 ) 2 +1 σ
2 −1 a b
As a result, σ |β, y ∼ Γ 2, 2
– In a more general case, the posterior does not take a known form
6 Simulations
Non informative prior and conjugate priors
• As the marginal posterior distributions of β and σ 2 are known and easy to simulate, we can just use then directly to
dra simulations. Theses distributions are given by equations (3.4) and (3.5) in the case of the non-informative prior and
equations (4.9) and (4.10) for the conjugate prior.
Gibbs sampling
• Gibbs sampling is a Markov Chain Monte Carlo (MCMC) based algorithm to simulate unknown distributions if the whole
set of conditional distributions is known.
• We can simulate z = {z1 , . . . , zn } ∼ f (z) with f (z) unknown but with all f (zi |z−i ) known (notation: z−i ≡
{z1 , ..., zi−1 , zi+1 , ..., zn }) in the following recursion:
(0) (0)
1. Initialize: propose starting points z(0) = {z1 , ..., zn }
(r) (r−1)
2. Simulate all zi from f (zi |z−i ) for i ∈ {1, ..., n}
3. Repeat step 2 for r = {1, ..., R + B} and disregard the first B simulations.
• Geman y Gemam (1984) showed that the previous recursion converges to the joint distribution exponentially.
• Gibbs sampling can also be used in the case of the conjugate prior as the whole set of conditional posteriors distribution
belong to known distribution families.
• The sampler:
1. Initialize: Calculate a according to (5.7) and propose a starting point for σ 2(0)
(r) (r) (r) (r)
2. With σ 2 = σ 2(r−1) , calculate V and b according to (5.4) and simulate β (r) from N (b , σ(r−1)
2
Q ).
(r)
(r)
(r) 2(r) −1 a(r) b
3. With β = β , calculate b according to (5.7) and simulate σ from Γ 2 , 2 .
4. Repeat steps 2 and 3 for r = {1, ..., R + B} and disregard the first B simulations.
Metropolis-Hasting sampling
• (Random walk) Metropolis-Hasting sampling is a more general MCMC based simulations method.
g(z)
• We want to simulate z ∼ f (z) = k where
• Define the candidate generator function q(z(r−1) , z∗ ), in the case of a Gaussian random walk Metropolis-Hasting is
specified as follows
The likelihood of proposing an state z∗ departing from state z(r−1) should equal the likelihood of proposing an state
z(r−1) departing from state z∗ (i.e., q(z∗ , z(r−1) ) = q(z(r−1) , z∗ ): balanced sampling)
3. Simulate u from U (0, 1), if u < α(z(r−1) , z∗ ) set z(r) = z∗ but z(r) = z(r−1) otherwise.
4. Repeat steps 2 and 3 until r = R + B and drop the first B simulations.
At the Independent prior
• The density to simulate is f (β, σ 2 ) ≡ p(β, σ 2 |y) which is proportional to
1 1
g(β, σ 2 ) ≡ 2 −n/2 exp − 2 (n − k)s2 + (β − b)0 X0 X(β − b) p(β)p(σ 2 )
(6.1)
(σ ) 2σ
• The sampler
1. Initialize: set β (0) , σ 2(0) and select Ω " ∂ 2 log g(β,σ2 ) ∂ 2 log g(β,σ 2 )
#
∂β∂β 0 ∂β∂σ 2
It is customary to set (β (0) , σ 2(0) ) = arg max log g(β, σ 2 ) and Ω = c for some
∂ log g(β,σ 2 )
2
∂ log g(β,σ 2 )
2
∂∂σ 2 β 0 ∂ 2 σ2
(0)
β ,σ 2(0)
c>0
2. Simulate (β ∗ , σ 2∗ ) from
β ∗ = β (r−1) + β (6.2)
σ 2∗ = σ 2(r−1) + σ2 (6.3)
where [0β σ2 ]0 ∼ N (0, Ω)
3. Calculate the candidate likelihood of acceptance
h i
∗ 2∗ (r−1) 2(r−1)
log α = min log g(β , σ ) − log g(β ,σ ), 0 . (6.4)
4. Simulate u from U (0, 1), if log u < log α set (β (r) , σ 2(r) ) = (β ∗ , σ 2∗ ) but (β (r) , σ 2(r) ) = (β (r−1) , σ 2(r−1) ) otherwise.
5. Repeat steps 2 and 3 until r = R + B and drop the first B simulations.
n 1
log g(β, σ 2 ) ≡ − σ 2 − 2 (n − k)s2 + (β − b)0 X0 X(β − b) + log p(β) + log p(σ 2 )
(6.5)
2 2σ