0% found this document useful (0 votes)
17 views10 pages

17 Ee

Uploaded by

Ahmed HAMIMES
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

17 Ee

Uploaded by

Ahmed HAMIMES
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Marginal Likelihood from the Gibbs Output

Siddhartha Chib

Journal of the American Statistical Association, Vol. 90, No. 432. (Dec., 1995), pp. 1313-1321.

Stable URL:
http://links.jstor.org/sici?sici=0162-1459%28199512%2990%3A432%3C1313%3AMLFTGO%3E2.0.CO%3B2-2

Journal of the American Statistical Association is currently published by American Statistical Association.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained
prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in
the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/journals/astata.html.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.

The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic
journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers,
and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take
advantage of advances in technology. For more information regarding JSTOR, please contact support@jstor.org.

http://www.jstor.org
Wed Feb 6 14:39:37 2008
Marginal Likelihood From the Gibbs Output
Siddhartha CHIB

In the context of Bayes estimation via Gibbs sampling, with or without data augmentation, a simple approach is developed for
computing the marginal density of the sample data (marginal likelihood) given parameter draws from the posterior distribution.
Consequently, Bayes factors for model comparisons can be routinely computed as a by-product of the simulation. Hitherto, this
calculation has proved extremely challenging. Our approach exploits the fact that the marginal density can be expressed as the
prior times the likelihood function over the posterior density. This simple identity holds for any parameter value. An estimate
of the posterior density is shown to be available if all complete conditional densities used in the Gibbs sampler have closed-form
expressions. To improve accuracy, the posterior density is estimated at a high density point, and the numerical standard error of
resulting estimate is derived. The ideas are applied to probit regression and finite mixture models.
KEY WORDS: Bayes factor; Estimation of normalizing constant; Finite mixture models; Linear regression; Markov chain Monte
Carlo; Markov mixture model; Multivariate density estimation; Numerical standard error; Probit regression;
Reduced conditional density.

1. INTRODUCTION (1994) showed that the marginal likelihood (equivalently,


the marginal density of y ) under model Mk, that is,
The advent of Markov chain Monte Carlo (MCMC) meth-
ods (Gelfand and Smith 1990, Tanner and Wong 1987) to
simulate posterior distributions has virtually revolutionized
the practice of Bayesian statistics. For the most part, these
methods have been used for estimation and out-of-sample can be estimated as
prediction, because both of those problems are easily solved
given a sample of draws from the posterior distribution. On
the other hand, the problem of calculating the marginal like-
lihood, which is the normalizing constant of the posterior
density and an input to the computation of Bayes factors which is the harmonic mean of the likelihood values. Al-
(see, for example, Berger 1985, Kass and Raftery 1995, or though this estimate is a simulation-consistent estimate of
O'Hagan 1994), has proved extremely challenging. This is m ( y1 Mk) , it is not stable, because the inverse likelihood
because the marginal likelihood is obtained by integrating does not have finite variance. But consider the quantity
the likelihood function with respect to the prior density, proposed by Gelfand and Dey (1993):
whereas the MCMC method produces draws from the pos-
terior.
One way to deal with this problem is to compute Bayes
factors without attempting to calculate the marginal like-
lihood by introducing a model indicator into the list of
unknown parameters. Work along these lines has been re- where p(0) is a density with tails thinner than the product of
ported by Carlin and Polson (1991), Carlin and Chib (1995), the prior and the likelihood. This can be shown to have the
property that ~ G + D m ( y Mk) as G becomes large without
and many others. To use these methods, however, it is nec-
essary to specify all of the competing models at the out- the instability of mNR.Nonetheless, this approach requires
set, which may not be always possible, and to carefully a tuning function, which can be quite difficult to determine
specify certain tuning constants to ensure that the simula- in high-dimensional problems, and subsequent monitoring
tion algorithm mixes suitably in model space. In this arti- to ensure that the numbers are stable. In fact, we have
cle, therefore, we concern ourselves with methods that di- found that the somewhat obvious choices of p(.)-a nor-
rectly address the calculation of the marginal likelihood. ma1 density or t density with mean and covariance equal
Suppose that f ( yO k , M k ) is the density function of the to the posterior mean and covariance--do not necessarily
data y = ( y l ,. . . , y,) under model Mk ( k = 1 , 2 , . . . , K ) satisfy the thinness requirement. Other attempts to mod-
given the model-specific parameter vector Ok. Let the prior ify the harmonic mean estimator, though requiring samples
density of O k (assumed to be proper) be given by ~ ( 01 M k k), from both the prior and posterior distributions, have been
discussed by Newton and Raftery (1994).
and let { Q P ) } I { O f ) ,' ' ' ' : 6 ) } be
7 draws from the
The purpose of this article is to demonstrate that a sim-
posterior T ( O k ~ M, k ) using a MCMC ple approach to the marginal likelihood and the
say the Gibbs Newton and Raftery Bayes factor is available that is free of the problems just de-
scribed. This approach is developed in the setting where the
Siddhartha Chib is Professor of Econometrics, John M. Olin School
of Business, Washington University, St. Louis, MO 63130. This article
has benefited from valuable comments of two anonymous referees, the @ 1995 Aiiierican Statistical Association
associate editor, and the editor. In addition, discussions with Jim Albert, Journal of the American Statistical Association
Ed Greenberg, and Radford Neal are gratefully acknowledged. December 1995, Vol. 90, No. 432, Theory and Methods

1313
Journal of the American Statistical Association, December 1995

Gibbs sampling algorithm, with or without data augmenta- rived, as shown in Section 3. It is now time to examine the
tion, has been used to provide a sample of draws from the method for calculating the posterior density estimate from
posterior distribution. To compute the marginal density by the Gibbs output.
our approach, it is necessary that all integrating constants
of the full conditional distributions in the Gibbs sampler be 2.1 Estimation of ~ ( O * l y ) .
known. This requirement is usually satisfied in models fit Consider now the estimation of the multivariate density
with conjugate priors and covers almost all applications of T(O* y) and the selection of the point O*. As was pointed
the Gibbs sampler that have appeared in the literature. out, the BMI expression holds for any 0, and thus the choice
The rest of the article is organized as follows. Section 2 of the point is not critical, but efficiency considerations dic-
presents the approach, and Section 3 illustrates the deriva- tate that for a given number of posterior draws, the density
tion of the numerical standard error of the estimate. Section is likely to be more accurately estimated at a high density
4 presents applications of the approach, first for variable point, where more samples are available, than at a point in
selection in probit regression and then for model compar- the tails. It should be noted that a modal value such as the
isons in finite mixture models. The final section contains posterior mode, or the maximum likelihood estimate, can
brief concluding remarks. be computed from the Gibbs output, at least approximately,
if it is easy to evaluate the log-likelihood function for each
2. THE APPROACH
draw in the simulation. Alternatively, one can make use of
Suppress the model index k and consider the situation the posterior mean provided that there is no concern that it
wherein f (y 0 ) is the sampling density (likelihood function) is a low density point.
for the given model and ~ ( 0is) the prior density. To allow We now explain how the posterior density ordinate can
for the possibility that posterior simulation requires data be estimated from the Gibbs output, starting with a canoni-
augmentation, let z denote latent data and suppose that for cal situation consisting of two blocks of parameters before
a given set of vector blocks 0 = (01,0 2 , .. . , OB), the Gibbs turning to the general case. We show that the proposed
sampling algorithm is applied to the set of (B + 1) com- multivariate density estimation method is easy to imple-
plete conditional densities, ment, requires only the available complete conditional den-
sities, and produces a simulation consistent estimate of the
posterior ordinate.
The objective is to compute the marginal density m(y M k )
from the output { o ( ~z) (, g ) ) g l obtained from (4). 2.1.1 Two Vector Blocks. Suppose that Gibbs sampling
The approach developed here consists of two related is applied to the complete conditional densities
ideas. First, m(y), by virtue of being the normalizing con-
stant of the posterior density, can be written as

which is the setting of Tanner and Wong (1987). Let the


where the numerator is just the product of the sampling output from the Gibbs algorithm be given by { o ( ~z(g));=,
),
density and the prior, with all integrating constants in- and suppose that O* is the selected point. If the posterior
cluded, and the denominator is the posterior density of density is written as
0. It is worthwhile to refer to this simple identity, which
holds for any 0, as the basic marginal likelihood identity
(BMI). Second, for a given 0 (say O*), the posterior or-
dinate ~ ( O * l ycan
) be estimated by exploiting the infor-
mation in the collection of complete conditional densities
( ~ ( 0y,, 0, (s # r ) ,z)):, . The technique for doing so is then it follows that an appropriate Monte Carlo estimate of
described later, but for the present, if the posterior density ~ ( 0 yat) O* is
estimate at O* is denoted by +(O* y ) , then the proposed
estimate of the marginal density, on the computationally
convenient logarithm scale, is
lnm(y) = In f (ylO*) + lnr(O*) - ln+(O*y ) . (6)
It is important to observe the simplicity and benefits of because z(g) is a draw from the distribution zly. Gelfand
this expression: all it requires is the evaluation of the log- and Smith (1990) referred to this technique as Rao-
likelihood function and the prior and an estimate of pos- Blackwellization and argued that it improves on the multi-
terior ordinate. The estimate does not suffer from any variate kernel method (Scott 1992). Also, under regularity
instability problem, because it is a density value that is conditions, the estimate is simulation consistent; that is,
averaged rather than its inverse. In addition, the entire es- +(O* ly) -+ T(O* y ) as G becomes large, almost surely, as a
timation (simulation) error arises from the estimation of consequence of the ergodic theorem (Tierney 1994). Sub-
the posterior ordinate, and this simulation error can be de- stituting the estimate of the posterior ordinate into (6) gives
Chib: Marginal Likelihood from the Gibbs Output 1315

the following estimate of the marginal likelihood: leads to an increase in the number of iterations, it is impor-
tant to stress that it does not require new programming and
thus is straightforward to implement. Note that the reduced
conditional run is not necessary if z is absent from the sam-
pling. In this case the reduced conditional density of e2 is
identical to its complete conditional density, and the density
This simple expression can be used for a large class of mod- estimate reduces to one used by Zellner and Min (1995) in
els, including the probit regression model discussed later. a different context.
Observe that the calculation amounts to evaluating the like- Substituting the two density estimates into (6) yields the
lihood, the prior, and the "complete data" posterior density estimate
at the point 8".
2.1.2 Three Vector Blocks. An even larger class of mod-
els can be covered by slightly generalizing the Tanner and
Wong structure. Suppose that the Gibbs sampler is defined
2.1.3 General Case. Although the technique described
through the complete conditional densities
thus far will apply to many problems of importance, con-
sider the situation with an arbitrary number of blocks. Even
in this case, the posterior density ordinate can be estimated
Models such as linear regression, linear regression with in-
rather easily.
dependent Student-t errors, Zellner's seemingly unrelated
Begin by writing the posterior density at the selected
regression, and censored regression either fall in this cat-
point as
egory or are a special case of this structure if z is absent.
* which now
Once again, the objective is to estimate ~ ( 8ly),
is expressed as

where the first term is the marginal ordinate, which can


where be estimated from the draws of the initial Gibbs run,
and the typical term is the reduced conditional ordinate
"(8: y , 07, e;, . . . ,e:-,). The latter is given by
and

is the reduced conditional density ordinate. It should be


, Z) and
clear that the normalizing constants of ~ ( e l l y02, where T is being used to denote density and distribution
r ( B 2 ly, el,z) must be included in the integration for the function interchangeably. To estimate this term, continue
decomposition in (8) to be valid. The first ordinate, ~ ( 0ly),
; the sampling with the complete conditional densities of
can be estimated in an obvious way, by taking the ergodic {e,, Or+,, . . . , eB,z ), where in each of these full condi-
average of the full conditional density with the posterior tional densities, 0, is set equal to 8:, (s < r - 1). If the
draws of (02,z ), leading to the estimate draws from the reduced complete conditional Gibbs run are
G
denoted by {@, Q$:, . . . , B,)! z(j)},then an estimate of
+(e;Iy)= G - ~Cn(e;ly,
eF),~(g)).
(10) is
g=1
A similar technique, with an important twist, can be invoked
to obtain the reduced conditional ordinate in (9). Recognize
that the draws of z from the Gibbs sampler are from the dis-
tribution [ziy] and not from [zly,871. Therefore, the com-
plete conditional density of 82 cannot be averaged directly.
A simple solution is available to deal with this complica- whereas an estimate of the joint density is n:=, ?(e:ly,
tion: Continue sampling for an additional G iterations with 8: (s < r)). The log of the marginal likelihood is
the complete conditional densities
B
+
lnm(y) = In f ( y e * ) Inr(B*)- xln+(8:ly, e:(s < r ) ) .
where in each of these densities, el is set equal to 8;. From
MCMC theory, it can be verified that the draws {z(j))from
this run follow the density p(zly, By), as required. Conse-
quently, +(8;j y, 8;) = G-I C 7r(6J;ly, 07, z(j)) is a simu- As an illustration of this procedure, suppose that B = 3,
lation consistent estimate of (9). Although this procedure a situation that arises in longitudinal random effects models
1316 Journal of the American Statistical Association, December 1995

and many other models. Then ~ ( 81 y% , 8 ; ) is estimated (see Besag 1989). This identity follows in a straightfor-
as G-l C ~ ( 81 y%
, 8T, OF),
~ ( j ) )where
, the draws {OF),
z ( j ) ) ward manner from the definition of the posterior density
are obtained by continuing the Gibbs sampler with n ( 8y , y f ) and cross-multiplying. Besag (1989) alluded to
a different proof.
3. NUMERICAL STANDARD ERROR
and As mentioned in the preceding section, the proposed den-
sity estimation procedure is likely to produce an accurate
estimate of n ( 8 y ) at the point 8*. In fact, it is possible
Finally, additional G iterations with the densities to calculate the accuracy achieved by a computation that
uses the Gibbs output. This calculation yields the numerical
standard error of the marginal density estimate (or, equiva-
produce draws { z ( j ) t)hat follow the distribution [ zy , 81,Qa]. lently, that of the posterior density estimate). The numerical
standard error gives the variation that can be expected in
These draws yield an estimate ~ ( 8y;, 8 ; , 8 % )This . tech- the estimate if the simulation were to be done afresh, but
nique is illustrated in Section 4.2 for mixture models. the point at which the ordinate is evaluated is kept fixed.
2.2 Bayes Factor Estimate To concentrate on the main ideas, consider the case in
Section 2.1.2 and define the vector stochastic process
To compute the Bayes factor for any two models k and
I-that is, m ( y1 M k ) / m ( yMl)-the calculation described
earlier is repeated for all models, and the following estimate
is used:
where in the first component the latent vector (e2,Z ) [ . I y]
while in the second component the latent vector z follows
the distribution [. y , 81].In general, h is a B x 1 vector with
An estimate of the posterior odds of any two models is the rth component given by "(8: Iy, 87, e;, . . . , 81 ( 1
given by multiplying the estimated Bayes factor by the prior > r ) ,z ) , the integrand of (10).
odds. It should be noted that due to the procedure used to es-
timate the reduced conditional ordinate, the second com-
2.3 Remarks ponent of h is approximately independent of the first. But
In some situations there are two sets of latent vectors for expositional simplifications, it is worthwhile to proceed
( z ,+) such that the density f ( y e ,+) = J f ( y ,zl0, +) dz with the vector formulation. Then in this notation,
is available in closed form but the likelihood f(yl8)
= J f ( y ,$18) d+ is not. This occurs, for example, in dis-
crete response data models with random effects. To analyze
this situation, one can use the BMI expression
and our objective is to find the variance of two functions
of h, namely = hl x h2 and $2 = ln(hl) 1n(h2) +
-. In ?(BT y ) + In ?(Ba y , 8 ; ) . The variance of these two
functions is found by the delta method as soon as the vari-
Both the numerator and denominator can be evaluated at ance of h is determined. Because h inherits the ergodic-
the point ( O * , +*), and the posterior mean of ( 8 ,+) and ity of the Gibbs output, it follows by the ergodic theorem
n ( 8 ,+ly) can be estimated using the method in Section 2.1 (Tierney 1994) that
+
by treating as an additional block.
The BMI can also be used to assess the convergence
of the Gibbs sampler, by computing and monitoring its almost surely, where p = ( ~ ( O y l y~) ,( 8y ;, QT))',
stability for different iterations. Such an idea, combined
with a different approach for computing the posterior den- lirn G { E ( -~ p ) ( h - p ) ' ) = 27rS(O),
G'cc
sity, appears in the Gibbs stopper proposed by Ritter and
Tanner (1992). Raftery (1994) mentioned using the ker- and S(0) is the spectral density matrix at frequency zero.
nel estimate of the posterior density in connection with the An estimate of f2 = 27rS(O)can be obtained by the approach
BMI, but the resulting estimate can inherit the inaccuracy of Newey and West (1987) or Geweke (1992). If
of the kernel method, especially in high dimensions. Fi-
nally, another identity similar to the BMI is available in
the prediction context. Suppose that y f denotes an out-of-
sample observation. Then the Bayesian prediction density,
f ( ~ Yf ) = J f ( ~ Y f , ~ ) T ( ~ IdYo ,) can be expressed as then
Chib: Marginal Likelihood from the Gibbs Output 1317

Table 1. Nodal lnvolvement Data

case Y XI X2 X3 X4 X5 Case Y XI X2 X3 XA x.5

where q is some constant, essentially the value at which the 4.1 Binary Probit Regression
autocorrelation function tapers off. In the applications to
Consider the data in Table 1 on the presence of prostatic
follow q is conservatively set equal to 10, although there
nodal involvement collected on 53 patients with cancer of
was negligible to vanishing serial correlation in the h ( g )
the prostate. The data (reported in the study by Brown
process. The variance of $ J ~ for
, example, is found by the (1980); see also Collett 1991) include a binary response
delta method to be variable y that takes the value 1 if cancer had spread to the
surrounding lymph nodes and value zero otherwise. The
objective is to explain the binary response with five vari-
ables: age of the patient in years at diagnosis ( z l ) ;level of
serum acid phosphate ( x 2 ) ;the result of an X-ray exami-
nation, coded 0 if negative and 1 if positive ( Q ) ; the size
of the tumor, coded 0 if small and 1 if large ( 5 4 ) ; and the
where the derivative vector consists of elements h;' and pathological grade of the tumor, coded 0 if less serious and
h;'. The square root of this variance is the numerical stan- 1 if more serious ( x 5 ) .
dard error of the marginal likelihood in the log scale. The probability of positive response can be explained
through a probit link function or, as by Collett (1991), by
4. EXAMPLES a logit link. If interactions and powers of explanatory vari-
In this section the approach developed earlier is applied to ables are excluded, then there are 32 possible models that
two important classes of models. In particular, the methods can be fit. Collett's finding from the classical deviance
are discussed in the context of variable selection in binary statistic (-2 times the maximized log-likelihood) is that
probit regression models and in the context of two broad the logistic model containing log(z2),x s, and x4 provides
classes of finite mixture models, the iid mixture model and a suitable fit for the data among these 32 models. These
the Markov mixture model. data are reanalyzed to demonstrate the computation of the
By way of notation, for a d-dimensional normal ran- marginal likelihood using nine of these models (defined
dom vector with mean p and covariance matrix X , later and selected entirely for illustrative purposes).
the density at the point t is denoted by $(tip, X ) Under model Ic, suppose that
=
- ( ~ T ) ~ /e x~p (c - ( t--' j.~)'X-'(t
/ ~ - p ) / 2 ) and the
inverse gamma density at the point s is denoted by
p I G ( s a ,b) E ( b a / r ( a )() l / ~ ) ( ~exp(-bls).
+') Finally, for where @(.) is the cumulative distribution function of the

-
a m vector q on the unit simplex, the Dirichlet
D(a1, a2,. . . ,a,) density is denoted by p ~ ( q l a 1 ,... ,a,)
r(C,a j ) q P 1 - l . .. q$-l/ nj r(aj).
standard normal density, xik are the covariates included in
model Ic, and 0, is the corresponding regression parame-
ter vector. The likelihood function under M k , assuming a
Journal of the American Statistical Association, December 1995

Table 2 Summary of Results for Nodal Involvement Data


the sampler for G = 5,000 cycles after deleting the first

log
c
500, and the estimate P* = ~ ( ~ ) / 5 , 0is0 0obtained.
Then the logarithm of the marginal likelihood of model
(maximized log Num
Terms fitted in model lik) d.f. (marginal) SE Mk is

where, it should be noted, the mean vector of the third den-


(9)
sity (i.e., p, ) is produced as a by-product of the sampling
^

algorithm.
random sample, is then The results are summarized in Table 2, where for each
of nine models, the maximized likelihood is reported along
with the degrees of freedom, the log of the marginal like-
lihood, and its numerical standard error. From this table it
can be seen that the marginal likelihood is very precisely
For this situation, the marginal likelihood can be computed estimated in all the fitted models. Of course, these results
rather simply by the Laplace method (see Kass and Raftery are obtained with G = 5,000 draws, and further improve-
1994), but given the small sample size, it is difficult to know ments in accuracy can be achieved by increasing G. For
the accuracy of the Laplace approximation. Harmonic mean comparison, the BMI expression was evaluated at a point
type estimators, on the other hand, are rather more difficult that was one posterior standard deviation from P*. As ex-
to obtain with this likelihood, because its tails generally pected, this led to an increase in the numerical standard
decline quite sharply. error of the estimate and, for example, was .26 in Mg, with
A procedure that works extremely well in conjunc- G = 5,000. The Laplace method was also used to determine
tion with the technique developed above is the data the marginal likelihood, and the results were in agreement
augmentation-Gibbs sampling method of Albert and Chib up to the second decimal place. We also examined if a
(1993a). Suppose that the prior information about PI, is multivariate kernel estimate of the posterior ordinate (with
weak, but not improper, and is represented by a multivari- a Gaussian product kernel) could be used in the BMI ex-
ate normal prior with the mean of each parameter equal to pression. This procedure did not produce equally accurate
.75 (because each covariate is expected to have a positive results. Also note that xl (the age variable) does not im-
impact on the probability of response), and a standard de- prove on the model with just a constant (the Bayes factor for
viation of 5. Under the assumption that the parameters are the second model vs. the first is .009), whereas the model
independent, the prior of PI, takes the form with the variable x3 (X-ray) has a Bayes factor of approxi-

PI, - N(aI,, A,').


mately 25 versus the model with just a constant. The Bayes
factor for M8 versus Mg is 5.33, supporting the conclusion
of Collett (1991), who argued that M8 is the best model,
Suppressing the model index k , the Gibbs draws for each
model are obtained as follows. Define a normally dis- and also demonstrating the value of the marginal likelihood
tributed latent variable, zi, such that in providing information about the comparative value of a
fitted model.
4.2 Marginal Likelihood in Mixture Models
where I(A) is an indicator function of the event A. This To further illustrate the usefulness of our approach, con-
in fact is equivalent to the probit model, because Pr(zi
> 0) = Q(xiP). Then, following Albert and Chib (1993a), Table 3. Velocity (km/second) for Galaxies in the Corona Borealis Region
the Gibbs sampler is defined through the complete condi-
tional densities
.(Ply, z) = ~(PIB,,B)
and

where = (A + XfX)-'(Aa + X f z ) , B = (A + XfX)-l,


B,
z = (zl, 2 2 , . . . , zn)', X is the matrix of all the covariates,
and 4(.Ip, l ) I [ a ,b] is the normal density truncated to the in-
terval [a,b ] . The output { P ( ~p:)), is obtained by running
Chib: Marginal Likelihood from the Gibbs Output

Table 4. Summary of Results for Galaxy Data


distributions
Model fitted log(margina1) Num SE

Two components: a: = a', V j -240.464 ,006


Three components: a: = a', V j -228.620 ,008
Three components: a: unrestricted -224.138 ,086
and
sider the calculation of the marginal likelihood in two broad
applications that involve mixture models. The first is con-
where po = 20, A-' = 100,vo = 6 , = 40, and a , = 1 . As
cerned with determining the number of components in a
can be observed, these priors reflect weak prior information
Gaussian finite mixture model applied to astronomical data
about the parameters. Under these prior distributions, the
on the velocity of galaxies. The second is concerned with a
objective is to compute the marginal likelihood for models
mixture model that applies to time series data. This model,
with two and three components. In addition, models ob-
which is also referred to as a Markov switching model or a
tained by restricting the variance a; to be constant across
hidden Markov model, is illustrated with data on the growth
components are also of interest.
rates of U.S. gross national product for the postwar period.
The Gibbs implementation for this model is straightfor-
4.2.1 Determining the Number of Components in a ward (see Diebolt and Robert 1994 and West 1992). Let
Mixture. Consider the data set in Table 3 on velocities z = ( z l,. . . , z,), then Gibbs sampling is defined through
of 82 galaxies from 6 well-separated conic sections of the the conditional densities of p , u 2 ,q , and z . Let Tj = {i :
Corona Borealis region, originally presented by Postman, zi = j ) be the set of observation indices for the observa-
Huchra, and Geller (1986). The objective is to find the tions classified into the jth population and let n, represent
best-fitting Gaussian finite mixture model. This data set the number of observations so assigned. Now pick out the
has been analyzed by Roeder (1990) who developed a non- observations that correspond to the jth population and place
parametric density approach to determine the number of them in the vector y, and define an n, vector i, comprising
modes. Subsequently, Carlin and Chib (1993) reanalyzed of units. Then
the data by parametric Bayesian methods and estimated
Gaussian mixture models with two to five components us-
ing the Gibbs sampler. Their results indicate symptoms of
overfitting when models with four or five components are
estimated. The Gibbs output from these models displays
nonvanishing serial correlation for extremely high lags, in-
dicating difficulties with convergence and nonidentifiability
of parameters. (See Crawford 1994 for a discussion of iden- and
tification issues in mixture models.) For this reason, models
with two and three components are fit.
For the model with d components, suppose that the and Pr(zi = jly, 0) oc qj x 4 ( y i l p , , a 2 )i, 5 n, where f i j
jth component is given by 4 ( y i l p j , a ; ) ,where yi i s ith = (A +
a j 2 n j ) - ' ( ~+p a~j 2 i i y j ) ~,j = ( A + a-'n.)-l
3 . 1 '

data value (velocity/1,000) and ( p j ,a ;) is the component- and 6, = ( y j - i j p j ) ' ( y j- i j p , ) .


specific mean and variance. If each component is sampled The posterior density ordinate can be computed from the
with probability qj (Cq, = I ) , then the density function of decomposition
the data y = ( y l , . . . , y s 2 ) given the parameters 8 is

where 8*is taken to be the (approximate) maximum likeli-


hood estimate computed by evaluating (16) for each simu-
where8 = ( q , p , u 2 w) ithq = ( q l , q ~ , . . . , q d =) , (p~ 1 , p 2 , lated draw. Now apply (11) as follows:
. . . , pd), and u2 = (a?,. . . , a ; ) . It is useful to refer to this The draws from the full Gibbs run are used to estimate
model as the "iid mixture model" because, as is well known,
by introducing iid latent variables zi E { 1 , 2 , . . . , d ) such T ( P * Y )= J Ilg=l 4 ( , Ib,, B j ) n ( z ,a21y)dz dm2.
that Next, the draws from the reduced Gibbs run with
the densities ~ ( a ; l yz , p * ) ,n(qly,z ) and {Pr(zily,
p*, a 2 ,q ) ) are used to estimate n(a2*Ip*, y)
- J
- dl P I G ( ~ ~ I ~ { v o~+{ 6~ 0) +, 6 j ) ) n ( zpy*, )dz.
and defining f (yilzi = j, 8) = 4(yilpj,a;) leads to the Finally, the draws from the subsequent reduced
mixture model in (16). Gibbs run with the densities n(qly,z ) and {Pr(zil y ,
Assume that all components of 8 are mutually inde- p * , u 2 *q, ) ) are used to estimate ~ ( q * l p*,
y , a2*)
pendent, and define the prior information through the + +
= J ~ ~ ( q l anll , . . . ,ad nd)p(zly,p *, a 2 *dz.
)
Journal of the American Statistical Association, December 1995

Table 5. Summary of Results for U.S. GNP Growth Rates Data


is given in terms of the one-step ahead prediction densities,

Prior Posterior
Parameter Mean Std dev Mean Std dev

PI 0 1.414 - ,313 ,314 where K-l is the observed data up to time t - 1 and p(zt
PP .75 1.414 1.038 .I 1 1 = 1lK-l, 8 ) is a time-varying conditional probability. The
c2 1.33 ,943 ,672 ,089 joint density of all the data is then
PI 1 .8 ,163 ,743 ,098
P22 .8 .I 63 ,911 ,042
Log marginal likelihood -229.496 (.028)
A little reflection shows that, given z, this model has the
same structure as the iid mixture model, and thus the
An estimate of the marginal likelihood is given by substi- marginal likelihood calculation proceeds in virtually the
tuting these quantities into (12). same way. The complete conditional densities of (p,a 2 )
Our results, which are based on G = 5,000 draws, are are identical to those in the iid mixture model, and, if one
summarized in Table 4. (Almost identical results were ob- assumes that the prior density on qi is Dirichlet(cril,cri2),
tained when the BMI expression was evaluated at the poste- then
rior mean instead of the approximate maximum likelihood
value.) First, the two-component model is clearly domi-
nated by both three-component models. Second, the three-
component model with a2 unrestricted appears to be better
where nij denotes the number of one-step transitions from
than the three-component model with a2restricted to be the
i to j in the sequence z (see Albert and Chib 1993b). A
same across components. This result would not be obvi-
decomposition similar to (18) is again available while each
ous from just looking at posterior distributions of the fitted
of the ordinates can be estimated by the reduced conditional
models, because all the parameters in both three-component
Gibbs sampling procedure described earlier.
models are tightly estimated. Third, all the numerical stan-
The Gibbs implementation of this model, and the cal-
dard errors are small, indicating that the marginal likelihood
culation of the marginal likelihood, require the simula-
has been accurately estimated.
tion of the latent variables z from p(zl y , 8 ) . As described
by Chib (1993), the latent variables are simulated through
4.2.2 Markov Mixture Model. As a final illustration of the following recursive steps, which are initiated with
the value of our approach, consider data on the quarterly p(zo = ilYo, 8). These recursions require one pass from
growth rates of U.S. gross national product (GNP) for the t = 1 to n and then a second pass from t = n to t = 1.
postwar period 1951.2 to 1992.4. Many different time series
models have been fit to this data, and our objective is to Step 1: Repeat for t = 1 , 2 , . . . , n.
demonstrate how the marginal likelihood can be calculated Prediction step: Calculate
in one particular case, of substantial practical importance,
for which this calculation has hitherto not been attempted.
The model of interest is the Markov mixture model, i=l
also sometimes referred to as the Markov switching model
(Goldfeld and Quandt 1973; Hamilton 1989). Let yt denote
the growth rate of GNP (multiplied by loo), and suppose Update step: Calculate
that

Step 2: Simulate z, from p(zn = j lYn,8),the mass func-


where p = ( p l , p2) and zt is an unobserved state variable tion produced by the last update step.
that follows a two-state Markov chain, Step 3: Repeat for t = n - 1 , . . . , 2 , 1 .
Given the draw zt+l = 1, calculate
p(zt = jlYn,zt+l = 48) t jlK,8),
pjl x ~ ( z =
where P = {pij) is the one-step transition probability ma- ( j = 1,2).
trix of the chain (i.e., pij = Pr(zt = jlztPl = i ) , and r l is
Simulate zt from p(zt = jlYn, zt+l = 1,8).
the probability distribution at t = 1. This model is a gen-
eralization of the iid mixture model of the last subsection. Note that the prediction step gives the time-varying prob-
Furthermore, it is a model that is particularly appropriate ability mass function required to calculate the likelihood
for modeling correlation in growth rates that are observed function in (20) at the point 8*.
in practice. Our results for this model and data are summarized
Let 8 = (p,a2,91, q z ) , where qi is the ith row of P;
then the likelihood function for the Markov mixture model
in Table 5. These results are based on G = 6,000
-
draws and rely on the prior distributions p l N(O,2),p2
Chib: Marginal Likelihood from the Gibbs Output 1321

-- N(.75,2), a2 2G(4,4), q 1
N Dirichlet(4, I), and q z
N

Dirichlet(l,4). These priors are relatively vague and are


Chib, S. (1992), "Bayes Inference in the Tobit Censored Regression
Model," Journal of Econometrics, 5 1, 79-99.
designed to model the potential persistence in low and high (1993), "Calculating Posterior Distributions and Modal Estimates
in Markov Mixture Models," submitted to Journal of Econometrics.
growth rates. Thus the marginal likelihood is seen to be Collett, D. (1991), Modelling Binary Data, London: Chapman and Hall.
equal to -229.496 on the log scale and is accurately esti- Crawford, S. L. (1994), "An Application of the Laplace Method to Finite
mated with a numerical standard error of .028. In com- Mixture Distributions," Journal of the American Statistical Association,
parison, the marginal likelihood is also calculated for a 89, 259-267.
first-order autoregressive model yt = po + !,Il ytPl + E ~E~, Diebolt, J., and Robert, C. P. (1993), "Estimation of Finite Mixture Dis-
tributions Through Bayesian Sampling," Journal of the Royal Statistical
N N ( 0 , a'), by treating this as a linear regression model Society, Ser. B, 56, 363-375.

(PI, P 2 ) ' -
after conditioning on the first observation. Under the prior
N2(0, diag(l0,lO)) and a2 2 6 ( 3 , 3 ) , the log
marginal likelihood is estimated to be -231.94. Thus the
Gelfand, A. E., and Dey, D. K. (1994), "Bayesian Model Choice: Asymp-
totics and Exact Calculations," Journal of the Royal Statistical Society,
Ser. B, 56, 501-514.
data support the Markov mixture model to the first-order Gelfand, A. E., and Smith, A. F. M. (1990), "Sampling-Based Approaches
to Calculating Marginal Densities," Journal of the American Statistical
autoregressive model. Association, 85, 398-409.
Geweke, J. (1992), "Evaluating the Accuracy of Sampling-Based Ap-
5. CONCLUDING REMARKS proaches to the Calculation of Posterior Moments," in Proceedings of
In summary, this article has developed and illustrated a the Fourth Valencia international Conference on Bayesian Statistics, eds.
J. M. Bernardo, J. 0 . Berger, A. P. Dawid, and A. F. M. Smith, New
new approach to calculating the marginal likelihood that York: Oxford University Press, pp. 169-193.
relies on the output of the Gibbs sampling algorithm. The Goldfeld, S. M., and Quandt, R. E. (1973), "A Markov Model for Switching
approach is fully automatic and stable, requiring no inputs Regressions," Journal of Econometrics, 1, 3-16.
beyond the draws from the simulation. Thus draws from the Hamilton, J. D. (1989), "A New Approach to the Economic Analysis of
Nonstationary Time Series and the Business Cycle," Econometrica, 57,
prior, or additional maximizations, or importance sampling
357-384.
functions, or any other tuning function, are not required. Kass, R. E., and Raftery, A. E. (1995), "Bayes Factors and Model Uncer-
It was shown that the numerical standard error of the es- tainty,'' Journal of the American Statistical Association, 90, 773-795.
timate can be derived from the posterior sample and the Newey, W. K., and West, K. D. (1987), "A Simple Positive Semi-Definite,
calculations are exhibited in problems dealing with probit Heteroskedasticity and Autocorrelation Consistent Covariance Matrix,"
Econometrica, 55, 703-708.
regression and finite-mixture models. In all the examples,
Newton, M. A,, and Raftery, A. E. (1994), "Approximate Bayesian Infer-
the marginal likelihood is estimated easily and very accu- ence by the Weighted Likelihood Bootstrap" (with discussion), Journal
rately. As a result, this approach should encourage the rou- of the Royal Statistical Society, Ser. B, 56, 3-48.
tine calculation of Bayes factors in models estimated by the O'Hagan, A. (1994), Bayesian Inference (Kendall's Advanced Theory of
Gibbs sampler. Statistics, Vol. 2B), London: Edward Arnold.
Postman, M., Huchra, J. P., and Geller, M. J. (1986), "Probes of Large-
Scale Structures in the Corona Borealis Region," The Astronomical Jour-
[Received May 1994. Revised February 1995.1 nal, 92, 1238-1247.
Raftery, A. E. (1994), "Hypothesis Testing and Model Selection Via Pos-
REFERENCES terior Simulation," unpublished manuscript, University of Washington,
Dept. of Statistics.
Albert, J., and Chib, S. (1993a), "Bayesian Analysis of Binary and Poly-
Roeder, K. (1990), "Density Estimation With Confidence Sets Exempli-
chotomous Response Data," Journal of the American Statistical Associ-
fied by Superclusters and Voids in Galaxies," Journal of the American
ation, 88, 669679.
Statistical Association, 85, 617624.
-(1993b), "Bayes Inference Via Gibbs Sampling of Autoregressive
Scott, D. W. (1992), Multivariate Density Estimation, New York: John
Time Series Subject to Markov Mean and Variance Shifts," Journal of
Wiley.
Business & Economic Statistics, 11, 1-15.
Ritter, C., and Tanner, M. A. (1992), "Facilitating the Gibbs Sampler: The
Berger, J. (1985), Statistical Decision Theory and Bayesian Analysis, New
York: Springer-Verlag. Gibbs Stopper and the Griddy-Gibbs Sampler," Journal of the American
Statistical Association, 87, 861-868.
Besag, J. (1989), "A Candidates Formula: A Curious Result in Bayesian
Prediction," Biometrika, 76, 183. Tanner, M. A,, and Wong, W. (1987), "The Calculation of Posterior Distri-
butions by Data Augmentation" (with discussion), Journal ofthe Amer-
Brown, B. W. (1980), "Prediction Analyses for Binary Data," in Biostatis-
ican Statistical Association, 82, 528-550.
tics Casebook, eds. R. J. Miller, B. Efron, B. W. Brown, and L. E. Moses,
New York: John Wiley. Tierney, L. (1994), "Markov Chains for Exploring Posterior Distributions,"
The Annals of Statistics, 22, 1701-1762.
Carlin, B., and Chib, S. (19931, "Bayesian Model Choice Via Markov
Chain Monte Carlo," Journal of the Royal Statistical Society, Ser. B, 57, West, M. (1992), "Modelling With Mixtures" (with discussion), in
473484. Bayesian Statistics 4, eds. J. M. Bernardo, J. 0 . Berger, A. P. Dawid, and
Carlin, B., and Polson, N. (19911, "Inference for Nonconjugate Bayesian A. F. M. Smith, Oxford, U.K.: Oxford University Press, pp. 503-524.
Models Using Gibbs Sampling," Canadian Journal of Statistics, 19, Zellner, A,, and Min, C. (1995), "Gibbs Sampler Convergence Criteria
399-405. (GSC2)," Journal of the American Statistical Association, 90, 921-927.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy