17 Ee
17 Ee
Siddhartha Chib
Journal of the American Statistical Association, Vol. 90, No. 432. (Dec., 1995), pp. 1313-1321.
Stable URL:
http://links.jstor.org/sici?sici=0162-1459%28199512%2990%3A432%3C1313%3AMLFTGO%3E2.0.CO%3B2-2
Journal of the American Statistical Association is currently published by American Statistical Association.
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained
prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in
the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/journals/astata.html.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic
journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers,
and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take
advantage of advances in technology. For more information regarding JSTOR, please contact support@jstor.org.
http://www.jstor.org
Wed Feb 6 14:39:37 2008
Marginal Likelihood From the Gibbs Output
Siddhartha CHIB
In the context of Bayes estimation via Gibbs sampling, with or without data augmentation, a simple approach is developed for
computing the marginal density of the sample data (marginal likelihood) given parameter draws from the posterior distribution.
Consequently, Bayes factors for model comparisons can be routinely computed as a by-product of the simulation. Hitherto, this
calculation has proved extremely challenging. Our approach exploits the fact that the marginal density can be expressed as the
prior times the likelihood function over the posterior density. This simple identity holds for any parameter value. An estimate
of the posterior density is shown to be available if all complete conditional densities used in the Gibbs sampler have closed-form
expressions. To improve accuracy, the posterior density is estimated at a high density point, and the numerical standard error of
resulting estimate is derived. The ideas are applied to probit regression and finite mixture models.
KEY WORDS: Bayes factor; Estimation of normalizing constant; Finite mixture models; Linear regression; Markov chain Monte
Carlo; Markov mixture model; Multivariate density estimation; Numerical standard error; Probit regression;
Reduced conditional density.
1313
Journal of the American Statistical Association, December 1995
Gibbs sampling algorithm, with or without data augmenta- rived, as shown in Section 3. It is now time to examine the
tion, has been used to provide a sample of draws from the method for calculating the posterior density estimate from
posterior distribution. To compute the marginal density by the Gibbs output.
our approach, it is necessary that all integrating constants
of the full conditional distributions in the Gibbs sampler be 2.1 Estimation of ~ ( O * l y ) .
known. This requirement is usually satisfied in models fit Consider now the estimation of the multivariate density
with conjugate priors and covers almost all applications of T(O* y) and the selection of the point O*. As was pointed
the Gibbs sampler that have appeared in the literature. out, the BMI expression holds for any 0, and thus the choice
The rest of the article is organized as follows. Section 2 of the point is not critical, but efficiency considerations dic-
presents the approach, and Section 3 illustrates the deriva- tate that for a given number of posterior draws, the density
tion of the numerical standard error of the estimate. Section is likely to be more accurately estimated at a high density
4 presents applications of the approach, first for variable point, where more samples are available, than at a point in
selection in probit regression and then for model compar- the tails. It should be noted that a modal value such as the
isons in finite mixture models. The final section contains posterior mode, or the maximum likelihood estimate, can
brief concluding remarks. be computed from the Gibbs output, at least approximately,
if it is easy to evaluate the log-likelihood function for each
2. THE APPROACH
draw in the simulation. Alternatively, one can make use of
Suppress the model index k and consider the situation the posterior mean provided that there is no concern that it
wherein f (y 0 ) is the sampling density (likelihood function) is a low density point.
for the given model and ~ ( 0is) the prior density. To allow We now explain how the posterior density ordinate can
for the possibility that posterior simulation requires data be estimated from the Gibbs output, starting with a canoni-
augmentation, let z denote latent data and suppose that for cal situation consisting of two blocks of parameters before
a given set of vector blocks 0 = (01,0 2 , .. . , OB), the Gibbs turning to the general case. We show that the proposed
sampling algorithm is applied to the set of (B + 1) com- multivariate density estimation method is easy to imple-
plete conditional densities, ment, requires only the available complete conditional den-
sities, and produces a simulation consistent estimate of the
posterior ordinate.
The objective is to compute the marginal density m(y M k )
from the output { o ( ~z) (, g ) ) g l obtained from (4). 2.1.1 Two Vector Blocks. Suppose that Gibbs sampling
The approach developed here consists of two related is applied to the complete conditional densities
ideas. First, m(y), by virtue of being the normalizing con-
stant of the posterior density, can be written as
the following estimate of the marginal likelihood: leads to an increase in the number of iterations, it is impor-
tant to stress that it does not require new programming and
thus is straightforward to implement. Note that the reduced
conditional run is not necessary if z is absent from the sam-
pling. In this case the reduced conditional density of e2 is
identical to its complete conditional density, and the density
This simple expression can be used for a large class of mod- estimate reduces to one used by Zellner and Min (1995) in
els, including the probit regression model discussed later. a different context.
Observe that the calculation amounts to evaluating the like- Substituting the two density estimates into (6) yields the
lihood, the prior, and the "complete data" posterior density estimate
at the point 8".
2.1.2 Three Vector Blocks. An even larger class of mod-
els can be covered by slightly generalizing the Tanner and
Wong structure. Suppose that the Gibbs sampler is defined
2.1.3 General Case. Although the technique described
through the complete conditional densities
thus far will apply to many problems of importance, con-
sider the situation with an arbitrary number of blocks. Even
in this case, the posterior density ordinate can be estimated
Models such as linear regression, linear regression with in-
rather easily.
dependent Student-t errors, Zellner's seemingly unrelated
Begin by writing the posterior density at the selected
regression, and censored regression either fall in this cat-
point as
egory or are a special case of this structure if z is absent.
* which now
Once again, the objective is to estimate ~ ( 8ly),
is expressed as
and many other models. Then ~ ( 81 y% , 8 ; ) is estimated (see Besag 1989). This identity follows in a straightfor-
as G-l C ~ ( 81 y%
, 8T, OF),
~ ( j ) )where
, the draws {OF),
z ( j ) ) ward manner from the definition of the posterior density
are obtained by continuing the Gibbs sampler with n ( 8y , y f ) and cross-multiplying. Besag (1989) alluded to
a different proof.
3. NUMERICAL STANDARD ERROR
and As mentioned in the preceding section, the proposed den-
sity estimation procedure is likely to produce an accurate
estimate of n ( 8 y ) at the point 8*. In fact, it is possible
Finally, additional G iterations with the densities to calculate the accuracy achieved by a computation that
uses the Gibbs output. This calculation yields the numerical
standard error of the marginal density estimate (or, equiva-
produce draws { z ( j ) t)hat follow the distribution [ zy , 81,Qa]. lently, that of the posterior density estimate). The numerical
standard error gives the variation that can be expected in
These draws yield an estimate ~ ( 8y;, 8 ; , 8 % )This . tech- the estimate if the simulation were to be done afresh, but
nique is illustrated in Section 4.2 for mixture models. the point at which the ordinate is evaluated is kept fixed.
2.2 Bayes Factor Estimate To concentrate on the main ideas, consider the case in
Section 2.1.2 and define the vector stochastic process
To compute the Bayes factor for any two models k and
I-that is, m ( y1 M k ) / m ( yMl)-the calculation described
earlier is repeated for all models, and the following estimate
is used:
where in the first component the latent vector (e2,Z ) [ . I y]
while in the second component the latent vector z follows
the distribution [. y , 81].In general, h is a B x 1 vector with
An estimate of the posterior odds of any two models is the rth component given by "(8: Iy, 87, e;, . . . , 81 ( 1
given by multiplying the estimated Bayes factor by the prior > r ) ,z ) , the integrand of (10).
odds. It should be noted that due to the procedure used to es-
timate the reduced conditional ordinate, the second com-
2.3 Remarks ponent of h is approximately independent of the first. But
In some situations there are two sets of latent vectors for expositional simplifications, it is worthwhile to proceed
( z ,+) such that the density f ( y e ,+) = J f ( y ,zl0, +) dz with the vector formulation. Then in this notation,
is available in closed form but the likelihood f(yl8)
= J f ( y ,$18) d+ is not. This occurs, for example, in dis-
crete response data models with random effects. To analyze
this situation, one can use the BMI expression
and our objective is to find the variance of two functions
of h, namely = hl x h2 and $2 = ln(hl) 1n(h2) +
-. In ?(BT y ) + In ?(Ba y , 8 ; ) . The variance of these two
functions is found by the delta method as soon as the vari-
Both the numerator and denominator can be evaluated at ance of h is determined. Because h inherits the ergodic-
the point ( O * , +*), and the posterior mean of ( 8 ,+) and ity of the Gibbs output, it follows by the ergodic theorem
n ( 8 ,+ly) can be estimated using the method in Section 2.1 (Tierney 1994) that
+
by treating as an additional block.
The BMI can also be used to assess the convergence
of the Gibbs sampler, by computing and monitoring its almost surely, where p = ( ~ ( O y l y~) ,( 8y ;, QT))',
stability for different iterations. Such an idea, combined
with a different approach for computing the posterior den- lirn G { E ( -~ p ) ( h - p ) ' ) = 27rS(O),
G'cc
sity, appears in the Gibbs stopper proposed by Ritter and
Tanner (1992). Raftery (1994) mentioned using the ker- and S(0) is the spectral density matrix at frequency zero.
nel estimate of the posterior density in connection with the An estimate of f2 = 27rS(O)can be obtained by the approach
BMI, but the resulting estimate can inherit the inaccuracy of Newey and West (1987) or Geweke (1992). If
of the kernel method, especially in high dimensions. Fi-
nally, another identity similar to the BMI is available in
the prediction context. Suppose that y f denotes an out-of-
sample observation. Then the Bayesian prediction density,
f ( ~ Yf ) = J f ( ~ Y f , ~ ) T ( ~ IdYo ,) can be expressed as then
Chib: Marginal Likelihood from the Gibbs Output 1317
where q is some constant, essentially the value at which the 4.1 Binary Probit Regression
autocorrelation function tapers off. In the applications to
Consider the data in Table 1 on the presence of prostatic
follow q is conservatively set equal to 10, although there
nodal involvement collected on 53 patients with cancer of
was negligible to vanishing serial correlation in the h ( g )
the prostate. The data (reported in the study by Brown
process. The variance of $ J ~ for
, example, is found by the (1980); see also Collett 1991) include a binary response
delta method to be variable y that takes the value 1 if cancer had spread to the
surrounding lymph nodes and value zero otherwise. The
objective is to explain the binary response with five vari-
ables: age of the patient in years at diagnosis ( z l ) ;level of
serum acid phosphate ( x 2 ) ;the result of an X-ray exami-
nation, coded 0 if negative and 1 if positive ( Q ) ; the size
of the tumor, coded 0 if small and 1 if large ( 5 4 ) ; and the
where the derivative vector consists of elements h;' and pathological grade of the tumor, coded 0 if less serious and
h;'. The square root of this variance is the numerical stan- 1 if more serious ( x 5 ) .
dard error of the marginal likelihood in the log scale. The probability of positive response can be explained
through a probit link function or, as by Collett (1991), by
4. EXAMPLES a logit link. If interactions and powers of explanatory vari-
In this section the approach developed earlier is applied to ables are excluded, then there are 32 possible models that
two important classes of models. In particular, the methods can be fit. Collett's finding from the classical deviance
are discussed in the context of variable selection in binary statistic (-2 times the maximized log-likelihood) is that
probit regression models and in the context of two broad the logistic model containing log(z2),x s, and x4 provides
classes of finite mixture models, the iid mixture model and a suitable fit for the data among these 32 models. These
the Markov mixture model. data are reanalyzed to demonstrate the computation of the
By way of notation, for a d-dimensional normal ran- marginal likelihood using nine of these models (defined
dom vector with mean p and covariance matrix X , later and selected entirely for illustrative purposes).
the density at the point t is denoted by $(tip, X ) Under model Ic, suppose that
=
- ( ~ T ) ~ /e x~p (c - ( t--' j.~)'X-'(t
/ ~ - p ) / 2 ) and the
inverse gamma density at the point s is denoted by
p I G ( s a ,b) E ( b a / r ( a )() l / ~ ) ( ~exp(-bls).
+') Finally, for where @(.) is the cumulative distribution function of the
-
a m vector q on the unit simplex, the Dirichlet
D(a1, a2,. . . ,a,) density is denoted by p ~ ( q l a 1 ,... ,a,)
r(C,a j ) q P 1 - l . .. q$-l/ nj r(aj).
standard normal density, xik are the covariates included in
model Ic, and 0, is the corresponding regression parame-
ter vector. The likelihood function under M k , assuming a
Journal of the American Statistical Association, December 1995
log
c
500, and the estimate P* = ~ ( ~ ) / 5 , 0is0 0obtained.
Then the logarithm of the marginal likelihood of model
(maximized log Num
Terms fitted in model lik) d.f. (marginal) SE Mk is
algorithm.
random sample, is then The results are summarized in Table 2, where for each
of nine models, the maximized likelihood is reported along
with the degrees of freedom, the log of the marginal like-
lihood, and its numerical standard error. From this table it
can be seen that the marginal likelihood is very precisely
For this situation, the marginal likelihood can be computed estimated in all the fitted models. Of course, these results
rather simply by the Laplace method (see Kass and Raftery are obtained with G = 5,000 draws, and further improve-
1994), but given the small sample size, it is difficult to know ments in accuracy can be achieved by increasing G. For
the accuracy of the Laplace approximation. Harmonic mean comparison, the BMI expression was evaluated at a point
type estimators, on the other hand, are rather more difficult that was one posterior standard deviation from P*. As ex-
to obtain with this likelihood, because its tails generally pected, this led to an increase in the numerical standard
decline quite sharply. error of the estimate and, for example, was .26 in Mg, with
A procedure that works extremely well in conjunc- G = 5,000. The Laplace method was also used to determine
tion with the technique developed above is the data the marginal likelihood, and the results were in agreement
augmentation-Gibbs sampling method of Albert and Chib up to the second decimal place. We also examined if a
(1993a). Suppose that the prior information about PI, is multivariate kernel estimate of the posterior ordinate (with
weak, but not improper, and is represented by a multivari- a Gaussian product kernel) could be used in the BMI ex-
ate normal prior with the mean of each parameter equal to pression. This procedure did not produce equally accurate
.75 (because each covariate is expected to have a positive results. Also note that xl (the age variable) does not im-
impact on the probability of response), and a standard de- prove on the model with just a constant (the Bayes factor for
viation of 5. Under the assumption that the parameters are the second model vs. the first is .009), whereas the model
independent, the prior of PI, takes the form with the variable x3 (X-ray) has a Bayes factor of approxi-
Prior Posterior
Parameter Mean Std dev Mean Std dev
PI 0 1.414 - ,313 ,314 where K-l is the observed data up to time t - 1 and p(zt
PP .75 1.414 1.038 .I 1 1 = 1lK-l, 8 ) is a time-varying conditional probability. The
c2 1.33 ,943 ,672 ,089 joint density of all the data is then
PI 1 .8 ,163 ,743 ,098
P22 .8 .I 63 ,911 ,042
Log marginal likelihood -229.496 (.028)
A little reflection shows that, given z, this model has the
same structure as the iid mixture model, and thus the
An estimate of the marginal likelihood is given by substi- marginal likelihood calculation proceeds in virtually the
tuting these quantities into (12). same way. The complete conditional densities of (p,a 2 )
Our results, which are based on G = 5,000 draws, are are identical to those in the iid mixture model, and, if one
summarized in Table 4. (Almost identical results were ob- assumes that the prior density on qi is Dirichlet(cril,cri2),
tained when the BMI expression was evaluated at the poste- then
rior mean instead of the approximate maximum likelihood
value.) First, the two-component model is clearly domi-
nated by both three-component models. Second, the three-
component model with a2 unrestricted appears to be better
where nij denotes the number of one-step transitions from
than the three-component model with a2restricted to be the
i to j in the sequence z (see Albert and Chib 1993b). A
same across components. This result would not be obvi-
decomposition similar to (18) is again available while each
ous from just looking at posterior distributions of the fitted
of the ordinates can be estimated by the reduced conditional
models, because all the parameters in both three-component
Gibbs sampling procedure described earlier.
models are tightly estimated. Third, all the numerical stan-
The Gibbs implementation of this model, and the cal-
dard errors are small, indicating that the marginal likelihood
culation of the marginal likelihood, require the simula-
has been accurately estimated.
tion of the latent variables z from p(zl y , 8 ) . As described
by Chib (1993), the latent variables are simulated through
4.2.2 Markov Mixture Model. As a final illustration of the following recursive steps, which are initiated with
the value of our approach, consider data on the quarterly p(zo = ilYo, 8). These recursions require one pass from
growth rates of U.S. gross national product (GNP) for the t = 1 to n and then a second pass from t = n to t = 1.
postwar period 1951.2 to 1992.4. Many different time series
models have been fit to this data, and our objective is to Step 1: Repeat for t = 1 , 2 , . . . , n.
demonstrate how the marginal likelihood can be calculated Prediction step: Calculate
in one particular case, of substantial practical importance,
for which this calculation has hitherto not been attempted.
The model of interest is the Markov mixture model, i=l
also sometimes referred to as the Markov switching model
(Goldfeld and Quandt 1973; Hamilton 1989). Let yt denote
the growth rate of GNP (multiplied by loo), and suppose Update step: Calculate
that
-- N(.75,2), a2 2G(4,4), q 1
N Dirichlet(4, I), and q z
N
(PI, P 2 ) ' -
after conditioning on the first observation. Under the prior
N2(0, diag(l0,lO)) and a2 2 6 ( 3 , 3 ) , the log
marginal likelihood is estimated to be -231.94. Thus the
Gelfand, A. E., and Dey, D. K. (1994), "Bayesian Model Choice: Asymp-
totics and Exact Calculations," Journal of the Royal Statistical Society,
Ser. B, 56, 501-514.
data support the Markov mixture model to the first-order Gelfand, A. E., and Smith, A. F. M. (1990), "Sampling-Based Approaches
to Calculating Marginal Densities," Journal of the American Statistical
autoregressive model. Association, 85, 398-409.
Geweke, J. (1992), "Evaluating the Accuracy of Sampling-Based Ap-
5. CONCLUDING REMARKS proaches to the Calculation of Posterior Moments," in Proceedings of
In summary, this article has developed and illustrated a the Fourth Valencia international Conference on Bayesian Statistics, eds.
J. M. Bernardo, J. 0 . Berger, A. P. Dawid, and A. F. M. Smith, New
new approach to calculating the marginal likelihood that York: Oxford University Press, pp. 169-193.
relies on the output of the Gibbs sampling algorithm. The Goldfeld, S. M., and Quandt, R. E. (1973), "A Markov Model for Switching
approach is fully automatic and stable, requiring no inputs Regressions," Journal of Econometrics, 1, 3-16.
beyond the draws from the simulation. Thus draws from the Hamilton, J. D. (1989), "A New Approach to the Economic Analysis of
Nonstationary Time Series and the Business Cycle," Econometrica, 57,
prior, or additional maximizations, or importance sampling
357-384.
functions, or any other tuning function, are not required. Kass, R. E., and Raftery, A. E. (1995), "Bayes Factors and Model Uncer-
It was shown that the numerical standard error of the es- tainty,'' Journal of the American Statistical Association, 90, 773-795.
timate can be derived from the posterior sample and the Newey, W. K., and West, K. D. (1987), "A Simple Positive Semi-Definite,
calculations are exhibited in problems dealing with probit Heteroskedasticity and Autocorrelation Consistent Covariance Matrix,"
Econometrica, 55, 703-708.
regression and finite-mixture models. In all the examples,
Newton, M. A,, and Raftery, A. E. (1994), "Approximate Bayesian Infer-
the marginal likelihood is estimated easily and very accu- ence by the Weighted Likelihood Bootstrap" (with discussion), Journal
rately. As a result, this approach should encourage the rou- of the Royal Statistical Society, Ser. B, 56, 3-48.
tine calculation of Bayes factors in models estimated by the O'Hagan, A. (1994), Bayesian Inference (Kendall's Advanced Theory of
Gibbs sampler. Statistics, Vol. 2B), London: Edward Arnold.
Postman, M., Huchra, J. P., and Geller, M. J. (1986), "Probes of Large-
Scale Structures in the Corona Borealis Region," The Astronomical Jour-
[Received May 1994. Revised February 1995.1 nal, 92, 1238-1247.
Raftery, A. E. (1994), "Hypothesis Testing and Model Selection Via Pos-
REFERENCES terior Simulation," unpublished manuscript, University of Washington,
Dept. of Statistics.
Albert, J., and Chib, S. (1993a), "Bayesian Analysis of Binary and Poly-
Roeder, K. (1990), "Density Estimation With Confidence Sets Exempli-
chotomous Response Data," Journal of the American Statistical Associ-
fied by Superclusters and Voids in Galaxies," Journal of the American
ation, 88, 669679.
Statistical Association, 85, 617624.
-(1993b), "Bayes Inference Via Gibbs Sampling of Autoregressive
Scott, D. W. (1992), Multivariate Density Estimation, New York: John
Time Series Subject to Markov Mean and Variance Shifts," Journal of
Wiley.
Business & Economic Statistics, 11, 1-15.
Ritter, C., and Tanner, M. A. (1992), "Facilitating the Gibbs Sampler: The
Berger, J. (1985), Statistical Decision Theory and Bayesian Analysis, New
York: Springer-Verlag. Gibbs Stopper and the Griddy-Gibbs Sampler," Journal of the American
Statistical Association, 87, 861-868.
Besag, J. (1989), "A Candidates Formula: A Curious Result in Bayesian
Prediction," Biometrika, 76, 183. Tanner, M. A,, and Wong, W. (1987), "The Calculation of Posterior Distri-
butions by Data Augmentation" (with discussion), Journal ofthe Amer-
Brown, B. W. (1980), "Prediction Analyses for Binary Data," in Biostatis-
ican Statistical Association, 82, 528-550.
tics Casebook, eds. R. J. Miller, B. Efron, B. W. Brown, and L. E. Moses,
New York: John Wiley. Tierney, L. (1994), "Markov Chains for Exploring Posterior Distributions,"
The Annals of Statistics, 22, 1701-1762.
Carlin, B., and Chib, S. (19931, "Bayesian Model Choice Via Markov
Chain Monte Carlo," Journal of the Royal Statistical Society, Ser. B, 57, West, M. (1992), "Modelling With Mixtures" (with discussion), in
473484. Bayesian Statistics 4, eds. J. M. Bernardo, J. 0 . Berger, A. P. Dawid, and
Carlin, B., and Polson, N. (19911, "Inference for Nonconjugate Bayesian A. F. M. Smith, Oxford, U.K.: Oxford University Press, pp. 503-524.
Models Using Gibbs Sampling," Canadian Journal of Statistics, 19, Zellner, A,, and Min, C. (1995), "Gibbs Sampler Convergence Criteria
399-405. (GSC2)," Journal of the American Statistical Association, 90, 921-927.