0% found this document useful (0 votes)
411 views32 pages

The Bayesian Information Criterion

This file contains slides for describing the utility of the dicrimination criterions for detecting the orders of a stationary time series. In the littérature there are many (Akaike, Bayes, Hanan-Quin). Here is presented the Bayesian Criterion (or Schwarz Criterion). Even if the proofs are incomplete, the general line is well introduced with some mathematical notations however

Uploaded by

constant31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
411 views32 pages

The Bayesian Information Criterion

This file contains slides for describing the utility of the dicrimination criterions for detecting the orders of a stationary time series. In the littérature there are many (Akaike, Bayes, Hanan-Quin). Here is presented the Bayesian Criterion (or Schwarz Criterion). Even if the proofs are incomplete, the general line is well introduced with some mathematical notations however

Uploaded by

constant31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

171:290 Model Selection


Lecture V: The Bayesian Information Criterion
Joseph E. Cavanaugh
Department of Biostatistics
Department of Statistics and Actuarial Science
The University of Iowa

September 25, 2012

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Introduction
BIC, the Bayesian information criterion, was introduced by
Schwarz (1978) as a competitor to the Akaike (1973, 1974)
information criterion.
Schwarz derived BIC to serve as an asymptotic approximation
to a transformation of the Bayesian posterior probability of a
candidate model.
In large-sample settings, the fitted model favored by BIC
ideally corresponds to the candidate model which is
a posteriori most probable; i.e., the model which is rendered
most plausible by the data at hand.
The computation of BIC is based on the empirical
log-likelihood and does not require the specification of priors.
Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Introduction
In Bayesian applications, pairwise comparisons between
models are often based on Bayes factors.
Assuming two candidate models are regarded as equally
probable a priori, a Bayes factor represents the ratio of the
posterior probabilities of the models. The model which is
a posteriori most probable is determined by whether the Bayes
factor is less than or greater than one.
In certain settings, model selection based on BIC is roughly
equivalent to model selection based on Bayes factors (Kass
and Raftery, 1995; Kass and Wasserman, 1995).
Thus, BIC has appeal in many Bayesian modeling problems
where priors are hard to set precisely.
Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Introduction

Outline:
Overview of BIC
Derivation of BIC
BIC and Bayes Factors
BIC versus AIC
Use of BIC

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Overview of BIC

Key Constructs:
True or generating model: g (y ).
Candidate or approximating model: f (y | k ).
Candidate class:
F(k) = {f (y | k ) | k (k)} .
Fitted model: f (y | k ).

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Overview of BIC

Akaike information criterion:


AIC = 2 ln f (y | k ) + 2k.
Bayesian (Schwarz) information criterion:
BIC = 2 ln f (y | k ) + k ln n.
AIC and BIC feature the same goodness-of-fit term.
The penalty term of BIC is more stringent than the penalty
term of AIC. (For n 8, k ln n exceeds 2k.)
Consequently, BIC tends to favor smaller models than AIC.

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Overview of BIC
The Bayesian information criterion is often called the Schwarz
information criterion.
Common acronyms: BIC, SIC, SBC, SC.
AIC provides an asymptotically unbiased estimator of the
expected Kullback discrepancy between the generating model
and the fitted approximating model.
BIC provides a large-sample estimator of a transformation of
the Bayesian posterior probability associated with the
approximating model.
By choosing the fitted candidate model corresponding to the
minimum value of BIC, one is attempting to select the
candidate model corresponding to the highest Bayesian
posterior probability.
Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Overview of BIC

BIC was justified by Schwarz (1978) for the case of


independent, identically distributed observations, and linear
models, under the assumption that the likelihood is from the
regular exponential family.
Generalizations of Schwarzs derivation are presented by Stone
(1979), Leonard (1982), Kashyap (1982), Haughton (1988),
and Cavanaugh and Neath (1999).
We will consider a justification which is general, yet informal.

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Derivation of BIC

Let y denote the observed data.


Assume that y is to be described using a model Mk selected
from a set of candidate models Mk1 , Mk2 , . . . , MkL .
Assume that each Mk is uniquely parameterized by a vector
k , where k is an element of the parameter space (k)
(k {k1 , k2 , . . . , kL }).
Let L(k | y ) denote the likelihood for y based on Mk .
Note: L(k | y ) = f (y | k ).
Let k denote the maximum likelihood estimate of k
obtained by maximizing L(k | y ) over (k).

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Derivation of BIC

We assume that derivatives of L(k | y ) up to order two exist


with respect to k , and are continuous and suitably bounded
for all k (k).
The motivation behind BIC can be seen through a Bayesian
development of the model selection problem.
Let (k) (k {k1 , k2 , . . . , kL }) denote a discrete prior over
the models Mk1 , Mk2 , . . . , MkL .
Let g (k | k) denote a prior on k given the model Mk
(k {k1 , k2 , . . . , kL }).

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Derivation of BIC
Applying Bayes Theorem, the joint posterior of Mk and k
can be written as
h((k, k ) | y ) =

(k) g (k | k) L(k | y )
,
m(y )

where m(y ) denotes the marginal distribution of y .


A Bayesian model selection rule might aim to choose the
model Mk which is a posteriori most probable.
The posterior probability for Mk is given by
Z
P(k | y ) = m(y )1 (k)
L(k | y ) g (k | k) dk .

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Derivation of BIC

Now consider minimizing 2 ln P(k | y ) as opposed to


maximizing P(k | y ).
We have
2 ln P(k | y ) = 2 ln {m(y )} 2 ln {(k)}
Z

2 ln
L(k | y ) g (k | k) dk .
The term involving m(y ) is constant with respect to k; thus,
for the purpose of model selection, this term can be discarded.

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Derivation of BIC
We obtain
2 ln P(k | y ) 2 ln {(k)}

Z
L(k | y ) g (k | k) dk
2 ln
S(k | y ).
Now consider the integral which appears above:
Z
L(k | y ) g (k | k) dk .
In order to obtain an approximation to this term, we take a
second-order Taylor series expansion of the log-likelihood
about k .
Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Derivation of BIC
We have
k | y )
0 ln L(
ln L(k | y ) ln L(k | y ) + (k k )
k
"
#
2
k | y )
1

ln
L(

0
+ (k k )
(k k )
0
2
k k
h
i
1
0
k , y ) (k k ),
= ln L(k | y ) (k k ) n I(
2
where

k , y ) = 1 ln L(k0 | y )
I(
n
k k

is the average observed Fisher information matrix.


Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Derivation of BIC
Thus,



h
i
1
0
k , y ) (k k ) .
L(k | y ) L(k | y ) exp (k k ) n I(
2
We therefore have the following approximation for our
integral:
Z
L(k | y ) g (k | k) dk


Z
h
i
1
0
k , y ) (k k )
L(k | y )
exp (k k ) n I(
2
g (k | k) dk .

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Derivation of BIC

Now consider the evaluation of




Z
h
i
1
0
k , y ) (k k ) g (k | k) dk
exp (k k ) n I(
2
using the noninformative prior g (k | k) = 1.
In this case, we obtain


Z
h
i
1
0
k , y ) (k k ) dk =
exp (k k ) n I(
2
(k/2)
k , y )|1/2 .
(2)
|n I(

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Derivation of BIC

We therefore have
Z
L(k | y ) g (k | k) dk
k , y )|1/2
L(k | y ) (2)(k/2) |n I(
k , y )|1/2
= L(k | y ) (2)(k/2) n(k/2) |I(
 (k/2)
2
k , y )|1/2 .
= L(k | y )
|I(
n

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Derivation of BIC

The preceding can be viewed as a variation on the Laplace


method of approximating the integral
Z
L(k | y ) g (k | k) dk .
(See Tierney and Kadane, 1986; Kass and Raftery, 1995.)
This approximation is valid in large-sample settings provided
that the prior g (k | k) is flat over the neighborhood of k
where L(k | y ) is dominant.
The prior g (k | k) need not be noninformative, although the
choice of g (k | k) = 1 makes our derivation more tractable.

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Derivation of BIC
We can now write
S(k | y ) = 2 ln{(k)}

Z
L(k | y ) g (k | k) dk
2 ln
2 ln{(k)}
"
#
 (k/2)
2
k , y )|1/2
2 ln L(k | y )
|I(
n
= 2 ln{(k)}
n  n o
k , y )|.
2 ln L(k | y ) + k ln
+ ln |I(
2

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Derivation of BIC

Ignoring terms in the preceding that are bounded as the


sample size grows to infinity, we obtain
S(k | y ) 2 ln L(k | y ) + k ln n.
With this motivation, the Bayesian (Schwarz) information
criterion is defined as follows:
BIC = 2 ln L(k | y ) + k ln n
= 2 ln f (y | k ) + k ln n.

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

BIC and Bayes Factors

Consider two candidate models Mk1 and Mk2 in a Bayesian


analysis. To choose between these models, a Bayes factor is
often used.
The Bayes factor, B12 , is defined as a ratio of the posterior
odds of Mk1 ,
P(k1 | y )/P(k2 | y ),
to the prior odds of Mk1 ,
(k1 )/(k2 ).
If B12 > 1, model Mk1 is favored by the data;
if B12 < 1, model Mk2 is favored by the data.
Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

BIC and Bayes Factors

Kass and Raftery (1995) write The Bayes factor is a


summary of the evidence provided by the data in favor of one
scientific theory, represented by a statistical model, as
opposed to another.

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

BIC and Bayes Factors


Let BIC(k1 ) denote BIC for model Mk1 , and let BIC(k2 )
denote BIC for model Mk2 . Kass and Raftery (1995) argue
that as n ,
2 ln B12 (BIC(k1 ) BIC(k2 ))
0.
2 ln B12
Thus, (BIC(k1 ) BIC(k2 )) can be viewed as a rough
approximation to 2 ln B12 .
Kass and Raftery (1995) write The Schwarz criterion (or
BIC) gives a rough approximation to [2] the logarithm of the
Bayes factor, which is easy to use and does not require
evaluation of prior distributions. It is well suited for
summarizing results in scientific communication.
Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

BIC versus AIC


Recall the definitions of consistency and asymptotic efficiency.
Suppose that the generating model is of a finite dimension,
and that this model is represented in the candidate collection
under consideration. A consistent criterion will
asymptotically select the fitted candidate model having the
correct structure with probability one.
On the other hand, suppose that the generating model is of an
infinite dimension, and therefore lies outside of the candidate
collection under consideration. An asymptotically efficient
criterion will asymptotically select the fitted candidate model
which minimizes the mean squared error of prediction.
AIC is asymptotically efficient yet not consistent;
BIC is consistent yet not asymptotically efficient.
Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

BIC versus AIC

AIC and BIC share the same goodness-of-fit term, but the
penalty term of BIC (k ln n) is potentially much more
stringent than the penalty term of AIC (2k).
Thus, BIC tends to choose fitted models that are more
parsimonious than those favored by AIC.
The differences in selected models may be especially
pronounced in large sample settings.
Intuitively, why is the complexity penalization so much greater
for BIC than for AIC?

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

BIC versus AIC

In Bayesian analyses, the strength of evidence required to


favor additional complexity is generally greater than in
frequentist analyses. Why?
The Bayesian analytical paradigm incorporates
estimation uncertainty and parameter uncertainty.
The frequentist analytical paradigm only incorporates
estimation uncertainty.

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

BIC versus AIC

From a practical perspective,


AIC could be advocated when the primary goal of the
modeling application is predictive; i.e., to build a model that
will effectively predict new outcomes.
BIC could be advocated when the primary goal of the
modeling application is descriptive; i.e., to build a model that
will feature the most meaningful factors influencing the
outcome, based on an assessment of relative importance.
As the sample size grows, predictive accuracy improves as
subtle effects are admitted to the model. AIC will increasingly
favor the inclusion of such effects; BIC will not.

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Use of BIC

BIC can be used to compare non-nested models.


BIC can be used to compare models based on different
probability distributions. However, when the criterion values
are computed, no constants should be discarded from the
goodness-of-fit term 2 ln f (y | k ).

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Use of BIC

In a model selection application, the optimal fitted model is


identified by the minimum value of BIC.
However, as with the application of any model selection
criterion, the criterion values are important; models with
similar values should receive the same ranking in assessing
criterion preferences.

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Use of BIC

Question: What constitutes a substantial difference in


criterion values?
For BIC, Kass and Raftery (1995, p. 777) feature the
following table (slightly revised for presentation).
BICi BICmin
0-2
2-6
6 - 10
> 10

Evidence Against Model i


Not worth more than a bare mention
Positive
Strong
Very Strong

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

Use of BIC

The use of BIC seems justifiable for model screening in


large-sample Bayesian analyses.
However, BIC is often employed in frequentist analyses.
Some frequentist practitioners prefer BIC to AIC, since BIC
tends to choose fitted models that are more parsimonious
than those favored by AIC.
However, given the Bayesian justification of BIC, is the use of
the criterion in frequentist analyses defensible?

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

Introduction

Overview

Derivation

BIC and Bayes Factors

BIC vs. AIC

Use of BIC

References

Schwarz, G. (1978). Estimating the dimension of a model.


Annals of Statistics 6, 461464.
Kass, R. and Raftery, A. (1995). Bayes Factors. Journal of
the American Statistical Association 90, 773795.
Neath, A. A. and Cavanaugh, J. E. (2012). The Bayesian
information criterion: Background, derivation, and
applications. WIREs Computational Statistics 4, 199203.

Joseph E. Cavanaugh
171:290 Model Selection

The University of Iowa


Lecture V: The Bayesian Information Criterion

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy