Fisher Information
Fisher Information
X (2017) 1–59
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 The Role of Fisher Information in Frequentist Statistics . . . . . . . . . 6
3 The Role of Fisher Information in Bayesian Statistics . . . . . . . . . . 10
4 The Role of Fisher Information in Minimum Description Length . . . 20
5 Concluding Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
A Generalization to Vector-Valued Parameters: The Fisher Information
Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
B Frequentist Statistics based on Asymptotic Normality . . . . . . . . . . 40
∗ This work was supported by the starting grant “Bayes or Bust” awarded by the Euro-
pean Research Council (283876). Correspondence concerning this article may be addressed
to Alexander Ly, email address: a.ly@uva.nl. The authors would like to thank Jay Myung,
Trisha Van Zandt, and three anonymous reviewers for their comments on an earlier version
of this paper. The discussions with Helen Steingroever, Jean-Bernard Salomond, Fabian Dab-
lander, Nishant Mehta, Alexander Etz, Quentin Gronau and Sacha Epskamp led to great
improvements of the manuscript. Moreover, the first author is grateful to Chris Klaassen, Bas
Kleijn and Henk Pijls for their patience and enthusiasm with which they taught, and answered
questions from a not very docile student.
1
imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017
Ly, et. al./Fisher information tutorial 2
1. Introduction
(1.3)
the coin with θ∗ = 0.6 generates this particular outcome with 0.18% chance.
In practice, the true value of θ is not known and has to be inferred from the
observed data. The first step typically entails the creation of a data summary.
For example, suppose once more that X n refers to an n-trial coin flip experi-
ment and suppose that we observed xnobs = (1, 0, 0, 1, 1, 1, 1, 0, 1, 1). To simplify
matters, we only record the number of heads as Y = ∑ni=1 Xi , which is a func-
tion of the data. Applying our function to the specific observations yields the
realization yobs = Y (xnobs ) = 7. Since the coin flips X n are governed by θ, so is a
function of X n ; indeed, θ relates to the potential outcomes y of Y as follows
n
f (y ∣ θ) = ( )θy (1 − θ)n−y , where y ∈ Y = {0, 1, . . . , n}, (1.4)
y
The derivative dθ d
log f (x ∣ θ) is known as the score function, a function of x, and
describes how sensitive the model (i.e., the functional form f ) is to changes in
θ at a particular θ. The Fisher information measures the overall sensitivity of
the functional relationship f to changes of θ by weighting the sensitivity at each
potential outcome x with respect to the chance defined by pθ (x) = f (x ∣ θ). The
weighting with respect to pθ (x) implies that the Fisher information about θ is
an expectation.
Similarly, Fisher information IX n (θ) within the random vector X n about θ
is calculated by replacing f (x ∣ θ) with f (xn ∣ θ), thus, pθ (x) with pθ (xn ) in the
definition. Moreover, under the assumption that the random vector X n consists
of n iid trials of X it can be shown that IX n (θ) = nIX (θ), which is why IX (θ) is
also known as the unit Fisher information.2 Intuitively, an experiment consisting
of n = 10 trials is expected to be twice as informative about θ compared to an
experiment consisting of only n = 5 trials. ◇
Intuitively, we cannot expect an arbitrary summary statistic T to extract
more information about θ than what is already provided by the raw data. Fisher
information adheres to this rule, as it can be shown that
30
20
15
10
Parameter θ
is given by
1
IX n (θ) = nIX (θ) = n , (1.8)
θ(1 − θ)
Recall that θ is unknown in practice and to infer its value we might: (1) pro-
vide a best guess in terms of a point estimate; (2) postulate its value and test
whether this value aligns with the data, or (3) derive a confidence interval. In
the frequentist framework, each of these inferential tools is related to the Fisher
information and exploits the data generative interpretation of a pmf. Recall that
given a model f (xn ∣ θ) and a known θ, we can view the resulting pmf pθ (xn )
Likelihood: f(yobs=7,n=10|θ)
0.25
0.20
0.15
0.10 MLE
0.05
0.00
Parameter θ
Fig 2. The likelihood function based on observing yobs = 7 heads in n = 10 trials. For these
data, the MLE is equal to θ̂obs = 0.7, see the main text for the interpretation of this function.
as a recipe that reveals how θ defines the chances with which X n takes on the
potential outcomes xn .
This data generative view is central to Fisher’s conceptualization of the
maximum likelihood estimator (MLE; Fisher, 1912; Fisher, 1922; Fisher, 1925;
LeCam, 1990; Myung, 2003). For instance, the binomial model implies that a
coin with a hypothetical propensity θ = 0.5 will generate the outcome y = 7
heads out of n = 10 trials with 11.7% chance, whereas a hypothetical propen-
sity of θ = 0.7 will generate the same outcome y = 7 with 26.7% chance. Fisher
concluded that an actual observation yobs = 7 out of n = 10 is therefore more
likely to be generated from a coin with a hypothetical propensity of θ = 0.7 than
from a coin with a hypothetical propensity of θ = 0.5. Fig. 2 shows that for this
specific observation yobs = 7, the hypothetical value θ = 0.7 is the maximum
likelihood estimate; the number θ̂obs = 0.7. This estimate is a realization of the
maximum likelihood estimator (MLE); in this case, the MLE is the function
θ̂ = n1 ∑ni=1 Xi = n1 Y , i.e., the sample mean. Note that the MLE is a statistic,
that is, a function of the data.
For iid data and under general conditions,3 the difference between the true
θ∗ and the MLE converges in distribution to a normal distribution, that is,
√
n(θ̂ − θ∗ ) → N (0, IX (θ )), as n → ∞.
D −1 ∗
(2.1)
This means that the MLE θ̂ generates potential estimates θ̂obs around the true
√
value θ∗ with a standard error given by the inverse of the square root of the
Fisher information at the true value θ∗ , i.e., 1/ nIX (θ∗ ), whenever n is large
enough. Note that the chances with which the estimates of θ̂ are generated
depend on the true value θ∗ and the sample size n. Observe that the standard
error decreases when the unit information IX (θ∗ ) is high or when n is large. As
experimenters we do not have control over the true value θ∗ , but we can affect
the data generating process by choosing the number of trials n. Larger values
of n increase the amount of information in X n , heightening the chances of the
MLE producing an estimate θ̂obs that is close to the true value θ∗ . The following
example shows how this can be made precise.
Example 2.1 (Designing a binomial experiment with the Fisher information).
Recall that the potential outcomes of a normal distribution fall within one stan-
dard error of the √ population mean with 68% chance. Hence, when we choose
n such that 1/ nIX (θ∗ ) = 0.1 we design an experiment that allows the MLE
to generate estimates within 0.1 distance of the true value with 68% chance.
To overcome the problem that θ∗ is not known, we solve the problem for the
worst case scenario. For the Bernoulli model this √ is given by θ = √ 1/2, the least
√
informative case, see Fig. 1. As such, we have 1/ nI X (θ ∗ ) ≤ 1/ nI (1/2) =
X
1/(2 n) = 0.1, where the last equality is the target requirement and is solved by
n = 25.
This leads to the following interpretation. After simulating k = 100 data sets
xnobs,1 , . . . , xnobs,k each with n = 25 trials, we can apply to each of these data
sets the MLE yielding k estimates θ̂obs,1 , . . . , θ̂obs,k . The sampling distribution
implies that at least 68 of these k = 100 estimate are expected to be at most 0.1
distance away from the true θ∗ . ◇
3
Basically, when the Fisher information exists for all parameter values. For details see
the advanced accounts provided by Bickel et al. (1993), Hájek (1970), Inagaki (1970), LeCam
(1970) and Appendix E.
4
Note that θ̂ is random, while the true value θ ∗ is fixed. As such, the error θ̂ − θ ∗ and
√ D
the rescaled error n(θ̂ − θ ∗ ) are also random. We used → in Eq. (2.1) to convey that the
distribution of the left-hand side goes to the distribution on the right-hand side. Similarly,
D
≈ in Eq. (2.2) implies that the distribution of the left-hand side is approximately equal to
the distribution given on the right-hand side. Hence, for finite n there will be an error due
to using the normal distribution as an approximation to the true sampling distribution. This
approximation error is ignored in the constructions given below, see Appendix B.1 for a more
thorough discussion.
with (approximately) 95% chance. This 95%-prediction interval Eq. (2.3) al-
lows us to construct a point null hypothesis test based on a pre-experimental
postulate θ∗ = θ0 .
Example 2.2 (A null hypothesis test for a binomial experiment). Under the
null hypothesis H0 ∶ θ∗ = θ0 = 0.5, we predict that an outcome of the MLE based
on n = 10 trials will lie between (0.19, 0.81) with 95% chance. This interval
follows from replacing θ∗ by θ0 in the 95%-prediction interval Eq. (2.3). The data
generative view implies that if we simulate k = 100 data sets each with the same
θ∗ = 0.5 and n = 10, we would then have k estimates θ̂obs,1 , . . . , θ̂obs,k of which
five are expected to be outside this 95% interval (0.19, 0.81). Fisher, therefore,
classified an outcome of the MLE that is smaller than 0.19 or larger than 0.81
as extreme under the null and would then reject the postulate H0 ∶ θ0 = 0.5 at a
significance level of .05. ◇
The normal approximation to the sampling distribution of the MLE and the
resulting null hypothesis test is particularly useful when the exact sampling
distribution of the MLE is unavailable or hard to compute.
Example 2.3 (An MLE null hypothesis test for the Laplace model). Suppose
that we have n iid samples from the Laplace distribution
f (xi ∣ θ) = 1
2b
exp ( − ∣xi −θ∣
b
), (2.4)
where θ denotes the population mean and the population variance is given by 2b2 .
It can be shown that the MLE for this model is the sample median, θ̂ = M̂ , and
the unit Fisher information is IX (θ) = b−2 . The exact sampling distribution of
the MLE is unwieldy (Kotz, Kozubowski and Podgorski, 2001) and not presented
here. Asymptotic normality of the MLE is practical, as it allows us to discard the
unwieldy exact sampling distribution and, instead, base our inference on a more
tractable (approximate) normal distribution with a mean equal to the true value
θ∗ and a variance equal to b2 /n. For n = 100, b = 1 and repeated sampling under
the hypothesis H0 ∶ θ∗ = θ0 , approximately 95% of the estimates (the observed
sample medians) are expected to fall in the range (θ0 − 0.196, θ0 + 0.196). ◇
This section outlines how Fisher information can be used to define the Jeffreys’s
prior, a default prior commonly used for estimation problems and for nuisance
parameters in a Bayesian hypothesis test (e.g., Bayarri et al., 2012; Dawid,
2011; Gronau, Ly and Wagenmakers, 2017; Jeffreys, 1961; Liang et al., 2008;
Li and Clyde, 2015; Ly, Verhagen and Wagenmakers, 2016a,b; Ly, Marsman and Wagenmakers,
in press; Ly et al., 2017a; Robert, 2016). To illustrate the desirability of the Jef-
freys’s prior we first show how the naive use of a uniform prior may have unde-
sirable consequences, as the uniform prior depends on the representation of the
inference problem, that is, on how the model is parameterized. This dependence
is commonly referred to as lack of invariance: different parameterizations of the
same model result in different posteriors and, hence, different conclusions. We
visualize the representation problem using simple geometry and show how the
geometrical interpretation of Fisher information leads to the Jeffreys’s prior that
is parameterization-invariant.
Bayesian analysis centers on the observations xnobs for which a generative model
f is proposed that functionally relates the observed data to an unobserved pa-
5
But see Brown, Cai and DasGupta (2001).
When little is known about the parameter θ that governs the outcomes of X n , it
may seem reasonable to express this ignorance with a uniform prior distribution
g(θ), as no parameter value of θ is then favored over another. This leads to the
following type of inference:
Example 3.1 (Uniform prior on θ). Before data collection, θ is assigned a
uniform prior, that is, g(θ) = 1/VΘ with a normalizing constant of VΘ = 1
as shown in the left panel of Fig. 3. Suppose that we observe coin flip data
xnobs with yobs = 7 heads out of n = 10 trials. To relate these observations to
the coin’s propensity θ we use the Bernoulli distribution as our f (xn ∣ θ). A
replacement of xn by the data actually observed yields the likelihood function
f (xnobs ∣ θ) = θ7 (1 − θ)3 , which is a function of θ. Bayes’ theorem now allows us
to update our prior to the posterior that is plotted in the right panel of Fig. 3. ◇
Note that a uniform prior on θ has the length, more generally, volume, of
the parameter space as the normalizing constant; in this case, VΘ = 1, which
equals the length of the interval Θ = (0, 1). Furthermore, a uniform prior can
be characterized as the prior that gives equal probability to all sub-intervals of
equal length. Thus, the probability of finding the true value θ∗ within a sub-
interval Jθ = (θa , θb ) ⊂ Θ = (0, 1) is given by the relative length of Jθ with
2.5
Density
2.0
1.5 yobs = 7
ÐÐÐÐ→
1.0 n = 10
0.5
0.0
0.0 0.5 0.6 0.8 1.0 0.0 0.5 0.6 0.8 1.0
Propensity θ Propensity θ
Fig 3. Bayesian updating based on observations xn obs with yobs = 7 heads out of n = 10 tosses.
In the left panel, the uniform prior distribution assigns equal probability to every possible
value of the coin’s propensity θ. In the right panel, the posterior distribution is a compromise
between the prior and the observed data.
1 θb θb − θa
P (θ∗ ∈ Jθ ) = ∫ g(θ)dθ = ∫ 1dθ = . (3.3)
Jθ VΘ θa VΘ
Hence, before any datum is observed, the uniform prior expresses the belief
P (θ∗ ∈ Jθ ) = 0.20 of finding the true value θ∗ within the interval Jθ = (0.6, 0.8).
After observing xnobs with yobs = 7 out of n = 10, this prior is updated to the
posterior belief of P (θ∗ ∈ Jθ ∣ xnobs ) = 0.54, see the shaded areas in Fig. 3.
Although intuitively appealing, it can be unwise to choose the uniform dis-
tribution by default, as the results are highly dependent on how the model is
parameterized. In what follows, we show how a different parameterization leads
to different posteriors and, consequently, different conclusions.
Example 3.2 (Different representations, different conclusions). The propensity
of a coin landing heads up is related to the angle φ with which that coin is bent.
Suppose that the relation between the angle φ and the propensity θ is given by the
function θ = h(φ) = 21 + 12 ( πφ ) , chosen here for mathematical convenience.6 When
3
φ is positive the tail side of the coin is bent inwards, which increases the coin’s
chances to land heads. As the function θ = h(φ) also admits an inverse function
h−1 (θ) = φ, we have an equivalent formulation of the problem in Example 3.1,
but now described in terms of the angle φ instead of the propensity θ.
As before, in order to obtain a posterior distribution, Bayes’ theorem requires
that we specify a prior distribution. As the problem is formulated in terms of
φ, one may believe that a noninformative choice is to assign a uniform prior
g̃(φ) on φ, as this means that no value of φ is favored over another. A uniform
prior on φ is in this case given by g̃(φ) = 1/VΦ with a normalizing constant
VΦ = 2π, because the parameter φ takes on values in the interval Φ = (−π, π).
6
Another example involves the logit formulation of the Bernoulli model, that is, in terms of
φ = log( 1−θ
θ
), where Φ = R. This logit formulation is the basic building block in item response
theory. We did not discuss this example as the uniform prior on the logit cannot be normalized
and, therefore, not easily represented in the plots.
This uniform distribution expresses the belief that the true φ∗ can be found
in any of the intervals (−1.0π, −0.8π), (−0.8π, −0.6π), . . . , (0.8π, 1.0π) with 10%
probability, because each of these intervals is 10% of the total length, see the
top-left panel of Fig. 4. For the same data as before, the posterior calculated
0.4
Density
0.3 yobs = 7
ÐÐÐÐ→
0.2 n = 10
0.1
0.0
−π 0 π −π 0 π
Angle φ Angle φ
h ⇣ h ↓
Prior θ from φ ∼ U [−π, π] Posterior θ from φ ∼ U [−π, π]
5
4
Density
2
yobs = 7
ÐÐÐ→
1
n = 10
0
0.0 0.5 0.6 0.8 1.0 0 0.5 0.6 0.8 1
Propensity θ Propensity θ
Fig 4. Bayesian updating based on observations xn obs with yobs = 7 heads out of n = 10
tosses when a uniform prior distribution is assigned to the the coin’s angle φ. The uniform
distribution is shown in the top-left panel. Bayes’ theorem results in a posterior distribution
for φ that is shown in the top-right panel. This posterior g̃(φ ∣ xn obs ) is transformed into a
posterior on θ (bottom-right panel) using θ = h(φ). The same posterior on θ is obtained if
we proceed via an alternative route in which we first transform the uniform prior on φ to
the corresponding prior on θ and then apply Bayes’ theorem with the induced prior on θ. A
comparison to the results from Fig. 3 reveals that posterior inference differs notably depending
on whether a uniform distribution is assigned to the angle φ or to the propensity θ.
Unlike the other fathers of modern statistical thoughts, Harold Jeffreys contin-
ued to study Bayesian statistics based on formal logic and his philosophical con-
victions of scientific inference (see, e.g., Aldrich, 2005; Etz and Wagenmakers,
2017; Jeffreys, 1961; Ly, Verhagen and Wagenmakers, 2016a,b; Robert, Chopin and Rousseau,
2009; Wrinch and Jeffreys, 1919, 1921, 1923). Jeffreys concluded that the uni-
form prior is unsuitable as a default prior due to its dependence on the parame-
terization. As an alternative, Jeffreys (1946) proposed the following prior based
on Fisher information
1√ √
gJ (θ) = IX (θ), where V = ∫ IX (θ)dθ, (3.4)
V Θ
which is known as the prior derived from Jeffreys’s rule or the Jeffreys’s prior
in short. The Jeffreys’s prior is parameterization-invariant, which implies that
it leads to the same posteriors regardless of how the model is represented.
Example 3.3 (Jeffreys’s prior). The Jeffreys’s prior of the Bernoulli model in
terms of φ is
3φ2
gJ (φ) = √ , where V = π, (3.5)
V π 6 − φ6
Density
1.0
yobs = 7
ÐÐÐÐ→
n = 10
0.5
0.0
−π 0 π −π 0 π
−2.8
Angle φ Angle φ
h ↓↑ h−1 h ↓↑ h−1
Jeffreys’s prior on θ Jeffreys’s posterior on θ
4
Density
2 yobs = 7
ÐÐÐÐ→
1 n = 10
0
0.0 0.5 0.6 0.8 1.0 0.0 0.5 0.6 0.8 1.0
0.15
Propensity θ Propensity θ
Fig 5. For priors constructed through Jeffreys’s rule it does not matter whether the problem
is represented in terms of the angles φ or its propensity θ. Thus, not only is the problem
equivalent due to the transformations θ = h(φ) and its backwards transformation φ = h−1 (θ),
the prior information is the same in both representations. This also holds for the posteriors.
Similarly, we could have started with the Jeffreys’s prior in terms of θ instead,
that is,
1
gJ (θ) = √ , where V = π. (3.6)
V θ(1 − θ)
The Jeffreys’s prior and posterior on θ are plotted in the bottom-left and the
bottom-right panel of Fig. 5, respectively. The Jeffreys’s prior on θ corresponds
to the prior belief PJ (θ∗ ∈ Jθ ) = 0.14 of finding the true value θ∗ within the
interval Jθ = (0.6, 0.8). After observing xnobs with yobs = 7 out of n = 10, this
prior is updated to the posterior belief of PJ (θ∗ ∈ Jθ ∣ xnobs ) = 0.53, see the shaded
areas in Fig. 5. The posterior is identical to the one obtained from the previously
described updating procedure that starts with the Jeffreys’s prior on φ instead of
on θ. ◇
This example shows that the Jeffreys’s prior leads to the same posterior
knowledge regardless of how we as researcher represent the problem. Hence, the
same conclusions about θ are drawn regardless of whether we (1) use Jeffreys’s
rule to construct a prior on θ and update with the observed data, or (2) use
Jeffreys’s rule to construct a prior on φ, update to a posterior distribution on
φ, which is then transformed to a posterior on θ.
In the remainder of this section we make intuitive that the Jeffreys’s prior is in
fact uniform in the model space. We elaborate on what is meant by model space
and how this can be viewed geometrically. This geometric approach illustrates
(1) the role of Fisher information in the definition of the Jeffreys’s prior, (2)
the interpretation of the shaded area, and (3) why the normalizing constant is
V = π, regardless of the chosen parameterization.
Before we describe the geometry of statistical models, recall that at a pmf can be
thought of as a data generating device of X, as the pmf specifies the chances with
which X takes on the potential outcomes 0 and 1. Each such pmf has to fulfil two
conditions: (i) the chances have to be non-negative, that is, 0 ≤ p(x) = P (X = x)
for every possible outcome x of X, and (ii) to explicitly convey that there are
w = 2 outcomes, and none more, the chances have to sum to one, that is,
p(0) + p(1) = 1. We call the largest set of functions that adhere to conditions (i)
and (ii) the complete set of pmfs P.
As any pmf from P defines w = 2 chances, we can represent such a pmf
as a vector in w dimensions. To simplify notation, we write p(X) for all w
chances simultaneously, hence, p(X) is the vector p(X) = [p(0), p(1)] when
w = 2. The two chances with which a pmf p(X) generates outcomes of X can be
simultaneously represented in the plane with p(0) = P (X = 0) on the horizontal
axis and p(1) = P (X = 1) on the vertical axis. In the most extreme case, we
have the pmf p(X) = [1, 0] or p(X) = [0, 1]. These two extremes are linked by a
straight line in the left panel of Fig. 6. Any pmf –and the true pmf p∗ (X) of X
2.0 2.0
1.5 1.5
m(X=1)
P(X=1)
1.0 1.0
0.5 0.5
0.0 0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
P(X=0) m(X=0)
Fig 6. The true pmf of X with the two outcomes {0, 1} has to lie on the line (left panel)
or more naturally on the positive part of the circle (right panel). The dot represents the pmf
pe (X).
in particular– can be uniquely identified with a vector on the line and vice versa.
For instance, the pmf pe (X) = [1/2, 1/2] (i.e., the two outcomes are generated
with the same chance) is depicted as the dot on the line.
3.4.2. Uniform on the parameter space versus uniform on the model space
2.0 2.0
1.5 1.5
m(X=1)
m(X=1)
1.0 1.0
0.5 0.5
0.0 0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
m(X=0) m(X=0)
Fig 7. The parameterization in terms of propensity θ (left panel) and angle φ (right panel)
differ from each other substantially, and from a uniform prior in the model space. Left panel:
The eleven squares (starting from the right bottom going anti-clockwise) represent pmfs that
correspond to θ = 0.0, 0.1, 0.2, . . . , 0.9, 1.0. The shaded area corresponds to the shaded area
in the bottom-left panel of Fig. 5 and accounts for 14% of the model’s length. Right panel:
Similarly, the eleven triangles (starting from the right bottom going anti-clockwise) represent
pmfs that correspond to φ = −1.0π, −0.8π, − . . . 0.8π, 1.0π.
that the models MΘ and MΦ are equivalent; the two models define the same
candidate set of pmfs that we believe to be viable data generating devices for
X.
√
However, θ and φ represent M in a substantially different manner. As the
representation m(X) = 2 p(X) respects our natural notion of distance, we
conclude, by eye, that a uniform division of θs with distance, say, dθ = 0.1 does
not lead to a uniform partition of the model. More extremely, a uniform division
of φ with distance dφ = 0.2π (10% of the length of the parameter space) also
does not lead to a uniform partition of the model. In particular, even though
the intervals (−π, −0.8π) and (−0.2π, 0) are of equal length in the parameter
space Φ, they do not have an equal displacement in the model MΦ . In effect,
the right panel of Fig. 7 shows that the 10% probability that the uniform prior
on φ assigns to φ∗ ∈ (−π, −0.8π) in parameter space is redistributed over a larger
arc length of the model MΦ compared to the 10% assigned to φ∗ ∈ (−0.2π, 0).
Thus, a uniform distribution on φ favors the pmfs mφ (X) with φ close to zero.
Note that this effect is cancelled by the Jeffreys’s prior, as it puts more mass
on the end points compared to φ = 0, see the top-left panel of Fig. 5. Similarly,
the left panel of Fig. 7 shows that the uniform prior g(θ) also fails to yield an
equiprobable assessment of the pmfs in model space. Again, the Jeffreys’s prior
in terms of θ compensates for the fact that the interval (0, 0.1) as compared
to (0.5, 0.6) in Θ is more spread out in model space. However, it does so less
severely compared to the Jeffreys’s prior on φ. To illustrate, we added additional
tick marks on the horizontal axis of the priors in the left panels of Fig. 5. The
tick mark at φ = −2.8 and θ = 0.15 both indicate the 25% quantiles of their
respective Jeffreys’s priors. Hence, the Jeffreys’s prior allocates more mass to the
boundaries of φ than to the boundaries of θ to compensate for the difference in
geometry, see Fig. 7. More generally, the Jeffreys’s prior uses Fisher information
Fig. 7 shows that neither a uniform prior on θ, nor a uniform prior on φ yields
a uniform prior on the model. Alternatively, we can begin with a uniform prior
on the model M and convert this into priors on the parameter spaces Θ and
Φ. This uniform prior on the model translated to the parameters is exactly the
Jeffreys’s prior.
Recall that a prior on a space S is uniform, if it has the following two defining
features: (i) the prior is proportional to one, and (ii) a normalizing constant given
by VS = ∫S 1ds that equals the length, more generally, volume of S. For instance,
a replacement of s by φ and S by Φ = (−π, π) yields the uniform prior on the
angles with the normalizing constant VΦ = ∫Φ 1dφ = 2π. Similarly, a replacement
of s by the pmf mθ (X) and S by the function space MΘ yields a uniform prior
on the model MΘ . The normalizing constant then becomes a daunting looking
integral in terms of displacements dmθ (X) between functions in model space
MΘ . Fortunately, it can be shown, see Appendix C, that V simplifies to
√
V =∫ 1dmθ (X) = ∫ IX (θ)dθ. (3.7)
MΘ Θ
In conclusion, we verified that the Jeffreys’s prior is a prior that leads to the
same conclusion regardless of how we parameterize the problem. This parameterization-
invariance property is a direct result of shifting our focus from finding the true
parameter value within the parameter space to the proper formulation of the
√
estimation problem –as discovering the true data generating pmf mθ∗ (X) =
2 pθ∗ (X) in MΘ and by expressing our prior ignorance as a uniform prior on
the model MΘ .
where n denotes the sample size, dj the number of free parameters, θ̂j the
MLE, IMj (θj ) the unit Fisher information, and fj the functional relationship
between the potential outcome xn and the parameters θj within model Mj .11
Hence, except for the observations xnobs , all quantities in the formulas depend
on the model Mj . We made this explicit using a subscript j to indicate that
the quantity, say, θj belongs to model Mj .12 For all three criteria, the model
yielding the lowest criterion value is perceived as the model that generalizes best
(Myung and Pitt, in press).
11
For vector-valued parameters θj , we have a Fisher information matrix and det IMj (θj )
refers to the determinant of this matrix. This determinant is always non-negative, because
the Fisher information matrix is always a positive semidefinite symmetric matrix. Intuitively,
volumes and areas cannot be negative (Appendix C.3.3).
12
For the sake of clarity, we will use different notations for the parameters within the
different models. We introduce two models in this section: the model M1 with parameter
θ1 = ϑ which we pit against the model M2 with parameter θ2 = α.
Each of the three model selection criteria tries to strike a balance between
model fit and model complexity. Model fit is expressed by the goodness-of-fit
terms, which involves replacing the potential outcomes xn and the unknown
parameter θj of the functional relationships fj by the actually observed data
xnobs , as in the Bayesian setting, and the maximum likelihood estimate θ̂j (xnobs ),
as in the frequentist setting.
The positive terms in the criteria account for model complexity. A penaliza-
tion of model complexity is necessary, because the support in the data cannot
be assessed by solely considering goodness-of-fit, as the ability to fit obser-
vations increases with model complexity (e.g., Roberts and Pashler, 2000). As
a result, the more complex model necessarily leads to better fits but may in
fact overfit the data. The overly complex model then captures idiosyncratic
noise rather than general structure, resulting in poor model generalizability
(Myung, Forster and Browne, 2000; Wagenmakers and Waldorp, 2006).
The focus in this section is to make intuitive how FIA acknowledges the trade-
off between goodness-of-fit and model complexity in a principled manner by
graphically illustrating this model selection procedure, see also Balasubramanian
(1996), Kass (1989), Myung, Balasubramanian and Pitt (2000), and Rissanen
(1996). We exemplify the concepts with simple multinomial processing tree
(MPT) models (e.g., Batchelder and Riefer, 1999; Klauer and Kellen, 2011; Wu, Myung and Batchelder,
2010). For a more detailed treatment of the subject we refer to Appendix D,
de Rooij and Grünwald (2011), Grünwald (2007), Myung, Navarro and Pitt (2006),
and the references therein.
Recall that each model specifies a functional relationship fj between the poten-
tial outcomes of X and the parameters θj . This fj is used to define a so-called
normalized maximum likelihood (NML) code. For the jth model its NML code
is defined as
between zero and one that can be transformed into a non-negative number by
taking the negative logarithm as14
− log pNML (xnobs ∣ Mj ) = − log fj (xnobs ∣ θ̂j (xnobs )) + log ∑ fj (xn ∣ θ̂j (xn )), (4.5)
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
Model complexity
which is called the description length of model Mj . Within the MDL framework,
the model with the shortest description length is the model that best describes
the observed data xnobs .
The model complexity term is typically hard to compute, but Rissanen (1996)
showed that it can be well-approximated by the dimensionality and the geomet-
rical complexity terms. That is,
√
FIA = − log fj (xnobs ∣ θ̂j (xnobs )) + + log (∫ det IMj (θj ) dθj ) ,
dj n
log
2 2π Θ
• L meaning both words come from the left list, i.e., “ll”,
• M meaning the words are mixed, i.e., “lr” or “rl”,
• R meaning both words come from the right list, i.e., “rr”.
For simplicity we assume that the participant will be presented with n test pairs
X n of equal difficulty. ◇
For the graphical illustration of this new running example, we generalize
the ideas presented in Section 3.4.1 from w = 2 to w = 3. Recall that a pmf
of X with w number of outcomes can be written as a w-dimensional vector.
For the task described above we know that a data generating pmf defines the
three chances p(X) = [p(L), p(M ), p(R)] with which X generates the outcomes
[L, M, R] respectively.15 As chances cannot be negative, (i) we require that
0 ≤ p(x) = P (X = x) for every outcome x in X , and (ii) to explicitly convey
that there are w = 3 outcomes, and none more, these w = 3 chances have to sum
to one, that is, ∑x∈X p(x) = 1. We call the largest set of functions that adhere
to conditions (i) and (ii) the complete set of pmfs P. The three chances with
which a pmf p(X) generates outcomes of X can be simultaneously represented
in three-dimensional space with p(L) = P (X = L) on the left most axis, p(M ) =
P (X = M ) on the right most axis and p(R) = P (X = R) on the vertical axis
as shown in the left panel of Fig. 8.16 In the most extreme case, we have the
pmf p(X) = [1, 0, 0], p(X) = [0, 1, 0] or p(X) = [0, 0, 1], which correspond to the
corners of the triangle indicated by pL, pM and pR, respectively. These three
extremes are linked by a triangular plane in the left panel of Fig. 8. Any pmf
–and the true pmf p∗ (X) in particular– can be uniquely identified with a vector
on the triangular plane and vice versa. For instance, a possible true pmf of X
is pe (X) = [1/3, 1/3, 1/3] (i.e., the outcomes L, M and R are generated with the
same chance) depicted as a (red) dot on the simplex.
This vector representation allows us to associate to each pmf of X the
Euclidean norm. For instance, the representation in the left panel of Fig. 8
leads to an extreme pmf √ p(X) = [1, 0, 0] that is one unit long, while pe (X) =
[1/3, 1/3, 1/3] is only (1/3)2 + (1/3)2 + (1/3)2 ≈ 0.58 units away from the ori-
√ we can avoid this mismatch in lengths by considering the vectors
gin. As before,
m(X) = 2 p(X), instead. Any pmf that is identified as m(X) is now two units
away from the origin. The model space M is the collection of all transformed
pmfs and represented as the surface of (the positive part of) the sphere in
√
the right panel of Fig. 8. By representing the set of all possible pmfs of X as
m(X) = 2 p(X), we adopted our intuitive notion of distance. As a result, the
selection mechanism underlying MDL can be made intuitive by simply looking
at the forthcoming plots.
15
As before we write p(X) = [p(L), p(M ), p(R)] with a capital X to denote all the w
number of chances simultaneously and we used the shorthand notation p(L) = p(X = L),
p(M ) = p(X = M ) and p(R) = p(X = R).
16
This is the three-dimensional generalization of Fig. 6.
mR
pR
pM mM
pL
mL
Fig 8. Every point on the sphere corresponds to a pmf of a categorical distribution with
w = 3 categories. In particular, the (red) dot refers to the pmf pe (x) = [1/3, 1/3, 1/3], the
circle represents the pmf given by p(X) = [0.01, 0.18, 0.81], while the cross represents the pmf
p(X) = [0.25, 0.5, 0.25].
To ease the exposition, we assume that both words presented to the participant
come from the right list R, thus, “rr” for the two models introduced below. As
model M1 we take the so-called individual-word strategy. Within this model
M1 , the parameter is θ1 = ϑ, which we interpret as the participant’s “right-list
recognition ability”. With chance ϑ the participant then correctly recognizes
that the first word originates from the right list and repeats this procedure for
the second word, after which the participant categorizes the word pair as L, M ,
or R, see the left panel of Fig. 9 for a schematic description of this strategy as a
processing tree. Fixing the participant’s “right-list recognition ability” ϑ yields
the following pmf
For instance, when the participant’s true ability is ϑ∗ = 0.9, the three outcomes
[L, M, R] are then generated with the following three chances f1 (X ∣ 0.9) =
[0.01, 0.18, 0.81], which is plotted as a circle in Fig. 8. On the other hand, when
ϑ∗ = 0.5 the participant’s generating pmf is then f1 (X ∣ ϑ = 0.5) = [0.25, 0.5, 0.25],
which is depicted as the cross in model space M. The set of pmfs so defined
forms a curve that goes through both the cross and the circle, see the left panel
of Fig. 10.
ϑ R
ϑ M
1−ϑ M α
0.5 L
ϑ M
1−ϑ 1−α
0.5 R
1−ϑ L
Fig 9. Two MPT models that theorize how a participant chooses the outcomes L, M , or R
in the source-memory task described in the main text. The left panel schematically describes
the individual-word strategy, while the right model schematically describes the only-mixed
strategy.
For instance, when the participant’s true differentiability is α∗ = 1/3, the three
outcomes [L, M, R] are then generated with the equal chances f2 (X ∣ 1/3) =
[1/3, 1/3, 1/3], which, as before, is plotted as the dot in Fig. 10. On the other
hand, when α∗ = 0.5 the participant’s generating pmf is then given by f2 (X ∣ α =
0.5) = [0.25, 0.5, 0.25], i.e., the cross. The set of pmfs so defined forms a curve
that goes through both the dot and the cross, see the left panel of Fig. 10.
The plots show that the models M1 and M2 are neither saturated nor nested,
as the two models define proper subsets of M and only overlap at the cross.
Furthermore, the plots also show that M1 and M2 are both one-dimensional,
as each model is represented as a line in model space. Hence, the dimensionality
terms in all three information criteria are the same. Moreover, AIC and BIC will
only discriminate these two models based on goodness-of-fit alone. This partic-
ular model comparison, thus, allows us to highlight the role Fisher information
plays in the MDL model selection philosophy.
mR mR
mM mM
mL mL
Fig 10. Left panel: The set of pmfs that are defined by the individual-list strategy M1 forms
a curve that goes through both the cross and the circle, while the pmfs of the only-mixed
strategy M2 correspond to the curve that goes through both the cross and the dot. Right
panel: The model selected by FIA can be thought of as the model closest to the empirical pmf
with an additional penalty for model complexity. The selection between the individual-list and
the only-mixed strategy by FIA based on n = 30 trials is formalized by the additional curves
–the only-mixed strategy is preferred over the individual-list strategy, when the observations
yield an empirical pmf that lies between the two non-decision curves. The top, middle and
bottom squares corresponding to the data sets xn n n
obs,1 , xobs,2 and xobs,3 in Table 1, which are
best suited to M2 , either, and M1 , respectively. The additional penalty is most noticeable at
the cross, where the two models share a pmf. Observations with n = 30 yielding an empirical
pmf in this area are automatically assigned to the simpler model, i.e., the only-mixed strategy
M2 .
For FIA we need to compute the goodness-of-fit terms, thus, we need to identify
the MLEs for the parameters within each model. For the models at hand, the
MLEs are
obs = [yL , yM , yR ]
xn FIAM1 (xn
obs ) FIAM2 (xn
obs ) Preferred model
obs,1 = [12, 1, 17]
xn 42 26 M2
obs,2 = [14, 10, 6]
xn 34 34 tie
obs,3 = [12, 16, 2]
xn 29 32 M1
shows three data sets xnobs,1 , xnobs,2 , xnobs,3 with n = 30 observations. The three
associated empirical pmfs are plotted as the top, middle and lower rectangles in
the right panel of Fig. 10, respectively. Table 1 also shows the approximation of
each model’s description length using FIA. Note that the first observed pmf, the
top rectangle in Fig. 10, is closer to M2 than to M1 , while the third empirical
pmf, the lower rectangle, is closer to M1 . Of particular interest is the middle
rectangle, which lies on an additional black curve that we refer to as a non-
decision curve; observations that correspond to an empirical pmf that lies on this
curve are described equally well by M1 and M2 . For this specific comparison, we
have the following decision rule: FIA selects M2 as the preferred model whenever
the observations correspond to an empirical pmf between the two non-decision
curves, otherwise, FIA selects M1 . Fig. 10 shows that FIA, indeed, selects the
model that is closest to the data except in the area where the two models
overlap –observations consisting of n = 30 trials with an empirical pmf near the
cross are considered better described by the simpler model M2 . Hence, this
yields an incorrect decision even when the empirical pmf is exactly equal to the
true data generating pmf that is given by, say, f1 (X ∣ ϑ = 0.51). This automatic
preference for the simpler model, however, decreases as n increases. The left
mR mR
mM mM
mL mL
Fig 11. For n large the additional penalty for model complexity becomes irrelevant. The
plotted non-decision curves are based on n = 120 and n = 10,000 trials in the left and right
panel respectively. In the right panel only the goodness-of-fit matters in the model comparison.
The model selected is then the model that is closest to the observations.
and right panel of Fig. 11 show the non-decision curves when n = 120 and n
(extremely) large, respectively. As a result of moving non-decision bounds, the
data set xnobs,4 = [56, 40, 24] that has the same observed pmf as xnobs,2 , i.e., the
middle rectangle, will now be better described by model M1 .
For (extremely) large n, the additional penalty due to M1 being more volup-
tuous than M2 becomes irrelevant and the sphere is then separated into quad-
rants: observations corresponding to an empirical pmf in the top-left or bottom-
right quadrant are better suited to the only-mixed strategy, while the top-right
and bottom-left quadrants indicate a preference for the individual-word strat-
egy M1 . Note that pmfs on the non-decision curves in the right panel of Fig. 11
are as far apart from M1 as from M2 , which agrees with our geometric in-
terpretation of goodness-of-fit as a measure of the model’s proximity to the
data. This quadrant division is only based on the two models’ goodness-of-fit
terms and yields the same selection as one would get from BIC (e.g., Rissanen,
1996). For large n, FIA, thus, selects the model that is closest to the empirical
pmf. This behavior is desirable, because asymptotically the empirical pmf is not
distinguishable from the true data generating pmf. As such, the model that is
closest to the empirical pmf will then also be closest to the true pmf. Hence,
FIA asymptotically selects the model that is closest to the true pmf. As a result,
the projected pmf within the closest model is then expected to yield the best
predictions amongst the competing models.
5. Concluding Comments
References
by minimum description length: Lower-bound sample sizes for the Fisher in-
formation approximation. Journal of Mathematical Psychology 60 29–34.
Huzurbazar, V. S. (1950). Probability distributions and orthogonal parame-
ters. In Mathematical Proceedings of the Cambridge Philosophical Society 46
281–284. Cambridge University Press.
Huzurbazar, V. S. (1956). Sufficient statistics and orthogonal parameters.
Sankhyā: The Indian Journal of Statistics (1933-1960) 17 217–220.
Inagaki, N. (1970). On the Limiting Distribution of a Sequence of Estimators
with Uniformity Property. Annals of the Institute of Statistical Mathematics
22 1–13.
Jeffreys, H. (1946). An Invariant Form for the Prior Probability in Estimation
Problems. Proceedings of the Royal Society of London. Series A. Mathematical
and Physical Sciences 186 453–461.
Jeffreys, H. (1961). Theory of Probability, 3rd ed. Oxford University Press,
Oxford, UK.
Kass, R. E. (1989). The Geometry of Asymptotic Inference. Statistical Science
4 188–234.
Kass, R. E. and Vaidyanathan, S. K. (1992). Approximate Bayes factors and
orthogonal parameters, with application to testing equality of two binomial
proportions. Journal of the Royal Statistical Society. Series B (Methodologi-
cal) 129–144.
Kass, R. E. and Vos, P. W. (2011). Geometrical foundations of asymptotic
inference 908. John Wiley & Sons.
Klauer, K. C. and Kellen, D. (2011). The flexibility of models of recognition
memory: An analysis by the minimum-description length principle. Journal
of Mathematical Psychology 55 430–450.
Kleijn, B. J. K. and Zhao, Y. Y. (2017). Criteria for posterior consistency.
arXiv preprint arXiv:1308.1263.
Korostelev, A. P. and Korosteleva, O. (2011). Mathematical statistics:
Asymptotic minimax theory 119. American Mathematical Society.
Kotz, S., Kozubowski, T. J. and Podgorski, K. (2001). The Laplace Distri-
bution and Generalizations: A Revisit with Applications to Communications,
Economics, Engineering, and Finance. Springer, New York.
Kraft, L. G. (1949). A device for quantizing, grouping, and coding amplitude-
modulated pulses Master’s thesis, Massachusetts Institute of Technology.
Kullback, S. and Leibler, R. A. (1951). On information and sufficiency.
The Annals of Mathematical Statistics 22 79–86.
LeCam, L. (1970). On the assumptions used to prove asymptotic normality
of maximum likelihood estimates. The Annals of Mathematical Statistics 41
802–828.
LeCam, L. (1990). Maximum likelihood: An introduction. International Statis-
tical Review/Revue Internationale de Statistique 58 153–171.
Lee, M. D. and Wagenmakers, E. J. (2013). Bayesian Cognitive Modeling:
A Practical Course. Cambridge University Press, Cambridge.
Lehmann, E. L. (2011). Fisher, Neyman, and the creation of classical statistics.
Springer Science & Business Media.
⃗ = log f (x ∣ θ)
where l(x ∣ θ) ⃗ is the log-likelihood function, ∂ l(x ∣ θ) ⃗ is the score
∂θi
function, that is, the partial derivative with respect to the ith component of
the vector θ⃗ and the dot is short-hand notation for the vector of the partial
⃗ is a d × 1 column vector
derivatives with respect to θ = (θ1 , . . . , θd ). Thus, l̇(x ∣ θ)
˙T ⃗
of score functions, while l (x ∣ θ) is a 1 × d row vector of score functions at the
outcome x. The partial derivative is evaluated at θ, ⃗ the same θ⃗ that is used
in the pmf pθ⃗(x) for the weighting. In Appendix E it is shown that the score
functions are expected to be zero, which explains why IX (θ) ⃗ is a covariance
matrix.
Under mild regularity conditions the i, jth entry of the Fisher information
matrix can be equivalently calculated via the negative expectation of the second
order partial derivates, that is,
Note that the sum (thus, integral in the continuous case) is with respect to the
outcomes x of X.
Example A.1 (Fisher information for normally distributed random variables).
When X is normally distributed, i.e., X ∼ N (µ, σ 2 ), it has the following proba-
bility density function (pdf )
⃗ = √ 1 exp ( − 1 (x − µ)2 ),
f (x ∣ θ) (A.5)
2πσ 2σ 2
where the parameters are collected into the vector θ⃗ = (σµ), with µ ∈ R and σ > 0.
⃗
⃗ = ( ∂µ l(x ∣ θ)) = (
∂ x−µ
˙ ∣ θ) ).
⃗
σ2
∣
l(x (A.6)
∂
∂σ
l(x θ)
(x−µ)2
σ 3 − σ1
The unit Fisher information matrix IX (θ) ⃗ is a 2×2 symmetric positive semidefi-
⃗
nite matrix, consisting of expectations of partial derivatives. Equivalently, IX (θ)
can be calculated using the second order partials derivatives
The off-diagonal elements are in general not zero. If the i, jth entry is zero we
say that θi and θj are orthogonal to each other, see Appendix C.3.3 below. ◇
For iid trials X n = (X1 , . . . , Xn ) with X ∼ pθ (x), the Fisher information
matrix for X n is given by IX n (θ)⃗ = nIX (θ).
⃗ Thus, for vector-valued parameters
⃗
θ the Fisher information matrix remains additive.
In the remainder of the text, we simply use θ for both one-dimensional and
vector-valued parameters. Similarly, depending on the context it should be clear
whether IX (θ) is a number or a matrix.
The construction of the hypothesis tests and confidence intervals in the frequen-
tist section were all based on the MLE being asymptotically normal.
For so-called regular parametric models, see Appendix E, the MLE for vector-
valued parameters θ converges in distribution to a multivariate normal distri-
bution, that is,
√
n(θ̂ − θ∗ ) → Nd (0, IX (θ )), as n → ∞,
D −1 ∗
(B.1)
In practice, we fix n and replace the true sampling distribution by this normal
distribution. Hence, we incur an approximation error that is only negligible
whenever n is large enough. What constitutes n large enough depends on the
true data generating pmf p∗ (x) that is unknown in practice. In other words,
the hypothesis tests and confidence intervals given in the main text based on
the replacement of the true sampling distribution by this normal distribution
might not be appropriate. In particular, this means that a hypothesis tests at
a significance level of 5% based on the asymptotic normal distribution, instead
of the true sampling distribution, might actually yield a type 1 error rate of,
say, 42%. Similarly, as a result of the approximation error, a 95%-confidence
interval might only encapsulate the true parameter in, say, 20% of the time that
we repeat the experiment.
Example B.1 (Asymptotic normality of the MLE vs the CLT: The Gaussian
distribution). If X has a Gaussian (normal) distribution, i.e., X ∼ N (θ, σ 2 ),
with σ 2 known, then the MLE is the sample mean and the unit Fisher infor-
mation is IX (θ) = 1/σ 2 . Asymptotic normality of the MLE leads to the same
√
statement as the CLT, that is, n(θ̂ − θ∗ ) → N (0, σ 2 ). Hence, asymptotically
D
we do not gain anything by going from the CLT to asymptotic normality of the
MLE. The additional knowledge of f (x ∣ θ) being normal does, however, allow
us to come to the rare conclusion that the normal approximation holds exactly
for every finite n, thus, (θ̂ − θ∗ ) = N (0, n1 σ 2 ). In all other cases, whenever
D
The sample median estimator M̂ performs better. Again, Fisher (1922) al-
ready knew that for n large enough that (M̂ − θ∗ ) ≈ N (0, n1 π2 ). The MLE is
D 2
The previous examples showed that the MLE is an estimator that leads to
a smaller sample size requirement, because it is the estimator with the lower
asymptotic variance. This lower asymptotic variance is a result of the MLE
making explicit use of the functional relationship between the samples xnobs and
the target θ in the population. Given any such f , one might wonder whether the
MLE is the estimator with the lowest possible asymptotic variance. The answer
is affirmative, whenever we restrict ourselves to the broad class of so-called
regular estimators.
A regular estimator Tn = tn (Xn ) is a function of the data that has a limiting
distribution that does not change too much, whenever we change the parameters
in the neighborhood of the true value θ∗ , see van der Vaart (1998, p. 115) for
a precise definition. The Hájek-LeCam convolution theorem characterizes the
aforementioned limiting distribution as a convolution, i.e., a sum of, the inde-
pendent statistics ∆θ∗ and Zθ∗ . That is, for any regular estimator Tn and every
possible true value θ∗ we have
√
n(Tn − θ∗ ) → ∆θ∗ + Zθ∗ , as n → ∞,
D
(B.3)
The MLE is a regular estimator with ∆θ∗ equal to the fixed true value θ∗ ,
thus, Var(∆θ∗ ) = 0. As such, the MLE has an asymptotic variance IX (θ ) that
−1 ∗
is equal to the lower bound given above. Hence, amongst the broad class of
regular estimators, the MLE performs best. This result was already foreshad-
owed by Fisher (1922), though it took another 50 years before this statement
21
Given observations xn
obs the maximum likelihood estimate θ̂obs is the number for which
2(x −θ)
obs ∣ θ) = ∑i=1 1+(xobs,i −θ)2 is zero. This optimization cannot be solved
n
the score function l̇(xn obs,i
was made mathematically rigorous (Hájek, 1970; Inagaki, 1970; LeCam, 1970;
van der Vaart, 2002; Yang, 1999), see also Ghosh (1985) for a beautiful review.
We stress that the normal approximation to the true sampling distribution
only holds when n is large enough. In practice, n is relatively small and the
replacement of the true sampling distribution by the normal approximation
can, thus, lead to confidence intervals and hypothesis tests that perform poorly
(Brown, Cai and DasGupta, 2001). This can be very detrimental, especially,
when we are dealing with hard decisions such as the rejection or non-rejection
of a hypothesis.
A simpler version of the Hájek-LeCam convolution theorem is known as the
Cramér-Fréchet-Rao information lower bound (Cramér, 1946; Fréchet, 1943;
Rao, 1945), which also holds for finite n. This theorem states that the variance
of an unbiased estimator Tn cannot be lower than the inverse Fisher information,
that is, nVar(Tn ) ≥ IX (θ ). We call an estimator Tn = t(X n ) unbiased if for
−1 ∗
every possible true value θ∗ and at each fixed n, its expectation is equal to
the true value, that is, E(Tn ) = θ∗ . Hence, this lower bound shows that Fisher
information is not only a concept that is useful for large samples.
Unfortunately, the class of unbiased estimators is rather restrictive (in gen-
eral, it does not include the MLE) and the lower bound cannot be attained
whenever the parameter is of more than one dimensions (Wijsman, 1973). Con-
sequently, for vector-valued parameters θ, this information lower bound does
not inform us, whether we should stop our search for a better estimator.
The Hájek-LeCam convolution theorem implies that for n large enough the
MLE θ̂ is the best performing statistic. For the MLE to be superior, however,
the data do need to be generated as specified by the functional relationship f .
In reality, we do not know whether the data are indeed generated as specified by
f , which is why we should also try to empirically test such an assumption. For
instance, we might believe that the data are normally distributed, while in fact
they were generated according to a Cauchy distribution. This incorrect assump-
tion implies that we should use the sample mean, but Example B.3 showed the
futility of such estimator. Model misspecification, in addition to hard decisions
based on the normal approximation, might be the main culprit of the crisis of
replicability. Hence, more research on the detection of model misspecification is
desirable and expected (e.g., Grünwald, 2016; Grünwald and van Ommen, 2014;
van Ommen et al., 2016).
We make intuitive that the Jeffreys’s prior is a uniform prior on the model MΘ ,
i.e.,
θb √
P (m∗ ∈ Jm ) = ∫ 1dmθ (X) = ∫ IX (θ)dθ,
1
(C.1)
V Jm θa
First note that we swapped the area of integration by substituting the interval
Jm = (mθa (X), mθb (X)) consisting of pmfs in function space MΘ by the interval
(θa , θb ) in parameter space. This is made possible by the parameter functional ν
with domain MΘ and range Θ that uniquely assigns to any (transformed) pmf
ma (X) ∈ MΘ a parameter value θa ∈ Θ. In this case, we have θa = ν(ma (X)) =
( 12 ma (1))2 . Uniqueness of the assignment implies that the resulting parameter
values θa and θb in Θ differ from each other whenever ma (X) and mb (X) in
MΘ differ from each other. For example, the map ν ∶ MΘ → Θ implies that
2.0 2.0
1.9 1.9
m(X=1)
m(X=1)
1.8 1.8
1.7 1.7
1.6 1.6
1.5 1.5
0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
m(X=0) m(X=0)
Fig 12. The full arrow represents the simultaneous displacement in model space based on the
Taylor approximation Eq. (C.3) in terms of θ at mθa (X), where θa = 0.8 (left panel) and in
terms of φ at mφa (X) where φa = 0.6π (right panel). The dotted line represents a part of the
Bernoulli model and note that the full arrow is tangent to the model.
in the left panel of Fig. 12 the third square from the left with coordinates
ma (X) = [0.89, 1.79] can be labeled by θa = 0.8 ≈ ( 21 (1.79))2 , while the second
square from the left with coordinates mb (X) = [0.63, 1.90] can be labeled by
θb = 0.9 ≈ ( 21 (1.90))2 .
To calculate the arc length of the curve Jm consisting of functions in MΘ ,
we first approximate Jm by a finite sum of tangent vectors, i.e., straight lines.
The approximation of the arc length is the sum of the length of these straight
lines. The associated approximation error goes to zero, when we increase the
number of tangent vectors and change the sum into an integral sign, as in the
usual definition of an integral. First we discuss tangent vectors.
In the left panel in Fig. 12, we depicted the tangent vector at mθa (X) as
the full arrow. This full arrow is constructed from its components: one broken
arrow that is parallel to the horizontal axis associated with the outcome x = 0,
and one broken arrow that is parallel to the vertical axis associated with the
outcome x = 1. The arrows parallel to the axes are derived by first fixing X = x
followed by a Taylor expansion of the parameterization θ ↦ mθ (x) at θa . The
where the slope, a function of x, Aθa (x) at mθa (x) in the direction of x is given
by
dmθa (x) 1 d
Aθa (x) = = 2 { dθ log f (x ∣ θa ) }mθa (x), (C.3)
dθ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
score function
and with an “intercept” Bθa (x) = o(dθ) that goes fast to zero whenever dθ → 0.
Thus, for dθ small, the intercept Bθa (x) is practically zero. Hence, we approxi-
mate the displacement between mθa (x) and mθb (x) by a straight line.
Example C.1 (Tangent vectors). In the right panel of Fig. 12 the right most
triangle is given by mφa (X) = [1.25, 1.56], while the triangle in the middle refers
to mφb (X) = [0.99, 1.74]. Using the functional ν̃, i.e., the inverse of the param-
√
eterization, φ ↦ 2 f (x ∣ φ), where f (x ∣ φ) = ( 12 + 21 ( πφ ) ) ( 12 − 12 ( πφ ) ) , we
3 x 3 1−x
Á √
=Á
À ∑ ( d log f (x ∣ θa )) pθa (x)dθ IX (θa )dθ.
2
dθ
= (C.9)
x∈X
θa dm (X)
The second equality follows from the definition of dθ
, i.e., Eq. (C.3), and
the last equality is due to the definition of Fisher information.
Example C.2 (Length of the tangent vectors). The length of the tangent vector
√ as the root of the sums of squares of
in the right panel of Fig. 12 can be calculated
its components, that is, ∥ φdφ dφ∥2 = (−0.14)2 + 0.172 = 0.22. Alternatively,
dm a (X)
we can first calculate the square root of the Fisher information at φa = 0.6π, i.e.,
√ 3φ2
I(φa ) = √ a = 0.35, (C.10)
π 6 − φ6
More generally, to approximate the length between pmfs mθa (X) and mθb (X),
we first identify ν(mθa (X)) = θa and multiply this with the distance dθ =
∣θa − ν(mθb (X))∣ in parameter space, i.e.,
dmθ (X) √
dmθ (X) = ∥ ∥ dθ = IX (θ) dθ. (C.11)
dθ 2
In other words, the root of the Fisher information converts a small distance dθ
at θa to a displacement in model space at mθa (X).
For random variables with w number of outcomes, the largest set of pmfs P is
the collection of functions p on X such that (i) 0 ≤ p(x) = P (X = x) for every
outcome x in X , and (ii) to explicitly convey that there are w outcomes, and
none more, these w chances have to sum to one, that is, ∑x∈X p(x) = 1. The
complete set of pmfs P can be parameterized using the functional ν that assigns
to each w-dimensional pmf p(X) a parameter β ∈ Rw−1 .
For instance, given a pmf p(X) = [p(L), p(M ), p(R)] we typically use the
functional ν ∶ P → R2 that takes the first two coordinates, that is, ν(p(X)) =
β = (ββ12 ), where β1 = p(L) and β2 = p(M ). The range of this functional ν is the
parameter space B = [0, 1]×[0, β1 ]. Conversely, the inverse of the functional ν is
the parameterization β ↦ pβ (X) = [β1 , β2 , 1 − β1 − β2 ], where (i’) 0 ≤ β1 , β2 and
(ii’) β1 +β2 ≤ 1. The restrictions (i’) and (ii’) imply that the parameterization has
domain B and the largest set of pmfs P as its range. By virtue of the functional
ν and its inverse, that is, the parameterization β ↦ pβ (X), we conclude that the
parameter space B and the complete set of pmfs P are isomorphic. This means
that each pmf p(X) ∈ P can be uniquely identified with a parameter β ∈ B and
vice versa. The inverse of ν implies that the parameters β ∈ B are functionally
related to the potential outcomes x of X as
and write MΓ for the collection of vectors so defined. As before, this collection
coincides with the full model, i.e., MΓ = M. In other words, by virtue of the
functional ν̃ and its inverse γ ↦ pγ (x) = f (x ∣ γ) we conclude that the parameter
space Γ and the complete set of pmfs M are isomorphic. Because M = MB this
means that we also have an isomorphism between the parameter space B and
Γ via M, even though B is a strict subset of Γ. Note that this equivalence goes
via parameterization β ↦ mβ (X) and the functional ν̃.
1 − β2 ⎛ 1
0 ⎞
IX (β) = ( ) and IX (γ) = γ1 (1−γ1 )
1 1
1 − β1 − β2 1 − β1 ⎝ 0 γ2 (1−γ2 ) ⎠
1−γ1 ,
1
(C.17)
respectively. The left panel of Fig. 13 shows the tangent vectors at pβ ∗ (X) =
[1/3, 1/3, 1/3] in model space, where β ∗ = (1/3, 1/3). The green tangent vector
22
This only works if pL < 1. When p(x1 ) = 1, we simply set γ2 = 0, thus, γ = (1, 0).
mR mR
mM mM
mL mL
Fig 13. When the off-diagonal entries are zero, the tangent vectors are orthogonal. Left
panel: The tangent vectors at pβ ∗ (X) = [1/3, 1/3, 1/3] span a diamond with an area given
√
by det I(β ∗ )dβ. The black curve is the submodel with β2 = 1/3 fixed and β1 free to vary
and yields a green tangent vector. The blue curve is the submodel with β1 = 1/3 fixed and β2
free to vary. Right panel: The tangent vectors at the same pmf in terms of γ, thus, pγ ∗ (X),
√
span a rectangle with an area given by det I(γ ∗ )dγ. The black curve is the submodel with
γ2 = 1/2 fixed and γ1 free to vary and yields a green tangent vector. The blue curve is the
submodel with γ1 = 1/3 fixed and γ2 free to vary.
Observe that the inner integral depends on the value of β1 from the outer
1⎛ ⎞
∫ √ dγ1 dγ2 = ∫ √ dγ1 ∫ √
1 1 1 1 1 1
V =∫ dγ2 = 2π.
0 ⎝ 0 γ1 γ2 (1 − γ2 ) ⎠ 0 γ1 0 γ2 (1 − γ2 )
(C.19)
where li is the code length of the outcome w. In our example, we have taken
D = 2 and code length of 2, 1 and 2 bits for the response L, M and R respectively.
Indeed, 2−2 + 2−1 + 2−2 = 1. Hence, code lengths behave like the logarithm (with
base D) of a pmf.
Given a data generating pmf p∗ (X), we can use the so-called Shannon-Fano
algorithm (e.g., Cover and Thomas, 2006, Ch. 5) to construct a prefix coding
system C ∗ . The idea behind this algorithm is to give the outcome x that is
generated with the highest chance the shortest code length. To do so, we encode
the outcome x as a code word C ∗ (x) that consists of − log2 p∗ (x) bits.23
23
When we use the logarithm with base two, log2 (y), we get the code length in bits,
while the natural logarithm, log(y), yields the code length in nats. Any result in terms of the
natural logarithm can be equivalently described in terms of the logarithm with base two, as
log(y) = log(2) log 2 (y).
With the true data generating pmf p∗ (X) at hand, thus, also the true coding
system f (X ∣ β ∗ ), we can calculate the (population) average code length per
trial
Whenever we use the logarithm with base 2, we refer to this quantity H(p∗ (X))
as the Shannon entropy.25 If the true pmf is p∗ (X) = [0.25, 0.5, 0.25] we have an
average code length of 1.5 bits per trail whenever we use the true coding system
f (X ∣ β ∗ ). Thus, we expect to use 12 bits to encode observations consisting of
n = 8 trials.
As coding theorists, we have no control over the true data generating pmf
p∗ (X), but we can choose the coding system f (X ∣ β) to encode the observations.
The (population) average code length per trial is given by
The quantity H(p∗ (X)∥ β) is also known as the cross entropy from the true
pmf p∗ (X) to the postulated f (X ∣ β).26 For instance, when we use the pmf
f (X ∣ β) = [0.01, 0.18, 0.81] to encode data that are generated according to
p∗ (X) = [0.25, 0.5, 0.25], we will use 2.97 bits on average per trial. Clearly,
24
Due to rounding, the Shannon-Fano algorithm actually produces code words C(x) that
are at most one bit larger than the ideal code length − log2 p∗ (x). We avoid further discussions
on rounding. Moreover, in the following we consider the natural logarithm instead.
25
Shannon denoted this quantity with an H to refer to the capital Greek letter for eta. It
seems that John von Neumann convinced Claude Shannon to call this quantity entropy rather
than information (Tribus and McIrvine, 1971).
26 Observe that the entropy H(p∗ (X)) is the just the cross entropy from the true p∗ (X)
this is much more than the 1.5 bits per trial that we get from using the true
coding system f (X ∣ β ∗ ).
More generally, Shannon (1948) showed that the cross entropy can never be
smaller than the entropy, i.e., H(p∗ (X)) ≤ H(p∗ (X)∥ β). In other words, we
always get a larger average code length, whenever we use the wrong coding
system f (X ∣ β). To see why this holds, we decompose the cross entropy as a
sum of the entropy and the Kullback-Leibler divergence,27 and show that the
latter cannot be negative. This decomposition follows from the definition of cross
entropy and a subsequent addition and subtraction of the entropy resulting in
p∗ (x)
H(p∗ (X)∥ β) = H(p∗ (X)) + ∑ ( log )p∗ (x),
(x ∣ ∗)
(D.4)
f β
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
x∈X
D(p∗ (X)∥β)
where D(p∗ (X)∥β) defines the Kullback-Leibler divergence from the true pmf
p∗ (X) to the postulated coding system f (X ∣ β). Using the so-called Jensen’s
inequality it can be shown that the KL-divergence is non-negative and that
it is only zero whenever f (X ∣ β) = p∗ (X). Thus, the cross entropy can never
be smaller than the entropy. Consequently, to minimize the load on the com-
munication network, we have to minimize the cross entropy with respect to the
parameter β. Unfortunately, however, we cannot do this in practice, because the
cross entropy is a population quantity based on the unknown true pmf p∗ (X).
Instead, we do the next best thing by replacing the true p∗ (X) in Eq. (D.3)
by the empirical pmf that gives the relative occurrences of the outcomes in the
sample rather than in the population. Hence, for any postulated f (X ∣ β), with
β fixed, we approximate the population average defined in Eq. (D.3) by the
sample average
n
H(xnobs ∥ β) = H(p̂obs (X)∥ f (X ∣ β)) = ∑ − log f (xobs,i ∣ β) = − log f (xnobs ∣ β).
i=1
(D.5)
We call the quantity H(xnobs ∥ β) the log-loss from the observed data xnobs , i.e.,
the empirical pmf p̂obs (X), to the coding system f (X ∣ β).
The entropy inequality H(p∗ (X)) ≤ H(p∗ (X)∥β) implies that the coding theo-
rist’s goal of finding the coding system f (X ∣ β) with the shortest average code
length is in fact equivalent to the statistical goal of finding the true data gen-
erating process p∗ (X). The coding theorist’s best guess is the coding system
f (X ∣ β) that minimizes the log-loss from xnobs to the model MB . Note that
minimizing the negative log-likelihood is the same as maximizing the likelihood.
Hence, the log-loss is minimized by the coding system associated with the MLE,
27
The KL-divergence is also known as the relative entropy.
thus, the predictive pmf f (X ∣ β̂obs ). Furthermore, the cross entropy decomposi-
tion shows that minimization of the log-loss is equivalent to minimization of the
KL-divergence from the observations xnobs to the model MB . The advantage of
having the optimization problem formulated in terms of KL-divergence is that it
has a known lower bound, namely, zero. Moreover, whenever the KL-divergence
from xnobs to the code f (X ∣ β̂obs ) is larger than zero, we then know that the
empirical pmf associated to the observations does not reside on the model. In
particular, Section 4.3.1 showed that the MLE plugin, f (X ∣ β̂obs ) is the pmf
on the model that is closest to the data. This geometric interpretation is due
to the fact that we retrieve the Fisher-Rao metric, when we take the second
derivative of the KL-divergence with respect to β (Kullback and Leibler, 1951).
This connection between the KL-divergence and Fisher information is exploited
in Ghosal, Ghosh and Ramamoorthi (1997) to generalize the Jeffreys’s prior to
nonparametric models, see also van Erven and Harremos (2014) for the rela-
tionship between KL-divergence and the broader class of divergence measures
developed by Rényi (1961), see also Campbell (1965).
A more mathematically rigorous exposition of the subject would have had this
section as the starting point, rather than the last section of the appendix. The
regularity conditions given below can be seen as a summary, and guidelines for
model builders. If we as scientists construct models such that these conditions
are met, we can then use the results presented in the main text. We first give
a more general notion of statistical models, then state the regularity conditions
followed by a brief discussion on these conditions.
The goal of statistical inference is to find the true probability measure P ∗
that governs the chances with which X takes on its events. A model PΘ defines
a subset of P, the largest collection of all possible probability measures. We
as model builders choose PΘ and perceive each probability measure P within
PΘ as a possible explanation of how the events of X were or will be generated.
When P ∗ ∈ PΘ we have a well-specified model and when P ∗ ∉ PΘ , we say that
the model is misspecified.
By taking PΘ to be equal to the largest possible collection P, we will not be
misspecified. Unfortunately, this choice is not helpful as the complete set is hard
to track and leads to uninterpretable inferences. Instead, we typically construct
the candidate set PΘ using a parameterization that sends a label θ ∈ Θ to a
probability measure Pθ . For instance, we might take the label θ = (σµ2 ) from
the parameter space Θ = R × (0, ∞) and interpret these two numbers as the
population mean and variance of a normal probability Pθ . This distributional
choice is typical in psychology, because it allows for very tractable inference
with parameters that are generally overinterpreted. Unfortunately, the normal
distribution comes with rather stringent assumptions resulting in a high risk of
misspecification. More specifically, the normal distribution is far too ideal, as it
supposes that the population is nicely symmetrically centred at its population
mean and outliers are practically not expected due to its tail behavior.
dmθ (x)
dθ = 12 (θ − θ∗ )T l̇(x ∣ θ∗ )mθ∗ (x), (E.1)
dθ
˙ ∣ θ∗ ) is a d-dimensional vector of score functions in L2 (Pθ∗ ),
where l(x
(iii) the Fisher information matrix IX (θ) is non-singular,
(iv) the map θ ↦ l(x˙ ∣ θ)mθ (x) is continuous from Θ to Ld (λ).
2
Note that (ii) allows us to generalize the geometrical concepts discussed in Ap-
pendix C.3 to more general random variables X. ◇
We provide some intuition. Condition (i) implies that Θ inherits the topologi-
cal structure of Rd . In particular, we have an inner product on Rd that allows us
to project vectors onto each other, a norm that allows us to measure the length
of a vector, and the Euclidean metric that allows us to measure the distance
between two√ vectors by taking the square root of the sums of squares, that is,
∥θ∗ −θ∥2 = ∑di=1 (θi∗ − θi ) . For d = 1 this norm is just the absolute value, which
2
This implies that the linearization term 21 (θ − θ∗ )T l̇(x ∣ θ∗ )mθ∗ (x) is a good
approximation to the “error” mθ (x) − mθ∗ (x) in the model MΘ , whenever θ
˙ ∣ θ∗ ) do not blow up. More
is close to θ∗ given that the score functions l(x
˙ ∣ θ∗ ) has a finite norm. We
specifically, this means that each component of l(x
This condition follows from the chain rule applied to the logarithm and an
exchange of the order of integration with respect to x, and derivation with
respect to θi , as
∫
∂
l(x ∣ θ∗ )pθ∗ (x)dx =∫ ∂
pθ∗ (x)dx = ∂
∫ pθ∗ (x)dx = ∂
1 = 0.
x∈X ∂θi x∈X ∂θi ∂θi x∈X ∂θi
(E.6)
Condition (iii) implies that the model does not collapse to a lower dimension.
For instance, when the parameter space is a plain the resulting model MΘ
cannot be line. Lastly, condition (iv) implies that the tangent functions change
smoothly as we move from mθ∗ (x) to mθ (x) on the sphere in L2 (λ), where θ is
a parameter value in the neighborhood of θ∗ .
The following conditions are stronger, thus, less general, but avoid Fréchet
differentiability and are typically easier to check.
Lemma E.1. Let Θ ⊂ Rd be open. At each possible true value θ∗ ∈ Θ, we
assume that pθ (x) is continuously differentiable in θ for λ-almost all x with
tangent vector ṗθ∗ (x). We define the score function at x as
The parameterization θ ↦ Pθ is regular, if the norm of the score vector Eq. (E.7)
is finite in quadratic mean, that is, l̇(X ∣ θ∗ ) ∈ L2 (Pθ∗ ), and if the corresponding
Fisher information matrix based on the score functions Eq. (E.7) is non-singular
and continuous in θ. ◇
There are many better sources than the current manuscript on this topic
that are mathematically much more rigorous and better written. For instance,
Bickel et al. (1993) give a proof of the lemma above and many more beautiful,
but sometimes rather (agonizingly) technically challenging, results. For a more
accessible, but no less elegant, exposition of the theory we highly recommend
van der Vaart (1998).