0% found this document useful (0 votes)
69 views59 pages

Fisher Information

This document provides a tutorial on Fisher information, which plays an important role in three statistical paradigms: frequentist, Bayesian, and minimum description length (MDL). Fisher information is defined and its calculation is shown. It is then explained how Fisher information can be used to construct hypothesis tests and confidence intervals in frequentist statistics, define default priors in Bayesian statistics, and measure model complexity in MDL. The tutorial is intended for graduate students and researchers in cognitive modeling and mathematical statistics.

Uploaded by

RockyRambo1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views59 pages

Fisher Information

This document provides a tutorial on Fisher information, which plays an important role in three statistical paradigms: frequentist, Bayesian, and minimum description length (MDL). Fisher information is defined and its calculation is shown. It is then explained how Fisher information can be used to construct hypothesis tests and confidence intervals in frequentist statistics, define default priors in Bayesian statistics, and measure model complexity in MDL. The tutorial is intended for graduate students and researchers in cognitive modeling and mathematical statistics.

Uploaded by

RockyRambo1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Vol.

X (2017) 1–59

A Tutorial on Fisher Information∗


Alexander Ly, Maarten Marsman, Josine Verhagen, Raoul
arXiv:1705.01064v2 [math.ST] 17 Oct 2017

Grasman and Eric-Jan Wagenmakers ∗


University of Amsterdam
Department of Psychological Methods
PO Box 15906
Nieuwe Achtergracht 129-B
1001 NK Amsterdam
The Netherlands
e-mail: a.ly@uva.nl
url: www.alexander-ly.com/; https://jasp-stats.org/

Abstract: In many statistical applications that concern mathematical psy-


chologists, the concept of Fisher information plays an important role. In
this tutorial we clarify the concept of Fisher information as it manifests
itself across three different statistical paradigms. First, in the frequentist
paradigm, Fisher information is used to construct hypothesis tests and
confidence intervals using maximum likelihood estimators; second, in the
Bayesian paradigm, Fisher information is used to define a default prior;
lastly, in the minimum description length paradigm, Fisher information is
used to measure model complexity.

MSC 2010 subject classifications: Primary 62-01, 62B10; secondary


62F03, 62F12, 62F15, 62B10.
Keywords and phrases: Confidence intervals, hypothesis testing, Jef-
freys’s prior, minimum description length, model complexity, model selec-
tion, statistical modeling.

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 The Role of Fisher Information in Frequentist Statistics . . . . . . . . . 6
3 The Role of Fisher Information in Bayesian Statistics . . . . . . . . . . 10
4 The Role of Fisher Information in Minimum Description Length . . . 20
5 Concluding Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
A Generalization to Vector-Valued Parameters: The Fisher Information
Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
B Frequentist Statistics based on Asymptotic Normality . . . . . . . . . . 40
∗ This work was supported by the starting grant “Bayes or Bust” awarded by the Euro-

pean Research Council (283876). Correspondence concerning this article may be addressed
to Alexander Ly, email address: a.ly@uva.nl. The authors would like to thank Jay Myung,
Trisha Van Zandt, and three anonymous reviewers for their comments on an earlier version
of this paper. The discussions with Helen Steingroever, Jean-Bernard Salomond, Fabian Dab-
lander, Nishant Mehta, Alexander Etz, Quentin Gronau and Sacha Epskamp led to great
improvements of the manuscript. Moreover, the first author is grateful to Chris Klaassen, Bas
Kleijn and Henk Pijls for their patience and enthusiasm with which they taught, and answered
questions from a not very docile student.
1
imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017
Ly, et. al./Fisher information tutorial 2

C Bayesian use of the Fisher-Rao Metric: The Jeffreys’s Prior . . . . . . 44


D MDL: Coding Theoretical Background . . . . . . . . . . . . . . . . . . . 51
E Regularity conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

1. Introduction

Mathematical psychologists develop and apply quantitative models in order to


describe human behavior and understand latent psychological processes. Exam-
ples of such models include Stevens’ law of psychophysics that describes the
relation between the objective physical intensity of a stimulus and its subjec-
tively experienced intensity (Stevens, 1957); Ratcliff’s diffusion model of deci-
sion making that measures the various processes that drive behavior in speeded
response time tasks (Ratcliff, 1978); and multinomial processing tree models
that decompose performance in memory tasks into the contribution of separate
latent mechanisms (Batchelder and Riefer, 1980; Chechile, 1973).
When applying their models to data, mathematical psychologists may operate
from within different statistical paradigms and focus on different substantive
questions. For instance, working within the classical or frequentist paradigm a
researcher may wish to test certain hypotheses or decide upon the number of
trials to be presented to participants in order to estimate their latent abilities.
Working within the Bayesian paradigm a researcher may wish to know how
to determine a suitable default prior on the parameters of a model. Working
within the minimum description length (MDL) paradigm a researcher may wish
to compare rival models and quantify their complexity. Despite the diversity
of these paradigms and purposes, they are connected through the concept of
Fisher information.
Fisher information plays a pivotal role throughout statistical modeling, but
an accessible introduction for mathematical psychologists is lacking. The goal of
this tutorial is to fill this gap and illustrate the use of Fisher information in the
three statistical paradigms mentioned above: frequentist, Bayesian, and MDL.
This work builds directly upon the Journal of Mathematical Psychology tutorial
article by Myung (2003) on maximum likelihood estimation. The intended target
group for this tutorial are graduate students and researchers with an affinity for
cognitive modeling and mathematical statistics.
To keep this tutorial self-contained we start by describing our notation and
key concepts. We then provide the definition of Fisher information and show
how it can be calculated. The ensuing sections exemplify the use of Fisher in-
formation for different purposes. Section 2 shows how Fisher information can
be used in frequentist statistics to construct confidence intervals and hypoth-
esis tests from maximum likelihood estimators (MLEs). Section 3 shows how
Fisher information can be used in Bayesian statistics to define a default prior
on model parameters. In Section 4 we clarify how Fisher information can be
used to measure model complexity within the MDL framework of inference.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 3

1.1. Notation and key concepts

Before defining Fisher information it is necessary to discuss a series of fundamen-


tal concepts such as the nature of statistical models, probability mass functions,
and statistical independence. Readers familiar with these concepts may safely
skip to the next section.
A statistical model is typically defined through a function f (xi ∣ θ) that rep-
resents how a parameter θ is functionally related to potential outcomes xi of
a random variable Xi . For ease of exposition, we take θ to be one-dimensional
throughout this text. The generalization to vector-valued θ can be found in
Appendix A, see also Myung and Navarro (2005).
As a concrete example, θ may represent a participant’s intelligence, Xi a
participant’s (future) performance on the ith item of an IQ test, xi = 1 the
potential outcome of a correct response, and xi = 0 the potential outcome of an
incorrect response on the ith item. Similarly, Xi is the ith trial in a coin flip
experiment with two potential outcomes: heads, xi = 1, or tails, xi = 0. Thus, we
have the binary outcome space X = {0, 1}. The coin flip model is also known as
the Bernoulli distribution f (xi ∣ θ) that relates the coin’s propensity θ ∈ (0, 1)
to land heads to the potential outcomes as
f (xi ∣ θ) = θxi (1 − θ)1−xi , where xi ∈ X = {0, 1}. (1.1)
Formally, if θ is known, fixing it in the functional relationship f yields a func-
tion pθ (xi ) = f (xi ∣ θ) of the potential outcomes xi . This pθ (xi ) is referred to
as a probability density function (pdf) when Xi has outcomes in a continuous
interval, whereas it is known as a probability mass function (pmf) when Xi has
discrete outcomes. The pmf pθ (xi ) = P (Xi = xi ∣ θ) can be thought of as a data
generative device as it specifies how θ defines the chance with which Xi takes
on a potential outcome xi . As this holds for any outcome xi of Xi , we say that
Xi is distributed according to pθ (xi ). For brevity, we do not further distinguish
the continuous from the discrete case, and refer to pθ (xi ) simply as a pmf.
For example, when the coin’s true propensity is θ∗ = 0.3, replacing θ by θ∗
in the Bernoulli distribution yields the pmf p0.3 (xi ) = 0.3xi 0.71−xi , a function
of all possible outcomes of Xi . A subsequent replacement xi = 0 in the pmf
p0.3 (0) = 0.7 tells us that this coin generates the outcome 0 with 70% chance.
In general, experiments consist of n trials yielding a potential set of outcomes
xn = (x1 , . . . , xn ) of the random vector X n = (X1 , . . . , Xn ). These n random
variables are typically assumed to be independent and identically distributed
(iid). Identically distributed implies that each of these n random variables is
governed by one and the same θ, while independence implies that the joint
distribution of all these n random variables simultaneously is given by a product,
that is,
n
f (xn ∣ θ) = f (x1 ∣ θ) × . . . × f (xn ∣ θ) = ∏ f (xi ∣ θ). (1.2)
i=1

As before, when θ is known, fixing it in this relationship f (xn ∣ θ) yields the


(joint) pmf of X n as pθ (xn ) = pθ (x1 ) × . . . × pθ (xn ) = ∏ni=1 pθ (xi ).

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 4

In psychology the iid assumption is typically evoked when experimental data


are analyzed in which participants have been confronted with a sequence of n
items of roughly equal difficulty. When the participant can be either correct or
incorrect on each trial, the participant’s performance X n can then be related
to an n-trial coin flip experiment governed by one single θ over all n trials. The
random vector X n has 2n potential outcomes xn . For instance, when n = 10,
we have 2n = 1,024 possible outcomes and we write X n for the collection of all
these potential outcomes. The chance of observing a potential outcome xn is
determined by the coin’s propensity θ as follows

f (xn ∣ θ) = f (x1 ∣ θ) × . . . × f (xn ∣ θ) = θ∑i=1 xi (1 − θ)n−∑i=1 xi , where xn ∈ X n .


n n

(1.3)

When the coin’s true propensity θ is θ∗ = 0.6, replacing θ by θ∗ in Eq. (1.3)


yields the joint pmf p0.6 (xn ) = f (xn ∣ θ = 0.6) = 0.6∑i=1 xi 0.4n−∑i=1 xi . The pmf
n n

with a particular outcome entered, say, x = (1, 1, 1, 1, 1, 1, 1, 0, 0, 0) reveals that


n

the coin with θ∗ = 0.6 generates this particular outcome with 0.18% chance.

1.2. Definition of Fisher information

In practice, the true value of θ is not known and has to be inferred from the
observed data. The first step typically entails the creation of a data summary.
For example, suppose once more that X n refers to an n-trial coin flip experi-
ment and suppose that we observed xnobs = (1, 0, 0, 1, 1, 1, 1, 0, 1, 1). To simplify
matters, we only record the number of heads as Y = ∑ni=1 Xi , which is a func-
tion of the data. Applying our function to the specific observations yields the
realization yobs = Y (xnobs ) = 7. Since the coin flips X n are governed by θ, so is a
function of X n ; indeed, θ relates to the potential outcomes y of Y as follows
n
f (y ∣ θ) = ( )θy (1 − θ)n−y , where y ∈ Y = {0, 1, . . . , n}, (1.4)
y

where (ny) = y!(n−y)!


n!
enumerates the possible sequences of length n that consist
of y heads and n − y tails. For instance, when flipping a coin n = 10 times,
there are 120 possible sequences of zeroes and ones that contain y = 7 heads and
n − y = 3 tails. The distribution f (y ∣ θ) is known as the binomial distribution.
The summary statistic Y has n+1 possible outcomes, whereas X n has 2n . For
instance, when n = 10 the statistic Y has only 11 possible outcomes, whereas X n
has 1,024. This reduction results from the fact that the statistic Y ignores the
order with which the data are collected. Observe that the conditional probability
of the raw data given Y = y is equal to P (X n ∣ Y = y, θ) = 1/(ny) and that it
does not depend on θ. This means that after we observe Y = y the conditional
probability of X n is independent of θ, even though each of the distributions
of X n and Y separately do depend on θ. We, therefore, conclude that there is
no information about θ left in X n after observing Y = y (Fisher, 1920; Stigler,
1973).

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 5

More generally, we call a function of the data, say, T = t(X n ) a statistic.


A statistic is referred to as sufficient for the parameter θ, if the expression
P (X n ∣ T = t, θ) does not depend on θ itself. To quantify the amount of informa-
tion about the parameter θ in a sufficient statistic T and the raw data, Fisher
introduced the following measure.
Definition 1.1 (Fisher information). The Fisher information IX (θ) of a ran-
dom variable X about θ is defined as1



2
⎪∑x∈X ( dθ log f (x ∣ θ)) pθ (x) if X is discrete,
⎪ d
IX (θ) = ⎨ (1.6)


⎪ ( dθ
2
log f (x ∣ θ)) pθ (x)dx if X is continuous.
⎪ ∫
d
⎩ X

The derivative dθ d
log f (x ∣ θ) is known as the score function, a function of x, and
describes how sensitive the model (i.e., the functional form f ) is to changes in
θ at a particular θ. The Fisher information measures the overall sensitivity of
the functional relationship f to changes of θ by weighting the sensitivity at each
potential outcome x with respect to the chance defined by pθ (x) = f (x ∣ θ). The
weighting with respect to pθ (x) implies that the Fisher information about θ is
an expectation.
Similarly, Fisher information IX n (θ) within the random vector X n about θ
is calculated by replacing f (x ∣ θ) with f (xn ∣ θ), thus, pθ (x) with pθ (xn ) in the
definition. Moreover, under the assumption that the random vector X n consists
of n iid trials of X it can be shown that IX n (θ) = nIX (θ), which is why IX (θ) is
also known as the unit Fisher information.2 Intuitively, an experiment consisting
of n = 10 trials is expected to be twice as informative about θ compared to an
experiment consisting of only n = 5 trials. ◇
Intuitively, we cannot expect an arbitrary summary statistic T to extract
more information about θ than what is already provided by the raw data. Fisher
information adheres to this rule, as it can be shown that

IX n (θ) ≥ IT (θ), (1.7)

with equality if and only if T is a sufficient statistic for θ.


Example 1.1 (The information about θ within the raw data and a summary
statistic). A direct calculation with a Bernoulli distributed random vector X n
shows that the Fisher information about θ within an n-trial coin flip experiment
1
Under mild regularity conditions Fisher information is equivalently defined as

⎪ d2
d2 ⎪− ∑x∈X ( dθ2 log f (x ∣ θ))pθ (x)
⎪ if X is discrete,
IX (θ) = −E( dθ 2 log f (X ∣ θ)) = ⎨ (1.5)

⎪−
2
( d log f (x ∣ θ))pθ (x)dx
⎩ ∫X dθ2
⎪ if X is continuous.
2
2 log f (x ∣ θ) denotes the second derivate of the logarithm of f with respect θ.
d
where dθ
2 Note the abuse of notation – we dropped the subscript i for the ith random variable X
i
and denote it simply by X instead.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 6

30

Fisher information IX (θ)


25

20

15

10

0.0 0.5 1.0

Parameter θ

Fig 1. The unit Fisher information IX (θ) = θ(1−θ)


1
as a function of θ within the Bernoulli
model. As θ reaches zero or one the expected information goes to infinity.

is given by
1
IX n (θ) = nIX (θ) = n , (1.8)
θ(1 − θ)

where IX (θ) = θ(1−θ)


1
is the Fisher information of θ within a single trial. As
shown in Fig. 1, the unit Fisher information IX (θ) depends on θ. Similarly, we
can calculate the Fisher information about θ within the summary statistic Y by
using the binomial model instead. This yields IY (θ) = θ(1−θ)
n
. Hence, IX n (θ) =
IY (θ) for any value of θ. In other words, the expected information in Y about θ
is the same as the expected information about θ in X n , regardless of the value
of θ. ◇
Observe that the information in the raw data X n and the statistic Y are
equal for every θ, and specifically also for its unknown true value θ∗ . That is,
there is no statistical information about θ lost when we use a sufficient statistic
Y instead of the raw data X n . This is particular useful when the data set X n
is large and can be replaced by single number Y .

2. The Role of Fisher Information in Frequentist Statistics

Recall that θ is unknown in practice and to infer its value we might: (1) pro-
vide a best guess in terms of a point estimate; (2) postulate its value and test
whether this value aligns with the data, or (3) derive a confidence interval. In
the frequentist framework, each of these inferential tools is related to the Fisher
information and exploits the data generative interpretation of a pmf. Recall that
given a model f (xn ∣ θ) and a known θ, we can view the resulting pmf pθ (xn )

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 7

Likelihood: f(yobs=7,n=10|θ)
0.25

0.20

0.15

0.10 MLE

0.05

0.00

0.0 0.2 0.4 0.6 0.7 0.8 1.0

Parameter θ

Fig 2. The likelihood function based on observing yobs = 7 heads in n = 10 trials. For these
data, the MLE is equal to θ̂obs = 0.7, see the main text for the interpretation of this function.

as a recipe that reveals how θ defines the chances with which X n takes on the
potential outcomes xn .
This data generative view is central to Fisher’s conceptualization of the
maximum likelihood estimator (MLE; Fisher, 1912; Fisher, 1922; Fisher, 1925;
LeCam, 1990; Myung, 2003). For instance, the binomial model implies that a
coin with a hypothetical propensity θ = 0.5 will generate the outcome y = 7
heads out of n = 10 trials with 11.7% chance, whereas a hypothetical propen-
sity of θ = 0.7 will generate the same outcome y = 7 with 26.7% chance. Fisher
concluded that an actual observation yobs = 7 out of n = 10 is therefore more
likely to be generated from a coin with a hypothetical propensity of θ = 0.7 than
from a coin with a hypothetical propensity of θ = 0.5. Fig. 2 shows that for this
specific observation yobs = 7, the hypothetical value θ = 0.7 is the maximum
likelihood estimate; the number θ̂obs = 0.7. This estimate is a realization of the
maximum likelihood estimator (MLE); in this case, the MLE is the function
θ̂ = n1 ∑ni=1 Xi = n1 Y , i.e., the sample mean. Note that the MLE is a statistic,
that is, a function of the data.

2.1. Using Fisher information to design an experiment

Since X n depends on θ so will a function of X n , in particular, the MLE θ̂. The


distribution of the potential outcomes of the MLE θ̂ is known as the sampling
distribution of the estimator and denoted as f (θ̂obs ∣ θ). As before, when θ∗ is
assumed to be known, fixing it in f (θ̂obs ∣ θ) yields the pmf pθ∗ (θ̂obs ), a function
of the potential outcomes of θ̂. This function f between the parameter θ and
the potential outcomes of the MLE θ̂ is typically hard to describe, but for n
large enough it can be characterized by the Fisher information.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 8

For iid data and under general conditions,3 the difference between the true
θ∗ and the MLE converges in distribution to a normal distribution, that is,

n(θ̂ − θ∗ ) → N (0, IX (θ )), as n → ∞.
D −1 ∗
(2.1)

Hence, for large enough n, the “error” is approximately normally distributed4

(θ̂ − θ∗ ) ≈ N (0, 1/(nIX (θ∗ ))).


D
(2.2)

This means that the MLE θ̂ generates potential estimates θ̂obs around the true

value θ∗ with a standard error given by the inverse of the square root of the
Fisher information at the true value θ∗ , i.e., 1/ nIX (θ∗ ), whenever n is large
enough. Note that the chances with which the estimates of θ̂ are generated
depend on the true value θ∗ and the sample size n. Observe that the standard
error decreases when the unit information IX (θ∗ ) is high or when n is large. As
experimenters we do not have control over the true value θ∗ , but we can affect
the data generating process by choosing the number of trials n. Larger values
of n increase the amount of information in X n , heightening the chances of the
MLE producing an estimate θ̂obs that is close to the true value θ∗ . The following
example shows how this can be made precise.
Example 2.1 (Designing a binomial experiment with the Fisher information).
Recall that the potential outcomes of a normal distribution fall within one stan-
dard error of the √ population mean with 68% chance. Hence, when we choose
n such that 1/ nIX (θ∗ ) = 0.1 we design an experiment that allows the MLE
to generate estimates within 0.1 distance of the true value with 68% chance.
To overcome the problem that θ∗ is not known, we solve the problem for the
worst case scenario. For the Bernoulli model this √ is given by θ = √ 1/2, the least

informative case, see Fig. 1. As such, we have 1/ nI X (θ ∗ ) ≤ 1/ nI (1/2) =
X
1/(2 n) = 0.1, where the last equality is the target requirement and is solved by
n = 25.
This leads to the following interpretation. After simulating k = 100 data sets
xnobs,1 , . . . , xnobs,k each with n = 25 trials, we can apply to each of these data
sets the MLE yielding k estimates θ̂obs,1 , . . . , θ̂obs,k . The sampling distribution
implies that at least 68 of these k = 100 estimate are expected to be at most 0.1
distance away from the true θ∗ . ◇
3
Basically, when the Fisher information exists for all parameter values. For details see
the advanced accounts provided by Bickel et al. (1993), Hájek (1970), Inagaki (1970), LeCam
(1970) and Appendix E.
4
Note that θ̂ is random, while the true value θ ∗ is fixed. As such, the error θ̂ − θ ∗ and
√ D
the rescaled error n(θ̂ − θ ∗ ) are also random. We used → in Eq. (2.1) to convey that the
distribution of the left-hand side goes to the distribution on the right-hand side. Similarly,
D
≈ in Eq. (2.2) implies that the distribution of the left-hand side is approximately equal to
the distribution given on the right-hand side. Hence, for finite n there will be an error due
to using the normal distribution as an approximation to the true sampling distribution. This
approximation error is ignored in the constructions given below, see Appendix B.1 for a more
thorough discussion.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 9

2.2. Using Fisher information to construct a null hypothesis test

The (asymptotic) normal approximation to the sampling distribution of the


MLE can also be used to construct a null hypothesis test. When we postulate
that the true value equals some hypothesized value of interest, say, θ∗ = θ0 , a
simple plugin then allows us to construct a prediction interval based on our
knowledge of the normal distribution. More precisely, the potential outcomes
xn with n large enough and generated according to pθ∗ (xn ) leads to potential
estimates θ̂obs that fall within the range
√ √
(θ∗ − 1.96 n1 IX −1 (θ ∗ ), θ ∗ + 1.96 I (θ )) ,
1 −1 ∗
n X
(2.3)

with (approximately) 95% chance. This 95%-prediction interval Eq. (2.3) al-
lows us to construct a point null hypothesis test based on a pre-experimental
postulate θ∗ = θ0 .
Example 2.2 (A null hypothesis test for a binomial experiment). Under the
null hypothesis H0 ∶ θ∗ = θ0 = 0.5, we predict that an outcome of the MLE based
on n = 10 trials will lie between (0.19, 0.81) with 95% chance. This interval
follows from replacing θ∗ by θ0 in the 95%-prediction interval Eq. (2.3). The data
generative view implies that if we simulate k = 100 data sets each with the same
θ∗ = 0.5 and n = 10, we would then have k estimates θ̂obs,1 , . . . , θ̂obs,k of which
five are expected to be outside this 95% interval (0.19, 0.81). Fisher, therefore,
classified an outcome of the MLE that is smaller than 0.19 or larger than 0.81
as extreme under the null and would then reject the postulate H0 ∶ θ0 = 0.5 at a
significance level of .05. ◇
The normal approximation to the sampling distribution of the MLE and the
resulting null hypothesis test is particularly useful when the exact sampling
distribution of the MLE is unavailable or hard to compute.
Example 2.3 (An MLE null hypothesis test for the Laplace model). Suppose
that we have n iid samples from the Laplace distribution

f (xi ∣ θ) = 1
2b
exp ( − ∣xi −θ∣
b
), (2.4)

where θ denotes the population mean and the population variance is given by 2b2 .
It can be shown that the MLE for this model is the sample median, θ̂ = M̂ , and
the unit Fisher information is IX (θ) = b−2 . The exact sampling distribution of
the MLE is unwieldy (Kotz, Kozubowski and Podgorski, 2001) and not presented
here. Asymptotic normality of the MLE is practical, as it allows us to discard the
unwieldy exact sampling distribution and, instead, base our inference on a more
tractable (approximate) normal distribution with a mean equal to the true value
θ∗ and a variance equal to b2 /n. For n = 100, b = 1 and repeated sampling under
the hypothesis H0 ∶ θ∗ = θ0 , approximately 95% of the estimates (the observed
sample medians) are expected to fall in the range (θ0 − 0.196, θ0 + 0.196). ◇

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 10

2.3. Using Fisher information to compute confidence intervals

An alternative to both point estimation and null hypothesis testing is interval


estimation. In particular, a 95%-confidence interval can be obtained by replacing
in the prediction interval Eq. (2.3) the unknown true value θ∗ by an estimate
θ̂obs . Recall that a simulation with k = 100 data sets each with n trials leads to
θ̂obs,1 , . . . , θ̂obs,k estimates, and each estimate leads to a different 95%-confidence
interval. It is then expected that 95 of these k = 100 intervals encapsulate the
true value θ∗ .5 Note that these intervals are centred around different points
whenever the estimates differ and that their lengths differ, as the Fisher infor-
mation depends on θ.
Example 2.4 (An MLE confidence interval for the Bernoulli model). When we
observe yobs,1 = 7 heads in n = 10 trials, the MLE then produces the estimate
θ̂obs,1 = 0.7. Replacing θ∗ in the prediction interval Eq. (2.3) with θ∗ = θ̂obs,1
yields an approximate 95%-confidence interval (0.42, 0.98) of length 0.57. On
the other hand, had we instead observed yobs,2 = 6 heads, the MLE would then
yield θ̂obs,2 = 0.6 resulting in the interval (0.29, 0.90) of length 0.61. ◇
In sum, Fisher information can be used to approximate the sampling distri-
bution of the MLE when n is large enough. Knowledge of the Fisher information
can be used to choose n such that the MLE produces an estimate close to the
true value, construct a null hypothesis test, and compute confidence intervals.

3. The Role of Fisher Information in Bayesian Statistics

This section outlines how Fisher information can be used to define the Jeffreys’s
prior, a default prior commonly used for estimation problems and for nuisance
parameters in a Bayesian hypothesis test (e.g., Bayarri et al., 2012; Dawid,
2011; Gronau, Ly and Wagenmakers, 2017; Jeffreys, 1961; Liang et al., 2008;
Li and Clyde, 2015; Ly, Verhagen and Wagenmakers, 2016a,b; Ly, Marsman and Wagenmakers,
in press; Ly et al., 2017a; Robert, 2016). To illustrate the desirability of the Jef-
freys’s prior we first show how the naive use of a uniform prior may have unde-
sirable consequences, as the uniform prior depends on the representation of the
inference problem, that is, on how the model is parameterized. This dependence
is commonly referred to as lack of invariance: different parameterizations of the
same model result in different posteriors and, hence, different conclusions. We
visualize the representation problem using simple geometry and show how the
geometrical interpretation of Fisher information leads to the Jeffreys’s prior that
is parameterization-invariant.

3.1. Bayesian updating

Bayesian analysis centers on the observations xnobs for which a generative model
f is proposed that functionally relates the observed data to an unobserved pa-
5
But see Brown, Cai and DasGupta (2001).

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 11

rameter θ. Given the observations xnobs , the functional relationship f is inverted


using Bayes’ rule to infer the relative plausibility of the values of θ. This is
done by replacing the potential outcome part xn in f by the actual observations
yielding a likelihood function f (xnobs ∣ θ), which is a function of θ. In other words,
xnobs is known, thus, fixed, and the true θ is unknown, therefore, free to vary.
The candidate set of possible values for the true θ is denoted by Θ and referred
to as the parameter space. Our knowledge about θ is formalized by a distribu-
tion g(θ) over the parameter space Θ. This distribution is known as the prior
on θ, as it is set before any datum is observed. We can use Bayes’ theorem to
calculate the posterior distribution over the parameter space Θ given the data
that were actually observed as follows
f (xnobs ∣ θ)g(θ)
g(θ ∣ X n = xnobs ) = . (3.1)
∫Θ f (xobs ∣ θ)g(θ) dθ
n

This expression is often verbalized as


likelihood × prior
posterior = . (3.2)
marginal likelihood
The posterior distribution is a combination of what we knew before we saw
the data (i.e., the information in the prior), and what we have learned from the
observations in terms of the likelihood (e.g., Lee and Wagenmakers, 2013). Note
that the integral is now over θ and not over the potential outcomes.

3.2. Failure of the uniform distribution on the parameter as a


noninformative prior

When little is known about the parameter θ that governs the outcomes of X n , it
may seem reasonable to express this ignorance with a uniform prior distribution
g(θ), as no parameter value of θ is then favored over another. This leads to the
following type of inference:
Example 3.1 (Uniform prior on θ). Before data collection, θ is assigned a
uniform prior, that is, g(θ) = 1/VΘ with a normalizing constant of VΘ = 1
as shown in the left panel of Fig. 3. Suppose that we observe coin flip data
xnobs with yobs = 7 heads out of n = 10 trials. To relate these observations to
the coin’s propensity θ we use the Bernoulli distribution as our f (xn ∣ θ). A
replacement of xn by the data actually observed yields the likelihood function
f (xnobs ∣ θ) = θ7 (1 − θ)3 , which is a function of θ. Bayes’ theorem now allows us
to update our prior to the posterior that is plotted in the right panel of Fig. 3. ◇
Note that a uniform prior on θ has the length, more generally, volume, of
the parameter space as the normalizing constant; in this case, VΘ = 1, which
equals the length of the interval Θ = (0, 1). Furthermore, a uniform prior can
be characterized as the prior that gives equal probability to all sub-intervals of
equal length. Thus, the probability of finding the true value θ∗ within a sub-
interval Jθ = (θa , θb ) ⊂ Θ = (0, 1) is given by the relative length of Jθ with

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 12

Uniform prior on θ Posterior θ from θ ∼ U [0, 1]


3.0

2.5

Density
2.0

1.5 yobs = 7
ÐÐÐÐ→
1.0 n = 10
0.5

0.0
0.0 0.5 0.6 0.8 1.0 0.0 0.5 0.6 0.8 1.0

Propensity θ Propensity θ
Fig 3. Bayesian updating based on observations xn obs with yobs = 7 heads out of n = 10 tosses.
In the left panel, the uniform prior distribution assigns equal probability to every possible
value of the coin’s propensity θ. In the right panel, the posterior distribution is a compromise
between the prior and the observed data.

respect to the length of the parameter space, that is,

1 θb θb − θa
P (θ∗ ∈ Jθ ) = ∫ g(θ)dθ = ∫ 1dθ = . (3.3)
Jθ VΘ θa VΘ
Hence, before any datum is observed, the uniform prior expresses the belief
P (θ∗ ∈ Jθ ) = 0.20 of finding the true value θ∗ within the interval Jθ = (0.6, 0.8).
After observing xnobs with yobs = 7 out of n = 10, this prior is updated to the
posterior belief of P (θ∗ ∈ Jθ ∣ xnobs ) = 0.54, see the shaded areas in Fig. 3.
Although intuitively appealing, it can be unwise to choose the uniform dis-
tribution by default, as the results are highly dependent on how the model is
parameterized. In what follows, we show how a different parameterization leads
to different posteriors and, consequently, different conclusions.
Example 3.2 (Different representations, different conclusions). The propensity
of a coin landing heads up is related to the angle φ with which that coin is bent.
Suppose that the relation between the angle φ and the propensity θ is given by the
function θ = h(φ) = 21 + 12 ( πφ ) , chosen here for mathematical convenience.6 When
3

φ is positive the tail side of the coin is bent inwards, which increases the coin’s
chances to land heads. As the function θ = h(φ) also admits an inverse function
h−1 (θ) = φ, we have an equivalent formulation of the problem in Example 3.1,
but now described in terms of the angle φ instead of the propensity θ.
As before, in order to obtain a posterior distribution, Bayes’ theorem requires
that we specify a prior distribution. As the problem is formulated in terms of
φ, one may believe that a noninformative choice is to assign a uniform prior
g̃(φ) on φ, as this means that no value of φ is favored over another. A uniform
prior on φ is in this case given by g̃(φ) = 1/VΦ with a normalizing constant
VΦ = 2π, because the parameter φ takes on values in the interval Φ = (−π, π).
6
Another example involves the logit formulation of the Bernoulli model, that is, in terms of
φ = log( 1−θ
θ
), where Φ = R. This logit formulation is the basic building block in item response
theory. We did not discuss this example as the uniform prior on the logit cannot be normalized
and, therefore, not easily represented in the plots.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 13

This uniform distribution expresses the belief that the true φ∗ can be found
in any of the intervals (−1.0π, −0.8π), (−0.8π, −0.6π), . . . , (0.8π, 1.0π) with 10%
probability, because each of these intervals is 10% of the total length, see the
top-left panel of Fig. 4. For the same data as before, the posterior calculated

Uniform prior on φ Posterior φ from φ ∼ U [−π, π]


0.5

0.4
Density

0.3 yobs = 7
ÐÐÐÐ→
0.2 n = 10
0.1

0.0
−π 0 π −π 0 π

Angle φ Angle φ
h ⇣ h ↓
Prior θ from φ ∼ U [−π, π] Posterior θ from φ ∼ U [−π, π]
5

4
Density

2
yobs = 7
ÐÐÐ→
1
n = 10

0
0.0 0.5 0.6 0.8 1.0 0 0.5 0.6 0.8 1

Propensity θ Propensity θ
Fig 4. Bayesian updating based on observations xn obs with yobs = 7 heads out of n = 10
tosses when a uniform prior distribution is assigned to the the coin’s angle φ. The uniform
distribution is shown in the top-left panel. Bayes’ theorem results in a posterior distribution
for φ that is shown in the top-right panel. This posterior g̃(φ ∣ xn obs ) is transformed into a
posterior on θ (bottom-right panel) using θ = h(φ). The same posterior on θ is obtained if
we proceed via an alternative route in which we first transform the uniform prior on φ to
the corresponding prior on θ and then apply Bayes’ theorem with the induced prior on θ. A
comparison to the results from Fig. 3 reveals that posterior inference differs notably depending
on whether a uniform distribution is assigned to the angle φ or to the propensity θ.

from Bayes’ theorem is given in top-right panel of Fig. 4. As the problem in


terms of the angle φ is equivalent to that of θ = h(φ) we can use the function h
to translate the posterior in terms of φ to a posterior on θ, see the bottom-right
panel of Fig. 4. This posterior on θ is noticeably different from the posterior on
θ shown in Figure 3.
Specifically, the uniform prior on φ corresponds to the prior belief P̃ (θ∗ ∈
Jθ ) = 0.13 of finding the true value θ∗ within the interval Jθ = (0.6, 0.8). After
observing xnobs with yobs = 7 out of n = 10, this prior is updated to the posterior
belief of P̃ (θ∗ ∈ Jθ ∣ xnobs ) = 0.29,7 see the shaded areas in Fig. 4. Crucially,
the earlier analysis that assigned a uniform prior to the propensity θ yielded a
7
The tilde makes explicit that the prior and posterior are derived from the uniform prior
g̃(φ) on φ.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 14

posterior probability P (θ∗ ∈ Jθ ∣ xnobs ) = 0.54, which is markedly different from


the current analysis that assigns a uniform prior to the angle φ.
The same posterior on θ is obtained when the prior on φ is first translated
into a prior on θ (bottom-left panel) and then updated to a posterior with Bayes’
theorem. Regardless of the stage at which the transformation is applied, the
resulting posterior on θ differs substantially from the result plotted in the right
panel of Fig. 3. ◇
Thus, the uniform prior distribution is not a panacea for the quantification of
prior ignorance, as the conclusions depend on how the problem is parameterized.
In particular, a uniform prior on the coin’s angle g̃(φ) = 1/VΦ yields a highly
informative prior in terms of the coin’s propensity θ. This lack of invariance
caused Karl Pearson, Ronald Fisher and Jerzy Neyman to reject 19th century
Bayesian statistics that was based on the uniform prior championed by Pierre-
Simon Laplace. This rejection resulted in, what is now known as, frequentist
statistics, see also Hald (2008), Lehmann (2011), and Stigler (1986).

3.3. A default prior by Jeffreys’s rule

Unlike the other fathers of modern statistical thoughts, Harold Jeffreys contin-
ued to study Bayesian statistics based on formal logic and his philosophical con-
victions of scientific inference (see, e.g., Aldrich, 2005; Etz and Wagenmakers,
2017; Jeffreys, 1961; Ly, Verhagen and Wagenmakers, 2016a,b; Robert, Chopin and Rousseau,
2009; Wrinch and Jeffreys, 1919, 1921, 1923). Jeffreys concluded that the uni-
form prior is unsuitable as a default prior due to its dependence on the parame-
terization. As an alternative, Jeffreys (1946) proposed the following prior based
on Fisher information
1√ √
gJ (θ) = IX (θ), where V = ∫ IX (θ)dθ, (3.4)
V Θ

which is known as the prior derived from Jeffreys’s rule or the Jeffreys’s prior
in short. The Jeffreys’s prior is parameterization-invariant, which implies that
it leads to the same posteriors regardless of how the model is represented.
Example 3.3 (Jeffreys’s prior). The Jeffreys’s prior of the Bernoulli model in
terms of φ is

3φ2
gJ (φ) = √ , where V = π, (3.5)
V π 6 − φ6

which is plotted in the top-left panel of Fig. 5. The corresponding posterior is


plotted in the top-right panel, which we transformed into a posterior in terms of
θ using the function θ = h(φ) shown in the bottom-right panel.8
8 The subscript J makes explicit that the prior and posterior are based on the prior derived

from Jeffreys’s rule, i.e., gJ (θ) on θ, or equivalently, gJ (φ) on φ.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 15

Jeffreys’s prior on φ Jeffreys’s posterior on φ


1.5

Density
1.0
yobs = 7
ÐÐÐÐ→
n = 10
0.5

0.0
−π 0 π −π 0 π
−2.8
Angle φ Angle φ
h ↓↑ h−1 h ↓↑ h−1
Jeffreys’s prior on θ Jeffreys’s posterior on θ
4
Density

2 yobs = 7
ÐÐÐÐ→
1 n = 10

0
0.0 0.5 0.6 0.8 1.0 0.0 0.5 0.6 0.8 1.0
0.15

Propensity θ Propensity θ
Fig 5. For priors constructed through Jeffreys’s rule it does not matter whether the problem
is represented in terms of the angles φ or its propensity θ. Thus, not only is the problem
equivalent due to the transformations θ = h(φ) and its backwards transformation φ = h−1 (θ),
the prior information is the same in both representations. This also holds for the posteriors.

Similarly, we could have started with the Jeffreys’s prior in terms of θ instead,
that is,
1
gJ (θ) = √ , where V = π. (3.6)
V θ(1 − θ)

The Jeffreys’s prior and posterior on θ are plotted in the bottom-left and the
bottom-right panel of Fig. 5, respectively. The Jeffreys’s prior on θ corresponds
to the prior belief PJ (θ∗ ∈ Jθ ) = 0.14 of finding the true value θ∗ within the
interval Jθ = (0.6, 0.8). After observing xnobs with yobs = 7 out of n = 10, this
prior is updated to the posterior belief of PJ (θ∗ ∈ Jθ ∣ xnobs ) = 0.53, see the shaded
areas in Fig. 5. The posterior is identical to the one obtained from the previously
described updating procedure that starts with the Jeffreys’s prior on φ instead of
on θ. ◇
This example shows that the Jeffreys’s prior leads to the same posterior
knowledge regardless of how we as researcher represent the problem. Hence, the
same conclusions about θ are drawn regardless of whether we (1) use Jeffreys’s
rule to construct a prior on θ and update with the observed data, or (2) use
Jeffreys’s rule to construct a prior on φ, update to a posterior distribution on
φ, which is then transformed to a posterior on θ.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 16

3.4. Geometrical properties of Fisher information

In the remainder of this section we make intuitive that the Jeffreys’s prior is in
fact uniform in the model space. We elaborate on what is meant by model space
and how this can be viewed geometrically. This geometric approach illustrates
(1) the role of Fisher information in the definition of the Jeffreys’s prior, (2)
the interpretation of the shaded area, and (3) why the normalizing constant is
V = π, regardless of the chosen parameterization.

3.4.1. The model space M

Before we describe the geometry of statistical models, recall that at a pmf can be
thought of as a data generating device of X, as the pmf specifies the chances with
which X takes on the potential outcomes 0 and 1. Each such pmf has to fulfil two
conditions: (i) the chances have to be non-negative, that is, 0 ≤ p(x) = P (X = x)
for every possible outcome x of X, and (ii) to explicitly convey that there are
w = 2 outcomes, and none more, the chances have to sum to one, that is,
p(0) + p(1) = 1. We call the largest set of functions that adhere to conditions (i)
and (ii) the complete set of pmfs P.
As any pmf from P defines w = 2 chances, we can represent such a pmf
as a vector in w dimensions. To simplify notation, we write p(X) for all w
chances simultaneously, hence, p(X) is the vector p(X) = [p(0), p(1)] when
w = 2. The two chances with which a pmf p(X) generates outcomes of X can be
simultaneously represented in the plane with p(0) = P (X = 0) on the horizontal
axis and p(1) = P (X = 1) on the vertical axis. In the most extreme case, we
have the pmf p(X) = [1, 0] or p(X) = [0, 1]. These two extremes are linked by a
straight line in the left panel of Fig. 6. Any pmf –and the true pmf p∗ (X) of X

2.0 2.0

1.5 1.5
m(X=1)
P(X=1)

1.0 1.0

0.5 0.5

0.0 0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

P(X=0) m(X=0)

Fig 6. The true pmf of X with the two outcomes {0, 1} has to lie on the line (left panel)
or more naturally on the positive part of the circle (right panel). The dot represents the pmf
pe (X).

in particular– can be uniquely identified with a vector on the line and vice versa.
For instance, the pmf pe (X) = [1/2, 1/2] (i.e., the two outcomes are generated
with the same chance) is depicted as the dot on the line.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 17

This vector representation allows us to associate to each pmf of X a norm,


that is, a length. Our intuitive notion of length is based on the Euclidean norm

and entails taking the root of the sums of squares. For instance, we√can associate
to the pmf pe (X) the length ∥pe (X)∥2 = (1/2)2 + (1/2)2 = 1/ 2 ≈ 0.71. On
the other hand, the length of the pmf that states that X = 1 is generated with
100% chance has length one. Note that by eye, we conclude that pe (X), the
arrow pointing to the dot in the left panel in Fig. 6 is indeed much shorter than
the arrow pointing to extreme pmf p(X) = [0, 1].
This mismatch in lengths can be avoided when we represent each pmf √ p(X)
by two times its square root instead (Kass, 1989), that is, by m(X) = 2 p(X) =
√ √
[2 p(0), 2 p(1)].9 A pmf that is identified as the vector m(X) is now two units
√ √
away from the origin, that is, ∥m(X)∥2 = m(0)2 + m(1)2 = 4(p(0) + p(1)) =
2. For instance, the pmf pe (X) is now represented as me (X) ≈ [1.41, 1.41]. The
model space M is collection of all transformed pmfs and represented as the
√ 6. By
10
surface of (the positive part of) a circle, see the right panel of Fig.
representing the set of all possible pmfs of X as vectors m(X) = 2 p(X) that
reside on the sphere M, we adopted our intuitive notion of distance. As a result,
we can now, by simply looking at the figures, clarify that a uniform prior on the
parameter space may lead to a very informative prior in the model space M.

3.4.2. Uniform on the parameter space versus uniform on the model space

As M represents the largest set of pmfs, any model defines a subset of M.


Recall that the function f (x ∣ θ) represents how we believe a parameter θ is
functionally related to an outcome x of X. √ For each θ this parameterization
yelds a pmf pθ (X) and, thus, also mθ (X) = 2 pθ (X). We denote the resulting
set of vectors mθ (X) so created by MΘ . For instance, the Bernoulli model
f (x ∣ θ) = θx (1 − θ)1−x consists of pmfs given by pθ (X) √ = [f (0
√∣ θ), f (1 ∣ θ)] =
[1 − θ, θ], which we represent as the vectors mθ (X) = [2 1 − θ, 2 θ]. Doing this
for every θ in the parameter space Θ yields the candidate set of pmfs MΘ . In
this case, we obtain a saturated model, since MΘ = M, see the left panel in
Fig. 7, where the right most square on the curve corresponds to m0 (X) = [2, 0].
By following the curve in an anti-clockwise manner we encounter squares that
represent the pmfs mθ (X) corresponding to θ = 0.1, 0.2, . . . , 1.0 respectively. In
the right panel of Fig. 7 the same procedure is repeated, but this time in terms
of φ at φ = −1.0π, −0.8π, . . . , 1.0π. Indeed, filling in the gaps shows that the
Bernoulli model in terms of θ and φ fully overlap with the largest set of possible
pmfs, thus, MΘ = M = MΦ . Fig. 7 makes precise what is meant when we say
9
The factor two is used to avoid a scaling of a quarter, though, its precise value is not
essential for the ideas conveyed here. To simplify matters, we also call m(X) a pmf.
√ collection of all functions on X such that (i) m(x) ≥ 0
10
Hence, the model space M is the
for every outcome x of X, and (ii) m(0)2 + m(1)2 = 2. This vector representation of all
the pmfs on X has the advantage that it also induces an inner product, which allows one
to project one vector onto another, see Rudin (1991, p. 4), van der Vaart (1998, p. 94) and
Appendix E.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 18

2.0 2.0

1.5 1.5

m(X=1)

m(X=1)
1.0 1.0

0.5 0.5

0.0 0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

m(X=0) m(X=0)

Fig 7. The parameterization in terms of propensity θ (left panel) and angle φ (right panel)
differ from each other substantially, and from a uniform prior in the model space. Left panel:
The eleven squares (starting from the right bottom going anti-clockwise) represent pmfs that
correspond to θ = 0.0, 0.1, 0.2, . . . , 0.9, 1.0. The shaded area corresponds to the shaded area
in the bottom-left panel of Fig. 5 and accounts for 14% of the model’s length. Right panel:
Similarly, the eleven triangles (starting from the right bottom going anti-clockwise) represent
pmfs that correspond to φ = −1.0π, −0.8π, − . . . 0.8π, 1.0π.

that the models MΘ and MΦ are equivalent; the two models define the same
candidate set of pmfs that we believe to be viable data generating devices for
X.

However, θ and φ represent M in a substantially different manner. As the
representation m(X) = 2 p(X) respects our natural notion of distance, we
conclude, by eye, that a uniform division of θs with distance, say, dθ = 0.1 does
not lead to a uniform partition of the model. More extremely, a uniform division
of φ with distance dφ = 0.2π (10% of the length of the parameter space) also
does not lead to a uniform partition of the model. In particular, even though
the intervals (−π, −0.8π) and (−0.2π, 0) are of equal length in the parameter
space Φ, they do not have an equal displacement in the model MΦ . In effect,
the right panel of Fig. 7 shows that the 10% probability that the uniform prior
on φ assigns to φ∗ ∈ (−π, −0.8π) in parameter space is redistributed over a larger
arc length of the model MΦ compared to the 10% assigned to φ∗ ∈ (−0.2π, 0).
Thus, a uniform distribution on φ favors the pmfs mφ (X) with φ close to zero.
Note that this effect is cancelled by the Jeffreys’s prior, as it puts more mass
on the end points compared to φ = 0, see the top-left panel of Fig. 5. Similarly,
the left panel of Fig. 7 shows that the uniform prior g(θ) also fails to yield an
equiprobable assessment of the pmfs in model space. Again, the Jeffreys’s prior
in terms of θ compensates for the fact that the interval (0, 0.1) as compared
to (0.5, 0.6) in Θ is more spread out in model space. However, it does so less
severely compared to the Jeffreys’s prior on φ. To illustrate, we added additional
tick marks on the horizontal axis of the priors in the left panels of Fig. 5. The
tick mark at φ = −2.8 and θ = 0.15 both indicate the 25% quantiles of their
respective Jeffreys’s priors. Hence, the Jeffreys’s prior allocates more mass to the
boundaries of φ than to the boundaries of θ to compensate for the difference in
geometry, see Fig. 7. More generally, the Jeffreys’s prior uses Fisher information

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 19

to convert the geometry of the model to the parameter space.


Note that because the Jeffreys’s prior is specified using the Fisher informa-
tion, it takes the functional relationship f (x ∣ θ) into account. The functional
relationship makes precise how the parameter is linked to the data and, thus,
gives meaning and context to the parameter. On the other hand, a prior on φ
specified without taking the functional relationship f (x ∣ φ) into account is a
prior that neglects the context of the problem. For instance, the right panel of
Fig. 7 shows that this neglect with a uniform prior on φ results in having the
geometry of Φ = (−π, π) forced onto the model MΦ .

3.5. Uniform prior on the model

Fig. 7 shows that neither a uniform prior on θ, nor a uniform prior on φ yields
a uniform prior on the model. Alternatively, we can begin with a uniform prior
on the model M and convert this into priors on the parameter spaces Θ and
Φ. This uniform prior on the model translated to the parameters is exactly the
Jeffreys’s prior.
Recall that a prior on a space S is uniform, if it has the following two defining
features: (i) the prior is proportional to one, and (ii) a normalizing constant given
by VS = ∫S 1ds that equals the length, more generally, volume of S. For instance,
a replacement of s by φ and S by Φ = (−π, π) yields the uniform prior on the
angles with the normalizing constant VΦ = ∫Φ 1dφ = 2π. Similarly, a replacement
of s by the pmf mθ (X) and S by the function space MΘ yields a uniform prior
on the model MΘ . The normalizing constant then becomes a daunting looking
integral in terms of displacements dmθ (X) between functions in model space
MΘ . Fortunately, it can be shown, see Appendix C, that V simplifies to

V =∫ 1dmθ (X) = ∫ IX (θ)dθ. (3.7)
MΘ Θ

Thus, V can be computed in terms of θ by multiplying the distances dθ in Θ


by the root of the Fisher information. Heuristically, this means that the root of
the Fisher√information translates displacements dmθ (X) in the model MΘ to
distances IX (θ)dθ in the parameter space Θ.
Recall from Example 3.3 that regardless of the parameterization, the nor-
malizing constant of the Jeffreys’s prior was π. To verify that this is indeed the
length of the model, we use the fact that the circumference of a quarter circle
with radius r = 2 can also be calculated as V = (2πr)/4 = π.
Given that the Jeffreys’s prior corresponds to a uniform prior on the model,
we deduce that the shaded area in the bottom-left panel of Fig. 5 with PJ (θ∗ ∈
Jθ ) = 0.14, implies that the model interval Jm = (m0.6 (X), m0.8 (X)), the shaded
area in the left panel of Fig. 7, accounts for 14% of the model’s length. After
updating the Jeffreys’s prior with the observations xnobs consisting of yobs = 7
out of n = 10 the probability of finding the true data generating pmf m∗ (X) in
this interval of pmfs Jm is increased to 53%.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 20

In conclusion, we verified that the Jeffreys’s prior is a prior that leads to the
same conclusion regardless of how we parameterize the problem. This parameterization-
invariance property is a direct result of shifting our focus from finding the true
parameter value within the parameter space to the proper formulation of the

estimation problem –as discovering the true data generating pmf mθ∗ (X) =
2 pθ∗ (X) in MΘ and by expressing our prior ignorance as a uniform prior on
the model MΘ .

4. The Role of Fisher Information in Minimum Description Length

In this section we graphically show how Fisher information is used as a measure


of model complexity and its role in model selection within the minimum descrip-
tion length framework (MDL; de Rooij and Grünwald, 2011; Grünwald, Myung and Pitt,
2005; Grünwald, 2007; Myung, Forster and Browne, 2000; Myung, Navarro and Pitt,
2006; Pitt, Myung and Zhang, 2002).
The primary aim of a model selection procedure is to select a single model
from a set of competing models, say, models M1 and M2 , that best suits the
observed data xnobs . Many model selection procedures have been proposed in the
literature, but the most popular methods are those based on penalized maxi-
mum likelihood criteria, such as the Akaike information criterion (AIC; Akaike,
1974; Burnham and Anderson, 2002), the Bayesian information criterion (BIC;
Raftery, 1995; Schwarz, 1978), and the Fisher information approximation (FIA;
Grünwald, 2007; Rissanen, 1996). These criteria are defined as follows
AIC = − 2 log fj (xnobs ∣ θ̂j (xnobs )) + 2dj , (4.1)
BIC = − 2 log fj (xnobs ∣ θ̂j (xnobs )) + dj log(n), (4.2)

FIA = − log fj (xnobs ∣ θ̂j (xnobs )) + log (∫ det IMj (θj ) dθj ),
dj n
+ log
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
2 2π Θ
Goodness-of-fit Dimensionality Geometric complexity
(4.3)

where n denotes the sample size, dj the number of free parameters, θ̂j the
MLE, IMj (θj ) the unit Fisher information, and fj the functional relationship
between the potential outcome xn and the parameters θj within model Mj .11
Hence, except for the observations xnobs , all quantities in the formulas depend
on the model Mj . We made this explicit using a subscript j to indicate that
the quantity, say, θj belongs to model Mj .12 For all three criteria, the model
yielding the lowest criterion value is perceived as the model that generalizes best
(Myung and Pitt, in press).
11
For vector-valued parameters θj , we have a Fisher information matrix and det IMj (θj )
refers to the determinant of this matrix. This determinant is always non-negative, because
the Fisher information matrix is always a positive semidefinite symmetric matrix. Intuitively,
volumes and areas cannot be negative (Appendix C.3.3).
12
For the sake of clarity, we will use different notations for the parameters within the
different models. We introduce two models in this section: the model M1 with parameter
θ1 = ϑ which we pit against the model M2 with parameter θ2 = α.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 21

Each of the three model selection criteria tries to strike a balance between
model fit and model complexity. Model fit is expressed by the goodness-of-fit
terms, which involves replacing the potential outcomes xn and the unknown
parameter θj of the functional relationships fj by the actually observed data
xnobs , as in the Bayesian setting, and the maximum likelihood estimate θ̂j (xnobs ),
as in the frequentist setting.
The positive terms in the criteria account for model complexity. A penaliza-
tion of model complexity is necessary, because the support in the data cannot
be assessed by solely considering goodness-of-fit, as the ability to fit obser-
vations increases with model complexity (e.g., Roberts and Pashler, 2000). As
a result, the more complex model necessarily leads to better fits but may in
fact overfit the data. The overly complex model then captures idiosyncratic
noise rather than general structure, resulting in poor model generalizability
(Myung, Forster and Browne, 2000; Wagenmakers and Waldorp, 2006).
The focus in this section is to make intuitive how FIA acknowledges the trade-
off between goodness-of-fit and model complexity in a principled manner by
graphically illustrating this model selection procedure, see also Balasubramanian
(1996), Kass (1989), Myung, Balasubramanian and Pitt (2000), and Rissanen
(1996). We exemplify the concepts with simple multinomial processing tree
(MPT) models (e.g., Batchelder and Riefer, 1999; Klauer and Kellen, 2011; Wu, Myung and Batchelder,
2010). For a more detailed treatment of the subject we refer to Appendix D,
de Rooij and Grünwald (2011), Grünwald (2007), Myung, Navarro and Pitt (2006),
and the references therein.

4.0.1. The description length of a model

Recall that each model specifies a functional relationship fj between the poten-
tial outcomes of X and the parameters θj . This fj is used to define a so-called
normalized maximum likelihood (NML) code. For the jth model its NML code
is defined as

fj (xnobs ∣ θ̂j (xnobs ))


pNML (xnobs ∣ Mj ) = , (4.4)
∑xn ∈X n fj (xn ∣ θ̂j (xn ))
where the sum in the denominator is over all possible outcomes xn in X n ,
and where θ̂j refers to the MLE within model Mj . The NML code is a relative
goodness-of-fit measure, as it compares the observed goodness-of-fit term against
the sum of all possible goodness-of-fit terms. Note that the actual observations
xnobs only affect the numerator, by a plugin of xnobs and its associated maximum
likelihood estimate θ̂(xnobs ) into the functional relationship fj belonging to model
Mj . The sum in the denominator consists of the same plugins, but for every
possible realization of X n .13 Hence, the denominator can be interpreted as a
measure of the model’s collective goodness-of-fit or the model’s fit capacity.
Consequently, for every set of observations xnobs , the NML code outputs a number
13
As before, for continuous data, the sum is replaced by an integral.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 22

between zero and one that can be transformed into a non-negative number by
taking the negative logarithm as14

− log pNML (xnobs ∣ Mj ) = − log fj (xnobs ∣ θ̂j (xnobs )) + log ∑ fj (xn ∣ θ̂j (xn )), (4.5)
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
Model complexity

which is called the description length of model Mj . Within the MDL framework,
the model with the shortest description length is the model that best describes
the observed data xnobs .
The model complexity term is typically hard to compute, but Rissanen (1996)
showed that it can be well-approximated by the dimensionality and the geomet-
rical complexity terms. That is,

FIA = − log fj (xnobs ∣ θ̂j (xnobs )) + + log (∫ det IMj (θj ) dθj ) ,
dj n
log
2 2π Θ

is an approximation of the description length of model Mj . The determinant is


simply the absolute value when the number of free parameters dj is equal to one.
Furthermore, the integral in the geometrical complexity term coincides with the
normalizing constant of the Jeffreys’s prior, which represented the volume of
the model. In other words, a model’s fit capacity is proportional to its volume
in model space as one would expect.
In sum, within the MDL philosophy, a model is selected if it yields the shortest
description length, as this model uses the functional relationship fj that best
extracts the regularities from xnobs . As the description length is often hard to
compute, we approximate it with FIA instead (Heck, Moshagen and Erdfelder,
2014). To do so, we have to characterize (1) all possible outcomes of X, (2)
propose at least two models which will be pitted against each other, and (3)
identify the model characteristics: the MLE θ̂j corresponding to Mj , and its
volume VMj . In the remainder of this section we show that FIA selects the model
that is closest to the data with an additional penalty for model complexity.

4.1. A new running example and the geometry of a random variable


with w = 3 outcomes

To graphically illustrate the model selection procedure underlying MDL we


introduce a random variable X that has w = 3 number of potential outcomes.
Example 4.1 (A psychological task with three outcomes). In the training phase
of a source-memory task, the participant is presented with two lists of words on a
computer screen. List L is projected on the left-hand side and list R is projected
on the right-hand side. In the test phase, the participant is presented with two
words, side by side, that can stem from either list, thus, ll, lr, rl, rr. At each
trial, the participant is asked to categorize these pairs as either:
14
Quite deceivingly the minus sign actually makes this definition positive, as − log(y) =
log(1/y) ≥ 0 if 0 ≤ y ≤ 1.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 23

• L meaning both words come from the left list, i.e., “ll”,
• M meaning the words are mixed, i.e., “lr” or “rl”,
• R meaning both words come from the right list, i.e., “rr”.
For simplicity we assume that the participant will be presented with n test pairs
X n of equal difficulty. ◇
For the graphical illustration of this new running example, we generalize
the ideas presented in Section 3.4.1 from w = 2 to w = 3. Recall that a pmf
of X with w number of outcomes can be written as a w-dimensional vector.
For the task described above we know that a data generating pmf defines the
three chances p(X) = [p(L), p(M ), p(R)] with which X generates the outcomes
[L, M, R] respectively.15 As chances cannot be negative, (i) we require that
0 ≤ p(x) = P (X = x) for every outcome x in X , and (ii) to explicitly convey
that there are w = 3 outcomes, and none more, these w = 3 chances have to sum
to one, that is, ∑x∈X p(x) = 1. We call the largest set of functions that adhere
to conditions (i) and (ii) the complete set of pmfs P. The three chances with
which a pmf p(X) generates outcomes of X can be simultaneously represented
in three-dimensional space with p(L) = P (X = L) on the left most axis, p(M ) =
P (X = M ) on the right most axis and p(R) = P (X = R) on the vertical axis
as shown in the left panel of Fig. 8.16 In the most extreme case, we have the
pmf p(X) = [1, 0, 0], p(X) = [0, 1, 0] or p(X) = [0, 0, 1], which correspond to the
corners of the triangle indicated by pL, pM and pR, respectively. These three
extremes are linked by a triangular plane in the left panel of Fig. 8. Any pmf
–and the true pmf p∗ (X) in particular– can be uniquely identified with a vector
on the triangular plane and vice versa. For instance, a possible true pmf of X
is pe (X) = [1/3, 1/3, 1/3] (i.e., the outcomes L, M and R are generated with the
same chance) depicted as a (red) dot on the simplex.
This vector representation allows us to associate to each pmf of X the
Euclidean norm. For instance, the representation in the left panel of Fig. 8
leads to an extreme pmf √ p(X) = [1, 0, 0] that is one unit long, while pe (X) =
[1/3, 1/3, 1/3] is only (1/3)2 + (1/3)2 + (1/3)2 ≈ 0.58 units away from the ori-
√ we can avoid this mismatch in lengths by considering the vectors
gin. As before,
m(X) = 2 p(X), instead. Any pmf that is identified as m(X) is now two units
away from the origin. The model space M is the collection of all transformed
pmfs and represented as the surface of (the positive part of) the sphere in

the right panel of Fig. 8. By representing the set of all possible pmfs of X as
m(X) = 2 p(X), we adopted our intuitive notion of distance. As a result, the
selection mechanism underlying MDL can be made intuitive by simply looking
at the forthcoming plots.
15
As before we write p(X) = [p(L), p(M ), p(R)] with a capital X to denote all the w
number of chances simultaneously and we used the shorthand notation p(L) = p(X = L),
p(M ) = p(X = M ) and p(R) = p(X = R).
16
This is the three-dimensional generalization of Fig. 6.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 24

mR

pR

pM mM

pL

mL

Fig 8. Every point on the sphere corresponds to a pmf of a categorical distribution with
w = 3 categories. In particular, the (red) dot refers to the pmf pe (x) = [1/3, 1/3, 1/3], the
circle represents the pmf given by p(X) = [0.01, 0.18, 0.81], while the cross represents the pmf
p(X) = [0.25, 0.5, 0.25].

4.2. The individual-word and the only-mixed strategy

To ease the exposition, we assume that both words presented to the participant
come from the right list R, thus, “rr” for the two models introduced below. As
model M1 we take the so-called individual-word strategy. Within this model
M1 , the parameter is θ1 = ϑ, which we interpret as the participant’s “right-list
recognition ability”. With chance ϑ the participant then correctly recognizes
that the first word originates from the right list and repeats this procedure for
the second word, after which the participant categorizes the word pair as L, M ,
or R, see the left panel of Fig. 9 for a schematic description of this strategy as a
processing tree. Fixing the participant’s “right-list recognition ability” ϑ yields
the following pmf

f1 (X ∣ ϑ) = [(1 − ϑ)2 , 2ϑ(1 − ϑ), ϑ2 ]. (4.6)

For instance, when the participant’s true ability is ϑ∗ = 0.9, the three outcomes
[L, M, R] are then generated with the following three chances f1 (X ∣ 0.9) =
[0.01, 0.18, 0.81], which is plotted as a circle in Fig. 8. On the other hand, when
ϑ∗ = 0.5 the participant’s generating pmf is then f1 (X ∣ ϑ = 0.5) = [0.25, 0.5, 0.25],
which is depicted as the cross in model space M. The set of pmfs so defined
forms a curve that goes through both the cross and the circle, see the left panel
of Fig. 10.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 25

ϑ R

ϑ M
1−ϑ M α

0.5 L
ϑ M
1−ϑ 1−α

0.5 R
1−ϑ L

Individual−word strategy Only−mixed strategy

Fig 9. Two MPT models that theorize how a participant chooses the outcomes L, M , or R
in the source-memory task described in the main text. The left panel schematically describes
the individual-word strategy, while the right model schematically describes the only-mixed
strategy.

As a competing model M2 , we take the so-called only-mixed strategy. For the


task described in Example 4.1, we might pose that participants from a certain
clinical group are only capable of recognizing mixed word pairs and that they
are unable to distinguish the pairs “rr” from “ll” resulting in a random guess
between the responses L and R, see the right panel of Fig. 9 for the processing
tree. Within this model M2 the parameter is θ2 = α, which is interpreted as the
participant’s “mixed-list differentiability skill” and fixing it yields the following
pmf

f2 (X ∣ α) = [(1 − α)/2, α, (1 − α)/2]. (4.7)

For instance, when the participant’s true differentiability is α∗ = 1/3, the three
outcomes [L, M, R] are then generated with the equal chances f2 (X ∣ 1/3) =
[1/3, 1/3, 1/3], which, as before, is plotted as the dot in Fig. 10. On the other
hand, when α∗ = 0.5 the participant’s generating pmf is then given by f2 (X ∣ α =
0.5) = [0.25, 0.5, 0.25], i.e., the cross. The set of pmfs so defined forms a curve
that goes through both the dot and the cross, see the left panel of Fig. 10.
The plots show that the models M1 and M2 are neither saturated nor nested,
as the two models define proper subsets of M and only overlap at the cross.
Furthermore, the plots also show that M1 and M2 are both one-dimensional,
as each model is represented as a line in model space. Hence, the dimensionality
terms in all three information criteria are the same. Moreover, AIC and BIC will
only discriminate these two models based on goodness-of-fit alone. This partic-
ular model comparison, thus, allows us to highlight the role Fisher information
plays in the MDL model selection philosophy.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 26

mR mR

mM mM

mL mL

Fig 10. Left panel: The set of pmfs that are defined by the individual-list strategy M1 forms
a curve that goes through both the cross and the circle, while the pmfs of the only-mixed
strategy M2 correspond to the curve that goes through both the cross and the dot. Right
panel: The model selected by FIA can be thought of as the model closest to the empirical pmf
with an additional penalty for model complexity. The selection between the individual-list and
the only-mixed strategy by FIA based on n = 30 trials is formalized by the additional curves
–the only-mixed strategy is preferred over the individual-list strategy, when the observations
yield an empirical pmf that lies between the two non-decision curves. The top, middle and
bottom squares corresponding to the data sets xn n n
obs,1 , xobs,2 and xobs,3 in Table 1, which are
best suited to M2 , either, and M1 , respectively. The additional penalty is most noticeable at
the cross, where the two models share a pmf. Observations with n = 30 yielding an empirical
pmf in this area are automatically assigned to the simpler model, i.e., the only-mixed strategy
M2 .

4.3. Model characteristics

4.3.1. The maximum likelihood estimators

For FIA we need to compute the goodness-of-fit terms, thus, we need to identify
the MLEs for the parameters within each model. For the models at hand, the
MLEs are

θ̂1 = ϑ̂ = (YM + 2YR )/(2n) for M1 , and θ̂2 = α̂ = YM /n for M2 , (4.8)

where YL , YM and YR = n − YL − YM are the number of L, M and R responses


in the data consisting of n trials.
Estimation is a within model operation and it can be viewed as projecting
the so-called empirical (i.e., observed) pmf corresponding to the data onto the
model. For iid data with w = 3 outcomes the empirical pmf corresponding to xnobs
is defined as p̂obs (X) = [yL /n, yM /n, yR /n]. Hence, the empirical pmf gives the
relative occurrence of each outcome in the sample. For instance, the observations

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 27

xnobs consisting of [yL = 3, yM = 3, yR = 3] responses correspond to the observed


pmf p̂obs (X) = [1/3, 1/3, 1/3], i.e., the dot in Fig. 10. Note that this observed
pmf p̂obs (X) does not reside on the curve of M1 .
Nonetheless, when we use the MLE ϑ̂ of M1 , we as researchers bestow the
participant with a “right-list recognition ability” ϑ and implicitly assume that
she used the individual-word strategy to generate the observations. In other
words, we only consider the pmfs on the curve of M1 as viable explanations of
how the participant generated her responses. For the data at hand, we have the
estimate ϑ̂obs = 0.5. If we were to generalize the observations xnobs under M1 , we
would then plug this estimate into the functional relationship f1 resulting in the
predictive pmf f1 (X ∣ ϑ̂obs ) = [0.25, 0.5, 0.25]. Hence, even though the number of
L, M and R responses were equal in the observations xnobs , under M1 we expect
that this participant will answer with twice as many M responses compared to
the L and R responses in a next set of test items. Thus, for predictions, part of
the data is ignored and considered as noise.
Geometrically, the generalization f1 (X ∣ ϑ̂obs ) is a result of projecting the
observed pmf p̂obs (X), i.e., the dot, onto the cross that does reside on the curve
of M1 .17 Observe that amongst all pmfs on M1 , the projected pmf is closest to
the empirical pmf p̂obs (X). Under M1 the projected pmf f1 (X ∣ ϑ̂obs ), i.e., the
cross, is perceived as structural, while any deviations from the curve of M1 is
labeled as noise. When generalizing the observations, we ignore noise. Hence, by
estimating the parameter ϑ, we implicitly restrict our predictions to only those
pmfs that are defined by M1 . Moreover, evaluating the prediction at xnobs and,
subsequently, taking the negative logarithm yields the goodness-of-fit term; in
this case, − log f1 (xnobs ∣ ϑ̂obs = 0.5) = 10.4.
Which part of the data is perceived as structural or as noise depends on the
model. For instance, when we use the MLE α̂, we restrict our predictions to
the pmfs of M2 . For the data at hand, we get α̂obs = 1/3 and the plugin yields
f2 (X ∣ α̂obs ) = [1/3, 1/3, 1/3]. Again, amongst all pmfs on M2 , the projected pmf
is closest to the empirical pmf p̂obs (X). In this case, the generalization under
M2 coincides with the observed pmf p̂obs (X). Hence, under M2 there is no
noise, as the empirical pmf p̂obs (X) was already on the model. Geometrically,
this means that M2 is closer to the empirical pmf than M1 , which results in a
lower goodness-of-fit term − log f2 (xnobs ∣ α̂obs = 1/3) = 9.9.
This geometric interpretation allows us to make intuitive that data sets with
the same goodness-of-fit terms will be as far from M1 as from M2 . Equivalently,
M1 and M2 identify the same amount of noise within xnobs , when the two models
fit the observations equally well. For instance, Fig. 10 shows that observations
xnobs with an empirical pmf p̂obs (X) = [0.25, 0.5, 0.25] are equally far from M1
as from M2 . Note that the closest pmf on M1 and M2 are both equal to the
empirical pmf, as f1 (X ∣ ϑ̂obs = 0.5) = p̂obs (X) = f2 (X ∣ α̂obs = 1/2). As a result,
the two goodness-of-fit terms will be equal to each other.
17 This resulting pmf f (X ∣ ϑ̂
1 obs ) is also known as the Kullback-Leibler projection of the
empirical pmf p̂obs (X) onto the model M1 . White (1982) used this projection to study the
behavior of the MLE under model misspecification.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 28

In sum, goodness-of-fit measures a model’s proximity to the observed data.


Consequently, models that take up more volume in model space will be able to
be closer to a larger number of data sets. In particular, when, say, M3 is nested
within M4 , this means that the distance between p̂obs (X) and M3 (noise) is
at least the distance between p̂obs (X) and M4 . Equivalently, for any data set,
M4 will automatically label more of the observations as structural. Models that
excessively identify parts of the observations as structural are known to overfit
the data. Overfitting has an adverse effect on generalizability, especially when
n is small, as p̂obs (X) is then dominated by sampling error. In effect, the more
voluminous model will then use this sampling error, rather than the structure,
for its predictions. To guard ourselves from overfitting, thus, bad generalizability,
the information criteria AIC, BIC and FIA all penalize for model complexity.
AIC and BIC only do this via the dimensionality terms, while FIA also take the
models’ volumes into account.

4.3.2. Geometrical complexity


n
For both models the dimensionality term is given by 12 log( 2π ). Recall that the
geometrical complexity term is the logarithm of the model’s volume, which for
the individual-word and the only-mixed strategy are given by
1√ √
VM1 = ∫ IM1 (θ)dθ = 2π and (4.9)
1√
0

VM2 = ∫ IM2 (α)dα = π, (4.10)


0

respectively. Hence, the individual-word strategy is a more complex model, be-


cause it has a larger volume, thus, capacity to fit data compared to the only-
mixed strategy. After taking logs, we see that the individual-word strategy incurs
an additional penalty of 1/2 log(2) compared to the only-mixed strategy.

4.4. Model selection based on the minimum description length


principle

With all model characteristics at hand, we only need observations to illustrate


that MDL model selection boils down to selecting the model that is closest
to the observations with an additional penalty for model complexity. Table 1
Table 1
obs = [yL , yM , yR ], where yL , yM , yR are the
The description lengths for three observations xn
number of observed responses L, M and R respectively.

obs = [yL , yM , yR ]
xn FIAM1 (xn
obs ) FIAM2 (xn
obs ) Preferred model
obs,1 = [12, 1, 17]
xn 42 26 M2
obs,2 = [14, 10, 6]
xn 34 34 tie
obs,3 = [12, 16, 2]
xn 29 32 M1

shows three data sets xnobs,1 , xnobs,2 , xnobs,3 with n = 30 observations. The three

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 29

associated empirical pmfs are plotted as the top, middle and lower rectangles in
the right panel of Fig. 10, respectively. Table 1 also shows the approximation of
each model’s description length using FIA. Note that the first observed pmf, the
top rectangle in Fig. 10, is closer to M2 than to M1 , while the third empirical
pmf, the lower rectangle, is closer to M1 . Of particular interest is the middle
rectangle, which lies on an additional black curve that we refer to as a non-
decision curve; observations that correspond to an empirical pmf that lies on this
curve are described equally well by M1 and M2 . For this specific comparison, we
have the following decision rule: FIA selects M2 as the preferred model whenever
the observations correspond to an empirical pmf between the two non-decision
curves, otherwise, FIA selects M1 . Fig. 10 shows that FIA, indeed, selects the
model that is closest to the data except in the area where the two models
overlap –observations consisting of n = 30 trials with an empirical pmf near the
cross are considered better described by the simpler model M2 . Hence, this
yields an incorrect decision even when the empirical pmf is exactly equal to the
true data generating pmf that is given by, say, f1 (X ∣ ϑ = 0.51). This automatic
preference for the simpler model, however, decreases as n increases. The left

mR mR

mM mM

mL mL

Fig 11. For n large the additional penalty for model complexity becomes irrelevant. The
plotted non-decision curves are based on n = 120 and n = 10,000 trials in the left and right
panel respectively. In the right panel only the goodness-of-fit matters in the model comparison.
The model selected is then the model that is closest to the observations.

and right panel of Fig. 11 show the non-decision curves when n = 120 and n
(extremely) large, respectively. As a result of moving non-decision bounds, the
data set xnobs,4 = [56, 40, 24] that has the same observed pmf as xnobs,2 , i.e., the
middle rectangle, will now be better described by model M1 .
For (extremely) large n, the additional penalty due to M1 being more volup-
tuous than M2 becomes irrelevant and the sphere is then separated into quad-
rants: observations corresponding to an empirical pmf in the top-left or bottom-

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 30

right quadrant are better suited to the only-mixed strategy, while the top-right
and bottom-left quadrants indicate a preference for the individual-word strat-
egy M1 . Note that pmfs on the non-decision curves in the right panel of Fig. 11
are as far apart from M1 as from M2 , which agrees with our geometric in-
terpretation of goodness-of-fit as a measure of the model’s proximity to the
data. This quadrant division is only based on the two models’ goodness-of-fit
terms and yields the same selection as one would get from BIC (e.g., Rissanen,
1996). For large n, FIA, thus, selects the model that is closest to the empirical
pmf. This behavior is desirable, because asymptotically the empirical pmf is not
distinguishable from the true data generating pmf. As such, the model that is
closest to the empirical pmf will then also be closest to the true pmf. Hence,
FIA asymptotically selects the model that is closest to the true pmf. As a result,
the projected pmf within the closest model is then expected to yield the best
predictions amongst the competing models.

4.5. Fisher information and generalizability

Model selection by MDL is sometimes perceived as a formalization of Occam’s


razor (e.g., Balasubramanian, 1996; Grünwald, 1998), a principle that states
that the most parsimonious model should be chosen when the models under
consideration fit the observed data equally well. This preference for the parsi-
monious model is based on the belief that the simpler model is better at pre-
dicting new (as yet unseen) data coming from the same source, as was shown
by Pitt, Myung and Zhang (2002) with simulated data.
To make intuitive why the more parsimonious model, on average, leads to
better predictions, we assume, for simplicity, that the true data generating pmf
is given by f (X ∣ θ∗ ), thus, the existence of a true parameter value θ∗ . As the
observations are expected to be contaminated with sampling error, we also ex-
pect an estimation error, i.e., a distance dθ between the maximum likelihood
estimate θ̂obs and the true θ∗ . Recall that in the construction of Jeffreys’s prior
Fisher information was used to convert displacement in model space to distances
on parameter space. Conversely, Fisher information transforms the estimation
error in parameter space to a generalization error in model space. Moreover,
the larger the Fisher information at θ∗ is, the more it will expand the estima-
tion error into a displacement between the prediction f (X ∣ θ̂obs ) and the true
pmf f (X ∣ θ∗ ). Thus, a larger Fisher information at θ∗ will push the prediction
further from the true pmf resulting in a bad generalization. Smaller models
have, on average, a smaller Fisher information at θ∗ and will therefore lead to
more stable predictions that are closer to the true data generating pmf. Note
that the generalization scheme based on the MLE plugin f (X ∣ θ̂obs ) ignores
the error at each generalization step. The Bayesian counterpart, on the other
hand, does take these errors into account, see Dawid (2011), Ly et al. (2017b),
Marsman, Ly and Wagenmakers (2016) and see van Erven, Grünwald and De Rooij
(2012), Grünwald and Mehta (2016), van der Pas and Grünwald (2014), Wagenmakers, Grünwald and Steyvers
(2006) for a prequential view of generalizability.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 31

5. Concluding Comments

Fisher information is a central statistical concept that is of considerable rele-


vance for mathematical psychologists. We illustrated the use of Fisher informa-
tion in three different statistical paradigms: in the frequentist paradigm, Fisher
information was used to construct hypothesis tests and confidence intervals;
in the Bayesian paradigm, Fisher information was used to specify a default,
parameterization-invariant prior distribution; lastly, in the paradigm of infor-
mation theory, data compression, and minimum description length, Fisher infor-
mation was used to measure model complexity. Note that these three paradigms
highlight three uses of the functional relationship f between potential observa-
tions xn and the parameters θ. Firstly, in the frequentist setting, the second
argument was fixed at a supposedly known parameter value θ0 or θ̂obs resulting
in a probability mass function, a function of the potential outcomes f (⋅ ∣ θ0 ).
Secondly, in the Bayesian setting, the first argument was fixed at the observed
data resulting in a likelihood function, a function of the parameters f (xobs ∣ ⋅).
Lastly, in the information geometric setting both arguments were free to vary,
i.e., f (⋅ ∣ ⋅) and plugged in by the observed data and the maximum likelihood
estimate.
To ease the exposition we only considered Fisher information of one-dimensional
parameters. The generalization of the concepts introduced here to vector valued
θ can be found in the appendix. A complete treatment of all the uses of Fisher in-
formation throughout statistics would require a book (e.g., Frieden, 2004) rather
than a tutorial article. Due to the vastness of the subject, the present account is
by no means comprehensive. Our goal was to use concrete examples to provide
more insight about Fisher information, something that may benefit psychologists
who propose, develop, and compare mathematical models for psychological pro-
cesses. Other uses of Fisher information are in the detection of model misspec-
ification (Golden, 1995; Golden, 2000; Waldorp, Huizenga and Grasman, 2005;
Waldorp, 2009; Waldorp, Christoffels and van de Ven, 2011; White, 1982), in
the reconciliation of frequentist and Bayesian estimation methods through the
Bernstein-von Mises theorem (Bickel and Kleijn, 2012; Rivoirard and Rousseau,
2012; van der Vaart, 1998; Yang and Le Cam, 2000), in statistical decision the-
ory (e.g., Berger, 1985; Hájek, 1972; Korostelev and Korosteleva, 2011; Ray and Schmidt-Hieber,
2016; Wald, 1949), in the specification of objective priors for more complex mod-
els (e.g., Ghosal, Ghosh and Ramamoorthi, 1997; Grazian and Robert, 2015;
Kleijn and Zhao, 2017), and computational statistics and generalized MCMC
sampling in particular (e.g., Banterle et al., 2015; Girolami and Calderhead,
2011; Grazian and Liseo, 2014; Gronau et al., 2017).
In sum, Fisher information is a key concept in statistical modeling. We hope
to have provided an accessible and concrete tutorial article that explains the
concept and some of its uses for applications that are of particular interest to
mathematical psychologists.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 32

References

Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE


Transactions on Automatic Control 19 716–723.
Aldrich, J. (2005). The statistical education of Harold Jeffreys. International
Statistical Review 73 289–307.
Amari, S. I., Barndorff-Nielsen, O. E., Kass, R. E., Lauritzen, S. L.
and Rao, C. R. (1987). Differential geometry in statistical inference. Institute
of Mathematical Statistics Lecture Notes—Monograph Series, 10. Institute of
Mathematical Statistics, Hayward, CA. MR932246
Atkinson, C. and Mitchell, A. F. S. (1981). Rao’s distance measure.
Sankhyā: The Indian Journal of Statistics, Series A 345–365.
Balasubramanian, V. (1996). A geometric formulation of Occam’s razor for
inference of parametric distributions. arXiv preprint adap-org/9601001.
Banterle, M., Grazian, C., Lee, A. and Robert, C. P. (2015). Acceler-
ating Metropolis-Hastings algorithms by delayed acceptance. arXiv preprint
arXiv:1503.00996.
Batchelder, W. H. and Riefer, D. M. (1980). Separation of Storage and
Retrieval Factors in Free Recall of Clusterable Pairs. Psychological Review 87
375–397.
Batchelder, W. H. and Riefer, D. M. (1999). Theoretical and Empiri-
cal Review of Multinomial Process Tree Modeling. Psychonomic Bulletin &
Review 6 57–86.
Bayarri, M. J., Berger, J. O., Forte, A. and Garcı́a-Donato, G. (2012).
Criteria for Bayesian model choice with application to variable selection. The
Annals of Statistics 40 1550–1577.
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis.
Springer Verlag.
Berger, J. O., Pericchi, L. R. and Varshavsky, J. A. (1998). Bayes fac-
tors and marginal distributions in invariant situations. Sankhyā: The Indian
Journal of Statistics, Series A 307–321.
Bickel, P. J. and Kleijn, B. J. K. (2012). The semiparametric Bernstein–von
Mises Theorem. The Annals of Statistics 40 206–237.
Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993).
Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins
University Press Baltimore.
Brown, L. D., Cai, T. T. and DasGupta, A. (2001). Interval estimation for
a binomial proportion. Statistical Science 101–117.
Burbea, J. (1984). Informative geometry of probability spaces Technical Re-
port, DTIC Document.
Burbea, J. and Rao, C. R. (1982). Entropy differential metric, distance and
divergence measures in probability spaces: A unified approach. Journal of
Multivariate Analysis 12 575–596.
Burbea, J. and Rao, C. R. (1984). Differential metrics in probability spaces.
Probability and mathematical statistics 3 241–258.
Burnham, K. P. and Anderson, D. R. (2002). Model Selection and Mul-

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 33

timodel Inference: A Practical Information–Theoretic Approach (2nd ed.).


Springer Verlag, New York.
Campbell, L. L. (1965). A coding theorem and Rényi’s entropy. Information
and Control 8 423–429.
Chechile, R. A. (1973). The Relative Storage and Retrieval Losses in Short–
Term Memory as a Function of the Similarity and Amount of Information
Processing in the Interpolated Task PhD thesis, University of Pittsburgh.
Cover, T. M. and Thomas, J. A. (2006). Elements of information theory.
John Wiley & Sons.
Cox, D. R. and Reid, N. (1987). Parameter orthogonality and approxi-
mate conditional inference. Journal of the Royal Statistical Society. Series
B (Methodological) 1–39.
Cramér, H. (1946). Methods of Mathematical Statistics. Princeton University
Press 23.
Dawid, A. P. (1977). Further comments on some comments on a paper by
Bradley Efron. The Annals of Statistics 5 1249.
Dawid, A. P. (2011). Posterior model probabilities. In Handbook of the Phi-
losophy of Science, (D. M. Gabbay, P. S. Bandyopadhyay, M. R. Forster,
P. Thagard and J. Woods, eds.) 7 607–630. Elsevier, North-Holland.
de Rooij, S. and Grünwald, P. D. (2011). Luckiness and Regret in Min-
imum Description Length Inference. In Handbook of the Philosophy of Sci-
ence, (D. M. Gabbay, P. S. Bandyopadhyay, M. R. Forster, P. Thagard and
J. Woods, eds.) 7 865–900. Elsevier, North-Holland.
Efron, B. (1975). Defining the curvature of a statistical problem (with applica-
tions to second order efficiency). The Annals of Statistics 3 1189–1242. With
a discussion by C. R. Rao, Don A. Pierce, D. R. Cox, D. V. Lindley, Lucien
LeCam, J. K. Ghosh, J. Pfanzagl, Niels Keiding, A. Philip Dawid, Jim Reeds
and with a reply by the author. MR0428531
Etz, A. and Wagenmakers, E. J. (2017). J. B. S. Haldane’s Contribution to
the Bayes Factor Hypothesis Test. Statistical Science 32 313–329.
Fisher, R. A. (1912). On an Absolute Criterion for Fitting Frequency Curves.
Messenger of Mathematics 41 155–160.
Fisher, R. A. (1920). A Mathematical Examination of the Methods of De-
termining the Accuracy of an Observation by the Mean Error, and by the
Mean Square Error. Monthly Notices of the Royal Astronomical Society 80
758–770.
Fisher, R. A. (1922). On the Mathematical Foundations of Theoretical Statis-
tics. Philosophical Transactions of the Royal Society of London. Series A,
Containing Papers of a Mathematical or Physical Character 222 309–368.
Fisher, R. A. (1925). Theory of Statistical Estimation. Mathematical Proceed-
ings of the Cambridge Philosophical Society 22 700–725.
Fréchet, M. (1943). Sur l’extension de certaines evaluations statistiques au
cas de petits echantillons. Revue de l’Institut International de Statistique 182–
205.
Frieden, B. R. (2004). Science from Fisher information: A unification. Cam-
bridge University Press.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 34

Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. (1997). Non-informative


priors via sieves and packing numbers. In Advances in statistical decision
theory and applications 119–132. Springer.
Ghosh, J. K. (1985). Efficiency of Estimates–Part I. Sankhyā: The Indian
Journal of Statistics, Series A 310–325.
Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and
Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 73 123–214.
Golden, R. M. (1995). Making correct statistical inferences using the wrong
probability model. Journal of Mathematical Psychology 39 3-20.
Golden, R. M. (2000). Statistical tests for comparing possibly misspecified
and nonnested models. Journal of Mathematical Psychology 44 153–170.
Grazian, C. and Liseo, B. (2014). Approximate integrated likelihood via ABC
methods. arXiv preprint arXiv:1403.0387.
Grazian, C. and Robert, C. P. (2015). Jeffreys’ Priors for Mixture Estima-
tion. In Bayesian Statistics from Methods to Models and Applications 37–48.
Springer.
Gronau, Q. F., Ly, A. and Wagenmakers, E.-J. (2017). Informed Bayesian
t-Tests. arXiv preprint arXiv:1704.02479.
Gronau, Q. F., Sarafoglou, A., Matzke, D., Ly, A., Boehm, U., Mars-
man, M., Leslie, D. S., Forster, J. J., Wagenmakers, E.-J. and
Steingroever, H. (2017). A tutorial on bridge sampling. arXiv preprint
arXiv:1703.05984.
Grünwald, P. D. (1998). The Minimum Description Length Principle and
Reasoning under Uncertainty PhD thesis, ILLC and University of Amster-
dam.
Grünwald, P. D. (2007). The Minimum Description Length Principle. MIT
Press, Cambridge, MA.
Grünwald, P. (2016). Safe Probability. arXiv preprint arXiv:1604.01785.
Grünwald, P. D. and Mehta, N. A. (2016). Fast Rates with Unbounded
Losses. arXiv preprint arXiv:1605.00252.
Grünwald, P. D., Myung, I. J. and Pitt, M. A., eds. (2005). Advances
in Minimum Description Length: Theory and Applications. MIT Press, Cam-
bridge, MA.
Grünwald, P. and van Ommen, T. (2014). Inconsistency of Bayesian infer-
ence for misspecified linear models, and a proposal for repairing it. arXiv
preprint arXiv:1412.3730.
Hájek, J. (1970). A Characterization of Limiting Distributions of Regular Es-
timates. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 14
323–330.
Hájek, J. (1972). Local asymptotic minimax and admissibility in estimation.
In Proceedings of the sixth Berkeley symposium on mathematical statistics and
probability 1 175–194.
Hald, A. (2008). A history of parametric statistical inference from Bernoulli
to Fisher, 1713-1935. Springer Science & Business Media.
Heck, D. W., Moshagen, M. and Erdfelder, E. (2014). Model selection

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 35

by minimum description length: Lower-bound sample sizes for the Fisher in-
formation approximation. Journal of Mathematical Psychology 60 29–34.
Huzurbazar, V. S. (1950). Probability distributions and orthogonal parame-
ters. In Mathematical Proceedings of the Cambridge Philosophical Society 46
281–284. Cambridge University Press.
Huzurbazar, V. S. (1956). Sufficient statistics and orthogonal parameters.
Sankhyā: The Indian Journal of Statistics (1933-1960) 17 217–220.
Inagaki, N. (1970). On the Limiting Distribution of a Sequence of Estimators
with Uniformity Property. Annals of the Institute of Statistical Mathematics
22 1–13.
Jeffreys, H. (1946). An Invariant Form for the Prior Probability in Estimation
Problems. Proceedings of the Royal Society of London. Series A. Mathematical
and Physical Sciences 186 453–461.
Jeffreys, H. (1961). Theory of Probability, 3rd ed. Oxford University Press,
Oxford, UK.
Kass, R. E. (1989). The Geometry of Asymptotic Inference. Statistical Science
4 188–234.
Kass, R. E. and Vaidyanathan, S. K. (1992). Approximate Bayes factors and
orthogonal parameters, with application to testing equality of two binomial
proportions. Journal of the Royal Statistical Society. Series B (Methodologi-
cal) 129–144.
Kass, R. E. and Vos, P. W. (2011). Geometrical foundations of asymptotic
inference 908. John Wiley & Sons.
Klauer, K. C. and Kellen, D. (2011). The flexibility of models of recognition
memory: An analysis by the minimum-description length principle. Journal
of Mathematical Psychology 55 430–450.
Kleijn, B. J. K. and Zhao, Y. Y. (2017). Criteria for posterior consistency.
arXiv preprint arXiv:1308.1263.
Korostelev, A. P. and Korosteleva, O. (2011). Mathematical statistics:
Asymptotic minimax theory 119. American Mathematical Society.
Kotz, S., Kozubowski, T. J. and Podgorski, K. (2001). The Laplace Distri-
bution and Generalizations: A Revisit with Applications to Communications,
Economics, Engineering, and Finance. Springer, New York.
Kraft, L. G. (1949). A device for quantizing, grouping, and coding amplitude-
modulated pulses Master’s thesis, Massachusetts Institute of Technology.
Kullback, S. and Leibler, R. A. (1951). On information and sufficiency.
The Annals of Mathematical Statistics 22 79–86.
LeCam, L. (1970). On the assumptions used to prove asymptotic normality
of maximum likelihood estimates. The Annals of Mathematical Statistics 41
802–828.
LeCam, L. (1990). Maximum likelihood: An introduction. International Statis-
tical Review/Revue Internationale de Statistique 58 153–171.
Lee, M. D. and Wagenmakers, E. J. (2013). Bayesian Cognitive Modeling:
A Practical Course. Cambridge University Press, Cambridge.
Lehmann, E. L. (2011). Fisher, Neyman, and the creation of classical statistics.
Springer Science & Business Media.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 36

Li, Y. and Clyde, M. A. (2015). Mixtures of g-priors in Generalized Linear


Models. arXiv preprint arXiv:1503.06913.
Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008).
Mixtures of g priors for Bayesian variable selection. Journal of the American
Statistical Association 103.
Ly, A., Marsman, M. and Wagenmakers, E.-J. (in press). Analytic Poste-
riors for Pearson’s Correlation Coefficient. Statistica Neerlandica.
Ly, A., Verhagen, A. J. and Wagenmakers, E. J. (2016a). Harold Jeffreys’s
default Bayes factor hypothesis tests: Explanation, extension, and application
in psychology. Journal of Mathematical Psychology 72 19–32.
Ly, A., Verhagen, A. J. and Wagenmakers, E. J. (2016b). An evaluation
of alternative methods for testing hypotheses, from the perspective of Harold
Jeffreys. Journal of Mathematical Psychology 72 43–55.
Ly, A., Raj, A., Etz, A., Marsman, M., Gronau, Q. F. and Wagenmak-
ers, E. J. (2017a). Bayesian Reanalyses From Summary Statistics and the
Strength of Statistical Evidence. Manuscript submitted for publication.
Ly, A., Etz, A., Marsman, M. and Wagenmakers, E. J. (2017b). Replica-
tion Bayes factors from evidence updating. Manuscript submitted for publica-
tion.
Marsman, M., Ly, A. and Wagenmakers, E. J. (2016). Four requirements
for an acceptable research program. Basic and Applied Social Psychology 38
308–312.
McMillan, B. (1956). Two inequalities implied by unique decipherability. IRE
Transactions on Information Theory 2 115–116.
Mitchell, A. F. (1962). Sufficient statistics and orthogonal parameters. In
Mathematical Proceedings of the Cambridge Philosophical Society 58 326–337.
Cambridge University Press.
Myung, I. J. (2003). Tutorial on Maximum Likelihood Estimation. Journal of
Mathematical Psychology 47 90–100.
Myung, I. J., Balasubramanian, V. and Pitt, M. A. (2000). Counting
Probability Distributions: Differential Geometry and Model Selection. Pro-
ceedings of the National Academy of Sciences 97 11170–11175.
Myung, I. J., Forster, M. R. and Browne, M. W. (2000). Model Selection
[Special Issue]. Journal of Mathematical Psychology 44.
Myung, I. J. and Navarro, D. J. (2005). Information matrix. Encyclopedia
of Statistics in Behavioral Science.
Myung, I. J., Navarro, D. J. and Pitt, M. A. (2006). Model Selection
by Normalized Maximum Likelihood. Journal of Mathematical Psychology 50
167–179.
Myung, I. J. and Pitt, M. A. (in press). Model comparison in psychol-
ogy. In The Stevens’ Handbook of Experimental Psychology and Cognitive
Neuroscience (Fourth Edition), (J. Wixted and E. J. Wagenmakers, eds.) 5:
Methodology John Wiley & Sons, New York, NY.
Pitt, M. A., Myung, I. J. and Zhang, S. (2002). Toward a Method of Se-
lecting Among Computational Models of Cognition. Psychological Review 109
472–491.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 37

Raftery, A. E. (1995). Bayesian model selection in social research. In Socio-


logical Methodology (P. V. Marsden, ed.) 111–196. Blackwells, Cambridge.
Rao, C. R. (1945). Information and Accuracy Attainable in the Estimation
of Statistical Parameters. Bulletin of the Calcutta Mathematical Society 37
81–91.
Ratcliff, R. (1978). A Theory of Memory Retrieval. Psychological Review 85
59–108.
Ray, K. and Schmidt-Hieber, J. (2016). Minimax theory for a class of non-
linear statistical inverse problems. Inverse Problems 32 065003.
Rényi, A. (1961). On measures of entropy and information. In Proceedings of
the fourth Berkeley symposium on mathematical statistics and probability 1
547–561.
Rissanen, J. (1996). Fisher Information and Stochastic Complexity. IEEE
Transactions on Information Theory 42 40–47.
Rivoirard, V. and Rousseau, J. (2012). Bernstein–von Mises theorem for
linear functionals of the density. The Annals of Statistics 40 1489–1523.
Robert, C. P. (2016). The expected demise of the Bayes Factor. Journal of
Mathematical Psychology 72 33–37.
Robert, C. P., Chopin, N. and Rousseau, J. (2009). Harold Jeffreys’s The-
ory of Probability Revisited. Statistical Science 141–172.
Roberts, S. and Pashler, H. (2000). How Persuasive is a Good Fit? A Com-
ment on Theory Testing in Psychology. Psychological Review 107 358–367.
Rudin, W. (1991). Functional analysis, second ed. International Series in Pure
and Applied Mathematics. McGraw-Hill, Inc., New York. MR1157815
Schwarz, G. (1978). Estimating the Dimension of a Model. Annals of Statistics
6 461–464.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System
Technical Journal 27 379–423.
Stevens, S. S. (1957). On the Psychophysical Law. Psychological Review 64
153–181.
Stigler, S. M. (1973). Studies in the History of Probability and Statis-
tics. XXXII Laplace, Fisher, and the discovery of the concept of sufficiency.
Biometrika 60 439–445.
Stigler, S. M. (1986). The history of statistics: The measurement of uncer-
tainty before 1900. Belknap Press.
Tribus, M. and McIrvine, E. C. (1971). Energy and information. Scientific
American 225 179–188.
van der Pas, S. and Grünwald, P. D. (2014). Almost the Best of Three
Worlds: Risk, Consistency and Optional Stopping for the Switch Criterion in
Single Parameter Model Selection. arXiv preprint arXiv:1408.5724.
van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University
Press.
van der Vaart, A. W. (2002). The statistical work of Lucien Le Cam. Annals
of Statistics 631–682.
van Erven, T., Grünwald, P. and De Rooij, S. (2012). Catching up faster
by switching sooner: A predictive approach to adaptive estimation with an

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 38

application to the AIC–BIC dilemma. Journal of the Royal Statistical Society:


Series B (Statistical Methodology) 74 361–417.
van Erven, T. and Harremos, P. (2014). Rényi divergence and Kullback-
Leibler divergence. IEEE Transactions on Information Theory 60 3797–3820.
van Ommen, T., Koolen, W. M., Feenstra, T. E. and Grünwald, P. D.
(2016). Robust probability updating. International Journal of Approximate
Reasoning 74 30–57.
Wagenmakers, E. J., Grünwald, P. D. and Steyvers, M. (2006). Accu-
mulative Prediction Error and the Selection of Time Series Models. Journal
of Mathematical Psychology 50 149–166.
Wagenmakers, E. J. and Waldorp, L. (2006). Model Selection: Theoreti-
cal Developments and Applications [Special Issue]. Journal of Mathematical
Psychology 50.
Wald, A. (1949). Statistical decision functions. The Annals of Mathematical
Statistics 165–205.
Waldorp, L. J. (2009). Robust and unbiased variance of GLM coefficients
for misspecified autocorrelation and hemodynamic response models in fMRI.
International Journal of Biomedical Imaging 2009 723912.
Waldorp, L., Christoffels, I. and van de Ven, V. (2011). Effective con-
nectivity of fMRI data using ancestral graph theory: Dealing with missing
regions. NeuroImage 54 2695–2705.
Waldorp, L. J., Huizenga, H. M. and Grasman, R. P. P. P. (2005). The
Wald test and Cramér–Rao bound for misspecified models in electromagnetic
source analysis. IEEE Transactions on Signal Processing 53 3427-3435.
White, H. (1982). Maximum likelihood estimation of misspecified models.
Econometrica 50 1–25.
Wijsman, R. (1973). On the attainment of the Cramér-Rao lower bound. The
Annals of Statistics 1 538–542.
Wrinch, D. and Jeffreys, H. (1919). On some aspects of the theory of prob-
ability. Philosophical Magazine 38 715–731.
Wrinch, D. and Jeffreys, H. (1921). On certain fundamental principles of
scientific inquiry. Philosophical Magazine 42 369–390.
Wrinch, D. and Jeffreys, H. (1923). On certain fundamental principles of
scientific inquiry. Philosophical Magazine 45 368–375.
Wu, H., Myung, I. J. and Batchelder, W. H. (2010). Minimum Description
Length Model Selection of Multinomial Processing Tree Models. Psychonomic
Bulletin & Review 17 275–286.
Yang, G. L. (1999). A conversation with Lucien Le Cam. Statistical Science
223–241.
Yang, G. L. and Le Cam, L. (2000). Asymptotics in Statistics: Some Basic
Concepts. Springer-Verlag, Berlin.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 39

Appendix A: Generalization to Vector-Valued Parameters: The


Fisher Information Matrix

Let X be a random variable, θ⃗ = (θ1 , . . . , θd ) a vector of parameters, and f


a functional relationship that relates θ⃗ to the potential outcomes x of X. As
before, it is assumed that by fixing θ⃗ in f we get the pmf pθ⃗(x) = f (x ∣ θ),
⃗ which
is a function of x. The pmf pθ⃗(x) fully determines the chances with which X
takes on the events in the outcome space X . The Fisher information of the
vector θ⃗ ∈ Rd is a positive semidefinite symmetric matrix of dimension d×d with
the entry at the ith row and jth column given by

⃗ i,j =Cov(l̇(X ∣ θ),


IX (θ) ⃗ l̇T (X ∣ θ))
⃗ , (A.1)
i,j

⎪ ⃗ ∂ ⃗
⎪∑x∈X ( ∂θi l(x ∣ θ), ∂θj l(x ∣ θ))pθ⃗(x)
⎪ ∂
if X is discrete,
=⎨
⎪ ⃗ ∂ l(x ∣ θ))p
⃗ ⃗(x)dx
(A.2)
⎪∫x∈X ( ∂θ
⎪ l(x ∣ θ),


i ∂θj θ if X is continuous.

⃗ = log f (x ∣ θ)
where l(x ∣ θ) ⃗ is the log-likelihood function, ∂ l(x ∣ θ) ⃗ is the score
∂θi
function, that is, the partial derivative with respect to the ith component of
the vector θ⃗ and the dot is short-hand notation for the vector of the partial
⃗ is a d × 1 column vector
derivatives with respect to θ = (θ1 , . . . , θd ). Thus, l̇(x ∣ θ)
˙T ⃗
of score functions, while l (x ∣ θ) is a 1 × d row vector of score functions at the
outcome x. The partial derivative is evaluated at θ, ⃗ the same θ⃗ that is used
in the pmf pθ⃗(x) for the weighting. In Appendix E it is shown that the score
functions are expected to be zero, which explains why IX (θ) ⃗ is a covariance
matrix.
Under mild regularity conditions the i, jth entry of the Fisher information
matrix can be equivalently calculated via the negative expectation of the second
order partial derivates, that is,

⃗ i,j = − E( ∂ 2 l(X ∣ θ)),


IX (θ) ⃗ (A.3)
∂θi ∂θj

⎪ ⃗ ⃗(x)
⎪− ∑x∈X ∂θi ∂θj log f (x ∣ θ)p
2

if X is discrete,
=⎨

θ

(A.4)
⎪ −
⎩ ∫x∈X ∂θi ∂θj (x ∣ θ (x)dx
2

log f θ)p ⃗ if X is continuous.

Note that the sum (thus, integral in the continuous case) is with respect to the
outcomes x of X.
Example A.1 (Fisher information for normally distributed random variables).
When X is normally distributed, i.e., X ∼ N (µ, σ 2 ), it has the following proba-
bility density function (pdf )

⃗ = √ 1 exp ( − 1 (x − µ)2 ),
f (x ∣ θ) (A.5)
2πσ 2σ 2

where the parameters are collected into the vector θ⃗ = (σµ), with µ ∈ R and σ > 0.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 40

The score vector at a specific θ = (µσ) is the following vector of functions of x


⃗ = ( ∂µ l(x ∣ θ)) = (
∂ x−µ
˙ ∣ θ) ).

σ2

l(x (A.6)

∂σ
l(x θ)
(x−µ)2
σ 3 − σ1

The unit Fisher information matrix IX (θ) ⃗ is a 2×2 symmetric positive semidefi-

nite matrix, consisting of expectations of partial derivatives. Equivalently, IX (θ)
can be calculated using the second order partials derivatives

⎛ ∂µ∂µ log f (x ∣ µ, σ 2 ) log f (x ∣ µ, σ)⎞


2
∂2
⃗ = −E

IX (θ) = ( σ2 2 ).
1
∂µ∂σ 0
⎝ ∂σ∂µ log f (x ∣ µ, σ) log f (x ∣ µ, σ)⎠
∂2 ∂2
(A.7)
0 σ2
∂σ∂σ

The off-diagonal elements are in general not zero. If the i, jth entry is zero we
say that θi and θj are orthogonal to each other, see Appendix C.3.3 below. ◇
For iid trials X n = (X1 , . . . , Xn ) with X ∼ pθ (x), the Fisher information
matrix for X n is given by IX n (θ)⃗ = nIX (θ).
⃗ Thus, for vector-valued parameters

θ the Fisher information matrix remains additive.
In the remainder of the text, we simply use θ for both one-dimensional and
vector-valued parameters. Similarly, depending on the context it should be clear
whether IX (θ) is a number or a matrix.

Appendix B: Frequentist Statistics based on Asymptotic Normality

The construction of the hypothesis tests and confidence intervals in the frequen-
tist section were all based on the MLE being asymptotically normal.

B.1. Asymptotic normality of the MLE for vector-valued parameters

For so-called regular parametric models, see Appendix E, the MLE for vector-
valued parameters θ converges in distribution to a multivariate normal distri-
bution, that is,

n(θ̂ − θ∗ ) → Nd (0, IX (θ )), as n → ∞,
D −1 ∗
(B.1)

where Nd is a d-dimensional multivariate normal distribution, and IX (θ ) the


−1 ∗

inverse Fisher information matrix at the true value θ . For n large enough, we
can, thus, approximate the sampling distribution of the “error” of the MLE by
a normal distribution, thus,

(θ̂ − θ∗ ) ≈ Nd (0, n1 IX (θ )), we repeat, approximately.


D −1 ∗
(B.2)

In practice, we fix n and replace the true sampling distribution by this normal
distribution. Hence, we incur an approximation error that is only negligible
whenever n is large enough. What constitutes n large enough depends on the

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 41

true data generating pmf p∗ (x) that is unknown in practice. In other words,
the hypothesis tests and confidence intervals given in the main text based on
the replacement of the true sampling distribution by this normal distribution
might not be appropriate. In particular, this means that a hypothesis tests at
a significance level of 5% based on the asymptotic normal distribution, instead
of the true sampling distribution, might actually yield a type 1 error rate of,
say, 42%. Similarly, as a result of the approximation error, a 95%-confidence
interval might only encapsulate the true parameter in, say, 20% of the time that
we repeat the experiment.

B.2. Asymptotic normality of the MLE and the central limit


theorem

Asymptotic normality of the MLE can be thought of as a refinement of the cen-


tral limit theorem. The (Lindeberg-Lévy) CLT is a general statement about the
sampling distribution of the sample mean estimator X̄ = n1 ∑ni=1 Xi based on iid
trials of X with common population mean θ = E(X) and variance Var(X) < ∞.
More specifically, the CLT states that, with a proper scaling, the sample mean
X̄ centred around the true θ∗ will converge in distribution to a normal distri-

bution, that is, n(X̄ − θ∗ ) → N (0, Var(X)). In practice, we replace the true
D

sampling distribution by this normal distribution at fixed n and hope that n is


large enough. Hence, for fixed n we then suppose that the “error” is distributed
as (X̄ − θ∗ ) ≈ N (0, n1 Var(X)) and we ignore the approximation error. In par-
D

ticular, when we know that the population variance is Var(X) = 1, we then


know that we require an experiment with n = 100 samples for X̄ to generate
estimates within 0.196 distance from θ with approximately 95% chance, that
is, P (∣X̄ − θ∣ ≤ 0.196) ≈ 0.95.18 This calculation was based on our knowledge
of the normal distribution N (0, 0.01), which has its 97.5% quantile at 0.196.
In the examples below we re-use this calculation by matching the asymptotic
variances to 0.01.19 The 95% statement only holds approximately, because we
do not know whether n = 100 is large enough for the CLT to hold, i.e., this prob-
ability could be well below 23%. Note that the CLT holds under very general
conditions; the population mean and variance both need to exist, i.e., be finite.
The distributional form of X is irrelevant for the statement of the CLT.
On the other hand, to even compute the MLE we not only require that the
population quantities to exists and be finite, but we also need to know the
functional relationship f that relates these parameters to the outcomes of X.
When we assume more (and nature adheres to these additional conditions),
we know more, and are then able to give stronger statements. We give three
examples.
18
As before, chance refers to the relative frequency, that is, when we repeat the experiment
k = 200 times, each with n = 100, we get k number of estimates and approximately 95% of
these k number of estimates are then expected to be within 0.196 distance away from the true
population mean θ ∗ .
19
Technically, an asymptotic variance is free of n, but we mean the approximate variance
1 2
at finite n. For the CLT this means n σ .

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 42

Example B.1 (Asymptotic normality of the MLE vs the CLT: The Gaussian
distribution). If X has a Gaussian (normal) distribution, i.e., X ∼ N (θ, σ 2 ),
with σ 2 known, then the MLE is the sample mean and the unit Fisher infor-
mation is IX (θ) = 1/σ 2 . Asymptotic normality of the MLE leads to the same

statement as the CLT, that is, n(θ̂ − θ∗ ) → N (0, σ 2 ). Hence, asymptotically
D

we do not gain anything by going from the CLT to asymptotic normality of the
MLE. The additional knowledge of f (x ∣ θ) being normal does, however, allow
us to come to the rare conclusion that the normal approximation holds exactly
for every finite n, thus, (θ̂ − θ∗ ) = N (0, n1 σ 2 ). In all other cases, whenever
D

X ∼/ N (θ, σ 2 ), we always have an approximation.20 Thus, whenever σ 2 = 1 and


n = 100 we know that P (∣θ̂ − θ∗ ∣ ≤ 0.196) = 0.95 holds exactly. ◇
Example B.2 (Asymptotic normality of the MLE vs the CLT: The Laplace dis-
tribution). If X has a Laplace distribution with scale b, i.e., X ∼ Laplace(θ, b),
then its population mean and variance are θ = E(X) and 2b2 = Var(X), respec-
tively.
In this case, the MLE is the sample median M̂ and the unit Fisher infor-
mation is IX (θ) = 1/b2 . Asymptotic normality of the MLE implies that we
can approximate the sampling distribution by the normal distribution, that is,
(θ̂ − θ∗ ) ≈ N (0, n1 b2 ), when n is large enough. Given that the population vari-
D

ance is Var(X) = 1, we know that b = 1/ 2, yielding a variance of 2n 1
in our
normal approximation to the sampling distribution. Matching this variance to
0.01 shows that we now require only n = 50 samples for the estimator to generate
estimates within 0.196 distance away from the true value θ∗ with 95% chance.
As before, the validity of this statement only holds approximately, i.e., whenever
the normal approximation to the sampling distribution of the MLE at n = 50 is
not too bad.
Hence, the additional knowledge of f (x ∣ θ) being Laplace allows us to use an
estimator, i.e., the MLE, that has a lower asymptotic variance. Exploiting this
knowledge allowed us to design an experiment with twice as few participants. ◇
Example B.3 (Asymptotic normality of the MLE vs the CLT: The Cauchy
distribution). If X has a Cauchy distribution centred around θ with scale 1,
i.e., X ∼ Cauchy(θ, 1), then X does not have a finite population variance, nor
a finite population mean. As such, the CLT cannot be used. Even worse, Fisher
(1922) showed that the sample mean as an estimator for θ is in this case useless,
as the sampling distribution of the sample mean is a Cauchy distribution that
does not depend on n, namely, X̄ ∼ Cauchy(θ, 1). As such, using the first obser-
vation alone to estimate θ is as good as combining the information of n = 100
samples in the sample mean estimator. Hence, after seeing the first observation
no additional information about θ is gained using the sample mean X̄, not even
if we increase n.
20
This is a direct result of Cramér’s theorem that states that whenever X is independent
of Y and Z = X + Y with Z a normal distribution, then X and Y themselves are necessarily
normally distributed.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 43

The sample median estimator M̂ performs better. Again, Fisher (1922) al-
ready knew that for n large enough that (M̂ − θ∗ ) ≈ N (0, n1 π2 ). The MLE is
D 2

even better, but unfortunately, in this case, it cannot be given as an explicit


function of the data.21 The Fisher information can be given explicitly, namely,
IX (θ) = 1/2. Asymptotic normality of the MLE implies that (θ̂ −θ∗ ) ≈ N (0, n1 2),
D

when n is large enough. Matching the variances in the approximation based on


the normal distribution to 0.01 shows that we require n = 25π 2 ≈ 247 for the sam-
ple median and n = 200 samples for the MLE to generate estimates within 0.196
distance away from the true value of value θ∗ with approximate 95% chance. ◇

B.3. Efficiency of the MLE: The Hájek-LeCam convolution theorem


and the Cramér-Fréchet-Rao information lower bound

The previous examples showed that the MLE is an estimator that leads to
a smaller sample size requirement, because it is the estimator with the lower
asymptotic variance. This lower asymptotic variance is a result of the MLE
making explicit use of the functional relationship between the samples xnobs and
the target θ in the population. Given any such f , one might wonder whether the
MLE is the estimator with the lowest possible asymptotic variance. The answer
is affirmative, whenever we restrict ourselves to the broad class of so-called
regular estimators.
A regular estimator Tn = tn (Xn ) is a function of the data that has a limiting
distribution that does not change too much, whenever we change the parameters
in the neighborhood of the true value θ∗ , see van der Vaart (1998, p. 115) for
a precise definition. The Hájek-LeCam convolution theorem characterizes the
aforementioned limiting distribution as a convolution, i.e., a sum of, the inde-
pendent statistics ∆θ∗ and Zθ∗ . That is, for any regular estimator Tn and every
possible true value θ∗ we have

n(Tn − θ∗ ) → ∆θ∗ + Zθ∗ , as n → ∞,
D
(B.3)

where Zθ∗ ∼ N (0, IX (θ )) and where ∆θ∗ has an arbitrary distribution. By


−1 ∗

independence, the variance of the asymptotic distribution is simply the sum of


the variances. As the variance of ∆θ∗ cannot be negative, we know that the
asymptotic variance of any regular estimator Tn is bounded from below, that
is, Var(∆θ∗ ) + IX (θ ) ≥ IX
−1 ∗
(θ ).
−1 ∗

The MLE is a regular estimator with ∆θ∗ equal to the fixed true value θ∗ ,
thus, Var(∆θ∗ ) = 0. As such, the MLE has an asymptotic variance IX (θ ) that
−1 ∗

is equal to the lower bound given above. Hence, amongst the broad class of
regular estimators, the MLE performs best. This result was already foreshad-
owed by Fisher (1922), though it took another 50 years before this statement
21
Given observations xn
obs the maximum likelihood estimate θ̂obs is the number for which
2(x −θ)
obs ∣ θ) = ∑i=1 1+(xobs,i −θ)2 is zero. This optimization cannot be solved
n
the score function l̇(xn obs,i

analytically and there are 2n solutions to this equation.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 44

was made mathematically rigorous (Hájek, 1970; Inagaki, 1970; LeCam, 1970;
van der Vaart, 2002; Yang, 1999), see also Ghosh (1985) for a beautiful review.
We stress that the normal approximation to the true sampling distribution
only holds when n is large enough. In practice, n is relatively small and the
replacement of the true sampling distribution by the normal approximation
can, thus, lead to confidence intervals and hypothesis tests that perform poorly
(Brown, Cai and DasGupta, 2001). This can be very detrimental, especially,
when we are dealing with hard decisions such as the rejection or non-rejection
of a hypothesis.
A simpler version of the Hájek-LeCam convolution theorem is known as the
Cramér-Fréchet-Rao information lower bound (Cramér, 1946; Fréchet, 1943;
Rao, 1945), which also holds for finite n. This theorem states that the variance
of an unbiased estimator Tn cannot be lower than the inverse Fisher information,
that is, nVar(Tn ) ≥ IX (θ ). We call an estimator Tn = t(X n ) unbiased if for
−1 ∗

every possible true value θ∗ and at each fixed n, its expectation is equal to
the true value, that is, E(Tn ) = θ∗ . Hence, this lower bound shows that Fisher
information is not only a concept that is useful for large samples.
Unfortunately, the class of unbiased estimators is rather restrictive (in gen-
eral, it does not include the MLE) and the lower bound cannot be attained
whenever the parameter is of more than one dimensions (Wijsman, 1973). Con-
sequently, for vector-valued parameters θ, this information lower bound does
not inform us, whether we should stop our search for a better estimator.
The Hájek-LeCam convolution theorem implies that for n large enough the
MLE θ̂ is the best performing statistic. For the MLE to be superior, however,
the data do need to be generated as specified by the functional relationship f .
In reality, we do not know whether the data are indeed generated as specified by
f , which is why we should also try to empirically test such an assumption. For
instance, we might believe that the data are normally distributed, while in fact
they were generated according to a Cauchy distribution. This incorrect assump-
tion implies that we should use the sample mean, but Example B.3 showed the
futility of such estimator. Model misspecification, in addition to hard decisions
based on the normal approximation, might be the main culprit of the crisis of
replicability. Hence, more research on the detection of model misspecification is
desirable and expected (e.g., Grünwald, 2016; Grünwald and van Ommen, 2014;
van Ommen et al., 2016).

Appendix C: Bayesian use of the Fisher-Rao Metric: The Jeffreys’s


Prior

We make intuitive that the Jeffreys’s prior is a uniform prior on the model MΘ ,
i.e.,
θb √
P (m∗ ∈ Jm ) = ∫ 1dmθ (X) = ∫ IX (θ)dθ,
1
(C.1)
V Jm θa

where Jm = (mθa (X), mθb (X)) is an interval of pmfs in model space. To do


so, we explain why the differential dmθ (X), a displacement in model space, is

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 45

converted into IX (θ)dθ in parameter space. The elaboration below boils down
to an explanation of arc length computations using integration by substitution.

C.1. Tangent vectors

First note that we swapped the area of integration by substituting the interval
Jm = (mθa (X), mθb (X)) consisting of pmfs in function space MΘ by the interval
(θa , θb ) in parameter space. This is made possible by the parameter functional ν
with domain MΘ and range Θ that uniquely assigns to any (transformed) pmf
ma (X) ∈ MΘ a parameter value θa ∈ Θ. In this case, we have θa = ν(ma (X)) =
( 12 ma (1))2 . Uniqueness of the assignment implies that the resulting parameter
values θa and θb in Θ differ from each other whenever ma (X) and mb (X) in
MΘ differ from each other. For example, the map ν ∶ MΘ → Θ implies that

2.0 2.0

1.9 1.9
m(X=1)

m(X=1)
1.8 1.8

1.7 1.7

1.6 1.6

1.5 1.5
0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

m(X=0) m(X=0)

Fig 12. The full arrow represents the simultaneous displacement in model space based on the
Taylor approximation Eq. (C.3) in terms of θ at mθa (X), where θa = 0.8 (left panel) and in
terms of φ at mφa (X) where φa = 0.6π (right panel). The dotted line represents a part of the
Bernoulli model and note that the full arrow is tangent to the model.

in the left panel of Fig. 12 the third square from the left with coordinates
ma (X) = [0.89, 1.79] can be labeled by θa = 0.8 ≈ ( 21 (1.79))2 , while the second
square from the left with coordinates mb (X) = [0.63, 1.90] can be labeled by
θb = 0.9 ≈ ( 21 (1.90))2 .
To calculate the arc length of the curve Jm consisting of functions in MΘ ,
we first approximate Jm by a finite sum of tangent vectors, i.e., straight lines.
The approximation of the arc length is the sum of the length of these straight
lines. The associated approximation error goes to zero, when we increase the
number of tangent vectors and change the sum into an integral sign, as in the
usual definition of an integral. First we discuss tangent vectors.
In the left panel in Fig. 12, we depicted the tangent vector at mθa (X) as
the full arrow. This full arrow is constructed from its components: one broken
arrow that is parallel to the horizontal axis associated with the outcome x = 0,
and one broken arrow that is parallel to the vertical axis associated with the
outcome x = 1. The arrows parallel to the axes are derived by first fixing X = x
followed by a Taylor expansion of the parameterization θ ↦ mθ (x) at θa . The

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 46

Taylor expansion is derived by differentiating with respect to θ at θa yielding


the following “linear” function of the distance dθ = ∣θb − θa ∣ in parameter space,
dmθa (x)
dmθa (x) = mθb (x) − mθa (x) = dθ + o(dθ) , (C.2)
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
dθ ²
Bθa (x)
Aθa (x)

where the slope, a function of x, Aθa (x) at mθa (x) in the direction of x is given
by
dmθa (x) 1 d
Aθa (x) = = 2 { dθ log f (x ∣ θa ) }mθa (x), (C.3)
dθ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
score function

and with an “intercept” Bθa (x) = o(dθ) that goes fast to zero whenever dθ → 0.
Thus, for dθ small, the intercept Bθa (x) is practically zero. Hence, we approxi-
mate the displacement between mθa (x) and mθb (x) by a straight line.
Example C.1 (Tangent vectors). In the right panel of Fig. 12 the right most
triangle is given by mφa (X) = [1.25, 1.56], while the triangle in the middle refers
to mφb (X) = [0.99, 1.74]. Using the functional ν̃, i.e., the inverse of the param-

eterization, φ ↦ 2 f (x ∣ φ), where f (x ∣ φ) = ( 12 + 21 ( πφ ) ) ( 12 − 12 ( πφ ) ) , we
3 x 3 1−x

find that these two pmfs correspond to φa = 0.6π and φb = 0.8π.


The tangent vector at mφa (X) is constructed from its components. For the
horizontal displacement, we fill in x = 0 in log f (x ∣ φ) followed by the derivation
with respect to φ at φa and a multiplication by mφa (x) resulting in
dmφa (0)
dφ = 12 { dφ
d
log f (0 ∣ φa )}mφa (0) dφ, (C.4)

3φ2a
=− √ dφ, (C.5)
2π 3 (π 3 + φ3a )
where dφ = ∣φb − φa ∣ is the distance in parameter space Φ. The minus sign indi-
cates that the displacement along the horizontal axis is from right to left. Filling
in dφ = ∣φb − φa ∣ = 0.2π and φa = 0.6π yields a horizontal displacement of 0.17
at mφa (0) from right to left in model space. Similarly, the vertical displacement
in terms of φ is calculated by first filling in x = 1 and leads to
dmφa (1)
dφ = 12 { dφ
d
log f (1 ∣ φa )}mφa (1) dφ, (C.6)

3φ2a
=√ dφ. (C.7)
2π 3 (π 3 − φ3a )
By filling in dφ = 0.2 and φa = 0.6π, we see that a change of dφ = 0.2π at
φa = 0.6π in the parameter space corresponds to a vertical displacement of 0.14
at mφa (1) from bottom to top in model space. Note that the axes in Fig. 12 are
scaled differently.
dφ at mφa (X) is the sum of the two
dmφa (X)
The combined displacement


broken arrows and plotted as a full arrow in the right panel of Fig. 12.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 47

at the vector mθa (X) is calculated


θa dm (X)
The length of the tangent vector dθ
by taking the root of the sum of its squared component, the natural measure of
distance we adopted above and this yields
¿
Á
∥ dθ∥ = Á
À ∑ ( dmθa (x) ) (dθ)2 ,
2
dmθa (X)
dθ dθ
(C.8)
2
¿
x∈X

Á √

À ∑ ( d log f (x ∣ θa )) pθa (x)dθ IX (θa )dθ.
2

= (C.9)
x∈X

θa dm (X)
The second equality follows from the definition of dθ
, i.e., Eq. (C.3), and
the last equality is due to the definition of Fisher information.
Example C.2 (Length of the tangent vectors). The length of the tangent vector
√ as the root of the sums of squares of
in the right panel of Fig. 12 can be calculated
its components, that is, ∥ φdφ dφ∥2 = (−0.14)2 + 0.172 = 0.22. Alternatively,
dm a (X)

we can first calculate the square root of the Fisher information at φa = 0.6π, i.e.,
√ 3φ2
I(φa ) = √ a = 0.35, (C.10)
π 6 − φ6

and a multiplication with dφ = 0.2π results in ∥ ∥2 dφ = 0.22. ◇


dmφa (X)

More generally, to approximate the length between pmfs mθa (X) and mθb (X),
we first identify ν(mθa (X)) = θa and multiply this with the distance dθ =
∣θa − ν(mθb (X))∣ in parameter space, i.e.,

dmθ (X) √
dmθ (X) = ∥ ∥ dθ = IX (θ) dθ. (C.11)
dθ 2

In other words, the root of the Fisher information converts a small distance dθ
at θa to a displacement in model space at mθa (X).

C.2. The Fisher-Rao metric

By virtue of the parameter functional ν, we send an interval of pmfs Jm =


(mθa (X), mθb (X)) in the function space MΘ to the interval (θa , θb ) in the

parameter space Θ. In addition, with the conversion of dmθ (X) = IX (θ) dθ
we integrate by substitution, that is,
θb √
P (m∗ (X) ∈ Jm ) = 1dmθ (X) = ∫ IX (θ)dθ.
1 mθb (X) 1
∫ (C.12)
V mθa (X) V θa
1√
In particular, choosing Jθ = MΘ yields the normalizing constant V = ∫0 IX (θ)dθ.
The interpretation of V as being the total length of MΘ is due to the use
of dmθ (X) as the metric, a measure of distance, in model space. To honour

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 48

Calyampudi Radhakrishna Rao’s (1945) contribution to the theory, this metric is


also known as the Fisher-Rao metric (e.g., Amari et al., 1987; Atkinson and Mitchell,
1981; Burbea, 1984; Burbea and Rao, 1982, 1984; Dawid, 1977; Efron, 1975;
Kass and Vos, 2011).

C.3. Fisher-Rao metric for vector-valued parameters

C.3.1. The parameter functional ν ∶ P → B and the categorical distribution

For random variables with w number of outcomes, the largest set of pmfs P is
the collection of functions p on X such that (i) 0 ≤ p(x) = P (X = x) for every
outcome x in X , and (ii) to explicitly convey that there are w outcomes, and
none more, these w chances have to sum to one, that is, ∑x∈X p(x) = 1. The
complete set of pmfs P can be parameterized using the functional ν that assigns
to each w-dimensional pmf p(X) a parameter β ∈ Rw−1 .
For instance, given a pmf p(X) = [p(L), p(M ), p(R)] we typically use the
functional ν ∶ P → R2 that takes the first two coordinates, that is, ν(p(X)) =
β = (ββ12 ), where β1 = p(L) and β2 = p(M ). The range of this functional ν is the
parameter space B = [0, 1]×[0, β1 ]. Conversely, the inverse of the functional ν is
the parameterization β ↦ pβ (X) = [β1 , β2 , 1 − β1 − β2 ], where (i’) 0 ≤ β1 , β2 and
(ii’) β1 +β2 ≤ 1. The restrictions (i’) and (ii’) imply that the parameterization has
domain B and the largest set of pmfs P as its range. By virtue of the functional
ν and its inverse, that is, the parameterization β ↦ pβ (X), we conclude that the
parameter space B and the complete set of pmfs P are isomorphic. This means
that each pmf p(X) ∈ P can be uniquely identified with a parameter β ∈ B and
vice versa. The inverse of ν implies that the parameters β ∈ B are functionally
related to the potential outcomes x of X as

f (x ∣ β) = β1xL β2xM (1 − β1 − β2 )xR , (C.13)

where xL , xM and xR are the number of L, M and R responses in one trial


–we either have x = [xL , xM , xR ] = [1, 0, 0], x = [0, 1, 0], or x = [0, 0, 1]. The
model f (x ∣ β) can be regarded as the generalization of the Bernoulli model to
w = 3 categories. In effect, the parameters β1 and β2 can be interpreted as a
participant’s propensity of choosing L and M , respectively. If X n consists of n
iid categorical random variables with the outcomes [L, M, R], the joint pmf of
X n is then

f (xn ∣ β) = β1yL β2yM (1 − β1 − β2 )yR , (C.14)

where yL , yM and yR = n − yL − yM are the number of L, M and R responses


in √ √ As √
n trials. before, the representation of the pmfs as the vectors mβ (X) =
[2 β1 , 2 β2 , 2 1 − β1 − β2 ] form the surface of (the positive part of) the sphere
of radius two, thus, M = MB , see Fig. 13. The extreme pmfs indicated by
mL, mM and mR in the figure are indexed by the parameter values β = (1, 0),
β = (0, 1) and β = (0, 0), respectively.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 49

C.3.2. The stick-breaking parameterization of the categorical distribution

Alternatively, we could also have used a “stick-breaking” parameter functional


ν̃ that sends each pmf in P to the vector of parameters ν̃(p(X)) = (γγ12 ), where
γ1 = pL and γ2 = pM /(1−pL).22 Again the parameter γ = (γγ12 ) is only a label, but
this time the range of ν̃ is the parameter space Γ = [0, 1] × [0, 1]. The functional
relationship f associated to γ is given by

f (x ∣ γ) = γ1xL ((1 − γ1 )γ2 ) ((1 − γ1 )(1 − γ2 ))


xM xR
. (C.15)

For each γ we can transform the pmf into the vector


√ √ √
mγ (X) = [2 γ1 , 2 (1 − γ1 )γ2 , 2 (1 − γ1 )(1 − γ2 )], (C.16)

and write MΓ for the collection of vectors so defined. As before, this collection
coincides with the full model, i.e., MΓ = M. In other words, by virtue of the
functional ν̃ and its inverse γ ↦ pγ (x) = f (x ∣ γ) we conclude that the parameter
space Γ and the complete set of pmfs M are isomorphic. Because M = MB this
means that we also have an isomorphism between the parameter space B and
Γ via M, even though B is a strict subset of Γ. Note that this equivalence goes
via parameterization β ↦ mβ (X) and the functional ν̃.

C.3.3. Multidimensional Jeffreys’s prior via the Fisher information matrix


and orthogonal parameters

The multidimensional Jeffreys’s √ prior is parameterization-invariant and has as


normalization constant V = ∫ det IX (θ)dθ, where det IX (θ) is the determinant
of the Fisher information matrix.
In the previous subsection we argued that the categorical distribution in
terms of β or parameterized with γ are equivalent to each other, that is, MB =
M = MΓ . However, these two parameterizations describe the model space M
quite differently. In this subsection we use the Fisher information to show that
the parameterization in terms of γ is sometimes preferred over β.
The complete model M is easier described by γ, because the parameters are
orthogonal. We say that two parameters are orthogonal to each other whenever
the corresponding off-diagonal entries in the Fisher information matrix are zero.
The Fisher information matrices in terms of β and γ are

1 − β2 ⎛ 1
0 ⎞
IX (β) = ( ) and IX (γ) = γ1 (1−γ1 )
1 1
1 − β1 − β2 1 − β1 ⎝ 0 γ2 (1−γ2 ) ⎠
1−γ1 ,
1
(C.17)

respectively. The left panel of Fig. 13 shows the tangent vectors at pβ ∗ (X) =
[1/3, 1/3, 1/3] in model space, where β ∗ = (1/3, 1/3). The green tangent vector
22
This only works if pL < 1. When p(x1 ) = 1, we simply set γ2 = 0, thus, γ = (1, 0).

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 50

mR mR

mM mM

mL mL

Fig 13. When the off-diagonal entries are zero, the tangent vectors are orthogonal. Left
panel: The tangent vectors at pβ ∗ (X) = [1/3, 1/3, 1/3] span a diamond with an area given

by det I(β ∗ )dβ. The black curve is the submodel with β2 = 1/3 fixed and β1 free to vary
and yields a green tangent vector. The blue curve is the submodel with β1 = 1/3 fixed and β2
free to vary. Right panel: The tangent vectors at the same pmf in terms of γ, thus, pγ ∗ (X),

span a rectangle with an area given by det I(γ ∗ )dγ. The black curve is the submodel with
γ2 = 1/2 fixed and γ1 free to vary and yields a green tangent vector. The blue curve is the
submodel with γ1 = 1/3 fixed and γ2 free to vary.

thus, with β2 = 1/3 fixed and β1 free to vary, while


∂mβ ∗ (X)
corresponds to ∂β1
,
, thus, with β1 = 1/3 and β2
β ∂m ∗ (X)
the red tangent vector corresponds to ∂β2

free to vary. The area of the diamond spanned by these two tangent vectors is
det I(β ∗ )dβ1 dβ2 , where we have taken dβ1 = 0.1 and dβ2 = 0.1.
The right panel of Fig. 13 shows the tangent vectors at the same point
pγ ∗ (X) = [1/3, 1/3, 1/3], where γ ∗ = (1/3, 1/2). The green tangent vector corre-
thus, with γ2 = 1/2 fixed and γ1 free to vary, while the red
∂mγ ∗ (X)
sponds to ∂γ1
,
, thus, with γ1 = 1/3 and γ2 free to vary.
γ ∂m ∗ (X)
tangent vector corresponds to ∂γ2
By glancing over the plots, we see that the two tangent vectors are indeed or-

thogonal. The area of the rectangle spanned by these these two tangent vectors
is det I(γ ∗ )dγ1 dγ2 , where we have taken dγ1 = dγ2 = 0.1.
There are now two ways to calculate the normalizing constant of the Jeffreys’s
prior, the area, more generally volume, of the model M. In terms of β this leads
to

(∫ β1 β2 − β1 − β2 dβ2 ) dβ1 .
1 β1 1
V =∫
1 − β1 − β2
(C.18)
0 0

Observe that the inner integral depends on the value of β1 from the outer

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 51

integral. This coupling is reflected by the non-zero off-diagonal term of the


Fisher information matrix IX (β) corresponding to β1 and β2 . On the other hand,
orthogonality implies that the two parameters can be treated independently of
each other. That is, knowing and fixing γ1 and changing γ2 will not affect mγ (X)
via γ1 . This means that the double integral decouples

1⎛ ⎞
∫ √ dγ1 dγ2 = ∫ √ dγ1 ∫ √
1 1 1 1 1 1
V =∫ dγ2 = 2π.
0 ⎝ 0 γ1 γ2 (1 − γ2 ) ⎠ 0 γ1 0 γ2 (1 − γ2 )
(C.19)

Using standard geometry we verify that this is indeed the area of M, as an


eighth of the surface area of a sphere of radius two is given by 81 4π22 = 2π.
Orthogonality is relevant in Bayesian analysis, as it provides an argument to
choose a prior on a vector-valued parameter that factorizes (e.g., Berger, Pericchi and Varshavsky,
1998; Huzurbazar, 1950, 1956; Jeffreys, 1961; Kass and Vaidyanathan, 1992;
Ly, Verhagen and Wagenmakers, 2016a,b), see also Cox and Reid (1987); Mitchell
(1962).
By taking a random variable X with w = 3 outcomes, we were able to vi-
sualize the geometry of model space. For more general X these plots get more
complicated and perhaps even impossible to draw. Nonetheless, the ideas con-
veyed here extend, even to continuous X, whenever the model adheres to the
regularity conditions given in Appendix E.

Appendix D: MDL: Coding Theoretical Background

D.1. Coding theory, code length and log-loss

A coding system translates words, i.e., outcomes of a random variable X, into


code words with code lengths that behave like a pmf. Code lengths can be mea-
sured with a logarithm, which motivates the adoption of log-loss, defined below,
as the decision criterion within the MDL paradigm. The coding theoretical ter-
minologies introduced here are illustrated using the random variable X with
w = 3 potential outcomes.

D.1.1. Kraft-McMillan inequality: From code lengths of a specific coding


system to a pmf

For the source-memory task we encoded the outcomes as L, M and R, but


when we communicate a participant’s responses xnobs to a collaborator over the
internet, we have to encode the observations xnobs as zeroes and ones. For in-
stance, we might use a coding system C̃ with code words C̃(X = L) = 00,
C̃(X = M ) = 01 and C̃(X = R) = 10. This coding system C̃ will transform
any set of responses xnobs into a code string C̃(xnobs ) consisting of 2n bits. Al-
ternatively, we can use a coding system C with code words C(X = L) = 10,

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 52

C(X = M ) = 0 and C(X = R) = 11, instead. Depending on the actual observa-


tions xnobs , this coding system outputs code strings C(xnobs ) with varying code
lengths that range from n to 2n bits. For example, if a participant responded
with xnobs = (M, R, M, L, L, M, M, M ) in n = 8 trials, the coding system C would
then output the 11-bit long code string C(xnobs ) = 01101010000. In contrast, the
first coding system C̃ will always output a 16-bit long code string when n = 8.
Shorter code strings are desirable as they will lead to a smaller load on the com-
munication network and they are less likely to be intercepted by “competing”
researchers.
Note that the shorter code length C(xnobs ) = 01101010000 of 11-bits is a result
of having code words of unequal lengths. The fact that one of the code word
is shorter does not interfere with the decoding, since no code word is a prefix
of another code word. As such, we refer to C as a prefix (free) coding system.
This implies that the 11-bit long code string C(xnobs ) is self-punctuated and
that it can be uniquely deciphered by simply reading the code string from left
to right resulting in the retrieval of xnobs . Note that the code lengths of C inherit
the randomness of the data. In particular, the coding system C produces a
shorter code string with high chance, if the participant generates the outcome
M with high chance. In the extreme case, the coding system C produces the 8-
bits long code string C(xn ) = 00000000 with 100% (respectively, 0%) chance, if
the participant generates the outcome M with 100% (respectively, 0%) chance.
More generally, Kraft and McMillan (Kraft, 1949; McMillan, 1956) showed that
any uniquely decipherable (prefix) coding system from the outcome space X
with w outcomes to an alphabet with D elements must satisfy the inequality
w
∑D ≤ 1,
−li
(D.1)
i=1

where li is the code length of the outcome w. In our example, we have taken
D = 2 and code length of 2, 1 and 2 bits for the response L, M and R respectively.
Indeed, 2−2 + 2−1 + 2−2 = 1. Hence, code lengths behave like the logarithm (with
base D) of a pmf.

D.1.2. Shannon-Fano algorithm: From a pmf to a coding system with specific


code lengths

Given a data generating pmf p∗ (X), we can use the so-called Shannon-Fano
algorithm (e.g., Cover and Thomas, 2006, Ch. 5) to construct a prefix coding
system C ∗ . The idea behind this algorithm is to give the outcome x that is
generated with the highest chance the shortest code length. To do so, we encode
the outcome x as a code word C ∗ (x) that consists of − log2 p∗ (x) bits.23
23
When we use the logarithm with base two, log2 (y), we get the code length in bits,
while the natural logarithm, log(y), yields the code length in nats. Any result in terms of the
natural logarithm can be equivalently described in terms of the logarithm with base two, as
log(y) = log(2) log 2 (y).

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 53

For instance, when a participant generates the outcomes [L, M, R] according


to the chances p∗ (X) = [0.25, 0.5, 0.25] the Shannon-Fano algorithm prescribes
that we should encode the outcome L with − log2 (0.25) = 2, M with − log2 (0.5) =
1 and R with 2 bits; the coding system C given above.24 The Shannon-Fano
algorithm works similarly for any other given pmf pβ (X). Hence, the Kraft-
McMillan inequality and its inverse, i.e., the Shannon-Fano algorithm imply that
pmfs and uniquely decipherable coding systems are equivalent to each other. As
such we have an additional interpretation of a pmf. To distinguish the different
uses, we write f (X ∣ β) when we view the pmf as a coding system, while we
retain the notation pβ (X) when we view the pmf as a data generating device.
In the remainder of this section we will not explicitly construct any other coding
system, as the coding system itself is irrelevant for the discussion at hand –only
the code lengths matter.

D.1.3. Entropy, cross entropy, log-loss

With the true data generating pmf p∗ (X) at hand, thus, also the true coding
system f (X ∣ β ∗ ), we can calculate the (population) average code length per
trial

H(p∗ (X)) = H(p∗ (X)∥ f (X ∣ β ∗ )) = ∑ − log f (x ∣ β ∗ )p∗ (x). (D.2)


x∈X

Whenever we use the logarithm with base 2, we refer to this quantity H(p∗ (X))
as the Shannon entropy.25 If the true pmf is p∗ (X) = [0.25, 0.5, 0.25] we have an
average code length of 1.5 bits per trail whenever we use the true coding system
f (X ∣ β ∗ ). Thus, we expect to use 12 bits to encode observations consisting of
n = 8 trials.
As coding theorists, we have no control over the true data generating pmf
p∗ (X), but we can choose the coding system f (X ∣ β) to encode the observations.
The (population) average code length per trial is given by

H(p∗ (X)∥ β) = H(p∗ (X)∥ f (X ∣ β)) = ∑ − log f (x ∣ β)p∗ (x). (D.3)


x∈X

The quantity H(p∗ (X)∥ β) is also known as the cross entropy from the true
pmf p∗ (X) to the postulated f (X ∣ β).26 For instance, when we use the pmf
f (X ∣ β) = [0.01, 0.18, 0.81] to encode data that are generated according to
p∗ (X) = [0.25, 0.5, 0.25], we will use 2.97 bits on average per trial. Clearly,
24
Due to rounding, the Shannon-Fano algorithm actually produces code words C(x) that
are at most one bit larger than the ideal code length − log2 p∗ (x). We avoid further discussions
on rounding. Moreover, in the following we consider the natural logarithm instead.
25
Shannon denoted this quantity with an H to refer to the capital Greek letter for eta. It
seems that John von Neumann convinced Claude Shannon to call this quantity entropy rather
than information (Tribus and McIrvine, 1971).
26 Observe that the entropy H(p∗ (X)) is the just the cross entropy from the true p∗ (X)

to the true coding system f (X ∣ β ∗ ).

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 54

this is much more than the 1.5 bits per trial that we get from using the true
coding system f (X ∣ β ∗ ).
More generally, Shannon (1948) showed that the cross entropy can never be
smaller than the entropy, i.e., H(p∗ (X)) ≤ H(p∗ (X)∥ β). In other words, we
always get a larger average code length, whenever we use the wrong coding
system f (X ∣ β). To see why this holds, we decompose the cross entropy as a
sum of the entropy and the Kullback-Leibler divergence,27 and show that the
latter cannot be negative. This decomposition follows from the definition of cross
entropy and a subsequent addition and subtraction of the entropy resulting in

p∗ (x)
H(p∗ (X)∥ β) = H(p∗ (X)) + ∑ ( log )p∗ (x),
(x ∣ ∗)
(D.4)
f β
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
x∈X

D(p∗ (X)∥β)

where D(p∗ (X)∥β) defines the Kullback-Leibler divergence from the true pmf
p∗ (X) to the postulated coding system f (X ∣ β). Using the so-called Jensen’s
inequality it can be shown that the KL-divergence is non-negative and that
it is only zero whenever f (X ∣ β) = p∗ (X). Thus, the cross entropy can never
be smaller than the entropy. Consequently, to minimize the load on the com-
munication network, we have to minimize the cross entropy with respect to the
parameter β. Unfortunately, however, we cannot do this in practice, because the
cross entropy is a population quantity based on the unknown true pmf p∗ (X).
Instead, we do the next best thing by replacing the true p∗ (X) in Eq. (D.3)
by the empirical pmf that gives the relative occurrences of the outcomes in the
sample rather than in the population. Hence, for any postulated f (X ∣ β), with
β fixed, we approximate the population average defined in Eq. (D.3) by the
sample average
n
H(xnobs ∥ β) = H(p̂obs (X)∥ f (X ∣ β)) = ∑ − log f (xobs,i ∣ β) = − log f (xnobs ∣ β).
i=1
(D.5)

We call the quantity H(xnobs ∥ β) the log-loss from the observed data xnobs , i.e.,
the empirical pmf p̂obs (X), to the coding system f (X ∣ β).

D.2. Data compression and statistical inference

The entropy inequality H(p∗ (X)) ≤ H(p∗ (X)∥β) implies that the coding theo-
rist’s goal of finding the coding system f (X ∣ β) with the shortest average code
length is in fact equivalent to the statistical goal of finding the true data gen-
erating process p∗ (X). The coding theorist’s best guess is the coding system
f (X ∣ β) that minimizes the log-loss from xnobs to the model MB . Note that
minimizing the negative log-likelihood is the same as maximizing the likelihood.
Hence, the log-loss is minimized by the coding system associated with the MLE,
27
The KL-divergence is also known as the relative entropy.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 55

thus, the predictive pmf f (X ∣ β̂obs ). Furthermore, the cross entropy decomposi-
tion shows that minimization of the log-loss is equivalent to minimization of the
KL-divergence from the observations xnobs to the model MB . The advantage of
having the optimization problem formulated in terms of KL-divergence is that it
has a known lower bound, namely, zero. Moreover, whenever the KL-divergence
from xnobs to the code f (X ∣ β̂obs ) is larger than zero, we then know that the
empirical pmf associated to the observations does not reside on the model. In
particular, Section 4.3.1 showed that the MLE plugin, f (X ∣ β̂obs ) is the pmf
on the model that is closest to the data. This geometric interpretation is due
to the fact that we retrieve the Fisher-Rao metric, when we take the second
derivative of the KL-divergence with respect to β (Kullback and Leibler, 1951).
This connection between the KL-divergence and Fisher information is exploited
in Ghosal, Ghosh and Ramamoorthi (1997) to generalize the Jeffreys’s prior to
nonparametric models, see also van Erven and Harremos (2014) for the rela-
tionship between KL-divergence and the broader class of divergence measures
developed by Rényi (1961), see also Campbell (1965).

Appendix E: Regularity conditions

A more mathematically rigorous exposition of the subject would have had this
section as the starting point, rather than the last section of the appendix. The
regularity conditions given below can be seen as a summary, and guidelines for
model builders. If we as scientists construct models such that these conditions
are met, we can then use the results presented in the main text. We first give
a more general notion of statistical models, then state the regularity conditions
followed by a brief discussion on these conditions.
The goal of statistical inference is to find the true probability measure P ∗
that governs the chances with which X takes on its events. A model PΘ defines
a subset of P, the largest collection of all possible probability measures. We
as model builders choose PΘ and perceive each probability measure P within
PΘ as a possible explanation of how the events of X were or will be generated.
When P ∗ ∈ PΘ we have a well-specified model and when P ∗ ∉ PΘ , we say that
the model is misspecified.
By taking PΘ to be equal to the largest possible collection P, we will not be
misspecified. Unfortunately, this choice is not helpful as the complete set is hard
to track and leads to uninterpretable inferences. Instead, we typically construct
the candidate set PΘ using a parameterization that sends a label θ ∈ Θ to a
probability measure Pθ . For instance, we might take the label θ = (σµ2 ) from
the parameter space Θ = R × (0, ∞) and interpret these two numbers as the
population mean and variance of a normal probability Pθ . This distributional
choice is typical in psychology, because it allows for very tractable inference
with parameters that are generally overinterpreted. Unfortunately, the normal
distribution comes with rather stringent assumptions resulting in a high risk of
misspecification. More specifically, the normal distribution is far too ideal, as it
supposes that the population is nicely symmetrically centred at its population
mean and outliers are practically not expected due to its tail behavior.

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 56

Statistical modeling is concerned with the intelligent construction of the can-


didate set PΘ such that it encapsulates the true probability measure P ∗ . In
other words, the restriction of P to PΘ in a meaningful manner. Consequently,
the goal of statistical inference is to give an informed guess P̃ within PΘ for
P ∗ based on the data. This guess should give us insights to how the data were
generated and how yet unseen data will be generated. Hence, the goal is not
to find the parameters as they are mere labels. Of course parameters can be
helpful, but they should not be the goal of inference.
Note that our general description of a model as a candidate set PΘ does not
involve any structure –thus, the members of PΘ do not need to be related to
each other in any sense. We use the parameterization to transfer the structure
of our labels Θ to a structure on PΘ . To do so, we require that Θ is a nice
open subset of Rd . Furthermore, we require that each label defines a member
Pθ of PΘ unambiguously. This means that if θ∗ and θ differ from each other
that the resulting pair of probability measure Pθ∗ and Pθ also differ from each
other. Equivalently, we call a parameterization identifiable whenever θ∗ = θ leads
to Pθ∗ = Pθ . Conversely, identifiability implies that when we know everything
about Pθ , we can then also use the inverse of the parameterization to pinpoint
the unique θ that corresponds to Pθ . We write ν ∶ PΘ → Θ for the functional that
attaches to each probability measure P a label θ. For instance, ν could be defined
on the family of normal distribution such that P ↦ ν(P ) = (Var EP (X)
P (X)
) = (σµ2 ).
In this case we have ν(PΘ ) = Θ and, therefore, a one-to-one correspondence
between the probability measures Pθ ∈ PΘ and the parameters θ ∈ Θ.
By virtue of the parameterization and its inverse ν, we can now transfer
additional structure from Θ to PΘ . We assume that each probability measure
Pθ that is defined on the events of X can be identified with a probability density
function (pdf) pθ (x) that is defined on the outcomes of X. For this assumption,
we require that the set PΘ is dominated by a so-called countably additive
measure λ. When X is continuous, we usually take for λ the Lebesgue measure
that assigns to each interval of the form (a, b) a length of b − a. Domination
allows us to express the probability of X falling in the range (a, b) under Pθ
by the “area under the curve of pθ (x)”, that is, Pθ (X ∈ (a, b)) = ∫a pθ (x)dx.
b

For discrete variables X taking values in X = {x1 , x2 , x3 , . . .}, we take λ to


be the counting measure. Consequently, the probability of observing the event
X ∈ A where A = {a = x1 , x2 , . . . , b = xk } is calculated by summing the pmf at
each outcome, that is, Pθ (X ∈ A) = ∑x=b x=a pθ (x). Thus, we represent PΘ as the
set PΘ = {pθ (x) ∶ θ ∈ Θ, Pθ (x) = ∫−∞ pθ (y)dy for all x ∈ X } in function space.
x

With this representation of PΘ in function space, the parameterization is now


essentially the functional relationship f that pushes each θ in Θ to a pdf pθ (x).
If we choose f to be regular, we can then also transfer additional topological
structure from Θ to PΘ .
Definition E.1 (Regular parametric model). We call the model PΘ a regular
parametric model, if the parameterization θ ↦ pθ (x) = f (x ∣ θ), that is, the
functional relationship f , satisfies the following conditions

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 57

(i) its domain Θ is an open subset of Rd ,



(ii) at each possible true√ value θ∗ ∈ Θ, the spherical representation θ ↦
mθ (x) = 2 pθ (x) = 2 f (x ∣ θ) is so-called Fréchet differentiable in L2 (λ).
The tangent function, i.e., the “derivative” in function space, at mθ∗ (x)
is then given by

dmθ (x)
dθ = 12 (θ − θ∗ )T l̇(x ∣ θ∗ )mθ∗ (x), (E.1)

˙ ∣ θ∗ ) is a d-dimensional vector of score functions in L2 (Pθ∗ ),
where l(x
(iii) the Fisher information matrix IX (θ) is non-singular,
(iv) the map θ ↦ l(x˙ ∣ θ)mθ (x) is continuous from Θ to Ld (λ).
2

Note that (ii) allows us to generalize the geometrical concepts discussed in Ap-
pendix C.3 to more general random variables X. ◇
We provide some intuition. Condition (i) implies that Θ inherits the topologi-
cal structure of Rd . In particular, we have an inner product on Rd that allows us
to project vectors onto each other, a norm that allows us to measure the length
of a vector, and the Euclidean metric that allows us to measure the distance
between two√ vectors by taking the square root of the sums of squares, that is,
∥θ∗ −θ∥2 = ∑di=1 (θi∗ − θi ) . For d = 1 this norm is just the absolute value, which
2

is why we previously denoted this as ∣θ∗ − θ∣.


Condition (ii) implies that the measurement of distances in Rd generalizes to
the measurement of distance in function space L2 (λ). Intuitively, we perceive
functions as vectors and say that a function h is a member of L2 (λ), if it has a
finite norm (length), i.e., ∥h(x)∥L2 (λ) < ∞, meaning
⎧ √

⎪√∫X [h(x)]2 dx
∥h(x)∥L2 (λ) = ⎨
if X takes on outcomes on R,

⎩ ∑x∈X [h(x)]
(E.2)
⎪ 2 if X is discrete.

As visualized in the main text, by considering MΘ = {mθ (x) = pθ (x) ∣ pθ ∈ Pθ }
we relate Θ to a subset of the sphere with radius two in the function space L2 (λ).
In particular, Section 4 showed that whenever the parameter is one-dimensional,
thus, a line, that the resulting collection MΘ also defines a line in model space.
Similarly, Appendix C.3 showed that whenever the parameter space is a subset
of [0, 1] × [0, 1] that the resulting MΘ also forms a plain.
Fréchet differentiability at θ∗ is formalized as

∥mθ (x) − mθ∗ (x) − 12 (θ − θ∗ )T l̇(x ∣ θ∗ )mθ∗ (x)∥L2 (λ)


→ 0.
∥θ − θ∗ ∥2
(E.3)

This implies that the linearization term 21 (θ − θ∗ )T l̇(x ∣ θ∗ )mθ∗ (x) is a good
approximation to the “error” mθ (x) − mθ∗ (x) in the model MΘ , whenever θ
˙ ∣ θ∗ ) do not blow up. More
is close to θ∗ given that the score functions l(x
˙ ∣ θ∗ ) has a finite norm. We
specifically, this means that each component of l(x

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 58

say that the component ∂


∂θi
l(x ∣ θ∗ ) is in L2 (Pθ∗ ), if ∥ ∂θ

i
l(x ∣ θ∗ )∥L2 (Pθ∗ ) < ∞,
meaning




⎪√∫x∈X ( ∂θi l(x ∣ θ )) pθ∗ (x)dx if X is continuous,
⎪ ∂ 2

∥ ∂θi l(x ∣ θ )∥

=⎨



⎪ ( ∂ l(x ∣ θ∗ )) pθ∗ (x)

⎩ ∑x∈X ∂θi
L2 (Pθ∗ ) 2
if X is discrete.
(E.4)
This condition is visualized in Fig. 12 and Fig. 13 by tangent vectors with finite
lengths. Under Pθ∗ , each component i = 1, . . . , d of the tangent vector is expected
to be zero, that is,


⎪∫x∈X ∂θ

l(x ∣ θ∗ )pθ∗ (x) = 0

if X is continuous,

⎩∑x∈X ∂θi l(x ∣ θ )pθ (x) = 0
(E.5)

i
∂ ∗ ∗ if X is discrete.

This condition follows from the chain rule applied to the logarithm and an
exchange of the order of integration with respect to x, and derivation with
respect to θi , as



l(x ∣ θ∗ )pθ∗ (x)dx =∫ ∂
pθ∗ (x)dx = ∂
∫ pθ∗ (x)dx = ∂
1 = 0.
x∈X ∂θi x∈X ∂θi ∂θi x∈X ∂θi
(E.6)

Note that if ∫ ∂θ∂


pθ∗ (x)dx > 0, then a small change at θ∗ will lead to a function
pθ∗ +dθ (x) that does not integrate to one and, therefore, not a pdf.
i

Condition (iii) implies that the model does not collapse to a lower dimension.
For instance, when the parameter space is a plain the resulting model MΘ
cannot be line. Lastly, condition (iv) implies that the tangent functions change
smoothly as we move from mθ∗ (x) to mθ (x) on the sphere in L2 (λ), where θ is
a parameter value in the neighborhood of θ∗ .
The following conditions are stronger, thus, less general, but avoid Fréchet
differentiability and are typically easier to check.
Lemma E.1. Let Θ ⊂ Rd be open. At each possible true value θ∗ ∈ Θ, we
assume that pθ (x) is continuously differentiable in θ for λ-almost all x with
tangent vector ṗθ∗ (x). We define the score function at x as

˙ ∣ θ∗ ) = ṗθ∗ (x) 1[p ∗ >0] (x),


pθ∗ (x) θ
l(x (E.7)

where 1[pθ∗ >0] (x) is the indicator function




⎪1 for all x such that pθ∗ (x) > 0,
1[pθ∗ >0] (x) = ⎨

(E.8)


0 otherwise.

The parameterization θ ↦ Pθ is regular, if the norm of the score vector Eq. (E.7)
is finite in quadratic mean, that is, l̇(X ∣ θ∗ ) ∈ L2 (Pθ∗ ), and if the corresponding
Fisher information matrix based on the score functions Eq. (E.7) is non-singular
and continuous in θ. ◇

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017


Ly, et. al./Fisher information tutorial 59

There are many better sources than the current manuscript on this topic
that are mathematically much more rigorous and better written. For instance,
Bickel et al. (1993) give a proof of the lemma above and many more beautiful,
but sometimes rather (agonizingly) technically challenging, results. For a more
accessible, but no less elegant, exposition of the theory we highly recommend
van der Vaart (1998).

imsart-generic ver. 2014/10/16 file: fiArxiv.tex date: October 18, 2017

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy