0% found this document useful (0 votes)
10 views17 pages

Random Variables Aren't Random: Paul W. Vos February 9, 2025

This paper critiques the conventional understanding of random variables in statistical inference, arguing that they can be defined without reliance on randomization or hypothetical sampling. It introduces a measure-theoretic framework that allows for a logical approach to statistical inference, emphasizing the mathematical nature of distributions over the traditional frequentist perspective. The paper proposes an information-based assessment of statistical procedures, offering a coherent alternative to existing methods that often lead to confusion and misinterpretation.

Uploaded by

Navin Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views17 pages

Random Variables Aren't Random: Paul W. Vos February 9, 2025

This paper critiques the conventional understanding of random variables in statistical inference, arguing that they can be defined without reliance on randomization or hypothetical sampling. It introduces a measure-theoretic framework that allows for a logical approach to statistical inference, emphasizing the mathematical nature of distributions over the traditional frequentist perspective. The paper proposes an information-based assessment of statistical procedures, offering a coherent alternative to existing methods that often lead to confusion and misinterpretation.

Uploaded by

Navin Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Random Variables aren’t Random

Paul W. Vos
February 9, 2025

Abstract
This paper examines the foundational concept of random variables
in probability theory and statistical inference, demonstrating that their
mathematical definition requires no reference to randomization or hypo-
thetical repeated sampling. We show how measure-theoretic probabil-
ity provides a framework for modeling populations through distributions,
leading to three key contributions. First, we establish that random vari-
ables, properly understood as measurable functions, can be fully char-
acterized without appealing to infinite hypothetical samples. Second, we
demonstrate how this perspective enables statistical inference through log-
ical rather than probabilistic reasoning, extending the reductio ad absur-
dum argument from deductive to inductive inference. Third, we show how
this framework naturally leads to information-based assessment of statisti-
cal procedures, replacing traditional inference metrics that emphasize bias
and variance with information-based approaches that better describe the
families of distributions used in parametric inference. This reformulation
addresses long-standing debates in statistical inference while providing
a more coherent theoretical foundation. Our approach offers an alter-
native to traditional frequentist inference that maintains mathematical
rigor while avoiding the philosophical complications inherent in repeated
sampling interpretations.

1 Introduction
Statistical inference aims to draw conclusions about populations from observed
data, yet its foundational concepts are often obscured by an unnecessary focus
on randomization and hypothetical repeated sampling. This focus has led to
persistent confusion and controversy in the interpretation of basic statistical
concepts, even among experienced researchers. A striking example appears in
discussions of p-values, where emphasis on random phenomena and hypothetical
datasets clouds the simpler mathematical reality: a p-value is fundamentally a
measure of location in a sampling distribution. The confusion surrounding p-
values illustrates a broader issue in statistical inference – the tendency to explain
mathematical concepts through mental constructs involving infinite hypotheti-
cal samples rather than recognizing them as well-defined mathematical objects.

1
The confusion arising from emphasis on randomization is illustrated by Gel-
man & Loken (2014) who begin their article stating that “Researchers typically
express the confidence in their data in terms of p-value: the probability that a
perceived result is actually the result of random variation.” This characterization
obscures the mathematical nature of p-values by casting them as descriptions of
random phenomena which is but one potential use, rather than what they are:
measures of where the observed sample lies in a specifically defined sampling
distribution.
The authors further assert that “p-values are based on what would have hap-
pened under other possible data sets.” This interpretation moves the discussion
from mathematics to what Penrose (2007) calls the mental world — a realm
of imagination and hypothetical scenarios. Such a transition inevitably leads
to confusion. A p-value is purely mathematical: it represents the percentile of
the observed sample in the distribution of all possible samples under the null
hypothesis model. This distribution and the resulting p-value are part of math-
ematics. While proper physical randomization of the observed sample is crucial
for connecting the p-value to the actual population, this requires only the single
randomization that produced our data.
While Gelman & Loken (2014) make valid points about potential misuse
of statistical methods, their conclusion that “the justification for p-values lies
in what would have happened across multiple data sets” reveals how deeply
embedded the repeated sampling perspective has become. The problems they
identify stem not from p-values themselves, which are well-defined mathematical
quantities, but from interpreting these mathematical objects through the lens of
hypothetical repeated sampling. When we maintain focus on the mathematical
structure – the distribution specified by the null hypothesis and the location of
our observed sample within its sampling distribution – the meaning of p-values
becomes clearer and their proper use more evident.
The remainder of this paper develops these ideas systematically. Section 2
introduces distributions on finite sets, called simple distributions, and shows how
these extend naturally to both countable and uncountable sets without requir-
ing the concept of randomness. Section 3 employs Penrose’s distinction between
mathematical, physical, and mental worlds to clarify how single-instance phys-
ical randomization validates our use of mathematical sampling distributions.
Section 4 shows how measure theory provides precise tools for describing data
through percentiles and tail areas, laying groundwork for the logical approach
to inference developed in Section 5. There, we show how the reductio ad absur-
dum argument from deductive logic extends naturally to statistical inference,
establishing what we call the Fisher-Information-Logic (FIL) approach. Section
6 examines traditional approaches to inference based on bias and variance, re-
vealing their limitations by focusing on additional structures of the support set
rather than the probabilities assigned to the support. Section 7 describes the
information-based framework that better aligns with the mathematical nature
of distributions while providing practical tools for inference.

2
2 Mathematical Framework of Distributions
2.1 Simple Distributions
Simple distributions provide the mathematical foundation for describing finite
populations, forming the basis for more abstract probability concepts. We begin
with this concrete case because it captures the essential features of distributions
while maintaining clear connections to observable populations.
Let f 1 , f 2 , . . . , f K be positive integers that sum to N , representing frequen-
cies in a population of size N . For example, these might represent counts of
different blood types in a hospital’s patient database or the number of students
receiving each possible grade in a class. Let XK be a set with K distinct ele-
ments. For example, XK might consist of labels for the different blood types or
letter grades. We define an (N, K)-simple frequency distribution as:

(x1 , f 1 ), (x2 , f 2 ), . . . , (xK , f K )




where each xk ∈ XK and each f k represents its frequency in the population.


A simple distribution normalizes these frequencies by the population size N .
Formally, it is a function m(N,K) : XK → (0, 1] satisfying:


P
m(N,K) (x) = 1
• N m(N,K) (x) ∈ Z+ for all x ∈ XK
The distribution is degenerate if m(N,K) (x) = 1 in which case K = 1.
The support XK is an abstract set. To emphasize that there may be addi-
tional structure on this set, XK is called the label space and the structure will
range from no structure to the algebraic properties of the reals when XK ⊂ R.
When the labels have meaningful numeric values, the simple distribution will
be approximated by a continuous distribution described in Section 2.3.
Each simple distribution corresponds to a unique multi-set (or bag) bm(N,K) c
containing N elements, where each value xk appears exactly f k times. Formally,
this multi-set can be represented as a set of N ordered pairs where the first
component takes values in XK and the second component ranges over {1, . . . , N }
so that each order pair is unique. For brevity, we write bmc when N and K are
clear from context.
The multi-set shows the connection with the more common notation X for
a distribution. For simple distributions, X = Π1 bmc where Π1 is the projec-
tion onto the first component. The proportion corresponding to the value x is
obtained from the counting measure of its pre-image

m(x) = |X −1 (x)|/N.

To avoid thinking about X as describing a random process it is helpful


to have a visualization. The graphical representation of a simple distribution
depends on the structure of the label space XK :

3
• For ordered XK (e.g., course grades A, B, C, D, F): Plot points on the
horizontal axis with uniform spacing, constructing rectangles centered at
each x with area m(N,K) (x).
• For unordered XK (e.g., blood types): The visualization uses similar rect-
angles, but their horizontal arrangement carries no meaningful informa-
tion.
• For XK ⊂ R (e.g., height measurements): The horizontal spacing reflects
the numerical values in XK , with rectangle heights chosen to achieve areas
of m(N,K) (x).
The label space XK and corresponding proportions m(x) shown in these visu-
alizations are the defining features of general distributions which we consider
next.

2.2 Discrete Distributions


Simple distributions model finite populations where N , while typically large,
remains unspecified. Discrete distributions generalize this concept by removing
the dependence on N .
Formally,
P a discrete distribution on XK is a function mK : XK → (0, 1]
satisfying XK mK (x) = 1. The space of discrete distributions on XK encom-
passes all (N, K)-simple distributions for N ≥ K. Each non-degenerate discrete
distribution corresponds to a point in the open simplex ∆(K−1) ⊂ RK .
When XK lacks structure beyond that of an ordering, there is a natural
bijection between the space of discrete distributions on XK and the simplex
∆(K−1) . This geometric perspective reveals that families of distributions on XK
typically form smooth submanifolds in ∆(K−1) , a structure not available when
restricted to simple distributions.
The notation XK for the distribution mK emphasizes the label space. For-
mally, Pr(XK = x) = mK (x) where Pr indicates measure-theoretic probability;
that is, a generalization of a proportion rather than a number that describes a
random phenomenon.
The next generalization is to allow K to be arbitrarily large so that the
support, XN , is countably infinite. That is, |XN | = |N| where N = {1, 2, 3, . . .}.

2.3 Continuous Distributions


Continuous distributions arise naturally when modeling measured values with
specified precision or grouped data where K is large but imprecise. These
distributions often serve as approximations to simple distributions, particularly
when dealing with physical measurements.
For an open interval X ⊂ R, a random variable X with probability density
function (pdf) m is visualized as a curve over X with total area 1. Areas
under portions of this curve approximate the rectangular areas of corresponding

4
simple distributions. This approximation becomes increasingly accurate as the
measurement precision increases and the population size grows.
As with simple and discrete distributions, the notation X for m emphasizes
the label space which in this case has the advantage ofRrepresenting the algebraic
structure inherited from R. Formally, Pr(X ∈ A) = A m(x)dx for measurable
set A. The support X for continuous distributions is uncountable; countability
is achieved by requiring a σ-algebra on X . It is common practice to use X to
denote both continuous and discrete distributions. We will do the same when
the context indicates the cardinality of the support.

3 Randomization and the Three Worlds


Penrose’s distinction between the physical, mental, and mathematical worlds
provides a framework for understanding the role of randomization in statistical
inference. Drawing from his The Road to Reality (Penrose, 2007), these worlds
can be characterized as follows:
• The physical world contains observable phenomena and tangible reality,
including the actual process of random selection through mechanisms like
shuffling cards or rolling dice
• The mental world encompasses human consciousness, understanding, and
imagination, including our intuitive conceptualization of probability and
randomness
• The mathematical world consists of absolute, objective truths that exist
independently of human thought or physical reality, including the formal
structures of measure theory and probability measures

Importantly, randomization exists only in the physical world as an actual


process. While the mental world contains our understanding and intuition about
randomness, these are distinct from physical randomization itself. Distributions
are part of mathematics that can serve as models for random phenomena in the
physical world but they are also models for real-world populations.
There is a direct connection between the sample y of size n in the physical
world and a simple distribution in the mathematical world. Let X(N,K) be the
distribution that provides an exact model for the real world population and bXc
be its bag of N elements. Let bXc(n) be the bag of all subsets of bXc of size
n. The ordered pairs in bXc(n) consist of n-tuples where the second component
ranges over all (NN−n)!
!
n-tuples from {1, 2, . . . , N }. The distribution Y(N 0 ,K 0 ) ,
in particular the values for N 0 and K 0 , for the bag bXc(n) is obtained using
combinatorics.
If y was obtained using a simple random sample, that is, if it was chosen in
such a way that every sample of size n had an equal chance of being selected, then
Y(N 0 ,K 0 ) is the exact distribution for y. For sampling plans other than simple
random samples, bXc(n) is replaced with a different subset of the powerset of

5
bXc. The proportions obtained by combinatorics are what connect the simple
distributions to the physical and mental worlds. This connection is fundamental
to understanding statistical inference in terms of distributions of data rather
than models of random phenomena, i.e., random variables.
Measure theory extends the idea of a proportion of a finite set to a probability
that describes an infinite set. This extension does not need to occur in the
mental or physical world and attempts to do so using hypothetical repeated
sampling from a real-world population is ripe for confusion.

4 Probability to Describe Observed Data


4.1 Proportions and Percentiles
For a simple distribution X(N,K) , proportions arise naturally from relative fre-
quencies. The proportion of any value x is defined as f (x)/N , where f (x)
represents its frequency. These proportions form the basis for understanding
more abstract probability concepts.
When the support set X carries an ordering, we can define the percentile of
a value x as 100 x0 ≤x f (x0 )/N . This percentile characterizes the location of x
P
within the distribution X. An equivalent and often more useful characterization
comes through tail areas:
• Left tail area: TAL (x) = x0 ≤x f (x0 )/N
P

• Right tail area: TAR (x) = x≤x0 f (x0 )/N


P

• Tail area: TA(x) = 2 min {TAL (x), TAR (x)}


For continuous distributions the sums are replaced with integrals.
The concept of extreme values in a distribution warrants careful considera-
tion. Values with small tail areas are termed extreme, and while these might be
called unlikely or surprising, such characterizations describe the value’s location
within the distribution rather than any inherent probability of the specific ob-
servation. A better synonym for ’extreme’ is ’rare’ indicating a property shared
by a relatively few members of a population. Consider human height: meeting
someone exceptionally tall represents an extreme or rare event because of where
that height sits in the population distribution, not because of any inherent im-
probability.
This distinction becomes particularly clear in games of chance, such as 5-
card poker. In a well-shuffled deck, each specific 5-card hand occurs with equal
probability 1/ 525 . However, the game of poker requires an imposed ordering on
equivalence classes of hands. Consider two hands: H1 = {2♥, 3♥, 4♥, 5♥, 7♣}
and H2 = {2♠, 3♠, 4♠, 5♠, 6♠}. While these hands have identical probabilities
of being dealt, H2 lies further in the tail of the distribution of hand rankings,
as significantly fewer hands beat it. A hand this far into the ordering is a rare
hand.

6
4.2 Fisher’s Infinite Population
R.A. Fisher did not adhere to the view of probability as describing hypotheti-
cal repeated samples, instead, as Spiegelhalter (2024) notes, “Fisher suggested
thinking of a unique data set as a sample from a hypothetical infinite population,
but this seems to be more of a thought experiment than an objective reality.”
However, Penrose’s distinction between mathematical and mental worlds pro-
vides an objective reality to Fisher’s framework.
To understand Fisher’s approach, we begin with a simple distribution X(N,K)
that exactly models a finite population. The observed data y represents a point
in Y(N,K) , the sampling distribution of X(N,K) . When we approximate X(N,K)
with a continuous distribution X, the corresponding sampling distribution Y
approximates Y(N,K) . Both X and Y will have infinite support.
The infinite sampling distribution exists in the mathematical world rather
than the mental world. It is not a hypothetical construct requiring repeated
sampling or imagination, but rather a mathematical object with the same ob-
jective reality as any other mathematical structure. Fisher’s infinite population,
therefore, describes a precise mathematical framework for understanding sam-
pling distributions, not a mental exercise in hypothetical repetition.

5 Logic of Statistical Inference


5.1 Deduction and Induction
We use the following example to illustrate the distinction between deductive
and inductive reasoning in statistical inference.
Example 1. Two Lotteries
Consider a scenario involving an extraterrestrial civilization comprised of
two distinct nations, both of which have discovered Earth. While these nations
coexist peacefully, they hold divergent views regarding humanity, with Nation
A representing an existential threat to Earth.
Earth’s intelligence services have intercepted communications revealing that
both nations operate similar lottery systems. Each nation conducts a pick-4
lottery where four numbers are drawn from separate but identical bins, differing
only in the numerical range of their balls. Nation A’s lottery uses balls numbered
0 through 7, while Nation B employs balls numbered 0 through 9. Intelligence
reports indicate that a spacecraft from one of these nations is approaching Earth,
commanded by a captain known for wagering on the sum of their national lottery
numbers. The next communication is expected to contain this sum.
Earth’s scientific community has determined that if the received sum exceeds
28, this conclusively identifies Nation B as the approaching civilization. This
conclusion follows from a reductio ad absurdum argument: assume the sum
originates from Nation A’s lottery, where each draw must be between 0 and 7.
The maximum possible sum would be 28 (achieved when drawing 7 four times).

7
Therefore, any sum exceeding 28 contradicts the assumption that it came from
Nation A, leaving Nation B as the only possible source.
This argument can be expressed in terms of distributions as follows. Let
Sk = {k 0 ∈ Z : 0 ≤ k 0 ≤ k} be the set of non-negative integers up through k.
Let S be the distribution for the observed sum sobs . The premise, known as
the null hypothesis in statistical terminology, is H◦ : S = SA where SA is the
distribution of sums from lottery A whose support is SA = S28 . The steps of
the argument for a sum of 29 are as follows:
1. Begin with H◦
2. H◦ =⇒ sobs ∈ S28 ; sobs = 29

3. sobs ∈
/ S28 . A contradiction. ∴ ¬H◦
This argument provides proof that H◦ is false; i.e. S 6= SA . Clearly, this
argument goes through whenever sobs 6∈ S28 .
When the observed sum is 28 or less, reasoning shifts from deduction to in-
duction, revealing fundamental divisions within the statistical community. The
repeated-sampling frequentists either conclude that no inference is possible or
resort to multiverse arguments to maintain their philosophical framework. The
Bayesian school insists that prior probabilities must be assigned to the two
possibilities, yet this leads to considerable debate regarding both the numeri-
cal values of these priors and their philosophical interpretation. Good (1971),
himself a Bayesian, wryly observes that there are at least “46,656 varieties of
Bayesians”.

5.2 Induction and Logic


A third approach emerges from those who recognize that the deductive argu-
ment exists purely within mathematics and can be extended using mathematical
principles, providing an objective framework for analysis. This mathematical
extension is motivated by the work of Fisher (one reference being pages 42-44
of Fisher (1959)).
The deductive argument only required S28 , the support of the distribution
specified by H◦ . The inductive argument requires SA the distribution specified
by H◦ along with another distribution that describes lottery B conditioned
on the event sobs ∈ S28 . This conditioning reflects the fact that we use a
deductive argument when sobs 6∈ S28 . For simplicity of notation we use SB for
this conditional distribution.
This extension introduces an auxiliary hypothesis Haux : “the observed sum is
not exceptionally rare.” This auxiliary hypothesis formalizes a principle common
to human experience - we generally assume that exceptionally rare events do
not occur, as evidenced by our willingness to engage in activities like air travel
despite infinitesimal but non-zero risks.
While the concept of “rare” might initially seem subjective, it can be quanti-
fied objectively using tail areas of a distribution. The tail area - specifically, the

8
proportions associated with extreme values - provides a mathematical frame-
work for defining rarity. For instance, if we designate a sum of 28 as rare but 27
as not rare for the distribution SA , this corresponds to labeling the uppermost
0.0002 of possible sums for lottery A as rare. More formally, we can express
Haux as the statement that sobs falls below the 99.98th percentile using the
natural ordering provided by the numerical values of the sum.
This formalization allows us to construct a reductio ad absurdum argument
applicable for induction:
1. Begin with the conjunction H◦ ∧ Haux
2. This conjunction implies sobs ∈ S27 ; sobs = 28

3. sobs ∈
/ S27 . A contradiction. ∴ ¬(H◦ ∧ Haux ) ≡ ¬H◦ ∨ ¬Haux

This argument structure is particularly noteworthy because both hypothe-


ses play essential roles: Haux establishes both the ordering and the numerical
threshold for extreme values, while H◦ determines the proportions assigned to
these values. The conclusion takes the form of a logical disjunction that Fisher
(1960) described as:
Either the hypothesis is not true, or an exceptionally rare outcome
has occurred.

The definition of rare events through tail areas can be calibrated to different
levels of evidence. For example, if we consider a sum of 24 as extreme but 23
as not, this corresponds to a tail area of 0.0171 for lottery A.
This framework provides a bridge between deductive and inductive reason-
ing. Rather than requiring the conceptual machinery of hypothetical repeated
sampling or subjective prior probabilities, it extends classical deductive logic
to accommodate uncertainty in a mathematically rigorous way. The approach
maintains objectivity while acknowledging the inherent limitations in our ability
to draw absolute conclusions from empirical data.

5.3 Test Statistics


The transition from deductive to inductive reasoning in statistical inference
highlights two fundamental requirements. First, we must assume that the distri-
bution of observed data matches that of the underlying population - an assump-
tion validated through proper randomization. Second, we require a method for
ordering possible outcomes to evaluate their extremity. This ordering represents
an important choice in statistical analysis.
The mathematical tool for imposing this order is the test statistic. Formally,
a test statistic for hypothesis H◦ is a real-valued function T defined on the
sample space. For each value t in the image of T , the pre-image T −1 (t) forms
a subset of the sample space. As t ranges over all possible values, these pre-
images partition the sample space into subsets ordered by t. When working with

9
continuous distributions, we require T to be measurable to ensure compatibility
with the probability structure.
There are many choices for the test statistic and different statistics provide
different orderings of the sample space. In our Two Lotteries example, we seek
a test statistic that effectively distinguishes between distributions SA and SB .
The Neyman-Pearson lemma provides theoretical guidance for this choice,
showing that the likelihood ratio test achieves optimal power when comparing
two simple hypotheses, as in our lottery example. Interestingly, for sums in
{8, 9, . . . , 27, 28}, the ordering induced by the likelihood ratio matches that of
the numerical values themselves. However, the likelihood ratio remains con-
stant across {0, 1, . . . , 6, 7}, revealing a subtle distinction between natural and
likelihood ratio ordering of outcomes.
Most statistical applications differ from our example in two important ways.
First, the exact values of N and K characterizing the simple distribution are
typically unknown, necessitating the use of approximating distributions X. For
discrete X, the sample space X becomes a countable set with proportions re-
placed by real numbers in the unit interval. For continuous X, X becomes a
measurable space with proportions replaced by measure-theoretic probability of
measurable sets.
Second, rather than choosing between two specific distributions, we consider
a smooth continuum of possible models. Nevertheless, the modified reductio ad
absurdum argument extends naturally to this setting after appropriate adjust-
ments for measurable sets. The likelihood continues to play a central role, with
the smooth structure of the model space allowing us to analyze how likelihood
functions vary across models - essentially a local version of the likelihood ratio.
The information content of a test statistic provides a mathematical frame-
work for describing its induced ordering of the sample space. This connection
between test statistics and information theory is described in Section 7.

6 Inference and Hypothetical Repeated Sampling


The traditional frequentist approach to statistical inference is motivated by a
conceptual framework of hypothetical repeated sampling. This section presents
this framework and shows how it naturally leads to concepts like bias and vari-
ance for evaluating statistical procedures. We purposefully maintain the re-
peated sampling perspective and ’random variable’ notation X here to demon-
strate how it shapes statistical thinking, before introducing an alternative ap-
proach in Section 7.
The population we study is finite and can be exactly described by a simple
distribution, which we denote by mpop or by Xpop to emphasize the label space.
To develop statistical procedures, we introduce a family of distributions M that
typically does not contain simple distributions, but does contain a distribution
X? that best approximates Xpop and we assume it is a suitable approximation
so that X? is considered the distribution from which the sample was taken.
A parameterization is a function θ : M → Θ ⊂ Rd and the goal of inference

10
is to use the sample y = (x1 , x2 , . . . , xn ) to obtain a value for the parameter,
called an estimate, that will be close to θ? = θ(X? ). The value of the estimate
will depend on the sample and a desirable property is that, were the process
repeated, the average of the estimates would be close to θ? . Conceptually, this
requires hypothetically sampling from the real-world population and letting the
number of such samples tend to infinity before the average equals θ? .
Mathematically, an estimate is the value of a measurable function t : Y →
Θ. The distribution that is obtained when t is applied to Y is an estimator,
θ̂ = t(Y ). Unbiasedness is described using the expectation operator, which for
continuous distributions, is defined as Eh(Y ) = h(y)mY (y)dy where mY is
R

the density function for the distribution Y obtained from m. This operator
is defined for each distribution in M so that a subscript on the operator will
indicate a specific distribution. The conceptual construction of unbiasedness
is expressed mathematically as E? θ̂ = θ? where E? is the expectation defined
using m = m? . Our notation does not distinguish between an estimate θ̂ = t(y)
and an estimator θ̂ = t(Y ) but will be clear from the context. In particular,
expectation operates on estimators.
The bias is the difference between this expectation and θ? ,

Bias? (θ̂) = E? θ̂ − θ? .
While the conceptual construction focuses on the distribution m? , mathemat-
ically bias is defined for all distributions in M . Since θ? is unknown, we want
estimators that are unbiased for all values of the parameter (i.e., distributions
in M ). The estimator θ̂ is unbiased for θ is its bias vanishes for all values of θ;
that is, Bias(θ̂) is the zero function on Θ.
Another important property is that the value of the estimate is, in some
sense, close to θ? . Recognizing that the estimate cannot be close for all y ∈ Y,
the estimator is described in terms of its average distance from θ? and, as with
bias, a mental conceptualization of this involves hypothetical repeated samples
from the physical population.
Mathematically, the distance is defined using a non-negative function d de-
fined on Θ × Θ so that for a fixed y the distance between the estimate and θ? is
d(θ̂, θ? ). The average distance is found using the estimator and the expectation
operator E? d(θ̂, θ? ). Since E and d are defined for all θ, Ed(θ̂, θ) is a function
on Θ.
The most common choice for d is square error, d(θ̂, θ) = (θ̂ − θ)2 , and this
average distance is called mean square error

MSE(θ̂) = E(θ̂ − θ)2 .

MSE is a function on Θ and estimators that minimize MSE for all parameter
values generally do not exist. When estimators are required to be unbiased, es-
timators that minimize MSE do exist for many important estimation problems.
For unbiased estimators MSE equals the variance V (θ̂) = E(θ̂ − E θ̂)2 and such
estimators are called uniformly minimum variance unbiased (UMVU).

11
The controversy with UMVU estimators comes when there are biased es-
timators that have smaller MSE than the UMVU estimator at all values of
the parameter. Using MSE as the estimation criterion, this indicates that the
UMVU estimator should not be used, but leaves open the question of what
estimator should be used as estimators minimizing MSE for all values of the pa-
rameter generally do not exist. Efron (2024) suggests the solution to this issue
is to move from frequentist to Bayesian inference methods. Efron’s suggestion
is based on the assumption that there a problems with frequentist methods,
in particular with maximum likelihood. What Efron found shocking was the
existence of estimators that had smaller MSE for all values of the parameter
(i.e., “always”):
That “always” was the shocking part: two centuries of statistical
theory, ANOVA, regression, multivariate analysis, etc., depended on
maximum likelihood estimation. Did everything have to be rethought?
Not everything; just the role of bias and MSE. Rethinking these leads to in-
formation as a means of assessing estimators. Before describing the role of
information in the next section we describe an important property shared by
bias and MSE that illustrates a difficulty with these measures and the way
forward.

Units of Measurement and Invariance


The fundamental requirement that statistical inference should not depend on
units of measurement leads us to examine the behavior of our estimation criteria
under different transformations. While bias and MSE exhibit invariance under
linear transformations (allowing, for instance, conversion between kilometers
and miles), they fail under more complex transformations that are equally valid
representations of the same physical quantity.
Consider, for example, the measurement of fuel efficiency in automobiles. In
the US, the units of measure are mile per gallon (mpg) while in the UK liters to
drive 100 km (L/100 km) is used. The issue is not metric versus English units so
we simplify and consider two studies of the same data that use the same family
of models. In one, the data are presented in km/L while in the other the units
are L/km so the units require a reciprocal transformation.
Suppose both studies used unbiasedness to choose an estimator. If the study
using units km/L finds an unbiased estimator, that estimator will not be unbi-
ased in the reciprocal units. This means that the analysis now depends on the
units of measurement. If both studies use MSE and the study using km/L units
finds one estimator having smaller MSE than another, this relationship need
not hold using the L/km units. Again, the analysis will depend on the units
chosen to measure the data.
This example illustrates deficiencies of bias and MSE to assess estimators but
is also suggests a solution. Namely, ignore the structure of the label space and
use the probability or probability density to assess estimators. In terms of the
visualization of simple distributions using rectangles placed on the horizontal

12
axis, we should ignore the location and other structure of this axis and focus on
the height of the rectangles. The base of the rectangles serve only as an index
set to compare distributions in terms of their corresponding heights.

7 Inference and Information


Section 6 presented the traditional frequentist framework where inference rests
on hypothetical repeated sampling, leading naturally to concepts like bias and
variance for evaluating statistical procedures. This section develops an alterna-
tive approach that aligns with our view of random variables as mathematical
structures. Rather than considering the behavior of procedures across hypothet-
ical samples, we fix the observed sample and examine its relationship to possible
models. This shift in perspective leads to two key developments: a logical frame-
work for inference through generalized estimation, and an information-theoretic
assessment of estimators that replaces bias and variance. Technical details of
generalized estimation are given in Vos (2022) and Vos & Wu (2024). See Vos
& Holbert (2022) for additional discussion regarding the adequacy of a single
random sample for inference.

7.1 From Repeated Sampling to Logical Inference


Traditional inference metrics like bias and variance arise naturally when con-
sidering sampling distributions, but they reflect properties of the label space
rather than the mathematical structure required to compare distributions. The
Fisher-Information-Logic (FIL) approach shifts focus to the relative frequencies
(or probabilities) assigned to points in the sample space, leading to metrics that
better align with the mathematical nature of distributions.
Consider how bias assessment typically works: we imagine repeatedly draw-
ing samples from some true distribution, computing our estimate each time, and
comparing the average estimate to the true parameter value. This framework
requires us to reason about samples we never observe. The FIL approach in-
stead starts with our actual observed sample yobs and examines its relationship
to every model in a family of distributions M . A key point here is that the sam-
pling distribution for each distribution in M is part of mathematics requiring
no sampling. The justification for comparing yobs to each sampling distribution
depends only on the single sampling mechanism used in the physical world that
resulted in yobs .
For a smooth family of distributions M with sampling distribution sup-
port Y, we introduce generalized estimators that map Y × M to R. A gener-
alized estimator g evaluated at our observed sample yobs provides a function
gobs = g(yobs , ·) that orders all models in M according to their consistency with
yobs . Rather than asking about the long-run behavior of g across hypothetical
samples, we examine how effectively it discriminates between different possible
models given our observed data.

13
7.2 Extending Logical Arguments to Statistical Inference
The FIL approach extends the modified reductio ad absurdum argument from
Section 5.2 to continuous families of distributions. For each model m ∈ M , we
consider the conjunction of two hypotheses:
• H: The model m best approximates the true simple distribution

• Haux : The observed sample yobs is not rare under model m


We formalize “rare” through tail areas, which naturally extend from simple
distributions to general distributions via measure theory:
Z
TAL (yobs , m) = mY dy
y:g(y,m)≤gobs (m)

where mY represents the sampling distribution under m. The right tail area
TAR uses ≥ instead of ≤, and we set TA(yobs , m) = 2 min(TAL , TAR ). For
significance level α, we formalize Haux as TA(yobs , m) > α.
This framework partitions the model space into two sets: Mα{ containing all
distributions where the reductio ad absurdum argument successfully reaches a
contradiction (the observed data are rare), and Mα containing the distributions
where the argument fails to reach a contradiction::

Mα{ = {m ∈ M : TA(yobs , m) ≤ α}
n o
Mα = m ∈ M : m 6∈ Mα{

We cannot know with certainty which of these sets contains m? , the distribu-
tion in M that best approximates the true simple distribution Xpop . However, if
m? ∈ Mα{ then yobs is rare. We have Fisher’s logical disjunction: either the true
distribution is in Mα or the sample we observed is rare. The set Mα forms a
(1−α)100% confidence region, with its image under a parameterization θ giving
a subset Θα ⊂ Rd . When d = 1, Θα often forms an interval, in which case we
call it a confidence interval.

7.3 Information Theory and Statistical Evidence


The FIL approach centers on comparing distributions rather than focusing on a
single hypothetical true distribution. This shift naturally leads to information-
theoretic concepts for measuring the strength of statistical evidence. While
tail areas provide evidence against specific distributions through the reductio
ad absurdum argument, the effectiveness of a generalized estimator is measured
through its information content Λ(g). This connection between statistical evi-
dence and information mirrors fundamental concepts in information theory: the
Kullback-Leibler (KL) divergence (also called KL information) and entropy.
Like our treatment of generalized estimators, these information-theoretic
measures depend only on probability assignments, not on the structure of the

14
label space. For distributions sharing support X , the KL divergence:
 
m1 (x)
m1 (x) log
X
KL(m1 , m2 ) =
m2 (x)
x∈X

measures the information lost when m2 is used to approximate m1 . Similarly,


entropy:
m(x) log m(x)
X
ENT (m) = −
x∈X

quantifies the information content of a distribution using only its probability


structure. These equations illustrate how X serves purely as an index set, con-
sistent with our emphasis on probability assignments over label space structure.
The Fisher information and the information utilized by generalized estima-
tors, Λ(g), share this independence from label space structure, requiring only
measurability for continuous distributions. However, they differ from KL diver-
gence and entropy in a crucial way: while KL and entropy can be computed for
individual distributions or pairs of distributions, Fisher information and Λ(g)
require a smooth family of distributions. This reflects their roles in statistical
inference, where they describe properties of distribution families and functions
defined on these families rather than isolated distributions.
When distributions have different supports, the KL divergence becomes in-
finite — precisely the cases where deductive rather than inductive reasoning
applies, as illustrated in our Two Lotteries example. This connection between
information theory and logical inference reinforces how the FIL approach pro-
vides a unified framework for statistical reasoning that maintains mathematical
rigor while avoiding the conceptual burden of hypothetical repeated sampling.

8 Discussion
This paper has developed a framework for understanding statistical inference
that emphasizes the mathematical nature of distributions while carefully distin-
guishing between mathematical, physical, and mental worlds. This distinction
proves particularly valuable when examining fundamental statistical concepts
and terminology that often conflate these domains. Three key areas illustrate
both the importance and broader implications of this perspective.
First, statistical terminology frequently introduces mental world connota-
tions that can obscure rather than clarify mathematical concepts. The term
“random variable” exemplifies this problem - while mathematically defined as
a measurable function, the word “random” suggests a mental world construct
involving chance and unpredictability. Similar issues arise with terms like “infor-
mation,” “likelihood,” and “confidence.” These terms carry intuitive meanings
that may mislead practitioners about their precise mathematical definitions.
The solution is not to eliminate such terminology - these terms are deeply em-
bedded in statistical practice - but rather to explicitly recognize and address
the potential confusion they may cause.

15
Second, the concept of information in statistics requires particular care.
While information theory provides precise mathematical definitions through
concepts like entropy and KL divergence, these capture only specific aspects
of how information is understood more broadly.
The phrase “amount of information” might suggest information is a quantity
like mass or volume, but this analogy breaks down upon closer examination.
Information does not have units, and the relationship between information and
variance illustrates this subtlety.
Consider an unbiased estimator θ̂ where the information Λ(θ̂) equals the re-
ciprocal of its variance. While these quantities are numerically reciprocal, they
are conceptually distinct: Λ(θ̂) measures how rapidly probability assignments
change when considering different models, while variance is defined for an iso-
lated distribution and quantifies the spread using the algebraic properties of
the label space. The parameter θ plays fundamentally different roles in each
case - for information, it serves as a coordinate on a smooth manifold (and thus
has no units), while for variance, it inherits the units and structure of the label
space. See Vos (2024) for problems when inference depends on the choice of
parameterization.
Third, our framework provides new insights into point estimation. When
working with continuous exponential families, we can leverage the bijection
between canonical statistics and expectation parameters to view points in the
support X as either canonical statistics or expectation parameters. In either
case, X becomes a subset of Rd , which we denote as XR . More fundamentally,
using the parameterization allows us to take X = M , viewing a point estimate
as a distribution in the model space M rather than as a real number labeling this
distribution. While M lacks the algebraic structure of the reals, concepts like
mean and variance can be defined through optimization, generalizing familiar
ideas like least squares to other exponential families using KL divergence. See
Wu & Vos (2012) and Vos & Wu (2015) for details on these distribution-valued
point estimators.
These observations point to a broader principle: the importance of maintain-
ing clear distinctions between mathematical definitions and their interpretations
in the physical and mental worlds. The fact that “random variables aren’t ran-
dom” serves as a caution about mathematical terminology in general - terms
that carry rich meaning outside mathematics may not align with their precise
mathematical definitions. This misalignment becomes particularly problematic
when it leads to reasoning about hypothetical scenarios (like infinite sequences
of samples) rather than focusing on well-defined mathematical objects.

References
Efron, B. (2024). Machine learning and the James–Stein estimator. Jpn. J.
Stat. Data Sci., 7(1), 257–266.

16
Fisher, R. (1959). Statistical methods and scientific inference. Hopetoun Street,
University of Edinburgh: T and A Constable Ltd, 2nd edition.
Fisher, S. R. A. (1960). Scientific thought and the refinement of human reason-
ing. 3, 1–10.

Gelman, A. & Loken, E. (2014). The statistical crisis in science. Am. Sci.,
102(6), 460.
Good, I. J. (1971). Letters to the editor: 46656 varieties of bayesians. The
American Statistician, 25(5), 62–63.
Penrose, R. (2007). The road to reality: a complete guide to the laws of the
universe. New York: Vintage Books, 1st vintage books ed edition.
Spiegelhalter, D. (2024). Why probability probably doesn’t exist (but it is useful
to act like it does). Nature, 636(8043), 560 – 563.
Vos, P. (2022). Generalized estimators, slope, efficiency, and fisher information
bounds. Information Geometry. SharedIt link https://rdcu.be/c0YQn.

Vos, P. (2024). Rethinking Mean Square Error: Why Information is a Superior


Assessment of Etimators. Preprint, https://arxiv.org/abs/2412.08475.
Vos, P. & Holbert, D. (2022). Frequentist statistical inference without repeated
sampling. Synthese, 200(2).

Vos, P. & Wu, Q. (2015). Maximum likelihood estimators uniformly minimize


distribution variance among distribution unbiased estimators in exponential
families. Bernoulli, 21(4).
Vos, P. & Wu, Q. (2024). Generalized estimation and information.

Wu, Q. & Vos, P. (2012). Decomposition of kullback–leibler risk and unbi-


asedness for parameter-free estimators. Journal of Statistical Planning and
Inference, 142(6), 1525–1536.

17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy