Random Variables Aren't Random: Paul W. Vos February 9, 2025
Random Variables Aren't Random: Paul W. Vos February 9, 2025
Paul W. Vos
February 9, 2025
Abstract
This paper examines the foundational concept of random variables
in probability theory and statistical inference, demonstrating that their
mathematical definition requires no reference to randomization or hypo-
thetical repeated sampling. We show how measure-theoretic probabil-
ity provides a framework for modeling populations through distributions,
leading to three key contributions. First, we establish that random vari-
ables, properly understood as measurable functions, can be fully char-
acterized without appealing to infinite hypothetical samples. Second, we
demonstrate how this perspective enables statistical inference through log-
ical rather than probabilistic reasoning, extending the reductio ad absur-
dum argument from deductive to inductive inference. Third, we show how
this framework naturally leads to information-based assessment of statisti-
cal procedures, replacing traditional inference metrics that emphasize bias
and variance with information-based approaches that better describe the
families of distributions used in parametric inference. This reformulation
addresses long-standing debates in statistical inference while providing
a more coherent theoretical foundation. Our approach offers an alter-
native to traditional frequentist inference that maintains mathematical
rigor while avoiding the philosophical complications inherent in repeated
sampling interpretations.
1 Introduction
Statistical inference aims to draw conclusions about populations from observed
data, yet its foundational concepts are often obscured by an unnecessary focus
on randomization and hypothetical repeated sampling. This focus has led to
persistent confusion and controversy in the interpretation of basic statistical
concepts, even among experienced researchers. A striking example appears in
discussions of p-values, where emphasis on random phenomena and hypothetical
datasets clouds the simpler mathematical reality: a p-value is fundamentally a
measure of location in a sampling distribution. The confusion surrounding p-
values illustrates a broader issue in statistical inference – the tendency to explain
mathematical concepts through mental constructs involving infinite hypotheti-
cal samples rather than recognizing them as well-defined mathematical objects.
1
The confusion arising from emphasis on randomization is illustrated by Gel-
man & Loken (2014) who begin their article stating that “Researchers typically
express the confidence in their data in terms of p-value: the probability that a
perceived result is actually the result of random variation.” This characterization
obscures the mathematical nature of p-values by casting them as descriptions of
random phenomena which is but one potential use, rather than what they are:
measures of where the observed sample lies in a specifically defined sampling
distribution.
The authors further assert that “p-values are based on what would have hap-
pened under other possible data sets.” This interpretation moves the discussion
from mathematics to what Penrose (2007) calls the mental world — a realm
of imagination and hypothetical scenarios. Such a transition inevitably leads
to confusion. A p-value is purely mathematical: it represents the percentile of
the observed sample in the distribution of all possible samples under the null
hypothesis model. This distribution and the resulting p-value are part of math-
ematics. While proper physical randomization of the observed sample is crucial
for connecting the p-value to the actual population, this requires only the single
randomization that produced our data.
While Gelman & Loken (2014) make valid points about potential misuse
of statistical methods, their conclusion that “the justification for p-values lies
in what would have happened across multiple data sets” reveals how deeply
embedded the repeated sampling perspective has become. The problems they
identify stem not from p-values themselves, which are well-defined mathematical
quantities, but from interpreting these mathematical objects through the lens of
hypothetical repeated sampling. When we maintain focus on the mathematical
structure – the distribution specified by the null hypothesis and the location of
our observed sample within its sampling distribution – the meaning of p-values
becomes clearer and their proper use more evident.
The remainder of this paper develops these ideas systematically. Section 2
introduces distributions on finite sets, called simple distributions, and shows how
these extend naturally to both countable and uncountable sets without requir-
ing the concept of randomness. Section 3 employs Penrose’s distinction between
mathematical, physical, and mental worlds to clarify how single-instance phys-
ical randomization validates our use of mathematical sampling distributions.
Section 4 shows how measure theory provides precise tools for describing data
through percentiles and tail areas, laying groundwork for the logical approach
to inference developed in Section 5. There, we show how the reductio ad absur-
dum argument from deductive logic extends naturally to statistical inference,
establishing what we call the Fisher-Information-Logic (FIL) approach. Section
6 examines traditional approaches to inference based on bias and variance, re-
vealing their limitations by focusing on additional structures of the support set
rather than the probabilities assigned to the support. Section 7 describes the
information-based framework that better aligns with the mathematical nature
of distributions while providing practical tools for inference.
2
2 Mathematical Framework of Distributions
2.1 Simple Distributions
Simple distributions provide the mathematical foundation for describing finite
populations, forming the basis for more abstract probability concepts. We begin
with this concrete case because it captures the essential features of distributions
while maintaining clear connections to observable populations.
Let f 1 , f 2 , . . . , f K be positive integers that sum to N , representing frequen-
cies in a population of size N . For example, these might represent counts of
different blood types in a hospital’s patient database or the number of students
receiving each possible grade in a class. Let XK be a set with K distinct ele-
ments. For example, XK might consist of labels for the different blood types or
letter grades. We define an (N, K)-simple frequency distribution as:
•
P
m(N,K) (x) = 1
• N m(N,K) (x) ∈ Z+ for all x ∈ XK
The distribution is degenerate if m(N,K) (x) = 1 in which case K = 1.
The support XK is an abstract set. To emphasize that there may be addi-
tional structure on this set, XK is called the label space and the structure will
range from no structure to the algebraic properties of the reals when XK ⊂ R.
When the labels have meaningful numeric values, the simple distribution will
be approximated by a continuous distribution described in Section 2.3.
Each simple distribution corresponds to a unique multi-set (or bag) bm(N,K) c
containing N elements, where each value xk appears exactly f k times. Formally,
this multi-set can be represented as a set of N ordered pairs where the first
component takes values in XK and the second component ranges over {1, . . . , N }
so that each order pair is unique. For brevity, we write bmc when N and K are
clear from context.
The multi-set shows the connection with the more common notation X for
a distribution. For simple distributions, X = Π1 bmc where Π1 is the projec-
tion onto the first component. The proportion corresponding to the value x is
obtained from the counting measure of its pre-image
m(x) = |X −1 (x)|/N.
3
• For ordered XK (e.g., course grades A, B, C, D, F): Plot points on the
horizontal axis with uniform spacing, constructing rectangles centered at
each x with area m(N,K) (x).
• For unordered XK (e.g., blood types): The visualization uses similar rect-
angles, but their horizontal arrangement carries no meaningful informa-
tion.
• For XK ⊂ R (e.g., height measurements): The horizontal spacing reflects
the numerical values in XK , with rectangle heights chosen to achieve areas
of m(N,K) (x).
The label space XK and corresponding proportions m(x) shown in these visu-
alizations are the defining features of general distributions which we consider
next.
4
simple distributions. This approximation becomes increasingly accurate as the
measurement precision increases and the population size grows.
As with simple and discrete distributions, the notation X for m emphasizes
the label space which in this case has the advantage ofRrepresenting the algebraic
structure inherited from R. Formally, Pr(X ∈ A) = A m(x)dx for measurable
set A. The support X for continuous distributions is uncountable; countability
is achieved by requiring a σ-algebra on X . It is common practice to use X to
denote both continuous and discrete distributions. We will do the same when
the context indicates the cardinality of the support.
5
bXc. The proportions obtained by combinatorics are what connect the simple
distributions to the physical and mental worlds. This connection is fundamental
to understanding statistical inference in terms of distributions of data rather
than models of random phenomena, i.e., random variables.
Measure theory extends the idea of a proportion of a finite set to a probability
that describes an infinite set. This extension does not need to occur in the
mental or physical world and attempts to do so using hypothetical repeated
sampling from a real-world population is ripe for confusion.
6
4.2 Fisher’s Infinite Population
R.A. Fisher did not adhere to the view of probability as describing hypotheti-
cal repeated samples, instead, as Spiegelhalter (2024) notes, “Fisher suggested
thinking of a unique data set as a sample from a hypothetical infinite population,
but this seems to be more of a thought experiment than an objective reality.”
However, Penrose’s distinction between mathematical and mental worlds pro-
vides an objective reality to Fisher’s framework.
To understand Fisher’s approach, we begin with a simple distribution X(N,K)
that exactly models a finite population. The observed data y represents a point
in Y(N,K) , the sampling distribution of X(N,K) . When we approximate X(N,K)
with a continuous distribution X, the corresponding sampling distribution Y
approximates Y(N,K) . Both X and Y will have infinite support.
The infinite sampling distribution exists in the mathematical world rather
than the mental world. It is not a hypothetical construct requiring repeated
sampling or imagination, but rather a mathematical object with the same ob-
jective reality as any other mathematical structure. Fisher’s infinite population,
therefore, describes a precise mathematical framework for understanding sam-
pling distributions, not a mental exercise in hypothetical repetition.
7
Therefore, any sum exceeding 28 contradicts the assumption that it came from
Nation A, leaving Nation B as the only possible source.
This argument can be expressed in terms of distributions as follows. Let
Sk = {k 0 ∈ Z : 0 ≤ k 0 ≤ k} be the set of non-negative integers up through k.
Let S be the distribution for the observed sum sobs . The premise, known as
the null hypothesis in statistical terminology, is H◦ : S = SA where SA is the
distribution of sums from lottery A whose support is SA = S28 . The steps of
the argument for a sum of 29 are as follows:
1. Begin with H◦
2. H◦ =⇒ sobs ∈ S28 ; sobs = 29
3. sobs ∈
/ S28 . A contradiction. ∴ ¬H◦
This argument provides proof that H◦ is false; i.e. S 6= SA . Clearly, this
argument goes through whenever sobs 6∈ S28 .
When the observed sum is 28 or less, reasoning shifts from deduction to in-
duction, revealing fundamental divisions within the statistical community. The
repeated-sampling frequentists either conclude that no inference is possible or
resort to multiverse arguments to maintain their philosophical framework. The
Bayesian school insists that prior probabilities must be assigned to the two
possibilities, yet this leads to considerable debate regarding both the numeri-
cal values of these priors and their philosophical interpretation. Good (1971),
himself a Bayesian, wryly observes that there are at least “46,656 varieties of
Bayesians”.
8
proportions associated with extreme values - provides a mathematical frame-
work for defining rarity. For instance, if we designate a sum of 28 as rare but 27
as not rare for the distribution SA , this corresponds to labeling the uppermost
0.0002 of possible sums for lottery A as rare. More formally, we can express
Haux as the statement that sobs falls below the 99.98th percentile using the
natural ordering provided by the numerical values of the sum.
This formalization allows us to construct a reductio ad absurdum argument
applicable for induction:
1. Begin with the conjunction H◦ ∧ Haux
2. This conjunction implies sobs ∈ S27 ; sobs = 28
3. sobs ∈
/ S27 . A contradiction. ∴ ¬(H◦ ∧ Haux ) ≡ ¬H◦ ∨ ¬Haux
The definition of rare events through tail areas can be calibrated to different
levels of evidence. For example, if we consider a sum of 24 as extreme but 23
as not, this corresponds to a tail area of 0.0171 for lottery A.
This framework provides a bridge between deductive and inductive reason-
ing. Rather than requiring the conceptual machinery of hypothetical repeated
sampling or subjective prior probabilities, it extends classical deductive logic
to accommodate uncertainty in a mathematically rigorous way. The approach
maintains objectivity while acknowledging the inherent limitations in our ability
to draw absolute conclusions from empirical data.
9
continuous distributions, we require T to be measurable to ensure compatibility
with the probability structure.
There are many choices for the test statistic and different statistics provide
different orderings of the sample space. In our Two Lotteries example, we seek
a test statistic that effectively distinguishes between distributions SA and SB .
The Neyman-Pearson lemma provides theoretical guidance for this choice,
showing that the likelihood ratio test achieves optimal power when comparing
two simple hypotheses, as in our lottery example. Interestingly, for sums in
{8, 9, . . . , 27, 28}, the ordering induced by the likelihood ratio matches that of
the numerical values themselves. However, the likelihood ratio remains con-
stant across {0, 1, . . . , 6, 7}, revealing a subtle distinction between natural and
likelihood ratio ordering of outcomes.
Most statistical applications differ from our example in two important ways.
First, the exact values of N and K characterizing the simple distribution are
typically unknown, necessitating the use of approximating distributions X. For
discrete X, the sample space X becomes a countable set with proportions re-
placed by real numbers in the unit interval. For continuous X, X becomes a
measurable space with proportions replaced by measure-theoretic probability of
measurable sets.
Second, rather than choosing between two specific distributions, we consider
a smooth continuum of possible models. Nevertheless, the modified reductio ad
absurdum argument extends naturally to this setting after appropriate adjust-
ments for measurable sets. The likelihood continues to play a central role, with
the smooth structure of the model space allowing us to analyze how likelihood
functions vary across models - essentially a local version of the likelihood ratio.
The information content of a test statistic provides a mathematical frame-
work for describing its induced ordering of the sample space. This connection
between test statistics and information theory is described in Section 7.
10
is to use the sample y = (x1 , x2 , . . . , xn ) to obtain a value for the parameter,
called an estimate, that will be close to θ? = θ(X? ). The value of the estimate
will depend on the sample and a desirable property is that, were the process
repeated, the average of the estimates would be close to θ? . Conceptually, this
requires hypothetically sampling from the real-world population and letting the
number of such samples tend to infinity before the average equals θ? .
Mathematically, an estimate is the value of a measurable function t : Y →
Θ. The distribution that is obtained when t is applied to Y is an estimator,
θ̂ = t(Y ). Unbiasedness is described using the expectation operator, which for
continuous distributions, is defined as Eh(Y ) = h(y)mY (y)dy where mY is
R
the density function for the distribution Y obtained from m. This operator
is defined for each distribution in M so that a subscript on the operator will
indicate a specific distribution. The conceptual construction of unbiasedness
is expressed mathematically as E? θ̂ = θ? where E? is the expectation defined
using m = m? . Our notation does not distinguish between an estimate θ̂ = t(y)
and an estimator θ̂ = t(Y ) but will be clear from the context. In particular,
expectation operates on estimators.
The bias is the difference between this expectation and θ? ,
Bias? (θ̂) = E? θ̂ − θ? .
While the conceptual construction focuses on the distribution m? , mathemat-
ically bias is defined for all distributions in M . Since θ? is unknown, we want
estimators that are unbiased for all values of the parameter (i.e., distributions
in M ). The estimator θ̂ is unbiased for θ is its bias vanishes for all values of θ;
that is, Bias(θ̂) is the zero function on Θ.
Another important property is that the value of the estimate is, in some
sense, close to θ? . Recognizing that the estimate cannot be close for all y ∈ Y,
the estimator is described in terms of its average distance from θ? and, as with
bias, a mental conceptualization of this involves hypothetical repeated samples
from the physical population.
Mathematically, the distance is defined using a non-negative function d de-
fined on Θ × Θ so that for a fixed y the distance between the estimate and θ? is
d(θ̂, θ? ). The average distance is found using the estimator and the expectation
operator E? d(θ̂, θ? ). Since E and d are defined for all θ, Ed(θ̂, θ) is a function
on Θ.
The most common choice for d is square error, d(θ̂, θ) = (θ̂ − θ)2 , and this
average distance is called mean square error
MSE is a function on Θ and estimators that minimize MSE for all parameter
values generally do not exist. When estimators are required to be unbiased, es-
timators that minimize MSE do exist for many important estimation problems.
For unbiased estimators MSE equals the variance V (θ̂) = E(θ̂ − E θ̂)2 and such
estimators are called uniformly minimum variance unbiased (UMVU).
11
The controversy with UMVU estimators comes when there are biased es-
timators that have smaller MSE than the UMVU estimator at all values of
the parameter. Using MSE as the estimation criterion, this indicates that the
UMVU estimator should not be used, but leaves open the question of what
estimator should be used as estimators minimizing MSE for all values of the pa-
rameter generally do not exist. Efron (2024) suggests the solution to this issue
is to move from frequentist to Bayesian inference methods. Efron’s suggestion
is based on the assumption that there a problems with frequentist methods,
in particular with maximum likelihood. What Efron found shocking was the
existence of estimators that had smaller MSE for all values of the parameter
(i.e., “always”):
That “always” was the shocking part: two centuries of statistical
theory, ANOVA, regression, multivariate analysis, etc., depended on
maximum likelihood estimation. Did everything have to be rethought?
Not everything; just the role of bias and MSE. Rethinking these leads to in-
formation as a means of assessing estimators. Before describing the role of
information in the next section we describe an important property shared by
bias and MSE that illustrates a difficulty with these measures and the way
forward.
12
axis, we should ignore the location and other structure of this axis and focus on
the height of the rectangles. The base of the rectangles serve only as an index
set to compare distributions in terms of their corresponding heights.
13
7.2 Extending Logical Arguments to Statistical Inference
The FIL approach extends the modified reductio ad absurdum argument from
Section 5.2 to continuous families of distributions. For each model m ∈ M , we
consider the conjunction of two hypotheses:
• H: The model m best approximates the true simple distribution
where mY represents the sampling distribution under m. The right tail area
TAR uses ≥ instead of ≤, and we set TA(yobs , m) = 2 min(TAL , TAR ). For
significance level α, we formalize Haux as TA(yobs , m) > α.
This framework partitions the model space into two sets: Mα{ containing all
distributions where the reductio ad absurdum argument successfully reaches a
contradiction (the observed data are rare), and Mα containing the distributions
where the argument fails to reach a contradiction::
Mα{ = {m ∈ M : TA(yobs , m) ≤ α}
n o
Mα = m ∈ M : m 6∈ Mα{
We cannot know with certainty which of these sets contains m? , the distribu-
tion in M that best approximates the true simple distribution Xpop . However, if
m? ∈ Mα{ then yobs is rare. We have Fisher’s logical disjunction: either the true
distribution is in Mα or the sample we observed is rare. The set Mα forms a
(1−α)100% confidence region, with its image under a parameterization θ giving
a subset Θα ⊂ Rd . When d = 1, Θα often forms an interval, in which case we
call it a confidence interval.
14
label space. For distributions sharing support X , the KL divergence:
m1 (x)
m1 (x) log
X
KL(m1 , m2 ) =
m2 (x)
x∈X
8 Discussion
This paper has developed a framework for understanding statistical inference
that emphasizes the mathematical nature of distributions while carefully distin-
guishing between mathematical, physical, and mental worlds. This distinction
proves particularly valuable when examining fundamental statistical concepts
and terminology that often conflate these domains. Three key areas illustrate
both the importance and broader implications of this perspective.
First, statistical terminology frequently introduces mental world connota-
tions that can obscure rather than clarify mathematical concepts. The term
“random variable” exemplifies this problem - while mathematically defined as
a measurable function, the word “random” suggests a mental world construct
involving chance and unpredictability. Similar issues arise with terms like “infor-
mation,” “likelihood,” and “confidence.” These terms carry intuitive meanings
that may mislead practitioners about their precise mathematical definitions.
The solution is not to eliminate such terminology - these terms are deeply em-
bedded in statistical practice - but rather to explicitly recognize and address
the potential confusion they may cause.
15
Second, the concept of information in statistics requires particular care.
While information theory provides precise mathematical definitions through
concepts like entropy and KL divergence, these capture only specific aspects
of how information is understood more broadly.
The phrase “amount of information” might suggest information is a quantity
like mass or volume, but this analogy breaks down upon closer examination.
Information does not have units, and the relationship between information and
variance illustrates this subtlety.
Consider an unbiased estimator θ̂ where the information Λ(θ̂) equals the re-
ciprocal of its variance. While these quantities are numerically reciprocal, they
are conceptually distinct: Λ(θ̂) measures how rapidly probability assignments
change when considering different models, while variance is defined for an iso-
lated distribution and quantifies the spread using the algebraic properties of
the label space. The parameter θ plays fundamentally different roles in each
case - for information, it serves as a coordinate on a smooth manifold (and thus
has no units), while for variance, it inherits the units and structure of the label
space. See Vos (2024) for problems when inference depends on the choice of
parameterization.
Third, our framework provides new insights into point estimation. When
working with continuous exponential families, we can leverage the bijection
between canonical statistics and expectation parameters to view points in the
support X as either canonical statistics or expectation parameters. In either
case, X becomes a subset of Rd , which we denote as XR . More fundamentally,
using the parameterization allows us to take X = M , viewing a point estimate
as a distribution in the model space M rather than as a real number labeling this
distribution. While M lacks the algebraic structure of the reals, concepts like
mean and variance can be defined through optimization, generalizing familiar
ideas like least squares to other exponential families using KL divergence. See
Wu & Vos (2012) and Vos & Wu (2015) for details on these distribution-valued
point estimators.
These observations point to a broader principle: the importance of maintain-
ing clear distinctions between mathematical definitions and their interpretations
in the physical and mental worlds. The fact that “random variables aren’t ran-
dom” serves as a caution about mathematical terminology in general - terms
that carry rich meaning outside mathematics may not align with their precise
mathematical definitions. This misalignment becomes particularly problematic
when it leads to reasoning about hypothetical scenarios (like infinite sequences
of samples) rather than focusing on well-defined mathematical objects.
References
Efron, B. (2024). Machine learning and the James–Stein estimator. Jpn. J.
Stat. Data Sci., 7(1), 257–266.
16
Fisher, R. (1959). Statistical methods and scientific inference. Hopetoun Street,
University of Edinburgh: T and A Constable Ltd, 2nd edition.
Fisher, S. R. A. (1960). Scientific thought and the refinement of human reason-
ing. 3, 1–10.
Gelman, A. & Loken, E. (2014). The statistical crisis in science. Am. Sci.,
102(6), 460.
Good, I. J. (1971). Letters to the editor: 46656 varieties of bayesians. The
American Statistician, 25(5), 62–63.
Penrose, R. (2007). The road to reality: a complete guide to the laws of the
universe. New York: Vintage Books, 1st vintage books ed edition.
Spiegelhalter, D. (2024). Why probability probably doesn’t exist (but it is useful
to act like it does). Nature, 636(8043), 560 – 563.
Vos, P. (2022). Generalized estimators, slope, efficiency, and fisher information
bounds. Information Geometry. SharedIt link https://rdcu.be/c0YQn.
17