Lecture 1.3
Lecture 1.3
Note: These notes are for your personal educational use only. Please do not distribute them.
1 Introduction
We now turn to the study of estimation theory. But before delving in, it is important to understand how
this topic fits into the broader scope of statistical inference. So, we briefly situate the topics we cover in this
class within the four main paradigms of statistical decision theory. Each paradigm focuses on a particular set
of assumptions imposed on a common abstract problem model. The common abstract setting is as follows.
We have a hidden variable X ∈ X that we cannot observe, and another random variable Y ∈ Y that we can
observe, where X and Y are standard measurable spaces, e.g., {0, 1}, R, etc. The two variables are related
by a known “observation model” or “statistical experiment,” which is a collection of likelihoods of Y given
X, i.e., a Markov kernel {PY |X (·|x) : x ∈ X }, where each PY |X (·|x) is a probability distribution over Y.
Our goal is to infer X based on an observation of a realization of the random variable Y generated by the
above model. Clearly, the variable X may be deterministic or random, and its alphabet X may be a finite
or an infinite set. The precise nature of X and X determines which of the aforementioned four paradigms
the inference problem falls into.
In the early 1900’s, the radar community was interested in models where |X | < +∞ (often with |X | = 2).
For instance, radar engineers would observe some measurement and have to detect if there was a signal
in the measurement, or if the measurement was just random noise. This problem could be set up as a
binary hypothesis testing problem where X = {0 = no signal, 1 = signal} (which we studied earlier). Such
inference problems with finite |X | are classified under the category of detection theory. In contrast, the
branch of statistics that deals with inference problems where X is a countably or uncountably infinite set,
e.g., {. . . , −2, −1, 0, 1, 2, . . . }, R, etc., is known as estimation theory. In the radar context, after detecting
an analog signal, engineers would have to approximate its value from noisy measurements. This would
correspond to a parameter estimation problem.
In the early statistics community, there was another divide among inference problems. Bayesian statis-
ticians believed that the underlying variable X was random and had a prior distribution PX . This prior
represented the statistician’s belief about X. So, the “right” way to proceed after observing Y was to com-
pute the posterior distribution PX|Y using Bayes’ rule in order to update the belief about X. In contrast,
non-Bayesian (or frequentist) statisticians did not impose such a prior over X. They argued that because X
could not be observed enough times for probabilities of X to have philosophical meaning, imposing a prior
over X was also meaningless. They assumed instead that X was just an unknown deterministic parameter.
The “matrix” below classifies the topics we cover in this class within the four paradigms of statistical decision
theory.
Bayesian Non-Bayesian
Detection Bayesian hypothesis testing Neyman-Pearson theory
Estimation Bayesian least squares Minimax estimation
2 Bayesian Estimation
In this section, we are concerned with Bayesian estimation theory. For simplicity, we assume in the sequel
that X = Y = R. As discussed above, in the Bayesian framework, we assume that we are given a prior
probability density function (PDF) PX > 0 and the likelihoods {PY |X (·|x) > 0 : x ∈ R}, where each
PY |X (·|x) is a conditional PDF of Y given X = x. Moreover, we assume that X and Y have finite second
moments. Our objective is to construct a “good” estimator X̂ : R → R, X̂(Y ) that provides an estimate of
X given an observation of Y . (Note that although we work with scalar random variables X and Y here, the
analysis can be generalized for vector-valued X and Y .)
1
Spring 2022 Part 1 Statistical Inference, Lecture 3 Bayesian Estimation
To formally set up the problem of finding a good estimator, as in our discussion of binary hypothesis
testing, consider a loss function L : R × R → R, where L(x, x̂) represents the “loss” in inferring x̂ when the
actual realization of X is x. We would like to construct the Bayes optimal estimator X̂B : R → R that solves
the following minimization problem:
h i
RB , inf E L(X, X̂(Y )) , (1)
X̂:R→R
where the infimum is over all estimators X̂ of X based on Y , E[L(X, X̂(Y ))] is known as the risk (where
the expectation is with respect to the joint PDF PX,Y ), and RB is known as the Bayes risk. Note that
the infimum is over all deterministic estimators, because Bayes risk is achieved by deterministic estimators
as we argued during our analysis of Bayesian hypothesis testing. So, there is no advantage in considering
randomized estimators in the minimization above.
In analogy with our analysis of binary hypothesis testing, the next proposition characterizes the Bayes
optimal estimator, cf. [1, Section 6].
Proposition 1 (Bayes Optimal Estimator). Under the aforementioned framework, suppose there exists an
estimator X̂B : R → R such that:
Then, X̂B is the Bayes optimal estimator (as the notation suggests).
where PX|Y (x|y) = PY |X (y|x)PX (x)/PY (y) by Bayes rule. Hence, we can find the optimal X̂ by minimizing
the above expression pointwise for each y ∈ R. Therefore, for any y ∈ R, the Bayes optimal estimator is
given by
Z +∞
X̂B (y) = arg min L(x, u)PX|Y (x|y) dx = arg min E[L(X, u)|Y = y]
u∈R −∞ u∈R
It is evident that different choices of loss functions lead to different Bayes optimal estimators X̂B . Perhaps
the most popular and commonly used loss function in the literature is the squared loss:
The risk corresponding to the squared loss is known as the mean squared error (MSE), and the minimum
MSE (MMSE) risk is given by: h i
RMMSE = min E (X − X̂(Y ))2 . (3)
X̂:R→R
The Bayes optimal estimator that achieves RMMSE , denoted X̂BLS : R → R, is known as the Bayes least
squares (BLS) estimator (or sometimes the MMSE estimator). The next theorem presents the well-known
result that the BLS estimator is the mean of the posterior distribution PX|Y , cf. [1, Section 6.3].
Theorem 1 (BLS Estimator). The BLS estimator is given by the conditional expectation:
2
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022
X̂B (y) = arg min E (X − u)2 Y = y = arg min u (u − 2E[X|Y = y]) = E[X|Y = y] .
u∈R u∈R
= E E (X − E[X|Y ])2 Y
= E[var(X|Y )]
which is the square root of the second moment of f . The finite second moment (or finite norm) constraint
in the definition of a Hilbert space is an analytical condition that ensures that the space is complete, i.e.,
every Cauchy sequence converges in the space. Informally, the vector space structure of a Hilbert space
3
Spring 2022 Part 1 Statistical Inference, Lecture 3 Bayesian Estimation
allows linear combinations of functions in the space to belong to the space, the inner product permits us
to measure distances and angles between functions in the space, and the completeness allows us to take
well-defined limits in the space. Finally, it is worth mentioning that various well-known inequalities for
Euclidean spaces carry over to this infinite dimensional Hilbert space setting. For instance, for any two
functions f, g ∈ L2 (R × R, PX,Y ), we have the well-known Cauchy-Schwarz-Bunyakovsky inequality:
2 2 2 2
|hf, gi| = E[f (X, Y )g(X, Y )] ≤ E f (X, Y )2 E g(X, Y )2 = kf k kgk ,
We refer readers to [2] for a comprehensive introduction to the theory of infinite dimensional Hilbert spaces.
Let S be a closed linear subspace of L2 (R × R, PX,Y ) (i.e., a sub-Hilbert space of L2 (R × R, PX,Y )). This
means that S is a non-empty subset of L2 (R × R, PX,Y ) that is itself a Hilbert space with the same inherited
inner product. Then, we have the following orthogonality principle, cf. [2, Lemma 4.1], which can be shown
to follow from the Hilbert projection theorem in convex analysis.
Proof. Since you are not required to know functional analysis, we omit analytical details that guarantee the
2
existence and uniqueness of h as the solution to the extremization minf ∈S kg − f k ; see, e.g., [2, Lemma 4.1]
if interested.
To prove the forward direction, consider the function h − f ∈ S for any fixed f ∈ S and ∈ R\{0}, and
2
observe using h = arg minf ∈S kg − f k that:
2 2 2 2
kg − hk ≤ kg − h + f k = kg − hk + 2 hg − h, f i + 2 kf k
where the third equality follows from hg − h, h − f i = 0 since h − f ∈ S. This completes the proof.
Geometrically, this principle states that given a function g ∈ L2 (R × R, PX,Y ), the closest function to g
in a closed linear subspace S is a function h ∈ S such that the error h − g is orthogonal to the subspace S.
As we will see, this principle provides a unified framework to characterize the BLS and related constrained
MMSE estimators, such as the linear least squares estimator.
4
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022
where each X̂ ∈ S is a function of Y only, and the definition of the subspace S imposes further constraints
on the estimators. The solution to this constrained MMSE problem is the unique estimator
h i
X̂S = arg min E (X − X̂(Y ))2 (6)
X̂∈S
Intuitively, the optimal estimator X̂S is characterized by the property that its error X̂S (Y ) − X is orthogonal
to (or uncorrelated with) every function of Y . Equivalently, X̂S (Y ) is the projection of X onto the subspace
S. Let us use this key idea to establish orthogonality characterizations of BLS and related estimators as
corollaries of Theorem 2.
Since this MSE minimization is over all estimators or functions of Y , consider the sub-Hilbert space
S = L2 (R, PY ) , f : R → R E f (Y )2 < +∞ ,
(9)
which contains all real-valued functions f (y) that only depend on Y and have finite second moment. It is
straightforward to verify that S is a closed subspace of L2 (R × R, PX,Y ). (For example, linear combinations
of functions of Y are also functions of Y .) Then, applying Theorem 2 yields the following orthogonality
characterization of BLS estimators, cf. [1, Section 6.3.2].
Proposition 2 (Orthogonality Characterization of BLS Estimator). An estimator X̂ : R → R satisfies the
orthogonality property: h i
∀f ∈ L2 (R, PY ), E (X̂(Y ) − X)f (Y ) = 0 ,
if and only if it is the BLS estimator X̂(y) = X̂BLS (y) = E[X|Y = y] for all y ∈ R.
Proof. As before, let g(X, Y ) = X be the function we seek to estimate. Then, Theorem 2 provides the fol-
lowing orthogonality characterization of BLS estimators: An estimator X̂ : R → R satisfies the orthogonality
property: h i
∀f ∈ L2 (R, PY ), E (X̂(Y ) − X)f (Y ) = 0 , (10)
if and only if it is the BLS estimator, X̂(y) = X̂BLS (y) for all y ∈ R, defined in (8). Furthermore, we can
derive the explicit conditional expectation form of the BLS estimator from this orthogonality characterization.
Indeed, for every f : R → R with finite second moment, we have using (10) and the tower property that
h i
E X̂BLS (Y )f (Y ) = E[Xf (Y )] = E[E[X|Y ] f (Y )] ,
5
Spring 2022 Part 1 Statistical Inference, Lecture 3 Bayesian Estimation
Intuitively, Proposition 2 conveys that the BLS estimator is characterized by having an error X̂BLS (Y )−X
that is orthogonal to the subspace of all functions of Y . Equivalently, the BLS estimator is the projection
of X onto the subspace of all functions of Y .
S = {f : R → R, f (y) = ay + b | a, b ∈ R} (11)
of all affine functions f (y) that only depend on Y . It is straightforward to verify that S is a closed subspace
of L2 (R × R, PX,Y ) using the fact that Y has finite second moment E[Y 2 ] < +∞. Fixing S to be our class
of admissible estimators, we define the linear least squares (LLS) estimation problem as:
h i
min E (X − X̂(Y ))2 = min E (X − aY − b)2 ,
(12)
X̂∈S a,b∈R
Note that we are essentially performing MMSE estimation over a smaller, more constrained, subspace here.
Then, applying Theorem 2 yields the following orthogonality characterization of LLS estimators, cf. [1,
Section 7].
cov(X, Y )
∀y ∈ R, X̂LLS (y) = E[X] + (y − E[Y ]) ,
var(Y )
where cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] is the covariance of X and Y , and var(Y ) = cov(Y, Y ) is the
variance of Y . Moreover, the LLS estimator achieves an MMSE risk of
h i cov(X, Y )2
E (X − X̂LLS (Y ))2 = var(X) − .
var(Y )
Proof. As before, let g(X, Y ) = X be the function we seek to estimate. Then, Theorem 2 provides the
following orthogonality characterization of LLS estimators: An affine estimator X̂ : R → R satisfies the
orthogonality property:
h i h i
∀f ∈ S, E (X̂(Y ) − X)f (Y ) = E (X̂(Y ) − X)(aY + b) = 0 , (14)
6
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022
where a, b ∈ R are the parameters that define f (y) = ay + b for y ∈ R, if and only if it is the LLS estimator,
X̂(y) = X̂LLS (y) for all y ∈ R, defined in (13). It remains to prove the explicit form of the LLS estimator
from this orthogonality characterization. (Note that the explicit form could also be derived through direct
optimization of (12).) Let X̂LLS (y) = cy + d for some c, d ∈ R. Then, letting a = 0 and b = 1 in (14), we get
cE[Y ] + d = E[X] ,
cE Y 2 + dE[Y ] = E[XY ] .
Solving these equations produces c = cov(X, Y )/var(Y ) and d = E[X] − (cov(X, Y )/var(Y ))E[Y ], which
establish the form of X̂LLS in the proposition statement. Finally, the MMSE risk of X̂LLS follows from direct
calculation:
" 2 #
h
2
i cov(X, Y )
E (X − X̂LLS (Y )) = E (X − E[X]) − (Y − E[Y ])
var(Y )
cov(X, Y )2 2cov(X, Y )2
= var(X) + −
var(Y ) var(Y )
2
cov(X, Y )
= var(X) − .
var(Y )
Intuitively, Proposition 3 conveys that the LLS estimator is characterized by having an error X̂LLS (Y )−X
that is orthogonal to the subspace of all affine (or linear) functions of Y . Equivalently, the LLS estimator is the
projection of X onto the subspace of all affine functions of Y . Finally, we mention two more remarks. Firstly,
the LLS estimator can be computed using only first and second order moments of the joint distribution PX,Y ;
these quantities can often be much easier to obtain than the full joint distribution PX,Y itself. Secondly,
when X, Y are jointly Gaussian random variables, the BLS and LLS estimators coincide; we leave this as an
exercise for the reader.
References
[1] G. W. Wornell, “Inference and information,” May 2017, Department of Electrical Engineering and Com-
puter Science, MIT, Cambridge, MA, USA, Lecture Notes 6.437.
[2] E. M. Stein and R. Shakarchi, Real Analysis: Measure Theory, Integration, and Hilbert Spaces, ser.
Princeton Lectures in Analysis. Princeton, NJ, USA: Princeton University Press, 2005, vol. 3.