0% found this document useful (0 votes)
13 views

Lecture 1.3

This document provides notes on Bayesian estimation from a statistical machine learning course. It discusses how estimation theory fits within statistical inference paradigms. Bayesian estimation assumes a prior distribution over the hidden variable X and computes a posterior distribution over X given an observation Y using Bayes' rule. The goal is to find an estimator X^ that minimizes the expected loss between the true X and the estimate. The Bayes optimal estimator minimizes the expected loss pointwise. When using squared loss, the Bayes least squares estimator is the mean of the posterior distribution and achieves minimum mean squared error.

Uploaded by

Jon Smithson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 1.3

This document provides notes on Bayesian estimation from a statistical machine learning course. It discusses how estimation theory fits within statistical inference paradigms. Bayesian estimation assumes a prior distribution over the hidden variable X and computes a posterior distribution over X given an observation Y using Bayes' rule. The goal is to find an estimator X^ that minimizes the expected loss between the true X and the estimate. The Bayes optimal estimator minimizes the expected loss pointwise. When using squared loss, the Bayes least squares estimator is the mean of the posterior distribution and achieves minimum mean squared error.

Uploaded by

Jon Smithson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CS 57800 Statistical Machine Learning, Prof.

Anuran Makur Spring 2022

Lecture 1.3 Bayesian Estimation


Anuran Makur

Note: These notes are for your personal educational use only. Please do not distribute them.

1 Introduction
We now turn to the study of estimation theory. But before delving in, it is important to understand how
this topic fits into the broader scope of statistical inference. So, we briefly situate the topics we cover in this
class within the four main paradigms of statistical decision theory. Each paradigm focuses on a particular set
of assumptions imposed on a common abstract problem model. The common abstract setting is as follows.
We have a hidden variable X ∈ X that we cannot observe, and another random variable Y ∈ Y that we can
observe, where X and Y are standard measurable spaces, e.g., {0, 1}, R, etc. The two variables are related
by a known “observation model” or “statistical experiment,” which is a collection of likelihoods of Y given
X, i.e., a Markov kernel {PY |X (·|x) : x ∈ X }, where each PY |X (·|x) is a probability distribution over Y.
Our goal is to infer X based on an observation of a realization of the random variable Y generated by the
above model. Clearly, the variable X may be deterministic or random, and its alphabet X may be a finite
or an infinite set. The precise nature of X and X determines which of the aforementioned four paradigms
the inference problem falls into.
In the early 1900’s, the radar community was interested in models where |X | < +∞ (often with |X | = 2).
For instance, radar engineers would observe some measurement and have to detect if there was a signal
in the measurement, or if the measurement was just random noise. This problem could be set up as a
binary hypothesis testing problem where X = {0 = no signal, 1 = signal} (which we studied earlier). Such
inference problems with finite |X | are classified under the category of detection theory. In contrast, the
branch of statistics that deals with inference problems where X is a countably or uncountably infinite set,
e.g., {. . . , −2, −1, 0, 1, 2, . . . }, R, etc., is known as estimation theory. In the radar context, after detecting
an analog signal, engineers would have to approximate its value from noisy measurements. This would
correspond to a parameter estimation problem.
In the early statistics community, there was another divide among inference problems. Bayesian statis-
ticians believed that the underlying variable X was random and had a prior distribution PX . This prior
represented the statistician’s belief about X. So, the “right” way to proceed after observing Y was to com-
pute the posterior distribution PX|Y using Bayes’ rule in order to update the belief about X. In contrast,
non-Bayesian (or frequentist) statisticians did not impose such a prior over X. They argued that because X
could not be observed enough times for probabilities of X to have philosophical meaning, imposing a prior
over X was also meaningless. They assumed instead that X was just an unknown deterministic parameter.
The “matrix” below classifies the topics we cover in this class within the four paradigms of statistical decision
theory.

Bayesian Non-Bayesian
Detection Bayesian hypothesis testing Neyman-Pearson theory
Estimation Bayesian least squares Minimax estimation

2 Bayesian Estimation
In this section, we are concerned with Bayesian estimation theory. For simplicity, we assume in the sequel
that X = Y = R. As discussed above, in the Bayesian framework, we assume that we are given a prior
probability density function (PDF) PX > 0 and the likelihoods {PY |X (·|x) > 0 : x ∈ R}, where each
PY |X (·|x) is a conditional PDF of Y given X = x. Moreover, we assume that X and Y have finite second
moments. Our objective is to construct a “good” estimator X̂ : R → R, X̂(Y ) that provides an estimate of
X given an observation of Y . (Note that although we work with scalar random variables X and Y here, the
analysis can be generalized for vector-valued X and Y .)

1
Spring 2022 Part 1 Statistical Inference, Lecture 3 Bayesian Estimation

To formally set up the problem of finding a good estimator, as in our discussion of binary hypothesis
testing, consider a loss function L : R × R → R, where L(x, x̂) represents the “loss” in inferring x̂ when the
actual realization of X is x. We would like to construct the Bayes optimal estimator X̂B : R → R that solves
the following minimization problem:
h i
RB , inf E L(X, X̂(Y )) , (1)
X̂:R→R

where the infimum is over all estimators X̂ of X based on Y , E[L(X, X̂(Y ))] is known as the risk (where
the expectation is with respect to the joint PDF PX,Y ), and RB is known as the Bayes risk. Note that
the infimum is over all deterministic estimators, because Bayes risk is achieved by deterministic estimators
as we argued during our analysis of Bayesian hypothesis testing. So, there is no advantage in considering
randomized estimators in the minimization above.
In analogy with our analysis of binary hypothesis testing, the next proposition characterizes the Bayes
optimal estimator, cf. [1, Section 6].

Proposition 1 (Bayes Optimal Estimator). Under the aforementioned framework, suppose there exists an
estimator X̂B : R → R such that:

∀y ∈ R, X̂B (y) = arg min E[L(X, u)|Y = y] .


u∈R

Then, X̂B is the Bayes optimal estimator (as the notation suggests).

Proof. For any estimator X̂ : R → R, observe that


h i Z +∞ Z +∞
E L(X, X̂(Y )) = L(x, X̂(y))PY |X (y|x)PX (x) dx dy
−∞ −∞
Z +∞ Z +∞
= PY (y) L(x, X̂(y))PX|Y (x|y) dx dy ,
−∞ −∞
| {z }
minimize

where PX|Y (x|y) = PY |X (y|x)PX (x)/PY (y) by Bayes rule. Hence, we can find the optimal X̂ by minimizing
the above expression pointwise for each y ∈ R. Therefore, for any y ∈ R, the Bayes optimal estimator is
given by
Z +∞
X̂B (y) = arg min L(x, u)PX|Y (x|y) dx = arg min E[L(X, u)|Y = y]
u∈R −∞ u∈R

when it exists. This completes the proof.

It is evident that different choices of loss functions lead to different Bayes optimal estimators X̂B . Perhaps
the most popular and commonly used loss function in the literature is the squared loss:

∀x, x̂ ∈ R, L(x, x̂) = (x − x̂)2 . (2)

The risk corresponding to the squared loss is known as the mean squared error (MSE), and the minimum
MSE (MMSE) risk is given by: h i
RMMSE = min E (X − X̂(Y ))2 . (3)
X̂:R→R

The Bayes optimal estimator that achieves RMMSE , denoted X̂BLS : R → R, is known as the Bayes least
squares (BLS) estimator (or sometimes the MMSE estimator). The next theorem presents the well-known
result that the BLS estimator is the mean of the posterior distribution PX|Y , cf. [1, Section 6.3].

Theorem 1 (BLS Estimator). The BLS estimator is given by the conditional expectation:

∀y ∈ R, X̂BLS (y) = E[X|Y = y] ,

2
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022

and it achieves an MMSE risk of


h i
RMMSE = E (X − X̂BLS (Y ))2 = E[var(X|Y )] ,
 
where var(X|Y = y) = E (X − E[X|Y = y])2 |Y = y is the variance of the PDF PX|Y (·|y) for any y ∈ R.
Moreover, it is an unbiased estimator:
h i
E X̂BLS (Y ) = E[X] .

Proof. To prove this, notice using Proposition 1 that for all y ∈ R,

X̂B (y) = arg min E (X − u)2 Y = y = arg min u (u − 2E[X|Y = y]) = E[X|Y = y] .
 
u∈R u∈R

Hence, X̂BLS = X̂B given above. Moreover, we have

RMMSE = E (X − E[X|Y ])2


 

= E E (X − E[X|Y ])2 Y
  

= E[var(X|Y )]

as desired. Finally, notice using the tower property that


h i
E X̂BLS (Y ) = E[E[X|Y ]] = E[X] ,

which completes the proof.


It turns out that MMSE estimation, and consequently, the BLS estimator outlined above, has an elegant
underlying geometry. The next section introduces the relevant mathematical machinery from functional
analysis to elucidate this geometric structure.

3 Hilbert Spaces and the Orthogonality Principle



Let L2 (R×R, PX,Y ) , f : R × R → R | E[f (X, Y )2 ] < +∞ denote the Hilbert space of real-valued functions
f (x, y) that have finite second moment (over the field R). Recall that a Hilbert space is a set of functions
(which play the role of “vectors”) that forms a vector space and is endowed with an inner product; so, the
set of functions forms an inner product space. Specifically, we endow L2 (R × R, PX,Y ) with the correlation
as inner product:
∀f, g ∈ L2 (R × R, PX,Y ), hf, gi , E[f (X, Y )g(X, Y )] , (4)
where the expectation is taken with respect to the joint distribution PX,Y . It is straightforward to verify
that this is indeed an inner product by checking the inner product axioms:
 
1. (Positive definiteness) hf, f i = E f (X, Y )2 ≥ 0 with equality if and only if f = f0 , where the almost
everywhere zero function, f0 (x, y) = 0 for x, y ∈ R, represents the zero vector in L2 (R × R, PX,Y ),
   
2. (Symmetry) hf, gi = E f (X, Y )g(X, Y ) = E g(X, Y )f (X, Y ) = hg, f i,
     
3. (Linearity) haf + bg, hi = E (af (X, Y )+bg(X, Y ))h(X, Y ) = aE f (X, Y )h(X, Y ) +bE g(X, Y )h(X, Y ) =
a hf, hi + b hg, hi,
where f, g, h ∈ L2 (R × R, PX,Y ) and a, b ∈ R. Furthermore, this inner product induces the norm:
1/2 1/2
∀f ∈ L2 (R × R, PX,Y ), kf k , hf, f i = E f (X, Y )2

,

which is the square root of the second moment of f . The finite second moment (or finite norm) constraint
in the definition of a Hilbert space is an analytical condition that ensures that the space is complete, i.e.,
every Cauchy sequence converges in the space. Informally, the vector space structure of a Hilbert space

3
Spring 2022 Part 1 Statistical Inference, Lecture 3 Bayesian Estimation

allows linear combinations of functions in the space to belong to the space, the inner product permits us
to measure distances and angles between functions in the space, and the completeness allows us to take
well-defined limits in the space. Finally, it is worth mentioning that various well-known inequalities for
Euclidean spaces carry over to this infinite dimensional Hilbert space setting. For instance, for any two
functions f, g ∈ L2 (R × R, PX,Y ), we have the well-known Cauchy-Schwarz-Bunyakovsky inequality:
2 2 2 2
|hf, gi| = E[f (X, Y )g(X, Y )] ≤ E f (X, Y )2 E g(X, Y )2 = kf k kgk ,
   

as well as the triangle inequality:


1/2 1/2 1/2
kf + gk = E (f (X, Y ) + g(X, Y ))2 ≤ E f (X, Y )2 + E g(X, Y )2
  
= kf k + kgk .

We refer readers to [2] for a comprehensive introduction to the theory of infinite dimensional Hilbert spaces.
Let S be a closed linear subspace of L2 (R × R, PX,Y ) (i.e., a sub-Hilbert space of L2 (R × R, PX,Y )). This
means that S is a non-empty subset of L2 (R × R, PX,Y ) that is itself a Hilbert space with the same inherited
inner product. Then, we have the following orthogonality principle, cf. [2, Lemma 4.1], which can be shown
to follow from the Hilbert projection theorem in convex analysis.

Theorem 2 (Orthogonality Principle). Given g ∈ L2 (R × R, PX,Y ), we have


2
h = arg min kg − f k = arg min E (g(X, Y ) − f (X, Y ))2
 
f ∈S f ∈S

if and only if for every f ∈ S,

hh − g, f i = E[(h(X, Y ) − g(X, Y ))f (X, Y )] = 0 .

Proof. Since you are not required to know functional analysis, we omit analytical details that guarantee the
2
existence and uniqueness of h as the solution to the extremization minf ∈S kg − f k ; see, e.g., [2, Lemma 4.1]
if interested.
To prove the forward direction, consider the function h − f ∈ S for any fixed f ∈ S and  ∈ R\{0}, and
2
observe using h = arg minf ∈S kg − f k that:

2 2 2 2
kg − hk ≤ kg − h + f k = kg − hk + 2 hg − h, f i + 2 kf k

which implies that:


2
2 hg − h, f i + 2 kf k ≥ 0 .
If hg − h, f i > 0, then taking  to be small enough (in magnitude) and negative contradicts the non-
negativity above. Likewise, if hg − h, f i < 0, then taking  to be small enough and positive contradicts the
non-negativity above. Hence, we must have hg − h, f i = 0 for every f ∈ S.
To prove the converse direction, note that for every f ∈ S:
2 2
kg − f k = kg − h + h − f k
2 2
= kg − hk + 2 hg − h, h − f i + kh − f k
2 2
= kg − hk + kh − f k
2
≥ kg − hk

where the third equality follows from hg − h, h − f i = 0 since h − f ∈ S. This completes the proof.

Geometrically, this principle states that given a function g ∈ L2 (R × R, PX,Y ), the closest function to g
in a closed linear subspace S is a function h ∈ S such that the error h − g is orthogonal to the subspace S.
As we will see, this principle provides a unified framework to characterize the BLS and related constrained
MMSE estimators, such as the linear least squares estimator.

4
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022

4 Geometry of MMSE Estimation


We now elucidate the geometry of the BLS and linear least squares estimators using the orthogonality
principle in Theorem 2. To this end, suppose g(X, Y ) = X is the function we seek to estimate, and S is
some subspace of admissible estimator functions X̂(Y ), which only includes functions X̂ that depend on Y .
Then, the minimization in Theorem 2 corresponds to the constrained MMSE problem:
h i
min E (X − X̂(Y ))2 , (5)
X̂∈S

where each X̂ ∈ S is a function of Y only, and the definition of the subspace S imposes further constraints
on the estimators. The solution to this constrained MMSE problem is the unique estimator
h i
X̂S = arg min E (X − X̂(Y ))2 (6)
X̂∈S

that satisfies the orthogonality principle:


h i
∀f ∈ S, E (X̂S (Y ) − X)f (Y ) = 0 . (7)

Intuitively, the optimal estimator X̂S is characterized by the property that its error X̂S (Y ) − X is orthogonal
to (or uncorrelated with) every function of Y . Equivalently, X̂S (Y ) is the projection of X onto the subspace
S. Let us use this key idea to establish orthogonality characterizations of BLS and related estimators as
corollaries of Theorem 2.

4.1 BLS Estimator


Recall that the BLS estimator X̂BLS is the solution to the optimization problem:
h i
X̂BLS = arg min E (X − X̂(Y ))2 . (8)
X̂:R→R

Since this MSE minimization is over all estimators or functions of Y , consider the sub-Hilbert space

S = L2 (R, PY ) , f : R → R E f (Y )2 < +∞ ,
  
(9)

which contains all real-valued functions f (y) that only depend on Y and have finite second moment. It is
straightforward to verify that S is a closed subspace of L2 (R × R, PX,Y ). (For example, linear combinations
of functions of Y are also functions of Y .) Then, applying Theorem 2 yields the following orthogonality
characterization of BLS estimators, cf. [1, Section 6.3.2].
Proposition 2 (Orthogonality Characterization of BLS Estimator). An estimator X̂ : R → R satisfies the
orthogonality property: h i
∀f ∈ L2 (R, PY ), E (X̂(Y ) − X)f (Y ) = 0 ,

if and only if it is the BLS estimator X̂(y) = X̂BLS (y) = E[X|Y = y] for all y ∈ R.
Proof. As before, let g(X, Y ) = X be the function we seek to estimate. Then, Theorem 2 provides the fol-
lowing orthogonality characterization of BLS estimators: An estimator X̂ : R → R satisfies the orthogonality
property: h i
∀f ∈ L2 (R, PY ), E (X̂(Y ) − X)f (Y ) = 0 , (10)

if and only if it is the BLS estimator, X̂(y) = X̂BLS (y) for all y ∈ R, defined in (8). Furthermore, we can
derive the explicit conditional expectation form of the BLS estimator from this orthogonality characterization.
Indeed, for every f : R → R with finite second moment, we have using (10) and the tower property that
h i
E X̂BLS (Y )f (Y ) = E[Xf (Y )] = E[E[X|Y ] f (Y )] ,

5
Spring 2022 Part 1 Statistical Inference, Lecture 3 Bayesian Estimation

which implies that h i


E (X̂BLS (Y ) − E[X|Y ])f (Y ) = 0 .
 
Then, setting f (Y ) = X̂BLS (Y ) − E[X|Y ], we get E (X̂BLS (Y ) − E[X|Y ])2 = 0. Hence, X̂BLS (Y ) = E[X|Y ]
almost surely, since the second moment of a random variable vanishes if and only if it is zero almost surely.
This completes the proof.

Intuitively, Proposition 2 conveys that the BLS estimator is characterized by having an error X̂BLS (Y )−X
that is orthogonal to the subspace of all functions of Y . Equivalently, the BLS estimator is the projection
of X onto the subspace of all functions of Y .

4.2 Linear Least Squares Estimator


Although BLS estimators achieve MMSE, they can sometimes be difficult to calculate due to imperfect
knowledge about the underlying joint distribution PX,Y or computational intractability of finding the pos-
terior PX|Y . In such cases, we often restrict our attention to the class of affine (or linear) estimators. So,
consider the sub-Hilbert space

S = {f : R → R, f (y) = ay + b | a, b ∈ R} (11)

of all affine functions f (y) that only depend on Y . It is straightforward to verify that S is a closed subspace
of L2 (R × R, PX,Y ) using the fact that Y has finite second moment E[Y 2 ] < +∞. Fixing S to be our class
of admissible estimators, we define the linear least squares (LLS) estimation problem as:
h i
min E (X − X̂(Y ))2 = min E (X − aY − b)2 ,
 
(12)
X̂∈S a,b∈R

and refer to the associated optimal estimator as the LLS estimator :


h i
X̂LLS = arg min E (X − X̂(Y ))2 . (13)
X̂∈S

Note that we are essentially performing MMSE estimation over a smaller, more constrained, subspace here.
Then, applying Theorem 2 yields the following orthogonality characterization of LLS estimators, cf. [1,
Section 7].

Proposition 3 (Orthogonality Characterization of LLS Estimator). An affine estimator X̂ : R → R satisfies


the orthogonality property: h i
∀a, b ∈ R, E (X̂(Y ) − X)(aY + b) = 0 ,

if and only if it is the LLS estimator X̂ = X̂LLS given by:

cov(X, Y )
∀y ∈ R, X̂LLS (y) = E[X] + (y − E[Y ]) ,
var(Y )

where cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] is the covariance of X and Y , and var(Y ) = cov(Y, Y ) is the
variance of Y . Moreover, the LLS estimator achieves an MMSE risk of
h i cov(X, Y )2
E (X − X̂LLS (Y ))2 = var(X) − .
var(Y )

Proof. As before, let g(X, Y ) = X be the function we seek to estimate. Then, Theorem 2 provides the
following orthogonality characterization of LLS estimators: An affine estimator X̂ : R → R satisfies the
orthogonality property:
h i h i
∀f ∈ S, E (X̂(Y ) − X)f (Y ) = E (X̂(Y ) − X)(aY + b) = 0 , (14)

6
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022

where a, b ∈ R are the parameters that define f (y) = ay + b for y ∈ R, if and only if it is the LLS estimator,
X̂(y) = X̂LLS (y) for all y ∈ R, defined in (13). It remains to prove the explicit form of the LLS estimator
from this orthogonality characterization. (Note that the explicit form could also be derived through direct
optimization of (12).) Let X̂LLS (y) = cy + d for some c, d ∈ R. Then, letting a = 0 and b = 1 in (14), we get

cE[Y ] + d = E[X] ,

and letting a = 1 and b = 0 in (14), we get

cE Y 2 + dE[Y ] = E[XY ] .
 

Solving these equations produces c = cov(X, Y )/var(Y ) and d = E[X] − (cov(X, Y )/var(Y ))E[Y ], which
establish the form of X̂LLS in the proposition statement. Finally, the MMSE risk of X̂LLS follows from direct
calculation:
" 2 #
h
2
i cov(X, Y )
E (X − X̂LLS (Y )) = E (X − E[X]) − (Y − E[Y ])
var(Y )
cov(X, Y )2 2cov(X, Y )2
= var(X) + −
var(Y ) var(Y )
2
cov(X, Y )
= var(X) − .
var(Y )

This completes the proof.

Intuitively, Proposition 3 conveys that the LLS estimator is characterized by having an error X̂LLS (Y )−X
that is orthogonal to the subspace of all affine (or linear) functions of Y . Equivalently, the LLS estimator is the
projection of X onto the subspace of all affine functions of Y . Finally, we mention two more remarks. Firstly,
the LLS estimator can be computed using only first and second order moments of the joint distribution PX,Y ;
these quantities can often be much easier to obtain than the full joint distribution PX,Y itself. Secondly,
when X, Y are jointly Gaussian random variables, the BLS and LLS estimators coincide; we leave this as an
exercise for the reader.

References
[1] G. W. Wornell, “Inference and information,” May 2017, Department of Electrical Engineering and Com-
puter Science, MIT, Cambridge, MA, USA, Lecture Notes 6.437.
[2] E. M. Stein and R. Shakarchi, Real Analysis: Measure Theory, Integration, and Hilbert Spaces, ser.
Princeton Lectures in Analysis. Princeton, NJ, USA: Princeton University Press, 2005, vol. 3.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy