0% found this document useful (0 votes)
55 views76 pages

EE378A - Combined Notes

This document summarizes a lecture on statistical signal processing: 1. The lecture introduced the course content, structure, and 3 pre-approved projects. Office hours and grading policies were also outlined. 2. The course aims to compare frameworks for data analysis, emphasize important principles, and enable developing optimal algorithms. It was motivated by a renowned data analysis book. 3. The 3 projects are on portfolio optimization using real stock data, learning using privileged information to improve classification, and time series prediction using future information during training only.

Uploaded by

satyamshivam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views76 pages

EE378A - Combined Notes

This document summarizes a lecture on statistical signal processing: 1. The lecture introduced the course content, structure, and 3 pre-approved projects. Office hours and grading policies were also outlined. 2. The course aims to compare frameworks for data analysis, emphasize important principles, and enable developing optimal algorithms. It was motivated by a renowned data analysis book. 3. The 3 projects are on portfolio optimization using real stock data, learning using privileged information to improve classification, and time series prediction using future information during training only.

Uploaded by

satyamshivam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

EE378A Statistical Signal Processing Lecture 1 - 03/29/2016

Lecture 1: Introduction
Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Nico Chaves

In this lecture, we discuss the content and structure of the course at a high level. We also introduce the 3
pre-approved course projects.

1 Course Logistics
Office Hours:
• Jiantao Jiao, Wed 4:30-5:30 pm in Packard 251
• Tsachy Weissman, Thurs 2:30-4:00 pm in Packard 256
• Bai Jiang, Tues 4:30-5:30 in Sequoia 200
• Yanjun Han, Thurs 4:30-5:30 in Packard 251
See the course website for more details, as these times/locations may be subject to change.

Grading:
• 40%: Approximately 4 problem sets
• 40%: Project
• 10%: Class attendance
• 10%: Scribe notes (due 72 hours after lecture)
• Up to 5% bonus for useful input to the course staff

2 Course Motivation
There are many topics that can be categorized as part of statistical signal processing: decision theory, learning
theory, reinforcement learning, etc. Modern statistical signal processing viewed from a broad perspective
includes everything about data science. However, it’s not always obvious how these different frameworks in
data science are related, and it can be difficult to decide which framework to use to solve a given problem.
This course will have 3 goals:
1. Compare and analyze these existing frameworks for data analysis and processing. By understanding
the goals of each framework, we’ll be able to choose an appropriate algorithm when facing a new
problem, and choose an appropriate framework to analyze a new algorithm.
2. Emphasize the most important principles.
3. Enable students to come up with the best algorithm for the problem at hand.

2.1 Peter Huber’s Data Analysis Book


Peter Huber is a renowned statistician who wrote a book called Data Analysis: What Can Be Learned From
the Past 50 Years [2]. The book focuses on how to teach/learn data analysis. It is highly recommended by
the course staff. In fact, it motivated the staff to teach this course.

1
2.2 “Quiz” Question 1
What is the difference between channel coding and multiple hypothesis testing?

Channel coding: Suppose a transmitter wants to transmit some message c in the form of binary string.
There may be many different messages (binary strings) that the transmitter can send, say, c1 , · · · , cM . Sup-
pose the transmitter randomly selects one of these M messages and sends it over the channel. The channel
corrupts the message in a certain way. For example, the channel may randomly flip each binary bit of the
transmitted codeword with probability  < 1/2 (bits are flipped independently).

Multiple hypothesis testing: The receiver then receives a message c0 , which may be different from
the message that was sent. The receiver then must apply some decoding algorithm to try to decide which
message was actually sent, and the decoding process is just choosing one hypothesis from multiple hypothe-
ses. In Bayesian framework, to decode the message, the receiver computes the MAP estimate to minimize
the block decoding probability of error.

The main different between channel coding and multiple hypothesis testing is that, most of the efforts
in channel coding is to design the codewords. In other words, our primary goal in channel coding is to select
the messages c1 , ..., cM such that we can maximize the rate of transmission. Given a fixed codebook, the
decoding is statistically trivial:1 one just needs to compute the MAP estimate of the codeword to minimize
decoding probability of error.

However, in multiple hypothesis testing, we are usually given the conditional distributions of the obser-
vations given each hypothesis. In other words, we do not have the freedom to design each hypothesis. Then,
if we study the multiple hypothesis testing problem under a Bayesian framework and assume each hypoth-
esis is chosen with equal probability, this problem is in fact equivalent to the channel decoding problem.
However, multiple hypothesis testing can also be performed under a frequentist framework, where one does
not assume that each hypothesis has a prior distribution. 2

2.3 “Quiz” Question 2


What are the modeling assumptions behind the support vector machine (SVM)? In particu-
lar, what assumptions do you make about the distribution of the data when applying an SVM?

An SVM doesn’t actually need to make any assumptions about the distribution of the data beyond that
the (feature, label) pairs are generated independently with the same distribution, which can also be relaxed
to some extent. An SVM simply applies a hyperplane to try to separate 2 different “clouds” of data. The
theoretical results one can derive about SVM also do not need to rely on any additional data generating
assumptions: one can derive ”distribution-free” results by comparing the performance of the SVM and that
of an oracle that knows the underlying distributions but is restricted only to use linear classifiers.

SVM corresponds to the idea of structural risk minimization in statistical learning theory, where multi-
ple function classes with different VC dimensions are considered simultaneously.

3 Pre-Approved Projects
This section introduces the 3 pre-approved course projects. Students are also welcome to propose their own
project topics.
1 If we do not consider computational issues: decoding an arbitrary code is NP-hard in the worst case.
2 Keywords: FWER and FDR control

2
3.1 Project 1: Portfolio Optimization
In this context, the term “portfolio” refers to a set of investments (e.g., stocks). The problem can be defined
as follows: Suppose you have N stocks. Every day, you can assign a portion of your capital to each stock.
Your goal is to maximize your profit at the end of some time period.

There exist conflicts between various theoretical frameworks applied to portfolio optimization. We refer
the readers to [1] for details. Since it may not be possible to convincingly propose a portfolio optimization
algorithm in theory, this project is aimed to compare different algorithms in practice.

Here is a basic example explaining why theoretical comparisons might be difficult. Consider a sequence
of random variables X1 , ..., XN , where Xi represents the amount of money you have at day i. If you become
bankrupt at some time point, you can’t make any more investments. Therefore, Xi ≥ 0.

Suppose you’re given a portfolio that generates such a sequence and that satisfies the following 2 condi-
tions:
a.s.
1. XN −−→ 0
2. E[XN ] → ∞ as N → ∞
Condition 1 suggests that the portfolio is very bad, because it suggests you won’t have any money after N
days. In particular, XN converges to 0 in the “almost sure” sense of convergence. However, condition 2
seems to be a desirable property, since the expectation approaches ∞.

Students pursuing Project 1 will use real stock market data. The course staff will provide these students
with real data from the S&P 500 over in total 5 years. At the end of the quarter, we’ll be able to compare
the performance of the different approaches.

3.2 Project 2: Learning using Privileged Information (LUPI)


In the LUPI setting, your training set consists of data points of the form: (xi , x∗i , yi ) ∈ X × X ∗ × Y. The test
set consists of data points of the form: (xj , yj ) ∈ X × Y. Roughly speaking, your goal is to find a classifier
f : X → Y such that f (xj ) ≈ yj for each (xj , yj ) pair in the test set. As in the traditional machine learning
setting, xi is a feature vector and yi is a label.

The key difference is x∗i , which is called the “teacher’s advice.” The teacher’s advice gives you some addi-
tional information to use to train your classifier. The idea behind the LUPI framework is that this additional
information at training time enables you to train a classifier that will perform better on the test set. Note:
The test set does not contain the teacher’s advice x∗i .

Example 1: Image Classification


We are given an image of a handwritten digit. Our goal is to determine which digit is shown in the image.
The possible labels are 0-9, i.e.: yi ∈ {0, 1, ..., 9}
Suppose that xi is a 5 × 5 image of the digit.
Then, x∗i could a 64 × 64 version of the same image.

A 64 × 64 image is sharper than the corresponding 5 × 5 image. The reason why the LUPI helps learning
is as follows: since we have access to sharper images, it’s easier to train a classifier that separates different
images. In particular, having access to the 64 × 64 images enables us to define a distance function that
classifies the images better. A pair of 5 × 5 images of different digits may look similar, but the corresponding
64×64 images will enable us to differentiate between the pair of images, resulting in a better distance function.

3
Example 2: Time Series Prediction
In the traditional time series prediction setting, we have a sequence of data points up to time t. Given
x1 , ..., xt , our goal is to predict xt+k (where k > 0). In other words, our goal is to predict the value of x at
some time in the future using features up until to the present.

Now, let’s recast the time series prediction problem under the LUPI setting: As before, xi consists of
features up to now and yi is the value we want to predict. Now, our training set also contains x∗i , which can
contain information from the future: xt+1 , xt+2 , ...

We emphasize again that you do not have access to the x∗i features when evaluating your predictor on
the test set. This should be especially clear in Example 2, where the teacher’s hint contains information
from the future.

In the case of a soft-margin SVM, you can use the teacher’s advice to learn the slack variables. It was
argued in Vapnik and Izmailov [3] that using LUPI significantly improves the classification performance of
an SVM. However, this technique hasn’t been studied extensively in practice (at least to the course staff’s
knowledge). It would make an interesting course project to see how well LUPI works in practice.

Moreover, it’s possible that the LUPI framework doesn’t actually require one to use SVMs. It has yet to
be shown whether LUPI can be applied to other machine learning techniques. This would be an interesting
problem to study as well.

3.3 Project 3: Ensemble Methods and Synergy Mechanism


3.3.1 Ensemble Methods
In this approach, your learning algorithm trains multiple predictors. Some aggregation of these predictors
is then used to classify data points at prediction time. For example, you may have 3 predictors: f1 , f2 , and
f3 , where f1 was trained using SVM, f2 using neural networks, and f3 using nearest neighbors. You could
then classify a new data point xj by evaluating y1j = f1 (xj ), y2j = f2 (xj ), and y3j = f3 (xj ), and then
aggregating these predictions. If we use a simple linear combination, then this prediction takes the form:
3
X
yj = αi yij
i=1

The goal is to learn the αi coefficients such that the performance of this ensemble classifier is comparable to
the oracle on the test set. The oracles can be quite different, corresponding to different types of aggregations,
i.e., model selection aggregation, convex aggregation and linear aggregation, following Nemirovski [4]. In the
model selection aggregation, the oracle is the best one among fi on the test set. In the convex aggregation,
the oracle is the best one among all convex combinations of fi , and similarly for the linear aggregation.
However, the problem is that, most ensemble methods only use a linear combination of the candidates as
the final aggregate, which is subject to the drawbacks of linear methods.

3.3.2 Synergy Mechanism


Suppose you’re given k classifiers f1 , f2 , ..., fk . Again, your goal is to construct the best possible classifier by
aggregating these k classifiers, but now you are no longer restricted to using linear combinations.

Solving the Conditional Density Estimation Problem:


Suppose we’re dealing with a binary classification problem, so yi ∈ {0, 1}. We can minimize our probability
of classification error by using the rule:
P (y = 1|f1 , f2 , ..., fk )

4
So the optimal decision rule is a conditional distribution, rather than some linear or convex combination of
the predictors. Suppose that we discretize the score of each predictor into m possible values. Then, we can
treat each of the mk possible values of f1 , f2 , ..., fk as a bin and count the number of 0 and 1 labels in each
bin. We can then use these empirical frequencies for the decision rule. However, this approach is impractical.
As k grows larger, the number of bins increases exponentially, so the bins will become increasingly sparse
(in practice, we only have some finite amount of data). As a result, the empirical frequencies we obtain from
the bins will rapidly become unreliable as we increase k.

Instead, we notice that it would be beneficial to assume that our classifier is monotonic. Suppose you
train an SVM. The classifier will be of the form: f (x) = wT x + b. If f (x1 ) > f (x2 ), then it seems reason-
able to assume that P (y = 1|f (x1 )) > P (y = 1|f (x2 )). In particular, x1 results in a larger margin, so we
believe that it’s more likely to have a label of 1 than x2 is. Therefore, we assume: if f (x1 ) > f (x2 ), then
P (y = 1|f (x1 )) > P (y = 1|f (x2 )). This is an example of a monotonicity assumption.

Next we’ll see why the monotonicity assumption simplifies the problem of conditional density estimation.

First, we can cast the 1-D conditional density estimation problem as the following integral equation:
Z x
f (t)dt = F (x)
−∞

Suppose you modify the density function by adding a sinusoid:


Z x
(f (t) + sin(ωt))dt
−∞

Evaluating this modified integral, we have:


Z x Z x x
cos(ωt) 1
(f (t) + sin(ωt))dt = f (t)dt − = F (x) + O( ) (1)
−∞ −∞ ω −∞ |ω|

Suppose we let ω → ∞. Then, the error term above approaches 0. Therefore, the modified integral ap-
proaches F (x). But as ω → ∞, the density defined by f (t) + sin(ωt) will oscillate infinitely fast and will
become very different from f (t), which is the density we actually wanted to estimate. This means that
there exist densities which are very different from the actual density but nonetheless result in a CDF that is
very close to the target CDF. Our conclusion is that density estimation is difficult when we don’t make any
assumptions about the density.

However, if we assume that the conditional density is monotonic, then it can’t have an oscillating com-
ponent as in the example above. This assumption makes the conditional density estimation problem much
easier to solve.

The synergy method has proved to be successful in some experiments. For example, Vapnik showed in
his recent talk “intelligent methods of learning” that for aggregating kernels with different parameters, the
performance of the aggregate on the test set is even better than that obtained by the best kernel parameter.

Some questions one could address for this project topic include: does the synergy mechanism actually
perform well when compared to traditional ensemble methods? For which types of predictors (SVM, neural
networks, etc.) does the monotonicity assumption hold? For which types of predictors does the synergy
mechanism work well?

5
References
[1] MacLean, Leonard C., Edward O. Thorp, and William T. Ziemba. ”Good and bad properties of the
Kelly criterion.” Risk 20, no. 2 (2010): 1.
[2] Huber, Peter J. Data analysis: what can be learned from the past 50 years. Vol. 874. John Wiley & Sons,
2012.
[3] Vapnik, Vladimir, and Rauf Izmailov. ”Learning Using Privileged Information: Similarity Control and
Knowledge Transfer.” Journal of Machine Learning Research 16 (2015): 2023-2049.

[4] Nemirovski, Arkadi. ”Topics in non-parametric statistics.” Ecole dEt de Probabilits de Saint-Flour 28
(2000): 85.

6
EE378A Statistical Signal Processing Lecture 2 - 03/31/2016

Lecture 2: Basic Concepts of Statistical Decision Theory


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi

In this lecture1 , we will introduce some of the basic concepts of statistical decision theory, which will play
crucial roles throughout the course.

1 Motivation
The process of inductive inference consists of
1. Observing a phenomenon
2. Constructing a model of that phenomenon
3. Making predictions using this model.
The main idea behind learning algorithms is to generalize past observations to make future predictions. Of
course, given a set of observations, one can always find a function to exactly fit the observed data, or to find
a probability measure that generates the observed data with probability one. However, without placing any
restrictions on how the future is related to the past, the No Free Lunch theorem essentially says that it is
impossible to generalize to new data. In other words, data alone cannot replace knowledge. Hence, a central
theme of this course is
Generalization = Data + Knowledge.
We can view statistical decision theory and statistical learning theory as different ways of incorporating
knowledge into a problem in order to ensure generalization.

2 Decision Theory
2.1 Basic Setup
The basic setup in statistical decision theory is as follows: We have an outcome space X and a class of
probability measures {Pθ : θ ∈ Θ}, and observations X ∼ Pθ , X ∈ X . An example would be the Gaussian
measure, defined in the above framework as follows:

dPθ 1 (x−µ)2
= √ e− 2σ2 ; θ = (µ, σ 2 ).
dx σ 2π
Our goal as a statistician is to estimate g(θ), where g is an arbitrary function that is known to us and θ is
unknown. Thus, observe that our goal is more general than simply estimating θ, and the generality of our
goal is motivated by the fact that the parametrization θ can be arbitrary.

A useful perspective to view the goal of estimating g(θ) is as a game between Nature and the statistician,
where Nature chooses θ and the statistician chooses a decision rule δ(X). Note that a general decision rule
may be randomized, i.e., for any realization of X = x, δ(x) produces an action a ∈ A following the probability
1 Some readings:

1. Olivier Bousquet, Stéphane Boucheron, Gábor Lugosi (2004) “Introduction to Statistical Learning Theory”.
2. Lawrence D. Brown (2000) “An Essay on Statistical Decision Theory”.
3. Abraham Wald (1949) “Statistical Decision Functions”.

1
distribution D(da|x). Intuitively, after observing x, the decision rule chooses the action a ∈ A randomly
from the distribution D(da|x). As a special case, for a deterministic decision rule δ(x), D(da|x) is just a
one-point probability mass at δ(x).
Moreover, we have a loss function L(θ, a) : Θ × A → R to quantify the loss between θ and any action
a ∈ A, such that (usually we assume that g(θ) belongs to the action space A)

L(θ, a) ≥ 0, ∀a ∈ A, θ ∈ Θ (1)
L(θ, g(θ)) = 0, ∀θ ∈ Θ. (2)

Immediately, we see that our loss function L(θ, δ(X)) is inherently a random quantity. Note that we
may have two sources of randomness: the observation X is inherently random, and for a fixed X the action
produced by the decision rule δ(X) ∼ D(da|X) may also be random.
Instead, we will define a risk function R(θ, δ) in terms of our loss function, which will instead be deter-
ministic.

Definition 1 (Risk). For general randomized decision rule δ(X):


ZZ
R(θ, δ) , Eθ [L(θ, δ(X))] = L(θ, a)D(da|x)dPθ (x). (3)
R
When the decision rule δ(X) is deterministic, we have R(θ, δ) = L(θ, δ(x))dPθ (x).
Intuitively, the risk R(θ, δ) is the average loss incurred by using the decision procedure δ over many draws
of data from Pθ .

2.2 Comparing Procedures


Decision theory offers a principled way of analyzing the performance of different statistical procedures by
evaluating and comparing their respective risks R(θ, δ). For example, suppose we observe data X ∼ Pθ .
Should we use maximum likelihood, method of moments, or some other procedure to estimate g(θ)? Decision
theory allows us to rule out certain inadmissible procedures.
Definition 2 (Admissibility). A decision rule δ(X) is called “inadmissible” if there exists some decision
rule δ 0 (X) such that

1. R(θ, δ 0 ) ≤ R(θ, δ) for all θ ∈ Θ, and


2. there exists θ0 ∈ Θ such that R(θ0 , δ 0 ) < R(θ0 , δ).
Such a rule δ 0 is said to dominate δ. If there does not exist any rule δ 0 (X) which dominates δ(X), δ(X) is
called “admissible”.

Having ruled out inadmissible decision rules, one might seek to find the best rule, namely the rule that
uniformly minimizes the risk. However, in most cases of interest, there is no such uniformly best decision rule.

For instance, if Nature chooses θ0 , then the trivial procedure δ(X) = g(θ0 ) achieves the minimum possible
risk R(θ0 , δ(X)) = 0. Thus, in most cases δ is admissible. However, such a decision rule is intuitively
unreasonable and, further, can have arbitrarily high risk for θ 6= θ0 .
There are two main approaches to restrict the order of the game to avoid this triviality, namely, we can
restrict the statistician, or we can assume more knowledge about Nature.

2.3 Restrict the Statistician


Let D be the set of all possible decision rules. One way to restrict the statistician is to force her to choose a
decision rule from some subset D0 ( D. How should we choose D0 ? Some possibilities are:

2
1. Take D0 to be the set of unbiased decisions, namely, the decisions δ(X) such that Eθ [δ(X)] = g(θ),
∀θ ∈ Θ.
2. Equivariance (which we will not discuss as much in class): As an example, if Pθ = f (X − θ), where
2 i.i.d.
θ ∈ R, f (X) , √12π e−x /2 , and our observations X1 , . . . , Xn ∼ Pθ , then we will choose our decision
function δ to be such that δ(X1 + c, X2 + c, . . . , Xn + c) = δ(X1 , . . . , Xn ) + c, where c is a constant.
The reason for why we might want to choose such a decision function δ is because, for equivariant
estimators, the variance of the estimator does not depend on θ.

2.4 Assume more knowledge about Nature


The downside of restricting the statistician is that we may be ignoring good decision rules δ. Instead, an
alternative approach is to assume more knowledge about Nature. Here, we run into the classic Bayesian vs.
Frequentist debate, which can be viewed as two different ways of assuming more knowledge about Nature.

1. Bayes: We assume that θ is drawn from a probability density, namely, θ ∼ λ(θ). For instance, for the
problem of disease detection, our prior λ(θ) could be the population disease density.

Thus, in the Bayesian setting, our new risk function becomes


Z
r(δ) = R(θ, δ)λ(θ)dθ. (4)

As a result, to minimize the risk, the optimization problem becomes:

min r(δ), (5)


δ

which is a well-posed optimization problem only over δ (and can be solved in many cases, though it
can be computationally intractable to compute).

It is known that if θ is finite dimensional, our observations X1 , . . . , Xn ∼ Pθ , and λ(θ) is supported


everywhere, then as n → ∞, λ(θ) has no effect on inference asymptotically. However, there are a
couple catches here to notice. The first catch is that here we are assuming n → ∞, but if n is finite,
then λ(θ) will influence the decision rule rather significantly. The second catch is that we are assuming
θ is finite dimensional, which does not hold generally in semiparametric and nonparametric models,
e.g., if we are trying to estimate a function.
2. Frequentist: The frequentist approach is to not impose a prior, which leads to a natural choice to
guard against the worst case risk using a minimax approach. Namely, the risk in the minimax setting
becomes
r(δ) = max R(θ, δ), (6)
θ

and our optimization problem is as usual,


min r(δ). (7)
δ

3 Data Reduction
Not all data is relevant to a particular decision problem. Indeed, the irrelevant data can be discarded and
replaced with some statistic T (X n ) of the data without hurting performance. We make this precise below.

3
3.1 Sufficient Statistics
Definition 3 (Markov chain). Random variables X, Y, Z are said to form a Markov Chain if X and Z are
conditionally independent given Y . In particular, the joint distribution can be written as

p(X, Y, Z) = p(X)p(Y | X)p(Z | Y )


= p(Z)p(Y | Z)p(X | Y ).

Definition 4 (Sufficiency). 2 A statistic T (X) is “sufficient” for the model P = {Pθ : θ ∈ Θ} if and only if
the following Markov chains hold for any distribution on θ:
1. θ − T (X) − X
2. θ − X − T (X)

One useful interpretation of the first condition is, if we know T (X), we can generate X without knowing
θ (because X and θ are independent conditioned on T (X)). The second condition usually trivially holds,
since in most cases T (X) is a deterministic function of X. However, a subtle but important point about
the second condition is that the statistic T (X) could be a random function of X. In this case, the second
condition implies that given X, T (X) does not depend on the randomness of θ.

3.2 Neyman-Fisher Factorization Criterion


Often checking the definition of sufficiency can be difficult. In cases where the model distributions admit
densities, we can give a characterization that is easier to work with.
We first give technical conditions necessary for the existence of density functions. Let (Ω, F) be a
measurable space. Recall A ∈ F is null for a measure µ on (Ω, F) if µ(B) = 0 for every B ∈ F with B ⊆ A.
Definition 5 (Absolute continuity). Suppose µ and ν are measures on (Ω, F). We call ν is absolutely
continuous with respect to µ if every null set of µ is also a null set of ν. We write ν  µ.
Definition 6 (Dominated statistical model). The statistical model (X , {Pθ : θ ∈ Θ}) is dominated by measure
µ if, for all θ ∈ Θ, Pθ  µ.

Theorem 1 (Neyman-Fisher Factorization Criterion). Suppose P = {Pθ : θ ∈ Θ} is dominated by a σ-finite


measure µ. In particular, Pθ has a density pθ = dP
dµ for all Pθ ∈ P. Then T (X) is sufficient iff there exists
θ

functions gθ and h such that


pθ (x) = gθ (T (x))h(x).
Example 2. Let X1 , X2 , . . . , Xn be i.i.d. Poisson(λ) and θ = λ. Then the joint distribution is
n
Y λXi
pθ (X1 , X2 , . . . , Xn ) = e−λ
i=1
Xi !
Pn
λ i=1 Xi
−nλ
=e
X1 ! · · · Xn !
= gθ (T (X1 , . . . , Xn ))h(X1 , . . . , Xn ),
Pn
where T (X) = i=1 Xi is sufficient by the Neyman-Fisher Factorization Criterion.

2 Note that this definition of Sufficiency is slightly different from the conventional frequentist definition, and is along the lines

of the Bayes definition. These two definitions are equivalent under mild conditions. Cf.[1]

4
3.3 Rao-Blackwell
In the decision theory framework, sufficient statistics provide a reduction of the data without loss of infor-
mation. In particular, any risk that can be achieved using a decision rule based on X can also be achieved
by a decision rule based on T (X), as the following theorem makes precise.
Theorem 3. Suppose X ∼ Pθ ∈ P and T is sufficient for P. For all decision rules δ(X) achieving risk
R(θ, δ(X)), there exists a decision rule δ 0 (T (X)) that achieves risk

R(θ, δ 0 (T (X))) ≤ R(θ, δ(X)) for all θ ∈ Θ.

Sketch of Proof Given T (X), we can sample a new dataset X 0 from the from the conditional distribution
p(X | T (X)). By sufficiency (the first condition in Definition 4), the conditional distribution p(X | T (X))
doesn’t depend on θ, and hence we can sample X 0 without knowing θ. We then define a randomized procedure
∆ d
δ 0 (T (X)) = δ(X 0 ) = δ(X).

Since δ(X) and δ 0 (T (X)) have the same distribution, they also have the same risk function.

It is rarely necessary to regenerate a dataset from sufficient statistics. Rather, in the case of convex
losses, it is possible to obtain a non-randomized decision rule that matches or improves the performance of
the original rule using sufficient statistics alone.
Definition 7 (Strict convexity). A function f : C → R is strictly convex if C is a convex set and

f (tx + (1 − t)y) < tf (x) + (1 − t)f (y)

for all x, y ∈ C, x 6= y, t ∈ (0, 1).


Example 4. For any θ, the function L(θ, a) = (a − θ)2 is strictly convex with respect to a on R.
Theorem 5 (Rao-Blackwell). Assume L(θ, a) is strictly convex in a. Then

R(θ, E[δ(X) | T (X)]) < R(θ, δ(X)),

unless E[δ(X) | T (X)] = δ(X) almost everywhere.


Rao-Blackwell is stronger than the previous theorem because, when the loss function is strictly convex,
it it possible to improve upon the original decision rule using only a reduced version of the data T (X) and
using a deterministic rule.

References
[1] Blackwell, D., and R. V. Ramamoorthi. A Bayes but not classically sufficient statistic, The Annals of
Statistics 10, no. 3 (1982): 1025-1026.

5
EE378A Statistical Signal Processing Lecture 3 - 04/05/2016

Lecture 3: Completeness
Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Xin Zheng and Chenyue Meng

In this lecture, we first review the Rao-Blackwell theorem, along with its proof and complications. Then
we introduce an important concept called completeness, which has applications such as Basu’s theorem.
In the end, exponential family will be introduced for finding complete and sufficient statistics for common
distributions.

1 Rao-Blackwell
Theorem 1 (Rao-Blackwell). Assume L(θ, d) is strictly convex in d, ∃ decision rule δ(X), s.t. R(θ, δ(X)) <
∞. Then

R(θ, E[δ(X)|T ]) < R(θ, δ) (1)

unless δ(X) = E[δ(X)|T ] with probability 1, where T is sufficient for P = {Pθ , θ ∈ Θ}.
Proof To prove Rao-Blackwell theorem, we first introduce Jensen’s inequality
Theorem 2 (Jensen’s inequality). Let f be a convex function, and let X be a random variable. Then

E[f (X)] ≥ f (EX) (2)

Moreover, if f is strictly convex, then E[f (X)] = f (EX) holds true if and only if X = EX with probability
1 (i.e., if X is a constant a.s.).
Given that L(θ, d) is strictly convex in d, we have

L(θ, E[δ(X)|T = t]) < E[L(θ, δ(X))|T = t] (3)

if we let X have the conditional distribution PX|T =t and apply Jensen’s inequality. The strict convexity of
L(θ, d) in d shows that the inequality is strict unless Pδ(X)|T =t is a point mass (i.e., a constant almost surely)
To finish the proof, we take expectations on both sides after replacing the fixed t with the random T ,

Eθ [L(θ, E[δ(X)|T ])] < Eθ [E[L(θ, δ(X))|T ]] (4)


R(θ, E[δ(X)|T ]) < Eθ [L(θ, δ(X)] (5)
R(θ, E[δ(X)|T ]) < R(θ, δ(X)) (6)

Note: Here, the estimator E[δ(X)|T = t] in Rao-Blackwell theorem only depends on t since T is sufficient
for P = {Pθ , θ ∈ Θ}. The convexity of L(θ, d) implies that it suffices to consider deterministic decision rules.

Complications:
Despite the powerfulness of Rao-Blackwell, its drawbacks and impracticalities are as follows:
A. Rao-Blackwell focuses on improvement, not optimality. Given a decision rule δ(x) and a sufficient
statistic T , we can use Rao-Blackwell to compute a new decision rule E[δ(x)|T ] which is guaranteed to reduce
the risk (at least not larger) for a convex loss function. However, this new decision rule might be far from
optimality.
B. It may be hard to compute the conditional expectation.

1
2 Completeness
In lecture 2, we introduced the concept of a statistic being sufficient with respect to a statistical model. In
general, a statistic T is sufficient for P = {Pθ , θ ∈ Θ} if the sample from Pθ gives no additional information
than T . What characterizes the situations in which sufficiency leads to substantial reduction of
data?
Definition 1 (Ancillary). Statistic V (X) is ancillary if PV (X)|θ = PV (X) . Moreover, V (X) is first-order
ancillary if Eθ V (X) = const.
Note: A statistic being ancillary intuitively means that it contains no information about θ.
i.i.d.
Example 3. Assume X1 , X2 , · · · Xn ∼ N (µ, σ 2 ), where µ is unknown and σ 2 is known. And we set θ = µ
as the parameter for the distribution P = {Pθ , θ ∈ Θ}.
Then let’s denote statistics X̄ and S 2 as
P
Xi
X̄ =
n
X
S2 = (Xi − X̄)2
d d
Now we want to prove S 2 is ancillary for µ. Assume Xi = Zi + µ ⇒ (X1 , X2 , · · · Xn ) = (Z1 , Z2 , · · · , Zn ) +
i.i.d
µ, then Z1 , Z2 , · · · , Zn ∼ N (0, σ 2 ). Then we have S 2 = (Xi − X̄)2 is equal to (Zi − Z̄)2 in distribution,
P P
whose distribution does not depend on µ, i.e. θ. Hence S 2 is ancillary for µ.
Definition 2 (Completeness). The statistic T is complete if the following is true:

Eθ [f (T )] = 0, ∀θ ∈ Θ ⇒ f (T ) = 0 w.p.1 (7)

Note: An alternative expression is - A statistic T appears to be most successful in reducing data if no


non-constant function of T is even first-order ancillary.
Example 4. An example of non-complete statistic: we argue that X = (X1 , X2 , . . . , Xn ) in the previous
Gaussian model is not complete. Indeed, we can find a non-constant function f (X) = X1 − X2 such that
Eθ [f (X)] = 0 for any θ ∈ Θ.

3 Basu’s Theorem
An important application of completeness is Basu’s theorem.
Theorem 5. (Basu’s theorem) If T is a complete and sufficient statistic for P = {Pθ , θ ∈ Θ} then any
ancillary statistic V is independent of T , i.e. V ⊥
⊥ T.
Proof Given any ancillary statistic V , we can define pA , Pθ (V ∈ A) which is independent of θ for any
given A. If T is sufficient, then ηA (t) , Pθ (V ∈ A|T = t) does not depend on θ. Take the expectation of
ηA (T ), we have
Eθ [ηA (T )] = Pθ (V ∈ A) = pA
. Hence, we have found a function ηA (T ) − pA whose expectation is constant zero for all parameters. It
follows from the definition of completeness that ηA (T ) = pA with probability 1.

i.i.d.
Example 6. Assume X1 , X2 , · · · Xn ∼ N (µ, σ 2 ), and θ = (µ, σ 2 ). Suppose we have proved that for known
σ 2 , statistic X̄ is both complete and sufficient for µ (which will bePproved at the end of this lecture), then
applying Basu’s thereom, we conclude that X̄ ⊥ ⊥ S 2 , where S 2 = (Xi − X̄)2 . Actually, this also implies
that X̄ ⊥ ⊥ S 2 holds true even in the case where both µ and σ 2 are unknown, since σ 2 is arbitrary.

2
4 Exponential Family
If a distribution can be transformed into the form of exponential family, then it will be much easier to
compute its corresponding statistics satisfying both completeness and sufficiency properties.
Exponential family has the general form as
Ps
pθ (x) = e[ i=1 ηi (θ)Ti (x)]−B(θ)
h(x)

where pθ (x) is the density/mass function and B(θ) is the normalization term.
According to Factorization theorem, [T1 , T2 , · · · , Ts ] is sufficient.
Definition 3 (Full rank). An exponential family is called full-rank if and only if:
Ps
1. neither Ti nor ηi satisfies a linear constraint, i.e., if i=1 ai Ti = a0 for some constants a0 , a1 , · · · , as ,
then a0 = a1 = · · · = as = 0;
2. the natural parameter space (defined below) contains an s-dimensional rectangle, i.e., has a non-empty
interior.

R P η T (x) 4 (Natural parameter space). The natural parameter space of an exponential family is {(η1 , · · · , ηs ) :
Definition
e i i h(x)dµ(x) < ∞}, where µ is the dominating measure for the distribution family P = {Pθ , θ ∈ Θ}.

Theorem 7 (Completeness in exponential family). If the statistical model constitutes an exponential family
which has full rank, then T = [T1 , T2 , · · · , Ts ] is complete and sufficient.
i.i.d.
Example 8. Assume X1 , X2 , · · · Xn ∼ N (µ, σ 2 ) with known σ 2 and θ = µ, then Pθ (X1 , X2 , · · · Xn ) can
be written as
n
(Xi − µ)2
 
Y 1
Pθ (X1 , X2 , · · · Xn ) = √ exp − (8)
i=1
2πσ 2σ 2
 P 2
nµ2
P 
1 Xi µ Xi
=p exp − − + . (9)
(2π)n σ n 2σ 2 2σ 2 σ2
 P 2
1 X nµ2
It belongs to the exponential family with h(X) = √
P
exp − 2σ2i , B(θ) = 2σ 2 and T (X) = Xi .
(2π)n σ n
Then theorem 7 implies that T (X) is complete and sufficient for µ.

3
EE378A Statistical Signal Processing Lecture 4 - 04/07/2016

Lecture 4: Unbiased Estimation, UMVU


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Neal Jean, Rachel Luo

In this lecture, we will introduce unbiased estimators, the concepts of uniform minimum risk unbiased
(UMRU) and uniform minimum variance unbiased (UMVU) estimators, and the Lehmann-Scheffe Theorem.

1 Logistics
Homework and grading
• No midterm or final, and grading on homeworks will be lenient.
• The emphasis is on learning—feel free to collaborate/discuss on Piazza, or come to office hours for
explanations from course staff.
Technical conditions
• Although these are necessary for rigorous proofs, the goal is to understand the main ideas so that we
will be able to solve real problems using these ideas in the future.
• From this point on, we’ll try to avoid spending time on technical conditions whenever possible.

2 Unbiased Estimation
Definition 1 (Unbiasedness). A decision rule δ(x) is unbiased if Eθ [δ(x)] = g(θ) ∀θ ∈ Θ.
Note: We will argue later in the course that unbiasedness may not be the desired property in practice,
but for now our goal is to find the best unbiased estimator.

2.1 Problems with unbiasedness


Example: Suppose X ∼ B(n, p), i.e., P(X = k) = nk pk (1 − p)n−k , 0 ≤ k ≤ n. If we want to estimate p, we


can simply use p̂ = X


n . Then E[p̂] = p. But what if we want to estimate p ln p?
A naive solution is g(p̂) = p̂ ln p̂. Is this unbiased? Jensen’s inequality and convexity of g(p) tell us
E[g(p̂)] > g(E[p̂]) = g(p)
so this is not an unbiased estimator. What’s even worse: there does not exist any unbiased estimator for
g(p) = p ln p in the Binomial model!
Proof: Suppose we have decision rule δ(X), where X ∼ B(n, p). Let us compute E[δ(X)] and set it equal
to p ln p for all the p ∈ [0, 1].
n
X
E[δ(X)] = δ(k)P(X = k)
k=0
n  
X n k
= δ(k) p (1 − p)n−k
k
k=0
≡ g(p)
= p ln p, ∀p ∈ [0, 1]

1
Pn
But this is impossible, since k=0 nk pk (1 − p)n−k is a polynomial in p of degree ≤ n, while p ln p is not


a polynomial. In general, if we are trying to estimate any quantity that can’t be written as a polynomial of
degree no more than n, then an unbiased estimator does not exist.

2.2 UMRU and UMVU


Definition 2 (U-estimable). If there exists an unbiased estimator for g(θ), then g(θ) is U-estimable.
Unfortunately, even if g(θ) is U-estimable, there is no guarantee that any unbiased estimators are good
in any way in a decision theoretic sense. However, it is a very healthy exercise to find the uniformly best
rules among unbiased estimators.
Definition 3 (Uniform minimum risk unbiased (UMRU)). An unbiased estimator δ(X) of g(θ) is the uni-
form minimum risk unbiased (UMRU) estimator if R(θ, δ) ≤ R(θ, δ 0 ), ∀θ ∈ Θ, ∀δ 0 unbiased.
When the squared error loss is used, we show that UMRU estimators reduce to the uniform minimum
variance unbiased (UMVU) estimators, which is defined by simply replacing the condition R(θ, δ) ≤
R(θ, δ 0 ) by Varθ (δ) ≤ Varθ (δ 0 ).
The key argument is the bias-variance decomposition for the squared error loss L(x, y) = (x − y)2 :

R(θ, δ) = E[L(g(θ), δ(X))]


= E[(δ(X) − g(θ))2 ]
= E[(δ(X) − E[δ(X)] + E[δ(X)] − g(θ))2 ]
= E[(a + b)2 ]

where a = δ(X) − E[δ(X)] and b = E[δ(X)] − g(θ). Then we have

R(θ, δ) = E[a2 ] + E[b2 ] + 2E[(δ(X) − E[δ(X)])(E[δ(X)] − g(θ))]


= E[a2 ] + E[b2 ] + 2(E[δ(X)] − g(θ)) · E[δ(X) − E[δ(X)]]
= E[a2 ] + E[b2 ]
= Varθ (δ) + (Biasθ (δ))2 .

As a result, when working with squared error loss, we can always decompose our risk into a variance
component and a squared bias component. Since the bias is zero when we are restricted to only use unbiased
estimators, all of the risk comes from the variance (i.e., risk = variance) and the UMRU estimator is also
the UMVU estimator.

3 Lehman-Scheffe Theorem
Theorem 1 (Lehman-Scheffe). Assume T is complete and sufficient, and that h(T ) is an unbiased estimator
for g(θ). Then h(T ) is

(1) the only function of T that is an unbiased estimator for g(θ);


(2) an UMRU estimator for any convex loss function;
(3) a unique UMRU estimator for any strictly convex loss function;

(4) a unique UMVU estimator.


Proof

2
h 6= h s.t. e
(1) Suppose there exists some e h(T ) is also an unbiased estimator for g(θ). Then

h(T ) − h(T )] = E[e


E[e h(T )] − E[h(T )]
= g(θ) − g(θ)
= 0 ∀θ ∈ Θ

h(T ) − h(T ), then E[f (T )] = 0 ∀θ ∈ Θ. By the definition of completeness, this implies


If we let f (T ) = e
that f (T ) = 0, which implies that e h = h everywhere. This is a contradiction, so h(T ) must be the only
function of T that is an unbiased estimator for g(θ).

(2) For any unbiased δ(x), Rao-Blackwell gives us η(T ) = E[δ(x)|T ] which is also unbiased (Rao-Blackwellization
preserves unbiasedness), and is at least as good as δ(x) in terms of its performance in risk function. It
follows from part (1) that η(T ) is the only function of T that is an unbiased estimator for g(θ), which
implies that η(T ) = h(T ) with probability one. Hence, we have proved that h(T ) is at least as good
as any unbiased estimators, which by definition shows h(T ) is UMRU.

(3) Uniqueness follows from the strict convexity of the loss function.
(4) The result follows from the strict convexity of the squared error loss.

Side note: Bahadur’s Theorem


Definition 4 (Minimal sufficient). Statistic T is minimal sufficient if for any sufficient statistic U , there
exists a function q(·) such that T = q(U ).

Theorem 2 (Bahadur). Suppose T is complete and sufficient. Then T is minimal sufficient.


By Bahadur’s theorem, if T1 and T2 are two complete sufficient statistics, by minimal sufficiency there
exist deterministic functions p(·) and q(·) such that T1 = p(T2 ), T2 = q(T1 ). Hence, the complete sufficient
statistic is unique up to (bimeasurable) bijection.

4 Examples
4.1 Strategies for finding UMRU estimators
A. Rao-Blackwellization: Start with any unbiased δ(x) and compute E[δ(x)|T ].
• Works in theory, but in practice computing the conditional expectation can be difficult.

B. Solve Eθ [δ(T )] = g(θ), ∀θ ∈ Θ.


• If you can find an unbiased estimator that is a function of T , then by Lehman-Scheffe it is the
only one.
C. Guess δ(T ) by playing around the distribution of T .

• Wait... what?
• Although this sounds unlikely to work, we will see that in practice this can often be more reason-
able than trying to work through the computations necessary for strategies A and B.

3
4.2 Example: Bernoulli distribution
iid
Suppose that X1 , X2 , . . . , Xn ∼ Bernoulli(p), where Z ∼ Bernoulli(p) means that P(Z = 1) = p and P(Z =
0) = 1 − p. In order to apply any of strategies A, B, or C from above, we first need T , so let’s start by finding
the complete and sufficient statistic. This is easy if we can write it as an exponential family distribution:

P(X1 , . . . , Xn ) = p 1(Xi =1) (1 − p) 1(Xi =0)


P P

P P
Xi
=p (1 − p)n− Xi
P P
= e( Xi ) ln p+(n− Xi ) ln(1−p)
P p
= e( Xi ) ln 1−p +n ln(1−p)

P
We can see that T = XiPis a complete sufficient statistic. Suppose that we want to estimate p, then
n
E(p̂) = p implies that p̂ = n−1 i=1 Xi is the UMRU estimator for p. However, what if we want to estimate
2
p ? Let’s try using recipe A (Rao-Blackwellization). First we need to find a simple, unbiased estimator
δ(x) = X1 · X2 :
E[δ(x)] = E[X1 ] · E[X2 ] = p2 .
Note that δ(x) ∈ {0, 1}. We have
X
E[δ(x)|T = t] = P(X1 = 1, X2 = 1| Xi = t)
P
P(X1 = 1, X2 = 1, Xi = t)
= P
P( Xi = t)
Pn
P(X1 = 1, X2 = 1, i=3 Xi = t − 2)
= Pn
P( i=1 Xi = t)
p · p n−2 (1 − p)(n−2)−(t−2) 1(t ≥ 2)
 t−2
t−2 p
= n t

n−t
t p (1 − p)
t(t − 1)1(t ≥ 2)
=
n(n − 1)
t(t − 1)
=
n(n − 1)
(T −1)
Therefore, our UMRU estimator for p2 is h(T ) = Tn(n−1) . As expected, the conditional expectation does
not depend on p based on the definition of sufficiency (it should only depend on T ). To summarize, we
start with an unbiased estimator that is not a function of T , so we cannot use Lehman-Scheffe. We first
use Rao-Blackwell to find another unbiased estimator that is a function of T —then all of the implications
of Lehman-Scheffe apply.

4
EE378A Statistical Signal Processing Lecture 5 - 04/12/2016

Lecture 5: Unbiased Estimation and Information Inequality


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Irena Fischer-Hwang, Kedar Tatwawadi

1 Announcements
• We very strongly encourage you typeset your homework using LaTeX, but if you insist on handwriting
and believe it is neat enough, you can also do that without asking for approval before the deadline.
After the deadline for homework 1 we will carefully review the writing quality of the handwritten
homework (if there are any), and decide whether they meet the standard of neatness. It is of course
our hope that they are neat enough! If some homework fails our test, we may require the corresponding
student(s) to typeset the rest of their homework.

2 Recap
Last time we discussed unbiased estimation, the Lehmann-Scheffe theorem, UMRU and UMVU estimators.
• In particular, we discussed some strategies for finding the UMRU estimator:
1. Rao-Blackwellization: First, find an unbiased estimator δ(x). Then, find the conditional expecta-
tion E[δ(x)|T ], which will be an estimator that is at least as good as δ(x). By Lehmann-Scheffe,
E[δ(x)|T ] will be the UMRU.
2. Solve equation: Recall that an unbiased estimator has the property that Eθ δ(T ) = g(θ), ∀θ ∈ Θ.
3. Guess δ(T ) which satisfies Eθ δ(T ) = g(θ) by playing with the distribution of T .

3 Some interesting examples


At the end of the previous lecture, we showed an example of finding the UMRU estimator of p2 for a
Bernoulli distribution using strategy 1 (Rao-Blackwellization). Today, we examine some examples which
utilize strategies 2 and 3.
k
Example 1 (An inadmissible UMVU estimator). X ∼ Poi(λ), i.e. P(X = k) = e−λ λk! , k ∈ N. We wish to
estimate g(λ) = e−aλ , and we start by finding the unbiased estimator. For δ(x) to be unbiased, we require
Eλ δ(x) = g(λ), ∀λ > 0. Let’s write out the expression for Eλ δ(x):

X λk
Eλ δ(x) = δ(k)e−λ = e−aλ (1)
k!
k=0

Since e−λ does not depend on k, we can divide this term on both sides to obtain

X λk
δ(k) = e(1−a)λ . (2)
k!
k=0

The expression on the left-hand side (LHS) is a power series expansion, so we will rewrite the right-hand
side (RHS) as a power series expansion as well, and then match terms in order to find δ(k). Using the power
series expansion for an exponential function, the RHS can be written as:

X (1 − a)k λk
e(1−a)λ = . (3)
k!
k=0

1
We find that δ(k) = (1 − a)k . Since this example concerns estimation of a function of a parameter for a
single random variable X, X is obviously a sufficient statistic. As a result, the estimator δ(x) = (1 − a)x is
a function of a complete sufficient statistic. By Lehmann-Scheffe, this estimator must be the unique UMVU
estimator.
This appears to be a good estimator, but let’s look more closely at it. Let’s consider two specific cases:

1. Case 1: a = 0. Since a = 0, g(λ) = 1. Our estimator is δ(x) = 1 for all x, as expected.


2. Case 2: a = 2. Since a = 2, g(λ) = e−2λ , which is always positive. However, our estimator is
δ(x) = (−1)x , which clearly is not always positive.

What is going on? It turns out that the estimator, despite being UMVU, is inadmissible under squared
error loss when a > 1. Here’s why: let’s consider the specific case of a = 2 again. When a = 2, g(λ) =
e−2λ > 0 and we can propose another estimator δ + (x) = max{(−1)x , 0}. Let’s compare the squared error
loss between using estimators δ(x) and δ + (x):

L(θ, δ) = (e−2λ − δ(x))2 (4)


+ −2λ 2
L(θ, δ ) = (e − max{δ(x), 0}) (5)

It is clear that L(θ, δ) ≤ L(θ, δ + ), ∀θ ∈ Θ. Intuitively, we can consider the loss to be the “distance”
between g(λ) (which is a positive number) and the decision δ(x). If δ(x) < 0, then the distance between
g(λ) and δ + is less than that between g(λ) and δ. Thus, δ + dominates δ(x)! So, when a > 1, δ(x) = (1 − a)x
is inadmissible.
i.i.d.
Example 2 (UMVU estimators for the variance in Gaussian model). X1 , X2 , . . . Xn ∼ N (µ, σ 2 ), with
2 2
θ = (µ, σ ) unknown. Note that in thisPn case, (X, S ) is the sufficient complete P
statistic by the property of
n
the exponential family, where X = i=1 Xi /n is the sample mean, and S 2 = i=1 (Xi − X)2 . Let’s find
UMVU estimators for:
1. µ: It is easy to show that the sample average X is an unbiased estimator of µ. As a result, by
Lehmann-Scheffe δ(x) = X is a UMVU for µ.
2
S
2. σ 2 : Since ES 2 = (n − 1)σ 2 , we can define the estimator σ̂ 2 = n−1 , which is an unbiased estimate of
σ . Again, Lehmann-Scheffe guarantees that σ̂ is the UMVU for σ 2 .
2 2

3. σ: Natural Approach: As |X1 − X2 | is independent of µ and, E|X1 − X2 | = cσ for some constant


c (in fact c = √2π ), our estimator can be:

|X1 − X2 |
δ(X1 , X2 ) = . (6)
c
On applying Rao-Blackwellization on δ(X1 , X2 ) with (X, S 2 ), it is possible to obtain the UMVU
estimator for σ. However, computation of conditional expectation with respect to S 2 is messy.
Guessing the estimator: Let’s now turn to the third strategy to find the UMVU for σ. Recall that
i.i.d.
if X1 , X2 , . . . Xn ∼ N (µ, σ 2 ), then we can also represent (X1 , X2 , . . . , Xn ) as µ + σ(Z1 , . . . , Zn ) where
i.i.d. d
Z1 , Z2 , . . . Zn ∼ N (0, 1). Then it’s clear that S 2 = σ 2 χ2n−1 , where χ2k is the Chi-squared distribution
with degrees of freedom k. We can take the square root and expectation of both sides of S 2 :
q
ES = σE χ2n−1 (7)
q
Note that E χ2n−1 is a constant which does not depend on µ or σ 2 , so we obtain σ = √S2 .
E χn−1

2
Example 3 (Criteria for goodness of estimators). This example is motivated by the different scaling factors
that sometimes appear in estimators, for instance, in MATLAB toolboxes. In particular, in the previous
S2
example we found the UMVU estimate of σ̂ 2 to be n−1 which has a scaling factor of n − 1. Now, recall the
maximum likelihood estimator (MLE): θ̂MLE = arg maxθ Pθ (x), which seems like it would come up with a
2
good estimator for σ 2 . In fact, the MLE is σ̂ 2 = Sn but it clearly has a scaling factor of n, not n − 1.
As if this were not confusing enough, let us now find a third candidate for σ̂ 2 by optimizing the risk
2 2
function for estimators of the form Sa , when a > 0. We first compute the risk for Sa under squared-error
loss. Recall that for any random variable X, we can decompose the second moment of X into the bias and
variance. So we have
2
 2 2  2   2 2
2 S S 2 S S 2
R(σ , )=E −σ = Var + E −σ . (8)
a a a a
The bias term can be computed as follows:
 2
(n − 1)σ 2

S 1
E − σ 2 = ES 2 − σ 2 = − σ2 . (9)
a a a
Note that the bias may not be zero, so the estimator we obtain may not be an unbiased estimator. Now,
we compute the variance term, recalling from the previous example that S follows a Chi-squared distribution,
i.e. S ∼ σ 2 χ2n−1 , so (note that if Y ∼ χ2k then the m-th moment of Y is EY m = k(k+2)(k+4) . . . (k+2m−2)):
S2 σ4
Var( ) = 2 · 2(n − 1). (10)
a a
Since the bias is a linear function of 1/a and variance is a quadratic function of 1/a, the risk is a quadratic
function of 1/a. So, minimizing the risk involves minimizing a quadratic function of 1/a. The minimization
results in a = n + 1, which gives
S2
σ̂ 2 = . (11)
n+1
This is yet another expression for σ̂ 2 ! So, which of the three expressions is the best estimator? Under squared
error loss, the σ̂ 2 we just found implies that both the UMVU estimator and MLE are inadmissible, for the
third estimator is the optimal estimator of the form S 2 /a which minimizes the risk function under squared
error loss. However, we still cannot claim that the third estimator is optimal, since it may be the case that
not restricting the statistician to estimators of the form S 2 /a will result in an even better estimator. This
example highlights the difficulty of finite-sample theory, and the fact that the definition of “goodness” of an
estimator matters a great deal for small n. In contrast, for large n it suffices to use the MLE, since it is
known that the MLE is asymptotically optimal (i.e., attains the best possible asymptotic risk for n → ∞).
Also note that these three estimators are essentially the same as n → ∞.

4 Information Inequality (Cramér-Rao Lower Bound)


Goal: As we saw in the previous examples, UMVU estimators need not be the best estimator in mini-
mizing the risk. However, to understand how does it compare with the “best” estimator (which might be
prohibitively difficult to compute), we need a good lower bound on Var(δ(X)).
The general idea of Cramér-Rao lower bound is to use Cauchy-Schwarz inequality, which states that for
any function ψ(X, θ) with finite second moment, we have
[Cov(δ, ψ)]2
Var(δ(X)) ≥ (12)
Var(ψ)
We now want to choose an appropriate ψ(X, θ) which simplifies the above inequality (specifically simplifies
the covariance term).

3
4.1 Hammersley-Chapman-Robbins Inequality
For the density pθ (x) = p(x, θ) ≥ 0, ∀x ∈ X , define the function ψ(x, θ) as:

p(x, θ + ∆)
ψ(x, θ) = −1 (13)
p(x, θ)

where θ, θ + ∆ ∈ Θ, and we assume that g(θ) 6= g(θ + ∆) where g(θ) is the function which we are estimating.
One reason for choosing this ψ(x, θ) is that Eθ ψ(x, θ) = 0. To see why:
Z  
p(x, θ + ∆)
Eθ [ψ(x, θ)] = − 1 p(x, θ)dx (14)
p(x, θ)
Z
= (p(x, θ + ∆) − p(x, θ)) dx (15)

=0 (16)

This fact simplifies the covariance term:

Cov(δ, ψ) = E[δψ] − E[δ]E[ψ] (17)


= E[δψ] (18)
Z  
p(x, θ + ∆)
= δ(x) − 1 p(x, θ)dx (19)
p(x, θ)
Z
= δ(x) (p(x, θ + ∆) − p(x, θ)) dx (20)

= Eθ+∆ [δ] − Eθ [δ] (21)

Upon plugging in the covariance term in (12), we obtain the following result:
2
(Eθ+∆ [δ] − Eθ [δ])
Var(δ(X)) ≥  2  . (22)
Eθ p(x,θ+∆)
p(x,θ) − 1

If we further assume that the estimator δ(X) is unbiased, i.e., Eθ δ(X) = g(θ), ∀θ, we can simplify (22) and
obtain the Hammersley-Chapman-Robbin inequality in its standard form:
2
(g(θ + ∆) − g(θ))
Var(δ(X)) ≥  2  . (23)
p(x,θ+∆)
Eθ p(x,θ) − 1

4.2 Cramér-Rao Lower Bound


The Cramér-Rao bound can be obtained using the Hammersley-Chapman-Robbin inequality and considering
∆ → 0. The RHS of (23) can be written as:

4
 2
2 g(θ+∆)−g(θ)
(g(θ + ∆) − g(θ)) ∆
lim  2  = ∆→0
lim  2  (24)
∆→0
Eθ p(x,θ+∆)
p(x,θ) − 1 Eθ
p(x,θ+∆)
p(x,θ) − 1 1
∆2
 2
g(θ+∆)−g(θ)

= lim  2  (25)
∆→0 p(x,θ+∆)−p(x,θ)
Eθ p(x,θ)∆

2
(g 0 (θ))
=  2  (26)
1 ∂p(x,θ)
Eθ p(x,θ) ∂θ

Thus, the Cramér-Rao Lower Bound in its standard form can be written as:
2
(g 0 (θ))
Var(δ(X)) ≥ (27)
Var (ψCR (x, θ))
∂p(x,θ)
1
Where ψCR (x, θ) = p(x,θ) ∂θ = ∂ ln ∂θ
p(x,θ)
. In some sense, ψCR captures the relative change in p(x, θ) as
we change θ, and it is also the partial derivative of the log-likelihood ln p(x, θ) with respect to θ, which is a
very interesting result since we did not start with anything to do with the log-likelihood.
The denominator I(θ) , Var (ψCR (x, θ)) is known as the Fisher information, which we will discuss more
about in the next lecture. The Fisher information in some sense captures the speed that the model changes
with θ. As Fisher information increases, the model changes more significantly given a local change of θ,
thus the different values of θ can be better distinguished from each other. The better distinguishability of
different models just implies the existence of an improved estimation (the lower bound on Var(δ) decreases).

5
EE378A Statistical Signal Processing Lecture 6 - 04/14/2016

Lecture 6: Information Inequality, Bayesian Decision Theory


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Stefan Jorgensen

In this lecture we will recap the material so far, finish discussing the information inequality and introduce
the Bayes formulation of decision theory.

1 Recap
The first part of this course focuses on statistical decision theory, which provides a systematic framework for
statistical study. In the second half of the course, we will focus on learning theory which is much younger
(developed since the 1970s). We study both because their theoretical foundations are very similar and
also complementing each other, and we can accelerate our understanding of learning theory by leveraging
established decision theory results.
Statistical decision theory problems specify a family of distributions, {Pθ , θ ∈ Θ} and seeks a decision
rule δ(X) to minimize the risk in a certain sense. For the problem to be well posed, we must impose further
restrictions. We studied two options:
A. Restrict the statistician: e.g. requiring that the decision rule be unbiased or equivariant. Though
restricting ourselves to these special types of decision rules may make us miss some other good decision
rules, the study of the optimal decision rules under these restrictions gives us insights about the struc-
ture of the decision problem, and deepens our understandings of important concepts such as sufficiency
and completeness, as well as important theorems such as Rao-Blackwell, Basu, and Lehmann–Scheffe.
B. Assume more knowledge about nature: We admit all decision rules, but seek ‘weaker’ notions of
optimality. In particular we will study
• Bayes risk, where we minimize the expected risk over a prior distribution on θ. This is particularly
powerful because it provides an easy and principled way of incorporating ‘knowledge’ about the
problem into the decision rule (e.g. data from earlier tests about the distribution of outcomes).
• Minimax optimality, where we choose δ to minimize the worst case risk.

2 Information Inequality (continued from Lecture 5)


Recall that from Cauchy-Scwarz we had
[Cov(δ, ψ)]2
Var(δ) ≥ (1)
Var(ψ)
where we define Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] and Var(X) = Cov(X, X).
In the following we consider a specific choice of ψ, the Fisher information, the multi-variate bound and
when the bound is tight.

2.1 Score function


We found the following choice of ψ(x, θ) to be particularly important:
Definition 1 (Score function).

Sθ (x) , log(p(x; θ)) (2)
∂θ

1
2.1.1 Properties of the score function:
1. E[Sθ (X)] = 0.
Proof: By the chain rule and definition of expectation we have
    Z  
∂ ṗ(X; θ) ∂ 1
E[Sθ (X)] = E log(p(X; θ) = E = p(X; θ) p(X; θ)dx, (3)
∂θ p(X; θ) ∂θ p(X; θ)

where ṗ(x; θ) , ∂p(x;θ)


∂θ . When technical conditions on p(x; θ) are met (to be precise, the statistical
experiment should be ‘regular’), the order of differentiation and integration can be exchanged. This
gives Z
∂ ∂
E[Sθ (X)] = p(x; θ)dx = 1=0 (4)
∂θ ∂θ
since the integral of p(x; θ) is constant for all θ.

2. Cov(Sθ , δ) = ∂θ Eθ [δ(X)].
Proof: Since the score function has zero mean,

Cov(Sθ , δ) = Eθ [Sθ (δ(X) − E[δ(X)])] = Eθ [Sθ δ(X)] (5)

Evaluating the integral gives


Z   Z
∂ 1 ∂
Cov(Sθ , δ) = δ(x) p(x; θ) p(x; θ)dx = δ(x) p(x; θ)dx (6)
∂θ p(x; θ) ∂θ

and now, assuming the order of integration and differentiation can be exchanged, we obtain:
Z
∂ ∂
Cov(Sθ , δ) = δ(x)p(x; θ)dx = Eθ (δ(X)) (7)
∂θ ∂θ

3. If δ is unbiased, then Cov(Sθ , δ) = g 0 (θ)


Proof: This follows directly from the previous result, since for unbiased δ we have Eθ [δ(X)] = g(θ).

2.1.2 Generalizations of Sθ
The score function is special because it is the only ψ(x, θ) that makes the corresponding Cauchy–Schwarz
inequality give the correct asymptotic variance in asymptotic theory. We could also consider a more general
class of ψ(x, θ) of the form
∂k
∂θ k
p(x; θ)
ψ(x, θ) = (8)
p(x; θ)
but while they have the zero mean property, they do not share the same role of the score function does in
asymptotic theory.

2.2 Fisher Information


The score function captures the relative rate at which the distribution changes when θ varies. If small changes
in θ result in large changes in the distribution of X, it is very easy to distinguish θ based on observations of X.
On the contrary, if large changes in θ only result in small changes in the distribution of X, then estimating θ
from X is very difficult, since there is not much information about θ available. This relationship is captured
by the variance of the score function, which we call the Fisher information:
Definition 2 (Fisher information).
I(θ) , Eθ [Sθ2 ] (9)

2
2.2.1 Properties of Fisher information
1. I(θ) depends on parameterization (but the information inequality (1) does not)
This is easy to see from the definition. Suppose θ = h(ξ), for a function h differentiable at ξ. Then we
have from the chain rule that
I(ξ) = I(θ)[h0 (ξ)]2 (10)
Applying this equation to the information inequality, we see that the h0 (ξ) term appears in both the
numerator and denominator, meaning that equation (1) is independent of the choice of parameteriza-
tion.
h 2 i

2. Alternate (easier) definition: I(θ) = −E[Ṡθ ] = −E ∂θ 2 log(p(X; θ))
Proof: Working backwards, we have from the chain rule and quotient rule that
" 2 #
p̈(X; θ)p(X; θ) − ṗ(X; θ)2
      
∂ ṗ(X; θ) ṗ(X; θ) p̈(X; θ)
−E[Ṡθ ] = −E = −E = E − E
∂θ p(X; θ) p(X; θ)2 p(X; θ) p(X; θ)
(11)
The first term of the last equality is simply the Fisher information, and the second term evaluates to zero
(assuming we can exchange the order of differentiation and integration), since it is the partial derivative
of the integral over a probability distribution (similar calculation as showing that E[Sθ (x)] = 0). Hence
we have
−E[Ṡθ ] = I(θ) (12)
Remark: This definition is particularly convenient when p(x; θ) belongs to the exponential family.

2.2.2 Example: Fisher information for binomial model


We want to compute I(θ) for the binomial model, X ∼ B(n, θ). Recall that the probability distribution of
x given θ is  
n x
p(x; θ) = θ (1 − θ)n−x , x = 0, · · · , n (13)
x
and that Eθ [X] = nθ. Taking the logarithm, we have
 
n
log(p(x; θ)) = log + x log(θ) + (n − x) log(θ) (14)
x

Now we can compute the score function and its derivative


∂ x −1
Sθ (x) = log(p(x; θ)) = 0 + + (n − x) (15)
∂θ θ 1−θ
x −1
Ṡθ (x) = − 2 + (n − x) (16)
θ (1 − θ)2

As a sanity check, it is easy to verify that E[Sθ ] = 0. This example is simple enough that we could evaluate
I(θ) directly as the variance of Sθ , but it is still much easier to use the alternate form:
 
Eθ [X] 1 n n n
I(θ) = −E[Ṡθ ] = − − 2 − (n − Eθ [X]) 2
= + = (17)
θ (1 − θ) θ 1−θ θ(1 − θ)

Now we can use the inequality (1) to lower bound the variance of any estimator δ as
 ∂
2  ∂
2
∂θ E[δ(X)] ∂θ E[δ(X)]
Var(δ) ≥ = (18)
I(θ) n/(θ(1 − θ))

3

In the case when δ is unbiased, we can go further, since ∂θ E[δ(x)] = 1:

θ(1 − θ)
Var(δ) ≥ (19)
n
which means that the variance is lower bounded by Ω(n−1 ). But we also know that the variance of the
empirical estimator θ̂ = x/n is θ(1 − θ)/n, which means that θ̂ is the optimal (unbiased) estimator for the
squared loss function. In other words, θ̂ is the UMVU estimator.

2.3 Multi-variate case


The Cramér-Rao bound is fundamentally the Cauchy-Schwarz inequality, so to extend it to the multivariate
case we look at the multi-variate Cauchy-Schwarz inequality. In the multivariate case, θ is a vector, and so
instead of partial derivatives we get gradient vectors. Consider choosing
S
X
ψ(x, θ) = aT ∇θ log(p(x; θ)) = ai ψi (x; θ) (20)
i=1

for some vector of coefficients a = [a1 , . . . , aS ] not all zero. Using this definition of ψ with the Cauchy-Schwarz
bound gives
h  P i2
S
Cov δ, i=1 ai ψi (X; θ)
Var(δ) ≥  P  (21)
S
Var δ, i=1 ai ψi (X; θ)

Since we want to make the bound as tight as possible, we choose a to maximize the right hand side. This
maximization problem has a closed-form solution (Check out [1, Chapter 2, Exercises 6.2-3]). Define the
vector of covariances element-wise as
[γ]i = Cov(δ, ψi ) (22)
and the Fisher information matrix as

[I(θ)]i,j = Cov(ψi , ψj ) (23)

Then the Cramér-Rao lower bound is optimized by choosing a = I −1 (θ)γ. That is

Var(δ) ≥ γ T I −1 (θ)γ (24)

2.4 Looseness of Cramér-Rao


The Cramér-Rao bound is only tight if the Cauchy-Schwarz inequality is tight. That is, it is tight if and
only if there exists some function λ(θ) that does not depend on x, such that
S
X
δ(x) − Eθ [δ(X)] = λ(θ) ai ψi (x; θ) (25)
i=1

But the existence of λ(θ) that satisfies this equality everywhere restricts p(x; θ). Concretely, it requires
that p(x; θ) follows an exponential family distribution. Even with this restriction, it turns out that the
Cramér-Rao bound is still quite useful, and particularly gives insightful results in the asymptotic analysis.

4
3 Bayes Formulation of Decision Theory
So far we have studied ways that restricting the statistician leads to well-posed decision theory problems.
The second approach to framing decision theory problems is to assuming more knowledge about nature. In
this section, we discuss the Bayes formulation, where we assume that the parameter θ are selected from a
prior distribution Λ(θ) over the parameter space Θ. We define the average risk as
Definition 3 (Average risk). Z
r(Λ, δ) = R(θ, δ)dΛ(θ) (26)

3.1 Motivation
A practical statistical theory should be able to comprehensively address the following two general questions:
• Achievability: produce approaches to construct practical schemes for general problems

• Evaluation: produce comprehensive framework to evaluate general schemes


From one perspective, one can think of the Bayes formulation as a way to incorporate known knowledge
about the statistical problems, and from another perspective, the Bayes formulation provides a general way
to produce reasonable practical schemes. Indeed, the complete class theorem tells us that it essentially
suffices to consider Bayes estimators in a large variety of decision theoretic problems, even if one does not
accept the Bayes prior assumption. In contrast, UMRU estimators do not address the first question in a
satisfactory fashion since it may not exist in many important cases.
There exist other general methods for achievability in decision theory, such as M (Z)-estimation and
method of moments, and we will discuss more methods when we talk about learning theory.

References
[1] Lehmann, Erich Leo, and George Casella. Theory of Point Estimation, Springer Texts in Statistics,
1998.

5
EE378A Statistical Signal Processing Lecture 7 - 04/19/2016

Lecture 7: Basic Results of Bayes Decision Theory


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Ned Danyliw and Junhong Choi

In this lecture, we will learn about four important theorems in Bayesian framework and start looking at
examples of Bayes estimators.

1 Bayes Setup
Recall in Lecture 2 that the basic setup in statistical decision theory is as follows: an outcome space X , a
class of probability measures {Pθ : θ ∈ Θ}, and observations X ∼ Pθ , X ∈ X . In Bayes decision theory
framework, we assume in addition that θ are probabilistic, and θ follows a distribution Λ(θ). Why do
Bayesian statisticians make this assumption?
• This is a nice way to incorporate prior knowledge into data. Actually not everyone thinks it is a good
idea, because sometimes it is unclear how to impose an appropriate prior on your data.

• This is a general way to generate reasonable schemes. What do we mean by ”reasonable schemes”?
We will look at it later.

2 Terminology
• Average risk: Z
r(Λ, δ) , R(θ, δ) dΛ(θ)
| {z }
risk function
R
with dΛ(θ) = 1, i.e. Λ(θ) is a probability distribution.
• Bayes solution is the estimator or decision rule which minimizes the average risk.

δΛ = arg min r(Λ, δ).


δ

3 Existence of Bayes Estimator


The general message of this section: it is “easy” to statistically solve δΛ in sense that δΛ usually exists (in
contrast to the UMVU which might not always exist). However, it might not be easy to compute the Bayes
solution.

Theorem 1. Suppose {Pθ , θ ∈ Θ}, Λ(θ) is a probability distribution of Θ, L(θ, δ) ≥ 0, the goal is to estimate
g(θ), and assume
(1) There exists δ0 that achieves finite average risk.
(2) For almost all x, there exists value δΛ (X) minimizing E [L(θ, δ(X))|X = x]

Then δΛ (X) is a Bayes estimator.

1
Proof Let δ be any estimator with finite average risk. It follows from assumption (2) and non-negativity
of loss function L that
E [L(θ, δ(X))|X = x] ≥ E [L(θ, δΛ (X))|X = x] ≥ 0.
Taking expectation of both sides with respect to PX yields

E {E [L(θ, δ(X))|X]} ≥ E {E [L(θ, δΛ (X))|X]} ≥ 0.

By tower property of conditional expectation,

E[L(θ, δ)] ≥ E[L(θ, δΛ )].

Remark 1. Given X = x, δ(X) is a constant (if δ is a deterministic decision rule), but L(θ, δ(X)) is still
random because θ is probablitic. Hence the conditional expectation E [L(θ, δ(X))|X = x] makes sense in
assumption (2).
Remark 2. The expectations in E[L(θ, δ)] ≥ E[L(θ, δΛ )] are with respect to the joint distribution of
PX,θ = Λ(θ)PX|θ .

Let’s get a more intuitive understanding of the above theorem. Write


Z
r(Λ, δ) = R(θ, δ)dΛ(θ)
Z Z 
= L(θ, δ)dPX|θ dΛ(θ)
ZZ
= L(θ, δ)dPX|θ dΛ(θ) [Fubini’s theorem]
ZZ
= L(θ, δ)dPX,θ [dPX|θ dΛ(θ) = dPX,θ ]
ZZ
= L(θ, δ)dPθ|X dPX [dPX,θ = dPθ|X dPX ]
Z Z 
= L(θ, δ)dPθ|X dPX [Fubini’s theorem]

RSince we has written the average risk r(Λ, δ) as an integral over x, it suffices to minimize the integrand
L(θ, δ)dPθ|X in the brackets (which is called conditional risk). It basically justifies the above theorem. In
Bayes setting, we actually have names for these two distributions
• Λ(θ) : Prior distribution
• Pθ|X : Posterior distribution

4 Bayes Estimator under Three Popular Loss Functions


In the previous section, we have shown that the task of average risk minimization can be reduced to the task
of conditional risk minimization. Suppose we have computed the posterior distribution Pθ|X , how could we
minimize the conditional risk E[L(θ, δ(X))|X = x]? It is hard in general as it depends largely on which loss
function is used. But for the three most popular loss functions, we have the following useful facts.
Theorem 2.
• For squared error loss L(θ, d) = (d − g(θ))2 , δΛ (x) = E[g(θ)|X = x].

2
• For absolute error loss L(θ, d) = |d − g(θ)|, δΛ (x) = any median of Pg(θ)|X=x .

• For Hamming loss, L(θ, d) = 1(g(θ) 6= d), δΛ (x) = arg maxy Pg(θ)|X (y), which is the Maximum a
posteriori (MAP) estimator of g(θ).
Remark. Since δΛ (X) is a constant in the conditional risk, all we need to do is find a scalar δ(x) for the
conditional risk at any given x. It is an easy task if the loss function L is either squared error loss, absolute
error loss or Hamming loss.

5 Admissibility of Bayes Estimator, and Complete Class Theorem


Theorem 3. (part 1) Every unique Bayes estimator is admissible.

Proof Let δΛ be the unique Bayes estimator under prior Λ. Suppose for the sake of contradiction that
there exists δ 6= δΛ such that
R(θ, δ) ≤ R(θ, δΛ ), ∀θ ∈ Θ
Taking expectation w.r.t. Λ(θ) yields
Z Z
r(Λ, δ) = R(θ, δ)dΛ(θ) ≤ R(θ, δΛ )dΛ(θ) = r(Λ, δΛ ).

Since δΛ minimize the average risk, r(Λ, δ) ≥ r(Λ, δΛ ). Putting together, we have r(Λ, δ) = r(Λ, δΛ ), i.e. δ
is also a Bayes estimator. This is a contradiction to the uniqueness of Bayes estimator δΛ .

Remark 1. This is super powerful! Even UMVU and MLE may not be admissible.
Remark 2. It is worth noting that an admissible estimator does not necessarily imply that it is “good”.
However inadmissibility does imply that it is “bad”.

This leads us to the question: is every admissible estimator Bayes? Yes, under mild conditions.
Theorem 3. (part 2: complete class theorem) Suppose
(1) Θ is a compact set.

(2) The risk set {R(·, δ) : δ is a (random) decision rule} is convex.


(3) R(θ, δ) is continuous in θ for every δ.
Then Bayes estimators (decision rules) constitute a complete class in the sense that for any admissible δ ∗ ,
∃ prior Π∗ such that δ ∗ is a Bayes solution for Π∗ .

Remark. Part 2 of Theorem 3 tells us that it suffices to search over all possible Bayes decision rules to find
the best estimator.

x
Example 1. Suppose X ∼ Binomial(n, θ), θ ∈ [0, 1]. θ̂ = n is UMVU, MLE and admissible. If we in
addition assume that θ follows a Beta distribution
1
Λ(θ) = Λa,b (θ) = θa−1 (1 − θ)b−1 for some a, b ≥ 0.
B(a, b)

where B(a, b) is the Beta function defined as


Z 1
Γ(a)Γ(b)
B(a, b) = θa−1 (1 − θ)b−1 dθ = .
0 Γ(a + b)

3
This prior is called “conjugate prior” to the Binomial distribution. It makes the posterior distribution
much easy to compute.
Pa,b (X, θ) = Λa,b (θ) PX|θ
 
1 n x
= θa−1 (1 − θ)b−1 θ (1 − θ)n−x
B(a, b) x
= C(a, b, X)θx+a−1 (1 − θ)n−x+b−1
where C(a, b, X) does not depend on θ. Then
Pa,b (X, θ) C(a, b, X) x+a−1
Pa,b (θ|X) = = θ (1 − θ)n−x+b−1 = C 0 (a, b, X)θx+a−1 (1 − θ)n−x+b−1
Pa,b (X) Pa,b (X)
The denominator Pa,b (X) does not depend on θ, so we can absorb it into C(a, b, X), and write
C(a, b, X)
= C 0 (a, b, X).
Pa,b (X)
We conclude that
Pa,b (θ|X) ∼ B(X + a, n − x + b).
Using Theorem 2, the Bayes estimator under squared error loss is
x+a x+a
θ̂a,b = E[θ|X] = =
(x + a) + (n − x + b) a+b+n
Here we use the basics of Beta distribution: if Z ∼ Beta(a, b),
a ab
E[Z] = , Var(Z) = .
a+b (a + b)2 (a + b + 1)
Let’s look at this result a bit more closely
• θ̂a,b is a convex combination of θ̂ (UMVU, MLE) and the prior mean.
 
a+b a n x
θ̂a,b = + .
a+b+n a+b a + b + n | {z
n}
| {z }
prior mean θ̂

• θ̂a,b approaches to θ̂ as a and b approach 0. We can interpret a as the number of samples of “1” you
have seen before the experiment, and b as the number of samples of “0”.
a
• θ̂a,b is biased, since its first part a+b is biased and its second part θ̂ is unbiased (UMVU).

6 Biasedness of Bayes Estimator


As shown in Example 1, the Bayes estimator for Binomial model is biased. Is it generically true for any
Bayes estimator?
Theorem 4. Λ(θ) is a prior of Θ and L is the squared error loss. Then no unbiased estimator δ(x) can be
a Bayes estimator unless
E[(δ(X) − g(θ))2 ] = 0
or equivalently δ(X) = g(θ) with probability 1.
Remark. This does not mean δ(X) = g(θ) everywhere. For example, if δ(X) 6= g(θ) on a subset of Θ where
Λ(θ) has not probability mass, we can still have E[(δ(X) − g(θ))2 ] = 0.
Example 2. If X ∼ N (µ, 1), then X is unbiased (it is UMVU, MLE and admissible) and cannot satisfy
X = µ with probability one for any prior distribution on µ. Hence, it cannot be a Bayes estimator.
Remark. However we can show that X is approached as the limit of a sequence of Bayes estimators.

4
EE378A Statistical Signal Processing Lecture 9 - 04/26/2016

Lecture 9: State Estimation in Hidden Markov Processes


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Abbas Kazerouni

In this lecture, we introduce Hidden Markov Processes and develop( efficient methods for estimations in such
p(rocesses. Although a similar analysis can be carried over to the continuous case, in this lecture we focus
on a discrete setting where all the random variables take value in a discrete set.

1 Notation
We start by introducing some notations which will be useful in the subsequent analysis. We use upper-case
letters X, Y, Z to denote random variables and lower-case letters x, y, z to denote specific realizations. For
any n and n ≤ m and collection of random variables X1 , X2 , · · · , XN , we let

X n = (X1 , X2 , · · · , Xn ), n
Xm = (Xm , Xm+1 , · · · , Xn ).

Moreover, we abbreviate P(X = x, Y = y, Z = z) as p(x, y, z), and similarly for conditional probabilities.

2 Hidden Markov Process


Recall from the previous lectures that a sequence of random variables X, Y, Z forms a Markov chain, denoted
as X − Y − Z, if p(x, z|y) = p(x|y)p(z|y) or equivalently p(x|y, z) = p(x|y) or equivalently p(z|y, x) = p(z|y).
The following Lemma gives another specification of a Markov chain.
Lemma 1. Let X, Y, Z be three random variables. Then, X − Y − Z if and only if there exist two functions
φ1 and φ2 such that
p(x, y, z) = φ1 (x, y)φ2 (y, z).
Proof (Proof of necessity) Assume that X − Y − Z. Then by definition of the Markov chain, we have

p(x, y, z) = p(x|y, z)p(y, z) = p(x|y)p(y, z).

Now taking φ1 (x, y) = p(x, y) and φ2 (y, z) = p(y, z) proves this direction.
(Proof of sufficiency) Now, the functions φ1 , φ2 exist as mention in the statement. Then we have:

p(x, y, z)
p(x|y, z) =
p(y, z)
p(x, y, z)
=P
x̃ p(x̃, y, z)
φ1 (x, y)φ2 (y, z)
= P
φ2 (y, z) x̃ φ1 (x̃, y)
φ1 (x, y)
=P .
x̃ φ1 (x̃, y)

The final equality shows that p(x|y, z) is not a function of z and hence, X − Y − Z follows by definition.

Definition 2 (Hidden Markov Process). The process {(Xn , Yn )}n≥1 is a Hidden Markov Process if

1
• {Xn }n≥1 is a Markov process, i.e.,
n
Y
n
∀ n ≥ 1 : p(x ) = p(xt |xt−1 ),
t=1

where by convention, we take X0 ≡ 0 and hence p(x1 |x0 ) = p(x1 ).


• {Yn }n≥1 is determined by a memoryless channel; i.e.,
n
Y
∀ n ≥ 1 : p(y n |xn ) = p(yt |xt ).
t=1

The processes {Xn }n≥1 and {Yn }n≥1 are called the state process and the noisy observations, respectively.
In view of the above definition, we can derive the joint distribution of the state and the observation as
follows:
n
Y
p(xn , y n ) = p(xn )p(y n |xn ) = p(xt |xt−1 )p(yt |xt ).
t=1
The main problem in the Hidden Markov Models is to compute the the posterior probability of the state
at any time, given all the observations up to that time, i.e. p(xt |y t ). The naive approach to do this is to
simply use the the definition of the conditional probability:
t t−1
p(xt , y t ) )p(xt , xt−1 )
P
t xt−1 p(y |xt .x
p(xt |y ) = = P P .
p(y t ) x̃t
t
xt−1 p(y |x̃t , x
t−1 )p(x̃ , xt−1 )
t

The above approach needs exponentially many computations both for the numerator and the denominator.
To avoid this problem, we develop an efficient method for computing the posterior probabilities using forward
recursion. Before getting to the algorithm, we establish some conditional independence relations.

3 Some Conditional Independence Relations


(a) (X t−1 , Y t ) − Xt − (Xt+1
n n
, Yt+1 ):
Proof We have :
n
Y
n n
p(x , y ) = p(xi |xi−1 )p(yi |xi )
i=1
" t #" n
#
Y Y
= p(xi |xi−1 )p(yi |xi ) p(xi |xi−1 )p(yi |xi )
i=1 i=t+1
| {z }| {z }
φ1 (xt ,(xt−1 ,y t )) φ2 (xt ,(xn n
t+1 ,yt+1 ))

The statement then follows from Lemma 1.

(b) (X t−1 , Y t−1 ) − Xt − (Xt+1


n
, Ytn ):
Proof Note that if {Xi , Yi }ni=1 is a Hidden Markov Process, then {Xn−i , Yn−i }ni=1 will also be a Hid-
den Markov Process. Applying (a) to this reversed process proves (b).

(c) X t−1 − (Xt , Y t−1 ) − (Xt+1


n
, Ytn ):
Proof Left as an exercise.

2
4 Causal Inference via Forward Recursion
We now derive the forward recursion algorithm as an efficient method to sequentially compute the causal
posterior probabilities.
Note that we have
p(xt , y t )
p(xt |y t ) = P t
x̃t p(x̃t , y )
p(xt , y t−1 )p(yt |xt , y t−1 )
=P t−1 )p(y |x̃ , y t−1 )
x̃t p(x̃t , y t t

p(y t−1 )p(xt |y t−1 )p(yt |xt )


= P (by (b))
p(y t−1 ) x̃t p(x̃t |y t−1 )p(yt |x̃t )
p(xt |y t−1 )p(yt |xt )
=P t−1 )p(y |x̃ )
x̃t p(x̃t |y t t

Define αt (xt ) = p(xt |y t ) and βt (xt ) = p(xt |y t−1 ), then the above computation can be summarized as
βt (xt )p(yt |xt )
αt (xt ) = P . (1)
x̃t βt (x̃t )p(yt |x̃t )

Similarly, we can write β as

βt+1 (xt+1 ) = p(xt+1 |y t )


X
= p(xt+1 , xt |y t )
xt
X
= p(xt |y t )p(xt+1 |y t , xt )
xt
X
= p(xt |y t )p(xt+1 |xt ) (by (a)).
xt

The above equation can be summarized as


X
βt+1 (xt+1 ) = αt (xt )p(xt+1 |xt ). (2)
xt

Equations (1) and (2) indicate that αt and βt can be sequentially computed based on each other for t =
1, · · · , n with the initialization β(x1 ) = p(x1 ). Hence, with this simple algorithm, the causal inference can
be done efficiently in terms of both computation and memory. This is called the forward recursion.

5 Non-causal Inference via Backward Recursion


Now suppose that we are interested in finding p(xt |y n ) for t = 1, 2, · · · , n in a non-causal manner. For this
purpose, we can derive a backward recursion algorithm as follows.
X
p(xt |y n ) = p(xt , xt+1 |y n )
xt+1
X
= p(xt+1 |y n )p(xt |xt+1 , y t ) (by (c))
xt+1
X p(xt |y t )p(xt+1 |xt )
= p(xt+1 |y n ) . (by (a))
xt+1
p(xt+1 |y t )

3
Now, let γt (xt ) = p(xt |y n ). Then, the above equation can be summarized as
X αt (xt )p(xt+1 |xt )
γt (xt ) = γt+1 (xt+1 ) . (3)
xt+1
βt+1 (xt+1 )

Equation (3) indicates that γt can be recursively computed based on αt ’s and βt ’s for t = n − 1, n − 2, · · · , 1
with the initialization of γn (xn ) = αn (xn ). This is called the backward recursion.

4
EE378A Statistical Signal Processing Lecture 10 - 04/28/2016

Lecture 10: State Estimation in Hidden Markov Processes


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Dominic Delgado, Lewis Guignard, Greg Kehoe

Last time we talked about forward and backward recursions in hidden Markov processes, where we obtained
both the “causal” posterior probability distribution p(xt |y t ) and the “non-causal” one p(xt |y n ). Now we
treat the hidden Markov process as a communicationQ model, i.e., the transmitter sends xn and the receiver
n
receives y through a memoryless channel p(y |x ) = t=1 p(yt |xt ). With the posterior distribution p(xt |y n )
n n n

in hand, we can minimize the bit error probability


n
1X
P(x̂t 6= xt )
n t=1

based on the MAP estimate

x̂t (y n ) = arg max p(xt |y n ), t = 1, · · · , n.


xt

However, if we want to minimize the block error probability, i.e.,

P(x̂n 6= xn )

we need to find the following MAP estimate

x̂n (y n ) = arg max


n
p(xn |y n ).
x

In this lecture, we discuss efficient algorithms for computing both p(xn |y n ) and arg maxxn p(xn |y n ).

1 Recap: Forward and Backward Recursion


Let (Xt , Yt )nt=1 be a hidden Markov process, and define

αt (xt ) = p(xt |y t )
βt (xt ) = p(xt |y t−1 )
γt (xt ) = p(xt |y n ).

The recursive formulae for computing these quantities are

βt (xt )p(yt |xt )


αt (xt ) = P
x̃t βt (x̃t )p(yt |x̃t )
X
βt+1 (xt+1 ) = αt (xt )p(xt+1 |xt )
xt
X αt (xt )p(xt+1 |xt )
γt (xt ) = γt+1 (xt+1 )
xt+1
βt+1 (xt+1 )

with the initialization β1 (x1 ) = p(x1 ) and γn (xn ) = αn (xn ).

1
2 Joint Posterior
Before computing p(xn |y n ), we begin with a useful fact about the HMP.
Lemma 1. In the HMP, the following Markov chain holds:

X t−1 − (Xt , Y n ) − Xt+1


n
.

Proof The desired result follows from Lemma 1 in Lecture 9 and


n
Y
p(xn , y n ) = p(xi |xi−1 )p(yi |xi )
i=1
Yt n
Y
= p(xi |xi−1 )p(yi |xi ) · p(xi |xi−1 )p(yi |xi ) .
i=1 i=t+1
| {z } | {z }
φ1 (xt−1 ,xt ,y n ) φ2 (xt ,xn n
t+1 ,y )

Now we are ready to derive the joint posterior distribution p(xn |y n ):


n−1
Y
p(xn |y n ) = p(xn |y n ) p(xt |xnt+1 , y n )
t=1
n−1
Y
= p(xn |y n ) p(xt |xt+1 , y n ) (By Lemma 1)
t=1
n−1
Y
= p(xn |y n ) p(xt |xt+1 , y t ) (By (c) in Lecture 9)
t=1
n−1
Y p(xt |y t )p(xt+1 |xt , y t )
= p(xn |y n )
t=1
p(xt+1 |y t )
n−1
Y αt (xt )p(xt+1 |xt )
= γn (xn ) (By (a) in Lecture 9)
t=1
βt+1 (xt+1 )

Write
n
X
ln p(xn |y n ) = gt (xt , xt+1 )
t=1

with
(
t )p(xt+1 |xt )
ln αt (x
βt+1 (xt+1 ) t = 1, · · · , n − 1
gt (xt , xt+1 ) ,
ln γn (xn ) t=n

we can solve the MAP estimator x̂MAP with the help of the following definition:
Definition 2. For 1 ≤ k ≤ n, Let
n
X
Mk (xk ) := max
n
gt (xt , xt+1 ).
xk+1
t=k

2
It is straightforward from definition that M1 (x̂MAP
1 ) = ln p(x̂MAP |y n ) = maxxn p(xn |y n ) and
n
X
Mk (xk ) = max max
n
(gk (xk , xk+1 ) + gt (xt , xt+1 ))
xk+1 xk+2
t=k+1
Xn
= max(gk (xk , xk+1 ) + max
n
gt (xt , xt+1 ))
xk+1 xk+2
t=k+1

= max(gk (xk , xk+1 ) + Mk+1 (xk+1 )).


xk+1

Since Mn (xn ) = g(xn , xn+1 ) = ln γn (xn ) only depends on one term xn , we may start from n and use the
previous recursive formula to obtain M1 (x1 ). This is called the Viterbi Algorithm.

3 Bellman Equations (Viterbi Algorithm)


The Bellman Equations, referred to in the communication setting as the Viterbi Algorithm, is an application
of a technique called dynamic programming. Using the recursive equations derived in the previous section,
we can express the Viterbi Algorithm as follows.

1: function Viterbi
2: Mn (xn ) ← ln γn (xn ) . Initialization of log-likelihood
3: x̂MAP (xn ) ← ∅ . Initialization of the MAP estimator
4: for k = n − 1, · · · , 1 do
5: Mk (xk ) ← maxxk+1 (gk (xk , xk+1 ) + Mk+1 (xk+1 )) . Maximum of log-likelihood
6: x̂k+1 (xk ) ← arg maxxk+1 (gk (xk , xk+1 ) + Mk+1 (xk+1 ))
7: x̂MAP (xk ) ← [x̂k+1 (xk ), x̂MAP (x̂k+1 (xk ))] . Maximizing sequence with leading term xk
8: end for
9: M ← maxx1 M1 (x1 ) . Maximum of overall log-likelihood
10: x̂1 ← arg maxx1 M1 (x1 )
11: x̂MAP ← [x̂1 , x̂MAP (x̂1 )] . Overall maximizing sequence
12: end function

4 More discussions
4.1 Optimality Criteria
Let us distinguish between the MAP estimator for the joint posterior p(xn |y n ) (i.e., minimizing the block
error probability) and the MAP estimator for the marginal posterior p(xt |y n ) (i.e., minimizing the bit error
probability) via an example. Consider a random vector X n with the following distribution:


1111 · · · 11 w.p. 0.1
1000 · · · 00 w.p. n1 × 0.9





0100 · · · 00 w.p. 1 × 0.9

Xn = n
0010 · · · 00 w.p. n1 × 0.9


···





0000 · · · 01 w.p. n1 × 0.9

For n ≥ 10, the maximum likelihood sequence is the all-one sequence, but to get the j-th bit right we should
always guess zero. Hence, if our goal is to guess the whole sequence correctly, we should guess all ones, but
if our goal is to make as few bit errors as possible on average, we should guess all zeros.

3
4.2 General Forward – backward Recursion
Pd
Denote by ∆d = {(p1 , · · · , pd ) ∈ Rd+ : i=1 pi = 1} the probability simplex with support size at most d, and
by X and Y the alphabet of X and Y , respectively. Then the forward–backward recursion can be written
via the mappings

F : ∆|X | × (∆|Y| )|X | × Y → ∆|X |


pX (x)pY |X (y|x)
(pX , pY |X , y) 7→ qX : x 7→ P
x̃ pX (x̃)pY |X (y|x̃)

and

G : ∆|X | × (∆|Y| )|X | → ∆|Y|


X
(pX , pY |X ) 7→ qY : y 7→ pX (x)pY |X (y|x)
x

as follows:
• Forward recursion:

αt = F (βt , pYt |Xt , yt )


βt+1 = G(αt , pXt+1 |Xt ))

• Backward recursion:

γt = G(γt+1 , F (αt , pXt+1 |Xt , ·))

4
EE378A Statistical Signal Processing Lecture 11 - 05/03/2016

Lecture 11: Minimax Decision Theory


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Jie Jun Ang

1 Minimax Framework
Recall that the Bayes estimator minimizes the average risk
Z
r(Λ, δ) = R(θ, δ) dΛ(θ),

where Λ is the prior. One complaint about the Bayes estimator is that the choice of Λ is often arbitrary and
hard to defend.
An alternative approach is to assume that nature is malicious, and will always pick the worst value of θ
in response to the statistician’s choice of δ. In this case, we instead seek to minimize the maximum risk (or
worst-case risk): Z
min sup L(θ, δ) dPX|θ .
δ θ

Claim 1. If the loss function L(θ, δ) is convex in δ, then finding the minimax rule can be reduced to solving
a convex optimization problem.
R
Why is this claim true? We note that the average of convex functions is convex, hence L(θ,R δ) dPX|θ is
convex in δ. Also, the supremum of convex functions is easily checked to be convex, hence supθ L(θ, δ) dPX|θ
is convex in δ. As a result, the problem
Z
min sup L(θ, δ) dPX|θ
δ θ

aims to minimize a convex function, and is thus a convex optimization problem.


Even though there is substantial literature on solving convex optimization problems, this particular
problem is difficult because it’s hard to evaluate the supremum over θ. Indeed, if we are given a decision
rule δ, we cannot evaluate the gradient because computing supθ is computationally intractable.

2 Minimax Theorem
The first minimax theorem was proved by von Neumann, in the setting of zero-sum games. It states that
every two-person finite-strategy zero-sum game has a mixed-strategy Nash equilibrium. For the purposes of
this class, however, we introduce the generalization by Sion and Kakutani:
Theorem 2 (Sion-Kakutani Minimax Theorem). Let X and Λ be compact convex sets in linear spaces (i.e.,
topological vector spaces). Let H(λ, x) : Λ × X → R be a continuous function such that
• H(λ, ·) is convex for every fixed λ ∈ Λ,
• H(·, x) is concave for any fixed x ∈ X.
Then
1. The strong duality holds:
max min H(λ, x) = min max H(λ, x). (1)
λ x x λ

1
2. There exists a ”saddle point” (λ∗ , x∗ ), for which

H(λ, x∗ ) ≤ H(λ∗ , x∗ ) ≤ H(λ∗ , x) for all λ ∈ Λ, x ∈ X. (2)

We omit the proof, but strongly recommend the following exercises:


• Show that (2) implies (1).

• Show that for an arbitrary function F (λ, x), we have

max min F (λ, x) ≤ min max F (λ, x).


λ x x λ

This is called the weak duality.

There is a game-theoretic way to view weak duality. Two people, named Max and Min, are playing a zero-
sum game. Max chooses λ ∈ Λ and Min chooses x ∈ X, then F (λ, x) is the reward function for Max (and
−F (λ, x) is the reward function for Min). Weak duality states that going first is advantageous for Max, no
matter the choice of the function F .

3 Applications of the Minimax Problem


3.1 A failed approach
R
Since we are interested in finding minδ supθ L(θ, δ) dPX|θ , a first approach is to lower bound this quantity
using weak duality. The following is slightly sloppy mathematically:
Z Z
min sup L(θ, δ) dPX|θ ≥ sup min L(θ, δ) dPX|θ .
δ θ θ δ

The problem is that in the expression on the RHS, the statistician is allowed to choose δ after nature has
chosen θ. If L(θ, g(θ)) = 0, the statistician may pick the constant predictor δ = g(θ), so the right hand side
is just 0. We have given up too much by applying weak duality.

3.2 Wald’s approach


What if, instead of operating directly
R on θ, we instead maximize over priors on θ? Recall that the risk
function R(θ, δ) is defined to be L(θ, δ) dPX|θ .
Claim 3. For any δ, we have Z
sup R(θ, δ) = sup R(θ, δ) dΛ(θ).
θ Λ(θ)

Proof Since the right hand side is an average of R(θ, δ) over some values of θ and the left hand side is the
supremum of R(θ, δ), we have LHS ≥ RHS. For the reverse inequality, given any  > 0, there exists some θ0
such that R(θ0 , δ) +  > supθ R(θ, δ). Taking Λ0 to be the delta distribution with point mass at θ = θ0 , we
get Z Z
sup R(θ, δ) dΛ(θ) ≥ R(θ, δ) dΛ0 (θ) = R(θ0 , δ) > sup R(θ, δ) − .
Λ θ
+
Taking  → 0 , we see that RHS ≥ LHS, so we are done.

2
Thus, maximizing over θ is the same as maximizing over Λ(θ). We use the minimax theorem:
Z
min sup L(θ, δ) dPX|θ = min sup R(θ, δ)
δ θ δ θ
Z
= min sup R(θ, δ) dΛ(θ)
δ Λ
Z
= sup min R(θ, δ) dΛ(θ)
Λ δ
ZZ
= sup min L(θ, δ) dPX|θ dΛ(θ).
Λ δ

We have not verified that the hypotheses of the minimax theorem are satisfied. The main idea is that, when
Λ is fixed, then this function is convex in δ (as discussed in the first section when L(θ, ·) is convex), and
when δ is fixed, this function is linear in Λ. We gloss over the details regarding compactness and convexity,
the norms on X and Λ, continuity, etc.
The above result may be succinctly summarized:

min sup R(θ, δ) = sup min r(Λ, δ).


δ θ Λ(θ) δ

Define the Bayes risk


rΛ = min r(Λ, δ) = min EΛ [R(θ, δ)]
δ δ

to be the risk of the Bayes estimator under the prior θ ∼ Λ(θ). Then the above result is minδ supθ R(θ, δ) =
supΛ(θ) rΛ .
The left-hand side’s expression may be interpreted as the statistician choosing δ first, then nature picking
θ, which seems very bad for the statistician. However, this is not as pessimistic as it initially seems: in the
right hand side’s expression, nature acts first by choosing a prior Λ, then the statistician picks δ.
An objection to Wald’s minimax formulation was that nature is not malicious, but this way of viewing
the problem reveals that it is not so pessimistic.

3.3 Difficulties
Similar to the fact that the first formulation of the minimax problem is hard to solve, this second form also
has its own difficulties. While it is easy to evaluate rΛ for each choice of Λ, there are too many possible
choices for Λ to evaluate supΛ rΛ . Ultimately, we have traded the problem of being unable to evaluate the
objective supθ R(θ, δ) with the problem of searching over an another possibly infinite-dimensional space.

4 Minimax in Action
Unfortunately, the minimax problem is difficult to solve exactly, even when it has been reformulated. Very
few results of exact optimality are known. Fortunately, statisticians have two ways of dealing with this
problem:
• Instead of searching for optimal solutions, we can instead seek potential solutions, which are asymp-
totically optimal. More on this next week.
• We may upper bound and lower bound the minimax risk by constants. Indeed, in the expression

min sup R(θ, δ) = sup rΛ ,


δ θ Λ(θ)

the left hand side is a minimum over δ, so for any choice δ0 , the quantity supθ R(θ, δ0 ) is an upper
bound on minimax risk. Similarly, the right hand side is a supremum over Λ, so for any choice of Λ0 ,

3
the quantity rΛ0 is a lower bound on the minimax risk. If we get upper and lower bounds which are
close, then we are quite satisfied with the result. This is still a difficult task, however.
Definition 4. A prior Λ is called least favorable if, for any prior Λ0 , we have rΛ ≥ rΛ0 .
This definition is completely motivated by the minimax equation. The notion is very formal in the sense
that it’s usually not possible to check whether Λ is least favorable.

Theorem 5. Suppose Λ(θ) is a prior on θ, and δΛ is the Bayes estimator under Λ. Suppose also that
r(Λ, δΛ ) = rΛ = supθ R(θ, δΛ ). Then
1. δΛ is the minimax estimator;
2. If δΛ is the unique Bayes estimator with respect to Λ, then it is the unique minimax estimator;

3. Λ is least favorable.
This theorem is almost tautological. Nevertheless, we prove the first assertion, and leave the rest as
exercises.
Proof Let δ be any decision rule. Then
Z
sup R(θ, δ) ≥ R(θ, δ) dΛ(θ)
θ
Z
≥ R(θ, δΛ ) dΛ(θ)

= rΛ
= sup R(θ, δΛ ).
θ

Hence δΛ minimizes the minimax risk.

5 Reference
B. Levit (2010). “Minimax Revisited: I, II”. This paper is a survey of several techniques people have used
to bound the minimax risk of the following problem: suppose you have a single observation X ∼ N (µ, σ 2 ),
where σ 2 is known, and you know that |µ| ≤ 1. You want to estimate µ. The minimax estimator is not
known! You can have a feeling of the difficulty in solving the exact minimax estimator!

4
EE378A Statistical Signal Processing Lecture 12 - 5/5/2016

Lecture 12: Minimax Decision Theory and Asymptotics


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Leighton Barnes, Nikhil Garg, & Sahaana Suri

1 Minimax estimator: an example


In the last lecture, we defined the minimax decision rule and introduced the minimax theorem which estab-
lishes the duality between minimax decision rule and Bayes decision rule.

Theorem 1 (Minimax Theorem). Under certain conditions,

inf sup R(θ, δ) = sup rΛ .


δ θ Λ

Based on the minimax theorem, the minimax risk can be upper bounded by using any decision rule δ 0 ,
and lower bounded by any prior Λ0 :

sup R(θ, δ 0 ) ≥ inf sup R(θ, δ) = sup rΛ ≥ rΛ0


θ δ θ Λ
0 0
This means that if we can come up with specific δ , Λ such that these two bounds are close, then we have
found an accurate value of the true minimax risk. Specifically, we have
R
Theorem 2. If Λ is a prior, δΛ is a Bayes estimator under this prior such that rΛ = R(θ, δΛ )dΛ(θ) =
supθ R(θ, δΛ ). Then we have:
1. δΛ is minimax;
2. If δΛ is uniquely Bayes, then δΛ is unique minimax;
3. Λ is least favorable.

Corollary 3. If a Bayes estimator has constant risk, it is minimax (not necessarily unique).
The preceding theorem and corollary provide us with the tool to find the exact minimax estimator “using
human brain” in some simple examples.

Example 4. In the Binomial model X ∼ B(n, θ), we will find out the minimax estimator under the squared
error loss. Recall that θ̂(X) = Xn is UMVU (by Lehmann-Schaffe), admissible (by Problem 2(4) in Homework
2), and MLE. However, we show that it turns out not to be minimax and find the minimax estimator. In
fact, the risk function of θ̂ is
θ(1 − θ)
R(θ, θ̂) = Eθ (θ̂ − θ)2 =
n
.
We already proved that under the conjugate prior
1
Λa,b (θ) = θa−1 (1 − θ)b−1 (a > 0, b > 0)
B(a, b)

we get the Bayes estimator


a+X
θ̂a,b (X) = .
a+b+n

1
The risk function of this Bayes estimator is
1
R(θ, θ̂a,b ) = Eθ (θ̂a,b − θ)2 = [n(1 − θ)θ + (a(1 − θ) − bθ)2 ].
(a + b + n)2

Motivated by the previous corollary,


√ we would like R(θ, θ̂a,b ) to be a constant independent of θ. It is easy
to see that we should set a = b = n/2, which yields

n
+X
θ̂ √n , √n (X) = √2
2 2 n+n

with
1
R(θ, θ̂ √n , √n ) = √ .
2 2 4( n + 1)2
Based on Theorem 2, this estimator is the unique minimax estimator for θ under squared error loss, while
θ̂(X) = X/n is not minimax. However, a closer inspection will reveal that R(θ, θ̂) ≥ R(θ, θ̂ √n , √n ) only holds
2 2
when θ is really close to 12 , and these two estimators coincide in the asymptotic setting (i.e., n → ∞).
In general, how do we choose which estimator to use? We have many choices, but the only way to choose
the right estimator is by looking and understanding the data and the problem you are trying to solve. In
the finite sample theory, the choice is a big matter as different estimators satisfy different properties, while
in asymptotics the difference sometimes disappears, as we will show below.

2 Fisher’s Program for asymptotic statistics


Now we eliminate some of the above ambiguity in the finite sample theory by looking at what happens when
the number of samples approaches infinity. 
Recall that in the binomial example, Eθ ( X = θ(1−θ) → 0 as n → ∞, at rate n1 . Now, consider
2

n − θ) n
√ X  d P
(X −θ) d P n i.i.d
the sequence n n − θ = i √ni (X = i=1 Xi with X1 , · · · , Xn ∼ Bernoulli(θ)). By CLT, this
sequence converges weakly to N (0, θ(1 − θ)) as n → ∞. Then, we know that X C C
n ∈ [0 − n , 0 + n ] with high
√ √

probability, where C is chosen by looking at Guassian percentiles. Moreover, the asymptotic variance is the
reciprocal to the Fisher information I(θ) = (θ(1 − θ))−1 .
Fisher wondered whether in general he could bound the asymptotic variance of asymptotically normal
estimators of θ in this way.
Conjecture 1. For regular statistical experiments with non-singular Fisher information I(θ), Fisher had
the following conjectures:
√ d
1. ∀ asymptotically normal estimators θn∗ , i.e., n(θn∗ − θ) → N (0, σ 2 (θ, {θn∗ })),
1
σ 2 (θ, {θn∗ }) ≥ , θ∈Θ
I(θ)

2. ∃ estimators (e.g., the MLE) θ̂n s.t.


1
σ 2 (θ, {θ̂n }) = .
I(θ)

This conjecture is unexpected. The Cramér-Rao lower bound of this form only applied to unbiased esti-
mators in finite sample case. Furthermore, we are not restricting ourselves to exponential families as we were
before. Fisher’s intuition is that any asymptotically normal estimators should be asymptotically unbiased,
where applying the Cramér-Rao lower bound seems to be fine. Moreover, the Taylor expansion of the MLE
seems to suggest that the MLE attains the Fisher information bound asymptotically. Unfortunately, neither

2
of these conjectures is not true in general: the second conjecture requires some technical conditions, and the
first conjecture is even more problematic as shown by the following example.
i.i.d Pn
Example 5 (Hodges’s Example in 1951). Let X1 , · · · , Xn ∼ N (0, 1). Then X̄ = i=1 Xni is an admissible
(by Problem 2(1) in Homework 2), MLE, UMVU (by Lehmann-Scheffe), and minimax estimator (will be
shown in Homework 3). Hodges claimed that this estimator can somehow be improved in terms of the
√ d
asymptotic variance. Note that n(X̄ − θ) → N (0, 1), thus we are seeking a better estimator Sn such that
√ d
n(Sn − θ) → N (0, σ 2 (θ)), where σ 2 (θ) ≤ 1 for any θ and σ 2 (θ0 ) < 1 for some θ0 .
Hodges’s estimator is as follows. Let
( 1
X̄ if |X̄| ≥ n− 4
Sn =
0 otherwise

Then it is easy to verify that (


2 1 6 0
θ=
σ (θ) = .
0 θ=0
Hence, Hodges’s estimator dominates X̄ in terms of the asymptotic variance, while it does not dominate
X̄ for any finite n by the admissibility of X̄. The n−1/4 term is key to the result because the typical deviation
from the mean is of the order n−1/2 : when θ = 0, with high probability Sn = 0; when θ 6= 0, as n increases,
the uncertainty region shrinks and [−n−1/4 , n−1/4 ] ∩ [θ − Cn−1/2 , θ + Cn−1/2 ] = ∅ for large n, i.e., with high
probability Sn = X̄. Note that Hodges’s example is almost never used in practice because in low dimensions
−1/2
it has unpleasant
√ properties (try to figure out the scaled risk nR(n , Sn ) (which is  1) and nR(n−1/4 , Sn )
(which is  n) under the squared error loss!).
In general, a sequence of asymptotically normal estimators in a regular statistical experiment is called
superefficient if the asymptotic variance under θ is no worse than I(θ)−1 for any θ, and improves over
I(θ0 )−1 for some θ0 . Hodges’s estimator was the first example of superefficiency. It has been proved that in
dimension one and two, superefficient estimators must have some unpleasant properties, which may disappear
for dimension greater than or equal to three (e.g., read the wikipedia page on the James-Stein estimator)!

3 Modern Theorems
In this section we give a brief overview of the modern, corrected theorems which have replaced Fisher’s
conjecture. These are due to many contributors, including Le Cam, Wolfowitz, Hájek, etc. (see Chapter 8
of the book Asymptotic Statistics by Van der Vaart).
First we state the result corresponding to Fisher’s second conjecture.
Theorem 6 (Asymptotic Normality of MLE). Under mild conditions (cf. Theorem 5.39 and Lemma 8.14
in Van der Vaart’s book), if θ̂n = arg maxθ pθ (x1 , . . . , xn ) then
n
√ 1 X
n(θ̂n − θ0 ) = √ I(θ0 )−1 sθ0 (xi ) + opθ0 (1), ∀θ
n i=1

and for any g(θ) differentiable at θ0 ,


√ d
n(g(θ̂n ) − g(θ0 )) → N (0, [g 0 (θ0 )]T I(θ0 )−1 [g 0 (θ0 )]) .

In the previous theorem, sθ (x) is the score function, I(θ) is the Fisher information, and the notation

Xn = opθ0 (1) means Xn →0 0 (converge in probability).
Now we introduce some approaches to correct Fisher’s first conjecture. Although the phenomenon of
superefficiency may occur, several observations can be drawn from the Hodges’s example. Firstly, in Hodges’s

3
example, superefficiency only occurs at one point θ = 0, which motivates statisticians to think about whether
it is generally true that superefficiency can only take place in a “small” set. The answer turns out to be
affirmative and is stated in the following theorem.
Theorem 7 (Hájek-Le Cam almost everywhere convolution theorem). Suppose the statistical experiment is
regular with non-singular Fisher information I(θ), and let g(θ) be differentiable at every θ. For any sequence
√ d
of estimators Tn (X1 , . . . , Xn ) satisfying n(Tn − g(θ)) → Lθ for all θ ∈ Θ, then there exists a probability
measure Mθ such that for almost every θ (i.e., for all θ ∈ Θ − N where N is of Lebesgue measure zero),

Lθ = N 0, [g 0 (θ)]T I(θ)−1 [g 0 (θ)] ∗ Mθ .




The ∗ above denotes the usual convolution of probability densities. Because the convolution of probability
densities corresponds to adding independent random variables, we can interpret this theorem as follows: any
limiting estimator is a Gaussian random variable with mean zero and variance I(θ)−1 (when g(θ) = θ) plus
some random independent junk. In particular, the almost everywhere convolution theorem shows that the
superefficiency phenomenon can only occur at a set of Lebesgue measure zero.
The second observation can help us get rid of the “almost everywhere” statement in√ the previous convolu-
tion theorem, which is as follows: although in Hodges’s example the random variable
√ n(Sn −θ) under θ = 0
has a limiting distribution L0 which is the delta point mass at zero, the quantity n(Sn −θ) under θ = n−1/2
does not have the limiting distribution L0 as n → ∞ (recall the previous computation of nR(n−1/2 , Sn ) under
the squared error loss). This motivates us to introduce the concept of a regular estimator sequence.

Definition 8 (Regular estimator sequence). An estimator sequence Tn is called regular at θ for estimating
g(θ), if for every h,


  
h d n
n Tn − g θ + √ → Lθ under Pθ+ √h .
n n

The probability measure Lθ may be arbitrary but should be the same for every h.

Now, if the regular estimator sequence is considered instead (as a subfamily of all estimator sequences
which may not be regular), we come to the following traditional convolution theorem which replaces “almost
everywhere” with “everywhere”.
Theorem 9 (Hájek-Le Cam convolution theorem). Suppose the statistical experiment is regular with non-
singular Fisher information I(θ), and let g(θ) be differentiable at every θ. For any regular estimator sequence
Tn (X1 , . . . , Xn ) for estimating g(θ) with limiting distribution Lθ for all θ ∈ Θ, then there exists a probability
measure Mθ such that for every θ,

Lθ = N 0, [g 0 (θ)]T I(θ)−1 [g 0 (θ)] ∗ Mθ .





The third observation from Hodges’s example √ is that although Eθ ( n(Sn −θ))2 approaches zero at θ = 0,
the “local” worst-case risk supθ∈[−n−1/2 ,n−1/2 ] Eθ ( n(Sn − θ))2 does not. Hence, we wonder whether Fisher’s
first conjecture becomes correct when we take the local worst-case risk, where “local” is characterized by the
i.i.d
scaling n−1/2 . In fact, for X1 , · · · , Xn ∼ Pθ with density pθ , the log-likelihood ratio is given by
n
dPθ+ √h n p
Y θ+ √hn (Xi )
n
X pθ+ √h (Xi )
n n
ln (X1 , · · · , Xn ) = ln = ln
dPθn i=1
pθ (Xi ) i=1
pθ (Xi )
n n
!
T
h X 1 T X ∂sθ (Xi )
=√ sθ (Xi ) + h h + opθ (1) (Taylor expansion)
n i=1 2n i=1
∂θ
1
= hT I(θ)Z − hT I(θ)h + opθ (1) (by CLT and LLN)
2

4
where sθ (X) is the score function, and Z ∼ N (0, I(θ)−1 ) is a normal random variable. The interesting fact
is that, for Gaussian models we can explicitly compute that

dN (h, I(θ)−1 ) 1
ln −1
(Z) = hT I(θ)Z − hT I(θ)h
dN (0, I(θ) ) 2

which is the same as above for any regular statistical models. Mathematically, we can show that the following
two models
n
(Pθ+ √
k
h : h ∈ R ) and (N (h, I(θ)−1 ) : h ∈ Rk )
n

are asymptotically equivalent, i.e., the model distance (defined in Problem 2 of Homework 1) between them
vanishes as n → ∞. Hence, asymptotically it suffices to work on the Guassian location model (N (h, I(θ)−1 ) :
h ∈ Rk ), which is easy to handle and yields the following theorem:

Theorem 10 (Hájek-Le Cam local asymptotic minimax theorem). Suppose the statistical experiment is
regular with non-singular Fisher information I(θ), and let g(θ) be differentiable at θ. For any estimator
sequence Tn and bowl-shaped loss function l, we have

   
h
sup lim inf sup Eθ+ √h l n Tn − g θ + √ ≥ El(Z)
I n→∞ h∈I n n

where Z ∼ N (0, [g 0 (θ)]T I(θ)−1 [g 0 (θ)]), and the first supremum is taken over all finite subsets I ⊂ Rk .
A non-negative loss function l : Rk → R+ with l(0) = 0 is called bowl-shaped if l(−u) = l(u) for any
u ∈ Rk and {u ∈ Rk : l(u) ≤ t} is a convex set for any t ≥ 0. By taking l(u) = kuk2 and g(θ) = θ, Theorem
10 shows that Fisher’s first conjecture is true in a “local asymptotic minimax” sense.

5
EE378A Statistical Signal Processing Lecture 13 - 05/10/2016

Lecture 13: Introduction to Statistical Learning Theory


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Raghav Subramaniam

In this lecture, we get a mostly philosophical introduction to statistical learning theory.

1 Recap: Decision Theory


In decision theory we have a class of distributions {Pθ : θ ∈ Θ} and assume the observation X follows exactly
one of the distribution Pθ . We have a loss function L(θ, δ) to quantify the loss incurred by any action δ.
One also has multiple criteria to evaluate a decision rule, such as unbiased estimation, Bayes decision theory,
minimax decision theory, and various approximations and asymptotics. However, in practice this framework
may be difficult to implement for highly complex data such as images: it is extremely difficult to propose
a not-too-big set of distributions that can characterize how natural images are generated. In other words,
choosing the proper set of Pθ that is both easy to analyze and also well characterizes the nature of the data
is a daunting task.

2 Books and Papers


References for learning theory:

1. Vladimir Vapnik: Estimation of Dependencies Based on Empirical Data (first version ’82, second
version ’06) includes a detailed comparison between learning theory and decision theory. The second
version includes an afterword that updates the technical results presented in the first version and
describes a general picture of how the new ideas developed over these years.
2. Vladimir Vapnik: Statistical Learning Theory (’98)
3. Leo Breiman: “Statistical Modeling: The Two Cultures” (’01) challenges mainstream statistical decision
theory in favor of learning theory (a fun read). It includes the comments from top statisticians and a
rejoinder by Breiman.

3 Problems with Decision Theory


1. Defining Pθ that is both flexible and analytically tractable is difficult. In modern decision theory,
generalized linear models and the general graphical models provide nice families of distributions to
play with, but the recent success of image classification using deep neural networks shows that these
distributions are not sufficient.
2. Decision theory puts the most crucial modeling assumptions on Pθ (but not on the decision rule δ).
It means that we essentially need to encode all the knowledge we have about data science problems in
Pθ . However, knowledge transfer from the data to Pθ is difficult in general for complex data.

3.1 Realism and Instrumentalism in the Philosophy of Science (Vapnik’06)


The philosophy of science has two different points of view on the goals and the results of scientific activities.

1. There is a group of philosophers who believe that the results of scientific discovery are the real laws
that exist in nature. These philosophers are called the realists.

1
2. There is another group of philosophers who believe the laws that are discovered by scientists are just
an instrument to make a good prediction. The discovered laws can be very different from the ones that
exist in Nature. These philosophers are called the instrumentalists.

The two types of approximations defined by classical discriminant analysis (using the generative model of
data) and by statistical learning theory (using the function that explains the data best) reflect the positions of
realists and instrumentalists in our simple model of the philosophy of generalization, the pattern recognition
model. Later in the class we will see that the position of philosophical instrumentalism played a crucial role
in the success that pattern recognition technology has achieved.

3.2 Example: Neuroscience


The goal is to build a machine that is controlled by electrical signals from the human brain; the machine’s
goal is to get coffee for the user when the user wants coffee. The “realism” approach is to figure out how
the neural signals associated with wanting coffee are generated and transmitted and how different cells
in the brain interact, and then program the machine to recognize those signals and then get coffee. The
“instrumentalism” approach is to build a specific mapping f : X → Y , where X is the brain signal and Y is
the movement of the machine such that the machine gets coffee when the user asks for it.

4 Learning Theory
Learning theory puts the most crucial modeling assumptions on the decision rule δ. In decision theory, we
have
i.i.d
X1 , . . . , Xn ∼ Pθ
In learning theory, we also have
i.i.d
X1 , . . . , Xn ∼ P
However, in decision theory, it is known that without assumptions on P , inference is impossible (called
the “No Free Lunch” theorem in decision theory). For example, the minimax risk

supR(θ, δ)
θ

is not vanishing if θ indexes over all possible distributions. Vapnik proposed that in various contexts, the
risk is not the “correct” quantity to control.

4.1 Terminology
i.i.d
• Observations (training samples) (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) ∼ PXY .
• Loss function L(g(X), Y ); examples of loss functions include

L(g(X), Y ) = (g(X) − Y )2

and

L(g(X), Y ) = 1{g(X)6=Y }

• Risk function R(g) = EL(g(X), Y ) where (X, Y ) ∼ PXY . One may view (X, Y ) as a test sample,
which is independent of the training sample and follows the same distribution.

2
• Bayes risk R∗ , inf R(g)
g

• Oracle risk R(g ∗ ) , inf R(g) where G is a subset of the set of all decision functions; an example is
g∈G

G = {g(x) = aT x + b : a ∈ Rd , b ∈ R}
i.e., all linear classifiers.

Note: The risk function is defined as an expectation over only X and Y , but not g, even when g may be a
random classifier which depends on the random training data. Therefore we could write the risk explicitly as
R(g) = EXY (L(g(X), Y )|{(Xi , Yi )}n1 ). We will instead implicitly assume that the risk is always conditioned
on the data.

Based on n observations, we propose some gn ∈ G. We have an important decomposition:

R(gn ) − R∗ = (R(gn ) − R(g ∗ )) + (R(g ∗ ) − R∗ )


The first term is nonnegative and tells us how good the estimator is compared to the optimal estimator
in G. This is known as the stochastic error or estimation error. The second term is nonnegative and tells
us how good the best estimator in G is compared to the best estimator of any kind. This is known as the
approximation error. Some people also call the approximation error the bias, and the stochastic error the
variance, but one should note that the bias and variance here are different from the usual notions defined
in decision theory. Hence, we recommend use the terms approximation error and stochastic error.

4.2 Crucial Observation of Learning Theory


“Stochastic error can be universally bounded regardless of P .” More specifically, we have the
following:
Proposition 1.
inf sup E(R(gn ) − R(g ∗ )) → 0
gn ∈G PXY

as n → ∞ under mild conditions on G.


For some intuition, suppose that nature chooses a “bad” distribution. Then, both gn and g ∗ suffer.
In general, we can’t make such universal guarantees on the approximation error. Note that R(gn ) can be
estimated by the testing data and we have a bound on inf gn ∈G supPXY (R(gn ) − R(g ∗ )), so we have an idea
about the value of the oracle risk R(g ∗ ). The main remaining problem is in selecting G. Intuitively, if the
size of G increases, the approximation error will decrease, and the stochastic error will increase. In this sense,
learning theorists encode the knowledge in G instead of Pθ .

4.3 Example: Support Vector Machine


(from Vapnik ’06, page 437)
We have a classification problem with two classes, 1 and 2, with PY (1) = PY (2) = 1/2. Generate training
i.i.d.
data (Xi , Yi ) ∼ PXY . A decision theorist would use Bayes’ rule, i.e.
   
p1 (x) p1 (x)
ŷ = 1 x : ≥ 1 + 2 · 1 x: <1
p2 (x) p2 (x)
where pi (x) , pX|Y =i (x). Specifically, a decision theorist first computes p1 (x) and p2 (x), then applies Bayes
rule to classify. The decision boundary is p1 (x) − p2 (x) = 0. If we use the kernel density estimator with
Gaussian kernel to estimate pi (x), i = 1, 2, we have the decision boundary for the decision theorist:

3
X 2 X 2
e−(kx−xi k/σ) − e−(kx−xi k/σ) = 0
i : yi =1 i : yi =2

A learning theorist would use a support vector machien (SVM) with a Gaussian kernel:
X 2 X 2
αi e−(kx−xi k/σ) − αi e−(kx−xi k/σ) = 0
i : yi =1 i : yi =2

If we use the kernel trick to map space X to a RKHS(Reproducing Kernel Hilbert Space) Z such as
2
K(x1 , x2 ) = e−(kx1 −x2 k/σ) = hz1 , z2 i, then it is clear that decision theorist will generate a hyperplane
passing through the center of mass of the data points, perpendicular to the line between the centers of the
two classes of data points. A learning theorist will get a better result with an SVM (search over all possible
separating hyperplanes and find the best one).

Figure 1: Classifications given by the classical nonparametric method and the SVM are very different. The picture
is in Z space. The red hyperplane is the output of SVM, and the blue hyperplane is the output of nonparametric
statistical decision theory.

4
EE378A Statistical Signal Processing Lecture 14 - 05/12/2016

Lecture 14: Basics of Statistical Learning Theory


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Jeremy Kim, Milind Rao

In this lecture, we recap what we have seen about the philosophy of learning theory, and then proceed to
present the framework for binary classification problems and conclude by discussing first results using the
probably approximately correct (PAC) learning framework.

1 Recap: Philosophy of Learning Theory


• Looking at prediction problems: while decision theory gives a framework for many classes of problems,
learning theory is more focused.
• Model decision rules, or a function class (e.g., trees, neural networks), instead of probability measures
(i.e., the statistical experiment (X , (Pθ )θ∈Θ )):

– Decision theory assumes an underlying probability distribution that can be measured;


– Probability measures are one way to incorporate knowledge - modeling the decision rules is another
way (e.g., knowledge is that the data can be modeled by a tree). Remember that

Generalization = Data + Knowledge.

• Analyze regret instead of risk: one is always trying to compete with an oracle with stronger abilities than
the statistician (such as knowing precisely the underlying distribution; knowing that in a parametrized
model the parameter lies in a smaller uncertainty set, etc).

2 Classification Problem
2.1 Problem Set-Up
A classification problem consists of the following components:
• Input space defined by X , binary output space defined by Y ∈ {−1, 1};
iid
• The training data (observations) (Xi , Yi ) ∼ PXY , 1 ≤ i ≤ n and the testing data (X, Y ) ∼ PXY ;
• Goal: construct a function (decision rule, predictor, or classifier) g : X → Y;

• Risk function based on zero-one loss: R(g) = P(g(X) 6= Y ) = EXY (I{g(X)6=Y } );


• The class of functions that one may choose g from (i.e., the prior knowledge), written by G.
Note: The risk function is defined as an expectation over only X and Y , but not g, even when g may be
a random classifier which depends on the random training data. Therefore we could write the risk explicitly
as R(g) = P(g(X) 6= Y ) = EXY (I{g(X)6=Y } |{(Xi , Yi )}n1 ). We will instead implicitly assume that the risk is
always conditioned on the data.

1
2.2 Evaluating Risk
• Bayes risk:
R∗ = inf R(g).
g

This value is always a non-random because the testing data is independent of the training data, and
the infimum achieving g, which is the Bayes rule, does not depend on the training data.

– Example: in the binary classification problem, define the regression function as

η(x) = 2P(Y = 1|X = x) − 1

we claim that t(x) = sgn(x) is the Bayes decision rule which satisfies

t(x) = arg min R(g).


g

In fact, this is a MAP decision rule under joint distribution PXY .


– Typically R∗ is not actually available in practice because PXY is unknown.
• Empirical risk (training error):
n
1X
Rn (g) = I{g(xi )6=yi } .
n i=1
This is always a random variable which depends on the training data.

• Oracle risk:
R(g ∗ ) = inf R(g).
g∈G

This is also not achievable in practice, but tight bounds can be found.

2.3 Risk Bounds


We can evaluate the error of of an arbitrary decision rule gn ∈ G as follows:

R(gn ) − R∗ = [R(gn ) − R(g ∗ )] + [R(g ∗ ) − R∗ ]

where [R(gn ) − R(g ∗ )] is called the stochastic error or estimation error and [R(g ∗ ) − R∗ ] the approximation
error. Sometimes these are also informally called variance and bias respectively, but this is different than
variance and bias that we talked about in decision theory.
Now there are three types of results we wish to derive:

A: R(gn ) ≤ Rn (gn ) + B(n, G)


– This is the gap between the true risk and the empirical risk of a decision rule, which usually
works for any decision rules and will involve the concept of the uniform deviations and empirical
processes. This inequality basically tells us: if our classifier incurs 5% error on the training data,
what can we expect about the testing error.

B: R(gn ) ≤ R(g ∗ ) + B(n, G)


– This bound gives an upper bound for the stochastic error, and tells you about the choice of
function class. You can treat R(gn ) as the validation error, and R(g ∗ ) as the best your class can
do. If validation error is low, you’ve picked a good class.

C: R(gn ) ≤ R∗ + B(n, G)

2
– This is the strongest result, but not very commonly used as it is hard to prove unless there are
some additional restrictions on PXY .
We first focus on the results of type A. We first define Zi , (Xi , Yi ) as the data. Given function class G,
we define the loss class
F = {(x, y) 7→ I{g(x)6=y} , g ∈ G}.
It can be immediately seen Pthat there is a bijection between two classes F and G.
n
Further define Pn f , n1 i=1 f (Zi ) and define Pf , E[f (Z)]. We need to bound
n
1X
Rn (g) − R(g) = f (Zi ) − E[f (Zi )]
n i=1
= Pn f − Pf.

By the LLN, of course we know that Rn (g) → R(g) as n → ∞. Here we need a non-asymptotic version:
Theorem 1. (Hoeffding) Let Z1 , · · · , Zn be n iid r.v.s with f (Z) ∈ [a, b]. Then for all  > 0, we have

2n2
 
P (|Pn f − Pf | ≥ ) ≤ 2 exp − .
(b − a)2

We focus on the so-called Probably Approximately Correct (PAC) learning framework, which is aims at
giving the following type of result: with probability at least 1 − δ (probably), R(gn ) is -close to R(g ∗ )
(approximately). This is equivalent to an alternate framework - expectation learning framework, which
usually gives an upper bound for the expected excess risk ER(gn ) − R(g ∗ ). The equivalence can be shown as
follows: to show that a non-negative random variable X is small, we can show that P(X ≥ t) is small in the
PAC learning framework and E[X] is small in the expectation learning framework via the relationship
Z ∞
E[X] = P(X ≥ t) dt.
0

We denote the RHS of the Hoeffding bound by δ ∈ (0, 1) which can be interpreted as the confidence level.
We can see
r !
log(2/δ)
P |Pn f − Pf | ≥ (b − a) ≤ δ.
2n

In other words, the typical deviation of Pn f from its mean Pf is O( √1n ).


Applying Hoeffding inequality to f (Z) = I{g(X)6=Y } , we get for any fixed g and any δ > 0, with probability
at least 1 − δ,
r
log(2/δ)
R(g) ≤ Rn (g) + .
2n
It is worth bearing in mind here in Hoeffding’s inequality, f is fixed in advance. As a result, to apply the
previous inequality, g cannot depend on the training samples. In other words, forq any fixed g ∈ G, there
exists set Sg of samples {Zi }ni=1 with P(Sg ) ≥ 1 − δ such that R(g) ≤ Rn (g) + log(2/δ) 2n . We add the
subscript g in Sg to indicate that the set Sg may vary with choice of g.
As can be seen from Fig. 1, for any fixed g 0 , with high probability, |Rn (g 0 ) − R(g 0 )| ≤ O( √1n ). However,
conditioned on training samples, |Rn (gn ) − R(gn )| could grow to very large values for some g ∈ G.

3
3
R(g)
Rn (g)

Risk
Range of Pn f 0 − P f 0

0 gn
g0 g∗
g∈G

Figure 1: The actual and empirical loss function on the function class. g ∗ is the minimizer of the risk function that
we are looking for, gn is the some classifier obtained from the training data.

In other words, consider the set of samples S for which supg∈G |Rn (g) − R(g)| ≥ . Hoeffding inequality
does not give us the hammer bounding the probability P(S). Suppose now we are able to find δ ≥ 0 such
that P(S) ≤ δ, then we can state with probability at least 1 − δ (i.e., conditioning on S),

R(gn ) = Rn (gn ) + (R(gn ) − Rn (gn ))


≤ Rn (gn ) + sup |Rn (g) − R(g)|
g∈G

≤ Rn (gn ) + 

which gives the result of type A. Hence, the uniform deviation bound is sufficient to bound the gap between
training and testing errors, and it is also shown to be necessary in certain sense.

2.4 Simple Case: Finite G


Let |G| = N or G = {g1 , . . . , gN }. Define set Ci as

Ci , {{(xi , yi )}ni=1 : R(gi ) − Rn (gi ) ≥ } .

From Hoeffding bound, we know that P(Ci ) ≤ δ = exp(−2n2 ). Now observe,


N
X
P(∪N
i=1 Ci ) ≤ P(Ci ) [Union Bound]
i=1
≤ Nδ
 
⇒ P sup R(gi ) − Rn (gi ) ≥  ≤ N exp(−2n2 ) = δ 0
i
r
log(N/δ 0 )
⇒ sup R(g) − Rn (g) ≤ [w.p. ≥ 1 − δ 0 ]
g∈G 2n

We thus have a uniform bound on the deviation of the empirical risk from its mean with high probability.

4
EE378A Statistical Signal Processing Lecture 15 - 05/17/2016

Lecture 15: Vapnik-Chervonenkis Theory


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Ahmadreza Momeni, Huseyin A. Inan

1 Recap
In the previous lecture, we considered the following two types of risks, i.e., the true risk and the empirical
risk:
 
R(g) = E I(g(X) 6= Y )
n
1X
Rn (g) = 1(g(xi ) 6= yi ),
n i=1

and we were interested in their difference as g varies, i.e., we aimed to find a condition to guarantee that
these two quantities remain close uniformly over the entire set G. In other words we wanted to control the
following quantity
sup |R(g) − Rn (g)|. (1)
g∈G

The intuition behind this goal was that, if we have such a guarantee, then minimizing Rn (g) would be
roughly the same as minimizing R(g). In this regard, we determined some types of results to achieve:
• Type A: R(gn ) ≤ Rn (gn ) + B(n, τ ), which is closely related to (1). If we guarantee such a bound then
we have:
|R(g) − Rn (g)| ≤ B(n, τ ), ∀g ∈ G ⇒ |R(gn ) − Rn (gn )| ≤ B(n, τ ),
which we will exploit later.
• Type B: R(gn ) ≤ R(g ∗ ) + B(n, τ ), where g ∗ = arg min R(g).
g∈G

Now suppose that gn minimizes Rn (g), we call gn the Empirical Risk Minimizer. Based on the above results
we have
R(gn ) = R(gn ) − R(g ∗ ) + R(g ∗ )
(a)
≤ R(gn ) − R(g ∗ ) + R(g ∗ ) + Rn (g ∗ ) − Rn (gn )
= (R(gn ) − Rn (gn )) + (Rn (g ∗ ) − R(g ∗ )) + R(g ∗ )
≤2 sup |R(g) − Rn (g)| + R(g ∗ )
g∈G
(b)
≤ 2B(n, τ ) + R(g ∗ ),
where (a) is due to the definition of the empirical risk minimizer that Rn (g ∗ ) ≥ Rn (gn ), and (b) follows from
result type A.
In order to proceed let us restate the following lemma.
Lemma 1 (Hoeffding bound and finite function class). For a fixed g we have
2
P(|R(g) − Rn (g)| ≥ ) ≤ 2e−2n ,
and consequently for finite G,
2
P(sup |R(g) − Rn (g)| ≥ ) ≤ |G| · 2e−2n .
g∈G

1
2 Empirical Process Theory
Part of empirical process theory studies the “uniform convergence” of frequencies to their true probabilities.
Formally, consider the probability triple (X , B, P), where X is the space of outcomes, B is a collection of sets
in X (a σ-algebra), and P is the probability measure defined over B. In addition let S ⊂ B denote a non-empty
set of possible events, and A ∈ S be an event. Also, let X n = (X1 , . . . , Xn ) be a collection of n i.i.d. random
variables drawn from distribution P. We denote any specific n-dimensional vector (x1 , x2 , . . . , xn ) ∈ X n as
xn . We define the empirical frequency as follows:

#{xi : xi ∈ A}
ν(A, xn ) , .
n
We are interested in the relationship between the empirical frequency and the probability of event A

p(A) , P(X ∈ A),

for all A. If we fix an event A, then according to the Law of Large Numbers we have
p
ν(A, xn ) → p(A) as n → ∞.

In order to study the mentioned relationship we define π S (xn ) as follows

π S (xn ) , sup |ν(A, xn ) − p(A)|,


A∈S

n→∞
and then pursue the following goal: P(π S (xn ) > ) −−−−→ 0.
We will shortly introduce the notion of entropy of the set of events S of the samples of size n:

N S (x1 , . . . , xn ) , # {(1(x1 ∈ A), 1(x2 ∈ A), . . . , 1(xn ∈ A)) : A ∈ S} .

This function basically represents the number of patterns of xn the set in S can generate. We have the
following proposition for this function:
Proposition 2. 1 ≤ N S (x1 , . . . , xn ) ≤ 2n .

Proof This result follows from the fact S is not empty, and since the patterns are binary sequences of
length n, there can be at most 2n of them.

Note that the function N S (x1 , x2 , . . . , xn ) is defined for any specific xn . For example, if x1 = . . . = xn ,
then 1 ≤ N S (x1 , . . . , xn ) ≤ 2 since we can have at most 2 different patterns possibly, (0, . . . , 0) and (1, . . . , 1).

We now introduce the entropy:

H S (n) , E log2 N S (X1 , . . . , Xn ). (2)

We have the following propositions for the entropy:

Proposition 3. 0 ≤ H S (n) ≤ n.
Proof Follows from Proposition 2.

Proposition 4 (Subadditivity). H S (m + n) ≤ H S (m) + H S (n).

2
This property is similar to the one that we have in information theoretic entropy. Recall from the
information theory:
m+n m+n
H(X m+n ) = H(X m ) + H(Xm+1 |X m ) ≤ H(X m ) + H(Xm+1 ),
where the equality is from the Chain Rule and the inequality is from the fact that conditioning reduces
entropy.
Proof Left as an exercise. A hint is to justify N S (xm+n
1 ) ≤ N S (xm S m+n
1 )N (xm+1 ).

The following Lemma in mathematical analysis will be important in our derivation:


Lemma 5 (Fekete). For any subadditive sequence {an }∞
n=1 (am+n ≤ am + an ),
an an
lim = inf . (3)
n→∞ n n≥0 n

Combining Proposition 4 and Lemma 5, we have


Corollary 6. There exists constant c ∈ [0, 1] such that
H S (n) H S (n)
lim = c = inf , (4)
n→∞ n n n
and for any finite n,
H S (n)
≥ c. (5)
n
Note that we had 0 ≤ H S (n) ≤ n, therefore, c ∈ [0, 1].
We now state a very important theorem that later we will connect to our uniform convergence problem:
H S (n)
Theorem 7. [1, Theorem 14.1] If lim = 0, then
n→∞ n
n→∞
π S (xn ) −→ 0 a.s. (6)
H S (n)
Moreover, if lim = c > 0, then there exists δ(c) > 0 not depending on n such that
n→∞ n
lim P(π S (xn ) > δ(c)) = 1. (7)
n→∞

We note that (6) is the desired case where we have the almost surely convergence to 0. However, (7)
tells us that there exists a gap δ(c) with probability 1 which is not desired. Let us now connect this to our
problem with the following bijections:
X ←− Z = (X, Y ) ∼ PXY (8)
A ←− {(X, Y ) : g(X) 6= Y } (maps to g) (9)
S ←− {{(X, Y ) : g(X) 6= Y }, g ∈ G} (10)
p(A) ←− R(g) (11)
n
ν(A; x ) ←− Rn (g) (12)
S n
π (x ) ←− sup |R(g) − Rn (g)| (13)
g∈G

What is the entropy in this case? It is the following:


H G (n) = E log2 #{(1(g(X1 ) 6= Y1 ), . . . , 1(g(Xn ) 6= Yn )) : g ∈ G}. (14)
H G (n)
 
We have the following theorem for the undesired case lim =c>0 :
n→∞ n

3
H G (n) n→∞
Theorem 8. [1, Theorem 3.6] If −−−−→ c > 0, then there exists a subset Z ∗ ⊂ Z, such that P(Z ∗ ) =
n
c, and for the subset Z1∗ , . . . , Zk∗ , (Z1 , . . . , Zn )∩Z ∗ of almost all training data set (Z1 , . . . , Zn ), for any given
sequence of binary values δ1 , . . . , δk ∈ {0, 1}, there exists a function g ∈ G such that δi = 1(g(Xi∗ 6= Yi∗ )).
This theorem basically tells us that if c > 0, then for c-portion of the training data, there is a function
g ∈ G which can perfectly overfits this portion of data. In an extreme case where c = 1, we call that G is
non-falsifiable which means that we cannot falsify the function class G given any observations, for there is
always some g ∈ G which commits no error. Then we cannot expect to do anything useful on the test data.
We refer to the case of c < 1 as the partially non-falsifiable case.

References
[1] Vlamimir Vapnik, Statistical learning theory. New York: Wiley, 1998.

4
EE378A Statistical Signal Processing Lecture 16 - 05/19/2016

Lecture 16: Statistical Learning Theory in Action


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Xinwei Shi

When we apply learning theory in practice, what really matters is not the precise bounds, but the intuition
we gain through the derivation. In this lecture, we will focus on how to use the ideas of learning theory
in practice. We will also provide intuitive explanations of some popular learning algorithms, such as SVM,
which will deepen our understandings of these algorithms.

1 Recap
Some notations and definitions:
• G: class of functions g : X → {−1, 1}
• loss class: F , {fg : (x, y) 7→ 1g(x)6=y , g ∈ G}
i.i.d.
• For samples Z1 , Z2 , ..., Zn ∼ PXY , Zi = (Xi , Yi ), N F (Z1 , Z2 , ..., Zn ) , #{(f (Z1 ), f (Z2 ), ..., f (Zn )) :
f ∈ F} is often referred to as “projection of F on samples” (Z1 , Z2 , ..., Zn ). It is an n-dimensional
binary vector counting the number of error patterns that can be generated by the loss class F .
• VC-entropy: H F (n) , E log2 N F (Z1 , Z2 , ..., Zn ).
Last time we introduced a theorem:
p H F (n)
sup|R(g) − Rn (g)| → 0 as n → ∞ ⇔ lim =0
g∈G n→∞ n

It shows that if VC-entropy over n converges to 0, “uniform convergence” can be achieved, which means
true risk function R(g) and the empirical risk function Rn (g) are uniformly close. In this case, if we choose
function g to minimize Rn (g), the true risk R(g) achieved by this g will be close to its minimal.

2 Shattering coefficient and VC-dimension


H F (n)
However, it is often difficult in practice to show lim n = 0, since
n→∞

1. the distribution PXY is unknown to us, hence we cannot compute the expectation of log2 N F (Z1 , Z2 , ..., Zn )
with infinite precision;
2. even for a given sequence (z1 , z2 , . . . , zn ), computing log2 N F (z1 , z2 , ..., zn ) may be difficult.
In practice, one approach is to approximate the VC-entropy by well-designed approximation algorithms.
Now we look at a simple way to upper bound the VC-entropy:

H F (n) ≤ log2 sup N F (Z1 , Z2 , ..., Zn )


(Z1 ,Z2 ,...,Zn )

By using the supremum as an upper bound of the expectation, we get rid of the dependence on distribution
PXY .
Definition 1 (Shattering coefficient). SF (n) , sup N F (Z1 , Z2 , ..., Zn ) is called the shattering coef-
(Z1 ,Z2 ,...,Zn )
ficient of loss class F.

1
In practice, to show VC-entropy over n converges to 0, it suffices to show that

log2 SF (n)
lim sup =0
n→∞ n

since the logarithm of the shattering coefficient log2 SF (n) provides an upper bound for the VC-entropy.
The shattering coefficient SF (n) has some strong structural properties:

Theorem 2. [1, Theorem 4.3] The shattering coefficient SF (n) satisfies either

(a) SF (n) = 2n , for all n ≥ 1,


or

= 2n if n ≤ h
(b) SF (n) Ph
n
 en h
 .
≤ i ≤ h if n > h
i=0

log2 SF (n)
In case (a), SF (n) always exhibits exponential growth, and lim n = 1 > 0. However, it does not
n→∞
necessarily mean that VC-entropy over n does not converge to 0 (or no uniform convergence of the empirical
risk to the true risk), because the logarithm of the shattering coefficient is only a pessimistic upper bound
of VC-entropy.
In case (b), SF (n) exhibits exponential growth until n = h, and then it grows at most polynomially. In
particular, log SnF (n) ≤ c log(n)
n → 0 as n → ∞, therefore the VC-entropy over n converges to 0.
Definition 3 (VC-dimension). The largest integer h that can make case (b) holds is defined as the VC-
dimension of F, denoted by V (F). In case (a) we define V (F) = ∞.
The intuition of VC-dimension is that h is the maximum number of feature vectors that can be shattered
by the function class G. Some notes:
1. Since there is a bijection between G and F, V (F) can also be written as V (G).

2. To prove V (F) or V (G) = h, it takes two steps:


(a) find a specific set of h feature vectors (X1 , ..., Xh ), show it can be shattered by G.
(b) prove any set of n feature vectors (X1 , ..., Xn ), n > h cannot be shattered G.
Usually step (b) is more difficult.

Example 4. X = R2 , G = {g = 1w1 x1 +w2 x2 ≥b : w1 , w2 , b ∈ R, (w1 , w2 ) 6= (0, 0)}. Find V (G). (Note: G is


the class of linear classifiers with two-dimensional features.)
It is easy to show that G can shatter a specific set of 2 points, or a specific set of 3 points in R2 . Then
we numerate all patterns of 4 points in R2 to prove that any set of 4 points cannot be shattered by G:

1. when there exist 3 of the 4 points that are collinear, the labeling in Fig.1(a) cannot be achieved by
linear classifiers.
2. when there exist 3 points that are not collinear, and the 4th point is inside the convex hull of the other
3 points, the labeling in Fig.1(b) cannot be achieved by linear classifiers.
3. when there exist 3 points that are not collinear, and not a single point is in the convex hull of the other
3 points, the labeling in Fig.1(c) cannot be achieved by linear classifiers (cf., Minsky & Papert ’69).

2
Figure 1: patterns of 4 points

Therefore, V (G) = 3. A generalization of this result is that for feature space Rd and linear classifier class
G, V (G) = d + 1 = number of parameters.
m
Example 5. Let us introduce Dudley’s theorem (cf., Dudley’78) here: for arbitrary X , G = {g = 1{ ci ψi (x) ≥
P
i=1
0} : ci ∈ R} with fixed ψ1 , · · · , ψm : X → R, then V (G) ≤ m. (Note: G is a class of generalized linear
classifiers.)
To prove this fact, assume by contradiction that the points x1 , · · · , xm+1 can be shattered. By definition,
there exist M = 2m+1 vectors c(1) , · · · , c(M ) such that the (m + 1) × 2m+1 matrix A formed by Aij =
(ψ1 (xi ), · · · , ψm (xi ))T c(j) satisfies: the columns of sign(A) exhausts all possible 2m+1 different binary vectors
of length m + P1. By construction of A, the rank of A is at most m, hence the row vectors of A are linearly
m+1
dependent: i=1 ui Ai = 0 with u1 , · · · , um+1 ∈ R not all zero.Pm+1
However, picking out the column of A
T
which equals sign(u1 , · · · , um+1 ) results in a non-zero entry in i=1 ui Ai , a contradiction!
The previous example shows that, for linear models the VC-dimension is closely related to the number
of parameters. However, for other function classes VC-dimension may be either larger or smaller than the
number of parameters, as shown in Examples 6 and 7.
Example 6 shows that the VC-dimension of G that is nonlinear in its parameters can be smaller than the
number of parameters.
d
Example 6. For X = R, G = {g = 1{ |ai xi |sign(x) + a0 ≥ 0} : ai ∈ R}, its VC-dimension is V (G) = 1,
P
i=1
d
|ai xi |sign(x) + a0 is monotonically increasing in x, it cannot shatter any set of two points.
P
for
i=1
Example 7 shows that the VC-dimension of G that is nonlinear in its parameters can be larger than the
number of parameters.
Example 7. For X = (0, 2π), G = {g = 1(sin αx ≥ 0) : α ∈ (0, ∞)}, its VC-dimension is V (G) = ∞. In
fact, for xi = 2−i , i = 1, · · · , n and any binary sequence δ1 , · · · , δn ∈ {0, 1}, the following
l
!
X
∗ i
α =π (1 − δi )2 + 1
i=1

satisfies 1(sin(α∗ xi )) = δi for all i = 1, · · · , n.

3 Learning theory in practice


The last section of the lecture will discuss how we incorporate learning theory, including concepts like VC-
dimension, in practice. There are several different schemes:
• Empirical risk minimization (ERM): the optimal function is selected as gn , arg minRn (g). We
g∈G
assume Rn (g) is close to R(g) (uniform convergence), and therefore R(gn ) is also small.

3
• Structural risk minimization (SRM): we introduce a sequence of function class {Gd , d = 1, 2, 3, · · · }
such that G1 ⊂ G2 ⊂ G3 ⊂ · · · . The optimal function is selected to minimize the sum of the empirical
risk and a penalty on the complexity of the function class, as gn , arg min (Rn (g) + pen(d, n)). Intro-
g∈Gd ,d∈N+
ducing the sequence of function class Gd relaxes the “knowledge” incorporated by the function class.
The complexity penalization pen(d, n) increases as d increases, and it is usually a simple function of
the structure. In this way, we try to achieve an optimal balance between the empirical risk and the
complexity of the class.
Choosing the penalty term is an “art”: the penalty term can be chosen according to VC-dimension,
VC-entropy and many others. We refer the readers to [2] for more details.
• Regularization: the optimal function is selected as gn , arg min(Rn (g) + λkgk22 ), where kgk22 can be
g∈G
replaced by other regularizers. The idea is essensially the same as SRM (e.g., consider Gd = {g ∈ G :
kgk2 ≤ d}).
We can understand ERM as an unconstrained optimization problem, and the regularization scheme as
a constrained optimization problem, where we minimize the empirical risk Rn (g) subject to kgk22 < b.
Here tuning the parameter b is equivalent to tuning parameter λ. The idea is that the function class
G can be very rich, and searching over all g ∈ G can lead
√ to overfitting (large stochastic error). So we
constrain the function g to have a norm smaller than b, so that we search over a smaller domain.
The parameter b can be gradually increased (or λ decreased) to find the optimal result in practice.
The following theorem explains the idea behind SVM:
Theorem 8 (SVM, Vapnik’78, Guivits’97). If F is the set of hyperplanes with margin ρ, then
1
V (F) ≤ min{O( ), n + 1},
ρ2
where n is the dimension of the space.
Let two parallel hyperplanes separate the two classes of data into wT xi + b ≥ 1 and wT xi + b ≤ −1. The
2
margin, defined as the distance between the two hyperplanes, is ρ = kwk 2
. So the VC-dimension is upper
bounded by Ckwk22 , where C is a constant. This explains why in SVM, we try to minimize the training error
and control the `2 -norm (instead of other norms) of parameter w.

References
[1] Vlamimir Vapnik, Statistical learning theory. New York: Wiley, 1998.
[2] Bartlett, Peter L., Stphane Boucheron, and Gbor Lugosi, “Model selection and error estimation”,
Machine Learning, 48, no. 1-3 (2002): 85-113.

4
EE378A Statistical Signal Processing Lecture 17 - 05/24/2016

Lecture 17: Individual Sequence Prediction under Logarithmic Loss


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Aman Sinha

In this lecture, we look at predicting an individual sequence X t = (X1 , X2 , ..., Xt ) based on experts’ advice,
which is very close to the setting of online learning.

1 Notation
1. Alphabets: We consider a sequence alphabet X and a prediction alphabet X̂ , which need not be the
same. For example, X̂ may be the set of all probability distributions on X .
2. Predictor: F = {Ft }t≥1 , where Ft : X t−1 → X̂ makes a prediction for time t based on all available
information up to now. Specifically, the prediction for time t is Ft (X t−1 ).
3. Loss: The instantaneous loss is l : X̂ × X → R. Specifically, the loss is l(Ft (X t−1 ), Xt ). The
Pn at time t t−1
n n
cumulative loss of a predictor F on a sequence X is LF (X ) = i=1 l(Ft (X ), Xt ).

2 Some results regarding regret


Let G be a predictor and F a set of predictors (experts). The regret of G relative to F on a specific sequence
xn is
Regret: LG (xn ) − min LF (xn ). (1)
F ∈F
Namely, the regret is how much worse G performed compared to the best performance possible in F. Our
goal is to find G with a small regret for all sequences xn given a class F.

2.1 Binary prediction with Hamming loss


(i)
Consider X = X̂ = {0, 1}, l(x̂, x) = I{x̂6=x} , F = {F (0) , F (1) }, where Ft (xt−1 ) = i ∀ t, xt−1 , i = 0, 1. Then
we have the following
Pn
1. minF ∈F LF (xn ) = min(n1 (xn ), n − n1 (xn )) ≤ n2 , where n1 (xn ) = i=1 I{xi =1} .
2. for any deterministic predictor G, there always exists some sequence xn such that LG (xn ) = n. (For
example, choose a sequence that always has xt opposite of what G predicts for time t).
3. By the above,  
n
inf sup LG (xn ) − min LF (xn ) ≥ .
G xn F ∈F 2

2.2 Prediction with logarithmic loss and finite families


Consider a finite alphabet X , and predictor space X̂ = M(X ), which is the space of probability mass
functions on X , and the loss function l(x̂, x) = − log(x̂[x]), where x̂[x] is the probability of the symbol x
under the distribution x̂.
In this setting, note that any predictor F is equivalent to some probability law on sequences. Namely, we
set Ft (xt−1 )[xt ] = P(Xt = xt |X t−1 = xt−1 ) and
n
X n
Y
n t−1
LF (x ) = − log Ft (x )[xt ] = − log Ft (xt−1 )[xt ] =: − log PF (xn ), (2)
t=1 t=1

1
where PF (xn ) is the probability law of sequences according to F .

Average predictor on finite family Given a finite class of functions F = {F (1) , ..., F (m) }, consider the
average predictor G defined by
m
1 X
PG = P (i) . (3)
m i=1 F
We have
maxi PF (i) (xn )
 
LG (xn ) − min LF (xn ) = log ≤ log m, (4)
F ∈F PG (xn )
Pm
since i=1 PF (i) ≥ maxi PF (i) (xn ). Therefore, the worst-case regret for the average predictor G of any finite
family F = {F (1) , ..., F (m) } is upper bounded by log m.
Taking a closer look at this predictor, we see that
Gt (xt−1 )[xt ] = PG (xt |xt−1 )
PG (xt )
=
PG (xt−1 )
P (i) (xt )
P
=P i F t−1 )
j PF (j) (x
P (i) (xt |xt−1 )PF (i) (xt−1 )
P
= i FP t−1 )
.
j PF (j) (x
n
Recognizing that PF (xn ) = e−LF (x ) , the above expression can be rewritten as
P (i) t−1
F (x )[xt ] exp(−LF (i) (xt−1 ))
Gt (xt−1 )[xt ] = i t P t−1 ))
(5)
j exp(−LF (j) (x

Thus, the average predictor iteratively reweights the predictors in F over a specific sequence xn according
to how well they have done so far in predicting the sequence.

Normalized maximum-likelihood predictor (NML) on finite family Consider the same finite family
n
P
as above. Define the normalization constant (i.e. partition function) Z := x n max i PF (i) (x ). Then,
consider the NML predictor:
maxi PF (i) (xn )
PG (xn ) = . (6)
Z
Then we have
LG (xn ) − min LF (xn ) = log Z. (7)
F ∈F

Note that for any other G0 , ∃ y n such that PG0 (y n ) ≤ PG (y n ) since both PG0 and PG define probability
laws. In this case, we have LG (y n ) = − log PG (y n ) ≤ − log PG0 (y n ) = LG0 (y n ). This leads to the following
theorem:
Theorem 1. minG maxxn {LG (xn ) − minF ∈F LF (xn )} = log Z(F), which is achieved by the NML predictor.
Proof By the above, where G is the NML predictor and y n is the sequence such that PG0 (y n ) ≤ PG (y n ),
we have
 
n n
maxn
LG (x ) − min LF (x ) = LG (y n ) − min LF (y n )
x F ∈F F ∈F

≤ LG0 (y ) − min LF (y n )
n
F ∈F
 
n n
≤ max
n
LG 0 (x ) − min LF (x )
x F ∈F

2
as desired.

3
EE378A Statistical Signal Processing Lecture 18 - 05/26/2016

Lecture 18: Individual Sequence Prediction under General Loss


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Igor Berman

In this lecture we continue our work on predicting individual sequences1 without assuming any underlying
stochastic model.

1 Classic Results in Probability Theory


The following two results can be easily shown and will be useful in out analysis of sequence prediction:

1.1 Hoeffding’s Inequality:


For any bounded random variable X s.t a ≤ X ≤ b and for any s > 0, the following holds:
 2
s (b − a)2
h i 
E es(X−EX) ≤ exp
8
In particular, for s = 1:

1 2 1 2
EeX ≤ eEX e 8 (b−a) =⇒ log EeX ≤ EX + (b − a)
8

1.2 Maximum of i.i.d Normal Vairables:


i.i.d
Given n i.i.d random variables G1 , G2 , · · · , Gn ∼ N (0, 1), the following holds:

E [max1≤i≤n Gi ]
lim √ =1
n→∞ 2 log n

2 Individual Sequence Prediction under Logarithmic Loss


2.1 Notation
Going back to individual sequence prediction, we recall the following notation:
1. We are given a sequence xn where xt ∈ X . We try to predict x̂n where x̂t ∈ X̃ .
2. The predictor is F = {Ft }t≥1 , s.t. Ft : X t−1 → X̂ .

3. We use loss function l : X̂ × X → R+ and denote lmax = supx̂,x l(x̂, x). The cumulative loss of predictor
F on sequnce xn is:
n
X
LF (xn ) = l(Ft (xt−1 ), xt )
t=1

4. The worst-case regret of predictor G relative to a reference class of predictors F is:


 
n n
WCRn (G, F) = max n
LG (x ) − min LF (x )
x F ∈F

1 Reading: N. Cesa-Bianchi, and G. Lugosi, Prediction, learning, and games. Cambridge university press, 2006.

1
2.2 Logarithmic Loss, NML predictor and Uniform Predictor
Consider the logarithmic loss, a finite alphabet X , and the predictor alphabet X̂ which is the simplex of all
1
probability distributions on X : X̂ = M(X ). We use the loss function l(x̂, x) = log x̂[x] ,where x̂[x] denotes
the x-th entry of vector x̂, or the probability that x̂ assigns to outcome x.
In Lecture 17 we showed a one-to-one correspondence between predictors and probability distributions
on sequences, i.e., for every predictor F there exists a distribution PF on sequences such that

FT (xt−1 )[xt ] = PF (xt |xt−1 )

We used this to prove the following properties:


1. The minimum worst-case regret for reference class F is achieved by the Normalized Maximum Likeli-
hood predictor G:
maxF ∈F PF (xn )
PG (xn ) = P n
x̃n maxF ∈F PF (x̃ )

2. We can upper-bound the WCR of G using the WCR of the Uniform Predictor Guniform , which is defined
by
1 X
PGuniform (xn ) = PF (xn )
|F|
F ∈F

Since G achieves minimal WCR, we know that:

WCRn (G, F) ≤ WCRn (Guniform , F) ≤ log |F|

The inequality on the right was shown in the last lecture.


3. The output of the Uniform Predictor at time t can be shown to be a weighted average of the outputs
of all the predictors in the reference class:

t−1
X
t−1 exp −LF (xt−1 )
Gt (x )[xt ] = Ft (x )[xt ] P t−1 ))
F ∈F F̃ ∈F exp (−LF̃ (x

3 Individual Sequence Prediction under General Loss


So far we analyzed predictors that output a probability distribution over X , Ft : X t−1 → M(X ). Next, we
explore randomized predictors: with a random predictor F we pick xt randomly according to distribution
Ft (xt−1 ), s.t. xt ∼ Ft (xt−1 ). Since this is a stochastic setting, we’ll focus on the expected loss:
" n # n X
X X
n
LF (x ) = E l(x̂, xt ) = l(x̂, xt )Ft (xt−1 )[x̂]
t=1 t=1 x̂∈X

3.1 Exponentially-weighted Predictor


Given a finite reference class F, let G be the exponentially-weighted predictor:

t−1
X
t−1 exp −ηLF (xt−1 )
Gt (x )[xt ] = Ft (x )[xt ] P t−1 ))
F ∈F F̃ ∈F exp (−ηLF̃ (x

Theorem 1. The exponential-weighted prediction G satisfies


h i log |F| ηl2
max LG (x n
) − min LF (x n
) ≤ + max n.
xn F η 8

2
As a direct corollary, by choosing q
8 log|F |
η= 2
lmax n

we achieve r
n n n
LG (x ) − min LF (x ) ≤ lmax log |F|
F 2
Proof: Define W1 = |F|, and for t > 1 let Wt = F ∈F exp(−ηLF (xt−1 )). Then on one hand,
P

Wn+1 X
log = log exp(−ηLF (xt−1 )) − log |F| (1)
W1
F ∈F

≥ log max exp −ηLF (xt−1 ) − log |F|



(2)
F ∈F
 
t−1
= log exp −η min LF (x ) − log |F| (3)
F ∈F

= −η min LF (xt−1 ) − log |F| (4)


F ∈F

On the other hand:


exp (−ηLF (xt ))
P
Wt+1
log = log P F ∈F t−1 ))
(5)
Wt F ∈F exp (−ηLF (x
P 
X exp −η l(x̂, xt )Ft (xt−1 )[x̂] + LF (xt−1 )
= log Px̂ t−1 ))
(6)
F ∈F F̃ ∈F exp (−ηLF̃ (x
! 
X X
t−1 exp −ηLF (xt−1 )
= log exp −η l(x̂, xt )Ft (x )[x̂] P t−1 ))
(7)
F ∈F x̂ F̃ ∈F exp (−ηLF̃ (x
" !#
X
t−1
= log EF ∼Qt exp −η l(x̂, xt )Ft (x )[x̂] (8)


where EF ∼Qt [·] denotes the expectation w.r.t. distribution Qt on F, assuming Qt (F ) ∝ exp −ηLF (xt−1 ) .
Recall Hoeffding’s inequality:
(b − a)2
log EeX ≤ EX + .
8
Since −η x̂ l(x̂, xt )Ft (xt−1 )[x̂] is bound in [−ηlmax , 0], by (8) is we know that
P
" #
Wt+1 X
t−1 η 2 lmax
2
log ≤ EF ∼Qt −η l(x̂, xt )Ft (x )[x̂] + (9)
Wt 8

X  η 2 lmax
2
l(x̂, xt )EF ∼Qt Ft (xt−1 )[x̂] +

= −η (10)
8

 !
X X exp −ηLF (xt−1 ) Ft (xt−1 )[x̂] η 2 lmax
2
= −η l(x̂, xt ) P t−1 ))
+ (11)
x̂ F ∈FF̃ ∈F exp (−ηLF̃ (x 8

Recalling that we defined G as


 !
X exp −ηLF (xt−1 ) Ft (xt−1 )[x̂]
Gt (xt−1 )[xt ] = P t−1 ))
F ∈F F̃ ∈F exp (−ηLF̃ (x

we arrive at:
Wt+1 X η 2 lmax
2
log ≤ −η l(x̂, xt )Gt (xt−1 )[xt ] +
Wt 8

3
Telescopically summing over all t:
n
X Wt+1 Wn+1 η 2 lmax
2
log = ≤ −ηLG (xn ) + n
t=1
Wt W1 8

Combining this with the lower bound (4):

η 2 lmax
2
−η min LF (xt−1 ) − log |F| ≤ −ηLG (xn ) + n
F ∈F 8
We arrive at

2
log |F| ηlmax
LG (xn ) − min LF (xt−1 ) ≤ + n
F ∈F η 8
as desired.

3.2 Optimality of Exponential Weighting


Denote the worst-case regret for reference classes of size N and sequences of length n as

WC(n, N ) = sup inf WCRn (G, F)


F :|F |=N G

Theorem 2. For binary prediction with hamming loss l(x̂, x) = 1(x̂ 6= x), and X = X̂ = {0, 1}, we have:

WC(n, N )
lim pn = 1.
2 log N
n,N →∞

By the above theorem, it is always possible to construct a reference set F under which the exponentially-
weighted predictor is nearly optimal.

4
EE378A Statistical Signal Processing Lecture 19 - 05/31/2016

Lecture 19: AdaBoost and Course Review


Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Brad Nelson, Mengyuan Yan, Arjun Seshadri

In this lecture, we cover the AdaBoost algorithm by Freund and Schapire [1], which can combine many weak
predictors into one strong predictor. We also review some of the key ideas from the course.

1 AdaBoost
The AdaBoost algorithm [1] is an example of an ensemble algorithm, which can combine many weak decision
rules into one strong decision rule. In general, in ensemble methods we need to consider two factors:
1. What to aggregate?

2. How to aggregate?

1.1 The AdaBoost Algorithm


AdaBoost assumes that we have n feature-observation pairs (X1 , Y1 ), . . . , (Xn , Yn ), Yi ∈ {−1, 1}.
Initialize:
• Weight D1 (i) = 1/n for i = 1, . . . , n (equal weights for each sample)
• Classifier F0 (x) = 0 (we don’t yet know anything)
Algorithm:

for t = 1, 2, · · · , T do Pn
Choose g ∈ G to minimize weighted error t = i=1 Dt (i)I(gt (Xi ) 6= Yi )
Ft = Ft−1 + αt gt , where αt = 21 log 1− t .
t
(
αt
e if gt (Xi ) 6= Yi
Dt+1 (i) = DZt (i)
p
× −αt , where Zt = 2 t (1 − t )
t
e if gt (Xi ) = Yi
end for

1.2 Explanation of AdaBoost


In this section we want to answer the question of what to aggregate and how to aggregate. In AdaBoost we
aggregate the simple classifiers gt by linear combination with weight αt . Why does this way of calculating
and aggregating gt make sense?
Explanation: we think of the aggregation of gt into Ft as functional gradient descent: boosting tries to
minimize Ee−Y F (x) . Our way to find the optimal F is to update it step by step. However, calculating the
expectation relies on true distribution, so we instead try to use empirical loss
n
−Y F (x) 1 X −Yi F (Xi )
minimize Pn e = e
n i=1
n
1 X −Yi Ft−1 (Xi )+αt ft (Xi )
= e
n i=1

1
How do we choose αt and ft to minimize the loss above? First, fix ft , and choose αt :
∂ X 1 −Yi Ft−1 (Xi ) αt X 1 −Yi Ft−1 (Xi ) −αt
Pn e−Y Ft (X) = e e − e e =0
∂αt n n
i:Yi 6=ft (Xi ) i:Yi =ft (Xi )
X 1 −Yi Ft−1 (Xi ) X 1 −Yi Ft−1 (Xi )
=⇒eαt e = e−αt e
n n
i:Yi 6=ft (Xi ) i:Yi =ft (Xi )

Claim: Dt (i) ∝ e−Yi Ft−1 (Xi ) because Ft−1 (Xi ) =


P
j≤t−1 αj gj (xi ), and the update rule of Dt (i) can be
writen as Dt+1 (i) ∝ Dt (i) exp(−αt gt (xi )yi ).
Now the above equation becomes
1 1 − t
eαt t = e−αt (1 − t ) =⇒ αt = ln .
2 t
Were Ft−1 to achieve weighted probability error t , we can say that this is the αt to use, but we are yet
to optimize for ft . To solve for ft , we minimize
 
1 X X
e−Yi Ft−1 (Xi ) e−αt + e−Yi Ft−1 (Xi ) eαt 
n
i:ft (Xi )=Yi i:ft (Xi )6=Yi
 
1  −αt X −Yi Ft−1 (Xi ) X
= e e + (eαt − e−αt ) e−Yi Ft−1 (Xi ) 
n i i:ft (Xi )6=Yi

We note that the firstP


term in fact does not depend on ft . Hence, it suffices to minimize
Pn the second term,
which is proportional to i Dt (i)I(ft (Xi ) 6= Yi ). Thus, we just need to minimize t = i=1 Dt (i)I(gt (Xi ) 6=
Yi ), which is exactly what AdaBoost does.

Claim: gt achieves weighted error 50% with respect to weight distribution Dt+1 . This can be easily
proved using the definition of gt and Dt+1 . Essentially, each step tries to correct errors that appeared at the
previous step. It is the underlying reason that boosting can combine a bunch of weak learners into a strong
one. For further explanation, see [2].

Some problems with AdaBoost include:


• AdaBoost is sensitive to label noise and outliers (other methods can do better) because it is a greedy
algorithm;
• Adaboost tries to maximize the margin (arc-gv is also designed to maximize the minimum margin, but
it sometimes performs worse than AdaBoost. So is maximizing margin really desirable?);
• loss minimization of Ee−Y F (X) : why should we use this loss? Can use another loss function to derive
update steps?
Note that if Ft achieves zero training error, we may still not be done, since we want to maximize the
margin. If gt achieves zero risk, then we are in fact done.

Other ensemble methods:


• Bagging (bootstrap aggregation)
• Dropout
– used in neural networks, where connections between layers are randomly dropped while training
– used because other ensemble methods are too computationally expensive
• Exponential screening (See [3] for more information)

2
2 Course Review
Generalization = Data + Knowledge

How to efficiently encode knowledge in algorithms?

2.1 Statistical Decision Theory


In Statistical Decision Theory, we assume we know a parameterization of the distribution X ∼ Pθ , θ ∈ Θ.
Some general frameworks for these results:
1. Restrict statistician: unbiased estimation, equivariance

• Severe application restriction


2. Assume more knowledge about nature: Bayes (use prior knowledge), minimax (worst-case prior)
Finite sample decision theory is hard. For example, recall there are different ways of normalizing the
empirical variance (dividing by n, n − 1, n + 1 depending on what is considered a “good” estimator).
Asymptotically, if Θ is finite dimensional, then MLE is good under mild conditions. This problem
becomes hard in the infinite dimensional case, see, e.g., [5].
Two intuitions in finite sample decision theory:
p−2
1. Shrinkage: consider X ∼ N (µ, Ip×p ), µ̂MLE = X, µ̂JS = (1− kXk 2 )X, under squared error R(µ, µ̂MLE ) =
2
p > R(µ, µ̂JS ) for any µ. Hence, the MLE is inadmissible here. The intuition is that, if you want to
estimate high-dimensional vector, usually variance dominates [7].
2. Approximation [4]: consider estimating F (θ) ∈ R with θ ∈ Rp , p  1. In this case, the measure
concentration phenomenon occurs [6]: F (θ̂) ≈ EF (θ̂) 6= F (Eθ̂). As a result, bias dominates in this
scenario, especially when F (·) is non-smooth.

2.2 Statistical Learning Theory


Statistical learning theory differs from decision theory by
1. Focus on prediction
2. Encode knowledge through a decision rule (function class) instead of a probability measure
3. Focus on regret instead of risk

• Regret = our performance − performance of oracle


• This approach makes it possible to deal with problems where probability is undefined, as the
definition of the oracle’s performance may not necessarily require the definition of probability
We have different kinds of oracles. In VC theory, oracle knows P (true measure), and is restricted to
rules in G. In individual sequence prediction, there is no true distribution, so the oracle knows the entire
sequence (true data), and is restricted to rules in G.

2.3 The last words


“Nothing is more practical than a good theory”, by V. Vapnik [8]

The precise theory is not always useful, and one should always focus on gaining general intuitions as
opposed to seeking out black-and-white rules that are claimed to be optimal.

3
References
[1] Y. Freund, R.E. Schapire, “A decision-theoretic generalization of on-line learning and an application to
boosting”. Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.
[2] R. Schapire, “Explaining AdaBoost”. Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik,
2013.
[3] A. B. Tsybakov, “Aggregation and minimax optimality in high-dimensional estimation”. Proc. Int. Congr.
Math., Seoul, Korea, pp. 225–246, Aug. 2014.

[4] J. Jiao, K. Venkat, Y. Han, T. Weissman, “Minimax Estimation of Functionals of Discrete Distributions”.
IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2835–2885, 2015.
[5] A. Nemirovski, “Topics in nonparametric statistics”. In Ecole d’Et de Probabilits de Saint-Flour XXVIII-
1998. Lecture Notes in Math. 1738. Springer, New York. MR1775640, 2000.

[6] M. Talagrand, “A new look at independence”. Ann. Probab. 24, pp. 1–34, 1996.
[7] W. James and C. Stein, “Estimation with quadratic loss,” Proc. 4th Berkeley Symp. Math. Statist.
Probab., vol. 1, pp. 361–379, 1961.
[8] V. N. Vapnik, “Statistical Learning Theory”. vol. 2. New York, NY, USA: Wiley, 1998.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy