EE378A - Combined Notes
EE378A - Combined Notes
Lecture 1: Introduction
Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Nico Chaves
In this lecture, we discuss the content and structure of the course at a high level. We also introduce the 3
pre-approved course projects.
1 Course Logistics
Office Hours:
• Jiantao Jiao, Wed 4:30-5:30 pm in Packard 251
• Tsachy Weissman, Thurs 2:30-4:00 pm in Packard 256
• Bai Jiang, Tues 4:30-5:30 in Sequoia 200
• Yanjun Han, Thurs 4:30-5:30 in Packard 251
See the course website for more details, as these times/locations may be subject to change.
Grading:
• 40%: Approximately 4 problem sets
• 40%: Project
• 10%: Class attendance
• 10%: Scribe notes (due 72 hours after lecture)
• Up to 5% bonus for useful input to the course staff
2 Course Motivation
There are many topics that can be categorized as part of statistical signal processing: decision theory, learning
theory, reinforcement learning, etc. Modern statistical signal processing viewed from a broad perspective
includes everything about data science. However, it’s not always obvious how these different frameworks in
data science are related, and it can be difficult to decide which framework to use to solve a given problem.
This course will have 3 goals:
1. Compare and analyze these existing frameworks for data analysis and processing. By understanding
the goals of each framework, we’ll be able to choose an appropriate algorithm when facing a new
problem, and choose an appropriate framework to analyze a new algorithm.
2. Emphasize the most important principles.
3. Enable students to come up with the best algorithm for the problem at hand.
1
2.2 “Quiz” Question 1
What is the difference between channel coding and multiple hypothesis testing?
Channel coding: Suppose a transmitter wants to transmit some message c in the form of binary string.
There may be many different messages (binary strings) that the transmitter can send, say, c1 , · · · , cM . Sup-
pose the transmitter randomly selects one of these M messages and sends it over the channel. The channel
corrupts the message in a certain way. For example, the channel may randomly flip each binary bit of the
transmitted codeword with probability < 1/2 (bits are flipped independently).
Multiple hypothesis testing: The receiver then receives a message c0 , which may be different from
the message that was sent. The receiver then must apply some decoding algorithm to try to decide which
message was actually sent, and the decoding process is just choosing one hypothesis from multiple hypothe-
ses. In Bayesian framework, to decode the message, the receiver computes the MAP estimate to minimize
the block decoding probability of error.
The main different between channel coding and multiple hypothesis testing is that, most of the efforts
in channel coding is to design the codewords. In other words, our primary goal in channel coding is to select
the messages c1 , ..., cM such that we can maximize the rate of transmission. Given a fixed codebook, the
decoding is statistically trivial:1 one just needs to compute the MAP estimate of the codeword to minimize
decoding probability of error.
However, in multiple hypothesis testing, we are usually given the conditional distributions of the obser-
vations given each hypothesis. In other words, we do not have the freedom to design each hypothesis. Then,
if we study the multiple hypothesis testing problem under a Bayesian framework and assume each hypoth-
esis is chosen with equal probability, this problem is in fact equivalent to the channel decoding problem.
However, multiple hypothesis testing can also be performed under a frequentist framework, where one does
not assume that each hypothesis has a prior distribution. 2
An SVM doesn’t actually need to make any assumptions about the distribution of the data beyond that
the (feature, label) pairs are generated independently with the same distribution, which can also be relaxed
to some extent. An SVM simply applies a hyperplane to try to separate 2 different “clouds” of data. The
theoretical results one can derive about SVM also do not need to rely on any additional data generating
assumptions: one can derive ”distribution-free” results by comparing the performance of the SVM and that
of an oracle that knows the underlying distributions but is restricted only to use linear classifiers.
SVM corresponds to the idea of structural risk minimization in statistical learning theory, where multi-
ple function classes with different VC dimensions are considered simultaneously.
3 Pre-Approved Projects
This section introduces the 3 pre-approved course projects. Students are also welcome to propose their own
project topics.
1 If we do not consider computational issues: decoding an arbitrary code is NP-hard in the worst case.
2 Keywords: FWER and FDR control
2
3.1 Project 1: Portfolio Optimization
In this context, the term “portfolio” refers to a set of investments (e.g., stocks). The problem can be defined
as follows: Suppose you have N stocks. Every day, you can assign a portion of your capital to each stock.
Your goal is to maximize your profit at the end of some time period.
There exist conflicts between various theoretical frameworks applied to portfolio optimization. We refer
the readers to [1] for details. Since it may not be possible to convincingly propose a portfolio optimization
algorithm in theory, this project is aimed to compare different algorithms in practice.
Here is a basic example explaining why theoretical comparisons might be difficult. Consider a sequence
of random variables X1 , ..., XN , where Xi represents the amount of money you have at day i. If you become
bankrupt at some time point, you can’t make any more investments. Therefore, Xi ≥ 0.
Suppose you’re given a portfolio that generates such a sequence and that satisfies the following 2 condi-
tions:
a.s.
1. XN −−→ 0
2. E[XN ] → ∞ as N → ∞
Condition 1 suggests that the portfolio is very bad, because it suggests you won’t have any money after N
days. In particular, XN converges to 0 in the “almost sure” sense of convergence. However, condition 2
seems to be a desirable property, since the expectation approaches ∞.
Students pursuing Project 1 will use real stock market data. The course staff will provide these students
with real data from the S&P 500 over in total 5 years. At the end of the quarter, we’ll be able to compare
the performance of the different approaches.
The key difference is x∗i , which is called the “teacher’s advice.” The teacher’s advice gives you some addi-
tional information to use to train your classifier. The idea behind the LUPI framework is that this additional
information at training time enables you to train a classifier that will perform better on the test set. Note:
The test set does not contain the teacher’s advice x∗i .
A 64 × 64 image is sharper than the corresponding 5 × 5 image. The reason why the LUPI helps learning
is as follows: since we have access to sharper images, it’s easier to train a classifier that separates different
images. In particular, having access to the 64 × 64 images enables us to define a distance function that
classifies the images better. A pair of 5 × 5 images of different digits may look similar, but the corresponding
64×64 images will enable us to differentiate between the pair of images, resulting in a better distance function.
3
Example 2: Time Series Prediction
In the traditional time series prediction setting, we have a sequence of data points up to time t. Given
x1 , ..., xt , our goal is to predict xt+k (where k > 0). In other words, our goal is to predict the value of x at
some time in the future using features up until to the present.
Now, let’s recast the time series prediction problem under the LUPI setting: As before, xi consists of
features up to now and yi is the value we want to predict. Now, our training set also contains x∗i , which can
contain information from the future: xt+1 , xt+2 , ...
We emphasize again that you do not have access to the x∗i features when evaluating your predictor on
the test set. This should be especially clear in Example 2, where the teacher’s hint contains information
from the future.
In the case of a soft-margin SVM, you can use the teacher’s advice to learn the slack variables. It was
argued in Vapnik and Izmailov [3] that using LUPI significantly improves the classification performance of
an SVM. However, this technique hasn’t been studied extensively in practice (at least to the course staff’s
knowledge). It would make an interesting course project to see how well LUPI works in practice.
Moreover, it’s possible that the LUPI framework doesn’t actually require one to use SVMs. It has yet to
be shown whether LUPI can be applied to other machine learning techniques. This would be an interesting
problem to study as well.
The goal is to learn the αi coefficients such that the performance of this ensemble classifier is comparable to
the oracle on the test set. The oracles can be quite different, corresponding to different types of aggregations,
i.e., model selection aggregation, convex aggregation and linear aggregation, following Nemirovski [4]. In the
model selection aggregation, the oracle is the best one among fi on the test set. In the convex aggregation,
the oracle is the best one among all convex combinations of fi , and similarly for the linear aggregation.
However, the problem is that, most ensemble methods only use a linear combination of the candidates as
the final aggregate, which is subject to the drawbacks of linear methods.
4
So the optimal decision rule is a conditional distribution, rather than some linear or convex combination of
the predictors. Suppose that we discretize the score of each predictor into m possible values. Then, we can
treat each of the mk possible values of f1 , f2 , ..., fk as a bin and count the number of 0 and 1 labels in each
bin. We can then use these empirical frequencies for the decision rule. However, this approach is impractical.
As k grows larger, the number of bins increases exponentially, so the bins will become increasingly sparse
(in practice, we only have some finite amount of data). As a result, the empirical frequencies we obtain from
the bins will rapidly become unreliable as we increase k.
Instead, we notice that it would be beneficial to assume that our classifier is monotonic. Suppose you
train an SVM. The classifier will be of the form: f (x) = wT x + b. If f (x1 ) > f (x2 ), then it seems reason-
able to assume that P (y = 1|f (x1 )) > P (y = 1|f (x2 )). In particular, x1 results in a larger margin, so we
believe that it’s more likely to have a label of 1 than x2 is. Therefore, we assume: if f (x1 ) > f (x2 ), then
P (y = 1|f (x1 )) > P (y = 1|f (x2 )). This is an example of a monotonicity assumption.
Next we’ll see why the monotonicity assumption simplifies the problem of conditional density estimation.
First, we can cast the 1-D conditional density estimation problem as the following integral equation:
Z x
f (t)dt = F (x)
−∞
Suppose we let ω → ∞. Then, the error term above approaches 0. Therefore, the modified integral ap-
proaches F (x). But as ω → ∞, the density defined by f (t) + sin(ωt) will oscillate infinitely fast and will
become very different from f (t), which is the density we actually wanted to estimate. This means that
there exist densities which are very different from the actual density but nonetheless result in a CDF that is
very close to the target CDF. Our conclusion is that density estimation is difficult when we don’t make any
assumptions about the density.
However, if we assume that the conditional density is monotonic, then it can’t have an oscillating com-
ponent as in the example above. This assumption makes the conditional density estimation problem much
easier to solve.
The synergy method has proved to be successful in some experiments. For example, Vapnik showed in
his recent talk “intelligent methods of learning” that for aggregating kernels with different parameters, the
performance of the aggregate on the test set is even better than that obtained by the best kernel parameter.
Some questions one could address for this project topic include: does the synergy mechanism actually
perform well when compared to traditional ensemble methods? For which types of predictors (SVM, neural
networks, etc.) does the monotonicity assumption hold? For which types of predictors does the synergy
mechanism work well?
5
References
[1] MacLean, Leonard C., Edward O. Thorp, and William T. Ziemba. ”Good and bad properties of the
Kelly criterion.” Risk 20, no. 2 (2010): 1.
[2] Huber, Peter J. Data analysis: what can be learned from the past 50 years. Vol. 874. John Wiley & Sons,
2012.
[3] Vapnik, Vladimir, and Rauf Izmailov. ”Learning Using Privileged Information: Similarity Control and
Knowledge Transfer.” Journal of Machine Learning Research 16 (2015): 2023-2049.
[4] Nemirovski, Arkadi. ”Topics in non-parametric statistics.” Ecole dEt de Probabilits de Saint-Flour 28
(2000): 85.
6
EE378A Statistical Signal Processing Lecture 2 - 03/31/2016
In this lecture1 , we will introduce some of the basic concepts of statistical decision theory, which will play
crucial roles throughout the course.
1 Motivation
The process of inductive inference consists of
1. Observing a phenomenon
2. Constructing a model of that phenomenon
3. Making predictions using this model.
The main idea behind learning algorithms is to generalize past observations to make future predictions. Of
course, given a set of observations, one can always find a function to exactly fit the observed data, or to find
a probability measure that generates the observed data with probability one. However, without placing any
restrictions on how the future is related to the past, the No Free Lunch theorem essentially says that it is
impossible to generalize to new data. In other words, data alone cannot replace knowledge. Hence, a central
theme of this course is
Generalization = Data + Knowledge.
We can view statistical decision theory and statistical learning theory as different ways of incorporating
knowledge into a problem in order to ensure generalization.
2 Decision Theory
2.1 Basic Setup
The basic setup in statistical decision theory is as follows: We have an outcome space X and a class of
probability measures {Pθ : θ ∈ Θ}, and observations X ∼ Pθ , X ∈ X . An example would be the Gaussian
measure, defined in the above framework as follows:
dPθ 1 (x−µ)2
= √ e− 2σ2 ; θ = (µ, σ 2 ).
dx σ 2π
Our goal as a statistician is to estimate g(θ), where g is an arbitrary function that is known to us and θ is
unknown. Thus, observe that our goal is more general than simply estimating θ, and the generality of our
goal is motivated by the fact that the parametrization θ can be arbitrary.
A useful perspective to view the goal of estimating g(θ) is as a game between Nature and the statistician,
where Nature chooses θ and the statistician chooses a decision rule δ(X). Note that a general decision rule
may be randomized, i.e., for any realization of X = x, δ(x) produces an action a ∈ A following the probability
1 Some readings:
1. Olivier Bousquet, Stéphane Boucheron, Gábor Lugosi (2004) “Introduction to Statistical Learning Theory”.
2. Lawrence D. Brown (2000) “An Essay on Statistical Decision Theory”.
3. Abraham Wald (1949) “Statistical Decision Functions”.
1
distribution D(da|x). Intuitively, after observing x, the decision rule chooses the action a ∈ A randomly
from the distribution D(da|x). As a special case, for a deterministic decision rule δ(x), D(da|x) is just a
one-point probability mass at δ(x).
Moreover, we have a loss function L(θ, a) : Θ × A → R to quantify the loss between θ and any action
a ∈ A, such that (usually we assume that g(θ) belongs to the action space A)
L(θ, a) ≥ 0, ∀a ∈ A, θ ∈ Θ (1)
L(θ, g(θ)) = 0, ∀θ ∈ Θ. (2)
Immediately, we see that our loss function L(θ, δ(X)) is inherently a random quantity. Note that we
may have two sources of randomness: the observation X is inherently random, and for a fixed X the action
produced by the decision rule δ(X) ∼ D(da|X) may also be random.
Instead, we will define a risk function R(θ, δ) in terms of our loss function, which will instead be deter-
ministic.
Having ruled out inadmissible decision rules, one might seek to find the best rule, namely the rule that
uniformly minimizes the risk. However, in most cases of interest, there is no such uniformly best decision rule.
∆
For instance, if Nature chooses θ0 , then the trivial procedure δ(X) = g(θ0 ) achieves the minimum possible
risk R(θ0 , δ(X)) = 0. Thus, in most cases δ is admissible. However, such a decision rule is intuitively
unreasonable and, further, can have arbitrarily high risk for θ 6= θ0 .
There are two main approaches to restrict the order of the game to avoid this triviality, namely, we can
restrict the statistician, or we can assume more knowledge about Nature.
2
1. Take D0 to be the set of unbiased decisions, namely, the decisions δ(X) such that Eθ [δ(X)] = g(θ),
∀θ ∈ Θ.
2. Equivariance (which we will not discuss as much in class): As an example, if Pθ = f (X − θ), where
2 i.i.d.
θ ∈ R, f (X) , √12π e−x /2 , and our observations X1 , . . . , Xn ∼ Pθ , then we will choose our decision
function δ to be such that δ(X1 + c, X2 + c, . . . , Xn + c) = δ(X1 , . . . , Xn ) + c, where c is a constant.
The reason for why we might want to choose such a decision function δ is because, for equivariant
estimators, the variance of the estimator does not depend on θ.
1. Bayes: We assume that θ is drawn from a probability density, namely, θ ∼ λ(θ). For instance, for the
problem of disease detection, our prior λ(θ) could be the population disease density.
which is a well-posed optimization problem only over δ (and can be solved in many cases, though it
can be computationally intractable to compute).
3 Data Reduction
Not all data is relevant to a particular decision problem. Indeed, the irrelevant data can be discarded and
replaced with some statistic T (X n ) of the data without hurting performance. We make this precise below.
3
3.1 Sufficient Statistics
Definition 3 (Markov chain). Random variables X, Y, Z are said to form a Markov Chain if X and Z are
conditionally independent given Y . In particular, the joint distribution can be written as
Definition 4 (Sufficiency). 2 A statistic T (X) is “sufficient” for the model P = {Pθ : θ ∈ Θ} if and only if
the following Markov chains hold for any distribution on θ:
1. θ − T (X) − X
2. θ − X − T (X)
One useful interpretation of the first condition is, if we know T (X), we can generate X without knowing
θ (because X and θ are independent conditioned on T (X)). The second condition usually trivially holds,
since in most cases T (X) is a deterministic function of X. However, a subtle but important point about
the second condition is that the statistic T (X) could be a random function of X. In this case, the second
condition implies that given X, T (X) does not depend on the randomness of θ.
2 Note that this definition of Sufficiency is slightly different from the conventional frequentist definition, and is along the lines
of the Bayes definition. These two definitions are equivalent under mild conditions. Cf.[1]
4
3.3 Rao-Blackwell
In the decision theory framework, sufficient statistics provide a reduction of the data without loss of infor-
mation. In particular, any risk that can be achieved using a decision rule based on X can also be achieved
by a decision rule based on T (X), as the following theorem makes precise.
Theorem 3. Suppose X ∼ Pθ ∈ P and T is sufficient for P. For all decision rules δ(X) achieving risk
R(θ, δ(X)), there exists a decision rule δ 0 (T (X)) that achieves risk
Sketch of Proof Given T (X), we can sample a new dataset X 0 from the from the conditional distribution
p(X | T (X)). By sufficiency (the first condition in Definition 4), the conditional distribution p(X | T (X))
doesn’t depend on θ, and hence we can sample X 0 without knowing θ. We then define a randomized procedure
∆ d
δ 0 (T (X)) = δ(X 0 ) = δ(X).
Since δ(X) and δ 0 (T (X)) have the same distribution, they also have the same risk function.
It is rarely necessary to regenerate a dataset from sufficient statistics. Rather, in the case of convex
losses, it is possible to obtain a non-randomized decision rule that matches or improves the performance of
the original rule using sufficient statistics alone.
Definition 7 (Strict convexity). A function f : C → R is strictly convex if C is a convex set and
References
[1] Blackwell, D., and R. V. Ramamoorthi. A Bayes but not classically sufficient statistic, The Annals of
Statistics 10, no. 3 (1982): 1025-1026.
5
EE378A Statistical Signal Processing Lecture 3 - 04/05/2016
Lecture 3: Completeness
Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: Xin Zheng and Chenyue Meng
In this lecture, we first review the Rao-Blackwell theorem, along with its proof and complications. Then
we introduce an important concept called completeness, which has applications such as Basu’s theorem.
In the end, exponential family will be introduced for finding complete and sufficient statistics for common
distributions.
1 Rao-Blackwell
Theorem 1 (Rao-Blackwell). Assume L(θ, d) is strictly convex in d, ∃ decision rule δ(X), s.t. R(θ, δ(X)) <
∞. Then
unless δ(X) = E[δ(X)|T ] with probability 1, where T is sufficient for P = {Pθ , θ ∈ Θ}.
Proof To prove Rao-Blackwell theorem, we first introduce Jensen’s inequality
Theorem 2 (Jensen’s inequality). Let f be a convex function, and let X be a random variable. Then
Moreover, if f is strictly convex, then E[f (X)] = f (EX) holds true if and only if X = EX with probability
1 (i.e., if X is a constant a.s.).
Given that L(θ, d) is strictly convex in d, we have
if we let X have the conditional distribution PX|T =t and apply Jensen’s inequality. The strict convexity of
L(θ, d) in d shows that the inequality is strict unless Pδ(X)|T =t is a point mass (i.e., a constant almost surely)
To finish the proof, we take expectations on both sides after replacing the fixed t with the random T ,
Note: Here, the estimator E[δ(X)|T = t] in Rao-Blackwell theorem only depends on t since T is sufficient
for P = {Pθ , θ ∈ Θ}. The convexity of L(θ, d) implies that it suffices to consider deterministic decision rules.
Complications:
Despite the powerfulness of Rao-Blackwell, its drawbacks and impracticalities are as follows:
A. Rao-Blackwell focuses on improvement, not optimality. Given a decision rule δ(x) and a sufficient
statistic T , we can use Rao-Blackwell to compute a new decision rule E[δ(x)|T ] which is guaranteed to reduce
the risk (at least not larger) for a convex loss function. However, this new decision rule might be far from
optimality.
B. It may be hard to compute the conditional expectation.
1
2 Completeness
In lecture 2, we introduced the concept of a statistic being sufficient with respect to a statistical model. In
general, a statistic T is sufficient for P = {Pθ , θ ∈ Θ} if the sample from Pθ gives no additional information
than T . What characterizes the situations in which sufficiency leads to substantial reduction of
data?
Definition 1 (Ancillary). Statistic V (X) is ancillary if PV (X)|θ = PV (X) . Moreover, V (X) is first-order
ancillary if Eθ V (X) = const.
Note: A statistic being ancillary intuitively means that it contains no information about θ.
i.i.d.
Example 3. Assume X1 , X2 , · · · Xn ∼ N (µ, σ 2 ), where µ is unknown and σ 2 is known. And we set θ = µ
as the parameter for the distribution P = {Pθ , θ ∈ Θ}.
Then let’s denote statistics X̄ and S 2 as
P
Xi
X̄ =
n
X
S2 = (Xi − X̄)2
d d
Now we want to prove S 2 is ancillary for µ. Assume Xi = Zi + µ ⇒ (X1 , X2 , · · · Xn ) = (Z1 , Z2 , · · · , Zn ) +
i.i.d
µ, then Z1 , Z2 , · · · , Zn ∼ N (0, σ 2 ). Then we have S 2 = (Xi − X̄)2 is equal to (Zi − Z̄)2 in distribution,
P P
whose distribution does not depend on µ, i.e. θ. Hence S 2 is ancillary for µ.
Definition 2 (Completeness). The statistic T is complete if the following is true:
Eθ [f (T )] = 0, ∀θ ∈ Θ ⇒ f (T ) = 0 w.p.1 (7)
3 Basu’s Theorem
An important application of completeness is Basu’s theorem.
Theorem 5. (Basu’s theorem) If T is a complete and sufficient statistic for P = {Pθ , θ ∈ Θ} then any
ancillary statistic V is independent of T , i.e. V ⊥
⊥ T.
Proof Given any ancillary statistic V , we can define pA , Pθ (V ∈ A) which is independent of θ for any
given A. If T is sufficient, then ηA (t) , Pθ (V ∈ A|T = t) does not depend on θ. Take the expectation of
ηA (T ), we have
Eθ [ηA (T )] = Pθ (V ∈ A) = pA
. Hence, we have found a function ηA (T ) − pA whose expectation is constant zero for all parameters. It
follows from the definition of completeness that ηA (T ) = pA with probability 1.
i.i.d.
Example 6. Assume X1 , X2 , · · · Xn ∼ N (µ, σ 2 ), and θ = (µ, σ 2 ). Suppose we have proved that for known
σ 2 , statistic X̄ is both complete and sufficient for µ (which will bePproved at the end of this lecture), then
applying Basu’s thereom, we conclude that X̄ ⊥ ⊥ S 2 , where S 2 = (Xi − X̄)2 . Actually, this also implies
that X̄ ⊥ ⊥ S 2 holds true even in the case where both µ and σ 2 are unknown, since σ 2 is arbitrary.
2
4 Exponential Family
If a distribution can be transformed into the form of exponential family, then it will be much easier to
compute its corresponding statistics satisfying both completeness and sufficiency properties.
Exponential family has the general form as
Ps
pθ (x) = e[ i=1 ηi (θ)Ti (x)]−B(θ)
h(x)
where pθ (x) is the density/mass function and B(θ) is the normalization term.
According to Factorization theorem, [T1 , T2 , · · · , Ts ] is sufficient.
Definition 3 (Full rank). An exponential family is called full-rank if and only if:
Ps
1. neither Ti nor ηi satisfies a linear constraint, i.e., if i=1 ai Ti = a0 for some constants a0 , a1 , · · · , as ,
then a0 = a1 = · · · = as = 0;
2. the natural parameter space (defined below) contains an s-dimensional rectangle, i.e., has a non-empty
interior.
R P η T (x) 4 (Natural parameter space). The natural parameter space of an exponential family is {(η1 , · · · , ηs ) :
Definition
e i i h(x)dµ(x) < ∞}, where µ is the dominating measure for the distribution family P = {Pθ , θ ∈ Θ}.
Theorem 7 (Completeness in exponential family). If the statistical model constitutes an exponential family
which has full rank, then T = [T1 , T2 , · · · , Ts ] is complete and sufficient.
i.i.d.
Example 8. Assume X1 , X2 , · · · Xn ∼ N (µ, σ 2 ) with known σ 2 and θ = µ, then Pθ (X1 , X2 , · · · Xn ) can
be written as
n
(Xi − µ)2
Y 1
Pθ (X1 , X2 , · · · Xn ) = √ exp − (8)
i=1
2πσ 2σ 2
P 2
nµ2
P
1 Xi µ Xi
=p exp − − + . (9)
(2π)n σ n 2σ 2 2σ 2 σ2
P 2
1 X nµ2
It belongs to the exponential family with h(X) = √
P
exp − 2σ2i , B(θ) = 2σ 2 and T (X) = Xi .
(2π)n σ n
Then theorem 7 implies that T (X) is complete and sufficient for µ.
3
EE378A Statistical Signal Processing Lecture 4 - 04/07/2016
In this lecture, we will introduce unbiased estimators, the concepts of uniform minimum risk unbiased
(UMRU) and uniform minimum variance unbiased (UMVU) estimators, and the Lehmann-Scheffe Theorem.
1 Logistics
Homework and grading
• No midterm or final, and grading on homeworks will be lenient.
• The emphasis is on learning—feel free to collaborate/discuss on Piazza, or come to office hours for
explanations from course staff.
Technical conditions
• Although these are necessary for rigorous proofs, the goal is to understand the main ideas so that we
will be able to solve real problems using these ideas in the future.
• From this point on, we’ll try to avoid spending time on technical conditions whenever possible.
2 Unbiased Estimation
Definition 1 (Unbiasedness). A decision rule δ(x) is unbiased if Eθ [δ(x)] = g(θ) ∀θ ∈ Θ.
Note: We will argue later in the course that unbiasedness may not be the desired property in practice,
but for now our goal is to find the best unbiased estimator.
1
Pn
But this is impossible, since k=0 nk pk (1 − p)n−k is a polynomial in p of degree ≤ n, while p ln p is not
a polynomial. In general, if we are trying to estimate any quantity that can’t be written as a polynomial of
degree no more than n, then an unbiased estimator does not exist.
As a result, when working with squared error loss, we can always decompose our risk into a variance
component and a squared bias component. Since the bias is zero when we are restricted to only use unbiased
estimators, all of the risk comes from the variance (i.e., risk = variance) and the UMRU estimator is also
the UMVU estimator.
3 Lehman-Scheffe Theorem
Theorem 1 (Lehman-Scheffe). Assume T is complete and sufficient, and that h(T ) is an unbiased estimator
for g(θ). Then h(T ) is
2
h 6= h s.t. e
(1) Suppose there exists some e h(T ) is also an unbiased estimator for g(θ). Then
(2) For any unbiased δ(x), Rao-Blackwell gives us η(T ) = E[δ(x)|T ] which is also unbiased (Rao-Blackwellization
preserves unbiasedness), and is at least as good as δ(x) in terms of its performance in risk function. It
follows from part (1) that η(T ) is the only function of T that is an unbiased estimator for g(θ), which
implies that η(T ) = h(T ) with probability one. Hence, we have proved that h(T ) is at least as good
as any unbiased estimators, which by definition shows h(T ) is UMRU.
(3) Uniqueness follows from the strict convexity of the loss function.
(4) The result follows from the strict convexity of the squared error loss.
4 Examples
4.1 Strategies for finding UMRU estimators
A. Rao-Blackwellization: Start with any unbiased δ(x) and compute E[δ(x)|T ].
• Works in theory, but in practice computing the conditional expectation can be difficult.
• Wait... what?
• Although this sounds unlikely to work, we will see that in practice this can often be more reason-
able than trying to work through the computations necessary for strategies A and B.
3
4.2 Example: Bernoulli distribution
iid
Suppose that X1 , X2 , . . . , Xn ∼ Bernoulli(p), where Z ∼ Bernoulli(p) means that P(Z = 1) = p and P(Z =
0) = 1 − p. In order to apply any of strategies A, B, or C from above, we first need T , so let’s start by finding
the complete and sufficient statistic. This is easy if we can write it as an exponential family distribution:
P P
Xi
=p (1 − p)n− Xi
P P
= e( Xi ) ln p+(n− Xi ) ln(1−p)
P p
= e( Xi ) ln 1−p +n ln(1−p)
P
We can see that T = XiPis a complete sufficient statistic. Suppose that we want to estimate p, then
n
E(p̂) = p implies that p̂ = n−1 i=1 Xi is the UMRU estimator for p. However, what if we want to estimate
2
p ? Let’s try using recipe A (Rao-Blackwellization). First we need to find a simple, unbiased estimator
δ(x) = X1 · X2 :
E[δ(x)] = E[X1 ] · E[X2 ] = p2 .
Note that δ(x) ∈ {0, 1}. We have
X
E[δ(x)|T = t] = P(X1 = 1, X2 = 1| Xi = t)
P
P(X1 = 1, X2 = 1, Xi = t)
= P
P( Xi = t)
Pn
P(X1 = 1, X2 = 1, i=3 Xi = t − 2)
= Pn
P( i=1 Xi = t)
p · p n−2 (1 − p)(n−2)−(t−2) 1(t ≥ 2)
t−2
t−2 p
= n t
n−t
t p (1 − p)
t(t − 1)1(t ≥ 2)
=
n(n − 1)
t(t − 1)
=
n(n − 1)
(T −1)
Therefore, our UMRU estimator for p2 is h(T ) = Tn(n−1) . As expected, the conditional expectation does
not depend on p based on the definition of sufficiency (it should only depend on T ). To summarize, we
start with an unbiased estimator that is not a function of T , so we cannot use Lehman-Scheffe. We first
use Rao-Blackwell to find another unbiased estimator that is a function of T —then all of the implications
of Lehman-Scheffe apply.
4
EE378A Statistical Signal Processing Lecture 5 - 04/12/2016
1 Announcements
• We very strongly encourage you typeset your homework using LaTeX, but if you insist on handwriting
and believe it is neat enough, you can also do that without asking for approval before the deadline.
After the deadline for homework 1 we will carefully review the writing quality of the handwritten
homework (if there are any), and decide whether they meet the standard of neatness. It is of course
our hope that they are neat enough! If some homework fails our test, we may require the corresponding
student(s) to typeset the rest of their homework.
2 Recap
Last time we discussed unbiased estimation, the Lehmann-Scheffe theorem, UMRU and UMVU estimators.
• In particular, we discussed some strategies for finding the UMRU estimator:
1. Rao-Blackwellization: First, find an unbiased estimator δ(x). Then, find the conditional expecta-
tion E[δ(x)|T ], which will be an estimator that is at least as good as δ(x). By Lehmann-Scheffe,
E[δ(x)|T ] will be the UMRU.
2. Solve equation: Recall that an unbiased estimator has the property that Eθ δ(T ) = g(θ), ∀θ ∈ Θ.
3. Guess δ(T ) which satisfies Eθ δ(T ) = g(θ) by playing with the distribution of T .
Since e−λ does not depend on k, we can divide this term on both sides to obtain
∞
X λk
δ(k) = e(1−a)λ . (2)
k!
k=0
The expression on the left-hand side (LHS) is a power series expansion, so we will rewrite the right-hand
side (RHS) as a power series expansion as well, and then match terms in order to find δ(k). Using the power
series expansion for an exponential function, the RHS can be written as:
∞
X (1 − a)k λk
e(1−a)λ = . (3)
k!
k=0
1
We find that δ(k) = (1 − a)k . Since this example concerns estimation of a function of a parameter for a
single random variable X, X is obviously a sufficient statistic. As a result, the estimator δ(x) = (1 − a)x is
a function of a complete sufficient statistic. By Lehmann-Scheffe, this estimator must be the unique UMVU
estimator.
This appears to be a good estimator, but let’s look more closely at it. Let’s consider two specific cases:
What is going on? It turns out that the estimator, despite being UMVU, is inadmissible under squared
error loss when a > 1. Here’s why: let’s consider the specific case of a = 2 again. When a = 2, g(λ) =
e−2λ > 0 and we can propose another estimator δ + (x) = max{(−1)x , 0}. Let’s compare the squared error
loss between using estimators δ(x) and δ + (x):
It is clear that L(θ, δ) ≤ L(θ, δ + ), ∀θ ∈ Θ. Intuitively, we can consider the loss to be the “distance”
between g(λ) (which is a positive number) and the decision δ(x). If δ(x) < 0, then the distance between
g(λ) and δ + is less than that between g(λ) and δ. Thus, δ + dominates δ(x)! So, when a > 1, δ(x) = (1 − a)x
is inadmissible.
i.i.d.
Example 2 (UMVU estimators for the variance in Gaussian model). X1 , X2 , . . . Xn ∼ N (µ, σ 2 ), with
2 2
θ = (µ, σ ) unknown. Note that in thisPn case, (X, S ) is the sufficient complete P
statistic by the property of
n
the exponential family, where X = i=1 Xi /n is the sample mean, and S 2 = i=1 (Xi − X)2 . Let’s find
UMVU estimators for:
1. µ: It is easy to show that the sample average X is an unbiased estimator of µ. As a result, by
Lehmann-Scheffe δ(x) = X is a UMVU for µ.
2
S
2. σ 2 : Since ES 2 = (n − 1)σ 2 , we can define the estimator σ̂ 2 = n−1 , which is an unbiased estimate of
σ . Again, Lehmann-Scheffe guarantees that σ̂ is the UMVU for σ 2 .
2 2
|X1 − X2 |
δ(X1 , X2 ) = . (6)
c
On applying Rao-Blackwellization on δ(X1 , X2 ) with (X, S 2 ), it is possible to obtain the UMVU
estimator for σ. However, computation of conditional expectation with respect to S 2 is messy.
Guessing the estimator: Let’s now turn to the third strategy to find the UMVU for σ. Recall that
i.i.d.
if X1 , X2 , . . . Xn ∼ N (µ, σ 2 ), then we can also represent (X1 , X2 , . . . , Xn ) as µ + σ(Z1 , . . . , Zn ) where
i.i.d. d
Z1 , Z2 , . . . Zn ∼ N (0, 1). Then it’s clear that S 2 = σ 2 χ2n−1 , where χ2k is the Chi-squared distribution
with degrees of freedom k. We can take the square root and expectation of both sides of S 2 :
q
ES = σE χ2n−1 (7)
q
Note that E χ2n−1 is a constant which does not depend on µ or σ 2 , so we obtain σ = √S2 .
E χn−1
2
Example 3 (Criteria for goodness of estimators). This example is motivated by the different scaling factors
that sometimes appear in estimators, for instance, in MATLAB toolboxes. In particular, in the previous
S2
example we found the UMVU estimate of σ̂ 2 to be n−1 which has a scaling factor of n − 1. Now, recall the
maximum likelihood estimator (MLE): θ̂MLE = arg maxθ Pθ (x), which seems like it would come up with a
2
good estimator for σ 2 . In fact, the MLE is σ̂ 2 = Sn but it clearly has a scaling factor of n, not n − 1.
As if this were not confusing enough, let us now find a third candidate for σ̂ 2 by optimizing the risk
2 2
function for estimators of the form Sa , when a > 0. We first compute the risk for Sa under squared-error
loss. Recall that for any random variable X, we can decompose the second moment of X into the bias and
variance. So we have
2
2 2 2 2 2
2 S S 2 S S 2
R(σ , )=E −σ = Var + E −σ . (8)
a a a a
The bias term can be computed as follows:
2
(n − 1)σ 2
S 1
E − σ 2 = ES 2 − σ 2 = − σ2 . (9)
a a a
Note that the bias may not be zero, so the estimator we obtain may not be an unbiased estimator. Now,
we compute the variance term, recalling from the previous example that S follows a Chi-squared distribution,
i.e. S ∼ σ 2 χ2n−1 , so (note that if Y ∼ χ2k then the m-th moment of Y is EY m = k(k+2)(k+4) . . . (k+2m−2)):
S2 σ4
Var( ) = 2 · 2(n − 1). (10)
a a
Since the bias is a linear function of 1/a and variance is a quadratic function of 1/a, the risk is a quadratic
function of 1/a. So, minimizing the risk involves minimizing a quadratic function of 1/a. The minimization
results in a = n + 1, which gives
S2
σ̂ 2 = . (11)
n+1
This is yet another expression for σ̂ 2 ! So, which of the three expressions is the best estimator? Under squared
error loss, the σ̂ 2 we just found implies that both the UMVU estimator and MLE are inadmissible, for the
third estimator is the optimal estimator of the form S 2 /a which minimizes the risk function under squared
error loss. However, we still cannot claim that the third estimator is optimal, since it may be the case that
not restricting the statistician to estimators of the form S 2 /a will result in an even better estimator. This
example highlights the difficulty of finite-sample theory, and the fact that the definition of “goodness” of an
estimator matters a great deal for small n. In contrast, for large n it suffices to use the MLE, since it is
known that the MLE is asymptotically optimal (i.e., attains the best possible asymptotic risk for n → ∞).
Also note that these three estimators are essentially the same as n → ∞.
3
4.1 Hammersley-Chapman-Robbins Inequality
For the density pθ (x) = p(x, θ) ≥ 0, ∀x ∈ X , define the function ψ(x, θ) as:
p(x, θ + ∆)
ψ(x, θ) = −1 (13)
p(x, θ)
where θ, θ + ∆ ∈ Θ, and we assume that g(θ) 6= g(θ + ∆) where g(θ) is the function which we are estimating.
One reason for choosing this ψ(x, θ) is that Eθ ψ(x, θ) = 0. To see why:
Z
p(x, θ + ∆)
Eθ [ψ(x, θ)] = − 1 p(x, θ)dx (14)
p(x, θ)
Z
= (p(x, θ + ∆) − p(x, θ)) dx (15)
=0 (16)
Upon plugging in the covariance term in (12), we obtain the following result:
2
(Eθ+∆ [δ] − Eθ [δ])
Var(δ(X)) ≥ 2 . (22)
Eθ p(x,θ+∆)
p(x,θ) − 1
If we further assume that the estimator δ(X) is unbiased, i.e., Eθ δ(X) = g(θ), ∀θ, we can simplify (22) and
obtain the Hammersley-Chapman-Robbin inequality in its standard form:
2
(g(θ + ∆) − g(θ))
Var(δ(X)) ≥ 2 . (23)
p(x,θ+∆)
Eθ p(x,θ) − 1
4
2
2 g(θ+∆)−g(θ)
(g(θ + ∆) − g(θ)) ∆
lim 2 = ∆→0
lim 2 (24)
∆→0
Eθ p(x,θ+∆)
p(x,θ) − 1 Eθ
p(x,θ+∆)
p(x,θ) − 1 1
∆2
2
g(θ+∆)−g(θ)
∆
= lim 2 (25)
∆→0 p(x,θ+∆)−p(x,θ)
Eθ p(x,θ)∆
2
(g 0 (θ))
= 2 (26)
1 ∂p(x,θ)
Eθ p(x,θ) ∂θ
Thus, the Cramér-Rao Lower Bound in its standard form can be written as:
2
(g 0 (θ))
Var(δ(X)) ≥ (27)
Var (ψCR (x, θ))
∂p(x,θ)
1
Where ψCR (x, θ) = p(x,θ) ∂θ = ∂ ln ∂θ
p(x,θ)
. In some sense, ψCR captures the relative change in p(x, θ) as
we change θ, and it is also the partial derivative of the log-likelihood ln p(x, θ) with respect to θ, which is a
very interesting result since we did not start with anything to do with the log-likelihood.
The denominator I(θ) , Var (ψCR (x, θ)) is known as the Fisher information, which we will discuss more
about in the next lecture. The Fisher information in some sense captures the speed that the model changes
with θ. As Fisher information increases, the model changes more significantly given a local change of θ,
thus the different values of θ can be better distinguished from each other. The better distinguishability of
different models just implies the existence of an improved estimation (the lower bound on Var(δ) decreases).
5
EE378A Statistical Signal Processing Lecture 6 - 04/14/2016
In this lecture we will recap the material so far, finish discussing the information inequality and introduce
the Bayes formulation of decision theory.
1 Recap
The first part of this course focuses on statistical decision theory, which provides a systematic framework for
statistical study. In the second half of the course, we will focus on learning theory which is much younger
(developed since the 1970s). We study both because their theoretical foundations are very similar and
also complementing each other, and we can accelerate our understanding of learning theory by leveraging
established decision theory results.
Statistical decision theory problems specify a family of distributions, {Pθ , θ ∈ Θ} and seeks a decision
rule δ(X) to minimize the risk in a certain sense. For the problem to be well posed, we must impose further
restrictions. We studied two options:
A. Restrict the statistician: e.g. requiring that the decision rule be unbiased or equivariant. Though
restricting ourselves to these special types of decision rules may make us miss some other good decision
rules, the study of the optimal decision rules under these restrictions gives us insights about the struc-
ture of the decision problem, and deepens our understandings of important concepts such as sufficiency
and completeness, as well as important theorems such as Rao-Blackwell, Basu, and Lehmann–Scheffe.
B. Assume more knowledge about nature: We admit all decision rules, but seek ‘weaker’ notions of
optimality. In particular we will study
• Bayes risk, where we minimize the expected risk over a prior distribution on θ. This is particularly
powerful because it provides an easy and principled way of incorporating ‘knowledge’ about the
problem into the decision rule (e.g. data from earlier tests about the distribution of outcomes).
• Minimax optimality, where we choose δ to minimize the worst case risk.
1
2.1.1 Properties of the score function:
1. E[Sθ (X)] = 0.
Proof: By the chain rule and definition of expectation we have
Z
∂ ṗ(X; θ) ∂ 1
E[Sθ (X)] = E log(p(X; θ) = E = p(X; θ) p(X; θ)dx, (3)
∂θ p(X; θ) ∂θ p(X; θ)
and now, assuming the order of integration and differentiation can be exchanged, we obtain:
Z
∂ ∂
Cov(Sθ , δ) = δ(x)p(x; θ)dx = Eθ (δ(X)) (7)
∂θ ∂θ
2.1.2 Generalizations of Sθ
The score function is special because it is the only ψ(x, θ) that makes the corresponding Cauchy–Schwarz
inequality give the correct asymptotic variance in asymptotic theory. We could also consider a more general
class of ψ(x, θ) of the form
∂k
∂θ k
p(x; θ)
ψ(x, θ) = (8)
p(x; θ)
but while they have the zero mean property, they do not share the same role of the score function does in
asymptotic theory.
2
2.2.1 Properties of Fisher information
1. I(θ) depends on parameterization (but the information inequality (1) does not)
This is easy to see from the definition. Suppose θ = h(ξ), for a function h differentiable at ξ. Then we
have from the chain rule that
I(ξ) = I(θ)[h0 (ξ)]2 (10)
Applying this equation to the information inequality, we see that the h0 (ξ) term appears in both the
numerator and denominator, meaning that equation (1) is independent of the choice of parameteriza-
tion.
h 2 i
∂
2. Alternate (easier) definition: I(θ) = −E[Ṡθ ] = −E ∂θ 2 log(p(X; θ))
Proof: Working backwards, we have from the chain rule and quotient rule that
" 2 #
p̈(X; θ)p(X; θ) − ṗ(X; θ)2
∂ ṗ(X; θ) ṗ(X; θ) p̈(X; θ)
−E[Ṡθ ] = −E = −E = E − E
∂θ p(X; θ) p(X; θ)2 p(X; θ) p(X; θ)
(11)
The first term of the last equality is simply the Fisher information, and the second term evaluates to zero
(assuming we can exchange the order of differentiation and integration), since it is the partial derivative
of the integral over a probability distribution (similar calculation as showing that E[Sθ (x)] = 0). Hence
we have
−E[Ṡθ ] = I(θ) (12)
Remark: This definition is particularly convenient when p(x; θ) belongs to the exponential family.
As a sanity check, it is easy to verify that E[Sθ ] = 0. This example is simple enough that we could evaluate
I(θ) directly as the variance of Sθ , but it is still much easier to use the alternate form:
Eθ [X] 1 n n n
I(θ) = −E[Ṡθ ] = − − 2 − (n − Eθ [X]) 2
= + = (17)
θ (1 − θ) θ 1−θ θ(1 − θ)
Now we can use the inequality (1) to lower bound the variance of any estimator δ as
∂
2 ∂
2
∂θ E[δ(X)] ∂θ E[δ(X)]
Var(δ) ≥ = (18)
I(θ) n/(θ(1 − θ))
3
∂
In the case when δ is unbiased, we can go further, since ∂θ E[δ(x)] = 1:
θ(1 − θ)
Var(δ) ≥ (19)
n
which means that the variance is lower bounded by Ω(n−1 ). But we also know that the variance of the
empirical estimator θ̂ = x/n is θ(1 − θ)/n, which means that θ̂ is the optimal (unbiased) estimator for the
squared loss function. In other words, θ̂ is the UMVU estimator.
for some vector of coefficients a = [a1 , . . . , aS ] not all zero. Using this definition of ψ with the Cauchy-Schwarz
bound gives
h P i2
S
Cov δ, i=1 ai ψi (X; θ)
Var(δ) ≥ P (21)
S
Var δ, i=1 ai ψi (X; θ)
Since we want to make the bound as tight as possible, we choose a to maximize the right hand side. This
maximization problem has a closed-form solution (Check out [1, Chapter 2, Exercises 6.2-3]). Define the
vector of covariances element-wise as
[γ]i = Cov(δ, ψi ) (22)
and the Fisher information matrix as
But the existence of λ(θ) that satisfies this equality everywhere restricts p(x; θ). Concretely, it requires
that p(x; θ) follows an exponential family distribution. Even with this restriction, it turns out that the
Cramér-Rao bound is still quite useful, and particularly gives insightful results in the asymptotic analysis.
4
3 Bayes Formulation of Decision Theory
So far we have studied ways that restricting the statistician leads to well-posed decision theory problems.
The second approach to framing decision theory problems is to assuming more knowledge about nature. In
this section, we discuss the Bayes formulation, where we assume that the parameter θ are selected from a
prior distribution Λ(θ) over the parameter space Θ. We define the average risk as
Definition 3 (Average risk). Z
r(Λ, δ) = R(θ, δ)dΛ(θ) (26)
3.1 Motivation
A practical statistical theory should be able to comprehensively address the following two general questions:
• Achievability: produce approaches to construct practical schemes for general problems
References
[1] Lehmann, Erich Leo, and George Casella. Theory of Point Estimation, Springer Texts in Statistics,
1998.
5
EE378A Statistical Signal Processing Lecture 7 - 04/19/2016
In this lecture, we will learn about four important theorems in Bayesian framework and start looking at
examples of Bayes estimators.
1 Bayes Setup
Recall in Lecture 2 that the basic setup in statistical decision theory is as follows: an outcome space X , a
class of probability measures {Pθ : θ ∈ Θ}, and observations X ∼ Pθ , X ∈ X . In Bayes decision theory
framework, we assume in addition that θ are probabilistic, and θ follows a distribution Λ(θ). Why do
Bayesian statisticians make this assumption?
• This is a nice way to incorporate prior knowledge into data. Actually not everyone thinks it is a good
idea, because sometimes it is unclear how to impose an appropriate prior on your data.
• This is a general way to generate reasonable schemes. What do we mean by ”reasonable schemes”?
We will look at it later.
2 Terminology
• Average risk: Z
r(Λ, δ) , R(θ, δ) dΛ(θ)
| {z }
risk function
R
with dΛ(θ) = 1, i.e. Λ(θ) is a probability distribution.
• Bayes solution is the estimator or decision rule which minimizes the average risk.
Theorem 1. Suppose {Pθ , θ ∈ Θ}, Λ(θ) is a probability distribution of Θ, L(θ, δ) ≥ 0, the goal is to estimate
g(θ), and assume
(1) There exists δ0 that achieves finite average risk.
(2) For almost all x, there exists value δΛ (X) minimizing E [L(θ, δ(X))|X = x]
1
Proof Let δ be any estimator with finite average risk. It follows from assumption (2) and non-negativity
of loss function L that
E [L(θ, δ(X))|X = x] ≥ E [L(θ, δΛ (X))|X = x] ≥ 0.
Taking expectation of both sides with respect to PX yields
Remark 1. Given X = x, δ(X) is a constant (if δ is a deterministic decision rule), but L(θ, δ(X)) is still
random because θ is probablitic. Hence the conditional expectation E [L(θ, δ(X))|X = x] makes sense in
assumption (2).
Remark 2. The expectations in E[L(θ, δ)] ≥ E[L(θ, δΛ )] are with respect to the joint distribution of
PX,θ = Λ(θ)PX|θ .
RSince we has written the average risk r(Λ, δ) as an integral over x, it suffices to minimize the integrand
L(θ, δ)dPθ|X in the brackets (which is called conditional risk). It basically justifies the above theorem. In
Bayes setting, we actually have names for these two distributions
• Λ(θ) : Prior distribution
• Pθ|X : Posterior distribution
2
• For absolute error loss L(θ, d) = |d − g(θ)|, δΛ (x) = any median of Pg(θ)|X=x .
• For Hamming loss, L(θ, d) = 1(g(θ) 6= d), δΛ (x) = arg maxy Pg(θ)|X (y), which is the Maximum a
posteriori (MAP) estimator of g(θ).
Remark. Since δΛ (X) is a constant in the conditional risk, all we need to do is find a scalar δ(x) for the
conditional risk at any given x. It is an easy task if the loss function L is either squared error loss, absolute
error loss or Hamming loss.
Proof Let δΛ be the unique Bayes estimator under prior Λ. Suppose for the sake of contradiction that
there exists δ 6= δΛ such that
R(θ, δ) ≤ R(θ, δΛ ), ∀θ ∈ Θ
Taking expectation w.r.t. Λ(θ) yields
Z Z
r(Λ, δ) = R(θ, δ)dΛ(θ) ≤ R(θ, δΛ )dΛ(θ) = r(Λ, δΛ ).
Since δΛ minimize the average risk, r(Λ, δ) ≥ r(Λ, δΛ ). Putting together, we have r(Λ, δ) = r(Λ, δΛ ), i.e. δ
is also a Bayes estimator. This is a contradiction to the uniqueness of Bayes estimator δΛ .
Remark 1. This is super powerful! Even UMVU and MLE may not be admissible.
Remark 2. It is worth noting that an admissible estimator does not necessarily imply that it is “good”.
However inadmissibility does imply that it is “bad”.
This leads us to the question: is every admissible estimator Bayes? Yes, under mild conditions.
Theorem 3. (part 2: complete class theorem) Suppose
(1) Θ is a compact set.
Remark. Part 2 of Theorem 3 tells us that it suffices to search over all possible Bayes decision rules to find
the best estimator.
x
Example 1. Suppose X ∼ Binomial(n, θ), θ ∈ [0, 1]. θ̂ = n is UMVU, MLE and admissible. If we in
addition assume that θ follows a Beta distribution
1
Λ(θ) = Λa,b (θ) = θa−1 (1 − θ)b−1 for some a, b ≥ 0.
B(a, b)
3
This prior is called “conjugate prior” to the Binomial distribution. It makes the posterior distribution
much easy to compute.
Pa,b (X, θ) = Λa,b (θ) PX|θ
1 n x
= θa−1 (1 − θ)b−1 θ (1 − θ)n−x
B(a, b) x
= C(a, b, X)θx+a−1 (1 − θ)n−x+b−1
where C(a, b, X) does not depend on θ. Then
Pa,b (X, θ) C(a, b, X) x+a−1
Pa,b (θ|X) = = θ (1 − θ)n−x+b−1 = C 0 (a, b, X)θx+a−1 (1 − θ)n−x+b−1
Pa,b (X) Pa,b (X)
The denominator Pa,b (X) does not depend on θ, so we can absorb it into C(a, b, X), and write
C(a, b, X)
= C 0 (a, b, X).
Pa,b (X)
We conclude that
Pa,b (θ|X) ∼ B(X + a, n − x + b).
Using Theorem 2, the Bayes estimator under squared error loss is
x+a x+a
θ̂a,b = E[θ|X] = =
(x + a) + (n − x + b) a+b+n
Here we use the basics of Beta distribution: if Z ∼ Beta(a, b),
a ab
E[Z] = , Var(Z) = .
a+b (a + b)2 (a + b + 1)
Let’s look at this result a bit more closely
• θ̂a,b is a convex combination of θ̂ (UMVU, MLE) and the prior mean.
a+b a n x
θ̂a,b = + .
a+b+n a+b a + b + n | {z
n}
| {z }
prior mean θ̂
• θ̂a,b approaches to θ̂ as a and b approach 0. We can interpret a as the number of samples of “1” you
have seen before the experiment, and b as the number of samples of “0”.
a
• θ̂a,b is biased, since its first part a+b is biased and its second part θ̂ is unbiased (UMVU).
4
EE378A Statistical Signal Processing Lecture 9 - 04/26/2016
In this lecture, we introduce Hidden Markov Processes and develop( efficient methods for estimations in such
p(rocesses. Although a similar analysis can be carried over to the continuous case, in this lecture we focus
on a discrete setting where all the random variables take value in a discrete set.
1 Notation
We start by introducing some notations which will be useful in the subsequent analysis. We use upper-case
letters X, Y, Z to denote random variables and lower-case letters x, y, z to denote specific realizations. For
any n and n ≤ m and collection of random variables X1 , X2 , · · · , XN , we let
X n = (X1 , X2 , · · · , Xn ), n
Xm = (Xm , Xm+1 , · · · , Xn ).
Moreover, we abbreviate P(X = x, Y = y, Z = z) as p(x, y, z), and similarly for conditional probabilities.
Now taking φ1 (x, y) = p(x, y) and φ2 (y, z) = p(y, z) proves this direction.
(Proof of sufficiency) Now, the functions φ1 , φ2 exist as mention in the statement. Then we have:
p(x, y, z)
p(x|y, z) =
p(y, z)
p(x, y, z)
=P
x̃ p(x̃, y, z)
φ1 (x, y)φ2 (y, z)
= P
φ2 (y, z) x̃ φ1 (x̃, y)
φ1 (x, y)
=P .
x̃ φ1 (x̃, y)
The final equality shows that p(x|y, z) is not a function of z and hence, X − Y − Z follows by definition.
Definition 2 (Hidden Markov Process). The process {(Xn , Yn )}n≥1 is a Hidden Markov Process if
1
• {Xn }n≥1 is a Markov process, i.e.,
n
Y
n
∀ n ≥ 1 : p(x ) = p(xt |xt−1 ),
t=1
The processes {Xn }n≥1 and {Yn }n≥1 are called the state process and the noisy observations, respectively.
In view of the above definition, we can derive the joint distribution of the state and the observation as
follows:
n
Y
p(xn , y n ) = p(xn )p(y n |xn ) = p(xt |xt−1 )p(yt |xt ).
t=1
The main problem in the Hidden Markov Models is to compute the the posterior probability of the state
at any time, given all the observations up to that time, i.e. p(xt |y t ). The naive approach to do this is to
simply use the the definition of the conditional probability:
t t−1
p(xt , y t ) )p(xt , xt−1 )
P
t xt−1 p(y |xt .x
p(xt |y ) = = P P .
p(y t ) x̃t
t
xt−1 p(y |x̃t , x
t−1 )p(x̃ , xt−1 )
t
The above approach needs exponentially many computations both for the numerator and the denominator.
To avoid this problem, we develop an efficient method for computing the posterior probabilities using forward
recursion. Before getting to the algorithm, we establish some conditional independence relations.
2
4 Causal Inference via Forward Recursion
We now derive the forward recursion algorithm as an efficient method to sequentially compute the causal
posterior probabilities.
Note that we have
p(xt , y t )
p(xt |y t ) = P t
x̃t p(x̃t , y )
p(xt , y t−1 )p(yt |xt , y t−1 )
=P t−1 )p(y |x̃ , y t−1 )
x̃t p(x̃t , y t t
Define αt (xt ) = p(xt |y t ) and βt (xt ) = p(xt |y t−1 ), then the above computation can be summarized as
βt (xt )p(yt |xt )
αt (xt ) = P . (1)
x̃t βt (x̃t )p(yt |x̃t )
Equations (1) and (2) indicate that αt and βt can be sequentially computed based on each other for t =
1, · · · , n with the initialization β(x1 ) = p(x1 ). Hence, with this simple algorithm, the causal inference can
be done efficiently in terms of both computation and memory. This is called the forward recursion.
3
Now, let γt (xt ) = p(xt |y n ). Then, the above equation can be summarized as
X αt (xt )p(xt+1 |xt )
γt (xt ) = γt+1 (xt+1 ) . (3)
xt+1
βt+1 (xt+1 )
Equation (3) indicates that γt can be recursively computed based on αt ’s and βt ’s for t = n − 1, n − 2, · · · , 1
with the initialization of γn (xn ) = αn (xn ). This is called the backward recursion.
4
EE378A Statistical Signal Processing Lecture 10 - 04/28/2016
Last time we talked about forward and backward recursions in hidden Markov processes, where we obtained
both the “causal” posterior probability distribution p(xt |y t ) and the “non-causal” one p(xt |y n ). Now we
treat the hidden Markov process as a communicationQ model, i.e., the transmitter sends xn and the receiver
n
receives y through a memoryless channel p(y |x ) = t=1 p(yt |xt ). With the posterior distribution p(xt |y n )
n n n
P(x̂n 6= xn )
In this lecture, we discuss efficient algorithms for computing both p(xn |y n ) and arg maxxn p(xn |y n ).
αt (xt ) = p(xt |y t )
βt (xt ) = p(xt |y t−1 )
γt (xt ) = p(xt |y n ).
1
2 Joint Posterior
Before computing p(xn |y n ), we begin with a useful fact about the HMP.
Lemma 1. In the HMP, the following Markov chain holds:
Write
n
X
ln p(xn |y n ) = gt (xt , xt+1 )
t=1
with
(
t )p(xt+1 |xt )
ln αt (x
βt+1 (xt+1 ) t = 1, · · · , n − 1
gt (xt , xt+1 ) ,
ln γn (xn ) t=n
we can solve the MAP estimator x̂MAP with the help of the following definition:
Definition 2. For 1 ≤ k ≤ n, Let
n
X
Mk (xk ) := max
n
gt (xt , xt+1 ).
xk+1
t=k
2
It is straightforward from definition that M1 (x̂MAP
1 ) = ln p(x̂MAP |y n ) = maxxn p(xn |y n ) and
n
X
Mk (xk ) = max max
n
(gk (xk , xk+1 ) + gt (xt , xt+1 ))
xk+1 xk+2
t=k+1
Xn
= max(gk (xk , xk+1 ) + max
n
gt (xt , xt+1 ))
xk+1 xk+2
t=k+1
Since Mn (xn ) = g(xn , xn+1 ) = ln γn (xn ) only depends on one term xn , we may start from n and use the
previous recursive formula to obtain M1 (x1 ). This is called the Viterbi Algorithm.
1: function Viterbi
2: Mn (xn ) ← ln γn (xn ) . Initialization of log-likelihood
3: x̂MAP (xn ) ← ∅ . Initialization of the MAP estimator
4: for k = n − 1, · · · , 1 do
5: Mk (xk ) ← maxxk+1 (gk (xk , xk+1 ) + Mk+1 (xk+1 )) . Maximum of log-likelihood
6: x̂k+1 (xk ) ← arg maxxk+1 (gk (xk , xk+1 ) + Mk+1 (xk+1 ))
7: x̂MAP (xk ) ← [x̂k+1 (xk ), x̂MAP (x̂k+1 (xk ))] . Maximizing sequence with leading term xk
8: end for
9: M ← maxx1 M1 (x1 ) . Maximum of overall log-likelihood
10: x̂1 ← arg maxx1 M1 (x1 )
11: x̂MAP ← [x̂1 , x̂MAP (x̂1 )] . Overall maximizing sequence
12: end function
4 More discussions
4.1 Optimality Criteria
Let us distinguish between the MAP estimator for the joint posterior p(xn |y n ) (i.e., minimizing the block
error probability) and the MAP estimator for the marginal posterior p(xt |y n ) (i.e., minimizing the bit error
probability) via an example. Consider a random vector X n with the following distribution:
1111 · · · 11 w.p. 0.1
1000 · · · 00 w.p. n1 × 0.9
0100 · · · 00 w.p. 1 × 0.9
Xn = n
0010 · · · 00 w.p. n1 × 0.9
···
0000 · · · 01 w.p. n1 × 0.9
For n ≥ 10, the maximum likelihood sequence is the all-one sequence, but to get the j-th bit right we should
always guess zero. Hence, if our goal is to guess the whole sequence correctly, we should guess all ones, but
if our goal is to make as few bit errors as possible on average, we should guess all zeros.
3
4.2 General Forward – backward Recursion
Pd
Denote by ∆d = {(p1 , · · · , pd ) ∈ Rd+ : i=1 pi = 1} the probability simplex with support size at most d, and
by X and Y the alphabet of X and Y , respectively. Then the forward–backward recursion can be written
via the mappings
and
as follows:
• Forward recursion:
• Backward recursion:
4
EE378A Statistical Signal Processing Lecture 11 - 05/03/2016
1 Minimax Framework
Recall that the Bayes estimator minimizes the average risk
Z
r(Λ, δ) = R(θ, δ) dΛ(θ),
where Λ is the prior. One complaint about the Bayes estimator is that the choice of Λ is often arbitrary and
hard to defend.
An alternative approach is to assume that nature is malicious, and will always pick the worst value of θ
in response to the statistician’s choice of δ. In this case, we instead seek to minimize the maximum risk (or
worst-case risk): Z
min sup L(θ, δ) dPX|θ .
δ θ
Claim 1. If the loss function L(θ, δ) is convex in δ, then finding the minimax rule can be reduced to solving
a convex optimization problem.
R
Why is this claim true? We note that the average of convex functions is convex, hence L(θ,R δ) dPX|θ is
convex in δ. Also, the supremum of convex functions is easily checked to be convex, hence supθ L(θ, δ) dPX|θ
is convex in δ. As a result, the problem
Z
min sup L(θ, δ) dPX|θ
δ θ
2 Minimax Theorem
The first minimax theorem was proved by von Neumann, in the setting of zero-sum games. It states that
every two-person finite-strategy zero-sum game has a mixed-strategy Nash equilibrium. For the purposes of
this class, however, we introduce the generalization by Sion and Kakutani:
Theorem 2 (Sion-Kakutani Minimax Theorem). Let X and Λ be compact convex sets in linear spaces (i.e.,
topological vector spaces). Let H(λ, x) : Λ × X → R be a continuous function such that
• H(λ, ·) is convex for every fixed λ ∈ Λ,
• H(·, x) is concave for any fixed x ∈ X.
Then
1. The strong duality holds:
max min H(λ, x) = min max H(λ, x). (1)
λ x x λ
1
2. There exists a ”saddle point” (λ∗ , x∗ ), for which
There is a game-theoretic way to view weak duality. Two people, named Max and Min, are playing a zero-
sum game. Max chooses λ ∈ Λ and Min chooses x ∈ X, then F (λ, x) is the reward function for Max (and
−F (λ, x) is the reward function for Min). Weak duality states that going first is advantageous for Max, no
matter the choice of the function F .
The problem is that in the expression on the RHS, the statistician is allowed to choose δ after nature has
chosen θ. If L(θ, g(θ)) = 0, the statistician may pick the constant predictor δ = g(θ), so the right hand side
is just 0. We have given up too much by applying weak duality.
Proof Since the right hand side is an average of R(θ, δ) over some values of θ and the left hand side is the
supremum of R(θ, δ), we have LHS ≥ RHS. For the reverse inequality, given any > 0, there exists some θ0
such that R(θ0 , δ) + > supθ R(θ, δ). Taking Λ0 to be the delta distribution with point mass at θ = θ0 , we
get Z Z
sup R(θ, δ) dΛ(θ) ≥ R(θ, δ) dΛ0 (θ) = R(θ0 , δ) > sup R(θ, δ) − .
Λ θ
+
Taking → 0 , we see that RHS ≥ LHS, so we are done.
2
Thus, maximizing over θ is the same as maximizing over Λ(θ). We use the minimax theorem:
Z
min sup L(θ, δ) dPX|θ = min sup R(θ, δ)
δ θ δ θ
Z
= min sup R(θ, δ) dΛ(θ)
δ Λ
Z
= sup min R(θ, δ) dΛ(θ)
Λ δ
ZZ
= sup min L(θ, δ) dPX|θ dΛ(θ).
Λ δ
We have not verified that the hypotheses of the minimax theorem are satisfied. The main idea is that, when
Λ is fixed, then this function is convex in δ (as discussed in the first section when L(θ, ·) is convex), and
when δ is fixed, this function is linear in Λ. We gloss over the details regarding compactness and convexity,
the norms on X and Λ, continuity, etc.
The above result may be succinctly summarized:
to be the risk of the Bayes estimator under the prior θ ∼ Λ(θ). Then the above result is minδ supθ R(θ, δ) =
supΛ(θ) rΛ .
The left-hand side’s expression may be interpreted as the statistician choosing δ first, then nature picking
θ, which seems very bad for the statistician. However, this is not as pessimistic as it initially seems: in the
right hand side’s expression, nature acts first by choosing a prior Λ, then the statistician picks δ.
An objection to Wald’s minimax formulation was that nature is not malicious, but this way of viewing
the problem reveals that it is not so pessimistic.
3.3 Difficulties
Similar to the fact that the first formulation of the minimax problem is hard to solve, this second form also
has its own difficulties. While it is easy to evaluate rΛ for each choice of Λ, there are too many possible
choices for Λ to evaluate supΛ rΛ . Ultimately, we have traded the problem of being unable to evaluate the
objective supθ R(θ, δ) with the problem of searching over an another possibly infinite-dimensional space.
4 Minimax in Action
Unfortunately, the minimax problem is difficult to solve exactly, even when it has been reformulated. Very
few results of exact optimality are known. Fortunately, statisticians have two ways of dealing with this
problem:
• Instead of searching for optimal solutions, we can instead seek potential solutions, which are asymp-
totically optimal. More on this next week.
• We may upper bound and lower bound the minimax risk by constants. Indeed, in the expression
the left hand side is a minimum over δ, so for any choice δ0 , the quantity supθ R(θ, δ0 ) is an upper
bound on minimax risk. Similarly, the right hand side is a supremum over Λ, so for any choice of Λ0 ,
3
the quantity rΛ0 is a lower bound on the minimax risk. If we get upper and lower bounds which are
close, then we are quite satisfied with the result. This is still a difficult task, however.
Definition 4. A prior Λ is called least favorable if, for any prior Λ0 , we have rΛ ≥ rΛ0 .
This definition is completely motivated by the minimax equation. The notion is very formal in the sense
that it’s usually not possible to check whether Λ is least favorable.
Theorem 5. Suppose Λ(θ) is a prior on θ, and δΛ is the Bayes estimator under Λ. Suppose also that
r(Λ, δΛ ) = rΛ = supθ R(θ, δΛ ). Then
1. δΛ is the minimax estimator;
2. If δΛ is the unique Bayes estimator with respect to Λ, then it is the unique minimax estimator;
3. Λ is least favorable.
This theorem is almost tautological. Nevertheless, we prove the first assertion, and leave the rest as
exercises.
Proof Let δ be any decision rule. Then
Z
sup R(θ, δ) ≥ R(θ, δ) dΛ(θ)
θ
Z
≥ R(θ, δΛ ) dΛ(θ)
= rΛ
= sup R(θ, δΛ ).
θ
5 Reference
B. Levit (2010). “Minimax Revisited: I, II”. This paper is a survey of several techniques people have used
to bound the minimax risk of the following problem: suppose you have a single observation X ∼ N (µ, σ 2 ),
where σ 2 is known, and you know that |µ| ≤ 1. You want to estimate µ. The minimax estimator is not
known! You can have a feeling of the difficulty in solving the exact minimax estimator!
4
EE378A Statistical Signal Processing Lecture 12 - 5/5/2016
Based on the minimax theorem, the minimax risk can be upper bounded by using any decision rule δ 0 ,
and lower bounded by any prior Λ0 :
Corollary 3. If a Bayes estimator has constant risk, it is minimax (not necessarily unique).
The preceding theorem and corollary provide us with the tool to find the exact minimax estimator “using
human brain” in some simple examples.
Example 4. In the Binomial model X ∼ B(n, θ), we will find out the minimax estimator under the squared
error loss. Recall that θ̂(X) = Xn is UMVU (by Lehmann-Schaffe), admissible (by Problem 2(4) in Homework
2), and MLE. However, we show that it turns out not to be minimax and find the minimax estimator. In
fact, the risk function of θ̂ is
θ(1 − θ)
R(θ, θ̂) = Eθ (θ̂ − θ)2 =
n
.
We already proved that under the conjugate prior
1
Λa,b (θ) = θa−1 (1 − θ)b−1 (a > 0, b > 0)
B(a, b)
1
The risk function of this Bayes estimator is
1
R(θ, θ̂a,b ) = Eθ (θ̂a,b − θ)2 = [n(1 − θ)θ + (a(1 − θ) − bθ)2 ].
(a + b + n)2
with
1
R(θ, θ̂ √n , √n ) = √ .
2 2 4( n + 1)2
Based on Theorem 2, this estimator is the unique minimax estimator for θ under squared error loss, while
θ̂(X) = X/n is not minimax. However, a closer inspection will reveal that R(θ, θ̂) ≥ R(θ, θ̂ √n , √n ) only holds
2 2
when θ is really close to 12 , and these two estimators coincide in the asymptotic setting (i.e., n → ∞).
In general, how do we choose which estimator to use? We have many choices, but the only way to choose
the right estimator is by looking and understanding the data and the problem you are trying to solve. In
the finite sample theory, the choice is a big matter as different estimators satisfy different properties, while
in asymptotics the difference sometimes disappears, as we will show below.
probability, where C is chosen by looking at Guassian percentiles. Moreover, the asymptotic variance is the
reciprocal to the Fisher information I(θ) = (θ(1 − θ))−1 .
Fisher wondered whether in general he could bound the asymptotic variance of asymptotically normal
estimators of θ in this way.
Conjecture 1. For regular statistical experiments with non-singular Fisher information I(θ), Fisher had
the following conjectures:
√ d
1. ∀ asymptotically normal estimators θn∗ , i.e., n(θn∗ − θ) → N (0, σ 2 (θ, {θn∗ })),
1
σ 2 (θ, {θn∗ }) ≥ , θ∈Θ
I(θ)
This conjecture is unexpected. The Cramér-Rao lower bound of this form only applied to unbiased esti-
mators in finite sample case. Furthermore, we are not restricting ourselves to exponential families as we were
before. Fisher’s intuition is that any asymptotically normal estimators should be asymptotically unbiased,
where applying the Cramér-Rao lower bound seems to be fine. Moreover, the Taylor expansion of the MLE
seems to suggest that the MLE attains the Fisher information bound asymptotically. Unfortunately, neither
2
of these conjectures is not true in general: the second conjecture requires some technical conditions, and the
first conjecture is even more problematic as shown by the following example.
i.i.d Pn
Example 5 (Hodges’s Example in 1951). Let X1 , · · · , Xn ∼ N (0, 1). Then X̄ = i=1 Xni is an admissible
(by Problem 2(1) in Homework 2), MLE, UMVU (by Lehmann-Scheffe), and minimax estimator (will be
shown in Homework 3). Hodges claimed that this estimator can somehow be improved in terms of the
√ d
asymptotic variance. Note that n(X̄ − θ) → N (0, 1), thus we are seeking a better estimator Sn such that
√ d
n(Sn − θ) → N (0, σ 2 (θ)), where σ 2 (θ) ≤ 1 for any θ and σ 2 (θ0 ) < 1 for some θ0 .
Hodges’s estimator is as follows. Let
( 1
X̄ if |X̄| ≥ n− 4
Sn =
0 otherwise
3 Modern Theorems
In this section we give a brief overview of the modern, corrected theorems which have replaced Fisher’s
conjecture. These are due to many contributors, including Le Cam, Wolfowitz, Hájek, etc. (see Chapter 8
of the book Asymptotic Statistics by Van der Vaart).
First we state the result corresponding to Fisher’s second conjecture.
Theorem 6 (Asymptotic Normality of MLE). Under mild conditions (cf. Theorem 5.39 and Lemma 8.14
in Van der Vaart’s book), if θ̂n = arg maxθ pθ (x1 , . . . , xn ) then
n
√ 1 X
n(θ̂n − θ0 ) = √ I(θ0 )−1 sθ0 (xi ) + opθ0 (1), ∀θ
n i=1
In the previous theorem, sθ (x) is the score function, I(θ) is the Fisher information, and the notation
pθ
Xn = opθ0 (1) means Xn →0 0 (converge in probability).
Now we introduce some approaches to correct Fisher’s first conjecture. Although the phenomenon of
superefficiency may occur, several observations can be drawn from the Hodges’s example. Firstly, in Hodges’s
3
example, superefficiency only occurs at one point θ = 0, which motivates statisticians to think about whether
it is generally true that superefficiency can only take place in a “small” set. The answer turns out to be
affirmative and is stated in the following theorem.
Theorem 7 (Hájek-Le Cam almost everywhere convolution theorem). Suppose the statistical experiment is
regular with non-singular Fisher information I(θ), and let g(θ) be differentiable at every θ. For any sequence
√ d
of estimators Tn (X1 , . . . , Xn ) satisfying n(Tn − g(θ)) → Lθ for all θ ∈ Θ, then there exists a probability
measure Mθ such that for almost every θ (i.e., for all θ ∈ Θ − N where N is of Lebesgue measure zero),
The ∗ above denotes the usual convolution of probability densities. Because the convolution of probability
densities corresponds to adding independent random variables, we can interpret this theorem as follows: any
limiting estimator is a Gaussian random variable with mean zero and variance I(θ)−1 (when g(θ) = θ) plus
some random independent junk. In particular, the almost everywhere convolution theorem shows that the
superefficiency phenomenon can only occur at a set of Lebesgue measure zero.
The second observation can help us get rid of the “almost everywhere” statement in√ the previous convolu-
tion theorem, which is as follows: although in Hodges’s example the random variable
√ n(Sn −θ) under θ = 0
has a limiting distribution L0 which is the delta point mass at zero, the quantity n(Sn −θ) under θ = n−1/2
does not have the limiting distribution L0 as n → ∞ (recall the previous computation of nR(n−1/2 , Sn ) under
the squared error loss). This motivates us to introduce the concept of a regular estimator sequence.
Definition 8 (Regular estimator sequence). An estimator sequence Tn is called regular at θ for estimating
g(θ), if for every h,
√
h d n
n Tn − g θ + √ → Lθ under Pθ+ √h .
n n
The probability measure Lθ may be arbitrary but should be the same for every h.
Now, if the regular estimator sequence is considered instead (as a subfamily of all estimator sequences
which may not be regular), we come to the following traditional convolution theorem which replaces “almost
everywhere” with “everywhere”.
Theorem 9 (Hájek-Le Cam convolution theorem). Suppose the statistical experiment is regular with non-
singular Fisher information I(θ), and let g(θ) be differentiable at every θ. For any regular estimator sequence
Tn (X1 , . . . , Xn ) for estimating g(θ) with limiting distribution Lθ for all θ ∈ Θ, then there exists a probability
measure Mθ such that for every θ,
√
The third observation from Hodges’s example √ is that although Eθ ( n(Sn −θ))2 approaches zero at θ = 0,
the “local” worst-case risk supθ∈[−n−1/2 ,n−1/2 ] Eθ ( n(Sn − θ))2 does not. Hence, we wonder whether Fisher’s
first conjecture becomes correct when we take the local worst-case risk, where “local” is characterized by the
i.i.d
scaling n−1/2 . In fact, for X1 , · · · , Xn ∼ Pθ with density pθ , the log-likelihood ratio is given by
n
dPθ+ √h n p
Y θ+ √hn (Xi )
n
X pθ+ √h (Xi )
n n
ln (X1 , · · · , Xn ) = ln = ln
dPθn i=1
pθ (Xi ) i=1
pθ (Xi )
n n
!
T
h X 1 T X ∂sθ (Xi )
=√ sθ (Xi ) + h h + opθ (1) (Taylor expansion)
n i=1 2n i=1
∂θ
1
= hT I(θ)Z − hT I(θ)h + opθ (1) (by CLT and LLN)
2
4
where sθ (X) is the score function, and Z ∼ N (0, I(θ)−1 ) is a normal random variable. The interesting fact
is that, for Gaussian models we can explicitly compute that
dN (h, I(θ)−1 ) 1
ln −1
(Z) = hT I(θ)Z − hT I(θ)h
dN (0, I(θ) ) 2
which is the same as above for any regular statistical models. Mathematically, we can show that the following
two models
n
(Pθ+ √
k
h : h ∈ R ) and (N (h, I(θ)−1 ) : h ∈ Rk )
n
are asymptotically equivalent, i.e., the model distance (defined in Problem 2 of Homework 1) between them
vanishes as n → ∞. Hence, asymptotically it suffices to work on the Guassian location model (N (h, I(θ)−1 ) :
h ∈ Rk ), which is easy to handle and yields the following theorem:
Theorem 10 (Hájek-Le Cam local asymptotic minimax theorem). Suppose the statistical experiment is
regular with non-singular Fisher information I(θ), and let g(θ) be differentiable at θ. For any estimator
sequence Tn and bowl-shaped loss function l, we have
√
h
sup lim inf sup Eθ+ √h l n Tn − g θ + √ ≥ El(Z)
I n→∞ h∈I n n
where Z ∼ N (0, [g 0 (θ)]T I(θ)−1 [g 0 (θ)]), and the first supremum is taken over all finite subsets I ⊂ Rk .
A non-negative loss function l : Rk → R+ with l(0) = 0 is called bowl-shaped if l(−u) = l(u) for any
u ∈ Rk and {u ∈ Rk : l(u) ≤ t} is a convex set for any t ≥ 0. By taking l(u) = kuk2 and g(θ) = θ, Theorem
10 shows that Fisher’s first conjecture is true in a “local asymptotic minimax” sense.
5
EE378A Statistical Signal Processing Lecture 13 - 05/10/2016
1. Vladimir Vapnik: Estimation of Dependencies Based on Empirical Data (first version ’82, second
version ’06) includes a detailed comparison between learning theory and decision theory. The second
version includes an afterword that updates the technical results presented in the first version and
describes a general picture of how the new ideas developed over these years.
2. Vladimir Vapnik: Statistical Learning Theory (’98)
3. Leo Breiman: “Statistical Modeling: The Two Cultures” (’01) challenges mainstream statistical decision
theory in favor of learning theory (a fun read). It includes the comments from top statisticians and a
rejoinder by Breiman.
1. There is a group of philosophers who believe that the results of scientific discovery are the real laws
that exist in nature. These philosophers are called the realists.
1
2. There is another group of philosophers who believe the laws that are discovered by scientists are just
an instrument to make a good prediction. The discovered laws can be very different from the ones that
exist in Nature. These philosophers are called the instrumentalists.
The two types of approximations defined by classical discriminant analysis (using the generative model of
data) and by statistical learning theory (using the function that explains the data best) reflect the positions of
realists and instrumentalists in our simple model of the philosophy of generalization, the pattern recognition
model. Later in the class we will see that the position of philosophical instrumentalism played a crucial role
in the success that pattern recognition technology has achieved.
4 Learning Theory
Learning theory puts the most crucial modeling assumptions on the decision rule δ. In decision theory, we
have
i.i.d
X1 , . . . , Xn ∼ Pθ
In learning theory, we also have
i.i.d
X1 , . . . , Xn ∼ P
However, in decision theory, it is known that without assumptions on P , inference is impossible (called
the “No Free Lunch” theorem in decision theory). For example, the minimax risk
supR(θ, δ)
θ
is not vanishing if θ indexes over all possible distributions. Vapnik proposed that in various contexts, the
risk is not the “correct” quantity to control.
4.1 Terminology
i.i.d
• Observations (training samples) (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) ∼ PXY .
• Loss function L(g(X), Y ); examples of loss functions include
L(g(X), Y ) = (g(X) − Y )2
and
L(g(X), Y ) = 1{g(X)6=Y }
• Risk function R(g) = EL(g(X), Y ) where (X, Y ) ∼ PXY . One may view (X, Y ) as a test sample,
which is independent of the training sample and follows the same distribution.
2
• Bayes risk R∗ , inf R(g)
g
• Oracle risk R(g ∗ ) , inf R(g) where G is a subset of the set of all decision functions; an example is
g∈G
G = {g(x) = aT x + b : a ∈ Rd , b ∈ R}
i.e., all linear classifiers.
Note: The risk function is defined as an expectation over only X and Y , but not g, even when g may be a
random classifier which depends on the random training data. Therefore we could write the risk explicitly as
R(g) = EXY (L(g(X), Y )|{(Xi , Yi )}n1 ). We will instead implicitly assume that the risk is always conditioned
on the data.
3
X 2 X 2
e−(kx−xi k/σ) − e−(kx−xi k/σ) = 0
i : yi =1 i : yi =2
A learning theorist would use a support vector machien (SVM) with a Gaussian kernel:
X 2 X 2
αi e−(kx−xi k/σ) − αi e−(kx−xi k/σ) = 0
i : yi =1 i : yi =2
If we use the kernel trick to map space X to a RKHS(Reproducing Kernel Hilbert Space) Z such as
2
K(x1 , x2 ) = e−(kx1 −x2 k/σ) = hz1 , z2 i, then it is clear that decision theorist will generate a hyperplane
passing through the center of mass of the data points, perpendicular to the line between the centers of the
two classes of data points. A learning theorist will get a better result with an SVM (search over all possible
separating hyperplanes and find the best one).
Figure 1: Classifications given by the classical nonparametric method and the SVM are very different. The picture
is in Z space. The red hyperplane is the output of SVM, and the blue hyperplane is the output of nonparametric
statistical decision theory.
4
EE378A Statistical Signal Processing Lecture 14 - 05/12/2016
In this lecture, we recap what we have seen about the philosophy of learning theory, and then proceed to
present the framework for binary classification problems and conclude by discussing first results using the
probably approximately correct (PAC) learning framework.
• Analyze regret instead of risk: one is always trying to compete with an oracle with stronger abilities than
the statistician (such as knowing precisely the underlying distribution; knowing that in a parametrized
model the parameter lies in a smaller uncertainty set, etc).
2 Classification Problem
2.1 Problem Set-Up
A classification problem consists of the following components:
• Input space defined by X , binary output space defined by Y ∈ {−1, 1};
iid
• The training data (observations) (Xi , Yi ) ∼ PXY , 1 ≤ i ≤ n and the testing data (X, Y ) ∼ PXY ;
• Goal: construct a function (decision rule, predictor, or classifier) g : X → Y;
1
2.2 Evaluating Risk
• Bayes risk:
R∗ = inf R(g).
g
This value is always a non-random because the testing data is independent of the training data, and
the infimum achieving g, which is the Bayes rule, does not depend on the training data.
we claim that t(x) = sgn(x) is the Bayes decision rule which satisfies
• Oracle risk:
R(g ∗ ) = inf R(g).
g∈G
This is also not achievable in practice, but tight bounds can be found.
where [R(gn ) − R(g ∗ )] is called the stochastic error or estimation error and [R(g ∗ ) − R∗ ] the approximation
error. Sometimes these are also informally called variance and bias respectively, but this is different than
variance and bias that we talked about in decision theory.
Now there are three types of results we wish to derive:
C: R(gn ) ≤ R∗ + B(n, G)
2
– This is the strongest result, but not very commonly used as it is hard to prove unless there are
some additional restrictions on PXY .
We first focus on the results of type A. We first define Zi , (Xi , Yi ) as the data. Given function class G,
we define the loss class
F = {(x, y) 7→ I{g(x)6=y} , g ∈ G}.
It can be immediately seen Pthat there is a bijection between two classes F and G.
n
Further define Pn f , n1 i=1 f (Zi ) and define Pf , E[f (Z)]. We need to bound
n
1X
Rn (g) − R(g) = f (Zi ) − E[f (Zi )]
n i=1
= Pn f − Pf.
By the LLN, of course we know that Rn (g) → R(g) as n → ∞. Here we need a non-asymptotic version:
Theorem 1. (Hoeffding) Let Z1 , · · · , Zn be n iid r.v.s with f (Z) ∈ [a, b]. Then for all > 0, we have
2n2
P (|Pn f − Pf | ≥ ) ≤ 2 exp − .
(b − a)2
We focus on the so-called Probably Approximately Correct (PAC) learning framework, which is aims at
giving the following type of result: with probability at least 1 − δ (probably), R(gn ) is -close to R(g ∗ )
(approximately). This is equivalent to an alternate framework - expectation learning framework, which
usually gives an upper bound for the expected excess risk ER(gn ) − R(g ∗ ). The equivalence can be shown as
follows: to show that a non-negative random variable X is small, we can show that P(X ≥ t) is small in the
PAC learning framework and E[X] is small in the expectation learning framework via the relationship
Z ∞
E[X] = P(X ≥ t) dt.
0
We denote the RHS of the Hoeffding bound by δ ∈ (0, 1) which can be interpreted as the confidence level.
We can see
r !
log(2/δ)
P |Pn f − Pf | ≥ (b − a) ≤ δ.
2n
3
3
R(g)
Rn (g)
Risk
Range of Pn f 0 − P f 0
0 gn
g0 g∗
g∈G
Figure 1: The actual and empirical loss function on the function class. g ∗ is the minimizer of the risk function that
we are looking for, gn is the some classifier obtained from the training data.
In other words, consider the set of samples S for which supg∈G |Rn (g) − R(g)| ≥ . Hoeffding inequality
does not give us the hammer bounding the probability P(S). Suppose now we are able to find δ ≥ 0 such
that P(S) ≤ δ, then we can state with probability at least 1 − δ (i.e., conditioning on S),
≤ Rn (gn ) +
which gives the result of type A. Hence, the uniform deviation bound is sufficient to bound the gap between
training and testing errors, and it is also shown to be necessary in certain sense.
We thus have a uniform bound on the deviation of the empirical risk from its mean with high probability.
4
EE378A Statistical Signal Processing Lecture 15 - 05/17/2016
1 Recap
In the previous lecture, we considered the following two types of risks, i.e., the true risk and the empirical
risk:
R(g) = E I(g(X) 6= Y )
n
1X
Rn (g) = 1(g(xi ) 6= yi ),
n i=1
and we were interested in their difference as g varies, i.e., we aimed to find a condition to guarantee that
these two quantities remain close uniformly over the entire set G. In other words we wanted to control the
following quantity
sup |R(g) − Rn (g)|. (1)
g∈G
The intuition behind this goal was that, if we have such a guarantee, then minimizing Rn (g) would be
roughly the same as minimizing R(g). In this regard, we determined some types of results to achieve:
• Type A: R(gn ) ≤ Rn (gn ) + B(n, τ ), which is closely related to (1). If we guarantee such a bound then
we have:
|R(g) − Rn (g)| ≤ B(n, τ ), ∀g ∈ G ⇒ |R(gn ) − Rn (gn )| ≤ B(n, τ ),
which we will exploit later.
• Type B: R(gn ) ≤ R(g ∗ ) + B(n, τ ), where g ∗ = arg min R(g).
g∈G
Now suppose that gn minimizes Rn (g), we call gn the Empirical Risk Minimizer. Based on the above results
we have
R(gn ) = R(gn ) − R(g ∗ ) + R(g ∗ )
(a)
≤ R(gn ) − R(g ∗ ) + R(g ∗ ) + Rn (g ∗ ) − Rn (gn )
= (R(gn ) − Rn (gn )) + (Rn (g ∗ ) − R(g ∗ )) + R(g ∗ )
≤2 sup |R(g) − Rn (g)| + R(g ∗ )
g∈G
(b)
≤ 2B(n, τ ) + R(g ∗ ),
where (a) is due to the definition of the empirical risk minimizer that Rn (g ∗ ) ≥ Rn (gn ), and (b) follows from
result type A.
In order to proceed let us restate the following lemma.
Lemma 1 (Hoeffding bound and finite function class). For a fixed g we have
2
P(|R(g) − Rn (g)| ≥ ) ≤ 2e−2n ,
and consequently for finite G,
2
P(sup |R(g) − Rn (g)| ≥ ) ≤ |G| · 2e−2n .
g∈G
1
2 Empirical Process Theory
Part of empirical process theory studies the “uniform convergence” of frequencies to their true probabilities.
Formally, consider the probability triple (X , B, P), where X is the space of outcomes, B is a collection of sets
in X (a σ-algebra), and P is the probability measure defined over B. In addition let S ⊂ B denote a non-empty
set of possible events, and A ∈ S be an event. Also, let X n = (X1 , . . . , Xn ) be a collection of n i.i.d. random
variables drawn from distribution P. We denote any specific n-dimensional vector (x1 , x2 , . . . , xn ) ∈ X n as
xn . We define the empirical frequency as follows:
#{xi : xi ∈ A}
ν(A, xn ) , .
n
We are interested in the relationship between the empirical frequency and the probability of event A
for all A. If we fix an event A, then according to the Law of Large Numbers we have
p
ν(A, xn ) → p(A) as n → ∞.
n→∞
and then pursue the following goal: P(π S (xn ) > ) −−−−→ 0.
We will shortly introduce the notion of entropy of the set of events S of the samples of size n:
This function basically represents the number of patterns of xn the set in S can generate. We have the
following proposition for this function:
Proposition 2. 1 ≤ N S (x1 , . . . , xn ) ≤ 2n .
Proof This result follows from the fact S is not empty, and since the patterns are binary sequences of
length n, there can be at most 2n of them.
Note that the function N S (x1 , x2 , . . . , xn ) is defined for any specific xn . For example, if x1 = . . . = xn ,
then 1 ≤ N S (x1 , . . . , xn ) ≤ 2 since we can have at most 2 different patterns possibly, (0, . . . , 0) and (1, . . . , 1).
Proposition 3. 0 ≤ H S (n) ≤ n.
Proof Follows from Proposition 2.
2
This property is similar to the one that we have in information theoretic entropy. Recall from the
information theory:
m+n m+n
H(X m+n ) = H(X m ) + H(Xm+1 |X m ) ≤ H(X m ) + H(Xm+1 ),
where the equality is from the Chain Rule and the inequality is from the fact that conditioning reduces
entropy.
Proof Left as an exercise. A hint is to justify N S (xm+n
1 ) ≤ N S (xm S m+n
1 )N (xm+1 ).
We note that (6) is the desired case where we have the almost surely convergence to 0. However, (7)
tells us that there exists a gap δ(c) with probability 1 which is not desired. Let us now connect this to our
problem with the following bijections:
X ←− Z = (X, Y ) ∼ PXY (8)
A ←− {(X, Y ) : g(X) 6= Y } (maps to g) (9)
S ←− {{(X, Y ) : g(X) 6= Y }, g ∈ G} (10)
p(A) ←− R(g) (11)
n
ν(A; x ) ←− Rn (g) (12)
S n
π (x ) ←− sup |R(g) − Rn (g)| (13)
g∈G
3
H G (n) n→∞
Theorem 8. [1, Theorem 3.6] If −−−−→ c > 0, then there exists a subset Z ∗ ⊂ Z, such that P(Z ∗ ) =
n
c, and for the subset Z1∗ , . . . , Zk∗ , (Z1 , . . . , Zn )∩Z ∗ of almost all training data set (Z1 , . . . , Zn ), for any given
sequence of binary values δ1 , . . . , δk ∈ {0, 1}, there exists a function g ∈ G such that δi = 1(g(Xi∗ 6= Yi∗ )).
This theorem basically tells us that if c > 0, then for c-portion of the training data, there is a function
g ∈ G which can perfectly overfits this portion of data. In an extreme case where c = 1, we call that G is
non-falsifiable which means that we cannot falsify the function class G given any observations, for there is
always some g ∈ G which commits no error. Then we cannot expect to do anything useful on the test data.
We refer to the case of c < 1 as the partially non-falsifiable case.
References
[1] Vlamimir Vapnik, Statistical learning theory. New York: Wiley, 1998.
4
EE378A Statistical Signal Processing Lecture 16 - 05/19/2016
When we apply learning theory in practice, what really matters is not the precise bounds, but the intuition
we gain through the derivation. In this lecture, we will focus on how to use the ideas of learning theory
in practice. We will also provide intuitive explanations of some popular learning algorithms, such as SVM,
which will deepen our understandings of these algorithms.
1 Recap
Some notations and definitions:
• G: class of functions g : X → {−1, 1}
• loss class: F , {fg : (x, y) 7→ 1g(x)6=y , g ∈ G}
i.i.d.
• For samples Z1 , Z2 , ..., Zn ∼ PXY , Zi = (Xi , Yi ), N F (Z1 , Z2 , ..., Zn ) , #{(f (Z1 ), f (Z2 ), ..., f (Zn )) :
f ∈ F} is often referred to as “projection of F on samples” (Z1 , Z2 , ..., Zn ). It is an n-dimensional
binary vector counting the number of error patterns that can be generated by the loss class F .
• VC-entropy: H F (n) , E log2 N F (Z1 , Z2 , ..., Zn ).
Last time we introduced a theorem:
p H F (n)
sup|R(g) − Rn (g)| → 0 as n → ∞ ⇔ lim =0
g∈G n→∞ n
It shows that if VC-entropy over n converges to 0, “uniform convergence” can be achieved, which means
true risk function R(g) and the empirical risk function Rn (g) are uniformly close. In this case, if we choose
function g to minimize Rn (g), the true risk R(g) achieved by this g will be close to its minimal.
1. the distribution PXY is unknown to us, hence we cannot compute the expectation of log2 N F (Z1 , Z2 , ..., Zn )
with infinite precision;
2. even for a given sequence (z1 , z2 , . . . , zn ), computing log2 N F (z1 , z2 , ..., zn ) may be difficult.
In practice, one approach is to approximate the VC-entropy by well-designed approximation algorithms.
Now we look at a simple way to upper bound the VC-entropy:
By using the supremum as an upper bound of the expectation, we get rid of the dependence on distribution
PXY .
Definition 1 (Shattering coefficient). SF (n) , sup N F (Z1 , Z2 , ..., Zn ) is called the shattering coef-
(Z1 ,Z2 ,...,Zn )
ficient of loss class F.
1
In practice, to show VC-entropy over n converges to 0, it suffices to show that
log2 SF (n)
lim sup =0
n→∞ n
since the logarithm of the shattering coefficient log2 SF (n) provides an upper bound for the VC-entropy.
The shattering coefficient SF (n) has some strong structural properties:
Theorem 2. [1, Theorem 4.3] The shattering coefficient SF (n) satisfies either
log2 SF (n)
In case (a), SF (n) always exhibits exponential growth, and lim n = 1 > 0. However, it does not
n→∞
necessarily mean that VC-entropy over n does not converge to 0 (or no uniform convergence of the empirical
risk to the true risk), because the logarithm of the shattering coefficient is only a pessimistic upper bound
of VC-entropy.
In case (b), SF (n) exhibits exponential growth until n = h, and then it grows at most polynomially. In
particular, log SnF (n) ≤ c log(n)
n → 0 as n → ∞, therefore the VC-entropy over n converges to 0.
Definition 3 (VC-dimension). The largest integer h that can make case (b) holds is defined as the VC-
dimension of F, denoted by V (F). In case (a) we define V (F) = ∞.
The intuition of VC-dimension is that h is the maximum number of feature vectors that can be shattered
by the function class G. Some notes:
1. Since there is a bijection between G and F, V (F) can also be written as V (G).
1. when there exist 3 of the 4 points that are collinear, the labeling in Fig.1(a) cannot be achieved by
linear classifiers.
2. when there exist 3 points that are not collinear, and the 4th point is inside the convex hull of the other
3 points, the labeling in Fig.1(b) cannot be achieved by linear classifiers.
3. when there exist 3 points that are not collinear, and not a single point is in the convex hull of the other
3 points, the labeling in Fig.1(c) cannot be achieved by linear classifiers (cf., Minsky & Papert ’69).
2
Figure 1: patterns of 4 points
Therefore, V (G) = 3. A generalization of this result is that for feature space Rd and linear classifier class
G, V (G) = d + 1 = number of parameters.
m
Example 5. Let us introduce Dudley’s theorem (cf., Dudley’78) here: for arbitrary X , G = {g = 1{ ci ψi (x) ≥
P
i=1
0} : ci ∈ R} with fixed ψ1 , · · · , ψm : X → R, then V (G) ≤ m. (Note: G is a class of generalized linear
classifiers.)
To prove this fact, assume by contradiction that the points x1 , · · · , xm+1 can be shattered. By definition,
there exist M = 2m+1 vectors c(1) , · · · , c(M ) such that the (m + 1) × 2m+1 matrix A formed by Aij =
(ψ1 (xi ), · · · , ψm (xi ))T c(j) satisfies: the columns of sign(A) exhausts all possible 2m+1 different binary vectors
of length m + P1. By construction of A, the rank of A is at most m, hence the row vectors of A are linearly
m+1
dependent: i=1 ui Ai = 0 with u1 , · · · , um+1 ∈ R not all zero.Pm+1
However, picking out the column of A
T
which equals sign(u1 , · · · , um+1 ) results in a non-zero entry in i=1 ui Ai , a contradiction!
The previous example shows that, for linear models the VC-dimension is closely related to the number
of parameters. However, for other function classes VC-dimension may be either larger or smaller than the
number of parameters, as shown in Examples 6 and 7.
Example 6 shows that the VC-dimension of G that is nonlinear in its parameters can be smaller than the
number of parameters.
d
Example 6. For X = R, G = {g = 1{ |ai xi |sign(x) + a0 ≥ 0} : ai ∈ R}, its VC-dimension is V (G) = 1,
P
i=1
d
|ai xi |sign(x) + a0 is monotonically increasing in x, it cannot shatter any set of two points.
P
for
i=1
Example 7 shows that the VC-dimension of G that is nonlinear in its parameters can be larger than the
number of parameters.
Example 7. For X = (0, 2π), G = {g = 1(sin αx ≥ 0) : α ∈ (0, ∞)}, its VC-dimension is V (G) = ∞. In
fact, for xi = 2−i , i = 1, · · · , n and any binary sequence δ1 , · · · , δn ∈ {0, 1}, the following
l
!
X
∗ i
α =π (1 − δi )2 + 1
i=1
3
• Structural risk minimization (SRM): we introduce a sequence of function class {Gd , d = 1, 2, 3, · · · }
such that G1 ⊂ G2 ⊂ G3 ⊂ · · · . The optimal function is selected to minimize the sum of the empirical
risk and a penalty on the complexity of the function class, as gn , arg min (Rn (g) + pen(d, n)). Intro-
g∈Gd ,d∈N+
ducing the sequence of function class Gd relaxes the “knowledge” incorporated by the function class.
The complexity penalization pen(d, n) increases as d increases, and it is usually a simple function of
the structure. In this way, we try to achieve an optimal balance between the empirical risk and the
complexity of the class.
Choosing the penalty term is an “art”: the penalty term can be chosen according to VC-dimension,
VC-entropy and many others. We refer the readers to [2] for more details.
• Regularization: the optimal function is selected as gn , arg min(Rn (g) + λkgk22 ), where kgk22 can be
g∈G
replaced by other regularizers. The idea is essensially the same as SRM (e.g., consider Gd = {g ∈ G :
kgk2 ≤ d}).
We can understand ERM as an unconstrained optimization problem, and the regularization scheme as
a constrained optimization problem, where we minimize the empirical risk Rn (g) subject to kgk22 < b.
Here tuning the parameter b is equivalent to tuning parameter λ. The idea is that the function class
G can be very rich, and searching over all g ∈ G can lead
√ to overfitting (large stochastic error). So we
constrain the function g to have a norm smaller than b, so that we search over a smaller domain.
The parameter b can be gradually increased (or λ decreased) to find the optimal result in practice.
The following theorem explains the idea behind SVM:
Theorem 8 (SVM, Vapnik’78, Guivits’97). If F is the set of hyperplanes with margin ρ, then
1
V (F) ≤ min{O( ), n + 1},
ρ2
where n is the dimension of the space.
Let two parallel hyperplanes separate the two classes of data into wT xi + b ≥ 1 and wT xi + b ≤ −1. The
2
margin, defined as the distance between the two hyperplanes, is ρ = kwk 2
. So the VC-dimension is upper
bounded by Ckwk22 , where C is a constant. This explains why in SVM, we try to minimize the training error
and control the `2 -norm (instead of other norms) of parameter w.
References
[1] Vlamimir Vapnik, Statistical learning theory. New York: Wiley, 1998.
[2] Bartlett, Peter L., Stphane Boucheron, and Gbor Lugosi, “Model selection and error estimation”,
Machine Learning, 48, no. 1-3 (2002): 85-113.
4
EE378A Statistical Signal Processing Lecture 17 - 05/24/2016
In this lecture, we look at predicting an individual sequence X t = (X1 , X2 , ..., Xt ) based on experts’ advice,
which is very close to the setting of online learning.
1 Notation
1. Alphabets: We consider a sequence alphabet X and a prediction alphabet X̂ , which need not be the
same. For example, X̂ may be the set of all probability distributions on X .
2. Predictor: F = {Ft }t≥1 , where Ft : X t−1 → X̂ makes a prediction for time t based on all available
information up to now. Specifically, the prediction for time t is Ft (X t−1 ).
3. Loss: The instantaneous loss is l : X̂ × X → R. Specifically, the loss is l(Ft (X t−1 ), Xt ). The
Pn at time t t−1
n n
cumulative loss of a predictor F on a sequence X is LF (X ) = i=1 l(Ft (X ), Xt ).
1
where PF (xn ) is the probability law of sequences according to F .
Average predictor on finite family Given a finite class of functions F = {F (1) , ..., F (m) }, consider the
average predictor G defined by
m
1 X
PG = P (i) . (3)
m i=1 F
We have
maxi PF (i) (xn )
LG (xn ) − min LF (xn ) = log ≤ log m, (4)
F ∈F PG (xn )
Pm
since i=1 PF (i) ≥ maxi PF (i) (xn ). Therefore, the worst-case regret for the average predictor G of any finite
family F = {F (1) , ..., F (m) } is upper bounded by log m.
Taking a closer look at this predictor, we see that
Gt (xt−1 )[xt ] = PG (xt |xt−1 )
PG (xt )
=
PG (xt−1 )
P (i) (xt )
P
=P i F t−1 )
j PF (j) (x
P (i) (xt |xt−1 )PF (i) (xt−1 )
P
= i FP t−1 )
.
j PF (j) (x
n
Recognizing that PF (xn ) = e−LF (x ) , the above expression can be rewritten as
P (i) t−1
F (x )[xt ] exp(−LF (i) (xt−1 ))
Gt (xt−1 )[xt ] = i t P t−1 ))
(5)
j exp(−LF (j) (x
Thus, the average predictor iteratively reweights the predictors in F over a specific sequence xn according
to how well they have done so far in predicting the sequence.
Normalized maximum-likelihood predictor (NML) on finite family Consider the same finite family
n
P
as above. Define the normalization constant (i.e. partition function) Z := x n max i PF (i) (x ). Then,
consider the NML predictor:
maxi PF (i) (xn )
PG (xn ) = . (6)
Z
Then we have
LG (xn ) − min LF (xn ) = log Z. (7)
F ∈F
Note that for any other G0 , ∃ y n such that PG0 (y n ) ≤ PG (y n ) since both PG0 and PG define probability
laws. In this case, we have LG (y n ) = − log PG (y n ) ≤ − log PG0 (y n ) = LG0 (y n ). This leads to the following
theorem:
Theorem 1. minG maxxn {LG (xn ) − minF ∈F LF (xn )} = log Z(F), which is achieved by the NML predictor.
Proof By the above, where G is the NML predictor and y n is the sequence such that PG0 (y n ) ≤ PG (y n ),
we have
n n
maxn
LG (x ) − min LF (x ) = LG (y n ) − min LF (y n )
x F ∈F F ∈F
≤ LG0 (y ) − min LF (y n )
n
F ∈F
n n
≤ max
n
LG 0 (x ) − min LF (x )
x F ∈F
2
as desired.
3
EE378A Statistical Signal Processing Lecture 18 - 05/26/2016
In this lecture we continue our work on predicting individual sequences1 without assuming any underlying
stochastic model.
1 2 1 2
EeX ≤ eEX e 8 (b−a) =⇒ log EeX ≤ EX + (b − a)
8
E [max1≤i≤n Gi ]
lim √ =1
n→∞ 2 log n
3. We use loss function l : X̂ × X → R+ and denote lmax = supx̂,x l(x̂, x). The cumulative loss of predictor
F on sequnce xn is:
n
X
LF (xn ) = l(Ft (xt−1 ), xt )
t=1
1 Reading: N. Cesa-Bianchi, and G. Lugosi, Prediction, learning, and games. Cambridge university press, 2006.
1
2.2 Logarithmic Loss, NML predictor and Uniform Predictor
Consider the logarithmic loss, a finite alphabet X , and the predictor alphabet X̂ which is the simplex of all
1
probability distributions on X : X̂ = M(X ). We use the loss function l(x̂, x) = log x̂[x] ,where x̂[x] denotes
the x-th entry of vector x̂, or the probability that x̂ assigns to outcome x.
In Lecture 17 we showed a one-to-one correspondence between predictors and probability distributions
on sequences, i.e., for every predictor F there exists a distribution PF on sequences such that
2. We can upper-bound the WCR of G using the WCR of the Uniform Predictor Guniform , which is defined
by
1 X
PGuniform (xn ) = PF (xn )
|F|
F ∈F
2
As a direct corollary, by choosing q
8 log|F |
η= 2
lmax n
we achieve r
n n n
LG (x ) − min LF (x ) ≤ lmax log |F|
F 2
Proof: Define W1 = |F|, and for t > 1 let Wt = F ∈F exp(−ηLF (xt−1 )). Then on one hand,
P
Wn+1 X
log = log exp(−ηLF (xt−1 )) − log |F| (1)
W1
F ∈F
we arrive at:
Wt+1 X η 2 lmax
2
log ≤ −η l(x̂, xt )Gt (xt−1 )[xt ] +
Wt 8
x̂
3
Telescopically summing over all t:
n
X Wt+1 Wn+1 η 2 lmax
2
log = ≤ −ηLG (xn ) + n
t=1
Wt W1 8
η 2 lmax
2
−η min LF (xt−1 ) − log |F| ≤ −ηLG (xn ) + n
F ∈F 8
We arrive at
2
log |F| ηlmax
LG (xn ) − min LF (xt−1 ) ≤ + n
F ∈F η 8
as desired.
Theorem 2. For binary prediction with hamming loss l(x̂, x) = 1(x̂ 6= x), and X = X̂ = {0, 1}, we have:
WC(n, N )
lim pn = 1.
2 log N
n,N →∞
By the above theorem, it is always possible to construct a reference set F under which the exponentially-
weighted predictor is nearly optimal.
4
EE378A Statistical Signal Processing Lecture 19 - 05/31/2016
In this lecture, we cover the AdaBoost algorithm by Freund and Schapire [1], which can combine many weak
predictors into one strong predictor. We also review some of the key ideas from the course.
1 AdaBoost
The AdaBoost algorithm [1] is an example of an ensemble algorithm, which can combine many weak decision
rules into one strong decision rule. In general, in ensemble methods we need to consider two factors:
1. What to aggregate?
2. How to aggregate?
for t = 1, 2, · · · , T do Pn
Choose g ∈ G to minimize weighted error t = i=1 Dt (i)I(gt (Xi ) 6= Yi )
Ft = Ft−1 + αt gt , where αt = 21 log 1− t .
t
(
αt
e if gt (Xi ) 6= Yi
Dt+1 (i) = DZt (i)
p
× −αt , where Zt = 2 t (1 − t )
t
e if gt (Xi ) = Yi
end for
1
How do we choose αt and ft to minimize the loss above? First, fix ft , and choose αt :
∂ X 1 −Yi Ft−1 (Xi ) αt X 1 −Yi Ft−1 (Xi ) −αt
Pn e−Y Ft (X) = e e − e e =0
∂αt n n
i:Yi 6=ft (Xi ) i:Yi =ft (Xi )
X 1 −Yi Ft−1 (Xi ) X 1 −Yi Ft−1 (Xi )
=⇒eαt e = e−αt e
n n
i:Yi 6=ft (Xi ) i:Yi =ft (Xi )
Claim: gt achieves weighted error 50% with respect to weight distribution Dt+1 . This can be easily
proved using the definition of gt and Dt+1 . Essentially, each step tries to correct errors that appeared at the
previous step. It is the underlying reason that boosting can combine a bunch of weak learners into a strong
one. For further explanation, see [2].
2
2 Course Review
Generalization = Data + Knowledge
The precise theory is not always useful, and one should always focus on gaining general intuitions as
opposed to seeking out black-and-white rules that are claimed to be optimal.
3
References
[1] Y. Freund, R.E. Schapire, “A decision-theoretic generalization of on-line learning and an application to
boosting”. Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.
[2] R. Schapire, “Explaining AdaBoost”. Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik,
2013.
[3] A. B. Tsybakov, “Aggregation and minimax optimality in high-dimensional estimation”. Proc. Int. Congr.
Math., Seoul, Korea, pp. 225–246, Aug. 2014.
[4] J. Jiao, K. Venkat, Y. Han, T. Weissman, “Minimax Estimation of Functionals of Discrete Distributions”.
IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2835–2885, 2015.
[5] A. Nemirovski, “Topics in nonparametric statistics”. In Ecole d’Et de Probabilits de Saint-Flour XXVIII-
1998. Lecture Notes in Math. 1738. Springer, New York. MR1775640, 2000.
[6] M. Talagrand, “A new look at independence”. Ann. Probab. 24, pp. 1–34, 1996.
[7] W. James and C. Stein, “Estimation with quadratic loss,” Proc. 4th Berkeley Symp. Math. Statist.
Probab., vol. 1, pp. 361–379, 1961.
[8] V. N. Vapnik, “Statistical Learning Theory”. vol. 2. New York, NY, USA: Wiley, 1998.