Ilovepdf Merged
Ilovepdf Merged
Editorial Board:
Already published
1. Bootstrap Methods and Their Application, by A.C. Davison and D.V. Hinkley
2. Markov Chains, by J. Norris
_CAMBRIDGE
. . " UNIVERSITY PRESS
www.cambridge.org
Information on this title: www.cambridge.org/9780521784504
DOI: 10.1017/CBO9780511802256
c Cambridge University Press 1998
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 1998
First paperback edition 2000
8th printing 2007
A catalogue record for this publication is available from the British Library
Library of Congress Cataloguing in Publication Data
Vaart, A. W. van der
Asymptotic statistics / A. W. van der Vaart.
p. cm. – (Cambridge series in statistical and probabilistic
mathematics)
Includes bibliographical references.
1. Mathematical statistical – Asymptotic theory. I. Title.
II. Series: Cambridge series on statistical and probabilistic mathematics
CA2276. V22 1998
519.5–dc21 98-15176
ISBN 978-0-521-49603-2 Hardback
ISBN 978-0-521-78450-4 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
1. Introduction
1.1. Approximate Statistical Procedures
1.2. Asymptotic Optimality Theory 2
1.3. Limitations 3
1.4. The Index n 4
2. Stochastic Convergence 5
2.l. Basic Theory 5
2.2. Stochastic 0 and 0 Symbols 12
*2.3. Characteristic Functions 13
*2.4. Almost-Sure Representations 17
*2.5. Convergence of Moments 17
*2.6. Convergence-Determining Classes 18
*2.7. Law of the Iterated Logarithm 19
*2.8. Lindeberg-Feller Theorem 20
*2.9. Convergence in Total Variation 22
Problems 24
3. Delta Method 25
3.l. Basic Result 25
3.2. Variance-Stabilizing Transformations 30
*3.3. Higher-Order Expansions 31
*3.4. Uniform Delta Method 32
*3.5. Moments 33
Problems 34
4. Moment Estimators 35
4.l. Method of Moments 35
*4.2. Exponential Families 37
Problems 40
5. M - and Z-Estimators 41
5.1. Introduction 41
5.2. Consistency 44
5.3. Asymptotic Normality 51
vii
6. Contiguity 85
6.1. Likelihood Ratios 85
6.2. Contiguity 87
Problems 91
7. Local Asymptotic Normality 92
7.1. Introduction 92
7.2. Expanding the Likelihood 93
7.3. Convergence to a Normal Experiment 97
7.4. Maximum Likelihood 100
*7.5. Limit Distributions under Alternatives 103
*7.6. Local Asymptotic Normality 103
Problems 106
8. Efficiency of Estimators 108
8.1. Asymptotic Concentration 108
8.2. Relative Efficiency 110
8.3. Lower Bound for Experiments 111
8.4. Estimating Normal Means 112
8.5. Convolution Theorem 115
8.6. Almost-Everywhere Convolution
Theorem 115
*8.7. Local Asymptotic Minimax Theorem 117
*8.8. Shrinkage Estimators 119
*8.9. Achieving the Bound 120
*8.10. Large Deviations 122
Problems 123
9. Limits of Experiments 125
9.1. Introduction 125
9.2. Asymptotic Representation Theorem 126
9.3. Asymptotic Normality 127
9.4. Uniform Distribution 129
9.5. Pareto Distribution 130
9.6. Asymptotic Mixed Normality 131
9.7. Heuristics 136
Problems 137
References 433
Index 439
This book grew out of courses that I gave at various places, including a graduate course in
the Statistics Department of Texas A&M University, Master's level courses for mathematics
students specializing in statistics at the Vrije Universiteit Amsterdam, a course in the DEA
program (graduate level) ofUniversite de Paris-sud, and courses in the Dutch AIO-netwerk
(graduate level).
The mathematical level is mixed. Some parts I have used for second year courses for
mathematics students (but they find it tough), other parts I would only recommend for a
graduate program. The text is written both for students who know about the technical
details of measure theory and probability, but little about statistics, and vice versa. This
requires brief explanations of statistical methodology, for instance of what a rank test or
the bootstrap is about, and there are similar excursions to introduce mathematical details.
Familiarity with (higher-dimensional) calculus is necessary in all of the manuscript. Metric
and normed spaces are briefly introduced in Chapter 18, when these concepts become
necessary for Chapters 19, 20, 21 and 22, but I do not expect that this would be enough as a
first introduction. For Chapter 25 basic knowledge of Hilbert spaces is extremely helpful,
although the bare essentials are summarized at the beginning. Measure theory is implicitly
assumed in the whole manuscript but can at most places be avoided by skipping proofs, by
ignoring the word "measurable" or with a bit of handwaving. Because we deal mostly with
i.i.d. observations, the simplest limit theorems from probability theory suffice. These are
derived in Chapter 2, but prior exposure is helpful.
Sections, results or proofs that are preceded by asterisks are either of secondary impor-
tance or are out of line with the natural order of the chapters. As the chart in Figure 0.1
shows, many of the chapters are independent from one another, and the book can be used
for several different courses.
A unifying theme is approximation by a limit experiment. The full theory is not developed
(another writing project is on its way), but the material is limited to the "weak topology"
on experiments, which in 90% of the book is exemplified by the case of smooth parameters
of the distribution of Li.d. observations. For this situation the theory can be developed
by relatively simple, direct arguments. Limit experiments are used to explain efficiency
properties, but also why certain procedures asymptotically take a certain form.
A second major theme is the application of results on abstract empirical processes. These
already have benefits for deriving the usual theorems on M -estimators for Euclidean pa-
rameters but are indispensable if discussing more involved situations, such as M -estimators
with nuisance parameters, chi-square statistics with data-dependent cells, or semiparamet-
ric models. The general theory is summarized in about 30 pages, and it is the applications
xiii
24
Figure 0.1. Dependence chart. A solid arrow means that a chapter is a prerequisite for a next chapter.
A dotted arrow means a natural continuation. Vertical or horizontal position has no independent
meaning.
that we focus on. In a sense, it would have been better to place this material (Chapters
18 and 19) earlier in the book, but instead we start with material of more direct statistical
relevance and of a less abstract character. A drawback is that a few (starred) proofs point
ahead to later chapters.
Almost every chapter ends with a "Notes" section. These are meant to give a rough
historical sketch, and to provide entries in the literature for further reading. They certainly
do not give sufficient credit to the original contributions by many authors and are not meant
to serve as references in this way.
Mathematical statistics obtains its relevance from applications. The subjects of this book
have been chosen accordingly. On the other hand, this is a mathematician's book in that
we have made some effort to present results in a nice way, without the (unnecessary) lists
of "regularity conditions" that are sometimes found in statistics books. Occasionally, this
means that the accompanying proof must be more involved. Ifthis means that an idea could
go lost, then an informal argument precedes the statement of a result.
This does not mean that I have strived after the greatest possible generality. A simple,
clean presentation was the main aim.
A* adjoint operator
Iffi* dual space
Cb(T), UC(T), C(T) (bounded, uniformly) continuous functions on T
loo (T) bounded functions on T
C,(Q), L,(Q) measurable functions whose rth powers are Q-integrable
IlflIQ,' norm of L,(Q)
II z1100, II zII T uniform norm
lin linear span
C, N, Q, JR., Z number fields and sets
EX, E* X, var X, sd X, Cov X (outer) expectation, variance, standard deviation,
covariance (matrix) of X
lPn, Gn empirical measure and process
Gp P-Brownian bridge
N (/L, 1:), tn, X; normal, t and chisquare distribution
2
Za, Xn,a' tn,a upper a-quantiles of normal, chisquare and t distributions
« absolutely continuous
<1,<1 I> contiguous, mutually contiguous
< smaller than up to a constant
convergence in distribution
p
-+ convergence in probability
as
-+ convergence almost surely
N(s, T, d), N[)(s, T, d) covering and bracketing number
1(s, T, d), 1[)(s, T, d) entropy integral
op(1),Op(1) stochastic order symbols
xv
provided the variables Xi have a finite second moment. This variation on the central limit
theorem is proved in the next chapter. A "large sample" level a test is to reject Ho : IL = JLo
I
if In(X n - lLo)/Snl exceeds the upper a/2 quantile of the standard normal distribution.
Table 1.1 gives the significance level of this test if the observations are either normally or
exponentially distributed, and a = 0.05. For n ::: 20 the approximation is quite reasonable
in the normal case. Ifthe underlying distribution is exponential, then the approximation is
less satisfactory, because of the skewness of the exponential distribution.
n Nonnal ExponentialG
5 0.122 0.19
10 0.082 0.14
15 0.070 0.11
20 0.065 0.10
25 0.062 0.09
50 0.056 0.07
100 0.053 0.06
In many ways the t-test is an uninteresting example. There are many other reasonable
test statistics for the same problem. Often their null distributions are difficult to calculate.
An asymptotic result similar to the one for the t-statistic would make them practically
applicable at least for large sample sizes. Thus, one aim of asymptotic statistics is to derive
the asymptotic distribution of many types of statistics.
There are similar benefits when obtaining confidence intervals. For instance, the given
approximation result asserts that ,In(Xn - 11) / Sn is approximately standard normally dis-
tributed if 11 is the true mean, whatever its value. This means that, with probability approx-
imately 1 - 2a,
,In(Xn -11)
-Za < < Za·
- Sn -
This can be rewritten as the confidence statement 11 = Xn ± Za Sn / ,In in the usual manner.
For large n its confidence level should be close to 1 - 2a.
As another example, consider maximum likelihood estimators en based on a sample of
size n from a density P9. A major result in asymptotic statistics is that in many situations
,In(en - 0) is asymptotically normally distributed with zero mean and covariance matrix the
inverse of the Fisher information matrix 19 • IfZ is k-variate normally distributed with mean
zero and nonsingular covariance matrix then the quadratic form Z possesses a
chi-square distribution with k degrees of freedom. Thus, acting as if ,In(en- 0) possesses
an Nk(O, 19- 1 ) distribution, we find that the ellipsoid
in certain exponential family models; the Rao-Blackwell theory allows us to conclude that
certain estimators are of minimum variance among the unbiased estimators. An important
and fairly general result is the Cramer-Rao bound for the variance of unbiased estimators,
but it is often not sharp.
If exact optimality theory does not give results, be it because the problem is untractable
or because there exist no "optimal" procedures, then asymptotic optimality theory may
help. For instance, to compare two tests we might compare approximations to their power
functions. To compare estimators, we might compare asymptotic variances rather than
exact variances. A major result in this area is that for smooth parametric models maximum
likelihood estimators are asymptotically optimal. This roughly means the following. First,
maximum likelihood estimators are asymptotically consistent: The sequence of estimators
converges in probability to the true value of the parameter. Second, the rate at which
maximum likelihood estimators converge to the true value is the fastest possible, typically
1/ ..;n. Third, their asymptotic variance, the variance of the limit distribution of ..;n(On - 0),
is minimal; in fact, maximum likelihood estimators "asymptotically attain" the Cramer-Rao
bound. Thus asymptotics justify the use of the maximum likelihood method in certain
situations. It is of interest here that, even though the method of maximum likelihood often
leads to reasonable estimators and has great intuitive appeal, in general it does not lead
to best estimators for finite samples. Thus the use of an asymptotic criterion simplifies
optimality theory considerably.
By taking limits we can gain much insight in the structure of statistical experiments. It
turns out that not only estimators and test statistics are asymptotically normally distributed,
but often also the whole sequence of statistical models converges to a model with a nor-
mal observation. Our good understanding of the latter "canonical experiment" translates
directly into understanding other experiments asymptotically. The mathematical beauty of
this theory is an added benefit of asymptotic statistics. Though we shall be mostly concerned
with normal limiting theory, this theory applies equally well to other situations.
1.3 Limitations
Although asymptotics is both practically useful and of theoretical importance, it should not
be taken for more than what it is: approximations. Clearly, a theorem that can be interpreted
as saying that a statistical procedure works fine for n -+- 00 is of no use if the number of
available observations is n = 5.
In fact, strictly speaking, most asymptotic results that are currently available are logically
useless. This is because most asymptotic results are limit results, rather than approximations
consisting of an approximating formula plus an accurate error bound. For instance, to
estimate a value a, we consider it to be the 25th element a = a25 in a sequence at, a2, ... ,
and next take limn.... oo an as an approximation. The accuracy of this procedure depends
crucially on the choice of the sequence in which a25 is embedded, and it seems impossible
to defend the procedure from a logical point of view. This is why there is good asymptotics
and bad asymptotics and why two types of asymptotics sometimes lead to conflicting
claims.
Fortunately, many limit results of statistics do give reasonable answers. Because it may
be theoretically very hard to ascertain that approximation errors are small, one often takes
recourse to simulation studies to judge the accuracy of a certain approximation.
Just as care is needed if using asymptotic results for approximations, results on asymptotic
optimality must be judged in the right manner. One pitfall is that even though a certain
procedure, such as maximum likelihood, is asymptotically optimal, there may be many
other procedures that are asymptotically optimal as well. For finite samples these may
behave differently and possibly better. Then so-called higher-order asymptotics, which
yield better approximations, may be fruitful. See e.g., [7], [52] and [114]. Although we
occasionally touch on this subject, we shall mostly be concerned with what is known as
"first-order asymptotics."
1.5 Notation
A symbol index is given on page xv.
For brevity we often use operator notation for evaluation of expectations and have special
symbols for the empirical measure and process.
For P a measure on a measurable space (X, B) and I : X IRk a measurable function,
PI denotes the integral J I dP; equivalently, the expectation Epl(X I ) for Xl a random
variable distributed according to P. When applied to the empirical measure 1Pn of a sample
Xl, ... , Xn , the discrete uniform measure on the sample values, this yields
This formula can also be viewed as simply an abbreviation for the average on the right. The
empirical process Gnl is the centered and scaled version of the empirical measure, defined
by
This is studied in detail in Chapter 19, but is used as an abbreviation throughout the book.
for every x at which the limit distribution function x t-+ P(X x) is continuous. Alterna-
tive names are weak convergence and convergence in law. As the last name suggests, the
convergence only depends on the induced laws of the vectors and not on the probability
spaces on which they are defined. Weak convergence is denoted by Xn -v-+ X; if X has dis-
tribution L, or a distribution with a standard code, such as N(O, 1), then also by Xn -v-+ L or
Xn -v-+ N(O, 1).
Let d (x, y) be a distance function on IRk that generates the usual topology. For instance,
the Euclidean distance
P(d(Xn , X) > e) -+ O.
This is denoted by Xn X. In this notation convergence in probability is the same as
p
d(X n, X) -+ o.
t More fonnally it is a Borel measurable map from some probability space in ]Rk. Throughout it is implic-
itly understood that variables X. g(X), and so forth of which we compute expectations or probabilities are
measurable maps on some probability space.
P(limd(Xn' X) = 0) = 1.
This is denoted by Xn X. Note that convergence in probability and convergence almost
surely only make sense ifeach of Xn and X are defined on the same probability space. For
convergence in distribution this is not necessary.
2.1 Example (Classical limittheorems). Let Yn be the average of the first n of a sequence
of independent, identically distributed random vectors Yt, Y2, .... If Ell Ytll < 00, then
fn EYt by the strong law oflarge numbers. UnderthestrongerassumptionthatEIIYtlI 2 <
00, the central limit theorem asserts that ,In(fn - EYt ) -v-+ N(O, Cov Yt ). The central limit
theorem plays an important role in this manuscript. It is proved later in this chapter, first
for the case of real variables, and next it is extended to random vectors. The strong law
of large numbers appears to be of less interest in statistics. Usually the weak law of large
numbers, according to which Yn EYt. suffices. This is proved later in this chapter. 0
2.2 Lemma (Portmanteau). For any random vectors Xn and X thefollowing statements
are equivalent.
(i) P(Xn :::: x) P(X :::: x) for all continuity points ofx P(X :::: x);
(ii) Ef(Xn) Ef(X) for all bounded, continuous functions f;
(iii) Ef(Xn) Ef(X) for all bounded, Lipschitzt functions f;
(iv) liminfEf(Xn) Ef(X) for all nonnegative, continuous functions f;
(v) liminfP(Xn E G) P(X E G) for every open set G;
(vi) lim sup P(Xn E F) :::: P(X E F) for every closed set F;
(vii) P(Xn E B) P(X E B) for all Borel sets B with P(X EBB) = 0, where
BB = Ii - iJ is the boundary of B.
Proof. (i) =} (ii). Assume first that the distribution function of X is continuous. Then
condition (i) implies that P(Xn E I) P(X E l) for every rectangle I. Choose a
sufficiently large, compact rectangle I with P(X ¢ l) < 8. A continuous function f is
uniformly continuous on the compact set I. Thus there exists a partition I = UjIj into
finitely many rectangles Ij such that f varies at most 8 on every Ij • Take a point x j from
each Ij and define fe = Lj f(xj)l/r Then If - fel < 8 on I, whence if f takes its values
in [-1,1],
t A function is called Lipschitz if there exists a number L such that If (x) - f (y) I Ld (x. y), for every x and
y. The least such number L is denoted IIflllip'
For sufficiently large n, the right side of the first equation is smaller than 2ε as well. We
combine this with
E f ε X n − E f ε (X ) ≤ P X n ∈ I j − P X ∈ I j f x j → 0.
j
Together with the triangle inequality the three displays show that E f (X n ) − E f (X ) is
bounded by 5ε eventually. This being true for every ε > 0 implies (ii).
Call a set B a continuity set if its boundary δ B satisfies P(X ∈ δ B) = 0. The preceding
argument is valid for a general X provided all rectangles I are chosen equal to continuity
sets. This is possible, because the collection of discontinuity sets is sparse. Given any
collection of pairwise disjoint measurable sets, at most countably many sets can have
positive probability. Otherwise the probability of their union would be infinite. Therefore,
given any collection of sets {Bα : α ∈ A} with pairwise disjoint boundaries, all except at
most countably many sets are continuity sets. In particular, for each j at most countably
many sets of the form {x : x j ≤ α} are not continuity sets. Conclude that there exist dense
subsets Q 1 , . . . , Q k of R such that each rectangle with corners in the set Q 1 × · · · × Q k is
a continuity set. We can choose all rectangles I inside this set.
(iii) ⇒ (v). For every open set G there exists a sequence of Lipschitz functions with
0 ≤ f m ↑ 1G . For instance f m (x) = (md(x, G c )) ∧ 1. For every fixed m,
by (vi). If P(X ∈ δ B) = 0, then left and right side are equal, whence all inequalities
are equalities. The probability P(X ∈ B) and the limit lim P(X n ∈ B) are between the
expressions on left and right and hence equal to the common value.
(vii) ⇒ (i). Every cell (−∞, x] such that x is a continuity point of x → P(X ≤ x) is a
continuity set.
The equivalence (ii) ⇔ (iv) is left as an exercise.
Proof. (i). The event {g(X n) E F} is identical to the event {Xn E g-l(F)}. For every
closed set F,
To see the second inclusion, take x in the closure of g-l (F). Thus, there exists a sequence
Xm with Xm -+ x and g(x m ) E F for every F. Ifx E C, then g(xm ) -+ g(x), which is in F
because F is closed; otherwise x E CC. By the portmanteau lemma,
Because P(X E CC) = 0, the probability on the right is P(X E g-l(F)) = p(g(X) E
F). Apply the portmanteau lemma again, in the opposite direction, to conclude that
g(Xn) -v-+ g(X).
(ii). Fix arbitrary e > O. For each 8 > 0 let B8 be the set of x for which there exists
y with d(x, y) < 8, but d(g(x), g(y)) > e. If X rt B8 and d(g(X n), g(X)) > e, then
d(Xn, X) ::: 8. Consequently,
The second term on the right converges to zero as n -+ 00 for every fixed 8 > O. Because
B8 n C ..(, 0 by continuity of g, the first term converges to zero as 8 ..(, O.
Assertion (iii) is trivial. •
Any random vector X is tight: For every e > 0 there exists a constant M such that
p( II X II > M) < e. A set of random vectors {Xa : a E A} is called uniformly tight if M can
be chosen the same for every Xa: For every e > 0 there exists a constant M such that
Thus, there exists a compact set to which all Xa give probability "almost" one. Another
name for uniformly tight is bounded in probability. It is not hard to see that every weakly
converging sequence Xn is uniformly tight. More surprisingly, the converse of this statement
is almost true: According to Prohorov's theorem, every uniformly tight sequence contains a
weakly converging subsequence. Prohorov's theorem generalizes the Heine-Borel theorem
from deterministic sequences Xn to random vectors.
Proof. (i). Fix a number M such that p(IIXII ::: M) < e. By the portmanteau lemma
P(IIXnll ::: M) exceeds p(IIXIl ::: M) arbitrarily little for sufficiently large n. Thus there
exists N such that p( II Xn II ::: M) < 2e, for all n ::: N. Because each of the finitely many
variables Xn with n < N is tight, the value of M can be increased, if necessary, to ensure
that p(IIX n II ::: M) < 2e for every n.
The crux of the proof of Prohorov's theorem is Helly's lemma. This asserts that any
given sequence of distribution functions contains a subsequence that converges weakly to
a possibly defective distribution function. A defective distribution function is a function
that has all the properties of a cumulative distribution function with the exception that it has
limits less than 1 at 00 and/or greater than 0 at -00.
Proof. Let Qk = {q I , q2, ... } be the vectors with rational coordinates, ordered in an
arbitrary manner. Because the sequence Fn(qd is contained in the interval [0, 1], it has
a converging subsequence. Call the indexing subsequence and the limit G(ql).
Next, extract a further subsequence {n}} c {n}} along which Fn (q2) converges to a
limit G(q2), a further subsequence {nJ} C {n}} along which Fn(q3) converges to a limit
G(q3), ... , and so forth. The "tail" of the diagonal sequence n j := belongs to every
sequence Hence Fn/qi) G(qi) for every i = 1,2, .... Because each Fn is nonde-
creasing, G(q) :s G(q') if q :s q'. Define
F(x) = q>x
inf G(q).
Conclude that lliminf Fnj(x) - F(x)1 < e. Because this is true for every e > 0 and
the same result can be obtained for the lim sup, it follows that Fn/x) F(x) at every
continuity point of F.
In the higher-dimensional case, it must still be shown that the expressions defining masses
of cells are nonnegative. For instance, for k = 2, F is a (defective) distribution function
only if F(b) + F(a) - F(al, b 2) - F(a2, bl) 0 for every a :s b. In the case that the four
comers a, b, (ai, b2), and (a2, bl) of the cell are continuity points; this is immediate from
the convergence of Fnj to F and the fact that each Fn is a distribution function. Next, for
general cells the property follows by right continuity. •
The right side can be made arbitrarily small, uniformly in n, by choosing sufficiently
largeM.
Because EX; = var Xn + (EXn)2, an alternative sufficient condition for uniform tight-
ness is EXn = 0 (1) and var Xn = 0 (1). This cannot be reversed. 0
Consider some of the relationships among the three modes of convergence. Convergence
in distribution is weaker than convergence in probability, which is in tum weaker than
almost-sure convergence, except if the limit is constant.
The second term on the right converges to zero as n 00. The first term can be made
arbitrarily small by choice of e. Conclude that the sequences Ef(Xn) and Ef(Yn) have the
same limit. The result follows from the portmanteau lemma.
(ii). Because d(Xn, X) 0 and trivially X -v-+ X, it follows that Xn -v-+ X by (iv).
(iii). The "only if' part is a special case of (ii). For the converse let ball(c, e) be the open
ball of radius e around c. Then P(d(Xn, c) ::: e) = p(Xn E ball(c, e)c). If Xn -v-+C, then
the lim sup of the last probability is bounded by p(c E ball(c, eY) = 0, by the portmanteau
lemma.
(v). First note that d( (Xn, Yn), (Xn, c») = d(Yn, c) O. Thus, according to (iv), it
suffices to show that (Xn , c) -v-+ (X, c). For every continuous, bounded function (x, y)
f(x, y), the function x f(x, c) is continuous and bounded. Thus Ef(Xn, c) Ef(X, c)
if Xn -v-+ X.
(vi). This follows fromd(x\, y\), (X2, Y2») =:: d(x\, X2) + d(y\, Y2). •
2.8 Lemma (Slutsky). Let X n, X and Yn be random vectors or variables. IfXn - X and
Yn - c for a constant c, then
(i) Xn + Yn - X + c;
(ii) YnXn - eX;
(iii) yn- I Xn - c- I X provided c # O.
In (i) the "constant" c must be a vector of the same dimension as X, and in (ii) it is
probably initially understood to be a scalar. However, (ii) is also true if every Yn and c
are matrices (which can be identified with vectors, for instance by aligning rows, to give a
meaning to the convergence Yn - c), simply because matrix multiplication (x, y) y x is
a continuous operation. Even (iii) is valid for matrices Yn and c and vectors Xn provided
c # 0 is understood as c being invertible, because taking an inverse is also continuous.
for certain parameters () and 0'2 depending on the underlying distribution, for every distri-
bution in the model. Then () = Tn ± Sn / J1i' Za is a confidence interval for () of asymptotic
level 1 - 2a. More precisely, we have that the probability that () is contained in [Tn -
Snl.;n Za, Tn + Snl.;n za] converges to 1 - 2a.
This is a consequence of the fact that the sequence .;n(Tn - ()ISn is asymptotically
standard normally distributed. 0
If the limit variable X has a continuous distribution function, then weak convergence
Xn "-"+ X implies P(Xn ::: x) P(X ::: x) for every x. The convergence is then even
uniform in x.
2.11 Lemma. Suppose that Xn "-"+ X for a random vector X with a continuous distribution
function. Then sUPxlp(Xn ::: x) - P(X ::: x)1 o.
Proof. Let Fn and F be the distribution functions of Xn and X. First consider the one-
dimensional case. Fix kEN. By the continuity of F there exist points -00 = Xo <
Xl < ... < Xk = 00 with F(Xi) = il k. By monotonicity, we have, for Xi-I::: x ::: Xi,
I I I I
Thus Fn (X) - F (X) is bounded above by SUPi Fn (Xi) - F (Xi) + 1I k, for every x. The
latter, finite supremum converges to zero as n 00, for each fixed k. Because k is arbitrary,
the result follows.
In the higher-dimensional case, we follow a similar argument but use hyperrectangles,
rather than intervals. We can construct the rectangles by intersecting the k partitions obtained
by subdividing each coordinate separately as before. •
(1 + op(1)r 1 = Opel)
op(Rn) = Rnop(1)
Op(Rn) = RnOp(1)
op(Op(1)) = op(l).
To see the validity of these rules it suffices to restate them in terms of explicitly named
vectors, where each 0 p (1) and 0 p (1) should be replaced by a different sequence of vectors
that converges to zero or is bounded in probability. In this way the first rule says: IfXn 0
and Yn 0, then Zn = Xn + Yn O. This is an example of the continuous-mapping
theorem. The third rule is short for the following: If Xn is bounded in probability and
Yn 0, then Xn Yn O. If Xn would also converge in distribution, then this would be
statement (ii) of Slutsky's lemma (with c = 0). But by Prohorov's theorem, Xn converges
in distribution "along subsequences" if it is bounded in probability, so that the third rule
can still be deduced from Slutsky's lemma by "arguing along subsequences."
Note that both rules are in fact implications and should be read from left to right, even
though they are stated with the help of the equality sign. Similarly, although it is true that
Opel) + Opel) = 20p(1), writing down this rule does not reflect understanding of the Op
symbol.
Two more complicated rules are given by the following lemma.
2.12 Lemma. Let R be afunction defined on domain in]Rk such that R(O) = O. Let Xn be
a sequence of random vectors with values in the domain of R that converges in probability
to zero. Then, for every p > 0,
(i) ifR(h) = o(lIhIlP) ash -+ 0, then R(X n) = op(IIXnIIP);
(ii) if R(h) = O(lIhIlP) as h -+ 0, then R(Xn) = Op(IIXnIlP).
Proof. Define g(h) as g(h) = R(h)/llhII P for h #- 0 and g(O) = O. Then R(Xn) =
g(Xn ) IIXnliP.
(i) Because the function g is continuous at zero by assumption, g(Xn) g(O) = 0 by
the continuous-mapping theorem.
(ii) By assumption there exist M and 8 > 0 such that ig(h)i :s M whenever Ilhl! :s 8.
Thus p(lg(Xn)1 > M) :s p(IIXnII > 8) -+ 0, and the sequence g(X n ) is tight. •
Each of the functions x t-+ e itT x is continuous and bounded. Thus, by the portmanteau
lemma, EeitTXn -+ Ee itTX for every t if Xn """ X. By Levy's continuity theorem the
2.13 Theorem (Uvy's continuity theorem). Let Xn and X be random vectors in Rk.
Then Xn - X ifand only ifEeitTx• EeitTX for every t E Rk. Moreover, if EeitTX• con-
verges pointwise to afunction q,(t) that is continuous at zero, then q, is the characteristic
function of a random vector X and Xn - X.
By assumption, the integrand in the right side converges pointwise to Re( 1 - q, (t)). By the
dominated-convergence theorem, the whole expression converges to
1 8
-8
Re(1 - q,(t») dt.
Becauseq, is continuous at zero, there exists for every e > > Osuchthatll-q,(t)1 < e
for ItI < 8. For this 8 the integral is bounded by 2e. Conclude that P(lXnl > 2/8) ::: 2e
for sufficiently large n, whence the sequence Xn is uniformly tight. •
2.14 Example (Normal distribution). The characteristic function of the Nk(JL, E) distri-
bution is the function
For real-valued z, the last equality follows easily by completing the square in the exponent.
Evaluating the integral for complex z, such as z = it, requires some skill in complex
function theory. One method, which avoids further calculations, is to show that both the
left- and righthand sides of the preceding display are analytic functions of z. For the right
side this is obvious; for the left side we can justify differentiation under the expectation
sign by the dominated-convergence theorem. Because the two sides agree on the real axis,
they must agree on the complex plane by uniqueness of analytic continuation. 0
2.15 Lemma. Random vectors X and Y in IRk are equal in distribution if and only if
Ee itTX = Ee itTy for every t E IRk.
Proof. By Fubini's theorem and calculations as in the preceding example, for every a > 0
and y E IRk,
By the convolution formula for densities, the righthand side is (2Jl')k times the density
P X +a Z (y) of the sum of X and a Z for a standard normal vector Z that is independent of X.
Conclude that if X and Y have the same characteristic function, then the vectors X + a Z
and Y + a Z have the same density and hence are equal in distribution for every a > O. By
Slutsky's lemma X + a Z "'" X as a to, and similarly for Y. Thus X and Y are equal in
distribution. •
The characteristic function of a sum of independent variables equals the product of the
characteristic functions of the individual variables. This observation, combined with Levy's
theorem, yields simple proofs of both the law oflarge numbers and the central limit theorem.
2.16 Proposition (Weak law of large numbers). Let Y1 , •.• , Yn be i.i.d. random variables
with characteristic function 4>. Then Yn /1 for a real number /1 ifand only if4> is differ-
entiable at zero with i /1 = 4>' (0).
Proof. We only prove that differentiability is sufficient. For the converse, see, for exam-
ple, [127, p. 52]. Because 4>(0) = 1, differentiability of 4> at zero means that 4>(t) = I
+ t4>'(O) + o(t) as t --+ O. Thus, by Fubini's theorem, for each fixed t and n --+ 00,
The right side is the characteristic function of the constant variable /1. By Levy's theorem,
Yn converges in distribution to /1. Convergence in distribution to a constant is the same as
convergence in probability. •
A sufficient but not necessary condition for 4>(t) = Ee itY to be differentiable at zero
is that EI Y I < 00. In that case the dominated convergence theorem allows differentiation
d .y . y
f/J'(t) = -Ee
dt
= EiYe
ll lt •
In particular, the derivative at zero is f/J' (0) = iEY and hence Yn EYt .
If Ey2 < 00, then the Taylor expansion can be carried a step further and we can obtain
a version of the central limit theorem.
2.17 Proposition (Central limit theorem). Let Yt , ••• , Yn be i.i.d. random variables with
EYi = 0 and EY? = 1. Then the sequence ,.jTiYn converges in distribution to the standard
normal distribution.
Proot A second differentiation under the expectation sign shows that f/J" (0) = i 2Ey2.
Because f/J' (0) = iEY = 0, we obtain
2.18 Example (Multivariate central limit theorem). Let Yt , Y2, ... be i.i.d. random vec-
tors in IRk with mean vector JL = EYt and covariance matrix E = E(Yt - JL)(Yt - JL)T.
Then
(The sum is taken coordinatewise.) By the Cramer-Wold device, this can be proved by
finding the limit distribution of the sequences of real variables
Because the random variables tTYt - t T JL. t T Y2 - t T JL •... are i.i.d. with zero mean and
variance tTEt. this sequence is asymptotically Nt (0, tTEt)-distributed by the univariate
central limit theorem. This is exactly the distribution of t T X if X possesses an Nk(O, E)
distribution. 0
Proof. For random variables we can simply define X̃ n = Fn−1 (U ) for Fn the distribution
function of X n and U an arbitrary random variable with the uniform distribution on
[0, 1]. (The “quantile transformation,” see Section 21.1.) The simplest known construction
for higher-dimensional vectors is more complicated. See, for example, Theorem 1.10.4
in [146], or [41].
Proof. We give the proof only in the most interesting direction. (See, for example, [146]
(p. 69) for the other direction.) Suppose that Yn = f (X n ) is asymptotically uniformly
integrable. Then we show that EYn → EY for Y = f (X ). Assume without loss of
generality that Yn is nonnegative; otherwise argue the positive and negative parts separately.
By the continuous mapping theorem, Yn Y . By the triangle inequality,
|EYn − EY | ≤ |EYn − EYn ∧ M| + |EYn ∧ M − EY ∧ M| + |EY ∧ M − EY |.
Because the function y → y ∧ M is continuous and bounded on [0, ∞), it follows that the
middle term on the right converges to zero as n → ∞. The first term is bounded above by
EYn 1{Yn > M}, and converges to zero as n -* 00 followed by M -* 00, by the unifonn
integrability. By the portmanteau lemma (iv), the third tenn is bounded by the liminf as
n -* 00 of the first and hence converges to zero as M t 00. •
2.21 Example. Suppose Xn is a sequence of random variables such that Xn""'-+ X and
limsupEIXnlP < 00 for some p. Then all moments of order strictly less than p converge
also: EX! -* EXk for every k < p.
By the preceding theorem, it suffices to prove that the sequence X! is asymptotically
uniformly integrable. By Markov's inequality
The limit superior, as n -* 00 followed by M -* 00, of the right side is zero if k < p. 0
2.22 Theorem. Let Xn and X be random variables such that -* EXP < 00 for
every pEN. Ifthe distribution of X is uniquely determined by its moments, then Xn .....-+ X.
Proof. Because EX; = 0(1), the sequence Xn is uniformly tight, by Markov's inequality.
By Prohorov's theorem, each subsequence has a further subsequence that converges weakly
to a limit Y. By the preceding example the moments of Y are the limits of the moments
of the subsequence. Thus the moments of Y are identical to the moments of X. Because,
by assumption, there is only one distribution with this set of moments, X and Y are equal
in distribution. Conclude that every subsequence of Xn has a further subsequence that
converges in distribution to X. This implies that the whole sequence converges to X. •
2.23 Example. The normal distribution is uniquely determined by its moments. (See, for
example, [123] or [133,p. 293].) -* ofor odd p -* (p-l)(p-3)···1
for even p implies that Xn .....-+ N (0, 1). The converse is false. 0
2.24 Lemma. On IRk = IRI X IRm the set offunctions (x, y) f(x)g(y) with f and g
ranging over all bounded, continuous functions on IRI and IRm , respectively, is convergence-
determining.
2.25 Lemma. There exists a countable set of continuous functions f :]Rk [0, 1] that
is convergence-determining and, moreover; Xn --+ X implies that Ef(Xn) Ef(X) uni-
formly in f E F.
The law of the iterated logarithm gives an interesting illustration of the difference between
almost sure and distributional statements. Under the conditions of the proposition, the
sequence n- t / 2 (Yt + ... + Yn ) is asymptotically normally distributed by the central limit
theorem. The limiting normal distribution is spread out over the whole real line. Apparently
division by the factor JIoglogn is exactly right to keep n- t / 2 (Yt + ... + Yn ) within a
compact interval, eventually.
A simple application of Slutsky'S lemma gives
Yt + ... + Yn P
Zn:= o.
Jnloglogn
Thus Zn is with high probability contained in the interval (-e, e) eventually, for any e > o.
This appears to contradict the law of the iterated logarithm, which asserts that Zn reaches
the interval (./2 - e, ./2 + e) infinitely often with probability one. The explanation is
that the set of W such that Zn(w) is in (-e, e) or (./2 - e,./2 + e) fluctuates with n. The
convergence in probability shows that at any advanced time a very large fraction of w have
Zn(w) E (-e, e). The law of the iterated logarithm shows that for each particular w the
sequence Zn(w) drops in and out of the interval (./2 - e,./2 +
e) infinitely often (and
hence out of (-e, e».
The implications for statistics can be illustrated by considering confidence statements.
If J.L and 1 are the true mean and variance of the sample Yt, h ... , then the probability that
- 2 - 2
< I I < Yn +-
Yn - -In-r-- In
converges to <1>(2) - <1>( -2) 95%. Thus the given interval is an asymptotic confidence
interval of level approximately 95%. (The confidence level is exactly <I> (2) - <I> (-2) if the
observations are normally distributed. This may be assumed in the following; the accuracy
of the approximation is not an issue in this discussion.) The point J..t = 0 is contained in
the interval if and only if the variable Zn satisfies
2
IZnl < .
- Jloglogn
Assume that J..t = 0 is the true value of the mean, and consider the following argument. By
the law of the iterated logarithm, we can be sure that Zn hits the interval (.fi - e, .fi + e)
infinitely often. The expression 2/ JIog log n is close to zero for large n. Thus we can be
sure that the true value J..t = 0 is outside the confidence interval infinitely often.
How can we solve the paradox that the usual confidence interval is wrong infinitely often?
There appears to be a conceptual problem if it is imagined that a statistician collects data in
a sequential manner, computing a confidence interval for every n. However, although the
frequentist interpretation of a confidence interval is open to the usual criticism, the paradox
does not seem to rise within the frequentist framework. In fact, from a frequentist point
of view the curious conclusion is reasonable. Imagine 100 statisticians, all of whom set
95% confidence intervals in the usual manner. They all receive one observation per day
and update their confidence intervals daily. Then every day about five of them should have
a false interval. It is only fair that as the days go by all of them take turns in being unlucky,
and that the same five do not have it wrong all the time. This, indeed, happens according
to the law of the iterated logarithm.
The paradox may be partly caused by the feeling that with a growing number of observa-
tions, the confidence intervals should become better. In contrast, the usual approach leads
to errors with certainty. However, this is only true if the usual approach is applied naively
in a sequential set-up. In practice one would do a genuine sequential analysis (including
the use of a stopping rule) or change the confidence level with n.
There is also another reason that the law of the iterated logarithm is of little practical
consequence. The argument in the preceding paragraphs is based on the assumption that
2/ .Jlog log n is close to zero and is nonsensical if this quantity is larger than.fi. Thus the
argument requires at least n 2: 1619, a respectable number of observations.
2.27 Proposition (Lindeberg-Feller central limit theorem). For each n let Yn,), ••• ,
Yn,k. be independent random vectors with finite variances such that
kn
LEIIYn ,ill2 1{IIYn ,ili > e} -+ 0, every e > 0,
i=1
k.
LCovYn,i -+
i=)
Then the sequence (Yn ,; - EYn ,;) converges in distribution to a normal N (0, E)
distribution.
A result of this type is necessary to treat the asymptotics of, for instance, regression
problems with fixed covariates. We illustrate this by the linear regression model. The
application is straightforward but notationally a bit involved. Therefore, at other places
in the manuscript we find it more convenient to assume that the covariates are a random
sample, so that the ordinary central limit theorem applies.
2.28 Example (Linear regression). In the linear regression problem, we observe a vector
Y = Xf3 + e for a known (n x p) matrix X of full rank, and an (unobserved) error vector e
with i.i.d. components with mean zero and variance (12. The least squares estimator of f3 is
This estimator is unbiased and has covariance matrix (l2(XT X)-I. Ifthe error vector e is
P
normally distributed, then is exactly normally distributed. Under reasonable conditions
on the design matrix, the least squares estimator is asymptotically normally distributed for
a large range of error distributions. Here we fix p and let n tend to infinity.
This follows from the representation
n
(XTX)I/2(p - f3) (X TX)-1/2X Te = Lan;e;,
;=1
where ani , ... , ann are the columns of the (pxn) matrix (X T X)-1/2X T =: A. This sequence
is asymptotically normal if the vectors anlel, ... ,annen satisfy the Lindeberg conditions.
The norming matrix (X T X)I/2 has been chosen to ensure that the vectors in the display
have covariance matrix (12 I for every n. The remaining condition is
n
This can be simplified to other conditions in several ways. Because L II ani 112 = trace( A AT)
= p, it suffices that max Ee; 1{liandlle;l > e} 0, which is equivalentto
Alternatively, the expectation Ee 2 1{alel > e} can be bounded by e- kElelk+ 2ak and a
second set of sufficient conditions is
n
Lllan;lIk 0; (k> 2).
;=1
Both sets of conditions are reasonable. Consider for instance the simple linear regression
model Yj = f30 + f3lx; + ej. Then
It is reasonable to assume that the sequences x and x 2 are bounded. Then the first matrix
on the right behaves like a fixed matrix, and the conditions for asymptotic normality
simplify to
where the supremum is taken over all measurable sets B. In view of the portmanteau lemma,
this type of convergence is stronger than convergence in distribution. Not only is it required
that the sequence P(Xn E B) converges for every Borel set B, the convergence must also
be uniform in B. Such strong convergence occurs less frequently and is often more than
necessary, whence the concept is less useful.
A simple sufficient condition for convergence in total variation is pointwise convergence
of densities. If Xn and X have densities Pn and p with respect to a measure JL, then
sup!P(Xn
B
E B) - P(X E B)! =2 f IPn - pi dJL.
Thus, convergence in total variation can be established by convergence theorems for inte-
grals from measure theory. The following proposition, which should be compared with the
monotone and dominated convergence theorems, is most appropriate.
2.29 Proposition. Suppose that In and I are arbitrary measurable functions such that
In I JL-almost everywhere (or in JL-measure) and lim sup J I/nlPdJL :::: J I/IP dJL <
oo,for some p ::: I and measure JL. Then lin - liP dJL J o.
Proof. By the inequality (a + b)P :::: 2Pa P + 2Pb P, valid for every a, b ::: 0, and the
assumption, 0 :::: 2Pl/ni P + 2PI/IP - lin - liP 2p +1 1/1 p almost everyWhere. By
Fatou's lemma,
f 2p+1 1/1 PdJL :::: liminf f (2 Pl/nl P + 2PI/I P - lin - liP) dJL
2.30 Corollary (Scheffe). Let Xn and X be random vectors with densities Pn and P with
respect to a measure JL. IfPn P JL-almost everywhere, then the sequence Xn converges
to X in total variation.
2.31 Theorem (Central limit theorem in total variation). Let Y1 , Y2 , ••• be U.d. random
variables withfinite second moment and characteristic function 4J such that 14J(t)l v dt < J
00 for some v 1. Then Y1 + ... + Yn satisfies the central limit theorem in total variation.
Proof. It can be assumed without loss of generality that EY1 = 0 and var Y1 = 1. By
the inversion formula for characteristic functions (see [47, p. 509]), the density Pn of
Y1 + ... + yn/.;n can be written
Pn(x) = -1
2rr
f. e- ltx 4J (t)n
-
.;n
dt.
By the central limit theorem and Levy's continuity theorem, the integrand converges to
e- itx exp(-!t z). It will be shown that the integral converges to
-
1
2rr
f e
-itx _!t 2
!x 2
e-z
e Z dt = - - .
J21r
Then an application of Scheffe's theorem concludes the proof.
The integral can be split into two parts. First, for every e > 0,
(
J1tl>E.,fii
e- itx 4J(
vn
dt::::.;n supl4J(t)l n - v
Itl>E
f v
14J(t)l dt.
I
Here SUPltl>E 14J (t) < 1 by the Riemann-Lebesgue lemma and because 4J is the characteristic
function of a nonlattice distribution (e.g., [47, pp. 501, 513]). Thus, the first part of the
integral converges to zero geometrically fast.
Second, a Taylor expansion yields that 4J(t) = 1 - !t
Z + o(t z ) as t --+ 0, so that there
exists e > 0 such that 14J(t)1 :::: 1 - t /4 for every It I < e. It follows that
Z
The proof can be concluded by applying the dominated convergence theorem to the remain-
ing part of the integral. •
Notes
The results of this chapter can be found in many introductions to probability theory. A
standard reference for weak convergence theory is the first chapter of [11]. Another very
readable introduction is [41]. The theory of this chapter is extended to random elements
with values in general metric spaces in Chapter 18.
PROBLEMS
1. If Xn possesses a t-distribution with n degrees of freedom, then Xn -v-+ N(O, 1) as n 00.
Show this.
2. Does it follow immediately from the result of the previous exercise that ExK EN(O, l)P for
every pEN? Is this true?
3. If Xn -v-+ N(O, 1) and Yn (1, then XnYn -v-+ N(O, (12). Show this.
4. In what sense is a chi-square distribution with n degrees of freedom approximately a normal
distribution?
5. Find an example of sequences such that Xn -v-+ X and Yn -v-+ Y, but the joint sequence (Xn, Yn )
does not converge in law.
6. If Xn and Yn are independent random vectors for every n, then Xn -v-+ X and Yn -v-+ Y imply that
(X n , Yn ) -v-+ (X, Y), where X and Y are independent. Show this.
7. If every Xn and X possess discrete distributions supported on the integers, then Xn -v-+ X if and
only ifP(Xn = x) P(X = x) for every integer x. Show this.
8. IfP(X n = iln) = lin for every i = 1,2, ... , n, then Xn -v-+ X, but there exist Borel sets with
P(Xn E B) = 1 for every n, but P(X E B) = 0. Show this.
9. IfP(X n = x n ) = 1 for numbers Xn and Xn x, then Xn -v-+ x. Prove this
(i) by considering distributions functions
(ii) by using Theorem 2.7.
10. State the rule 0 p (1) + 0 p (1) = 0 p (1) in terms of random vectors and show its validity.
11. In what sense is it true that 0 p (1) = 0 p (1)? Is it true that 0 p (1) = 0 p (1)?
12. The rules given by Lemma 2.12 are not simple plug-in rules.
(i) Give an example of a function R with R(h) = o{lIhll) as h
variables Xn such that R(Xn ) is not equal to op(Xn).
° and a sequence of random
14. Find an example of a sequence of random variables such that Xn 0, but Xn does not converge
almost surely.
15. Let XI, ... , Xn be i.i.d. with density /A.a(x) = Ae-).(x-a)l{x a}. Calculate the maximum
likelihood estimator of an) of (A, a) and show that an) (A, a).
16. Let X I, ... , Xn be i.i.d. standard normal variables. Show that the vector U = (X I, ... , Xn) IN,
where N 2 = L:?=I Xl, is uniformly distributed over the unit sphere sn-I in lR.n , in the sense that
U and 0 U are identically distributed for every orthogonal transformation 0 of lR.n .
17. For each n, let Un be uniformly distributed over the unit sphere Sn-I in lR.n. Show that the vectors
.JTi(Un, I, Un,2) converge in distribution to a pair of independent standard normal variables.
18. If .JTi(Tn - e) converges in distribution, then Tn converges in probability to e. Show this.
19. IfEX n -+ /-t and var Xn -+ 0, then Xn /-to Show this.
20. p{IXn I > c) < 00 for every c > 0, then Xn converges almost surely to zero. Show this.
21. Use characteristic functions to show that binomial(n, Aln) -v-+ Poisson(A). Why does the central
limit theorem not hold?
22. If X I, ... , Xn are i.i.d. standard Cauchy, then Xn is standard Cauchy.
(i) Show this by using characteristic functions
(ii) Why does the weak law not hold?
23. Let X I, ... , Xn be i.i.d. with finite fourth moment. Find constants a, b, and Cn such that the
sequence cn(Xn - a, - b) converges in distribution, and determine the limit law. Here Xn
and are the averages of the Xi and the Xl, respectively.
https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press
3
Delta Method
If y'n(Tn -e) - T for some variable T, then we expect that y'n(¢(Tn) - ¢(e») - ¢'(e) T.
In particular, if y'n(Tn - e) is asymptotically normal N(O, a 2), then we expect that
y'n(¢(Tn) - ¢(e») is asymptotically normal N(O, ¢'(e)2a 2). This is proved in greater
generality in the following theorem.
In the preceding paragraph it is silently understood that Tn is real-valued, but we are more
interested in considering statistics ¢ (Tn) that are formed out of several more basic statistics.
Consider the situation that Tn = (Tn.!, ... , Tn,k) is vector-valued, and that ¢ :]Rk f-+ ]Rm is
a given function defined at least on a neighbourhood of e. Recall that ¢ is differentiable at
e if there exists a linear map (matrix) :]Rk f-+ ]Rm such that
All the expressions in this equation are vectors of length m, and IIhll is the Euclidean
norm. The linear map h f-+ is sometimes called a "total derivative," as opposed to
25
partial derivatives. A sufficient condition for t/J to be (totally) differentiable is that all partial
derivatives at/Jj(X)/aXi exist for x in a neighborhood of 0 and are continuous at O. (Just
existence of the partial derivatives is not enough.) In any case, the total derivative is found
from the partial derivatives. IT t/J is differentiable, then it is partially differentiable, and the
derivative map h t-* (h) is matrix multiplication by the matrix
3.1 Theorem. Let t/J : 1D>t/> C IRk t-* IRm be a map defined on a subset of IRk and dif-
ferentiable at O. Let Tn be random vectors taking their values in the domain of t/J. If
rn(Tn - 0) "-'+ T for numbers rn --+ 00, then rn(t/J(Tn) - t/J(O») Moreover, the
difference between rn(t/J(Tn) - t/J(O») and - 0») converges to zero in probability.
Proof. Because the sequence r n(Tn - 0) converges in distribution, it is uniformly tight and
Tn - 0 converges to zero in probability. By the differentiability of t/J the remainder function
=
R(h) t/J(O + h) - t/J(O) - =
satisfies R(h) o(lIhll) as h --+ O. Lemma 2.12 allows
to replace the fixed h by a random sequence and gives
The map 4> is differentiable at the point () = (ai, a2)T, with derivative 4>(al.a2l = (-2al, 1).
Thus if the vector (TI , T2)' possesses the normal distribution in the last display, then
In view of Slutsky's lemma, the same result is valid for the unbiased version n/(n - 1)S2
of the sample variance, because In(n/(n - 1) - 1) -+ O. 0
3.3 Example (Level of the chi-square test). As an application of the preceding example,
consider the chi-square test for testing variance. Normal theory prescribes to reject the null
hypothesis Ho: IL2 1 for values of nS2 exceeding the upper a point X;.a of the xLI
distribution. If the observations are sampled from a normal distribution, then the test has
exactly level a. Is this still approximately the case if the underlying distribution is not
normal? Unfortunately, the answer is negative.
For large values of n, this can be seen with the help of the preceding result. The central
limit theorem and the preceding example yield the two statements
X2 -(n-1)
n-I _ N(O, 1), In (S2
"2 -
)
1 - N(O, K + 2),
./2n - 2 fA'
where K = IL4/ - 3 is the kurtosis of the underlying distribution. The first statement
implies that (X;.a - (n - 1») / ./2n - 2) converges to the upper a point Za of the standard
normal distribution. Thus the level of the chi-square test satisfies
PIL2=1
2) =
(nS 2 > Xn•a P
(c(S2 1)
...,n IL2 - >
X;.a- n
In ) -+ 1 - <I> (zaJ2)
JK + 2 .
The asymptotic level reduces to 1 - <I> (za) = a if and only if the kurtosis of the underlying
distribution is O. This is the case for normal distributions. On the other hand, heavy-tailed
distributions have a much larger kurtosis. If the kurtosis of the underlying distribution is
"close to" infinity, then the asymptotic level is close to 1 - <1>(0) = 1/2. We conclude that
the level of the chi-square test is nonrobust against departures of normality that affect the
value of the kurtosis. At least this is true if the critical values of the test are taken from
the chi-square distribution with (n - 1) degrees of freedom. If, instead, we would use a
Law Level
Laplace 0.12
0.95 N(O, 1) + 0.05 N(O, 9) 0.12
normal approximation to the distribution of ../Ti(S2 / JL2 - 1) the problem would not arise,
provided the asymptotic variance IC + 2 is estimated accurately. Table 3.1 gives the level
for two distributions with slightly heavier tails than the normal distribution. 0
In the preceding example the asymptotic distribution of ../Ti(S2 - (12) was obtained by the
delta method. Actually, it can also and more easily be derived by a direct expansion. Write
C2 2
vn(S - (1 ) = vn -1 L)Xi
n i=1
c( - JL) -2 (1 2) - vn(X
C- - JL)2.
The second term converges to zero in probability; the first term is asymptotically normal
by the central limit theorem. The whole expression is asymptotically normal by Slutsky's
lemma.
Thus it is not always a good idea to apply general theorems. However, in many exam-
ples the delta method is a good way to package the mechanics of Taylor expansions in a
transparent way.
3.4 Example. Consider the joint limit distribution of the sample variance S2 and the
t-statistic X/So Again for the limit distribution it does not make a difference whether we
use a factor nor n - I to standardize S2. For simplicity we use n. Then (S2, XIS) can be
written as t/J(X, X2) for the map t/J: R2 R2 given by
t/J(x, y) = (y - x 2, (y _:2)1/2).
The joint limit distribution of ../Ti(X-ai, X2 -a2) is derived in the preceding example. The
map t/J is differentiable at () = (ai, a2) provided (12 = a2 - ar is positive, with derivative
It follows that the sequence ../Ti(S2 - (12, X/ S - all(1) is asymptotically bivariate normally
distributed, with zero mean and covariance matrix,
3.5 Example (Skewness). The sample skewness of a sample XI, .. ', Xn is defined as
(X. _ X)3
I _ n L..1=1 I
The sequence ../ii(X - ai, X2 - a2, X3 - (3) is asymptotically mean-zero normal by the
central limit theorem, provided is finite. The value ¢(al, a2, (3) is exactly the popu-
lation skewness. The function ¢ is differentiable at the point (ai, a2, (3) and application of
the delta method is straightforward. We can save work by noting that the sample skewness
is location and scale invariant. With Yi = (Xi - a 1) / a, the skewness can also be written as
¢(Y, y2, y3). With).. = JL3/a3 denoting the skewness of the underlying distribution, the
Ys satisfy
'V'-tN(O,(
y3 _ A K + 3 JLs/a -).. JL6/a - A
The derivative of ¢ at the point (0, 1, A) equals (-3, -3A/2, 1). Hence, if T possesses the
normal distribution in the display, then ../ii(In -)..) is asymptotically normal distributed with
mean zero and variance equal to var( -3Tl - 3AT2/2 + T3). If the underlying distribution
asymptotically N(O,
6)-distributed.
°
is normal, then A = JLs = 0, K = and JL6/a6 = 15. In that case the sample skewness is
An approximate level a test for normality based on the sample skewness could be to
reject normality if vlnl1n I > ,J6 Zaj2. Table 3.2 gives the level of this test for different
values of n. 0
n Level
10 0.02
20 0.03
30 0.03
50 0.05
For φ chosen such that φ (θ )σ (θ ) ≡ 1, the asymptotic variance is constant and finding an
asymptotic confidence interval for η = φ(θ ) is easy. The solution
1
φ(θ ) = dθ
σ (θ)
It does not work very well to base an asymptotic confidence interval directly on this result.
Figure 3.1. Histogram of 1000 sample correlation coefficients, based on 1000 independent
samples of the the bivariate normal distribution with correlation 0.6, and histogram of the
arctanh of these values.
The transformation
1 1 1+ρ
φ(ρ) = dρ = log = arctanh ρ
1−ρ 2 2 1−ρ
√
is variance stabilizing. Thus, the sequence n(arctanh rn –arctanh ρ) converges to a
standard normal distribution for every ρ. This leads to the asymptotic confidence interval
for the correlation coefficient ρ given by
√ √
tanh(arctanh rn − z α / n ), tanh(arctanh rn + z α / n ) .
Table 3.3 gives an indication of the accuracy of this interval. Besides stabilizing the
variance the arctanh transformation has the benefit of symmetrizing the distribution of the
sample correlation coefficient (which is perhaps of greater importance), as can be seen in
Figure 5.3.
In the one-dimensional case, a Taylor expansion applied to a statistic Tn has the form
Usually the linear term (Tn - O)t/J'(O) is of higher order than the remainder, and thus
determines the order at which t/J(Tn) - t/J(O) converges to zero: the same order as Tn - O.
Then the approach of the preceding section gives the limit distribution of t/J(Tn) - t/J(O). If
t/J' (0) = 0, this approach is still valid but not of much interest, because the resulting limit
distribution is degenerate at zero. Then it is more informative to multiply the difference
t/J(Tn) - t/J(O) by a higher rate and obtain a nondegenerate limit distribution. Looking at
the Taylor expansion, we see that the linear term disappears if t/J'(O) = 0, and we expect
that the quadratic term determines the limit behavior of t/J(Tn).
3.7 Example. Suppose that .jTi.X converges weakly to a standard normal distribution.
Because the derivative of x cos x is zero at x = 0, the standard delta method of the
preceding section yields that .jTi.(cos X - cos 0) converges weakly to O. It should be
concluded that .jTi. is not the right norming rate for the random sequence cos X-I. A
more informative statement is that - 2n (cos X-I) converges in distribution to a chi-square
distribution with one degree of freedom. The explanation is that
If the sequences Tn,j - OJ are of different order, then it may happen, for instance, that the
linear part involving Tn,j - OJ is of the same order as the quadratic part involving (Tn,j _ OJ)2.
Thus, it is necessary to determine carefully the rate of all terms in the expansion, and to
rearrange these in decreasing order of magnitude, before neglecting the "remainder."
Several applications of Slutsky's lemma and the delta method yield as limit in law the vector
+ h) - = if T is the limit in distribution of ,.jn(Tn - 9n). For 9n -* 9
at a slower rate, this argument does not work. However, the same result is true under a
slightly stronger differentiability assumption on q,.
3.8 Theorem. Let q, : IRk H- IRm be a map defined and continuously differentiable in
a neighborhood of 9. Let Tn be random vectors taking their values in the domain of
t/J. Ifrn(Tn - 9n) - T for vectors 9n -* 9 and numbers rn -* 00, then rn(t/J(Tn) -
t/J(9n») - Moreover, the difference between rn(t/J(Tn) - t/J(9n») and - 9n»)
converges to zero in probability.
Proof. It suffices to prove the last assertion. Because convergence in probability to zero
of vectors is equivalent to convergence to zero of the components separately, it is no loss
of generality to assume that q, is real-valued. For 0 ::: t ::: 1 and fixed h, define gn(t) =
q,(fJn + th). For sufficiently large n and sufficiently small h, both fJn and 9n + h are in a
ball around 9 inside the neighborhood on which t/J is differentiable. Then gn : [0, 1] H- IR is
continuously differentiable with derivative (t) = (h). By the mean-value theorem,
gn(1) - gn(O) = for some 0::: ::: 1. In other words
By the continuity of the map 9 H- there exists for every e > 0 a I) > 0 such that
- < ellhll for every - 911 < I) and every h. For sufficiently large nand
IIhll < 1)/2, the vectors 9n + are within distance I) of fJ, so that the nonn II Rn(h)/I of the
right side of the preceding display is bounded by ell h II. Thus, for any 1/ > 0,
p(rnIIRn(Tn - fJn)11 > 1/) ::: P(II Tn - 9nll + P(rnII Tn - 9nlle > 1/).
The first tenn converges to zero as n -* 00. The second tenn can be made arbitrarily small
by choosing e small. •
*3.5 Moments
So far we have discussed the stability of convergence in distribution under transfonnations.
We can pose the same problem regarding moments: Can an expansion for the moments of
q,(Tn) - q,(fJ) be derived from a similar expansion for the moments of Tn - fJ? In principle
the answer is affirmative, but unlike in the distributional case, in which a simple derivative
of q, is enough, global regularity conditions on q, are needed to argue that the remainder
tenns are negligible.
One possible approach is to apply the distributional delta method first, thus yielding the
qualitative asymptotic behavior. Next, the convergence of the moments of t/J(Tn) - q,(fJ)
(or a remainder tenn) is a matter of unifonn integrability, in view of Lemma 2.20. If
q, is uniformly Lipschitz, then this unifonn integrability follows from the corresponding
unifonn integrability of Tn - 9. If q, has an unbounded derivative, then the connection
between moments of q,(Tn) - q,(fJ) and Tn - fJ is harder to make, in general.
Notes
The Delta method belongs to the folklore of statistics. It is not entirely trivial; proofs are
sometimes based on the mean-value theorem and then require continuous differentiability in
a neighborhood. A generalization to functions on infinite-dimensional spaces is discussed
in Chapter 20.
PROBLEMS
1. Find the joint limit distribution of (.Jii"(X - /1-), .Jii"(S2 - (12») ifK and S2 are based on a sample
of size n from a distribution with finite fourth moment. Under what condition on the underlying
distribution are .Jii"(K - /1-) and .Jii"(S2 - (12) asymptotically independent?
2. Find the asymptotic distribution of .Jii"(r - p) if r is the correlation coefficient of a sample of n
bivariate vectors with finite fourth moments. (This is quite a bit of work. It helps to assume that
the mean and the variance are equal to 0 and 1, respectively.)
3. Investigate the asymptotic robustness of the level of the t-test for testing the mean that rejects
Ho : /1- 0 if .Jii"K/ S is larger than the upper a quantile of the tn-I distribution.
4. Find the limit distribution of the sample kurtosiskn = n- I I:7=1 (Xi - K)4/S4 - 3, and design an
asymptotic level a test for normality based on kn • (Warning: At least 500 observations are needed
to make the normal approximation work in this case.)
S. Design an asymptotic level a test for normality based on the sample skewness and kurtosis jointly.
6. Let X10 ••• , Xn be i.i.d. with expectation /1- and variance 1. Find constants such that an - bn )
converges in distribution if /1- = 0 or /1- =f. O.
7. Let XI, ... , Xn be a random sample from the Poisson distribution with mean (J. Find a variance
stabilizing transformation for the sample mean, and construct a confidence interval for (J based on
this.
8. Let XI, ... ,Xn be Li.d. with expectation I and finite variance. Find the limit distribution of
.Jii"(K;1 - 1). Ifthe random variables are sampled from a density f that is bounded and strictly
positive in a neighborhood ofzero, show that EIK;II = 00 for every n. (The density of Xn is
bounded away from zero in a neighborhood of zero for every n.)
-1 L /j(X
n
i) = Eo/j(X), j = 1, ... , k,
n ;=1
for given functions II, ... , fk. Thus the parameter is chosen such that the sample moments
(on the left side) match the theoretical moments. If the parameter is k-dimensional one
usually tries to match k moments in this manner. The choices /j (x) = x j lead to the
method of moments in its simplest form.
Moment estimators are not necessarily the best estimators, but under reasonable condi-
tions they have convergence rate Jn and are asymptotically normal. This is a consequence
of the delta method. Write the given functions in the vector notation f = (II, ... , fk), and
let e: 81-+ IRk be the vector-valued expectation e(O) = Pof. Then the moment estimator
en solves the system of equations
1 n
W'nf == -
n
L f(X
i=1
i) = e(O) == Po!.
For existence of the moment estimator, it is necessary that the vector W'nf be in the range
of the function e. If e is one-to-one, then the moment estimator is uniquely determined as
-I
On = e CW'nf) and
A
IfW'n f is asymptotically normal and e- I is differentiable, then the right side is asymptoti-
cally normal by the delta method.
35
The derivative of e- 1 at e(Oo) is the inverse e;;1 of the derivative of eat 00. Because the
function e- 1 is often not explicit, it is convenient to ascertain its differentiability from the
differentiability of e. This is possible by the inverse function theorem. According to this
theorem a map that is (continuously) differentiable throughout an open set with nonsingular
derivatives is locally one-to-one, is of full rank, and has a differentiable inverse. Thus we
obtain the following theorem.
4.1 Theorem. Suppose that e(O) = p(J! is one-to-one on an open set e c IRk and con-
tinuously differentiable at 00 with nonsingular derivative Moreover, assume that
Peo II f112 < 00. Then moment estimators On exist with probability tending to one and
satisfy
For completeness, the following two lemmas constitute, if combined, a proof of the
inverse function theorem. Ifnecessary the preceding theorem can be strengthened somewhat
by applying the lemmas directly. Furthermore, the first lemma can be easily generalized to
infinite-dimensional parameters, such as used in the semiparametric models discussed in
Chapter 25.
4.2 Lemma. Let e c IRk be arbitrary and let e : e IRk be one-to-one and differentiable
at a point 00 with a nonsingular derivative. Then the inverse e- 1 (defined on the range of
e) is differentiable at e(Oo) provided it is continuous at e(Oo).
Proof. Write '1 = e(Oo) and tlh = e- 1('1 + h) - e- 1('1). Because e- 1 is continuous at '1,
we have that tlh 0 as h O. Thus
sufficiently small open neighborhood U of 00 onto an open set V and e- I : V f-+ U is well
defined and continuous.
Consequently, ¢ maps ball(OI' e) into itself. Because ¢ is a contraction, it has a fixed point
o E ball (0 1 , e): a point with ¢(O) =
O. By definition of ¢ this satisfies e(O) = 1/.
Any other 0 with e(O) = 1/ is also a fixed point of ¢. In that case the difference 0 - 0 =
¢(O) - ¢(O) has norm bounded by ! 110 - Oil. This can only happen if 0 = O. Hence e is
one-to-one throughout U. •
4.4 Example. Let X I, ... , Xn be a random sample from the beta-distribution: The com-
mon density is equal to
The moment estimator for (a, f3) is the solution of the system of equations
a
Xn = Ea{lXI
,
= --,
a+f3
-2 2 (a + l)a
X - E {IX - - - - - - - -
n - a, I - (ex + {J + l)(ex + {J)
The righthand side is a smooth and regular function of (ex, f3), and the equations can be
solved explicitly. Hence, the moment estimators exist and are asymptotically normal. 0
Thus h and t = (tl, ... , tk) are known functions on the sample space, and the family is
given in its natural parametrization. The parameter set e must be contained in the natural
parameter space for the family. This is the set of 0 for which P(J can define a probability
density. IfJL is the dominating measure, then this is the right side in
Here the second equality is an example of the general rule that score functions have zero
means. It can fonnally be established by differentiating the identity J P(J d JL == 1 under the
integral sign: Combine the lemma and the Leibniz rule to see that
dJL =/ oc(O) h(x) e(JTt(x) dJL(x) + /C(O) h(x) tj(x) e(JTt(x) dJL(x).
oOj oOj
The left side is zero and the equation can be rewritten as 0 = i::/c(O) + E(Jt(X).
It follows that the likelihood equations L i(J (Xj) = 0 reduce to the system of k equations
Thus, the maximum likelihood estimators are moment estimators. Their asymptotic prop-
erties depend on the function e(O) = E(Jt(X), which is very well behaved on the interior
of the natural parameter set. By differentiating E(Jt(X) under the expectation sign (which
is justified by the lemma), we see that its derivative matrices are given by
= Cov(J t(X).
The exponential family is said to be offull rank if no linear combination Ajtj(X) is
constant with probability 1; equivalently, if the covariance matrix of t (X) is nonsingular. In
view of the preceding display, this ensures that the derivative is strictly positive-definite
throughout the interior of the natural parameter set. Then e is one-to-one, so that there exists
at most one solution to the moment equations. (Cf. Problem 4.6.) In view of the expression
for i 9 , the matrix is the second-derivative matrix (Hessian) of the log likelihood
L:7=1.e 9(X i ). Thus, a solution to the moment equations must be a point of maximum of
the log likelihood.
A solution can be shown to exist (within the natural parameter space) with probability
I if the exponential family is "regular," or more generally "steep" (see [17]); it is then a
point of absolute maximum of the likelihood. Ifthe true parameter is in the interior of the
parameter set, then a (unique) solution en exists with probability tending to 1 as n t--+ 00,
in any case, by Theorem 4.1. Moreover, this theorem shows that the sequence In(en - 00)
is asymptotically normal with covariance matrix
So far we have considered an exponential family in standard form. Many examples arise
in the form
(4.6)
4.6 Theorem. Let e c ]Rk be open and let Q : e t--+ ]Rk be one-to-one and continuously
differentiable throughout e with nonsingular derivatives. Let the (exponential) family of
densities Pe be given by (4.6) and be offull rank. Then the likelihood equations have a
unique solution On with probability tending to I and In(On - 0) $. N(O, /9-I)for every O.
Proof. According to the inverse function theorem, the range of Q is open and the inverse
map Q-I is differentiable throughout this range. Thus, as discussed previously, the delta
method ensures the asymptotic normality. It suffices to calculate the asymptotic covariance
matrix. By the preceding discussion this is equal to
By direct calculation, the score function for the model is equal to i 9 (x) = d/d(O) +
t(x). As before, the score function has mean zero, so that this can be rewritten
as i 9(x) = (QO)T (t(x) - E9t(X»). Thus, the Fisher information matrix equals 19 =
( Qo) T Cov9t (X) Qo. This is the inverse of the asymptotic covariance matrix given in the
preceding display. •
Not all exponential families satisfy the conditions of the theorem. For instance, the
normal N«(}, (}2) family is an example of a "curved exponential family." The map Q«(}) =
«(}-2,(}-1) (with t(x) = (-x 2 j2,x» does not fill up the natural parameter space of the
normal location-scale family but only traces out a one-dimensional curve. In such cases the
result of the theorem may still hold. In fact, the result is true for most models with "smooth
parametrizations," as is seen in Chapter 5. However, the "easy" proof of this section is not
valid.
PROBLEMS
1. Let XI, ... , Xn be a sample from the unifonn distribution on [-9, 9]. Find the moment estimator
of 9 based on X2. Is it asymptotically nonnal? Can you think of an estimator for () that converges
faster to the parameter?
2. Let XI, ... , Xn be a sample from a density P9 and ! a function such that e«(}) = E9!(X) is
differentiable with e'(9) = E9£9(X)!(X) for 19 = log P9.
(i) Show that the asymptotic variance of the moment estimator based on ! equals var9(f)/
COV9(!, £9)2.
(ii) Show that this is bigger than I()-I with equality for all 9 if and only if the moment estimator
is the maximum likelihood estimator.
(iii) Show that the latter happens only for exponential family members.
3. To what extent does the result of Theorem 4.1 require that the observations are i.i.d.?
4. Let the observations be a sample of size n from the N(I1, (12) distribution. Calculate the Fisher
information matrix for the parameter () = (11, (12) and its inverse. Check directly that the maximum
likelihood estimator is asymptotically nonnal with zero mean and covariance matrix 19- 1•
S. Establish the fonnula e6 = COV() t(X) by differentiating e«(}) = E9t(X) under the integral sign.
(Differentiating under the integral sign is justified by Lemma 4.5, because E()t(X) is the first
derivative of c«(})-I.)
6. Suppose a function e : e ]Rk is defined and continuously differentiable on a convex subset
e C ]Rk with strictly positive-definite derivative matrix. Then e has at most one zero in e.
(Consider the function g(>..) = «(}I - e(>"(}1 + (1 - for given (}I ::f= and 0 :::: >.. :::: 1.
Ifg(O) = g(1) = 0, then there exists a point >"0 with g' (>"0) = 0 by the mean-value theorem.)
5.1 Introduction
Suppose that we are interested in a parameter (or "functional") () attached to the distribution
ofobservationsXJ, ... , X n. A popular method for finding an estimator On =on(X 1 , ••• , Xn)
is to maximize a criterion function of the type
(5.1)
Here me: X t-+ lR are known functions. An estimator maximizing Mn(() over e is called
an M -estimator. In this chapter we investigate the asymptotic behavior of sequences of
M -estimators.
Often the maximizing value is sought by setting a derivative (or the set of partial deriva-
tives in the multidimensional case) equal to zero. Therefore, the name M -estimator is also
used for estimators satisfying systems of equations of the type
1 n
Wn (() = - LVre(X;) = o. (5.2)
n ;=1
Here Vre are known vector-valued maps. For instance, if () is k-dimensional, then Vre
typically has k coordinate functions Vre = (Vre,I, ... , Vre,k), and (5.2) is shorthand for the
system of equations
n
LVre,j(X;) = 0, j=1,2, ... ,k.
;=1
Even though in many examples Vre,j is the jth partial derivative of some function mo, this
is irrelevant for the following. Equations, such as (5.2), defining an estimator are called
estimating equations and need not correspond to a maximization problem. In the latter case
it is probably better to call the corresponding estimators Z-estimators (for zero), but the
use of the name M -estimator is widespread.
41
Sometimes the maximum of the criterion function Mn is not taken or the estimating
equation does not have an exact solution. Then it is natural to use as estimator a value
that almost maximizes the criterion function or is a near zero. This yields approximate
M-estimators or Z-estimators. Estimators that are sufficiently close to being a point of
maximum or a zero often have the same asymptotic behavior.
An operator notation for taking expectations simplifies the formulas in this chapter.
We write P for the marginal law of the observations XI, ... , Xn , which we assume to be
identically distributed. Furthermore, we write PI for the expectation E/(X) = J I dP and
abbreviate the average n- 1L,7=d(Xi ) to Pnl. Thus Pn is the empirical distribution: the
(random) discrete distribution that puts mass lin at every of the observations XI."" X n •
The criterion functions now take the forms
5.3 Example (Maximum likelihood estimators). Suppose XI, ... ,Xn have a common
density Po: Then the maximum likelihood estimator maximizes the likelihood 07=1 Po (Xi)'
or equivalently the log likelihood
n
o I)ogpo(Xd.
i=1
(Define logO = -00.) However, this function is not smooth in 0 and there exists no natural
version of (S.2). Thus, in this example the definition as the location of a maximum is more
fundamental than the definition as a zero. 0
5.4 Example (Location estimators). Let XI, ... , Xn be a random sample of real-valued
observations and suppose we want to estimate the location of their distribution. "Location"
is a vague term; it could be made precise by defining it as the mean or median, or the center
of symmetry of the distribution if this happens to be symmetric. Two examples of location
estimators are the sample mean and the sample median. Both are Z-estimators, because
they solve the equations
n n
- 0) = 0; and L sign(X i - 0) = 0,
i=1 i=1
respectively.t Both estimating equations involve functions of the form 1/I(x - 0) for a
function 1/1 that is monotone and odd around zero. It seems reasonable to study estimators
that solve a general equation of the type
n
L1/I(Xi - 0) = 0.
i=1
1/I(x) = l -k
x
k
if x < -k
k:
if IXI:::
if x k.
The Huber estimators were motivated by studies in robust statistics concerning the influ-
ence of extreme data points on the estimate. The exact values of the largest and smallest
observations have very little influence on the value of the median, but a proportional influ-
ence on the mean. Therefore, the sample mean is considered nonrobust against outliers. If
the extreme observations are thought to be rather unreliable, it is certainly an advantage to
limit their influence on the estimate, but the median may be too successful in this respect.
Depending on the value of k, the Huber estimators behave more like the mean (large k) or
more like the median (small k) and thus bridge the gap between the nonrobust mean and
very robust median.
Another example are the quantiles. A pth sample quantile is roughly a point 0 such that
pn observations are less than 0 and (1 - p)n observations are greater than O. The precise
definition has to take into account that the value pn may not be an integer. One possibility
is to call a pth sample quantile any Bthat solves the inequalities
n
-1 < L((1 - p)I{Xi < O) - pl{Xi > OJ) < 1. (5.5)
i=1
t The sign-function is defined as sign (x) = -1,0, I if x < 0, x = 0 or x > 0, respectively. Also x+ means
x v 0 = max (x ,0). For the median we assume that there are no tied observations (in the middle).
It)
0
0
0
,
0 0
, C)I
4 6 8 10 12 -2 -1 0 2 3
Figure 5.1. The functions 0 1-+ \{In (0) for the 80% quantile and the Huber estimator for samples of
size 15 from the gamma(8,1) and standard normal distribution, respectively.
All the estimators considered so far can also be defined as a solution of a maximization
problem. Mean, median, Huber estimators, and quantiles minimize I:7=lm(Xi - 0) for m
equal to x 2 , lxi, x21lxl::::k + (2klxl - k2)llxl>k and (I - p)x- + px+, respectively. D
5.2 Consistency
If the estimator On is used to estimate the parameter 0, then it is certainly desirable that
the sequence en
converges in probability to O. If this is the case for every possible value
of the parameter, then the sequence of estimators is called asymptotically consistent. For
instance, the sample mean X n is asymptotically consistent for the population mean EX
(provided the population mean exists). This follows from the law of large numbers. Not
surprisingly this extends to many other sample characteristics. For instance, the sam-
ple median is consistent for the population median, whenever this is well defined. What
can be said about M -estimators in general? We shall assume that the set of possible
parameters is a metric space, and write d for the metric. Then we wish to prove that
d(e n , (0) 0 for some value 00, which depends on the underlying distribution of the
observations.
Suppose that the M-estimator en
maximizes the random criterion function
en
Clearly, the "asymptotic value" of depends on the asymptotic behavior of the functions
Mn' Under suitable normalization there typically exists a deterministic "asymptotic criterion
function" 0 1-* M(O) such that
every O. (5.6)
For instance, if Mn(O) is an average of the form JP>nme as in (5.1), then the law of large
numbers gives this result with M(O) = Pme, provided this expectation exists.
It seems reasonable to expect that the maximizer On of Mn converges to the maximizing
value 00 of M. This is what we wish to prove in this section, and we say that is en
(asymptotically) consistent for 00. However, the convergence (5.6) is too weak to ensure
Figure 5.2. Example of a function whose point of maximum is not well separated.
the convergence of en. Because the value en depends on the whole function e t-+ Mn(e),
an appropriate form of "functional convergence" of Mn to M is needed, strengthening the
pointwise convergence (5.6). There are several possibilities. In this section we first discuss
an approach based on uniform convergence of the criterion functions. Admittedly, the
assumption of uniform convergence is too strong for some applications and it is sometimes
not easy to verify, but the approach illustrates the general idea.
Given an arbitrary random function e t-+ Mn(e), consider estimators en that nearly
maximize M n, that is,
Then certainly Mn(e n) Mn(eo) - op(l), which turns out to be enough to ensure con-
sistency. It is assumed that the sequence Mn converges to a nonrandom map M: e t-+ i.
Condition (5.8) of the following theorem requires that this map attains its maximum at a
unique point eo, and only parameters close to eo may yield a value of M(e) close to the
maximum value M(eo). Thus, eo should be a well-separated point of maximum of M.
Figure 5.2 shows a function that does not satisfy this requirement.
5.7 Theorem. Let Mn be random functions and let M be a fixed function ofe such that
for every s > ot
supIMn(e) - M(8)1 0,
BEE> (5.8)
sup M(e) < M(eo).
B: d(B.Bo)?:;e
Then any sequence of estimators en with Mn(e n) Mn(eo) - op(1) converges in proba-
bility to eo.
t Some of the expressions in this display may be nonmeasurable. Then the probability statements are understood
in tenns of outer measure.
Proof. By the property of en, we have Mn(On) Mn(Oo) - op(l). Because the unifonn
convergence of Mn to M implies the convergence of Mn(Oo) M(Oo), the right side equals
M(Oo) - op(l). It follows that Mn(On) M(Oo) - op(1), whence
Then it may be expected that a sequence of (approximate) zeros of \lin converges in prob-
ability to a zero of \II. This is true under similar restrictions as in the case of maximizing
M estimators. In fact, this can be deduced from the preceding theorem by noting that a
zero of \lin maximizes the function 0 t-+ \lin (0)-II II.
5.9 Theorem. Let \lin be random vector-valued functions and let \II be a fixed vector-
valued function of 0 such that for every e > 0
I
sUPeee \lin (0) - \11(0)11 0,
infe :d(e,60)2:s II \11(0) II > 0 = II \11(00 ) II·
Then any sequence of estimators On such that \IIn(On) = op(1) converges in probability
to 00'
Proof. This follows from the preceding theorem, on applying it to the functions Mn(0) =
-II I
\lin (0) and M(O) = -11\11(0)11. •
The conditions of both theorems consist of a stochastic and a deterministic part. The
deterministic condition can be verified by drawing a picture of the graph of the function. A
helpful general observation is that, for a compact set e and continuous function M or \II,
uniqueness of 00 as a maximizer or zero implies the condition. (See Problem 5.27.)
For Mn«() or \IIn«() equal to averages as in (5.1) or (5.2) the unifonn convergence
required by the stochastic condition is equivalent to the set of functions {me: () E e}
or {1/1e,j: 0 E e, j = 1, ... , k} being Glivenko-Cantelli. Glivenko-Cantelli classes of
functions are discussed in Chapter 19. One simple set of sufficient conditions is that e be
compact, that the functions () t-+ me (x) or 0 t-+ 1/1e (x) are continuous for every x, and that
they are dominated by an integrable function.
Unifonn convergence of the criterion functions as in the preceding theorems is much
stronger than needed for consistency. The following lemma is one of the many possibilities
to replace the uniformity by other assumptions.
5.10 Lemma. Let e be a subset of the real line and let Wn be random functions and
W afixedfunction of() such that Wn «()) --+ w«()) in probability for every (). Assume that
each map () t-+ Wn«()) is continuous and has exactly one zero en, or is nondecreasing with
Wn(e n) = op(1). Let ()o be a point such that W«()o - e) < 0 < W«()o + e) for every e > O.
p
Then ()n --+ ()o.
Proof. If the map () t-+ Wn «()) is continuous and has a unique zero at en, then
The left side converges to one, because Wn «()o ± e) --+ W«()o ± e) in probability. Thus the
right side converges to one as well, and On is consistent.
Ifthe map () t-+ Wn«()) is nondecreasing and en is a zero, then the same argument is valid.
More generally, if () t-+ Wn«()) is nondecreasing, then Wn«()o - e) < -1} and en ()o - e
imply Wn(e n) < -1}, which has probability tending to zero for every 1} > 0 if en is a near
zero. This and a similar argument applied to the right tail shows that, for every e, 1} > 0,
For 21} equal to the smallest of the numbers - W«()o - e) and W«()o + e) the left side still
converges to one. •
5.11 Example (Median). The sample median en is a (near) zero ofthe map () t-+ Wn «()) =
n- I L7=, sign(X; - ()). By the law of large numbers,
for every fixed (). Thus, we expect that the sample median converges in probability to a
point ()o such that P(X > ()o) = P(X < ()o): a population median.
This can be proved rigorously by applying Theorem 5.7 or 5.9. However, even though
the conditions of the theorems are satisfied, they are not entirely trivial to verify. (The
uniform convergence of Wn to W is proved essentially in Theorem 19.1) In this case it
is easier to apply Lemma 5.10. Because the functions () t-+ IVn «()) are nonincreasing, it
follows that en ()o provided that IV«()o - e) > 0 > IV«()o + e) for every e > O. This is
the case if the population median is unique: P( X < ()o - e) < 4 < P( X < ()o + e) for all
e > O. D
M«()) = Pmo.
In this subsection we consider an alternative set of conditions under which the maximizer On
of the process Mn converges in probability to a point of maximum ()o of the function M. This
"classical" approach to consistency was taken by Wald in 1949 for maximum likelihood
estimators. It works best if the parameter set e is compact. Ifnot, then the argument must
be complemented by a proof that the estimators are in a compact set eventually or be applied
to a suitable compactification of the parameter set.
Assume that the map () 1-+ me (x) is upper-semicontinuous for almost all x: For every ()
5.14 Theorem. Let () 1-+ me (x) be upper-semicontinuousjor almost all x and let (5.13) be
satisfied. Thenjorany estimators en such that Mn(e n) Mn«(}o)-op(1)jorsome(}o E eo,
jor every e > 0 and every compact set K C e,
Proof. Ifthe function () 1-+ Pme is identically -00, then eo = e, and there is nothing
to prove. Hence, we may assume that there exists (}o E eo such that Pm90 > -00, whence
Plm901 < 00 by (5.13).
Fix some () and let VI -l- () be a decreasing sequence of open balls around () of diameter
converging to zero. Write mu(x) for sUPeeU mo(x). The sequence mu, is decreasing
and greater than mo for every I. Combination with (5.12) yields that mu, -l-mo almost
surely. In view of (5.13), we can apply the monotone convergence theorem and obtain that
Pmu, -l- Pme (which may be -(0).
For () f. eo, we have Pme < Pm90' Combine this with the preceding paragraph to see
that for every () f. eo there exists an open ball Ve around () with Pmuo < Pm90' The set
B = {() E K :d«(}, eo) e} is compact and is covered by the balls {Ve:(} E B}. Let
Vel' ... , Ve p be a finite subcover. Then, by the law of large numbers,
Ifen E B, then sUPOeB l?nme is at least I?nmO.' which by definition of On is at least I?nm90 -
op(1) = Pm90 - op(l), by the law oflarge numbers. Thus
Even in simple examples, condition (5.13) can be restrictive. One possibility for relax-
ation is to divide the n observations in groups of approximately the same size. Then (5.13)
may be replaced by, for some k and every k ::: I < 2k,
I
pi sup Lmo(x;) < 00. (5.15)
OeU i=\
Surprisingly enough, this simple device may help. For instance, under condition (5.13)
the preceding theorem does not apply to yield the asymptotic consistency of the maximum
likelihood estimator of (f1., (1) based on a random sample from the N(f1., (12) distribution
(unless we restrict the parameter set for (1), but under the relaxed condition it does (with
k = 2). (See Problem 5.25.) The proof of the theorem under (5.15) remains almost the
same. Divide the n observations in groups of k observations and, possibly, a remainder
group of I observations; next, apply the law of large numbers to the approximately nJ k
group sums.
5.16 Example (Cauchy likelihood). The maximum likelihood estimator for () based on a
random sample from the Cauchy distribution with location () maximizes the map () lP'nmO
for
These infinite values should not worry us: They are permitted in the preceding theorem.
Moreover, because we maximize () lP'nmo, they ensure that the estimator On never takes
the values ±oo, which is excellent.
We apply Wald's theorem with 8 = i, equipped with, for instance, the metric d«()\, ()2) =
Iarctg ()\ - arctg ()21. Because the functions () mo (x) are continuous and nonpositive, the
conditions are trivially satisfied. Thus, taking K = i, we obtain that d(On, 80) O. This
conclusion is valid for any underlying distribution P of the observations for which the set
80 is nonempty, because so far we have used the Cauchy likelihood only to motivate mo.
To conclude that the maximum likelihood estimator in a Cauchy location model is con-
sistent, it suffices to show that 8 0 = {()o} if P is the Cauchy distribution with center ()o. This
follows most easily from the identifiability of this model, as discussed in Lemma 5.35. 0
5.17 Example (Current status data). Suppose that a "death" that occurs at time T is only
observed to have taken place or not at a known "check-up time" C. We model the obser-
vations as a random sample X \, ... , Xn from the distribution of X = (C, 1(T ::: C l),
where T and C are independent random variables with completely unknown distribution
functions F and G, respectively. The purpose is to estimate the "survival distribution"
1- F.
IfG has a density g with respect to Lebesgue measure A, then X = (C, has a density
with respect to the product of>.. and counting measure on the set {O, I}. A maximum like-
lihood estimator for F can be defined as the distribution function P that maximizes the
likelihood
n
F O(.6. i F(Ci ) + (1- .6. i )(1- F)(Ci »)
i=1
over all distribution functions on [0, (0). Because this only involves the numbers F(Ct),
... , F(Cn ), the maximizer of this expression is not unique, but some thought shows that
there is a unique maximizer P that concentrates on (a subset of) the observation times
C t , ••• , Cn' This is commonly used as an estimator.
We can show the consistency of this estimator by Wald's theorem. By its definition P
maximizes the function F lPn log PF, but the consistency proof proceeds in a smoother
way by setting
PF 2PF
mF = log = log-....:....--
P(F+Fo)/2 PF + PFo
Po log
(p
2p
+ Po)
= PI (po)
-
p
I (po)
P-
p
= 1(1) =0,
t Alternatively, consider all probability distributions on the compactification [0, 001 again equipped with the
weak topology.
by Jensen’s inequality and the concavity of f , with equality only if p0 / p = 1 almost surely
under P, and then also under P0 . This completes the proof.
1
n
n (θ ) ≡ ψθ (X i ) = Pn ψθ , (θ) = Pψθ .
n
i=1
Assume that the estimator θ̂n is a zero of n and converges in probability to a zero θ0 of .
Because θ̂n → θ0 , it makes sense to expand n (θ̂n ) in a Taylor series around θ0 . Assume
for simplicity that θ is one-dimensional. Then
˙ n (θ0 ) + 1 (θ̂n − θ0 )2
0 = n (θ̂n ) = n (θ0 ) + (θ̂n − θ0 ) ¨ n (θ̃n ),
2
]R.k and the derivatives *n(80) are (k x k)-matrices that converge to the (k x k) matrix P,fr80
with entries paja8j 1/l80,i. The final statement becomes
Assume that P 111/180 112 < 00 and that the map () P1/Io is differentiable at a zero ()o, with
nonsingular derivative matrix V80' /flP'n 1/10. = op(n- 1/ 2), and en ()o, then
In particular, the sequence ,Jii(en - 80 ) is asymptotically normal with mean zero and
covariance matrix Vr;l
Proof. For a fixed measurable function I, we abbreviate ,Jii(lP'n - P)I to Gnl, the
empirical process evaluated at I. The consistency of On and the Lipschitz condition on the
maps 8 1/10 imply that
(5.22)
For a nonrandom sequence On this is immediate from the fact that the means of these variables
are zero, while the variances are bounded by P1I1/Io. -1/180 112 ::::: P,fr2 11 ()n - ()0112 and hence
converge to zero. A proof for estimators On under the present mild conditions takes more
effort. The appropriate tools are developed in Chapter 19. In Example 19.7 it is seen that
the functions 1/10 form a Donsker class. Next, (5.22) follows from Lemma 19.24. Here we
accept the convergence as a fact and give the remainder of the proof.
By the definitions of On and ()o, we can rewrite Gn1/l0. as ,JiiP(1/I80 -1/10) + op(1).
Combining this with the delta method (or Lemma 2.12) and the differentiability of the map
8 P1/Io, we find that
JnIlOn - 00 11 sI I
V60 (On - ( 0) = Op(1) + op(JnIlOn - ( 0 11).
This implies that On is In-consistent: The left side is bounded in probability. Inserting this
in the previous display, we obtain that JnVoo (On -(0 ) = -Gn1/160 +op(1). We conclude the
proof by taking the inverse 1 left and right. Because matrix multiplication is a continous
map, the inverse of the remainder term still converges to zero in probability. •
It is this expansion rather than the differentiability that is needed in the following theorem.
5.23 Theorem. For each 0 in an open subset ofEuclidean space let x f-+ mo (x) be a mea-
surable function such that 0 f-+ mo (x) is differentiable at 00 for P -almost every x t with
derivative moo (x) and such that, for every 01 and O2 in a neighborhood of 00 and a measur-
able function m with Pm 2 < 00
Furthermore, assume that the map 0 f-+ Pmo admits a second-order Taylor expansion
at a point of maximum 00 with nonsingular symmetric second derivative matrix VOo' If
1 P
lP'nmOn :::: suPolP'nmo - op(n- ) and On -+ 00 , then
In particular, the sequence -/n(On - (0) is asymptotically normal with mean zero and
covariance matrix
*Proof. The Lipschitz property and the differentiability of the maps 0 m9 imply that,
for every random sequence hn that is bounded in probability,
5.24 Example (Median). The sample median maximizes the criterion function 0
- L7= IIXi -0 I. Assume that the distribution function F of the observations is differentiable
C!
<Xl
ci
to
ci
"<t
ci
N
ci
0
ci
Figure 5.3. The distribution function of the sample median (dotted curve) and its nonna! approxi-
mation for a sample of size 25 from the Laplace distribution.
at its median 80 with positive derivative /(80). Then the sample median is asymptotically
normal.
This follows from Theorem 5.23 applied with mo(x) = Ix - 81-lxl. As a consequence
of the triangle inequality, this function satisfies the Lipschitz condition with m(x) == 1.
Furthermore, the map 8 1-+ mo (x) is differentiable at 80 except if x = 80, with moo (x) =
-sign(x - (0). By partial integration,
of the true underlying distribution P on the model using the Kullback-Leibler divergence,
which is defined as - P 10g(Pe Ip), as a "distance" measure: P90 minimizes this quantity
over all densities in -the model. Second, we expect that ..(ii(On - (0) is asymptotically
normal with mean zero and covariance matrix
v:- t pi l v:- t
90 909090·
Here £e = log Pe, and V90 is the second derivative matrix of the map 0 t-+ P log Pe. The
preceding theorem with me = log Pe gives sufficient conditions for this to be true.
The asymptotics give insight into the practical value of the experimenter's estimate On.
This depends on the specific situation. However, if the model is not too far off from the truth,
then the estimated density PD. may be a reasonable approximation for the true density. 0
5.26 Example (Exponentialjrailty model). Suppose that the observations are a random
sample (XJ, Yt ), ..• , (Xn, Yn) of pairs of survival times. For instance, each Xi is the
survival time of a "father" and Yi the survival time of a "son." We assume that given
an unobservable value Zi, the survival times Xi and Yi are independent and exponentially
distributed with parameters Zi and 0 Zi, respectively. The value Zi may be different for each
observation. The problem is to estimate the ratio 0 of the parameters.
To fit this example into the i.i.d. set-up of this chapter, we assume that the values Zt, ..• , Zn
are realizations of a random sample Zt, ... , Zn from some given distribution (that we do
not have to know or parametrize).
One approach is based on the sufficiency of the variable Xi + OYi for Zi in the case
that 0 is known. Given Zi = z, this "statistic" possesses the gamma-distribution with
shape parameter 2 and scale parameter z. Corresponding to this, the conditional density of
an observation (X, Y) factorizes, for a given z, as he(x, y) ge(x + Oy Iz), for ge(s Iz) =
z2 se- zs the gamma-density and
he(x, y) = - - .
o
x + Oy
Because the density of Xi + OYi depends on the unobservable value Zi, we might wish to
discard the factor ge(s Iz) from the likelihood and use the factor he(x, y) only. Unfor-
tunately, this "conditional likelihood" does not behave as an ordinary likelihood, in that
the corresponding "conditional likelihood equation," based on the function he I he (x, y) =
alao loghe(x, y), does not have mean zero under o. The bias can be corrected by condi-
tioning on the sufficient statistic. Let
1/Ie(X, Y)
he Y) -
= 20-(X, 20Ee
(he ) X-
-(X, Y) IX + OY =
0Y .
he he X + OY
Hence the zero of e f-+ POo1{!o is taken uniquely at e = eo. Next, the sequence .;n(On - eo)
can be shown to be asymptotically normal by Theorem 5.21. In fact, the functions -if, 0 (x, y)
are uniformly bounded in x, y > 0 and e ranging over compacta in (0, (0), so that, by the
mean value theorem, the function -if, in this theorem may be taken equal to a constant.
On the other hand, although this estimator is easy to compute, it can be shown that it is
not asymptotically optimal. In Chapter 25 on semiparametric models, we discuss estimators
with a smaller asymptotic variance. 0
5.27 Example (Nonlinear least squares). Suppose that we observe a random sample (Xl,
YI ), .•. , (Xn , Yn ) from the distribution of a vector (X, Y) that follows the regression
model
Y = foo(X) + e, E(e I X) = o.
Here 10 is a parametric family of regression functions, for instance Io(x) = el + e2eO,x,
e.
and we aim at estimating the unknown vector (We assume that the independent variables
are a random sample in order to fit the example in our i.i.d. notation, but the analysis could
be carried out conditionally as well.) The least squares estimator that minimizes
n
e f-+ - fo(X i ))2
i=l
is an M-estimator for mo(x, y) = (y - Io(x))2 (or rather minus this function). It should
be expected to converge to the minimizer of the limit criterion function
Thus the least squares estimator should be consistent if eo is identifiable from the model,
in the sense that e =f. eo implies that 10 (X) =f. 100 (X) with positive probability.
For sufficiently regular regression models, we have
Besides giving the asymptotic normality of .;n(On - eo), the preceding theorems give
an asymptotic representation
If we neglect the remainder term, t then this means that On - eo behaves as the average of
the variables Then the (asymptotic) "influence" of the nth observation on the
t To make the following derivation rigorous, more information concerning the remainder term would be necessary.
Because the "influence" of an extra observation x is proportional to VO- I t/lo (x), the function
x VO- I t/lo (x) is called the asymptotic influence function of the estimator On. Influence
functions can be defined for many other estimators as well, but the method of Z-estimation
is particularly convenient to obtain estimators with given influence functions. Because VIio
is a constant (matrix), any shape of influence function can be obtained by simply choosing
the right functions t/lo.
For the purpose of robust estimation, perhaps the most important aim is to bound the
influence of each individual observation. Thus, a Z-estimator is called B-robust if the
function t/lo is bounded.
yn«()n-()o)=
'- A T) Xi +o P (1)·
1,-VO;;-.i..Jt/I( Yj-()oXi
yn i=1
Consequently, even for a bounded function t/I, the influence function (x, y) VO- I t/I(y -
()T x)x may be unbounded, and an extreme value of an X j may still have an arbitrarily
large influence on the estimate (asymptotically). Thus, the estimators obtained in this way
are protected against influence points but may still suffer from leverage points and hence
are only partly robust. To obtain fully robust estimators, we can change the estimating
equations to
n
L1ft(Yi - ()T Xi)V(Xi»)W(X i ) = O.
i=l
Here we protect against leverage points by choosing w bounded. For more flexibility we
have also allowed a weighting factor v(Xi ) inside 1ft. The choices 1ft(x) = x, v(x) = 1 and
w(x) = x correspond to the (nonrobust) least-squares estimator.
The solution On of our final estimating equation should be expected to be consistent for
the solution of
If the function 1ft is odd and the error symmetric, then the true value ()o will be a solution
whenever e is symmetric about zero, because then E1ft(ea) = 0 for every a.
Precise conditions for the asymptotic normality of In(On - ()o) can be obtained from
Theorems 5.21 and 5.9. The verification of the conditions of Theorem 5.21, which are "local"
in nature, is relatively easy, and, if necessary, the Lipschitz condition can be relaxed by
using results on empirical processes introduced in Chapter 19 directly. Perhaps proving the
consistency of 0n is harder. The biggest technical problem may be to show that 0n = 0 p (1),
so it would help if () could a priori be restricted to a bounded set. On the other hand,
for bounded functions 1ft, the case of most interest in the present context, the functions
(x, y) 1-+ 1ft(y - ()T x)v(x»)w(x) readily form a Glivenko-Cantelli class when () ranges
freely, so that verification of the strong uniqueness of ()o as a zero becomes the main
challenge when applying Theorem 5.9. This leads to a combination of conditions on 1ft, v,
w, and the distributions of e and X. 0
5.29 Example (Optimal robust estimators). Every sufficiently regular function 1ft defines
a location estimator On through the equation L7=11ft(Xi -() = O. In order to choose among
the different estimators, we could compare their asymptotic variances and use the one with
the smallest variance under the postulated (or estimated) distribution P of the observations.
On the other hand, if we also wish to guard against extreme obervations, then we should
find a balance between robustness and asymptotic variance. One possibility is to use the
estimator with the smallest asymptotic variance at the postulated, ideal distribution P under
the side condition that its influence function be uniformly bounded by some constant c. In
this example we show that for P the normal distribution, this leads to the Huber estimator.
The Z -estimator is consistent for the solution ()o of the equation P 1ft (. - () = E 1ft (X1 -
() = O. Suppose that we fix an underlying, ideal P whose "location" ()o is zero. Then the
problem is to find 1ft that minimizes the asymptotic variance p1ft2/(p1ft')2 under the two
side conditions, for a given constant c,
(x) I
I1ftP1ft' ::s c, and P1ft = O.
The problem is homogeneous in 1ft, and hence we may assume that P1ft' = 1 without loss
of generality. Next, minimization of p1ft2 under the side conditions P1ft = 0, P1ft' = 1 and
1I1ftlloo ::s c can be achieved by using Lagrange multipliers, as in problem 14.6 This leads
to minimizing
for fixed "multipliers" A and IL under the side condition 111/11100 c with respect to 1/1. This
expectation is minimized by minimizing the integrand pointwise, for every fixed x. Thus
the minimizing 1/1 has the property that, for every x separately, y = 1/1 (x) minimizes the
parabola y2 + AY + ILY(p'/p)(x) over y E [-c, c]. This readily gives the solution, with
the value y truncated to the interval [c, d],
1 1 p' ]C
1/I(x) = [ --A - -IL-(X) .
2 2 p -c
The constants A and IL can be solved from the side conditions P1/I = 0 and P1/I' = 1. The
normal distribution P = =
It> has location score function p'/ p(x) -x, and by symmetry
it follows that A = 0 in this case. Then the optimal 1/1 reduces to Huber's 1/1 function. 0
(Xi-O)
- A - =0. (5.30)
i=1 (J'
Here {j is an initial (robust) estimator of scale, which is meant to stabilize the robustness
of the location estimator. For instance, the "cut-off" parameter k in Huber's 1/I-function
determines the amount of robustness of Huber's estimator, but the effect of a particular
choice of k on bounding the influence of outlying observations is relative to the range of
the observations. If the observations are concentrated in the interval [-k, k], then Huber's
1/1 yields nothing else but the sample mean, if all observations are outside [-k, k], we get
the median. Scaling the observations to a standard scale gives a clear meaning to the value
of k. The use of the median absolute deviation from the median (see. section 21.3) is often
recommended for this purpose.
If the scale estimator is itself a Z-estimator, then we can treat the pair (0, (j) as a Z-
estimator for a system of equations, and next apply the preceding theorems. More generally,
we can apply the following result. In this subsection we allow a condition in terms of
Donsker classes, which are discussed in Chapter 19. The proof of the following theorem
follows the same steps as the proof of Theorem 5.21.
5.31 Theorem. For each 0 in an open subset ofRk and each TI in a metric space, let x 1-+
be an Rk -valued measurable function such that the class offunctions : 110 -
0011 < 8, d(TI, Tlo) < 8} is Donsker for some 8 > 0, and such that PII1/I9,1/ -1/190,1/0 11 2 0
as (0, TI) (00, Tlo). Assume that P1/I90,1/0 = 0, and that the maps 0 1-+ are differ-
entiable at 00, uniformly in TI in a neighborhood of Tlo with nonsingular derivative matrices
P A
such that V90 ,1/o' If = op(1) and (On, ijn) (00, Tlo), then
Under the conditions of this theorem, the limiting distribution of the sequence y'ii(On-
00) depends on the estimator through the "drift" term In general, this gives
a contribution to the limiting distribution, and must be chosen with care. If is y'ii-
consistent and the map TJ is differentiable, then the drift term can be analyzed
using the delta-method.
It may happen that the drift term is zero. If the parameters 0 and TJ are "orthogonal"
in this sense, then the auxiliary estimators may converge at an arbitrarily slow rate and
affect the limit distribution of On only through their limiting value TJo.
5.32 Example (Symmetric location). Suppose that the distribution of the observations
is symmetric about 00. Let x 1fr (x) be an antisymmetric function, and consider the
Z-estimators that solve equation (5.30). Because P1fr(X - Oo)/a) = 0 for every a, by
the symmetry of P and the antisymmetry of 1fr, the "drift term" due to in the pre-
ceding theorem is identically zero. The estimator On has the same limiting distribu-
tion whether we use an arbitrary consistent estimator of a "true scale" ao or ao
itself. 0
5.33 Example (Robust regression). In the linear regression model considered in Exam-
ple 5.28, suppose that we choose the weight functions v and w dependent on the data and
solve the robust estimator On of the regression parameters from
If Xl, ... , Xn are a random sample from a density Ps, then the maximum likelihood
estimator On maximizes the function 0 L log Ps(Xi ), or equivalently, the function
Ps Ps
Mn(O) = - L..log -(Xi) = Pn log-.
n i=l P80 P80
(Subtraction of the "constant" L log P80 (Xi) turns out to be mathematically convenient.)
Ifwe agree that log 0 = -00, then this expression is with probability 1 well defined if P80
is the true density. The asymptotic function corresponding to Mn ist
This requires that the model for the observations is not the same under the parameters 0
and 00. Identifiability is a natural and even a necessary condition: Ifthe parameter is not
identifiable, then consistent estimators cannot exist.
5.35 Lemma. Let Ips: 0 e E>} be a collection of subprobability densities such that
(5.34) holds and such that P80 is a probability measure. Then M(O) = P80 log Psi P80
attains its maximum uniquely at 00 •
Proof. First note that M (00 ) = P80 log 1 = O. Hence we wish to show that M (0) is strictly
negative for 0 #- 00 •
Because logx .::: 2(.jX - 1) for every x 0, we have, writing /L for the dominating
measure,
(The last inequality is an equality if f Ps d/L = 1.) This is always nonpositive, and is zero
only if Ps and P80 are equal. By assumption the latter happens only if 0 = 00. •
Thus, under conditions such as in section 5.2 and identifiability, the sequence of maxi-
mum likelihood estimators is consistent for the true parameter.
t Presently we take the expectation P/Io under the parameter 90. whereas the derivation in section 5.3 is valid for a
generic underlying probability structure and does not conceptually require that the set of parameters (J indexes
a set of underlying distributions.
Hence it is a Z-estimator for 1/10 equal to the score function to = ajao log Po of the model.
In view of the results of section 5.3, we expect that the sequence ,jn(On - 0) is, under 0,
asymptotically normal with mean zero and covariance matrix
(5.36)
Under regularity conditions, this reduces to the inverse of the Fisher information matrix
. ·T
Ie = Peieie.
=
To see this in the case of a one-dimensional parameter, differentiate the identity f Po d JL 1
twice with respect to O. Assuming that the order of differentiation and integration can be
reversed, we obtain f Pe dJL = f Pe dJL = O. Together with the identities
.
ie=-- -
po (pO)2
Pe Po
this implies that Pete = 0 (scores have mean zero), and Polo = -Ie (the curvature of the
likelihood is equal to minus the Fisher information). Consequently, (5.36) reduces to Ie-I.
The higher-dimensional case follows in the same way, in which we should interpret the
identities Pote = 0 and Polo = -10 as a vector and a matrix identity, respectively.
We conclude that maximum likelihood estimators typically satisfy
This is a very important result, as it implies that maximum likelihood estimators are asymp-
totically optimal. The convergence in distribution means roughly that the maximum likeli-
hood estimator On is N(O, (nlo)-I)-distributed for every 0, for large n. Hence, it is asymp-
totically unbiased and asymptotically of variance (nlo)-I. According to the Cramer-Rao
theorem, the variance of an unbiased estimator is at least (nle)-l. Thus, we could in-
fer that the maximum likelihood estimator is asymptotically uniformly minimum-variance
unbiased, and in this sense optimal. We write "could" because the preceding reasoning is
informal and unsatisfying. The asymptotic normality does not warrant any conclusion about
the convergence of the moments and vare we have not introduced an asymptotic
version of the Cramer-Rao theorem; and the Cramer-Rao bound does not make any assertion
concerning asymptotic normality. Moreover, the unbiasedness required by the Cramer-Rao
theorem is restrictive and can be relaxed considerably in the asymptotic situation.
However, the message that maximum likelihood estimators are asymptotically efficient
is correct. We give a precise discussion in Chapter 8. The justification through asymptotics
appears to be the only general justification of the method of maximum likelihood. In some
form, this result was found by Fisher in the 1920s, but a better and more general insight
was only obtained in the period from 1950 through 1970 through the work of Le Cam and
others.
In the preceding informal derivations and discussion, it is implicitly understood that the
density Pe possesses at least two derivatives with respect to the parameter. Although this
can be relaxed considerably, a certain amount of smoothness of the dependence 0 1-+ Pe is
essential for the asymptotic normality. Compare the behavior of the maximum likelihood
estimators in the case of uniformly distributed observations: They are neither asymptotically
normal nor asymptotically optimal.
5.37 Example (Uniform distribution). Let Xl, ... , Xn be a sample from the uniform
distribution on [0, 0]. Then the maximum likelihood estimator is the maximum X(n) of the
observations. Because the variance of X(n) is of the order O(n- 2 ), we expect that a suitable
norming rate in this case is not",;n, but n. Indeed, for each x < 0
We conclude this section with a theorem that establishes the asymptotic normality of
maximum likelihood estimators rigorously. Clearly, the asymptotic normality follows from
Theorem 5.23 applied to me = log Pe, or from Theorem 5.21 applied with 'I/Ie = £e equal
to the score function of the model. The following result is a minor variation on the first
theorem. Its conditions somehow also ensure the relationship Pete = - Ie and the twice-
differentiability of the map 0 1-+ P60 log Pe, even though the existence of second derivatives
is not part of the assumptions. This remarkable phenomenon results from the trivial fact
that square roots of probability densities have squares that integrate to 1. To exploit this,
we require the differentiability of the maps 0 1-+ JPi, rather than of the maps 0 1-+ log Pe.
A statistical model (Pe: 0 E S) is called differentiable in quadratic mean if there exists a
including simple conditions for its validity, is given in Chapter 7. It should be noted that
5.39 Theorem. Suppose that the model (P9: e E 8) is differentiable in quadratic mean
at an inner point 00 of 8 C IRk. Furthermore, suppose that there exists a measurable
function £ with p80 e2 < 00 such that, for every 01 and in a neighborhood of 00,
In particular, the sequence "fii(On - (0) is asymptotically normal with mean zero and
covariance matrix 19;/.
*Proof. This theorem is a corollary of Theorem 5.23. We shall show that the conditions
of the latter theorem are satisfied for m9 = log P9 and V80 = - 180 ,
Fix an arbitrary converging sequence of vectors hn -+ h, and set
w. 2( -1)
By the differentiability in quadratic mean, the sequence "fiiWn converges in L2 (P80) to the
function hT £80' In particular, it converges in probability, whence by a delta method
In view of the Lipschitz condition on the map e t--+ log P9, we can apply the dominated-
convergence theorem to strengthen this to convergence in L 2 (P80 ). This shows that the map
o t--+ log P9 is differentiable in probability, as required in Theorem 5.23. (The preceding
argument considers only sequences On of the special form eo + hnl"fii approaching 00.
Because h n can be any converging sequence and ....(n+TI"fii -+ 1, these sequences are
actually not so special. By re-indexing the result can be seen to be true for any On -+ 00.)
Next, by computing means (which are zero) and variances, we see that
Equating this result to the expansion given by Theorem 7.2, we see that
Hence the map () H- Peo log Pe is twice-differentiable with second derivative matrix -leo,
or at least permits the corresponding Taylor expansion of order 2. •
5.40 Example (Binary regression). Suppose that we observe a random sample (XI,
Yd, ... , (X n , Yn ) consisting of k-dimensional vectors of "covariates" Xi, and 0-1 "response
variables" Yi , following the model
Pe(Y i = 11 Xi = x) = 'II«(}Tx).
Here 'II: ]R H- [0, 1] is a known continuously differentiable, monotone function. The choices
'II «(})= =
1/(1 +e-e) (the logistic distribution function) and 'II <l> (the normal distribution
function) correspond to the logit model and probit model, respectively. The maximum
likelihood estimator On maximizes the (conditional) likelihood function
n n
() H- nPe(Y i IXi): = n'll«(}TXi)Yi (1 - 'II«(}TX i ))l-Yi •
i=1 i=1
The consistency and asymptotic normality of On can be proved, for instance, by combining
Theorems 5.7 and 5.39. (Alternatively, we may follow the classical approach given in sec-
tion 5.6. The latter is particularly attractive for the logit model, for which the log likelihood
is strictly concave in (), so that the point of maximum is unique.) For identifiability of () we
must assume that the distribution of the Xi is not concentrated on a (k - I)-dimensional
affine subspace of]Rk. For simplicity we assume that the range of Xi is bounded.
The consistency can be proved by applying Theorem 5.7 with me = 10g(PIi + peo)/2.
Because Peo is bounded away from 0 (and (0), the function mil is somewhat better behaved
than the function log PII.
By Lemma 5.35, the parameter 0 is identifiable from the density PII. We can redo the
proof to see that, with ;S meaning "less than up to a constant,"
This shows that (}o is the unique point of maximum of () H- Peome. Furthermore, if Peomllk
Peomeo, then ()[ X (}l X. Ifthe sequence (}k is also bounded, then E( «(}k - (}O)T X)2 0,
whence (}k H- 00 by the nonsingularity of the matrix EX XT. On the other hand, II(}k II cannot
have a diverging subsequence, because in that case O[ / II(}k II X 0 and hence Ok / IIOk II 0
by the same argument. This verifies condition (5.8).
Checking the uniform convergence to zero of sUPIIIJPlnmll - Pmlll is not trivial, but
it becomes an easy exercise if we employ the Glivenki-Cantelli theorem, as discussed in
Chapter 19. The functions x H- 'II«(}Tx) form a VC-class, and the functions mil take the
form me(x, y) = x), y, 'II«(}lx)), where the function y, 1/) is Lipschitz in its
first argument with Lipschitz constant bounded above by 1/1/ + 1/(1- 1/). This is enough to
ensure that the functions mo form a Donsker class and hence certainly a Glivenko-Cantelli
class, in view of Example 19.20.
The asymptotic normality of ,Jii(On - f) is now a consequence of Theorem 5.39. The
score function
is uniformly bounded in x, y and f) ranging over compacta, and continuous in f) for every
x and y. The Fisher information matrix
1JI(11) = P1/I0.
The estimator 0n is a zero of IJIn, and the true value 110 a zero of w.
The essential condition
of the following theorem is that the second-order partial derivatives of 1/10 (x) with respect
to 11 exist for every x and satisfy
I I 1f(x),
for some integrable measurable function 1f. This should be true at least for every 11 in a
neighborhood of 110.
5.41 Theorem. For each θ in an open subset of Euclidean space, let θ → ψθ (x) be
twice continuously differentiable for every x. Suppose that Pψθ0 = 0, that Pψθ0 2 < ∞
and that the matrix P ψ̇θ0 exists and is nonsingular. Assume that the second-order partial
derivatives are dominated by a fixed integrable function ψ̈(x) for every θ in a neighborhood
of θ0 . Then every consistent estimator sequence θ̂n such that n (θ̂n ) = 0 for every n
satisfies
√ −1 1 n
n(θ̂n − θ0 ) = − P ψ̇θ0 √ ψθ0 (X i ) + o P (1).
n
i=1
√
In particular, the sequence n(θ̂n − θ0 ) is asymptotically normal with mean zero and
covariance matrix (P ψ̇θ0 )−1 Pψθ0 ψθT0 (P ψ̇θ0 )−1 .
Proof. By Taylor’s theorem there exist (random) vectors θ̃n on the line segment between
θ0 and θ̂n (possibly different for each coordinate of the function n ) such that
This is bounded in probability by the law of large numbers. Combination of these facts
allows us to rewrite the preceding display as
−n (θ0 ) = V + o P (1) + 12 (θ̂n − θ0 )O P (1) (θ̂n − θ0 ) = (V + o P (1))(θ̂n − θ0 ),
In the preceding sections, the existence and consistency of solutions θ̂n of the estimating
equations is assumed from the start. The present smoothness conditions actually ensure the
existence of solutions. (Again the conditions could be significantly relaxed, as shown in
the next proof.) Moreover, provided there exists a consistent estimator sequence at all, it is
always possible to select a consistent sequence of solutions.
5.42 Theorem. Under the conditions of the preceding theorem, the probability that the
equation Pn ψθ = 0 has at least one root tends to 1, as n → ∞, and there exists a sequence
of roots θ̂n such that θ̂n → θ0 in probability. If ψθ = ṁ θ is the gradient of some function
m θ and θ0 is a point of local maximum of θ → Pm θ , then the sequence θ̂n can be chosen
to be local maxima of the maps θ → Pn m θ .
Proof. Integrate the Taylor expansion of θ → ψθ (x) with respect to x to find that,
for points θ̃ = θ̃ (x) on the line segment between θ0 and θ (possibly different for each
coordinate of the function θ → Pψθ ),
Pψθ = Pψθ0 + P ψ̇θ0 (θ − θ0 ) + 12 (θ − θ0 )T P ψ̈θ̃ (θ − θ0 ).
By the domination condition, P ψ̈θ̃ is bounded by Pψ̈ < ∞ if θ is sufficiently close
to θ0 . Thus, the map (θ ) = Pψθ is differentiable at θ0 . By the same argument is dif-
ferentiable throughout a small neighborhood of θ0 , and by a similar expansion (but now to
first order) the derivative P ψ̇θ can be seen to be continuous throughout this neighborhood.
Because P ψ̇θ0 is nonsingular by assumption, we can make the neighborhood still smaller,
if necessary, to ensure that the derivative of is nonsingular throughout the neighborhood.
Then, by the inverse function theorem, there exists, for every sufficiently small δ > 0, an
open neighborhood G δ of θ0 such that the map : G δ → ball(0, δ) is a homeomorphism.
The diameter of G δ is bounded by a multiple of δ, by the mean-value theorem and the fact
that the norms of the derivatives (P ψ̇θ )−1 of the inverse −1 are bounded.
Combining the preceding Taylor expansion with a similar expansion for the sample
version n (θ ) = Pn ψθ , we see
sup n (θ ) − (θ ) ≤ o P (1) + δo P (1) + δ 2 O P (1),
θ∈G δ
where the o P (1) terms and the O p (1) term result from the law of large numbers, and are
uniform in small δ. Because P o P (1) + δo P (1) > 12 δ → 0 for every δ > 0, there exists
δn ↓ 0 such that P o P (1) + δn o P (1) > 12 δn → 0. If K n,δ is the event where the left side
of the preceding display is bounded above by δ, then P(K n,δn ) → 1 as n → ∞.
On the event K n,δ the map θ → θ − n o −1 (θ ) maps ball(0, δ) into itself, by the
definitions of G δ and K n,δ . Because the map is also continuous, it possesses a fixed-point
in ball(0, δ), by Brouwer’s fixed point theorem. This yields a zero of n in the set G δ ,
whence the first assertion of the theorem.
For the final assertion, first note that the Hessian P ψ̇θ0 of θ → Pm θ at θ0 is negative-
definite, by assumption. A Taylor expansion as in the proof of Theorem 5.41 shows that
P P
Pn ψ̇θ̂n − Pn ψ̇θ0 → 0 for every θ̂n → θ0 . Hence the Hessian Pn ψ̇θ̂n of θ → Pn m θ at
any consistent zero θ̂n converges in probability to the negative-definite matrix P ψ̇θ0 and is
negative-definite with probability tending to 1.
The assertion of the theorem that there exists a consistent sequence of roots of the
estimating equations is easily misunderstood. It does not guarantee the existence of an
asymptotically consistent sequence of estimators. The only claim is that a clairvoyant
statistician (with preknowledge of θ0 ) can choose a consistent sequence of roots. In reality,
it may be impossible to choose the right solutions based only on the data (and knowledge
of the model). In this sense the preceding theorem, a standard result in the literature, looks
better than it is.
The situation is not as bad as it seems. One interesting situation is if the solution of the
estimating equation is unique for every n. Then our solutions must be the same as those of
the clairvoyant statistician and hence the sequence of solutions is consistent.
In general, the deficit can be repaired with the help of a preliminary sequence of estimators
On. If the sequence On is consistent, then it works to choose the root On of JP> n1/16 = 0 that
is closest to On. Because liOn - On II is smaller than the distance II0: - On II between the
0:
clairvoyant sequence and On, both distances converge to zero in probability. Thus the
sequence of closest roots is consistent.
The assertion of the theorem can also be used in a negative direction. The point (}o in
the theorem is required to be a zero of (} ..... P1/I6, but, apart from that, it may be arbitrary.
Thus, the theorem implies at the same time that a malicious statistician can always choose
a sequence of roots On that converges to any given zero. These may include other points
besides the "true" value of (}. Furthermore, inspection of the proof shows that the sequence
of roots can also be chosen to jump back and forth between two (or more) zeros. If the
function (} ..... P1/I6 has multiple roots, we must exercise care. We can be sure that certain
roots of (} ..... JP> n 1/16 are bad estimators.
Part of the problem here is caused by using estimating equations, rather than maximiza-
tion to find estimators, which blurs the distinction between points of absolute maximum,
local maximum, and even minimum. In the light of the results on consistency in section 5.2,
we may expect the location of the point of absolute maximum of (} ..... JP> nm6 to converge
to a point of absolute maximum of (} ..... Pm6. As long as this is unique, the absolute
maximizers of the criterion function are typically consistent.
5.43 Example (Weibull distribution). Let XI, ... , Xn be a sample from the Weibull dis-
tribution with density
P6 u(x)
,
=U lu, x > 0, (} > 0, U > O.
(Then u I16 is a scale parameter.) The score function is given by the partial derivatives of
the log density with respect to (} and u:
.,
l6a(x) = (1(} + log
- x6
x - -logx,
u
w- - 1+ -u
U
x6 )
2
.
The likelihood equations L l6,u (Xi) = 0 reduce to
1 n
U = -
n i=1
The second equation is strictly decreasing in (}, from 00 at (} = 0 to log x -log x(n) at 0 = 00.
Hence a solution exists, and is unique, unless all Xi are equal. Provided the higher-order
derivatives of the score function exist and can be dominated, the sequence of maximum
likelihood estimators (On' Un) is asymptotically normal by Theorems 5.41 and 5.42. There
exist four different third-order derivatives, given by
a3t6,u(X) = 2. _ x 6 10g3 x
a(}3 (}3 u
For 0 and (J ranging over sufficiently small neighborhoods of 00 and (Jo, these functions are
dominated by a function of the form
This corresponds to replacing \{In(O) by its tangent at 9n, and is known as the method of
Newton-Rhapson in numerical analysis. The solution 0 = On is
-I -
A
On = On -
- •
\{In(On)
-
\{In (On).
In numerical analysis this procedure is iterated a number of times, taking On as the new
preliminary guess, and so on. Provided that the starting point 9n is well chosen, the sequence
of solutions converges to a root of \{In. Our interest here goes in a different direction. We
suppose that the preliminary estimator 9n is already within range n- I / 2 of the true value
of e. Then, as we shall see, just one iteration of the Newton-Rhapson scheme produces
an estimator On that is as good as the Z-estimator defined by \{In. In fact, it is better in
that its consistency is guaranteed, whereas the true Z-estimator may be inconsistent or not
uniquely defined.
In this way consistency and asymptotic normality are effectively separated, which is
useful because these two aims require different properties of the estimating equations.
Good initial estimators can be constructed by ad-hoc methods and take care of consistency.
Next, these initial estimators can be improved by the one-step method. Thus, for instance,
the good properties of maximum likelihood estimation can be retained, even in cases in
which the consistency fails.
In this section we impose the following condition on the random criterion functions \{In.
For every constant M and a given nonsingular matrix
Condition (5.44) suggests that \{In is differentiable at 00 , with derivative tending to but
this is not an assumption. We do not require that a derivative exists, and introduce
https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press
72 M- and Z-Estimators
(}n = (}n -
- •
"'n,o"'n«(}n)'
5.45 Theorem (One-step estimation). Let .,fn"'n«(}O) - Z and let (5.44) hold. Then the
one-step estimator On, for a given .,fn-consistent estimator sequence On and estimators
. p.
"'n,O-+ "'0, satisfies
5.46 Addendum. For "'n«(}) = rn1/l9 condition (5.44) is satisfied under the conditions of
Theorem 5.21 with = V90, and under the conditions of Theorem 5.41 with = Ptj,90'
Proof. The standardized estimator - (}o) equals
- (0 ) - In('''n(On) - "'n(OO») -
By (5.44) the second term can be replaced by - (On - (0) +0 p (1). Thus the expression
can be rewritten as
The first term converges to zero in probability, and the theorem follows after application of
Slutsky's lemma.
For a proof of the addendum, see the proofs of the corresponding theorems. •
Ifthe sequence .,fn(On - (}o) converges in distribution, then it is certainly uniformly tight.
Consequently, a sequence of one-step estimators is .,fn-consistent and can itself be used as
preliminary estimator for a second iteration of the modified Newton-Rhapson algorithm.
Presumably, this would give a value closer to a root of "'n.
However, the limit distribution
of this "two-step estimator" is the same, so that repeated iteration does not give asymptotic
improvement. In practice a multistep method may nevertheless give better results.
We close this section with a discussion of the discretization trick. This device is mostly
of theoretical value and has been introduced to relax condition (5.44) to the following. For
every nonrandom sequence On = (}o + O(n- I/ 2 ),
(5.47)
This new condition is less stringent and much easier to check. It is sufficiently strong if
the preliminary estimators On are discretized on grids of mesh width n -1/2. For instance,
On is suitably discretized if all its realizations are points of the grid n- I / 2Zk (consisting
of the points n- I/ 2(il, ... , ik) for integers ii, ... ,ik). This is easy to achieve, but perhaps
unnatural. Any preliminary estimator sequence On can be discretized by replacing its values
by the closest points of the grid. Because this changes each coordinate by at most n- I / 2 ,
en
In-consistency of is retained by discretization.
Define a one-step estimator en
as before, but now use a discretized version of the pre-
liminary estimator.
5.48 Theorem (Discretized one-step estimation). Let In\lln(OO) ..... Z and let (5.47) hold.
Then the one-step estimator en, for a given In-consistent, discretized estimator sequence
- . p.
On and estimators \IIn,O-+\IIo, satisfies
5.49 Addendum. For \lin (0) = IPn1/10 and IP n the empirical measure of a random sample
from a density Po that is differentiable in quadratic mean (5.38), condition (5.47), is satisfied,
. 'T
with \110 = - POo 1/Iooeoo ' if, as 0 -+ 00 ,
converges to zero in probability. Fix 8 > O. By the In-consistency, there exists M with
p(Jnllen- 00 11 > M)< 8. If Jnlle n - 00 11 :::: M, then en equals one of the values in the
set Sn = {O E n- I / 2 7}: 110 - 0011 :::: n-I/2M}. For each M and n there are only finitely
many elements in this set. Moreover, for fixed M the number of elements is bounded
independently of n. Thus
The maximum of the terms in the sum corresponds to a sequence of nonrandom vectors On
with On = 00 + O(n- I / 2 ). It converges to zero by (5.47). Because the number of terms in
the sum is bounded independently of n, the sum converges to zero.
For a proof of the addendum, see proposition A. 10 in [139]. •
If the score function to of the model also satisfies the conditions of the addendum,
then the estimators $n,O = - POn 1/I0n are consistent for $0. This shows that discretized
one-step estimation can be carried through under very mild regularity conditions. Note
that the addendum requires only continuity of 0 f-+ 1/10, whereas (5.47) appears to require
differentiability.
5.50 Example (Cauchy distribution). Suppose X I, ... , Xn are a sample from the Cauchy
location family Po(x) = Jr- I (1 + (x - 0)2fl. Then the score function is given by
e (x) _ _ 2(_x_-_0_).....,..
o - 1 + (x - 0)2 ·
Figure 5.4. Cauchy log likelihood function of a sample of 25 observations, showing three local
maxima. The value of the absolute maximum is well-separated from the other maxima, and its
location is close to the true value zero of the parameter.
This function behaves like 1/x for x -+ ±oo and is bounded in between. The second
moment of i(} (X I) therefore exists, unlike the moments of the distribution itself. Because
the sample mean possesses the same (Cauchy) distribution as a single observation X J, the
sample mean is a very inefficient estimator. Instead we could use the median, or another
M -estimator. However, the asymptotically best estimator should be based on maximum
likelihood. We have
The tails of this function are of the order 1/x 3 , and the function is bounded in between.
These bounds are uniform in () varying over a compact interval. Thus the conditions of
Theorems 5.41 and 5.42 are satisfied. Since the consistency follows from Example 5.16,
the sequence of maximum likelihood estimators is asymptotically normal.
The Cauchy likelihood estimator has gained a bad reputation, because the likelihood
equation L i(} (Xi) = 0 typically has several roots. The number of roots behaves asymp-
totically as two times a Poisson(1/rr) variable plus 1. (See [126].) Therefore, the one-step
(or possibly multi-step method) is often recommended, with, for instance, the median as the
initial estimator. Perhaps a better solution is not to use the likelihood equations, but to deter-
mine the maximum likelihood estimator by, for instance, visual inspection of a graph of the
likelihood function, as in Figure 5.4. This is particularly appropriate because the difficulty of
multiple roots does not occur in the two parameter location-scale model. In the model with
density p(} (x / u ) / U , the maximum likelihood estimator for «(), u) is unique. (See [25].) 0
5.51 Example (Mixtures). Let f and g be given, positive probability densities on the real
line. Considerestimatingtheparameter(} = (/L, v, u, 'l', p) based on a random sample from
If we set μ = x1 and next maximize over σ > 0, then we obtain the value ∞ whenever
p > 0, irrespective of the values of ν and τ .
A one-step estimator appears reasonable in this example. In view of the smoothness of
the likelihood, the general theory yields the asymptotic efficiency of a one-step estimator
√
if started with an initial n-consistent estimator. Moment estimators could be appropriate
initial estimators.
5.52 Theorem (Rate of convergence). Assume that for fixed constants C and α > β, for
every n, and for every sufficiently small δ > 0,
sup P(m θ − m θ0 ) ≤ −Cδ α ,
δ
2
≤ d(θ,θ0 )<δ
E∗ sup Gn (m θ − m θ ) ≤ Cδ β .
0
d(θ,θ0 )<δ
If the sequence θ̂n satisfies Pn m θ̂n ≥ Pn m θ0 − O P n α/(2β−2α) and converges in outer
probability to θ0 , then n 1/(2α−2β) d(θ̂n , θ0 ) = O P∗ (1).
Proof. Set Tn = n 1/ (2a-2fJ) and suppose that On maximizes the map () f-+ JP>nme up to a
variable Rn = Op(T;;a).
For each n, the parameter space minus the point ()o can be partitioned into the "shells"
Sj,n = {(): 2j - 1 < Tnd«(), ()o) 2 j }, with j ranging over the integers. If Tn d(On' ()o) is
larger than 2M for a given integer M, then On is in one of the shells Sj,n with j :::: M. In
that case the supremum of the map () f-+ JP> nme - JP> nmeo over this shell is at least - Rn by
the property of On. Conclude that, for every £ > 0,
If the sequence On is consistent for ()o, then the second probability on the right converges
to 0 as n --+ 00, for every fixed £ > O. The third probability on the right can be made
arbitrarily small by choice of K, uniformly in n. Choose £ > 0 small enough to ensure that
the conditions of the theorem hold for every 8 £. Then for every j involved in the sum,
we have
2(j-l)a
sup P(me - mOo) - C --
a ·
eESj,n Tn
For iC2(M-I)a :::: K, the series can be bounded in terms of the empirical process Gn by
by Markov's inequality and the definition of Tn. The right side converges to zero for every
M = Mn --+ 00. •
Consider the special case that the parameter () is a Euclidean vector. Ifthe map () f-+ P me
is twice-differentiable at the point of maximum ()o, then its first derivative at ()o vanishes
and a Taylor expansion of the limit criterion function takes the form
Then the first condition of the theorem holds with ex = 2 provided that the second-derivative
matrix V is nonsingular.
The second condition of the theorem is a maximal inequality and is harder to verify. In
"regular" cases it is valid with f3 = 1 and the theorem yields the "usual" rate of convergence
.;n. The theorem also applies to nonstandard situations and yields, for instance, the rate
i.
n if ex = 2 and f3 = Lemmas 19.34, 19.36 and 19.38 and corollary 19.35 are examples
l/3
of maximal inequalities that can be appropriate for the present purpose. They give bounds
in terms of the entropies of the classes of functions {me - meo: d «(), ()o) < 8}.
A Lipschitz condition on the maps () f-+ me is one possibility to obtain simple estimates
on these entropies and is applicable in many applications. The result of the following
corollary is used earlier in this chapter.
5.53 Corollmy. For each 0 in an open subset of Euclidean space let x t-+ me (x) be a
measurable function such that, for every 01 and 02 in a neighborhood of00 and a measurable
function m such that Pm 2 < 00,
Furthermore, suppose that the map 0 t-+ Pme admits a second-order Taylor expansion at
the point ofmaximum 00 with nonsingular second derivative. IflPnme. ::: lPnm80 - 0 p(n -I),
A A P
then,Jn(On -(0 ) = Op(1),providedthatO n -+Oo.
Proof. By assumption, the first condition of Theorem 5.52 is valid with ex = 2. To see
that the second one is valid with f3 = 1, we apply Corollary 19.35 to the class of functions
F = {me - m80: 110 - 0011 < 8}. This class has envelope function F = m8, whence
E* sup
lIe-80lld
IGn(me-m80)I;S l 0
limllP.2a
JlogN[)(e,F,L 2 (P»)de.
The bracketing entropy of the class F is estimated in Example 19.7. Inserting the upper
bound obtained there into the integral, we obtain that the preceding display is bounded
above by a multiple of
Rates of convergence different from ,In are quite common for M -estimators of infinite-
dimensional parameters and may also be obtained through the application of Theorem 5.52.
See Chapters 24 and 25 for examples. Rates slower than ,In may also arise for fairly simple
parametric estimates.
5.54 Example (Modal interval). Suppose that we define an estimator 8n oflocation as the
center of an interval of length 2 that contains the largest possible fraction of the observations.
This is an M-estimator for the functions me = 1[11-1,8+1].
For many underlying distributions the first condition of Theorem 5.52 holds with ex = 2.
It suffices that the map 0 t-+ Pmll = P[O - 1,0 + 1] is twice-differentiable and has
a proper maximum at some point 00 . Using the maximal ineqUality Corollary 19.35 (or
Lemma 19.38), we can show that the second condition is valid with f3 = !.
Indeed, the
bracketing entropy of the intervals in the real line is of the order 8/e2 , and the envelope
function of the class offunctions 1[11-1,11+1] - 1[80-1,80+1] as 0 ranges over (00 - 8, 00 + 8)
is bounded by I[80-I-a,80-IH] + I[80+I-a,80+IH], whose squared L2 -norm is bounded by
IIplloo28.
!
Thus Theorem 5.52 applies with ex = 2 and f3 = and yields the rate of convergence
n 1/3. The resulting location estimator is very robust against outliers. However, in view of
its slow convergence rate, one should have good reasons to use it.
The use of an interval of length 2 is somewhat awkward. Every other fixed length would
give the same result. More interestingly, we can also replace the fixed-length interval by the
smallest interval that contains a fixed fraction, for instance 1/2, of the observations. This
still yields a rate of convergence of n 1/3. The intuitive reason for this is that the length of a
"shorth" settles down by a In-rate and hence its randomness is asymptotically negligible
relative to its center. 0
The sets en and Hn need not be metric spaces, but instead we measure the discrepancies
between en and eo, and fin and a limiting value 1}o, by nonnegative functions e 1-+ (e, eo)
and 1} 1-+ d(1}, 1}o), which may be arbitrary.
5.55 Theorem. Assume thatJor arbitrary functions en: en x Hn 1-+ lR and 4>n: (0, (0) 1-+
lR such that 81-+ 4>n(8)/8 fJ is decreasing for some f3 < 2, every (e, 1}) E en x Hn, and
every 8> 0,
Let On > °satisfy 4>n(on) ::: In for every n. If P(e n E en, fin E Hn) --* 1 and
2: - then (en, eo) = OJ, (On + d(fin, 1}o»).
Then the intersection of the events (en, fin) E eo) 2: 2M (8 n+d(fin , 1}o»)
is contained in the union of the events {(en, fin) E Sn,j,M} over j 2: M. By the definition
of en, the supremum of - over the set of parameters (e, 1}) E Sn,j,M is
nonnegative on the event {(en, fin) E Sn,j,M}' Conclude that
From here on the proof is the same as the proof of Theorem 5.52, except that we use that
ifJn (c8) ::::: cf3 ifJn (8) for every c > 1, by the assumption on ifJn. •
h r+ Mn (e + - Mn(eo).
Suppose that these, if suitably normed, converge to a limit process h r+ M(h). Then the
general principle is that the sequence hn converges in distribution to the maximizer of this
limit process.
For simplicity of notation we shall write the local criterion functions as h r+ Mn(h).
Let {Mn (h): h E Hn} be arbitrary stochastic processes indexed by subsets Hn of a given
metric space. We wish to prove that the argmax-functional is continuous: If Mn "'" M and
Hn H in a suitable sense, then the (near) maximizers lin of the random maps h t-+ Mn(h)
converge in distribution to the maximizer h of the limit process h r+ M(h). It is easy to
find examples in which this is not true, but given the right definitions it is, under some
conditions. Given a set B, set
Then convergence in distribution of the vectors (Mn (A), Mn (B)) for given pairs of sets A
and B is an appropriate form of convergence of Mn to M. The following theorem gives some
flexibility in the choice of the indexing sets. We implicitly either assume that the suprema
Mn (B) are measurable or understand the weak convergence in terms of outer probabilities,
as in Chapter 18.
The result we are looking for is not likely to be true if the maximizer of the limit process
is not well defined. Exactly as in Theorem 5.7, the maximum should be "well separated."
Because in the present case the limit is a stochastic process, werequire that every sample
path h t-+ M(h) possesses a well-separated maximum (condition (5.57».
5.56 Theorem (Argmax theorem). Let Mn and M be stochastic processes indexed by sub-
sets Hn and H of a given metric space such that, for every pair of a closed set F and a set
K in a given collection K,
Furthermore, suppose that every sample path of the process h f-+ M (h) possesses a well-
separated point of maximum h in that, for every open set G and every K E K,
Proof. If h n E F n K, then Mn(F n K n Hn) :::: Mn(B) - op(1) for any set B. Hence,
for every closed set F and every K E K,
by Slutsky's lemma and the portmanteau lemma. Ifh E F C , then M(F n K n H) is strictly
smaller than M(h) by (5.57) and hence on the intersection with the event in the far right
side h cannot be contained in K n H. It follows that
The theorem works most smoothly if we can take K to consist only of the whole space.
However, then we are close to assuming some sort of global uniform convergence of Mn
to M, and this may not hold or be hard to prove. It is usually more economical in terms
of conditions to show that the maximizers hn are contained in certain sets K, with high
probability. Then uniform convergence of Mn to M on K is sufficient. The choice of
compact sets K corresponds to establishing the uniform tightness of the sequence hn before
applying the argmax theorem.
Ifthe sample paths of the processes Mn are bounded on K and Hn = H for every n, then
the convergence of the processes Mn viewed as elements ofthe space eoo(K) implies
the convergence condition of the argmax theorem. This follows by the continuous-mapping
theorem, because the map
Proof. The compactness of K and the continuity of the sample paths h M (h) imply
that the (unique) points of maximum h are automatically well separated in the sense of
(5.57). Indeed, if this fails for a given open set G 3 h and K (and a given w in the
underlying probability space), then there exists a sequence hm in G C n K n H such that
M{h m ) M(h). If K is compact, then this sequence can be chosen convergent. The limit
ho must be in the closed set G C and hence cannot be h. By the continuity of M it also has
the property that M{ho) =
lim M(h m ) =
M{h). This contradicts the assumption that his
a unique point of maximum.
Ifwe can show that (Mn(F n Hn), Mn(K n Hn») converges to the corresponding limit
for every compact sets F C K, then the theorem is a corollary of Theorem 5.56. IfHn = H
for every n, then this convergence is immediate from the weak convergence of Mn to M
in (,oo{K), by the continuous-mapping theorem. For Hn changing with n this convergence
may fail, and we need to refine the proof of Theorem 5.56. This goes through with minor
changes if
and g similarly, but with H replacing Hn. By an argument as in the proof of Theo-
rem 18.11, the desired result follows if lim sup gn(Zn) :s g(z) for every sequence Zn Z
in (,oo{K) and continuous function z. (Then lim sup P(gn{Mn) x) :s P(g{M) x) for
every x, for any weakly converging sequence Mn - M with a limit with continuous sample
paths.) This in turn follows if for every precompact set B C K,
The argmax theorem can also be used to prove consistency, by applying it to the original
criterion functions () Mn«(). Then the limit process () M«() is degenerate, and has
a fixed point of maximum ()o. Weak convergence becomes convergence in probability, and
the theorem now gives conditions for the consistency en
()o. Condition (5.57) reduces
to the well-separation of ()o, and the convergence
p
sup Mn«() sup Mn«()
8eFnKne. 8eFnKne
is, apart from allowing en to depend on n, weaker than the uniform convergence of Mn to
M.
Notes
In the section on consistency we have given two main results (uniform convergence and
Wald's proof) that have proven their value over the years, but there is more to say on this
subject. The two approaches can be unified by replacing the uniform convergence by "one-
sided uniform convergence," which in the case of i.i.d. observations can be established
under the conditions of Wald's theorem by a bracketing approach as in Example 19.8 (but
then one-sided). Furthermore, the use of special properties, such as convexity of the "" or
m functions, is often helpful. Examples such as Lemma 5.10, or the treatment of maximum
likelihood estimators in exponential families in Chapter 4, appear to indicate that no single
approach can be satisfactory.
The study of the asymptotic properties of maximum likelihood estimators and other
M-estimators has a long history. Fisher [48], [50] was a strong advocate of the method of
maximum likelihood and noted its asymptotic optimality as early as the 1920s. What we
have labelled the classical conditions correspond to the rigorous treatment given by Cramer
[27] in his authoritative book. Huber initiated the systematic study of M -estimators, with
the purpose of developing robust statistical procedures. His paper [78] contains important
ideas that are precursors for the application of techniques from the theory of empirical
processes by, among others, Pollard, as in [117], [118], and [120]. For one-dimensional
parameters these empirical process methods can be avoided by using a maximal inequality
based on the Lrnorm (see, e.g., Theorem 2.2.4 in [146]). Surprisingly, then a Lipschitz
condition on the Hellinger distance (an integrated quantity) suffices; see for example, [80] or
[94]. For higher-dimensional parameters the results are also not the best possible, but I do
not know of any simple better ones.
The books by Huber [79] and by Hampel, Ronchetti, Rousseeuw, and Stahel [73] are
good sources for applications of M -estimators in robust statistics. These references also
discuss the relative efficiency of the different M -estimators, which motivates, for instance,
the use of Huber's ",,-function. In this chapter we have derived Huber's estimator as the
solution of the problem of minimizing the asymptotic variance under the side condition
of a uniformly bounded influence function. Originally Huber derived it as the solution to
the problem of minimizing the maximum asymptotic variance sup p a'l. for P ranging over
a contamination neighborhood:P = (1 - e)ct> + eQ with Q arbitrary. For M-estimators
these two approaches tum out to be equivalent.
The one-step method can be traced back to numerical schemes for solving the likelihood
equations, including Fisher's method of scoring. One-step estimators were introduced for
their asymptotic efficiency by Le Cam in 1956, who later developed them for general locally
asymptotically quadratic models, and also introduced the discretization device, (see [93]).
PROBLEMS
1. Let XI, ... , Xn be a sample from a density that is strictly positive and symmetric about some
point. Show that the Huber M-estimator for location is consistent for the symmetry point.
2. Find an expression for the asymptotic variance of the Huber estimator for location if the obser-
vations are normally distributed.
3. Define 1/I(x) = I - p, 0, p if x < 0,0, > O. Show that E1/I(X - 9) = 0 implies that P(X <
9) ::: p ::: P(X ::: 9).
4. Let XI, ... , Xn be Li.d. N(J.L,0'2)-distributed. Derive the maximum likelihood estimator for
(J.L, 0'2) and show that it is asymptotically normal. Calculate the Fisher information matrix for
this parameter and its inverse.
S. Let XI, ... , Xn be i.i.d. Poisson(119)-distributed. Derive the maximum likelihood estimator for
9 and show that it is asymptotically normal.
6. Let XI, ... , Xn be i.i.d. N(9,9)-distributed. Derive the maximum likelihood estimator for 9
and show that it is asymptotically normal.
7. Find a sequence of fixed (nonrandom) functions Mn: lR lR that converges pointwise to a limit
Mo and such that each Mn has a unique maximum at a point 9n, but the sequence 9n does not
converge to 90. Can you also find a sequence Mn that converges uniformly?
8. Find a sequence of fixed (nonrandom) functions Mn: lR lR that converges pointwise but not
uniformly to a limit Mo such that each Mn has a unique maximum at a point 9n and the sequence
9n converges to 90.
9. Let XI, ... ,Xn be Li.d. observations from a uniform distribution on [0, 9]. Show that the
sequence of maximum likelihood estimators is asymptotically consistent. Show that it is not
asymptotically normal.
10. Let XI, ... , Xn be i.i.d. observations from an exponential density 9 exp( -9x). Show that the
sequence of maximum likelihood estimators is asymptotically normal.
11. Let IF; I (p) be a pth sample quantile of a sample from a cumulative distribution F on 1R that is
differentiable with positive derivative at the population pth-quantile F- I (p) = inf{ x: F (x)
pl. Show that .Jri(IF;I(p) - F-I(p») is asymptotically normal with mean zero and variance
p(1- p)II(F-1 (p)t
12. Derive a minimal condition on the distribution function F that guarantees the consistency of the
sample pth quantile.
13. Calculate the asymptotic variance of .Jri(8n - 9) in Example 5.26.
14. Suppose that we observe a random sample from the distribution of (X, Y) in the following
errors-in-variables model:
X = Z+e
Yj = a + f3Z + I,
where (e, f) is bivariate normally distributed with mean 0 and covariance matrix 0'21 and is
independent from the unobservable variable Z. In analogy to Example 5.26, construct a system
of estimating equations for (a, f3) based on a conditional likelihood, and study the limit properties
of the corresponding estimators.
15. In Example 5.27, for what point is the least squares estimator 8n consistent if we drop the
condition that E(e I X) = O? Derive an (implicit) solution in terms of the function E(e IX). Is
it necessarily 90 if Ee = O?
16. In Example 5.27, consider the asymptotic behavior of the least absolute-value estimator 9 that
minimizes I:?=IIYi - 4>e(Xi)l.
17. Let X I, ... , Xn be i.i.d. with density fA,a(x) = Ae-).(x-a) I {x a}, where the parameters A > 0
and a E lR are unknown. Calculate the maximum likelihood estimator (in, an) of (A, a) and
derive its asymptotic properties.
18. Let X be Poisson-distributed with density Pe (x) = (Jx e-e / x!. Show by direct calculation that
Eeie(X) = 0 and Eele(X) = Compare this with the assertions in the introduction.
Apparently, differentiation under the integral (sum) is permitted in this case. Is that obvious from
results from measure theory or (complex) analysis?
19. Let X I, ... , Xn be a sample from the N «(J. 1) distribution, where it is known that (J O. Show
that the maximum likelihood estimator is not asymptotically normal under (J = O. Why does this
not contradict the theorems of this chapter?
20. Show that (8 - (JO)\{In (8 n ) in formula (5.18) converges in probability to zero if 9n ..;. (Jo, and that
there exists an integrable function M and 8 > 0 with I e (x) I .::: M (x) for every x and every
II(J - (Joll < 8.
21. If 9n maximizes Mn , then it also maximizes M;;' Show that this may be used to relax the
conditions of Theorem 5.7 to sUPeIM,i - M+I«(J) -+ 0 in probability (if M«(Jo) > 0).
22. Suppose that for every s > o there exists a set 8 e with liminfP(9n E 8 e ) I-s. Then uniform
convergence of Mn to M in Theorem 5.7 can be relaxed to uniform convergence on every 8 e .
23. Show that Wald's consistency proof yields almost sure convergence of 9 n, rather than convergence
in probability if the parameter space is compact and Mn (9 n) Mn «(Jo) - o( I).
24. Suppose that (X I, YI), ... , (Xn , Yn ) are i.i.d. and satisfy the linear regression relationship Yi =
(JTXi +ei for (unobservable) errors el, ... , en independent of XI,"" X n . Show that the mean
absolute deviation estimator, which minimizes I: IYi - (J Xii, is asymptotically normal under a
mild condition on the error distribution.
25. (i) Verify the conditions ofWald's theorem for me the log likelihood function of the N(J1" 0'2)_
distribution if the parameter set for (J = (J1" 0'2) is a compact subset of jR x jR+.
(ii) Extend me by continuity to the compactification of jR x jR+. Show that the conditions of
Wald's theorem fail at the points (IL, 0).
(iii) Replace me by the log likelihood function of a pair of two independent observations from the
N (J1" O' 2 )-distribution. Show that Wald's theorem now does apply, also with a compactified
parameter set.
26. A distribution on jRk is called ellipsoidally symmetric if it has a density of the form x 1-+
g( (x - J1,)T (x - J1,») for a function g: [0, 00) 1-+ [0,00), a vector J1" and a symmetric
positive-definite matrix Study the Z-estimators for location jl that solve an equation of the
form
n
I)(Xi - J1,)T i:.;;1 (Xi - J1,»),
i=1
for given estimators i:.n and, for instance, Huber's v-function. Is the asymptotic distribution of
i:.n important?
27. Suppose that 8 is a compact metric space and M: 8 -+ lR is continuous. Show that (5.8) is
equivalent to the point (Jo being a point of unique global maximum. Can you relax the continuity
of M to some form of "semi-continuity"?
As proved in the next lemma, Qa « P and Q.L .1 P. Furthermore, for every measurable
set A
85
part (or singular part) of Q with respect to P, respectively. In view of the preceding
display, the function q/ p is a density of Q a with respect to P. It is denoted d Q/d P (not:
d Q a /d P), so that
dQ q
= , P − a.s.
dP p
As long as we are only interested in the properties of the quotient q/ p under P-probability,
we may leave the quotient undefined for p = 0. The density d Q/d P is only P-almost
surely unique by definition. Even though we have used densities to define them, d Q/d P
and the Lebesgue decomposition are actually independent of the choice of densities and
dominating measure.
In statistics a more common name for a Radon-Nikodym density is likelihood ratio.
We shall think of it as a random variable d Q/d P : → [0, ∞) and shall study its law
under P.
6.2 Lemma. Let P and Q be probability measures with densities p and q with respect to
a measure μ. Then for the measures Q a and Q ⊥ defined in (6.1)
(i) Q = Q a + Q ⊥ , Q a P, Q ⊥ ⊥ P.
(ii) Q a (A) = A (q/ p) d P for every measurable set A.
(iii) Q P if and only if Q( p = 0) = 0 if and only if (q/ p) d P = 1.
Proof. The first statement of (i) is obvious from the definitions of Q a and Q ⊥ . For the
second, we note that P(A) can be zero only if p(x) = 0 for μ-almost all x ∈ A. In this
case, μ A ∩ { p > 0} = 0, whence Q a (A) = Q A ∩ { p > 0} = 0 by the absolute
continuity of Q with respect to μ. The third statement of (i) follows from P( p = 0) = 0
and Q ⊥ ( p > 0) = Q(∅) = 0.
Statement (ii) follows from
q q
Q (A) =
a
qdμ = pdμ = d P.
A∩{ p>0} A∩{ p>0} p A p
For (iii) we note first that Q P if and only if Q ⊥ = 0. By (6.1) the latter happens if
and only if Q( p = 0) = 0. This yields the first “if and only if.” For the second, we note
It is not true in general that f d Q = f (d Q/d P)d P. For this to be true for every
measurable function f, the measure Q must be absolutely continuous with respect to P. On
the other hand, for any P and Q and nonnegative f,
q dQ
f dQ ≥ f qdμ = f pdμ = f d P.
p>0 p>0 p dP
This inequality is used freely in the following. The inequality may be strict, because
dividing by zero is not permitted.†
6.2 Contiguity
If a probability measure Q is absolutely continuous with respect to a probability measure
P, then the Q-law of a random vector X : → Rk can be calculated from the P-law of the
pair (X, d Q/dp) through the formula
dQ
E Q f (X ) = E p f (X ) .
dP
With P X,V equal to the law of the pair (X, V ) = (X, d Q/d P) under P, this relationship
can also be expressed as
dQ
Q(X ∈ B) = E P 1 B (X ) = vd P X,V (x, v).
dP B×R
The validity of these formulas depends essentially on the absolute continuity of Q with
respect to P, because a part of Q that is orthogonal with respect to P cannot be recovered
from any P-law.
Consider an asymptotic version of the problem. Let (n , An ) be measurable spaces,
each equipped with a pair of probability measures Pn and Q n . Under what conditions can
a Q n -limit law of random vectors X n : n → Rk be obtained from suitable Pn -limit laws?
In view of the above it is necessary that Q n is “asymptotically absolutely continuous” with
respect to Pn in a suitable sense. The right concept is contiguity.
The name “contiguous” is standard, but perhaps conveys a wrong image. “Contiguity”
suggests sequences of probability measures living next to each other, but the correct image
is “on top of each other” (in the limit).
† The algebraic identity d Q = (d Q/d P)d P is false, because the notation d Q/d P is used as shorthand for
d Q a /d P: If we write d Q/d p, then we are not implicitly assuming that Q P.
Thus, the sequences of likelihood ratios dQn/dPn and dPn/dQn are uniformly tight under
Pn and Qn, respectively. By Prohorov's theorem, every subsequence has a further weakly
converging subsequence. The next lemma shows that the properties of the limit points
determine contiguity. This can be understood in analogy with the nonasymptotic situation.
For probability measures P and Q, the following three statements are equivalent by (iii) of
Lemma 6.2:
dQ
Q« P, Ep-
dP
= 1.
This equivalence persists if the three statements are replaced by their asymptotic counter-
parts: Sequences Pn and Qn satisfy Qn<lPn, if and only if the weak limit points of dPn/dQn
under Qn give mass 0 to 0, if and only if the weak limit points of d Qn/d Pn under Pn have
mean 1.
6.4 Lemma (Le Cam's first lemma). Let Pn and Qn be sequences ofprobability measures
on measurable spaces (On, An). Then the following statements are equivalent:
(i) Qn <l Pn.
(ii) IfdPn/dQn U along a subsequence, then P(U > 0) = 1.
(iii) IfdQn/dPn fA V along a subsequence, then EV = 1.
(iv) For any statistics Tn : On ]Rk: If Tn 0, then Tn .f?; O.
Proof. The equivalence of (i) and (iv) follows directly from the definition of contiguity:
Given statistics Tn, consider the sets An = {IITn II > e}; given sets An, consider the statistics
Tn = IAn'
(i)=} (ii). For simplicity of notation, we write just {n} for the given subsequence
along which dPn/dQn U. For given n, we define the function gn(e) = Qn(dPn/dQn <
e) - P(U < e). By the portmanteau lemma, liminf gn(e) 0 for every e > O. Then, for
en -l- 0 at a sufficiently slow rate, also lim inf gn (en) O. Thus,
P(U ·
= 0) = 1ImP(U . . fQ n (dPn
< en) :::: hmm dQn < en ) .
dPn
Pn ( --::::
dQn
en I\qn > 0) = 1 dPn
--dQn::::
dQn
f endQn O.
If Qn is contiguous with respect to Pn, then the Qn-probability of the set on the left goes
to zero also. But this is the probability on the right in the first display. Combination shows
that P(U = 0) = O.
(iii) =} (i). If Pn(An) 0, then the sequence lOn-An converges to 1 in Pn-probability.
By Prohorov's theorem, every subsequence of In} has a further subsequence along which
(dQn/dPn, In.-AJ (V, 1) under Pn, for some weak limit V. The function (v, t) vt
is continuous and nonnegative on the set [0, 00) x to, I}. By the portmanteau lemma
Under (iii) the right side equals EV = 1. Then the left side is I as well and the sequence
Qn(An) = I - Qn(Qn - An) converges to zero.
(ii) =} (iii). The probability measures ILn = !(Pn + Qn) dominate both Pn and Qn, for
every n. The sum of the densities of Pn and Qn with respect to ILn equals 2. Hence, each of
the densities takes its values in the compact interval [0,2]. By Prohorov's theorem every
subsequence possesses a further subsequence along which
dPn $.?n. U dQn !!l V dPn R.
Wn := dlLn W,
for certain random variables U, V and W. Every Wn has expectation I under ILn. In view
of the boundedness, the weak convergence of the sequence Wn implies convergence of
moments, and the limit variable has mean EW = I as well. For a given bounded, continuous
function I, define a function g : [0, 2] 1R. by g(w) = I( w/(2-w) )(2-w) forO w < 2
and g(2) = 0. Then g is bounded and continuous. Because dPn/dQn = Wn/(2- Wn) and
dQn/dlLn = 2 - Wn, the portmanteau lemma yields
d Pn ) n d Qn
EQ.f ( dQn = EJl.f ( dQn
dP )
dlLn = EJl.g(Wn) EI ( 2 _WW ) (2 - W),
°
where the integrand in the right side is understood to be g(2) = if W = 2. By assumption,
the left side converges to E/(U). Thus E/(U) equals the right side ofthe display for every
continuous and bounded function I. Take a sequence of such functions with I:::: 1m t l(o},
and conclude by the dominated-convergence theorem that
6.S Example (Asymptotic log normality). The following special case plays an important
role in the asymptotic theory of smooth parametric models. Let Pn and Qn be probability
measures on arbitrary measurable spaces such that
d Pn &: eN(Jl.a2)
dQn
Then Qn <l Pn. Furthermore, Qn <l r> Pn if and only if IL = - !a 2 •
Because the (log normal) variable on the right is positive, the first assertion is immediate
from (ii) of the theorem. The second follows from (iii) with the roles of Pn and Qn switched,
on noting that E exp N (IL, a 2 ) = I if and only if IL = - !a 2 •
A mean equal to minus half times the variance looks peculiar, but we shall see that this sit-
uation arises naturally in the study of the asymptotic optimality of statistical procedures. 0
The following theorem solves the problem of obtaining a Qn-limit law from a Pn-limit
law that we posed in the introduction. The result, a version of Le Cam's third lemma, is in
perfect analogy with the nonasymptotic situation.
by the portmanteau lemma. Apply the portmanteau lemma in the converse direction to
conclude the proof that Xn & L. •
In this situation the asymptotic covariance matrices of the sequence Xn are the same under
Pn and Qn, but the mean vectors differ by the asymptotic covariance rbetween Xn and the
log likelihood ratios. t
The statement is a special case of the preceding theorem. Let (X, W) have the given
(k + 1)-dimensional normal distribution. By the continuous mapping theorem, the sequence
(Xn, dQn/dPn) converges in distribution under Pn to (X, e W ). Because WisN(-ta2, a 2)_
distributed, the sequences Pn and Qn are mutually contiguous. According to the abstract
t We set log 0 = -00; because the nonnal distribution does not charge the point -00 the assumed asymptotic
nonnality oflogdQn/dPn includes the assumption that Pn(dQn/dPn = 0) .... O.
version of Le Cam's third lemma, Xn & L with L(B) = E1 B (X)e w. The characteristic
function of Lis f eitT x dL(x) = Ee itTX e W • This is the characteristic function of the given
normal distribution at the vector (t, -i). Thus
The right side is the characteristic function of the Nk (f.L + T, L:) distribution. 0
Notes
The concept and theory of contiguity was developed by Le Cam in [92]. In his paper the
results that were later to become known as Le Cam's lemmas are listed as a single theorem.
The names "first" and "third" appear to originate from [71]. (The second lemma is on
product measures and the first lemma is actually only the implication (iii) => (i).)
PROBLEMS
1. Let Pn = N(O, 1) and Qn = N(J-Ln, 1). Show that the sequences Pn and Qn are mutually
contiguous if and only if the sequence J-Ln is bounded.
2. Let Pn and Qn be the distribution of the mean of a sample of size n from the N(O, 1) and the
N«(}n, 1) distribution, respectively. Show that Pn <1 f>Qn if and only if (}n = 0(11.../ii).
3. Let Pn and Qn be the law of a sample of size n from the uniform distribution on [0, 1] or [0, 1+ 1In],
respectively. Show that Pn <1 Qn. Is it also true that Qn <1 Pn? Use Lemma 6.4 to derive your
answers.
I
4. Suppose that IIPn - Qn II --+ 0, where II ·11 is the total variation distance liP - QII = sUPA P(A)-
Q(A)I. Show that Pn <If> Qn.
°
S. Given 8 > find an example of sequences such that Pn <1 f> Qn, but II Pn - Qn II --+ 1 - 8. (The
maximum total variation distance between two probability measures is 1.) This exercise shows
that it is wrong to think of contiguous sequences as being close. (Try measures that are supported
on just two points.)
6. Give a simple example in which Pn <1 Qn, but it is not true that Qn <1 Pn .
7. Show that the constant sequences {P} and {Q} are contiguous if and only if P and Q are absolutely
continuous.
°
8. If P « Q, then Q(An) --+ implies P(An) --+
does this follow from Lemma 6.4?
° for every sequence of measurable sets. How
7.1 Introduction
Suppose we observe a sample Xl, ... , Xn from a distribution P9 on some measurable space
(X, A) indexed by a parameter () that ranges over an open subset e Then the full
observation is a single observation from the product P: of n copies of P9 , and the statis-
tical model is completely described as the collection of probability measures {P: : () E e}
on the sample space (Xn , An). In the context of the present chapter we shall speak of a
statistical experiment, rather than of a statistical model. In this chapter it is shown that
many statistical experiments can be approximated by Gaussian experiments after a suitable
reparametrization.
The reparametrization is centered around a fixed parameter (}o, which should be regarded
as known. We define a local parameter h = In(() - (}o), rewrite P: as and thus
obtain an experiment with parameter h. In this chapter we show that, for large n, the
experiments
are similar in statistical properties, whenever the original experiments () t-+ P9 are "smooth"
in the parameter. The second experiment consists of observing a single observation from a
normal distribution with mean h and known covariance matrix (equal to the inverse of the
Fisher information matrix). This is a simple experiment, which is easy to analyze, whence
the approximation yields much information about the asymptotic properties of the original
experiments. This information is extracted in several chapters to follow and concerns both
asymptotic optimality theory and the behavior of statistical procedures such as the maximum
likelihood estimator and the likelihood ratio test.
92
We have taken the local parameter set equal to IRk, which is not correct if the parameter
set e is a true subset of IRk. If 00 is an inner point of the original parameter set, then the
vector () = ()o + hiIn is a parameter in e for a given h, for every sufficiently large n,
and the local parameter set converges to the whole of IRk as n 00. Then taking the local
parameter set equal to IRk does not cause errors. To give a meaning to the results of this
chapter, the measure P80 +h/.jn may be defined arbitrarily if ()o + hiIn ¢. e.
(7.1)
Ifthis condition is satisfied, then the model (Po: e E 8) is called differentiable in quadratic
mean at e.
!
Usually, h Tlo (x) J Po (x) is the derivative of the map h 1-+ J PO+h (x) at h = 0 for
(almost) every x. In this case
.
io(x) =2
1
r.:t.:\ -
a vr:::t.:\ a
Po (x) = -log Po (x).
vPo(x) ae ae
Condition (7.1) does not require differentiability of the map e 1-+ Po (x) for any single x, but
rather differentiability in (quadratic) mean. Admittedly, the latter is typically established by
pointwise differentiability plus a convergence theorem for integrals. Because the condition
is exactly right for its purpose, we establish in the following theorem local asymptotic
normality under (7.1). A lemma following the theorem gives easily verifiable conditions in
terms of pointwise derivatives.
7.2 Theorem. Suppose that 8 is an open subset ofRk and that the model (Po: e E 8)
is differentiable in quadratic mean at e. Then Polo = 0 and the Fisher information matrix
4
10 = pol o exists. Furthermore, for every converging sequence h n h, as n 00,
Proof. Given a converging sequence h n h, we use the abbreviations Pn, P, and g for
PO+hn/.,fo, Po, and hT lo, respectively. By (7.1) the sequence JTi(ffn-.;p) converges in
quadratic mean (i.e., in L2(f,L)) to !g.;p. This implies that the sequence ffn converges in
quadratic mean to .;p. By the continuity of the inner product,
The right side equals JTi(1- 1) = 0 for every n, because both probability densities integrate
to 1. Thus Pg = O.
The random variable Wni = 2 [.JPn / P(Xi) - 1] is with P -probability 1 well defined.
By (7.1)
var ( t;
n
Wni -
1 n
JTi t;g(Xi)
)
E(.JnWni - g(Xd)2 0,
Here Pg 2 = f g2 dP = hT loh by the definitions of g and 10 • If both the means and the
variances of a sequence of random variables converge to zero, then the sequence converges
to zero in probability. Therefore, combining the preceding pair of displayed equations, we
find
(7.4)
Next, we express the log likelihood ratio in l:7=1 Wni through a Taylor expansion of the
logarithm. If we write 10g(1 + x) = x - !x 2 + x 2R(2x), then R(x) -+ 0 as x -+ 0, and
log n
n
i=l P i=l
L
Pn (Xi) = 2 n log ( 1 + -1 Wni )
2
n I n I n
= L
Wni - 4
i=1
+2L
i=1
L
i=1
(7.5)
As a consequence of the right side of (7.3), it is possible to write n = g2(Xi) + Ani for
random variables Ani such that EIAni I -+ O. The averages An converge in mean and hence
in probability to zero. Combination with the law of large numbers yields
7.6 Lemma. For every () in an open subset ofJRk let Po be a JL-probability density. Assume
that the map () So (x) = J Po (x) is continuously differentiable for every x. Ifthe elements
of the matrix 10 = J(polpo)(pr Ipo) Po dJL are well defined and continuous in (), then the
map () "fiiO is differentiable in quadratic mean (7.1) with to given by Pol Po.
Proof. By the chain rule, the map () Po{x) = is differentiable for every x with
gradient Po = 2soso. Because So is nonnegative, its gradient So at a point at which So = 0
!
must be zero. Conclude that we can write So = (Po I Po) "fiiO, where the quotient Po / Po
may be defined arbitrarily if Po = O. By assumption, the map () 10 = 4 J sosl dJL is
continuous.
Because the map () So (x) is continuously differentiable, the difference SO+h (x) - So (x)
can be written as the integral fol hT So+uh(X) du of its derivative. By Jensen's (or Cauchy-
Schwarz's) inequality, the square of this integral is bounded by the integral fol (hT So+uh (x»)2
f( SO+th,-SO)2
t dJL
jt(
10 htTSfJ+uth,
.)2 1 t
du dJL = 410
T
ht IO+uth,ht du,
where the last equality follows by Fubini's theorem and the definition of 10 • For h t -+ h
the right side converges to loh = j(h TsO)2 dJL by the continuity of the map 0 10.
By the differentiability of the map 0 So (x) the integrand in
f - So - hT So r dJL
converges pointwise to zero. The result of the preceding paragraph combined with Propo-
sition 2.29 shows that the integral converges to zero. •
io(x) = - Eot(X)), 10 =
Thus the asymptotic expansion of the local log likelihood is valid for most exponential
families. 0
7.8 Example (Location models). The preceding lemma also includes all location models
{J(x - 0): 0 E 1R} for a positive, continuously differentiable density f with finite Fisher
information for location
The score function io (x) can be taken equal to -(f'/f)(x - 0). The Fisher information is
equal to If for every 0 and hence certainly continuous in O.
By a refinement of the lemma, differentiability in quadratic mean can also be established
for slightly irregular shapes, such as the Laplace density f(x) = For the Laplace
density the map 0 log f(x - 0) fails to be differentiable at the single point 0 = x.
At other points the derivative exists and equals sign(x - 0). It can be shown that the
Laplace location model is differentiable in quadratic mean with score function io (x) =
sign(x - 0). This may be proved by writing the difference J f(x - h) - J f(x) as the
integral sign(x - uh) J f (x - uh) du of its derivative, which is possible even though
the derivative does not exist everywhere. Next the proof of the preceding lemma applies. 0
PO+h(Pe = 0) = 1 [O,e]<
1
--I[o,O+h)(x)dx
() + h
= --.
()
h
+h
The orthogonal part does converge to zero, but only at the rate O(h). 0
The right side is very similar in form to the right side of the expansion of the log likelihood
ratio 10gdP:+h/Jn/dP; given in Theorem 7.2. In view of the similarity, the possibility of
a normal approximation is not a complete surprise. The approximation in this section is
"local" in nature: We fix () and think of
as a statistical model with parameter h, for "known" (). We show that this can be approxi-
mated by the statistical model (N(h, Ie-I): h E
A motivation for studying a local approximation is that, usually, asymptotically, the
"true" parameter can be known with unlimited precision. The true statistical difficulty is
therefore determined by the nature of the measures Pe for () in a small neighbourhood of
the true value. In the present situation "small" turns out to be "of size 0 (1/ In).''
A relationship between the models that can be statistically interpreted will be described
through the possible (limit) distributions of statistics. For each n, let Tn = Tn (X I, ... , Xn)
be a statistic in the experiment (P;+h/Jn: h E with values in a fixed Euclidean space.
Suppose that the sequence of statistics Tn converges in distribution under every possible
(local) parameter:
every h.
Here!:'" means convergence in distribution under the parameter () + hl,.;n, and LO,h
may be any probability distribution. According to the following theorem, the distributions
{Le,h : h E IRk} are necessarily the distributions of a statistic T in the normal experiment
(N(h, IiJl): h E IRk). Thus, every weakly converging sequence of statistics is "matched"
by a statistic in the limit experiment. (In the present set-up the vector () is considered known
and the vector h is the statistical parameter. Consequently, by "statistics" Tn and T are
understood measurable maps that do not depend on h but may depend on ().)
This principle of matching estimators is a method to give the convergence of models
a statistical interpretation. Most measures of quality of a statistic can be expressed in the
distribution of the statistic under different parameters. For instance, if a certain hypothesis
is rejected for values of a statistic Tn exceeding a number c, then the power function
h H> Ph(Tn > c) is relevant; alternatively, if Tn is an estimator of h, then the mean square
error h H> Eh(Tn - h)2, or a similar quantity, determines the quality of Tn. Both quality
measures depend on the laws of the statistics only. The following theorem asserts that as a
function of h the law of a statistic Tn can be well approximated by the law of some statistic
T. Then the quality of the approximating T is the same as the "asymptotic quality" of the
sequence Tn. Investigation of the possible T should reveal the asymptotic performance of
possible sequences Tn. Concrete applications of this principle to testing and estimation are
given in later chapters.
A minor technical complication is that it is necessary to allow randomized statistics in
the limit experiment. A randomized statistic T based on the observation X is defined as a
measurable map T = T (X, U) that depends on X but may also depend on an independent
variable U with a uniform distribution on [0, 1]. Thus, the statistician working in the limit
experiment is allowed to base an estimate or test on both the observation and the outcome of
an extra experiment that can be run without knowledge of the parameter. In most situations
such randomization is not useful, but the following theorem would not be true without
it.t
J = Ie,
By assumption, the marginals of the sequence (Tn' .6.n) converge in distribution under
h = 0; hence they are uniformly tight by Prohorov's theorem. Because marginal tightness
implies joint tightness, Prohorov's theorem can be applied in the other direction to see the
existence of a subsequence of {n} along which
t It is not important that U is unifonnly distributed. Any randomization mechanism that is sufficiently rich will
do.
jointly, for some random vector (S, d). The vector d is necessarily a marginal weak limit
of the sequence d n and hence it is N(O, J)-distributed. Combination with Theorem 7.2
yields
dPn,h) -0
( Tn,log-- (
S,h T
d -I -h
T Jh ) .
dPn,o 2
Because the random vectors on the left and right sides have the same second marginal
distribution, this is the same as saying that T U) is distributed according to the conditional
distribution of S given d = for almost As shown in the next lemma, this can be
achieved by using the quantile transformation.
Let X be an observation in the limit experiment (N(h, r'):hE ]Rk). Then JX is under
°
h = normally N (0, J)-distributed and hence it is equal in distribution to d. Furthermore,
by Fubini's theorem,
7.11 Lem1lUl. Given a random vector (S, d) with values in]Rd x]Rk and an independent
uniformly [0, 1] random variable U (defined on the same probability space), there exists a
jointly measurable map T on]Rk x [0, 1] such that (T(d, U), d) and (S, d) are equal in
distribution.
and
respectively. These are measurable functions in their two and three arguments, respectively.
Furthermore, QI has law p Stl f1 =d and SI) has law p SZIf1=d,SI=SI, for every
ands i . Set
UI, U2) =
Then the first coordinate QI of U b U2) possesses the distribution p Stl f1 =&.
Given that this first coordinate equals SIt the second coordinate is distributed as Q2 (U21 s}),
which has law pSzlf1=&,SI=S1 by construction. Thus T satisfies the requirements. •
Then the maximum likelihood estimator in the limit experiment is a "projection" of X and
the limit distribution of .,fo(On - e) may change accordingly.
Let 8 be an arbitrary subset of]Rk and define Hn as the local parameter space Hn =
.,fo(8 - e). Then hn is the maximizer over Hn of the random function (or "process")
dpn
h Iog O+h/../ii
n
dPo
If the experiment (Po: e E 8) is differentiable in quadratic mean, then this sequence of
processes converges (marginally) in distribution to the process
*Proof. Let Gn = .,fo(lPn - P90 ) be the empirical process. In the proof of Theorem 5.39
it is shown that the map 0 log Po is differentiable at eo in L 2 (P90 ) with derivative
190 and that the map 0 P90 log Po permits a Taylor expansion of order 2 at 00, with
"second-derivative matrix" -190, Therefore, the conditions of Lemma 19.31 are satisfied
for mo = log Po, whence, for every M,
estimators hn are bounded in probability and hence belong to the balls of radius Mn with
probability tending to 1. Furthermore, the sequence of intersections Hn n ball (0, Mn)
converges to H, as the original sets Hn. Thus, we may assume that the hn are the maximum
likelihood estimators relative to local parameter sets Hn that are contained in the balls of
radius Mn. Fix an arbitrary closed set F. If hn E F, then the log likelihood is maximal on
F. Hence P(h n E F) is bounded above by
p( SJIP
heFnHn
rn log POo+h/..[ii
POo
sup
heHn
rn log POo+h/..[ii)
POo
= p( sup hTGnlOo -
heFnHn 2
lOoh sup hTGnlOo -
heHn 2
lOoh + op(l))
= - IJ:2(F n Hn) I I - IJ:2 Hn I + Op(l)),
by completing the square. By Lemma 7.13 (ii) and (iii) ahead, we can replace Hn by H on
both sides, at the cost of adding a further op(l)-term and increasing the probability. Next,
by the continuous mapping theorem and the continuity of the map Z 1-+ liz - A II for every
set A, the probability is asymptotically bounded above by, with Z a standard normal vector,
7.13 Lemma. Ifthe sequence of subsets Hn of IRk converges to a nonempty set Hand
the sequence of random vectors Xn converges in distribution to a random vector X, then
(i) IIXn - Hnll IIX - HII.
(ii) IIXn - Hn n FII IIXn - H n FII + op(1), for every closed set F.
(iii) IIXn - Hn n Gil IIXn - H n Gil + op(l), for every open set G.
Proof. (i). Because the map x 1-+ IIx - H II is (Lipschitz) continuous for any set H,
we have that IIXn - - HII by the continuous-mapping theorem. Ifwe also show
that IIXn - Hnll - IIXn - HII 0, then the proof is complete after an application of
Slutsky's lemma. By the uniform tightness of the sequence Xn , it suffices to show that
IIx - Hn II --+ IIx - H II uniformly for x ranging over compact sets, or equivalently that
IIxn - Hn II --+ IIx - H II for every converging sequence Xn --+ x.
For every fixed vector Xn, there exists a vector h n E Hn with IIx n- Hn II IIx n- hn11-1/ n.
Unless IIx n - Hn II is unbounded, we can choose the sequence h n bounded. Then every
subsequence of h n has a further subsequence along which it converges, to a limit h in H.
Conclude that, in any case,
Conversely, for every e > °there exists h E H and a sequence hn --+ h with h n E Hn and
IIx - HII IIx - hll - e = lim IIxn - hnll - e lim sup IIxn - Hnll - e.
Combination of the last two displays yields the desired convergence of the sequence xn −
Hn to x − H .
(ii). The assertion is equivalent to the statement P X n − Hn ∩ F − X n − H ∩ F >
−ε → 1 for every ε > 0. In view of the uniform tightness of the sequence X n , this follows
if lim inf xn − Hn ∩ F ≥ x − H ∩ F for every converging sequence xn → x. We can
prove this by the method of the first half of the proof of (i), replacing Hn by Hn ∩ F.
(iii). Analogously to the situation under (ii), it suffices to prove that lim sup xn − Hn ∩
G ≤ x − H ∩ G for every converging sequence xn → x. This follows as the second
half of the proof of (i).
1
n
√
n (Tn − μθ ) = √ ψθ (X i ) + o Pθ (1).
n
i=1
According to Theorem 7.2, the sequence of log likelihood ratios can be approximated
by an average as well: It is asymptotically equivalent to an affine transformation
of n −1/2 ˙θ (X i ). The sequence of joint averages n −1/2 ψθ (X i ), ˙θ (X i ) is
asymptotically multivariate normal under θ by the central limit theorem (provided ψθ has
mean zero and finite second moment). With the help of Slutsky’s lemma we obtain the joint
limit distribution of Tn and the log likelihood ratios under θ :
P ψ ψT
√ d Pθn+h/√n θ 0 θ θ θ Pθ ψθ h T ˙θ
n (Tn − μθ ), log N , .
d Pθn − 12 h T Iθ h Pθ ψθT h T ˙θ h T Iθ h
Finally we can apply Le Cam’s third lemma, Example 6.7, to obtain the limit distribution
√ √
of n(Tn − μθ ) under θ + h/ n. Concrete examples of this scheme are discussed in later
chapters.
wide variety of models satisfy a general form of local asymptotic normality and for that
reason allow a unified treatment. These include models with independent, not identically
distributed observations, but also models with dependent observations, such as used in time
series analysis or certain random fields. Because local asymptotic normality underlies a
large part of asymptotic optimality theory and also explains the asymptotic normality of
certain estimators, such as maximum likelihood estimators, it is worthwhile to formulate a
general concept.
Suppose the observation at "time" n is distributed according to a probability measure
Pn,e, for a parameter 8 ranging over an open subset of IRk. e
7.14 Definition. The sequence of statistical models (Pn,e : 8 E e) is locally asymptoti-
cally normal (LAN) at 8 if there exist matrices Tn and Ie and random vectors l:l.n,e such that
l:l.n,e N(O, Ie) and for every converging sequence hn --+ h
dP -Ih 1
n,.,+r.. = hT l:l. - _hT I h + 0 (1)
II
log
dP, n,e 2 e p.,s·
n,e
7.1S Example. Ifthe experiment (Pe : 8 E e) is differentiable in quadratic mean, then the
sequence of models (P: : 8 E e) is locally asymptotically normal with norming matrices
Tn = -Inl.
0
An inspection of the proof of Theorem 7.10 readily reveals that this depends on the local
asymptotic normality property only. Thus, the local experiments
7.17 Example (Gaussian time series). This example requires some knowledge of time-
series models. Suppose that at time n the observations are a stretch X I, ... , Xn from a
stationary, Gaussian time series {X, : t E Z} with mean zero. The covariance matrix of n
The function fe is the spectral density of the series. It is convenient to let the parameter
enter the model through the spectral density, rather than directly through the density of the
observations.
Let Pn,e be the distribution (on ]Rn) of the vector (XI, ... , X n), a normal distribution
with mean zero and covariance matrix Tn (fe). The periodogram of the observations is the
function
n 2
In().} = -1 '" it)..
.
21rn 1=1
Suppose that Ie is bounded away from zero and infinity, and that there exists a vector-valued
function i.e :]R 1--+ ]Rd such that, as h .... 0,
Ie = -41r1 f ieie
. ·r d)'.
The proof is elementary, but involved, because it has to deal with the quadratic forms in
the n-variate normal density, which involve vectors whose dimension converges to infinity
(see [30]). 0
The sequence n,0 may be thought of as “asymptotically sufficient” for the local parameter
h. The definition of n,0 shows that, asymptotically, all the “information” about the
parameter is contained in the observations falling into the neighborhoods Vn + C j . Thus,
asymptotically, the problem is determined by the points of irregularity.
√
The remarkable rescaling rate n log n can be explained by computing the Hellinger
distance between the densities f (x − θ ) and f (x) (see section 14.5).
Notes
Local asymptotic normality was introduced by Le Cam [92], apparently motivated by the
study and construction of asymptotically similar tests. In this paper Le Cam defines two
sequences of models (Pn,θ : θ ∈ ) and (Q n,θ : θ ∈ ) to be differentially equivalent if
for every bounded set K and every θ . He next shows that a sequence of statistics Tn
in a given asymptotically differentiable sequence of experiments (roughly LAN) that is
asymptotically equivalent to the centering sequence n,θ is asymptotically sufficient, in
the sense that the original experiments and the experiments consisting of observing the Tn
are differentially equivalent. After some interpretation this gives roughly the same message
as Theorem 7.10. The latter is a concrete example of an abstract result in [95], with a
different (direct) proof.
PROBLEMS
1. Show that the Poisson distribution with mean θ satisfies the conditions of Lemma 7.6. Find
the information.
2. Find the Fisher information for location for the normal, logistic, and Laplace distributions.
3. Find the Fisher information for location for the Cauchy distributions.
4. Let f be a density that is symmetric about
zero. Show
that the Fisher information matrix (if
it exists) of the location scale family f (x − μ)/σ /σ is diagonal.
5. Find an explicit expression for the o Pθ (1)-term in Theorem 7.2 in the case that pθ is the
density of the N (θ, 1)-distribution.
6. Show that the Laplace location family is differentiable in quadratic mean.
† See, for example, [80, pp. 133–139] for a proof, and also a discussion of other almost regular situations. For
instance, singularities of the form f (x) ∼ f (c j ) + |x − c j |1/2 at points c j with f (c j ) > 0.
7. Find the fonn of the score function for a location-scale family f (x - IL)/ u ) / u with parameter
8 = (IL. u) and apply Lemma 7.6 to find a sufficient condition for differentiability in quadratic
mean.
8. Investigate for which parameters k the location family f(x - 8) for f the gamma(k. 1) density
is differentiable in quadratic mean.
9. Let Pn ,9 be the distribution of the vector (Xl, ...• Xn) if {Xt : t E Z} is a stationary Gaussian
time series satisfying X t = 8X t-l + Zt for a given number 181 < 1 and independent standard
nonnal variables Zt. Show that the model is locally asymptotically nonnal.
10. Investigate whether the log nonnal family of distributions with density
1 > s}
u"firr(x - s)
is differentiable in quadratic mean with respect to 8 = (s, IL. u).
108
A limit distribution is "good" if quantities of this type are small. More generally, we
f
focus on minimizing idLe for a given nonnegative function i. Such a function is called
a loss function and its integral f idLe is the asymptotic risk of the estimator. The method
of measuring concentration (or rather lack of concentration) by means of loss functions
applies to one- and higher-dimensional parameters alike.
The following example shows that a definition of what constitutes asymptotic optimality
is not as straightforward as it might seem.
8.1 Example (Hodges' estimator). Suppose that Tn is a sequence of estimators for a real
parameter () with standard asymptotic behavior in that, for each () and certain limit distri-
butions Le,
As a specific example, let Tn be the mean of a sample of size n from the N «(), 1)-distribution.
Define a second estimator Sn through
If the estimator Tn is already close to zero, then it is changed to exactly zero; otherwise it
is left unchanged. The truncation point n- I / 4 has been chosen in such a way that the limit
behavior of Sn is the same as that of Tn for every () ¥ 0, but for () = 0 there appears to be a
great improvement. Indeed, for every rn,
o
rnSn .".. 0
.[ii(Sn - () !!... L e,
To see this, note first that the probability that Tn falls in the interval «(} - M n -1/2, (} + M n -I /2)
converges to Le (- M, M) for most M and hence is arbitrarily close to 1 for M and n
sufficiently large. For () ¥ 0, the intervals «() - Mn-I/ 2, () + Mn-I/2) and (_n- I/ 4 , n- I/4 )
are centered at different places and eventually disjoint. This implies that truncation will
rarely occur: Pe{Tn = Sn) 1 if() ¥ 0, whence the second assertion. Ontheotherhandthe
interval (-Mn-I/ 2, Mn-l/2) is contained in the interval (_n- I/4 , n- I/4) eventually. Hence
=
under () 0 we have truncation with probability tending to 1 and hence PO{Sn 0) = 1;
this is stronger than the first assertion.
At first sight, Sn is an improvement on Tn. For every () ¥ 0 the estimators behave
the same, while for () = 0 the sequence Sn has an "arbitrarily fast" rate of convergence.
However, this reasoning is a bad use of asymptotics.
Consider the concrete situation that Tn is the mean of a sample of size n from the
normal N«(), I)-distribution. It is well known that Tn =
X is optimal in many ways for
every fixed n and hence it ought to be asymptotically optimal also. Figure 8.1 shows
why Sn = Xl{IXI 2:: n- I / 4 } is no improvement. It shows the graph of the risk function
() t-+ Ea(Sn - ()2 for three different values of n. These functions are close to 1 on most
Figure 8.1. Risk function θ → nEθ (Sn − θ )2 of the Hodges estimator based on the
means of samples of size 10 (dashed), 100 (dotted), and 1000 (solid) observations from
the N (θ, 1)-distribution.
of the domain but possess peaks close to zero. As n → ∞, the locations and widths of
the peaks converge to zero but their heights to infinity. The conclusion is that Sn “buys”
its better asymptotic behavior at θ = 0 at the expense of erratic behavior close to zero.
Because the values of θ at which Sn is bad differ from n to n, the erratic behavior is not
visible in the pointwise limit distributions under fixed θ.
needed to meet the requirement with each of the estimators. Then, if it exists, the limit
. nv,2
1I m-
v-+oo nv,l
is called the relative efficiency of the estimators. (In general, it depends on the para-
meter 0.)
Because JV(Tn, - 1/1(0») can be written as Jv/nv Fv(Tn, - 1/1(0»), it follows that
necessarily nv 00, and also that nv/v a 2 (O). Thus, the relative efficiency of two
estimator sequences with asymptotic variances a?(O) is just
If the value of this quotient is bigger than 1, then the second estimator sequence needs
proportionally that many observations more than the first to achieve the same (asymptotic)
precision.
Then Tn can be considered a good estimator for 1/1(0) if the limit distributions Le,h are
maximally concentrated near zero. If they are maximally concentrated for every h and
some fixed 0, then Tn can be considered locally optimal at O. Unless specified otherwise,
we assume in the remainder of this chapter that the parameter set e is an open subset of
IRk, and that 1/1 maps e into IRm. The derivative of 0 1/1(0) is denoted by 1/Ie.
Suppose that the observations are a sample of size n from a distribution Pe. IfPe depends
smoothly on the parameter, then
as experiments, in the sense of Theorem 7.10. This theorem shows which limit distributions
are possible and can be specialized to the estimation problem in the following way.
(8.2) holds jor every h. Then there exists a randomized statistic T in the experiment
(N(h, I(J-I):h E IRk) such that T -1p(Jh has distribution L(J,hjorevery h.
Proof. Apply Theorem 7.10 to Sn = In{Tn -1/1'«(}»). In view of the definition of L(J,h
and the differentiability of 1/1', the sequence
This theorem shows that for most estimator sequences Tn there is a randomized estimator
T such that the distribution of In{Tn - 1/1'«(} + hi In)) under () + hi In is, for large n,
approximately equal to the distribution of T -1p(Jh under h. Consequently the standardized
distribution of the best possible estimator Tn for 1/1' «() + hi In) is approximately equal to the
standardized distribution of the best possible estimator T for 1p(Jh in the limit experiment. If
we know the best estimator T for 1p(Jh, then we know the "locally best" estimator sequence
Tn for 1/1'«(}).
In this way, the asymptotic optimality problem is reduced to optimality in the experiment
based on one observation X from a N(h, I(J-I)-distribution, in which () is known and h
ranges over IRk. This experiment is simple and easy to analyze. The observation itself is
the customary estimator for its expectation h, and the natural estimator for 1p(Jh is 1p(JX.
This has several optimality properties: It is minimum variance unbiased, minimax, best
equivariant, and Bayes with respect to the noninformative prior. Some of these properties
are reviewed in the next section.
Let us agree, at least for the moment, that 1p(JX is a "best" estimator for 1p(Jh. The
distribution of1p(JX -1pIJh is normal with zero mean and covariance 1pIJIIJ-I1pIJTfor every h.
°
The parameter h = in the limit experiment corresponds to the parameter () in the original
problem. We conclude that the "best" limit distribution of In{Tn - 1/1'«(}») under () is the
N(O,1pIJ IIJ- I1pIJ T)-distribution.
This is the main result of the chapter. The remaining sections discuss several ways of
making this reasoning more rigorous. Because the expression 1pIJIil 1pIJT is precisely the
Cramer-Rao lower bound for the covariance of unbiased estimators for 1/1'«(}), we can think
of the results of this chapter as asymptotic Cramer-Rao bounds. This is helpful, even though
it does not do justice to the depth of the present results. For instance, the Cramer-Rao bound
in no way suggests that normal limiting distributions are best. Also, it is not completely
true that an N (h, IiI)-distribution is "best" (see section 8.8). We shall see exactly to what
extent the optimality statement is false.
8.5 Lemma (Anderson's lemma). For any bowl-shaped lossfunction l on IRk, every prob-
ability measure M on IRk, and every covariance matrix
over all (randomized) estimators T. For every bowl-shaped loss function l, this leads again
to the estimator A X .
8.6 Proposition. For any bowl-shaped loss function l, the maximum risk ofany random-
ized estimator T of Ah is bounded below by Eol(AX). Consequently, AX is a minimax
estimator for Ah./f Ah is real and Eo(AX)2l(AX) < 00, then AX is the only minimax
estimator for Ah up to changes on sets ofprobability zero.
Proofs. For a proof of the uniqueness of the minimax estimator, see [18] or [80]. We
prove the other assertions for subconvex loss functions, using a Bayesian argument.
Let H be a random vector with a normal N(O, A)-distribution, and consider the original
N(h, E)-distribution as the conditional distribution of X given H = h. The randomization
variable U in T(X, U) is constructed independently of the pair (X, H). In this notation, the
distribution of the variable T - A H is equal to the "average" of the distributions of T - Ah
under the different values of h in the original set-up, averaged over h using a N (0, A)-"prior
distribution."
By a standard calculation, we find that the "a posteriori" distribution, the distribution
of H given X, is the normal distribution with mean (E -I + A -I) -I E -I X and covariance
matrix (E -I + A -I) -I. Define the random vectors
These vectors are independent, because WA is a function of (X, U) only, and the condi-
tional distribution of G A given X is normal with mean 0 and covariance matrix A(E-1 +
A -I ) -I AT, independent of X. As A = AI for a scalar A --+ 00, the sequence G A converges
in distribution to a N(O, AEAT)-distributed vector G. The sum of the two vectors yields
T - AH, for every A.
Because a supremum is larger than an average, we obtain, where on the left we take the
expectation with respect to the original model,
by Anderson's lemma. This is true for every A. The lim inf of the right side as A --+ 00 is
at least El(G), by the portmanteau lemma. This concludes the proof that AX is minimax.
If T is equivariant-in-Iaw with invariant law L, then the distribution of G A + W A =
T - AH is L, for every A. It follows that
As A --+ 00, the left side remains fixed; the first factor on the right side converges to the
characteristic function of G, which is positive. Conclude that the characteristic functions of
W A converge to a continuous function, whence WA converges in distribution to some vector
W, by Levy's continuity theorem. By the independence of G A and W A for every A, the
sequence (G A, WA) converges in distribution to a pair (G, W) of independent vectors with
marginal distributions as before. Next, by the continuous-mapping theorem, the distribution
of G A + W A, which is fixed at L, "converges" to the distribution of G + W. This proves
that L can be written as a convolution, as claimed in Proposition 8.4.
If T is an equivariant-in-Iaw estimator and t(X) = E(T(X, U)I X), then
The probability measure Le may be arbitrary but should be the same for every h.
A regular estimator sequence attains its limit distribution in a "locally uniform" manner.
This type of regularity is common and is often considered desirable: A small change in
the parameter should not change the distribution of the estimator too much; a disappearing
small change should not change the (limit) distribution at all. However, some estimator
sequences of interest, such as shrinkage estimators, are not regular.
In terms of the limit distributions LI),h in (8.2), regularity is exactly that all Le,h are
equal, for the given f). According to Theorem 8.3, every estimator sequence is matched
by an estimator T in the limit experiment (N (h, 11)-1) : h E IRk). For a regular estimator
sequence this matching estimator has the property
. h
T - 1/Ieh rv Le, every h. (8.7)
possible to improve on a given estimator sequence for selected parameters. In this section
it is shown that improvement over an N (0, ,fro IO-1,fro T)-distribution can be made on at most
a Lebesgue null set of parameters. Thus the possibilities for improvement are very much
restricted.
. I' T
In particular, if L o has covariance matrix Eo, then the matrix Eo - 1/IoIi 1/10 is
nonnegative definite for Lebesgue almost every lJ.
The theorem follows from the convolution theorem in the preceding section combined
with the following remarkable lemma. Any estimator sequence with limit distributions is
automatically regular at almost every lJ along a subsequence of In}.
Then for every Yn ---+ 0 there exists a subsequence of {n} such that, for Lebesgue almost
every (lJ, h), along the subsequence,
Proof. Assume without loss of generality that 8 = JRk; otherwise, fix some lJo and let
Pn.o = Pn.Oo for every lJ not in 8. Write Tn.o = rn(Tn - 1/I(lJ»). There exists a countable
collection :F of uniformly bounded, left- or right-continuous functions f such that weak
convergence ofasequence of maps Tn is equivalentto Ef(Tn) ---+ f dL for every f E :F.tJ
Suppose that for every f there exists a subsequence of {n} along which
t For continuous distributions L we can use the indicator functions of cells (-00, c] with c ranging over Qk. For
general L replace every such indicator by an approximating sequence of continuous functions. Alternatively,
see, e.g., Theorem 1.12.2 in [146]. Also see Lemma 2.25.
J
Settinggn(8) = E(lf(Tn,(I) andg(8) = f dL(I, we see thatthe lemma is proved once we
have established the following assertion: Every sequence of bounded, measurable functions
gn that converges almost everywhere to a limit g, has a subsequence along which
We may assume without loss of generality that the function g is integrable; otherwise we
first multiply each gn and g with a suitable, fixed, positive, continuous function. It should
also be verified that, under our conditions, the functions gn are measurable.
Write P for the standard normal density onlRk and Pn for the density of the N(O, I +Y; 1)-
distribution. By Scheffe's lemma, the sequence Pn converges to P in L 1 • Let 8 and H denote
independent standard normal vectors. Then, by the triangle inequality and the dominated-
convergence theorem,
Secondly for any fixed continuous and bounded function ge the sequence Elge(8 + YnH)-
I
ge (8) converges to zero as n -+ 00 by the dominated convergence theorem. Thus, by the
triangle inequality, we obtain
limliminf sup
8--70 n--7(X) /1(1'-(1/1<8
E(I,.e(Jn(Tn -1/1(8 1 »)).
This is the asymptotic maximum risk over an arbitrarily small neighborhood of 8. The
following theorem concerns an even more refined (and smaller) version of the local maxi-
mum risk.
8.11 Theorem. Let the experiment (p() : 0 E 8) be differentiable in quadratic mean (7.1) at
o with nonsingular Fisher information matrix I().Let 1/1 be differentiable at O. Let Tn be
any estimator sequence in the experiments (P;: 0 E JRk). Thenfor any bowl-shaped loss
function .e
Proof. We only give the proof under the further assumptions that the sequence ,Jri(Tn -
1/1(0») is uniformly tight under 0 and that.e is (lower) semicontinuous. t Then Prohorov's
theorem shows that every subsequence of In} has a further subsequence along which the
vectors
It suffices to show that the left side of this display is a lower bound for the left side of the
theorem.
The complicated construction that defines the asymptotic minimax risk (the lim inf sand-
wiched between two suprema) requires that we apply the preceding argument to a carefully
chosen subsequence. Place the rational vectors in an arbitrary order, and let Ik consist of
the first k vectors in this sequence. Then the left side of the theorem is larger than
There exists a subsequence {nk} of {n} such that this expression is equal to
We apply the preceding argument to this subsequence and find a further subsequence along
which Tn satisfies (8.2). For simplicity of notation write this as In'} rather than with a
double subscript. Because.e is nonnegative and lower semicontinuous, the portmanteau
lemma gives, for every h,
Every rational vector h is contained in It for every sufficiently large k. Conclude that
R ::: sup
hEct
f f. dLo,h = sup Ehf.(T -lpoh).
hEct
The risk function in the supremum on the right is lower semicontinuous in h, by the
continuity of the Gaussian location family and the lower semicontinuity of e. Thus
the expression on the right does not change if is replaced by IRk. This concludes the
proof. •
8.12 Example (Shrinkage estimator). Let X" ... , Xn be a sample from a multivariate
normal distribution with mean () and covariance the identity matrix. The dimension k of
the observations is assumed to be at least 3. This is essential! Consider the estimator
- Xn
Tn = Xn - (k - 2) .
nllXnll 2
Because Xn converges in probability to the mean (), the second term in the definition of Tn
is Op(n-') if () i:- O. In that case "fii(Tn - Xn) converges in probability to zero, whence
the estimator sequence Tn is regular at every () i:- O. For () = hl"fii, the variable Mn is
distributed as a variable X with an N(h, I)-distribution, and for every n the standardized
estimator "fii(Tn - hl"fii) is distributed as T - h for
X
T(X)=X-(k-2)-.
IIXII2
This is the Stein shrinkage estimator. Because the distribution of T - h depends on h, the
sequence Tn is not regular at () = O. The Stein estimator has the remarkable property that,
for every h (see, e.g., [99, p. 300]),
The example of shrinkage estimators shows that, depending on the optimality criterion, a
normal N (0, lpo Io-'lpo T)-limit distribution need not be optimal. In this light, is it reasonable
https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press
120 Efficiency of Estimators
to uphold that maximum likelihood estimators are asymptotically optimal? Perhaps not. On
the other hand, the possibility of improvement over the N (0, ..fre Ii l..fr//)-limit is restricted
in two important ways.
First, improvement can be made only on a null set of parameters by Theorem 8.9.
Second, improvement is possible only for special loss functions, and improvement for one
loss function necessarily implies worse performance for other loss functions. This follows
from the next lemma.
Suppose that we require the estimator sequence to be locally asymptotically minimax for
a given loss function l in the sense that
8.13 Lemma. Assume that the experiment (Pe : e E 9) is differentiable in quadratic mean
(7.1) at e with nonsingular Fisher information matrix Ie. Let 1/1 be a real-valued map
that is differentiable at (). Then an estimator sequence in the experiments (P; : () E IRk)
can be locally asymptotically minimax at e for a bowl-shaped loss function l such that
0< f x 2 l(x) dN(O, ..freli1..freT)(x) < 00 only ifTn is best regular at e.
Proof. We only give the proof under the further assumption that the sequence In (Tn -
1/1 (e) ) is uniformly tight under (). Then by the same arguments as in the proof of Theo-
rem 8.l1, every subsequence of {n} has a further subsequence along which the sequence
Jii(Tn - 1/1 (e + h / Jii)) converges in distribution under e + h / Jii to the distribution
Le.h of T - ..freh under h, for a randomized estimator T based on an N(h, li1)-distributed
observation. Because Tn is locally asymptotically minimax, it follows that
Thus T is a minimax estimator for ..freh in the limit experiment. By Proposition 8.6,
T = ..freX, whence Le.h is independent of h. •
Then Tn is best regular estimator for 1/I(O} at O. Conversely, every best regular estimator
sequence satisfies this expansion.
Proof. The sequence ll.n,e = n- l / 2 L le(X i } converges in distribution to a vector ll.e with
a N (0, Ie }-distribution. By Theorem 7.2 the sequence log d P;+h/..;n/ d P: is asymptotically
equivalent to hT ll.n,e - !h TIeh. If Tn is asymptotically linear, then In(Tn - 1/I(e}) is
asymptotically equivalent to the function l;e Ie- l ll.n,e. Apply Slutsky's lemma to find that
The limit distribution of the sequence In(Tn -1/I(O}) undere +h/ In follows by Le Cam's
third lemma, Example 6.7, and is normal with mean l;eh and covariance matrix l;e Ie- l l;e T.
Combining this with the differentiability of 1/1, we obtain that Tn is regular.
Next suppose that Sn and Tn are both best regular estimator sequences. By the same
arguments as in the proof of Theorem 8.11 it can be shown that, at least along subsequences,
the joint estimators (Sn, Tn) for (1/1 (e), 1/1 (O)) satisfy for every h
for a randomized estimator (S, T) in the normal-limit experiment. Because Sn and Tn are
best regular, the estimators Sand T are best equivariant-in-Iaw. Thus S = T = l;eX almost
surely by Proposition 8.6, whence In(Sn - Tn} converges in distribution to S - T = O.
Thus every two best regular estimator sequences are asymptotically equivalent. The
second assertion of the lemma follows on applying this to Tn and the estimators
I . -1
Sn = 1/I(e} + In 1/Ie Ie ll.n,e.
Because the parameter 0 is known in the local experiments (P;+h/..;n: h E IRk), this indeed
defines an estimator sequence within the present context. It is best regular by the first part
of the lemma. •
Under regularity conditions, for instance those of Theorem 5.39, the maximum likeli-
hood estimator en in a parametric model satisfies
Then the maximum likelihood estimator is asymptotically optimal for estimating e in terms
of the convolution theorem. By the delta method, the estimator 1/1 (en) for 1/1 (e) can be seen
S.lS Theorem. Suppose that the estimator sequence Tn is consistentfor 1/I(0) under every
O. Then, for every 8 > 0 and every 00,
Proof. If the right side is infinite, then there is nothing to prove. The Kullback-Leibler
information - Pe log PfJo/ Pe can be finite only if Pe « PfJo. Hence, it suffices to prove that
- Pe log PfJo/ Pe is an upper bound for the left side for every 0 such that Pe « PfJo and
d(1/I{O),1/I{Oo)) > 8. The variable An = (n- I ) :E7=llog{Pe/PfJo){Xj) is well defined
(possibly -(0). For every constant M,
-llOg PfJo(d(Tn, 1/I{(0)) > 8) M -llOg Pe(d(Tn' 1/I(00)) > 8, An < M).
For M > Pe log pe/ PfJo' we have that Pe{An < M) I by the law of large numbers.
Furthermore, by the consistency of Tn for 1/I(0), the probability Pe (d(Tn , 1/1 (Oo)) > 8)
converges to 1 for every () such that d(t/I«(), ""«()o») > e. Conclude that the probability in
the right side of the preceding display converges to 1, whence the lim sup of the left side is
bounded by M. •
Notes
Chapter 32 of the famous book by Cramer [27] gives a rigorous proof of what we now
know as the Cramer-Rao inequality and next goes on to define the asymptotic efficiency of
an estimator as the quotient of the inverse Fisher information and the asymptotic variance.
Cramer defines an estimator as asymptotically efficient if its efficiency (the quotient men-
tioned previously) equals one. These definitions lead to the conclusion that the method of
maximum likelihood produces asymptotically efficient estimators, as already conjectured
by Fisher [48, 50] in the 1920s. That there is a conceptual hole in the definitions was clearly
realized in 1951 when Hodges produced his example of a superefficient estimator. Not long
after this, in 1953, Le Cam proved that superefficiency can occur only on a Lebesgue null
set. Our present result, almost without regularity conditions, is based on later work by Le
Cam (see [95].) The asymptotic convolution and minimax theorems were obtained in the
present form by Hajek in [69] and [70] after initial work by many authors. Our present
proofs follow the approach based on limit experiments, initiated by Le Cam in [95].
PROBLEMS
1. Calculate the asymptotic relative efficiency of the sample mean and the sample median for
estimating 8, based on a sample of size n from the normal N(8, I) distribution.
2. As the previous problem, but now for the Laplace distribution (density p(x) = ie-Ixl).
3. Consider estimating the distribution function P(X :::: x) at a fixed point x based on a sample
X I, ... , Xn from the distribution of X. The "nonparametric" estimator is n -I #( Xi:::: x). If it
is known that the true underlying distribution is normal N(8, I), another possible estimator is
<I> (x - X). Calculate the relative efficiency of these estimators.
4. Calculate the relative efficiency of the empirical p-quantile and the estimator <1>-1 (p)Sn + Xn
for the estimating the p-th quantile of the distribution of a sample from the normal N (j.L, u 2 )_
distribution.
5. Consider estimating the population variance by either the sample variance S2 (which is unbiased)
or else n- I L?=I (Xi - X)2 = (n - l)/n S2. Calculate the asymptotic relative efficiency.
6. Calculate the asymptotic relative efficiency of the sample standard deviation and the interquartile
range (corrected for unbiasedness) for estimating the standard deviation based on a sample of
size n from the normal N(j.L, u 2)-distribution.
7. Given a sample of size n from the uniform distribution on [0,8], the maximum X(n) of the
observations is biased downwards. Because He (8 - X(n» = HeX(I), the bias can be removed by
adding the minimum of the observations. Is X (l) + X (n) a good estimator for 8 from an asymptotic
point of view?
8. Consider the Hodges estimator Sn based on the mean of a sample from the N (8, I)-distribution.
(i) -+ Oinsuchawaythatn l / 48n -+ oand n l / 28n --+- 00.
(ii) Show that Sn is not regular at 8 = O.
(iii) Show that sUP-8<Od Po (JnISn - 81 > kn} --+ 1 for every kn that converges to infinity
sufficiently slowly.
9. Show that a loss function l: IR 1-+ IR is bowl-shaped if and only if it has the forml(x) = lo{lxl}
for a nondecreasing function lo.
10. Show that a function of the form l (x) = lo (lix II) for a nondecreasing function lo is bowl-shaped.
11. Prove Anderson's lemma for the one-dimensional case, for instance by calculating the derivative
of J lex + h) dN(O, l)(x). Does the proof generalize to higher dimensions?
12. What does Lemma 8.13 imply about the coordinates of the Stein estimator. Are they good
estimators of the coordinates of the expectaction vector?
13. All results in this chapter extend in a straightforward manner to general locally asymptotically
normal models. Formulate Theorem 8.9 and Lemma 8.14 for such models.