0% found this document useful (0 votes)
35 views140 pages

Ilovepdf Merged

Uploaded by

周东旭
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views140 pages

Ilovepdf Merged

Uploaded by

周东旭
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 140

Asymptotic Statistics

This book is an introduction to the field of asymptotic statistics. The


treatment is both practical and mathematically rigorous. In addition to
most of the standard topics of an asymptotics course, including like-
lihood inference, M-estimation, asymptotic efficiency, U-statistics, and
rank procedures, the book also presents recent research topics such as
semiparametric models, the bootstrap, and empirical processes and their
applications.
One of the unifying themes is the approximation by limit experi-
ments. This entails mainly the local approximation of the classical i.i.d.
set-up with smooth parameters by location experiments involving a sin-
gle, normally distributed observation. Thus, even the standard subjects
of asymptotic statistics are presented in a novel way.
Suitable as a text for a graduate or Master's level statistics course, this
book also gives researchers in statistics, probability, and their applications
an overview of the latest research in asymptotic statistics.

A.W. van der Vaart is Professor of Statistics in the Department of


Mathematics and Computer Science at the Vrije Universiteit, Amsterdam.

Published online by Cambridge University Press


CAMBRIDGE SERIES IN STATISTICAL AND PROBABILISTIC MATHEMATICS

Editorial Board:

R. Gill, Department of Mathematics, Utrecht University


B.D. Ripley, Department of Statistics, University of Oxford
S. Ross, Department of Industrial Engineering, University of California, Berkeley
M. Stein, Department of Statistics, University of Chicago
D. Williams, School of Mathematical Sciences, University of Bath

This series of high-quality upper-division textbooks and expository monographs covers


all aspects of stochastic applicable mathematics. The topics range from pure and applied
statistics to probability theory, operations research, optimization, and mathematical pro-
gramming. The books contain clear presentations of new developments in the field and
also of the state of the art in classical methods. While emphasizing rigorous treatment of
theoretical methods, the books also contain applications and discussions of new techniques
made possible by advances in computational practice.

Already published
1. Bootstrap Methods and Their Application, by A.C. Davison and D.V. Hinkley
2. Markov Chains, by J. Norris

Published online by Cambridge University Press


Asymptotic Statistics

A.W. VAN DER VAART

_CAMBRIDGE
. . " UNIVERSITY PRESS

Published online by Cambridge University Press


University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre,
New Delhi – 110025, India
103 Penang Road, #05–06/07, Visioncrest Commercial, Singapore 238467

Cambridge University Press is part of the University of Cambridge.


It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.

www.cambridge.org
Information on this title: www.cambridge.org/9780521784504
DOI: 10.1017/CBO9780511802256

c Cambridge University Press 1998
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 1998
First paperback edition 2000
8th printing 2007
A catalogue record for this publication is available from the British Library
Library of Congress Cataloguing in Publication Data
Vaart, A. W. van der
Asymptotic statistics / A. W. van der Vaart.
p. cm. – (Cambridge series in statistical and probabilistic
mathematics)
Includes bibliographical references.
1. Mathematical statistical – Asymptotic theory. I. Title.
II. Series: Cambridge series on statistical and probabilistic mathematics
CA2276. V22 1998
519.5–dc21 98-15176
ISBN 978-0-521-49603-2 Hardback
ISBN 978-0-521-78450-4 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.

Published online by Cambridge University Press


To Maryse and Marianne

Published online by Cambridge University Press


Published online by Cambridge University Press
Contents

Preface page xiii


Notation page xv

1. Introduction
1.1. Approximate Statistical Procedures
1.2. Asymptotic Optimality Theory 2
1.3. Limitations 3
1.4. The Index n 4

2. Stochastic Convergence 5
2.l. Basic Theory 5
2.2. Stochastic 0 and 0 Symbols 12
*2.3. Characteristic Functions 13
*2.4. Almost-Sure Representations 17
*2.5. Convergence of Moments 17
*2.6. Convergence-Determining Classes 18
*2.7. Law of the Iterated Logarithm 19
*2.8. Lindeberg-Feller Theorem 20
*2.9. Convergence in Total Variation 22
Problems 24
3. Delta Method 25
3.l. Basic Result 25
3.2. Variance-Stabilizing Transformations 30
*3.3. Higher-Order Expansions 31
*3.4. Uniform Delta Method 32
*3.5. Moments 33
Problems 34
4. Moment Estimators 35
4.l. Method of Moments 35
*4.2. Exponential Families 37
Problems 40
5. M - and Z-Estimators 41
5.1. Introduction 41
5.2. Consistency 44
5.3. Asymptotic Normality 51
vii

Published online by Cambridge University Press


viii Contents

*5.4. Estimated Parameters 60


5.5. Maximum Likelihood Estimators 61
*5.6. Classical Conditions 67
*5.7. One-Step Estimators 71
*5.8. Rates of Convergence 75
*5.9. Argmax Theorem 79
Problems 83

6. Contiguity 85
6.1. Likelihood Ratios 85
6.2. Contiguity 87
Problems 91
7. Local Asymptotic Normality 92
7.1. Introduction 92
7.2. Expanding the Likelihood 93
7.3. Convergence to a Normal Experiment 97
7.4. Maximum Likelihood 100
*7.5. Limit Distributions under Alternatives 103
*7.6. Local Asymptotic Normality 103
Problems 106
8. Efficiency of Estimators 108
8.1. Asymptotic Concentration 108
8.2. Relative Efficiency 110
8.3. Lower Bound for Experiments 111
8.4. Estimating Normal Means 112
8.5. Convolution Theorem 115
8.6. Almost-Everywhere Convolution
Theorem 115
*8.7. Local Asymptotic Minimax Theorem 117
*8.8. Shrinkage Estimators 119
*8.9. Achieving the Bound 120
*8.10. Large Deviations 122
Problems 123
9. Limits of Experiments 125
9.1. Introduction 125
9.2. Asymptotic Representation Theorem 126
9.3. Asymptotic Normality 127
9.4. Uniform Distribution 129
9.5. Pareto Distribution 130
9.6. Asymptotic Mixed Normality 131
9.7. Heuristics 136
Problems 137

10. Bayes Procedures 138


10.1. Introduction 138
10.2. Bernstein-von Mises Theorem 140

Published online by Cambridge University Press


Contents ix

10.3. Point Estimators 146


*10.4. Consistency 149
Problems 152
11. Projections 153
11.1. Projections 153
11.2. Conditional Expectation 155
11.3. Projection onto Sums 157
*11.4. Hoeffding Decomposition 157
Problems 160
12. U -Statistics 161
12.1. One-Sample U -Statistics 161
12.2. Two-Sample U -statistics 165
*12.3. Degenerate U -Statistics 167
Problems 171

13. Rank, Sign, and Permutation Statistics 173


13.1. Rank Statistics 173
13.2. Signed Rank Statistics 181
13.3. Rank Statistics for Independence 184
*13.4. Rank Statistics under Alternatives 184
13.5. Permutation Tests 188
*13.6. Rank Central Limit Theorem 190
Problems 190

14. Relative Efficiency of Tests 192


14.1. Asymptotic Power Functions 192
14.2. Consistency 199
14.3. Asymptotic Relative Efficiency 201
*14.4. Other Relative Efficiencies 202
*14.5. Rescaling Rates 211
Problems 213

15. Efficiency of Tests 215


15.1. Asymptotic Representation Theorem 215
15.2. Testing Normal Means 216
15.3. Local Asymptotic Normality 218
15.4. One-Sample Location 220
15.5. Two-Sample Problems 223
Problems 226

16. Likelihood Ratio Tests 227


16.1. Introduction 227
*16.2. Taylor Expansion 229
16.3. Using Local Asymptotic Normality 231
16.4. Asymptotic Power Functions 236

Published online by Cambridge University Press


x Contents

16.5. Bartlett Correction 238


*16.6. Bahadur Efficiency 238
Problems 241

17. Chi-Square Tests 242


17.1. Quadratic Fonns in Nonnal Vectors 242
17.2. Pearson Statistic 242
17.3. Estimated Parameters 244
17.4. Testing Independence 247
*17.5. Goodness-of-Fit Tests 248
*17.6. Asymptotic Efficiency 251
Problems 253

18. Stochastic Convergence in Metric Spaces 255


18.1. Metric and Nonned Spaces 255
18.2. Basic Properties 258
18.3. Bounded Stochastic Processes 260
Problems 263
19. Empirical Processes 265
19.1. Empirical Distribution Functions 265
19.2. Empirical Distributions 269
19.3. Goodness-of-Fit Statistics 277
19.4. Random Functions 279
19.5. Changing Classes 282
19.6. Maximal Inequalities 284
Problems 289
20. Functional Delta Method 291
20.1. von Mises Calculus 291
20.2. Hadamard-Differentiable Functions 296
20.3. Some Examples 298
Problems 303
21. Quantiles and Order Statistics 304
21.1. Weak Consistency 304
21.2. Asymptotic Nonnality 305
21.3. Median Absolute Deviation 310
21.4. Extreme Values 312
Problems 315
22. L-Statistics 316
22.1. Introduction 316
22.2. Hajek Projection 318
22.3. Delta Method 320
22.4. L-Estiinators for Location 323
Problems 324
23. Bootstrap 326

Published online by Cambridge University Press


Contents xi

23.1. Introduction 326


23.2. Consistency 329
23.3. Higher-Order Correctness 334
Problems 339
24. Nonparametric Density Estimation 341
24.1 Introduction 341
24.2 Kernel Estimators 341
24.3 Rate Optimality 346
24.4 Estimating a Unimodal Density 349
Problems 356
25. Semiparametric Models 358
25.1 Introduction 358
25.2 Banach and Hilbert Spaces 360
25.3 Tangent Spaces and Information 362
25.4 Efficient Score Functions 368
25.5 Score and Information Operators 371
25.6 Testing 384
*25.7 Efficiency and the Delta Method 386
25.8 Efficient Score Equations 391
25.9 General Estimating Equations 400
25.10 Maximum Likelihood Estimators 402
25.11 Approximately Least-Favorable
Submodels 408
25.12 Likelihood Equations 419
Problems 431

References 433

Index 439

Published online by Cambridge University Press


Published online by Cambridge University Press
Preface

This book grew out of courses that I gave at various places, including a graduate course in
the Statistics Department of Texas A&M University, Master's level courses for mathematics
students specializing in statistics at the Vrije Universiteit Amsterdam, a course in the DEA
program (graduate level) ofUniversite de Paris-sud, and courses in the Dutch AIO-netwerk
(graduate level).
The mathematical level is mixed. Some parts I have used for second year courses for
mathematics students (but they find it tough), other parts I would only recommend for a
graduate program. The text is written both for students who know about the technical
details of measure theory and probability, but little about statistics, and vice versa. This
requires brief explanations of statistical methodology, for instance of what a rank test or
the bootstrap is about, and there are similar excursions to introduce mathematical details.
Familiarity with (higher-dimensional) calculus is necessary in all of the manuscript. Metric
and normed spaces are briefly introduced in Chapter 18, when these concepts become
necessary for Chapters 19, 20, 21 and 22, but I do not expect that this would be enough as a
first introduction. For Chapter 25 basic knowledge of Hilbert spaces is extremely helpful,
although the bare essentials are summarized at the beginning. Measure theory is implicitly
assumed in the whole manuscript but can at most places be avoided by skipping proofs, by
ignoring the word "measurable" or with a bit of handwaving. Because we deal mostly with
i.i.d. observations, the simplest limit theorems from probability theory suffice. These are
derived in Chapter 2, but prior exposure is helpful.
Sections, results or proofs that are preceded by asterisks are either of secondary impor-
tance or are out of line with the natural order of the chapters. As the chart in Figure 0.1
shows, many of the chapters are independent from one another, and the book can be used
for several different courses.
A unifying theme is approximation by a limit experiment. The full theory is not developed
(another writing project is on its way), but the material is limited to the "weak topology"
on experiments, which in 90% of the book is exemplified by the case of smooth parameters
of the distribution of Li.d. observations. For this situation the theory can be developed
by relatively simple, direct arguments. Limit experiments are used to explain efficiency
properties, but also why certain procedures asymptotically take a certain form.
A second major theme is the application of results on abstract empirical processes. These
already have benefits for deriving the usual theorems on M -estimators for Euclidean pa-
rameters but are indispensable if discussing more involved situations, such as M -estimators
with nuisance parameters, chi-square statistics with data-dependent cells, or semiparamet-
ric models. The general theory is summarized in about 30 pages, and it is the applications

xiii

https://doi.org/10.1017/CBO9780511802256.001 Published online by Cambridge University Press


xiv Preface

24
Figure 0.1. Dependence chart. A solid arrow means that a chapter is a prerequisite for a next chapter.
A dotted arrow means a natural continuation. Vertical or horizontal position has no independent
meaning.

that we focus on. In a sense, it would have been better to place this material (Chapters
18 and 19) earlier in the book, but instead we start with material of more direct statistical
relevance and of a less abstract character. A drawback is that a few (starred) proofs point
ahead to later chapters.
Almost every chapter ends with a "Notes" section. These are meant to give a rough
historical sketch, and to provide entries in the literature for further reading. They certainly
do not give sufficient credit to the original contributions by many authors and are not meant
to serve as references in this way.
Mathematical statistics obtains its relevance from applications. The subjects of this book
have been chosen accordingly. On the other hand, this is a mathematician's book in that
we have made some effort to present results in a nice way, without the (unnecessary) lists
of "regularity conditions" that are sometimes found in statistics books. Occasionally, this
means that the accompanying proof must be more involved. Ifthis means that an idea could
go lost, then an informal argument precedes the statement of a result.
This does not mean that I have strived after the greatest possible generality. A simple,
clean presentation was the main aim.

Leiden, September 1997


A.W. van der Vaart

https://doi.org/10.1017/CBO9780511802256.001 Published online by Cambridge University Press


Notation

A* adjoint operator
Iffi* dual space
Cb(T), UC(T), C(T) (bounded, uniformly) continuous functions on T
loo (T) bounded functions on T
C,(Q), L,(Q) measurable functions whose rth powers are Q-integrable
IlflIQ,' norm of L,(Q)
II z1100, II zII T uniform norm
lin linear span
C, N, Q, JR., Z number fields and sets
EX, E* X, var X, sd X, Cov X (outer) expectation, variance, standard deviation,
covariance (matrix) of X
lPn, Gn empirical measure and process
Gp P-Brownian bridge
N (/L, 1:), tn, X; normal, t and chisquare distribution
2
Za, Xn,a' tn,a upper a-quantiles of normal, chisquare and t distributions
« absolutely continuous
<1,<1 I> contiguous, mutually contiguous
< smaller than up to a constant
convergence in distribution
p
-+ convergence in probability
as
-+ convergence almost surely
N(s, T, d), N[)(s, T, d) covering and bracketing number
1(s, T, d), 1[)(s, T, d) entropy integral
op(1),Op(1) stochastic order symbols

xv

Published online by Cambridge University Press


Published online by Cambridge University Press
1
Introduction

Why asymptotic statistics? The use of asymptotic approximations is two-


fold. First, they enable us to find approximate tests and confidence regions.
Second, approximations can be used theoretically to study the quality
(efficiency) of statistical procedures.

1.1 Approximate Statistical Procedures


To carry out a statistical test, we need to know the critical value for the test statistic. In
most cases this means that we must know the distribution of the test statistic under the
null hypothesis. Sometimes this is known exactly, but more often only approximations are
available. This may be because the distribution of the statistic is analytically intractable,
or perhaps the postulated statistical model is considered only an approximation of the true
underlying distributions. In both cases the use of an approximate critical value may be fully
satisfactory for practical purposes.
Consider for instance the classical t -test for location. Given a sample of independent
observations XJ, ... , X n , we wish to test a null hypothesis concerning the mean JL = EX.
The t-test is based on the quotient of the sample mean Xn and the sample standard deviation
Sn. If the observations arise from a normal distribution with mean JLo, then the distribution
of In(X n -lLo)/ Sn is known exactly: It is a t-distribution with n - 1 degrees of freedom.
However, we may have doubts regarding the normality, or we might even believe in a
completely different model. Ifthe number of observations is not too small, this does not
matter too much. Then we may act as if In(X n - lLo)/Sn possesses a standard normal
distribution. The theoretical justification is the limiting result, as n --+ 00,

sup PII- ( In(Xn - JL) ::: x


)
- <l>(x) --+ 0,
x Sn

provided the variables Xi have a finite second moment. This variation on the central limit
theorem is proved in the next chapter. A "large sample" level a test is to reject Ho : IL = JLo
I
if In(X n - lLo)/Snl exceeds the upper a/2 quantile of the standard normal distribution.
Table 1.1 gives the significance level of this test if the observations are either normally or
exponentially distributed, and a = 0.05. For n ::: 20 the approximation is quite reasonable
in the normal case. Ifthe underlying distribution is exponential, then the approximation is
less satisfactory, because of the skewness of the exponential distribution.

https://doi.org/10.1017/CBO9780511802256.002 Published online by Cambridge University Press


2 Introduction

Table 1.1. Level of the test with critical region


IJil(Xn -lLo)/Snl > 1.96 ifthe observations
are sampled from the normal or
exponential distribution.

n Nonnal ExponentialG
5 0.122 0.19
10 0.082 0.14
15 0.070 0.11
20 0.065 0.10
25 0.062 0.09
50 0.056 0.07
100 0.053 0.06

G The third column gives approximations based on 10,000


simulations.

In many ways the t-test is an uninteresting example. There are many other reasonable
test statistics for the same problem. Often their null distributions are difficult to calculate.
An asymptotic result similar to the one for the t-statistic would make them practically
applicable at least for large sample sizes. Thus, one aim of asymptotic statistics is to derive
the asymptotic distribution of many types of statistics.
There are similar benefits when obtaining confidence intervals. For instance, the given
approximation result asserts that ,In(Xn - 11) / Sn is approximately standard normally dis-
tributed if 11 is the true mean, whatever its value. This means that, with probability approx-
imately 1 - 2a,
,In(Xn -11)
-Za < < Za·
- Sn -
This can be rewritten as the confidence statement 11 = Xn ± Za Sn / ,In in the usual manner.
For large n its confidence level should be close to 1 - 2a.
As another example, consider maximum likelihood estimators en based on a sample of
size n from a density P9. A major result in asymptotic statistics is that in many situations
,In(en - 0) is asymptotically normally distributed with zero mean and covariance matrix the
inverse of the Fisher information matrix 19 • IfZ is k-variate normally distributed with mean
zero and nonsingular covariance matrix then the quadratic form Z possesses a
chi-square distribution with k degrees of freedom. Thus, acting as if ,In(en- 0) possesses
an Nk(O, 19- 1 ) distribution, we find that the ellipsoid

{o : (0 - enllfj. (0 - en) X!a }


is an approximate 1 - a confidence region, if Xl,a is the appropriate critical value from the
chi-square distribution. A closely related alternative is the region based on inverting the
likelihood ratio test, which is also based on an asymptotic approximation.

1.2 Asymptotic Optimality Tbeory


For a relatively small number of statistical problems there exists an exact, optimal solution.
For instance, the Neyman-Pearson theory leads to optimal (uniformly most powerful) tests

https://doi.org/10.1017/CBO9780511802256.002 Published online by Cambridge University Press


1.3 Limitations 3

in certain exponential family models; the Rao-Blackwell theory allows us to conclude that
certain estimators are of minimum variance among the unbiased estimators. An important
and fairly general result is the Cramer-Rao bound for the variance of unbiased estimators,
but it is often not sharp.
If exact optimality theory does not give results, be it because the problem is untractable
or because there exist no "optimal" procedures, then asymptotic optimality theory may
help. For instance, to compare two tests we might compare approximations to their power
functions. To compare estimators, we might compare asymptotic variances rather than
exact variances. A major result in this area is that for smooth parametric models maximum
likelihood estimators are asymptotically optimal. This roughly means the following. First,
maximum likelihood estimators are asymptotically consistent: The sequence of estimators
converges in probability to the true value of the parameter. Second, the rate at which
maximum likelihood estimators converge to the true value is the fastest possible, typically
1/ ..;n. Third, their asymptotic variance, the variance of the limit distribution of ..;n(On - 0),
is minimal; in fact, maximum likelihood estimators "asymptotically attain" the Cramer-Rao
bound. Thus asymptotics justify the use of the maximum likelihood method in certain
situations. It is of interest here that, even though the method of maximum likelihood often
leads to reasonable estimators and has great intuitive appeal, in general it does not lead
to best estimators for finite samples. Thus the use of an asymptotic criterion simplifies
optimality theory considerably.
By taking limits we can gain much insight in the structure of statistical experiments. It
turns out that not only estimators and test statistics are asymptotically normally distributed,
but often also the whole sequence of statistical models converges to a model with a nor-
mal observation. Our good understanding of the latter "canonical experiment" translates
directly into understanding other experiments asymptotically. The mathematical beauty of
this theory is an added benefit of asymptotic statistics. Though we shall be mostly concerned
with normal limiting theory, this theory applies equally well to other situations.

1.3 Limitations
Although asymptotics is both practically useful and of theoretical importance, it should not
be taken for more than what it is: approximations. Clearly, a theorem that can be interpreted
as saying that a statistical procedure works fine for n -+- 00 is of no use if the number of
available observations is n = 5.
In fact, strictly speaking, most asymptotic results that are currently available are logically
useless. This is because most asymptotic results are limit results, rather than approximations
consisting of an approximating formula plus an accurate error bound. For instance, to
estimate a value a, we consider it to be the 25th element a = a25 in a sequence at, a2, ... ,
and next take limn.... oo an as an approximation. The accuracy of this procedure depends
crucially on the choice of the sequence in which a25 is embedded, and it seems impossible
to defend the procedure from a logical point of view. This is why there is good asymptotics
and bad asymptotics and why two types of asymptotics sometimes lead to conflicting
claims.
Fortunately, many limit results of statistics do give reasonable answers. Because it may
be theoretically very hard to ascertain that approximation errors are small, one often takes
recourse to simulation studies to judge the accuracy of a certain approximation.

https://doi.org/10.1017/CBO9780511802256.002 Published online by Cambridge University Press


4 Introduction

Just as care is needed if using asymptotic results for approximations, results on asymptotic
optimality must be judged in the right manner. One pitfall is that even though a certain
procedure, such as maximum likelihood, is asymptotically optimal, there may be many
other procedures that are asymptotically optimal as well. For finite samples these may
behave differently and possibly better. Then so-called higher-order asymptotics, which
yield better approximations, may be fruitful. See e.g., [7], [52] and [114]. Although we
occasionally touch on this subject, we shall mostly be concerned with what is known as
"first-order asymptotics."

1.4 The Index n


In all of the following n is an index that tends to infinity, and asymptotics means taking
limits as n -+ 00. In most situations n is the number of observations, so that usually
asymptotics is equivalent to "large-sample theory." However, certain abstract results are
pure limit theorems that have nothing to do with individual observations. In that case n just
plays the role of the index that goes to infinity.

1.5 Notation
A symbol index is given on page xv.
For brevity we often use operator notation for evaluation of expectations and have special
symbols for the empirical measure and process.
For P a measure on a measurable space (X, B) and I : X IRk a measurable function,
PI denotes the integral J I dP; equivalently, the expectation Epl(X I ) for Xl a random
variable distributed according to P. When applied to the empirical measure 1Pn of a sample
Xl, ... , Xn , the discrete uniform measure on the sample values, this yields

This formula can also be viewed as simply an abbreviation for the average on the right. The
empirical process Gnl is the centered and scaled version of the empirical measure, defined
by

This is studied in detail in Chapter 19, but is used as an abbreviation throughout the book.

https://doi.org/10.1017/CBO9780511802256.002 Published online by Cambridge University Press


2
Stochastic Convergence

This chapter provides a review ofbasic modes ofconvergence ofsequences


of stochastic vectors, in particular convergence in distribution and in
probability.

2.1 Basic Theory


A random vector in :IRk is a vector X = (XI, ... , Xk) of real random variables. t The dis-
tributionfunction of X is the map x t-+ P(X x).
A sequence of random vectors Xn is said to converge in distribution to a random vector
X if

P(Xn x) -+ P(X x),

for every x at which the limit distribution function x t-+ P(X x) is continuous. Alterna-
tive names are weak convergence and convergence in law. As the last name suggests, the
convergence only depends on the induced laws of the vectors and not on the probability
spaces on which they are defined. Weak convergence is denoted by Xn -v-+ X; if X has dis-
tribution L, or a distribution with a standard code, such as N(O, 1), then also by Xn -v-+ L or
Xn -v-+ N(O, 1).
Let d (x, y) be a distance function on IRk that generates the usual topology. For instance,
the Euclidean distance

d(x, y) = IIx - yll =( 8(X


k
i - Yi)2
) 1/2

A sequence of random variables Xn is said to converge in probability to X if for all e > 0

P(d(Xn , X) > e) -+ O.
This is denoted by Xn X. In this notation convergence in probability is the same as
p
d(X n, X) -+ o.

t More fonnally it is a Borel measurable map from some probability space in ]Rk. Throughout it is implic-
itly understood that variables X. g(X), and so forth of which we compute expectations or probabilities are
measurable maps on some probability space.

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


6 Stochastic Convergence

As we shall see, convergence in probability is stronger than convergence in distribution.


An even stronger mode of convergence is almost-sure convergence. The sequence Xn is
said to converge almost surely to X if d(X n, X) 0 with probability one:

P(limd(Xn' X) = 0) = 1.
This is denoted by Xn X. Note that convergence in probability and convergence almost
surely only make sense ifeach of Xn and X are defined on the same probability space. For
convergence in distribution this is not necessary.

2.1 Example (Classical limittheorems). Let Yn be the average of the first n of a sequence
of independent, identically distributed random vectors Yt, Y2, .... If Ell Ytll < 00, then
fn EYt by the strong law oflarge numbers. UnderthestrongerassumptionthatEIIYtlI 2 <
00, the central limit theorem asserts that ,In(fn - EYt ) -v-+ N(O, Cov Yt ). The central limit
theorem plays an important role in this manuscript. It is proved later in this chapter, first
for the case of real variables, and next it is extended to random vectors. The strong law
of large numbers appears to be of less interest in statistics. Usually the weak law of large
numbers, according to which Yn EYt. suffices. This is proved later in this chapter. 0

The portmanteau lemma gives a number of equivalent descriptions of weak convergence.


Most of the characterizations are only useful in proofs. The last one also has intuitive value.

2.2 Lemma (Portmanteau). For any random vectors Xn and X thefollowing statements
are equivalent.
(i) P(Xn :::: x) P(X :::: x) for all continuity points ofx P(X :::: x);
(ii) Ef(Xn) Ef(X) for all bounded, continuous functions f;
(iii) Ef(Xn) Ef(X) for all bounded, Lipschitzt functions f;
(iv) liminfEf(Xn) Ef(X) for all nonnegative, continuous functions f;
(v) liminfP(Xn E G) P(X E G) for every open set G;
(vi) lim sup P(Xn E F) :::: P(X E F) for every closed set F;
(vii) P(Xn E B) P(X E B) for all Borel sets B with P(X EBB) = 0, where
BB = Ii - iJ is the boundary of B.

Proof. (i) =} (ii). Assume first that the distribution function of X is continuous. Then
condition (i) implies that P(Xn E I) P(X E l) for every rectangle I. Choose a
sufficiently large, compact rectangle I with P(X ¢ l) < 8. A continuous function f is
uniformly continuous on the compact set I. Thus there exists a partition I = UjIj into
finitely many rectangles Ij such that f varies at most 8 on every Ij • Take a point x j from
each Ij and define fe = Lj f(xj)l/r Then If - fel < 8 on I, whence if f takes its values
in [-1,1],

IEf(Xn) - Efe(Xn) I: : 8 + P(Xn ¢ I),


IEf(X) - Efe(X) I :::: 8 + P(X ¢ l) < 28.

t A function is called Lipschitz if there exists a number L such that If (x) - f (y) I Ld (x. y), for every x and
y. The least such number L is denoted IIflllip'

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


2.1 Basic Theory 7

For sufficiently large n, the right side of the first equation is smaller than 2ε as well. We
combine this with
           
E f ε X n − E f ε (X ) ≤ P X n ∈ I j − P X ∈ I j   f x j  → 0.
j
 
Together with the triangle inequality the three displays show that E f (X n ) − E f (X ) is
bounded by 5ε eventually. This being true for every ε > 0 implies (ii).
Call a set B a continuity set if its boundary δ B satisfies P(X ∈ δ B) = 0. The preceding
argument is valid for a general X provided all rectangles I are chosen equal to continuity
sets. This is possible, because the collection of discontinuity sets is sparse. Given any
collection of pairwise disjoint measurable sets, at most countably many sets can have
positive probability. Otherwise the probability of their union would be infinite. Therefore,
given any collection of sets {Bα : α ∈ A} with pairwise disjoint boundaries, all except at
most countably many sets are continuity sets. In particular, for each j at most countably
many sets of the form {x : x j ≤ α} are not continuity sets. Conclude that there exist dense
subsets Q 1 , . . . , Q k of R such that each rectangle with corners in the set Q 1 × · · · × Q k is
a continuity set. We can choose all rectangles I inside this set.
(iii) ⇒ (v). For every open set G there exists a sequence of Lipschitz functions with
0 ≤ f m ↑ 1G . For instance f m (x) = (md(x, G c )) ∧ 1. For every fixed m,

lim inf P(X n ∈ G) ≥ lim inf E f m (X n ) = E f m (X ).


n→∞ n→∞

As m → ∞ the right side increases to P(X ∈ G) by the monotone convergence theorem.


(v) ⇔ (vi). Because a set is open if and only if its complement is closed, this follows by
taking complements.

(v) + (vi) ⇒ (vii). Let B and B denote the interior and the closure of a set, respectively.
By (v)
◦ ◦  
P(X ∈ B) ≤ lim inf P(X n ∈ B) ≤ lim sup P X n ∈ B̄ ≤ P(X ∈ B̄),

by (vi). If P(X ∈ δ B) = 0, then left and right side are equal, whence all inequalities
are equalities. The probability P(X ∈ B) and the limit lim P(X n ∈ B) are between the
expressions on left and right and hence equal to the common value.
(vii) ⇒ (i). Every cell (−∞, x] such that x is a continuity point of x → P(X ≤ x) is a
continuity set.
The equivalence (ii) ⇔ (iv) is left as an exercise. 

The continuous-mapping theorem is a simple result, but it is extremely useful. If the


sequence of random vectors X n converges to X and g is continuous, then g(X n ) converges
to g(X ). This is true for each of the three modes of stochastic convergence.

2.3 Theorem (Continuous mapping). Let g : Rk → Rm be continuous at every point of


a set C such that P(X ∈ C) = 1.
(i) If X n  X , then g(X n )  g(X );
P P
(ii) If X n → X , then g(X n ) → g(X );
as as
(iii) If X n → X , then g(X n ) → g(X ).

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


8 Stochastic Convergence

Proof. (i). The event {g(X n) E F} is identical to the event {Xn E g-l(F)}. For every
closed set F,

To see the second inclusion, take x in the closure of g-l (F). Thus, there exists a sequence
Xm with Xm -+ x and g(x m ) E F for every F. Ifx E C, then g(xm ) -+ g(x), which is in F
because F is closed; otherwise x E CC. By the portmanteau lemma,

limsupP(g(X n) E F) :::; limsupP(Xn E g-l(F)) :::; p(X E g-1(F)).

Because P(X E CC) = 0, the probability on the right is P(X E g-l(F)) = p(g(X) E
F). Apply the portmanteau lemma again, in the opposite direction, to conclude that
g(Xn) -v-+ g(X).
(ii). Fix arbitrary e > O. For each 8 > 0 let B8 be the set of x for which there exists
y with d(x, y) < 8, but d(g(x), g(y)) > e. If X rt B8 and d(g(X n), g(X)) > e, then
d(Xn, X) ::: 8. Consequently,

The second term on the right converges to zero as n -+ 00 for every fixed 8 > O. Because
B8 n C ..(, 0 by continuity of g, the first term converges to zero as 8 ..(, O.
Assertion (iii) is trivial. •

Any random vector X is tight: For every e > 0 there exists a constant M such that
p( II X II > M) < e. A set of random vectors {Xa : a E A} is called uniformly tight if M can
be chosen the same for every Xa: For every e > 0 there exists a constant M such that

sup p(IIXa II > M) < e.


a

Thus, there exists a compact set to which all Xa give probability "almost" one. Another
name for uniformly tight is bounded in probability. It is not hard to see that every weakly
converging sequence Xn is uniformly tight. More surprisingly, the converse of this statement
is almost true: According to Prohorov's theorem, every uniformly tight sequence contains a
weakly converging subsequence. Prohorov's theorem generalizes the Heine-Borel theorem
from deterministic sequences Xn to random vectors.

2.4 Theorem (Prohorov's theorem). Let Xn be random vectors in IRk.


(i) If Xn -v-+ X for some X, then {Xn : n E N} is uniformly tight;
(ii) If Xn is uniformly tight, then there exists a subsequence with X nj -v-+ X as j -+ 00,
for some X.

Proof. (i). Fix a number M such that p(IIXII ::: M) < e. By the portmanteau lemma
P(IIXnll ::: M) exceeds p(IIXIl ::: M) arbitrarily little for sufficiently large n. Thus there
exists N such that p( II Xn II ::: M) < 2e, for all n ::: N. Because each of the finitely many
variables Xn with n < N is tight, the value of M can be increased, if necessary, to ensure
that p(IIX n II ::: M) < 2e for every n.

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


2.1 Basic Theory 9

(ii). By Helly's lemma (described subsequently), there exists a subsequence Fnj of


the sequence of cumulative distribution functions Fn(x) = P(Xn :s x) that converges
weakly to a possibly "defective" distribution function F. It suffices to show that F is a
proper distribution function: F(x) 0,1 if Xi -00 for some i, or x 00. By the
uniform tightness, there exists M such that Fn (M) > 1- e for all n. By making M larger, if
necessary, it can be ensured thatM is a continuity point of F. Then F(M) = lim Fnj(M)
1 - e. Conclude that F(x) 1 as x 00. That the limits at -00 are zero can be seen in
a similar manner. •

The crux of the proof of Prohorov's theorem is Helly's lemma. This asserts that any
given sequence of distribution functions contains a subsequence that converges weakly to
a possibly defective distribution function. A defective distribution function is a function
that has all the properties of a cumulative distribution function with the exception that it has
limits less than 1 at 00 and/or greater than 0 at -00.

2.5 Lemma (Helly's lemma). Each given sequence Fn of cumulative distributionfunc-


tions on IRk possesses a subsequence Fn j with the property that Fn j (x) F (x) at each
continuity point x of a possibly defective distribution function F.

Proof. Let Qk = {q I , q2, ... } be the vectors with rational coordinates, ordered in an
arbitrary manner. Because the sequence Fn(qd is contained in the interval [0, 1], it has
a converging subsequence. Call the indexing subsequence and the limit G(ql).
Next, extract a further subsequence {n}} c {n}} along which Fn (q2) converges to a
limit G(q2), a further subsequence {nJ} C {n}} along which Fn(q3) converges to a limit
G(q3), ... , and so forth. The "tail" of the diagonal sequence n j := belongs to every
sequence Hence Fn/qi) G(qi) for every i = 1,2, .... Because each Fn is nonde-
creasing, G(q) :s G(q') if q :s q'. Define

F(x) = q>x
inf G(q).

Then F is nondecreasing. It is also right-continuous at every point x, because for every


e > Othereexistsq > x with G(q) - F(x) < e, which implies F(y) - F(x) < eforevery
x :s y :s q. Continuity of F at x implies, for every e > 0, the existence of q < x < q'
such that G(q') - G(q) < e. By monotonicity, we have G(q) :s F(x) :s G(q'), and

Conclude that lliminf Fnj(x) - F(x)1 < e. Because this is true for every e > 0 and
the same result can be obtained for the lim sup, it follows that Fn/x) F(x) at every
continuity point of F.
In the higher-dimensional case, it must still be shown that the expressions defining masses
of cells are nonnegative. For instance, for k = 2, F is a (defective) distribution function
only if F(b) + F(a) - F(al, b 2) - F(a2, bl) 0 for every a :s b. In the case that the four
comers a, b, (ai, b2), and (a2, bl) of the cell are continuity points; this is immediate from
the convergence of Fnj to F and the fact that each Fn is a distribution function. Next, for
general cells the property follows by right continuity. •

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


10 Stochastic Convergence

2.6 Example (Markov's inequality). A sequence Xn of random variables with EI Xn IP =


0(1) for some p > 0 is uniformly tight. This follows because by Markov's inequality

The right side can be made arbitrarily small, uniformly in n, by choosing sufficiently
largeM.
Because EX; = var Xn + (EXn)2, an alternative sufficient condition for uniform tight-
ness is EXn = 0 (1) and var Xn = 0 (1). This cannot be reversed. 0

Consider some of the relationships among the three modes of convergence. Convergence
in distribution is weaker than convergence in probability, which is in tum weaker than
almost-sure convergence, except if the limit is constant.

2.7 Theorem. Let Xn, X and Yn be random vectors. Then


(i) Xn X implies Xn X;
(ii) Xn X implies Xn -v-+ X;
(iii) Xn c for a constant c ifand only ifXn -v-+ c;
(iv) ifXn -v-+ X and d(X n, Yn) 0, then Yn -v-+ X;
(v) if Xn -v-+ X and Yn cfor a constant c, then (Xn, Yn) -v-+ (X, c);
(vi) if Xn X and Yn Y, then (Xn, Yn) (X, Y).

Proof. (i). The sequence of sets An = X) > e} is decreasing for every


e > 0 and decreases to the empty set if Xn(w) X(w) for every w. If Xn X, then
P(d(Xn' X) > e) =:: P(An) O.
(iv). For every f with range [0, 1] and Lipschitz norm at most 1 and every e > 0,

The second term on the right converges to zero as n 00. The first term can be made
arbitrarily small by choice of e. Conclude that the sequences Ef(Xn) and Ef(Yn) have the
same limit. The result follows from the portmanteau lemma.
(ii). Because d(Xn, X) 0 and trivially X -v-+ X, it follows that Xn -v-+ X by (iv).
(iii). The "only if' part is a special case of (ii). For the converse let ball(c, e) be the open
ball of radius e around c. Then P(d(Xn, c) ::: e) = p(Xn E ball(c, e)c). If Xn -v-+C, then
the lim sup of the last probability is bounded by p(c E ball(c, eY) = 0, by the portmanteau
lemma.
(v). First note that d( (Xn, Yn), (Xn, c») = d(Yn, c) O. Thus, according to (iv), it
suffices to show that (Xn , c) -v-+ (X, c). For every continuous, bounded function (x, y)
f(x, y), the function x f(x, c) is continuous and bounded. Thus Ef(Xn, c) Ef(X, c)
if Xn -v-+ X.
(vi). This follows fromd(x\, y\), (X2, Y2») =:: d(x\, X2) + d(y\, Y2). •

According to the last assertion of the lemma, convergence in probability of a sequence of


vectors Xn = (X n,\, ... , Xn,k) is equivalent to convergence of every one of the sequences
of components Xn,i separately. The analogous statement for convergence in distribution

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


2.1 Basic Theory 11

is false: Convergence in distribution of the sequence Xn is stronger than convergence of


every one of the sequences of components X n •i • The point is that the distribution of the
components Xn,i separately does not determine their joint distribution: They might be
independent or dependent in many ways. We speak of joint convergence in distribution
versus marginal convergence.
Assertion (v) of the lemma has some useful consequences. If Xn - X and Yn - c, then
(Xn, Yn) - (X, c). Consequently, by the continuous mapping theorem, g(Xn, Yn) - g(X, c)
for every map g that is continuous at every point in the set IRk x {c} in which the vector
(X, c) takes its values. Thus, for every g such that

lim g(x, y) = g(xo, c), for every Xo.

Some particular applications of this principle are known as Slutsky's lemma.

2.8 Lemma (Slutsky). Let X n, X and Yn be random vectors or variables. IfXn - X and
Yn - c for a constant c, then
(i) Xn + Yn - X + c;
(ii) YnXn - eX;
(iii) yn- I Xn - c- I X provided c # O.

In (i) the "constant" c must be a vector of the same dimension as X, and in (ii) it is
probably initially understood to be a scalar. However, (ii) is also true if every Yn and c
are matrices (which can be identified with vectors, for instance by aligning rows, to give a
meaning to the convergence Yn - c), simply because matrix multiplication (x, y) y x is
a continuous operation. Even (iii) is valid for matrices Yn and c and vectors Xn provided
c # 0 is understood as c being invertible, because taking an inverse is also continuous.

2.9 Example (t-statistic). Let YI , h ... be independent, identically distributed random


variables with EY1 = 0 and EYr < 00. Then the t-statistic J1i'Yn/Sn, where = (n -
1)-1 E7=1 (Yj - Yn)2 is the sample variance, is asymptotically standard normal.
To see this, first note that by two applications of the weak law of large numbers and the
continuous-mapping theorem for convergence in probability

Sn2 = -n- ( -I L.".Yj2- Y


-2) P 1(2
n -+ EYI - (EYd 2) = varYI.
n - 1 n i=1

Again by the continuous-mapping theorem, Sn converges in probability to sd YI • By the cen-


trallimit theorem J1i'Yn converges in law to the N (0, var YI) distribution. Finally, Slutsky's
lemma gives that the sequence of t -statistics converges in distribution to N (0, var Y 1) / sd YI
= N(O, 1). 0

2.10 Example (Confidence intervals). Let Tn and Sn be sequences of estimators satis-


fying

for certain parameters () and 0'2 depending on the underlying distribution, for every distri-
bution in the model. Then () = Tn ± Sn / J1i' Za is a confidence interval for () of asymptotic

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


12 Stochastic Convergence

level 1 - 2a. More precisely, we have that the probability that () is contained in [Tn -
Snl.;n Za, Tn + Snl.;n za] converges to 1 - 2a.
This is a consequence of the fact that the sequence .;n(Tn - ()ISn is asymptotically
standard normally distributed. 0

If the limit variable X has a continuous distribution function, then weak convergence
Xn "-"+ X implies P(Xn ::: x) P(X ::: x) for every x. The convergence is then even
uniform in x.

2.11 Lemma. Suppose that Xn "-"+ X for a random vector X with a continuous distribution
function. Then sUPxlp(Xn ::: x) - P(X ::: x)1 o.
Proof. Let Fn and F be the distribution functions of Xn and X. First consider the one-
dimensional case. Fix kEN. By the continuity of F there exist points -00 = Xo <
Xl < ... < Xk = 00 with F(Xi) = il k. By monotonicity, we have, for Xi-I::: x ::: Xi,

Fn(x) - F(x) ::: Fn(Xi) - F(Xi-l) = Fn(Xi) - F(Xi) + 11k


Fn(Xi-l) - F(Xi) = Fn(Xi-l) - F(Xi-l) - 11k.

I I I I
Thus Fn (X) - F (X) is bounded above by SUPi Fn (Xi) - F (Xi) + 1I k, for every x. The
latter, finite supremum converges to zero as n 00, for each fixed k. Because k is arbitrary,
the result follows.
In the higher-dimensional case, we follow a similar argument but use hyperrectangles,
rather than intervals. We can construct the rectangles by intersecting the k partitions obtained
by subdividing each coordinate separately as before. •

2.2 Stochastic 0 and 0 Symbols


It is convenient to have short expressions for terms that converge in probability to zero
or are uniformly tight. The notation op(1) ("small oh-P-one") is short for a sequence of
random vectors that converges to zero in probability. The expression 0 p (1) ("big oh-
P-one") denotes a sequence that is bounded in probability. More generally, for a given
sequence of random variables Rn ,
p
Xn = op(Rn) means Xn = YnRn and Yn 0;
Xn = Op(Rn) means Xn = YnRn and Yn = Op(1).

This expresses that the sequence Xn converges in probability to zero or is bounded in


probability at the "rate" Rn. For deterministic sequences Xn and Rn, the stochastic "oh"
symbols reduce to the usual 0 and 0 from calculus.
There are many rules of calculus with 0 and 0 symbols, which we apply without com-
ment. For instance,

op(1)+ op(1) = op(1)


op(1) + Op(l) = Op(1)
Op(1)op(1) = op(l)

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


2.3 Characteristic Functions 13

(1 + op(1)r 1 = Opel)
op(Rn) = Rnop(1)
Op(Rn) = RnOp(1)
op(Op(1)) = op(l).
To see the validity of these rules it suffices to restate them in terms of explicitly named
vectors, where each 0 p (1) and 0 p (1) should be replaced by a different sequence of vectors
that converges to zero or is bounded in probability. In this way the first rule says: IfXn 0
and Yn 0, then Zn = Xn + Yn O. This is an example of the continuous-mapping
theorem. The third rule is short for the following: If Xn is bounded in probability and
Yn 0, then Xn Yn O. If Xn would also converge in distribution, then this would be
statement (ii) of Slutsky's lemma (with c = 0). But by Prohorov's theorem, Xn converges
in distribution "along subsequences" if it is bounded in probability, so that the third rule
can still be deduced from Slutsky's lemma by "arguing along subsequences."
Note that both rules are in fact implications and should be read from left to right, even
though they are stated with the help of the equality sign. Similarly, although it is true that
Opel) + Opel) = 20p(1), writing down this rule does not reflect understanding of the Op
symbol.
Two more complicated rules are given by the following lemma.

2.12 Lemma. Let R be afunction defined on domain in]Rk such that R(O) = O. Let Xn be
a sequence of random vectors with values in the domain of R that converges in probability
to zero. Then, for every p > 0,
(i) ifR(h) = o(lIhIlP) ash -+ 0, then R(X n) = op(IIXnIIP);
(ii) if R(h) = O(lIhIlP) as h -+ 0, then R(Xn) = Op(IIXnIlP).

Proof. Define g(h) as g(h) = R(h)/llhII P for h #- 0 and g(O) = O. Then R(Xn) =
g(Xn ) IIXnliP.
(i) Because the function g is continuous at zero by assumption, g(Xn) g(O) = 0 by
the continuous-mapping theorem.
(ii) By assumption there exist M and 8 > 0 such that ig(h)i :s M whenever Ilhl! :s 8.
Thus p(lg(Xn)1 > M) :s p(IIXnII > 8) -+ 0, and the sequence g(X n ) is tight. •

*2.3 Characteristic Functions


It is sometimes possible to show convergence in distribution of a sequence of random vectors
directly from the definition. In other cases "transforms" of probability measures may help.
The basic idea is that it suffices to show characterization (ii) of the portmanteau lemma for
a small subset of functions f only.
The most important transform is the characteristic function

Each of the functions x t-+ e itT x is continuous and bounded. Thus, by the portmanteau
lemma, EeitTXn -+ Ee itTX for every t if Xn """ X. By Levy's continuity theorem the

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


14 Stochastic Convergence

converse is also true: Pointwise convergence of characteristic functions is equivalent to


weak convergence.

2.13 Theorem (Uvy's continuity theorem). Let Xn and X be random vectors in Rk.
Then Xn - X ifand only ifEeitTx• EeitTX for every t E Rk. Moreover, if EeitTX• con-
verges pointwise to afunction q,(t) that is continuous at zero, then q, is the characteristic
function of a random vector X and Xn - X.

Proof. If Xn - X, then Eh(Xn) Eh(X) for every bounded continuous function h, in


particular for the functions h (x) = e itT x. This gives one direction of the first statement.
For the proof of the last statement, suppose first that we already know that the sequence
Xn is uniformly tight. Then, according to Prohorov's theorem, every subsequence has a
further subsequence that converges in distribution to some vector Y. By the preceding
paragraph, the characteristic function of Y is the limit of the characteristic functions of the
converging subsequence. By assumption, this limit is the function q,(t). Conclude that
every weak limit point Y of a converging subsequence possesses characteristic function
q,. Because a characteristic function uniquely determines a distribution (see Lemma 2.15),
it follows that the sequence Xn has only one weak limit point. It can be checked that a
uniformly tight sequence with a unique limit point converges to this limit point, and the
proof is complete.
The uniform tightness of the sequence Xn can be derived from the continuity of q, at
zero. Because marginal tightness implies joint tightness, it may be assumed without loss
of generality that Xn is one-dimensional. For every x and > 0,

> 2} ::: 2(1 - = 16


(1 - costx) dt.

Replace x by Xn , take expectations, and use Fubini's theorem to obtain that

P(IXnl > ::: 1 8


Re(1 - Eeitx.) dt.

By assumption, the integrand in the right side converges pointwise to Re( 1 - q, (t)). By the
dominated-convergence theorem, the whole expression converges to

1 8
-8
Re(1 - q,(t») dt.

Becauseq, is continuous at zero, there exists for every e > > Osuchthatll-q,(t)1 < e
for ItI < 8. For this 8 the integral is bounded by 2e. Conclude that P(lXnl > 2/8) ::: 2e
for sufficiently large n, whence the sequence Xn is uniformly tight. •

2.14 Example (Normal distribution). The characteristic function of the Nk(JL, E) distri-
bution is the function

Indeed, if X is Nk(O, /) distributed and El/2 is a symmetric square root of E (hence


E = (E 1/2)2), then E 1/2 X + JL possesses the given normal distribution and

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


2.3 Characteristic Functions 15

For real-valued z, the last equality follows easily by completing the square in the exponent.
Evaluating the integral for complex z, such as z = it, requires some skill in complex
function theory. One method, which avoids further calculations, is to show that both the
left- and righthand sides of the preceding display are analytic functions of z. For the right
side this is obvious; for the left side we can justify differentiation under the expectation
sign by the dominated-convergence theorem. Because the two sides agree on the real axis,
they must agree on the complex plane by uniqueness of analytic continuation. 0

2.15 Lemma. Random vectors X and Y in IRk are equal in distribution if and only if
Ee itTX = Ee itTy for every t E IRk.

Proof. By Fubini's theorem and calculations as in the preceding example, for every a > 0
and y E IRk,

By the convolution formula for densities, the righthand side is (2Jl')k times the density
P X +a Z (y) of the sum of X and a Z for a standard normal vector Z that is independent of X.
Conclude that if X and Y have the same characteristic function, then the vectors X + a Z
and Y + a Z have the same density and hence are equal in distribution for every a > O. By
Slutsky's lemma X + a Z "'" X as a to, and similarly for Y. Thus X and Y are equal in
distribution. •

The characteristic function of a sum of independent variables equals the product of the
characteristic functions of the individual variables. This observation, combined with Levy's
theorem, yields simple proofs of both the law oflarge numbers and the central limit theorem.

2.16 Proposition (Weak law of large numbers). Let Y1 , •.• , Yn be i.i.d. random variables
with characteristic function 4>. Then Yn /1 for a real number /1 ifand only if4> is differ-
entiable at zero with i /1 = 4>' (0).

Proof. We only prove that differentiability is sufficient. For the converse, see, for exam-
ple, [127, p. 52]. Because 4>(0) = 1, differentiability of 4> at zero means that 4>(t) = I
+ t4>'(O) + o(t) as t --+ O. Thus, by Fubini's theorem, for each fixed t and n --+ 00,

The right side is the characteristic function of the constant variable /1. By Levy's theorem,
Yn converges in distribution to /1. Convergence in distribution to a constant is the same as
convergence in probability. •

A sufficient but not necessary condition for 4>(t) = Ee itY to be differentiable at zero
is that EI Y I < 00. In that case the dominated convergence theorem allows differentiation

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


16 Stochastic Convergence

under the expectation sign, and we obtain

d .y . y
f/J'(t) = -Ee
dt
= EiYe
ll lt •

In particular, the derivative at zero is f/J' (0) = iEY and hence Yn EYt .
If Ey2 < 00, then the Taylor expansion can be carried a step further and we can obtain
a version of the central limit theorem.

2.17 Proposition (Central limit theorem). Let Yt , ••• , Yn be i.i.d. random variables with
EYi = 0 and EY? = 1. Then the sequence ,.jTiYn converges in distribution to the standard
normal distribution.

Proot A second differentiation under the expectation sign shows that f/J" (0) = i 2Ey2.
Because f/J' (0) = iEY = 0, we obtain

Ee',-"" <p"C.) (1- EY' +oG) r


The right side is the characteristic function of the normal distribution with mean zero and
-+ .-leEr'

variance Ey2. The proposition follows from Levy's continuity theorem. •

The characteristic function t EeitT x of a vector X is determined by the set of all


characteristic functions u Eeiu(t TX) of linear combinations t T X of the components of X.
Therefore, Levy's continuity theorem implies that weak convergence of vectors is equivalent
to weak convergence of linear combinations:

Xn -v-+ X if and only if t T Xn -v-+ t TX for all t E IRk.

This is known as the Cramer-Wold device. It allows to reduce higher-dimensional problems


to the one-dimensional case.

2.18 Example (Multivariate central limit theorem). Let Yt , Y2, ... be i.i.d. random vec-
tors in IRk with mean vector JL = EYt and covariance matrix E = E(Yt - JL)(Yt - JL)T.
Then

(The sum is taken coordinatewise.) By the Cramer-Wold device, this can be proved by
finding the limit distribution of the sequences of real variables

Because the random variables tTYt - t T JL. t T Y2 - t T JL •... are i.i.d. with zero mean and
variance tTEt. this sequence is asymptotically Nt (0, tTEt)-distributed by the univariate
central limit theorem. This is exactly the distribution of t T X if X possesses an Nk(O, E)
distribution. 0

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


2.5 Convergence of Moments 17

∗ 2.4 Almost-Sure Representations


Convergence in distribution certainly does not imply convergence in probability or almost
surely. However, the following theorem shows that a given sequence X n  X can always
be replaced by a sequence X̃ n  X̃ that is, marginally, equal in distribution and converges
almost surely. This construction is sometimes useful and has been put to good use by some
authors, but we do not use it in this book.

2.19 Theorem (Almost-sure representations). Suppose that the sequence of random


vectors X n converges in distribution to a random vector X 0 . Then there exists a probability
space (,˜ Ũ, P̃) and random vectors X̃ n defined on it such that X̃ n is equal in distribution
to X n for every n ≥ 0 and X̃ n → X̃ 0 almost surely.

Proof. For random variables we can simply define X̃ n = Fn−1 (U ) for Fn the distribution
function of X n and U an arbitrary random variable with the uniform distribution on
[0, 1]. (The “quantile transformation,” see Section 21.1.) The simplest known construction
for higher-dimensional vectors is more complicated. See, for example, Theorem 1.10.4
in [146], or [41]. 

∗ 2.5 Convergence of Moments


By the portmanteau lemma, weak convergence X n  X implies that E f (X n ) → E f (X )
for every continuous, bounded function f . The condition that f be bounded is not
superfluous: It is not difficult to find examples of a sequence X n  X and an unbounded,
continuous function f for which the convergence fails. In particular, in general convergence
p
in distribution does not imply convergence EX n → EX p of moments. However, in many
situations such convergence occurs, but it requires more effort to prove it.
A sequence of random variables Yn is called asymptotically uniformly integrable if
 
lim lim sup E |Yn | 1 |Yn | > M = 0.
M→∞ n→∞

Uniform integrability is the missing link between convergence in distribution and


convergence of moments.

2.20 Theorem. Let f : Rk  → R be measurable and continuous at every point in a set


C. Let X n  X where X takes its values in C. Then E f (X n ) → E f (X ) if the sequence of
random variables f (X n ) is asymptotically uniformly integrable.

Proof. We give the proof only in the most interesting direction. (See, for example, [146]
(p. 69) for the other direction.) Suppose that Yn = f (X n ) is asymptotically uniformly
integrable. Then we show that EYn → EY for Y = f (X ). Assume without loss of
generality that Yn is nonnegative; otherwise argue the positive and negative parts separately.
By the continuous mapping theorem, Yn  Y . By the triangle inequality,
|EYn − EY | ≤ |EYn − EYn ∧ M| + |EYn ∧ M − EY ∧ M| + |EY ∧ M − EY |.
Because the function y  → y ∧ M is continuous and bounded on [0, ∞), it follows that the
middle term on the right converges to zero as n → ∞. The first term is bounded above by

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


18 Stochastic Convergence

EYn 1{Yn > M}, and converges to zero as n -* 00 followed by M -* 00, by the unifonn
integrability. By the portmanteau lemma (iv), the third tenn is bounded by the liminf as
n -* 00 of the first and hence converges to zero as M t 00. •

2.21 Example. Suppose Xn is a sequence of random variables such that Xn""'-+ X and
limsupEIXnlP < 00 for some p. Then all moments of order strictly less than p converge
also: EX! -* EXk for every k < p.
By the preceding theorem, it suffices to prove that the sequence X! is asymptotically
uniformly integrable. By Markov's inequality

The limit superior, as n -* 00 followed by M -* 00, of the right side is zero if k < p. 0

The moment function p EX P can be considered a transfonn of probability distributions,


just as can the characteristic function. In general, it is not a true transfonn in that it does
determine a distribution uniquely only under additional assumptions. Ifa limit distribution
is uniquely determined by its moments, this transfonn can still be used to establish weak
convergence.

2.22 Theorem. Let Xn and X be random variables such that -* EXP < 00 for
every pEN. Ifthe distribution of X is uniquely determined by its moments, then Xn .....-+ X.

Proof. Because EX; = 0(1), the sequence Xn is uniformly tight, by Markov's inequality.
By Prohorov's theorem, each subsequence has a further subsequence that converges weakly
to a limit Y. By the preceding example the moments of Y are the limits of the moments
of the subsequence. Thus the moments of Y are identical to the moments of X. Because,
by assumption, there is only one distribution with this set of moments, X and Y are equal
in distribution. Conclude that every subsequence of Xn has a further subsequence that
converges in distribution to X. This implies that the whole sequence converges to X. •

2.23 Example. The normal distribution is uniquely determined by its moments. (See, for
example, [123] or [133,p. 293].) -* ofor odd p -* (p-l)(p-3)···1
for even p implies that Xn .....-+ N (0, 1). The converse is false. 0

*2.6 Convergence-Determining Classes


A class F of functions f :]Rk -* ]R is called convergence-determining if for every sequence
of random vectors Xn the convergence Xn""'-+ X is equivalent to Ef(Xn) -+ Ef(X) for
every f E F. By definition the set of all bounded continuous functions is convergence-
determining, but so is the smaller set of all differentiable functions, and many other classes.
The set of all indicator functions 1(-oo,t] would be convergence-determining if we would
restrict the definition to limits X with continuous distribution functions. We shall have
occasion to use the following results. (For proofs see Corollary 1.4.5 and Theorem 1.12.2,
for example, in [146].)

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


2.7 Law of the Iterated Logarithm 19

2.24 Lemma. On IRk = IRI X IRm the set offunctions (x, y) f(x)g(y) with f and g
ranging over all bounded, continuous functions on IRI and IRm , respectively, is convergence-
determining.

2.25 Lemma. There exists a countable set of continuous functions f :]Rk [0, 1] that
is convergence-determining and, moreover; Xn --+ X implies that Ef(Xn) Ef(X) uni-
formly in f E F.

*2.7 Law of the Iterated Logarithm


The law of the iterated logarithm is an intriguing result but appears to be of less interest
to statisticians. It can be viewed as a refinement of the strong law of large numbers.
If Yt , Y2, ... are i.i.d. random variables with mean zero, then Yt + ... + Yn = o(n)
almost surely by the strong law. The law of the iterated logarithm improves this order to
o (Jn log log n), and even gives the proportionality constant.
2.26 Proposition (Law of the iterated logarithm). Let Yt , Y2 , ••• be U.d. random vari-
ables with mean zero and variance 1. Then
Yt + ... + 1":
lim sup n = Vr,;2, a.s.
n ..... oo In log log n
Conversely, ifthis statement holds for both Yj and - Yj , then the variables have mean zero
and variance 1.

The law of the iterated logarithm gives an interesting illustration of the difference between
almost sure and distributional statements. Under the conditions of the proposition, the
sequence n- t / 2 (Yt + ... + Yn ) is asymptotically normally distributed by the central limit
theorem. The limiting normal distribution is spread out over the whole real line. Apparently
division by the factor JIoglogn is exactly right to keep n- t / 2 (Yt + ... + Yn ) within a
compact interval, eventually.
A simple application of Slutsky'S lemma gives
Yt + ... + Yn P
Zn:= o.
Jnloglogn
Thus Zn is with high probability contained in the interval (-e, e) eventually, for any e > o.
This appears to contradict the law of the iterated logarithm, which asserts that Zn reaches
the interval (./2 - e, ./2 + e) infinitely often with probability one. The explanation is
that the set of W such that Zn(w) is in (-e, e) or (./2 - e,./2 + e) fluctuates with n. The
convergence in probability shows that at any advanced time a very large fraction of w have
Zn(w) E (-e, e). The law of the iterated logarithm shows that for each particular w the
sequence Zn(w) drops in and out of the interval (./2 - e,./2 +
e) infinitely often (and
hence out of (-e, e».
The implications for statistics can be illustrated by considering confidence statements.
If J.L and 1 are the true mean and variance of the sample Yt, h ... , then the probability that
- 2 - 2
< I I < Yn +-
Yn - -In-r-- In

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


20 Stochastic Convergence

converges to <1>(2) - <1>( -2) 95%. Thus the given interval is an asymptotic confidence
interval of level approximately 95%. (The confidence level is exactly <I> (2) - <I> (-2) if the
observations are normally distributed. This may be assumed in the following; the accuracy
of the approximation is not an issue in this discussion.) The point J..t = 0 is contained in
the interval if and only if the variable Zn satisfies
2
IZnl < .
- Jloglogn
Assume that J..t = 0 is the true value of the mean, and consider the following argument. By
the law of the iterated logarithm, we can be sure that Zn hits the interval (.fi - e, .fi + e)
infinitely often. The expression 2/ JIog log n is close to zero for large n. Thus we can be
sure that the true value J..t = 0 is outside the confidence interval infinitely often.
How can we solve the paradox that the usual confidence interval is wrong infinitely often?
There appears to be a conceptual problem if it is imagined that a statistician collects data in
a sequential manner, computing a confidence interval for every n. However, although the
frequentist interpretation of a confidence interval is open to the usual criticism, the paradox
does not seem to rise within the frequentist framework. In fact, from a frequentist point
of view the curious conclusion is reasonable. Imagine 100 statisticians, all of whom set
95% confidence intervals in the usual manner. They all receive one observation per day
and update their confidence intervals daily. Then every day about five of them should have
a false interval. It is only fair that as the days go by all of them take turns in being unlucky,
and that the same five do not have it wrong all the time. This, indeed, happens according
to the law of the iterated logarithm.
The paradox may be partly caused by the feeling that with a growing number of observa-
tions, the confidence intervals should become better. In contrast, the usual approach leads
to errors with certainty. However, this is only true if the usual approach is applied naively
in a sequential set-up. In practice one would do a genuine sequential analysis (including
the use of a stopping rule) or change the confidence level with n.
There is also another reason that the law of the iterated logarithm is of little practical
consequence. The argument in the preceding paragraphs is based on the assumption that
2/ .Jlog log n is close to zero and is nonsensical if this quantity is larger than.fi. Thus the
argument requires at least n 2: 1619, a respectable number of observations.

*2.8 Lindeberg-Feller Theorem


Central limit theorems are theorems concerning convergence in distribution of sums of
random variables. There are versions for dependent observations and nonnormal limit
distributions. The Lindeberg-Feller theorem is the simplest extension of the classical central
limit theorem and is applicable to independent observations with finite variances.

2.27 Proposition (Lindeberg-Feller central limit theorem). For each n let Yn,), ••• ,
Yn,k. be independent random vectors with finite variances such that

kn
LEIIYn ,ill2 1{IIYn ,ili > e} -+ 0, every e > 0,
i=1
k.
LCovYn,i -+
i=)

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


2.8 Lindeberg-Feller Theorem 21

Then the sequence (Yn ,; - EYn ,;) converges in distribution to a normal N (0, E)
distribution.

A result of this type is necessary to treat the asymptotics of, for instance, regression
problems with fixed covariates. We illustrate this by the linear regression model. The
application is straightforward but notationally a bit involved. Therefore, at other places
in the manuscript we find it more convenient to assume that the covariates are a random
sample, so that the ordinary central limit theorem applies.

2.28 Example (Linear regression). In the linear regression problem, we observe a vector
Y = Xf3 + e for a known (n x p) matrix X of full rank, and an (unobserved) error vector e
with i.i.d. components with mean zero and variance (12. The least squares estimator of f3 is

This estimator is unbiased and has covariance matrix (l2(XT X)-I. Ifthe error vector e is
P
normally distributed, then is exactly normally distributed. Under reasonable conditions
on the design matrix, the least squares estimator is asymptotically normally distributed for
a large range of error distributions. Here we fix p and let n tend to infinity.
This follows from the representation
n
(XTX)I/2(p - f3) (X TX)-1/2X Te = Lan;e;,
;=1

where ani , ... , ann are the columns of the (pxn) matrix (X T X)-1/2X T =: A. This sequence
is asymptotically normal if the vectors anlel, ... ,annen satisfy the Lindeberg conditions.
The norming matrix (X T X)I/2 has been chosen to ensure that the vectors in the display
have covariance matrix (12 I for every n. The remaining condition is
n

L"an; II 2Ee;l{lIan;IIle;1 > e} O.


;=1

This can be simplified to other conditions in several ways. Because L II ani 112 = trace( A AT)
= p, it suffices that max Ee; 1{liandlle;l > e} 0, which is equivalentto

Alternatively, the expectation Ee 2 1{alel > e} can be bounded by e- kElelk+ 2ak and a
second set of sufficient conditions is
n
Lllan;lIk 0; (k> 2).
;=1

Both sets of conditions are reasonable. Consider for instance the simple linear regression
model Yj = f30 + f3lx; + ej. Then

It is reasonable to assume that the sequences x and x 2 are bounded. Then the first matrix

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


22 Stochastic Convergence

on the right behaves like a fixed matrix, and the conditions for asymptotic normality
simplify to

Every reasonable design satisfies these conditions. 0

*2.9 Convergence in Total Variation


A sequence of random variables converges in total variation to a variable X if

sup!P(Xn E B) - P(X E B)! 0,


B

where the supremum is taken over all measurable sets B. In view of the portmanteau lemma,
this type of convergence is stronger than convergence in distribution. Not only is it required
that the sequence P(Xn E B) converges for every Borel set B, the convergence must also
be uniform in B. Such strong convergence occurs less frequently and is often more than
necessary, whence the concept is less useful.
A simple sufficient condition for convergence in total variation is pointwise convergence
of densities. If Xn and X have densities Pn and p with respect to a measure JL, then

sup!P(Xn
B
E B) - P(X E B)! =2 f IPn - pi dJL.

Thus, convergence in total variation can be established by convergence theorems for inte-
grals from measure theory. The following proposition, which should be compared with the
monotone and dominated convergence theorems, is most appropriate.

2.29 Proposition. Suppose that In and I are arbitrary measurable functions such that
In I JL-almost everywhere (or in JL-measure) and lim sup J I/nlPdJL :::: J I/IP dJL <
oo,for some p ::: I and measure JL. Then lin - liP dJL J o.
Proof. By the inequality (a + b)P :::: 2Pa P + 2Pb P, valid for every a, b ::: 0, and the
assumption, 0 :::: 2Pl/ni P + 2PI/IP - lin - liP 2p +1 1/1 p almost everyWhere. By
Fatou's lemma,

f 2p+1 1/1 PdJL :::: liminf f (2 Pl/nl P + 2PI/I P - lin - liP) dJL

:::: 2P+ 1 f I/I dJL -limsup f lin - liP dJL,


P

by assumption. The proposition follows. •

2.30 Corollary (Scheffe). Let Xn and X be random vectors with densities Pn and P with
respect to a measure JL. IfPn P JL-almost everywhere, then the sequence Xn converges
to X in total variation.

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


Notes 23

The central limit theorem is usually formulated in terms of convergence in distribution.


Often it is valid in terms of the total variation distance, in the sense that

SUplp(YI + ... + Yn E B) - { IJ21r e-!<x-n/L)2/ ntr 2dxl--+ O.


B JB.;na 2rr
Here JL and a Z are mean and variance of the Yi , and the supremum is taken over all Borel
sets. An integrable characteristic function, in addition to a finite second moment, suffices.

2.31 Theorem (Central limit theorem in total variation). Let Y1 , Y2 , ••• be U.d. random
variables withfinite second moment and characteristic function 4J such that 14J(t)l v dt < J
00 for some v 1. Then Y1 + ... + Yn satisfies the central limit theorem in total variation.

Proof. It can be assumed without loss of generality that EY1 = 0 and var Y1 = 1. By
the inversion formula for characteristic functions (see [47, p. 509]), the density Pn of
Y1 + ... + yn/.;n can be written

Pn(x) = -1
2rr
f. e- ltx 4J (t)n
-
.;n
dt.

By the central limit theorem and Levy's continuity theorem, the integrand converges to
e- itx exp(-!t z). It will be shown that the integral converges to

-
1
2rr
f e
-itx _!t 2
!x 2
e-z
e Z dt = - - .
J21r
Then an application of Scheffe's theorem concludes the proof.
The integral can be split into two parts. First, for every e > 0,

(
J1tl>E.,fii
e- itx 4J(
vn
dt::::.;n supl4J(t)l n - v
Itl>E
f v
14J(t)l dt.

I
Here SUPltl>E 14J (t) < 1 by the Riemann-Lebesgue lemma and because 4J is the characteristic
function of a nonlattice distribution (e.g., [47, pp. 501, 513]). Thus, the first part of the
integral converges to zero geometrically fast.
Second, a Taylor expansion yields that 4J(t) = 1 - !t
Z + o(t z ) as t --+ 0, so that there

exists e > 0 such that 14J(t)1 :::: 1 - t /4 for every It I < e. It follows that
Z

The proof can be concluded by applying the dominated convergence theorem to the remain-
ing part of the integral. •

Notes

The results of this chapter can be found in many introductions to probability theory. A
standard reference for weak convergence theory is the first chapter of [11]. Another very
readable introduction is [41]. The theory of this chapter is extended to random elements
with values in general metric spaces in Chapter 18.

https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press


24 Stochastic Convergence

PROBLEMS
1. If Xn possesses a t-distribution with n degrees of freedom, then Xn -v-+ N(O, 1) as n 00.
Show this.
2. Does it follow immediately from the result of the previous exercise that ExK EN(O, l)P for
every pEN? Is this true?
3. If Xn -v-+ N(O, 1) and Yn (1, then XnYn -v-+ N(O, (12). Show this.
4. In what sense is a chi-square distribution with n degrees of freedom approximately a normal
distribution?
5. Find an example of sequences such that Xn -v-+ X and Yn -v-+ Y, but the joint sequence (Xn, Yn )
does not converge in law.
6. If Xn and Yn are independent random vectors for every n, then Xn -v-+ X and Yn -v-+ Y imply that
(X n , Yn ) -v-+ (X, Y), where X and Y are independent. Show this.
7. If every Xn and X possess discrete distributions supported on the integers, then Xn -v-+ X if and
only ifP(Xn = x) P(X = x) for every integer x. Show this.
8. IfP(X n = iln) = lin for every i = 1,2, ... , n, then Xn -v-+ X, but there exist Borel sets with
P(Xn E B) = 1 for every n, but P(X E B) = 0. Show this.
9. IfP(X n = x n ) = 1 for numbers Xn and Xn x, then Xn -v-+ x. Prove this
(i) by considering distributions functions
(ii) by using Theorem 2.7.
10. State the rule 0 p (1) + 0 p (1) = 0 p (1) in terms of random vectors and show its validity.
11. In what sense is it true that 0 p (1) = 0 p (1)? Is it true that 0 p (1) = 0 p (1)?
12. The rules given by Lemma 2.12 are not simple plug-in rules.
(i) Give an example of a function R with R(h) = o{lIhll) as h
variables Xn such that R(Xn ) is not equal to op(Xn).
° and a sequence of random

(ii) Given an example ofafunction R such R(h) = O{lIhll)ash °


and a sequence of random
variables Xn such that Xn = Op(1) but R(Xn ) is not equal to Op(Xn).
13. Find an example of a sequence of random variables such that Xn -v-+ 0, but EX n 00.

14. Find an example of a sequence of random variables such that Xn 0, but Xn does not converge
almost surely.
15. Let XI, ... , Xn be i.i.d. with density /A.a(x) = Ae-).(x-a)l{x a}. Calculate the maximum
likelihood estimator of an) of (A, a) and show that an) (A, a).
16. Let X I, ... , Xn be i.i.d. standard normal variables. Show that the vector U = (X I, ... , Xn) IN,
where N 2 = L:?=I Xl, is uniformly distributed over the unit sphere sn-I in lR.n , in the sense that
U and 0 U are identically distributed for every orthogonal transformation 0 of lR.n .
17. For each n, let Un be uniformly distributed over the unit sphere Sn-I in lR.n. Show that the vectors
.JTi(Un, I, Un,2) converge in distribution to a pair of independent standard normal variables.
18. If .JTi(Tn - e) converges in distribution, then Tn converges in probability to e. Show this.
19. IfEX n -+ /-t and var Xn -+ 0, then Xn /-to Show this.
20. p{IXn I > c) < 00 for every c > 0, then Xn converges almost surely to zero. Show this.
21. Use characteristic functions to show that binomial(n, Aln) -v-+ Poisson(A). Why does the central
limit theorem not hold?
22. If X I, ... , Xn are i.i.d. standard Cauchy, then Xn is standard Cauchy.
(i) Show this by using characteristic functions
(ii) Why does the weak law not hold?
23. Let X I, ... , Xn be i.i.d. with finite fourth moment. Find constants a, b, and Cn such that the
sequence cn(Xn - a, - b) converges in distribution, and determine the limit law. Here Xn
and are the averages of the Xi and the Xl, respectively.
https://doi.org/10.1017/CBO9780511802256.003 Published online by Cambridge University Press
3
Delta Method

The delta method consists of using a Taylor expansion to approximate a


random vector of the form ¢(Tn) by the polynomial ¢(e) + ¢'(e)(Tn -
e) + ... in Tn - e. It is a simple but useful method to deduce the limit law
of¢(Tn) - ¢(e)from that of Tn - e. Applications include the non robust-
ness of the chi-square test for normal variances and variance stabilizing
transformations.

3.1 Basic Result


Suppose an estimator Tn for a parameter e is available, but the quantity of interest is ¢ (e) for
some known function ¢. A natural estimator is ¢(Tn). How do the asymptotic properties
of ¢(Tn) follow from those of Tn?
A first result is an immediate consequence of the continuous-mapping theorem. If the
sequence Tn converges in probability to e and ¢ is continuous at e, then ¢(Tn) converges
in probability to ¢(e).
Of greater interest is a similar question concerning limit distributions. In particular, if
y'n(Tn -e) converges weakly to a limit distribution, is the same true for y'n(¢(Tn) -¢(e»)?
If¢ is differentiable, then the answer is affirmative. Informally, we have

v'fi(¢(Tn) - ¢(e») ¢'(e) v'fi(Tn - e).

If y'n(Tn -e) - T for some variable T, then we expect that y'n(¢(Tn) - ¢(e») - ¢'(e) T.
In particular, if y'n(Tn - e) is asymptotically normal N(O, a 2), then we expect that
y'n(¢(Tn) - ¢(e») is asymptotically normal N(O, ¢'(e)2a 2). This is proved in greater
generality in the following theorem.
In the preceding paragraph it is silently understood that Tn is real-valued, but we are more
interested in considering statistics ¢ (Tn) that are formed out of several more basic statistics.
Consider the situation that Tn = (Tn.!, ... , Tn,k) is vector-valued, and that ¢ :]Rk f-+ ]Rm is
a given function defined at least on a neighbourhood of e. Recall that ¢ is differentiable at
e if there exists a linear map (matrix) :]Rk f-+ ]Rm such that

¢(e + h) - ¢(e) = + o(llhll), h -+ 0.

All the expressions in this equation are vectors of length m, and IIhll is the Euclidean
norm. The linear map h f-+ is sometimes called a "total derivative," as opposed to

25

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


26 Delta Method

partial derivatives. A sufficient condition for t/J to be (totally) differentiable is that all partial
derivatives at/Jj(X)/aXi exist for x in a neighborhood of 0 and are continuous at O. (Just
existence of the partial derivatives is not enough.) In any case, the total derivative is found
from the partial derivatives. IT t/J is differentiable, then it is partially differentiable, and the
derivative map h t-* (h) is matrix multiplication by the matrix

IT the dependence of the derivative on 0 is continuous, then t/J is called continuously


differentiable.
It is better to think of a derivative as a linear approximation h t-* to the function
h t-*t/J(O + h) - t/J(O) than as a set of partial derivatives. Thus the derivative at a point 0
is a linear map. IT the range space of t/J is the real line (so that the derivative is a horizontal
vector), then the derivative is also called the gradient of the function.
Note that what is usually called the derivative of a function t/J : IR t-* IR does not com-
pletely correspond to the present derivative. The derivative at a point, usually written t/J'(0),
is written here as Although t/J'(0) is a number, the second object is identified with the
map h t-* = t/J'(O) h. Thus in the present terminology the usual derivative function
o t-* t/J'(O) is a map from IR into the set of linear maps from IR t-* IR, not a map from
IR t-* IR. Graphically the "affine" approximation h t-* t/J(O) + is the tangent to the
function t/J at O.

3.1 Theorem. Let t/J : 1D>t/> C IRk t-* IRm be a map defined on a subset of IRk and dif-
ferentiable at O. Let Tn be random vectors taking their values in the domain of t/J. If
rn(Tn - 0) "-'+ T for numbers rn --+ 00, then rn(t/J(Tn) - t/J(O») Moreover, the
difference between rn(t/J(Tn) - t/J(O») and - 0») converges to zero in probability.

Proof. Because the sequence r n(Tn - 0) converges in distribution, it is uniformly tight and
Tn - 0 converges to zero in probability. By the differentiability of t/J the remainder function
=
R(h) t/J(O + h) - t/J(O) - =
satisfies R(h) o(lIhll) as h --+ O. Lemma 2.12 allows
to replace the fixed h by a random sequence and gives

t/J(Tn) - t/J(O) - - 0) == R(Tn - 0) = oP(IITn - 011).


Multiply this left and right with rn, and note that op(rnliTn - 011) = op(l) by tightness of
the sequence rn(Tn - 0). This yields the last statement of the theorem. Because matrix
multiplication is continuous, (rn (Tn - 0) ) "-'+ (T) by the continuous-mapping theorem.
Apply Slutsky's lemma to conclude that the sequence rn(t/J(Tn) - t/J(O») has the same weak
limit. _

A common situation is that .jii(Tn - 0) converges to a multivariate normal distribution


Nk(J.l" E). Then the conclusion of the theorem is that the sequence .jii(t/J(Tn) - t/J(O»)
converges in law to the Nm J.l" 1:: l) distribution.

3.2 Example (Sample variance). The sample variance of n observations Xl,"" Xn


is defined as S2 = n- 1L7=1 (Xi - X)2 and can be written as t/J(X, X2) for the function

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


3.1 Basic Result 27

4>(x, y) = Y_X2 • (For simplicity of notation, wedividebynratherthann-l.) Suppose that


S2 is based on a sample from a distribution with finite first to fourth moments a I , a2, a3, a4.
By the multivariate central limit theorem,

The map 4> is differentiable at the point () = (ai, a2)T, with derivative 4>(al.a2l = (-2al, 1).
Thus if the vector (TI , T2)' possesses the normal distribution in the last display, then

Jn(4)(X, X2) - 4>(al, a2») - -2aITI + Tz.


The latter variable is normally distributed with zero mean and a variance that can be ex-
pressed in al, ... , a4. In case al = 0, this variance is simplya4 - The general case
can be reduced to this case, because S2 does not change if the observations Xi are replaced
by the centered variables Yi = Xi - al. Write ILk = Eyik for the central moments of the
Xi. Noting that S2 = 4>(Y, y2) and that 4> (ILl , IL2) = IL2 is the variance of the original
observations, we obtain

In view of Slutsky's lemma, the same result is valid for the unbiased version n/(n - 1)S2
of the sample variance, because In(n/(n - 1) - 1) -+ O. 0

3.3 Example (Level of the chi-square test). As an application of the preceding example,
consider the chi-square test for testing variance. Normal theory prescribes to reject the null
hypothesis Ho: IL2 1 for values of nS2 exceeding the upper a point X;.a of the xLI
distribution. If the observations are sampled from a normal distribution, then the test has
exactly level a. Is this still approximately the case if the underlying distribution is not
normal? Unfortunately, the answer is negative.
For large values of n, this can be seen with the help of the preceding result. The central
limit theorem and the preceding example yield the two statements
X2 -(n-1)
n-I _ N(O, 1), In (S2
"2 -
)
1 - N(O, K + 2),
./2n - 2 fA'

where K = IL4/ - 3 is the kurtosis of the underlying distribution. The first statement
implies that (X;.a - (n - 1») / ./2n - 2) converges to the upper a point Za of the standard
normal distribution. Thus the level of the chi-square test satisfies

PIL2=1
2) =
(nS 2 > Xn•a P
(c(S2 1)
...,n IL2 - >
X;.a- n
In ) -+ 1 - <I> (zaJ2)
JK + 2 .
The asymptotic level reduces to 1 - <I> (za) = a if and only if the kurtosis of the underlying
distribution is O. This is the case for normal distributions. On the other hand, heavy-tailed
distributions have a much larger kurtosis. If the kurtosis of the underlying distribution is
"close to" infinity, then the asymptotic level is close to 1 - <1>(0) = 1/2. We conclude that
the level of the chi-square test is nonrobust against departures of normality that affect the
value of the kurtosis. At least this is true if the critical values of the test are taken from
the chi-square distribution with (n - 1) degrees of freedom. If, instead, we would use a

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


28 Delta Method

Table 3.1. Level of the test that rejects


ifns2 /1L2 exceeds the 0.95 quantile
of the X?9 distribution.

Law Level

Laplace 0.12
0.95 N(O, 1) + 0.05 N(O, 9) 0.12

Note: Approximations based on simulation of


10,000 samples.

normal approximation to the distribution of ../Ti(S2 / JL2 - 1) the problem would not arise,
provided the asymptotic variance IC + 2 is estimated accurately. Table 3.1 gives the level
for two distributions with slightly heavier tails than the normal distribution. 0

In the preceding example the asymptotic distribution of ../Ti(S2 - (12) was obtained by the
delta method. Actually, it can also and more easily be derived by a direct expansion. Write

C2 2
vn(S - (1 ) = vn -1 L)Xi
n i=1
c( - JL) -2 (1 2) - vn(X
C- - JL)2.
The second term converges to zero in probability; the first term is asymptotically normal
by the central limit theorem. The whole expression is asymptotically normal by Slutsky's
lemma.
Thus it is not always a good idea to apply general theorems. However, in many exam-
ples the delta method is a good way to package the mechanics of Taylor expansions in a
transparent way.

3.4 Example. Consider the joint limit distribution of the sample variance S2 and the
t-statistic X/So Again for the limit distribution it does not make a difference whether we
use a factor nor n - I to standardize S2. For simplicity we use n. Then (S2, XIS) can be
written as t/J(X, X2) for the map t/J: R2 R2 given by

t/J(x, y) = (y - x 2, (y _:2)1/2).
The joint limit distribution of ../Ti(X-ai, X2 -a2) is derived in the preceding example. The
map t/J is differentiable at () = (ai, a2) provided (12 = a2 - ar is positive, with derivative

It follows that the sequence ../Ti(S2 - (12, X/ S - all(1) is asymptotically bivariate normally
distributed, with zero mean and covariance matrix,

It is easy but uninteresting to compute this explicitly. 0

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


3.1 Basic Result 29

3.5 Example (Skewness). The sample skewness of a sample XI, .. ', Xn is defined as
(X. _ X)3
I _ n L..1=1 I

n - (n- 1 L:7=1 (Xi - X)2f/2·


Not surprisingly it converges in probability to the skewness of the underlying distribution,
defined as the quotient A = JL3/a3 of the third central moment and the third power of the
standard deviation of one observation. The skewness of a symmetric distribution, such
as the normal distribution, equals zero, and the sample skewness may be used to test this
aspect of normality of the underlying distribution. For large samples a critical value may
be determined from the normal approximation for the sample skewness.
The sample skewness can be written as ¢ (X, X2, X3) for the function ¢ given by
c - 3ab + 2a 3
¢(a, b, c) = (b _ a 2)3/2 .

The sequence ../ii(X - ai, X2 - a2, X3 - (3) is asymptotically mean-zero normal by the
central limit theorem, provided is finite. The value ¢(al, a2, (3) is exactly the popu-
lation skewness. The function ¢ is differentiable at the point (ai, a2, (3) and application of
the delta method is straightforward. We can save work by noting that the sample skewness
is location and scale invariant. With Yi = (Xi - a 1) / a, the skewness can also be written as
¢(Y, y2, y3). With).. = JL3/a3 denoting the skewness of the underlying distribution, the
Ys satisfy

'V'-tN(O,(
y3 _ A K + 3 JLs/a -).. JL6/a - A
The derivative of ¢ at the point (0, 1, A) equals (-3, -3A/2, 1). Hence, if T possesses the
normal distribution in the display, then ../ii(In -)..) is asymptotically normal distributed with
mean zero and variance equal to var( -3Tl - 3AT2/2 + T3). If the underlying distribution

asymptotically N(O,
6)-distributed.
°
is normal, then A = JLs = 0, K = and JL6/a6 = 15. In that case the sample skewness is

An approximate level a test for normality based on the sample skewness could be to
reject normality if vlnl1n I > ,J6 Zaj2. Table 3.2 gives the level of this test for different
values of n. 0

Table 3.2. Level of the test that


rejects if v'nlln IIJ6 exceeds the
0.975 quantile of the normal
distribution, in the case that the
observations are normally
distributed.

n Level

10 0.02
20 0.03
30 0.03
50 0.05

Note: Approximations based on simula-


tion of 10,000 samples.

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


30 Delta Method

3.2 Variance-Stabilizing Transformations


√ θ  
Given a sequence of statistics Tn with n(Tn − θ )  N 0, σ 2 (θ ) for a range of values of
θ , asymptotic confidence intervals for θ are given by
 
σ (θ ) σ (θ)
Tn − z α √ , Tn + z α √ .
n n

These are asymptotically of level 1 − 2α in that the probability that θ is covered by


the interval converges to 1 − 2α for every θ . Unfortunately, as stated previously, these
intervals are useless, because of their dependence on the unknown θ . One solution is to
replace the unknown standard deviations σ (θ ) by estimators. If the sequence of estimators
is chosen consistent, then the resulting confidence interval still has asymptotic level 1 − 2α.
Another approach is to use a variance-stabilizing transformation, which often leads to a
better approximation.
The idea is that no problem arises if the asymptotic variances σ 2 (θ ) are independent of θ .
Although this fortunate situation is rare, it is often possible to transform the parameter into
a different parameter η = φ(θ ), for which this idea can be applied. The natural estimator
for η is φ(Tn ). If φ is differentiable, then
√   θ  
n φ (Tn ) − φ(θ )  N 0, φ  (θ )2 σ 2 (θ ) .

For φ chosen such that φ  (θ )σ (θ ) ≡ 1, the asymptotic variance is constant and finding an
asymptotic confidence interval for η = φ(θ ) is easy. The solution

1
φ(θ ) = dθ
σ (θ)

is a variance stabililizing transformation. If it is well defined, then it is automatically


monotone, so that a confidence interval for η can be transformed back into a confidence
interval for θ .

3.6 Example (Correlation). Let (X 1 , Y1 ), . . . , (X n , Yn ) be a sample from a bivariate


normal distribution with correlation coefficient ρ. The sample correlation coefficient is
defined as
n   
i=1 X i − X̄ Yi − Ȳ
rn =  2 1/2 .
n  2  
i=1 X i − X̄ Yi − Ȳ

With the help of the delta method, it is possible to derive that n(rn − ρ) is asymptotically
zero-mean normal, with variance depending on the (mixed) third and fourth moments of
(X, Y ). This is true for general underlying distributions, provided the fourth moments exist.
Under the normality assumption the asymptotic variance can be expressed in the correlation
of X and Y . Tedious algebra gives
√  
n (rn − ρ)  N 0, (1 − ρ 2 )2 .

It does not work very well to base an asymptotic confidence interval directly on this result.

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


3.3 Higher-Order Expansions 31

Table 3.3. Coverage probability of the asymptotic 95%


x confidence interval for the correlation coefficient, for two
values of n and five different values of the true correlation ρ.

n ρ=0 ρ = 0.2 ρ = 0.4 ρ = 0.6 ρ = 0.8


15 0.92 0.92 0.92 0.93 0.92
25 0.93 0.94 0.94 0.94 0.94

Note: Approximations based on simulation of 10,000 samples.

Figure 3.1. Histogram of 1000 sample correlation coefficients, based on 1000 independent
samples of the the bivariate normal distribution with correlation 0.6, and histogram of the
arctanh of these values.

The transformation

1 1 1+ρ
φ(ρ) = dρ = log = arctanh ρ
1−ρ 2 2 1−ρ

is variance stabilizing. Thus, the sequence n(arctanh rn –arctanh ρ) converges to a
standard normal distribution for every ρ. This leads to the asymptotic confidence interval
for the correlation coefficient ρ given by
 √ √ 
tanh(arctanh rn − z α / n ), tanh(arctanh rn + z α / n ) .

Table 3.3 gives an indication of the accuracy of this interval. Besides stabilizing the
variance the arctanh transformation has the benefit of symmetrizing the distribution of the
sample correlation coefficient (which is perhaps of greater importance), as can be seen in
Figure 5.3. 

∗ 3.3 Higher-Order Expansions


To package a simple idea in a theorem has the danger of obscuring the idea. The delta
method is based on a Taylor expansion of order one. Sometimes a problem cannot be
exactly forced into the framework described by the theorem, but the principle of a Taylor
expansion is still valid.

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


32 Delta Method

In the one-dimensional case, a Taylor expansion applied to a statistic Tn has the form

Usually the linear term (Tn - O)t/J'(O) is of higher order than the remainder, and thus
determines the order at which t/J(Tn) - t/J(O) converges to zero: the same order as Tn - O.
Then the approach of the preceding section gives the limit distribution of t/J(Tn) - t/J(O). If
t/J' (0) = 0, this approach is still valid but not of much interest, because the resulting limit
distribution is degenerate at zero. Then it is more informative to multiply the difference
t/J(Tn) - t/J(O) by a higher rate and obtain a nondegenerate limit distribution. Looking at
the Taylor expansion, we see that the linear term disappears if t/J'(O) = 0, and we expect
that the quadratic term determines the limit behavior of t/J(Tn).

3.7 Example. Suppose that .jTi.X converges weakly to a standard normal distribution.
Because the derivative of x cos x is zero at x = 0, the standard delta method of the
preceding section yields that .jTi.(cos X - cos 0) converges weakly to O. It should be
concluded that .jTi. is not the right norming rate for the random sequence cos X-I. A
more informative statement is that - 2n (cos X-I) converges in distribution to a chi-square
distribution with one degree of freedom. The explanation is that

cos X- cosO = (X - 0)0 +!(X - + ....


That the remainder term is negligible after multiplication with n can be shown along the
same lines as the proof of Theorem 3.1. The sequence nX2 converges in law to a xl
distribution by the continuous-mapping theorem; the sequence -2n(cos X-I) has the
same limit, by Slutsky's lemma. 0

A more complicated situation arises if the statistic Tn is higher-dimensional with coor-


dinates of different orders of magnitude. For instance, for a real-valued function t/J,

If the sequences Tn,j - OJ are of different order, then it may happen, for instance, that the
linear part involving Tn,j - OJ is of the same order as the quadratic part involving (Tn,j _ OJ)2.
Thus, it is necessary to determine carefully the rate of all terms in the expansion, and to
rearrange these in decreasing order of magnitude, before neglecting the "remainder."

*3.4 Uniform Delta Method


Sometimes we wish to prove the asymptotic normality of a sequence .jTi.(t/J(Tn) - t/J(On»)
for centering vectors On changing with n, rather than a fixed vector. If.jTi.(On - 0) 40 h for
certain vectors 0 and h, then this can be handled easily by decomposing

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


3.5 Moments 33

Several applications of Slutsky's lemma and the delta method yield as limit in law the vector
+ h) - = if T is the limit in distribution of ,.jn(Tn - 9n). For 9n -* 9
at a slower rate, this argument does not work. However, the same result is true under a
slightly stronger differentiability assumption on q,.

3.8 Theorem. Let q, : IRk H- IRm be a map defined and continuously differentiable in
a neighborhood of 9. Let Tn be random vectors taking their values in the domain of
t/J. Ifrn(Tn - 9n) - T for vectors 9n -* 9 and numbers rn -* 00, then rn(t/J(Tn) -
t/J(9n») - Moreover, the difference between rn(t/J(Tn) - t/J(9n») and - 9n»)
converges to zero in probability.

Proof. It suffices to prove the last assertion. Because convergence in probability to zero
of vectors is equivalent to convergence to zero of the components separately, it is no loss
of generality to assume that q, is real-valued. For 0 ::: t ::: 1 and fixed h, define gn(t) =
q,(fJn + th). For sufficiently large n and sufficiently small h, both fJn and 9n + h are in a
ball around 9 inside the neighborhood on which t/J is differentiable. Then gn : [0, 1] H- IR is
continuously differentiable with derivative (t) = (h). By the mean-value theorem,
gn(1) - gn(O) = for some 0::: ::: 1. In other words

By the continuity of the map 9 H- there exists for every e > 0 a I) > 0 such that
- < ellhll for every - 911 < I) and every h. For sufficiently large nand
IIhll < 1)/2, the vectors 9n + are within distance I) of fJ, so that the nonn II Rn(h)/I of the
right side of the preceding display is bounded by ell h II. Thus, for any 1/ > 0,

p(rnIIRn(Tn - fJn)11 > 1/) ::: P(II Tn - 9nll + P(rnII Tn - 9nlle > 1/).
The first tenn converges to zero as n -* 00. The second tenn can be made arbitrarily small
by choosing e small. •

*3.5 Moments
So far we have discussed the stability of convergence in distribution under transfonnations.
We can pose the same problem regarding moments: Can an expansion for the moments of
q,(Tn) - q,(fJ) be derived from a similar expansion for the moments of Tn - fJ? In principle
the answer is affirmative, but unlike in the distributional case, in which a simple derivative
of q, is enough, global regularity conditions on q, are needed to argue that the remainder
tenns are negligible.
One possible approach is to apply the distributional delta method first, thus yielding the
qualitative asymptotic behavior. Next, the convergence of the moments of t/J(Tn) - q,(fJ)
(or a remainder tenn) is a matter of unifonn integrability, in view of Lemma 2.20. If
q, is uniformly Lipschitz, then this unifonn integrability follows from the corresponding
unifonn integrability of Tn - 9. If q, has an unbounded derivative, then the connection
between moments of q,(Tn) - q,(fJ) and Tn - fJ is harder to make, in general.

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


34 Delta Method

Notes
The Delta method belongs to the folklore of statistics. It is not entirely trivial; proofs are
sometimes based on the mean-value theorem and then require continuous differentiability in
a neighborhood. A generalization to functions on infinite-dimensional spaces is discussed
in Chapter 20.

PROBLEMS
1. Find the joint limit distribution of (.Jii"(X - /1-), .Jii"(S2 - (12») ifK and S2 are based on a sample
of size n from a distribution with finite fourth moment. Under what condition on the underlying
distribution are .Jii"(K - /1-) and .Jii"(S2 - (12) asymptotically independent?
2. Find the asymptotic distribution of .Jii"(r - p) if r is the correlation coefficient of a sample of n
bivariate vectors with finite fourth moments. (This is quite a bit of work. It helps to assume that
the mean and the variance are equal to 0 and 1, respectively.)
3. Investigate the asymptotic robustness of the level of the t-test for testing the mean that rejects
Ho : /1- 0 if .Jii"K/ S is larger than the upper a quantile of the tn-I distribution.
4. Find the limit distribution of the sample kurtosiskn = n- I I:7=1 (Xi - K)4/S4 - 3, and design an
asymptotic level a test for normality based on kn • (Warning: At least 500 observations are needed
to make the normal approximation work in this case.)
S. Design an asymptotic level a test for normality based on the sample skewness and kurtosis jointly.
6. Let X10 ••• , Xn be i.i.d. with expectation /1- and variance 1. Find constants such that an - bn )
converges in distribution if /1- = 0 or /1- =f. O.
7. Let XI, ... , Xn be a random sample from the Poisson distribution with mean (J. Find a variance
stabilizing transformation for the sample mean, and construct a confidence interval for (J based on
this.
8. Let XI, ... ,Xn be Li.d. with expectation I and finite variance. Find the limit distribution of
.Jii"(K;1 - 1). Ifthe random variables are sampled from a density f that is bounded and strictly
positive in a neighborhood ofzero, show that EIK;II = 00 for every n. (The density of Xn is
bounded away from zero in a neighborhood of zero for every n.)

https://doi.org/10.1017/CBO9780511802256.004 Published online by Cambridge University Press


4
Moment Estimators

The method ofmoments determines estimators by comparing sample and


theoretical moments. Moment estimators are useful for their simplicity,
although not always optimal. Maximum likelihood estimators for full ex-
ponentialfamilies are moment estimators, and their asymptotic normality
can be proved by treating them as such.

4.1 Method of Moments


Let XI, ... , Xn be a sample from a distribution Po that depends on a parameter 0, ranging
over some set 8. The method of moments consists of estimating 0 by the solution of a
system of equations

-1 L /j(X
n
i) = Eo/j(X), j = 1, ... , k,
n ;=1

for given functions II, ... , fk. Thus the parameter is chosen such that the sample moments
(on the left side) match the theoretical moments. If the parameter is k-dimensional one
usually tries to match k moments in this manner. The choices /j (x) = x j lead to the
method of moments in its simplest form.
Moment estimators are not necessarily the best estimators, but under reasonable condi-
tions they have convergence rate Jn and are asymptotically normal. This is a consequence
of the delta method. Write the given functions in the vector notation f = (II, ... , fk), and
let e: 81-+ IRk be the vector-valued expectation e(O) = Pof. Then the moment estimator
en solves the system of equations
1 n
W'nf == -
n
L f(X
i=1
i) = e(O) == Po!.
For existence of the moment estimator, it is necessary that the vector W'nf be in the range
of the function e. If e is one-to-one, then the moment estimator is uniquely determined as
-I
On = e CW'nf) and
A

IfW'n f is asymptotically normal and e- I is differentiable, then the right side is asymptoti-
cally normal by the delta method.

35

https://doi.org/10.1017/CBO9780511802256.005 Published online by Cambridge University Press


36 Moment Estimators

The derivative of e- 1 at e(Oo) is the inverse e;;1 of the derivative of eat 00. Because the
function e- 1 is often not explicit, it is convenient to ascertain its differentiability from the
differentiability of e. This is possible by the inverse function theorem. According to this
theorem a map that is (continuously) differentiable throughout an open set with nonsingular
derivatives is locally one-to-one, is of full rank, and has a differentiable inverse. Thus we
obtain the following theorem.

4.1 Theorem. Suppose that e(O) = p(J! is one-to-one on an open set e c IRk and con-
tinuously differentiable at 00 with nonsingular derivative Moreover, assume that
Peo II f112 < 00. Then moment estimators On exist with probability tending to one and
satisfy

Proof. Continuous differentiability at 00 presumes differentiability in a neighborhood and


the continuity of 0 and nonsingularity of imply nonsingularity in a neighborhood.
Therefore, by the inverse function theorem there exist open neighborhoods U of 00 and
V of Peof such that e: U V is a differentiable bijection with a differentiable inverse
e- 1 : V U. Moment estimators On = e- 1 (lPnf) exist as soon as IPnf E V, which
happens with probability tending to 1 by the law of large numbers.
The centra1limit theorem guarantees asymptotic normality of the sequence v'n (IPn f -
Peof). Next use Theorem 3.1 on the display preceding the statement of the theorem. •

For completeness, the following two lemmas constitute, if combined, a proof of the
inverse function theorem. Ifnecessary the preceding theorem can be strengthened somewhat
by applying the lemmas directly. Furthermore, the first lemma can be easily generalized to
infinite-dimensional parameters, such as used in the semiparametric models discussed in
Chapter 25.

4.2 Lemma. Let e c IRk be arbitrary and let e : e IRk be one-to-one and differentiable
at a point 00 with a nonsingular derivative. Then the inverse e- 1 (defined on the range of
e) is differentiable at e(Oo) provided it is continuous at e(Oo).

Proof. Write '1 = e(Oo) and tlh = e- 1('1 + h) - e- 1('1). Because e- 1 is continuous at '1,
we have that tlh 0 as h O. Thus

'1 + h = e e- 1('1 + h) = e(tlh + 00) = e(Oo) + + 0(11 tlh II),


as h 0, where the last step follows from differentiability of e. The displayed equation
can be rewritten as (tlh) = h+o(lItlhll). By continuity of the inverse of this implies
that

tlh = e;;l(h) + o(lItlhll).


In particular, II tlhll (1 +0(1») ::: Ile;;1 (h) ij = O(lIh II). Insert this in the displayed equation
to obtain the desired result that tlh = e;; (h) + o(lIhll). •

4.3 Lemma. Let e : e IRk be defined and differentiable in a neighborhood of a point


00 and continuously differentiable at 00 with a nonsingular derivative. Then e maps every

https://doi.org/10.1017/CBO9780511802256.005 Published online by Cambridge University Press


4.2 Exponential Families 37

sufficiently small open neighborhood U of 00 onto an open set V and e- I : V f-+ U is well
defined and continuous.

Proof. By assumption, ---+ A -I := as 0 f-+ 00 • Thus II I - II s ! for every


o in a sufficiently small neighborhood U of 00 . Fix an arbitrary point 1/1 = e(Od from
V = e(U) (where 01 E U). Next find an e > 0 such that ball (0 1, e) C U, and fix an
arbitrary point 1/ with 111/ - 1/111 < 8:= !IIAII-le. It will be shown that 1/ = e(O) for some
point 0 E ball (0 1, e). Hence every 1/ E ball(1/I, 8) has an original in ball (01 , e). If e is
one-to-one on U, so that the original is unique, then it follows that V is open and that e- I
is continuous at 1/1.
Define a function ¢ (0) = 0 + A (1/ - e (0) ). Because the norm of the derivative =
I- is bounded by ! throughout U, the map ¢ is a contraction on U. Furthermore, if
110 - 01 11 S e,
I
II¢(O) - 0111 s II¢(O) - ¢(Ol) II + 11¢(01) - 0111 s 2:110 - 0111 + IIAIIII1/ - 1/111 < e.

Consequently, ¢ maps ball(OI' e) into itself. Because ¢ is a contraction, it has a fixed point
o E ball (0 1 , e): a point with ¢(O) =
O. By definition of ¢ this satisfies e(O) = 1/.
Any other 0 with e(O) = 1/ is also a fixed point of ¢. In that case the difference 0 - 0 =
¢(O) - ¢(O) has norm bounded by ! 110 - Oil. This can only happen if 0 = O. Hence e is
one-to-one throughout U. •

4.4 Example. Let X I, ... , Xn be a random sample from the beta-distribution: The com-
mon density is equal to

r(a + f3) a-I(I ){I-II


Xf-+ r(a)f(f3)x -x o<x<!·

The moment estimator for (a, f3) is the solution of the system of equations
a
Xn = Ea{lXI
,
= --,
a+f3
-2 2 (a + l)a
X - E {IX - - - - - - - -
n - a, I - (ex + {J + l)(ex + {J)

The righthand side is a smooth and regular function of (ex, f3), and the equations can be
solved explicitly. Hence, the moment estimators exist and are asymptotically normal. 0

*4.2 Exponential Families


Maximum likelihood estimators in full exponential families are moment estimators. This can
be exploited to show their asymptotic normality. Actually, as shown in Chapter 5, maximum
likelihood estimators in smoothly parametrized models are asymptotically normal in great
generality. Therefore the present section is included for the benefit of the simple proof,
rather than as an explanation of the limit properties.
Let X I, ... , Xn be a sample from the k-dimensional exponential family with density
Po(x) = c(O) h(x) eOTt(xl.

https://doi.org/10.1017/CBO9780511802256.005 Published online by Cambridge University Press


38 Moment Estimators

Thus h and t = (tl, ... , tk) are known functions on the sample space, and the family is
given in its natural parametrization. The parameter set e must be contained in the natural
parameter space for the family. This is the set of 0 for which P(J can define a probability
density. IfJL is the dominating measure, then this is the right side in

eC {o E]Rk :C(O)-l == / h(x)e(JTt(x) dJL(x) < oo}.


It is a standard result (and not hard to see) that the natural parameter space is convex. It is
usually open, in which case the family is called "regular." In any case, we assume that the
true parameter is an inner point of e. Another standard result concerns the smoothness of
the function 0 t-+- c(O), or rather of its inverse. (For a proof of the following lemma, see
[100, p. 59] or [17, p. 39].)

4.5 Lemma. ThefunctionO t-+-


o
Jh(x) e(JTt(x) dJL(x) isanalyticontheset{O E C k : Re 0 E
e }. Its derivatives can be found by differentiating (repeatedly) under the integral sign:

for any natural numbers p and i 1 + ... + ik = p.


The lemma implies that the log likelihood i(J (x) = log P(J (x) can be differentiated (in-
finitely often) with respect to O. The vector of partial derivatives (the score function) satisfies
. i::
i(J(x) = -(0)
c
+ t(x) = t(x) - E(Jt(X).

Here the second equality is an example of the general rule that score functions have zero
means. It can fonnally be established by differentiating the identity J P(J d JL == 1 under the
integral sign: Combine the lemma and the Leibniz rule to see that

dJL =/ oc(O) h(x) e(JTt(x) dJL(x) + /C(O) h(x) tj(x) e(JTt(x) dJL(x).
oOj oOj

The left side is zero and the equation can be rewritten as 0 = i::/c(O) + E(Jt(X).
It follows that the likelihood equations L i(J (Xj) = 0 reduce to the system of k equations

Thus, the maximum likelihood estimators are moment estimators. Their asymptotic prop-
erties depend on the function e(O) = E(Jt(X), which is very well behaved on the interior
of the natural parameter set. By differentiating E(Jt(X) under the expectation sign (which
is justified by the lemma), we see that its derivative matrices are given by

= Cov(J t(X).
The exponential family is said to be offull rank if no linear combination Ajtj(X) is
constant with probability 1; equivalently, if the covariance matrix of t (X) is nonsingular. In

https://doi.org/10.1017/CBO9780511802256.005 Published online by Cambridge University Press


4.2 Exponential Families 39

view of the preceding display, this ensures that the derivative is strictly positive-definite
throughout the interior of the natural parameter set. Then e is one-to-one, so that there exists
at most one solution to the moment equations. (Cf. Problem 4.6.) In view of the expression
for i 9 , the matrix is the second-derivative matrix (Hessian) of the log likelihood
L:7=1.e 9(X i ). Thus, a solution to the moment equations must be a point of maximum of
the log likelihood.
A solution can be shown to exist (within the natural parameter space) with probability
I if the exponential family is "regular," or more generally "steep" (see [17]); it is then a
point of absolute maximum of the likelihood. Ifthe true parameter is in the interior of the
parameter set, then a (unique) solution en exists with probability tending to 1 as n t--+ 00,
in any case, by Theorem 4.1. Moreover, this theorem shows that the sequence In(en - 00)
is asymptotically normal with covariance matrix

-I CovOo t(X) -If = (COV9o t(X)r l .

So far we have considered an exponential family in standard form. Many examples arise
in the form

(4.6)

where Q = (QI, ... , Qd is a vector-valued function. If Q is one-to-one and a maximum


likelihood estimator en exists, then by the invariance of maximum likelihood estimators
under transformations, Q(en) is the maximum likelihood estimator for the natural parameter
Q(O) as considered before. If the range of Q contains an open ball around Q(Oo), then
the preceding discussion shows that the sequence In(Q(en ) - Q(Oo») is asymptotically
normal. It requires another application of the delta method to obtain the limit distribution
of In(en - 00). As is typical of maximum likelihood estimators, the asymptotic covariance
matrix is the inverse of the Fisher information matrix
. . T
Ie = E9.e9 (X).ee (X) .

4.6 Theorem. Let e c ]Rk be open and let Q : e t--+ ]Rk be one-to-one and continuously
differentiable throughout e with nonsingular derivatives. Let the (exponential) family of
densities Pe be given by (4.6) and be offull rank. Then the likelihood equations have a
unique solution On with probability tending to I and In(On - 0) $. N(O, /9-I)for every O.

Proof. According to the inverse function theorem, the range of Q is open and the inverse
map Q-I is differentiable throughout this range. Thus, as discussed previously, the delta
method ensures the asymptotic normality. It suffices to calculate the asymptotic covariance
matrix. By the preceding discussion this is equal to

By direct calculation, the score function for the model is equal to i 9 (x) = d/d(O) +
t(x). As before, the score function has mean zero, so that this can be rewritten
as i 9(x) = (QO)T (t(x) - E9t(X»). Thus, the Fisher information matrix equals 19 =
( Qo) T Cov9t (X) Qo. This is the inverse of the asymptotic covariance matrix given in the
preceding display. •

https://doi.org/10.1017/CBO9780511802256.005 Published online by Cambridge University Press


40 Moment Estimators

Not all exponential families satisfy the conditions of the theorem. For instance, the
normal N«(}, (}2) family is an example of a "curved exponential family." The map Q«(}) =
«(}-2,(}-1) (with t(x) = (-x 2 j2,x» does not fill up the natural parameter space of the
normal location-scale family but only traces out a one-dimensional curve. In such cases the
result of the theorem may still hold. In fact, the result is true for most models with "smooth
parametrizations," as is seen in Chapter 5. However, the "easy" proof of this section is not
valid.

PROBLEMS
1. Let XI, ... , Xn be a sample from the unifonn distribution on [-9, 9]. Find the moment estimator
of 9 based on X2. Is it asymptotically nonnal? Can you think of an estimator for () that converges
faster to the parameter?
2. Let XI, ... , Xn be a sample from a density P9 and ! a function such that e«(}) = E9!(X) is
differentiable with e'(9) = E9£9(X)!(X) for 19 = log P9.
(i) Show that the asymptotic variance of the moment estimator based on ! equals var9(f)/
COV9(!, £9)2.
(ii) Show that this is bigger than I()-I with equality for all 9 if and only if the moment estimator
is the maximum likelihood estimator.
(iii) Show that the latter happens only for exponential family members.
3. To what extent does the result of Theorem 4.1 require that the observations are i.i.d.?
4. Let the observations be a sample of size n from the N(I1, (12) distribution. Calculate the Fisher
information matrix for the parameter () = (11, (12) and its inverse. Check directly that the maximum
likelihood estimator is asymptotically nonnal with zero mean and covariance matrix 19- 1•
S. Establish the fonnula e6 = COV() t(X) by differentiating e«(}) = E9t(X) under the integral sign.
(Differentiating under the integral sign is justified by Lemma 4.5, because E()t(X) is the first
derivative of c«(})-I.)
6. Suppose a function e : e ]Rk is defined and continuously differentiable on a convex subset
e C ]Rk with strictly positive-definite derivative matrix. Then e has at most one zero in e.
(Consider the function g(>..) = «(}I - e(>"(}1 + (1 - for given (}I ::f= and 0 :::: >.. :::: 1.
Ifg(O) = g(1) = 0, then there exists a point >"0 with g' (>"0) = 0 by the mean-value theorem.)

https://doi.org/10.1017/CBO9780511802256.005 Published online by Cambridge University Press


5
M- and Z-Estimators

This chapter gives an introduction to the consistency and asymptotic


normality of M -estimators and Z-estimators. Maximum likelihood esti-
mators are treated as a special case.

5.1 Introduction
Suppose that we are interested in a parameter (or "functional") () attached to the distribution
ofobservationsXJ, ... , X n. A popular method for finding an estimator On =on(X 1 , ••• , Xn)
is to maximize a criterion function of the type

(5.1)

Here me: X t-+ lR are known functions. An estimator maximizing Mn(() over e is called
an M -estimator. In this chapter we investigate the asymptotic behavior of sequences of
M -estimators.
Often the maximizing value is sought by setting a derivative (or the set of partial deriva-
tives in the multidimensional case) equal to zero. Therefore, the name M -estimator is also
used for estimators satisfying systems of equations of the type
1 n
Wn (() = - LVre(X;) = o. (5.2)
n ;=1
Here Vre are known vector-valued maps. For instance, if () is k-dimensional, then Vre
typically has k coordinate functions Vre = (Vre,I, ... , Vre,k), and (5.2) is shorthand for the
system of equations
n
LVre,j(X;) = 0, j=1,2, ... ,k.
;=1

Even though in many examples Vre,j is the jth partial derivative of some function mo, this
is irrelevant for the following. Equations, such as (5.2), defining an estimator are called
estimating equations and need not correspond to a maximization problem. In the latter case
it is probably better to call the corresponding estimators Z-estimators (for zero), but the
use of the name M -estimator is widespread.

41

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


42 M- and Z-Estimators

Sometimes the maximum of the criterion function Mn is not taken or the estimating
equation does not have an exact solution. Then it is natural to use as estimator a value
that almost maximizes the criterion function or is a near zero. This yields approximate
M-estimators or Z-estimators. Estimators that are sufficiently close to being a point of
maximum or a zero often have the same asymptotic behavior.
An operator notation for taking expectations simplifies the formulas in this chapter.
We write P for the marginal law of the observations XI, ... , Xn , which we assume to be
identically distributed. Furthermore, we write PI for the expectation E/(X) = J I dP and
abbreviate the average n- 1L,7=d(Xi ) to Pnl. Thus Pn is the empirical distribution: the
(random) discrete distribution that puts mass lin at every of the observations XI."" X n •
The criterion functions now take the forms

Mn(O) = Pnmo, and wn(O) = Pn1{l0.


We also abbreviate the centered sums n- I/2 L,7=I(J(Xi ) - PI) to Gnl, the empirical
process at I.

5.3 Example (Maximum likelihood estimators). Suppose XI, ... ,Xn have a common
density Po: Then the maximum likelihood estimator maximizes the likelihood 07=1 Po (Xi)'
or equivalently the log likelihood
n
o I)ogpo(Xd.
i=1

Thus, a maximum likelihood estimator is an M -estimator as in (S.l) with mo = log Po. If


the density is partially differentiable with respect to 0 for each fixed x, then the maximum
likelihood estimator also solves an equation of type (S.2), with 1{Io equal to the vector of
partial derivatives eo ,j = aIao j log Po. The vector-valued function eo is known as the score
function of the model.
The definition (S.l) of an M-estimator may apply in cases where (S.2) does not. For
instance, if XI, ... , Xn are i.i.d. according to the uniform distribution on [0, OJ, then it
makes sense to maximize the log likelihood
n
o I)log l[o,OI(Xi ) -logO).
i=1

(Define logO = -00.) However, this function is not smooth in 0 and there exists no natural
version of (S.2). Thus, in this example the definition as the location of a maximum is more
fundamental than the definition as a zero. 0

5.4 Example (Location estimators). Let XI, ... , Xn be a random sample of real-valued
observations and suppose we want to estimate the location of their distribution. "Location"
is a vague term; it could be made precise by defining it as the mean or median, or the center
of symmetry of the distribution if this happens to be symmetric. Two examples of location
estimators are the sample mean and the sample median. Both are Z-estimators, because
they solve the equations
n n
- 0) = 0; and L sign(X i - 0) = 0,
i=1 i=1

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.1 Introduction 43

respectively.t Both estimating equations involve functions of the form 1/I(x - 0) for a
function 1/1 that is monotone and odd around zero. It seems reasonable to study estimators
that solve a general equation of the type
n
L1/I(Xi - 0) = 0.
i=1

We can consider a Z-estimator defined by this equation a "location" estimator, because it


has the desirable property of location equivariance. Ifthe observations Xi are shifted by a
fixed amount a, then so is the estimate: B + a solves L7=11/I(Xi + a - 0) = if B solves °
the original equation.
Popular examples are the Huber estimators corresponding to the functions

1/I(x) = l -k
x
k
if x < -k
k:
if IXI:::
if x k.

The Huber estimators were motivated by studies in robust statistics concerning the influ-
ence of extreme data points on the estimate. The exact values of the largest and smallest
observations have very little influence on the value of the median, but a proportional influ-
ence on the mean. Therefore, the sample mean is considered nonrobust against outliers. If
the extreme observations are thought to be rather unreliable, it is certainly an advantage to
limit their influence on the estimate, but the median may be too successful in this respect.
Depending on the value of k, the Huber estimators behave more like the mean (large k) or
more like the median (small k) and thus bridge the gap between the nonrobust mean and
very robust median.
Another example are the quantiles. A pth sample quantile is roughly a point 0 such that
pn observations are less than 0 and (1 - p)n observations are greater than O. The precise
definition has to take into account that the value pn may not be an integer. One possibility
is to call a pth sample quantile any Bthat solves the inequalities
n
-1 < L((1 - p)I{Xi < O) - pl{Xi > OJ) < 1. (5.5)
i=1

This is an approximate M-estimator for 1/1 (x) = 1 - p, 0, -p if x < 0, x = 0, or x > 0,


respectively. The "approximate" refers to the inequalities: It is required that the value of
the estimating equation be inside the interval (-1, 1), rather than exactly zero. This may
seem a rather wide tolerance interval for a zero. However, all solutions tum out to have the
same asymptotic behavior. In any case, except for special combinations of p and n, there is
no hope of finding an exact zero, because the criterion function is discontinuous with jumps
at the observations. (See Figure 5.1.) Ifno observations are tied, then all jumps are of size
one and at least one solution Bto the inequalities exists. Iftied observations are present, it
may be necessary to increase the interval (-1, 1) to ensure the existence of solutions. Note
that the present 1/1 function is monotone, as in the previous examples, but not symmetric
about zero (for p # 1/2).

t The sign-function is defined as sign (x) = -1,0, I if x < 0, x = 0 or x > 0, respectively. Also x+ means
x v 0 = max (x ,0). For the median we assume that there are no tied observations (in the middle).

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


44 M- and Z-Estimators

It)

0
0

0
,

0 0
, C)I

4 6 8 10 12 -2 -1 0 2 3

Figure 5.1. The functions 0 1-+ \{In (0) for the 80% quantile and the Huber estimator for samples of
size 15 from the gamma(8,1) and standard normal distribution, respectively.

All the estimators considered so far can also be defined as a solution of a maximization
problem. Mean, median, Huber estimators, and quantiles minimize I:7=lm(Xi - 0) for m
equal to x 2 , lxi, x21lxl::::k + (2klxl - k2)llxl>k and (I - p)x- + px+, respectively. D

5.2 Consistency
If the estimator On is used to estimate the parameter 0, then it is certainly desirable that
the sequence en
converges in probability to O. If this is the case for every possible value
of the parameter, then the sequence of estimators is called asymptotically consistent. For
instance, the sample mean X n is asymptotically consistent for the population mean EX
(provided the population mean exists). This follows from the law of large numbers. Not
surprisingly this extends to many other sample characteristics. For instance, the sam-
ple median is consistent for the population median, whenever this is well defined. What
can be said about M -estimators in general? We shall assume that the set of possible
parameters is a metric space, and write d for the metric. Then we wish to prove that
d(e n , (0) 0 for some value 00, which depends on the underlying distribution of the
observations.
Suppose that the M-estimator en
maximizes the random criterion function

en
Clearly, the "asymptotic value" of depends on the asymptotic behavior of the functions
Mn' Under suitable normalization there typically exists a deterministic "asymptotic criterion
function" 0 1-* M(O) such that

every O. (5.6)

For instance, if Mn(O) is an average of the form JP>nme as in (5.1), then the law of large
numbers gives this result with M(O) = Pme, provided this expectation exists.
It seems reasonable to expect that the maximizer On of Mn converges to the maximizing
value 00 of M. This is what we wish to prove in this section, and we say that is en
(asymptotically) consistent for 00. However, the convergence (5.6) is too weak to ensure

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.2 Consistency 45

Figure 5.2. Example of a function whose point of maximum is not well separated.

the convergence of en. Because the value en depends on the whole function e t-+ Mn(e),
an appropriate form of "functional convergence" of Mn to M is needed, strengthening the
pointwise convergence (5.6). There are several possibilities. In this section we first discuss
an approach based on uniform convergence of the criterion functions. Admittedly, the
assumption of uniform convergence is too strong for some applications and it is sometimes
not easy to verify, but the approach illustrates the general idea.
Given an arbitrary random function e t-+ Mn(e), consider estimators en that nearly
maximize M n, that is,

Mn(e n) sup Mn(8) - opel).


B

Then certainly Mn(e n) Mn(eo) - op(l), which turns out to be enough to ensure con-
sistency. It is assumed that the sequence Mn converges to a nonrandom map M: e t-+ i.
Condition (5.8) of the following theorem requires that this map attains its maximum at a
unique point eo, and only parameters close to eo may yield a value of M(e) close to the
maximum value M(eo). Thus, eo should be a well-separated point of maximum of M.
Figure 5.2 shows a function that does not satisfy this requirement.

5.7 Theorem. Let Mn be random functions and let M be a fixed function ofe such that
for every s > ot

supIMn(e) - M(8)1 0,
BEE> (5.8)
sup M(e) < M(eo).
B: d(B.Bo)?:;e

Then any sequence of estimators en with Mn(e n) Mn(eo) - op(1) converges in proba-
bility to eo.

t Some of the expressions in this display may be nonmeasurable. Then the probability statements are understood
in tenns of outer measure.

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


46 M- and Z-Estimators

Proof. By the property of en, we have Mn(On) Mn(Oo) - op(l). Because the unifonn
convergence of Mn to M implies the convergence of Mn(Oo) M(Oo), the right side equals
M(Oo) - op(l). It follows that Mn(On) M(Oo) - op(1), whence

M(Oo) - M(On) Mn(On) - M(On) + op(1)


p
sup IMn - MI(O) + op(1) O.
e
by the first part of assumption (5.8). By the second part of assumption (5.8), there exists for
every e > 0 a number T/ > 0 such that M(O) < M(Oo) - T/ for every 0 with d(O, (0) e.
Thus, the event {d(On, (0) e} is contained in the event {M(On) < M(Oo) - T/}. The
probability of the latter event converges to 0, in view of the preceding display. •

Instead of through maximization, an M -estimator may be defined as a zero of a criterion


function 0 t-+ \lin (0). It is again reasonable to assume that the sequence of criterion
functions converges to a fixed limit:

Then it may be expected that a sequence of (approximate) zeros of \lin converges in prob-
ability to a zero of \II. This is true under similar restrictions as in the case of maximizing
M estimators. In fact, this can be deduced from the preceding theorem by noting that a
zero of \lin maximizes the function 0 t-+ \lin (0)-II II.
5.9 Theorem. Let \lin be random vector-valued functions and let \II be a fixed vector-
valued function of 0 such that for every e > 0

I
sUPeee \lin (0) - \11(0)11 0,
infe :d(e,60)2:s II \11(0) II > 0 = II \11(00 ) II·
Then any sequence of estimators On such that \IIn(On) = op(1) converges in probability
to 00'

Proof. This follows from the preceding theorem, on applying it to the functions Mn(0) =
-II I
\lin (0) and M(O) = -11\11(0)11. •

The conditions of both theorems consist of a stochastic and a deterministic part. The
deterministic condition can be verified by drawing a picture of the graph of the function. A
helpful general observation is that, for a compact set e and continuous function M or \II,
uniqueness of 00 as a maximizer or zero implies the condition. (See Problem 5.27.)
For Mn«() or \IIn«() equal to averages as in (5.1) or (5.2) the unifonn convergence
required by the stochastic condition is equivalent to the set of functions {me: () E e}
or {1/1e,j: 0 E e, j = 1, ... , k} being Glivenko-Cantelli. Glivenko-Cantelli classes of
functions are discussed in Chapter 19. One simple set of sufficient conditions is that e be
compact, that the functions () t-+ me (x) or 0 t-+ 1/1e (x) are continuous for every x, and that
they are dominated by an integrable function.
Unifonn convergence of the criterion functions as in the preceding theorems is much
stronger than needed for consistency. The following lemma is one of the many possibilities
to replace the uniformity by other assumptions.

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.2 Consistency 47

5.10 Lemma. Let e be a subset of the real line and let Wn be random functions and
W afixedfunction of() such that Wn «()) --+ w«()) in probability for every (). Assume that
each map () t-+ Wn«()) is continuous and has exactly one zero en, or is nondecreasing with
Wn(e n) = op(1). Let ()o be a point such that W«()o - e) < 0 < W«()o + e) for every e > O.
p
Then ()n --+ ()o.

Proof. If the map () t-+ Wn «()) is continuous and has a unique zero at en, then

The left side converges to one, because Wn «()o ± e) --+ W«()o ± e) in probability. Thus the
right side converges to one as well, and On is consistent.
Ifthe map () t-+ Wn«()) is nondecreasing and en is a zero, then the same argument is valid.
More generally, if () t-+ Wn«()) is nondecreasing, then Wn«()o - e) < -1} and en ()o - e
imply Wn(e n) < -1}, which has probability tending to zero for every 1} > 0 if en is a near
zero. This and a similar argument applied to the right tail shows that, for every e, 1} > 0,

For 21} equal to the smallest of the numbers - W«()o - e) and W«()o + e) the left side still
converges to one. •

5.11 Example (Median). The sample median en is a (near) zero ofthe map () t-+ Wn «()) =
n- I L7=, sign(X; - ()). By the law of large numbers,

Wn W«()) = E sign(X - ()) = P(X > ()) - P(X < ()),

for every fixed (). Thus, we expect that the sample median converges in probability to a
point ()o such that P(X > ()o) = P(X < ()o): a population median.
This can be proved rigorously by applying Theorem 5.7 or 5.9. However, even though
the conditions of the theorems are satisfied, they are not entirely trivial to verify. (The
uniform convergence of Wn to W is proved essentially in Theorem 19.1) In this case it
is easier to apply Lemma 5.10. Because the functions () t-+ IVn «()) are nonincreasing, it
follows that en ()o provided that IV«()o - e) > 0 > IV«()o + e) for every e > O. This is
the case if the population median is unique: P( X < ()o - e) < 4 < P( X < ()o + e) for all
e > O. D

*5.2.1 Wald's Consistency Proof


Consider the situation that, for a random sample of variables X I, ... , Xn ,

M«()) = Pmo.

In this subsection we consider an alternative set of conditions under which the maximizer On
of the process Mn converges in probability to a point of maximum ()o of the function M. This
"classical" approach to consistency was taken by Wald in 1949 for maximum likelihood
estimators. It works best if the parameter set e is compact. Ifnot, then the argument must

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


48 M- and Z-Estimators

be complemented by a proof that the estimators are in a compact set eventually or be applied
to a suitable compactification of the parameter set.
Assume that the map () 1-+ me (x) is upper-semicontinuous for almost all x: For every ()

lim sup me. (x) ::: me (x), a.s .. (5.12)


e.-+e
(The exceptional set of x may depend on () .) Furthermore, assume that for every sufficiently
small ball Vee the function x 1-+ sUPeeU me (x) is measurable and satisfies

Psupme < 00. (5.13)


eeU
Typically, the map () 1-+ Pme has a unique global maximum at a point (}o, but we shall
allow multiple points of maximum, and write eo for the set {(}o E e: Pm90 = sUPe Pme}
of all points at which M attains its global maximum. The set eo is assumed not empty. The
maps me: X 1-+ i: are allowed to take the value -00, but the following theorem assumes
implicitly that at least Pm90 is finite.

5.14 Theorem. Let () 1-+ me (x) be upper-semicontinuousjor almost all x and let (5.13) be
satisfied. Thenjorany estimators en such that Mn(e n) Mn«(}o)-op(1)jorsome(}o E eo,
jor every e > 0 and every compact set K C e,

Proof. Ifthe function () 1-+ Pme is identically -00, then eo = e, and there is nothing
to prove. Hence, we may assume that there exists (}o E eo such that Pm90 > -00, whence
Plm901 < 00 by (5.13).
Fix some () and let VI -l- () be a decreasing sequence of open balls around () of diameter
converging to zero. Write mu(x) for sUPeeU mo(x). The sequence mu, is decreasing
and greater than mo for every I. Combination with (5.12) yields that mu, -l-mo almost
surely. In view of (5.13), we can apply the monotone convergence theorem and obtain that
Pmu, -l- Pme (which may be -(0).
For () f. eo, we have Pme < Pm90' Combine this with the preceding paragraph to see
that for every () f. eo there exists an open ball Ve around () with Pmuo < Pm90' The set
B = {() E K :d«(}, eo) e} is compact and is covered by the balls {Ve:(} E B}. Let
Vel' ... , Ve p be a finite subcover. Then, by the law of large numbers,

Ifen E B, then sUPOeB l?nme is at least I?nmO.' which by definition of On is at least I?nm90 -
op(1) = Pm90 - op(l), by the law oflarge numbers. Thus

{en E B} C {SUPl?nme Pm90 - Op(l)}.


eeB
In view of the preceding display the probability of the event on the right side converges to
zero as n --. 00. •

Even in simple examples, condition (5.13) can be restrictive. One possibility for relax-
ation is to divide the n observations in groups of approximately the same size. Then (5.13)

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.2 Consistency 49

may be replaced by, for some k and every k ::: I < 2k,
I
pi sup Lmo(x;) < 00. (5.15)
OeU i=\

Surprisingly enough, this simple device may help. For instance, under condition (5.13)
the preceding theorem does not apply to yield the asymptotic consistency of the maximum
likelihood estimator of (f1., (1) based on a random sample from the N(f1., (12) distribution
(unless we restrict the parameter set for (1), but under the relaxed condition it does (with
k = 2). (See Problem 5.25.) The proof of the theorem under (5.15) remains almost the
same. Divide the n observations in groups of k observations and, possibly, a remainder
group of I observations; next, apply the law of large numbers to the approximately nJ k
group sums.

5.16 Example (Cauchy likelihood). The maximum likelihood estimator for () based on a
random sample from the Cauchy distribution with location () maximizes the map () lP'nmO
for

mo(x) = -log(l + (x _ ()2).


The natural parameter set lR. is not compact, but we can enlarge it to the extended real line,
provided that we can define mo in a reasonable way for () = ±oo. To have the best chance
of satisfying (5.13), we opt for the minimal extension, which in order to satisfy (5.12) is

m_oo(x) = lim sup mo(x) = -00; moo(x) = limsupmo(x) = -00.


01-*-00 01-*00

These infinite values should not worry us: They are permitted in the preceding theorem.
Moreover, because we maximize () lP'nmo, they ensure that the estimator On never takes
the values ±oo, which is excellent.
We apply Wald's theorem with 8 = i, equipped with, for instance, the metric d«()\, ()2) =
Iarctg ()\ - arctg ()21. Because the functions () mo (x) are continuous and nonpositive, the
conditions are trivially satisfied. Thus, taking K = i, we obtain that d(On, 80) O. This
conclusion is valid for any underlying distribution P of the observations for which the set
80 is nonempty, because so far we have used the Cauchy likelihood only to motivate mo.
To conclude that the maximum likelihood estimator in a Cauchy location model is con-
sistent, it suffices to show that 8 0 = {()o} if P is the Cauchy distribution with center ()o. This
follows most easily from the identifiability of this model, as discussed in Lemma 5.35. 0

5.17 Example (Current status data). Suppose that a "death" that occurs at time T is only
observed to have taken place or not at a known "check-up time" C. We model the obser-
vations as a random sample X \, ... , Xn from the distribution of X = (C, 1(T ::: C l),
where T and C are independent random variables with completely unknown distribution
functions F and G, respectively. The purpose is to estimate the "survival distribution"
1- F.
IfG has a density g with respect to Lebesgue measure A, then X = (C, has a density

PF(c, 8) = (8F(c) + (1 - 8)(1 - F)(c»)g(c)

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


50 M- and Z-Estimators

with respect to the product of>.. and counting measure on the set {O, I}. A maximum like-
lihood estimator for F can be defined as the distribution function P that maximizes the
likelihood
n
F O(.6. i F(Ci ) + (1- .6. i )(1- F)(Ci »)
i=1

over all distribution functions on [0, (0). Because this only involves the numbers F(Ct),
... , F(Cn ), the maximizer of this expression is not unique, but some thought shows that
there is a unique maximizer P that concentrates on (a subset of) the observation times
C t , ••• , Cn' This is commonly used as an estimator.
We can show the consistency of this estimator by Wald's theorem. By its definition P
maximizes the function F lPn log PF, but the consistency proof proceeds in a smoother
way by setting

PF 2PF
mF = log = log-....:....--
P(F+Fo)/2 PF + PFo

Because the likelihood is bigger at P than it is at ! P + ! Fo, it follows that lPnm F 0=


lPnmFo' (It is not claimed that P maximizes F lPnmF; this is not true.)
Condition (5.13) is satisfied trivially, because m F log 2 for every F. We can equip the
set of all distribution functions with the topology of weak convergence. If we restrict the
parameter set to distributions on a compact interval [0, t'], then the parameter set is compact
by Prohorov's theorem.t The map F mF(c, 8) is continuous at F, relative to the weak
topology, for every (c, 8) such that c is a continuity point of F. Under the assumption that
G has a density, this includes almost every (c, 8), for every given F. Thus, Theorem 5.14
shows that Pn converges under Fo in probability to the set Fo of all distribution functions
that maximize the map F PFomF, provided Fo E Fo. This set always contains Fo, but
it does not necessarily reduce to this single point. For instance, if the density g is zero
on an interval [a, b], then we receive no information concerning deaths inside the interval
[a, b], and there can be no hope that Pn converges to Fo on [a, b]. In that case, Fo is not
"identifiable" on the interval [a, b].
We shall show that Fo is the set of all F such that F = Fo almost everywhere according
to G. Thus, the sequence Pn is consistent for Fo "on the set of time points that have a
positive probability of occurring."
Because PF = PFo under PFo if and only if F = Fo almost everywhere according to G, it
suffices to prove that, for every pair of probability densities P and Po, Po log 2 P j (p + Po) 0
with equality if and only if P =
Po almost surely under Po. If Po(p =
0) > 0, then
log2pj(p + Po) =
-00 with positive probability and hence, because the function is
bounded above, Po log 2p j (p + Po) = -00. Thus we may assume that Po(p = 0) = O.
Then, with I(u) = -u log(! + !u),

Po log
(p
2p
+ Po)
= PI (po)
-
p
I (po)
P-
p
= 1(1) =0,

t Alternatively, consider all probability distributions on the compactification [0, 001 again equipped with the
weak topology.

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.3 Asymptotic Normality 51

by Jensen’s inequality and the concavity of f , with equality only if p0 / p = 1 almost surely
under P, and then also under P0 . This completes the proof. 

5.3 Asymptotic Normality


Suppose a sequence of estimators θ̂n is consistent for a parameter θ that ranges over an
open subset of a Euclidean space. The next question of interest concerns the order at which
the discrepancy θˆn − θ converges to zero. The answer depends on the specific situation, but
for estimators based on n replications of an experiment the order is often n −1/2 . Then
multiplication with the inverse of this rate creates a proper balance, and the sequence

n(θ̂n −θ ) converges in distribution, most often to a normal distribution. This is interesting
from a theoretical point of view. It also makes it possible to obtain approximate confidence
sets. In this section we derive the asymptotic normality of M-estimators.
We can use a characterization of M-estimators either by maximization or by solving
estimating equations. Consider the second possibility. Let X 1 , . . . , X n be a sample from
some distribution P, and let a random and a “true” criterion function be of the form:

1
n
n (θ ) ≡ ψθ (X i ) = Pn ψθ , (θ) = Pψθ .
n
i=1

Assume that the estimator θ̂n is a zero of n and converges in probability to a zero θ0 of .
Because θ̂n → θ0 , it makes sense to expand n (θ̂n ) in a Taylor series around θ0 . Assume
for simplicity that θ is one-dimensional. Then
˙ n (θ0 ) + 1 (θ̂n − θ0 )2 
0 = n (θ̂n ) = n (θ0 ) + (θ̂n − θ0 ) ¨ n (θ̃n ),
2

where θ̃n is a point between θ̂n and θ0 . This can be rewritten as



√ − nn (θ0 )
n(θ̂n − θ0 ) = . (5.18)
 ¨ n (θ̃n )
˙ n (θ0 ) + 1 (θ̂n − θ0 )
2
√ 
If Pψθ20 is finite, then the numerator − nn (θ0 ) = −n −1/2 ψθ0 (X i ) is asymptotically
normal by the central limit theorem. The asymptotic mean and variance are Pψθ0 =
˙ n (θ0 ) is
(θ0 ) = 0 and Pψθ20 , respectively. Next consider the denominator. The first term 
˙ P
an average and can be analyzed by the law of large numbers: n (θ0 ) → P ψ̇θ0 , provided the
expectation exists. The second term in the denominator is a product of θ̂n − θ = o P (1) and
¨ n (θ̃n ) and converges in probability to zero under the reasonable condition that ¨ n (θ̃n )
(which is also an average) is O P (1). Together with Slutsky’s lemma, these observations
yield
 
√ Pψθ20
n(θ̂n − θ0 )  N 0,  2 . (5.19)
P ψ̇θ0
The preceding derivation can be made rigorous by imposing appropriate conditions, often
called “regularity conditions.” The only real challenge is to show that ¨ n (θ̃n ) = O P (1)
(see Problem 5.20 or section 5.6).
The derivation can be extended to higher-dimensional parameters. For a k-dimensional
parameter, we use k estimating equations. Then the criterion functions are maps n :Rk  →

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


52 M- and Z-Estimators

]R.k and the derivatives *n(80) are (k x k)-matrices that converge to the (k x k) matrix P,fr80
with entries paja8j 1/l80,i. The final statement becomes

Here the invertibility of the matrix P,fr 80 is a condition.


In the preceding derivation it is implicitly understood that the function () 1/10 (x)
possesses two continuous derivatives with respect to the parameter, for every x. This is true
in many examples but fails, for instance, for the function 1/10 (x) = sign(x - (), which yields
the median. Nevertheless, the median is asymptotically normal. That such a simple, but
important, example cannot be treated by the preceding approach has motivated much effort
to derive the asymptotic normality of M-estimators by more refined methods. One result
is the following theorem, which assumes less than one derivative (a Lipschitz condition)
instead of two derivatives.

5.21 Theorem. Foreach() in an open subsetolEuclidean space, letx 1/Io(x)beamea-


surable vector-valued function such that, lor every ()l and fh in a neighborhood 01 ()o and
. .2
a measurable function 1/1 with P1/I < 00,

Assume that P 111/180 112 < 00 and that the map () P1/Io is differentiable at a zero ()o, with
nonsingular derivative matrix V80' /flP'n 1/10. = op(n- 1/ 2), and en ()o, then

In particular, the sequence ,Jii(en - 80 ) is asymptotically normal with mean zero and
covariance matrix Vr;l

Proof. For a fixed measurable function I, we abbreviate ,Jii(lP'n - P)I to Gnl, the
empirical process evaluated at I. The consistency of On and the Lipschitz condition on the
maps 8 1/10 imply that

(5.22)

For a nonrandom sequence On this is immediate from the fact that the means of these variables
are zero, while the variances are bounded by P1I1/Io. -1/180 112 ::::: P,fr2 11 ()n - ()0112 and hence
converge to zero. A proof for estimators On under the present mild conditions takes more
effort. The appropriate tools are developed in Chapter 19. In Example 19.7 it is seen that
the functions 1/10 form a Donsker class. Next, (5.22) follows from Lemma 19.24. Here we
accept the convergence as a fact and give the remainder of the proof.
By the definitions of On and ()o, we can rewrite Gn1/l0. as ,JiiP(1/I80 -1/10) + op(1).
Combining this with the delta method (or Lemma 2.12) and the differentiability of the map
8 P1/Io, we find that

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.3 Asymptotic Normality 53

In particular, by the invertibility of the matrix VOo,

JnIlOn - 00 11 sI I
V60 (On - ( 0) = Op(1) + op(JnIlOn - ( 0 11).
This implies that On is In-consistent: The left side is bounded in probability. Inserting this
in the previous display, we obtain that JnVoo (On -(0 ) = -Gn1/160 +op(1). We conclude the
proof by taking the inverse 1 left and right. Because matrix multiplication is a continous
map, the inverse of the remainder term still converges to zero in probability. •

The preceding theorem is a reasonable compromise between simplicity and general


applicability, but, unfortunately, it does not cover the sample median. Because the function
o f-+ sign(x - 0) is not Lipschitz, the Lipschitz condition is apparently still stronger
than necessary. Inspection of the proof shows that it is used only to ensure (5.22). It is
seen in Lemma 19.24, that (5.22) can be ascertained under the weaker conditions that the
collection of functions x f-+ 1/10 (x) are a "Donsker class" and that the map 0 f-+ 1/10 is
continuous in probability. The functions sign(x - 0) do satisfy these conditions, but a proof
and the definition of a Donsker class are deferred to Chapter 19.
If the functions 0 f-+ 1/Io(x) are continuously differentiable, then the natural candidate
for (x) is supo II 011, with the supremum taken over a neighborhood of 00 • Then the
main condition is that the partial derivatives are "locally dominated" by a square-integrable
function: There should exist a square-integrable function with II oil s for every 0
close to 00 . If0 f-+ 0 (x) is also continuous at 00 , then the dominated-convergence theorem
readily yields that VOo =
The properties of M estimators can typically be obtained under milder conditions by
using their characterization as maximizers. The following theorem is in the same spirit
as the preceding one but does cover the median. It concerns M -estimators defined as
maximizers of a criterion function 0 f-+ lP'nmo, which are assumed to be consistent for a
point of maximum 00 of the function 0 f-+ P mo. Ifthe latter function is twice continuously
differentiable at 00 , then, of course, it allows a two-term Taylor expansion of the form

It is this expansion rather than the differentiability that is needed in the following theorem.

5.23 Theorem. For each 0 in an open subset ofEuclidean space let x f-+ mo (x) be a mea-
surable function such that 0 f-+ mo (x) is differentiable at 00 for P -almost every x t with
derivative moo (x) and such that, for every 01 and O2 in a neighborhood of 00 and a measur-
able function m with Pm 2 < 00

Furthermore, assume that the map 0 f-+ Pmo admits a second-order Taylor expansion
at a point of maximum 00 with nonsingular symmetric second derivative matrix VOo' If
1 P
lP'nmOn :::: suPolP'nmo - op(n- ) and On -+ 00 , then

t Alternatively, it suffices that () I-> mo is differentiable at ()o in P-probability.

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


54 M- and Z-Estimators

In particular, the sequence -/n(On - (0) is asymptotically normal with mean zero and
covariance matrix

*Proof. The Lipschitz property and the differentiability of the maps 0 m9 imply that,
for every random sequence hn that is bounded in probability,

IGn [..;'n(mBo+h.I.,fo - mBo) - mBo] o.


For nonrandom sequences hn this follows, because the variables have zero means, and vari-
ances that converge to zero, by the dominated convergence theorem. For general sequences
hn this follows from Lemma 19.3l.
A second fact that we need and that is proved subsequently is the -/n-consistency of the
sequence On. By Corollary 5.53, the Lipschitz condition, and the twice differentiability of
the map 0 Pm9, the sequence -/n(On - 0) is bounded in probability.
The remainder of the proof is self-contained. In view of the twice differentiability of the
map 0 Pm9, the preceding display can be rewritten as
I -T -
nlPn ( mBo+h.I.,fo - mBo ) = ihn VBohn + h-Tn IGnmBo
.
+ op(1).
Because the sequence On is -/n-consistent, this is valid both for hn equal to lin = -/n(On -(0 )
and for hn = - IlGnmBo. After simple algebra in the second case, we obtain the equations
1 AT A AT •
nlPn (mBo+h.I.,fo - mBo ) = "2hn VBohn + hn IGnmBo + op(1),
nlPn(mBo-v,;llG.mflu/.,fo - mBo) = + op(l).
By the definition of On, the left side of the first equation is larger than the left side of the
second equation (up to op(1» and hence the same relation is true for the right sides. Take
the difference, complete the square, and conclude that

"21 (Ahn + IGnmBo VBo (Ahn +


I.)T I . ) + op(l)
IGnmBo o.
Because the matrix VBo is strictly negative-definite, the quadratic form must converge to
zero in probability. The same must be true for lliin + II. •
The assertions of the preceding theorems must be in agreement with each other and
also with the informal derivation leading to (5.20). If0 m9(x) is differentiable, then a
maximizer of 0 IPnm9 typically solves IPn1/19 = 0 for 1/19 = m9. Then the theorems and
(5.20) are in agreement provided that
a2 a .
V9 = a0 2 Pm9 = ao P1/I9 = P1/I9 = pm9.
This involves changing the order of differentiation (with respect to 0) and integration (with
respect to x), and is usually permitted. However, for instance, the second derivative of Pm9
may exist without 0 m9 (x) being differentiable for all x, as is seen in the following
example.

5.24 Example (Median). The sample median maximizes the criterion function 0
- L7= IIXi -0 I. Assume that the distribution function F of the observations is differentiable

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.3 Asymptotic Normality 55

C!

<Xl
ci

to
ci

"<t
ci

N
ci

0
ci

-0.5 0.0 0.5

Figure 5.3. The distribution function of the sample median (dotted curve) and its nonna! approxi-
mation for a sample of size 25 from the Laplace distribution.

at its median 80 with positive derivative /(80). Then the sample median is asymptotically
normal.
This follows from Theorem 5.23 applied with mo(x) = Ix - 81-lxl. As a consequence
of the triangle inequality, this function satisfies the Lipschitz condition with m(x) == 1.
Furthermore, the map 8 1-+ mo (x) is differentiable at 80 except if x = 80, with moo (x) =
-sign(x - (0). By partial integration,

Pmo = 8F(O) + { (8 - 2x)dF(x) - 8{1 - F(8») = 2 (o F(x)dx - 8.


J(O,O) Jo
If F is sufficiently regular around 80, then Pmo is twice differentiable with first derivative
2F(8) - 1 (which vanishes at ( 0) and second derivative 2/(8). More generally, under the
minimal condition that F is differentiable at 80, the function Pmo has a Taylor expansion
PmOo + i(8 - ( 0)22/(80) + 0(18 - ( 012 ), so that we set VOo = 2/(80). Because
= E1 = 1, the asymptotic variance of the median is 1/{2/(80»)2. Figure 5.3 gives an
impression of the accuracy of the approximation. 0

5.25 Example (Misspecified model). Suppose an experimenter postulates a model {Po: 8


E e} for a sample of observations X I, ... , X n . However, the model is misspecified in that
the true underlying distribution does not belong to the model. The experimenter decides to
en
use the postulated model anyway, and obtains an estimate from maximizing the likelihood
L log Po (Xi). What is the asymptotic behaviour of () n ?
At first sight, it might appear that () n would behave erratically due to the use of the wrong
model. However, this is not the case. First, we expect that en
is asymptotically consistent
for a value 80 that maximizes the function 8 1-+ P log Po, where the expectation is taken
under the true underlying distribution P. The density POo can be viewed as the "projection"

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


56 M- and Z-Estimators

of the true underlying distribution P on the model using the Kullback-Leibler divergence,
which is defined as - P 10g(Pe Ip), as a "distance" measure: P90 minimizes this quantity
over all densities in -the model. Second, we expect that ..(ii(On - (0) is asymptotically
normal with mean zero and covariance matrix

v:- t pi l v:- t
90 909090·

Here £e = log Pe, and V90 is the second derivative matrix of the map 0 t-+ P log Pe. The
preceding theorem with me = log Pe gives sufficient conditions for this to be true.
The asymptotics give insight into the practical value of the experimenter's estimate On.
This depends on the specific situation. However, if the model is not too far off from the truth,
then the estimated density PD. may be a reasonable approximation for the true density. 0

5.26 Example (Exponentialjrailty model). Suppose that the observations are a random
sample (XJ, Yt ), ..• , (Xn, Yn) of pairs of survival times. For instance, each Xi is the
survival time of a "father" and Yi the survival time of a "son." We assume that given
an unobservable value Zi, the survival times Xi and Yi are independent and exponentially
distributed with parameters Zi and 0 Zi, respectively. The value Zi may be different for each
observation. The problem is to estimate the ratio 0 of the parameters.
To fit this example into the i.i.d. set-up of this chapter, we assume that the values Zt, ..• , Zn
are realizations of a random sample Zt, ... , Zn from some given distribution (that we do
not have to know or parametrize).
One approach is based on the sufficiency of the variable Xi + OYi for Zi in the case
that 0 is known. Given Zi = z, this "statistic" possesses the gamma-distribution with
shape parameter 2 and scale parameter z. Corresponding to this, the conditional density of
an observation (X, Y) factorizes, for a given z, as he(x, y) ge(x + Oy Iz), for ge(s Iz) =
z2 se- zs the gamma-density and

he(x, y) = - - .
o
x + Oy

Because the density of Xi + OYi depends on the unobservable value Zi, we might wish to
discard the factor ge(s Iz) from the likelihood and use the factor he(x, y) only. Unfor-
tunately, this "conditional likelihood" does not behave as an ordinary likelihood, in that
the corresponding "conditional likelihood equation," based on the function he I he (x, y) =
alao loghe(x, y), does not have mean zero under o. The bias can be corrected by condi-
tioning on the sufficient statistic. Let

1/Ie(X, Y)
he Y) -
= 20-(X, 20Ee
(he ) X-
-(X, Y) IX + OY =
0Y .
he he X + OY

Next define an estimator On as the solution of JP>n1/Ie = o.


This works fairly nicely. Because the function 0 t-+ 1/Ie (X, y) is continuous, and de-
creases strictly from 1 to -Ion (0, 00) for every x, y > 0, the equation JP>n 1/Ie = 0 has a
unique solution. The sequence of solutions On can be seen to be consistent by Lemma 5.10.
By straightforward calculation, as 0 -+ 00 ,
o+ 00
P90 1/Ie = - 0 _ 00 -
200 0
(0 _ (0)2 log
00
e = 3010 (00 - 0) + 0(00 - 0).

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.3 Asymptotic Normality 57

Hence the zero of e f-+ POo1{!o is taken uniquely at e = eo. Next, the sequence .;n(On - eo)
can be shown to be asymptotically normal by Theorem 5.21. In fact, the functions -if, 0 (x, y)
are uniformly bounded in x, y > 0 and e ranging over compacta in (0, (0), so that, by the
mean value theorem, the function -if, in this theorem may be taken equal to a constant.
On the other hand, although this estimator is easy to compute, it can be shown that it is
not asymptotically optimal. In Chapter 25 on semiparametric models, we discuss estimators
with a smaller asymptotic variance. 0

5.27 Example (Nonlinear least squares). Suppose that we observe a random sample (Xl,
YI ), .•. , (Xn , Yn ) from the distribution of a vector (X, Y) that follows the regression
model

Y = foo(X) + e, E(e I X) = o.
Here 10 is a parametric family of regression functions, for instance Io(x) = el + e2eO,x,
e.
and we aim at estimating the unknown vector (We assume that the independent variables
are a random sample in order to fit the example in our i.i.d. notation, but the analysis could
be carried out conditionally as well.) The least squares estimator that minimizes
n
e f-+ - fo(X i ))2
i=l

is an M-estimator for mo(x, y) = (y - Io(x))2 (or rather minus this function). It should
be expected to converge to the minimizer of the limit criterion function

Thus the least squares estimator should be consistent if eo is identifiable from the model,
in the sense that e =f. eo implies that 10 (X) =f. 100 (X) with positive probability.
For sufficiently regular regression models, we have

Pmo P ( (e - eo) T 10')2


0 + Ee2.
This suggests that the conditiot;ls of Theorem 5.23 are satisfied with VOo = 2P j and
moo (x, y) = -2(y - IOo(x))Ioo(X)' If e and X are independent, then this leads to the
asymptotic covariance matrix 0

Besides giving the asymptotic normality of .;n(On - eo), the preceding theorems give
an asymptotic representation

If we neglect the remainder term, t then this means that On - eo behaves as the average of
the variables Then the (asymptotic) "influence" of the nth observation on the

t To make the following derivation rigorous, more information concerning the remainder term would be necessary.

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


58 M- and Z-Estimators

value of On can be computed as

Because the "influence" of an extra observation x is proportional to VO- I t/lo (x), the function
x VO- I t/lo (x) is called the asymptotic influence function of the estimator On. Influence
functions can be defined for many other estimators as well, but the method of Z-estimation
is particularly convenient to obtain estimators with given influence functions. Because VIio
is a constant (matrix), any shape of influence function can be obtained by simply choosing
the right functions t/lo.
For the purpose of robust estimation, perhaps the most important aim is to bound the
influence of each individual observation. Thus, a Z-estimator is called B-robust if the
function t/lo is bounded.

5.28 Example (Robust regression). Consider a random sample of observations (XI, Y I ),


... , (Xn, Yn) following the linear regression model

fori.i.d. errorsel, ... , en that are independent of XI, ... , X n • Theclassicalestimatorforthe


regression parameter () is the least squares estimator, which minimizes (Yi - ()T Xi)2.
Outlying values of Xi ("leverage points") or extreme values of (Xi, Yi ) jointly ("influence
points") can have an arbitrarily large influence on the value of the least-squares estimator,
which therefore is nonrobust. As in the case of location estimators, a more robust estimator
for () can be obtained by replacing the square by a function m(x) that grows less rapidly
as x 00, for instance m(x) = Ixl or m(x) equal to the primitive function of Huber's t/I.
Usually, minimizing an expression ofthe type m(Yj - () Xi) is equivalent to solving a
system of equations
n
L t/I{Yi - ()T Xi )Xj = O.
i=1

Because E t/I(Y -()J


X) X = E t/I(e )EX, we can expect the resulting estimator to be consistent
provided Et/I(e) = O. Furthermore, we should expect that, for VIio = Et/I'(e)XXT ,

yn«()n-()o)=
'- A T) Xi +o P (1)·
1,-VO;;-.i..Jt/I( Yj-()oXi
yn i=1

Consequently, even for a bounded function t/I, the influence function (x, y) VO- I t/I(y -
()T x)x may be unbounded, and an extreme value of an X j may still have an arbitrarily
large influence on the estimate (asymptotically). Thus, the estimators obtained in this way
are protected against influence points but may still suffer from leverage points and hence
are only partly robust. To obtain fully robust estimators, we can change the estimating

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.3 Asymptotic Normality 59

equations to
n
L1ft(Yi - ()T Xi)V(Xi»)W(X i ) = O.
i=l

Here we protect against leverage points by choosing w bounded. For more flexibility we
have also allowed a weighting factor v(Xi ) inside 1ft. The choices 1ft(x) = x, v(x) = 1 and
w(x) = x correspond to the (nonrobust) least-squares estimator.
The solution On of our final estimating equation should be expected to be consistent for
the solution of

0= E1ft(Y - ()TX)v(X»)w(X) = E1ft((e +()lX - ()TX)V(X»)w(X).

If the function 1ft is odd and the error symmetric, then the true value ()o will be a solution
whenever e is symmetric about zero, because then E1ft(ea) = 0 for every a.
Precise conditions for the asymptotic normality of In(On - ()o) can be obtained from
Theorems 5.21 and 5.9. The verification of the conditions of Theorem 5.21, which are "local"
in nature, is relatively easy, and, if necessary, the Lipschitz condition can be relaxed by
using results on empirical processes introduced in Chapter 19 directly. Perhaps proving the
consistency of 0n is harder. The biggest technical problem may be to show that 0n = 0 p (1),
so it would help if () could a priori be restricted to a bounded set. On the other hand,
for bounded functions 1ft, the case of most interest in the present context, the functions
(x, y) 1-+ 1ft(y - ()T x)v(x»)w(x) readily form a Glivenko-Cantelli class when () ranges
freely, so that verification of the strong uniqueness of ()o as a zero becomes the main
challenge when applying Theorem 5.9. This leads to a combination of conditions on 1ft, v,
w, and the distributions of e and X. 0

5.29 Example (Optimal robust estimators). Every sufficiently regular function 1ft defines
a location estimator On through the equation L7=11ft(Xi -() = O. In order to choose among
the different estimators, we could compare their asymptotic variances and use the one with
the smallest variance under the postulated (or estimated) distribution P of the observations.
On the other hand, if we also wish to guard against extreme obervations, then we should
find a balance between robustness and asymptotic variance. One possibility is to use the
estimator with the smallest asymptotic variance at the postulated, ideal distribution P under
the side condition that its influence function be uniformly bounded by some constant c. In
this example we show that for P the normal distribution, this leads to the Huber estimator.
The Z -estimator is consistent for the solution ()o of the equation P 1ft (. - () = E 1ft (X1 -
() = O. Suppose that we fix an underlying, ideal P whose "location" ()o is zero. Then the
problem is to find 1ft that minimizes the asymptotic variance p1ft2/(p1ft')2 under the two
side conditions, for a given constant c,

(x) I
I1ftP1ft' ::s c, and P1ft = O.

The problem is homogeneous in 1ft, and hence we may assume that P1ft' = 1 without loss
of generality. Next, minimization of p1ft2 under the side conditions P1ft = 0, P1ft' = 1 and
1I1ftlloo ::s c can be achieved by using Lagrange multipliers, as in problem 14.6 This leads
to minimizing

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


60 M- and Z-Estimators

for fixed "multipliers" A and IL under the side condition 111/11100 c with respect to 1/1. This
expectation is minimized by minimizing the integrand pointwise, for every fixed x. Thus
the minimizing 1/1 has the property that, for every x separately, y = 1/1 (x) minimizes the
parabola y2 + AY + ILY(p'/p)(x) over y E [-c, c]. This readily gives the solution, with
the value y truncated to the interval [c, d],

1 1 p' ]C
1/I(x) = [ --A - -IL-(X) .
2 2 p -c

The constants A and IL can be solved from the side conditions P1/I = 0 and P1/I' = 1. The
normal distribution P = =
It> has location score function p'/ p(x) -x, and by symmetry
it follows that A = 0 in this case. Then the optimal 1/1 reduces to Huber's 1/1 function. 0

*5.4 Estimated Parameters


In many situations, the estimating equations for the parameters of interest contain prelim-
inary estimates for "nuisance parameters." For example, many robust location estimators
are defined as the solutions of equations of the type

(Xi-O)
- A - =0. (5.30)
i=1 (J'

Here {j is an initial (robust) estimator of scale, which is meant to stabilize the robustness
of the location estimator. For instance, the "cut-off" parameter k in Huber's 1/I-function
determines the amount of robustness of Huber's estimator, but the effect of a particular
choice of k on bounding the influence of outlying observations is relative to the range of
the observations. If the observations are concentrated in the interval [-k, k], then Huber's
1/1 yields nothing else but the sample mean, if all observations are outside [-k, k], we get
the median. Scaling the observations to a standard scale gives a clear meaning to the value
of k. The use of the median absolute deviation from the median (see. section 21.3) is often
recommended for this purpose.
If the scale estimator is itself a Z-estimator, then we can treat the pair (0, (j) as a Z-
estimator for a system of equations, and next apply the preceding theorems. More generally,
we can apply the following result. In this subsection we allow a condition in terms of
Donsker classes, which are discussed in Chapter 19. The proof of the following theorem
follows the same steps as the proof of Theorem 5.21.

5.31 Theorem. For each 0 in an open subset ofRk and each TI in a metric space, let x 1-+
be an Rk -valued measurable function such that the class offunctions : 110 -
0011 < 8, d(TI, Tlo) < 8} is Donsker for some 8 > 0, and such that PII1/I9,1/ -1/190,1/0 11 2 0
as (0, TI) (00, Tlo). Assume that P1/I90,1/0 = 0, and that the maps 0 1-+ are differ-
entiable at 00, uniformly in TI in a neighborhood of Tlo with nonsingular derivative matrices
P A

such that V90 ,1/o' If = op(1) and (On, ijn) (00, Tlo), then

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.5 Maximum Likelihood Estimators 61

Under the conditions of this theorem, the limiting distribution of the sequence y'ii(On-
00) depends on the estimator through the "drift" term In general, this gives
a contribution to the limiting distribution, and must be chosen with care. If is y'ii-
consistent and the map TJ is differentiable, then the drift term can be analyzed
using the delta-method.
It may happen that the drift term is zero. If the parameters 0 and TJ are "orthogonal"
in this sense, then the auxiliary estimators may converge at an arbitrarily slow rate and
affect the limit distribution of On only through their limiting value TJo.

5.32 Example (Symmetric location). Suppose that the distribution of the observations
is symmetric about 00. Let x 1fr (x) be an antisymmetric function, and consider the
Z-estimators that solve equation (5.30). Because P1fr(X - Oo)/a) = 0 for every a, by
the symmetry of P and the antisymmetry of 1fr, the "drift term" due to in the pre-
ceding theorem is identically zero. The estimator On has the same limiting distribu-
tion whether we use an arbitrary consistent estimator of a "true scale" ao or ao
itself. 0

5.33 Example (Robust regression). In the linear regression model considered in Exam-
ple 5.28, suppose that we choose the weight functions v and w dependent on the data and
solve the robust estimator On of the regression parameters from

This corresponds to defining a nuisance parameter TJ = (v, w) and setting 1fr9,v,w(X, y) =


1fr(y - OT x)v(x»)w(x). If the functions 1fr9,v,w run through a Donsker class (and they
easily do), and are continuous in (0, v, w), and the map 0 P1fr9,v,w is differentiable at
00 uniformly in (v, w), then the preceding theorem applies. If E1fr(ea) = 0 for every a,
then P1frIJo,v,w = 0 for any v and w, and the limit distribution of y'ii(On - 00 ) is the same,
whether we use the random weight functions (un, wn) or their limit (vo, wo) (assuming that
this exists).
The purpose of using random weight functions could be, besides stabilizing the robust-
ness, to improve the asymptotic efficiency of On. The limit (vo, wo) typically is not the
same for every underlying distribution P, and the estimators (fin, Wn) can be chosen in such
a way that the asymptotic variance is minimal. 0

5.5 Maximum Likelihood Estimators


Maximum likelihood estimators are examples of M -estimators. In this section we special-
ize the consistency and the asymptotic normality results of the preceding sections to this
important special case. Our approach reverses the historical order. Maximum likelihood
estimators were shown to be asymptotically normal first by Fisher in the 1920s and rigor-
ously by Cramer, among others, in the 1940s. General M -estimators were not introduced
and studied systematically until the 1960s, when they became essential in the development
of robust estimators.

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


62 M- and Z-Estimators

If Xl, ... , Xn are a random sample from a density Ps, then the maximum likelihood
estimator On maximizes the function 0 L log Ps(Xi ), or equivalently, the function
Ps Ps
Mn(O) = - L..log -(Xi) = Pn log-.
n i=l P80 P80
(Subtraction of the "constant" L log P80 (Xi) turns out to be mathematically convenient.)
Ifwe agree that log 0 = -00, then this expression is with probability 1 well defined if P80
is the true density. The asymptotic function corresponding to Mn ist

M(O) = Eso log Ps (X) = P80 log Ps .


P80 P80
The number -M(O) is called the Kullback-Leibler divergence of Ps and P80; it is often
considered a measure of "distance" between Ps and P80, although it does not have the
properties of a mathematical distance. Based on the results of the previous sections, we
may expect the maximum likelihood estimator to converge to a point of maximum of M (0).
Is the true value 00 always a point of maximum? The answer is affirmative, and, moreover,
the true value is a unique point of maximum if the true measure is identifiable:

every 0 #- 00. (5.34)

This requires that the model for the observations is not the same under the parameters 0
and 00. Identifiability is a natural and even a necessary condition: Ifthe parameter is not
identifiable, then consistent estimators cannot exist.

5.35 Lemma. Let Ips: 0 e E>} be a collection of subprobability densities such that
(5.34) holds and such that P80 is a probability measure. Then M(O) = P80 log Psi P80
attains its maximum uniquely at 00 •

Proof. First note that M (00 ) = P80 log 1 = O. Hence we wish to show that M (0) is strictly
negative for 0 #- 00 •
Because logx .::: 2(.jX - 1) for every x 0, we have, writing /L for the dominating
measure,

P10g:: ':::2P80 (J:: -1)=2/ ,.jPsP80d/L-2


80

.::: - / (JPi - ...(iii;)2 d/L.

(The last inequality is an equality if f Ps d/L = 1.) This is always nonpositive, and is zero
only if Ps and P80 are equal. By assumption the latter happens only if 0 = 00. •

Thus, under conditions such as in section 5.2 and identifiability, the sequence of maxi-
mum likelihood estimators is consistent for the true parameter.

t Presently we take the expectation P/Io under the parameter 90. whereas the derivation in section 5.3 is valid for a
generic underlying probability structure and does not conceptually require that the set of parameters (J indexes
a set of underlying distributions.

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.5 Maximum Likelihood Estimators 63

This conclusion is derived from viewing the maximum likelihood estimator as an M-


estimator for mo = log Po. Sometimes it is technically advantageous to use a different
starting point. For instance, consider the function

mo = Iog Po2+ P90


P90
By the concavity of the logarithm, the maximum likelihood estimator 0 satisfies
1 PfJ 1
lPnmO :::: lPn 2 log P90 + lPn 2 log 1 :::: 0 = lPnm90'
Even though 0 does not maximize 0 t-+- lPnmo, this inequality can be used as the starting
point for a consistency proof, since Theorem 5.7 requires that Mn(O) :::: Mn(Oo) - op(1)
only. The true parameter is still identifiable from this criterion function, because, by the
preceding lemma, P90mO = 0 implies that (Po + P90)/2 = P90, or Po = P90' A technical
advantage is that mo :::: log(1j2). For another variation, see Example 5.17.
Consider asymptotic normality. The maximum likelihood estimator solves the likelihood
equations
a n
- I)ogpo(X i ) = o.
ao i=1

Hence it is a Z-estimator for 1/10 equal to the score function to = ajao log Po of the model.
In view of the results of section 5.3, we expect that the sequence ,jn(On - 0) is, under 0,
asymptotically normal with mean zero and covariance matrix

(5.36)

Under regularity conditions, this reduces to the inverse of the Fisher information matrix
. ·T
Ie = Peieie.
=
To see this in the case of a one-dimensional parameter, differentiate the identity f Po d JL 1
twice with respect to O. Assuming that the order of differentiation and integration can be
reversed, we obtain f Pe dJL = f Pe dJL = O. Together with the identities

.
ie=-- -
po (pO)2
Pe Po
this implies that Pete = 0 (scores have mean zero), and Polo = -Ie (the curvature of the
likelihood is equal to minus the Fisher information). Consequently, (5.36) reduces to Ie-I.
The higher-dimensional case follows in the same way, in which we should interpret the
identities Pote = 0 and Polo = -10 as a vector and a matrix identity, respectively.
We conclude that maximum likelihood estimators typically satisfy

This is a very important result, as it implies that maximum likelihood estimators are asymp-
totically optimal. The convergence in distribution means roughly that the maximum likeli-
hood estimator On is N(O, (nlo)-I)-distributed for every 0, for large n. Hence, it is asymp-
totically unbiased and asymptotically of variance (nlo)-I. According to the Cramer-Rao

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


64 M- and Z-Estimators

theorem, the variance of an unbiased estimator is at least (nle)-l. Thus, we could in-
fer that the maximum likelihood estimator is asymptotically uniformly minimum-variance
unbiased, and in this sense optimal. We write "could" because the preceding reasoning is
informal and unsatisfying. The asymptotic normality does not warrant any conclusion about
the convergence of the moments and vare we have not introduced an asymptotic
version of the Cramer-Rao theorem; and the Cramer-Rao bound does not make any assertion
concerning asymptotic normality. Moreover, the unbiasedness required by the Cramer-Rao
theorem is restrictive and can be relaxed considerably in the asymptotic situation.
However, the message that maximum likelihood estimators are asymptotically efficient
is correct. We give a precise discussion in Chapter 8. The justification through asymptotics
appears to be the only general justification of the method of maximum likelihood. In some
form, this result was found by Fisher in the 1920s, but a better and more general insight
was only obtained in the period from 1950 through 1970 through the work of Le Cam and
others.
In the preceding informal derivations and discussion, it is implicitly understood that the
density Pe possesses at least two derivatives with respect to the parameter. Although this
can be relaxed considerably, a certain amount of smoothness of the dependence 0 1-+ Pe is
essential for the asymptotic normality. Compare the behavior of the maximum likelihood
estimators in the case of uniformly distributed observations: They are neither asymptotically
normal nor asymptotically optimal.

5.37 Example (Uniform distribution). Let Xl, ... , Xn be a sample from the uniform
distribution on [0, 0]. Then the maximum likelihood estimator is the maximum X(n) of the
observations. Because the variance of X(n) is of the order O(n- 2 ), we expect that a suitable
norming rate in this case is not",;n, but n. Indeed, for each x < 0

Thus, the sequence -n(X(n) - 0) converges in distribution to an exponential distribution


with mean o. Consequently, the sequence ",;n(X(n) - 0) converges to zero in probability.
Note that most of the informal operations in the preceding introduction are illegal or not
even defined for the uniform distribution, starting with the definition of the likelihood equa-
tions. The informal conclusion that the maximum likelihood estimator is asymptotically
optimal is also wrong in this case; see section 9.4. 0

We conclude this section with a theorem that establishes the asymptotic normality of
maximum likelihood estimators rigorously. Clearly, the asymptotic normality follows from
Theorem 5.23 applied to me = log Pe, or from Theorem 5.21 applied with 'I/Ie = £e equal
to the score function of the model. The following result is a minor variation on the first
theorem. Its conditions somehow also ensure the relationship Pete = - Ie and the twice-
differentiability of the map 0 1-+ P60 log Pe, even though the existence of second derivatives
is not part of the assumptions. This remarkable phenomenon results from the trivial fact
that square roots of probability densities have squares that integrate to 1. To exploit this,
we require the differentiability of the maps 0 1-+ JPi, rather than of the maps 0 1-+ log Pe.
A statistical model (Pe: 0 E S) is called differentiable in quadratic mean if there exists a

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.5 Maximum Likelihood Estimators 65

measurable vector-valued function £80 such that, as 0 -+ 00,

f [-/Pi - - - OO)T rdJJ, = o{IIO - Oof),


This property also plays an important role in asymptotic optimality theory. A discussion,
(5.38)

including simple conditions for its validity, is given in Chapter 7. It should be noted that

aoa -/Pi = 2.jji9


1 a
ao P9 = 2l(aao log P9 ) -/Pi.
Thus, the function £80 in the integral really is the score function of the model (as the
notation suggests), and the expression 180 = P80 £80 defines the Fisher information matrix.
However, condition (5.38) does not require existence of alae P9(X) for every x.

5.39 Theorem. Suppose that the model (P9: e E 8) is differentiable in quadratic mean
at an inner point 00 of 8 C IRk. Furthermore, suppose that there exists a measurable
function £ with p80 e2 < 00 such that, for every 01 and in a neighborhood of 00,

Ifthe Fisher information matrix 180 is nonsingular and On is consistent, then

In particular, the sequence "fii(On - (0) is asymptotically normal with mean zero and
covariance matrix 19;/.

*Proof. This theorem is a corollary of Theorem 5.23. We shall show that the conditions
of the latter theorem are satisfied for m9 = log P9 and V80 = - 180 ,
Fix an arbitrary converging sequence of vectors hn -+ h, and set

w. 2( -1)
By the differentiability in quadratic mean, the sequence "fiiWn converges in L2 (P80) to the
function hT £80' In particular, it converges in probability, whence by a delta method

.jii(log P90+h.l.;n - log P90) = 2.jii log( 1 + Wn ) h T £90'

In view of the Lipschitz condition on the map e t--+ log P9, we can apply the dominated-
convergence theorem to strengthen this to convergence in L 2 (P80 ). This shows that the map
o t--+ log P9 is differentiable in probability, as required in Theorem 5.23. (The preceding
argument considers only sequences On of the special form eo + hnl"fii approaching 00.
Because h n can be any converging sequence and ....(n+TI"fii -+ 1, these sequences are
actually not so special. By re-indexing the result can be seen to be true for any On -+ 00.)
Next, by computing means (which are zero) and variances, we see that

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


66 M- and Z-Estimators

Equating this result to the expansion given by Theorem 7.2, we see that

Hence the map () H- Peo log Pe is twice-differentiable with second derivative matrix -leo,
or at least permits the corresponding Taylor expansion of order 2. •

5.40 Example (Binary regression). Suppose that we observe a random sample (XI,
Yd, ... , (X n , Yn ) consisting of k-dimensional vectors of "covariates" Xi, and 0-1 "response
variables" Yi , following the model

Pe(Y i = 11 Xi = x) = 'II«(}Tx).
Here 'II: ]R H- [0, 1] is a known continuously differentiable, monotone function. The choices
'II «(})= =
1/(1 +e-e) (the logistic distribution function) and 'II <l> (the normal distribution
function) correspond to the logit model and probit model, respectively. The maximum
likelihood estimator On maximizes the (conditional) likelihood function
n n
() H- nPe(Y i IXi): = n'll«(}TXi)Yi (1 - 'II«(}TX i ))l-Yi •
i=1 i=1

The consistency and asymptotic normality of On can be proved, for instance, by combining
Theorems 5.7 and 5.39. (Alternatively, we may follow the classical approach given in sec-
tion 5.6. The latter is particularly attractive for the logit model, for which the log likelihood
is strictly concave in (), so that the point of maximum is unique.) For identifiability of () we
must assume that the distribution of the Xi is not concentrated on a (k - I)-dimensional
affine subspace of]Rk. For simplicity we assume that the range of Xi is bounded.
The consistency can be proved by applying Theorem 5.7 with me = 10g(PIi + peo)/2.
Because Peo is bounded away from 0 (and (0), the function mil is somewhat better behaved
than the function log PII.
By Lemma 5.35, the parameter 0 is identifiable from the density PII. We can redo the
proof to see that, with ;S meaning "less than up to a constant,"

This shows that (}o is the unique point of maximum of () H- Peome. Furthermore, if Peomllk
Peomeo, then ()[ X (}l X. Ifthe sequence (}k is also bounded, then E( «(}k - (}O)T X)2 0,
whence (}k H- 00 by the nonsingularity of the matrix EX XT. On the other hand, II(}k II cannot
have a diverging subsequence, because in that case O[ / II(}k II X 0 and hence Ok / IIOk II 0
by the same argument. This verifies condition (5.8).
Checking the uniform convergence to zero of sUPIIIJPlnmll - Pmlll is not trivial, but
it becomes an easy exercise if we employ the Glivenki-Cantelli theorem, as discussed in
Chapter 19. The functions x H- 'II«(}Tx) form a VC-class, and the functions mil take the
form me(x, y) = x), y, 'II«(}lx)), where the function y, 1/) is Lipschitz in its
first argument with Lipschitz constant bounded above by 1/1/ + 1/(1- 1/). This is enough to

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.6 Classical Conditions ·67

ensure that the functions mo form a Donsker class and hence certainly a Glivenko-Cantelli
class, in view of Example 19.20.
The asymptotic normality of ,Jii(On - f) is now a consequence of Theorem 5.39. The
score function

is uniformly bounded in x, y and f) ranging over compacta, and continuous in f) for every
x and y. The Fisher information matrix

is continuous in f), and is bounded below by a multiple of EX XT and hence is nonsingular.


The differentiability in quadratic mean follows by calculus, or by Lemma 7.6. D

*5.6 Classical Conditions


In this section we discuss the "classical conditions" for asymptotic normality of M -estima-
tors. These conditions were formulated in the 1930s and 1940s to make the informal deriva-
tions of the asymptotic normality of maximum likelihood estimators, for instance by Fisher,
mathematically rigorous. Although Theorem 5.23 requires less than a first derivative of
the criterion function, the "classical conditions" require existence of third derivatives. It
is clear that the classical conditions are too stringent, but they are still of interest, because
they are simple, lead to simple proofs, and nevertheless apply to many examples. The
classical conditions also ensure existence of Z-estimators and have a little to say about their
consistency.
We describe the classical approach for general Z-estimators and vector-valued parame-
ters. The higher-dimensional case requires more skill in calculus and matrix algebra than
is necessary for the one-dimensional case. When simplified to dimension one the argu-
ments do not go much beyond making the informal derivation leading from (5.18) to (5.19)
rigorous.
Let the observations X I, •.• , Xn be a sample from a distribution P, and consider the
estimating equations

1JI(11) = P1/I0.

The estimator 0n is a zero of IJIn, and the true value 110 a zero of w.
The essential condition
of the following theorem is that the second-order partial derivatives of 1/10 (x) with respect
to 11 exist for every x and satisfy

I I 1f(x),

for some integrable measurable function 1f. This should be true at least for every 11 in a
neighborhood of 110.

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


68 M- and Z-Estimators

5.41 Theorem. For each θ in an open subset of Euclidean space, let θ  → ψθ (x) be
twice continuously differentiable for every x. Suppose that Pψθ0 = 0, that Pψθ0 2 < ∞
and that the matrix P ψ̇θ0 exists and is nonsingular. Assume that the second-order partial
derivatives are dominated by a fixed integrable function ψ̈(x) for every θ in a neighborhood
of θ0 . Then every consistent estimator sequence θ̂n such that n (θ̂n ) = 0 for every n
satisfies
√  −1 1  n
n(θ̂n − θ0 ) = − P ψ̇θ0 √ ψθ0 (X i ) + o P (1).
n
i=1

In particular, the sequence n(θ̂n − θ0 ) is asymptotically normal with mean zero and
covariance matrix (P ψ̇θ0 )−1 Pψθ0 ψθT0 (P ψ̇θ0 )−1 .

Proof. By Taylor’s theorem there exist (random) vectors θ̃n on the line segment between
θ0 and θ̂n (possibly different for each coordinate of the function n ) such that

0 = n (θ̂n ) = n (θ0 ) +  ˙ n (θ0 )(θ̂n − θ0 ) + 1 (θ̂n − θ0 )T 


¨ n (θ̃n )(θ̂n − θ0 ).
2
The first term on the right n (θ0 ) is an average of the i.i.d. random vectors ψθ0 (X i ), which

have mean Pψθ0 = 0. By the central limit theorem, the sequence n n (θ0 ) converges
in distribution to a multivariate normal distribution with mean 0 and covariance matrix
Pψθ0 ψθT0 . The derivative  ˙ n (θ0 ) in the second term is an average also. By the law of
large numbers it converges in probability to the matrix V = P ψ̇θ0 . The second derivative
¨ n (θ̃n ) is a k-vector of (k × k) matrices depending on the second-order derivatives ψ̈θ . By
assumption, there exists a ball B around θ0 such that ψ̈θ is dominated by ψ̈ for every
θ ∈ B. The probability of the event {θ̂n ∈ B} tends to 1. On this event
   1

 1
   
n n
¨   ψ̈(X i ).
 ( θ̃
 n n   ) = ψ̈ θ̃ (X i )  ≤
n n n
i=1 i=1

This is bounded in probability by the law of large numbers. Combination of these facts
allows us to rewrite the preceding display as
 
−n (θ0 ) = V + o P (1) + 12 (θ̂n − θ0 )O P (1) (θ̂n − θ0 ) = (V + o P (1))(θ̂n − θ0 ),

because the sequence (θ̂n − θ0 )O P (1) = o P (1)O P (1) converges to 0 in probability if


θ̂n is consistent for θ0 . The probability that the matrix Vθ0 + o P (1) is invertible tends to 1.
√  −1
Multiply the preceding equation by n and apply V + o P (1) left and right to complete
the proof. 

In the preceding sections, the existence and consistency of solutions θ̂n of the estimating
equations is assumed from the start. The present smoothness conditions actually ensure the
existence of solutions. (Again the conditions could be significantly relaxed, as shown in
the next proof.) Moreover, provided there exists a consistent estimator sequence at all, it is
always possible to select a consistent sequence of solutions.

5.42 Theorem. Under the conditions of the preceding theorem, the probability that the
equation Pn ψθ = 0 has at least one root tends to 1, as n → ∞, and there exists a sequence
of roots θ̂n such that θ̂n → θ0 in probability. If ψθ = ṁ θ is the gradient of some function

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.6 Classical Conditions 69

m θ and θ0 is a point of local maximum of θ  → Pm θ , then the sequence θ̂n can be chosen
to be local maxima of the maps θ  → Pn m θ .

Proof. Integrate the Taylor expansion of θ  → ψθ (x) with respect to x to find that,
for points θ̃ = θ̃ (x) on the line segment between θ0 and θ (possibly different for each
coordinate of the function θ  → Pψθ ),
Pψθ = Pψθ0 + P ψ̇θ0 (θ − θ0 ) + 12 (θ − θ0 )T P ψ̈θ̃ (θ − θ0 ).
By the domination condition, P ψ̈θ̃  is bounded by Pψ̈ < ∞ if θ is sufficiently close
to θ0 . Thus, the map (θ ) = Pψθ is differentiable at θ0 . By the same argument  is dif-
ferentiable throughout a small neighborhood of θ0 , and by a similar expansion (but now to
first order) the derivative P ψ̇θ can be seen to be continuous throughout this neighborhood.
Because P ψ̇θ0 is nonsingular by assumption, we can make the neighborhood still smaller,
if necessary, to ensure that the derivative of  is nonsingular throughout the neighborhood.
Then, by the inverse function theorem, there exists, for every sufficiently small δ > 0, an
open neighborhood G δ of θ0 such that the map  : G δ  → ball(0, δ) is a homeomorphism.
The diameter of G δ is bounded by a multiple of δ, by the mean-value theorem and the fact
that the norms of the derivatives (P ψ̇θ )−1 of the inverse  −1 are bounded.
Combining the preceding Taylor expansion with a similar expansion for the sample
version n (θ ) = Pn ψθ , we see
sup  n (θ ) − (θ ) ≤ o P (1) + δo P (1) + δ 2 O P (1),
θ∈G δ

where the o P (1) terms and the O p (1) term result from the law of large numbers, and are
 
uniform in small δ. Because P o P (1) + δo P (1) > 12 δ → 0 for every δ > 0, there exists
 
δn ↓ 0 such that P o P (1) + δn o P (1) > 12 δn → 0. If K n,δ is the event where the left side
of the preceding display is bounded above by δ, then P(K n,δn ) → 1 as n → ∞.
On the event K n,δ the map θ  → θ − n o  −1 (θ ) maps ball(0, δ) into itself, by the
definitions of G δ and K n,δ . Because the map is also continuous, it possesses a fixed-point
in ball(0, δ), by Brouwer’s fixed point theorem. This yields a zero of n in the set G δ ,
whence the first assertion of the theorem.
For the final assertion, first note that the Hessian P ψ̇θ0 of θ  → Pm θ at θ0 is negative-
definite, by assumption. A Taylor expansion as in the proof of Theorem 5.41 shows that
P P
Pn ψ̇θ̂n − Pn ψ̇θ0 → 0 for every θ̂n → θ0 . Hence the Hessian Pn ψ̇θ̂n of θ  → Pn m θ at
any consistent zero θ̂n converges in probability to the negative-definite matrix P ψ̇θ0 and is
negative-definite with probability tending to 1. 

The assertion of the theorem that there exists a consistent sequence of roots of the
estimating equations is easily misunderstood. It does not guarantee the existence of an
asymptotically consistent sequence of estimators. The only claim is that a clairvoyant
statistician (with preknowledge of θ0 ) can choose a consistent sequence of roots. In reality,
it may be impossible to choose the right solutions based only on the data (and knowledge
of the model). In this sense the preceding theorem, a standard result in the literature, looks
better than it is.
The situation is not as bad as it seems. One interesting situation is if the solution of the
estimating equation is unique for every n. Then our solutions must be the same as those of
the clairvoyant statistician and hence the sequence of solutions is consistent.

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


70 M- and Z-Estimators

In general, the deficit can be repaired with the help of a preliminary sequence of estimators
On. If the sequence On is consistent, then it works to choose the root On of JP> n1/16 = 0 that
is closest to On. Because liOn - On II is smaller than the distance II0: - On II between the
0:
clairvoyant sequence and On, both distances converge to zero in probability. Thus the
sequence of closest roots is consistent.
The assertion of the theorem can also be used in a negative direction. The point (}o in
the theorem is required to be a zero of (} ..... P1/I6, but, apart from that, it may be arbitrary.
Thus, the theorem implies at the same time that a malicious statistician can always choose
a sequence of roots On that converges to any given zero. These may include other points
besides the "true" value of (}. Furthermore, inspection of the proof shows that the sequence
of roots can also be chosen to jump back and forth between two (or more) zeros. If the
function (} ..... P1/I6 has multiple roots, we must exercise care. We can be sure that certain
roots of (} ..... JP> n 1/16 are bad estimators.
Part of the problem here is caused by using estimating equations, rather than maximiza-
tion to find estimators, which blurs the distinction between points of absolute maximum,
local maximum, and even minimum. In the light of the results on consistency in section 5.2,
we may expect the location of the point of absolute maximum of (} ..... JP> nm6 to converge
to a point of absolute maximum of (} ..... Pm6. As long as this is unique, the absolute
maximizers of the criterion function are typically consistent.

5.43 Example (Weibull distribution). Let XI, ... , Xn be a sample from the Weibull dis-
tribution with density

P6 u(x)
,
=U lu, x > 0, (} > 0, U > O.

(Then u I16 is a scale parameter.) The score function is given by the partial derivatives of
the log density with respect to (} and u:

.,
l6a(x) = (1(} + log
- x6
x - -logx,
u
w- - 1+ -u
U
x6 )
2
.
The likelihood equations L l6,u (Xi) = 0 reduce to
1 n
U = -
n i=1
The second equation is strictly decreasing in (}, from 00 at (} = 0 to log x -log x(n) at 0 = 00.
Hence a solution exists, and is unique, unless all Xi are equal. Provided the higher-order
derivatives of the score function exist and can be dominated, the sequence of maximum
likelihood estimators (On' Un) is asymptotically normal by Theorems 5.41 and 5.42. There
exist four different third-order derivatives, given by

a3t6,u(X) = 2. _ x 6 10g3 x
a(}3 (}3 u

a3t6,u (x) = x 6 loi x


a0 2au u2
a3l6,u (x) 2x6 1
ogx
aoau 2 u3
a3t6,u(X) 2 6x 6
au 3 = - u3 + -u-4 .
https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press
5.7 One-Step Estimators 71

For 0 and (J ranging over sufficiently small neighborhoods of 00 and (Jo, these functions are
dominated by a function of the form

M(x) = A(l + xB)(1 + Ilogxl + ... + IlogxI3).


for sufficiently large A and B. Because the Weibull distribution has an exponentially small
tail, the mixed moment Eeo,<ToXP Ilog xlq is finite for every p, q O. Thus, all moments of
eo
£0 and exist and M is integrable. D

*5.7 One-Step Estimators


The method of Z-estimation as discussed so far has two disadvantages. First. it may be
hard to find the roots of the estimating equations. Second, for the roots to be consistent,
the estimating equation needs to behave well throughout the parameter set. For instance,
existence of a second root close to the boundary of the parameter set may cause trouble. The
one-step method overcomes these problems by building on and improving a preliminary
estimator 9n'
The idea is to solve the estimator from a linear approximation to the original estimating
equation \{In (0) = O. Given a preliminary estimator 9n, the one-step estimatoris the solution
(in 0) to

This corresponds to replacing \{In(O) by its tangent at 9n, and is known as the method of
Newton-Rhapson in numerical analysis. The solution 0 = On is
-I -
A
On = On -
- •
\{In(On)
-
\{In (On).

In numerical analysis this procedure is iterated a number of times, taking On as the new
preliminary guess, and so on. Provided that the starting point 9n is well chosen, the sequence
of solutions converges to a root of \{In. Our interest here goes in a different direction. We
suppose that the preliminary estimator 9n is already within range n- I / 2 of the true value
of e. Then, as we shall see, just one iteration of the Newton-Rhapson scheme produces
an estimator On that is as good as the Z-estimator defined by \{In. In fact, it is better in
that its consistency is guaranteed, whereas the true Z-estimator may be inconsistent or not
uniquely defined.
In this way consistency and asymptotic normality are effectively separated, which is
useful because these two aims require different properties of the estimating equations.
Good initial estimators can be constructed by ad-hoc methods and take care of consistency.
Next, these initial estimators can be improved by the one-step method. Thus, for instance,
the good properties of maximum likelihood estimation can be retained, even in cases in
which the consistency fails.
In this section we impose the following condition on the random criterion functions \{In.
For every constant M and a given nonsingular matrix

sup II In(\{In(O) - \{In(OO») - - (0)11 O. (5.44)


vnIIO-8011 <M

Condition (5.44) suggests that \{In is differentiable at 00 , with derivative tending to but
this is not an assumption. We do not require that a derivative exists, and introduce
https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press
72 M- and Z-Estimators

a further refinement of the Newton-Rhapson scheme by replacing by arbitrary


estimators. Given nonsingular, random matrices that converge in probability to
define the one-step estimator
-I -
A

(}n = (}n -
- •

"'n,o"'n«(}n)'

Call an estimator sequence On .,fn-consistent if the sequence .,fn(On - (}o) is uniformly


tight. The interpretation is that On already determines the value (}o within n- I / 2 -range.

5.45 Theorem (One-step estimation). Let .,fn"'n«(}O) - Z and let (5.44) hold. Then the
one-step estimator On, for a given .,fn-consistent estimator sequence On and estimators
. p.
"'n,O-+ "'0, satisfies

5.46 Addendum. For "'n«(}) = rn1/l9 condition (5.44) is satisfied under the conditions of
Theorem 5.21 with = V90, and under the conditions of Theorem 5.41 with = Ptj,90'
Proof. The standardized estimator - (}o) equals

- (0 ) - In('''n(On) - "'n(OO») -

By (5.44) the second term can be replaced by - (On - (0) +0 p (1). Thus the expression
can be rewritten as

The first term converges to zero in probability, and the theorem follows after application of
Slutsky's lemma.
For a proof of the addendum, see the proofs of the corresponding theorems. •

Ifthe sequence .,fn(On - (}o) converges in distribution, then it is certainly uniformly tight.
Consequently, a sequence of one-step estimators is .,fn-consistent and can itself be used as
preliminary estimator for a second iteration of the modified Newton-Rhapson algorithm.
Presumably, this would give a value closer to a root of "'n.
However, the limit distribution
of this "two-step estimator" is the same, so that repeated iteration does not give asymptotic
improvement. In practice a multistep method may nevertheless give better results.
We close this section with a discussion of the discretization trick. This device is mostly
of theoretical value and has been introduced to relax condition (5.44) to the following. For
every nonrandom sequence On = (}o + O(n- I/ 2 ),

(5.47)

This new condition is less stringent and much easier to check. It is sufficiently strong if
the preliminary estimators On are discretized on grids of mesh width n -1/2. For instance,
On is suitably discretized if all its realizations are points of the grid n- I / 2Zk (consisting
of the points n- I/ 2(il, ... , ik) for integers ii, ... ,ik). This is easy to achieve, but perhaps
unnatural. Any preliminary estimator sequence On can be discretized by replacing its values

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.7 One-Step Estimators 73

by the closest points of the grid. Because this changes each coordinate by at most n- I / 2 ,
en
In-consistency of is retained by discretization.
Define a one-step estimator en
as before, but now use a discretized version of the pre-
liminary estimator.

5.48 Theorem (Discretized one-step estimation). Let In\lln(OO) ..... Z and let (5.47) hold.
Then the one-step estimator en, for a given In-consistent, discretized estimator sequence
- . p.
On and estimators \IIn,O-+\IIo, satisfies

5.49 Addendum. For \lin (0) = IPn1/10 and IP n the empirical measure of a random sample
from a density Po that is differentiable in quadratic mean (5.38), condition (5.47), is satisfied,
. 'T
with \110 = - POo 1/Iooeoo ' if, as 0 -+ 00 ,

f [1/IoJPe -1/IooJPOof d/L-+ O.


Proof. The arguments of the previous proof apply, except that it must be shown that

converges to zero in probability. Fix 8 > O. By the In-consistency, there exists M with
p(Jnllen- 00 11 > M)< 8. If Jnlle n - 00 11 :::: M, then en equals one of the values in the
set Sn = {O E n- I / 2 7}: 110 - 0011 :::: n-I/2M}. For each M and n there are only finitely
many elements in this set. Moreover, for fixed M the number of elements is bounded
independently of n. Thus

:::: 8 + L p(IIR(On) II > 8).


OnESn

The maximum of the terms in the sum corresponds to a sequence of nonrandom vectors On
with On = 00 + O(n- I / 2 ). It converges to zero by (5.47). Because the number of terms in
the sum is bounded independently of n, the sum converges to zero.
For a proof of the addendum, see proposition A. 10 in [139]. •

If the score function to of the model also satisfies the conditions of the addendum,
then the estimators $n,O = - POn 1/I0n are consistent for $0. This shows that discretized
one-step estimation can be carried through under very mild regularity conditions. Note
that the addendum requires only continuity of 0 f-+ 1/10, whereas (5.47) appears to require
differentiability.

5.50 Example (Cauchy distribution). Suppose X I, ... , Xn are a sample from the Cauchy
location family Po(x) = Jr- I (1 + (x - 0)2fl. Then the score function is given by

e (x) _ _ 2(_x_-_0_).....,..
o - 1 + (x - 0)2 ·

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


74 M- and Z-Estimators

-200 -100 o 100

Figure 5.4. Cauchy log likelihood function of a sample of 25 observations, showing three local
maxima. The value of the absolute maximum is well-separated from the other maxima, and its
location is close to the true value zero of the parameter.

This function behaves like 1/x for x -+ ±oo and is bounded in between. The second
moment of i(} (X I) therefore exists, unlike the moments of the distribution itself. Because
the sample mean possesses the same (Cauchy) distribution as a single observation X J, the
sample mean is a very inefficient estimator. Instead we could use the median, or another
M -estimator. However, the asymptotically best estimator should be based on maximum
likelihood. We have

The tails of this function are of the order 1/x 3 , and the function is bounded in between.
These bounds are uniform in () varying over a compact interval. Thus the conditions of
Theorems 5.41 and 5.42 are satisfied. Since the consistency follows from Example 5.16,
the sequence of maximum likelihood estimators is asymptotically normal.
The Cauchy likelihood estimator has gained a bad reputation, because the likelihood
equation L i(} (Xi) = 0 typically has several roots. The number of roots behaves asymp-
totically as two times a Poisson(1/rr) variable plus 1. (See [126].) Therefore, the one-step
(or possibly multi-step method) is often recommended, with, for instance, the median as the
initial estimator. Perhaps a better solution is not to use the likelihood equations, but to deter-
mine the maximum likelihood estimator by, for instance, visual inspection of a graph of the
likelihood function, as in Figure 5.4. This is particularly appropriate because the difficulty of
multiple roots does not occur in the two parameter location-scale model. In the model with
density p(} (x / u ) / U , the maximum likelihood estimator for «(), u) is unique. (See [25].) 0

5.51 Example (Mixtures). Let f and g be given, positive probability densities on the real
line. Considerestimatingtheparameter(} = (/L, v, u, 'l', p) based on a random sample from

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.8 Rates of Convergence 75

the mixture density


   
x −μ 1 x −v 1
x → p f + (1 − p)g .
σ σ τ τ
If f and g are sufficiently regular, then this is a smooth five-dimensional parametric model,
and the standard theory should apply. Unfortunately, the supremum of the likelihood over
the natural parameter space is ∞, and there exists no maximum likelihood estimator. This
is seen, for instance, from the fact that the likelihood is bigger than
   
x1 − μ 1 
n
xi − v 1
pf (1 − p)g .
σ σ τ τ
i=2

If we set μ = x1 and next maximize over σ > 0, then we obtain the value ∞ whenever
p > 0, irrespective of the values of ν and τ .
A one-step estimator appears reasonable in this example. In view of the smoothness of
the likelihood, the general theory yields the asymptotic efficiency of a one-step estimator

if started with an initial n-consistent estimator. Moment estimators could be appropriate
initial estimators. 

∗ 5.8 Rates of Convergence


In this section we discuss some results that give the rate of convergence of M-estimators.
These results are useful as intermediate steps in deriving a limit distribution, but also of
interest on their own. Applications include both classical estimators of “regular” parameters

and estimators that converge at a slower than n-rate. The main result is simple enough,
but its conditions include a maximal inequality, for which results such as in Chapter 19 are
needed.
Let Pn be the empirical distribution of a random sample of size n from a distribution
P, and, for every θ in a metric space , let x  → m θ (x) be a measurable function. Let θ̂n
(nearly) maximize the criterion function θ  → Pn m θ .
The criterion function may be viewed as the sum of the deterministic map θ  → Pm θ
and the random fluctations θ  → Pn m θ − Pm θ . The rate of convergence of θ̂n depends on
the combined behavior of these maps. If the deterministic map changes rapidly as θ moves
away from the point of maximum and the random fluctuations are small, then θ̂n has a high
rate of convergence. For convenience of notation we measure the fluctuations in terms of

the empirical process Gn m θ = n(Pn m θ − Pm θ ).

5.52 Theorem (Rate of convergence). Assume that for fixed constants C and α > β, for
every n, and for every sufficiently small δ > 0,
sup P(m θ − m θ0 ) ≤ −Cδ α ,
δ
2
≤ d(θ,θ0 )<δ
 
E∗ sup Gn (m θ − m θ ) ≤ Cδ β .
0
d(θ,θ0 )<δ
 
If the sequence θ̂n satisfies Pn m θ̂n ≥ Pn m θ0 − O P n α/(2β−2α) and converges in outer
probability to θ0 , then n 1/(2α−2β) d(θ̂n , θ0 ) = O P∗ (1).

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


76 M- and Z-Estimators

Proof. Set Tn = n 1/ (2a-2fJ) and suppose that On maximizes the map () f-+ JP>nme up to a
variable Rn = Op(T;;a).
For each n, the parameter space minus the point ()o can be partitioned into the "shells"
Sj,n = {(): 2j - 1 < Tnd«(), ()o) 2 j }, with j ranging over the integers. If Tn d(On' ()o) is
larger than 2M for a given integer M, then On is in one of the shells Sj,n with j :::: M. In
that case the supremum of the map () f-+ JP> nme - JP> nmeo over this shell is at least - Rn by
the property of On. Conclude that, for every £ > 0,

If the sequence On is consistent for ()o, then the second probability on the right converges
to 0 as n --+ 00, for every fixed £ > O. The third probability on the right can be made
arbitrarily small by choice of K, uniformly in n. Choose £ > 0 small enough to ensure that
the conditions of the theorem hold for every 8 £. Then for every j involved in the sum,
we have
2(j-l)a
sup P(me - mOo) - C --
a ·
eESj,n Tn

For iC2(M-I)a :::: K, the series can be bounded in terms of the empirical process Gn by

by Markov's inequality and the definition of Tn. The right side converges to zero for every
M = Mn --+ 00. •

Consider the special case that the parameter () is a Euclidean vector. Ifthe map () f-+ P me
is twice-differentiable at the point of maximum ()o, then its first derivative at ()o vanishes
and a Taylor expansion of the limit criterion function takes the form

Then the first condition of the theorem holds with ex = 2 provided that the second-derivative
matrix V is nonsingular.
The second condition of the theorem is a maximal inequality and is harder to verify. In
"regular" cases it is valid with f3 = 1 and the theorem yields the "usual" rate of convergence
.;n. The theorem also applies to nonstandard situations and yields, for instance, the rate
i.
n if ex = 2 and f3 = Lemmas 19.34, 19.36 and 19.38 and corollary 19.35 are examples
l/3
of maximal inequalities that can be appropriate for the present purpose. They give bounds
in terms of the entropies of the classes of functions {me - meo: d «(), ()o) < 8}.
A Lipschitz condition on the maps () f-+ me is one possibility to obtain simple estimates
on these entropies and is applicable in many applications. The result of the following
corollary is used earlier in this chapter.

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.8 Rates ofCorrvergence 77

5.53 Corollmy. For each 0 in an open subset of Euclidean space let x t-+ me (x) be a
measurable function such that, for every 01 and 02 in a neighborhood of00 and a measurable
function m such that Pm 2 < 00,

Furthermore, suppose that the map 0 t-+ Pme admits a second-order Taylor expansion at
the point ofmaximum 00 with nonsingular second derivative. IflPnme. ::: lPnm80 - 0 p(n -I),
A A P
then,Jn(On -(0 ) = Op(1),providedthatO n -+Oo.

Proof. By assumption, the first condition of Theorem 5.52 is valid with ex = 2. To see
that the second one is valid with f3 = 1, we apply Corollary 19.35 to the class of functions
F = {me - m80: 110 - 0011 < 8}. This class has envelope function F = m8, whence

E* sup
lIe-80lld
IGn(me-m80)I;S l 0
limllP.2a
JlogN[)(e,F,L 2 (P»)de.

The bracketing entropy of the class F is estimated in Example 19.7. Inserting the upper
bound obtained there into the integral, we obtain that the preceding display is bounded
above by a multiple of

Change the variables in the integral to see that this is a multiple of 8. •

Rates of convergence different from ,In are quite common for M -estimators of infinite-
dimensional parameters and may also be obtained through the application of Theorem 5.52.
See Chapters 24 and 25 for examples. Rates slower than ,In may also arise for fairly simple
parametric estimates.

5.54 Example (Modal interval). Suppose that we define an estimator 8n oflocation as the
center of an interval of length 2 that contains the largest possible fraction of the observations.
This is an M-estimator for the functions me = 1[11-1,8+1].
For many underlying distributions the first condition of Theorem 5.52 holds with ex = 2.
It suffices that the map 0 t-+ Pmll = P[O - 1,0 + 1] is twice-differentiable and has
a proper maximum at some point 00 . Using the maximal ineqUality Corollary 19.35 (or
Lemma 19.38), we can show that the second condition is valid with f3 = !.
Indeed, the
bracketing entropy of the intervals in the real line is of the order 8/e2 , and the envelope
function of the class offunctions 1[11-1,11+1] - 1[80-1,80+1] as 0 ranges over (00 - 8, 00 + 8)
is bounded by I[80-I-a,80-IH] + I[80+I-a,80+IH], whose squared L2 -norm is bounded by
IIplloo28.
!
Thus Theorem 5.52 applies with ex = 2 and f3 = and yields the rate of convergence
n 1/3. The resulting location estimator is very robust against outliers. However, in view of
its slow convergence rate, one should have good reasons to use it.
The use of an interval of length 2 is somewhat awkward. Every other fixed length would
give the same result. More interestingly, we can also replace the fixed-length interval by the
smallest interval that contains a fixed fraction, for instance 1/2, of the observations. This

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


78 M- and Z-Estimators

still yields a rate of convergence of n 1/3. The intuitive reason for this is that the length of a
"shorth" settles down by a In-rate and hence its randomness is asymptotically negligible
relative to its center. 0

The preceding theorem requires the consistency of On as a condition. This consistency is


implied if the other conditions are valid for every 8 > 0, not just for small values of 8. This
can be seen from the proof or the more general theorem in the next section. Because the
conditions are not natural for large values of 8, it is usually better to argue the consistency
by other means.

5.S.1 Nuisance Parameters


In Chapter 25 we need an extension of Theorem 5.52 that allows for a "smoothing" or
"nuisance" parameter. We also take the opportunity to insert a number of other refinements,
which are sometimes useful.
Let x 1-+ be measurable functions indexed by parameters (e, 1}), and consider
estimators en contained in a set en that, for a given fin contained in a set Hn, maximize the
map

The sets en and Hn need not be metric spaces, but instead we measure the discrepancies
between en and eo, and fin and a limiting value 1}o, by nonnegative functions e 1-+ (e, eo)
and 1} 1-+ d(1}, 1}o), which may be arbitrary.

5.55 Theorem. Assume thatJor arbitrary functions en: en x Hn 1-+ lR and 4>n: (0, (0) 1-+
lR such that 81-+ 4>n(8)/8 fJ is decreasing for some f3 < 2, every (e, 1}) E en x Hn, and
every 8> 0,

- + en(e, 1}) ::: -d;(e, eo) + d 2 (1}, 1}o),


E* sup
dq (9,60)d
- - Inen(e, 1})\ ::: 4>n(l;).
(9,'1)e9 n xHn

Let On > °satisfy 4>n(on) ::: In for every n. If P(e n E en, fin E Hn) --* 1 and
2: - then (en, eo) = OJ, (On + d(fin, 1}o»).

Proof. For simplicity assume that 2: without a tolerance term. For


each n E N, j E Z and M > 0, let Sn,j,M be the set
{(e, 1}) E en x Hn:2 j - 18n < 2j 8n,d(1}, 1}o):::

Then the intersection of the events (en, fin) E eo) 2: 2M (8 n+d(fin , 1}o»)
is contained in the union of the events {(en, fin) E Sn,j,M} over j 2: M. By the definition
of en, the supremum of - over the set of parameters (e, 1}) E Sn,j,M is
nonnegative on the event {(en, fin) E Sn,j,M}' Conclude that

P* (en, eo) 2: 2M (On + d(fin, 1}o»), (8 n, fin) E en x Hn)

::: L p*( sup 2: 0).


https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press
5.9 Argmax Theorem 79

For every j, (e, TJ) E Sn,j,M, and every sufficiently large M,

P(m(),ry - mllo,ry) + en «(), TJ) ::::: -d;«(), ()o) + d2 (TJ, TJo)


< -(1 - 2- 2M ) d ry2 (e , ()0 ) <
- - _2 - 8n •
2j 4 2

From here on the proof is the same as the proof of Theorem 5.52, except that we use that
ifJn (c8) ::::: cf3 ifJn (8) for every c > 1, by the assumption on ifJn. •

*5.9 Argmax Theorem


The consistency of a sequence of M -estimators can be understood as the points of maximum
en of the criterion functions e r+ Mn (e) converging in probability to a point of maximum
of the limit criterion function e r+ M(e). So far we have made no attempt to understand
the distributional limit properties of a sequence of M -estimators in a similar way. This is
possible, but it is somewhat more complicated and is perhaps best studied after developing
the theory of weak convergence of stochastic processes, as in Chapters 18 and 19.
Because the estimators en typically converge to constants, it is necessary to rescale them
before studying distributional limit properties. Thus, we start by searching for a sequence
of numbers rn r+ 00 such that the sequence hn = rn (en - e) is uniformly tight. The results
of the preceding section should be useful. If en maximizes the function e r+ Mn (e), then
the rescaled estimators hn are maximizers of the local criterion functions

h r+ Mn (e + - Mn(eo).

Suppose that these, if suitably normed, converge to a limit process h r+ M(h). Then the
general principle is that the sequence hn converges in distribution to the maximizer of this
limit process.
For simplicity of notation we shall write the local criterion functions as h r+ Mn(h).
Let {Mn (h): h E Hn} be arbitrary stochastic processes indexed by subsets Hn of a given
metric space. We wish to prove that the argmax-functional is continuous: If Mn "'" M and
Hn H in a suitable sense, then the (near) maximizers lin of the random maps h t-+ Mn(h)
converge in distribution to the maximizer h of the limit process h r+ M(h). It is easy to
find examples in which this is not true, but given the right definitions it is, under some
conditions. Given a set B, set

M(B) = sup M(h).


hEB

Then convergence in distribution of the vectors (Mn (A), Mn (B)) for given pairs of sets A
and B is an appropriate form of convergence of Mn to M. The following theorem gives some
flexibility in the choice of the indexing sets. We implicitly either assume that the suprema
Mn (B) are measurable or understand the weak convergence in terms of outer probabilities,
as in Chapter 18.
The result we are looking for is not likely to be true if the maximizer of the limit process
is not well defined. Exactly as in Theorem 5.7, the maximum should be "well separated."
Because in the present case the limit is a stochastic process, werequire that every sample
path h t-+ M(h) possesses a well-separated maximum (condition (5.57».

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


80 M- and Z-Estimators

5.56 Theorem (Argmax theorem). Let Mn and M be stochastic processes indexed by sub-
sets Hn and H of a given metric space such that, for every pair of a closed set F and a set
K in a given collection K,

Furthermore, suppose that every sample path of the process h f-+ M (h) possesses a well-
separated point of maximum h in that, for every open set G and every K E K,

M(h) > M(G c n K n H), if hE G, a.s .. (5.57)

lfMn(h n):::: Mn(Hn)-op(1)andforeverye > OthereexistsK E KsuchthatsuPnP(h n rt.


K) < e andP(h rt. K) < e, then h n -h.

Proof. If h n E F n K, then Mn(F n K n Hn) :::: Mn(B) - op(1) for any set B. Hence,
for every closed set F and every K E K,

P(h n E F n K) :oS P(Mn(F n K n Hn) :::: Mn(K n Hn) - op(l))


:oS P(M(F n K n H) :::: M(K n H)) + 0(1),

by Slutsky's lemma and the portmanteau lemma. Ifh E F C , then M(F n K n H) is strictly
smaller than M(h) by (5.57) and hence on the intersection with the event in the far right
side h cannot be contained in K n H. It follows that

limsupP(hn E F n K) :oS P(h E F) + P(h rt. K n H).


By assumption we can choose K such that the left and right sides change by less than e if
we replace K by the whole space. Hence hn - h by the portmanteau lemma. •

The theorem works most smoothly if we can take K to consist only of the whole space.
However, then we are close to assuming some sort of global uniform convergence of Mn
to M, and this may not hold or be hard to prove. It is usually more economical in terms
of conditions to show that the maximizers hn are contained in certain sets K, with high
probability. Then uniform convergence of Mn to M on K is sufficient. The choice of
compact sets K corresponds to establishing the uniform tightness of the sequence hn before
applying the argmax theorem.
Ifthe sample paths of the processes Mn are bounded on K and Hn = H for every n, then
the convergence of the processes Mn viewed as elements ofthe space eoo(K) implies
the convergence condition of the argmax theorem. This follows by the continuous-mapping
theorem, because the map

Z f-+ (z(A n K), z(B n K))


from eoo (K) to JR2 is continuous, for every pair of sets A and B. The weak convergence in
eoo(K) remains sufficient if the sets Hn depend on n but converge in a suitable way. Write
Hn -+ H if H is the set of all limits lim h n of converging sequences h n with h n E Hn
for every n and, moreover, the limit h = limi h ni of every converging sequence h ni with
h ni E Hni for every i is contained in H.

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


5.9 Ar;gmax Theorem 81

5.58 Corollary. Suppose that Mn - M in (,oo(K)forevery compact subset K of'R,k,Jor


a limit process M with continuous sample paths that have unique points of maxima h. If
Hn H, Mn{h n) Mn(Hn)-op(1), and the sequence hn is uniformly tight, thenh n -h.

Proof. The compactness of K and the continuity of the sample paths h M (h) imply
that the (unique) points of maximum h are automatically well separated in the sense of
(5.57). Indeed, if this fails for a given open set G 3 h and K (and a given w in the
underlying probability space), then there exists a sequence hm in G C n K n H such that
M{h m ) M(h). If K is compact, then this sequence can be chosen convergent. The limit
ho must be in the closed set G C and hence cannot be h. By the continuity of M it also has
the property that M{ho) =
lim M(h m ) =
M{h). This contradicts the assumption that his
a unique point of maximum.
Ifwe can show that (Mn(F n Hn), Mn(K n Hn») converges to the corresponding limit
for every compact sets F C K, then the theorem is a corollary of Theorem 5.56. IfHn = H
for every n, then this convergence is immediate from the weak convergence of Mn to M
in (,oo{K), by the continuous-mapping theorem. For Hn changing with n this convergence
may fail, and we need to refine the proof of Theorem 5.56. This goes through with minor
changes if

lim sup P(Mn(F n Hn) - Mn(K n Hn) x) :s P(M{F n H) - M(K n H) x),


n-+oo
for every x, every compact set F and every large closed ball K. Define functions gn: (,oo(K)
'R. by

gn{Z) = sup z(h) - sup z(h),


heFnH. heknH.

and g similarly, but with H replacing Hn. By an argument as in the proof of Theo-
rem 18.11, the desired result follows if lim sup gn(Zn) :s g(z) for every sequence Zn Z
in (,oo{K) and continuous function z. (Then lim sup P(gn{Mn) x) :s P(g{M) x) for
every x, for any weakly converging sequence Mn - M with a limit with continuous sample
paths.) This in turn follows if for every precompact set B C K,

sup z(h):s lim sup zn(h):S sup z{h).


heBnH n-+oo heBnH. heBnH

To prove the upper inequality, select hn E B n Hn such that


sup zn(h) = zn(h n) + 0(1) = z(h n) + 0(1).
heBnH.

Because B is compact, every subsequence of hn has a converging subsequence. Because


Hn H, the limit h must be in B n H. Because z(h n) z{h), the upper bound follows.
To prove the lower inequality, select for given s > 0 an element h E B n H such that

sup z(h):s z(h) + s.


heBnH

Because Hn H, there exists hn E Hn with hn h. This sequence must be in B c B


eventually, whence z(h) = limz(h n) = limzn{h n) is bounded above by liminfsuPheBnH.
zn(h). •

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


82 M- and Z-Estimators

The argmax theorem can also be used to prove consistency, by applying it to the original
criterion functions () Mn«(). Then the limit process () M«() is degenerate, and has
a fixed point of maximum ()o. Weak convergence becomes convergence in probability, and
the theorem now gives conditions for the consistency en
()o. Condition (5.57) reduces
to the well-separation of ()o, and the convergence
p
sup Mn«() sup Mn«()
8eFnKne. 8eFnKne

is, apart from allowing en to depend on n, weaker than the uniform convergence of Mn to
M.

Notes
In the section on consistency we have given two main results (uniform convergence and
Wald's proof) that have proven their value over the years, but there is more to say on this
subject. The two approaches can be unified by replacing the uniform convergence by "one-
sided uniform convergence," which in the case of i.i.d. observations can be established
under the conditions of Wald's theorem by a bracketing approach as in Example 19.8 (but
then one-sided). Furthermore, the use of special properties, such as convexity of the "" or
m functions, is often helpful. Examples such as Lemma 5.10, or the treatment of maximum
likelihood estimators in exponential families in Chapter 4, appear to indicate that no single
approach can be satisfactory.
The study of the asymptotic properties of maximum likelihood estimators and other
M-estimators has a long history. Fisher [48], [50] was a strong advocate of the method of
maximum likelihood and noted its asymptotic optimality as early as the 1920s. What we
have labelled the classical conditions correspond to the rigorous treatment given by Cramer
[27] in his authoritative book. Huber initiated the systematic study of M -estimators, with
the purpose of developing robust statistical procedures. His paper [78] contains important
ideas that are precursors for the application of techniques from the theory of empirical
processes by, among others, Pollard, as in [117], [118], and [120]. For one-dimensional
parameters these empirical process methods can be avoided by using a maximal inequality
based on the Lrnorm (see, e.g., Theorem 2.2.4 in [146]). Surprisingly, then a Lipschitz
condition on the Hellinger distance (an integrated quantity) suffices; see for example, [80] or
[94]. For higher-dimensional parameters the results are also not the best possible, but I do
not know of any simple better ones.
The books by Huber [79] and by Hampel, Ronchetti, Rousseeuw, and Stahel [73] are
good sources for applications of M -estimators in robust statistics. These references also
discuss the relative efficiency of the different M -estimators, which motivates, for instance,
the use of Huber's ",,-function. In this chapter we have derived Huber's estimator as the
solution of the problem of minimizing the asymptotic variance under the side condition
of a uniformly bounded influence function. Originally Huber derived it as the solution to
the problem of minimizing the maximum asymptotic variance sup p a'l. for P ranging over
a contamination neighborhood:P = (1 - e)ct> + eQ with Q arbitrary. For M-estimators
these two approaches tum out to be equivalent.
The one-step method can be traced back to numerical schemes for solving the likelihood
equations, including Fisher's method of scoring. One-step estimators were introduced for

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


Problems 83

their asymptotic efficiency by Le Cam in 1956, who later developed them for general locally
asymptotically quadratic models, and also introduced the discretization device, (see [93]).

PROBLEMS
1. Let XI, ... , Xn be a sample from a density that is strictly positive and symmetric about some
point. Show that the Huber M-estimator for location is consistent for the symmetry point.
2. Find an expression for the asymptotic variance of the Huber estimator for location if the obser-
vations are normally distributed.
3. Define 1/I(x) = I - p, 0, p if x < 0,0, > O. Show that E1/I(X - 9) = 0 implies that P(X <
9) ::: p ::: P(X ::: 9).
4. Let XI, ... , Xn be Li.d. N(J.L,0'2)-distributed. Derive the maximum likelihood estimator for
(J.L, 0'2) and show that it is asymptotically normal. Calculate the Fisher information matrix for
this parameter and its inverse.
S. Let XI, ... , Xn be i.i.d. Poisson(119)-distributed. Derive the maximum likelihood estimator for
9 and show that it is asymptotically normal.
6. Let XI, ... , Xn be i.i.d. N(9,9)-distributed. Derive the maximum likelihood estimator for 9
and show that it is asymptotically normal.
7. Find a sequence of fixed (nonrandom) functions Mn: lR lR that converges pointwise to a limit
Mo and such that each Mn has a unique maximum at a point 9n, but the sequence 9n does not
converge to 90. Can you also find a sequence Mn that converges uniformly?
8. Find a sequence of fixed (nonrandom) functions Mn: lR lR that converges pointwise but not
uniformly to a limit Mo such that each Mn has a unique maximum at a point 9n and the sequence
9n converges to 90.
9. Let XI, ... ,Xn be Li.d. observations from a uniform distribution on [0, 9]. Show that the
sequence of maximum likelihood estimators is asymptotically consistent. Show that it is not
asymptotically normal.
10. Let XI, ... , Xn be i.i.d. observations from an exponential density 9 exp( -9x). Show that the
sequence of maximum likelihood estimators is asymptotically normal.
11. Let IF; I (p) be a pth sample quantile of a sample from a cumulative distribution F on 1R that is
differentiable with positive derivative at the population pth-quantile F- I (p) = inf{ x: F (x)
pl. Show that .Jri(IF;I(p) - F-I(p») is asymptotically normal with mean zero and variance
p(1- p)II(F-1 (p)t
12. Derive a minimal condition on the distribution function F that guarantees the consistency of the
sample pth quantile.
13. Calculate the asymptotic variance of .Jri(8n - 9) in Example 5.26.
14. Suppose that we observe a random sample from the distribution of (X, Y) in the following
errors-in-variables model:
X = Z+e
Yj = a + f3Z + I,
where (e, f) is bivariate normally distributed with mean 0 and covariance matrix 0'21 and is
independent from the unobservable variable Z. In analogy to Example 5.26, construct a system
of estimating equations for (a, f3) based on a conditional likelihood, and study the limit properties
of the corresponding estimators.
15. In Example 5.27, for what point is the least squares estimator 8n consistent if we drop the
condition that E(e I X) = O? Derive an (implicit) solution in terms of the function E(e IX). Is
it necessarily 90 if Ee = O?

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


84 M- and Z-Estimators

16. In Example 5.27, consider the asymptotic behavior of the least absolute-value estimator 9 that
minimizes I:?=IIYi - 4>e(Xi)l.
17. Let X I, ... , Xn be i.i.d. with density fA,a(x) = Ae-).(x-a) I {x a}, where the parameters A > 0
and a E lR are unknown. Calculate the maximum likelihood estimator (in, an) of (A, a) and
derive its asymptotic properties.
18. Let X be Poisson-distributed with density Pe (x) = (Jx e-e / x!. Show by direct calculation that
Eeie(X) = 0 and Eele(X) = Compare this with the assertions in the introduction.
Apparently, differentiation under the integral (sum) is permitted in this case. Is that obvious from
results from measure theory or (complex) analysis?
19. Let X I, ... , Xn be a sample from the N «(J. 1) distribution, where it is known that (J O. Show
that the maximum likelihood estimator is not asymptotically normal under (J = O. Why does this
not contradict the theorems of this chapter?
20. Show that (8 - (JO)\{In (8 n ) in formula (5.18) converges in probability to zero if 9n ..;. (Jo, and that
there exists an integrable function M and 8 > 0 with I e (x) I .::: M (x) for every x and every
II(J - (Joll < 8.
21. If 9n maximizes Mn , then it also maximizes M;;' Show that this may be used to relax the
conditions of Theorem 5.7 to sUPeIM,i - M+I«(J) -+ 0 in probability (if M«(Jo) > 0).
22. Suppose that for every s > o there exists a set 8 e with liminfP(9n E 8 e ) I-s. Then uniform
convergence of Mn to M in Theorem 5.7 can be relaxed to uniform convergence on every 8 e .
23. Show that Wald's consistency proof yields almost sure convergence of 9 n, rather than convergence
in probability if the parameter space is compact and Mn (9 n) Mn «(Jo) - o( I).
24. Suppose that (X I, YI), ... , (Xn , Yn ) are i.i.d. and satisfy the linear regression relationship Yi =
(JTXi +ei for (unobservable) errors el, ... , en independent of XI,"" X n . Show that the mean
absolute deviation estimator, which minimizes I: IYi - (J Xii, is asymptotically normal under a
mild condition on the error distribution.
25. (i) Verify the conditions ofWald's theorem for me the log likelihood function of the N(J1" 0'2)_
distribution if the parameter set for (J = (J1" 0'2) is a compact subset of jR x jR+.
(ii) Extend me by continuity to the compactification of jR x jR+. Show that the conditions of
Wald's theorem fail at the points (IL, 0).
(iii) Replace me by the log likelihood function of a pair of two independent observations from the
N (J1" O' 2 )-distribution. Show that Wald's theorem now does apply, also with a compactified
parameter set.
26. A distribution on jRk is called ellipsoidally symmetric if it has a density of the form x 1-+
g( (x - J1,)T (x - J1,») for a function g: [0, 00) 1-+ [0,00), a vector J1" and a symmetric
positive-definite matrix Study the Z-estimators for location jl that solve an equation of the
form
n
I)(Xi - J1,)T i:.;;1 (Xi - J1,»),
i=1

for given estimators i:.n and, for instance, Huber's v-function. Is the asymptotic distribution of
i:.n important?
27. Suppose that 8 is a compact metric space and M: 8 -+ lR is continuous. Show that (5.8) is
equivalent to the point (Jo being a point of unique global maximum. Can you relax the continuity
of M to some form of "semi-continuity"?

https://doi.org/10.1017/CBO9780511802256.006 Published online by Cambridge University Press


6
Contiguity

"Contiguity" is another name for "asymptotic absolute continuity."


Contiguity arguments are a technique to obtain the limit distribution
of a sequence of statistics under underlying laws Qn from a limiting
distribution under laws Pn • Typically, the laws Pn describe a null distri-
bution under investigation, and the laws Qn correspond to an alternative
hypothesis.

6.1 Likelihood Ratios


Let P and Q be measures on a measurable space (Q, A). Then Q is absolutely continuous
with respectto P if peA) = oimplies Q(A) = ofor every measurable set A; this is denoted
by Q « P. Furthermore, P and Q are orthogonal if Q can be partitioned as Q = Q p U Q Q
with Qp n QQ = 0 and P(QQ) = 0 = Q(Qp). Thus P "charges" only Qp and Q "lives
on" the set Q Q, which is disjoint with the "support" of P. Orthogonality is denoted by
P 1. Q.
In general, two measures P and Q need be neither absolutely continuous nor orthogonal.
The relationship between their supports can best be described in terms of densities. Suppose
P and Q possess densities p and q with respect to a measure Jl, and consider the sets

Qp = {p > O}, QQ = {q > OJ.


See Figure 6.1. Because = Jp=o p dJl = 0, the measure P is supported on the set
Qp. Similarly, Q is supported on QQ. The intersection Q p n QQ receives positive measure
from both P and Q provided its measure under Jl is positive. The measure Q can be written
as the sum Q = Qa + Q.L of the measures

Q.L(A) = Q(A n (p = OJ). (6.1)

As proved in the next lemma, Qa « P and Q.L .1 P. Furthermore, for every measurable
set A

Qa(A) = 1 fJ... dP.


AP
The decomposition Q = Qa + Q.L is called the Lebesgue decomposition of Q with respect
to P. The measures Qa and Q.L are called the absolutely continuous part and the orthogonal

85

https://doi.org/10.1017/CBO9780511802256.007 Published online by Cambridge University Press


86 Contiguity

Figure 6.1. Supports of measures.

part (or singular part) of Q with respect to P, respectively. In view of the preceding
display, the function q/ p is a density of Q a with respect to P. It is denoted d Q/d P (not:
d Q a /d P), so that
dQ q
= , P − a.s.
dP p
As long as we are only interested in the properties of the quotient q/ p under P-probability,
we may leave the quotient undefined for p = 0. The density d Q/d P is only P-almost
surely unique by definition. Even though we have used densities to define them, d Q/d P
and the Lebesgue decomposition are actually independent of the choice of densities and
dominating measure.
In statistics a more common name for a Radon-Nikodym density is likelihood ratio.
We shall think of it as a random variable d Q/d P :   → [0, ∞) and shall study its law
under P.

6.2 Lemma. Let P and Q be probability measures with densities p and q with respect to
a measure μ. Then for the measures Q a and Q ⊥ defined in (6.1)
(i) Q = Q a + Q ⊥ , Q a  P, Q ⊥ ⊥ P.

(ii) Q a (A) = A (q/ p) d P for every measurable set A.

(iii) Q  P if and only if Q( p = 0) = 0 if and only if (q/ p) d P = 1.

Proof. The first statement of (i) is obvious from the definitions of Q a and Q ⊥ . For the
second, we note that P(A) can be zero only if p(x) = 0 for μ-almost all x ∈ A. In this
   
case, μ A ∩ { p > 0} = 0, whence Q a (A) = Q A ∩ { p > 0} = 0 by the absolute
continuity of Q with respect to μ. The third statement of (i) follows from P( p = 0) = 0
and Q ⊥ ( p > 0) = Q(∅) = 0.
Statement (ii) follows from
  
q q
Q (A) =
a
qdμ = pdμ = d P.
A∩{ p>0} A∩{ p>0} p A p

For (iii) we note first that Q  P if and only if Q ⊥ = 0. By (6.1) the latter happens if
and only if Q( p = 0) = 0. This yields the first “if and only if.” For the second, we note

https://doi.org/10.1017/CBO9780511802256.007 Published online by Cambridge University Press


6.2 Contiguity 87

that by (ii) the total mass of Q a is equal to Q a () = (q/ p)d P. This is 1 if and only if
Q a = Q. 

 
It is not true in general that f d Q = f (d Q/d P)d P. For this to be true for every
measurable function f, the measure Q must be absolutely continuous with respect to P. On
the other hand, for any P and Q and nonnegative f,
   
q dQ
f dQ ≥ f qdμ = f pdμ = f d P.
p>0 p>0 p dP

This inequality is used freely in the following. The inequality may be strict, because
dividing by zero is not permitted.†

6.2 Contiguity
If a probability measure Q is absolutely continuous with respect to a probability measure
P, then the Q-law of a random vector X :   → Rk can be calculated from the P-law of the
pair (X, d Q/dp) through the formula
dQ
E Q f (X ) = E p f (X ) .
dP
With P X,V equal to the law of the pair (X, V ) = (X, d Q/d P) under P, this relationship
can also be expressed as

dQ
Q(X ∈ B) = E P 1 B (X ) = vd P X,V (x, v).
dP B×R

The validity of these formulas depends essentially on the absolute continuity of Q with
respect to P, because a part of Q that is orthogonal with respect to P cannot be recovered
from any P-law.
Consider an asymptotic version of the problem. Let (n , An ) be measurable spaces,
each equipped with a pair of probability measures Pn and Q n . Under what conditions can
a Q n -limit law of random vectors X n : n  → Rk be obtained from suitable Pn -limit laws?
In view of the above it is necessary that Q n is “asymptotically absolutely continuous” with
respect to Pn in a suitable sense. The right concept is contiguity.

6.3 Definition. The sequence Q n is contiguous with respect to the sequence Pn if


Pn (An ) → 0 implies Q n (An ) → 0 for every sequence of measurable sets An . This is
denoted Q n  Pn . The sequences Pn and Q n are mutually contiguous if both Pn  Q n and
Q n  Pn . This is denoted Pn  Q n .

The name “contiguous” is standard, but perhaps conveys a wrong image. “Contiguity”
suggests sequences of probability measures living next to each other, but the correct image
is “on top of each other” (in the limit).
† The algebraic identity d Q = (d Q/d P)d P is false, because the notation d Q/d P is used as shorthand for
d Q a /d P: If we write d Q/d p, then we are not implicitly assuming that Q  P.

https://doi.org/10.1017/CBO9780511802256.007 Published online by Cambridge University Press


88 Contiguity

Before answering the question of interest, we give two characterizations of contiguity


in tenns of the asymptotic behavior of the likelihood ratios of Pn and Qn. The likelihood
ratios dQn/dPn and dPn/dQn are nonnegative and satisfy
dQn dPn
E Pn dPn :::: 1 and EQn dQn :::: 1.

Thus, the sequences of likelihood ratios dQn/dPn and dPn/dQn are uniformly tight under
Pn and Qn, respectively. By Prohorov's theorem, every subsequence has a further weakly
converging subsequence. The next lemma shows that the properties of the limit points
determine contiguity. This can be understood in analogy with the nonasymptotic situation.
For probability measures P and Q, the following three statements are equivalent by (iii) of
Lemma 6.2:
dQ
Q« P, Ep-
dP
= 1.
This equivalence persists if the three statements are replaced by their asymptotic counter-
parts: Sequences Pn and Qn satisfy Qn<lPn, if and only if the weak limit points of dPn/dQn
under Qn give mass 0 to 0, if and only if the weak limit points of d Qn/d Pn under Pn have
mean 1.

6.4 Lemma (Le Cam's first lemma). Let Pn and Qn be sequences ofprobability measures
on measurable spaces (On, An). Then the following statements are equivalent:
(i) Qn <l Pn.
(ii) IfdPn/dQn U along a subsequence, then P(U > 0) = 1.
(iii) IfdQn/dPn fA V along a subsequence, then EV = 1.
(iv) For any statistics Tn : On ]Rk: If Tn 0, then Tn .f?; O.

Proof. The equivalence of (i) and (iv) follows directly from the definition of contiguity:
Given statistics Tn, consider the sets An = {IITn II > e}; given sets An, consider the statistics
Tn = IAn'
(i)=} (ii). For simplicity of notation, we write just {n} for the given subsequence

along which dPn/dQn U. For given n, we define the function gn(e) = Qn(dPn/dQn <
e) - P(U < e). By the portmanteau lemma, liminf gn(e) 0 for every e > O. Then, for
en -l- 0 at a sufficiently slow rate, also lim inf gn (en) O. Thus,

P(U ·
= 0) = 1ImP(U . . fQ n (dPn
< en) :::: hmm dQn < en ) .

On the other hand,

dPn
Pn ( --::::
dQn
en I\qn > 0) = 1 dPn
--dQn::::
dQn
f endQn O.

If Qn is contiguous with respect to Pn, then the Qn-probability of the set on the left goes
to zero also. But this is the probability on the right in the first display. Combination shows
that P(U = 0) = O.
(iii) =} (i). If Pn(An) 0, then the sequence lOn-An converges to 1 in Pn-probability.
By Prohorov's theorem, every subsequence of In} has a further subsequence along which

https://doi.org/10.1017/CBO9780511802256.007 Published online by Cambridge University Press


6.2 Contiguity 89

(dQn/dPn, In.-AJ (V, 1) under Pn, for some weak limit V. The function (v, t) vt
is continuous and nonnegative on the set [0, 00) x to, I}. By the portmanteau lemma

liminf Qn(Qn - An) :::: liminf f In.-A. dPn :::: EI· V.

Under (iii) the right side equals EV = 1. Then the left side is I as well and the sequence
Qn(An) = I - Qn(Qn - An) converges to zero.
(ii) =} (iii). The probability measures ILn = !(Pn + Qn) dominate both Pn and Qn, for
every n. The sum of the densities of Pn and Qn with respect to ILn equals 2. Hence, each of
the densities takes its values in the compact interval [0,2]. By Prohorov's theorem every
subsequence possesses a further subsequence along which
dPn $.?n. U dQn !!l V dPn R.
Wn := dlLn W,

for certain random variables U, V and W. Every Wn has expectation I under ILn. In view
of the boundedness, the weak convergence of the sequence Wn implies convergence of
moments, and the limit variable has mean EW = I as well. For a given bounded, continuous
function I, define a function g : [0, 2] 1R. by g(w) = I( w/(2-w) )(2-w) forO w < 2
and g(2) = 0. Then g is bounded and continuous. Because dPn/dQn = Wn/(2- Wn) and
dQn/dlLn = 2 - Wn, the portmanteau lemma yields

d Pn ) n d Qn
EQ.f ( dQn = EJl.f ( dQn
dP )
dlLn = EJl.g(Wn) EI ( 2 _WW ) (2 - W),

°
where the integrand in the right side is understood to be g(2) = if W = 2. By assumption,
the left side converges to E/(U). Thus E/(U) equals the right side ofthe display for every
continuous and bounded function I. Take a sequence of such functions with I:::: 1m t l(o},
and conclude by the dominated-convergence theorem that

P(U = 0) = EI(o}(U) = 2-W


- W) = 2P(W = 0).
By a similar argument, E/(V) = E/(2 - W)/W)W for every continuous and bounded
function I, where the integrand on the right is understood to be zero if W = 0. Take a
sequence 0::: Im(x) t x and conclude by the monotone convergence theorem that

EV = E(2;,W)w = E(2 - W)lw>o = 2P(W > 0) - 1.

Combination of the last two displays shows that P(U = 0) + EV = 1. •

6.S Example (Asymptotic log normality). The following special case plays an important
role in the asymptotic theory of smooth parametric models. Let Pn and Qn be probability
measures on arbitrary measurable spaces such that
d Pn &: eN(Jl.a2)
dQn
Then Qn <l Pn. Furthermore, Qn <l r> Pn if and only if IL = - !a 2 •
Because the (log normal) variable on the right is positive, the first assertion is immediate
from (ii) of the theorem. The second follows from (iii) with the roles of Pn and Qn switched,
on noting that E exp N (IL, a 2 ) = I if and only if IL = - !a 2 •

https://doi.org/10.1017/CBO9780511802256.007 Published online by Cambridge University Press


90 Contiguity

A mean equal to minus half times the variance looks peculiar, but we shall see that this sit-
uation arises naturally in the study of the asymptotic optimality of statistical procedures. 0

The following theorem solves the problem of obtaining a Qn-limit law from a Pn-limit
law that we posed in the introduction. The result, a version of Le Cam's third lemma, is in
perfect analogy with the nonasymptotic situation.

6.6 Theorem. Let Pn and Qn be sequences o/probability measures on measurable spaces


(Q n, An), and let Xn: Q n 1-+ jRk be a sequence o/random vectors. Suppose that Qn <I Pn
and

( X n , -dQn) Pn (X, V).


dPn

Then L(B) = EIB(X) V defines a probability measure, and Xn & L.


Proof. Because V :::: 0, it follows with the help of the monotone convergence theorem
that L defines a measure. By contiguity, E V = 1 and hence L is a probability measure.
It is immediate from the definition of L that f / dL = E/(X) V for every measurable
indicator function /. Conclude, in steps, that the same is true for every simple function /,
any nonnegative measurable function, and every integrable function.
If / is continuous and nonnegative, then so is the function (x, v) 1-+ / (x) v on jRk x
[0, (0). Thus

liminf EQJ(Xn) :::: liminf f /(Xn) dPn :::: E/(X)V,

by the portmanteau lemma. Apply the portmanteau lemma in the converse direction to
conclude the proof that Xn & L. •

6.7 Example (Le Cam's third lemma). ThenameLeCam'sthirdlemmaisoftenreserved


for the following result. If

( X., log !4 Nk+1 ( ( -!u'). ;,)).


then

In this situation the asymptotic covariance matrices of the sequence Xn are the same under
Pn and Qn, but the mean vectors differ by the asymptotic covariance rbetween Xn and the
log likelihood ratios. t
The statement is a special case of the preceding theorem. Let (X, W) have the given
(k + 1)-dimensional normal distribution. By the continuous mapping theorem, the sequence
(Xn, dQn/dPn) converges in distribution under Pn to (X, e W ). Because WisN(-ta2, a 2)_
distributed, the sequences Pn and Qn are mutually contiguous. According to the abstract

t We set log 0 = -00; because the nonnal distribution does not charge the point -00 the assumed asymptotic
nonnality oflogdQn/dPn includes the assumption that Pn(dQn/dPn = 0) .... O.

https://doi.org/10.1017/CBO9780511802256.007 Published online by Cambridge University Press


Problems 91

version of Le Cam's third lemma, Xn & L with L(B) = E1 B (X)e w. The characteristic
function of Lis f eitT x dL(x) = Ee itTX e W • This is the characteristic function of the given
normal distribution at the vector (t, -i). Thus

The right side is the characteristic function of the Nk (f.L + T, L:) distribution. 0

Notes
The concept and theory of contiguity was developed by Le Cam in [92]. In his paper the
results that were later to become known as Le Cam's lemmas are listed as a single theorem.
The names "first" and "third" appear to originate from [71]. (The second lemma is on
product measures and the first lemma is actually only the implication (iii) => (i).)

PROBLEMS
1. Let Pn = N(O, 1) and Qn = N(J-Ln, 1). Show that the sequences Pn and Qn are mutually
contiguous if and only if the sequence J-Ln is bounded.
2. Let Pn and Qn be the distribution of the mean of a sample of size n from the N(O, 1) and the
N«(}n, 1) distribution, respectively. Show that Pn <1 f>Qn if and only if (}n = 0(11.../ii).
3. Let Pn and Qn be the law of a sample of size n from the uniform distribution on [0, 1] or [0, 1+ 1In],
respectively. Show that Pn <1 Qn. Is it also true that Qn <1 Pn? Use Lemma 6.4 to derive your
answers.
I
4. Suppose that IIPn - Qn II --+ 0, where II ·11 is the total variation distance liP - QII = sUPA P(A)-
Q(A)I. Show that Pn <If> Qn.
°
S. Given 8 > find an example of sequences such that Pn <1 f> Qn, but II Pn - Qn II --+ 1 - 8. (The
maximum total variation distance between two probability measures is 1.) This exercise shows
that it is wrong to think of contiguous sequences as being close. (Try measures that are supported
on just two points.)
6. Give a simple example in which Pn <1 Qn, but it is not true that Qn <1 Pn .
7. Show that the constant sequences {P} and {Q} are contiguous if and only if P and Q are absolutely
continuous.
°
8. If P « Q, then Q(An) --+ implies P(An) --+
does this follow from Lemma 6.4?
° for every sequence of measurable sets. How

https://doi.org/10.1017/CBO9780511802256.007 Published online by Cambridge University Press


7
Local Asymptotic Normality

A sequence of statistical models is "locally asymptotically normal" if,


asymptotically, their likelihood ratio processes are similar to those for
a normal location parameter. Technically, this is if the likelihood ratio
processes admit a certain quadratic expansion. An important example
in which this arises is repeated sampling from a smooth parametric
model. Local asymptotic normality implies convergence of the models to
a Gaussian model after a rescaling of the parameter.

7.1 Introduction
Suppose we observe a sample Xl, ... , Xn from a distribution P9 on some measurable space
(X, A) indexed by a parameter () that ranges over an open subset e Then the full
observation is a single observation from the product P: of n copies of P9 , and the statis-
tical model is completely described as the collection of probability measures {P: : () E e}
on the sample space (Xn , An). In the context of the present chapter we shall speak of a
statistical experiment, rather than of a statistical model. In this chapter it is shown that
many statistical experiments can be approximated by Gaussian experiments after a suitable
reparametrization.
The reparametrization is centered around a fixed parameter (}o, which should be regarded
as known. We define a local parameter h = In(() - (}o), rewrite P: as and thus
obtain an experiment with parameter h. In this chapter we show that, for large n, the
experiments

are similar in statistical properties, whenever the original experiments () t-+ P9 are "smooth"
in the parameter. The second experiment consists of observing a single observation from a
normal distribution with mean h and known covariance matrix (equal to the inverse of the
Fisher information matrix). This is a simple experiment, which is easy to analyze, whence
the approximation yields much information about the asymptotic properties of the original
experiments. This information is extracted in several chapters to follow and concerns both
asymptotic optimality theory and the behavior of statistical procedures such as the maximum
likelihood estimator and the likelihood ratio test.

92

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


7.2 Expanding the Likelihood 93

We have taken the local parameter set equal to IRk, which is not correct if the parameter
set e is a true subset of IRk. If 00 is an inner point of the original parameter set, then the
vector () = ()o + hiIn is a parameter in e for a given h, for every sufficiently large n,
and the local parameter set converges to the whole of IRk as n 00. Then taking the local
parameter set equal to IRk does not cause errors. To give a meaning to the results of this
chapter, the measure P80 +h/.jn may be defined arbitrarily if ()o + hiIn ¢. e.

7.2 Expanding the Likelihood


The convergence of the local experiments is defined and established later in this chapter.
First, we discuss the technical tool: a Taylor expansion of the logarithm of the likelihood.
Let P9 be a density of P9 with respect to some measure J.L. Assume for simplicity that
the parameter is one-dimensional and that the log likelihood 19(x) = logp9(x) is twice-
differentiable with respect to (), for every x, with derivatives i9 (x) and .e9 (x). Then, for
every fixed x,
P9+h 1 2"
log - ( x ) = hl.9(x) + -h 2
19(x) + ox(h ).
P9 2
The subscript x in the remainder term is a reminder of the fact that this term depends on x
as well as on h. It follows that
n h n 1 h2 n
log 0
P9+h/.jn (X;) = t= i9(X;) + -- L L
.eo (Xi) + Remn •
i=1 Po v n i=1 2 n i=1
. ···2
Here the score has mean zero, Polo = 0, and - Pol9 = P9l9 = /9 equals the Fisher infor-
mation for () (see, e.g., section 5.5). Hence the first term can be rewritten as hAn ,9, where
An,9 = n- 1j2 1:7=1 i9(Xj) is asymptotically normal with mean zero and variance /9, by
the central limit theorem. Furthermore, the second term in the expansion is asymptotically
equivalent to - 4h 2 /9, by the law of large numbers. The remainder term should behave as
o(l/n) times a sum of n terms and hopefully is asymptotically negligible. Consequently,
under suitable conditions we have, for every h,

log O P9+h/..;n (Xj) = hAn,(J -


n 1 2
"2/oh + OPg(1).
i=l P9
In the next section we see that this is similar in form to the likelihood ratio process of a Gaus-
sian experiment. Because this expansion concerns the likelihood process in a neighborhood
of (), we speak of "local asymptotic normality" of the sequence of models {P: : () E e}.
The preceding derivation can be made rigorous under moment or continuity conditions
on the second derivative of the log likelihood. Local asymptotic normality was originally
deduced in this manner. Surprisingly, it can also be established under a single condition that
only involves a first derivative: differentiability of the root density () 14 .[Po in quadratic
mean. This entails the existence of a vector of measurable functions io = (i9, 1, ... , i9 ,k) T
such that

(7.1)

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


94 Local Asymptotic Normality

Ifthis condition is satisfied, then the model (Po: e E 8) is called differentiable in quadratic
mean at e.
!
Usually, h Tlo (x) J Po (x) is the derivative of the map h 1-+ J PO+h (x) at h = 0 for
(almost) every x. In this case

.
io(x) =2
1
r.:t.:\ -
a vr:::t.:\ a
Po (x) = -log Po (x).
vPo(x) ae ae
Condition (7.1) does not require differentiability of the map e 1-+ Po (x) for any single x, but
rather differentiability in (quadratic) mean. Admittedly, the latter is typically established by
pointwise differentiability plus a convergence theorem for integrals. Because the condition
is exactly right for its purpose, we establish in the following theorem local asymptotic
normality under (7.1). A lemma following the theorem gives easily verifiable conditions in
terms of pointwise derivatives.

7.2 Theorem. Suppose that 8 is an open subset ofRk and that the model (Po: e E 8)
is differentiable in quadratic mean at e. Then Polo = 0 and the Fisher information matrix
4
10 = pol o exists. Furthermore, for every converging sequence h n h, as n 00,

Proof. Given a converging sequence h n h, we use the abbreviations Pn, P, and g for
PO+hn/.,fo, Po, and hT lo, respectively. By (7.1) the sequence JTi(ffn-.;p) converges in
quadratic mean (i.e., in L2(f,L)) to !g.;p. This implies that the sequence ffn converges in
quadratic mean to .;p. By the continuity of the inner product,

The right side equals JTi(1- 1) = 0 for every n, because both probability densities integrate
to 1. Thus Pg = O.
The random variable Wni = 2 [.JPn / P(Xi) - 1] is with P -probability 1 well defined.
By (7.1)

var ( t;
n
Wni -
1 n
JTi t;g(Xi)
)
E(.JnWni - g(Xd)2 0,

E t Wni = 2n(f JPn.JPdf,L - 1) = -n f[JPn - .JP]2df,L


(7.3)

Here Pg 2 = f g2 dP = hT loh by the definitions of g and 10 • If both the means and the
variances of a sequence of random variables converge to zero, then the sequence converges
to zero in probability. Therefore, combining the preceding pair of displayed equations, we
find

(7.4)

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


7.2 Expanding the Likelihood 95

Next, we express the log likelihood ratio in l:7=1 Wni through a Taylor expansion of the
logarithm. If we write 10g(1 + x) = x - !x 2 + x 2R(2x), then R(x) -+ 0 as x -+ 0, and

log n
n

i=l P i=l
L
Pn (Xi) = 2 n log ( 1 + -1 Wni )
2
n I n I n
= L
Wni - 4
i=1
+2L
i=1
L
i=1
(7.5)

As a consequence of the right side of (7.3), it is possible to write n = g2(Xi) + Ani for
random variables Ani such that EIAni I -+ O. The averages An converge in mean and hence
in probability to zero. Combination with the law of large numbers yields

By the triangle inequality followed by Markov's inequality,

nP(IWnd > sJ2) :::: nP(g2(Xi) > ns 2) + nP(IAnd > ns 2)


:::: S-2 pi{g2 > ns 2} + s-2EI And -+ O.

The left side is an IWnd > s.J2}. Thus the


IWnd converges to zero in probability. By the property of the function R, the sequence
converges in probability to zero as well. The last term on the right
in (7.5) is bounded by IR(Wni ) Il:7=1 Thus it is op(1)Op(1), and converges
in probability to zero. Combine to obtain that

Together with (7.4) this yields the theorem. •

To establish the differentiability in quadratic mean of specific models requires a conver-


gence theorem for integrals. Usually one proceeds by showing differentiability of the map
() po{x) for almost every x plus JL-equi-integrability (e.g., domination). The following
lemma takes care of most examples.

7.6 Lemma. For every () in an open subset ofJRk let Po be a JL-probability density. Assume
that the map () So (x) = J Po (x) is continuously differentiable for every x. Ifthe elements
of the matrix 10 = J(polpo)(pr Ipo) Po dJL are well defined and continuous in (), then the
map () "fiiO is differentiable in quadratic mean (7.1) with to given by Pol Po.

Proof. By the chain rule, the map () Po{x) = is differentiable for every x with
gradient Po = 2soso. Because So is nonnegative, its gradient So at a point at which So = 0
!
must be zero. Conclude that we can write So = (Po I Po) "fiiO, where the quotient Po / Po
may be defined arbitrarily if Po = O. By assumption, the map () 10 = 4 J sosl dJL is
continuous.
Because the map () So (x) is continuously differentiable, the difference SO+h (x) - So (x)
can be written as the integral fol hT So+uh(X) du of its derivative. By Jensen's (or Cauchy-
Schwarz's) inequality, the square of this integral is bounded by the integral fol (hT So+uh (x»)2

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


96 Local Asymptotic Normality

du of the square. Conclude that

f( SO+th,-SO)2
t dJL
jt(
10 htTSfJ+uth,
.)2 1 t
du dJL = 410
T
ht IO+uth,ht du,

where the last equality follows by Fubini's theorem and the definition of 10 • For h t -+ h
the right side converges to loh = j(h TsO)2 dJL by the continuity of the map 0 10.
By the differentiability of the map 0 So (x) the integrand in

f - So - hT So r dJL

converges pointwise to zero. The result of the preceding paragraph combined with Propo-
sition 2.29 shows that the integral converges to zero. •

7.7 Example (Exponentialfamilies). The preceding lemma applies to most exponential


family models
Po(x) = d(O)h(x)eQ(Olt(x).
An exponential family model is smooth in its natural parameter (away from the boundary of
the natural parameter space). Thus the maps 0 J Po (x) are continuously differentiable
if the maps 0 Q(O) are continuously differentiable and map the parameter set e into the
interior of the natural parameter space. The score function and information matrix equal

io(x) = - Eot(X)), 10 =
Thus the asymptotic expansion of the local log likelihood is valid for most exponential
families. 0

7.8 Example (Location models). The preceding lemma also includes all location models
{J(x - 0): 0 E 1R} for a positive, continuously differentiable density f with finite Fisher
information for location

If = f( \x) f(x) dx.

The score function io (x) can be taken equal to -(f'/f)(x - 0). The Fisher information is
equal to If for every 0 and hence certainly continuous in O.
By a refinement of the lemma, differentiability in quadratic mean can also be established
for slightly irregular shapes, such as the Laplace density f(x) = For the Laplace
density the map 0 log f(x - 0) fails to be differentiable at the single point 0 = x.
At other points the derivative exists and equals sign(x - 0). It can be shown that the
Laplace location model is differentiable in quadratic mean with score function io (x) =
sign(x - 0). This may be proved by writing the difference J f(x - h) - J f(x) as the
integral sign(x - uh) J f (x - uh) du of its derivative, which is possible even though
the derivative does not exist everywhere. Next the proof of the preceding lemma applies. 0

7.9 Counterexample (Uniform distribution). The family of uniform distributions on


[0, OJ is nowhere differentiable in quadratic mean. The reason is that the support of the

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


7.3 Convergence to a Normal Experiment 97

uniform distribution depends too much on the parameter. Differentiability in quadratic


mean (7.1) does not require that all densities Pe have the same support. However, restric-
tion of the integral (7.1) to the set {Pe = O} yields

PO+h(Pe = 0) = ( PO+h dJ.l = O(h2).


Jpo=o
Thus, under (7.1) the total mass PO+h (Pe = 0) of PO+ h that is orthogonal to Pe must
"disappear" as h ---+ 0 at a rate faster than h 2 .
This is not true for the uniform distribution, because, for h ::: 0,

PO+h(Pe = 0) = 1 [O,e]<
1
--I[o,O+h)(x)dx
() + h
= --.
()
h
+h
The orthogonal part does converge to zero, but only at the rate O(h). 0

7.3 Convergence to a Normal Experiment


The true meaning of local asymptotic normality is convergence of the local statistical
experiments to a normal experiment. In Chapter 9 the notion of convergence of statistical
experiments is introduced in general. In this section we bypass this general theory and
establish a direct relationship between the local experiments and a normal limit experiment.
The limit experiment is the experiment that consists of observing a single observation X
with the N(h, Ie-I)-distribution. The log likelihood ratio process of this experiment equals

dN(h, Ie-I) TIT


log ( I) (X) = h leX - -h leh.
dN 0, I()- 2

The right side is very similar in form to the right side of the expansion of the log likelihood
ratio 10gdP:+h/Jn/dP; given in Theorem 7.2. In view of the similarity, the possibility of
a normal approximation is not a complete surprise. The approximation in this section is
"local" in nature: We fix () and think of

as a statistical model with parameter h, for "known" (). We show that this can be approxi-
mated by the statistical model (N(h, Ie-I): h E
A motivation for studying a local approximation is that, usually, asymptotically, the
"true" parameter can be known with unlimited precision. The true statistical difficulty is
therefore determined by the nature of the measures Pe for () in a small neighbourhood of
the true value. In the present situation "small" turns out to be "of size 0 (1/ In).''
A relationship between the models that can be statistically interpreted will be described
through the possible (limit) distributions of statistics. For each n, let Tn = Tn (X I, ... , Xn)
be a statistic in the experiment (P;+h/Jn: h E with values in a fixed Euclidean space.
Suppose that the sequence of statistics Tn converges in distribution under every possible
(local) parameter:

every h.

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


98 Local Asympt?tic Normality

Here!:'" means convergence in distribution under the parameter () + hl,.;n, and LO,h
may be any probability distribution. According to the following theorem, the distributions
{Le,h : h E IRk} are necessarily the distributions of a statistic T in the normal experiment
(N(h, IiJl): h E IRk). Thus, every weakly converging sequence of statistics is "matched"
by a statistic in the limit experiment. (In the present set-up the vector () is considered known
and the vector h is the statistical parameter. Consequently, by "statistics" Tn and T are
understood measurable maps that do not depend on h but may depend on ().)
This principle of matching estimators is a method to give the convergence of models
a statistical interpretation. Most measures of quality of a statistic can be expressed in the
distribution of the statistic under different parameters. For instance, if a certain hypothesis
is rejected for values of a statistic Tn exceeding a number c, then the power function
h H> Ph(Tn > c) is relevant; alternatively, if Tn is an estimator of h, then the mean square
error h H> Eh(Tn - h)2, or a similar quantity, determines the quality of Tn. Both quality
measures depend on the laws of the statistics only. The following theorem asserts that as a
function of h the law of a statistic Tn can be well approximated by the law of some statistic
T. Then the quality of the approximating T is the same as the "asymptotic quality" of the
sequence Tn. Investigation of the possible T should reveal the asymptotic performance of
possible sequences Tn. Concrete applications of this principle to testing and estimation are
given in later chapters.
A minor technical complication is that it is necessary to allow randomized statistics in
the limit experiment. A randomized statistic T based on the observation X is defined as a
measurable map T = T (X, U) that depends on X but may also depend on an independent
variable U with a uniform distribution on [0, 1]. Thus, the statistician working in the limit
experiment is allowed to base an estimate or test on both the observation and the outcome of
an extra experiment that can be run without knowledge of the parameter. In most situations
such randomization is not useful, but the following theorem would not be true without
it.t

7.10 Theorem. Assume that the experiment (Po: () E 8) is differentiable in quadratic


mean (7.1) at the point () with nonsingular Fisher information matrix 10 , Let Tn be statistics
in the experiments (P:+h/.,fo: h E IRk) such that the sequence Tn converges in distribution
under every h. Then there exists a randomized statistic T in the experiment (N (h, IiJ 1) : h E
IRk) such that Tn !:... T for every h.
Proof. For later reference, it is useful to use the abbreviations

J = Ie,

By assumption, the marginals of the sequence (Tn' .6.n) converge in distribution under
h = 0; hence they are uniformly tight by Prohorov's theorem. Because marginal tightness
implies joint tightness, Prohorov's theorem can be applied in the other direction to see the
existence of a subsequence of {n} along which

t It is not important that U is unifonnly distributed. Any randomization mechanism that is sufficiently rich will
do.

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


7.3 Convergence to a Normal Experiment 99

jointly, for some random vector (S, d). The vector d is necessarily a marginal weak limit
of the sequence d n and hence it is N(O, J)-distributed. Combination with Theorem 7.2
yields

dPn,h) -0
( Tn,log-- (
S,h T
d -I -h
T Jh ) .
dPn,o 2

In particular, the sequence 10gdPn,h/dPn,O converges to the normal N( Jh, hT Jh)-


distribution. By Example 6.5, the sequences Pn,h and Pn,o are contiguous. The limit law
Lh of Tn under h can therefore be expressed in the joint law on the right, by the general
form ofLe Cam's third lemma: For each Borel set B

Lh(B) = EIB(S)ehT 8-!h TJh.


We need to find a statistic T in the normal experiment having this law under h (for every h),
using only the knowledge that d is N (0, J)-distributed.
By the lemma below there exists a randomized statistic T such that, with U uniformly
distributed and independent of d,t

(T(d, U), d) '" (S, d).

Because the random vectors on the left and right sides have the same second marginal
distribution, this is the same as saying that T U) is distributed according to the conditional
distribution of S given d = for almost As shown in the next lemma, this can be
achieved by using the quantile transformation.
Let X be an observation in the limit experiment (N(h, r'):hE ]Rk). Then JX is under
°
h = normally N (0, J)-distributed and hence it is equal in distribution to d. Furthermore,
by Fubini's theorem,

Ph(T(JX, U) E B) = f P(T(Jx, U) E B) e-!(x-h)T J(x-h)

= EoI B(T(JX, U») ehT JX-!h TJh.


This equals Lh(B), because, by construction, the vector (T(JX, U), JX) has the same
°
distribution under h = as (S, d). The randomized statistic T(J X, U) has law Lh under
h and hence satisfies the requirements. •

7.11 Lem1lUl. Given a random vector (S, d) with values in]Rd x]Rk and an independent
uniformly [0, 1] random variable U (defined on the same probability space), there exists a
jointly measurable map T on]Rk x [0, 1] such that (T(d, U), d) and (S, d) are equal in
distribution.

Proof. For simplicity of notation we only give a construction for d = 2. It is possible


to produce two independent uniform [0, 1] variables U, and U2 from one given [0, 1]
variable U. (For instance, construct U, and U2 from the even and odd numbered digits in
the decimal expansion of U.) Therefore it suffices to find a statistic T = T(d, u" U2)
such that (T, d) and (S, d) are equal in law. Because the second marginals are equal, it

t The symbol - means "equal-in-Iaw."

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


100 Local Asymptotic Normality

suffices to construct T such that UI, U2) is equal in distribution to S given fl =


every E ]Rk. Let QI (UI and SI) be the quantile functions of the conditional
distributions

and

respectively. These are measurable functions in their two and three arguments, respectively.
Furthermore, QI has law p Stl f1 =d and SI) has law p SZIf1=d,SI=SI, for every
ands i . Set

UI, U2) =

Then the first coordinate QI of U b U2) possesses the distribution p Stl f1 =&.
Given that this first coordinate equals SIt the second coordinate is distributed as Q2 (U21 s}),
which has law pSzlf1=&,SI=S1 by construction. Thus T satisfies the requirements. •

7.4 Maximum Likelihood


Maximum likelihood estimators in smooth parametric models were shown to be asymp-
totically normal in Chapter 5. The convergence of the local experiments to a normal limit
experiment gives an insightful explanation of this fact.
By the representation theorem, Theorem 7.10, every sequence of statistics in the local ex-
periments (P:+h/.;n: h E ]Rk) is matched in the limit by a statistic in the normal experiment.
Although this does not follow from this theorem, a sequence of maximum likelihood esti-
mators is typically matched by the maximum likelihood estimator in the limit experiment.
Now the maximum likelihood estimator for h in the experiment (N (h, /9- 1) : h E ]Rk) is the
observation X itself (the mean of a sample of size one), and this is normally distributed.
Thus, we should expect that the maximum likelihood estimators hn for the local param-
eter h in the experiments (P:+h/.;n: h E ]Rk) converge in to X. In terms of
the original parameter 0, the local maximum likelihood estimator h n is the standardized
maximum likelihood estimator hn = ../Ti(On - 0). Furthermore, the local parameter h =
corresponds to the value 0 of the original parameter. Thus, we should expect that under
°
o the sequence ../Ti(On - 0) converges in distribution to X under h = 0, that is, to the
N (0, /9- 1)-distribution.
As a heuristic explanation of the asymptotic normality of maximum likelihood estimators
the preceding argument is much more insightful than the proof based on linearization of the
score equation. It also explains why, or in what sense, the maximum likelihood estimator
is asymptotically optimal: in the same sense as the maximum likelihood estimator of a
Gaussian location parameter is optimal.
This heuristic argument cannot be justified under just local asymptotic normality, which is
too weak a connection between the sequence oflocal experiments and the normal limit exper-
iment for this purpose. Clearly, the argument is valid under the conditions of Theorem 5.39,
because the latter theorem guarantees the asymptotic normality of the maximum likelihood
estimator. This theorem adds a Lipschitz condition on the maps 0 log P9(X), and the
"global" condition that On is consistent to differentiability in quadratic mean. In the fol-
lowing theorem, we give a direct argument, and also allow that 0 is not an inner point of
the parameter set, so that the local parameter spaces may not converge to the full space ]Rk.

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


7.4 Maximum Likelihood 101

Then the maximum likelihood estimator in the limit experiment is a "projection" of X and
the limit distribution of .,fo(On - e) may change accordingly.
Let 8 be an arbitrary subset of]Rk and define Hn as the local parameter space Hn =
.,fo(8 - e). Then hn is the maximizer over Hn of the random function (or "process")
dpn
h Iog O+h/../ii
n
dPo
If the experiment (Po: e E 8) is differentiable in quadratic mean, then this sequence of
processes converges (marginally) in distribution to the process

dN(h, 10- 1) 1 TIT


h log ( I) (X) = --(X - h) 10(X - h) + -X loX.
dN 0, I; 2 2
Ifthe sequence of sets Hn converges in a suitable sense to a set H, then we should expect,
under regularity conditions, that the sequence hn converges to the maximizer h of the latter
process over H. This maximizer is the projection of the vector X onto the set H relative
to the metric d(x, y) = (x - y)T 10 (x - y) (where a "projection" means a closest point); if
H = ]Rk, this projection reduces to X itself.
An appropriate notion of convergence of sets is the following. Write Hn -+ H if H
is the set of all limits lim hn of converging sequences hn with hn E Hn for every nand,
moreover, the limit h = lilDj hni of every converging sequence hni with hni E Hni for every
i is contained in H. t

7.12 Theorem. Suppose that the experiment (Po: e E 8) is differentiable in quadratic


mean at eo with nonsingular Fisher information matrix 190' Furthermore, suppose that for
every el and e2 in a neighborhood of eo and a measurable function l with P90 l 2 < 00,

If the sequence of maximum likelihood estimators On is consistent and the sets Hn =


.,fo(8 - ( 0 ) converge to a nonempty, convex set H, then the sequence .,fo(On - eo)
converges under 00 in distribution to the projection of a standard normal vector onto the
set leo1/2 H.

*Proof. Let Gn = .,fo(lPn - P90 ) be the empirical process. In the proof of Theorem 5.39
it is shown that the map 0 log Po is differentiable at eo in L 2 (P90 ) with derivative
190 and that the map 0 P90 log Po permits a Taylor expansion of order 2 at 00, with
"second-derivative matrix" -190, Therefore, the conditions of Lemma 19.31 are satisfied
for mo = log Po, whence, for every M,

sup \nlPn log P90+h/../ii - hTGnl 90 + 190 h\ O.


Ilhll:=:M P90 2

By Corollary 5.53 the estimators On are .,fo-consistent under eo.


The preceding display is also valid for every sequence Mn that diverges to 00 sufficiently
slowly. Fix such a sequence. By the .,fo-consistency of On, the local maximum likelihood

t See Chapter 16 for examples.

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


102 Local Asymptotic Normality

estimators hn are bounded in probability and hence belong to the balls of radius Mn with
probability tending to 1. Furthermore, the sequence of intersections Hn n ball (0, Mn)
converges to H, as the original sets Hn. Thus, we may assume that the hn are the maximum
likelihood estimators relative to local parameter sets Hn that are contained in the balls of
radius Mn. Fix an arbitrary closed set F. If hn E F, then the log likelihood is maximal on
F. Hence P(h n E F) is bounded above by

p( SJIP
heFnHn
rn log POo+h/..[ii
POo
sup
heHn
rn log POo+h/..[ii)
POo

= p( sup hTGnlOo -
heFnHn 2
lOoh sup hTGnlOo -
heHn 2
lOoh + op(l))
= - IJ:2(F n Hn) I I - IJ:2 Hn I + Op(l)),
by completing the square. By Lemma 7.13 (ii) and (iii) ahead, we can replace Hn by H on
both sides, at the cost of adding a further op(l)-term and increasing the probability. Next,
by the continuous mapping theorem and the continuity of the map Z 1-+ liz - A II for every
set A, the probability is asymptotically bounded above by, with Z a standard normal vector,

p(IIZ-IJ:2(FnH)11 IIZ-IJ:2 H II)·


The projection TIZ of the vector Zon the set IJ:2 H is unique, because the latter set is
convex by assumption and automatically closed. If the distance of Z to IJ:2(F n H) is
smaller than its distance to the set IJ:2 H, then TIZ
must be in IJ:2(F n H). Consequently,
the probability in the last display is bounded by P(TIZ
E IJ:2 F). The theorem follows from
the portmanteau lemma. •

7.13 Lemma. Ifthe sequence of subsets Hn of IRk converges to a nonempty set Hand
the sequence of random vectors Xn converges in distribution to a random vector X, then
(i) IIXn - Hnll IIX - HII.
(ii) IIXn - Hn n FII IIXn - H n FII + op(1), for every closed set F.
(iii) IIXn - Hn n Gil IIXn - H n Gil + op(l), for every open set G.

Proof. (i). Because the map x 1-+ IIx - H II is (Lipschitz) continuous for any set H,
we have that IIXn - - HII by the continuous-mapping theorem. Ifwe also show
that IIXn - Hnll - IIXn - HII 0, then the proof is complete after an application of
Slutsky's lemma. By the uniform tightness of the sequence Xn , it suffices to show that
IIx - Hn II --+ IIx - H II uniformly for x ranging over compact sets, or equivalently that
IIxn - Hn II --+ IIx - H II for every converging sequence Xn --+ x.
For every fixed vector Xn, there exists a vector h n E Hn with IIx n- Hn II IIx n- hn11-1/ n.
Unless IIx n - Hn II is unbounded, we can choose the sequence h n bounded. Then every
subsequence of h n has a further subsequence along which it converges, to a limit h in H.
Conclude that, in any case,

liminfllxn - Hnll liminfllxn - hnll IIx - hll IIx - HII.

Conversely, for every e > °there exists h E H and a sequence hn --+ h with h n E Hn and

IIx - HII IIx - hll - e = lim IIxn - hnll - e lim sup IIxn - Hnll - e.

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


7.6 Local Asymptotic Normality 103

Combination of the last two displays yields the desired convergence of the sequence xn −
Hn  to x − H .

(ii). The assertion is equivalent to the statement P X n − Hn ∩ F − X n − H ∩ F >

−ε → 1 for every ε > 0. In view of the uniform tightness of the sequence X n , this follows
if lim inf xn − Hn ∩ F ≥ x − H ∩ F for every converging sequence xn → x. We can
prove this by the method of the first half of the proof of (i), replacing Hn by Hn ∩ F.
(iii). Analogously to the situation under (ii), it suffices to prove that lim sup xn − Hn ∩
G ≤ x − H ∩ G for every converging sequence xn → x. This follows as the second
half of the proof of (i). 

∗ 7.5 Limit Distributions under Alternatives


Local asymptotic normality is a convenient tool in the study of the behavior of statistics
under “contiguous alternatives.” Under local asymptotic normality,
d Pθn+h/√n  
θ 1 T
log  N − h Iθ h, h Iθ h .
T
d Pθn 2
Therefore, in view of Example 6.5 the sequences of distributions Pθ+h/ n √ and P n are
n θ
mutually contiguous. This is of great use in many proofs. With the help of Le Cam’s third
lemma it also allows to obtain limit distributions of statistics under the parameters θ +

h/ n, once the limit behavior under θ is known. Such limit distributions are of interest,
for instance, in studying the asymptotic efficiency of estimators or tests.
The general scheme is as follows. Many sequences of statistics Tn allow an
approximation by an average of the type

1 
n

n (Tn − μθ ) = √ ψθ (X i ) + o Pθ (1).
n
i=1

According to Theorem 7.2, the sequence of log likelihood ratios can be approximated
by an average as well: It is asymptotically equivalent to an affine transformation
  
of n −1/2 ˙θ (X i ). The sequence of joint averages n −1/2 ψθ (X i ), ˙θ (X i ) is
asymptotically multivariate normal under θ by the central limit theorem (provided ψθ has
mean zero and finite second moment). With the help of Slutsky’s lemma we obtain the joint
limit distribution of Tn and the log likelihood ratios under θ :
    P ψ ψT
√ d Pθn+h/√n θ 0 θ θ θ Pθ ψθ h T ˙θ
n (Tn − μθ ), log  N , .
d Pθn − 12 h T Iθ h Pθ ψθT h T ˙θ h T Iθ h

Finally we can apply Le Cam’s third lemma, Example 6.7, to obtain the limit distribution
√ √
of n(Tn − μθ ) under θ + h/ n. Concrete examples of this scheme are discussed in later
chapters.

∗ 7.6 Local Asymptotic Normality


The preceding sections of this chapter are restricted to the case of independent, identically
distributed observations. However, the general ideas have a much wider applicability. A

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


104 Local Asymptotic Normality

wide variety of models satisfy a general form of local asymptotic normality and for that
reason allow a unified treatment. These include models with independent, not identically
distributed observations, but also models with dependent observations, such as used in time
series analysis or certain random fields. Because local asymptotic normality underlies a
large part of asymptotic optimality theory and also explains the asymptotic normality of
certain estimators, such as maximum likelihood estimators, it is worthwhile to formulate a
general concept.
Suppose the observation at "time" n is distributed according to a probability measure
Pn,e, for a parameter 8 ranging over an open subset of IRk. e
7.14 Definition. The sequence of statistical models (Pn,e : 8 E e) is locally asymptoti-
cally normal (LAN) at 8 if there exist matrices Tn and Ie and random vectors l:l.n,e such that
l:l.n,e N(O, Ie) and for every converging sequence hn --+ h

dP -Ih 1
n,.,+r.. = hT l:l. - _hT I h + 0 (1)
II
log
dP, n,e 2 e p.,s·
n,e

7.1S Example. Ifthe experiment (Pe : 8 E e) is differentiable in quadratic mean, then the
sequence of models (P: : 8 E e) is locally asymptotically normal with norming matrices
Tn = -Inl.
0

An inspection of the proof of Theorem 7.10 readily reveals that this depends on the local
asymptotic normality property only. Thus, the local experiments

of a locally asymptotically normal sequence converge to the experiment (N(h, 1;;1): h E


IRk), in the sense of this theorem. All results for the case of i.i.d. observations that are based
on this approximation extend to general locally asymptotically normal models. To illustrate
the wide range of applications we include, without proof, three examples, two of which
involve dependent observations.

7.16 Example (Autoregressive processes). An autoregressive process {X,: t E Z} of or-


der 1 satisfies the relationship X, = () X,-I + Z, for a sequence of independent, identically
distributed variables ... , Z -I, Zo, Z I, . .. with mean zero and finite variance. There ex-
ists a stationary solution ... , X-I, Xo, X I, ... to the autoregressive equation if and only if
10 I =j:. 1. To identify the parameter it is usually assumed that 18 I < 1. Ifthe density of the
noise variables Z j has finite Fisher information for location, then the sequence of models
corresponding to observing XI, ... , Xn with parameter set (-1, 1) is locally asymptotically
normal at () with norming matrices Tn = -Inl.
The observations in this model form a stationary Markov chain. The result extends to
general ergodic Markov chains with smooth transition densities (see [130]). 0

7.17 Example (Gaussian time series). This example requires some knowledge of time-
series models. Suppose that at time n the observations are a stretch X I, ... , Xn from a
stationary, Gaussian time series {X, : t E Z} with mean zero. The covariance matrix of n

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


7.6 Local Asymptotic Normality 105

consecutive variables is given by the (Toeplitz) matrix

The function fe is the spectral density of the series. It is convenient to let the parameter
enter the model through the spectral density, rather than directly through the density of the
observations.
Let Pn,e be the distribution (on ]Rn) of the vector (XI, ... , X n), a normal distribution
with mean zero and covariance matrix Tn (fe). The periodogram of the observations is the
function
n 2
In().} = -1 '" it)..
.
21rn 1=1

Suppose that Ie is bounded away from zero and infinity, and that there exists a vector-valued
function i.e :]R 1--+ ]Rd such that, as h .... 0,

f [f9+h-fe-h r'i e le1 2d).=o (2)


IIhll .

Then the sequence of experiments (Pn,e : () E 8) is locally asymptotically normal at () with

Ie = -41r1 f ieie
. ·r d)'.
The proof is elementary, but involved, because it has to deal with the quadratic forms in
the n-variate normal density, which involve vectors whose dimension converges to infinity
(see [30]). 0

7.18 Example (Almost regular densities). Consider estimating a location parameter ()


based on a sample of size n from the density f(x - (). If f is smooth, then this model is
differentiable in quadratic mean and hence locally asymptotically normal by Example 7.8.
Iff possesses points of discontinuity, or other strong irregularities, then a locally asymptot-
ically normal approximation is impossible. t Examples of densities that are on the boundary
between these "extremes" are the triangular density f(x) = (1 - Ixlt and the gamma
density f(x) = xe- x l {x > OJ. These yield models that are locally asymptotically normal,
but with norming rate J n log n rather than In. The existence of singularities in the density
makes the estimation of the parameter () easier, and hence a faster rescaling rate is necessary.
(For the triangular density, the true singularities are the points -1 and 1, the singularity at
ois statistically unimportant, as in the case of the Laplace density.)
For a more general result, consider densities f that are absolutely continuous except pos-
sibly in small neighborhoods UI, ... , Uk of finitely many fixed points CI, ... , q. Suppose
that f' / J1 is square-integrable on the complement of UjUj , that f(cj} = 0 for every j,
and that, for fixed constants ai, ... , ak and b l , ..• , bko each of the functions

t See Chapter 9 for some examples.

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


106 Local Asymptotic Normality

is twice continuously differentiable. If (a j + b j ) > 0, then the model is locally asymp-
 
totically normal at θ = 0 with, for Vn equal to the interval n −1/2 (log n)−1/4 , (log n)−1
around zero,†
  
rn = n log n, I0 = aj + bj ,
j
 
1 
n 
k
1 X i − c j ∈ Vn 1  
n,0 = √ − f x + cj dx .
n log n Xi − c j Vn x
i=1 j=1

The sequence n,0 may be thought of as “asymptotically sufficient” for the local parameter
h. The definition of n,0 shows that, asymptotically, all the “information” about the
parameter is contained in the observations falling into the neighborhoods Vn + C j . Thus,
asymptotically, the problem is determined by the points of irregularity.

The remarkable rescaling rate n log n can be explained by computing the Hellinger
distance between the densities f (x − θ ) and f (x) (see section 14.5). 

Notes
Local asymptotic normality was introduced by Le Cam [92], apparently motivated by the
study and construction of asymptotically similar tests. In this paper Le Cam defines two
sequences of models (Pn,θ : θ ∈ ) and (Q n,θ : θ ∈ ) to be differentially equivalent if

sup Pn,θ+h/√n − Q n,θ+h/√n → 0,


h∈K

for every bounded set K and every θ . He next shows that a sequence of statistics Tn
in a given asymptotically differentiable sequence of experiments (roughly LAN) that is
asymptotically equivalent to the centering sequence n,θ is asymptotically sufficient, in
the sense that the original experiments and the experiments consisting of observing the Tn
are differentially equivalent. After some interpretation this gives roughly the same message
as Theorem 7.10. The latter is a concrete example of an abstract result in [95], with a
different (direct) proof.

PROBLEMS
1. Show that the Poisson distribution with mean θ satisfies the conditions of Lemma 7.6. Find
the information.
2. Find the Fisher information for location for the normal, logistic, and Laplace distributions.
3. Find the Fisher information for location for the Cauchy distributions.
4. Let f be a density that is symmetric about
 zero. Show
 that the Fisher information matrix (if
it exists) of the location scale family f (x − μ)/σ /σ is diagonal.
5. Find an explicit expression for the o Pθ (1)-term in Theorem 7.2 in the case that pθ is the
density of the N (θ, 1)-distribution.
6. Show that the Laplace location family is differentiable in quadratic mean.

† See, for example, [80, pp. 133–139] for a proof, and also a discussion of other almost regular situations. For
instance, singularities of the form f (x) ∼ f (c j ) + |x − c j |1/2 at points c j with f (c j ) > 0.

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


Problems 107

7. Find the fonn of the score function for a location-scale family f (x - IL)/ u ) / u with parameter
8 = (IL. u) and apply Lemma 7.6 to find a sufficient condition for differentiability in quadratic
mean.
8. Investigate for which parameters k the location family f(x - 8) for f the gamma(k. 1) density
is differentiable in quadratic mean.
9. Let Pn ,9 be the distribution of the vector (Xl, ...• Xn) if {Xt : t E Z} is a stationary Gaussian
time series satisfying X t = 8X t-l + Zt for a given number 181 < 1 and independent standard
nonnal variables Zt. Show that the model is locally asymptotically nonnal.
10. Investigate whether the log nonnal family of distributions with density

1 > s}
u"firr(x - s)
is differentiable in quadratic mean with respect to 8 = (s, IL. u).

https://doi.org/10.1017/CBO9780511802256.008 Published online by Cambridge University Press


8
Efficiency of Estimators

One purpose of asymptotic statistics is to compare the performance of


estimatorsfor large sample sizes. This chapter discusses asymptotic lower
bounds for estimation in locally asymptotically normal models. These
show, among others, in what sense maximum likelihood estimators are
asymptotically efficient.

8.1 Asymptotic Concentration


Suppose the problem is to estimate 1/1«(}} based on observations from a model governed by
the parameter (). What is the best asymptotic performance of an estimator sequence Tn for
1/1 «(})?
To simplify the situation, we shall in most of this chapter assume that the sequence
.;n(Tn - 1/1 «()) ) converges in distribution under every possible value of (). Next we rephrase
the question as: What are the best possible limit distributions? In analogy with the Cramer-
Rao theorem a "best" limit distribution is referred to as an asymptotic lower bound. Under
certain restrictions the normal distribution with mean zero and covariance the inverse Fisher
information is an asymptotic lower bound for estimating 1/1«(}} = () in a smooth parametric
model. This is the main result of this chapter, but it needs to be qualified.
The notion of a "best" limit distribution is understood in terms of concentration. Ifthe
limit distribution is a priori assumed to be normal, then this is usually translated into asymp-
totic unbiasedness and minimum variance. The statement that .;n(Tn - 1/1«(}}) converges in
distribution to a N(JL«(}), a 2 «(}) )-distribution can be roughly understood in the sense that
eventually Tn is approximately normally distributed with mean and variance given by

1/1«(}} + JL«(}) and a 2 «(}) .


.;n n
Because Tn is meant to estimate 1/1 «(}), optimal choices for the asymptotic mean and variance
are JL«(}) = 0 and variance a 2 «(}) as small as possible. These choices ensure not only that
the asymptotic mean square error is small but also that the limit distribution N (JL «()), a 2 «(}) )
is maximally concentrated near zero. For instance, the probability of the interval (-a, a)
is maximized by choosing JL«(}) = 0 and a 2 «(}) minimal.
We do not wish to assume a priori that the estimators are asymptotically normal. That
normal limits are best will actually be an interesting conclusion. The concentration of a
general limit distribution L() cannot be measured by mean and variance alone. Instead, we

108

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


801 Asymptotic Concentration 109

can employ a variety of concentration measures, such as

! Ixi dLe(x); ! l{lxl>a}dLe(x); !OxIAa)dLe(X)o

A limit distribution is "good" if quantities of this type are small. More generally, we
f
focus on minimizing idLe for a given nonnegative function i. Such a function is called
a loss function and its integral f idLe is the asymptotic risk of the estimator. The method
of measuring concentration (or rather lack of concentration) by means of loss functions
applies to one- and higher-dimensional parameters alike.
The following example shows that a definition of what constitutes asymptotic optimality
is not as straightforward as it might seem.

8.1 Example (Hodges' estimator). Suppose that Tn is a sequence of estimators for a real
parameter () with standard asymptotic behavior in that, for each () and certain limit distri-
butions Le,

As a specific example, let Tn be the mean of a sample of size n from the N «(), 1)-distribution.
Define a second estimator Sn through

If the estimator Tn is already close to zero, then it is changed to exactly zero; otherwise it
is left unchanged. The truncation point n- I / 4 has been chosen in such a way that the limit
behavior of Sn is the same as that of Tn for every () ¥ 0, but for () = 0 there appears to be a
great improvement. Indeed, for every rn,
o
rnSn .".. 0
.[ii(Sn - () !!... L e,

To see this, note first that the probability that Tn falls in the interval «(} - M n -1/2, (} + M n -I /2)
converges to Le (- M, M) for most M and hence is arbitrarily close to 1 for M and n
sufficiently large. For () ¥ 0, the intervals «() - Mn-I/ 2, () + Mn-I/2) and (_n- I/ 4 , n- I/4 )
are centered at different places and eventually disjoint. This implies that truncation will
rarely occur: Pe{Tn = Sn) 1 if() ¥ 0, whence the second assertion. Ontheotherhandthe
interval (-Mn-I/ 2, Mn-l/2) is contained in the interval (_n- I/4 , n- I/4) eventually. Hence
=
under () 0 we have truncation with probability tending to 1 and hence PO{Sn 0) = 1;
this is stronger than the first assertion.
At first sight, Sn is an improvement on Tn. For every () ¥ 0 the estimators behave
the same, while for () = 0 the sequence Sn has an "arbitrarily fast" rate of convergence.
However, this reasoning is a bad use of asymptotics.
Consider the concrete situation that Tn is the mean of a sample of size n from the
normal N«(), I)-distribution. It is well known that Tn =
X is optimal in many ways for
every fixed n and hence it ought to be asymptotically optimal also. Figure 8.1 shows
why Sn = Xl{IXI 2:: n- I / 4 } is no improvement. It shows the graph of the risk function
() t-+ Ea(Sn - ()2 for three different values of n. These functions are close to 1 on most

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


110 Efficiency of Estimators

Figure 8.1. Risk function θ  → nEθ (Sn − θ )2 of the Hodges estimator based on the
means of samples of size 10 (dashed), 100 (dotted), and 1000 (solid) observations from
the N (θ, 1)-distribution.

of the domain but possess peaks close to zero. As n → ∞, the locations and widths of
the peaks converge to zero but their heights to infinity. The conclusion is that Sn “buys”
its better asymptotic behavior at θ = 0 at the expense of erratic behavior close to zero.
Because the values of θ at which Sn is bad differ from n to n, the erratic behavior is not
visible in the pointwise limit distributions under fixed θ. 

8.2 Relative Efficiency


In order to choose between two estimator sequences, we compare the concentration of their

limit distributions. In the case of normal limit distributions and convergence rate n, the
quotient of the asymptotic variances is a good numerical measure of their relative efficiency.
This number has an attractive interpretation in terms of the numbers of observations needed
to attain the same goal with each of two sequences of estimators.
Let ν → ∞ be a “time” index, and suppose that it is required that, as ν → ∞, our
estimator sequence attains mean zero and variance 1 (or 1/ν). Assume that an estimator Tn
based on n observations has the property that, as n → ∞,
√   θ  
n Tn − ψ(θ )  N 0, σ 2 (θ ) .
Then the requirement is to use at time ν an appropriate number n ν of observations such
that, as ν → ∞,
√   θ
ν Tn v − ψ(θ )  N (0, 1).
Given two available estimator sequences, let n ν,1 and n ν,2 be the numbers of observations

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


8.3 Lower Bound/or Experiments 111

needed to meet the requirement with each of the estimators. Then, if it exists, the limit
. nv,2
1I m-
v-+oo nv,l
is called the relative efficiency of the estimators. (In general, it depends on the para-
meter 0.)
Because JV(Tn, - 1/1(0») can be written as Jv/nv Fv(Tn, - 1/1(0»), it follows that
necessarily nv 00, and also that nv/v a 2 (O). Thus, the relative efficiency of two
estimator sequences with asymptotic variances a?(O) is just

lim n v ,2Iv = ai(O) .


v-+oo nv,dv al(O)

If the value of this quotient is bigger than 1, then the second estimator sequence needs
proportionally that many observations more than the first to achieve the same (asymptotic)
precision.

8.3 Lower Bound for Experiments


It is certainly impossible to give a nontrivial lower bound on the limit distribution of a
standardized estimator ..;n(Tn - 1/1(0») for a single o. Hodges' example shows that it is
not even enough to consider the behavior under every 0, pointwise for all O. Different
values of the parameters must be taken into account simultaneously when taking the limit
as n 00. We shall do this by studying the performance of estimators under parameters
in a "shrinking" neighborhood of a fixed O.
We consider parameters 0 + h/ ..;n for 0 fixed and h ranging over IRk and suppose that,
for certain limit distributions Le,h,

J1i(Tn - 1/1(0 + :n)) 9+'!J.Jn Le,h, every h. (8.2)

Then Tn can be considered a good estimator for 1/1(0) if the limit distributions Le,h are
maximally concentrated near zero. If they are maximally concentrated for every h and
some fixed 0, then Tn can be considered locally optimal at O. Unless specified otherwise,
we assume in the remainder of this chapter that the parameter set e is an open subset of
IRk, and that 1/1 maps e into IRm. The derivative of 0 1/1(0) is denoted by 1/Ie.
Suppose that the observations are a sample of size n from a distribution Pe. IfPe depends
smoothly on the parameter, then

as experiments, in the sense of Theorem 7.10. This theorem shows which limit distributions
are possible and can be specialized to the estimation problem in the following way.

8.3 Theorem. Assume that the experiment (Pe : 0 E e) is differentiable in quadratic


mean (7.1) at the point 0 with nonsingular Fisher information matrix Ie. Let 1/1 be dif-
ferentiable at O. Let Tn be estimators in the experiments CP;+h/Jn: h E IRk) such that

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


112 Efficiency of Estimators

(8.2) holds jor every h. Then there exists a randomized statistic T in the experiment
(N(h, I(J-I):h E IRk) such that T -1p(Jh has distribution L(J,hjorevery h.

Proof. Apply Theorem 7.10 to Sn = In{Tn -1/1'«(}»). In view of the definition of L(J,h
and the differentiability of 1/1', the sequence

converges in distribution under h to L(J,h *fJ"'oh' where *fJh denotes a translation by h.


According to Theorem 7.10, there exists a randomized statistic T in the normal experiment
such that T has distribution L(J,h *fJ"'oh for every h. This satisfies the requirements. •

This theorem shows that for most estimator sequences Tn there is a randomized estimator
T such that the distribution of In{Tn - 1/1'«(} + hi In)) under () + hi In is, for large n,
approximately equal to the distribution of T -1p(Jh under h. Consequently the standardized
distribution of the best possible estimator Tn for 1/1' «() + hi In) is approximately equal to the
standardized distribution of the best possible estimator T for 1p(Jh in the limit experiment. If
we know the best estimator T for 1p(Jh, then we know the "locally best" estimator sequence
Tn for 1/1'«(}).
In this way, the asymptotic optimality problem is reduced to optimality in the experiment
based on one observation X from a N(h, I(J-I)-distribution, in which () is known and h
ranges over IRk. This experiment is simple and easy to analyze. The observation itself is
the customary estimator for its expectation h, and the natural estimator for 1p(Jh is 1p(JX.
This has several optimality properties: It is minimum variance unbiased, minimax, best
equivariant, and Bayes with respect to the noninformative prior. Some of these properties
are reviewed in the next section.
Let us agree, at least for the moment, that 1p(JX is a "best" estimator for 1p(Jh. The
distribution of1p(JX -1pIJh is normal with zero mean and covariance 1pIJIIJ-I1pIJTfor every h.
°
The parameter h = in the limit experiment corresponds to the parameter () in the original
problem. We conclude that the "best" limit distribution of In{Tn - 1/1'«(}») under () is the
N(O,1pIJ IIJ- I1pIJ T)-distribution.
This is the main result of the chapter. The remaining sections discuss several ways of
making this reasoning more rigorous. Because the expression 1pIJIil 1pIJT is precisely the
Cramer-Rao lower bound for the covariance of unbiased estimators for 1/1'«(}), we can think
of the results of this chapter as asymptotic Cramer-Rao bounds. This is helpful, even though
it does not do justice to the depth of the present results. For instance, the Cramer-Rao bound
in no way suggests that normal limiting distributions are best. Also, it is not completely
true that an N (h, IiI)-distribution is "best" (see section 8.8). We shall see exactly to what
extent the optimality statement is false.

8.4 Estimating Normal Means


According to the preceding section, the asymptotic optimality problem reduces to optimality
in a normal location (or "Gaussian shift") experiment. This section has nothing to do with
asymptotics but reviews some facts about Gaussian models.

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


8.4 Estimating Normal Means 113

Based on a single observation X from a N(h, it is required to estimate


Ah for a given matrix A. The covariance matrix is assumed known and nonsingular. It
is well known that AX is minimum variance unbiased. It will be shown that AX is also
best-equivariant and minimax for many loss functions.
A randomized estimator T is called equivariant-in-Iaw for estimating Ah if the distri-
bution of T - Ah under h does not depend on h. An example is the estimator AX, whose
"invariant law" (the law of A X - Ah under h) is the N(O, The follow-
ing proposition gives an interesting characterization of the law of general equivariant-in-law
estimators: These are distributed as the sum of AX and an independent variable.

8.4 Proposition. The null distribution L ofany randomized equivariant-in-Iaw estimator


of Ah can be decomposed as L = N(O, *
Mforsomeprobabilitymeasure M. The
only randomized equivariant-in-Iaw estimator for which M is degenerate at is AX.°
The measure M can be interpreted as the distribution of a noise factor that is added to
the estimator AX. If no noise is best, then it follows that AX is best equivariant-in-law.
A more precise argument can be made in terms of loss functions. In general, convoluting
a measure with another measure decreases its concentration. This is immediately clear in
terms of variance: The variance of a sum of two independent variables is the sum of the
variances, whence convolution increases variance. For normal measures this extends to
all "bowl-shaped" symmetric loss functions. The name should convey the form of their
graph. Formally, a function is defined to be bowl-shaped if the sublevel sets {x : l(x) c}
are convex and symmetric about the origin; it is called subconvex if, moreover, these sets
are closed. A loss function is any function with values in [0, (0). The following lemma
quantifies the loss in concentration under convolution (for a proof, see, e.g., [80] or [114].)

8.5 Lemma (Anderson's lemma). For any bowl-shaped lossfunction l on IRk, every prob-
ability measure M on IRk, and every covariance matrix

f ldN(O, f ld[N(O, *M].


Next consider the minimax criterion. According to this criterion the "best" estimator,
relative to a given loss function, minimizes the maximum risk

sup Ehl(T - Ah),


h

over all (randomized) estimators T. For every bowl-shaped loss function l, this leads again
to the estimator A X .

8.6 Proposition. For any bowl-shaped loss function l, the maximum risk ofany random-
ized estimator T of Ah is bounded below by Eol(AX). Consequently, AX is a minimax
estimator for Ah./f Ah is real and Eo(AX)2l(AX) < 00, then AX is the only minimax
estimator for Ah up to changes on sets ofprobability zero.

Proofs. For a proof of the uniqueness of the minimax estimator, see [18] or [80]. We
prove the other assertions for subconvex loss functions, using a Bayesian argument.

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


114 Efficiency of Estimators

Let H be a random vector with a normal N(O, A)-distribution, and consider the original
N(h, E)-distribution as the conditional distribution of X given H = h. The randomization
variable U in T(X, U) is constructed independently of the pair (X, H). In this notation, the
distribution of the variable T - A H is equal to the "average" of the distributions of T - Ah
under the different values of h in the original set-up, averaged over h using a N (0, A)-"prior
distribution."
By a standard calculation, we find that the "a posteriori" distribution, the distribution
of H given X, is the normal distribution with mean (E -I + A -I) -I E -I X and covariance
matrix (E -I + A -I) -I. Define the random vectors

These vectors are independent, because WA is a function of (X, U) only, and the condi-
tional distribution of G A given X is normal with mean 0 and covariance matrix A(E-1 +
A -I ) -I AT, independent of X. As A = AI for a scalar A --+ 00, the sequence G A converges
in distribution to a N(O, AEAT)-distributed vector G. The sum of the two vectors yields
T - AH, for every A.
Because a supremum is larger than an average, we obtain, where on the left we take the
expectation with respect to the original model,

sup Ehl(T - Ah) Ee(T - AH) = Ee(G A + WA ) El(G A ),


h

by Anderson's lemma. This is true for every A. The lim inf of the right side as A --+ 00 is
at least El(G), by the portmanteau lemma. This concludes the proof that AX is minimax.
If T is equivariant-in-Iaw with invariant law L, then the distribution of G A + W A =
T - AH is L, for every A. It follows that

As A --+ 00, the left side remains fixed; the first factor on the right side converges to the
characteristic function of G, which is positive. Conclude that the characteristic functions of
W A converge to a continuous function, whence WA converges in distribution to some vector
W, by Levy's continuity theorem. By the independence of G A and W A for every A, the
sequence (G A, WA) converges in distribution to a pair (G, W) of independent vectors with
marginal distributions as before. Next, by the continuous-mapping theorem, the distribution
of G A + W A, which is fixed at L, "converges" to the distribution of G + W. This proves
that L can be written as a convolution, as claimed in Proposition 8.4.
If T is an equivariant-in-Iaw estimator and t(X) = E(T(X, U)I X), then

is independent of h. By the completeness of the normal location family, we conclude that


t - AX is constant, almost surely. IfT has the same law as AX, then the constant is zero.
Furthermore, T must be equal to its projection t almost surely, because otherwise it would
have a bigger second moment than t = AX. Thus T = AX almost surely. •

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


8.6 Almost-Everywhere Convolution Theorem 115

8.5 Convolution Theorem


An estimator sequence Tn is called regular at f) for estimating a parameter 1/I(f) if, for
every h,

In( "'(9+ ;,.))


T. - '+:('" L,

The probability measure Le may be arbitrary but should be the same for every h.
A regular estimator sequence attains its limit distribution in a "locally uniform" manner.
This type of regularity is common and is often considered desirable: A small change in
the parameter should not change the distribution of the estimator too much; a disappearing
small change should not change the (limit) distribution at all. However, some estimator
sequences of interest, such as shrinkage estimators, are not regular.
In terms of the limit distributions LI),h in (8.2), regularity is exactly that all Le,h are
equal, for the given f). According to Theorem 8.3, every estimator sequence is matched
by an estimator T in the limit experiment (N (h, 11)-1) : h E IRk). For a regular estimator
sequence this matching estimator has the property
. h
T - 1/Ieh rv Le, every h. (8.7)

Thus a regular estimator sequence is matched by an equivariant-in-Iaw estimator for *eh.


A more informative name for "regular" is asymptotically equivariant-in-law.
It is now easy to determine a best estimator sequence from among the regular estima-
tor sequences (a best regular sequence): It is the sequence Tn that corresponds to the best
equivariant-in-Iaw estimator T for *eh in the limit experiment, which is *eX by Proposi-
tion 8.4. The best possible limit distribution of a regular estimator sequence is the law of
this estimator, a N(O, *l)lil*eT)-distribution.
The characterization as a convolution of the invariant laws of equivariant-in-Iaw estima-
tors carries over to the asymptotic situation.

S.S Theorem (Convolution). Assume that the experiment (Pe : f) E 8) is differentiable


in quadratic mean (7.1) at the point e with nonsingular Fisher information matrix Ie.
Let 1/1 be differentiable at e. Let Tn be an at f) regular estimator sequence in the experi-
ments (Po: f) E 8) with limit distribution Le. Then there exists a probability measure Me
such that
.
Le = N ( 0, 1/Iele- 1/Ie
I' T) *Me.
In particular, ifLe has covariance matrix
definite.
then the matrix *
- *e Iii eT is nonnegative-

Proof. Apply Theorem 8.3 to conclude that Le is the distribution of an equivariant-in-Iaw


estimator T in the limit experiment, satisfying (8.7). Next apply Proposition 8.4. •

8.6 Almost-Everywhere Convolution Theorem


Hodges' example shows that there is no hope for a nontrivial lower bound for the limit
distribution of a standardized estimator sequence In(Tn - 1/1 (f) ) for every e. It is always

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


116 Efficiency of Estimators

possible to improve on a given estimator sequence for selected parameters. In this section
it is shown that improvement over an N (0, ,fro IO-1,fro T)-distribution can be made on at most
a Lebesgue null set of parameters. Thus the possibilities for improvement are very much
restricted.

8.9 Theorem. Assume that the experiment (Po: lJ E 8) is differentiable in quadratic


mean (7.1) at every lJ with nonsingular Fisher information matrix 1o. Let 1/1 be differentiable
at every lJ. Let Tn be an estimator sequence in the experiments (P; : lJ E 8) such that
In(Tn - 1/I(lJ») converges to a limit distribution Lo under every lJ. Then there exist
probability distributions Mo such that for Lebesgue almost every lJ

. I' T
In particular, if L o has covariance matrix Eo, then the matrix Eo - 1/IoIi 1/10 is
nonnegative definite for Lebesgue almost every lJ.

The theorem follows from the convolution theorem in the preceding section combined
with the following remarkable lemma. Any estimator sequence with limit distributions is
automatically regular at almost every lJ along a subsequence of In}.

8.10 Lemma. Let Tn be estimators in experiments (Pn.O: lJ E 8) indexed by a measurable


subset 8 ofJRk. Assume that the map lJ f-+ Pn.o(A) is measurable for every measurable
set A and every n, and that the map lJ f-+ 1/I(lJ) is measurable. Suppose that there exist
distributions L o such that for Lebesgue almost every lJ

Then for every Yn ---+ 0 there exists a subsequence of {n} such that, for Lebesgue almost
every (lJ, h), along the subsequence,

Proof. Assume without loss of generality that 8 = JRk; otherwise, fix some lJo and let
Pn.o = Pn.Oo for every lJ not in 8. Write Tn.o = rn(Tn - 1/I(lJ»). There exists a countable
collection :F of uniformly bounded, left- or right-continuous functions f such that weak
convergence ofasequence of maps Tn is equivalentto Ef(Tn) ---+ f dL for every f E :F.tJ
Suppose that for every f there exists a subsequence of {n} along which

Eo+ynhf(Tn'o+Ynh) ---+ f f dLo, )..2k _ a.e. (lJ, h).

Even in case the subsequence depends on f, we can, by a diagonalization scheme, con-


struct a subsequence for which this is valid for every f in the countable set :F. Along this
subsequence we have the desired convergence.

t For continuous distributions L we can use the indicator functions of cells (-00, c] with c ranging over Qk. For
general L replace every such indicator by an approximating sequence of continuous functions. Alternatively,
see, e.g., Theorem 1.12.2 in [146]. Also see Lemma 2.25.

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


8.7 Local Asymptotic Minimax Theorem 117

J
Settinggn(8) = E(lf(Tn,(I) andg(8) = f dL(I, we see thatthe lemma is proved once we
have established the following assertion: Every sequence of bounded, measurable functions
gn that converges almost everywhere to a limit g, has a subsequence along which

gn(8 + Ynh) g(8), ),Y - a.e. (8, h).

We may assume without loss of generality that the function g is integrable; otherwise we
first multiply each gn and g with a suitable, fixed, positive, continuous function. It should
also be verified that, under our conditions, the functions gn are measurable.
Write P for the standard normal density onlRk and Pn for the density of the N(O, I +Y; 1)-
distribution. By Scheffe's lemma, the sequence Pn converges to P in L 1 • Let 8 and H denote
independent standard normal vectors. Then, by the triangle inequality and the dominated-
convergence theorem,

Elgn(8 + Yn H ) - g(8 + YnH)1 = f Ign(U) - g(u) IPn(U) du O.

Secondly for any fixed continuous and bounded function ge the sequence Elge(8 + YnH)-
I
ge (8) converges to zero as n -+ 00 by the dominated convergence theorem. Thus, by the
triangle inequality, we obtain

Elg(8 + Yn H ) - g(8)\ :s fig - gel(u) (Pn + p)(u)du + 0(1)


= 2 fig - gel(u) p(u) du + 0(1).
Because any measurable integrable function g can be approximated arbitrarily closely in
L 1 by continuous functions, the first term on the far right side can be made arbitrarily small
by choice of ge' Thus the left side converges to zero.
By combining this with the preceding display, we seethatElgn(8+YnH) - g(8)1 O.
In other words, the sequence of functions (8, h) gn(8 + Ynh) - g(8) converges to zero
in mean and hence in probability, under the standard normal measure. There exists a
subsequence along which it converges to zero almost surely. •

*8.7 Local Asymptotic Minimax Theorem


The convolution theorems discussed in the preceding sections are not completely satisfying.
The convolution theorem designates a best estimator sequence among the regular estimator
sequences, and thus imposes an a priori restriction on the set of permitted estimator se-
quences. The almost-everywhere convolution theorem imposes no (serious) restriction but
yields no information about some parameters, albeit a null set of parameters.
This section gives a third attempt to "prove" that the normal N (0, 1/1(11(1-11/1(1 T)-distribution
is the best possible limit. It is based on the minimax criterion and gives a lower bound for the
maximum risk over a small neighborhood of a parameter 8. In fact, it bounds the expression

limliminf sup
8--70 n--7(X) /1(1'-(1/1<8
E(I,.e(Jn(Tn -1/1(8 1 »)).
This is the asymptotic maximum risk over an arbitrarily small neighborhood of 8. The
following theorem concerns an even more refined (and smaller) version of the local maxi-
mum risk.

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


118 Efficiency of Estimators

8.11 Theorem. Let the experiment (p() : 0 E 8) be differentiable in quadratic mean (7.1) at
o with nonsingular Fisher information matrix I().Let 1/1 be differentiable at O. Let Tn be
any estimator sequence in the experiments (P;: 0 E JRk). Thenfor any bowl-shaped loss
function .e

";I' Eo+hJJii+,"(To - ... (6+ ;.))) "fUN (0. t.').


Here the first supremum is taken over all finite subsets I of JRk.

Proof. We only give the proof under the further assumptions that the sequence ,Jri(Tn -
1/1(0») is uniformly tight under 0 and that.e is (lower) semicontinuous. t Then Prohorov's
theorem shows that every subsequence of In} has a further subsequence along which the
vectors

(In(Tn -1/1(0»), In Ll()(Xi»)


converge in distribution to a limit under O. By Theorem 7.2 and Le Cam's third lemma, the
sequence ,Jri(Tn -1/1 (0») converges in law also under every 0+h / ,Jrialong the subsequence.
By differentiability of 1/1 , the same is true for the sequence ,Jri(Tn -1/1 (0 + h/ ,Jri)), whence
(8.2) is satisfied. By Theorem 8.3, the distributions L(),h are the distributions of T - 1p()h
under h for a randomized estimator T based on an N(h, I()-l)-distributed observation. By
Proposition 8.6,

sup Eh.e(T -1p()h) Eo.e(1p()X) = f.e dN(O, 1p() IiI1p()T).


hERk

It suffices to show that the left side of this display is a lower bound for the left side of the
theorem.
The complicated construction that defines the asymptotic minimax risk (the lim inf sand-
wiched between two suprema) requires that we apply the preceding argument to a carefully
chosen subsequence. Place the rational vectors in an arbitrary order, and let Ik consist of
the first k vectors in this sequence. Then the left side of the theorem is larger than

lim liminfSU PE8+h/."Iii.e(Jn(Tn -1/1(0 +


R:= k-.oo n-.oo hElk 'V n

There exists a subsequence {nk} of {n} such that this expression is equal to

lim SU PE8+h/."Iii/(Jnk(Tnk -1/1(0 + 'V nk


k-.oo hElk

We apply the preceding argument to this subsequence and find a further subsequence along
which Tn satisfies (8.2). For simplicity of notation write this as In'} rather than with a
double subscript. Because.e is nonnegative and lower semicontinuous, the portmanteau
lemma gives, for every h,

",",(To, - . . (0 + ;.,))) " ! ldL•.•.


t See, for example, [146, Chapter 3.11] for the general result, which can be proved along the same lines, but using
a compactification device to induce tightness.

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


B.B Shrinkage Estimators 119

Every rational vector h is contained in It for every sufficiently large k. Conclude that

R ::: sup
hEct
f f. dLo,h = sup Ehf.(T -lpoh).
hEct

The risk function in the supremum on the right is lower semicontinuous in h, by the
continuity of the Gaussian location family and the lower semicontinuity of e. Thus
the expression on the right does not change if is replaced by IRk. This concludes the
proof. •

*8.8 Shrinkage Estimators


The theorems of the preceding sections seem to prove in a variety of ways that the best
possible limit distribution is the N (0, lpoli'lpoT)-distribution. At closer inspection, the
situation is more complicated, and to a certain extent optimality remains a matter of taste,
asymptotic optimality being no exception. The "optimal" normal limit is the distribution
of the estimator lpoX in the normal limit experiment. Because this estimator has several
optimality properties, many statisticians consider it best. Nevertheless, one might prefer a
Bayes estimator or a shrinkage estimator. With a changed perception of what constitutes
"best" in the limit experiment, the meaning of "asymptotically best" changes also. This
becomes particularly clear in the example of shrinkage estimators.

8.12 Example (Shrinkage estimator). Let X" ... , Xn be a sample from a multivariate
normal distribution with mean () and covariance the identity matrix. The dimension k of
the observations is assumed to be at least 3. This is essential! Consider the estimator
- Xn
Tn = Xn - (k - 2) .
nllXnll 2
Because Xn converges in probability to the mean (), the second term in the definition of Tn
is Op(n-') if () i:- O. In that case "fii(Tn - Xn) converges in probability to zero, whence
the estimator sequence Tn is regular at every () i:- O. For () = hl"fii, the variable Mn is
distributed as a variable X with an N(h, I)-distribution, and for every n the standardized
estimator "fii(Tn - hl"fii) is distributed as T - h for
X
T(X)=X-(k-2)-.
IIXII2
This is the Stein shrinkage estimator. Because the distribution of T - h depends on h, the
sequence Tn is not regular at () = O. The Stein estimator has the remarkable property that,
for every h (see, e.g., [99, p. 300]),

EhllT - hll 2 < EhllX - hll 2 = k.


It follows that, in terms of joint quadratic loss f.(x) = Ilx 11 2 , the local limit distributions
LO,h of the sequence "fii(Tn - hl"fii) under () = hl"fii are all better than the N(O, I)-limit
distribution of the best regular estimator sequence X n. 0

The example of shrinkage estimators shows that, depending on the optimality criterion, a
normal N (0, lpo Io-'lpo T)-limit distribution need not be optimal. In this light, is it reasonable
https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press
120 Efficiency of Estimators

to uphold that maximum likelihood estimators are asymptotically optimal? Perhaps not. On
the other hand, the possibility of improvement over the N (0, ..fre Ii l..fr//)-limit is restricted
in two important ways.
First, improvement can be made only on a null set of parameters by Theorem 8.9.
Second, improvement is possible only for special loss functions, and improvement for one
loss function necessarily implies worse performance for other loss functions. This follows
from the next lemma.
Suppose that we require the estimator sequence to be locally asymptotically minimax for
a given loss function l in the sense that

E'+h!.. {iii (0+ ;'))),; f


(T, - '" I dN (0, -¢t,l,'-¢t, T)
This is a reasonable requirement, and few statisticians would challenge it. The following
lemma shows that for one-dimensional parameters 1/I(e) local asymptotic minimaxity for
even a single loss function implies regUlarity. Thus, if it is required that all coordinates of a
certain estimator sequence be locally asymptotically minimax for some loss function, then
the best regular estimator sequence is optimal without competition.

8.13 Lemma. Assume that the experiment (Pe : e E 9) is differentiable in quadratic mean
(7.1) at e with nonsingular Fisher information matrix Ie. Let 1/1 be a real-valued map
that is differentiable at (). Then an estimator sequence in the experiments (P; : () E IRk)
can be locally asymptotically minimax at e for a bowl-shaped loss function l such that
0< f x 2 l(x) dN(O, ..freli1..freT)(x) < 00 only ifTn is best regular at e.

Proof. We only give the proof under the further assumption that the sequence In (Tn -
1/1 (e) ) is uniformly tight under (). Then by the same arguments as in the proof of Theo-
rem 8.l1, every subsequence of {n} has a further subsequence along which the sequence
Jii(Tn - 1/1 (e + h / Jii)) converges in distribution under e + h / Jii to the distribution
Le.h of T - ..freh under h, for a randomized estimator T based on an N(h, li1)-distributed
observation. Because Tn is locally asymptotically minimax, it follows that

sup Ehl(T - ..froh) = sup


hERk hEak
f l dL o.h f l dN(O, ..frolo-1..fro T).

Thus T is a minimax estimator for ..freh in the limit experiment. By Proposition 8.6,
T = ..freX, whence Le.h is independent of h. •

*8.9 Achieving the Bound


Ifthe convolution theorem is taken as the basis for asymptotic optimality, then an estimator
sequence is best if it is asymptotically regular with a N(O, ..froli1..freT)-Iimit distribution.
An estimator sequence has this property if and only if the estimator is asymptotically linear
in the score function.

8.14 Lemma. Assume that the experiment (Po: e E 9) is differentiable in quadratic


mean (7.1) at () with nonsingular Fisher information matrix Ie. Let 1/1 be differentiable at

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


8.9 Achieving the Bound 121

e. Let Tn be an estimator sequence in the experiments (P: : e E IRk) such that

Then Tn is best regular estimator for 1/I(O} at O. Conversely, every best regular estimator
sequence satisfies this expansion.

Proof. The sequence ll.n,e = n- l / 2 L le(X i } converges in distribution to a vector ll.e with
a N (0, Ie }-distribution. By Theorem 7.2 the sequence log d P;+h/..;n/ d P: is asymptotically
equivalent to hT ll.n,e - !h TIeh. If Tn is asymptotically linear, then In(Tn - 1/I(e}) is
asymptotically equivalent to the function l;e Ie- l ll.n,e. Apply Slutsky's lemma to find that

The limit distribution of the sequence In(Tn -1/I(O}) undere +h/ In follows by Le Cam's
third lemma, Example 6.7, and is normal with mean l;eh and covariance matrix l;e Ie- l l;e T.
Combining this with the differentiability of 1/1, we obtain that Tn is regular.
Next suppose that Sn and Tn are both best regular estimator sequences. By the same
arguments as in the proof of Theorem 8.11 it can be shown that, at least along subsequences,
the joint estimators (Sn, Tn) for (1/1 (e), 1/1 (O)) satisfy for every h

for a randomized estimator (S, T) in the normal-limit experiment. Because Sn and Tn are
best regular, the estimators Sand T are best equivariant-in-Iaw. Thus S = T = l;eX almost
surely by Proposition 8.6, whence In(Sn - Tn} converges in distribution to S - T = O.
Thus every two best regular estimator sequences are asymptotically equivalent. The
second assertion of the lemma follows on applying this to Tn and the estimators
I . -1
Sn = 1/I(e} + In 1/Ie Ie ll.n,e.

Because the parameter 0 is known in the local experiments (P;+h/..;n: h E IRk), this indeed
defines an estimator sequence within the present context. It is best regular by the first part
of the lemma. •

Under regularity conditions, for instance those of Theorem 5.39, the maximum likeli-
hood estimator en in a parametric model satisfies

Then the maximum likelihood estimator is asymptotically optimal for estimating e in terms
of the convolution theorem. By the delta method, the estimator 1/1 (en) for 1/1 (e) can be seen

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


122 Efficiency of Estimators

to be asymptotically linear as in the preceding theorem, so that it is asymptotically regular


and optimal as well.
Actually, regular and asymptotically optimal estimators for 0 exist in every parametric
model (Pe : 0 E e) that is differentiable in quadratic mean with nonsingular Fisher infor-
mation throughout e, provided the parameter 0 is identifiable. This can be shown using
the discretized one-step method discussed in section 5.7 (see [93]).

*8.10 Large Deviations


Consistency of an estimator sequence Tn entails that the probability of the event
d(Tn'1/I(0)) > 8 tends to zero under 0, for every 8 > O. This is a very weak require-
ment. One method to strengthen it is to make 8 dependent on n and to require that the
probabilities Pe (d(Tn, 1/I(0)) > 8 n) converge to 0, or are bounded away from I, for a given
sequence 8 n O. The results of the preceding sections address this question and give very
precise lower bounds for these probabilities using an "optimal" rate 8 n = r;;l, typically
n- I / 2 •
Another method of strengthening the consistency is to study the speed at which the
probabilities Pe(d(Tn' 1/I(0)) > 8) converge to 0 for a fixed 8 > O. This method appears
to be of less importance but is of some interest. Typically, the speed of convergence is
exponential, and there is a precise lower bound for the exponential rate in terms of the
Kullback-Leibler information.
We consider the situation that Tn is based on a random sample of size n from a distribution
Pe, indexed by a parameter 0 ranging over an arbitrary set e. We wish to estimate the value
of a function 1/1 : e 1-+ lIJ) that takes its values in a metric space.

S.lS Theorem. Suppose that the estimator sequence Tn is consistentfor 1/I(0) under every
O. Then, for every 8 > 0 and every 00,

lim sup -.!.log PfJo(d(Tn, 1/I{(0)) >


n-+oo n
8) inf
e: d(1/I(e), 1/1 (fJo» >e
-Pe log PfJo.
Pe

Proof. If the right side is infinite, then there is nothing to prove. The Kullback-Leibler
information - Pe log PfJo/ Pe can be finite only if Pe « PfJo. Hence, it suffices to prove that
- Pe log PfJo/ Pe is an upper bound for the left side for every 0 such that Pe « PfJo and
d(1/I{O),1/I{Oo)) > 8. The variable An = (n- I ) :E7=llog{Pe/PfJo){Xj) is well defined
(possibly -(0). For every constant M,

PfJo(d(Tn, 1/I(00)) > 8) PfJo(d(Tn, 1/I{(0)) > 8, An < M)


Eel{d(Tn' 1/I{(0)) > 8, An < M}e-nAn

e- nM Pe(d(Tn' 1/I{(0)) > 8, An < M).

Take logarithms and multiply by -(lin) to conclude that

-llOg PfJo(d(Tn, 1/I{(0)) > 8) M -llOg Pe(d(Tn' 1/I(00)) > 8, An < M).
For M > Pe log pe/ PfJo' we have that Pe{An < M) I by the law of large numbers.
Furthermore, by the consistency of Tn for 1/I(0), the probability Pe (d(Tn , 1/1 (Oo)) > 8)

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


Problems 123

converges to 1 for every () such that d(t/I«(), ""«()o») > e. Conclude that the probability in
the right side of the preceding display converges to 1, whence the lim sup of the left side is
bounded by M. •

Notes
Chapter 32 of the famous book by Cramer [27] gives a rigorous proof of what we now
know as the Cramer-Rao inequality and next goes on to define the asymptotic efficiency of
an estimator as the quotient of the inverse Fisher information and the asymptotic variance.
Cramer defines an estimator as asymptotically efficient if its efficiency (the quotient men-
tioned previously) equals one. These definitions lead to the conclusion that the method of
maximum likelihood produces asymptotically efficient estimators, as already conjectured
by Fisher [48, 50] in the 1920s. That there is a conceptual hole in the definitions was clearly
realized in 1951 when Hodges produced his example of a superefficient estimator. Not long
after this, in 1953, Le Cam proved that superefficiency can occur only on a Lebesgue null
set. Our present result, almost without regularity conditions, is based on later work by Le
Cam (see [95].) The asymptotic convolution and minimax theorems were obtained in the
present form by Hajek in [69] and [70] after initial work by many authors. Our present
proofs follow the approach based on limit experiments, initiated by Le Cam in [95].

PROBLEMS
1. Calculate the asymptotic relative efficiency of the sample mean and the sample median for
estimating 8, based on a sample of size n from the normal N(8, I) distribution.
2. As the previous problem, but now for the Laplace distribution (density p(x) = ie-Ixl).
3. Consider estimating the distribution function P(X :::: x) at a fixed point x based on a sample
X I, ... , Xn from the distribution of X. The "nonparametric" estimator is n -I #( Xi:::: x). If it
is known that the true underlying distribution is normal N(8, I), another possible estimator is
<I> (x - X). Calculate the relative efficiency of these estimators.
4. Calculate the relative efficiency of the empirical p-quantile and the estimator <1>-1 (p)Sn + Xn
for the estimating the p-th quantile of the distribution of a sample from the normal N (j.L, u 2 )_
distribution.
5. Consider estimating the population variance by either the sample variance S2 (which is unbiased)
or else n- I L?=I (Xi - X)2 = (n - l)/n S2. Calculate the asymptotic relative efficiency.
6. Calculate the asymptotic relative efficiency of the sample standard deviation and the interquartile
range (corrected for unbiasedness) for estimating the standard deviation based on a sample of
size n from the normal N(j.L, u 2)-distribution.
7. Given a sample of size n from the uniform distribution on [0,8], the maximum X(n) of the
observations is biased downwards. Because He (8 - X(n» = HeX(I), the bias can be removed by
adding the minimum of the observations. Is X (l) + X (n) a good estimator for 8 from an asymptotic
point of view?
8. Consider the Hodges estimator Sn based on the mean of a sample from the N (8, I)-distribution.
(i) -+ Oinsuchawaythatn l / 48n -+ oand n l / 28n --+- 00.
(ii) Show that Sn is not regular at 8 = O.

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press


124 Efficiency of Estimators

(iii) Show that sUP-8<Od Po (JnISn - 81 > kn} --+ 1 for every kn that converges to infinity
sufficiently slowly.
9. Show that a loss function l: IR 1-+ IR is bowl-shaped if and only if it has the forml(x) = lo{lxl}
for a nondecreasing function lo.
10. Show that a function of the form l (x) = lo (lix II) for a nondecreasing function lo is bowl-shaped.
11. Prove Anderson's lemma for the one-dimensional case, for instance by calculating the derivative
of J lex + h) dN(O, l)(x). Does the proof generalize to higher dimensions?
12. What does Lemma 8.13 imply about the coordinates of the Stein estimator. Are they good
estimators of the coordinates of the expectaction vector?
13. All results in this chapter extend in a straightforward manner to general locally asymptotically
normal models. Formulate Theorem 8.9 and Lemma 8.14 for such models.

https://doi.org/10.1017/CBO9780511802256.009 Published online by Cambridge University Press

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy