Untitled
Untitled
Statistical Limit
Theory
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Series Editors
Bradley P. Carlin, University of Minnesota, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada
Introduction to
Statistical Limit
Theory
Alan M. Polansky
Northern Illinois University
Dekalb, Illinois, USA
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2011 by Taylor and Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a pho-
tocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
TO
AND
Preface xvii
ix
x CONTENTS
3 Convergence of Random Variables 101
3.1 Introduction 101
3.2 Convergence in Probability 102
3.3 Stronger Modes of Convergence 107
3.4 Convergence of Random Vectors 117
3.5 Continuous Mapping Theorems 121
3.6 Laws of Large Numbers 124
3.7 The Glivenko–Cantelli Theorem 135
3.8 Sample Moments 140
3.9 Sample Quantiles 147
3.10 Exercises and Experiments 152
3.10.1 Exercises 152
3.10.2 Experiments 157
The motivation for writing a new book is heavily dependent on the books that
are currently available in the area you wish to present. In many cases one can
make the case that a new book is needed when there are no books available in
that area. In other cases one can make the argument that the current books
are out of date or are of poor quality. I wish to make it clear from the onset
that I do not have this opinion of any of the books on statistical asymptotic
theory that are currently available, or have been available in the past. In fact,
I find myself humbled as I search my book shelves and find the many well
written books on this subject. Indeed, I feel strongly that some of the best
minds in statistics have written books in this area. This leaves me to the
task of explaining why I found it necessary to write this book on asymptotic
statistical theory.
Many students of statistics are finding themselves more specialized away from
mathematical theory in favor of newer and more complex statistical methods
that necessarily require specialized study. Inevitably, this has led to a dimin-
ished focus of pure mathematics. However, this does not diminish the need for
a good understanding of asymptotic theory. Good students of statistics, for
example, should know exactly what the central limit theorem says and what
exactly it means. They should be able to understand the assumptions required
for the theory to work. There are many modern methods that are complex
enough so that they cannot be justified directly, as therefore must be justified
from an asymptotic viewpoint. Students should have a good understanding as
to what such a justification guarantees and what it does not. Students should
also have a good understanding of what can go wrong.
Asymptotic theory is mathematical in nature. Over the years I have helped
many students understand results from asymptotic theory, often as part of
their research or classwork. In this time I began to realize that the decreased
exposure to mathematical theory, particularly in real analysis, was making it
much more difficult for students to understand this theory.
I wrote this book with the goal of explaining as much of the background ma-
terial as I could, while still keeping a reasonable presentation and length. The
reader will not find a detailed review of the whole of real analysis, and other
important subjects from mathematics. Instead the reader will find sufficient
background of those subjects that are important for the topic at hand, along
xvii
xviii PREFACE
with references which the reader may explore for a more detailed understand-
ing. I have also attempted to present a much more detailed account of the
modes of convergence of random variables, distributions, and moments than
can be found in many other texts. This creates a firm foundation for the appli-
cations that appear in the book, along with further study that students may
do on their own.
As I began the job of writing this book, I recalled a quote from one of my
favorite authors, Sir Arthur Conan Doyle in his Sherlock Holmes adventure
“The Dancing Men.” Mr. Holmes is conversing with Dr. Watson and has
just presented him with one of his famous deductions that leaves Dr. Watson
startled by Holmes’ deductive abilities. Mr. Holmes replies as follows:
“You see, my dear Watson,”—he propped his test-tube in the rack, and began to
lecture with the air of a professor addressing his class—“it is not really difficult
to construct a series of inferences, each dependent upon its predecessor and each
simple in itself. If, after doing so, one simply knocks out all the central inferences
and presents one’s audience with the starting-point and the conclusion, one may
produce a startling, though possibly a meretricious, effect.”
A mathematical theorem essentially falls into this category in the sense that
it is a series of assumptions followed by a logical result. The central inferences
have been removed, sometimes producing a somewhat startling result. When
one uses the theorem as a tool to obtain further results, the central inferences
are usually unimportant. If one wishes to truly understand the result, one
must understand the central inferences, which is usually called the “proof” of
the theorem. As this book is meant to be either a textbook for a course, or a
reference book for someone unfamiliar with many of the results within, I feel
duty bound to not produce startling results in this book, but to include all
of the central inferences of each result in such a way that the simple logical
progression is clear to the reader. As a result, most of the proofs presented in
this book are very detailed.
I have included proofs to many of the results in the book. Nearly all of these
arguments come from those before me, but I have attempted to include de-
tailed explanations for the results while pointing out important tools and
techniques along the way. I have attempted to give due credit to the sources I
used for the proofs. If I have left anyone out, please accept my apologies and
be assured that it was not on purpose, but out of my ignorance. I have also
not steered away from the more complicated proofs in this field. These proofs
often contain valuable techniques that a user of asymptotic theory should be
familiar with. For many of the standard results there are several proofs that
one can choose from. I have not always chosen the shortest or the “slickest”
method of proof. In my choice of proofs I weighed the length of the argument
along with its pedagogical value and how well the proof provides insight into
the final result. This is particularly true when complicated or strange looking
assumptions are part of the result. I have attempted to point out in the proofs
where these conditions come from.
PREFACE xix
This book is mainly concerned with providing a detailed introduction into
the common modes of convergence and their related tools used in statistics.
However, it is also useful to consider how these results can be applied to several
common areas of statistics. Therefore, I have included several chapters that
deal more with the application of the theory developed in the first part of the
book. These applications are not an exhaustive offering by any stretch of the
imagination, and to be sure an exhaustive offering would have enlarged the
book to the size of a phone book from a large metropolitan area. Therefore, I
have attempted to include a few topics whose deeper understanding benefits
greatly from asymptotic theory and whose applications provide illustrative
examples of the theory developed earlier.
Many people have helped me along the way in developing the ideas behind
this book and implementing them on paper. Bob Stern and David Grubbs at
CRC/Chapman & Hall have been very supportive of this project and have
put up with my countless delays. There were several students who took a
course from me based on an early draft of this book: Devrim Bilgili, Arpita
Chatterjee, Ujjwal Das, Santu Ghosh, Priya Kohli, and Suchitrita Sarkar.
They all provided me with useful comments, found numerous typographical
errors, and asked intelligent questions, which helped develop the book from
its early phase to a more coherent and complete document. I would also like
to thank Qian Dong, Lingping Liu, and Kristin McCullough who studied from
a later version of the book and were able to point out several typographical
errors and places where there could be some improvement in the presentation. I
also want thank Sanjib Basu who helped me by answering numerous questions
about Bayesian theory.
My friends and colleagues who have supported me through the years also
deserve a special note of thanks: My advisor, Bill Schucany, has also been a
constant supporter of my activities. He recently announced his retirement and
I wish him all the best. I also want to thank Bob Mason, Youn-Min Chou,
Dick Gunst, Wayne Woodward, Bennie Pierce, Pat Gerard, Michael Ernst,
Carrie Helmig, Donna Lynn, and Jane Eesley for their support over the years.
Jeff Reynolds, my bandmate, my colleague, my student, and my friend, has
always shown me incredible support. He has been a constant source for reality
checks, help, and motivation. We have faced many battles together, many of
them quite enjoyable: “My brother in arms.”
There is always family, and mine is the best. My wife Catherine and my son
Alton have been there for me all throughout this process. I love and cherish
both of them; I could not have completed this project without their support,
understanding, sacrifices, and patience. I also wish to thank my extended
family; My Mom and Dad, Kay and Al Polansky who celebrated their 60th
wedding anniversary in 2010; my brother Gary and his famiy: Marcia, Kara,
Krista, Mandy, Nellie and Jack; my brother Dale and his new family, which we
welcomed in the summer of 2009: Jennifer and Sydney; and my wife’s family:
Karen, Mike, Ginnie, Christopher, Jonathan, Jackie, Frank, and Mila.
xx PREFACE
Finally, I would like to thank those who always provide me with the diversions
I need during such a project as this. Matt Groening, who started a little show
called The Simpsons the year before I started graduate school, will never
know how much joy his creation has brought to my life. I also wish to thank
David Silverman who visited Southern Methodist University while I was there
working on my Ph.D.; he drew me a wonderful Homer Simpson, which hangs
on my wall to this day. Jebediah was right: “A noble spirit embiggens the
smallest man.” There was also those times when what I needed was a good
polka. For this I usually turned to the music of Carl Finch and Jeffrey Barnes
of Brave Combo. See you at Westfest!
Much has happened during the time while I was writing this book. I has very
saddened to hear of the passing Professor Erich Lehmann on September 12,
2009, at the age of 91. One cannot understate the contributions of Professor
Lehmann on the field of statistics, and particularly on much of the material
presented in this book. I did not know Professor Lehmann personally, but did
meet him once at the Joint Statistical Meetings where he was kind enough to
sign a copy of the new edition of his book Theory of Point Estimation. His
classic books, which have always had a special place on my bookshelf since
beginning my career as a graduate student, have been old and reliable friends
for many years. On July 8, 2010, David Blackwell, the statistician and math-
ematician who wrote many groundbreaking papers on probability and game
theory passed away as well. Besides his many contributions to statistics, Pro-
fessor Blackwell also held the distinction of being the first African American
scholar to be admitted to the National Academy of Sciences and was the first
African American tenured professor at Berkeley. He too will be missed.
During the writing of this book I also passed my 40th birthday and began to
think about the turmoil and fear that pervaded my birth year of 1968. What
a time it must have been to bring a child into the world. There must have
been few hopes and many fears. I find myself looking at my own child Alton
and wondering what the world will have in store for him. It has again been a
time of turmoil and fear, and I can only hope that humanity begins to heed
the words of Dr. Martin Luther King:
“We must learn to live together as brothers or perish together as fools.”
Peace to All,
Alan M. Polansky
Creston, Illinois, USA
CHAPTER 1
K. felt slightly abandoned as, probably observed by the priest, he walked by him-
self between the empty pews, and the size of the cathedral seemed to be just at
the limit of what a man could bear.
1.1 Introduction
1
2 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
while the simple alternating sequence xn = (−1)n has values
x1 = −1, x2 = 1, x3 = −1, x4 = 1, . . . .
One can also consider real sequences of the form xt with an uncountable
domain such as the real line which is specified by a function xt : R → R. Such
sequences are essentially just real functions. This section will consider only
real sequences whose index set is N.
The asymptotic behavior of such sequences is often of interest. That is, what
general conclusions can be made about the behavior of the sequence as n
becomes very large? In particular, do the values in the sequence appear to
“settle down” and become arbitrarily close to a single number x ∈ R as
n → ∞? For example, the sequence specified by xn = n−1 appears to become
closer and closer to 0 as n becomes very large. If a sequence has this type of
property, then the sequence is said to converge to x as n → ∞, or that the
limit of xn as n → ∞ is x, usually written as
lim xn = x,
n→∞
This definition ensures the behavior described above. Specify any distance
ε > 0 to x, and all of the terms in a convergent sequence will eventually be
closer than that distance to x.
Example 1.1. Consider the harmonic sequence defined by xn = n−1 for all
n ∈ N. This sequence appears to monotonically become closer to zero as n
increases. In fact, it can be proven that the limit of this sequence is zero. Let
ε > 0. Then there exists nε ∈ N such that n−1 ε < ε, which can be seen by
taking nε to be any integer greater than ε−1 . It follows that any n ≥ nε will
also have the property that n−1 < ε. Therefore, according to Definition 1.1,
the sequence xn = n−1 converges to zero, or
lim n−1 = 0.
n→∞
For the real number system there is an equivalent development of the concept
of a limit that is dependent on the concept of a Cauchy sequence.
SEQUENCES OF REAL NUMBERS 3
Definition 1.2. Let {xn }∞ n=1 be a sequence of real numbers. The sequence
is a Cauchy sequence if for every ε > 0 there is an integer nε such that
|xn − xm | < ε for every n > nε and m > nε .
In general, not every Cauchy sequence converges to a limit. For example, not
every Cauchy sequence of rational numbers has a rational limit. See Exam-
ple 6.9 of Sprecher (1970). There are, however, some spaces where Cauchy
sequence has a unique limit. Such spaces are said to be complete. The real
number system is an example of a complete space.
Theorem 1.1. Every Cauchy sequence of real numbers has a unique limit.
The advantage of using Cauchy sequences is that we sometimes only need to
show the existence of a limit and Theorem 1.1 can be used to show that a real
sequence has a limit even if we do not know what the limit may be.
Simple algebraic transformations can be applied to convergent sequences with
the resulting limit being subject to the same transformation. For example,
adding a constant to each term of a convergent sequence results in a conver-
gent sequence whose limit equals the limit of the original sequence plus the
constant. A similar results applies to sequences that have been multiplied by
a constant.
Theorem 1.2. Let {xn }∞ n=1 be a sequence of real numbers such that
lim xn = x,
n→∞
and
lim cxn = cx.
n→∞
Proof. We will prove the first result. The second result is proven in Exercise 6.
Let {xn }∞n=1 be a sequence of real numbers that converges to x, let c be a real
constant, and let ε > 0. Definition 1.1 and the convergence of the sequence
{xn }∞
n=1 implies that there exists a positive integer nε such that |xn − x| < ε
for all n ≥ nε . Now consider the sequence {xn + c}∞ n=1 . Note that for ε > 0
we have that |(xn + c) − (x + c)| = |xn − x|. Therefore, |(xn + c) − (x + c)| is
also less than ε for all n ≥ nε and the result follows from Definition 1.1.
lim xn = x,
n→∞
Proof. Only the first result will be proven here. The remaining results are
proven as part of Exercise 6. Let {xn }∞ ∞
n=1 and {yn }n=1 be convergent se-
quences with limits x and y respectively. Let ε > 0, then Definition 1.1 implies
that there exists integers nε,x and nε,y such that |xn −x| < ε/2 for all n ≥ nε,x
and |yn − y| < ε/2 for all n ≥ nε,y . Now, note that Theorem A.18 implies that
|(xn + yn ) − (x + y)| = |(xn − x) + (yn − y)| ≤ |xn − x| + |yn − y|.
Let nε = max{nε,x , nε,y } so that |xn − x| < ε/2 and |yn − y| < ε/2 for all
n ≥ nε . Therefore |(xn + yn ) − (x + y)| < ε/2 + ε/2 = ε for all n ≥ nε and the
result follows from Definition 1.1.
The focus of our study of limits so far has been for convergent sequences. Not
all sequences of real numbers are convergent, and the limits of non-convergent
real sequences do not exist.
Example 1.2. Consider the alternating sequence defined by xn = (−1)n for
all n ∈ N. It is intuitively clear that the alternating sequence does not “settle
down” at all, or that it does not converge to any real number. To prove this,
let l ∈ R be any real number. Take 0 < ε < max{|l − 1|, |l + 1|}, then for any
n0 ∈ N there will exist at least one n00 > n0 such that |xn − l| > ε. Hence, this
sequence does not have a limit.
While a non-convergent sequence does not have a limit, the asymptotic be-
havior of non-convergent sequences can be described to a certain extent by
SEQUENCES OF REAL NUMBERS 5
considering the asymptotic behavior of the upper and lower bounds of the
sequence. Let {xn }∞ n=1 be a sequence of real numbers, then u ∈ R is an upper
bound for {xn }∞ n=1 if xn ≤ u for all n ∈ N. Similarly, l ∈ R is a lower bound for
{xn }∞ ∞
n=1 if xn ≥ l for all n ∈ N. The least upper bound of {xn }n=1 is ul ∈ R
∞
if ul is an upper bound for {xn }n=1 and ul ≤ u for any upper bound u of
{xn }∞n=1 . The least upper bound will be denoted by
ul = sup xn ,
n∈N
and is often called the supremum of the sequence {xn }∞ n=1 . Similarly, the
greatest lower bound of {xn }∞ ∞
n=1 is lu ∈ R if lu is a lower bound for {xn }n=1
∞
and lu ≥ l for any lower bound l of {xn }n=1 . The greatest lower bound of
{xn }∞
n=1 will be denoted by
lu = inf xn ,
n∈N
and is often called the infimum of {xn }∞
n=1 .
It is a property of the real numbers
that any sequence that has a lower bound also has a greatest lower bound and
that any sequence that has an upper bound also has a least upper bound.
See Page 33 of Royden (1988) for further details. The asymptotic behavior
of non-convergent sequences can be studied in terms of how the supremum
and infimum of a sequence behaves as n → ∞. That is, we can consider for
example the asymptotic behavior of the upper limit of a sequence {xn }∞ n=1 by
calculating
lim sup xk .
n→∞ k≥n
If
lim inf xn = lim sup xn = c ∈ R,
n→∞ n→∞
6 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
then the limit of {xn }∞
n=1 exists and is equal to c.
The usefulness of the limit supremum and the limit infimum can be demon-
strated through some examples.
Example 1.3. Consider the sequence {xn }∞
n=1 where xn = n
−1
for all n ∈ N.
Note that
sup xn = sup k −1 = n−1 ,
k≥n k≥n
Note that zero is a lower bound of n−1 since n−1 > 0 for all n ∈ N. Further,
zero is the greatest lower bound since there exists an nε ∈ N such that n−1
ε <ε
for any ε > 0. Similar arguments can be used to show that
lim inf xn = sup inf xk = sup 0 = 0.
n→∞ n∈N k≥n n∈N
Similarly,
inf xk = inf (−1)k = −1,
k≥n k≥n
so that
lim inf xn = sup inf xk = sup −1 = −1.
n→∞ n∈N k≥n n∈N
In this case it is clear that
lim inf xn 6= lim sup xn ,
n→∞ n→∞
so that, as shown in Example 1.2, the limit of the sequence does not exist.
The limit infimum and limit supremum indicate the extent of the limit of the
variation of {xn }∞
n=1 as n → ∞.
Example 1.5. Consider the sequence {xn }∞ n
n=1 where xn = (−1) (1 + n
−1
)
for all n ∈ N. Note that
(
k −1 1 + n−1 if n is even,
sup xk = sup(−1) (1 + k ) =
k≥n k≥n 1 + (n + 1)−1 if n is odd.
SEQUENCES OF REAL NUMBERS 7
In either case,
inf (1 + n−1 ) = inf [1 + (n + 1)−1 ] = 1,
n∈N n∈N
so that
lim sup xn = 1.
n→∞
Similarly,
(
−(1 + n−1 ) if n is odd,
inf xk = inf (−1)k (1 + k −1 ) = ,
k≥n k≥n −[1 + (n + 1)−1 ] if n is even.
and
sup −(1 + n−1 ) = sup −[1 + (n + 1)−1 ] = −1,
n∈N n∈N
so that
lim inf xn = −1.
n→∞
As in Example 1.4
lim inf xn 6= lim sup xn ,
n→∞ n→∞
so that the limit does not exist. Note that this sequence has the same asymp-
totic behavior on its upper and lower bounds as the much simpler sequence
in Example 1.4.
The properties of the limit supremum and limit infimum are similar to those
of the limit with some notable exceptions.
Theorem 1.5. Let {xn }∞
n=1 be a sequence of real numbers. Then
and
lim sup(−xn ) = − lim inf xn .
n→∞ n→∞
Proof. The second property will be proven here. The first property is proven
in Exercise 10. Let {xn }∞ n=1 be a sequence of real numbers. Note that the
negative of any lower bound of {xn }∞ n=1 is an upper bound of the sequence
{−xn }∞n=1 . To see why, let l be a lower bound of {xn }∞
n=1 . Then l ≤ xn for all
n ∈ N. Multiplying each side of the inequality by −1 yields −l ≥ −xn for all
n ∈ N. Therefore −l is an upper bound of {−xn }∞ n=1 , and it follows that the
negative of the greatest lower bound of {xn }∞ n=1 is the least upper bound of
{−xn }∞n=1 . That is
− inf xk = sup −xk , (1.3)
k≥n k≥n
and
− sup xk = inf −xk . (1.4)
k≥n k≥n
8 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
Therefore, an application of Equations (1.3) and (1.4) implies that
and
lim inf xn ≤ lim inf yn .
n→∞ n→∞
2. If
lim sup xn < ∞,
n→∞
and
lim inf xn < ∞,
n→∞
then
lim inf xn + lim inf yn ≤ lim inf (xn + yn ),
n→∞ n→∞ n→∞
and
lim sup(xn + yn ) ≤ lim sup xn + lim sup yn .
n→∞ n→∞ n→∞
and
0 < lim sup yn < ∞,
n→∞
then
lim sup xn yn ≤ lim sup xn lim sup yn .
n→∞ n→∞ n→∞
Proof. Part of the first property is proven here. The remaining properties are
proven in Exercises 12 and 13. Suppose that {xn }∞ ∞
n=1 and {yn }n=1 are real se-
quences such that xn ≤ yn for all n ∈ N. Then xk ≤ yk for all k ∈ {n, n+1, . . .}.
This implies that any upper bound for {yn , yn+1 , . . . , } is also an upper bound
SEQUENCES OF REAL NUMBERS 9
for {xn , xn+1 , . . .}. It follows that the least upper bound for {xn , xn+1 , . . .} is
less than or equal to the least upper bound for {yn , yn+1 , . . . , }. That is
sup xk ≤ sup yk . (1.5)
k≥n k≥n
Example 1.7. Consider two sequences of real numbers and {xn }∞
n=1 {yn }∞
n=1
where xn = 21 [(−1)n − 1] and yn = (−1)n for all n ∈ N. Using Definition 1.3
it can be shown that
lim sup xn = 0,
n→∞
10 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
and
lim sup yn = 1,
n→∞
so that
lim sup xn lim sup yn = 0.
n→∞ n→∞
Proof. The standard proof of this result, such as this one which is adapted
from Sprecher (1970), is based on the Theorem A.22. Using Theorem A.22,
note that when n ∈ N is fixed
n n
X n −i n−i X n −i
(1 + n−1 )n = n (1) = n . (1.8)
i=0
i i=0
i
Consider the ith term in the series on the right hand side of Equation (1.8),
SEQUENCES OF REAL NUMBERS 11
and note that
i−1
n −i (n − i)! Y n − j 1
n = ≤ .
i i!(n − i)! j=0 n i!
Therefore it follows that
n
X 1
(1 + n−1 )n ≤ ≤ e,
i=0
i!
for every n ∈ N. The second inequality comes from the fact that
∞
X 1
e= ,
i=0
i!
where all of the terms in the sequence are positive. It then follows that
lim sup(1 + n−1 )n ≤ e.
n→∞
A proof of Theorem 1.8 can be found in Slomson (1991). Theorem 1.8 implies
that when n is large,
nn (2nπ)1/2
' 1,
exp(n)n!
and hence one can approximate n! with nn exp(−n)(2nπ)1/2 .
Example 1.9. Let n and k be positive integers such that k ≤ n. The number
of combinations of k items selected from a set of n items is
n n!
= .
k k!(n − k)!
Theorem 1.8 implies that when both n and k are large, we can approximate
n
k as
nn exp(−n)(2nπ)1/2
n
'
k k exp(−k)(2kπ) (n − k)n−k exp(k − n)[2(n − k)π]1/2
k 1/2
However, if x = 1 then
lim fn (x) = lim 1 = 1.
n→∞ n→∞
pw
Therefore fn −−→ f as n → ∞ where fn (x) = δ{x; {1}}, and δ is the indicator
function defined by (
1 if x ∈ A,
δ{x; A} =
0 if x ∈
/ A.
Note that this example also demonstrates that the limit of a sequence of
continuous functions need not also be continuous.
Because the definition of pointwise convergence for sequences of functions is
closely related to Definition 1.1, the definition for limit for real sequences,
many of the properties of limits also hold for sequences of functions that
converge pointwise. For example, if {fn (x)}∞ ∞
n=1 and {gn (x)}n=1 are sequences
of functions that converge to the functions f and g, respectively, then the
sequence {fn (x) + gn (x)}∞
n=1 converges pointwise to f (x) + g(x). See Exercise
14.
A different approach to defining convergence for sequences of real functions
requires not only that the sequence of functions converge pointwise to a lim-
iting function, but that the convergence must be uniform in x ∈ R. That is,
if {fn }∞
n=1 is a sequence of functions that convergence pointwise to a function
f , we further require that the rate of convergence of fn (x) to f (x) as n → ∞
does not depend on x ∈ R.
Definition 1.5. A sequence of functions {fn (x)}∞ n=1 converges uniformly to
a function f (x) as n → ∞ if for every ε > 0 there exists an integer nε such
that |fn (x) − f (x)| < ε for all n ≥ nε and x ∈ R. This type of convergence
u
will be represented as fn −→ f as n → ∞.
14 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
Example 1.12. Consider once again the sequence of functions {fn (x)}∞ n=1
given by fn (x) = 1 + n−1 x2 for all n ∈ N and x ∈ [0, 1]. It was shown in
pw
Example 1.10 that fn −−→ f = 1 as n → ∞ on [0, 1]. Now we investigate
whether this convergence is uniform or not. Let ε > 0, and note that for
x ∈ [0, 1],
|fn (x) − f (x)| = |1 + n−1 x2 − 1| = |n−1 x2 | ≤ n−1 ,
for all x ∈ [0, 1]. Take nε = ε−1 + 1 and we have that |fn (x) − f (x)| < ε for all
u
n > nε . Because nε does not depend on x, we have that fn − → f , as n → ∞
on [0, 1].
∞
Example 1.13. Consider the sequence of functions {fn (x)}n=1 given by
fn (x) = xn for all x ∈ [0, 1] and n ∈ N. It was shown in Example 1.11 that
pw
fn −−→ f as n → ∞ where fn (x) = δ{x; {1}} on [0, 1]. Consider ε ∈ (0, 1).
For any value of n ∈ N, |fn (x) − 0| < ε on (0, 1) when xn < ε which im-
plies that n > log(ε)/ log(x), where we note that log(x) < 0 when x ∈ (0, 1).
Hence, such a bound on n will always depend on x since log(x) is unbounded
in the interval (0, 1), and therefore the sequence of functions {fn }∞ n=1 does
not converge uniformly to f as n → ∞.
Example 1.12 demonstrates one characteristic of sequences of functions that
are uniformly convergent, in that the limit of a sequence of uniformly conver-
gent continuous functions must also be continuous.
Theorem 1.9. Suppose that {fn (x)}∞ n=1 is a sequence of functions on a subset
R of R that converge uniformly to a function f . If each fn is continuous at a
point x ∈ R, then f is also continuous at x.
A proof of Theorem 1.9 can be found in Section 9.4 of Apostol (1974). Note
that uniform convergence is a sufficient, but not a necessary condition, for
the limit function to be continuous, as is demonstrated by Example 1.13. An
alternate view of uniformly convergent sequences of functions can be defined
in terms of Cauchy sequences, as shown below.
Theorem 1.10. Suppose that {fn (x)}∞ n=1 is a sequence of functions on a
u
subset R of R. There exists a function f such that fn − → f as n → ∞ if and
only if for every ε > 0 there exists nε ∈ N such that |fn (x) − fm (x)| < ε for
all n > nε and m > nε , for every x ∈ R.
A proof of Theorem 1.10 can be found in Section 9.5 of Apostol (1974).
Another important property of sequences of functions {fn (x)}∞ n=1 is whether
a limit and an integral can be exhanged. That is, let a and b be real constants
that do not depend on n such that a < b. Does it necessarily follow that
Z b Z b Z b
lim fn (x)dx = lim fn (x)dx = f (x)dx?
n→∞ a a n→∞ a
An example can be used to demonstrate that such an exchange is not always
justified.
SEQUENCES OF REAL FUNCTIONS 15
Example 1.14. Consider a sequence of real functions defined by fn (x) =
2n δ{x; (2−n , 2−(n−1) )} for all n ∈ N. The integral of fn is given by
Z ∞ Z 2−(n−1)
fn (x)dx = 2n dx = 1,
−∞ 2−n
While Example 1.14 shows it is not always possible to interchange a limit and
an integral, there are some instances where the change is allowed. One of the
most useful of these cases occurs when the sequence of functions {fn (x)}∞n=1
is dominated by an integrable function, that is, a function whose integral
exists and is finite. This result is usually called the Dominated Convergence
Theorem.
Theorem 1.11 (Lebesgue). Let {fn (x)}∞ n=1 be a sequence of real functions.
pw
Suppose that fn −−→ f as n → ∞ for some real valued function f , and that
there exists a real function g such that
Z ∞
|g(x)|dx < ∞,
−∞
and Z ∞ Z ∞ Z ∞
lim fn (x)dx = lim fn (x)dx = f (x)dx.
n→∞ −∞ −∞ n→∞ −∞
This function dominates fn (x) for all n ∈ N in that |fn (x)| ≤ g(x) for all
x ∈ R and n ∈ N. But note that
Z ∞ ∞ Z
X ∞ ∞
X
g(x)dx = 2n δ{x; (2−n , 2−(n−1) )} = 1 = ∞. (1.10)
−∞ n=1 −∞ n=1
A related result that allows for the interchange of a limit and an integral
is based on the assumption that the sequence of functions is monotonically
increasing or decreasing to a limiting function.
Theorem 1.12 (Lebesgue’s Monotone Convergence Theorem). Let {fn (x)}∞ n=1
be a sequence of real functions that are monotonically increasing to f on R.
That is fi (x) ≤ fj (x) for all x ∈ R when i < j, for positive integers i and j
pw
and fn −−→ f as n → ∞ on R. Then
Z ∞ Z ∞ Z ∞
lim fn (x)dx = lim fn (x)dx = f (x)dx. (1.11)
n→∞ −∞ −∞ n→∞ −∞
Proofs of Theorem 1.12 and Corollary 1.1 can be found in Gut (2005) or
Sprecher (1970).
Example 1.16. The following setup is often used when using arguments that
rely on the truncation of random variables. Let g be an integrable function on
R and define a sequence of functions {fn (x)}∞
n=1 as fn (x) = g(x)δ{|x|; (0, n)}.
It follows that for all x ∈ R that
lim fn (x) = lim g(x)δ{|x|; (0, n)} = g(x),
n→∞ n→∞
since for each x ∈ R there exists an integer nx such that fnx (x) = g(x) for
all n ≥ nx . Noting that {fn (x)}∞
n=1 is a monotonically increasing sequence of
18 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
functions allows us to use Theorem 1.12 to conclude that
Z ∞ Z ∞ Z ∞
lim fn (x)dx = lim fn (x)dx = g(x)dx.
n→∞ −∞ −∞ n→∞ −∞
Figure 1.1 The linear approximation of a curved function. The solid line indicates
the form of the function f and the dotted line shows the linear approximation of
f (z + δ) given by f (z + δ) ' f (z) + δf 0 (z) for the point z indicated on the graph.
0.52
f(z) + ! f’(z)
0.50
f(x)
0.48
0.46
z
0.5 0.6 0.7 0.8 0.9 1.0
the error of the approximation as a function of both x and δ. Note that using
Theorem A.3 yields
Z x+δ
f (x + δ) − f (x) = f 0 (t)dt,
x
and Z x+δ
δf 0 (x) = f 0 (x) dt,
x
so that Z x+δ
E1 (x, δ) = [f 0 (t) − f 0 (x)]dt.
x
An application of Theorem A.4 yields
Z x+δ
x+δ
E1 (x, δ) = −(x + δ − t)[f 0 (t) − f 0 (x)]|x + (x + δ − t)f 00 (t)dt
x
Z x+δ
= (x + δ − t)f 00 (t)dt,
x
which establishes the role of the second derivative in the error of the approx-
imation.
20 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
Note that if f has a high degree of curvature, with no change in the direction
of the concavity of f , the absolute value of the integral in E1 (x, δ) will be
large. This indicates that the function continues turning away from the linear
approximation as shown in Figure 1.1. If f has an inflection point in the
interval (x, x + δ), the direction on the concavity will change and the integral
will become smaller. The change in the sign of f 00 indicates that the function
is turning back toward the linear approximation and the error will decrease.
See Figure 1.2.
A somewhat simpler form of the error term E1 (x, δ) can be obtained through
an application of the Theorem A.5, as
Z x+δ
E1 (x, δ) = (x + δ − t)f 00 (t)dt
x
Z x+δ
= f 00 (ξ) (x + δ − t)dt
x
1 00 2
= 2 f (ξ)δ
for some ξ ∈ [x, x + δ]. The exact value of ξ will depend on x, δ, and f .
Assume that f 000 (x) exists, is finite, and continuous in the interval (x, x + δ).
Define
E2 (x, δ) = f (x + δ) − f (x) − δf 0 (x) − 21 δ 2 f 00 (x). (1.15)
Note that the first three terms on the right hand side of Equation (1.15) equals
E1 (x, δ) so that
Z x+δ
E2 (x, δ) = E1 (x, δ) − 21 δ 2 f 00 (x) = (x + δ − t)f 00 (t)dt − 21 δ 2 f 00 (x).
x
Figure 1.2 The linear approximation of a curved function. The solid line indicates
the form of the function f and the dotted line shows the linear approximation of
f (z + δ) given by f (z + δ) ' f (z) + δf 0 (z) for the point z indicated on the graph.
In this case, the concavity of f changes and the linear approximation becomes more
accurate again for larger values of δ, for the range of the values plotted in the figure.
25
20
f(z) + ! f’(z)
15
f(x)
10
5
0
z
0.0 0.2 0.4 0.6 0.8 1.0
so that Z x+δ
E2 (x, δ) = (x + δ − t)[f 00 (t) − f 00 (x)]dt.
x
An application of Theorem A.4 yields
Z x+δ
1
E2 (x, δ) = 2 (x + δ − t)2 f 000 (t)dt,
x
which indicates that the third derivative is the essential determining factor in
the accuracy of this approximation. As with E1 (x, δ), the error term can be
restated in a somewhat simpler form by using an application of Theorem A.5.
That is,
Z x+δ
E2 (x, δ) = (x + δ − t)f 000 (t)dt
x
Z x+δ
= f 000 (ξ) (x + δ − t)2 dt
x
1 3 000
= 6 δ f (ξ),
22 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
for some ξ ∈ [x, x + δ], where the value of ξ will depend on x, δ and f . Note
that δ is cubed in E2 (x, δ) as opposed to being squared in E1 (x, δ). This would
imply, depending on the relative sizes of f 00 and f 000 in the interval [x, x + δ],
that E2 (x, δ) will generally be smaller than E1 (x, δ) for small values of δ. In
fact, one can note that
1 3 000
E2 (x, δ) 6 δ f [ξ2 (δ)] δf 000 [ξ2 (δ)]
= 1 2 00 = .
E1 (x, δ) 2 δ f [ξ1 (δ)]
3f 00 [ξ1 (δ)]
The values ξ1 (δ) and ξ2 (δ) are the values of ξ for E1 (x, δ) and E2 (x, δ), re-
spectively, written here as a function of δ to emphasize the fact that these
values change as δ → 0. In fact, as δ → 0, ξ1 (δ) → x and ξ2 (δ) → x. Now
assume that the derivatives are continuous so that f 00 [ξ1 (δ)] → f 00 (x) and
f 000 [ξ1 (δ)] → f 000 (x) as δ → 0 and that |f 000 (x)/f 00 (x)| < ∞. Then
E2 (x, δ) δf 000 [ξ2 (δ)]
lim = lim 00 = 0. (1.16)
δ→0 E1 (x, δ) δ→0 f [ξ1 (δ)]
Hence, under these conditions, the error from the quadratic approximation
will be dominated by the error from the linear approximation.
If f is a sufficiently smooth function, ensured by the existence of a required
number of derivatives, then the process described above can be iterated further
to obtain potentially smaller error terms when δ is small. This results in what
is usually known as Taylor’s Theorem.
Theorem 1.13 (Taylor). Let f be a function that has p + 1 bounded and
continuous derivatives in the interval (x, x + δ). Then
p
X δ k f (k) (x)
f (x + δ) = + Ep (x, δ),
k!
k=0
where
x+δ
δ p+1 f (p+1) (ξ)
Z
1
Ep (x, δ) = (x + δ − t)p f (p+1) (t)dt = ,
p! x (p + 1)!
for some ξ ∈ (x, x + δ).
For a proof of Theorem 1.13 see Exericises 24 and 25. What is so special
about the approximation that is obtained using Theorem 1.13? Aside from
the motivation given earlier in this section, consider taking the first derivative
of the pth order approximation at δ = 0, which is
p
" p
#
d X δ k f (k) (x) d 0
X δ k f (k) (x)
= f (x) + δf (x) + = f 0 (x).
dδ k! dδ k!
k=0 δ=0 k=2 δ=0
Hence, the approximating function has the same derivative at x as the actual
function f (x). In general it can be shown that
p
dj X δ k f (k) (x)
= f (j) (x), (1.17)
dδ j k!
k=0 δ=0
THE TAYLOR EXPANSION 23
for j = 1, . . . , p. Therefore, the pth -order approximating function has the same
derivatives of order 1, . . . , p as the actual function f (x) at the point x. A proof
of Equation (1.17) is given in Exercise 27.
Note that an alternative form of the expansion given in Theorem 1.13 can be
obtained by setting y = x + δ and x = y0 so that δ = y − x = y − y0 and the
expansion has the form
p
X (y − y0 )k f (k) (y0 )
f (y) = + Ep (y, y0 ), (1.18)
k!
k=0
where
(y − y0 )p+1 f (p+1) (ξ)
Ep (y, y0 ) = ,
(p + 1)!
and ξ is between y and y0 . The expansion given in Equation (1.18) is usually
called the expansion of f around the point y0 .
Figure 1.3 The exponential function (solid line) and three approximations based on
Theorem 1.13 using p = 1 (dashed line), p = 2 (dotted line) and p = 3 (dash-dot
line). 3.0
2.5
2.0
1.5
1.0
absolute errors from the higher order polynomial approximations are domi-
nated by the absolute errors from the lower order polynomial approximations
as δ → 0. One can also observe from Figure 1.5 that error of the quadratic
approximation relative to that of the linear approximation is much larger than
that of the absolute error of the cubic approximation relative to that of the
linear approximation.
Example 1.18. Consider the distribution function of a N(0, 1) distribution
given by
Z x
Φ(x) = (2π)−1/2 exp(−t2 /2)dt.
−∞
We would like to approximate Φ(x), an integral that has no simple closed form,
with a simple function for values of x near 0. As in the previous example, three
approximations based on Theorem 1.13 will be considered. Applying Theorem
1.13 to Φ(x + δ) with p = 1 yields the approximation
Φ(x + δ) = Φ(x) + δΦ0 (x) + E1 (x, δ) = Φ(x) + δφ(x) + E1 (x, δ),
THE TAYLOR EXPANSION 25
Figure 1.4 The absolute error for approximating the exponential function using the
three approximations based on Theorem 1.13 with p = 1 (dashed line), p = 2 (dotted
line) and p = 3 (dash-dot line).
0.8
0.6
0.4
0.2
0.0
Φ(δ) = 1
2 + δφ(0) + E1 (δ) = 1
2 + δ(2π)−1/2 + E1 (δ).
Φ(δ) = 1
2 + (2π)−1/2 δ + 21 φ0 (0)δ 2 + E2 (δ),
where
0 d −1/2 2
φ (0) = φ(x)
= −x(2π) exp(−x /2)
= −xφ(x) = 0.
dx x=0 x=0 x=0
Hence, the quadratic approximation is the same as the linear one. This indi-
cates that the linear approximation is more accurate in this case than what
would usually be expected. The cubic approximation has the form
Φ(δ) = 1
2 + (2π)−1/2 δ + 12 φ0 (0)δ 2 + 61 φ00 (0)δ 3 + E3 (δ), (1.19)
26 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
Figure 1.5 The absolute relative errors for approximating the exponential function
using the three approximations based on Theorem 1.13 with p = 1, p = 2 and p = 3.
The relative errors are |E2 (δ)/E1 (δ)| (solid line), |E3 (δ)/E1 (δ)| (dashed line) and
|E3 (δ)/E2 (δ)| (dotted line).
0.30
0.25
0.20
0.15
0.10
0.05
0.00
where
00 d 2
= −(2π)−1/2 .
φ (0) = − xφ(x)
= (x − 1)φ(x)
dx x=0 x=0
Figure 1.6 The standard normal distribution function (solid line) and two approxi-
mations based on Theorem 1.13 using p = 1 (dashed line) and p = 3 (dotted line).
1.0
0.9
0.8
0.7
0.6
0.5
From Example 1.18 it is clear that the derivatives of the standard normal
density have a specific form in that they are all a polynomial multiplied by
the standard normal density. The polynomial multipliers, called Hermite poly-
nomials are quite useful and will be used later in the book.
Definition 1.6. Let φ(x) be the density of a standard normal random variable.
28 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
The k th derivative of φ(x) can be expressed as
dk
φ(x) = (−1)k Hk (x)φ(x),
dxk
where Hk (x) is a k th -order polynomial in x called the k th Hermite polynomial.
That is,
(−1)k φ(k) (x)
Hk (x) = .
φ(x)
Hermite polynomials are an example of a set of orthogonal polynomials. See
Exercise 33. Hermite polynomials also have many interesting properties in-
cluding a simple recurrence relation between successive polynomials.
Theorem 1.14. Let Hk (x) be the k th Hermite polynomial, then
where the properties of the geometric series have been used. See Theorem
A.23. However, if δ is fixed so that δ > 1 then
p
X
(−1)k+1 δ k → ∞,
k=0
The exact form of the error terms of asymptotic expansions are typically not
of interest. However, the asymptotic behavior of these errors as δ → 0 or as
n → ∞ terms is important. For example, in Section 1.4 it was argued that
when certain conditions are met, the error term E2 (x, δ) from Theorem 1.13
is dominated as δ → 0 by E1 (x, δ), when x is fixed. Because the asymptotic
behavior of the error term, and not its exact form, is important, a specific
type of notation has been developed to symbolize the limit behavior of these
sequences. Asymptotic order notation is a simple type of shorthand that indi-
cates the asymptotic behavior of a sequence with respect to another sequence.
Definition 1.7. Let {xn }∞ ∞
n=1 and {yn }n=1 be real sequences.
ASYMPTOTIC EXPANSIONS 31
1. The notation xn = o(yn ) as n → ∞ means that
xn
lim = 0.
n→∞ yn
−1/2
This indicates that an = o(n ) and bn = o(n−1/2 ) as n → ∞, and the
conclusion is then that the sequences {an }∞ ∞
n=1 and {bn }n=1 converge to zero at
−1/2
a faster rate than n . To emphasize the fact that these representations are
not unique, we can also conclude that an = o(n−1/4 ) and an = o(n−1/256 ) as
n → ∞ as well, with similar conclusions for the sequence {bn }∞n=1 . Note finally
that any sequence that converges to zero, including the sequences {an }∞n=1 and
{bn }∞
n=1 , are also o(1) as n → ∞.
The main tool that we have encountered for deriving asymptotic expansions
is given by Theorem 1.13 (Taylor), which provided fairly specific forms of the
error terms in the expansion. These error terms can also be written in terms
of the asymptotic order notation to provide a simple asymptotic form of the
errors.
Theorem 1.15. Let f be a function that has p + 1 bounded and continuous
derivatives in the interval (x, x + δ). Then
p
X δ k f (k) (x)
f (x + δ) = + Ep (x, δ),
k!
k=0
Proof. We will prove that Ep (x, δ) = O(δ p+1 ) as δ → 0. The fact that
Ep (x, δ) = o(δ p ) is proven in Exercise 34. From Theorem 1.13 we have that
δ p+1 f (p+1) (ξ)
Ep (x, δ) = ,
(p + 1)!
ASYMPTOTIC EXPANSIONS 33
for some ξ ∈ (x, x + δ). Hence, the sequence Ep (x, δ)/δ p+1 has the form
f (p+1) (ξ)
,
(p + 1)!
which depends on δ through the value of ξ. The assumption that f has p + 1
bounded and continuous derivatives in the interval (x, x + δ) ensures that
this sequence remains bounded for all ξ ∈ (x, x + δ). Hence it follows from
Definition 1.7 that Ep (x, δ) = O(δ p+1 ) as δ → 0.
Example 1.22. Consider the asymptotic expansions developed in Example
1.17, that considered approximating the function exp(δ) as δ → 0. Theorem
1.15 can be applied to these results to conclude that exp(δ) = 1 + δ + O(δ 2 ),
exp(δ) = 1 + δ + 12 δ 2 + O(δ 3 ), and exp(δ) = 1 + δ + 21 δ 2 + 16 δ 3 + O(δ 4 ), as
δ → 0. This allows us to easily evaluate the asymptotic properties of the error
sequences. In particular, the asymptotically most accurate approximation has
an error term that converges to zero at the same rate as δ 4 . Alternatively,
we could also apply Theorem 1.15 to these approximations to conclude that
exp(δ) = 1 + δ + o(δ), exp(δ) = 1 + δ + 12 δ 2 + o(δ 2 ), and
1 1
exp(δ) = 1 + δ + δ 2 + δ 3 + o(δ 3 ),
2 6
as δ → 0. Hence, the asymptotically most accurate approximation considered
here has an error term that converges to 0 at a rate faster than δ 3 .
Example 1.23. Consider the asymptotic expansions developed in Example
1.18 that approximated the standard normal distribution function near zero.
The first and second-order approximations coincide in this case so that we
can conclude using Theorem 1.15 that Φ(δ) = 12 + δ(2π)−1/2 + O(δ 3 ) and
Φ(δ) = 12 + δ(2π)−1/2 + o(δ 2 ) as δ → 0. The third order approximation has
the forms Φ(δ) = 21 + δ(2π)−1/2 + 16 δ 3 (2π)−1/2 + O(δ 4 ) and Φ(δ) = 12 +
δ(2π)−1/2 + 61 δ 3 (2π)−1/2 + o(δ 3 ) as δ → 0.
There are other methods besides the Taylor expansion which can also be used
to generate asymptotic expansions. A particular method that is useful for
approximating integral functions of a certain form is based on Theorem A.4.
Integral functions with an exponential type form often fall into this category,
and the normal integral is a particularly interesting example.
Example 1.24. Consider the problem of approximating the tail probability
function of a standard normal distribution given by
Z ∞
Φ̄(z) = φ(t)dt,
z
for large values of z, or as z → ∞. To apply integration by parts, first note that
from Definition 1.6 it follows that φ0 (t) = −H1 (t)φ(t) = −tφ(t) or equivalently
φ(t) = −φ0 (t)/t. Therefore
Z ∞
Φ̄(z) = − t−1 φ0 (t)dt.
z
34 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
A single application of Theorem A.4 yields
Z ∞ Z ∞
− t−1 φ0 (t)dt = lim [−t−1 φ(t)] + z −1 φ(z) − t−2 φ(t)dt
z t→∞ z
Z ∞
−1 −2
= z φ(z) − t φ(t)dt. (1.23)
z
The integral of the right hand side of Equation (1.23) can be rewritten as
Z ∞ Z ∞
− t−2 φ(t)dt = t−3 φ0 (t)dt,
z z
so that Z ∞
1
E2 (z) = 3z −5 − 15t−6 φ(t)dt.
φ(z) z
Now
∞
z5
Z
z 5 E2 (z) = 3 − 15t−6 φ(t)dt.
φ(z) z
The first term is bounded, and noting that φ(t) is a decreasing function for
t > 0 it follows that
Z ∞ Z ∞ Z ∞
z5 z5
15t−6 φ(t)dt ≤ 15t−6 φ(z)dt = z 5 15t−6 dt = 3.
φ(z) z φ(z) z z
Therefore z 5 E2 (z) remains bounded and positive for all z > 0, and it follows
that E2 (z) = O(z −5 ) as z → ∞. This process can be iterated further by
ASYMPTOTIC EXPANSIONS 35
applying integration by parts to the error term E2 (z) which will result in
an error term that is O(z −7 ) as z → ∞. Barndorff-Nielsen and Cox (1989)
point out several interesting properties of the resulting asymptotic expansion,
including the fact that if the process is continued the resulting sequence has
alternating signs, and that each successive approximation provides a lower
or upper bound for the true value Φ̄(z). Moreover, the infinite sequence that
results from continuing the expansion indefinitely is divergent when z is fixed.
See Example 3.1 of Barndorff-Nielsen and Cox (1989) for further details.
More general theorems on the relationship between integration and the asymp-
totic order notation can be developed as well. See, for example, Section 1.1 of
Erdélyi (1956). It is important to note that it is generally not permissible to
exchange a derivative and an asymptotic order relation, though some results
are possible if additional assumptions can be made. For example, the following
result is based on the development of Section 7.3 of De Bruijn (1958), which
contains a proof of the result.
36 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
Theorem 1.17. Let g be a real function that is integrable over a finite interval
and define Z t
G(t) = g(x)dx.
0
If g is non-decreasing and G(t) (α + 1)−1 tα+1 as t → ∞, then g(t) tα as
t → ∞.
Example 1.25. Consider the function G(t) = t3 + t2 and note that G(t) t3
as t → ∞. Knowing the exact form of G(t) in this case allows us to compute
the derivative using direct calculations to be g(t) = 3t2 + 2t so that g(t) 3t2
as t → ∞. Therefore, differentiating the asymptotic rate is permissible here.
In fact, if we did not know the exact form of g(t), but knew that g is a non-
decreasing function, the same rate could be obtained from Theorem 1.17. Note
also that G(t) = O(t3 ) and g(t) = O(t2 ) here as t → ∞.
1/2
Example 1.26. Consider the function G(t) = t + sin(t) and note that
G(t) = O(t1/2 ) as t → ∞. Direct differentiation yields g(t) = 21 t−1/2 + cos(t),
but in this case it is not true that g(t) = O( 12 t−1/2 ) as t → ∞. Indeed, note
that
1 −1/2
2t + cos(t)
−1/2
= 1 + t1/2 cos(t),
t
does not remain bounded as t → ∞. The reason that differentiation is not
applicable here is due to the cyclic nature of G(t). As t → ∞ the t1/2 term
dominates the sin(t) term in G(t), so this periodic pattern is damped out in
the limit. However, the t−1/2 term, which converges to 0, in g(t) is dominated
by the cos(t) term as t → ∞ so the periodic nature of the function results.
Note that Theorem 1.17 is not applicable in this case as g(t) is not strictly
increasing.
Proof. We will prove the first result, leaving the proofs of the remaining results
as Exercise 37. Definition 1.7 implies that
an
lim = 0,
n→∞ bn
and
cn
lim = 0.
n→∞ dn
Theorem 1.18 essentially yields two types of results. First, one can observe the
multiplicative effect of the asymptotic behavior of the sequences. Second, one
can also observe the dominating effect of sequences that have o-type behavior
over those with O-type behavior, in that the product of a sequence with o-
type behavior with a sequence that has O-type behavior yields a sequence
with o-type behavior. The reason for this dominance comes from the fact that
the product of a bounded sequence with a sequence that converges to zero,
also converges to zero.
Returning to the discussion on the asymptotic expansion for the product of
the functions f (x)g(y), it is now clear from Theorem 1.18 that the form of the
asymptotic expansion for f (x)g(y) can be written as
f (x)g(y) = d0 (x)d00 (y) + O(n−1/2 ) + O(n−1/2 ) + O(n−1 ).
The next step in simplifying this asymptotic expansion is to consider the
behavior of the sum of the sequences that are O(n−1/2 ) and O(n−1 ) as n → ∞.
Define error terms E1 (n), E10 (n) and E2 (n) such that E1 (n) = O(n−1/2 ),
E10 (n) = O(n−1/2 ) and E2 (n) = O(n−1 ) as n → ∞. Then
n1/2 [E1 (n) + E10 (n) + E2 (n)] = n1/2 E1 (n) + n1/2 E1 (n)0 + n1/2 E2 (n).
38 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
The fact that the first two sequences in the sum remain bounded for all n ∈ N
follows directly from Definition 1.7. Because the third sum in the sequence
is O(n−1 ) as n → ∞ it follows that nE2 (n) remains bounded for all n ∈
N. Because n1/2 E2 (n) ≤ nE2 (n) for all n ∈ N, it follows that n1/2 E2 (n)
remains bounded for all n ∈ N. Therefore E1 (n) + E10 (n) + E2 (n) = O(n−1/2 )
as n → ∞ and the asymptotic expansion for f (x)g(y) can be written as
f (x)g(y) = d0 (x)d00 (y) + O(n−1/2 ) as n → ∞. Is it possible that the error
sequence converges to zero at a faster rate than n−1/2 ? Such a result cannot
be found using the assumptions that are given because the error sequences
E1 (n) and E10 (n) are only guaranteed to remain bounded when multiplied by
n1/2 , and not any larger sequence in n. This type of result is generalized in
Theorem 1.19.
Theorem 1.19. Consider two real sequences {an }∞ ∞
n=1 and {bn }n=1 and pos-
itive real numbers k and m where k ≤ m. Then
1. If an = o(n−k ) and bn = o(n−m ) as n → ∞, then an + bn = o(n−k ) as
n → ∞.
2. If an = O(n−k ) and bn = O(n−m ) as n → ∞, then an + bn = O(n−k ) as
n → ∞.
3. If an = O(n−k ) and bn = o(n−m ) as n → ∞, then an + bn = O(n−k ) as
n → ∞.
4. If an = o(n−k ) and bn = O(n−m ) as n → ∞, then an + bn = O(n−k ) as
n → ∞.
Proof. Only the first result will be proven, leaving the proofs of the remaining
results the subject of Exercise 38. Suppose an = o(n−k ) and bn = o(n−m ) as
n → ∞ and consider the sequence nk (an + bn ). Because an = o(n−k ) it follows
that nk an → 0 as n → ∞. Similarly, nk bn → 0 as n → ∞ due to the fact that
|nk bn | ≤ |nm bn | → 0 as n → ∞. It follows that nk (an + bn ) → 0 as n → ∞
which yields the result.
Example 1.27. In Example 1.20 it was established that the density of Z =
n1/2 [X̄ − µ(α, β)]/σ(α, β), where X̄n is the sample mean from a sample of size
n from a Beta(α, β) distribution, has asymptotic expansion
f (x) = φ(x) − 16 n−1/2 φ(x)H3 (x)κ3 (α, β) + R2 (x, n).
It will be shown later that R2 (x, n) = O(n−1 ) as n → ∞. In some applications
κ3 (α, β) is not known exactly and is replaced by a sequence κ̂3 (α, β) where
κ̂3 (α, β) = κ3 (α, β) + O(n−1/2 ) as n → ∞. Theorems 1.18 and 1.19 can be
employed to yield
fˆ(x) = φ(x) − 61 n−1/2 φ(x)H3 (x)κ̂3 (α, β) + O(n−1 )
= φ(x) − 61 n−1/2 φ(x)H3 (x)[κ3 (α, β) + O(n−1/2 )] + O(n−1 )
= φ(x) − 61 n−1/2 φ(x)H3 (x)κ3 (α, β) + O(n−1 ), (1.24)
as n → ∞. Therefore, it is clear that replacing κ3 (α, β) with κ̂3 (α, β) does not
INVERSION OF ASYMPTOTIC EXPANSIONS 39
change the asymptotic order of the error in the asymptotic expansion. That
is |fˆ(x) − f (x)| = O(n−1 ) as n → ∞, for a fixed value of x ∈ R. Note that it
is not proper to conclude that fˆ(x) = f (x), even though both functions have
asymptotic expansions of the form φ(x)− 16 n−1/2 φ(x)H3 (x)κ3 (α, β)+O(n−1 ).
It is clear from the development given in Equation (1.24) that the error terms
of the two expansions differ, even though they are both O(n−1 ) as n → ∞.
The final result of this section will provide a more accurate representation
of the approximation given by Stirling’s approximation to factorials given
in Theorem 1.8 by specifying the asymptotic behavior of the error of this
approximation.
Theorem 1.20. n! = nn exp(−n)(2nπ)1/2 [1 + O(n−1 )] as n → ∞.
For a proof of Theorem 1.20 see Example 3.5 of Barndorff-Nielsen and Cox
(1989). The theory of asymptotic expansions and divergent series is far more
expansive than has been presented in this brief overview. The material pre-
sented in this section is sufficient for understanding the expansion theory used
in the rest of this book. Several book length treatments of this topic can be
consulted for further information. These include Barndorff-Nielsen and Cox
(1989), Copson (1965), De Bruijn (1958), Erdélyi (1956), and Hardy (1949).
Some care must be taken when consulting some references on asymptotic ex-
pansions as many presentations are for analytic functions in the complex do-
main. The theoretical properties of asymptotic expansions for these functions
can differ greatly in some cases than for real functions.
As presented, this method is somewhat ad hoc, and we have not provided any
general guidelines as to when the method is applicable and what happens in
cases such as when the inverse is not unique. For the problems encountered in
this book the method is generally reliable. For a rigorous justification of the
method see De Bruijn (1958).
Example 1.28. Example 1.20 showed that the density of Zn = n1/2 (X̄n −
µ(α, β))/σ(α, β), where X̄n is computed from a sample of size n from a
Beta(α, β) distribution, has asymptotic expansion
as n → ∞, where it is noted that from Definition 1.6 it follows that the integral
of −H3 (z)φ(z), which is the third derivative of the standard normal density,
is given by H2 (z)φ(z), which is the second derivative of the standard normal
density. We assume that the integration of the error term with respect to z
does not change the order of the error term. This actually follows from the
fact that the error term can be shown to be uniform in z. Denote the αth
quantile of Fn as fα,n and assume that fα,n has an asymptotic expansion of
the form fα,n = v0 (α) + n−1/2 v1 (α) + O(n−1 ), as n → ∞. To obtain v0 (α)
and v1 (α) set Fn (fα,n ) = α + O(n−1 ), which is the property that fα,n should
have to be the αth quantile of Fn up to order O(n−1 ), as n → ∞. Therefore,
INVERSION OF ASYMPTOTIC EXPANSIONS 41
it follows that Fn [v0 (α) + n−1/2 v1 (α) + O(n−1 )] = α + O(n−1 ), or equivalently
Φ[v0 (α) + n−1/2 v1 (α) + O(n−1 )] − 16 n−1/2 φ[v0 (α) + n−1/2 v1 (α) + O(n−1 )]×
H2 [v0 (α) + n−1/2 v1 (α) + O(n−1 )]κ3 (α, β) = α + O(n−1 ), (1.27)
as n → ∞. Now expand each term in Equation (1.27) using Theorem 1.13
and the related theorems in Section 1.5. Applying Theorem 1.13 the standard
normal distribution function yields
F (fα,n ) = Φ[v0 (α)] + n−1/2 φ[v0 (α)]{v1 (α) − 61 κ3 (α, β)H2 [v0 (α)]} + O(n−1 ),
as n → ∞. Using the notation of this section we have that r0 (α; v0 ) = Φ[v0 (α)]
and r1 (α; v0 , v1 ) = φ[v0 (α)]{v1 (α) − 61 κ3 (α, β)H2 [v0 (α)]}. Setting r0 (α, v0 ) =
Φ[v0 (α)] = α implies that v0 (α) = zα , the αth quantile of a N(0, 1) distribu-
tion. Similarly, setting r1 (α; v0 , v1 ) = φ[v0 (α)]{v1 (α) − 61 κ3 (α, β)H2 [v0 (α)]} =
0 implies that v1 (α) = 61 κ3 (α, β)H2 [v0 (α)] = 61 κ3 (α, β)H2 (zα ). Therefore, an
asymptotic expansion for the αth quantile of F is given by
fα,n = v0 (α) + n−1/2 v1 (α) + O(n−1 ) = zα + n−1/2 16 κ3 (α, β)H2 (zα ) + O(n−1 ),
as n → ∞.
It should be noted that if closed forms for the derivatives of f are known,
then it can be easier to derive an asymptotic expansion for the inverse of a
function f (x) using Theorem 1.13 or Theorem 1.15 directly. The derivatives
of the inverse of f (x) are required for this approach. The following result from
calculus can be helpful with this calculation.
Theorem 1.21. Assume that g is a strictly increasing and continuous real
function on an interval [a, b] and let h be the inverse of g. If the derivative
of g exists and is non-zero at a point x ∈ (a, b) then the derivative of h also
42 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
exists and is non-zero at the corresponding point y = g(x) and
" #−1
d d
h(y) = g(x)
.
dy dx x=h(y)
A proof of Theorem 1.21 can be found in Section 6.20 of Apostol (1967). Note
the importance of the monotonicity condition in Theorem 1.21, which ensures
that the function g has a unique inverse. Further, the restriction that g is
strictly increasing implies that the derivative of the inverse will be positive.
Example 1.29. Consider the standard normal distribution function Φ(z),
and suppose we wish to obtain an asymptotic expansion for the standard
normal quantile function Φ−1 (α) for values of α near 12 . Theorem 1.15 implies
that
d
Φ−1 (α + δ) = Φ−1 (α) + δ Φ−1 (α) + O(δ 2 ),
dα
as δ → 0. Noting that Φ(t) is monotonically increasing and that Φ0 (t) 6= 0 for
all t ∈ R, we can apply Theorem 1.21 to Φ−1 (α) to find that
" #−1
d −1 d 1
Φ (α) = Φ(z) = .
dα dz −1
z=Φ (α) φ(zα)
1.7.1 Exercises
1. Let {xn }∞
n=1 be a sequence of real numbers defined by
−1
n = 1 + 3(k − 1), k ∈ N,
xn = 0 n = 2 + 3(k − 1), k ∈ N,
1 n = 3 + 3(k − 1), k ∈ N.
Compute
lim inf xn ,
n→∞
and
lim sup xn .
n→∞
Determine if the limit of xn as n → ∞ exists.
EXERCISES AND EXPERIMENTS 43
2. Let {xn }∞
n=1 be a sequence of real numbers defined by
n n+1
xn = − ,
n+1 n
for all n ∈ N. Compute
lim inf xn ,
n→∞
and
lim sup xn .
n→∞
Determine if the limit of xn as n → ∞ exists.
n
3. Let {xn }∞
n=1 be a sequence of real numbers defined by xn = n
(−1) −n
for
all n ∈ N. Compute
lim inf xn ,
n→∞
and
lim sup xn .
n→∞
Determine if the limit of xn as n → ∞ exists.
4. Let {xn }∞
n=1 be a sequence of real numbers defined by xn = n2
−n
, for all
n ∈ N. Compute
lim inf xn ,
n→∞
and
lim sup xn .
n→∞
Determine if the limit of xn as n → ∞ exists.
5. Each of the sequences given below converges to zero. Specify the smallest
value of nε so that |xn | < ε for every n > nε as a function of ε.
a. xn = n−2
b. xn = n(n + 1)−1 − 1
c. xn = [log(n + 1)]−1
d. xn = 2(n2 + 1)−1
6. Let {xn }∞ ∞
n=1 and {yn }n=1 be sequences of real numbers such that
lim xn = x,
n→∞
and
lim yn = y.
n→∞
b. Prove that
lim (xn + yn ) = x + y.
n→∞
44 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
c. Prove that
lim xn yn = xy.
n→∞
d. Prove that
xn x
lim
= ,
yn
n→∞ y
6 0 for all n ∈ N and y 6= 0.
where yn =
7. Let {xn }∞ ∞
n=1 and {yn }n=1 be sequences of real numbers such that xn ≤ yn
for all n ∈ N. Prove that if the limit of the two sequences exist, then
lim xn ≤ lim yn .
n→∞ n→∞
8. Let {xn }∞ ∞
n=1 and {yn }n=1 be sequences of real numbers such that
lim (xn + yn ) = s,
n→∞
and
lim (xn − yn ) = d.
n→∞
Prove that
lim xn yn = 41 (s2 − d2 ).
n→∞
9. Find the supremum and infimum limits for each sequence given below.
a. xn = (−1)n (1 + n−1 )
b. xn = (−1)n
c. xn = (−1)n n
d. xn = n2 sin2 ( 12 nπ)
e. xn = sin(n)
f. xn = (1 + n−1 ) cos(nπ)
g. xn = sin( 12 nπ) cos( 12 nπ)
h. xn = (−1)n n(1 + n)−n
a. Prove that
inf xn ≤ lim inf xn ≤ lim sup xn ≤ sup xn .
n∈N n→∞ n→∞ n∈N
b. Prove that
lim inf xn = lim sup xn = l,
n→∞ n→∞
if and only if
lim xn = l.
n→∞
EXERCISES AND EXPERIMENTS 45
11. Let {xn }∞ ∞
n=1 and {yn }n=1 be a sequences of real numbers such that xn ≤ yn
for all n ∈ N. Prove that
lim inf xn ≤ lim inf yn .
n→∞ n→∞
and
lim sup yn < ∞,
n→∞
Then prove that
lim inf xn + lim inf yn ≤ lim inf (xn + yn ),
n→∞ n→∞ n→∞
and
lim sup(xn + yn ) ≤ lim sup xn + lim sup yn .
n→∞ n→∞ n→∞
and
0 < lim sup yn < ∞.
n→∞
Prove that
lim sup xn yn ≤ lim sup xn lim sup yn .
n→∞ n→∞ n→∞
b. Compute Z ∞
lim fn (x)dx.
n→∞ −∞
Does this match the result derived above?
c. State whether Theorem 1.11 applies to this case, and use it to explain
the results you found.
a. Prove that
lim fn (x) = δ{x; (0, 1)}
n→∞
for all x ∈ R, and hence conclude that
Z ∞
lim fn (x)dx = 1.
−∞ n→∞
b. Compute Z ∞
lim fn (x)dx.
n→∞ −∞
Does this match the result you found above?
c. State whether Theorem 1.11 applies to this case, and use it to explain
the results you found above.
18. Let g(x) = exp(−|x|) and define a sequence of functions {fn (x)}∞
n=1 as
fn (x) = g(x)δ{|x|; (n, ∞)}, for all n ∈ N.
a. Calculate
f (x) = lim fn (x),
n→∞
for each fixed x ∈ R.
b. Calculate Z ∞
lim fn (x)dx,
n→∞ −∞
and Z ∞
f (x)dx.
−∞
Is the exchange of the limit and the integral justified in this case? Why
or why not?
EXERCISES AND EXPERIMENTS 47
19. Define a sequence of functions {fn (x)}∞ 2 n
n=1 as fn (x) = n x(1−x) for x ∈ R
and for all n ∈ N.
a. Calculate
f (x) = lim fn (x),
n→∞
for each fixed x ∈ R.
b. Calculate Z ∞
lim fn (x)dx,
n→∞ −∞
and Z ∞
f (x)dx.
−∞
Is the exchange of the limit and the integral justified in this case? Why
or why not?
24. Prove Theorem 1.13 using induction. That is, assume that
Z x+δ
E1 (x, δ) = (x + δ − t)f 00 (t)dt,
x
38. Prove the remaining three results of Theorem 1.19. That is, consider two
real sequences {an }∞ ∞
n=1 and {bn }n=1 and positive integers k and m where
k ≤ m. Then
a. Suppose an = O(n−k ) and bn = O(n−m ) as n → ∞. Then prove that
an + bn = O(n−k ) as n → ∞.
b. Suppose an = O(n−k ) and bn = o(n−m ) as n → ∞. Then prove that
an + bn = O(n−k ) as n → ∞.
c. Suppose an = o(n−k ) and bn = O(n−m ) as n → ∞. Then prove that
an + bn = O(n−k ) as n → ∞.
50 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
39. For each specified pair of functions G(t) and g(t), determine the value of α
and c so that G(t) ctα−1 as t → ∞ and determine if there is a function
g(t) dtα for some d as t → ∞ where c and d are real constants. State
whether Theorem 1.17 is applicable in each case.
a. G(t) = 2t4 + t
b. G(t) = t + t−1
c. G(t) = t2 + cos(t)
d. G(t) = t1/2 + cos(t)
40. Consider a real function f that can be approximated with the asymptotic
expansion
fn (x) = πx + 21 n−1/2 π 2 x1/2 − 13 n−1 π 3 x1/4 + O(n−3/2 ),
as n → ∞, uniformly in x, where x is assumed to be positive. Use the first
method demonstrated in Section 1.6 to find an asymptotic expansion with
error O(n−3/2 ) as n → ∞ for xa where f (xa ) = a + O(n−3/2 ) as n → ∞.
41. Consider the problem of approximating the function sin(x) and its inverse
for values of x near 0.
1.7.2 Experiments
1. Refer to the three approximations derived for each of the four functions in
Exercise 26. For each function use R to construct a line plot of the function,
along with the three approximations versus δ on a single plot. The lines
corresponding to each approximation and the original function should be
different, that is, the plots should look like the one given in Figure 1.3.
You may need to try several ranges of δ to find one that provides a good
indication of the behavior of each approximation. What do these plots
suggest about the errors of the three approximations?
2. Refer to the three approximations derived for each of the four functions in
Exercise 26. For each function use R to construct a line plot of the error
terms E1 (x, δ), E2 (x, δ) and E3 (x, δ) versus δ on a single plot. The lines
corresponding to each error function should be different so that the plots
should look like the one given in Figure 1.4. What do these plots suggest
about the errors of the three approximations?
EXERCISES AND EXPERIMENTS 51
3. Refer to the three approximations derived for each of the four functions in
Exercise 26. For each function use R to construct a line plot of the error
terms E2 (x, δ) and E3 (x, δ) relative to the error term E1 (x, δ). That is, for
each function, plot E2 (x, δ)/E1 (x, δ) and E3 (x, δ)/E1 (x, δ) versus δ. The
lines corresponding to each relative error function should be different. What
do these plots suggest about the relative error rates?
4. Consider the approximation for the normal tail integral Φ̄(z) studied in
Example 1.24 given by
Φ̄(z) ' z −1 φ(z)(1 − z −2 + 3z −4 − 15z −6 + 105z −8 ).
A slight rearrangement of the approximation implies that
z Φ̄(z)
' 1 − z −2 + 3z −4 − 15z −6 + 105z −8 .
φ(z)
Define S1 (z) = 1 − z −2 , S2 (z) = 1 − z −2 + 3z −4 , S3 (z) = 1 − z −2 + 3z −4 −
15z −6 and S4 (z) = 1 − z −2 + 3z −4 − 15z −6 + 105z −8 , which are the succes-
sive approximations of z Φ̄(z)/φ(z). Using R, compute z Φ̄(z)/φ(z), S1 (z),
S2 (z), S3 (z), and S4 (z) for z = 1, . . . , 10. Comment on which approxima-
tion performs best for each value of z and whether the approximations
become better as z becomes larger.
CHAPTER 2
2.1 Introduction
This chapter begins with a short review of probability measures and random
variables. A sound formal understanding of random variables is crucial to have
a complete understanding of much of the asymptotic theory that follows. In-
equalities are also very useful in asymptotic theory, and the second section
of this chapter reviews several basic inequalities for both probabilities and
expectations, as well as some more advanced results that will have specific
applications later in the book. The next section develops some limit theory
that is useful for working with probabilities of sequences of events, including
the Borel-Cantelli lemmas. We conclude the chapter with a review of moment
generating functions, characteristic functions, and cumulant generating func-
tions. Moment generating functions and characteristic functions are often a
useful surrogate for distributions themselves. While the moment generating
function may be familiar to many readers, the characteristic function may
not, due to the need for some complex analysis. However, the extra effort re-
quired to use the characteristic function is worthwhile as many of the results
presented later in the book are more useful when derived using characteristic
functions.
53
54 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
are called events, and are taken to be members of a collection of subsets of Ω
called a σ-field.
Definition 2.1. Let Ω be a set, then F is a σ-field of subsets of Ω if F has
the following properties:
1. ∅ ∈ F and Ω ∈ F.
2. A ∈ F implies that Ac ∈ F.
3. If Ai ∈ F for i ∈ N then
∞
[ ∞
\
Ai ∈ F and Ai ∈ F.
i=1 i=1
In some cases a σ-field will be generated from a sample space Ω. This σ-field
is the smallest σ-field that contains the events in Ω. The term smallest in
this case means that this σ-field is a subset of any other σ-field that contains
the events in Ω. For further information about σ-fields, their generators, and
their use in probability theory see Section 1.2 of Gut (2005) or Section 2.1 of
Pollard (2002).
The experiment selects an outcome in Ω according to a probability measure
P , which is a set function that maps F to R.
Definition 2.2. Let Ω be a sample space and F be a σ-algebra of subsets of
Ω. A set function P : F → R is a probability measure if P satisfies the axioms
of probability set forth by Kolmogorov (1933). The axioms are:
The term mutually exclusive refers to the property that Ai and Aj are disjoint,
or that Ai ∩ Aj = ∅ for all i 6= j.
From Definition 2.2 it is clear that there are three elements that are required
to assign a set of probabilities to outcomes from an experiment: the sample
space Ω, which identifies the possible outcomes of the experiment; the σ-field
F, which identifies which events in Ω that the probability measure is able
to compute probabilities for; and the probability measure P , which assigns
probabilities to the events in F. These elements are often collected together
in a triple (Ω, F, P ), called a probability space.
When the sample space of an experiment is R, or an interval subset of R, the
σ-field used to define the probability space is usually generated from the open
subsets of the sample space.
PROBABILITY MEASURES AND RANDOM VARIABLES 55
Definition 2.3. Let Ω be a sample space. The Borel σ-field corresponding to
Ω is the σ-field generated by the collection of open subsets of Ω. The Borel
σ-field generated by Ω is denoted by B{Ω}.
In the case where Ω = R, it can be shown that B{R} can be generated from
simpler collections of events. In particular, B{R} can be generated from the
collection of intervals {(−∞, b] : b ∈ R}, with a similar result when Ω is a
subset of R such as Ω = [0, 1]. Other simple collections of intervals can also
be used to generate B{R}. See Section 3.3 of Gut (2005) for further details.
The main purpose of this section is to introduce the concept of a random
variable. Random variables provide a convenient way of referring to events
within a sample space that often have simple interpretations with regard to
the underlying experiment. Intuitively, random variables are often thought of
as mathematical variables that are subject to random behavior. This informal
way of thinking about random variables may be helpful to understand cer-
tain concepts in probability theory, but a true understanding, especially with
regard to statistical limit theorems, comes from the formal mathematical def-
inition below.
Definition 2.4. Let (Ω, F, P ) be a probability space, X be a function that
maps Ω to R, and B be a σ-algebra of subsets of R. The function X is a
random variable if X −1 (B) = {ω ∈ Ω : X(ω) ∈ B} ∈ F, for all B ∈ B.
Note that according to Definition 2.4, there is actually nothing random about
a random variable. When the experiment is performed an element of Ω is
chosen at random according to the probability measure P . The role of the
random variable is to map this outcome to the real line. Therefore, the output
of the random variable is random, but the mapping itself is not. The restriction
that X −1 (B) = {ω ∈ Ω : X(ω) ∈ B}, for all B ∈ B assures that probability
of the inverse mapping can be calculated.
Events written in terms of random variables are interpreted by selecting out-
comes from the sample space Ω that satisfy the event. That is, if A ∈ B then
the event {X ∈ A} is equivalent to the event that consists of all outcomes
ω ∈ Ω such that X(ω) ∈ A. This allows for the computation of probabilities
of events written in terms of random variables. That is, P (X ∈ A) = P (ω :
X(ω) ∈ A), where it is assumed that the event will be empty when A is not
a subset of the range of the function X. Random variables need not be one-
to-one functions, but in the case where X is a one-to-one function and a ∈ R
the computation simplifies to P (X = a) = P [X −1 (a)].
Example 2.1. Consider the simple experiment where a fair coin is flipped
three times, and the sequence of flips is observed. The elements of the sample
space will be represented by triplets containing the symbols Hi , signifying
that the ith flip is heads, and Ti , signifying that the ith flip is tails. The
order of the symbols in the triplet signify the order in which the outcomes
are observed. For example, the event H1 T2 H3 corresponds to the event that
56 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
heads was observed first, then tails, then heads again. The sample space for
this experiment is given by
Ω = {T1 T2 T3 , T1 T2 H3 , T1 H2 T3 , H1 T2 T3 ,
H1 H2 T3 , H1 T2 H3 , T1 H2 H3 , H1 H2 H3 }.
Because the coin is fair, the probability measure P on this sample space is
uniform so that each outcome has a probability of 18 . A suitable σ-field for Ω
is given by the power set of Ω. Now consider a random variable X defined as
0 if ω = {T1 T2 T3 },
1 if ω ∈ {T T H , T H T , H T T },
1 2 3 1 2 3 1 2 3
X(ω) =
2 if ω ∈ {H H
1 2 3T , H T H
1 2 3 , T 1 H2 H3 },
3 if ω = {H1 H2 H3 }.
Hence, the random variable X counts the number of heads in the three flips
of the coin. For example, the event that two heads are observed in the three
flips of the coin can be represented by the event {X = 2}. The probability
of this event is computed by considering all of the outcomes in the original
sample space that satisfy this event. That is
P (X = 2) = P [ω ∈ Ω : X(ω) = 2] = P (H1 H2 T3 , H1 T2 H3 , T1 H2 H3 ) = 38 .
Because the inverse image of each possible value of X is in F, we are always
able to compute the probability of the corresponding inverse image.
Example 2.2. Consider a countable sample space of the form
Ω = {H1 , T1 H2 , T1 T2 H3 , T1 T2 T3 H4 , . . . , },
that corresponds to the experiment of flipping a coin repeatedly until the
first heads is observed where the notation of Example 2.1 has been used to
represent the possible outcomes of the experiment. If the coin is fair then the
probability measure P defined as
1
2,
ω = {H1 },
P (ω) = 2−k , ω = {T1 T2 · · · Tk−1 Hk },
0 otherwise,
The usual σ-field used for random vectors is the Borel sets on Rd , denoted by
B{Rd }.
Example 2.4. Consider a probability space (Ω, F, P ) where Ω = (0, 1)×(0, 1)
is the unit square and P is a bivariate extension of Lebesgue measure. That
is, if R is a rectangle of the form
R = {(ω1 , ω2 ) : ω 1 ≤ ω1 ≤ ω 1 , ω 2 ≤ ω1 ≤ ω 2 },
for all B ∈ B{Rq }. As with the univariate case, we shall call such a function
as simply a Borel function, and we get a parallel result to Theorem 2.1.
Theorem 2.2. Let X be a d-dimensional random vector and g : Rd → Rq be
a Borel function. Then g(X) is a q-dimensional random vector.
inf Xn ,
n∈N
and
sup Xn ,
n∈N
are also random variables. See Section 2.1.1 of Gut (2005) for further details.
SOME IMPORTANT INEQUALITIES 59
2.3 Some Important Inequalities
When events are not necessarily mutually exclusive, the Bonferroni Inequality
is useful for obtaining an upper bound on the probability of the union of the
events. In the special case were the events are mutually exclusive, Axiom 3 of
Definition 2.2 applies and an equality results.
Theorem 2.4 (Bonferroni). Let {Ai }ni=1 be a sequence of events from a prob-
ability space (Ω, F, P ). Then
n
! n
[ X
P Ai ≤ P (Ai ).
i=1 i=1
Markov’s Theorem is a general result that places a bound on the tail proba-
bilities of a random variable using the fact that a certain set of moments of
the random variable are finite. Essentially the result states that only so much
probability can be in the tails of the distribution of a random variable X when
E(|X|r ) < ∞ for some r > 0.
Theorem 2.6 (Markov). Consider a random variable X where E(|X|r ) < ∞
for some r > 0 and let δ > 0. Then P (|X| > δ) ≤ δ −r E(|X|r ).
60 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
Proof. Assume for simplicity that X is a continuous random variable with
distribution function F . If δ > 0 then
Z ∞
r
E(|X| ) = |x|r dF (x)
−∞
Z Z
r
= |x| dF (x) + |x|r dF (x)
{x:|x|≤δ} {x:|x|>δ}
Z
≥ |x|r dF (x),
{x:|x|>δ}
since Z
|x|r dF (x) ≥ 0.
{x:|x|≤δ}
Note that |x|r ≥ δ r within the set {x : |x| > δ}, so that
Z Z
|x|r dF (x) ≥ δ r dF (x) = δ r P (|X| > δ).
{x:|x|>δ} {x:|x|>δ}
Note that Theorem A.18 is a special case of Theorems 2.8 and 2.9 when r is
take to be one. Theorem 2.9 can be proven using Theorem A.18 and Hölder’s
Inequality, given below.
Theorem 2.10 (Hölder). Let X and Y be random variables such that E|X|p <
∞ and E|Y |q < ∞ where p and q are real numbers such that p−1 + q −1 = 1.
Then
1/p 1/q
E(|XY |) ≤ [E(|X|p )] [E(|Y |q )] .
SOME IMPORTANT INEQUALITIES 61
x
z
For proofs of Theorems 2.9 and 2.10, see Section 3.2 of Gut (2005).
A more general result is Jensen’s Inequality, which is based on the properties
of convex functions.
Definition 2.6. Let f be a real function such that f [λx + (1 − λ)y] ≤ λf (x) +
(1 − λ)f (y) for all x ∈ R, y ∈ R and λ ∈ (0, 1), then f is a convex function.
If the inequality is strict then the function is strictly convex. If the function
−f (x) is convex, then the function f (x) is concave.
For a proof of Theorem 2.11, see Section 5.3 of Fristedt and Gray (1997).
62 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
Example 2.5. Let X be a random variable such that E(|X|r ) < ∞, for
some r > 0. Let s be a real number such that 0 < s < r. Since r/s > 1
it follows that f (x) = xr/s is a convex function and therefore Theorem 2.11
implies that [E(|X|s )]r/s ≤ E[(|X|s )r/s ] = E(|X|r ) < ∞, so that it follows
that E(|X|s ) < ∞. This establishes the fact that the existence of higher order
moments implies the existence of lower order moments.
The following result establishes a bound for the absolute expectation of the
sum of truncated random variables. Truncation is often a useful tool that can
be applied when some of the moments of a random variable do not exist. If the
random variables are truncated at some finite value, then all of the moments
of the truncated random variables must exist, and tools such as Theorem 2.7
can be used.
Theorem 2.12. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables such that E(X1 ) = 0. Then, for any ε > 0,
n
!
X
E Xi δ{|Xi |; [0, ε]} ≤ nE(|X1 |δ{|X1 |; (ε, ∞)}).
i=1
The following inequality is an analog of Theorem 2.7 for the case when the
random variable of interest is a sum or an average. While the bound has
the same form as Theorem 2.7, the event in the probability concerns a much
stronger event in terms of the maximal value of the random variable. The
result is usually referred to as Kolmogorov’s Maximal Inequality.
Theorem 2.13 (Kolmogorov). Let {Xn }∞ n=1 be a sequence of independent
random variables where E(Xn ) = 0 and V (Xn ) < ∞ for all n ∈ N. Let
n
X
Sn = Xi .
i=1
Proof. The proof provided here is fairly standard for this result, though this
particular version runs most closely along what is shown in Gut (2005) and
Sen and Singer (1993). Let ε > 0 and define S0 ≡ 0 and define a sequence of
events {Ai }ni=0 as
A0 = {|Si | ≤ ε; i ∈ {0, 1, . . . , n}},
and
Ai = {|Sk | ≤ ε; k ∈ {0, 1, . . . , i − 1}} ∩ {|Si | > ε},
for i ∈ {1, . . . , n}. The essential idea to deriving the bound is based on the
64 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
fact that
[n
max |Si | > ε = Ai , (2.2)
i∈{1,...,n}
i=1
where the events in the sequence {Ai }ni=0 are mutually exclusive. We now
consider Sn to be a random variable that maps some sample space Ω to R.
Equation (2.2) implies that
n
[
Ai ⊂ Ω,
i=1
so that
( n
) n
[ X
Sn2 (ω) ≥ Sn2 (ω)δ ω; Ai = Sn2 (ω)δ{ω; Ai },
i=1 i=1
for all ω ∈ Ω, where we have used the fact that the events in the sequence
{Ai }ni=0 are mutually exclusive. Therefore, Theorem A.16 implies that
n
X
E(Sn2 ) ≥ E(Sn2 δ {Ai }),
i=1
where we have suppressed the ω argument in the indicator function. Note that
we can write Sn as (Sn − Si ) + Si and therefore Sn2 = Si2 + 2Si (Sn − Si ) +
(Sn − Si )2 . Hence
n
X
E(Sn2 ) ≥ E{[Si2 + 2Si (Sn − Si ) + (Sn − Si )2 ]δ {Ai }}
i=1
n
X
≥ E{[Si2 + 2Si (Sn − Si )]δ {Ai }}
i=1
n
X n
X
= E(Si2 δ {Ai }) + 2 E[Si (Sn − Si )δ {Ai }],
i=1 i=1
where the second inequality is due to the fact that [Sn (ω) − Si (ω)]2 ≥ 0 for
all ω ∈ Ω. Now note that
n
X
Sn − Si = Xk ,
k=i+1
is independent of Si δ {Ai } because the event Ai and the sum Si depend only
on X1 , . . . , Xi . Therefore
Note that any event smaller than this union will not be an upper bound of
the sequence. Following similar arguments, the infimum of a sequence of real
numbers is defined to be the greatest lower bound. Therefore, the infimum of a
sequence of events is the largest event that is contained by all of the events in
the sequence. Hence, the infimum of a sequence of events {An }∞ n=1 is defined
as
∞
\
inf An = An .
n∈N
n=1
These concepts can be combined to define the limit supremum and the limit
infimum of a sequence of events.
n=1 be a sequence of events in a σ-field F generated
Definition 2.7. Let {An }∞
from a sample space Ω. Then
∞ \
[ ∞
lim inf An = sup inf An = Ak ,
n→∞ n∈N k≥n n=1 k=n
and
∞ [
\ ∞
lim sup An = inf sup An = Ak .
n→∞ n∈N k≥n
n=1 k=n
If
lim inf An 6= lim sup An ,
n→∞ n→∞
then the limit of the sequence {An }∞
n=1 does not exist.
Example 2.6. Consider the probability space (Ω, F, P ) where Ω = (0, 1),
SOME LIMIT THEORY FOR EVENTS 67
F = B{(0, 1)}, and the sequence of events {An }∞ 1
n=1 is defined by An = ( 3 −
−1 2 −1
(3n) , 3 + (3n) ) for all n ∈ N. Now
∞
\
inf An = ( 13 − (3k)−1 , 23 + (3k)−1 ) = [ 13 , 23 ],
k≥n
k=n
Example 2.7. Consider the probability space (Ω, F, P ) where Ω = (0, 1),
F = B{(0, 1)}, and the sequence of events {An }∞ 1
n=1 is defined by An = ( 2 +
n1
(−1) 4 , 1) for all n ∈ N. Definition 2.7 implies that
∞ \
[ ∞ ∞
[
lim inf An = ( 21 + (−1)k 14 , 1) = ( 34 , 1) = ( 34 , 1).
n→∞
n=1 k=n n=1
The sequence of events studied in Example 2.6 has a property that is very
important in the theory of limits of sequences of events. In that example,
An+1 ⊂ An for all n ∈ N, which corresponds to a monotonically increasing se-
quence of events. The computation of the limits of such sequences is simplified
by this structure.
Theorem 2.15. Let {An }∞ n=1 be a sequence of events from a σ-field F of
subsets of a sample space Ω.
Proof. We will prove the first result. The second result is proven in Exercise 7.
n=1 be a sequence of monotonically increasing events from F. That
Let {An }∞
is An ⊂ An+1 for all n ∈ N. Then Definition 2.7 implies
∞ \
[ ∞
lim inf An = Ak .
n→∞
n=1 k=n
and therefore
∞
[
lim inf An = An .
n→∞
n=1
Similarly, Definition 2.7 implies that
∞ [
\ ∞
lim sup An = Ak .
n→∞
n=1 k=n
Proof. To prove this result break the sequence up into mutually exclusive
events and use Definition 2.2. If {An }∞
n=1 is a sequence of monotonically in-
creasing events then define Bn+1 = An+1 ∩ Acn for n ∈ N where B1 is defined
to be A1 . Note that the sequence {Bn }∞n=1 is defined so that
n
[ n
[
Ai = Bi ,
i=1 i=1
for all n ∈ N. Therefore, taking the limit of each side of the equation as n → ∞
yields
X∞
lim P (An ) = P (Bn ).
n→∞
n=1
Definition 2.2, Equation (2.5), and Theorem 2.15 then imply
∞ ∞ ∞
! !
X [ [
P (Bn ) = P Bn = P An = P lim An ,
n→∞
n=1 n=1 n=1
which proves the result for monotonically increasing events. For monotonically
decreasing events take the complement as shown in Equation (2.4) and note
that the resulting sequence is monotonically increasing. The above result is
then applied to this resulting sequence of monotonically increasing events.
70 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
The Borel-Cantelli Lemmas relate the probability of the limit supremum of
a sequence of events to the convergence of the sum of the probabilities of the
events in the sequence.
Theorem 2.17 (Borel and Cantelli). Let {An }∞
n=1 be a sequence of events.
If
X∞
P (An ) < ∞,
n=1
then
P lim sup An = 0.
n→∞
for each n ∈ N, where the inequality follows from Theorem 2.3 and the fact
that
∞ [
\ ∞ ∞
[
Am ⊂ Am .
n=1 m=n m=n
Now, Theorem 2.4 (Bonferroni) implies that
X∞
P lim sup An ≤ P (Am ),
n→∞
m=n
implies that
∞
X
lim P (Am ) = 0,
n→∞
m=n
so that Theorem 2.16 implies that
P lim sup An = 0.
n→∞
then the probability that An occurs infinitely often is zero. That is, there will
exist an n0 ∈ N such that none of the events in the sequence {An0 +1 , An0 +2 , . . .}
will occur, with probability one. The second Borel and Cantelli Lemma relates
the divergence of the sum of the probabilities of the events in the sequence
to the case where An occurs infinitely often with probability one. This result
only applies to the case where the events in the sequence are independent.
Theorem 2.18 (Borel and Cantelli). Let {An }∞
n=1 be a sequence of indepen-
dent events. If
∞
X
P (An ) = ∞,
n=1
then
P lim sup An = P (An i.o.) = 1.
n→∞
Proof. We use the method of Billingsley (1986) to prove this result. Note that
by Theorem A.2
c ∞ [ ∞
!c ∞ \ ∞
\ [
lim sup An = Am = Acm .
n→∞
n=1 m=n n=1 m=n
then the result will follow. In fact, note that if we are able to show that
∞
!
\
c
P Am = 0, (2.6)
m=n
72 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
for each n ∈ N, then Theorem 2.4 implies that
∞ \ ∞ ∞ ∞
! !
[ X \
c c
P Am ≤ P Am = 0.
n=1 m=n n=1 m=n
Now, note that 1 − x ≤ exp(−x) for all positive and real x, so that
P (Acm ) = 1 − P (Am ) ≤ exp[−P (Am )],
for all m ∈ {n, n + 1, . . .}. Therefore,
n+k
! n+k " n+k #
\ Y X
P Acm ≤ exp[−P (Am )] = exp − P (Am ) . (2.7)
m=n m=n m=n
We wish to take the limit of both sides of Equation (2.7), which means that we
need to evaluate the limit of the sum on the right hand side. By supposition
we have that
∞
X n−1
X ∞
X
P (Am ) = P (Am ) + P (Am ) = ∞,
m=1 m=1 m=n
and
n−1
X
P (Am ) ≤ n − 1 < ∞,
m=1
since P (Am ) ≤ 1 for each m ∈ N. Therefore, it follows that
∞
X
P (Am ) = ∞, (2.8)
m=n
Results such as the one given in Corollary 2.1 are called zero-one laws because
the probability of the event of interest can only take on the values zero and
one.
Example 2.8. Let {Un }∞ n=1 be a sequence of independent random variables
where Un has a Uniform{1, 2, . . . , n} distribution for all n ∈ N. Define a
sequence of events {An }∞
n=1 as An = {Un = 1} for all n ∈ N. Note that
∞
X ∞
X
P (An ) = n−1 = ∞,
n=1 n=1
In this case there will be a last occurrence of an event in the sequence An with
probability one. That is, there will exist an integer n0 such that {Un = 1}
will not be observed for all n > n0 , with probability one. That is, the event
{Un = 1} will occur in this sequence only a finite number of times, with
probability one. Squaring the size of the sample space in this case creates too
many opportunities for events other than those events in the sequence An to
occur as n → ∞ for an event An to ever occur again after a certain point.
74 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
2.5 Generating and Characteristic Functions
The integrals used in Definition 2.9 are Lebesgue-Stieltjes integrals, which can
be applied to any random variable, discrete or continuous.
Definition 2.10. Let X be a random variable with distribution function F .
Let g be any real function. Then the integral
Z ∞
g(x)dF (x),
−∞
provided Z ∞
|g(x)|f (x)dx < ∞.
−∞
If X is a discrete random variable that takes on values in the set {x1 , x2 , . . .}
with probability distribution function f then
Z ∞ Xn
g(x)dF (x) = g(xi )f (xi ),
−∞ i=1
provided
n
X
|g(xi )|f (xi ) < ∞.
i=1
GENERATING AND CHARACTERISTIC FUNCTIONS 75
The use of this notation will allow us to keep the presentation of the book
simple without having to present essentially the same material twice: once
for the discrete case and once for the continuous case. For example, in the
particular case when X is a discrete random variable that takes on values in
the set {x1 , x2 , . . .} with probability distribution function f then Definitions
2.9 and 2.10 imply that
∞
X
µ0k = E(X k ) = xki f (xi ),
i=1
The condition given in Equation (2.11) is called the Carleman Condition and
has sufficient conditions given below.
Theorem 2.20. Let X be a random variable with moments {µ0k }∞ k=1 . If
Z ∞ 1/k
lim sup k −1 |x|k dF (x) < ∞,
k→∞ −∞
or
∞
X µ0 λk
k
, (2.12)
k!
k=1
converges absolutely when |λ| < λ0 , for some λ0 > 0, then the Carleman
Condition of Equation (2.11) holds.
The Carleman Condition, as well as the individually sufficient conditions given
in Theorem 2.20, essentially put restrictions on the rate of growth of the
sequence µ02k . For further details on these results see Akhiezer (1965) and
Shohat and Tamarkin (1943).
Example 2.9. Consider a continuous random variable X with density
(
2x 0 ≤ x ≤ 1,
f (x) =
0 elsewhere.
Direct calculation shows that for i ∈ N,
Z 1
µ0i = E(X i ) = 2xi+1 dx = 2(i + 2)−1 .
0
76 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
Therefore (µ02i )−1/2i = (i + 1)1/2i for all i ∈ N. Noting that (i + 1)1/2i ≥ 1 for
all i ∈ N implies that
∞
X
(µ02i )−1/2i = ∞,
i=1
We first show that fθ (x) is a valid density function. Since f (x) is a density it
follows that f (x) ≥ 0 for all x ∈ R. Now | sin(x)| ≤ 1 for all x ∈ R and |θ| ≤ 1
so it follows that 1 + θ sin[2π log(x)] ≥ 0 for all x > 0. Hence it follows that
fθ (x) ≥ 0 for all x > 0 and θ ∈ [−1, 1]. Further, noting that
Z ∞ Z ∞
f (x){1 + θ sin[2π log(x)]}dx = 1 + θ f (x) sin[2π log(x)]dx,
0 0
to show that the proposed density function integrates to one. Consider the
change of variable log(x) = u so that du = dx/x and note that
lim log(x) = −∞,
x→0
and
lim log(x) = ∞.
x→∞
Therefore,
Z ∞ Z ∞
f (x) sin[2π log(x)]dx = (2π)−1/2 exp(− 21 u2 ) sin(2πu)du.
0 −∞
The function in the integrand is odd so that the integral is zero. Therefore, it
follows that fθ (x) is a valid density. Now consider the k th moment of fθ given
GENERATING AND CHARACTERISTIC FUNCTIONS 77
by
Z ∞
µ0k (θ) = xk f (x){1 + θ sin[2π log(x)]}dx
Z0 ∞ Z ∞
= xk f (x)dx + θ xk f (x) sin[2π log(x)]dx
0 0
Z ∞
= µ0k + θ xk f (x) sin[2π log(x)]dx,
0
where µ0k is the k th moment of f (x). Using the same change of variable as
above, it follows that
Z ∞
xk f (x) sin[2π log(x)]dx =
0
Z ∞
(2π)−1/2 exp( 21 k 2 ) exp(− 21 u2 ) sin[2π(u + k)]du =
−∞
Z ∞
(2π)−1/2 exp( 12 k 2 ) exp(− 21 u2 ) sin(2πu)du = 0,
−∞
for each k ∈ N. Hence, µ0k (θ) = µ0k for all θ ∈ [−1, 1], and we have demon-
strated that this family of distributions all have the same sequence of moments.
Therefore, the moment sequence of this distribution does not uniquely iden-
tify this distribution. A plot of the lognormal density along with the density
given in Equation (2.13) when θ = 1 is given in Figure 2.3. This example is
based on one from Heyde (1963).
provided
Z ∞
exp(tx)dF (x) < ∞
−∞
Figure 2.3 The lognormal density (solid line) and the density given in Equation
(2.13) with θ = 1 (dashed line). Both densities have the same moment sequence.
1.5
1.0
0.5
0.0
0 1 2 3 4
Z ∞
m(t) = exp(tx)dF (x)
−∞
n
X n x
= exp(tx) p (1 − p)n−x
x=0
x
n
X n
= [p exp(t)]x (1 − p)n−x
x=0
x
= [1 − p + p exp(t)]n ,
Note that the integral in Equation (2.14) diverges unless the restriction |t| <
β −1 is employed. Now consider the change of variable v = x(β −1 − t) so that
dx = (β −1 − t)−1 dv. The moment generating function is then given by
Z ∞
−1 −1 −1
m(t) = β (β − t) exp(−v)dv = (1 − tβ)−1 .
0
When the convergence condition given in Definition 2.11 holds then all of
the moments of X are finite, and the moment generating function contains
information about all of the moments of F . To observe why this is true,
consider the Taylor series for exp(tX) given by
∞
X (tX)i
exp(tX) = .
i=0
i!
where we have assumed that the exchange between the infinite sum and the
expectation is permissible. Note that the coefficient of ti , divided by i!, is
equal to the ith moment of F . As indicated by its name, the moment gen-
erating function can be used to generate the moments of the corresponding
distribution. The operation is outlined in Theorem 2.21, which can be proven
using a standard induction argument. See Problem 16.
Theorem 2.21. Let X be a random variable that has moment generating
function m(t) that converges on some radius |t| ≤ b for some b > 0. Then
dk m(t)
µ0k = .
dtk t=0
We do not formally prove Theorem 2.22, but to informally see why the result
would be true one can compare Equation (2.15) to Equation (2.12). When
the moment generating function exists then expansion given in Theorem 2.20
must converge for some radius of convergence. Therefore, Theorems 2.19 and
2.20 imply that the distribution is uniquely characterized by its moment se-
quence. Note that since Equation (2.15) is a polynomial in t whose coefficients
are functions of the moments, the two moment generating functions will be
equal only when all of the moments are equal, which will mean that the two
distributions are equal.
Another useful result relates the distribution of a function of a random variable
to the moment generating function of the random variable.
Theorem 2.23. Let X be a random variable with distribution function F . If
g is a real function then the moment generating function of g(X) is
Z ∞
m(t) = E{exp[tg(X)]} = exp[tg(x)]dF (x),
−∞
The results of Theorems 2.22 and 2.23 can be combined to identify the distri-
butions of transformed random variables.
Example 2.14. Suppose X is an Exponential(1) random variable. Ex-
ample 2.12 showed that the moment generating function of X is mX (t) =
GENERATING AND CHARACTERISTIC FUNCTIONS 81
(1 − t)−1 when |t| < 1. Now consider a new random variable Y = βX where
β ∈ R. Theorem 2.23 implies that the moment generating function of Y is
mY (t) = E{exp[t(βX)]} = E{exp[(tβ)X]} = mX (tβ) as long as tβ < 1 or
equivalently t < β −1 . Evaluating the moment generating function of X at tβ
yields mY (t) = (1−tβ)−1 , which is the moment generating function of an Ex-
ponential(β) random variable. Theorem 2.22 can be used to conclude that
if X is an Exponential(1) random variable then βX is an Exponential(β)
random variable.
Example 2.15. Let Z be a N(0, 1) random variable and define a new random
variable Y = Z 2 . Theorem 2.23 implies that the moment generating function
of Y is given by
mY (t) = E[exp(tZ 2 )]
Z ∞
= (2π)−1/2 exp(− 21 z 2 + tz 2 )dz
−∞
Z ∞
= (2π)−1/2 exp[− 12 z 2 (1 − 2t)]dz.
−∞
Assuming that 1 − 2t > 0, or that |t| < 12 so that integral converges, use the
change of variable v = z(1 − 2t)1/2 to obtain
Z ∞
−1/2 1
mY (t) = (1 − 2t) (2π)−1/2 exp(− v 2 )dv = (1 − 2t)−1/2 .
−∞ 2
Note that mY (t) is the moment generating function of a Chi-Squared(1)
random variable. Therefore, Theorem 2.22 implies that if Z is a N(0, 1) random
variable then Z 2 is a Chi-Squared(1) random variable.
where the interchange of the product and the expectation follows from the
independence of the random variables. The second result follows by setting
m1 (t) = · · · = mn (t) = m(t).
Example 2.16. Suppose that X1 , . . . , Xn are a set of independent N(0, 1)
random variables. The moment generating function of Xk is m(t) = exp( 12 t2 )
for all t ∈ R. Theorem 2.25 implies that the moment generating function of
n
X
Sn = Xi ,
i=1
is mSn (t) = mn (t) = exp( 12 nt2 ), which is the moment generating function of
a N(0, n) random variable. Therefore, Theorem 2.22 implies that the sum of
n independent N(0, 1) random variables is a N(0, n) random variable.
The moment generating function can be a useful tool in asymptotic theory, but
has the disadvantage that it does not exist for many distributions of interest. A
function that has many similar properties to the moment generating function
is the characteristic function. Using the characteristic function requires a little
more work as it is based on some ideas from complex analysis. However, the
benefits far outweigh this inconvenience as the characteristic function always
exists, can be used to generate moments when they exist, and also uniquely
identifies a distribution.
Definition 2.12. Let X be a random variable with distribution function F .
The characteristic function of X, or equivalently of F , is
Z ∞
ψ(t) = E[exp(itX)] = exp(itx)dF (x). (2.16)
−∞
by Theorem A.6. Now Definitions A.5 and A.6 (Euler) imply that
| exp(itx)| = | cos(tx) + i sin(tx)| = [cos2 (tx) + sin2 (tx)]1/2 = 1.
Therefore, it follows that
Z ∞ Z ∞
exp(itx)dF (x)≤ dF (x) = 1.
−∞ −∞
where both of the integrals in Equation (2.18) are of real functions integrated
over the real line.
Example 2.17. Let X be a Binomial(n, p) random variable. The character-
istic function of X is given by
Z ∞
ψ(t) = exp(itx)dF (x)
−∞
n
X n x
= exp(itx) p (1 − p)n−x
x=0
x
n
X n
= [p exp(it)]x (1 − p)n−x
x=0
x
= [1 − p + p exp(it)]n ,
where the final equality results from Theorem A.22.
84 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
Example 2.18. Let X be an Exponential(β) distribution. The character-
istic function of X is given by
Z ∞
ψ(t) = exp(itx)dF (x)
−∞
Z ∞
= exp(itx)β −1 exp(−β −1 x)dx
0
Z ∞ Z ∞
= β −1 cos(tx) exp(−β −1 x) + iβ −1 sin(tx) exp(−β −1 x)dx.
0 0
In both of the previous two examples the characteristic function turns out to
be a function whose range is in the complex field. This is not always the case,
as there are some circumstances under which the characteristic function is a
real valued function.
Theorem 2.26. Let X be a random variable. The characteristic function of
X is real valued if and only if X has the same distribution as −X.
Theorem 2.26 is proven in Exercise 22. Note that the condition that X has
the same distribution as −X implies that P (X ≥ x) = P (X ≤ −x) for all
x > 0, which is equivalent to the case where X has a symmetric distribution
about the origin. For example, a random variable with a N(0, 1) distribution
has a real valued characteristic function, whereas a random variable with a
non-symmetric distribution like a Gamma distribution has a complex val-
ued characteristic function. Note that Theorem 2.26 requires that X have a
distribution that is symmetric about the origin. That is, if X has a N(µ, σ)
distribution where µ 6= 0, then the characteristic function of X is complex
valued.
As with the moment generating function, the characteristic function uniquely
characterizes the distribution of a random variable, though the characteristic
function can be used in more cases as there is no need to consider potential
convergence issues with the characteristic function.
GENERATING AND CHARACTERISTIC FUNCTIONS 85
Theorem 2.27. Let F and G be two distribution functions whose character-
istic functions are ψF (t) and ψG (t) respectively. If ψF (t) = ψG (t) for all t ∈ R
then F (x) = G(x) for all x ∈ R.
For discrete distributions we will focus on random variables that take on values
on a regular lattice. That is, for random variables X that take on values in
the set {kd + l : k ∈ Z} for some d > 0 and l ∈ R. Many of the common
discrete distributions, such as the Binomial, Geometric, and Poisson, have
supports on a regular lattice with d = 1 and l = 0.
Theorem 2.29. Consider a random variable X that takes on values in the set
86 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
{kd + l : k ∈ Z} for some d > 0 and l ∈ R. Suppose that X has characteristic
function ψ(t), then for any x ∈ {kd + l : k ∈ Z},
Z π/d
d
P (X = x) = exp(−itx)ψ(t)dt.
2π −π/d
Example 2.20. Let X be a discrete random variable that takes on values
in the set {0, 1, . . . , n} where n ∈ N is fixed so that d = 1 and l = 0. If X
has characteristic function ψ(t) = [1 − p + p exp(it)]n where p ∈ (0, 1) and
x ∈ {1, . . . , n}, then Theorem 2.29 implies that
Z π
1
P (X = x) = exp(−itx)[1 − p + p exp(it)]n dt
2π −π
Z π n
1 X n k
= exp(−itx) p exp(itk)(1 − p)n−k dt
2π −π k
k=0
n Z π
1 X n k n−k
= p (1 − p) exp[it(k − x)]dt,
2π k −π
k=0
where Theorem A.22 has been used to expand the polynomial. There are
two distinct cases to consider for the value of the index k. When k = x the
exponential function becomes one and the expression simplifies to
Z π
1 n x n−x n x
p (1 − p) dt = p (1 − p)n−x .
2π x −π x
When k 6= x the integral expression in the sum can be calculated as
Z π Z π Z π
exp[it(k − x)]dt = cos[t(k − x)]dt + i sin[t(k − x)]dt.
−π −π −π
The second integral is zero because sin[−t(k − x)] = − sin[(t(k − x)]. For the
first integral we note that (k − x) ∈ N and therefore
Z π Z π
cos[t(k − x)]dt = 2 cos[t(k − x)]dt = 0,
−π 0
since the range of the integral over the cosine function is an integer multiple
of π. Therefore, the integral expression is zero when k 6= x and we have that
n x
P (X = x) = p (1 − p)n−x ,
x
which corresponds to the probability of a Binomial(n, p) random variable.
Due to the similar form of the definition of the characteristic function to the
moment generating function, the two functions have similar properties. In
particular, the characteristic function of a random variable can also be used
to obtain moments of the random variable if they exist. We first establish
that when the associated moments exist, the characteristic function can be
approximated by partial expansions whose terms correspond to the Taylor
series for the exponential function.
GENERATING AND CHARACTERISTIC FUNCTIONS 87
Theorem 2.30. Let X be a random variable with characteristic function ψ.
Then if E(|X|n ) < ∞ for some n ∈ N, then
n
(it)k E(X k ) 2|t|n |X|n
X
ψ(t) − ≤E ,
k! n!
k=0
and
n
(it)k E(X k )
n+1
|X|n+1
X |t|
ψ(t) − ≤E .
k! (n + 1)!
k=0
Proof. We will prove the first statement. The second statement follows using
similar arguments. Theorem A.11 implies that for y ∈ R,
n
X (iy)k 2|y|n
exp(iy) − ≤ . (2.19)
k! n!
k=0
as t → 0 and
dk ψ(t)
= ik µ0k .
dtk t=0
A proof of Theorem 2.31 is the subject of Exercise 33. Note that the char-
acteristic function allows one to find the finite moments of distributions that
may not have all finite moments, as opposed to the moment generating func-
tion which does not exist when any of the moments of a random variable are
infinite.
88 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
Example 2.21. Let X be a N(0, 1) random variable, which has characteristic
function ψ(t) = exp(− 12 t2 ). Taking a first derivative yields
d d 1 2
1 2
ψ(t)
= exp(− 2 t ) = −t exp(− 2 t ) = 0,
dt t=0 dt t=0 t=0
The characteristic function also has a simple relationship with linear transfor-
mations of a random variable.
Theorem 2.32. Suppose that X is a random variable with characteristic
function ψ(t). Let Y = αX + β where α and β are real constants. Then the
characteristic function of Y is ψY (t) = exp(itβ)ψ(tα).
The usefulness of the cumulant generating function may not be apparent from
the definition given above, though one can immediately note that when the
moment generating function of a random variable exists, the cumulant gen-
erating function will also exist and will uniquely characterize the distribution
of the random variable as the moment generating function does. This follows
from the fact that the cumulant generating function is a one-to-one function
of the moment generating function. As indicated by its name, c(t) can be
used to generate the cumulants of a random variable, which are related to the
moments of a random variable. Before defining the cumulants of a random
GENERATING AND CHARACTERISTIC FUNCTIONS 89
variable, some expansion theory is required to investigate the structure of the
cumulant generating function more closely.
We begin by assuming that the moment generating function m(t) is defined on
a radius of convergence |t| < b for some b > 0. As shown in Equation (2.15),
the moment generating function can be written as
µ0n tn
m(t) = 1 + µ01 t + 21 µ02 t2 + 16 µ03 t3 + · · · +
+ O(tn ), (2.20)
k!
as t → 0. An application of Theorem 1.15 to the logarithmic function can be
used to show that
log(1 + δ) = δ − 21 δ 2 + 13 δ 3 − 14 δ 4 + · · · + n1 (−1)n+1 δ n + O(δ n ), (2.21)
as δ → 0. Substituting
n
X µ0 ti i
δ= + O(tn ), (2.22)
i=1
i!
into Equation (2.21) yields a polynomial in t of the same form as given in
Equation (2.20), but with different coefficients. That is, the cumulant gener-
ating function has the form
κn tn
c(t) = κ1 t + 21 κ2 t2 + 16 κ3 t3 + · · · +
+ Rc (t), (2.23)
n!
where the coefficients κi are called the cumulants of X, and Rc (t) is an error
term that depends on t whose order is determined below. Note that since
c(t) and m(t) have the same form, the cumulants can be generated from the
cumulant generating function in the same way that moments can be generated
from the moment generating function through Theorem 2.21. That is,
di c(t)
κi = .
dti t=0
Matching the coefficients of ti in Equations (2.21) and (2.23) yields expres-
sions for the cumulants of X in terms of the moments of X. For example,
matching the coefficients of t yields the relation µ01 t = κ1 t so that the first
cumulant is equal to µ01 , the mean. The remaining cumulants are not equal to
the corresponding moments. Matching the coefficients of t2 yields
1
2 κ2 t
2
= 21 µ02 t2 − 12 (µ01 )2 t2 ,
so that κ2 = µ02 − (µ01 )2 , the variance of X. Similarly, matching the coefficients
of t3 yields
1 3 1 0 3 1 0 0 3 1 0 3 3
6 κ3 t = 6 µ3 t − 2 µ1 µ2 t + 3 (µ1 ) t ,
so that κ3 = µ03 − 3µ01 µ02 + 2(µ01 )3 . In this form this expression may not seem
familiar, but note that
E[(X − µ01 )3 ] = µ03 − 3µ01 µ02 + 2(µ01 )3 ,
so that κ3 is the skewness of X. It can be similarly shown that κ4 is the
90 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
kurtosis of X. For further cumulants, see Exercises 34 and 35, and Chapter 3
of Kendall and Stuart (1977).
The reminder term Rc (t) from the expansion given in Equation (2.23) will
now be quantified. Consider the powers of t given in the expansion for δ in
Equation (2.22) when δ is substituted into Equation (2.23). The reminder
term from the linear term is O(tn ) as t → 0. When the expansion given for δ
is squared, there will be terms that range in powers of t from t2 to t2n . All of
the powers of t less than n + 1 have coefficients that are used to identify the
first k cumulants in terms of the moments of X. All the remaining terms are
O(tn ) as t → 0, assuming that all of the moments are finite, which must follow
if the moment generating function of X converges on a radius of convergence.
This argument can be applied to the remaining terms to obtain the result
n
X κi ti
c(t) = + O(tn ), (2.24)
i=1
i!
as t → ∞.
Example 2.22. Suppose X has a N(µ, σ 2 ) distribution. The moment gener-
ating function of X is given by m(t) = exp(µt + 21 σ 2 t2 ). From Definition 2.13,
the cumulant generating function of X is
c(t) = log[m(t)] = µt + 21 σ 2 t2 .
Matching this cumulant generating function to the general form given in Equa-
tion (2.24) implies the N(µ, σ 2 ) distribution has cumulants
µ i = 1,
κi = σ 2 i = 2
0 i = 3, 4, . . .
In some cases additional calculations are required to obtain the necessary form
of the cumulant generating function.
Example 2.23. Suppose that X has an Exponential(β) distribution. The
moment generating function of X is m(t) = (1 − βt)−1 when |t| < β −1 so
that the cumulant generating function of X is c(t) = − log(1 − βt). This
cumulant generating function is not in the form given in Equation (2.24) so
the cumulants cannot be obtained by directly observing c(t). However, when
|βt| < 1 it follows from the Taylor expansion given in Equation (2.21) that
log(1 − βt) = (−βt) − 12 (−βt)2 + 13 (−βt)3 + · · ·
= −βt − 21 β 2 t2 − 31 β 3 t3 + · · · .
Therefore c(t) = βt+ 21 β 2 t2 + 31 β 3 t3 +· · · , which is now in the form of Equation
(2.24). It follows that the ith cumulant of X can be found by solving
κi ti β i ti
= ,
i! i
EXERCISES AND EXPERIMENTS 91
for t 6= 0. This implies that the ith cumulant of X is κi = (i − 1)!β i for
i ∈ N. Alternatively, one could also find the cumulants by differentiating the
cumulant generating function. For example, the first cumulant is given by
d d
c(t)
= [− log(1 − βt)]
dt t=0 dt t=0
= β(1 − βt)−1
t=0
= β.
The remaining cumulants can be found by taking additional derivatives.
Cumulant generating functions are particularly easy to work with for sums of
independent random variables.
Theorem 2.34. Let X1 , . . . , Xn be a sequence of independent random vari-
ables where Xi has cumulant generating function ci (t) for i = 1, . . . , n. Then
the cumulant generating function of
n
X
Sn = Xi ,
i=1
is
n
X
cSn (t) = ci (t).
i=1
If X1 , . . . , Xn are also identically distributed with cumulant generating func-
tion c(t) then the cumulant generating function of Sn is nc(t).
Theorem 2.34 is proven in Exercise 36. The fact that the cumulant gener-
ating functions add for sums of independent random variables implies that
the coefficients of ti /i! add as well. Therefore the ith cumulant of the sum of
independent random variables is equal to the sum of the corresponding cu-
mulants of the individual random variables. This result gives some indication
as to why cumulants are often preferable to work with, as the moments or
central moments of a sum of independent random variables can be a complex
function of the individual moments. For further information about cumulants
see Barndorff-Nielsen and Cox (1989), Gut (2005), Kendall and Stuart (1977),
and Severini (2005).
2.6.1 Exercises
lim sup An ,
n→∞
and determine if the limit of the sequence {An }∞
n=1 exists.
EXERCISES AND EXPERIMENTS 93
lim sup An ,
n→∞
and determine if the limit of the sequence {An }∞
n=1 exists.
10. Let {An }n=1 be a sequence of events from F, a σ-field on the sample space
∞
lim sup An ,
n→∞
and determine if the limit of the sequence {An }∞ n=1 exists.
11. Let {An }n=1 be a sequence of events from F, a σ-field on the sample space
∞
Ω = R, defined by
(
[ 1 , 1 + n−1 ) if n is even,
An = 21 2 −1 1
( 2 − n , 2 ] if n is odd,
for all n ∈ N. Compute
lim inf An ,
n→∞
lim sup An ,
n→∞
and determine if the limit of the sequence {An }∞ n=1 exists.
12. Consider a probability space (Ω, F, P ) where Ω = (0, 1), F = B{(0, 1)} and
P is Lebesgue measure on (0, 1). Let {An }∞ n=1 be a sequence of events in F
defined by An = (0, 12 (1 + n−1 )) for all n ∈ N. Show that
lim P (An ) = P lim An .
n→∞ n→∞
13. Consider tossing a fair coin repeatedly and define Hn to be the event that
the nth toss of the coin yields a head. Prove that
P (lim sup Hn ) = 1,
n→∞
and interpret this result in terms of how often the event occurs.
14. Consider the case where {An }∞ n=1 is a sequence of independent events that
all have the same probability p ∈ (0, 1). Prove that
P (lim sup An ) = 1,
n→∞
and interpret this result in terms of how often the event occurs.
94 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
15. Let {Un }∞n=1 be a sequence of independent Uniform(0, 1) random vari-
ables. For each definition of An given below, calculate
P lim sup An .
n→∞
is
n
Y
ψSn (t) = ψi (t).
i=1
Further, prove that if X1 , . . . , Xn are identically distributed with character-
istic function ψ(t) then the characteristic function of Sn is ψSn (t) = ψ n (t).
26. Let X1 , . . . , Xn be a sequence of independent random variables where Xi
has a Gamma(αi , β) distribution for i = 1, . . . , n. Let
n
X
Sn = Xi .
i=1
a. Prove that
n
X µ0 (it)k
ψ(t) = 1 + k
+ o(|t|n ),
k!
k=1
as t → 0.
b. Prove that
dk ψ(t)
= ik µ0k .
dtk t=0
34. a. Prove that κ4 = µ04 − 4µ03 µ01 − 3(µ02 )2 + 12µ02 (µ01 )2 − 6(µ01 )4 .
b. Prove that κ4 = µ4 − 3µ22 , which is often called the kurtosis of a random
variable.
c. Suppose that X is an Exponential(θ) random variable. Compute the
fourth cumulant of X.
35. a. Prove that
κ5 = µ05 −5µ04 µ01 −10µ03 µ02 +20µ03 (µ01 )2 +30(µ02 )2 µ01 −60µ02 (µ01 )3 +24(µ01 )5 .
is
n
X
cSn (t) = ci (t).
i=1
EXERCISES AND EXPERIMENTS 97
37. Suppose that X is a Poisson(λ) random variable, so that the moment
generating function of X is m(t) = exp{λ[exp(t) − 1]}. Find the cumulant
generating function of X, and put it into the form given in Equation (2.24).
Using the form of the cumulant generating function, find a general form for
the cumulants of X.
38. Suppose that X is a Gamma(λ) random variable, so that the moment
generating function of X is m(t) = (1−tβ)−α . Find the cumulant generating
function of X, and put it into the form given in Equation (2.24). Using
the form of the cumulant generating function, find a general form for the
cumulants of X.
39. Suppose that X is a Laplace(α, β) random variable, so that the moment
generating function of X is m(t) = (1 − t2 β 2 )−1 exp(tα) when |t| < β −1 .
Find the cumulant generating function of X, and put it into the form given
in Equation (2.24). Using the form of the cumulant generating function,
find a general form for the cumulants of X.
40. One consequence of defining the cumulant generating function in terms of
the moment generating function is that the cumulant generating function
will not exist any time the moment generating function does not. An alter-
nate definition of the cumulant generating function is defined in terms of
the characteristic function. That is, if X has characteristic function ψ(t),
then the cumulant generating function can be defined as c(t) = log[ψ(t)].
a. Assume all of the cumulants (and moments) of X exist. Prove that the
coefficient of (it)k /k! for the cumulant generating function defined using
the characteristic function is the k th cumulant of X. You may want to
use Theorem 2.31.
b. Find the cumulant generating function of a random variable X that has
a Cauchy(0, 1) distribution based on the fact that the characteristic
function of X is ψ(t) = exp(|t|). Use the form of this cumulant generating
function to argue that the cumulants of X do not exist.
2.6.2 Experiments
1. For each of the distributions listed below, use R to compute P (|X − µ| > δ)
and compare the result to the bound given by Theorem 2.7 as δ −2 σ 2 for
δ = 12 , 1, 32 , 2. Which distributions become closest to achieving the bound?
What are the properties of these distributions?
a. N(0, 1)
b. T(3)
c. Gamma(1, 1)
d. Uniform(0, 1)
98 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
2. For each distribution listed below, plot the corresponding characteristic
function of the density as a function of t if the characteristic function is
real-valued, or as a function of t on the complex plane if the function is
complex-valued. Describe each characteristic function. Are there any prop-
erties of the associated random variables that have an apparent effect on
the properties of the characteristic function? See Section B.3 for details on
plotting complex functions in the complex plane.
a. Bernoulli( 12 )
b. Binomial(5, 12 )
c. Geometric( 12 )
d. Poisson(2)
e. Uniform(0, 1)
f. Exponential(2)
g. Cauchy(0, 1)
3. For each value of µ and σ listed below, plot the characteristic function
of the corresponding N(µ, σ 2 ) distribution as a function of t in the com-
plex plane. Describe how the changes in the parameter values affect the
properties of the corresponding characteristic function. This will require a
three-dimensional plot. See Section B.3 for further details.
a. µ = 0, σ =1
b. µ = 1, σ =1
c. µ = 0, σ =2
d. µ = 1, σ =2
4. Random walks are a special type of discrete stochastic process that are able
to change from one state to any adjacent state according to a conditional
probability distribution. This experiment will investigate the properties of
random walks in one, two, and three dimensions.
The man from the country has not expected such difficulties: the law should
always be accessible for everyone, he thinks, but as he now looks more closely at
the gatekeeper in his fur coat, at his large pointed nose and his long, thin, black
Tartars beard, he decides that it would be better to wait until he gets permission
to go inside.
Before the Law by Franz Kafka
3.1 Introduction
Let {Xn }∞
n=1 be a sequence of random variables and let X be some other
random variable. Under what conditions is it possible to say that Xn converges
to X as n → ∞? That is, is it possible to define a limit for a sequence of
random variables so that the statement
lim Xn = X,
n→∞
101
102 CONVERGENCE OF RANDOM VARIABLES
{Xn }∞
n=1 match X with probability one, which would provide the definition
that Xn converges to X as n → ∞ if
P lim Xn = X = 1.
n→∞
Under these conditions note that Theorem 2.7 (Tchebysheff) implies that for
any ε > 0 P (|θ̂n − θ| > ε) ≤ ε−2 V (θ̂n ). The limiting condition on the variance
of θ̂n and Definition 2.2 imply that
0 ≤ lim P (|θ̂n − θ| > ε) ≤ lim ε−2 V (θ̂n ) = 0,
n→∞ n→∞
104 CONVERGENCE OF RANDOM VARIABLES
lim P (|θ̂n − θ| ≥ ε) = 0,
n→∞
p
and Definition 3.1 implies that θ̂n − → θ as n → ∞. In estimation theory this
property is called consistency. That is, θ̂n is a consistent estimator of θ. A
special case of this result applies to the sample mean. Suppose that X1 , . . . , Xn
are a set of independent and identically distributed random variables from a
distribution with mean θ and finite variance σ 2 . The sample mean X̄n is
an unbiased estimator of θ with variance n−1 σ 2 which converges to zero as
p
n → ∞ as long as σ 2 < ∞. Therefore it follows that X̄n − → θ as n → ∞ and
the sample mean is a consistent estimator of θ. This result is a version of what
are known as Laws of Large Numbers. In particular, this result is known as
the Weak Law of Large Numbers. Various results of this type can be proven
under many different conditions. In particular, it will be shown in Section 3.6
that the condition that the variance is finite can be relaxed. This result can be
visualized with the aid of simulated data. Consider simulating samples from a
N(0, 1) distribution of size n = 5, 10, 15, . . . , 250, where the sample mean X̄n
CONVERGENCE IN PROBABILITY 105
is computed on each sample. The Weak Law of Large Numbers states that
these sample means should converge in probability to 0 as n → ∞. Figure
3.2 shows the results of five such simulated sequences. An ε-band has been
plotted around 0. Note that all of the sequences generally become closer to 0
as n becomes larger, and that there is a point where all of the sequences are
within the ε-band. Remember that the definition of convergence in probability
is a result for random sequences. This does not mean that all such sequences
will be within the ε-band for a given sample size, only that the probability that
the sequences are within the ε-band converges to one as n → ∞. This can also
be observed from the fact that the individual sequences do not monotonically
converge to 0 as n becomes large. There are random fluctuations in all of the
sequences, but the overall behavior of the sequence does become closer to 0
as n becomes large.
106 CONVERGENCE OF RANDOM VARIABLES
Example 3.3. Let {cn }∞
n=1 be a sequence of real constants where
lim cn = c,
n→∞
for some constant c ∈ R. Let {Xn }∞n=1 be a sequence of random variables with
a degenerate distribution at cn for all n ∈ N. That is P (Xn = cn ) = 1 for all
n ∈ N. Let ε > 0, then
P (|Xn − c| ≥ ε) = P (|Xn − c| ≥ ε|Xn = cn )P (Xn = cn ) = P (|cn − c| ≥ ε).
Definition 1.1 implies that for any ε > 0 there exists an nε ∈ N such that
|cn − c| < ε for all n > nε . Therefore P (|cn − c| > ε) = 0 for all n > nε , and
it follows that
lim P (|Xn − c| ≥ ε) = lim P (|cn − c| ≥ ε) = 0.
n→∞ n→∞
p
Therefore, by Definition 3.1 it follows that Xn −
→ c as n → ∞.
Example 3.4. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a Uniform(θ, θ + 1) distribution for some
θ ≥ 0. Let X(1) be the first sample order statistic. That is
X(1) = min{X1 , . . . , Xn }.
The distribution function of X(1) can be found by using the fact that if X(1) ≥ t
for some t ∈ R, then Xi ≥ t for all i = 1, . . . , n. Therefore, the distribution
function of X(1) is given by
Let ε > 0 and consider the inequality |X(1) −θ| ≤ ε. If ε ≥ 1 then |X(1) −θ| ≤ ε
with probability one because X(1) ∈ (θ, θ+1) with probability one. If ε ∈ (0, 1)
then
P (|X(1) − θ| < ε) = P (−ε < X(1) − θ < ε)
= P (θ − ε < X(1) < θ + ε)
= P (θ < X(1) < θ + ε)
= F (θ + ε)
= 1 − (1 − ε)n ,
STRONGER MODES OF CONVERGENCE 107
where the fact that X(1) must be greater than θ has been used. Therefore
lim P (|X(1) − θ| < ε) = 1,
n→∞
p
since 0 < 1 − ε < 1. Definition 3.1 then implies that X(1) −
→ θ as n → ∞, or
that X(1) is a consistent estimator of θ.
Hence, almost certain convergence requires that the set of all ω for which
Xn (ω) converges to X(ω), have probability one. Note that the limit used in
Equation (3.1) is the usual limit for a sequence of constants given in Definition
1.1, as when ω is fixed, Xn (ω) is a sequence of constants. See Figure 3.3.
Example 3.5. Consider the sample space Ω = [0, 1] with probability measure
P such that ω is chosen according to a Uniform[0, 1] distribution on the Borel
σ-field B[0, 1]. Define a sequence {Xn }∞ n=1 of random variables as Xn (ω) =
δ{ω; [0, n−1 )}. Let ω ∈ [0, 1] be fixed and note that there exists an nω ∈ N
such that n−1 < ω for all n ≥ nω . Therefore Xn (ω) = 0 for all n ≥ nω , and it
follows that for this value of ω
lim Xn (ω) = 0.
n→∞
"
Definition 3.2 can be difficult to apply! in practice, and is not always use-
ful when studying the properties of almost certain convergence. By applying
Definition 1.1 to the limit inside the probability in Equation (3.1), an equiva-
lent definition that relates almost certain convergence to a statement that is
similar to the one used in Definition 3.1 can be obtained.
Theorem 3.1. Let {Xn }∞ n=1 be a sequence of random variables. Then Xn
converges almost certainly to a random variable X as n → ∞ if and only if
for every ε > 0,
Proof. This result is most easily proven by rewriting the definition of a limit
using set operations. This is the method used by Halmos (1950), Serfling
(1980), and many others. To prove the equivalence, consider the set
n o
A = ω : lim Xn (ω) = X(ω) .
n→∞
STRONGER MODES OF CONVERGENCE 109
Definition 1.1 implies that
This implies that the sequence of events within the intersection on the right
hand side of Equation (3.4) is monotonically decreasing as ε → 0. Therefore,
Theorem 2.15 implies that
∞
[
A = lim {ω : |Xm (ω) − X(ω)| < ε for all m ≥ n}.
ε→0
n=1
and hence
P lim Xn = X = 1,
n→∞
a.c. a.c.
so that Xn −−→ X as n → ∞. Now suppose that Xn −−→ X as n → ∞ and
let ε > 0 and note that Equation (3.5) implies that
1 = P lim Xn = X =
n→∞
lim lim P [ω : |Xm (ω) − X(ω)| < ε for all m ≥ n] ≤
ε→0 n→∞
lim P [ω : |Xm (ω) − X(ω)| < ε for all m ≥ n],
n→∞
so that
lim P [ω : |Xm (ω) − X(ω)| < ε for all m ≥ n] = 1,
n→∞
and the result is proven.
Example 3.6. Suppose that {Un }∞ n=1 is a sequence of independent Uni-
form(0, 1) random variables and let U(1,n) be the smallest order statistic of
U1 , . . . , Un defined by U(1,n) = min{U1 , . . . , Un }. Let ε > 0, then because
U(1,n) ≥ 0 with probability one, it follows that
P (|U(1,m) − 0| < ε for all m ≥ n) = P (U(1,m) < ε for all m ≥ n)
= P (U(1,n) < ε),
where the second equality follows from the fact that if U(1,n) < ε then U(1,m) <
ε for all m ≥ n. Similarly, if U(1,m) < ε for all m ≥ n then U(1,n) < ε so that
it follows that the two events are equivalent. Now note that the independence
of the random variables in the sequence implies that when ε < 1,
n
Y
P (U(1,n) < ε) = 1 − P (U(1,n) ≥ ε) = 1 − P (Uk ≥ ε) = 1 − (1 − ε)n .
k=1
When ε ≥ 1 the probability equals one for all n ∈ N. Hence Theorem 3.1
a.c.
implies that U(1,n) −−→ 0 as n → ∞.
Theorem 3.1 is also useful in beginning to understand the relationship between
convergence in probability and almost certain convergence.
Theorem 3.2. Let {Xn }∞ n=1 be a sequence of random variables that converge
p
almost certainly to a random variable X as n → ∞. Then Xn − → X as n → ∞.
STRONGER MODES OF CONVERGENCE 111
Proof. Suppose that {Xn }∞ n=1 is a sequence of random variables that converge
almost certainly to a random variable X as n → ∞. Then Theorem 3.1 implies
that for every ε > 0,
lim P (|Xm − X| < ε for all m ≥ n) = 1.
n→∞
which implies
lim P (|Xn (ω) − X(ω)| < ε) = 1.
n→∞
p
Therefore Definition 3.1 implies that Xn −
→ X as n → ∞.
Figure 3.4 The first ten subsets of the unit interval used in Example 3.7.
m(10)
m(9)
m(8)
m(7)
m(6)
m(5)
m(4)
m(3)
m(2)
m(1)
and, therefore,
h i
P lim Xn (ω) = 0 = 0.
n→∞
Therefore, this sequence does not converge almost certainly to 0. In fact, the
probability that Xn converges at all is zero. Note the fundamental difference
between the two modes of convergence demonstrated by this example. Con-
vergence in probability requires that Xn be arbitrarily close to X for a set of
ω whose probability limits to one, while almost certain convergence requires
that the set of ω for which Xn (ω) converges to X have probability one. This
latter set does not depend on n.
converges. For example, if P (|Xn −X| > ε) = n−1 then the sequence of random
variables {Xn }∞n=1 would converge in probability to X, but not completely,
as n → ∞. If P (|Xn − X| > ε) = n−2 then the sequence of random variables
{Xn }∞
n=1 would converge in probability and completely to X, as n → ∞.
Example 3.8. Let U be a Uniform(0, 1) random variable and define a se-
quence of random variables {Xn }∞
n=1 such that Xn = δ{U ; (0, n
−2
)}. Let
ε > 0, then (
0 if ε ≥ 1
P (Xn > ε) = −2
n if ε < 1.
Therefore, for every ε > 0,
∞
X ∞
X
P (Xn > ε) ≤ n−2 < ∞,
n=1 n=1
c
so that it follows from Defintion 3.3 that Xn →
− 0 as n → ∞.
Example 3.9. Let {θ̂n }∞
n=1be a sequence of random variables such that
E(θ̂n ) = c for all n ∈ N where c ∈ R is a constant that does not depend on n.
Suppose further that V (θ̂n ) = n−2 τ where τ is a positive and finite constant
that does not depend on n. Under these conditions, note that for any ε > 0,
Theorem 2.7 implies that
P (|θ̂n − c| > ε) ≤ ε−2 V (Xn ) = n−2 ε−2 τ.
Therefore, for every ε > 0,
∞
X ∞
X ∞
X
P (|θ̂n − c| > ε) ≤ n−2 ε−2 τ ≤ ε−2 τ n−2 < ∞,
n=1 n=1 n=1
where we have used the fact that ε and τ do not depend on n. Therefore,
c
θ̂n →
− c as n → ∞.
114 CONVERGENCE OF RANDOM VARIABLES
While complete convergence is sufficient to ensure convergence in probability,
we must still investigate the relationship between complete convergence and
almost certain convergence. As it turns out, complete convergence also implies
almost certain convergence, and is one of the strongest concepts of convergence
of random variables that we will study in this book.
Theorem 3.3. Let {Xn }∞ n=1 be a sequence of random variables that converges
a.c.
completely to a random variable X as n → ∞. Then Xn −−→ X as n → ∞.
Proof. There are several approaches to proving this result, including one based
on Theorems 2.17 and 2.18. See Exercise 13. The approach we use here is
c
used by Serfling (1980). Suppose that Xn → − X as n → ∞. This method of
proof shows that the complement of the event in Equation (3.2) has limiting
probability zero, which in turn proves that the event has limiting probability
one, and the almost certain convergence of the sequence the results from
Theorem 3.1. Note that the complement of {|Xm (ω) − X(ω)| < ε for all m ≥
n} contains all ω ∈ Ω where |Xm (ω) − X(ω)| ≥ ε for at least one m ≥ n. That
is,
because |Xm (ω) − X(ω)| > ε must be true for at least one m ≥ n. Theorem
2.4 then implies that
∞ ∞
!
[ X
P {|Xm (ω) − X(ω)| > ε} ≤ P (ω : |Xm (ω) − X(ω)| > ε).
m=n m=n
must hold. Therefore, Definition 2.2 implies that for every ε > 0
lim P (ω : |Xm (ω) − X(ω)| > ε for some m ≥ n) = 0.
n→∞
Because the probabilities of an event and its complement always add to one,
STRONGER MODES OF CONVERGENCE 115
the probability of the complement of the event given earlier must converge to
one. That is,
lim P (ω : |Xm (ω) − X(ω)| ≤ ε for all m ≥ n) = 1.
n→∞
a.c.
Theorem 3.1 then implies that Xn −−→ X as n → ∞.
The result in Theorem 3.3, coupled with Definition 3.3, essentially implies that
if a sequence {Xn }∞n=1 converges in probability to X as n → ∞ at a sufficiently
fast rate, then the sequence will also converge almost certainly to X. The fact
that complete convergence is not equivalent to almost certain convergence is
established by Example 3.5. The sequence of random variables is shown in that
example to converge almost certainly, but because P (ω : |Xn (ω) − X(ω)| >
ε) = n−1 if ε < 1, the sequence does not converge completely.
There are some conditions under which almost certain convergence and com-
plete convergence are equivalent.
Theorem 3.4. Let {Xn }∞ n=1 be a sequence of independent random variables,
a.c. c
and let c be a real constant. If Xn −−→ c as n → ∞ then Xn → − c as n → ∞.
a.c.
Proof. Suppose that Xn −−→ c as n → ∞ where c is a real constant. Then,
for every ε > 0
lim P (|Xm − c| ≤ ε for all m ≥ n) = 1,
n→∞
or equivalently
lim P (|Xm − c| > ε for at least one m ≥ n) = 0.
n→∞
Note that
where the second equality follows from Theorem 2.16 and the fact that
( ∞ )∞
[
{|Xm − c| > ε} ,
m=n n=1
Note that Theorem 3.4 obtains an equivalence between almost certain conver-
gence and convergence in probability. When a sequence converges in probabil-
ity at a fast enough rate to a constant, convergence in probability and almost
certain convergence are equivalent. Such a result was the main motivation of
Hsu and Robbins (1947). Note further that convergence to a constant plays an
c
important role in the proof and the application of Corollary 2.1. If Xn → − X
as n → ∞, then the sequence is not independent and Corollary 2.1 cannot be
c
applied to the sequence of events {|Xn − X| > ε}∞ n=1 . But when Xn → − c as
n → ∞ the sequence is independent and Corollary 2.1 can be applied.
As with convergent sequences of real numbers, subsequences of convergent
sequences of random variables can play an important role in the development
of asymptotic theory.
Theorem 3.5. Let {Xn }∞ n=1 be a sequence of random variables that converge
in probability to a random variable X. Then there exists a non-decreasing
c a.c.
sequence of positive integers {nk }∞
k=1 such that Xnk →
− X and Xnk −−→ X as
k → ∞.
This section will investigate how the three modes of convergence studied in
Sections 3.2 and 3.3 can be applied to random vectors. Let {Xn }∞ n=1 be a
sequence of d-dimensional random vectors and let X be another d-dimensional
random vector. For an arbitrary d-dimensional vector x0 = (x1 , . . . , xd ) ∈ Rd
118 CONVERGENCE OF RANDOM VARIABLES
let kxk be the usual vector norm in d-dimensional Euclidean space defined by
d
!1/2
X
2
kxk = xi .
i=1
When d = 1 the norm reduces to the absolute value of x, that is kxk = |x|.
Therefore, we can generalize the one-dimensional requirement that |Xn (ω) −
X(ω)| > ε to kXn (ω) − X(ω)k > ε in the d-dimensional case.
Definition 3.4. Let {Xn }∞n=1 be a sequence of d-dimensional random vectors
and let X be another d-dimensional random vector.
This turns out to be the essential relationship required to establish that the
convergence of a random vector is equivalent to the convergence of the indi-
vidual elements of the random vector.
CONVERGENCE OF RANDOM VECTORS 119
Theorem 3.6. Let {Xn }∞ n=1 be a sequence of d-dimensional random vectors
and let X be another d-dimensional random vector where X0 = (X1 , . . . , Xd )
and X0n = (X1,n . . . , Xd,n ) for all n ∈ N.
p p
1. Xn − → X as n → ∞ if and only if Xk,n −
→ Xk as n → ∞ for all k ∈
{1, . . . , d}.
a.c. a.c.
2. Xn −−→ X as n → ∞ if and only if Xk,n −−→ Xk as n → ∞ for all
k ∈ {1, . . . , d}.
Proof. We will prove this result for convergence in probability. The remaining
p
result is proven in Exercise 15. Suppose that Xn −→ X as n → ∞. Then, from
Definition 3.4 is follows that for every ε > 0
lim P (kXn − Xk ≤ ε) = 1,
n→∞
where we have used the relationship in Equation (3.8). Now let k ∈ {1, . . . , d}.
Theorem 2.3 implies that since
d
\
{ω : |Xi,n (ω) − Xi (ω)| ≤ ε} ⊂ {ω : |Xk,n (ω) − Xk (ω)| ≤ ε},
i=1
Now,
d
\
{ω : |Xi,n (ω) − Xi (ω)| ≤ d−1 ε} ⊂ {ω : |Xi,n (ω) − Xi (ω)| ≤ ε},
i=1
so that
d
!
\
P {ω : |Xi,n (ω) − Xi (ω)| ≤ d−1 ε} ≤ P (ω : |Xi,n (ω) − Xi (ω)| ≤ ε).
i=1
Proof. We will prove the second result of the theorem, leaving the proof of the
p
first part as Exercise 18. Suppose that Xn −→ c as n → ∞, so that Definition
3.1 implies that for every δ > 0
lim P (|Xn − c| < δ) = 1.
n→∞
Proof. We will prove the first result in this case. See Exercise 19 for proof of
a.c.
the second result. Definition 3.2 implies that if Xn −−→ X as n → ∞ then
h i
P ω : lim Xn (ω) = X(ω) = 1.
n→∞
Let n o
N = ω ∈ Ω : lim Xn (ω) = X(ω) ,
n→∞
and note that by assumption P (N ) = P [C(g)] = 1. Consider ω ∈ N ∩ C(g).
For such ω is follows from Theorem 1.3 that
lim g[Xn (ω)] = g[X(ω)].
n→∞
Example 3.2 discussed some general conditions under which an estimator θ̂n
of a parameter θ converges in probability to θ as n → ∞. In the special
case where the estimator is the sample mean calculated from a sequence of
independent and identically distributed random variables, Example 3.2 states
that the sample mean will converge in probability to the population mean
as long as the variance of the population is finite. This result is often called
the Weak Law of Large Numbers. The purpose of this section is to explore
other versions of this result. In particular we will consider alternate sets of
conditions under which the result remains the same. We will also consider
under what conditions the result can be strengthened to the Strong Law of
LAWS OF LARGE NUMBERS 125
Large Numbers, for which the sample mean converges almost certainly to the
population mean. The first result given in Theorem 3.10 below shows that the
assumption that the variance of the population is finite can be removed as
long as the mean of the population exists and is finite.
Theorem 3.10 (Weak Law of Large Numbers). Let X1 , . . . , Xn be a set of
independent and identically distributed random variables from a distribution
F with finite mean θ and let X̄n be the sample mean computed on the random
p
variables. Then X̄n −
→ θ as n → ∞.
To find a bound on the second term in Equation (3.9) first note that
n
!c n n
\ [ [
c 3 3 c
A = {|Xk | ≤ nε } = {|Xk | ≤ nε } = {|Xk | > nε3 },
k=1 k=1 k=1
which follows from Theorem A.2. Therefore, Theorems 2.3 and 2.4 and the
fact that the random variables are identically distributed imply
n
!
[
P ({|Sn − E(Tn )| > nε3 } ∩ Ac ) ≤ P {|Xk | > nε3 }
k=1
n
X
≤ P (|Xk | > nε3 )
k=1
= nP (|X1 | > nε3 ). (3.11)
LAWS OF LARGE NUMBERS 127
Combining the results of Equations (3.9)–(3.11) implies that
P (|Sn − E(Tn )| > nε) ≤ εE(|X1 |) + nP (|X1 | > nε3 ).
Let G be the distribution of |X1 |, then note that Theorem A.7 implies that
Z ∞ Z ∞ Z ∞
3 −3 3 −3
nP (|X1 | > nε ) = n dG(t) = ε nε dG(t) ≤ ε tdG(t).
nε3 nε3 nε3
We use the limit supremum in the limit instead of the usual limit since we do
not yet know whether the sequence converges or not. Equivalently, we have
shown that
lim sup P (|n−1 Sn − n−1 E(Tn )| > ε) ≤ εE(|X1 |).
n→∞
The Strong Law of Large Numbers keeps the same essential result as Theorem
3.10 except that the mode of convergence is strengthened from convergence
in probability to almost certain convergence. The path to this stronger result
requires slightly more complicated mathematics, and we will therefore develop
some intermediate results before presenting the final result and its proof. The
general approach used here is the development used by Feller (1971). Some-
what different approaches to this result can be found in Gut (2005), Gnedenko
(1962), and Sen and Singer (1993), though the basic ideas are essentially the
same.
Theorem 3.11. Let {Xn }∞ n=1 be a sequence of independent random variables
where E(Xn ) = 0 for all n ∈ N and
∞
X
E(Xn2 ) < ∞.
n=1
where the second inequality is due to the fact that the terms of the sequence
are non-negative. Now, the right hand side of Equation (3.12) does not depend
on m, so that we can take the limit of the left hand side as m → ∞. Theorem
2.16 the implies that
∞ ∞
!
[ X
P sup |Sk − Sn | > ε = P {|Sk − Sn | > ε} ≤ ε−2 V (Xk ).
k≥n
k=n k=n+1
Theorem 3.11 actually completes much of the work we need to prove the
Strong Law of Large Numbers in that we now know that the sum converges
almost certainly to a limit. However, the assumption on the variance of Xn is
quite strong and we will need to find a way to weaken this assumption. The
method will be the same as used in the proof of Theorem 3.10 in that we
will use truncated random variables. In order to apply the result in Theorem
3.11 to these truncated random variables a slight generalization of the result
is required.
Corollary 3.1. Let {Xn }∞ n=1 be a sequence of independent random variables
where E(Xn ) = 0 for all n ∈ N. Let {bn }∞ n=1 be a monotonically increasing
sequence of real numbers such that bn → ∞ as n → ∞. If
∞
X
b−2 2
n E(Xn ) < ∞,
n=1
then
∞
X
b−1
n Xn ,
n=1
a.c.
converges almost certainly to some limit and b−1
n Sn −−→ 0 as n → ∞.
Proof. The first result is obtained directly from Theorem 3.11 using {bn Xn }∞
n=1
as the sequence of random variables of interest. In that case we require
∞
X ∞
X
E[(b−1 2
n Xn ) ] = b−2 2
n E(Xn ) < ∞,
n=1 n=1
which is the condition that is assumed. To prove the second result see Exercise
23.
The final result required to prove the Strong Law of Large Numbers is a
condition on the existence of the mean of a random variable.
Theorem 3.12. Let {Xn }∞ n=1 be a sequence of random variables where Xn
has distribution function F for all n ∈ N. Then for any ε > 0
∞
X
P (|Xn | > nε) < ∞,
n=1
Proof. Suppose that E(Xn ) exists and equals θ. We will consider the case
when θ = 0. The case when θ 6= 0 can be proven using the same methodology
used at the end of the proof of Theorem 3.10. Define two new sequences
of random variables {Yn }∞ ∞
n=1 and {Zn }n=1 as Yn = Xn δ{|Xn |; [0, n]} and
Zn = Xn δ{|Xn |; (n, ∞)}. Hence, it follows that
n
X n
X n
X
X̄n = n−1 Xk = n−1 Yk + n−1 Zk .
k=1 k=1 k=1
Because E(Xn ) exists, Theorem 3.12 implies that for every ε > 0,
∞
X
P (|Xn | > nε) < ∞.
n=1
and Theorem 2.17 then implies that P ({Zn 6= 0} i.o.) = 0. This means that
with probability one there will only be a finite number of times that Zn 6= 0
over all the values of n ∈ N, which implies that the sum
∞
X
Zn ,
n=1
and therefore
n
a.c.
X
n−1 Zk −−→ 0,
k=1
as n → ∞. The convergence behavior of the sum
n
X
Yk ,
k=1
will be studied with the aid of Corollary 3.1. To apply this result we must
show that
X∞
n−1 E(Yn2 ) < ∞.
n=1
LAWS OF LARGE NUMBERS 131
First note that since Yn is the version of Xn truncated at n and −n we have
that
Z ∞ Z n n Z
X
2 2 2
E(Yn ) = x δ{|x|; [0, n]}dF (x) = x dF (x) = x2 dF (x),
−∞ −n k=1 Rk
Note that the set of pairs (n, k) for which n ∈ {1, 2, . . .} and k ∈ {1, 2, . . . , n}
are the same set of pairs for which k ∈ {1, 2, . . .} and n ∈ {k, k + 1, . . .}. This
allows us to change the order in the double sum as
∞ ∞ X ∞ ∞ ∞
Z "Z !#
X X X X
−2 2 −2 2 2 −2
n E(Yn ) = n x dF (x) = x dF (x) n ,
n=1 k=1 n=k Rk k=1 Rk n=k
Now
∞
X
n−2 ≤ 2k −1 ,
n=k
so that
∞
X ∞ Z
X
n−2 E(Yn2 ) ≤ 2x2 k −1 dF (x).
n=1 k=1 Rk
where the last inequality follows from our assumptions. Therefore we have
shown that
X∞
n−2 E(Yn2 ) < ∞.
n=1
We are now in the position where Corollary 3.1 can be applied to the centered
sequence, which allows us to conclude that
n
" n n
!#
a.c.
X X X
−1 −1 −1
n [Yk − E(Yk )] = n Yk − E n Yk −−→ 0,
k=1 k=1 k=1
132 CONVERGENCE OF RANDOM VARIABLES
as n → ∞, leaving us to evaluate the asymptotic behavior of
n
! n
X X
−1
E n Yk = n−1 E(Yk ).
k=1 k=1
This is the same problem encountered in the proof of Theorem 3.10, and the
same solution based on Theorem 2.12 implies that
n
X
lim n−1 E(Yk ) = 0.
n→∞
k=1
as n → ∞.
Aside from the stronger conclusion about the mode of convergence about the
sample mean, there is another major difference between the Weak Law of
Large Numbers and the Strong Law of Large Numbers. Section VII.8 of Feller
(1971) actually shows that the existence of the mean of Xn is both a necessary
and sufficient condition to assure the almost certain convergence of the sample
mean. As it turns out, the existence of the mean is not a necessary condition
for a properly centered sample mean to converge in probability to a limit.
Theorem 3.14. Let {Xn }∞ n=1 be a sequence of independent random variables
p
each having a common distribution F . Then X̄n − E(X1 δ{|X1 |; [0, n]}) − →0
as n → ∞ if and only if
lim nP (|X1 | > n) = 0.
n→∞
Figure 3.5 The results of a small simulation demonstrating the behavior of sample
means computed from a Cauchy(0, 1) distribution. Each line represents a sequence
of sample means computed on a sequence of independent Cauchy(0, 1) random vari-
ables. The means were computed when n = 5, 10, . . . , 250.
%
$
#
!
!#
!$
!%
sample mean will not converge to zero. To observe the behavior of the sample
mean in this case see Figure 3.5, where five realizations of the sample mean
have been plotted for n = 5, 10, . . . , 250. Note that the values of the mean do
not appear to be settling down as we observed in Figure 3.2.
Example 3.22. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a continuous distribution with density
(
1 −2
x |x| > 1
f (x) = 2
0 |x| ≤ 1.
We first note that
Z n Z n
1
2 xf (x)dx = 1
2 x−1 dx = 1
2 log(n),
1 1
so that Z n
xf (x)dx → ∞,
1
as n → ∞, and therefore the mean of X1 does not exist. Checking the condition
134 CONVERGENCE OF RANDOM VARIABLES
in Theorem 3.14 we have that, due to the symmetry of the density,
Z ∞ Z ∞
nP (|X1 | > n) = 2n dF = x−2 dx = 1.
n n
Therefore, Theorem 3.14 implies that X̄n − E(X1 δ{|X1 |; [0, n]}) does not con-
verge in probability to the truncated mean, which in this case is zero due to
symmetry. One the other hand, if we modify the tails of the density so that
they drop off at a slightly faster rate, then we can achieve convergence. For
example, consider the density suggested in Section 6.4 of Gut (2005), given
by (
cx2 log(|x|) |x| > 2
f (x) =
0 |x| ≤ 2,
where c is a normalizing constant. In this case it can be shown that nP (|X1 | >
n) → 0 as n → ∞, but that the mean does not exist. However, we can still
p
conclude that X̄n −
→ 0 as n → ∞ due to Theorem 3.14.
The Laws of Large Numbers given by Theorems 3.10 and 3.13 provide a char-
acterization of the limiting behavior of the sample mean as the sample size
n → ∞. The Law of the Iterated Logarithm provides information about the
extreme fluctuations of the sample mean as n → ∞.
Theorem 3.15 (Hartman and Wintner). Let {Xn }∞ n=1 be a sequence of in-
dependent random variables each having a common distribution F such that
E(Xn ) = µ and V (Xn ) = σ 2 < ∞. Then
n1/2 (X̄n − µ)
P lim sup 2 1/2
= 1 = 1,
n→∞ {2σ log[log(n)]}
and
n1/2 (X̄n − µ)
P lim inf = −1 = 1.
n→∞ {2σ 2 log[log(n)]}1/2
A proof of Theorem 3.15 can be found in Section 8.3 of Gut (2005). The-
orem 3.15 shows that the extreme fluctuations of the sequence of random
variables given by {Zn }∞
n=1 = {n
1/2 −1
σ (X̄n − µ)}∞ n=1 are about the same size
1/2
as {2 log[log(n)]} . More precisely, let ε > 0 and consider the interval
Iε,n = [−(1 + ε){2 log[log(n)]}1/2 , (1 + ε){2 log[log(n)]}1/2 ],
then all but a finite number of values in the sequence {Zn }∞ n=1 will be con-
tained in Iε,n with probability one. On the other hand if we define the interval
Jε,n = [−(1 − ε){2 log[log(n)]}1/2 , (1 − ε){2 log[log(n)]}1/2 ],
then Zn ∈
/ Jn,ε an infinite number of times with probability one.
Example 3.23. Theorem 3.15 is somewhat difficult to visualize, but some
simulated results can help. In Figure 3.6 we have plotted a realization of
n1/2 X̄n for n = 1, . . . , 500, where the population is N(0, 1), along with its
THE GLIVENKO–CANTELLI THEOREM 135
Figure 3.6 A simulated example of the behavior indicated by Theorem 3.15. The
solid line is a realization of n1/2 X̄n for a sample of size n from a N(0, 1) dis-
tribution with n = 1, . . . , 500. The dotted line indicates the extent of the envelope
±{2 log[log(n)]}1/2 and the dashed line indicates the extreme fluctuations of n1/2 X̄n .
2
1
'(
0
!1
!2
''
extreme fluctuations and the limits ±{2 log[log(n)]}1/2 . One would not ex-
pect the extreme fluctuations of n1/2 X̄n to exactly follow the limits given by
±{2 log[log(n)]}1/2 , but note in our realization that the general shape of the
fluctuations does follow the envelope fairly well as n becomes larger.
The next step in our development is to extend the consistency result of The-
orem 3.16 to the entire empirical distribution function. That is, we wish to
conclude that F̂n is a consistent estimator of F , or that F̂n convergences al-
most certainly to F as n → ∞. This differs from the previous result in that
we wish to show that the random function F̂n becomes arbitrarily close to F
with probability one as n → ∞. Therefore, we require a measure of distance
between two distribution functions. Many distance functions, or metrics, can
be defined on the space of distribution functions. For examples, see Young
(1988). A common metric for comparing two distribution functions in statis-
tical inference is based on the supremum metric.
Theorem 3.17. Let F and G be two distribution functions. Then,
d∞ (F, G) = kF − Gk∞ = sup |F (t) − G(t)|
t∈R
A stronger result can actually be proven in that the metric defined in Theorem
3.17 is actually a metric over the space of all functions. Now that a metric in
the space of distribution functions has been defined, it is relevant to ascertain
whether the empirical distribution function is a consistent estimator of F with
respect to this metric. That is, we would conclude that F̂n converges almost
certainly to F as n → ∞ if,
P lim kF̂n − F k = 0 = 1,
n→∞
a.c.
or equivalently that kF̂n − F k∞ −−→ 0 as n → ∞.
Theorem 3.18 (Glivenko and Cantelli). Let X1 , . . . , Xn be a set of indepen-
dent and identically distributed random variables from a distribution F , and
let F̂n be the empirical distribution function computed on X1 , . . . , Xn . Then
a.c.
kF̂n − F k∞ −−→ 0.
a.c.
Proof. For a fixed value of t ∈ R, Theorem 3.16 implies that F̂n (t) −−→ F (t)
as n → ∞. The result we wish to prove states that the maximum difference
between F̂n and F also converges to 0 as n → ∞, a stronger result. Rather
than attempt to quantify the behavior of the maximum difference directly,
a.c.
we will instead prove that F̂n (t) −−→ F (t) uniformly in t as n → ∞. We will
follow the method of proof used by van der Vaart (1998). Alternate approaches
can be found in Sen and Singer (1993) and Serfling (1980). The result was first
proven under various conditions by Glivenko (1933) and Cantelli (1933).
Let ε > 0 be given. Then, there exists a partition of R given by −∞ = t0 <
t1 < . . . < tk = ∞ such that
lim F (t) − F (ti−1 ) < ε,
t↑ti
138 CONVERGENCE OF RANDOM VARIABLES
for some k ∈ N. We will begin by arguing that such a partition exists. First
consider the endpoints of the partition t1 and tk−1 . Because F is a distribution
function we know that
lim F (t) = lim F (t) = 0,
t→t0 t→−∞
and hence there must be a point t1 such that F (t1 ) < ε. Similarly, since
lim F (t) = lim F (t) = 1,
t→tk t→∞
it follows that there must be a point tk−1 such that F (tk−1 ) is within ε of
F (tk ) = 1. If the distribution function is continuous with an interval (a, b),
then the definition of continuity implies that there must exist two points ti
and ti−1 such that F (ti ) − F (ti−1 ) < ε. Noting that for points where F is
continuous we have that
lim F (t) = F (ti ),
t↑ti
which shows that the partition exists on any interval (a, b) where the distri-
bution function is continuous.
Now consider an interval (a, b) where there exists a point t0 ∈ (a, b) such that
F has a jump of size δ 0 at t0 . That is,
δ 0 = F (t0 ) − lim0 F (t).
t↑t
0
There is no problem if δ < ε, for a specific value of ε, but this cannot be
guaranteed for every ε > 0. However, the partition can still be created by
setting one of the points of the partition exactly at t0 . First, consider the case
where ti = t0 . Considering the limit of F (t) as t ↑ t0 results in the value of
F at t0 if F was left continuous at t0 . It then follows that there must exist a
point ti−1 such that
lim0 F (t) − F (ti−1 ) < ε.
t↑t
See Figure 3.7. In the case where ti+1 = t0 , the property follows from the fact
that F is always right continuous, and is therefore continuous on the interval
(ti+1 , b) for some b. Therefore, there does exist a partition with the indicated
property. The partition is finite because the range of F , which is [0, 1], is
bounded.
Now consider t ∈ (ti−1 , ti ) for some i ∈ {0, . . . , k} and note that because F̂n
and F are non-decreasing it follows that
F̂n (t) ≤ lim F̂n (t),
t↑ti
and
F (t) ≥ F (ti−1 ) > lim F (t) − ε,
t↑ti
so that is follows that
F̂n (t) − F (t) ≤ lim F̂n (t) − lim F (t) + ε.
t↑ti t↑ti
THE GLIVENKO–CANTELLI THEOREM 139
Similar computations can be used to show that F̂n (t) − F (t) ≥ F̂n (ti−1 ) −
F (ti−1 ) − ε.
a.c.
We already know that F̂n (t) −−→ F (t) for every t ∈ R. However, because
a.c.
the partition t0 < t1 < . . . < tk is finite, it follows that F̂n (t) −−→ F (t)
uniformly on the partition t0 < t1 < . . . < tk . That is, for every ε > 0, there
exists a positive integer nε such that |F̂n (t) − F (t)| < ε for all n ≥ nε and
t ∈ {t0 , t1 , . . . , tk }, with probability one. To prove this we need only find nε,t
such that |F̂n (t) − F (t)| < ε for all n ≥ nε,t with probability one and assign
nε = max{nε,t0 , . . . , nε,tk }.
This implies that for every ε > 0 and t ∈ (ti−1 , ti ) that there is a positive
integer nε such that
F̂n (t) − F (t) ≤ lim F̂n (t) − lim F (t) + ε ≤ 2ε,
t↑ti t↑ti
and
F̂n (t) − F (t) ≥ F̂n (ti−1 ) − F (ti−1 ) − ε ≥ −2ε.
Hence, for every ε > 0 and t ∈ (ti−1 , ti ) that there is a positive integer nε
such that |F̂n (t) − F (t)| < 2ε for all n ≥ nε , with probability one. Noting that
the value of nε does not depend on i, we have proven that for every ε > 0
there is a positive integer nε such that |F̂n (t) − F (t)| < 2ε for every n ≥ nε
and t ∈ R, with probability one. Therefore, F̂n (t) converges uniformly to F (t)
with probability one. This uniform convergence implies that
P lim sup |F̂n (t) − F (t)| = 0 = 1,
n→∞ t∈R
or that
a.c.
sup |F̂n (t) − F (t)| −−→ 0,
t∈R
as n → ∞.
Figure 3.7 Constructing the partition used in the proof of Theorem 3.18 when there
is a discontinuity in the distribution function. By locating a partition point at the
jump point (grey line), the continuity of the distribution function to the left of this
point can be used to find a point (dotted line) such that the difference between the
distribution function at these two points does not exceed ε for any specified ε > 0.
"%&
"%#
"%"
!! " ! # $
)
where the integral is evaluated using Definition 2.10. This estimate is known
as the k th sample moment. The properties of this estimate are detailed below.
Theorem 3.19. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F .
The k th central moment can be handled in a similar way. From Definition 2.9
the k th central moment of F is
Z ∞
µk = (x − µ01 )k dF (x), (3.16)
−∞
where we will again use the assumption in Equation (3.15) that E(|X|m ) < ∞
for a value of m to be determined later. Substituting the empirical distribution
142 CONVERGENCE OF RANDOM VARIABLES
".4
".2
"."
!1 " 1 2 3 4
x
function for F in Equation (3.16) provides an estimate of µk given by
Z ∞ Z ∞ k n
X
µ̂k = x− tdF̂n (t) dF̂n (x) = n−1 (Xi − µ̂01 )k .
−∞ −∞ i=1
This estimate has a more complex structure than that of the k th sample
moment which makes the bias and standard error more difficult to obtain.
One result which makes this job slightly easier is given below.
Theorem 3.20. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables such that E(X1 ) = 0 and E(|Xn |k ) < ∞ for
some k ≥ 2. Then
n
k
X
E Xi = O(n−k/2 ),
i=1
as n → ∞.
A proof of Theorem 3.20 can be found in Chapter 19 of Loéve (1977). A similar
result is given in Lemma 9.2.6.A of Serfling (1980).
Theorem 3.21. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F .
SAMPLE MOMENTS 143
".%
".#
"."
as n → ∞.
as n → ∞.
a.c.
3. If E(|X1 |k ) < ∞ then µ̂k −−→ µk as n → ∞.
Proof. We follow the method of Serfling (1980) to prove the first result. We
144 CONVERGENCE OF RANDOM VARIABLES
begin by noting that
n
X
µ̂k = n−1 (Xi − µ̂01 )k
i=1
n
X
= n−1 [(Xi − µ01 ) + (µ01 − µ̂01 )]k
i=1
n X
k
X k
= n−1 (µ01 − µ̂01 )j (Xi − µ01 )k−j
i=1 j=0
j
k " n
#
X k 0 0 j −1
X
0 k−j
= (µ1 − µ̂1 ) n (Xi − µ1 ) .
j=0
j i=1
Therefore,
k
( " n
#)
X k X
E(µ̂k ) = E (µ01 − µ̂01 )j n −1
(Xi − µ01 )k−j . (3.17)
j=0
j i=1
k
( " n
#)
X k X
E (µ01 − µ̂01 )j n−1
(Xi − µ01 )k−j . (3.18)
j=1
j i=1
( " n
#)
k 0 0 −1
X
0 k−1
E (µ1 − µ̂1 ) n (Xi − µ1 ) =
1 i=1
n X
X n
kn−2 E[(µ01 − Xi )(Xj − µ01 )k−1 ].
i=1 j=1
Note that when i 6= j that the two terms in the sum are independent, and
therefore the expectation of the product is the product of the expectations. In
all of these cases E(Xi − µ01 ) = 0 and the term vanishes. When i = j the term
in the sum equals −E[(Xi − µ01 )k ] = −µk . Combining these results implies
that the first term of Equation (3.18) is −n−1 kµk . The second term (j = 2)
SAMPLE MOMENTS 145
of Equation (3.18) equals
( " n
#)
k 0 0 2 −1
X
0 k−2
E (µ1 − µ̂1 ) n (Xi − µ1 ) =
2 i=1
" #2 n
X n X
−3
1
2 k(k − 1)n E (µ01 − Xi ) (Xj − µ01 )k−2 =
i=1 j=1
n X
X n X
n
1
2 k(k − 1)n−3 E[(µ01 − Xi )(µ01 − Xj )(Xl − µ01 )k−2 ]. (3.19)
i=1 j=1 l=1
When i differs from j and l the expectation is zero due to independence. When
i = j, the sum in Equation (3.19) becomes
n X
X n
1
2 k(k − 1)n−3 E[(µ01 − Xi )2 (Xj − µ01 )k−2 ]. (3.20)
i=1 j=1
When i 6= j the two terms in the sum in Equation (3.20) are independent
therefore
1
2 k(k − 1)n−3 [n(n − 1)µ2 µk−2 + nµk ] = 12 k(k − 1)n−1 µ2 µk−2 + O(n−2 ),
" n
#j " n
#
k −1
X
0 −1
X
0 k−j
n (µ1 − Xi ) n (Xi − µ1 ) .
j i=1 i=1
146 CONVERGENCE OF RANDOM VARIABLES
Now apply Theorem 2.10 (Hölder) to the two sums to yield
" #j " #
Xn Xn
−1
(µ01 − Xi ) n−1 (Xi − µ01 )k−j
E ≤
n
i=1 i=1
!j k/j j/k
n
X
E n−1 (µ01 − Xi )
i=1
n
k/(k−j) (k−j)/k
X
× E n−1 (Xi − µ01 )k−j . (3.21)
i=1
!j k/j j/k
n
X
E n−1 (µ01 − Xi )
=
i=1
n
k j/k
X
E n−1 (µ01 − Xi ) = [O(n−k/2 )]j/k = O(n−j/2 ),
i=1
The population quantile as defined above is always unique, even though the
inverse of the distribution function may not be unique in every case. There
are three essential examples to consider. In the case where F (x) is contin-
uous and strictly increasing in the neighborhood of the quantile, then the
distribution function has a unique inverse in that neighborhood and it follows
that F (ξp ) = F [F −1 (p)] = p. The continuity of the function in this case also
guarantees that F (ξp −) = F (ξp ) = p. That is, the quantile ξp can be seen
as the unique solution to the equation F (ξp −) = F (ξp ) = p with respect to
ξp . See Figure 3.11. In the case where a discontinuous jump occurs at the
quantile, the distribution function does not have an inverse in the sense that
F [F −1 (p)] = p. The quantile ξp as defined above is located at the jump point.
In this case F (ξp −) < p < F (ξp ), and once again the quantile can be defined
to be the unique solution to the equation F (ξp −) < p < F (ξp ) with respect
to ξp . See Figure 3.12. In the last case F is continuous in a neighborhood
of the quantile, but is not increasing. In this case, due to the continuity of
F in the neighborhood of the quantile, F (ξp −) = p = F (ξp ). However, the
difference in this case is that the quantile is not the unique solution to the
equation F (ξp −) = p = F (ξp ) in that any point in the non-increasing neigh-
borhood of the quantile will also be a solution to this equation. See Figure
3.13. Therefore, for the first two situations (Figures 3.11 and 3.12) there is
148 CONVERGENCE OF RANDOM VARIABLES
!p
Figure 3.12 When the distribution function has a discontinuous jump in at the quan-
tile, then F (ξp −) < p < F (ξp ).
!p
SAMPLE QUANTILES 149
Figure 3.13 When the distribution function is not increasing in a neighborhood of the
quantile, then F (ξp −) = p = F (ξp ), but there is no unique solution to this equation.
!p
Figure 3.14 Example of computing a sample quantile when p = kn−1 . In this exam-
ple, the empirical distribution function for a Uniform(0, 1) sample of size n = 5 is
plotted and we wish to estimate ξ0.6 . In this case p = 0.6 = kn−1 where k = 3 so
that the estimate corresponds to the third order statistic.
1.0
0.8
0.6
F(x)
0.4
0.2
0.0
Figure 3.15 Example of computing a sample quantile when p 6= kn−1 . In this exam-
ple, the empirical distribution function for a Uniform(0, 1) sample of size n = 5 is
plotted and we wish to estimate the median ξ0.5 . In this case there is not a value of
k such that p = 0.5 = kn−1 , but when k = 3 we have that (k − 1)n−1 < p < kn−1
and therefore the estimate of the median corresponds to the third order statistic.
1.0
0.8
0.6
F(x)
0.4
0.2
0.0
x
with common distribution F . Suppose that p ∈ (0, 1) and that ξp is the unique
a.c.
solution to F (ξp −) ≤ p ≤ F (ξp ). Then ξˆp −−→ ξp as n → ∞.
Proof. Let ε > 0 and note that the assumption that F (ξp −) ≤ p ≤ F (ξp )
implies that F (ξp − ε) < p < F (ξp + ε). Now, Theorem 3.16 implies that
a.c. a.c.
F̂n (x) −−→ F (x) for every x ∈ R so that F̂n (ξp − ε) −−→ F (ξp − ε) and
a.c.
F̂n (ξp + ε) −−→ F (ξp + ε) as n → ∞. Theorem 3.1 implies that for every δ > 0
lim P [|F̂m (ξp − ε) − F (ξp − ε)| < δ for all m ≥ n] = 1,
n→∞
and
lim P [|F̂m (ξp + ε) − F (ξp + ε)| < δ for all m ≥ n] = 1.
n→∞
Now, take δ small enough so that
lim P [F̂m (ξp − ε) < p < F̂m (ξp + ε) for all m ≥ n] = 1.
n→∞
3.10.1 Exercises
is a consistent estimator of θ.
4. Let U be a Uniform(0,1) random variable and define a sequence of random
p
variables {Xn }∞n=1 as Xn = δ{U ; (0, n
−1
)}. Prove that Xn −
→ 0 as n → ∞.
5. Let {cn }∞
n=1 be a sequence of real constants such that
lim cn = c,
n→∞
b. Prove that Sn2 is an unbiased estimator of µ2 , that is, prove that E(Sn2 ) =
µ2 for all µ2 > 0.
c. Prove that the variance of µ̂2 is
V (µ̂2 ) = n−1 (µ4 − n−3 2
n−1 µ2 ).
d. Use the results derived above to prove that µ̂2 is a consistent estimator
p
of µ2 . That is, prove that µ̂2 −
→ µ2 as n → ∞.
e. Relate the results observed here with the results given in Theorem 3.21.
8. Consider a sequence of independent random variables {Xn }∞ n=1 where Xn
has probability distribution function
−(n+1)
2
x = −2n(1−ε) , 2n(1−ε)
fn (x) = 1 − 2−n x = 0
0 elsewhere,
1
where ε > 2 (Sen and Singer, 1993).
a. Compute the mean and variance of Xn .
b. Let
n
X
X̄n = n−1 Xk ,
k=1
for all n ∈ N. Compute the mean and variance of X̄n .
p
c. Prove that X̄n −→ 0 as n → ∞.
9. Let {Xn }∞
n=1 be a sequence of random variables such that
where α ∈ R.
p
a. For what values of α does Xn −
→ 0 as n → ∞?
a.c.
b. For what values of α does Xn −−→ 0 as n → ∞?
c
c. For what values of α does Xn →
− 0 as n → ∞?
implies that
n
X
lim b−1
n xk = 0.
n→∞
k=1
Use Kronecker’s Lemma to prove the second result in Corollary 3.1. That
is, let {Xn }∞
n=1 be a sequence of independent random variables where
E(Xn ) = 0 for all n ∈ N. Prove that if
∞
X
b−2 2
n E(Xn ) < ∞,
n=1
a.c.
then b−1
n Sn −−→ 0 as n → ∞.
156 CONVERGENCE OF RANDOM VARIABLES
24. Let {Xn }∞n=1 be a sequence of independent and identically distributed ran-
dom variables from a Cauchy(0, 1) distribution. Prove that the mean of
the distribution does not exist, and further prove that it can be shown that
nP (|X1 | > n) → 2π −1 as n → ∞, so that the condition of Theorem 3.14
does not hold.
25. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables from a density of the form
(
cx2 log(|x|) |x| > 2
f (x) =
0 |x| ≤ 2,
where c is a normalizing constant. Prove that nP (|X1 | > n) → 0 as n → ∞,
p
but that the mean does not exist. Hence, we can still conclude that X̄n −
→0
as n → ∞ due to Theorem 3.14.
26. Prove Theorem 3.17. That is, let F and G be two distribution functions.
Show that
kF − Gk∞ = sup |F (t) − G(t)|
t∈R
is a metric in the space of distribution functions.
27. In the proof of Theorem 3.18, verify that F̂n (t)−F (t) ≥ F̂n (t)−F (ti−1 )−ε.
28. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a distribution F .
a. Prove that
3.10.2 Experiments
1. Consider an experiment that flips a fair coin 100 times. Define an indicator
random variable Bn so that
(
1 if the k th flip is heads
Bk =
0 if the k th flip is tails.
Run a simulation that will repeat this experiment 25 times. On the same
set of axes, plot p̂n versus n for n = 1, . . . , 100 with the points connected
by lines, for each replication of the experiment. For comparison, plot a
horizontal line at 21 , the true probability of flipping heads. Comment on
the outcome of the experiments. What results are being demonstrated by
this set of experiments? How do these results related to what is usually
called the frequency method for computing probabilities?
2. Write a program in R that generates a sample X1 , . . . , Xn from a specified
distribution F , computes the empirical distribution function of X1 , . . . , Xn ,
and plots both the empirical distribution function and the specified distri-
bution function F on the same set of axes. Use this program with n =
5, 10, 25, 50, and 100 to demonstrate the consistency of the empirical dis-
tribution function given by Theorem 3.16. Repeat this experiment for each
of the following distributions: N(0, 1), Binomial(10, 0.25), Cauchy(0, 1),
and Gamma(2, 4).
3. Write a program in R that generates a sample X1 , . . . , Xn from a specified
distribution F and computes the sample mean X̄n . Use this program with
n = 5, 10, 25, 50, 100, and 1000 and plot the sample size against X̄n . Repeat
the experiment five times, and plot all the results on a single set of axes.
Produce the plot described above for each of the following distributions
N(0, 1), T(1), and T(2). For each distribution state whether the Strong
Law of Large Numbers or the Weak Law of Large Numbers regulates the
behavior of X̄n . What differences in behavior are observed on the plots?
158 CONVERGENCE OF RANDOM VARIABLES
4. Write a program in R that generates independent Uniform(0, 1) random
variables U1 , . . . , Un . Define two sequences of random variables X1 , . . . , Xn
and Y1 , . . . , Yn as Xk = δ{Uk ; (0, k −1 )} and Yk = δ{Uk ; (0, k −2 )}. Plot
X1 , . . . , Xn and Y1 , . . . , Yn against k = 1, . . . , n on the same set of axes for
p
n = 25. Is it apparent from the plot that Xn − → 0 as n → ∞ but that
c
Yn →− 0 as n → ∞? Repeat this process five times to get an idea of the
average behavior in each plot.
5. Write a program in R that generates a sample X1 , . . . , Xn from a specified
distribution F , computes the empirical distribution function of X1 , . . . , Xn ,
computes the maximum distance between F̂n and F , and computes the lo-
cation of the maximum distance between F̂n and F . Use this program with
n = 5, 10, 25, 50, and 100 and plot the sample size versus the maximum
distance to demonstrate Theorem 3.18. Separately, plot the location of the
maximum distance between F̂n and F against the sample size. Is there an
area where the maximum tends to stay, or does it tend to occur where F
has certain properties? Repeat this experiment for each of the following dis-
tributions: N(0, 1), Binomial(10, 0.25), Cauchy(0, 1), and Gamma(2, 4).
6. Write a program in R that generates a sample from a population with
distribution function
0 x < −1
1
1 + x −1 ≤ x < − 2
F (x) = 12 − 21 ≤ x < 21
1 − x 1 ≤ x < 1
2
1 x ≥ 1.
This distribution is Uniform on the set [−1, − 21 ] ∪ [ 12 , 1]. Use this program
to generate samples of size n = 5, 10, 25, 50, 100, 500, and 1000. For each
sample compute the sample median ξˆ0.5 . Repeat this process five times
and plot the results on a single set of axes. What effect does the flat area
of the distribution have on the convergence of the sample median? For
comparison, repeat the entire experiment but compute ξˆ0.75 instead.
CHAPTER 4
Convergence of Distributions
“Ask them then,” said the deputy director. “It’s not that important,” said K.,
although in that way his earlier excuse, already weak enough, was made even
weaker. As he went, the deputy director continued to speak about other things.
The Trial by Franz Kafka
4.1 Introduction
In statistical inference it is often the case that we are not interested in whether
a random variable converges to another specific random variable, rather we are
just interested in the distribution of the limiting random variable. Statistical
hypothesis testing provides a good example of this situation. Suppose that we
have a random sample from a distribution F with mean µ, and we wish to
test some hypothesis about µ. The most common test statistic to use in this
situation is Zn = n1/2 σ̂n−1 (µ̂n −µ0 ) where µ̂n and σ̂n are the sample mean and
standard deviation, respectively. The value µ0 is a constant that is specified
by the null hypothesis. In order to derive a statistical hypothesis test for the
null hypothesis based on this test statistic we need to know the distribution
of Zn when the null hypothesis is true, which in this case we will take to be
the condition that µ = µ0 . If the parametric form of F is not known explicitly,
then this distribution can be approximated using the Central Limit Theorem,
which states that Zn approximately has a N(0, 1) distribution when n is large
and µ = µ0 . See Section 4.4. This asymptotic result does not identify a specific
random variable Z that Zn converges to as n → ∞. There is no need because
all we are interested in is the distribution of the limiting random variable. This
chapter will introduce and study a type of convergence that only specifies the
distribution of the random variable of interest as n → ∞.
159
160 CONVERGENCE OF DISTRIBUTIONS
functions of the random variables in the sequence {Xn }∞
n=1 converge to the
distribution function of X.
If all of the random variables in the sequence {Xn }∞ n=1 and the random vari-
able X were all continuous then it might make sense to consider the densities
of the random variables. Or, if all of the random variables were discrete we
could consider the convergence of the probability distribution functions of the
random variables. Such approaches are unnecessarily restrictive. In fact, some
of the more interesting examples of convergence of distributions of sequences
of random variables are for sequences of discrete random variables that have
distributions that converge to a continuous distribution as n → ∞. By defining
the mode of convergence in terms of distribution functions, which are defined
for all types of random variables, our definition will allow for such results.
Definition 4.1. Let {Xn }∞ n=1 be a sequence of random variables where Xn
has distribution function Fn for all n ∈ N. Then Xn converges in distribution
to a random variable X with distribution function F as n → ∞ if
lim Fn (x) = F (x),
n→∞
for all x ∈ C(F ), the set of points where F is continuous. This relationship
d
will be represented by Xn −
→ X as n → ∞.
It is clear from Definition 4.1 that the concept of convergence in distribution
literally requires that the distribution of the random variables in the sequence
converge to a distribution that matches the distribution of the limiting random
variable. Convergence in distribution is also often called weak convergence
since the random variables play a secondary role in Definition 4.1. In fact, the
concept of convergence in distribution can be defined without them.
Definition 4.2. Let {Fn }∞ n=1 be a sequence of distribution functions. Then
Fn converges weakly to F as n → ∞ if
lim Fn (x) = F (x),
n→∞
for all x ∈ R, we would not be able to conclude that the convergence takes
place in this instance. However, Definitions 4.1 and 4.2 do not have this strict
requirement, and noting that 0 ∈/ C(F ) in this case allows us to conclude that
d
Fn ; F , or Xn −
→ X, as n → ∞.
where the change of variable v = t(1 + n−1 )−1/2 has been used. This limit
is valid for all x ∈ R, and since the distribution function of X is given by
d
Φ(x), it follows from Definition 4.1 that Xn − → X as n → ∞. Note that we
are required to verify that the limit is valid for all x ∈ R in this case because
C(Φ) = R.
∞
Example 4.3. Let {Xn }n=1 be a sequence of independent random variables
where the distribution function of Xn is
(
1 − exp(θ − x) for x ∈ [θ, ∞),
Fn (x) =
0 for x ∈ (−∞, θ).
162 CONVERGENCE OF DISTRIBUTIONS
Define a new sequence of random variables given by Yn = min{X1 , . . . , Xn }
for all n ∈ N. The distribution function of Yn can be found by noting that
Yn > y if and only if Xk > y for all k ∈ {1, . . . , n}. Therefore, for y ∈ [θ, ∞),
Further, note that if y ∈ (−∞, θ), then P (Yn ≤ y) is necessarily zero due
to the fact that P (Xn ≤ θ) = 0 for all n ∈ N. Therefore, the distribution
function of Yn is
(
1 − exp[n(θ − y)] for y ∈ [θ, ∞),
Gn (y) =
0 for y ∈ (−∞, θ).
Now
i
Y n−k+1
lim = 1,
n→∞ n−λ
k=1
and Theorem 1.7 implies that
n
λ
lim 1− = exp(−λ).
n→∞ n
Therefore, it follows that
i n−i
n λ λ λi exp(−λ)
lim 1− = ,
n→∞ i n n i!
for each i ∈ {0, 1, . . . , n}, and hence for 0 ≤ x ≤ n we have that
bxc i n−i bxc i
X n λ λ X λ exp(−λ)
lim Fn (x) = lim 1− = , (4.1)
n→∞ n→∞
i=0
i n n i=0
i!
and
lim F (x) = 0.
x→−∞
The example given below shows that these properties do not always follow for
the limiting distribution.
Example 4.5. Lehmann (1999) considers a sequence of random variables
{Xn }∞
n=1 such that the distribution function of Xn is given by
0
x<0
Fn (x) = 1 − pn , 0 ≤ x < n,
1 x ≥ n,
where {pn }∞
n=1 is a sequence of real numbers such that
lim pn = p.
n→∞
If p = 0 then (
0 x<0
lim Fn (x) = F (x) =
n→∞ 1 x ≥ 0,
which is a degenerate distribution at zero. However, if 0 < p < 1 then the
limiting function is given by
(
0 x < 0,
F (x) =
1 − p x ≥ 0.
It is clear in this case that
lim F (x) = 1 − p < 1.
x→∞
where {pn }∞
n=1 is a sequence of real numbers such that
lim pn = p.
n→∞
If we first consider the case where p = 0, then for every ε > 0, we need only
use Definition 1.1 and find a value nε such that pn < ε for all n ≥ nε . For
this value of n, it will follow that P (|Xn | ≤ 0) ≥ 1 − ε for all n > nε , and
by Definition 4.3, the sequence is bounded in probability. On the other hand,
consider the case where p > 0 and we set a value of ε such that 0 < ε < p.
Let x be a positive real value. For any n > x we have the property that
P (|Xm | ≤ x) = 1 − p ≤ 1 − ε for all m > n. Therefore, it is not possible to
find the value of x required in Definition 4.3, and the sequence is not bounded
in probability.
In Examples 4.5 and 4.6 we found that when the sequence in question was
bounded in probability, the corresponding limiting distribution function was
a valid distribution function. When the sequence was not bounded in proba-
bility, the limiting distribution function was not a valid distribution function.
Hence, for that example, the property that the sequence is bounded in prob-
ability is equivalent to the condition that the limiting distribution function is
valid. This property is true in general.
Theorem 4.1. Let {Xn }∞ n=1 be a sequence of random variables where Xn has
distribution function Fn for all n ∈ N. Suppose that Fn ; F as n → ∞ where
F may or may not be a valid distribution function. Then,
lim F (x) = 0,
x→−∞
and
lim F (x) = 1,
x→∞
if and only if the sequence {Xn }∞
n=1 is bounded in probability.
Theorem 4.2. Let {Fn }∞ n=1 be a sequence of distribution functions that con-
verge weakly to a distribution function F as n → ∞. Let {Fn−1 }∞ n=1 be the
corresponding sequence of quantile functions and let F −1 be the quantile func-
tion corresponding to F . Define N to be the set of points where Fn−1 does not
converge pointwise to F −1 . That is
n o
N = (0, 1) \ t ∈ (0, 1) : lim Fn−1 (t) = F −1 (t) .
n→∞
Then N is countable.
A proof of Theorem 4.2 can be found in Section 1.5.6 of Serfling (1980). As
Theorem 4.2 implies, there may be as many as a countable number of points
where the convergence does not take place. Certainly for the points where
the distribution function does not converge, we cannot expect the quantile
function to necessarily converge at the inverse of those points. However, the
result is not specific about at which points the convergence may not take place,
and other points may be included in the set N as well, such as the inverse
of points that occur where the distribution functions are not increasing. On
the other hand, there may be cases where the convergence of the quantile
functions may occur at all points in (0, 1), as the next example demonstrates.
WEAK CONVERGENCE OF RANDOM VARIABLES 167
Example 4.7. Let {Xn }∞ n=1 be a sequence of random variables where Xn
is an Exponential(θ + n−1 ) random variable for all n ∈ N where θ is a
positive real constant. Let X be an Exponential(θ) random variable. It can
d
be shown that Xn − → X as n → ∞. See Exercise 2. The quantile function
associated with Xn is given by Fn−1 (t) = −(θ + n−1 )−1 log(1 − t) for all n ∈ N
and t ∈ (0, 1). Similarly, the quantile function associated with X is given by
F −1 (t) = −θ−1 log(1 − t). Let t ∈ (0, 1) and note that
lim Fn−1 (t) = lim −(θ + n−1 )−1 log(1 − t) = −θ−1 log(1 − t) = F −1 (t).
n→∞ n→∞
Therefore, the convergence of the quantile functions holds for all t ∈ (0, 1),
and therefore N = ∅.
Proof. We follow the development of this result given in Sen and Singer (1993).
Let ε > 0 and consider a partition of the interval [a, b] given by a = x0 <
x1 < · · · < xm < xm+1 = b. We will assume that xk is a continuity point of
F for all k ∈ {0, 1, . . . , m + 1} and that xk+1 − xk < ε for k ∈ {0, 1, . . . , m}.
We will form a step function to approximate g on the interval [a, b]. Define
gm (x) = g[ 21 (xk + xk+1 )] whenever x ∈ (xk , xk+1 ) and note that because
a = x0 < · · · < xm+1 = b it follows that for every m ∈ N and x ∈ [a, b] we can
write gm (x) as
m
X
gm (x) = g[ 12 (xk + xk+1 )]δ{x; (xk , xk+1 )}.
k=0
To bound the first term we note that since xk+1 −xk < ε for all k ∈ {0, . . . , m}
it then follows that if x ∈ (xk , xk+1 ) then |x − 21 (xk + xk+1 )| < ε. Because g
is a continuous function it follows that there exists ηε > 0 such that |gm (x) −
g(x)| = |g[ 12 (xk + xk+1 )] − g(x)| < ηε , for all x ∈ (xk , xk+1 ). Therefore, there
exists δε > 0 such that
sup |gm (x) − g(x)| < 13 δε .
x∈[a,b]
Therefore,
Z b Z b
gm (x)dFn (x) − gm (x)dF (x) =
a a
m
X
g[ 21 (xk+1 − xk )][Fn (xk+1 ) − F (xk+1 ) − Fn (xk ) + F (xk )].
k=0
The third term can also be bounded by 31 δε using similar arguments to those
above. See Exercise 19. Since all three terms can be made smaller than 13 δε
for any δε > 0, it follows that
Z
b Z b
lim g(x)dFn (x) − g(x)dF (x) < δε ,
n→∞ a a
The restriction that the range of the integral is bounded can be weakened if
we are willing to assume that the function of interest is instead bounded. This
result is usually called the extended or generalized theorem of Helly and Bray.
170 CONVERGENCE OF DISTRIBUTIONS
Theorem 4.4 (Helly and Bray). Let g be a continuous and bounded function
n=1 be a sequence of distribution functions such that Fn ; F as
and let {Fn }∞
n → ∞, where F is a distribution function. Then,
Z ∞ Z ∞
lim g(x)dFn (x) = g(x)dF (x).
n→∞ −∞ −∞
Proof. Once again, we will use the method of proof from Sen and Singer
(1993). This method of proof breaks up the integrals in the difference
Z ∞ Z ∞
g(x)dFn (x) − g(x)dF (x),
−∞ −∞
into two basic parts. In the first part, the integrals are integrated over a finite
range, and hence Theorem 4.3 (Helly and Bray) can be used to show that the
difference converges to zero. The second part corresponds to the integrals of
the leftover tails of the range. These differences will be made arbitrarily small
by appealing to both the assumed boundedness of the function g and the fact
that F is a distribution function. To begin, let ε > 0 and let
g̃ = sup |g(x)|,
x∈R
Theorem 4.3 (Helly and Bray) can be applied to the second term as long as a
and b are finite constants that do not depend on n, to obtain
Z
b Z b
lim g(x)dFn (x) − g(x)dF (x) = 0.
n→∞ a a
The find bounds for the remaining two terms, we note that since F is a
distribution function it follows that
lim F (x) = 1,
x→∞
and
lim F (x) = 0.
x→−∞
Therefore, Definition 1.1 implies that for every δ > 0 there exist finite conti-
nuity points a < b such that
Z a
dF (x) = F (a) < δ,
−∞
WEAK CONVERGENCE OF RANDOM VARIABLES 171
and Z ∞
dF (x) = 1 − F (b) < δ.
b
Therefore, for these values of a and b it follows that
Z a Z a Z a
g(x)dF n (x) ≤
g̃dFn (x) = g̃
dFn (x) = g̃Fn (a).
−∞ −∞ −∞
The limit property in Theorem 4.4 (Helly and Bray) can be further shown to
be equivalent to the weak convergence of the sequence of distribution func-
tions, thus characterizing weak convergence. Another characterization of weak
convergence is based on the convergence of the characteristic functions corre-
sponding to the sequence of distribution functions. However, in order to prove
this equivalence, we require another of Helly’s Theorems.
Theorem 4.5 (Helly). Let {Fn }∞n=1 be a sequence of non-decreasing functions
that are uniformly bounded. Then the sequence {Fn }∞n=1 contains at least one
subsequence {Fnm }∞n=1 where {n } ∞
m m=1 is an increasing sequence in N such
that Fnm ; F as m → ∞ where F is a non-decreasing function.
172 CONVERGENCE OF DISTRIBUTIONS
A proof of Theorem 4.5 can be found in Section 37 of Gnedenko (1962).
Theorem 4.5 is somewhat general in that it deals with sequences that may
or may not be distribution functions. The result does apply to sequences of
distribution functions since they are uniformly bounded between zero and
one. However, as our discussion earlier in this chapter suggests, the limiting
function F may not be a valid distribution function, even when the sequence
is comprised of distribution functions. There are two main potential problems
in this case. The first is that F need not be right continuous. But, since
weak convergence is defined on the continuity points of F , it follows that we
can always define F in such a way that F is right continuous at the points
of discontinuity of F , without changing the weak convergence properties of
the sequence. See Exercise 10. The second potential problem is that F may
not have the proper limits as x → ∞ and x → −∞. This problem cannot
be addressed without further assumptions. See Theorem 4.1. It turns out
that the convergence of the corresponding distribution functions provides the
additional assumptions that are required. With this assumption, the result
given below provides other cromulent methods for assessing weak convergence.
Theorem 4.6. Let {Fn }∞ n=1 be a sequence of distribution functions and let
{ψn }∞
n=1 be a sequence of characteristic functions such that ψn is the char-
acteristic function of Fn for all n ∈ N. Let F be a distribution function with
characteristic function ψ. The following three statements are equivalent:
1. Fn ; F as n → ∞.
2. For each t ∈ R,
lim ψn (t) = ψ(t).
n→∞
Note that g is continuous and bounded. See Figure 4.1. Therefore, Condition
3 implies that
Z ∞ Z ∞
lim g(x)dFn (x) = g(x)dF (x).
n→∞ −∞ −∞
WEAK CONVERGENCE OF RANDOM VARIABLES 173
Now, because g(x) = 0 when x ≥ t + ε, it follows that
Z ∞ Z t Z t+ε
g(x)dFn (x) = dFn (x) + g(x)dFn (x).
−∞ −∞ t
where we have used the limit superior because we do not know if the limit of
Fn (t) converges or not. Now
Z ∞ Z t Z t+ε
g(x)dF (x) = dF (x) + g(x)dF (x)
−∞ −∞ t
Z t Z t+ε
≤ dF (x) + dF (x)
−∞ t
= F (t + ε),
since g(x) ≤ 1 for all x ∈ [t, t + ε]. Therefore, we have shown that
lim sup Fn (t) ≤ F (t + ε).
n→∞
or equivalently that
lim Fn (t) = F (t),
n→∞
by Definition 1.3, when t is a continuity point of F . Therefore, Definition 4.2
implies that Fn ; F as n → ∞. Hence, we have shown that Conditions 1 and
3 are equivalent.
To prove that Conditions 1 and 2 are equivalent we use the method of proof
given by Gnendenko (1962). We will first show that Condition 1 implies Con-
dition 2. Therefore, let us assume that {Fn }∞
n=1 is a sequence of distribution
functions that converge weakly to a distribution function F as n → ∞. Define
g(x) = exp(itx) where t is a constant and note that g(x) is continuous and
174 CONVERGENCE OF DISTRIBUTIONS
bounded. Therefore, we use the equivalence between Conditions 1 and 3 to
conclude that
Z ∞ Z ∞
lim ψn (t) = exp(itx)dFn (x) = exp(itx)dF (x) = ψ(t),
n→∞ −∞ −∞
for all t ∈ R. We first prove that {Fn }∞ n=1 converges to a distribution func-
tion. We then finish the proof by showing that the characteristic function of
F must be ψ(t). To obtain the weak convergence we begin by concluding from
Theorem 4.5 (Helly) that there is a subsequence {Fnm }∞ ∞
m=1 , where {nm }m=1
is an increasing sequence in N, that converges weakly to some non-decreasing
function F . From the discussion following Theorem 4.5, we know that we can
assume that F is a right continuous function. The proof that F has the correct
limit properties is rather technical and is somewhat beyond the scope of this
book. For a complete argument see Gnendenko (1962). In turn, we shall for the
rest of this argument, assume that F has the necessary properties to be a dis-
tribution function. It follows from the fact that Condition 1 implies Condition
2 that the characteristic function ψ must correspond to the distribution func-
tion F . To complete the proof, we must now show that the sequence {Fn }∞ n=1
converges weakly to F as n → ∞. We will use a proof by contradiction. That
is, let us suppose that the sequence {Fn }∞n=1 does not converge weakly to F as
n → ∞. In this case we would be able to find a sequence of integers {cn }∞ n=1
such that Fcn converges weakly to some distribution function G that differs
from F at at least one point of continuity. However, as we have stated above,
G must have a characteristic function equal to ψ(t) for all t ∈ R. But Theo-
rem 2.27 implies that G must be the same distribution function as F , thereby
contradicting our assumption, and hence it follows that {Fn }∞ n=1 converges
weakly to F as n → ∞.
d
Therefore, Theorem 4.6 implies that Xn −
→ X as n → ∞.
Example 4.10. Let {Xn }∞ n=1 be a sequence of random variables where Xn
has distribution Fn for all n ∈ N. Suppose the Xn converges in distribution
to a random variable X with distribution F as n → ∞. Now let g(x) =
xδ{x; (−δ, δ)}, for a specified 0 < δ < ∞, which is a bounded and continuous
WEAK CONVERGENCE OF RANDOM VARIABLES 175
Figure 4.1 The bounded and continuous function used in the proof of Theorem 4.6.
1.0
0.8
0.6
0.4
0.2
0.0
t t+!
but in this case we note that E(Xn ) = E(Xn δ{Xn ; (−δ, δ)}) and E(X) =
E(Xδ{X; (−δ, δ)}). Therefore, in this case we have proven that
lim E(Xn ) = E(X),
n→∞
under the assumption that the random variables stay within a bounded subset
of the real line.
Note that the boundedness of the sequence of random variables is not a nec-
essary condition for the expectations of sequences of random variables that
converge in distribution to also converge, though it is sufficient. We will return
to this topic in Chapter 5 where we will develop equivalent conditions to the
convergence of the expectations.
Noting that the convergence detailed in Definitions 4.1 and 4.2 is pointwise in
terms of the sequence of distribution functions, it may be somewhat surprising
that the convergence can be shown to be uniform as well. This is due to
the special properties associated with distribution functions in that they are
bounded and non-decreasing.
Theorem 4.7 (Pólya). Suppose that {Fn }∞ n=1 is a sequence of distribution
functions such that Fn ; F as n → ∞ where F is a continuous distribution
function. Then
lim sup |Fn (t) − F (t)| = 0.
n→∞ t∈R
The proof of Theorem 4.7 essentially follows along the same lines as that of
Theorem 3.18 where we considered the uniform convergence of the empirical
distribution function. See Exercise 12.
When considering the modes of convergence studied thus far, it would appear
conceptually that convergence in distribution is a rather weak concept in that
it does not require that the random variables Xn and X should be close to one
another when n is large. Only the corresponding distributions of the sequence
need coincide with the distribution of X as n → ∞. Therefore, it would seem
that any of the modes of convergence studied in Chapter 3, which all require
that the random variables Xn and X coincide in some sense in the limit,
would also require that distributions of the random variables would also co-
incide. This should essentially guarantee convergence in distribution for these
sequences. It suffices to prove this property for convergence in probability, the
weakest of the modes of convergence studied in Chapter 3.
Theorem 4.8. Let {Xn }∞
n=1 be a sequence of random variables that converge
d
in probability to a random variable X as n → ∞. Then Xn −
→ X as n → ∞.
WEAK CONVERGENCE OF RANDOM VARIABLES 177
Proof. Let ε > 0, the distribution function of Xn be Fn (x) = P (Xn ≤ x) for
all n ∈ N, and F denote the distribution function of X. Then
Fn (x) = P (Xn ≤ x)
= P ({Xn ≤ x} ∩ {|Xn − X| < ε}) + P ({Xn ≤ x} ∩ {|Xn − X| ≥ ε})
≤ P ({Xn ≤ x} ∩ {|Xn − X| < ε}) + P ({|Xn − X| ≥ ε}),
by Theorem 2.3. Now note that
{Xn ≤ x} ∩ {|Xn − X| < ε} ⊂ {X ≤ x + ε},
so that Theorem 2.3 implies
P ({Xn ≤ x} ∩ {|Xn − X| < ε}) ≤ P (X ≤ x + ε),
and therefore,
Fn (x) ≤ P (X ≤ x + ε) + P (|Xn − X| ≥ ε).
Hence, without assuming that the limit exists, we can use Theorem 1.6 to
show that
See Exercise 13. Suppose that x ∈ C(F ). Since ε > 0 is arbitrary, we have
shown that
F (x) = F (x−) ≤ lim inf Fn (x) ≤ lim sup Fn (x) ≤ F (x+) = F (x).
n→∞ n→∞
Proof. This proof is based on the development of Serfling (1980). Like many
existence proofs, this one is constructive in nature. Let F be the distribution
WEAK CONVERGENCE OF RANDOM VARIABLES 179
function of X and Fn be the distribution function of Xn for all n ∈ R. Define
the random variable Y : [0, 1] → R as Y (ω) = F −1 (ω). Similarly define
Yn (ω) = Fn−1 (ω) for all n ∈ N. We now prove that these random variables
have the properties listed earlier. We begin by proving that X and Y have
the same distribution. The distribution function of Y is given by G(y) =
P [ω : Y (ω) ≤ y] = P [ω : F −1 (ω) ≤ y]. Now Theorem 3.22 implies that
P [ω : F −1 (ω) ≤ y] = P [ω : ω ≤ F (y)]. Because P is a uniform probability
measure, it follows that G(y) = P [ω : ω ≤ F (y)] = F (y), and therefore we
have proven that X and Y have the same distribution. Using similar arguments
it can be shown that Xn and Yn have the same distribution for all n ∈ N.
Hence, it remains for us to show that Yn converges almost certainly to Y as
n → ∞. Note that the sequence of random variables {Yn }∞ n=1 is actually the
sequence {Fn−1 }∞n=1 , which is a sequence of quantile functions corresponding to
a sequence of distribution functions that converge weakly. Theorem 4.2 implies
that if we collect together all ω ∈ [0, 1] such that the sequence {Yn (ω)}∞ n=1
does not converge pointwise to Y (ω) = F −1 (ω) we get a countable set, which
has probability zero with respect to the continuous probability measure P on
a.c.
Ω. Therefore, Definition 3.2 implies that Yn −−→ Y as n → ∞.
The most common example of using Theorem 4.10 arises when proving that
continuous functions of a sequence of random variables that converge weakly
also converge. See Theorem 4.12 and the corresponding proof.
We often encounter problems that concern a sequence of random variables that
converge in distribution and we perturb the sequence with another sequence
that also has some convergence properties. In these cases it is important to
determine how such perturbations affect the convergence of the sequence. For
example, we might know that a sequence of random variables {Xn }∞ n=1 con-
verges in distribution to a random variable Z that has a N(µ, 1) distribution.
We may wish to standardize this sequence, but may not know µ. Suppose
we have a consistent estimator of µ. That is, we can compute µ̂n such that
p
µ̂n −
→ µ as n → ∞. Given this information, is it possible to conclude that
the sequence {Xn − µ̂n }∞ n=1 converges in distribution to a standard normal
distribution? Slutsky’s Theorem is a result that considers both additive and
multiplicative perturbations of sequences that converge in distribution.
Theorem 4.11 (Slutsky). Let {Xn }∞ n=1 be a sequence of random variables
that converge weakly to a random variable X. Let {Yn }∞ n=1 be a sequence of
random variables that converge in probability to a real constant c. Then,
d
1. Xn + Yn −
→ X + c as n → ∞.
d
2. Xn Yn −
→ cX as n → ∞.
d
3. Xn /Yn −
→ X/c as n → ∞ as long as c 6= 0.
Proof. The first result will be proven here. The remaining results are proven
in Exercises 14 and 15. Denote the distribution function of Xn + Yn to be Gn
180 CONVERGENCE OF DISTRIBUTIONS
and let F be the distribution function of X. Let ε > 0 and set x such that
x − c ∈ C(F ) and x + ε − c ∈ C(F ). Then
Gn (x) = P (Xn + Yn ≤ x)
= P ({Xn + Yn ≤ x} ∩ {|Yn − c| ≤ ε}) +
P ({Xn + Yn ≤ x} ∩ {|Yn − c| > ε}).
Now, Theorem 2.3 implies that
P ({Xn + Yn ≤ x} ∩ {|Yn − c| ≤ ε}) ≤ P (Xn + c ≤ x + ε),
and
P ({Xn + Yn ≤ x} ∩ {|Yn − c| > ε}) ≤ P (|Yn − c| > ε).
Therefore,
Gn (x) ≤ P (Xn + c ≤ x + ε) + P (|Yn − c| > ε),
and Theorem 1.6 implies that
lim sup Gn (x) ≤ lim sup P (Xn + c ≤ x + ε) + lim sup P (|Yn − c| > ε)
n→∞ n→∞ n→∞
= P (X ≤ x + ε − c),
d p
where we have used the fact that Xn −
→ X and Yn −
→ c as n → ∞. Similarly,
it can be shown that
P (Xn ≤ x − ε − c) ≤ Gn (x) + P (|Yn − c| > ε).
See Exercise 16. Therefore, Theorem 1.6 implies that
lim inf P (Xn ≤ x − ε − c) ≤ lim inf Gn (x) + lim inf P (|Yn − c| > ε)
n→∞ n→∞ n→∞
so that
lim inf Gn (x) ≥ P (X ≤ x − ε − c).
n→∞
Because ε > 0 is arbitrary, we have proven that
lim sup Gn (x) ≤ P (X ≤ x − c) ≤ lim inf Gn (x),
n→∞ n→∞
Proof. In this proof we follow the method of Serfling (1980) which translates
the problem into one concerning almost certain convergence using Theorem
4.10 (Skorokhod), where the convergence of the transformation is known to
hold. The result is then translated back to weak convergence using the fact that
almost certain convergence implies convergence in distribution. To proceed,
let us suppose that {Xn }∞ n=1 is a sequence of random variables that converge
in distribution to a random variable X as n → ∞. Theorem 4.10 then implies
that there exists a sequence of random variables {Yn }∞ n=1 such that Xn and
a.c.
Yn have the same distribution for all n ∈ N where Yn −−→ Y as n → ∞,
and Y has the same distribution as X. It then follows from Theorem 3.8 that
a.c.
g(Yn ) −−→ g(Y ) as n → ∞ as long as g is continuous with probability one
with respect to the distribution of Y . To show this, let D(g) denote the set
of discontinuities of the function g. Let P be the probability measure from
the measure space used to define X and P ∗ be the probability measure from
the measure space used to define Y . Then it follows that since X and Y have
the same distribution, that P ∗ [Y ∈ D(g)] = P [X ∈ D(g)] = 0. Therefore, it
a.c.
follows that g(Yn ) −−→ g(Y ) as n → ∞. Theorems 3.2 and 4.8 then imply
d
that g(Yn ) −
→ g(Y ) as n → ∞. But g(Yn ) has the same distribution as g(Xn )
182 CONVERGENCE OF DISTRIBUTIONS
for all n ∈ R and g(Y ) has the same distribution as g(X). Therefore, it follows
d
that g(Xn ) −
→ g(X) as n → ∞ as well, and the result is proven.
Figure 4.2 Probability contours of the discrete bivariate distribution function from
Example 4.17. The dotted lines indicate the location of discrete steps in the distribu-
tion function, with the height of the steps being indicated on the plot. It is clear from
this plot that the point (0, 0) is a point of discontinuity of the distribution function.
1.0
0.5
1/2 1
0.0
x2
!0.5
1/2
!1.0
0
!1.5
x1
and that
Z
F (x) = (2π)−d/2 |Σ|−1/2 exp[− 12 (t − µ)0 Σ−1 (t − µ)]dt,
B(x)
WEAK CONVERGENCE OF RANDOM VECTORS 185
where B(x) is defined in Theorem 4.13. Because the limiting distribution is
continuous, P [X ∈ ∂B(x)] = 0 for all x ∈ Rd . To show that weak convergence
follows, we note that Fn (x) can be written as
Z
Fn (x) = (2π)−d/2 exp(− 12 t0 t)dt,
−1/2
Σn [B(x)−µn ]
and similarly
Z
F (x) = (2π)−d/2 exp(− 12 t0 t)dt.
Σ−1/2 [B(x)−µ]
In both cases we have used the shorthand AB(x) + c to represent the linear
−1/2
transformation {At + c : t ∈ B(x)}. Now it follows that Σn [B(x − µn )] →
−1/2 d
Σ [B(x − µ)] for all x ∈ R . Therefore, it follows that
lim Fn (x) = F (x),
n→∞
d
for all x ∈ Rd and, hence, it follows from Definition 4.4 that Xn −
→ X as
n → ∞.
If a sequence of d-dimensional distribution functions {Fn }∞
n=1 converges weakly
to a distribution function F as n → ∞, then Defintion 4.4 implies that
lim Fn (x) = F (x),
n→∞
Example 4.20. Consider the setup of Example 4.19 where {Xn }∞ n=1 is a
sequence of d-dimensional random vectors such that Xn has a N(µn , Σn )
d
distribution and Xn −
→ X as n → ∞ where X has a N(µ, Σ) distribution.
Define a region
E(α) = {x ∈ Rd : (x − µ)0 Σ−1 (x − µ) ≤ χ2d;α }
186 CONVERGENCE OF DISTRIBUTIONS
where χ2d;α is the α quantile of a ChiSquared(d) distribution. The boundary
region of E(α) is given by
∂E(α) = {x ∈ Rd : (x − µ)0 Σ−1 (x − µ) = χ2d;α },
which is an ellipsoid in Rd . Now
P [X ∈ ∂E(α)] = P [(X − µ)0 Σ−1 (X − µ) = χ2d;α ] = 0,
since X is a continuous random vector. It then follows from Theorem 4.14
that
lim P [Xn ∈ E(α)] = P [X ∈ E(α)] = α.
n→∞
The result of Theorem 4.14 is actually part of a larger result that generalizes
Theorem 4.6 to the multivariate case.
Theorem 4.15. Let {Xn }∞ n=1 be a sequence of d-dimensional random vec-
tors where Xn has distribution function Fn for all n ∈ N and let X be a
d-dimensional random vector with distribution function F . Then the following
statements are equivalent.
1. Fn ; F as n → ∞.
2. For any bounded and continuous function g,
Z Z
lim g(x)dFn (x) = g(x)dF (x).
n→∞ Rd Rd
Proof. We shall follow the method of proof given by Billingsley (1986), which
first shows the equivalence on Conditions 2–5, and then proves that Condition
5 is equivalent to Condition 1. We begin by proving that Condition 2 implies
Condition 3. Let C be a closed subset of Rd . Define a metric ∆(x, C) that
measures the distance between a point x ∈ Rd and the set C as the smallest
distance between x and any point in C. That is,
∆(x, C) = inf {||x − c||}.
c∈C
WEAK CONVERGENCE OF RANDOM VECTORS 187
In order to effectively use Condition 2 we need to define a bounded and con-
tinuous function. Therefore, define
1
if t < 0,
hk (t) = 1 − tk if 0 ≤ t ≤ k −1 ,
if k −1 ≤ t.
0
Let gk (x) = hk [∆(x, C)]. It follows that the function is continuous and bounded
between zero and one for all k ∈ N. Now, suppose that x ∈ C so that
∆(x, C) = 0 and hence,
lim hk [∆(x, C)] = lim 1 = 1,
k→∞ k→∞
where the final equality follows from Condition 2. It can further be proven
that gk converges monotonically (decreasing) to δ{x; C} as n → ∞ so that
Theorem 1.12 (Lebesgue) implies that
Z Z Z
lim gk (x)dF (x) = g(x)dF (x) = δ{x; c}dF (x) = P (X ∈ C).
k→∞ Rd Rd Rd
and Condition 2 follows. We will finally show that Condition 5 implies Con-
dition 1. To show this define sets B(x) = {t ∈ Rd : t ≤ x}. Theorem 4.13
implies that x is a continuity point of F if and only if P [X ∈ ∂B(x)] = 0.
Therefore, if x is a continuity point of F , we have from Condition 5 that
lim Fn (x) = lim P [X ∈ B(x)] = P [X ∈ B(x)] = F (x).
n→∞ n→∞
(1 + n−1 )1/2
0
Cn = .
0 (1 + n−1 )1/2
d
In this case Zn −→ Z as n → ∞ where Z has a N(0, Σ) distribution. Hence,
there are an infinite number of choices of Z depending on the covariance
between Xn and Yn . There are even more choices for the limiting distribution
because the limiting joint distribution need not even be multivariate Normal,
due to the fact that the joint distribution of two normal random variables need
not be multivariate Normal.
The converse of this property is true. That is, if a sequence of random vectors
converge in distribution to another random vector, then all of the elements
in the sequence of random vectors must also converge to the elements of the
limiting random vector. This result follows from Theorem 4.14.
The convergence of random vectors was simplified to the univariate case using
Theorem 3.6. As Example 4.21 demonstrates, the same simplification is not
applicable to the convergence of distributions. The Cramér-Wold Theorem
does provide a method for reducing the convergence in distribution of random
vectors to the univariate case. Before presenting this result some preliminary
setup is required. The result depends on multivariate characteristic functions,
which are defined below.
where we have used the independence assumption and the fact that the char-
acteristic function of a standard normal random variable is exp(− 21 t2 ). There-
fore, it follows that
n
!
X
ψ(t) = exp − 2 1
tk = exp(− 12 t0 t).
2
k=1
We are now in a position to present the theorem of Cramér and Wold, which
reduces the task of proving that a sequence of random vectors converge weakly
to another random vector to the univariate case by considering the convergence
all possible linear combinations of the components of the random vectors.
Theorem 4.17 (Cramér and Wold). Let {Xn }∞
n=1 be a sequence of random
d
vectors in Rd and let X be a d-dimensional random vector. Then Xn −
→ X as
d
n → ∞ if and only if v0 Xn −→ v0 X as n → ∞ for all v ∈ Rd .
Proof. We will follow the proof of Serfling (1980). Let us first suppose that for
d
any v ∈ Rd that v0 Xn − → vX as n → ∞. Theorem 4.6 then implies that the
characteristic function of v0 Xn converges to the characteristic function of v0 X.
Let ψn (t) be the characteristic function of Xn for all n ∈ N. The characteristic
function of v0 Xn is then given by E[exp(itv0 Xn )] = E{exp[i(tv0 )Xn ]} =
ψn (tv) by Definition 4.5. Similarly, if ψ(t) is the characteristic function of X,
then the characteristic function of v0 X is given by ψ(tv). Theorem 4.6 then
d
implies that if v0 Xn −
→ v0 X as n → ∞ for all v ∈ Rd , then
lim ψn (tv) = ψ(tv),
n→∞
Xn is independent of Yn for all n ∈ N. We assume that {µn }∞ n=1 and {νn }n=1
∞
Proof. We will use the method of proof suggested by Lehmann (1999). De-
fine a sequence of 3d-dimensional random variables {Wn }∞
n=1 where Wn =
0
0 0 0 0 0 0 0
(Xn , Yn , Zn ) for all n ∈ N and similarly define W = (X , y , z ). Theo-
d
rem 3.6 and Example 4.25 imply that Wn − → W as n → ∞. Now de-
fine g(w) = g(x, y, z) = diag(y)x + z which is an everywhere continuous
function so that P [W ∈ C(g)] = 1. Theorem 4.18 implies that g(Wn ) =
d
diag(Yn )Xn + Zn −
→ g(W) = diag(y)X + z as n → ∞.
Example 4.29. Let {Xn }∞
n=1 be a sequence of d-dimensional random vectors
d
such that n1/2 Xn −
→ Z as n → ∞ where Z is a Np (0, I) random vector.
THE CENTRAL LIMIT THEOREM 195
Suppose that {Yn }∞
n=1 is any sequence of random vectors that converges in
d
probability to a vector θ. Then Theorem 4.19 implies that n1/2 Xn +Yn −
→W
as n → ∞ where W has a Np (θ, I) distribution.
The central limit theorem, as the name given to it by G. Pólya in 1920 implies,
is the key asymptotic result in statistics. The result in some form has existed
since 1733 when De Moivre proved the result for a sequence of independent
and identically distributed Bernoulli(θ) random variables. In some sense,
one can question how far the field of statistics could have progressed without
this essential result. It is the Central Limit Theorem, in its various forms, that
allow us to construct approximate normal tests and confidence intervals for
unknown means when the sample size is large. Without such a result we would
be required to develop tests for each possible population. The result allows us
to approximate Binomial probabilities under certain circumstances when the
number of Bernoulli experiments is large. These probabilities, with such an
approximation, would have been very difficult to compute, especially before
the advent of the digital computer. Another key attribute of the Central Limit
Theorem is its widespread applicability. The Central Limit Theorem, with
appropriate modifications, applies not only to the case of independent and
identically distributed random variables, but can also be applied to dependent
sequences of variables, sequences that have varying distributions, and other
cases as well. Finally, the power of the normal approximation is additionally
quite important. When the parent population is not too far from normality,
then the Central Limit Theorem provides quite accurate approximations to
the distribution of the sample mean, even when the sample size is quite small.
In this section we will introduce the simplest form of the central limit theorem
which applies to sequences of independent and identically distributed random
variables with finite mean and variance and present the usual proof which is
based on limits of characteristic functions. We will also present the simple
form of the multivariate version of the central limit theorem. We will revisit
this topic with much more detail in Chapter 6 where we consider several
generalizations of this result.
Theorem 4.20 (Lindeberg and Lévy). Let {Xn }∞ n=1 be a sequence of inde-
pendent and identically distributed random variables such that E(Xn ) = µ and
d
V (Xn ) = σ 2 < ∞ for all n ∈ N, then Zn = n1/2 σ −1 (X̄n − µ) −
→ Z as n → ∞
where Z has a N(0, 1) distribution.
!! " ! # $
The left hand side of this last equation is a Binomial probability and the right
hand side is a Normal probability. Therefore, this last equation provides a
method for approximating Binomial probabilities with Normal probabil-
ities under the condition that n is large. Figures 4.3 and 4.4 compare the
Binomial(n, p) and N[nθ, nθ(1 − θ)] distribution functions for n = 5 and
n = 10 when θ = 14 . One can observe that the Normal distribution function,
though continuous, does capture the general shape of the Binomial distribu-
tion function, with the approximation improving as n becomes larger.
0 5 10
0.15
0.10
0.05
0.00
!4 !2 0 2 4
Zn
d
such that σn−1 (Xn − µ) −→ Z as n → ∞ where Z is a N(0, 1) random variable
and {σn }∞
n=1 is a sequence of real numbers such that
lim σn = σ ∈ R,
n→∞
p
and µ ∈ R. Then Xn −
→ µ if and only if σ = 0.
p
Proof. Let us first assume that σ = 0, which implies that σn − → 0 as n → ∞.
d
Part 2 of Theorem 4.11 (Slutsky) implies that σn σn−1 (Xn − µ) = Xn − µ − →0
p
as n → ∞. Theorem 4.9 then implies that Xn − µ − → 0 as n → ∞, which is
p
equivalent to the result that Xn −→ µ as n → ∞. On the other hand if we
p
assume that Xn − → µ as n → ∞, then again we automatically conclude that
p
Xn − µ − → 0. Suppose that σ 6= 0 and let us find a contradiction. Since σ 6= 0
p
it follows that σn /σ −
→ 1 as n → ∞ and therefore Theorem 4.11 implies that
d
σn σ −1 σn−1 (Xn − µ) = Xn − µ −→ Z as n → ∞ where Z is a N(0, 1) random
p
variable. This is a contradiction since we know that Xn − µ −
→ 0 as n → ∞.
Therefore, σ cannot be non-zero.
200 CONVERGENCE OF DISTRIBUTIONS
0.10
0.05
0.00
!4 !2 0 2 4
Zn
The proof of Theorem 4.22 is similar to that of the univariate case given by
Theorem 4.20, the only real difference being that the multivariate character-
istic function is used.
Example 4.34. Consider a sequence of discrete bivariate random variables
{Xn }∞
n=1 where Xn has probability distribution
x0 = (1, 0, 0)
θ
x0 = (0, 1, 0)
η
f (x) =
1 − θ − η x0 = (0, 0, 1)
0 otherwise,
for all n ∈ N where θ and η are parameters such that 0 < θ < 1, 0 < η < 1 and
0 < θ+η < 1. The mean vector of Xn is given by E(Xn ) = µ = (θ, η, 1−θ−η)0 .
The covariance matrix of Xn is given by
θ(1 − θ) −θη −θ(1 − θ − η)
V (Xn ) = Σ = −θη η(1 − η) −η(1 − θ − η) .
−θ(1 − θ − η) −η(1 − θ − η) (θ + η)(1 − θ − η)
d
Theorem 4.22 implies that n1/2 Σ−1/2 (X̄n − µ) −
→ Z as n → ∞ where Z is a
three dimensional N(0, I) random vector. That is, when n is large, it follows
that
P [n1/2 Σ−1/2 (X̄n − µ) ≤ t] ' Φ(t),
where Φ is the multivariate distribution function of a N(0, I) random vector.
Equivalently, we have that
n
!
X
P Xk ≤ t ' Φ[n−1/2 Σ−1/2 (t − µ)],
k=1
for all z ∈ R. Definition 1.1 then implies that for large values of n, we can
use the approximation P (Zn ≤ z) ' Φ(z). One of the first concerns when one
uses any approximation should be about the accuracy of the approximation.
In the case of the normal approximation given by Theorem 4.20, we are in-
terested in how well the normal distribution approximates probabilities of the
standardized sample mean and how the quality of the approximation depends
on n and on the parameters of the distribution of the random variables in
the sequence. In some cases it is possible to study these effects using direct
calculation.
Example 4.35. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables where Xn has a Bernoulli(θ) distribution. In
this case it is well known that the sum Sn has a Binomial(n, θ) distribu-
tion. The distribution of the standardized sample mean, Zn = n1/2 θ−1/2 (1 −
θ)−1/2 (X̄n −θ) then has a Binomial(n, θ) distribution that has been scaled so
that Zn has support {n1/2 θ−1/2 (1 − θ)−1/2 (0 − θ), n1/2 θ−1/2 (1 − θ)−1/2 (n−1 −
θ), . . . , n1/2 θ−1/2 (1 − θ)−1/2 (1 − θ)}. That is,
1/2 −1/2 −1/2 −1 n k
P [Zn = n θ (1 − θ) (kn − θ)] = θ (1 − θ)n−k ,
k
for k ∈ {1, . . . , n}. As shown in Example 4.30, Theorem 4.20 implies that
d
Zn −→ Z as n → ∞ where Z is a N(0, 1) random variable. This means that
for large n, we have the approximation P (Zn ≤ z) ' Φ(z). Because the
distribution of the standardized mean is known in this case we can assess the
accuracy of the normal approximation directly. For example, when θ = 14 and
n = 5 we have that the Kolmogorov distance between P (Zn ≤ t) and Φ(t) is
0.2346. See Figure 4.7. Similarly, when θ = 12 and n = 10 we have that the
Kolmogorov distance between P (Zn ≤ t) and Φ(t) is 0.1230. See Figure 4.8.
A more complete table of comparisons is given in Table 4.1. It is clear from
the table that both n and θ affect the accuracy of the normal approximation.
As n increases, the Kolmogorov distance becomes smaller. This is guaranteed
by Theorem 4.7. However, another effect can be observed from Table 4.1.
The approximation becomes progressively worse as θ approaches zero. This
is due to the fact that the binomial distribution becomes more skewed as θ
approaches zero. The normal approximation requires larger sample sizes to
overcome this skewness.
0.4
0.2
0.0
!2 0 2 4 6
x
Table 4.1 The normal approximation of the Binomial(n, θ) distribution for n =
5, 10, 25, 50, 100 and θ = 0.01, 0.05, 0.10, 0.25, 0.50. The value reported is the Kol-
mogorov distance between the scaled binomial distribution function and the normal
distribution function.
θ
n 0.01 0.05 0.10 0.25 0.50
5 0.5398 0.4698 0.3625 0.2347 0.1726
10 0.5291 0.3646 0.2361 0.1681 0.1230
25 0.4702 0.2331 0.1677 0.1071 0.0793
50 0.3664 0.1677 0.1161 0.0758 0.0561
100 0.2358 0.1160 0.0832 0.0535 0.0398
204 CONVERGENCE OF DISTRIBUTIONS
0.4
0.2
0.0
!4 !2 0 2 4
x
Gamma(n, n−1/2 ) distribution given by
P (Zn ≤ z) = P [n1/2 θ−1 (X̄n − θ) ≤ z] (4.3)
Z z 1/2
n
= (t + n1/2 )n−1 exp[−n1/2 (t + n1/2 )]dt. (4.4)
−n1/2 Γ(n)
d
See Exercise 27. Theorem 4.20 implies that Zn − → Z as n → ∞ where Z is
a N(0, 1) distribution, and therefore for large n, we have the approximation
P (Zn ≤ z) ' Φ(z). Because the exact density of Zn is known in this case, we
can compute the Kolmogorov distance between P (Zn ≤ z) and Φ(z) to get an
overall view of the accuracy of this approximation. Note that in this case the
distribution of Zn does not depend on θ, so that we need only consider the
sample size n for our calculations. For example, when n = 2, the Kolmorogov
distance between P (Zn ≤ z) and Φ(z) is approximately 0.0945, and when
n = 5 the Kolmorogov distance between P (Zn ≤ z) and Φ(z) is approximately
0.0596. See Figures 4.9 and 4.10. A plot of the Kolmogorov distance against
n is given in Figure 4.11. We observe once again that the distance decreases
with n, as required by Theorem 4.7.
Figure 4.9 The normal approximation of the Gamma(n, n−1/2 ) distribution when
n = 2. The solid line is the Gamma(n, n−1/2 ) distribution function translated by its
mean which equals n1/2 , the dashed line is the N(0, 1) distribution function.
1.0
0.8
0.6
0.4
0.2
0.0
!1 0 1 2
or may not have a simple form. In these cases one can often use simulations
to approximate the behavior of the normal approximation.
Example 4.37. Let {{Xn,m }nm=1 }∞ n=1 be a triangular array of N(0, 1) ran-
dom variables that are mutually independent both within and between rows.
Define a new sequence of random variables {Yn }∞n=1 where
for y > 0. See Arnold, Balakrishnan and Nagaraja (1993). Now let Ȳn be the
sample mean of Y1 , . . . , Yn , and let Zn = n1/2 σ(Ȳn − µ) where µ = E(Yn )
and σ 2 = V (Yn ). While the distribution function in Equation (4.5) can be
computed numerically with some ease, the distributions of Ȳn and Zn are not
so simple to compute and approximations based on simulations are usually
easier.
Figure 4.10 The normal approximation of the Gamma(n, n−1/2 ) distribution when
n = 5. The solid line is the Gamma(n, n−1/2 ) distribution function translated by its
mean which equals n1/2 , the dashed line is the N(0, 1) distribution function.
1.0
0.8
0.6
0.4
0.2
0.0
!2 !1 0 1 2
not be known. This problem prevents one from studying the accuracy of the
normal approximation using either direct calculation or simulations. In this
case one needs to be able to study the accuracy of the normal approximation in
such a way that it does not depend of the distribution of the random variables
in the sequence. One may be surprised to learn that there are theoretical
results which can be used in this case. Specifically, one can find universal
bounds on the accuracy of the normal approximation that do not depend on
the distribution of the random sequence. For the case we study in this section
we will have to make some further assumptions about the distribution F .
Specifically, we will need to know something about the third moment of F .
Figure 4.11 The Kolmorogov distance between P (Zn ≤ z) and Φ(z) for different val-
ues of n where Zn is the standardized sample mean computed on n Exponential(θ)
random variables.
0.15
Kolmogorov Distance
0.10
0.05
0 20 40 60 80 100
and Z ∞
t2 dF (t) = 1.
−∞
Let G be a differentiable distribution function with characteristic function ζ
such that ζ 0 (0) = 0. Then for T > 0,
Z T
|ψ(t) − ζ(t)|
π|F (x) − G(x)| ≤ dt + 24T −1 sup |G0 (t)|.
−T |t| t∈R
208 CONVERGENCE OF DISTRIBUTIONS
A proof of Theorem 4.23 can be found in Section 2.5 of Kolassa (2006). The
form of the Smoothing Theorem given in Theorem 4.23 allows us to bound
the difference between distribution functions based on an integral of the cor-
responding characteristic functions. A more general version of this result will
be considered later in the book. Theorem 4.23 allows us to find a universal
bound for the accuracy of the normal approximation that depends only on
the absolute third moment of F . This famous result is known as the theorem
of Berry and Esseen.
Theorem 4.24 (Berry and Esseen). Let {Xn }∞ n=1 be a sequence of indepen-
dent and identically distributed random variables from a distribution such that
E(Xn ) = 0 and V (Xn ) = 1. If ρ = E(|Xn |3 ) < ∞ then
sup P (n1/2 X̄n ≤ t) − Φ(t) ≤ n−1/2 Bρ, (4.6)
t∈R
Proof. The proof of this result is based on Theorem 4.23, and follows the
method of proof given by Feller (1971) and Kolassa (2006). To avoid certain
overly technical arguments, we will make the assumption in this proof that
the distribution of the random variables is symmetric, which results in a real
valued characteristic function. The general proof follows the same overall path.
See Section 6.2 of Gut (2005) for further details. We will assume that B = 3
and first dispense with the case where n < 10. To simplify notation, let X
be a random variable following the distribution F . Note that Theorem 2.11
(Jensen) implies that [E(|X|2 )]3/2 ≤ E(|X|3 ) = ρ, and by assumption we have
that E(|X|2 ) = 1. Therefore, it follows that ρ ≥ 1. Now, when B = 3, the
bound given in Equation (4.6) has the property n−1/2 Bρ = 2n−1/2 ρ where
ρ ≥ 1. Hence, if n1/2 ≤ 3, or equivalently if n < 10, it follows that nBρ ≥ 1.
Since the difference between two distribution functions is always bounded
above by one, it follows that the result is true without any further calculation
when B = 3 and n < 10. For the remainder of the proof we will consider only
the case where n ≥ 10.
Let ψ(t) be the characteristic function of Xn , and focusing our attention on
the characteristic function of the standardized mean, let
n
X
Zn = n1/2 X̄n = n−1/2 Xk .
k=1
Theorems 2.32 and 2.33 imply that the characteristic function of Zn is given
by ψ n (n−1/2 t). Letting Fn denote the distribution function of Zn , Theorem
4.23 implies that
Z T
|Fn (t) − Φ(x)| ≤ π −1 |t|−1 |ψ n (n−1/2 t) − exp(− 21 t2 )|dt
−T
+ 24T −1 π −1 sup |φ(t)|. (4.7)
t∈R
THE ACCURACY OF THE NORMAL APPROXIMATION 209
The bound of integration used in Equation (4.7) is chosen to be T = 43 ρ−1 n1/2 ,
where it follows that since ρ ≥ 1, we have that T ≤ 43 n1/2 . Now,
Note that 48
5 = 9.6, matching the bound given in Feller (1971). In order to
find a bound on the integral in Equation (4.7) we will use Theorem A.10,
which states that for any two real numbers ξ and γ we have that
if |ξ| ≤ ζ and |γ| ≤ ζ. We will use this result with ξ = ψ(n−1/2 t) and γ =
exp(− 12 n−1 t2 ), and we therefore need to find a value for ζ. To begin finding
such a bound, we first note that
Z ∞ Z ∞
1 2
|ψ(t) − 1 + 2 t | = exp(itx)dF (x) − dF (x)−
−∞ −∞
Z ∞ Z ∞
1 2 2
itxdF (x) + 2 t x dF (x) ,
−∞ −∞
where we have used the fact that our assumptions imply that
Z ∞
exp(itx)dF (x) = ψ(t),
−∞
Z ∞
dF (x) = 1,
−∞
Z ∞
xdF (x) = 0,
−∞
and
Z ∞
x2 dF (x) = 1.
−∞
Now we use the assumption that F has a symmetric distribution about the
origin. Note that ρ = E(|X|3 ), and is not the skewness of F , so that ρ > 0 as
long as F is not a degenerate distribution at the origin. In this case, Theorem
2.26 implies that ψ(t) is real valued and that |ψ(t) − 1 + 21 t2 | ≤ 16 ρ|t|3 implies
that ψ(t) − 1 + 12 t2 ≤ 16 ρ|t|3 , or that
ψ(t) ≤ 1 − 21 t2 + 16 ρ|t|3 . (4.11)
Note that such a comparison would not make sense if ψ(t) were complex
valued. When ψ(t) is real-valued we have that ψ(t) ≥ 0 for all t ∈ R. Also, if
1 − 21 t2 > 0, or equivalently if t2 < 2, it follows that the right hand side of
Equation (4.11) is positive. Hence, if (n−1/2 t)2 < 2 we have that
|ψ(n−1/2 t)| ≤ 1 − 21 (n−1/2 t)2 + 61 ρ|n−1/2 t|3
= 1 − 12 n−1 t2 + 61 n−3/2 ρ|t|3 .
Now, if we assume that t is also within our proposed region of integration,
that is that |t| ≤ T = 43 ρ−1 n1/2 it follows that
−3/2 3
1
6 ρn |t| = 16 ρn−3/2 |t|2 |t| ≤ 4 −1 2
18 n t ,
and therefore |ψ(n−1/2 t)| ≤ 1 − 18 5 −1 2
n t . From Theorem A.21 we have that
5 −1 2
exp(x) ≥ 1+x for all x ∈ R, so that it follows that 1− 18 n t ≤ exp(−n−1 18 5 2
t )
−1/2 −1 5 2
for all t ∈ R. Therefore, we have established that |ψ(n t)| ≤ exp(−n 18 t ),
for all |t| ≤ T , the bound of integration established earlier. It then follows that
|ψ(n−1/2 t)|n−1 ≤ exp[− 18 5 −1
n (n − 1)t2 ]. But note that 185 −1
n (n − 1) ≥ 41 for
−1/2 n−1 1 2
n ≥ 10 so that |ψ(n t)| ≤ exp(− 4 t ) when n ≥ 10. We will use this
bound for ζ in Equation (4.8) to find that
Now Z ∞
T −1 29 t2 exp(− 14 t2 )dt = 23 π 1/2 ρn−1/2 ,
−∞
and Z ∞
T −1 18
1
|t|3 exp(− 41 t2 )dt = 23 ρn−1/2 .
−∞
−1 −1/2
See Exercise 28. Similarly, 48 5 T = 36
5 ρn . Therefore, it follows that
Z ∞
π −1 |t|−1 |ψ n (n−1/2 t) − exp(− 21 t2 )|dt + 24T −1 π −1 sup |φ(t)| ≤
−∞ t∈R
π −1 ρn−1/2 ( 23 π 1/2 + 2
3 + 36
5 ) ≤ 136 −1
15 π ρn−1/2 ,
where we have used the fact that π 1/2 ≤ 95 for the second inequality. Now,
to finish up, note that π −1 ≤ 408
135
so that we finally have the conclusion that
212 CONVERGENCE OF DISTRIBUTIONS
from Equation (4.7) that
−1/2 −1/2
|Fn (t) − Φ(x)| ≤ 135 136
408 15 ρn = 18360
6120 ρn = 3ρn−1/2 ,
which completes the proof.
Note that the more general case where E(Xn ) = µ and V (Xn ) = σ 2 can
be addressed by applying Theorem 4.24 to the standardized sequence Zn =
σ −1 (Xn − µ) for all n ∈ N, yielding the result below.
Corollary 4.2. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution such that E(Xn ) = µ and
V (Xn ) = σ 2 . If E(|Xn − µ|3 ) < ∞ then
sup P [n1/2 σ −1 (X̄n − µ) ≤ t] − Φ(t) ≤ n−1/2 Bσ −3 E(|Xn − µ|3 )
t∈R
Corollary 4.2 is proven in Exercise 29. The bound given in Theorem 4.24 and
Corollary 4.2 were first derived by Berry (1941) and Esseen (1942). Extension
to the case of non-identical distributions has been studied by Esseen (1945).
The exact value of the constant B specified in Theorem 4.24 and Corollary
4.2 is not known, though there has been a considerable amount of research
devoted to the topic of finding upper and lower bounds for B. The proof of
Theorem 4.24 uses the constant B = 3. Esseen’s original value for the constant
is 7.59. Esseen and Wallace also showed that B = 2.9 and B = 2.05 works
as well in unpublished works. See page 26 of Kolassa (2006). The recent best
upper bounds for B have been shown to be 0.7975 by van Beek (1972) and
0.7655 by Shiganov (1986). Chen (2002) provides further refinements of B.
Further information about this constant can be found in Petrov (1995) and
Zolotarev (1986). A lower bound of 0.4097 is given by Esseen (1956), using a
type of Bernoulli population for the sequence of random variables. A lower
bound based on similar arguments is derived in Example 4.38 below.
Example 4.38. Lower bounds for the constant B used in Theorem 4.24 and
Corollary 4.2 can be found by computing the observed distance between the
standardized distribution of the mean and the standard normal distribution
for specific examples. Any such distance must provide a lower bound for the
maximum distance, and therefore can be used to provide a lower bound for
B. The most useful examples provide a distance that is a multiple of n−1/2
so that a lower bound for B can be derived that does not depend on n.
For example, Petrov (2000) considers the case where {Xn }∞n=1 is a sequence
of independent and identically distributed random variables where Xn has
probability distribution function
(
1
x ∈ {−1, 1},
f (x) = 2
0 otherwise.
Now apply Theorem 1.20 (Stirling) to the factorial operators in the combina-
tion to find
n n!
n = n n
2 ( 2 )!( 2 )!
nn (2nπ)1/2 exp( n2 ) exp( n2 )[1 + o(1)]
=
nπ( n2 )n/2 ( n2 )n/2 exp(n)[1 + o(1)][1 + o(1)]
= 2n ( nπ
2 1/2
) [1 + o(1)],
where B = 0.7655 can be used as the constant. This bound is given for the
cases studied in Table 4.1 in Table 4.2. Note that in each of the cases studied,
the actual error given in Table 4.1 is lower than the error given by Corollary
4.2.
214 CONVERGENCE OF DISTRIBUTIONS
Table 4.2 Upper bounds on the error of the normal approximation of the
Binomial(n, θ) distribution for n = 5, 10, 25, 50, 100 and θ = 0.01, 0.05, 0.10,
0.25, 0.50 provided by the Corollary 4.2 with B = 0.7655.
θ
n 0.01 0.05 0.10 0.25 0.50
5 3.3725 1.4215 0.9357 0.4941 0.3423
10 2.3847 1.0052 0.6617 0.3494 0.2421
25 1.5082 0.6357 0.4185 0.221 0.1531
50 1.0665 0.4495 0.2959 0.1563 0.1083
100 0.7541 0.3179 0.2092 0.1105 0.0765
In Section 3.9 we proved that the sample quantiles converge almost certainly
to the population quantiles under some assumptions on the local behavior of
the distribution function in a neighborhood of the quantile. In this section
we establish that sample quantiles also have an asymptotic Normal distri-
bution. Of interest in this case is the fact that the results again depend on
local properties of the distribution function, in particular, the derivative of the
distribution function at the point of the quantile. This result differs greatly
from the case of the sample moments whose asymptotic Normality depends
on global properties of the distribution, which in that case was dependent
on the moments of the distribution. The main result given below establishes
Normal limits for some specific forms of probabilities involving sample quan-
tiles. These will then be used to establish the asymptotic Normality of the
sample quantiles under certain additional assumptions.
Theorem 4.26. Let {Xn }∞ n=1 be a sequence of independent random variables
that have a common distribution F . Let p ∈ (0, 1) and suppose that F is
continuous at ξp . Then,
1. If F 0 (ξp −) exists and is positive then for x < 0,
lim P {n1/2 F 0 (ξp −)(ξˆp,n − ξp )[p(1 − p)]−1/2 ≤ x} = Φ(x).
n→∞
216 CONVERGENCE OF DISTRIBUTIONS
2. If F 0 (ξp +) exists and is positive then for x > 0,
3. In any case
Proof. Fix t ∈ R and let v be a normalizing constant whose specific value will
be specified later in the proof. Define Gn (t) = P [n1/2 v −1 (ξˆpn − ξp ) ≤ t], which
is the standardized distribution of the pth sample quantile. Now,
The development in Section 3.7 implies that nF̂n (ξp + tvn−1/2 ) has a Bi-
nomial[n, F (ξp + tvn−1/2 )] distribution. Let θ = F (ξp + tvn−1/2 ) and note
that
where the random variable n1/2 [F̂n (ξp + tvn−1/2 ) − nθ][θ(1 − θ)]−1/2 is a
standardized Binomial(n, θ) random variable. Now let us consider the case
when t = 0. When t = 0 is follows that θ = F (ξp + tvn−1/2 ) = F (ξp ) = p and
n1/2 (p − θ)[θ(1 − θ)]−1/2 = 0 as long as p ∈ (0, 1). Theorem 4.20 (Lindeberg
and Lévy) then implies that
" #
n1/2 [F̂n (ξp + tvn−1/2 ) − nθ]
lim P ≥ 0 = 1 − Φ(0) = 12 ,
n→∞ [θ(1 − θ)]1/2
which proves Statement 3. Note that the normalizing constant v does not
enter into this result as it is cancelled out when t = 0. To prove the remaining
THE SAMPLE QUANTILES 217
statements, note that
( )
n1/2 [nF̂n (ξp + tvn−1/2 ) − nθ] n1/2 (p − θ)
Φ(t)−Gn (t) = Φ(t)−P ≥ =
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2
( )
n1/2 [nF̂n (ξp + tvn−1/2 ) − nθ] n1/2 (p − θ)
P < − 1 + Φ(t) =
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2
( )
n1/2 [nF̂n (ξp + tvn−1/2 ) − nθ] n1/2 (p − θ)
P < − 1 + Φ(t)
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2
1/2 1/2
n (p − θ) n (p − θ)
−Φ +Φ
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2
( )
n1/2 [nF̂n (ξp + tvn−1/2 ) − nθ] n1/2 (p − θ)
1/2
n (p − θ)
=P < − Φ
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2 [θ(1 − θ)]1/2
1/2
n (θ − p)
+ Φ(t) − Φ .
[θ(1 − θ)]1/2
To obtain a bound on the first difference we use Theorem 4.24 (Berry and
Esseen) which implies that
( )
n1/2 [nF̂n (ξp + tvn−1/2 ) − nθ]
sup P < t − Φ(t) ≤ n−1/2 Bγτ −3 ,
[θ(1 − θ)]1/2
t∈R
where B is a constant that does not depend on n, and
γ = θ|1 − θ|3 + (1 − θ)| − θ3 | = θ(1 − θ)[(1 − θ)2 + θ2 ],
and τ 2 = θ(1 − θ). Therefore,
Bθ(1 − θ)[(1 − θ)2 + θ2 ] (1 − θ)2 + θ2
n−1/2 Bγτ −3 = 1/2 3/2 3/2
= Bn−1/2 1/2 .
n θ (1 − θ) θ (1 − θ)1/2
Therefore, it follows that
( )
n1/2 [nF̂n (ξp + tvn−1/2 ) − nθ] n1/2 (p − θ)
1/2
n (p − θ)
P < −Φ
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2 [θ(1 − θ)]1/2
(1 − θ)2 + θ2
≤ Bn−1/2
θ1/2 (1 − θ)1/2
and, hence,
(1 − θ)2 + θ2
1/2
−1/2
n (θ − p)
|Φ(t) − Gn (t)| ≤ Bn + Φ(t) − Φ
.
θ1/2 (1 − θ)1/2 [θ(1 − θ)]1/2
To complete the arguments we must investigate the limiting behavior of θ.
Note that θ = F (ξp + tvn−1/2 ), which is a function of n. Because we have
assumed that F is continuous at ξp it follows that for a fixed value of t ∈ R,
lim θ = lim F (ξp + n−1/2 vt) = F (ξp ) = p,
n→∞ n→∞
218 CONVERGENCE OF DISTRIBUTIONS
and therefore it follows that
lim θ(1 − θ) = p(1 − p).
n→∞
(1 − θ)2 + θ2
lim |Gn (t) − Φ(t)| ≤ lim Bn−1/2 +
n→∞ n→∞ θ1/2 (1 − θ)1/2
1/2
n (θ − p)
lim Φ(t) − Φ = 0.
n→∞ [θ(1 − θ)]1/2
Hence, we have shown that
lim P [n1/2 v −1 (ξˆpn − ξp ) ≤ t] = Φ(t),
n→∞
for all t > 0, which proves Statement 2. Similar arguments are used to prove
Statement 3. See Exercise 30.
The result of Theorem 4.26 simplifies when we are able to make additional
assumptions about the structure of F . In the first case we assume that F is
differentiable at the point ξp and that F (ξp ) > 0.
Corollary 4.3. Let {Xn }∞ n=1 be a sequence of independent random variables
that have a common distribution F . Let p ∈ (0, 1) and suppose that F is
differentiable at ξp and that F (ξp ) > 0. Then,
n1/2 F 0 (ξp )(ξˆpn − ξp ) d
−
→Z
[p(1 − p)]1/2
as n → ∞ where Z is a N(0, 1) random variable.
The mean and median of this distribution is zero, and the variance is 25 . The
asymptotic standard error of the median is
[2f (ξ1/2 )]−1 = [21/2 π −1/2 exp(− 92 )]−1 = 2−1/2 π 1/2 exp( 92 ).
220 CONVERGENCE OF DISTRIBUTIONS
Figure 4.12 The densities of the bimodal Normal (solid line) mixture and the
N(0, 52 ) distribution (dashed line) used in Example 4.41.
0.4
0.3
0.2
0.1
0.0
!4 !2 0 2 4
Let us compare this standard error with that of the sample median based on
a sample of size n from a N(0, 52 ) distribution, which has the same median,
mean and variance as the bimodal Normal mixture considered earlier. In this
case the asymptotic standard error is given by
so that the ratio of the asymptotic standard error of the sample median for
the bimodal Normal mixture relative to that of the N(0, 52 ) distribution is
Therefore, the sample median has a much larger asymptotic standard error
for estimating the median of the bimodal Normal mixture. This is because
the density of the mixture distribution is much lower near the location of the
median. See Figure 4.12.
It is clear from this result that the distribution of ξˆ1/4 has a longer tail below
ξ1/4 and a shorter tail above ξ1/4 . This is due to the fact that there is less
density, and hence less data, that will typically be observed on average below
ξ1/4 . See Experiment 6.
4.8.1 Exercises
1. Let {Xn }∞
n=1 be a sequence of random variables such that Xn has a Uni-
d
form{0, n−1 , 2n−2 , . . . , 1} distribution for all n ∈ N. Prove that Xn −
→X
as n → ∞ where X has a Uniform[0, 1] distribution.
2. Let {Xn }∞n=1 be a sequence of random variables where Xn is an Expo-
nential(θ + n−1 ) random variable for all n ∈ N where θ is a positive
real constant. Let X be an Exponential(θ) random variable. Prove that
d
Xn −
→ X as n → ∞.
3. Let {Xn }∞
n=1 be a sequence of random variables such that for each n ∈ N,
Xn has a Gamma(αn , βn ) distribution where {αn }∞ ∞
n=1 and {βn }n=1 are
sequences of positive real numbers such that αn → α and βn → β as
222 CONVERGENCE OF DISTRIBUTIONS
Figure 4.13 The distribution function considered in Example 4.42. Note that the
derivative of the distribution function does not exist at the point 12 , which equals
ξ1/4 for this population. According to Theorem 4.26, this is enough to ensure that
the asymptotic distribution of the sample quantile is not normal.
1.0
0.8
0.6
F(x)
0.4
0.2
0.0
d
n → ∞, some some positive real numbers α and β. Prove that Xn −
→ X as
n → ∞ where X has a Gamma(α, β) distribution.
4. Let {Xn }∞
n=1 be a sequence of random variables where for each n ∈ N, Xn
has an Bernoulli[ 12 +(n+2)−1 ] distribution, and let X be a Bernoulli( 12 )
d
random variable. Prove that Xn −
→ X as n → ∞.
5. Let {Xn } be a sequence of independent and identically distributed random
variables where the distribution function of Xn is
(
1 − x−θ for x ∈ (1, ∞),
Fn (x) =
0 for x ∈ (−∞, 1].
for all x ∈ R for some function F (x). Prove the following properties of
F (x).
14. Prove the second result of Theorem 4.11. That is, let {Xn }∞
n=1 be a sequence
of random variables that converge weakly to a random variable X. Let
{Yn }∞
n=1 be a sequence of random variables that converge in probability to
d
a real constant c. Prove that Xn Yn −
→ cX as n → ∞.
15. Prove the third result of Theorem 4.11. That is, let {Xn }∞
n=1 be a sequence
of random variables that converge weakly to a random variable X. Let
{Yn }∞
n=1 be a sequence of random variables that converge in probability to
d
a real constant c 6= 0. Prove that Xn /Yn − → X/c as n → ∞.
16. In the context of the proof of the first result of Theorem 4.11, prove that
P (Xn ≤ x − ε − c) ≤ Gn (x) + P (|Yn − c| > ε).
17. Use Theorem 4.11 to prove that if {Xn }∞ n=1 is a sequence of random vari-
ables that converge in probability to a random variable X as n → ∞, then
d
Xn −→ X as n → ∞.
18. Use Theorem 4.11 to prove that if {Xn }∞ n=1 is a sequence of random vari-
ables that converge in distribution to a real constant c as n → ∞, then
p
Xn −→ c as n → ∞.
19. In the context of the proof of Theorem 4.3, prove that
Z
b Z b
lim gm (x)dF (x) − g(x)dF (x) < 31 δε ,
n→∞ a a
for any δε > 0.
EXERCISES AND EXPERIMENTS 225
20. Let {Xn }∞n=1 be a sequence of d-dimensional random vectors where Xn
has distribution function Fn for all n ∈ N and let X be a d-dimensional
random vector with distribution function F . Prove that if for any closed
set of C ⊂ Rd ,
lim sup P (Xn ∈ C) = P (X ∈ C),
n→∞
22. Let {Xn }∞n=1 be a sequence of d-dimensional random vectors that converge
in distribution to a random vector X as n → ∞. Let X0n = (Xn1 , . . . , Xnd )
d d
and X0 = (X1 , . . . , Xd ). Prove that if Xn −
→ X as n → ∞ then Xnk −
→ Xk
as n → ∞ for all k ∈ {1, . . . , d}.
23. Prove the converse part of the proof of Theorem 4.17. That is, let {Xn }∞
n=1
be a sequence of d-dimensional random vectors and let X be a d-dimensional
d d
random vector. Prove that if Xn − → X as n → ∞ then v0 Xn − → v0 X as
d
n → ∞ for all v ∈ R .
24. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of random variables where Xn has
a N(µn , σn ) distribution and Yn has a N(νn , τn2 ) and νn → ν as n → ∞ for
2
26. In the context of the proof of Theorem 4.20, use Theorem A.22 to prove
that [1 − 12 n−1 t2 + o(n−1 )]n = [1 − 12 n−1 t2 ] + o(n−1 ) as n → ∞ for fixed t.
27. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables where Xn has a Exponential(θ) distribution. Prove that
the standardized sample mean Zn = n1/2 θ−1 (X̄n − θ) has a translated
Gamma(n, n−1/2 ) distribution given by
P (Zn ≤ z) = P [n1/2 θ−1 (X̄n − θ) ≤ z]
Z z
n1/2
= (t + n1/2 )n−1 exp[−n1/2 (t + n1/2 )]dt.
−n1/2 Γ(n)
For each sample, compute the sample quantile ξˆ1/4 . When the b samples
have been simulated, a histogram of the b sample values of ξˆ1/4 should be
produced. Run this simulation for n = 5, 10, 25, 50, and 100 with b = 10,000
and discuss the resulting histograms in terms of the theoretical result of
Example 4.42.
CHAPTER 5
Convergence of Moments
And at such a moment, his solitary supporter, Karl comes along wanting to give
him a piece of advice, but instead only shows that all is lost.
Amerika by Franz Kafka
229
230 CONVERGENCE OF MOMENTS
Xn
Figure 5.2 Convergence in rth mean when r = 1 in Example 5.1 with X(ω) =
δ{ω; [0, 1]} and Xn (ω) = n−1 (n − 1 + ω) for all ω ∈ [0, 1] and n ∈ N. In this case
E(|Xn − X|) = (2n)−1 corresponds to the triangular area between Xn and X. The
area for n = 3 is shaded on the plot.
1.0
X
0.8
X3
0.6
X2
y
0.4
0.2
0.0
X1
variables from a common distribution that has mean µ and variance σ 2 < ∞.
Let X̄n be the sample mean. Then,
qm
Therefore, Definition 5.1 implies that X̄n −−→ µ as n → ∞. Are we able to
r=4
also conclude that X̄n −−→ µ as n → ∞? This would require that
Note that
We note that, as in Example 2.5, that xr/s is a convex function so that The-
orem 2.11 (Jensen) implies that
lim [E(|Xn − X|s )]r/s ≤ lim E[(|Xn − X|s )r/s ] = lim E(|Xn − X|r ) = 0,
n→∞ n→∞ n→∞
The sum in Equation 5.2 has n3 terms in all, which can be partitioned as
follows. There are n terms where i = j = k for which the expectation in
the sum has the formE(Xi3 ) = γ, where γ will be used to denote the third
moment. There are 32 n(n − 1) terms where two of the indices are the same,
while the other is different. For these terms the expectation has the form
E(Xi2 )E(Xj ) = µ3 + µσ 2 . Finally, there are n(n − 1)(n − 2) terms where
i, j, and k are all unequal. In these cases the expectation has the form
E(Xi )E(Xj )E(Xk ) = µ3 . Therefore, it follows that
where A(n) = O(n−1 ) as n → ∞. One can find E(X̄n4 ) using the same basic
approach. That is
n X
X n X
n X
n
E(X̄n4 ) = n−4 E(Xi Xj Xk Xm ). (5.3)
i=1 j=1 k=1 m=1
Proof. We begin proving this result by first showing that under the stated
conditions it follows that P (|X| ≤ |Y | = 1). To show this let δ > 0. Since
P (|Xn | ≤ |Y |) = 1 for all n ∈ N it follows that P (|X| > |Y | + δ) ≤ P (|X| >
|Xn | + δ) = P (|X| − |Xn | > δ) for all n ∈ N. Now Theorem A.18 implies that
|X|−|Xn | ≤ |Xn −X| which implies that P (|X| > |Y |+δ) ≤ P (|Xn −X| > δ),
for all n ∈ N. Therefore,
P (|X| > |Y | + δ) ≤ lim P (|Xn − X| > δ) = 0,
n→∞
where the limiting value follows from Definition 3.1 because we have assumed
p
that Xn − → X as n → ∞. Therefore, we can conclude that P (|X| > |Y |+δ) = 0
for all δ > 0 and hence it follows that P (|X| > |Y |) = 0. Therefore, P (|X| ≤
|Y |) = 1. Now we can begin to work on the problem of interest. Theorem A.18
implies that P (|Xn − X| ≤ |Xn | + |X|) = 1 for all n ∈ N. We have assumed
that P (|Xn | ≤ |Y |) = 1 and have established in the arguments above that
P (|X| ≤ |Y |) = 1, hence it follows that P (|Xn − X| ≤ |2Y |) = 1. The
assumption that E(|Y |r ) < ∞ implies that
Z ∞
r
lim E(|Y | δ{Y ; (b, ∞)}) = lim |Y |r dF = 0,
b→∞ b→∞ b
Hence
lim E(|Xn − X|r ) ≤ (2r + 1)εr ,
n→∞
for all ε > 0. Since ε is arbitrary, it follows that
lim E(|Xn − X|r ) = 0,
n→∞
r
and hence we have proven that Xn −
→ X as n → ∞.
The proof of Theorem 5.4 is given in Exercise 6. The result of Theorem 5.4 can
c
actually be strengthened to the conclusion that Xn → − X as n → ∞ without
any change to the assumptions. Theorem 5.4 provides us with a condition
under which convergence in rth mean implies complete convergence and almost
certain convergence. Could almost certain convergence imply convergence in
rth mean? The following example of Serfling (1980) shows that this is not
always the case.
Example 5.6. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables such that E(Xn ) = 0 and V (Xn ) = 1. Consider
a new sequence of random variables defined by
n
X
Yn = {n log[log(n)]}−1/2 Xk ,
k=1
qm
for all n ∈ N. In Example 5.3 we showed that Yn −−→ 0 as n → ∞. Does it
a.c.
also follow that Yn −−→ 0? It does not follow since Theorem 3.15 (Hartman
and Wintner) and Definition 1.3 imply that
n
!
X
−1/2
P lim {n log[log(n)]} Xk = 0 = 0.
n→∞
k=1
Note that Definition 5.2 requires more than just that each random variable in
UNIFORM INTEGRABILITY 239
the sequence be integrable. Indeed, if this was true, then E(|Xn |δ{|Xn |; (a, ∞)})
would converge to zero as a → ∞ for all n ∈ N. Rather, Definition 5.2 re-
quires that this convergence be uniform in n, meaning that the rate at which
E(|Xn |δ{|Xn |; (a, ∞)}) convergences to zero as a → ∞ must be the same for
all n ∈ N, or that the rate of convergence cannot depend on n.
Example 5.7. Let {Xn }∞ n=1 be a sequence of independent and identically dis-
tributed random variables from a distribution F with finite mean µ. That is
E(X) = µ < ∞. The fact that the mean is finite implies that the expectation
E(|Xn |δ{|Xn |; (a, ∞)}) converges to zero as a → ∞, and since the random
variables have identical distributions, the convergence rate for each expecta-
tion is exactly the same. Therefore, by Definition 5.2, the sequence {Xn }∞ n=1
is uniformly integrable.
∞ 2
Example 5.8. Let {Xn }n=1 be a sequence of independent N(0, σn ) random
variables where {σn }∞n=1 is a sequence of real numbers such that 0 < σn ≤ σ <
∞ for all n ∈ N. For this sequence of random variables it is true that E(|Xn |) <
∞ for all n ∈ N, which ensures that E(|Xn |δ{|Xn |; (a, ∞)}) converges to zero
as a → ∞. To have uniform integrability we must further establish that this
convergence is uniform in n ∈ R. To see this note that for every n ∈ N, we
have that
Z
E(|Xn |δ{|Xn |; (a, ∞)}) = (2πσn2 )−1/2 |x| exp(− 21 σn2 x2 )dx
|x|>a
Z ∞
= 2 (2πσn2 )−1/2 x exp(− 21 σn2 x2 )dx
a
Z ∞
= 2σn (2π)−1/2 v exp(− 21 v 2 )dv.
−1
aσn
then E(|Xn |δ{|Xn |; (aε , ∞)}) < ε for all n ∈ N. Since aε does not depend on n,
the convergence of E(|Xn |δ{|Xn |; (a, ∞)}) to zero is uniform in n, and hence
Definition 5.2 implies that the sequence {Xn }∞ n=1 is uniformly integrable.
∞
Example 5.9. Let {Xn }n=1 be a sequence of independent random variables
where Xn has probability distribution function
−α
1 − n
x=0
fn (x) = n −α
x=n
0 otherwise,
where α > 1. Note that the expected value of Xn is given by E(Xn ) = n1−α ,
240 CONVERGENCE OF MOMENTS
which is finite. Now, for fixed values of n ∈ N we have that for a > 0,
(
n1−α a ≤ n,
E(|Xn |δ{|Xn |; (aε , ∞)}) =
0 a > n.
Let ε > 0 be given. Then E(|Xn |δ{|Xn |; (aε , ∞)}) < ε as long as a > n. This
makes it clear that the convergence of E(|Xn |δ{|Xn |; (aε , ∞)}) is not uniform
in this case since the value of a required to obtain E(|Xn |δ{|Xn |; (aε , ∞)}) < ε
is directly related to n. The only way to obtain a value of a that would ensure
E(|Xn |δ{|Xn |; (aε , ∞)}) < ε for all n ∈ N, would be to let a → ∞.
One fact that follows from the uniform integrability of a sequence of random
variables is that the sequence expectations, and its least upper bound, must
stay finite.
Theorem 5.5. Let {Xn }∞
n=1 be a sequence of uniformly integrable random
variables. Then,
sup E(|Xn |) < ∞.
n∈N
Proof. Note that E(|Xn |) = E(|Xn |δ{|Xn |; [0, a]}) + E(|Xn |δ{|Xn |; (a, ∞)}).
We can bound the first term as
Z a Z a Z a
E(|Xn |δ{|Xn |; [0, a]}) = |Xn |dFn ≤ adFn = a dFn ≤ a,
0 0 0
where Fn is the distribution function of Xn for all n ∈ N. For the second term
we note that from Definition 5.2 we can choose a < ∞ large enough so that
E(|Xn |δ{|Xn |; (a, ∞)}) ≤ 1. The uniformity of the convergence implies that
a single choice of a will suffice for all n ∈ N. Hence, it follows that for a large
enough that E(|Xn |) ≤ a + 1. Therefore,
sup E(|Xn |) ≤ a + 1 < ∞.
n∈N
If ε > 0 and
sup E(|Xn |1+ε ) < ∞,
n∈N
then the upper bound in Equation (5.5) will converge to zero as a → ∞. The
convergence is guaranteed to be uniform because the upper bound does not
depend on n.
Example 5.10. Suppose that {Xn }∞ n=1 is a sequence of independent random
variables where Xn has a Gamma(αn , βn ) distribution for all n ∈ N where
{αn }∞ ∞
n=1 and {βn }n=1 are real sequences. Suppose that αn ≤ α < ∞ and
βn ≤ β < ∞ for all n ∈ N. Then E(|Xn |2 ) = αn βn2 + αn2 βn2 ≤ αβ 2 + α2 β 2 < ∞
for all n ∈ N. Then,
sup E(|Xn |2 ) ≤ αβ 2 + α2 β 2 < ∞,
n∈N
Theorem 5.7 is proven in Exercise 8. The need for having a integrable random
that bounds the sequence can be eliminated by replacing the random variable
Y in Theorem 5.7 with the supremum of the sequence {Xn }∞ n=1 .
Corollary 5.1. Let {Xn }∞
n=1 be a sequence of random variables such that
E sup |Xn | < ∞.
n∈N
Corollary 5.1 is proven in Exercise 9. The final result we highlight in this sec-
tion shows that it is also sufficient for a sequence to be bounded by a uniformly
integrable sequence to conclude that a sequence is uniformly integrable.
242 CONVERGENCE OF MOMENTS
Theorem 5.8. Let {Xn }∞ ∞
n=1 be a sequence of random variables and {Yn }n=1
be a sequence of positive integrable random variables such that P (|Xn | ≤ Yn ) =
1 for all n ∈ N. If the sequence {Yn }∞ n=1 is uniformly integrable then the
sequence {Xn }∞n=1 is uniformly integrable.
The final result of this section links the uniform integrability of the sum of
sequences of random variables to the uniform integrability of the individual
sequences.
Theorem 5.9. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of uniformly inte-
grable random variables. Then the sequence {Xn + Yn }∞
n=1 is uniformly inte-
grable.
Proof. From Definition 5.2 we know that to show that the sequence {Xn +
Yn }∞
n=1 is uniformly integrable, we must show that
and
lim 2E(|Yn |δ{|Yn |; ( 21 a, ∞)}) = 0,
a→∞
and that in both cases the convergence is uniform. This means that for any
ε > 0 there exist bε and cε that do not depend on n such that
E(|Xn |δ{|Xn |; ( 12 a, ∞)}) < 12 ε,
for all a > bε and
E(|Yn |δ{|Yn |; ( 12 a, ∞)}) < 12 ε,
for all a > cε . Let aε = max{bε , cε } and note that aε does not depend on n.
It then follows from Equation (5.6) that
lim E(|Xn + Yn |δ{|Xn + Yn |; (a, ∞)}) ≤ ε,
a→∞
for all a > aε . Definition 5.2 then implies that the sequence {Xn + Yn }∞
n=1 is
uniformly integrable.
a.c.
where the second equality follows from the fact that Xn −−→ X as n → ∞.
Note further that
inf |Xk | ≤ |Xn |,
k≥n
Note that the second equality follows from Definition 1.3 because we have
proven above that the limit in the second term exists and equals E(|X|).
We now have developed sufficient theory to prove the main result which
equates uniform integrability with the convergence of moments.
Theorem 5.11. Let {Xn }∞ n=1 be a sequence of random variables that converge
almost certainly to a random variable X as n → ∞. Let r > 0, then
lim E(|Xn |r ) = E(|X|r )
n→∞
(|Xn | + |X|) and Theorem A.20 implies that (|Xn | + |X|) ≤ 2 (|Xn | + |X|r )
r r r r
for r > 0. Therefore |Xn − X|r ≤ 2r (|Xn |r + |X|r ) and Theorem 5.9 then im-
plies that since both {|Xn |r }∞ r
n=1 and |X| are uniformly integrable, it follows
r ∞
that the sequence {|Xn − X| }n=1 is uniformly integrable. Let ε > 0 and note
that
E(|Xn − X|r ) = E(|Xn − X|r δ{|Xn − X|; [0, ε]})
+E(|Xn − X|r δ{|Xn − X|; (ε, ∞)})
≤ εr + E(|Xn − X|r δ{|Xn − X|; (ε, ∞)}).
Therefore, Theorem 1.6 implies that
lim sup E(|Xn − X|r ) ≤ εr + lim sup E(|Xn − X|r δ{|Xn − X|; (ε, ∞)}).
n→∞ n→∞
For the second term on the right hand side we use Theorem 2.14 (Fatou) to
conclude that
Since ε is arbitrary, and E(|Xn − X|r ) ≥ 0 it follows that we have proven that
lim E(|Xn − X|r ) = 0,
n→∞
r
or equivalently that Xn −→ X as n → ∞. We now intend to use this result
to show that the corresponding expectations converge. To do this we need to
consider two cases. In the first case we assume that 0 < r ≤ 1, for which
Theorem 2.8 implies that
E(|Xn |r ) = E[|(Xn − X) + X|r ] ≤ E(|Xn − X|r ) + E(|X|r )
246 CONVERGENCE OF MOMENTS
or equivalently that E(|Xn |r ) − E(|X|r ) ≤ E(|Xn − X|r ). Therefore, the fact
r
that Xn −→ X as n → ∞ implies that
lim E(|Xn |r ) − E(|X|r ) ≤ lim E(|Xn − X|r ) = 0,
n→∞ n→∞
Similar arguments, based on Theorem 2.9 in place of Theorem 2.8, are used
for the case where r > 1. See Exercise 11. For a proof on the converse see
Section 5.5 of Gut (2005).
The result of Theorem 5.11 also holds for convergence in probability. The
proof in this case in nearly the same, except one needs to prove Theorem 5.10
for the case when the random variables converge in probability.
Theorem 5.12. Let {Xn }∞ n=1 be a sequence of random variables that converge
in probability to a random variable X as n → ∞. Let r > 0, then
lim E(|Xn |r ) = E(|X|r )
n→∞
The results of Theorems 5.11 and 5.12 also hold for convergence in distribu-
tion, though the proof is slightly different.
Theorem 5.13. Let {Xn }∞ n=1 be a sequence of random variables that converge
in distribution to a random variable X as n → ∞. Let r > 0, then
lim E(|Xn |r ) = E(|X|r )
n→∞
Proof. We will prove the sufficiency of the uniform integrability of the se-
quence following the method of proof used by Serfling (1980). For a proof of
the necessity see Section 5.5 of Gut (2005). Suppose that the limiting random
variable X has a distribution function F . Let ε > 0 and choose a positive real
value a such that both a and −a are continuity points of F and that
sup E(|Xn |r δ{|Xn |; [a, ∞)}) < ε.
n∈N
This is possible because we have assumed that the sequence {|Xn |r }∞ n=1 is
uniformly integrable and we can therefore find a real value a that does not
depend on n such that E(|Xn |r δ{|Xn |; [a, ∞)}) < ε for all n ∈ N. Now choose
a real number b such that b > a and that b and −b are also continuity points
of F . Consider the function |x|r δ{|x|; [a, b]}, which is a continuous function on
the [a, b]. It therefore follows from Theorem 4.3 (Helly and Bray) that
lim E(|Xn |r δ{|Xn |; [a, b]}) = E(|X|r δ{|X|; [a, b]}).
n→∞
CONVERGENCE OF MOMENTS 247
Now, note that for every n ∈ N we have that
|Xn |r δ{|Xn |; [a, b]} ≤ |Xn |r δ{|Xn |; [a, ∞)},
with probability one. Theorem A.16 then implies that
E(|Xn |r δ{|Xn |; [a, b]}) ≤ E(|Xn |r δ{|Xn |; [a, ∞)}) < ε,
for every n ∈ N. Therefore it follows that
lim E(|Xn |r δ{|Xn |; [a, b]}) = E(|X|r δ{|X|; [a, b]}) < ε.
n→∞
This result holds for every b > a. Hence, it further follows from Theorem 1.12
(Lebesgue) that
lim E(|X|r δ{|X|; [a, b]}) = E(|X|r δ{|X|; [a, ∞)}) < ε.
b→∞
This also in turn implies that E(|X|r ) < ∞. Keeping the value of a as specified
above, we have that
|E(|Xn |r ) − E(|X|r )| = |E(|Xn |r δ{|Xn |; [0, a]}) + E(|Xn |r δ{|Xn |; (a, ∞)})
− E(|X|r δ{|X|; [0, a]}) − E(|X|r δ{|X|; (a, ∞)})|.
Theorem A.18 implies that
|E(|Xn |r ) − E(|X|r )| ≤ |E(|Xn |r δ{|Xn |; [0, a]}) − E(|X|r δ{|X|; [0, a]})|+
|E(|Xn |r δ{|Xn |; (a, ∞)}) − E(|X|r δ{|X|; (a, ∞)})|.
The expectations in the second term are each less that ε so that
|E(|Xn |r ) − E(|X|r )| ≤
|E(|Xn |r δ{|Xn |; [0, a]}) − E(|X|r δ{|X|; [0, a]})| + 2ε.
Noting once again that the function |x|r δ{|x|; [0, a]} is continuous on [0, a], we
apply Theorem 4.3 (Helly and Bray) to find that
so that
lim P (|Xn | > ε) = 0.
n→∞
p
Hence Xn −
→ 0 as n → ∞. Theorem 5.12 further implies that
lim E(Xn ) = E(0) = 0.
n→∞
Example 5.13. Let {Un }∞ be a sequence of random variables where Un has
n=1
a.c.
a Uniform(0, n−1 ) distribution for all n ∈ N. It can be shown that Un −−→
−1
0 as n → ∞. Let fn (u) = δ{u; (0, n )} denote the density of Un for all
n ∈ N, and let Fn denote the corresponding distribution function. Note that
E(|Xn |δ{|Xn |; (a, ∞)}) = 0 for all a > 1. This proves that the sequence is
uniformly integrable, and hence it follows from Theorem 5.11 that
lim E(Xn ) = 0.
n→∞
Note that this property could have been addressed directly by noting that
E(Xn ) = (2n)−1 for all n ∈ N, and therefore
lim E(Xn ) = lim (2n)−1 = 0.
n→∞ n→∞
Example 5.14. Suppose that {Xn }∞ n=1 is a sequence of independent random
variables. Suppose that Xn has a Gamma(αn , βn ) distribution for all n ∈ N
where {αn }∞ ∞
n=1 and {βn }n=1 are real sequences. Suppose that αn ≤ α < ∞
and βn ≤ β < ∞ for all n ∈ N and that αn → α and βn → β as n → ∞. It was
shown in Example 5.10 that the sequence {Xn }∞ n=1 is uniformly integrable.
d
It also follows that Xn −
→ X as n → ∞ where X is a Gamma(α, β) random
variable. Theorem 5.13 implies that E(Xn ) → E(X) = αβ as n → ∞.
5.4.1 Exercises
Prove that ||X||r is a norm. That is, show that ||X||r has the following
properties:
for all n ∈ N.
250 CONVERGENCE OF MOMENTS
p
a. Prove that Xn −→ 0 as n → ∞.
r
b. Let r > 0. Determine whether Xn −
→ 0 as n → ∞.
5. Suppose that {Xn }∞ n=1 is a sequence of independent random variables
from a common distribution that has mean µ and variance σ 2 , such that
E(|Xn |4 ) < ∞. Prove that
E(X̄n4 ) = n−4 [nλ + 4n(n − 1)µγ + 6n(n − 1)(n − 2)µ2 (µ2 + σ 2 )
+n(n − 1)(n − 2)(n − 3)µ4 ]
= B(n) + n−3 (n − 1)(n − 2)(n − 3)µ4 ,
where B(n) = O(n−1 ) as n → ∞, γ = E(Xn3 ), and λ = E(Xn4 ).
6. Let {Xn }∞
n=1 be a sequence of random variables that converge in r
th
mean
to a random variable X as n → ∞ for some r > 0. Prove that if
X∞
E(|Xn − X|r ) < ∞,
n=1
a.c.
then Xn −−→ X as n → ∞.
7. Let {Xn }∞n=1 be a sequence of independent random variables where Xn has
probability distribution function
−α
1 − n
x=0
fn (x) = n−α x=n
0 otherwise,
and 0 < α < 1. Prove that there does not exist a random variable Y such
that P (|Xn | ≤ |Y |) = 1 for all n ∈ N and E(|Y |) < ∞.
8. Let {Xn }∞n=1 be a sequence of random variables such that P (|Xn | ≤ Y ) = 1
for all n ∈ N where Y is a positive integrable random variable. Prove that
the sequence {Xn }∞ n=1 is uniformly integrable.
∞
9. Let {Xn }n=1 be a sequence of random variables such that
E sup |Xn | < ∞.
n∈N
5.4.2 Experiments
a. F is a N(0, 1) distribution.
b. F is a Cauchy(0, 1) distribution, where µ is taken to be the median of
the distribution.
c. F is a Exponential(1) distribution.
d. F is a T(2) distribution.
e. F is a T(3) distribution.
against n on the same set of axes. Describe the behavior observed for the
sequence in each case in terms of the result of Example 5.3. Repeat the
experiment for each θ ∈ {0.01, 0.10, 0.50, 0.75}.
EXERCISES AND EXPERIMENTS 253
3. Write a program in R that simulates a sequence of independent random
variables X1 , . . . , X100 where Xn as probability distribution function
−α
1 − n
x=0
fn (x) = n −α
x=n
0 elsewhere.
Repeat the experiment five times and plot each sequence Xn against n on
the same set of axes. Describe the behavior observed for the sequence in
each case. Repeat the experiment for each α ∈ {0.5, 1.0, 1.5, 2.0}.
4. Write a program in R that simulates a sequence of independent random
variables X1 , . . . , X100 where Xn is a N(0, σn2 ) random variable where the
sequence {σn }∞ n=1 is specified below. Repeat the experiment five times and
plot each sequence Xn against n on the same set of axes. Describe the
behavior observed for the sequence in each case, and relate the behavior to
the results of Exercise 17.
a. σn = n−1 for all n ∈ N.
b. σn = n for all n ∈ N.
c. σn = 10 + (−1)n for all n ∈ N.
d. {σn }∞
n=1 is a sequence of independent random variables where σn has an
Exponential(θ) distribution for each n ∈ N.
They formed a unit of the sort that normally can be formed only by matter that
is lifeless.
The Trial by Franz Kafka
6.1 Introduction
One of the important and interesting features of the Central Limit Theorem
is that the weak convergence of the mean holds under many situations beyond
the simple situation where we observe a sequence of independent and iden-
tically distributed random variables. In this chapter we will explore some of
these extensions. The two main direct extensions of the Central Limit Theo-
rem we will consider are to non-identically distributed random variables and to
triangular arrays. Of course, other generalizations are possible, and we only
present some of the simpler cases. For a more general presentation of this
subject see Gnedenko and Kolmogorov (1968). We will also consider transfor-
mations of asymptotically normal statistics that either result in asymptoti-
cally Normal statistics, or statistics that follow a ChiSquared distribution.
As we will show, the difference between these two outcomes depends on the
smoothness of the transformation.
The Lindeberg, Lévy, and Feller version of the Central Limit Theorem relaxes
the assumption that the random variables in the sequence need to be identi-
cally distributed, but still retains the assumption of independence. The result
originates from the work of Lindeberg (1922), Lévy (1925), and Feller (1935),
who each proved various parts of the final result.
Theorem 6.1 (Lindeberg, Lévy, and Feller). Let {Xn }∞ n=1 be a sequence of
independent random variables where E(Xn ) = µn and V (Xn ) = σn2 < ∞ for
all n ∈ N where {µn }∞ 2 ∞
n=1 and {σn }n=1 are sequences of real numbers. Let
n
X
µ̄n = n−1 µk ,
k=1
255
256 CENTRAL LIMIT THEOREMS
n
X
τn2 = σk2 , (6.1)
k=1
d
then Zn = nτn−1 (X̄n − µ̄n ) −
→ Z, as n → ∞ where Z has a N(0, 1) distribution
if and only if
n
X
lim τn−2 E(|Xk − µk |2 δ{|Xk − µk |; (ετn , ∞)}) = 0, (6.3)
n→∞
k=1
Proof. We will prove the sufficiency of the condition given in Equation (6.3),
but not its necessity. See Section 7.2 of Gut (2005) for details on the necessity
part of the proof. The main argument of this proof is based on the same idea
that we used in proving Theorem 4.20 (Lindeberg and Lévy) in that we will
show that the characteristic function of Zn converges to the characteristic
function of a N(0, 1) random variable. As in the proof of Theorem 4.20 we
begin by assuming that µn = 0 for all n ∈ N which does not reduce the
generality of the proof since the numerator of Zn has the form
n
X n
X
X̄n − µ̄n = n−1 (Xk − µk ) = n−1 Xk∗ ,
k=1 k=1
where E(Xk∗ ) = 0 for all k ∈ {1, . . . , n}. Let ψk be the characteristic function
of Xk for all k ∈ N. Theorem 2.33 implies that the characteristic function of
the sum of X1 , . . . , Xn is
n
Y
ψk (t).
k=1
Note that
n
X
Zn = nτn−1 (X̄n − µ̄n ) = τn−1 Xk ,
k=1
Note that the final term in the product has the form
( n ) ( n
)
X X
exp − 12 τn−2 σk2 t2 = exp − 12 t2 τn−2 σk2 = exp(− 12 t2 ),
k=1 k=1
and ( )
n
X n
X
lim exp ψk (τn−1 t) − (1 − 1 −2 2 2
2 τn σk t ) = 1,
n→∞
k=1 k=1
which is equivalent to showing that
Xn n
X
log[ψk (τn−1 t)] + 1 − ψk (τn−1 t) = 0,
lim (6.4)
n→∞
k=1 k=1
and n
X n
X
−1 1 −2 2 2
lim ψk (τn t) − (1 − 2 τn σk t ) = 0. (6.5)
n→∞
k=1 k=1
We work on showing the property in Equation (6.4) first. Theorem 2.30 im-
plies that |ψk (t) − 1 − itE(Xk )| ≤ E( 21 t2 Xk2 ). In our case we are evaluating
the characteristic function at τn−1 t and we have assumed that E(Xk ) = 0.
Therefore, we have that
2 2
t Xk t2 σk2
|ψk (τn−1 t) − 1| ≤ E 2
= ≤ 12 t2 τn−2 max σk2 . (6.6)
2τn 2τn2 1≤k≤n
k=1 k=1
This completes our first task in this proof since we have proven that Equation
(6.4) is true. We now take up the task of proving that Equation (6.5) is true.
Theorem 2.30 implies that
2 2
t Xk
|ψk (tτn−1 ) − 1 − itτn−1 E(Xk ) + 12 t2 τn−2 E(Xk2 )| ≤ E , (6.8)
τn2
NON-IDENTICALLY DISTRIBUTED RANDOM VARIABLES 259
which, due to the assumption that E(Xk ) = 0 simplifies to
2 2
2 2
ψk (tτn−1 ) − 1 + t σk ≤ E t Xk .
(6.9)
2τn2 τn2
Theorem 2.30 and similar calculations to those used earlier can also be used
to establish that
2 2
3 3
ψk (tτn−1 ) − 1 + t σk ≤ E |t| |Xk | .
2τn2 6τn3
Now, Theorem A.18 and Equations (6.8) and (6.9) imply that
n 2 2 n
t2 σk2
X
−1 t σ k
X −1
ψk (tτn ) − 1 − ≤ ψk (tτn ) − 1 − 2τ 2
2τn2
k=1
k=1 n
n 2 2
|t|3 |Xk |3
X t Xk
≤ min E , E .
τn2 6τn3
k=1
We now split each of the expectations across small and large values of |Xk |.
Let ε > 0, then
2 2 2 2 2 2
t Xk t Xk t Xk
E = E δ{|Xk |; [0, ετn ]} + E δ{|Xk |; (ετn , ∞)} ,
τn2 τn2 τn2
and
The first term on the right hand side can be simplified by bounding |Xk |3 ≤
260 CENTRAL LIMIT THEOREMS
ετn |Xk |2 due to the condition imposed by the indicator function. That is,
n 3 n
|t| |Xk |3 |t|3 ετn
X X
E 3
δ{|Xk |; [0, ετn ]} ≤ E(|Xk |2 δ{|Xk |; [0, ετn ]})
6τn 6τn3
k=1 k=1
n
X
1 3 −2
≤ 6 |t| ετn E(|Xk |2 )
k=1
Xn
1 3 −2
= 6 |t| ετn σk2
k=1
1 3
= 6 |t| ε.
Equation (6.3) implies that the second term on the right hand side of Equation
(6.11) converges to zero as n → ∞, therefore
n 2 2
X t σ
ψk (tτn−1 ) − 1 − k 1 3
lim sup ≤ 6 |t| ε,
2τn2
n→∞
k=1
n
X
τn2 = σk2 ,
k=1
d
as n → ∞, then Zn = nτn−1 (X̄n − µ̄n ) −
→ Z, as n → ∞, where Z has a N(0, 1)
distribution.
Proof. We will follow the method of proof of Serfling (1980). We will show that
the conditions of Equations (6.2) and (6.3) follow from the condition given in
Equation (6.12). Let ε > 0 and focus on the term inside the summation of
Equation (6.3). We note that
where we note that the inequality comes from the fact that η > 2 so that the
exponent 2 − η < 0, and hence
|Xk − µk |2−η δ{|Xk − µk |; (ετn , ∞)} ≤ |ετn |2−η δ{|Xk − µk |; (ετn , ∞)}.
The inequality follows from an application of Theorem A.16. It, then, further
follows that
by Equation (6.12). A similar result follows for limit infimum, so that it follows
that
Xn
lim τn−2 E(|Xk − µk |2 δ{|Xk − µk |; (ετn , ∞)}) = 0,
n→∞
k=1
and therefore the condition of Equation (6.3) is satisfied. We now show that
the condition of Equation (6.2) is also satisfied. To do this, we note that for
all k ∈ {1, . . . , n}, we have that
Therefore,
The condition given in Equation (6.3) implies that the second term on the
TRIANGULAR ARRAYS 263
right hand side of Equation (6.13) converges to zero as n → ∞. Therefore
lim max τn−2 σk2 ≤ ε2 ,
n→∞ k∈{1,...,n}
Under certain conditions the result of Theorem 4.20 can be extended to double
arrays as well. For simplicity we give the result for triangular arrays.
Theorem 6.2. Let {{Xnk }nk=1 }∞ n=1 be a triangular array where Xn1 , . . . , Xnn
are mutually independent random variables for each n ∈ N. Suppose that Xnk
2
has mean µnk and variance σnk < ∞ for all k ∈ {1, . . . , n} and n ∈ N. Let
n
X
µn· = µnk ,
k=1
264 CENTRAL LIMIT THEOREMS
and
n
X
2 2
σn· = σnk .
k=1
Then
lim max P (|Xnk − µnk | > εσn· ) = 0, (6.14)
n→∞ k∈{1,...,n}
and
n
X
2 2
σn· = σnk .
k=1
Suppose that for some η > 2
n
X
E(|Xnk − µnk |η ) = o(σn·
η
),
k=1
TRANSFORMED RANDOM VARIABLES 265
as n → ∞, then !
n
d
X
−1
Zn = σn· Xnk − µn· −
→ Z,
k=1
as n → ∞ where Z is a N(0, 1) random variable.
For the proof of Corollary 6.2, see Exercise 5.
Example 6.2. Consider a triangular array {{Xnk }nk=1 }∞ n=1 where the se-
quence Xn1 , . . . , Xnn is a set of independent and identically distributed ran-
dom variables from an Exponential(θn ) distribution where {θn }∞ n=1 is a
sequence of positive real numbers that converges to a real value θ as n → ∞.
2
In this case µnk = θn and σnk = θn2 for all k ∈ {1, . . . , n} and n ∈ N so that
n
X n
X
µn· = µnk = θn = nθn ,
k=1 k=1
and
n
X n
X
2 2
σn· = σnk = θn2 = nθn2 .
k=1 k=1
We will use Corollary 6.2 with η = 4, so that we have that
n
X n
X n
X
E(|Xnk − µnk |4 ) = E(|Xnk − θn |4 ) = 9θn4 = 9nθn4 .
k=1 k=1 k=1
and hence,
n
X
E(|Xnk − µnk |4 ) = o(σn·
4
),
k=1
as n → ∞. Therefore, Corollary 6.2 implies that
n
!
d
X
Zn = n1/2 θn−1 Xnk − nθn −
→ Z,
k=1
lim σn = 0.
n→∞
Proof. The method of proof used here is to use Theorem 4.11 (Slutsky) to
show that σn−1 (Xn − µ) and [σn g 0 (µ)]−1 [g(Xn ) − g(µ)] have the same limiting
distribution. To this end, define a function h as h(x) = (x − µ)−1 [g(x) −
g(µ)] − g 0 (µ) for all x 6= µ. The definition of derivative motivates us to define
p
h(µ) = 0. Now, Theorem 4.21 implies than Xn − → µ as n → ∞, and therefore
p
Theorem 3.7 implies that h(Xn ) − → h(µ) = 0 as n → ∞. Theorem 4.11 then
p
implies that h(Xn )σn−1 (Xn − µ) − → 0 as n → ∞. But, this implies that
g(Xn ) − g(µ) Xn − µ
− g 0 (µ) =
Xn − µ σn
g(Xn ) − g(µ) Xn − µ p
− g 0 (µ) −
→ 0,
σn σn
TRANSFORMED RANDOM VARIABLES 267
as n → ∞, which in turn implies that
g(Xn ) − g(µ) Xn − µ p
− −
→ 0,
σn g 0 (µ) σn
as n → ∞. Therefore, Theorem 4.11 implies that
g(Xn ) − g(µ) Xn − µ Xn − µ d
− + −
→ Z,
σn g 0 (µ) σn σn
where Z is a N(0, 1) random variable. Hence
g(Xn ) − g(µ) d
−
→ Z,
σn g 0 (µ)
as n → ∞, and the result is proven.
Theorem 6.3 indicates that the change in variation required to maintain the
asymptotic normality of a transformation of an asymptotically normal random
variable is related to the derivative of the transformation. This is because the
asymptotic variation in g(Xn ) is related to the local change in g near µ due
p
to the fact that Xn − → µ as n → ∞. To visualize this consider Figures 6.1
and 6.2. In Figure 6.1 the variation around µ decreases through the function
g due to small derivative of g in a neighborhood of µ, while in Figure 6.2 the
variation around µ increases through the function g due to large derivative of
g in a neighborhood of µ. If the derivative of g is zero at µ, then there will
be no variation in g as Xn approaches µ as n → ∞. This fact does not allow
us to obtain an asymptotic Normal result for the transformed sequence of
random variables, though other types of weak convergence are possible.
Example 6.3. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution with variance σ 2 < ∞. Let
Sn2 be the sample variance. Then, under some minor conditions, it can be
d
shown that n1/2 (µ4 − σ 4 )−1/2 (Sn2 − σ 2 ) −
→ Z as n → ∞ where Z is a N(0, 1)
random variable and µ4 is the fourth moment of Xn , which is assumed to
be finite. Asymptotic normality can also be obtained for the sample standard
deviation by considering the function g(x) = x1/2 where
d d 1/2
= 12 σ −1 .
g(x) = x
dx x=σ 2 dx x=σ 2
g
g(µ)
g
g(µ)
but
d2
2
g(x) = 2.
dx
x=0
d
In this case Theorem 6.4 implies that σn−2 (Xn2 − µ2 ) −
→ Z 2 as n → ∞. There-
fore, in the case where µ = 0, the limiting distribution is a ChiSquared(1)
distribution.
Proof. We generalize the argument used to prove Theorem 6.3. Define a func-
tion h that maps Rd to R as
h(x) = kx − θk−1 [g(x) − g(θ) − d0 (θ)(x − θ)],
where we define h(θ) = 0. Therefore,
n1/2 [g(Xn ) − g(θ)] = n1/2 kXn − θkh(Xn ) + n1/2 d0 (θ)(Xn − θ).
d
By assumption we know that n1/2 (Xn − θ) − → Z as n → ∞ where Z has a
N(0, Σ) distribution. Therefore Theorem 4.17 (Cramér and Wold) implies that
d
n1/2 d0 (θ)(Xn − θ) − → Z1 as n → ∞ where Z1 is a N [0, d0 (θ)Σd(θ)] random
variable. Since k · k denotes a vector norm on Rd we have that n1/2 kXn − θk =
kn1/2 (Xn − θ)k. Assuming that the norm is continuous, Theorem 4.18 implies
d
that kn1/2 (Xn − θ)k −→ kZk as n → ∞. Serfling (1980) argues that the
p
function h is continuous, and therefore since Xn −
→ θ as n → ∞ it follows
p
that h(Xn ) −
→ h(θ) = 0 as n → ∞. Therefore, Theorem 4.11 (Slutsky) implies
TRANSFORMED RANDOM VARIABLES 271
p
that n1/2 kXn − θkh(Xn ) −
→ 0 as n → ∞, and another application of Theorem
4.11 implies that
d
n1/2 kXn − θkh(Xn ) + n1/2 d0 (θ)(Xn − θ) −
→ Z,
n
X
Si2 = n−1 (Xik − X̄i )2 ,
k=1
and
n
X
X̄i = n−1 Xik ,
k=1
for i = 1, 2. We will first show that a properly normalized function of the ran-
dom vector (S12 , S22 , S12 ) converges in distribution to a Normal distribution.
Following the arguments of Sen and Singer (1993) we first note that
n
X
S12 = n−1 (X1k − X̄1 )(X2k − X̄2 )
k=1
Xn
= n−1 [(X1k − µ1 ) + (µ1 − X̄1 )][(X2k − µ2 ) + (µ2 − X̄2 )]
k=1
Xn n
X
= n−1 (X1k − µ1 )(X2k − µ2 ) + n−1 (X1k − µ1 )(µ2 − X̄2 ) +
k=1 k=1
Xn n
X
n−1 (µ1 − X̄1 )(X2k − µ2 ) + n−1 (µ1 − X̄1 )(µ2 − X̄2 ). (6.16)
k=1 k=1
272 CENTRAL LIMIT THEOREMS
The two middle terms in Equation (6.16) can be simplified as
n
X
n−1 (X1k − µ1 )(µ2 − X̄2 ) =
k=1
n
X
n−1 (µ2 − X̄2 ) (X1k − µ1 ) = (µ2 − X̄2 )(X̄1 − µ1 ),
k=1
and
n
X
n−1 (µ1 − X̄1 )(X2k − µ2 ) = (X̄1 − µ1 )(µ2 − X̄2 ).
k=1
Therefore, it follows that
n
X
S12 = n−1 (X1k − µ1 )(X2k − µ2 ) − (µ1 − X̄1 )(µ2 − X̄2 )
k=1
Xn
= n−1 (X1k − µ1 )(X2k − µ2 ) + R12 .
k=1
p
Theorem 3.10 (Weak Law of Large Numbers) implies that X̄1 − → µ1 and
p p
X̄2 −
→ µ2 as n → ∞, so that Theorem 3.9 implies that R12 −
→ 0 as n → ∞. It
can similarly be shown that
n
X
Si2 = n−1 (Xik − µi )2 + Ri ,
k=1
p
where Ri − → 0 as n → ∞ for i = 1, 2. See Exercise 11. Now define a random
vector U0n = (S12 − σ12 , S22 − σ22 , S12 − τ ) for all n ∈ N. Let λ ∈ R3 where
λ0 = (λ1 , λ2 , λ3 ) and observe that
n1/2 λ0 Un = n1/2 λ1 (S12 − σ12 ) + λ2 (S22 − σ22 ) + λ3 (S12 − τ )
( " n
#
X
= n1/2 λ1 n−1 (X1k − µ1 )2 + R1 − σ12
k=1
" n
#
X
−1 2
+λ2 n (X2k − µ2 ) + R2 − σ22
k=1
" n
#)
X
−1
+ λ3 n (X1k − µ1 )(X2k − µ2 ) + R12 − τ .
k=1
Combine the remainder terms into one term and combine the remaining terms
into one sum to find that
n
X
n1/2 λ0 Un = n−1/2 {λ1 [(X1k − µ1 )2 − σ12 ] + λ2 [(X2k − µ2 )2 − σ22 ]
k=1
+ λ3 [(X1k − µ1 )(X2k − µ2 ) − τ ]} + R,
TRANSFORMED RANDOM VARIABLES 273
where R = n1/2 (R1 + R2 + R12 ). Note that even though each individual
remainder term converges to zero in probability, it may not necessarily follow
that n1/2 times the remainder also converges to zero. However, we will show in
Chapter 8 that this does follow in this case. Now, define a sequence of random
variables {Vn }∞
n=1 as
Vk = λ1 [(X1k −µ1 )2 −σ12 ]+λ2 [(X2k −µ2 )2 −σ22 ]+λ3 [(X1k −µ1 )(X2k −µ2 )−τ ],
so that it follows that
n
X
n1/2 λ0 Un = n−1/2 Vk + R,
k=1
p
where R − → 0 as n → ∞. The random variables V1 , . . . , Vn are a set of inde-
pendent and identically distributed random variables. The expectation of Vk
is given by
E(Vk ) = E{λ1 [(X1k − µ1 )2 − σ12 ] + λ2 [(X2k − µ2 )2 − σ22 ]
+λ3 [(X1k − µ1 )(X2k − µ2 ) − τ ]}
= λ1 {E[(X1k − µ1 )2 ] − σ12 } + λ2 {E[(X2k − µ2 )2 ] − σ22 }
+λ3 {E[(X1k − µ1 )(X2k − µ2 )] − τ ]}
= 0,
where we have used the fact that E[(X1k − µ1 )2 ] = σ12 , E[(X2k − µ2 )2 ] = σ22 ,
and E[(X1k − µ1 )(X2k − µ2 )] = τ . The variance of Vk need not be found
explicitly, but is equal to λ0 Λλ where Λ = V (Un ). Therefore, Theorem 4.20
(Lindeberg and Lévy) implies that
n
d
X
n−1/2 Vk −
→ Z,
k=1
0
as n → ∞ where Z has a N(0, λ Λλ) distribution. Theorem 4.11 (Slutsky)
then further implies that
n
d
X
n1/2 λ0 Un = n−1/2 Vk + R −
→ Z,
k=1
p
as n → ∞, since R − → 0 as n → ∞. Because λ is arbitrary, it follows
d
from Theorem 4.17 (Cramér and Wold) that n1/2 Un − → Z as n → ∞,
where Z has a N(0, Λ) distribution. Using this result, we shall prove that
there is a function of the sample correlation that converges in distribution
to a Normal distribution. Let θ 0 = (σ12 , σ22 , τ ) and consider the function
g(x) = g(x1 , x2 , x3 ) = x3 (x1 x2 )−1/2 . Then, it follows that
−3/2 −1/2 −1/2 −3/2
d0 (x) = [− 21 x3 x1 x2 , − 21 x3 x1 x2 , (x1 x2 )−1/2 ],
so that
d0 (θ) = [− 21 ρσ1−2 , − 12 ρσ2−2 , (σ1 σ2 )−1 ],
274 CENTRAL LIMIT THEOREMS
which has been written in terms of the correlation ρ = σ1 σ2 τ . Theorem 6.5
d
→ Z as n → ∞ where Z is a N[0, d0 (θ)Λd(θ)]
then implies that n1/2 (ρ̂ − ρ) −
random variable.
The result of Theorem 6.5 extends to the more general case where g is a func-
tion that maps Rd to Rm . The main requirement for the transformed sequence
to remain asymptotically Normal is that the matrix of partial derivatives of
g must have elements that exist and are non-zero at θ.
Theorem 6.6. Let {Xn }∞
n=1 be a sequence of random vectors from a d-
d
dimensional distribution such that n1/2 (Xn − θ) − → Z as n → ∞ where Z
is a d-dimensional N(0, Σ) random vector, θ is a d × 1 constant vector, and
Σ is a d × d covariance matrix. Let g be a real function that maps Rd to Rm
such that g(x) = [g1 (x), . . . , gm (x)]0 for all x ∈ Rd where gk (x) is a real func-
tion that maps Rd to R. Let D(θ) be the m × d matrix of partial derivatives
of g whose (i, j)th element is given by
∂
Dij (θ) = gi (x)
∂xj x=θ
The second term of the right hand side of Equation (6.17) can be written as
m
X
vi d0i (θ)(Xn − θ) = v0 D(θ)(Xn − θ).
i=1
TRANSFORMED RANDOM VARIABLES 275
d
By assumption we have that n1/2 (Xn −θ) −
→ Z as n → ∞ where Z is a N(0, Σ)
d
random vector. Therefore, Theorem 4.17 implies that n1/2 v0 D(θ)(Xn − θ) −
→
v0 D(θ)Z = W where W has a N[0, v0 D(θ)ΣD(θ)v] random vector. For the
first term in Equation (6.17), we note that
m
X
vi kX − θkhi (Xn ) = v0 h(Xn )kXn − θk,
i=1
0
where h (x) = [h1 (x), . . . , hm (x)]. The fact that n1/2 kXn − θk = kn1/2 (Xn −
d d
θ)k and that n1/2 (Xn −θ) −
→ Z implies once again that kn1/2 (Xn −θ)k −
→ kZk
p
as n → ∞, as in the proof of Theorem 6.5. It also follows that h(Xn ) − →
h(θ) = 0 as n → ∞ so that Theorem 4.11 (Slutsky) and Theorem 3.9 imply
p
that n1/2 v0 h(Xn )kXn −θk −
→ v0 0kZk = 0, as n → ∞ and another application
of Theorem 4.11 (Slutsky) implies that
n
X
Xn = Dk ,
k=1
A popular test statistic for testing the null hypothesis that the probability
vector is equal to a proposed model p is given by
d
X
Tn = np−1 2
k (Xnk − npk ) ,
k=1
6.5.1 Exercises
1. Prove that Theorem 6.1 (Lindeberg, Lévy, and Feller) reduces to Theorem
4.20 (Lindeberg and Lévy) when {Xn }∞ n=1 is a sequence of independent
and identically distributed random variables.
2. Let {Xn }∞
n=1 be a sequence of independent random variables where Xn has
a Gamma(θn , 2) distribution where {θn }∞
n=1 is a sequence of positive real
numbers.
3. Let {Xn }∞
n=1 be a sequence of independent random variables where Xn has
a Bernoulli(θn ) distribution where {θn }∞ n=1 is a sequence of real numbers.
Find a non-trivial sequence {θn }∞n=1 such that the assumptions of Theorem
6.1 (Lindeberg, Lévy, and Feller) hold, and describe the resulting conclusion
for the weak convergence of X̄n .
EXERCISES AND EXPERIMENTS 279
4. In the context of Theorem 6.1 (Lindeberg, Lévy, and Feller), prove that
Equation (6.3) implies Equation (6.2).
5. Prove Corollary 6.2. That is, let {{Xnk }nk=1 }∞
n=1 be a triangular array where
X11 , . . . , Xn1 are mutually independent random variables for each n ∈ N.
2
Suppose that Xnk has mean µnk and variance σnk for all k ∈ {1, . . . , n}
and n ∈ N. Let
Xn
µn· = µnk ,
k=1
and
n
X
2 2
σn· = σnk .
k=1
Suppose that for some η > 2
n
X
E(|Xnk − µnk |η ) = o(σn·
η
),
k=1
as n → ∞, then
n
!
d
X
−1
Zn = σn· Xnk − µn· −
→ Z,
k=1
11. Let {Xn }n=1 be a set of independent and identically distributed random
variables from a distribution with me µ and finite variance σ 2 . Show that
n
X n
X
S 2 = n−1 (Xk − X̄)2 = n−1 (Xk − µ)2 + R,
k=1 k=1
p
where R −
→ 0 as n → ∞.
12. In Example 6.6, find Λ and d0 (θ)Λd(θ).
d
13. Let {Xn } be a sequence of d-dimensional random vectors where Xn −
→ Z as
n → ∞ where Z has a N(0, I) distribution. Let A be a m × d matrix and
find the asymptotic distribution of the sequence {AXn }∞ n=1 as n → ∞.
Describe any additional assumptions that may need to be made for the
matrix A.
EXERCISES AND EXPERIMENTS 281
d
14. Let {Xn } be a sequence of d-dimensional random vectors where Xn − →Z
as n → ∞ where Z has a N(0, I) distribution. Let A be a m × d matrix and
let b be a m × 1 real valued vector. Fnd the asymptotic distribution of the
sequence {AXn + b}∞ n=1 as n → ∞. Describe any additional assumptions
that may need to be made for the matrix A and the vector b. What effect
does adding the vector b have on the asymptotic result?
d
15. Let {Xn } be a sequence of d-dimensional random vectors where Xn − →Z
as n → ∞ where Z has a N(0, I) distribution. Let A be a symmetric d × d
matrix and find the asymptotic distribution of the sequence {X0n AXn }∞
n=1
as n → ∞. Describe any additional assumptions that need to be made for
the matrix A.
16. Let {Xn }∞
n=1 be a sequence of two-dimensional random vectors where
d
Xn −→ Z as n → ∞ where Z has a N(0, I) distribution. Consider the trans-
formation g(x) = x1 + x2 + x1 x2 where x0 = (x1 , x2 ). Find the asymptotic
distribution of g(Xn ) as n → ∞.
17. Let {Xn }∞
n=1 be a sequence of three-dimensional random vectors where
d
Xn −→ Z as n → ∞ where Z has a N(0, I) distribution. Consider the trans-
formation g(x) = [x1 x2 + x3 , x1 x3 + x2 , x2 x3 + x1 ] where x0 = (x1 , x2 , x3 ).
Find the asymptotic distribution of g(Xn ) as n → ∞.
6.5.2 Experiments
Plot the 1000 values of Zn on a density histogram and overlay the histogram
with a plot of a N(0, 1) density. Repeat the experiment for n = 5, 10, 25,
100 and 500 and describe how the distribution converges.
4. Write a program in R that simulates 1000 samples of size n from a Uni-
form(θ1 , θ2 ) distribution where n, θ1 , and θ2 are specified below. For each
sample compute Zn = n1/2 σ −1 (X̄n2 − µ2 ) where X̄n is the mean of the
observed sample and µ and σ correspond to the mean and standard de-
viation of a Uniform(θ1 , θ2 ) distribution. Plot a histogram of the 1000
observed values of Zn for each case listed below and compare the shape of
the histograms to what would be expected.
a. θ1 = −1, θ2 = 1.
b. θ1 = 0, θ2 = 1.
CHAPTER 7
But when he talks about them and compares them with himself and his colleagues
there’s a small error running through what he says, and, just for your interest,
I’ll tell you about it.
The Trial by Franz Kafka
for all t ∈ R. Define Rn (t) = Φ(t) − P (Zn ≤ t) for all t ∈ R and n ∈ N. This
in turn implies that P (Zn ≤ t) = Φ(t) + Rn (t). Theorem 4.24 (Berry and
Esseen) implies that |Rn (t)| ≤ n−1/2 Bρ where B is a finite constant that does
not depend on n or t and ρ is the third absolute moment about the mean of
F . Noting that ρ also does not depend on n and t, we have that
lim n1/2 |Rn (t)| ≤ Bρ,
n→∞
283
284 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
In this chapter we will specifically consider the case of obtaining an asymp-
totic expansion that has an error term that is o(n−1/2 ) as n → ∞. We will
also consider inverting this expansion to obtain an asymptotic expansion for
the quantile function whose error term is also o(n−1/2 ) as n → ∞. These re-
sults can be extended beyond the sample mean to smooth functions of vector
means through what is usually known as the smooth function model and we
will explore both this model and the corresponding expansion theory. This
chapter concludes with a brief description of saddlepoint expansions which
are designed to provide more accuracy to these approximations under certain
conditions.
where the inequality comes from an application of Theorem A.6. Noting that
| exp(−itx)| ≤ 1, it then follows that
Z ∞
|fn (x) − f (x)| ≤ (2π)−1 |ψn (t) − ψ(t)|dt. (7.1)
−∞
For further details see Section XV.3 of Feller (1971). The general method
of proof for determining the error associated with the one-term Edgeworth
expansion is based on computing the integral of the distance between the
characteristic function of the standardized density of the sample mean and the
characteristic function of the approximating expansion, which then bounds the
difference between the corresponding densities. It is important to note that the
bound given in Equation (7.1) is uniform in x, a property that will translate
to the error term of the Edgeworth expansions.
A slight technicality should be addressed at this point. The expansion φ(t) +
1 −3 −1/2 0 3
6σ n µ3 (t − 3t)φ(t) is typically not a valid density function as it usually
does not integrate to one and is not always non-negative. In this case we
are really computing a Fourier transformation on the expansion. Under the
assumptions we shall impose it follows that the Fourier inversion theorem
still works as detailed in Theorem 2.28 with the exception that the Fourier
transformation of the expansion may not strictly be a valid characteristic
function. See Theorem 4.1 of Bhattacharya and Rao (1976). The arguments
producing the bound in Equation (7.1) also follow, with the right hand side
being called the Fourier norm by Feller (1971). See Exercise 1.
Another technical matter arises when taking the Fourier transformation of the
expansion. The second term in the expansion is given by a constant multiplied
by H3 (x)φ(x). It is therefore convenient to have a mechanism for computing
the Fourier transformation of a function of this type.
286 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
Theorem 7.1. The Fourier transformation of Hk (x)φ(x) is (it)k exp(− 21 t2 ).
Proof. We will prove one specific case of this result. For a proof of the general
result see Exercise 2. In this particular case we will evaluate the integral
Z ∞
φ(3) (x) exp(−itx)dx,
−∞
which from Definition 1.6 is the Fourier transformation of −H3 (x)φ(x). Using
Theorem A.4 with u = exp(−itx), v = φ(2) (x), du = −it exp(−itx), and
dv = φ(3) (t), we have that
Z ∞
φ(3) (x) exp(−itx)dx =
−∞
∞ Z ∞
exp(−itx)φ(2) (x) + (it) φ(2) (x) exp(−itx)dx. (7.2)
−∞ −∞
A proof of Theorem 7.3 can be found in Section 4.1 of Gut (2005). We now
have a collection of theory that is suitable for determining the asymptotic
characteristics of the error term of the Edgeworth expansion.
Theorem 7.4. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables where Xn has characteristic function ψ for all
n ∈ N. Let Fn (x) = P [n1/2 σ −1 (X̄n − µ) ≤ x], with density fn (x) for all
x ∈ R where µ = E(Xn ) and σ 2 = V (Xn ). Assume that E(Xn3 ) < ∞, |ψ|ν is
integrable for some ν ≥ 1, and that fn (x) exists for n ≥ ν. Then,
fn (x) − φ(x) − 61 σ −3 n−1/2 µ3 (x3 − 3x)φ(x) = o(n−1/2 ), (7.3)
as n → ∞.
Proof. As usual, we will consider without loss of generality the case where
µ = 0. Let us first consider computing the Fourier transform of the function
fn (x) − φ(x) − 61 σ −3 n−1/2 µ3 (x3 − 3x)φ(x). (7.4)
Because the Fourier transform is an integral, we can accomplish this with
term-by-term integration. Previous calculations using Theorems 2.32 and 2.33
imply that the Fourier transform of fn (x) can be written as ψ n (tσ −1 n−1/2 ).
Similarly, it is also known that the characteristic function of φ(x) is exp(− 21 t2 ).
Finally, we require the Fourier transform of − 16 σ −3 n−1/2 µ3 (x3 − 3x)φ(x).
To simplify this matter we note that by Definition 1.6, the third Hermite
polynomial is given by H3 (x) = x3 − 3x and, therefore, Definition 1.6 implies
that
−(x3 − 3x)φ(t) = (−1)3 H3 (x)φ(x) = φ(3) (x).
Theorem 7.1 therefore implies that the Fourier transformation of −H3 (x)φ(x) =
φ(3) (x) is (it)3 exp(− 21 t2 ). Therefore, the Fourier transformation of Equation
(7.4) is given by
ψ n (n−1/2 σ −1 t) − exp(− 21 t2 ) − 16 µ3 σ −3 n−1/2 (it)3 exp(− 12 t2 ).
Now, using the Fourier norm of Equation (7.1), we have that
Our task now is to show that the integral of the right hand side of Equation
(7.5) is o(n−1/2 ) as n → ∞. Let δ > 0 and note that since ψ is a characteristic
function of a density we have from Theorem 7.2 that |ψ(t)| ≤ 1 for all t ∈ R
and that |ψ(t)| < 1 for all t 6= 0. This result, combined with the result of
288 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
Theorem 7.3 (Riemann and Lebesgue) implies that there is a real number qδ
such that |ψ(t)| ≤ qδ < 1 for all |t| ≥ δ. We now begin approximating the
Fourier norm in Equation (7.5). We begin by breaking up the Fourier norm
in Equation (7.5) into two integrals: the first over an interval near the origin
and the second over the remaining tails. That is,
Z ∞
|ψ n (n−1/2 σ −1 t) − exp(− 12 t2 ) − 16 µ3 σ −3 n−1/2 (it)3 exp(− 12 t2 )|dt =
Z −∞
|ψ n (n−1/2 σ −1 t) − exp(− 21 t2 ) − 16 µ3 σ −3 n−1/2 (it)3 exp(− 12 t2 )|dt+
|t|>δσn1/2
Z
|ψ n (n−1/2 σ −1 t) − exp(− 12 t2 ) − 61 µ3 σ −3 n−1/2 (it)3 exp(− 12 t2 )|dt.
|t|<δσn1/2
(7.6)
Working with the first term on the right hand side of Equation (7.6), we note
that Theorem A.18 implies that
Z
|ψ n (n−1/2 σ −1 t)−exp(− 12 t2 )− 16 µ3 σ −3 n−1/2 (it)3 exp(− 21 t2 )|dt ≤
|t|>δσn1/2
Z
|ψ n (n−1/2 σ −1 t)| + exp(− 12 t2 )(1 + | 16 µ3 σ −3 n−1/2 (it)3 |)dt =
|t|>δσn1/2
Z
|ψ n (n−1/2 σ −1 t)|dt+
|t|>δσn1/2
Z
exp(− 12 t2 )(1 + | 16 µ3 σ −3 n−1/2 (it)3 |)dt. (7.7)
|t|>δσn1/2
For the first integral on the right hand side of Equation (7.7), we note that
ψ n (n−1/2 σ −1 t) = ψ ν (n−1/2 σ −1 t)ψ n−ν (n−1/2 σ −1 t).
Now |t| > δσn1/2 implies that n−1/2 σ −1 |t| > δ so the fact that |ψ(t)| ≤ qδ
for all |t| ≥ δ implies that ψ n−ν (n−1/2 σ −1 t) ≤ qδn−ν . Hence, it follows from
Theorem A.7 that
Z Z
|ψ n (n−1/2 σ −1 t)|dt ≤ qδn−ν |ψ ν (n−1/2 σ −1 t)|dt
|t|>δσn1/2 |t|>δσn1/2
Z ∞
−1/2 −1
≤ qδn−ν ν
|ψ (n σ t)|dt (7.8)
−∞
Z ∞
= n1/2 σqδn−ν |ψ ν (u)|du. (7.9)
−∞
where C is a constant that does not depend on n. Noting that |qδ | < 1, we
have that
as n → ∞. For the second integral on the right hand side of Equation (7.7)
we have that
Z
exp(− 12 t2 )(1 + | 16 µ3 σ −3 n−1/2 (it)3 |)dt ≤
|t|>δσn1/2
Z
exp(− 21 t2 )(1 + |µ3 σ −3 t3 |)dt =
|t|>δσn 1/2
Z Z
exp(− 12 t2 )dt + exp(− 21 t2 )|µ3 σ −3 t3 |dt, (7.10)
|t|>δσn1/2 |t|>δσn1/2
and hence,
Z
lim n exp(− 12 t2 )dt ≤ lim n−1/2 (2π)1/2 (δσ)−3 E(|Z|3 ) = 0.
n→∞ |t|>δσn1/2 n→∞
as n → ∞. For the second integral on the right hand side of Equation (7.10)
290 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
we note that
Z Z
exp(− 12 t2 )|µ3 σ −3 t3 |dt = |µ3 σ −3
| |t|3 exp(− 21 t2 )dt
|t|>δσn1/2 |t|>δσn1/2
Z
= 2|µ3 σ −3 | t3 exp(− 21 t2 )dt. (7.11)
t>δσn1/2
Note that the final equality in Equation (7.11) follows from the fact that the
integrand is an even function. Now consider a change of variable where u = 12 t2
so that du = tdt and the lower limit of the integral becomes 21 δ 2 σ 2 n. Hence,
we have that
Z Z ∞
exp(− 12 t2 )|µ3 σ −3 t3 |dt = 4|µ3 σ −3 | u exp(−u)du.
1 2 2
|t|>δσn1/2 2δ σ n
Now use Theorem A.4 with dv = exp(−u) so that v = − exp(−u) to find that
Z ∞
4|µ3 σ −3 | u exp(−u)du =
1 2 2
2δ σ n
∞
Z ∞
−4|µ3 σ −3 |u exp(−u) 1 δ2 σ2 n + 4|µ3 σ −3 | exp(−u)du =
2 1 2 2
2δ σ n
2|µ3 σ −3 |δ 2 σ 2 n exp(− 21 δ 2 σ 2 n) + 4|µ3 σ −3
| exp(− 12 δ 2 σ 2 n), (7.12)
where we have used the fact that
lim nk exp(−n) = 0,
n→∞
for all fixed k ∈ N. Therefore, using the same property, we can also conclude
using Definition 1.7 that
Z
exp(− 21 t2 )|µ3 σ −3 t3 |dt = o(n−1 ),
|t|>δσn1/2
where
Z
lim n1/2 (2π)−1 ε|t|3 n−1/2 exp(− 14 t2 )dt =
n→∞ |t|<δσn1/2
Z Z ∞
3
lim ε |t| exp(− 41 t2 )dt =ε |t|3 exp(− 14 t2 )dt. (7.16)
n→∞ |t|<δσn1/2 −∞
EDGEWORTH EXPANSIONS 293
Note that the integral on the right hand side of Equation (7.16) is finite. Let
B1 be a finite bound that is larger than this integral, then it follows that
Z
lim n1/2 (2π)−1 ε|t|3 n−1/2 exp(− 14 t2 )dt ≤ εB1 ,
n→∞ |t|<δσn1/2
since Z ∞
|it|6 exp(− 14 t2 )dt < ∞.
−∞
It, then, follows that
Z
(2π)−1 exp(− 12 t2 )| exp[nΛ(tσ −1 n−1/2 )]−
|t|<δσn1/2
The error in Equation 7.3 can also be characterized as having the property
O(n−1 ) as n → ∞. Theorem 7.4 offers a potential improvement in the accu-
racy of approximating the distribution of the standardized sample mean over
Theorem 4.20 (Lindeberg and Lévy) which approximates fn (t) with a normal
density. In that case, we have that fn (t) − φ(t) = o(1) = O(n−1/2 ) as n → ∞.
Further reductions in the order of the error are possible if terms are added
to the expansion and further conditions on the existence of moments and the
smoothness of f can be assumed.
Theorem 7.5. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables. Let Fn (x) = P [n1/2 σ −1 (X̄n − µ) ≤ x], with
density fn (x) for all x ∈ R. Assume that E(|Xn |p ) < ∞, |ψ|ν is integrable for
some ν ≥ 1, and that fn (x) exists for n ≥ ν. Then,
p
X
fn (x) − φ(x) − φ(x) n−k/2 pk (x) = o(n−p/2 ), (7.17)
k=1
nc(n−1/2 t) = 21 t2 + 16 n−1/2 κ3 t3 + 1 −1
24 n κ4 t4 + o(n−1 ),
as n → ∞. We now convert this result to an equivalent expression for the
moment generating function of the standardized mean. Definition 2.13 implies
that the moment generating function of n1/2 X̄n is
m(t) = exp[ 21 t2 + 16 n−1/2 κ3 t3 + 1 −1
24 n κ4 t4 + o(n−1 )].
Using Theorem 1.13 on the exponential function implies that
Figure 7.1 The true density of Zn = n1/2 σ −1 (X̄n − µ) (solid line), the standard nor-
mal density (dashed line) and the one-term Edgeworth expansion (dotted line) when
n = 5 and X1 , . . . , Xn is a set of independent and identically distributed random
variables following an Exponential(θ) distribution.
0.4
0.3
0.2
0.1
0.0
!3 !2 !1 0 1 2 3 4
Figure 7.2 The true density of Zn = n1/2 σ −1 (X̄n − µ) (solid line), the standard
normal density (dashed line) and the one-term Edgeworth expansion (dotted line)
when n = 10 and X1 , . . . , Xn is a set of independent and identically distributed
random variables following an Exponential(θ) distribution.
0.4
0.3
0.2
0.1
0.0
!3 !2 !1 0 1 2 3 4
2
3 − 31 θ, σ 2 = µ2 = 181
+ 91 θ − 19 θ2 , µ3 = − 135
1 1
− 45 θ + 19 θ2 − 27
2 3
θ , and
1 4 1 2 2 3 1 4
µ4 = 135 + 135 θ − 15 θ + 27 θ − 27 θ . It then follows that the standardized
cumulants are given by
1 1 1 2 2 3
− 135 − 45 θ + 9 θ − 27 θ
ρ3 = ,
1 1 1 2 3/2
18 + 9 θ − 9 θ
and
µ4 − 3σ 4
ρ4 =
σ4
1 4 1 2 2 3 1 4
135 + 135 θ − 15 θ + 27 θ − 27 θ
= − 3.
1 1 1 2 2
18 + 9 θ − 9 θ
Figure 7.3 The third (solid line) and fourth (dashed line) standardized cumulants of
the linear density given in Equation (7.21) as a function of θ.
0.5
0.0
!0.5
!1.0
and
X
ζ(t + 2πb−1 ) = pk [ibk(t + 2πb−1 )]
k∈Z
X
= pk exp(itbk) exp(2πik)
k∈Z
X
= pk exp(itbk)
k∈Z
= ζ(t),
since exp(2πik) = 1 when k ∈ Z.
To prove the second result, we now assume that X is a random variable with a
periodic characteristic function that has period a. We wish now to show that
X is a lattice random variable. Noting that ψ(0) = E[exp(i0X)] = E(1) = 1
it follows from the periodicity of ψ that ψ(0) = ψ(a) = 1. Now, Definition A.6
implies that
and
lim |Φ(x) − G(x)| = 0.
x→−∞
Suppose that G has bounded derivative that has a continuously differentiable
Fourier transformation γ such that γ(0) = 1 and
d
γ(t) = 0.
dt t=0
If F is a distribution function with zero expectation, then
Z t
|ψ(x) − ζ(x)|
|F (x) − G(x)| ≤ dx + 24(πt)−1 sup |G0 (x)|,
−t π|x| x∈R
A proof of Theorem 7.7 can be found in Section XVI.3 of Feller (1971). We now
have the required results to justify the Edgeworth expansion for non-lattice
distributions.
Theorem 7.8. Let {Xn }∞ n=1 be a sequence of independent and identically dis-
tributed random variables from a distribution F . Let Fn (x) = P [n1/2 σ −1 (X̄n −
302 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
µ) ≤ x] and assume that E(Xn3 ) < ∞. If F is a non-lattice distribution then
Fn (x) − Φ(x) − 16 σ −3 n−1/2 µ3 (1 − x2 )φ(x) = o(n−1/2 ), (7.24)
as n → ∞ uniformly in x.
Proof. We shall give a partial proof of the result, leaving the details for Ex-
ercise 6. Let
G(x) = Φ(x) + 61 σ −3 n−1/2 µ3 (1 − x2 )φ(x).
We must first prove that our choice for G(x) satisfies the assumptions of
Theorem 7.7. We begin by noting that since φ(x) → 0 as x → ±∞ at a faster
rate than (1 − x2 ) diverges to −∞ it follows that
lim (1 − x2 )φ(x) = 0,
x→∞
and
lim (1 − x2 )φ(x) = 0.
x→−∞
Therefore,
1 −3 −1/2
lim G(x) = lim Φ(x) + lim σ n µ3 (1 − x2 )φ(x) = 1,
x→∞ x→∞ x→∞ 6
and
1 −3 −1/2
lim G(x) = lim Φ(x) + lim σ n µ3 (1 − x2 )φ(x) = 0.
x→−∞ x→−∞ x→−∞ 6
and
lim |Fn (x) − G(x)| = 0.
x→∞
The derivative of G(x) is given by
d d h i
G(x) = Φ(x) + 61 σ −3 n−1/2 µ3 (1 − x2 )φ(x)
dx dx
= φ(x) − 16 σ −3 n−1/2 µ3 (3x − x3 )φ(x)
= φ(x) + 16 σ −1 n−1/2 µ3 H3 (x)φ(x),
which can be shown to be bounded. We now compute the Fourier transforma-
tion of G0 (x) as
Z ∞
γ(t) = exp(itx)[φ(x) + 16 σ −1 n−1/2 µ3 H3 (x)φ(x)]dx
−∞
Z ∞ Z ∞
1 −1 −1/2
= exp(itx)φ(x)dx + 6 σ n µ3 exp(itx)H3 (x)φ(x)dx
−∞ −∞
= exp(− 21 t2 ) + (it)3 exp(− 21 t2 )
= exp(− 12 t2 )[1 + 61 µ3 σ −3 n−1/2 (it)3 ]. (7.25)
EDGEWORTH EXPANSIONS 303
The first term in Equation (7.25) comes from the fact that the characteristic
function of a N(0, 1) distribution is exp(− 21 t2 ). The second term in Equation
(7.25) is the result of an application of Theorem 7.1. Note that γ(0) = 1. The
derivative of the Fourier transformation is given by
d
exp(− 12 t2 )[1 + 16 µ3 σ −3 n−1/2 (it)3 ]
=
dt t=0
−t exp(− 21 t2 )[1 + 16 µ3 σ −3 n−1/2 (it)3 ] +
t=0
exp(− 12 t2 )[ 31 µ3 σ −3 n−1/2 i3 t2 ] = 0.
t=0
Therefore, we can apply Theorem 7.7 to find that
The remainder of the proof proceeds much along the same path as the proof
of Theorem 7.4. See Exercise 6.
As in the case of expansions for densities, the result of Theorem 7.8 can be
expanded under some additional assumptions so that the error is o(n−p/2 ) as
n → ∞.
Theorem 7.9. Let {Xn }∞ n=1 be a sequence of independent and identically dis-
tributed random variables from a distribution F that has characteristic func-
tion ψ. Let Fn (t) = P [n1/2 σ −1 (X̄n − µ) ≤ x] and assume that E(Xnp ) < ∞.
If
lim sup |ψ(t)| < 1, (7.26)
|t|→∞
then
p
X
Fn (x) − Φ(x) − φ(x) n−k/2 rk (x) = o(n−p/2 ), (7.27)
k=1
uniformly in x as n → ∞, where rk (x) is a real polynomial whose coefficients
depend only on µ1 , . . . , µk . In particular, rk (x) does not depend on n, p or on
other properties of F .
To find the polynomials r1 , . . . , rk we note that the Edgeworth expansion for
the distribution function is the integrated version of the expansion for the
density. Therefore, it follows that
d
φ(x)pk (x) = φ(x)rk (x), (7.28)
dx
304 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
which yields
−r1 (x) = 61 ρ3 H2 (x) (7.29)
and
1 1 2
−r2 (x) = 24 ρ4 H3 (x) + 72 ρ3 H5 (x). (7.30)
See Exercise 13.
Example 7.4. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables where Xn has an Exponential(θ) distribu-
tion for all n ∈ N, where the third and fourth standardized cumulants are
given by ρ3 = 2 and ρ4 = 6. Therefore we have that −r1 (x) = 31 H2 (x) and
−r2 (x) = 41 H3 (x) + 18
1
H5 (x), with the two-term Edgeworth expansion for the
distribution function of Zn = n1/2 σ −1 (X̄n − µ) being given by
Φ(x) − 13 n−1/2 φ(x)H2 (x) − n−1 φ(x)[ 41 H3 (x) + 1
18 H5 (x)] + o(n−1 ),
as n → ∞. To illustrate the correction that the Edgeworth expansion makes
in this case, we note that from Example 4.36 that Zn has a location translated
Gamma(n, n−1/2 ) density whose form is given in Equation (4.3). Figure 7.4
compares the true distribution function of Zn with the standard normal dis-
tribution function and the one-term Edgeworth expansion when n = 5. One
can observe from Figure 7.4 that the Edgeworth expansion does a much better
job of capturing the true distribution function of Zn than the normal approx-
imation does. Figure 7.5 provides the same comparison when n = 10. Both
approximations are better, but one can observe that the one-term Edgeworth
expansion is still more accurate.
Figure 7.4 The true distribution function of Zn = n1/2 σ −1 (X̄n − µ) (solid line),
the standard normal density (dashed line) and the one-term Edgeworth expansion
(dotted line) when n = 5 and X1 , . . . , Xn is a set of independent and identically
distributed random variables following an Exponential(θ) distribution.
1.0
0.8
0.6
0.4
0.2
0.0
!3 !2 !1 0 1 2 3 4
Expansions for the quantiles of the random variable Zn are called Cornish–
Fisher expansions and were first developed by Cornish and Fisher (1937)
and Fisher and Cornish (1960). See also Hall (1983a). We will develop these
expansions using the inversion method described in Section 1.6. In particular
we will begin with the two-term Edgeworth expansion for the distribution
function of Zn given by
Figure 7.5 The true distribution function of Zn = n1/2 σ −1 (X̄n − µ) (solid line),
the standard normal density (dashed line) and the one-term Edgeworth expansion
(dotted line) when n = 10 and X1 , . . . , Xn is a set of independent and identically
distributed random variables following an Exponential(θ) distribution.
1.0
0.8
0.6
0.4
0.2
0.0
!3 !2 !1 0 1 2 3 4
Φ(gα,n ) = Φ[v0 (α)] + [n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]φ[v0 (α)]−
1 −1/2
2 [n v1 (α) + n−1 v2 (α) + o(n−1 )]2 H1 [v0 (α)]φ[v0 (α)]+
o{[n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]2 }, (7.32)
as n → ∞. We will simplify each term and consolidate all terms that are of
order o(n−1 ) or smaller into the error term. The first term on the right hand
side of Equation (7.32) needs no simplification, so we move to the second term
where we have that
based once again on the fact that v1 (α) and v2 (α) do not depend on n. Sim-
ilarly, it can be shown that n−2 v22 (α) = o(n−1 ) as n → ∞. For the term
2n−1/2 v1 (α)Rn we note that Rn = o(n−1 ) as n → ∞, which implies that
lim n[2n−1/2 v1 (α)Rn ] = lim 2n−1/2 v1 (α)(nRn ) = 0,
n→∞ n→∞
φ(gα,n ) = φ[v0 (α)] − v0 (α)φ[v0 (α)][n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]+
o{[n−1/2 v1 (α) + n−1 v2 (α)]2 },
as n → ∞. Keeping only terms of order larger than o(n−1/2 ) yields
φ(gα,n ) = φ[v0 (α)] − n−1/2 v0 (α)v1 (α)φ[v0 (α)] + o(n−1/2 ),
as n → ∞. Now H2 (x) = x2 − 1 so that using calculations similar to those
detailed above
H2 (gα,n ) = [v0 (α) + n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]2 − 1
= v02 (α) + n−1 v12 (α) + n−2 v22 (α) + 2n−1/2 v0 (α)v1 (α)
+2n−1 v0 (α)v2 (α) + 2n−3/2 v1 (α)v2 (α) − 1 + o(n−1 ),
as n → ∞. But, noting that
n−1 v12 (α) + n−2 v22 (α) + 2n−1 v0 (α)v2 (α) + 2n−3/2 v1 (α)v2 (α) = o(n−1/2 ),
implies that
H2 (gα,n ) = v02 (α) + 2n−1/2 v0 (α)v1 (α) − 1 + o(n−1/2 )
= H2 [v0 (α)] + 2n−1/2 v0 (α)v1 (α) + o(n−1/2 ),
as n → ∞. Therefore,
φ(gα,n )H2 (gα,n ) = {φ[v0 (α)] − n−1/2 v0 (α)v1 (α)φ[v0 (α)] + o(n−1/2 )} ×
{H2 [v0 (α)] + 2n−1/2 v0 (α)v1 (α) + o(n−1/2 )}
= φ[v0 (α)]H2 [v0 (α)] + 2n−1/2 v0 (α)v1 (α)φ[v0 (α)]
−n−1/2 v0 (α)v1 (α)H2 [v0 (α)]φ[v0 (α)]
−2n−1 v02 (α)v12 (α)φ[v0 (α)] + o(n−1/2 ),
as n → ∞. However 2n−1 v02 (α)v12 (α)φ[v0 (α)] = o(n−1/2 ) so that
n−1 φ(gα,n )[ 24
1
ρ4 H3 (gα,n ) + 1 2
72 ρ3 H5 (gα,n )] =
−1
n 1
φ[v0 (α)]{ 24 ρ4 H3 [v0 (α)] + 1 2
72 ρ3 H5 [v0 (α)]} + o(n−1 ), (7.36)
as n → ∞. Combining the results of Equations (7.34), (7.35), and (7.36)
implies that
then
p
X
gn,α − zα − n−k/2 qk (zα ) = o(n−p/2 ), (7.39)
k=1
as n → ∞ uniformly in ε < α < 1 − ε for each ε > 0 where q3 , . . . , qp are
polynomials that depend on the moments of Xn and not on n.
We shall not endeavor to present a more formal proof of Theorem 7.10 other
than what is presented above. The polynomials q1 , . . . , qp are related to the
polynomials r1 , . . . , rp , though the relationship can become quite complicated
as more terms are added to the expansion. In paticular q1 (x) = −r1 (x) and
q2 (x) = r1 (x)r10 (x)− 21 xr12 (x)−r2 (x). These relationships can be determined by
directly inverting the asymptotic expansion in Equation (7.27). See Exercise
14. As with the Edgeworth expansion, the expansion given by Theorem 7.10
can also be written so that the error term has order O(n−(p+1)/2 ) as n → ∞.
Because of the relationship between the two methods, many similar conclu-
sions about the accuracy of the normal approximation for the distribution of
n1/2 σ −1 (X̄n −µ) also hold for the quantiles of n1/2 σ −1 (X̄n −µ). In general, we
can observe from Theorem 7.10 that the normal approximation provides an ap-
proximation for the quantiles of n1/2 σ −1 (X̄n −µ) that has error that is o(1), or
O(n−1/2 ), as n → ∞. However, if the third standardized cumulant ρ3 is zero,
corresponding to the case where the population is symmetric, the normal ap-
proximation provides an approximation for the quantiles of n1/2 σ −1 (X̄n − µ),
where the error is o(n−1/2 ), or O(n−1 ), as n → ∞. Similarly, if the fourth stan-
dardized cumulant ρ4 is zero, corresponding to the case where the population
has the same kurtosis as a Normal population, the normal approximation
provides an approximation for the quantiles of n1/2 σ −1 (X̄n − µ), where the
error is o(n−1 ), or O(n−3/2 ), as n → ∞.
Example 7.5. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables where Xn has an Exponential(θ) distribu-
tion for all n ∈ N. In Example 7.4 it was shown that the distribution of
n1/2 σ −1 (X̄n − θ) has an Edgeworth expansion with −r1 (x) = 31 H2 (x) and
−r2 (x) = 83 H3 (x) + 18
1
H5 (x). Theorem 7.10 then implies that the αth quantile
of the distribution of n1/2 σ −1 (X̄n − θ) has Cornish–Fisher expansion
gα,n = zα + n−1/2 q1 (zα ) + n−1 q2 (zα ) + op (n−1 ),
as n → ∞, where q1 (zα ) = −r1 (zα ) = 31 H3 (zα ) and q2 (zα ) = r1 (zα )r10 (zα ) −
1 2
2 zα r1 (zα ) − r2 (zα ). Evaluating the derivative of r1 we find that
0 d
r1 (zα ) = r1 (zα ) = − 32 H1 (zα ) = − 32 zα .
dx x=zα
THE SMOOTH FUNCTION MODEL 311
Therefore
2
q2 (zα ) = 29 zα H2 (zα ) − 1
18 H2 (zα ) + 38 H3 (zα ) + 1
18 H5 (zα ).
The sample mean is not the only function of a set of independent and identi-
cally distributed random variables that has an asymptotic Normal distribu-
tion. For example, in Section 6.4, we studied several conditions under which
transformations of an asymptotically Normal random vector also have an
asymptotic Normal distribution. The key condition for such transformations
to be asymptotically Normal was based on the smoothness of the trans-
formation. One particular application of this theory is based on looking at
smooth functions of sample mean vectors. Edgeworth and Cornish–Fisher ex-
pansions can also be applied to these problems as well, though the results are
slightly more complicated and the function of the sample mean must have a
certain type of smooth representation. This section will focus on the model
for which these expansions are valid. Section 7.5 will provide details about the
expansions themselves.
The specification of the smooth function model begins with a sequence of d-
dimensional random vectors {Xn }∞ n=1 following a d-dimensional distribution
F . It is assumed that these random vectors are independent and identically
distributed. Let µ = E(Xn ) and assume that the components of µ are finite.
The parameter of interest is assumed to be a smooth function of the vector
mean µ. That is, we assume that there exists a smooth function g : Rd → R
such that θ = g(µ). The exact requirements on the smoothness of g will be
detailed in Section 7.5.
The parameter of interest will be estimated using a plug-in estimate based on
the sample mean. That is, we will assume that the mean µ is estimated by
the sample mean
Xn
µ̂ = X̄n = n−1 Xk .
k=1
An estimate of θ can then be computed by substituting the sample mean into
the function g. That is, θ̂n = g(µ̂). Note that under the conditions we have
stated thus far θ̂n is a consistent estimator of θ owing to the consistency of the
sample mean from Theorem 3.10 (Weak Law of Large Numbers) and Theorem
3.9.
In order to correctly standardize the distribution of θ̂n we also require the
standard error of θ̂n . In particular we will assume that the asymptotic variance
of n1/2 θ̂n is a constant σ that is also a smooth function of µ. That is, we assume
312 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
that there is a smooth function h : Rd → R such that
σ 2 = h2 (µ) = lim V (n1/2 θ̂n ).
n→∞
Since the parameter and the asymptotic variance rely on the first two moments
of Wn we can specify a smooth function model for θ with d = 2 and Xn =
(Wn , Wn2 ) so that the vector mean is given by µ = E(X0n ) = (µ01 , µ02 ) for all
n ∈ N. The functions g and h are given by g(x) = x1 and h(x) = x2 − x21
where x0 = (x1 , x2 ).
Example 7.7. Let {Wn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean η and variance
θ. We wish to represent the parameter θ in a smooth function model. Assuming
that µ0k is the k th moment of Wn , we note that θ = µ02 − (µ01 )2 so that we will
need at least the first two moments of Wn to be represented in our Xn vector.
Theorem 3.21 implies that the asymptotic variance of the sample variance is
µ4 − µ22 = µ04 − 4µ01 µ03 + 6(µ01 )2 µ2 − 3(µ01 )4 − [µ02 − (µ01 )2 ]2 ,
THE SMOOTH FUNCTION MODEL 313
so that to represent the variance in the smooth function model we require
d = 4 with X0n = (Wn , Wn2 , Wn3 , Wn4 ) for all n ∈ N. With this representation
we have g(x) = x2 − x21 and
h(x) = x4 − 4x1 x3 + 6x21 x2 − 3x41 − (x2 − x21 )2 ,
where x0 = (x1 , x2 , x3 , x4 ). See Exercise 15.
Example 7.8. Let {Wn }∞
be a sequence of independent and identically
n=1
distributed bivariate random vectors from a distribution F having mean vector
η and covariance matrix Σ. Let Wn0 = (Wn1 , Wn2 ) for all n ∈ N and define
j j j
µij = E{[Wn1 − E(Wn2 )]i [Wn2 − E(Wn2 )]j } and µ0ij = E(Wn1 Wn2 ). The
parameter of interest in this example is the correlation between Wn1 and
−1/2 −1/2
Wn2 . That is, θ = µ11 µ20 µ02 . The estimate of θ, based on replacing the
population moments with the sample moments as described above, has the
form Pn
k=1 (Wk1 − W̄1 )(Wk2 − W̄2 )
θ̂n = Pn Pn .
[ k=1 (Wk1 − W̄1 )2 ]1/2 [ k=1 (Wk2 − W̄2 )2 ]1/2
When constructing the smooth function model representation for this param-
eter, we will need to specify the sequence {X}∞ n=1 to include moments of both
Wn1 and Wn2 , but also various products of these random variables as well.
The correlation parameter itself can be specified readily enough as a function
of these moments as shown above, but the asymptotic variance is more chal-
lenging. For our current application we will use the result from Section 27.8
of Cramér (1946) which gives the asymptotic variance of n1/2 θ̂n as
σ 2 = 41 θ(µ40 µ−2 −2 −1 −1
20 + µ04 µ02 + 2µ22 µ20 µ02 +
4µ22 µ−2 −1 −1 −1 −1
11 − 4µ31 µ11 µ20 − 4µ13 µ11 µ02 ),
which makes it apparent that we require moments up to order four from each
random variable plus several products of powers of these random variables. It
then suffices to define the sequence {X}∞
n=1 as
The results in this section rely heavily on the smooth function model presented
in Section 7.4. In particular, we will assume that {Xn }∞ n=1 is a sequence of
independent and identically distributed d-dimensional random vectors from a
d-dimensional distribution F . Let µ = E(Xn ) and assume that the parameter
of interest θ is related to µ through a smooth function g. The parameter θ is
estimated with the plug-in estimate g(X̄n ). Finally, we assume that
σ 2 = lim V [n1/2 g(X̄n )],
n→∞
GENERAL EDGEWORTH AND CORNISH–FISHER EXPANSIONS 315
is also related to µ through a smooth function h.
It is worth noting the effect that transformations have on the normal approxi-
d
mation. We know from Theorem 4.22 that n1/2 Σ−1/2 (X̄n −µ) − → Z as n → ∞
where Z is a d-dimensional N(0, I) random variable and Σ is the covariance
matrix of Xn as long as the elements of Σ are all finite. Given this result, we
d
can then apply Theorem 6.5 to find that n1/2 [g(Xn ) − g(µ)] − → Z as n → ∞
where Z is a N[0, d0 (µ)Σd(µ)] random variable and d(µ) is the vector of
partial derivatives of g evaluated at µ. To simplify the notation in this sec-
tion we will let di represent the ith element of d(µ). We now encounter some
smoothness conditions required in our smooth function model. In particular,
we must now assume that d(µ) is not equal to the zero vector and d(x) is
continuous in a neighborhood of µ.
An important issue in developing Edgeworth expansions for under the smooth
function model is related to finding expressions for the cumulants of θ̂n in
terms of the moments of the distribution F . This is important because, as with
the usual Edgeworth expansion, the coefficients of the terms of the expansion
will be related to the cumulants of θ̂n , and therefore the specification of the
cumulants is required to specify the exact form for the expansion. As an
example we will consider the case of the specification of σ 2 . Suppose that
X0 = (X1 , . . . , Xd ) where X has distribution F and define µij = E{[Xi −
E(Xi )][Xj − E(Xj )]} so that the (i, j)th element of Σ is given by µij . The
quadratic form representing the asymptotic variance of n1/2 (θ̂n − θ), which is
equivalently the asymptotic variance of n1/2 θ̂n , is then given by
d X
X d
σ 2 = d0 (µ)Σd(µ) = di dj µij .
i=1 j=1
It turns out that all of the cumulants of n1/2 (θ̂n − θ) can be written using sim-
ilar expressions to these forms, though the proof becomes rather complicated
for each additional cumulant. We shall present only the results. A detailed
argument supporting these results can be found in Chapter 2 of Hall (1992).
Define A(x) = [g(x) − g(µ)]/h(µ) where x0 = (x1 , . . . , xd ). Let
∂k
ai1 ···ik = A(x) .
∂xi1 · · · ∂xik x=µ
It then follows that the first cumulant n1/2 A(X̄n ) = n1/2 σ −1 (θ̂n − θ) equals
n−1/2 A1 + O(n−1 ) as n → ∞ where
d X
X d
1
A1 = 2 aij µij . (7.40)
i=1 j=1
If we have chosen our h function correctly, the second cumulant of n1/2 A(X̄n )
should be one. See Page 55 of Hall (1992) for further details. The third cumu-
316 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
lant of n1/2 A(X̄n ) is given by n−1/2 A2 + O(n−1 ) where
d X
X d X
d d X
X d X
d X
d
A2 = ai aj ak µijk + 3 ai aj akl µik µjl , (7.41)
i=1 j=1 k=1 i=1 j=1 k=1 l=1
In most applications we will have σ̃ = 1. One may note that the extra term
involving the cumulant A1 in Equation (7.44) that does not appear in the
polynomial defined in Theorem 7.9. This is due to the fact that the first
cumulant of X̄n − µ is zero. Further polynomials can be obtained using the
methodology of Withers (1983, 1984). For example, Polansky (1995) obtains
and
2
d X
X d X
d X
d d X
X d
1
A43 = 2 ai aj ak al µijkl − 3 21 ai aj µij +
i=1 j=1 k=1 l=1 i=1 j=1
X d X
d X d X
d X
d
12 ai aj ak alm µil µjkm +
i=1 j=1 k=1 l=1 m=1
d X
X d X
d X
d X
d X
d
12 ai aj akl amu µik µjm µlu +
i=1 j=1 k=1 l=1 m=1 u=1
X d X
d X d X
d X d
d X
4 ai aj ak almu µil µjm µku . (7.48)
i=1 j=1 k=1 l=1 m=1 u=1
and a3 = a4 = 0. Similarly,
∂ 2 −1
2
= −2σ −1 ,
a11 = 2 σ [x2 − x1 − θ]
∂x1 x=µ
a12 = a21 = a22 = 0, a3i = ai3 = 0, and a4i = ai4 = 0 for i ∈ {1, 2, 3, 4}.
Therefore, it follows from Equation (7.40) that
d X
X d
A1 = 1
2 aij µij = 21 a11 µ11 = −σ −1 µ11 ,
i=1 j=1
With this definition, the Edgeworth type correction for the distribution func-
tion of B(X̄n ) then has the same form as that of A(X̄n ), with the exception
that the constants ai1 ···ik are replaced by bi1 ···ik .
Theorem 7.13. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed d-dimensional random vectors from a distribution F with mean
vector µ. Let θ = g(µ) for some function g and suppose that θ̂n = g(X̄n ). Let
σ 2 be the asymptotic variance of n1/2 θ̂n and assume that σ = h(µ) for some
function h. Define B(x) = [g(x) − g(µ)]/h(x) and assume that B has p + 2
continuous derivatives in a neighborhood of µ and that E(||X||p+2 ) < ∞. Let
ψ be the characteristic function of F and assume that
lim sup |ψ(t)| < ∞. (7.49)
||t||→∞
d X
X d X
d d X
X d X
d X
d
B2 = bi bj bk µijk + 3 bi bj bkl µik µjl , (7.53)
i=1 j=1 k=1 i=1 j=1 k=1 l=1
and µij and µijk are as defined previously. Even though r1 and q1 have the
same form, the two polynomials are not equal as ai1 ···ik 6= bi1 ···ik . In fact, Hall
(1988a) points out that
d X
X d
r1 (x) − v1 (x) = − 12 σ −3 ai cj µij x2 ,
i=1 j=1
STUDENTIZED STATISTICS 321
where
∂
ck = g(x) .
∂xk x=µ
See Exercise 18. Further polynomials may be obtained using similar methods.
as n → ∞ where s1 (x) = −v1 (x) and s2 (x) = v1 (x)v10 (x) − 21 xv12 (x) − v2 (x).
Example 7.10. Let {Wn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean θ and variance
σ 2 . It was shown in Example 7.6 that this parameter can be represented in the
smooth function model with X0n = (Wn , Wn2 ), g(x) = x1 , and h(x) = x2 − x21 ,
where we would have θ̂n = W̄n and
n
X
σ̂n2 = n−1 (Wk − W̄n )2 .
k=1
In this example, we will derive the one-term Edgeworth expansion for the
studentized distribution of θ̂n and compare it to the Equation (7.24). For
simplicity, we will consider the case where θ = 0 and σ = 1. To obtain this
expansion, we must first find the constants b1 , b2 , b11 , b12 , b21 and b22 , where
B(x) = (x1 − θ)(x2 − x21 )−1/2 . We, first, note that
∂
b1 = B(x)
∂x1 x=µ
= x1 (x1 − θ)(x2 − x21 )−3/2 + (x2 − x21 )−1/2
x=µ
2 −3/2 2 −1/2
= θ(θ − θ)(β − θ ) + (β − θ )
= (β − θ2 )−1/2
= 1,
∂2
b12 = b21 = B(x)
∂x1 x2 x=µ
= − 23 x1 (x1 − θ)(x2 − x21 )−5/2 − 21 (x2 − x21 )−3/2
x=µ
= − 21 (β − θ2 )−3/2
= − 21 ,
and
∂2
b22 = B(x)
∂x22 x=µ
= 3
4 (x1 − θ)(x2 − x21 )−5/2
x=µ
= 0.
The expressions for B1 and B2 also require us to find the moments of the form
µij and µijk for i = 1, 2, j = 1, 2, and k = 1, 2. However, due to the fact that
b2 = b11 = b22 = 0 we only need to find µ11 , µ12 , and µ111 . Letting κ3 denote
the third cumulant of Wn we have in this case that µ11 = E[(Wn − θ)2 ] =
β − θ2 = 1, µ12 = µ21 = E[Wn3 ] − θβ = κ3 , µ111 = E[(Wn − θ)3 ] = κ3 , We
now have enough information to compute B1 and B2 . From Equation (7.52),
we have that
2 X
X 2
1
B1 = 2 bij µij = 12 (b12 µ12 + b21 µ21 ) = − 12 κ3 .
i=1 j=1
Similarly,
2 X
X 2 X
2 X
2
bi bj bkl µik µjl = b21 b12 µ11 µ12 + b21 b21 µ12 µ11 = −κ3 .
i=1 j=1 k=1 l=1
Figure 7.6 The Edgeworth expansion for the standardized sample mean (solid line)
and the studentized mean (dashed line) when n = 5 and κ3 = 1.
1.0
0.8
0.6
0.4
0.2
0.0
!4 !2 0 2 4
Figure 7.7 The function φ(t)H3 (t) from the first term of an Edgeworth expansion
for the standardized sample mean.
0.4
0.2
0.0
!0.2
!0.4
!4 !2 0 2 4
To begin our development, let f be the density associated with F and assume
that f has moment generating function m(u) and cumulant generating func-
tion c(u), both of which are assumed to exist. Define fλ (t) = exp(λt)f (t)/m(λ)
for some real parameter λ. It then follows that fλ (t) is a density since fλ (t) ≥ 0
and
Z ∞
exp(λt)f (t)dt = m(λ),
−∞
which implies
Z ∞ Z ∞
fλ (t)dt = [m(λ)]−1 exp(λt)f (t)dt = 1.
−∞ −∞
Several properties of this density can be obtained. Let Eλ denote the expec-
tation with respect to fλ and suppose that Y is a random variable following
the density fλ . That is,
Z ∞ Z ∞
Eλ (Y ) = tfλ (t)dt = [m(λ)]−1 t exp(λt)f (t)dt.
−∞ −∞
326 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
Figure 7.8 The function H3 (t) from the first term of an Edgeworth expansion for
the standardized sample mean which provides an approximation to the relative error
of the Edgeworth expansion as a function of t.
100
50
0
!50
!100
!4 !2 0 2 4
it follows that if we invert m(u) we get the density f (t). Similarly, we note
that
Z ∞ Z ∞
m(u + λ) = exp[t(u + λ)]f (t)dt = exp(ut)[exp(λt)f (t)]dt.
−∞ −∞
7.8.1 Exercises
1. Let f be a real function and define the Fourier norm as Feller (1971) does
as Z ∞
−1
(2π) |f (x)|dx.
−∞
For a fixed value of x, is this function a norm?
2. Prove that the Fourier transformation of Hk (x)φ(x) is (it)k exp(− 21 t2 ).
Hint: Use induction and integration by parts as in the partial proof of The-
orem 7.1.
3. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables where Xn has a Gamma(α, β) distribution for all n ∈ N.
a. Compute one- and two- term Edgeworth expansions for the density of
n1/2 σ −1 (X̄n − µ) where in this case µ = αβ and σ 2 = αβ 2 . What effect
do the values of α and β have on the accuracy of the expansion? Is it
possible to eliminate either the first or second term through a specific
choice of α and β?
b. Compute one- and two-term Edgeworth expansions for the distribution
function of n1/2 σ −1 (X̄n − µ).
c. Compute one- and two-term Cornish–Fisher expansions for the quantile
function of n1/2 σ −1 (X̄n − µ).
4. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables where Xn has a Beta(α, β) distribution for all n ∈ N.
6. Prove Theorem 7.8. That is, let {Xn }∞ n=1 be a sequence of independent
and identically distributed random variables from a distribution F . Let
Fn (t) = P [n1/2 σ −1 (X̄n − µ) ≤ t] and assume that E(Xn3 ) < ∞. If F is a
non-lattice distribution then prove that
Fn (x) − Φ(x) − 61 σ −3 n−1/2 µ3 (1 − x2 )φ(t) = o(n−1/2 ),
as n → ∞ uniformly in x. The first part of this proof is provided after
Theorem 7.4. At what point is it important that the distribution be non-
lattice?
7. Let {Rn }∞
n=1 be a sequence of real numbers such that Rn = o(n
−1
) as
2 −1
n → ∞. Prove that Rn = o(n ) as n → ∞.
8. Suppose that v1 (α) and v2 (α) are constant with respect to n. Prove that if
Rn = [n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]2 then a sequence that is o(Rn ) as
n → ∞ is also o(n−1 ) as n → ∞.
9. Suppose that gα,n = v0 (α) + n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 ) as n → ∞
where v0 (α), v1 (α), and v2 (α) are constant with respect to n. Prove that
H3 (gα,n ) = H3 [v0 (α)] + o(1) and H4 (gα,n ) = H4 [v0 (α)] + o(1) as n → ∞.
10. Prove that Z ∞
exp(tx)φ(x)Hk (x)dx = tk exp( 21 t2 ).
−∞
a. Prove that bi = ai σ −1 .
b. Prove that bij = aij σ −1 − 21 (ai cj + aj ci )σ −3 , where
∂
ck = g(x) .
∂xk x=µ
19. Let {Wn }∞n=1 be a sequence of independent and identically distributed ran-
dom variables from a distribution F with mean θ and variance σ 2 . It was
shown in Example 7.6 that this parameter can be represented in the smooth
function model with X0n = (Wn , Wn2 ), g(x) = x1 , and h(x) = x2 − x21 where
we have θ̂n = W̄n and
n
X
σ̂ 2 = n−1 (Wk − W̄n )2 .
k=1
Assuming that θ = 0 and σ = 1 and using the form from Equations (7.44)–
(7.48), derive a two-term Edgeworth expansion for the studentized distri-
bution of θ̂ and compare it to Equation (7.54). In particular, show that
1
q2 (x) = x[ 12 κ4 (x2 − 3) − 1 2 4
18 κ3 (x + 2x2 − 3) − 14 (x2 + 3)],
which can be found in Section 2.6 of Hall (1992), where κ3 and κ4 are the
third and fourth cumulants of F , respectively.
20. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables from a distribution F that has density f , characteristic func-
334 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
tion ψ(u), and cumulant generating function c(u). Assume that the charac-
teristic function is real valued in this case and use the alternate definition
of the cumulant generating function given by c(u) = log[ψ(u)]. Define the
density fλ (t) = exp(λt)f (t)/ψ(λ) and let {Yn }∞ n=1 be a sequence of inde-
pendent and identically distributed random variables following the density
fλ . Let fn denote the density of nX̄n and fn,λ denote the density of nȲn .
Using characteristic functions, prove that fn,λ (t) = exp[λt − nc(λ)]fn (t).
21. Let X be a random variable with moment generating function m(u) and
cumulant generating function c(u). Assuming that both functions exist,
prove that
d2 m(u + λ)
= c00 (λ).
du2 m(λ) t=0
22. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables following an Exponential(1) density.
6. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables following a N(µ, σ 2 ) distribution. A saddlepoint expansion
that approximates the density of nX̄n at a point x with relative error
O(n−1 ) as n → ∞ was shown in Example 7.11 to have the form
fn (x) = [2πnc00 (λ̃)]−1/2 exp[nc(λ̃) − λ̃x][1 + O(n−1 )]
= (2πnσ 2 )−1/2 exp[ 21 σ −2 (−nµ2 + n−1 x2 ) − xσ −2 (n−1 x − µ)]
×[1 + O(n−1 )]
= (2πnσ 2 )−1/2 exp[− 12 n−1 σ −2 (x − nµ)2 ][1 + O(n−1 )],
which has a leading term equal to a N(nµ, nσ 2 ) density, which matches
the exact density of nX̄n . Plot a N(nµ, nσ 2 ) density and the saddlepoint
EXERCISES AND EXPERIMENTS 337
approximation for µ = 1, σ 2 = 1, and n = 2, 5, 10, 25 and 50. Discuss how
well the saddlepoint approximation appears to be doing in each case.
7. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables following a ChiSquared(θ) distribution. A saddlepoint ex-
pansion that approximates the density of nX̄n at a point x with relative
error O(n−1 ) as n → ∞ was shown in Example 7.12 to have the form
fn (x) = [4πn−1 x2 θ]−1/2 exp[− 12 nθ log(nθx−1 ) − 12 (1 − nx−1 θ)][1 + O(n−1 )].
Plot the correct Gamma density of nX̄n and the saddlepoint approximation
for θ = 1 and n = 2, 5, 10, 25 and 50. Discuss how well the saddlepoint
approximation appears to be doing in each case.
CHAPTER 8
I trust them far more than what Barnabas says. Even if they are old, worthless
letters picked at random out of a pile of equally worthless letters, with no more
understanding than the canaries at fairs have, pecking out people’s fortunes at
random, well, even if that is so, at least these letters bear some relation to my
work.
The Castle by Franz Kafka
So far in this book we have relied at many times on the ability to approximate
a function f (x + δ) for a sequence of constants δ that converge to zero based
on the value f (x). That is, we are able to approximate values of f (x + δ) for
small values of δ as long as we know f (x). The main tool for developing these
approximations was based on Theorem 1.13 (Taylor), though we also briefly
talked about other methods as well. The main strength of the theory is based
on the idea that the accuracy of these approximations are well known and have
properties that are easily represented using the asymptotic order notation
introduced in Section 1.5. For instance, in Example 1.23 we found that the
distribution function of a N(0, 1) random variable could be approximated near
zero with 21 + δ(2π)−1/2 + 16 δ 3 (2π)−1/2 . The error of this approximation can
be represented as o(δ 3 ), which means that the error, when divided by δ 3 ,
converges to zero as δ converges to zero. We also saw that this error can be
represented as O(δ 4 ), which means that the error, when divided by δ 4 , remains
bounded as δ converges to zero.
In some cases it would be useful to develop methods for approximating ran-
dom variables. As an example, consider the situation where we have observed
X1 , . . . , Xn , a set of independent and identically distributed random vari-
ables from a distribution F with mean θ and variance σ 2 . If the distribu-
tion F is not known, then an approximate 100α% upper confidence limit for
θ is given by Ûn (X1 , . . . , Xn ) = X̄n − n−1/2 σ̂z1−α , where σ̂ is the sample
standard deviation. This confidence limit is approximate in the sense that
P [θ ≤ Ûn (X1 , . . . , Xn )] ' α. Let us assume that there is an exact upper confi-
dence limit Un (X1 , . . . , Xn ) such that P [θ ≤ Un (X1 , . . . , Xn )] = α. In this case
339
340 ASYMPTOTIC EXPANSIONS FOR RANDOM VARIABLES
it would be useful to be able to compare Ûn (X1 , . . . , Xn ) and Un (X1 , . . . , Xn )
to determine the quality of the approximation. That is, we are interested in
the behavior of Rn = |Ûn (X1 , . . . , Xn ) − Un (X1 , . . . , Xn )| as n → ∞. For
example, we would like to be able to conclude that Rn → 0 as n → ∞. How-
ever, there is a small problem in this case in that both Un (X1 , . . . , Xn ) and
Ûn (X1 , . . . , Xn ) are random variables due to their dependence on the sample
X1 , . . . , Xn . Therefore, we cannot characterize the asymptotic behavior of Rn
using a limit for real sequences, we must characterize this behavior in terms
of one of the modes of convergence for random variables discussed in Chapter
p
3. For example, we can use Definition 3.1 and determine whether Rn − → 0 as
n → ∞.
An alternative method for approximating the upper confidence limit would
be to use the correction suggested by the Cornish-Fisher expansion of Theo-
rem 7.10. That is, replace z1−α with h1−α,n to obtain an approximate upper
confidence limit given by Ũn (X1 , . . . , Xn ) = X̄n − n−1/2 σ̂h1−α,n . By defining
Rn = |Ũn (X1 , . . . , Xn ) − Un (X1 , . . . , Xn )| we could again ascertain whether
p
Rn − → 0 as n → ∞. However, a more useful analysis would consider whether
Ũn (X1 , . . . , Xn ) is an asymptotically more accurate approximate upper confi-
dence limit than Ûn (X1 , . . . , Xn ). If all we know is that both methods converge
to zero in probability then we cannot compare the two methods directly. In
this case we require some information about the rate at which the two methods
converge in probability to zero.
The goal of this chapter is to develop methods for approximating a random
variable, or a sequence of random variables, with an asymptotic expansion
whose terms may be random variables. The error terms of these sequences
will also necessarily be random variables as well. That is, let {Xn }∞ n=1 be a
sequence of random variables. Then we would like to find random variables
Y0 , Y1 , . . . , Yn such that Xn = Y0 + n−1/2 Y1 + n−1 Y2 + · · · + n−p/2 Yp + R(n, p),
where R(n, p) is a random variable that serves as a remainder term that de-
pends on both n and p. Note that the random variables Y0 , . . . , Yp themselves
do not depend on n. As with asymptotic expansions for sequences of real num-
bers, the form of the expansion and the rate at which the error term converges
to zero are both important properties of the expansion. Therefore, another fo-
cus of this chapter is on defining rates of convergence for sequences of random
variables. In particular, we will develop stochastic analogs of the asymptotic
order notation from Definition 1.7. We will then apply these methods to ap-
plications, such as the delta method and the asymptotic distribution of the
sample central moments.
One can observe from Example 8.5 that a very useful special case of sequences
that are bounded in probability are those that converge in distribution. As
discussed in Section 4.2, we will always assume that sequences that converge
in distribution do so to distributions that have valid distribution functions. It
is this property that assures that such sequences are bounded in probability.
Theorem 8.1. Let {Xn }∞ n=1 be a sequence of random variables that converge
in distribution to a random variable X as n → ∞, then Xn = Op (1) as
n → ∞.
Proof. Let Fn denote the distribution function of Xn for all n ∈ N and let
d
F denote the distribution of X. Since Xn − → X as n → ∞ it follows from
Definition 4.1 that
lim Fn (x) = F (x),
n→∞
for all x ∈ C(F ). By assumption F is a distribution function such that
lim F (x) = 1,
x→∞
and
lim F (x) = 0.
x→−∞
Therefore Theorem 4.1 implies that {Xn }∞ n=1 is bounded in probability Defi-
nition 8.2 implies then that Xn = Op (1) as n → ∞.
Example 8.7. Let {Xn }∞ n=1 be a sequence of independent random variables
from a distribution F with mean µ and variance σ 2 . Theorem 4.20 (Linde-
d
berg and Lévy) implies that n1/2 σ −1 (X̄n − µ) −
→ Z where Z has a N(0, 1)
distribution. Therefore Theorem 8.1 implies that n1/2 σ −1 (X̄n − µ) = Op (1)
as n → ∞ and Definition 8.2 implies that X̄n − µ = Op (n−1/2 ) or equivalently
that X̄n = µ + Op (n−1/2 ) as n → ∞.
Example 8.8. Let Z1 , . . . , Zm be a set of independent and identically dis-
tributed N(0, 1) random variables and let δ > 0. Define
m
X
Xm,δ = (Z1 + δ)2 + Zk2 .
k=2
When using the order notation for real valued sequences, we found that if a
sequence was o(yn ) as n → ∞ for some real valued sequence {yn }∞
n=1 then the
sequence was also O(yn ) as n → ∞. This same type of relationship holds for
sequences of random variables using the stochastic order notation.
Theorem 8.2. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of random variables.
Suppose that Xn = op (Yn ) as n → ∞, then Xn = Op (Yn ) as n → ∞.
For a proof of Theorem 8.2 see Exercise 2. To effectively work with stochastic
order notation we must also establish how the sequences of each order interact
with each other. The result below also provides results as to how real sequences
and sequences of random variables interact with one another.
Theorem 8.3. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of random variables
and let {yn }∞
n=1 be a sequence of real numbers.
STOCHASTIC ORDER NOTATION 345
1. If Xn = Op (n−a ) and Yn = Op (n−b ) as n → ∞, then Xn Yn = Op (n−(a+b) )
as n → ∞.
2. If Xn = Op (n−a ) and yn = o(n−b ) as n → ∞, then Xn yn = op (n−(a+b) )
as n → ∞.
3. If Xn = Op (n−a ) and Yn = op (n−b ) as n → ∞, then Xn Yn = op (n−(a+b) )
as n → ∞.
4. If Xn = op (n−a ) and yn = o(n−b ) as n → ∞, then Xn yn = op (n−(a+b) ) as
n → ∞.
5. If Xn = Op (n−a ) and yn = O(n−b ) as n → ∞, then Xn yn = Op (n−(a+b) )
as n → ∞.
6. If Xn = op (n−a ) and yn = O(n−b ) as n → ∞, then Xn Yn = op (n−(a+b) )
as n → ∞.
7. If Xn = op (n−a ) and Yn = op (n−b ) as n → ∞, then Xn Yn = op (n−(a+b) )
as n → ∞.
Proof. We will prove the first two parts of this theorem, leaving the remaining
parts for Exercise 9. To prove the first result, suppose that {Xn }∞ n=1 and
{Yn }∞
n=1 are sequences of random variables such that Xn = Op (n
−a
) and Yn =
Op (n−b ) as n → ∞. Therefore, it follows from Definition 8.2 that the sequences
{na Xn }∞ b ∞
n=1 and {n Yn }n=1 are both bounded in probability. Therefore, for
every ε > 0 there exist bounds xε and yε and positive integers nx,ε and ny,ε
such that P (|na Xn | ≤ xε ) > 1 − ε and P (|mb Ym | ≤ yε ) > 1 − ε for all
n > nx,ε and m > ny,ε . Define bε = max{xε , yε } and nε = max{nx,ε , ny,ε }.
Therefore, it follows that P (|na Xn | ≤ bε ) > 1 − ε and P (|nb Yn | ≤ bε ) > 1 − ε
for all n > nε . In accordance with Definition 8.2, we must now prove that the
sequence {na+b Xn Yn }∞ n=1 is bounded in probability. Let ε > 0 and note that
≥ P (|nb Yn | ≤ bε )P (|na Xn | ≤ bε )
≥ (1 − ε)2
= 1 − 2ε + ε2
> 1 − 2ε.
lim P (|na+b Xn yn | ≤ ξ) = 1,
n→∞
p
and therefore Definition 3.1 implies that na+b Xn yn −
→ 0 as n → ∞. Therefore,
Definition 8.1 implies that Xn yn = op (n−(a+b) ) as n → ∞.
With the introduction of the stochastic order notation in Definitions 8.1 and
8.2, we can now define an asymptotic expansion for a sequence of random
variables {Xn }∞
n=1 as an expansion of the form
or of the form
Un (X1 , . . . , Xn ) =
X̄n − n−1/2 σ̂n z1−α − n−1 σ̂n s1 (z1−α ) − n−3/2 σ̂n s2 (z1−α ) + o(n−3/2 ),
as n → ∞. Hence, if we define Rn = |Ûn (X1 , . . . , Xn ) − Un (X1 , . . . , Xn )|, then
it follows that
Rn = |n−1 σ̂n s1 (z1−α ) + n−3/2 σ̂n s2 (z1−α ) + o(n−3/2 )|,
as n → ∞. Assuming that σ̂n is a consistent estimator of σ, which can be ver-
ified using Theorem 3.19, it follows that σ̂n = σ + op (1), as n → ∞. Therefore,
it follows that n−1 σ̂n s1 (z1−α ) = op (n−1/2 ) and hence Rn = op (n−1/2 ), as n →
∞. Hence we have shown that Ûn (X1 , . . . , Xn ) = Un (X1 , . . . , Xn ) + op (n−1/2 )
as n → ∞. Now recall that s1 (z1−α ) and s2 (z1−α ) are polynomials whose
coefficients are functions of the moments of F . Suppose that we can estimate
s1 (z1−α ) and s2 (z1−α ) with consistent estimators ŝ1 (z1−α ) and ŝ2 (z1−α ) so
that ŝ1 (z1−α ) = s1 (z1−α ) + op (1) and ŝ2 (z1−α ) = s2 (z1−α ) + op (1), as n → ∞.
We can then approximate the true upper confidence limit with
Ũn (X1 , . . . , Xn ) = X̄n − n−1/2 σ̂n zα − n−1 σ̂n ŝ1 (z1−α ) − n−3/2 σ̂n ŝ2 (z1−α ).
To find the accuracy of this approximate note that
n−1/2 σ̂n ŝ1 (z1−α ) = n−1/2 σs1 (z1−α ) + op (n−1/2 ),
and
n−1 σ̂n ŝ2 (z1−α ) = n−1 σs2 (z1−α ) + op (n−1 ),
as n → ∞. Therefore,
Ũn (X1 , . . . , Xn ) = X̄n − n−1/2 σ̂n z1−α − n−1 σ̂n s1 (z1−α ) + op (n−1 ),
as n → ∞ and it follows that |Ũn (X1 , . . . , Xn ) − Un (X1 , . . . , Xn )| = op (n−1 ),
as n → ∞, which is more accurate than the normal approximation given by
Ûn (X1 , . . . , Xn ). Note that estimating s2 (z1−α ) in this case makes no dif-
ference from an asymptotic viewpoint, because the error from estimating
s1 (z1−α ) is as large, asymptotically, as this term. Therefore, an asymptot-
ically equivalent substitute for Ũn (X1 , . . . , Xn ) is the approximation X̄n −
n−1/2 σ̂n z1−α − n−1 σ̂n ŝ1 (z1−α ).
348 ASYMPTOTIC EXPANSIONS FOR RANDOM VARIABLES
8.3 The Delta Method
then,
n1/2 (µ̂k − µk ) = n1/2 (µ̃k − µk − kµk−1 µ̃1 ) + op (1), (8.5)
as n → ∞.
Proof. To begin this argument, first note that using Theorem A.22 we have
THE SAMPLE MOMENTS 351
that
n
X n
X
−1 −1
n (Xi − X̄n )k = n [(Xi − µ) − (X̄n − µ)]k
i=1 i=1
n X k
X k
= n−1 (−1)k−j (Xi − µ)j (X̄n − µ)k−j
i=1 j=0
j
k " n #
k−j k
X X
−1 k−j j
= (−1) n (X̄n − µ) (Xi − µ)
j=0
j i=1
k
k−j k
X
= (−1) (X̄n − µ)k−j µ̃j .
j=0
j
d
Now Theorem 4.20 implies that n1/2 µ̃1 = n1/2 (X̄n − µ) −
→ Z as n → ∞
352 ASYMPTOTIC EXPANSIONS FOR RANDOM VARIABLES
where Z is a N(0, µ2 ) random variable. Therefore, Theorem 8.1 implies that
n1/2 µ̃1 = Op (1) as n → ∞. Next we note that Theorem 3.10 implies that
n
X p
µ̃k−1 = n−1 (Xi − µ)k−1 −
→ E[(Xi − µ)k−1 ] = µk−1 , (8.6)
i=1
p
as n → ∞ so that Theorem 3.8 implies that kµk−1 − k µ̃k−1 − → 0 as n → ∞.
Finally, we note that in the special case when k = 2 in Equation (8.6) that
n
X p
µ̃1 = n−1 (Xi − µ) −
→ E(Xi − µ) = µ1 = 0,
i=1
p p
→ 0 as n → ∞. Therefore, Theorem 3.8 implies that µ̃1k−j−1 µ̃j −
so that µ̃1 − →0
as n → ∞ for j = 0, . . . , k − 2, and hence
k−2
X k k−j−1
(−1)k−j µ̃ µ̃j = op (1),
j=0
j 1
as n → ∞. Therefore,
n1/2 (µ̂k − µk ) = n1/2 (µ̃k − µk − kµk−1 µ̃1 ) + op (1),
as n → ∞.
Yn0 = [(Xn − µ)2 − µ2 − 2µ1 (Xn − µ), . . . , (Xn − µ)k − µk − kµk−1 (Xn − µ)],
for all n ∈ N. Note that
E[(Xn − µ)k − µk − kµk−1 (Xn − µ)] = µk − µk − kµk−1 µ1 = 0,
since µ1 = 0. Then it follows that Wn = n1/2 Ȳn and Theorem 4.22 implies
d
that Wn − → Z as n → ∞ where Z is a N(0, Σ) random vector. As discussed
above, the random vector Zn has this same limit distribution, and therefore all
there is left to do is verify the form of the covariance matrix. The form of the
covariance matrix can be determined from the covariance matrix of Yn . The
(i, j)th element of this covariance matrix is given by the covariance between
(Xn −µ)i+1 −µi+1 −(i+1)µi (Xn −µ) and (Xn −µ)j+1 −µj+1 −(j+1)µj (Xn −µ).
Since both random variables have expectation equal to zero, this covariance
is equal to the expectation of the product
8.5.1 Exercises
1. Let {Xn }∞
n=1 be a sequence of independent random variables where Xn
has a N(0, n−1 ) distribution for all n ∈ N. Prove that Xn = Op (n−1/2 ) as
n → ∞.
2. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of random variables. Suppose that
Xn = op (Yn ) as n → ∞. Prove that Xn = Op (Yn ) as n → ∞.
3. Let {Xn }∞n=1 be a sequence of independent and identically distributed ran-
dom variables where Xn = n−1 Un where Un has a Uniform(0, 1) distribu-
tion for all n ∈ N. Prove that Xn = op (n−1/2 ) and that Xn = Op (n−1 ) as
n → ∞.
4. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of independent random variables.
Suppose that Yn is a Beta(αn , βn ) random variable where {αn }∞ n=1 and
{βn }∞
n=1 are sequences of positive real numbers that converge to α and β,
respectively. Suppose further that, conditional on Yn , the random variable
Xn has a Binomial(m, Yn ) distribution where m is a fixed positive integer
for all n ∈ N. Prove that Xn = Op (1) as n → ∞.
5. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of independent random variables.
Suppose that Yn is a Poisson(θ) random variable where θ is a positive real
number. Suppose further that, conditional on Yn , the random variable Xn
has a Binomial(Yn , τ ) distribution for all n ∈ N where τ is a fixed real
number in the interval [0, 1]. Prove that Xn = Op (Yn ) as n → ∞.
6. Let {Xn }∞
n=1 be a sequence of independent random variables where Xn
has a Gamma(αn , βn ) distribution for all n ∈ N and {αn }∞ ∞
n=1 and {βn }n=1
are bounded sequences of positive real numbers. That is, there exist real
numbers α and β such that 0 < αn ≤ α and 0 < βn ≤ β for all n ∈ N.
Prove that Xn = Op (1) as n → ∞.
7. Let {Xn }∞
n=1 be a sequence of independent random variables where Xn has
a Geometric(θn ) distribution where {θn }∞
n=1 is described below. For each
sequence determine whether Xn = Op (1) as n → ∞.
8. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of independent random variables,
where Xn has a Uniform(0, n) distribution and Yn has a Uniform(0, n2 )
distribution for all n ∈ N. Prove that Xn = op (Yn ) as n → ∞.
EXERCISES AND EXPERIMENTS 355
9. Let {Xn }∞ ∞ ∞
n=1 and {Yn }n=1 be sequences of random variables and let {yn }n=1
be a sequence of real numbers.
a. Prove that if Xn = Op (n−a ) and Yn = op (n−b ) as n → ∞, then Xn Yn =
op (n−(a+b) ) as n → ∞.
b. Prove that if Xn = op (n−a ) and yn = o(n−b ) as n → ∞, then Xn yn =
op (n−(a+b) ) as n → ∞.
c. Prove that if Xn = Op (n−a ) and yn = O(n−b ) as n → ∞, then Xn yn =
Op (n−(a+b) ) as n → ∞.
d. Prove that if Xn = op (n−a ) and yn = O(n−b ) as n → ∞, then Xn yn =
op (n−(a+b) ) as n → ∞.
e. Prove that if Xn = op (n−a ) and Yn = op (n−b ) as n → ∞, then Xn Yn =
op (n−(a+b) ) as n → ∞.
10. Suppose that {Wn }∞ n=1 is a sequence of independent random variables such
that Wn has a N(θ, σ 2 ) distribution for all n ∈ N where θ 6= 0. Define a
sequence of random variables {Xn }∞ n=1 where Xn = W̄n for all n ∈ N so
−1 2 p
that N(θ, n σ ) distribution for all n ∈ N and Xn − → θ as n → ∞. Find
the asymptotic distribution of n1/2 [exp(−Xn2 ) − exp(−θ2 )].
11. Let {Bn }∞n=1 be a sequence of independent random variables where Bn has
a Bernoulli(θ) distribution for all n ∈ N. Define a sequence of random
variables {Xn }∞
n=1 where
n
X
Xn = n−1 Bk ,
k=1
8.5.2 Experiments
a. N(0, 1)
b. Cauchy(0, 1)
c. T(3)
d. Exponential(1)
a. N(0, 1)
b. Cauchy(0, 1)
c. T(3)
d. Exponential(1)
5. Write a program in R that simulates the sequence {Xn }100 n=1 where Xn has
a Geometric(θn ) distribution where the sequence θn is specified below.
Repeat the experiment five times and plot the five realizations against n on
EXERCISES AND EXPERIMENTS 357
the same set of axes. Describe the behavior of each sequence and compare
this to the theoretical results of Exercise 7.
a. θn = n(n + 10)−1 for all n ∈ N.
b. θn = n−1 for all n ∈ N.
c. θn = n−2 for all n ∈ N.
d. θn = 12 for all n ∈ N.
CHAPTER 9
9.1 Introduction
This chapter will introduce a class of parameters that are known as func-
tional parameters. These types of parameters are functions of a distribution
function and can therefore be estimated by taking the same function of the
empirical distribution function computed on a sample from the distribution.
A novel approach to finding the asymptotic distribution of statistics of this
type was developed by von Mises (1947). In essence von Mises (1947) showed
that statistics of this type could be approximated based on a type of Taylor
expansion. Under some regularity conditions the asymptotic distribution of
the approximation can be found using standard methods such as Theorem
4.20 (Linbdeberg and Lévy). In this chapter we will first introduce functional
parameters and statistics. We will then develop the Taylor type expansion by
first introducing a differential for functional statistics, and then introducing
the expansion itself. We will then proceed to develop the asymptotic theory.
There have been many advances in this theory since its inception, mostly
in regard to developing more useful differentials. Much of the mathematical
theory required to study these advances is beyond the mathematical level of
this book. The purpose of this chapter is to provide a general overview of
the subject at a mathematical level that is consistent with the rest of our
presentation so far. Those who find an interest in this topic should consult
Fernholz (1983), Chapter 6 of Serfling (1980), and Chapter 20 of van der Vart
(1998).
359
360 DIFFERENTIABLE STATISTICAL FUNCTIONALS
observed a set of independent and identically distributed random variables
from a distribution F , and we are interested in a certain characteristic θ of
this distribution. Usually, θ can be written as a function of F . That is, we
can take θ = T (F ) for some function T . It is important to note that the
domain of this function is not the real line. Rather, T takes a distribution
function and maps it to the real line. Therefore the domain of this function is
a space containing distribution functions. We will work with a few spaces of
distribution functions. In particular, we will consider the set of all continuous
distribution functions on R, the set of all discrete distribution functions on
R, and the set of all distribution functions on R. Some examples of functional
parameters are given below.
Example 9.1. Let F ∈ F, the collection of all distribution functions on the
real line, and let θ be the k th central moment of F . Then θ is a functional
parameter that can be written as
Z ∞
θ = T (F ) = tk dF (t).
−∞
Noting that this parameter may not exist for all F ∈ F, we may consider
taking F from Fk , the collection of all distribution functions that have at
least k finite moments.
Example 9.2. Let F ∈ Fk , where Fk is defined in Example 9.1, and let θ be
the k th moment of F about the mean. Then θ is a functional parameter that
can be written as
Z ∞ Z ∞ k
θ = T (F ) = t− udF (u) dF (t).
−∞ −∞
for all i ∈ {1, . . . , k}. The hypothesized model can then be compared to the
true model, using the functional
k
X Z 2
θ = T (F ) = p−1
i dF − p i ,
i=1 Ri
FUNCTIONAL PARAMETERS AND STATISTICS 361
which is the sum of the square differences of the probabilities relative to the
model probabilities. Note that when the proposed model is correct then θ = 0,
and that when the proposed model is incorrect then θ > 0.
A key property that applies to many plug-in estimators based on the empirical
distribution function is that, conditional on the observed sample X1 , . . . , Xn ,
the empirical distribution function is a discrete distribution that associates a
probability of n−1 with each value in the sample. Therefore, integrals with
respect to the empirical distribution function simplify according to Definition
2.10 as a sum. That is, if g is a real valued function we have that
Z ∞ Xn
−1
g(x)dF̂n (x) = n g(Xi ).
−∞ k=1
which is the k th
sample moment µ̂0k .
th
Example 9.5. Let F be a distribution with finite k moment and let θ be
the k th moment of F about the mean. Then θ is a functional parameter that
can be written as
Z ∞ Z ∞ k
θ = T (F ) = t− udF (u) dF (t).
−∞ −∞
which compares the hypothesized model to the true model. Suppose that a
sample X1 , . . . , Xn is observed from F and the empirical distribution function
F̂n is used to estimate F . Then a plug-in estimator for θ is given by
X k Z 2
−1
θ̂n = T (F̂n ) = pi dF̂n − pi
i=1 Ri
2
k
X n
X
= p−1
i
n −1
δ(Xj ; Ri ) − pi
i=1 j=1
k
X
= p−1 2
i (p̂i − pi ) ,
i=1
where p̂i is the proportion of the sample that was observed in subset Ri . Sup-
pose that we alternatively considered estimating the distribution function F
with a Normal distribution whose parameters were estimated from the ob-
served sample. That is, we could estimate F with a N(X̄n , S) distribution,
conditional on the observed sample X1 , . . . , Xn , where S is the sample stan-
dard deviation. In this case, the plug-in estimator for θ is given by
k
X
θ̂n = T (F̂n ) = p−1 2
i (p̃i − pi ) ,
i=1
where p̃i is the probability that a N(X̄n , S) random variable is in the region
Ri for all i ∈ {1, . . . , k}.
DIFFERENTIATION OF STATISTICAL FUNCTIONALS 363
9.3 Differentiation of Statistical Functionals
which is the usual derivative, from the right, of the real function h evaluated at
zero. This result allows us to easily define higher order Gâteaux differentials.
Definition 9.2 (Gâteaux). Let T be a functional that maps a space of func-
tions F to R and let F and G be members of F. The k th order Gâteaux dif-
ferential of T at F in the direction of G is defined to be
dk
∆k T (F ; G − F ) = k
h(δ) ,
dδ δ↓0
Many of the functionals studied in this chapter will all have a relatively sim-
ple form that can be written as a multiple integral of a symmetric function
where each integral is integrated with respect to dF . In this case the Gâteaux
differential has a simple form.
Theorem 9.1. Consider a functional of the form
Z ∞ Z ∞ r
Y
T (F ) = ··· h(x1 , . . . , xr ) dF (xi ),
−∞ −∞ i=1
If k > r then ∆k T (F ; G − F ) = 0.
Proof. We will prove this result for r = 1 and r = 2. For a general proof see
Exercise 9. We first note that when r = 1 we have that
Z ∞
T [F + δ(G − F )] = h(x1 )d{F (x1 ) + δ[G(x1 ) − F (x1 )]} =
−∞
Z ∞ Z ∞
h(x1 )dF (x1 ) + δ h(x1 )d[G(x1 ) − F (x1 )].
−∞ −∞
DIFFERENTIATION OF STATISTICAL FUNCTIONALS 365
Therefore, Definition 9.2 implies that
Z ∞
d
∆1 T (F ; G − F ) = h(x1 )dF (x1 )+
dδ −∞
Z ∞
δ h(x1 )d[G(x1 ) − F (x1 )]
−∞ δ↓0
Z ∞
= h(x1 )d[G(x1 ) − F (x1 )]
−∞
Z ∞ Z ∞
= h(x1 )dG(x1 ) − h(x1 )dF (x1 )
−∞ −∞
= T (G) − T (F ).
T [F + δ(G − F )] =
Z ∞ Z ∞
h(x1 , x2 )d{F (x1 )+δ[G(x1 )−F (x1 )]}d{F (x2 )+δ[G(x2 )−F (x2 )]}.
−∞ −∞
To simplify this expression we first work with the differential in the double
integral. Note that
Therefore,
Z ∞Z ∞
T [F + δ(G − F )] = h(x1 , x2 )dF (x1 )dF (x2 )+
−∞ −∞
Z ∞Z ∞
δ h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 )+
−∞ −∞
Z ∞Z ∞
δ h(x1 , x2 )dF (x1 )d[G(x2 ) − F (x2 )]+
−∞ −∞
Z ∞Z ∞
δ2 h(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )].
−∞ −∞
and, therefore,
Z ∞Z ∞
T [F + δ(G − F )] = h(x1 , x2 )dF (x1 )dF (x2 )+
−∞ −∞
Z ∞Z ∞
2δ h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 )+
−∞ −∞
Z ∞Z ∞
δ2 h(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )].
−∞ −∞
Therefore, the first two Gâteaux differentials follow from Definition 9.2 as
Z ∞ Z ∞
d
∆1 T (F ; G − F ) = h(x1 , x2 )dF (x1 )dF (x2 )+
dδ −∞ −∞
Z ∞Z ∞
2δ h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 )+
−∞ −∞
Z ∞Z ∞
δ2
h(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )]
−∞ −∞ δ↓0
Z ∞ Z ∞
=2 h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 )+
−∞ −∞
Z ∞Z ∞
2δ h(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )]
−∞ −∞ δ↓0
Z ∞ Z ∞
=2 h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 ),
−∞ −∞
and
Z ∞Z ∞
d
∆2 T (F ; G − F ) = 2 h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 )+
dδ −∞ −∞
Z ∞Z ∞
2δ h(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )]
−∞ −∞ δ↓0
Z ∞ Z ∞
=2 h(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )]
−∞ −∞
= 2T (G − F ).
This functional has the form given in Theorem 9.1 with r = 1 and h(x) = x.
Therefore, Theorem 9.1 implies that the first Gâteaux differential has the form
Z ∞ Z ∞
∆1 T (F ; G − F ) = T (G) − T (F ) = xdG(x) − xdF (x).
−∞ −∞
The Gâteaux differential is not the only approach to defining a derivative type
methodology for functionals, and in some sense this differential does not have
sufficient properties for a full development of the type of asymptotic theory
that we wish to seek in general settings. The original approach to using differ-
entials of this type was developed by von Mises (1947), who used a differential
similar to the Gâteaux differential. If we are working in a linear space that
is equipped with a norm then one can define what is commonly known as
the Fréchet derivative. See Dieudonné (1960), Section 2.3 of Fernholz (1983),
Fréchet (1925), Nashed (1971), and Chapter 6 of Serfling (1980), for further
details on this type of derivative. However, Fernholz (1983) points out that few
statistical functions are Fréchet differentiable. The Hadamard differential, first
used in this context by Reeds (1976), exists under weaker conditions than the
Fréchet derivative, but still exhibits the required properties. The Hadamard
differential is preferred by Fernholz (1983) as a compromise. For further details
on the Hadamard differential see Averbukh and Smolyanov (1968), Fernholz
(1983), Keller (1974), and Yamamuro (1974). In our presentation we will con-
tinue to use the Gâteaux differential and will limit ourselves to problems where
this differential is useful. This generally follows the development of Serfling
(1980), without addressing the issues related to the Fréchet derivative. This
provides a reasonable and practical overview of this topic without becoming
too deep with mathematical technicalities.
One case that is of particular interest comes from the fact that the statistical
properties of plug-in estimators of T can be studied by replacing G with the
empirical distribution function of Definition 3.5. That is, we can consider the
expansion
for all G ∈ F.
Proof. We prove this result for the special cases when r = 1 and r = 2. For
the general case see Exercise 10. For the case when r = 1, we follow the
constructive proof of Serfling (1980) and consider the function
Z ∞
t̃(x1 |F ) = t(x1 ) − t(x2 )dF (x2 ).
−∞
so that
Z ∞Z ∞ Z ∞ Z ∞
t̃2 (x1 , x2 |F )dG(x1 )dG(x2 ) = t(x1 , x2 )dG(x1 )dG(x2 )−
−∞ −∞ −∞ −∞
Z ∞Z ∞Z ∞
t(t1 , x2 )dF (t1 )dG(x1 )dG(x2 )−
−∞ −∞ −∞
Z ∞Z ∞Z ∞
t(x1 , t2 )dF (t2 )dG(x1 )dG(x2 )+
−∞ −∞ −∞
Z ∞Z ∞Z ∞Z ∞
t(t1 , t2 )dF (t1 )dF (t2 )dG(x1 )dG(x2 ) =
−∞ −∞ −∞ −∞
Z ∞Z ∞ Z ∞Z ∞
t(x1 , x2 )dG(x1 )dG(x2 ) − t(t1 , x2 )dF (t1 )dG(x2 )−
−∞ −∞ −∞ −∞
Z ∞Z ∞ Z ∞Z ∞
t(x1 , t2 )dG(x1 )dF (t2 ) + t(t1 , t2 )dF (t1 )dF (t2 ) =
−∞ −∞ −∞ −∞
Z ∞Z ∞
t(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )],
−∞ −∞
The result of Theorem 9.2 can have a profound impact on the form of a
Gâteaux differential. For example, Theorem 9.1 implies that if T is a functional
of the form
Z ∞ Z ∞ r
Y
T (F ) = ··· h(x1 , . . . , xr ) dF (xi ),
−∞ −∞ i=1
then
Z ∞ Z ∞ k
Y
∆k T (F, F̂n − F ) = ··· t(x1 , . . . , xk |F ) d[F̂n (xi ) − F (xi )],
−∞ −∞ i=1
where
Z ∞ Z ∞ r−k
Y
t(x1 , . . . , xk |F ) = ··· h(x1 , . . . , xk , y1 , . . . , yr−k ) dF (yi ).
−∞ −∞ i=1
This form of the differential can often be used to establish the weak conver-
gence properties of the differential. For example, if k = 1, we have that
n
X
∆1 T (F, F̂n − F ) = n−1 t̃(Xi |F ),
i=1
as n → ∞.
Proof. Theorem 9.2 implies that there exists a function t̃(x1 , . . . , xr |F ) such
that
Z ∞ Z ∞ r
Y
··· t(x1 , . . . , xr ) d[F̂n (xi ) − F (xi )] =
−∞ −∞ i=1
Z ∞ Z ∞
··· t̃(x1 , . . . , xr |F )dF̂n (x1 ) · · · dF̂n (xr ).
−∞ −∞
(9.2)
Now we take advantage of the fact that
Z ∞
t̃(xi1 , . . . , xir |F )dF (xik ) = 0,
−∞
for all ik ∈ {1, . . . , r} and k ∈ {1, . . . , r}. See Exercise 8. This implies that the
expectation in Equation (9.2) will be zero unless each index in the function
occurs at least twice. Serfling (1980) concludes that the number of non-zero
terms is O(nr ), as n → ∞. Assuming that the remaining absolute expectations
are bounded by some real value, as indicated by the assumptions, it follows
that
" #2
Z ∞ Z ∞ Yr
E
· · · t(x1 , . . . , x r ) d[F̂ (x
n i ) − F (x i )] ≤
−∞ −∞
i=1
2
Z ∞ Z ∞ Yr
E ··· t(x1 , . . . , xr ) d[F̂n (xi ) − F (xi )] =
−∞ −∞
i=1
n
X n
X n
X n
X
n−2r ··· ··· E[t̃(Xi1 , . . . , Xir )t̃(Xj1 , . . . , Xjr )] = O(n−r ),
i1 =1 ir =1 j1 =1 jr =1
as n → ∞.
We now combine these results in order to prove that the error term in the
expansion in Equation (9.1) has the desired asymptotic properties. We begin
by noting the Definition 9.2 implies that
T (F̂n ) = T (F ) + ∆1 T (F, F̂n − F ) + E1 (F, F̂n )
d
= T (F ) + T [F + δ(F̂n − F )] + E1 (F, F̂n ). (9.3)
dδ δ↓0
Let us first consider the case when r = 1. Noting that ∆1 T (F, F̂n − F ) =
T (F̂n ) − T (F ) implies that T (F ) + ∆1 T (F, F̂n − F ) = T (F ) + T (F̂n ) − T (F ) =
T (F̂n ). Therefore, E1 (F, F̂n ) is identically zero for all n ∈ N.
374 DIFFERENTIABLE STATISTICAL FUNCTIONALS
For the case when r = 2 we define v(δ) = T [F + δ(F̂n − F )] as a function of
δ so that v(0) = T (F ) and v(1) = T (F̂n ). With this notation, Equation (9.3)
can be written as
d
+ E1 (δ) = v(0) + v 0 (0) + E1 (δ),
v(1) = v(0) + v(δ) (9.4)
dδ δ=0
which has the same form of a Taylor expansion for the function v provided by
Theorem 1.13, and hence E1 (δ) = 21 δ 2 v 00 (ξ), for some ξ ∈ (0, 1).
Let δ be an arbitrary member of the unit interval. Following the arguments
of the proof of Theorem 9.1, if the functional T has the form
Z ∞
T (F ) = h(x1 , x2 )dF (x1 )dF (x2 ),
−∞
then
Z ∞ Z ∞
v 00 (ξ) = 2 h(x1 , x2 )d[F̂n (x1 ) − F (x1 )]d[F̂n (x1 ) − F (x2 )],
−∞ −∞
where we note that the derivative does not depend on ξ. Theorem 9.2 implies
that there exists a function h̃(x1 , x2 |F ) such that
Z ∞Z ∞ X n
n X
v 00 (ξ) = h̃(x1 , x2 )dF̂n (x1 )dF̂n (x2 ) = n−2 h̃(X1 , Xj |F ).
−∞ −∞ i=1 j=1
Suppose that we can assume that E[h̃2 (Xi , Xj )] < ∞ for all i and j from the
index set {1, . . . , n}. Then Theorem 9.3 implies that
2
n X n
X
n1/2 E n−2 h̃(Xi , Xj |F ) = O(n−3/2 ),
i=1 j=1
qm
Definition 5.1 implies that n1/2 |v 00 (ξ)| −−→ 0 as n → ∞ and hence Theorem 5.2
p p
implies that n1/2 |v 00 (ξ)| −
→ 0 as n → ∞. This in turn implies that n1/2 E1 (δ) −
→
0 as n → ∞ for all δ ∈ (0, 1). This type of argument can be generalized to the
following result.
Theorem 9.4. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F and let θ be a functional
parameter of the form
Z ∞ Z ∞ r
Y
θ = T (F ) = ··· h(x1 , . . . , xr ) dF (xi ).
−∞ −∞ i=1
ASYMPTOTIC DISTRIBUTION 375
Then, 2
d p
n1/2 sup 2 T [F + λ(F̂n − F )] −
→ 0,
λ∈[0,1] dλ
p
as n → ∞, and therefore n1/2 E1 (F, F̂n ) −
→ 0, as n → ∞.
Discussion about some general conditions under which the result of Theorem
9.4 holds for other types of functional parameters and differentials can be
found in Chapter 6 of Serfling (1980).
Example 9.9. Consider the variance functional of Example 9.8 which has
the form Z ∞Z ∞
T (F ) = h(x1 , x2 )dF (x1 )dF (x2 ),
−∞ −∞
where h(x1 , x2 ) = 21 (x21 + x22 − 2x1 x2 ). In Example 9.8 is was shown that
∆1 T (F, F̂n − F ) = µ̂02 − µ02 − 2µ̂01 µ01 + 2(µ01 )2 . Therefore, it follows that T (F ) +
∆1 T (F, F̂n − F ) = µ̂02 + (µ01 )2 − 2µ̂01 µ01 , and hence in this case we have that
E1 (F, F̂n ) = µ̂02 − (µ̂01 )2 − µ̂02 − (µ01 )2 + 2µ̂01 µ01 = −(µ̂01 − µ01 )2 .
Theorems 8.1 and 8.5 then imply that E1 (F, F̂n ) = op (n−1/2 ) as n → ∞,
which verifies the arguments given earlier for this example.
In this section we will use the results of the previous section, along with the
expansion given in Equation (9.1), to find the asymptotic distribution of the
sample functional T (F̂n ). From an initial view, the use of the expansion given
in Equation (9.1) may not seem as if it would be necessarily useful as it is not
readily apparent that the asymptotic properties of ∆1 T (F, F̂n − F ) would be
easier to establish than that of T (F̂n ) directly. Indeed, it is the case is some
problems that either approach may have the same level of difficulty. However,
we have established that in some cases ∆1 T (F, F̂n − F ) can be written as a
sum of random variables whose asymptotic properties follow from established
results such as Theorem 4.20 (Lindeberge-Lévy). Another important key in-
gredient in establishing these results is the asymptotic behavior of the error
for the expansion given in Equation (9.1). In the previous section we observed
that in certain cases this error term can be guaranteed to converge to zero at
a certain rate. For the development of the result below, the error term will
need to converge in probability to zero at least as fast as n−1/2 as n → ∞.
Theorem 9.5. Let {Xn }∞ n=1 be a sequence of independent and identically dis-
tributed random variables from a distribution F . Let θ = T (F ) be a functional
with estimator θ̂n = T (F̂n ), where F̂n is the empirical distribution function of
X1 , . . . , Xn . Suppose that
n
X
∆1 T (F, F̂n − F ) = n−1 t̃(Xi |F ),
i=1
376 DIFFERENTIABLE STATISTICAL FUNCTIONALS
for some function t̃(Xi |F ). Let µ̃ = E[t̃(Xi |F )] and σ̃ 2 = V [t̃(Xi |F )] and
assume that 0 < σ̃ 2 < ∞. If
p
n1/2 E1 (F, F̂n ) = n1/2 [T (F̂n ) − T (F ) − ∆1 (F, F̂n − F )] −
→ 0,
d
as n → ∞, then n1/2 (θ̂n − θ − µ̃) −
→ Z as n → ∞ where Z has a N(0, σ̃ 2 )
distribution.
Proof. We begin proving this result by first finding the asymptotic distribution
of ∆1 T (F, F̂n −F ). First, note that {t̃(Xn |F )}∞n=1 is a sequence of independent
and identically distributed random variables with mean µ̃ and variance 0 <
σ̃ 2 < ∞. Therefore, Theorem 4.20 (Lindeberge-Lévy) implies that
" n
#
d
X
1/2 −1
n n t̃(Xi |F ) − µ̃ −→ Z, (9.5)
i=1
2
as n → ∞ where Z is a N(0, σ̃ ) random variable. Now we use the fact that
we have defined E1 (F, F̂n ) so that T (F̂n ) = T (F ) + ∆1 (F, F̂n − F ) + E1 (F, F̂n )
so that it follows that
n1/2 (θ̂n − θ − µ̃) = n1/2 [∆1 T (F, F̂n − F ) − µ̃] + n1/2 E1 (F, F̂n ). (9.6)
It follows that the first term on the right hand side of Equation (9.6) converges
in distribution to Z, while the second term converges in probability to zero as
n → ∞. Therefore, Theorem 4.11 (Slutsky) implies that the sum of these two
terms converges to Z, and the result is proven.
When the functional parameter has the form studied in the previous section,
the conditions under which we obtain the result given in Theorem 9.5 greatly
simplify.
Corollary 9.1. Let {Xn }∞ n=1 be a sequence of independent and identically dis-
tributed random variables from a distribution F . Let θ = T (F ) be a functional
parameter of the form
Z ∞ Z ∞ r
Y
T (F ) = ··· h(x1 , . . . , xr ) dF (xi ), (9.7)
−∞ −∞ i=1
with estimator θ̂n = T (F̂n ), where F̂n is the empirical distribution function of
X1 , . . . , Xn . Define µ̃ = E[t̃(Xi |F )], σ̃ 2 = V [t̃(Xi |F )], where
n
X
∆1 T (F, F̂n − F ) = n−1 t̃(Xi |F ),
i=1
for some function t̃(Xi |F ), assume that 0 < σ̃ 2 < ∞, and that ∆1 T (F, F̂n −F )
d
is not functionally equal to zero, then n1/2 (θ̂n − θ − µ̃) −
→ Z as n → ∞ where
Z has a N(0, σ̃ 2 ) distribution.
Proof. We begin by noting that if the functional T has the form indicated in
ASYMPTOTIC DISTRIBUTION 377
Equation (9.7) then Theorem 9.1 implies that
∆1 T (F ; F̂n − F ) = T (F̂n ) − T (F ) =
Z ∞ Z ∞ r−1
!
Y
r ··· h(x1 , y1 , . . . , yr−1 ) dF (yi ) d[F̂n (x1 ) − F (x1 )] =
−∞ −∞ i=1
Z ∞
h̃(x1 |F )d[F̂n (x1 ) − F (x1 )].
−∞
Now, Theorem 9.2 implies that there exists a function t̃(x1 |F ) such that
Z ∞ n
X
∆1 T (F ; F̂n − F ) = t̃(x1 |F )dF̂n (x1 ) = n−1 t̃(Xi |F ),
−∞ i=1
which yields the form required by Theorem 9.5. Theorem 9.4 then provides
the required behavior of the error term, which then proves the result.
where t̃(Xi |F ) = (Xi −µ01 )2 −θ. Example 9.9 shows that E1 (F, F̂n ) = op (n−1/2 )
as n → ∞ so that Theorem 9.5 implies that if σ̃ is finite, then n1/2 (θ̂n − θ −
d
→ Z as n → ∞ where Z is a N(0, σ̃ 2 ) random variable. In this case θ̂n is
µ̃) −
378 DIFFERENTIABLE STATISTICAL FUNCTIONALS
the sample variance given by
n
X
θ̂n = n−1 (Xi − X̄n )2 .
i=1
9.6.1 Exercises
1. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables from a distribution F . Let θ be defined as the pth quantile
of F .
where the derivative in the integral is taken with respect to θ. Discuss any
additional assumptions that you need to make.
8. Let G ∈ F and let F be a fixed distribution from F. Consider a functional
of the form
Z ∞ Z ∞ r
Y
I(G) = ··· t(x1 , . . . , xr ) d[G(xk ) − F (xk )],
−∞ −∞ k=1
for all G ∈ F as given by Theorem 9.2. Prove that for the functions
t̃(x1 , . . . , xr |F ) specified in the proof of Theorem 9.2 that
Z ∞
t̃(xi1 , . . . , xir |F )dF (xik ) = 0.
−∞
9.6.2 Experiments
a. N(0, 1)
b. Exponential(1)
c. Uniform(0, 1)
d. Cauchy(0, 1)
Parametric Inference
But sitting in front of him and taken by surprise by his dismissal, K. would be
able easily to infer everything he wanted from the lawyer’s face and behaviour,
even if he could not be induced to say very much.
The Trial by Franz Kafka
10.1 Introduction
383
384 PARAMETRIC INFERENCE
Definition 10.1. Any function T that maps a sample X1 , . . . , Xn to a pa-
rameter space Ω is a point estimator of θ.
We will usually denote an estimator of a parameter θ as θ̂n = T (X1 , . . . , Xn ),
or simply as θ̂n . Note that there is nothing special about a point estimator, it
is simply a function of the observed data that does not depend on θ, or any
other unknown quantities. The search for a good point estimator requires us
to define the types of properties that a good estimator should have. Usually
we would like our estimator to be close to θ in some respect. Let ρ be a metric
on Ω. Then we can measure the distance between θ̂n and θ as ρ(θ̂n , θ). But
ρ(θ̂n , θ) is a random variable and hence we need some way of summarizing the
behavior of ρ(θ̂n , θ). This is usually accomplished by taking the expectation
of ρ(θ̂n , θ). In decision theory the distance ρ(θ̂n , θ) is usually called the loss
associated with estimating θ with θ̂n and the function ρ is called the loss
function. The expected loss, given by R(θ̂, θ) = E[ρ(θ̂n , θ)] is called the risk
associated with estimating θ with the estimator θ̂n .
A common loss function that is often used in practice is ρ(θ̂, θ) = (θ̂n − θ)2 ,
which is called squared error loss. The associated risk, given by MSE(θ̂n ) =
E[(θ̂n − θ)2 ] is called the mean squared error, which measures the expected
square distance between θ̂n and θ. It can be shown that MSE(θ̂n ) can be
decomposed into two parts given by MSE(θ̂n ) = Bias2 (θ̂n ) + V (θ̂n ), where
Bias(θ̂n ) = E(θ̂n ) − θ is called the bias of the estimator θ̂n . See Exercise 1.
The bias of an estimator θ̂n measures the expected systematic departure of
θ̂n from θ. A special case occurs when the bias always equals zero.
Definition 10.2. An estimator θ̂ of θ is unbiased if Bias(θ̂n ) = 0 for all
θ ∈ Ω.
If an estimator is unbiased then the mean squared error and the variance
of the estimator coincide. In this case the variance of the estimator can be
used alone as a measure of the quality of the estimator. Usually the standard
deviation of the estimator, called the standard error of θ̂n is often reported as
a measure of the quality of the estimator, since it is in the same units as the
parameter space, whereas the variance is in square units.
An important special case in estimation theory is the case of estimating the
mean of a population that has a finite variance. Let {Xn }∞ n=1 be a sequence of
independent and identically distributed random variables from a distribution
F with mean θ and finite variance σ 2 . It is well known that if we estimate
θ with the sample mean X̄n then Bias(X̄n ) = 0 and V (X̄n ) = n−1 σ 2 for all
θ ∈ Ω. Suppose now that we are interested in estimating g(θ) for some real
function g. An obvious estimator of g(θ) is g(X̄n ). If g is a linear function of the
form g(x) = a+bx then the bias and variance of g(X̄n ) can be found by directly
using the properties of X̄n . That is E[g(X̄n )] = a + bE(X̄n ) = a + bθ = g(θ)
so that g(X̄n ) is an unbiased estimator of g(θ). The variance of g(X̄n ) can be
found to be V [g(X̄n )] = b2 V (X̄n ) = n−1 b2 σ 2 .
POINT ESTIMATION 385
If g is not a linear function then the bias and variance of g(X̄n ) cannot be
found using only the properties of X̄n . In this case more information about
the population F must be known.
Example 10.1. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean θ and finite
variance σ 2 . Suppose now that we are interested in estimating g(θ) = θ2 using
the estimator g(X̄n ) = X̄n2 . We first find the bias for this estimator. Taking
the expectation we have that
X n X n
E(X̄n2 ) = E n−2 Xi Xj
i=1 j=1
n
X n X
X n
= n−2 E(Xi2 ) + n−2 E(Xi )E(Xj )
i=1 i=1 j=1
j6=i
Example 10.1 highlights the increased complexity one finds when working with
estimating a non-linear function of θ. The fact that we are able to compute
the bias and variance in the closed forms indicated are a result of the function
being a sum of powers of X̄n . If we alternatively considered functions such
as sin(X̄n ) or exp(−X̄n2 ) such a direct approach would no longer be possible.
However, an approximate approach can be developed for certain functions by
approximating the function of interest with a linear expression obtained using
a Taylor expansion from Theorem 1.13.
Example 10.2. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean θ and finite
variance σ 2 . Suppose now that we are interested in estimating g(θ) = exp(−θ2 )
using the estimator g(X̄n ) = exp(−X̄n2 ). Without further specific information
of the form of F , simple closed form expressions for the mean and variance of
exp(X̄n2 ) are difficult to obtain. However, note that Theorem 1.13 implies that
we can find a Taylor expansion for the exponential function at −X̄n2 around
the point −θ2 . That is,
exp(−X̄n2 ) = exp(−θ2 ) − 2(X̄n − θ)θ exp(−θ2 ) + E2 (X̄n , θ),
where E2 (X̄n , θ) = (X̄n − θ)2 (2ξ 2 − 1) exp(−ξ 2 ), where ξ is a random variable
that is always between X̄n and θ with probability one. Taking expectations,
POINT ESTIMATION 387
we find that
E[exp(−X̄n2 )] = exp(−θ2 ) − 2θE(X̄n − θ) exp(−θ2 ) + E[E2 (X̄n , θ)]
= exp(−θ2 ) + E[(X̄n − θ)2 (2ξ 2 − 1) exp(−ξ 2 )],
since E(X̄n − θ) = 0. Therefore, the bias of exp(−X̄n2 ) as an estimator of
exp(−θ2 ) is given by E[(X̄n − θ)2 (2ξ 2 − 1) exp(−ξ 2 )]. The expectation of the
error can be troublesome to compute due to the random variable ξ unless
we happen to note in this case that the function (2ξ 2 − 1) exp(−ξ 2 ) is a
bounded function so that there exists a finite real value m such that |(2ξ 2 −
1) exp(−ξ 2 )| < m for all ξ ∈ R. Therefore, it follows that
E[g(X̄n )] = g(θ)+
1 −1 2 00
2n σ g (θ) + 16 E[(X̄n − θ)3 ]g 000 (θ) + 1
24 E[(X̄n − θ)4 g 0000 (ξ)], (10.6)
since E(X̄n − θ) = 0 and E[(X̄n − θ)2 ] = n−1 σ 2 . Note that we cannot factor
g 0000 (ξ) out of the expectation in Equation (10.6) because ξ is a random variable
due to the fact that ξ is always between θ and X̄n . For the third term in
Equation (10.6), we note that
" #3
X n
E[(X̄n − θ)3 ] = n−3 E (Xi − θ)
i=1
n
X n X
X n
= n−3 E[(Xi − θ)3 ] + 3n−3 E[(Xi − θ)2 (Xj − θ)]
i=1 i=1 j=1
j6=i
n X
X n X
n
+n−3 E[(Xi − θ)(Xj − θ)(Xk − θ)]. (10.7)
i=1 j=1 k=1
j6=i k6=i
k6=j
The second and third term on the right hand side of Equation (10.7) are zero
due to independence and the fact that E(Xi − θ) = 0. Therefore, it follows
that E[(X̄n − θ)3 ] = n−2 E[(Xi − θ)3 ] = O(n−2 ), as n → ∞. By assumption,
there exists a bound m ∈ R such that g 0000 (t) ≤ m for all t ∈ R. Therefore,
it follows that g 0000 (ξ)(X̄n − θ)4 ≤ m(X̄n − θ)4 , with probability one. Hence
1 0000 1
Theorem A.16 implies that E[ 24 g (ξ)(X̄n − θ)4 ] ≤ 24 mE[(X̄n − θ)4 ] < ∞,
POINT ESTIMATION 389
since we have assumed that the fourth moment is finite. To obtain the rate of
convergence for the error term we note that
" n #4
X
E[(X̄n − θ)4 ] = E n−4 (Xi − θ)
i=1
n
X n X
X n
= n−4 E[(Xi − θ)4 ] + 4n−4 E[(Xi − θ)3 (Xj − θ)]
i=1 i=1 j=1
j6=i
n X
X n
+3n−4 E[(Xi − θ)2 (Xj − θ)2 ]
i=1 j=1
j6=i
n
XX n Xn
+6n−4 E[(Xi − θ)2 (Xj − θ)(Xk − θ)]
i=1 j=1 k=1
j6=i k6=i
k6=j
n
XXXXn n
+n−4 E[(Xi − θ)(Xj − θ)(Xk − θ)(Xl − θ)].
i=1 j=1 k=1 l=1
j6=i k6=i l6=i
k6=j l6=j
l6=k
Using similar arguments to those used for the third moment we find that
E[(X̄n − θ)4 ] = n−3 E[(Xn − θ)4 ] + 3n−3 (n − 1)E[(Xn − θ)2 (Xm − θ)2 ] so that
it follows that E[(X̄n − θ)4 ] = O(n−2 ) as n → ∞. Therefore, it follows that
E[g(X̄n )] = g(θ) + 21 n−1 σ 2 g 00 (θ) + O(n−2 ),
as n → ∞.
Example 10.3. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F with mean θ and finite vari-
ance σ 2 . In Example 10.1 we considered estimating θ2 using the estimator X̄n2 .
Taking g(t) = t2 we have that g 0 (t) = 2t, g 00 (t) = 2, and that g k (t) = 0 for
all k ≥ 3. Therefore, we can apply Theorem 10.1, assuming that the fourth
moment of F is finite, to find E(X̄n2 ) = θ2 + n−1 σ 2 + O(n−2 ), as n → ∞. This
compares exactly with the result of Example 10.1 with the exception of the
error term. The error term in this case is identically zero in this case because
the derivatives of order higher than two are zero for the function g(t) = t2 .
For the variance, we consider derivatives of the function h(t) = g 2 (t) = t4 ,
where we have that h0 (t) = 4t3 , h00 (t) = 12t2 , h000 (t) = 24t and h0000 (t) = 24.
The fourth derivative of h is bounded and therefore we can apply Theorem
10.1 to find that V (X̄n ) = 4n−1 θ2 σ 2 + O(n−2 ), as n → ∞, which matches
the variance expansion found in Example 10.1. The error term in this case
is not identically zero as we encountered non-zero terms of order O(n−2 ) in
Example 10.1. This is also indicated by the fact that the fourth derivative of
h is not zero.
390 PARAMETRIC INFERENCE
Example 10.4. Consider the framework presented in Example 10.2 where
{Xn }∞n=1 is a sequence of independent and identically distributed random vari-
ables from a distribution F with mean θ and finite variance σ 2 . We are inter-
ested in estimating g(θ) = exp(−θ2 ) using the estimator g(X̄n ) = exp(−X̄n2 ).
If we assume that F has a finite fourth moment then the assumptions of
Theorem 10.1 hold and we have that
E[exp(−X̄n2 )] = exp(−θ2 ) − n−1 σ 2 (1 − 2θ2 ) exp(−θ2 ) + O(n−2 ),
and
V [exp(−X̄n2 )] = 4n−1 σ 2 θ2 exp(−2θ2 ) + O(n−2 ),
as n → ∞. We can compare this result to the result supplied by Theorem 6.3,
d
which implies that n1/2 (2θσ)−1 exp(θ2 )[exp(−X̄n2 ) − exp(−θ2 )] −
→ Z, as n →
∞ where Z is a N(0, 1) random variable. Note that the asymptotic variance
given by Theorem 6.3 matches the first term of the asymptotic expansion for
the variance given in Theorem 10.1.
While the result of Theorem 10.1 is general in the sense that the distribution
F need not be specified, the assumption on the boundedness of the fourth
derivative of F will often be violated in practice. This does not mean that
no asymptotic result of this kind can be obtained, but that such results will
probably rely on methods more specific to the problem considered.
Example 10.5. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables following a N(θ, σ 2 ) distribution and consider
estimating exp(θ) with exp(X̄n ). We are not able to apply Theorem 10.1
directly to this case because the fourth derivative of exp(t), which is still
exp(t), is not a bounded function. If we attempt to apply Theorem 1.13 to
1
this problem we will end up with an error term of the form 24 (X̄n −θ)2 exp(ξ),
where ξ is a random variable that is always between θ and X̄n . This type of
error term is difficult to deal with in this case because the exponential function
is not bounded and hence we cannot directly bound the expectation as we did
in Example 10.2 and the proof of Theorem 10.1. Instead we follow the approach
of Lehmann (1999) and use the convergent Taylor series for the exponential
function instead. That is,
∞
X exp(θ)(X̄n − θ)i
exp(X̄n ) = . (10.8)
i=0
i!
We will assume that it is permissible in this case to exchange the expectation
and the infinite sum, so that taking the expectation of both sides of Equation
(10.8) yields
∞ ∞
X exp(θ)E[(X̄n − θ)i ] X E[(X̄n − θ)2i ]
E[exp(X̄n )] = = exp(θ) . (10.9)
i=0
i! i=0
(2i)!
The second equality in Equation (10.9) is due to the fact that X̄n − θ has a
N(0, n−1 σ 2 ) distribution, whose odd moments are zero. Hence it follows that
POINT ESTIMATION 391
we must evaluate
n 1/2 Z ∞
2i
E[(X̄n − θ) ] = t2i exp(− 21 nσ −2 t2 )dt.
2πσ 2 −∞
Γ(i + 12 ) = (i − 21 )(i − 23 ) · · · 32 12 Γ( 21 )
= 2−i (2i − 1)(2i − 3) · · · (3)(1)π 1/2
(2i)!π 1/2
=
2i (2i)(2i − 1) · · · (4)(2)
= (2i)!π 1/2 2−2i (i!)−1 .
It is noteworthy that not all functions of the sample mean have nice proper-
ties, even when the population is normal. For example, when X1 , . . . , Xn are
independent and identically distributed N(θ, 1) random variables, n1/2 (X̄n−1 −
d
θ−1 ) −
→ Z as n → ∞ where Z is a N(0, θ−4 ) random variable, but E(X̄n−1 )
does not exist for any n ∈ N. See Example 4.2.4 of Lehmann (1999) and
Lehmann and Shaffer (1988) for further details.
If we consider the case where θ̂n is an estimator of a parameter θ with an
expansion for its expectation of the form E(θ̂n ) = θ + n−1 b + O(n−2 ) as
n → ∞ where b is a real constant, then it follows that the bias of θ̂n is
O(n−1 ), and hence the square bias of θ̂n is O(n−2 ), as n → ∞. If the variance
of θ̂n has an expansion of the form V (θ̂n ) = n−1 v + O(n−2 ) as n → ∞ where
v is a real constant, then it follows that the mean squared error of θ̂n has the
expansion MSE(θ̂n ) = n−1 v + O(n−2 ) as n → ∞. Therefore, it is the constant
v that is the important factor in determining the asymptotic performance of
θ̂n , under these assumptions on the form of the bias and variance.
Now suppose that θ̃n is another estimator of θ, and that θ̃n has similar prop-
erties to θ̂n in the sense that the mean squared error for θ̃n has asymptotic
expansion MSE(θ̃n ) = n−1 w + O(n−2 ) as n → ∞, where w is a real con-
stant. If we wish to compare the performance of these two estimators from an
asymptotic viewpoint it follows that we need only compare the constants v
and w.
Definition 10.3. Let θ̂n and θ̃n be two estimators of a parameter θ such
that MSE(θ̂n ) = n−1 v + O(n−2 ) and MSE(θ̃n ) = n−1 w + O(n−2 ) as n → ∞
where v and w are real constants. Then the asymptotic relative efficiency of
θ̂n relative to θ̃n is given by ARE(θ̂n , θ̃n ) = wv −1 .
The original motivation for the form of the asymptotic relative efficiency comes
from comparing the sample sizes required for each estimator to have the same
mean squared error. From the asymptotic viewpoint we would require sample
sizes n and m so that n−1 v = m−1 w, with the asymptotic relative efficiency
of θ̂n relative to θ̃n is given by ARE(θ̂n , θ̃n ) = mn−1 . However, note that
if n−1 v = m−1 w then it follows that wv −1 = mn−1 , yielding the form of
Definition 10.3.
Example 10.6. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with finite variance σ 2
and continuous density f . We will assume that F has a unique median θ
POINT ESTIMATION 393
such that f (θ) > 0. In the case where f is symmetric about θ we have two
immediate possible choices for estimating θ given by the sample mean θ̂n =
X̄n and the sample median θ̃n . It is known that V (θ̂n ) = n−1 σ 2 , and it
follows from Corollary 4.4 that the leading term for the variance of θ̃n is
−2
1
4 [f (θ)] . Therefore, Definition 10.3 implies that ARE(θ̂n , θ̃n ) = 41 [σf (θ)]−2 .
If F corresponds to N(θ, σ 2 ) random variable then f (θ) = (2πσ 2 )−1/2 and
we have that ARE(θ̂n , θ̃n ) = 12 π ' 1.5708, which indicates that the sample
mean is about one and one half times more efficient than the sample median.
If F corresponds to a T(ν) distribution where ν > 2 we have that θ = 0,
σ 2 = v/(v − 2), and
Γ( ν+1
2 )
f (θ) = .
(πv)1/2 Γ( ν2 )
Therefore, it follows that
π(ν − 2)Γ2 ( ν2 )
ARE(θ̂n , θ̃n ) = .
4Γ2 ( ν+1
2 )
The values of ARE(θ̂n , θ̃n ) for ν = 3, 4, 5, 10, 25, 50, and 100 are given in Table
10.1. From Table 10.1 one can observe that for heavy tailed T(ν) distributions,
the sample median is more efficient than the sample mean. In these cases the
variance of the sample mean is increased by the high likelihood of observing
outliers in samples. However, as the degrees of freedom increase, the median
becomes less efficient so that when ν = 5 the sample median and the sample
mean are almost equally efficient from an asymptotic viewpoint. From this
point on, the sample mean becomes more efficient. In the limit we find that
ARE(θ̂n , θ̃n ) approaches the value associated with the normal distribution
given by 21 π.
Example 10.7. Let B1 , . . . , Bn be a set of independent and identically dis-
tributed Bernoulli(θ) random variables, where the success probability θ is
the parameter of interest. We will also assume that the parameter space is
Ω = (0, 1). The usual unbiased estimator of θ is the sample mean θ̂n = B̄n ,
which corresponds to the proportion of successes observed in the sample. The
properties of the sample mean imply that this estimator is unbiased with
variance n−1 θ(1 − θ). When θ is small the estimator θ̂n is often considered
unsatisfactory because θ̂n will be equal to zero with a large probability. For
example, calculations based on the Binomial(n, θ) distribution can be used
to show that if n = 100 and θ = 0.001 then P (θ̂n = 0) = 0.9048. Since zero
is not in the parameter space of θ, this may not be considered a reasonable
estimator of θ. An alternative approach to estimating θ in this case is based
on adding in one success and one failure to the observed sample. That is, we
consider the alternative estimator
n
!
X
−1
θ̃n = (n + 2) Bi + 1 = (1 + 2n−1 )−1 (B̄n + n−1 ).
i=1
394 PARAMETRIC INFERENCE
Table 10.1 The asymptotic relative efficiency of the sample mean relative to the
sample median when the population follows a T(ν) distribution.
ν 3 4 5 10 25 50 100
ARE(θ̂n , θ̃n ) 0.617 0.889 1.041 1.321 1.4743 1.5231 1.5471
Therefore, from an asymptotic viewpoint, the estimators have the same effi-
ciency.
with respect to θ can be obtained by exchanging the derivative and the in-
tegral.
5. The first two derivatives of log[f (x|θ)] with respect to θ exist for all x ∈ R
and θ ∈ Ω.
POINT ESTIMATION 395
Then V (θ̂n ) ≥ [nI(θ)]−1 where
∂
I(θ) = V log[f (Xn |θ)] .
∂θ
The development of this result can be found in Section 2.6 of Lehmann and
Casella (1998). The value I(θ) is called the Fisher information number. Noting
that
∂ f 0 (x|θ)
log[f (x|θ)] = ,
∂θ f (x|θ)
it then follows that the random variable within the expectation measures the
relative rate of change of f (x|θ) with respect to changes in θ. If this rate of
change is large, then samples with various values of θ will be easily distin-
guished from one another and θ will be easier to estimate. In this case the
bound on the variance will be small. If this rate of change is small then the
parameter is more difficult to estimate and the variance bound will be larger.
Several alternate expressions are available for I(θ) under various assumptions.
To develop some of these, let X be a random variable that follows the distri-
bution F (x|θ) and note that
0
∂ f (X|θ)
V log[f (X|θ)] = V =
∂θ f (X|θ)
( 2 )
f 0 (X|θ)
0
2 f (X|θ)
E −E . (10.11)
f (X|θ) f (X|θ)
Evaluating the second term on the right hand side of Equation (10.11) yields
0 Z ∞ 0 Z ∞
f (X|θ) f (x|θ)
E = f (x|θ)dx = f 0 (x|θ)dx. (10.12)
f (X|θ) −∞ f (x|θ) −∞
and hence,
Z ∞
∂
f (x|θ)dx = 0.
∂θ −∞
If we can exchange the partial derivative and the integral, then it follows that
Z ∞
f 0 (x|θ)dx = 0. (10.13)
−∞
which depends on the unknown parameter µ. Note however that the bound is
attained asymptotically since
2n−1 θ2
lim ARE(θ̂n , θ̃n ) = lim = 1.
n→∞ n→∞ 2(n − 1)−1 θ 2
and
∂
lim Bias(θ̂n ) = 0,
n→∞ ∂θ
which would avoid super-efficient estimators. Another assumption that can
avoid this difficulty is to require v(θ) to be a continuous function in θ. Still
another possibility suggested by Rao (1963) and Wolfowitz (1965) is to require
the weak convergence on n1/2 (θ̂n − θ) to Z to be uniform in θ. Further results
are proven by Phanzagl (1970).
We now consider a specific methodology for obtaining estimators that is ap-
plicable to many types of problems: maximum likelihood estimation. Under
specific conditions, we will be able to show that this method provides asymp-
totically optimal estimators that are also asymptotically Normal. If we ob-
serve X1 , . . . , Xn , a set of independent and identically distributed random
POINT ESTIMATION 399
variables from a distribution F (x|θ), then the joint density of the observed
sample if given by
n
Y
f (x1 , . . . , xn |θ) = f (xi |θ),
k=1
where θ ∈ Ω. We have assumed that F (x|θ) has density f (x|θ) and that the
observed random variables are continuous. In the discrete case the density
f (x|θ) is replaced by the probability distribution function associated with
F (x|θ). The likelihood function considers the joint density f (x1 , . . . , xn |θ) as
a function of θ where the observed sample is taken to be fixed. That is
n
Y
L(θ|x1 , . . . , xn ) = f (xi |θ).
k=1
For simpler notation we will often use L(θ|x) in place of L(θ|x1 , . . . , xn ) where
x0 = (x1 , . . . , xn ). The maximum likelihood estimators of θ are taken to be the
points that maximize the function L(θ|x1 , . . . , xn ) with respect to θ. That is,
θ̂n is a maximum likelihood estimator of θ if
L(θ̂n |x1 , . . . , xn ) = sup L(θ|x1 , . . . , xn ).
θ∈Ω
Assuming that L(θ|x1 , . . . , xn ) has at least two derivatives and that the pa-
rameter space of θ is Ω, candidates for the maximum likelihood estimator of
θ have the properties
d
L(θ|x1 , . . . , xn ) = 0,
dθ θ=θ̂n
and
d2
L(θ|x1 , . . . , xn ) < 0.
dθ2 θ=θ̂n
Other candidates for a maximum likelihood estimator are the points on the
boundary of the parameter space. The maximum likelihood estimators are
the candidates for which the likelihood is the largest. Therefore, maximum
likelihood estimators may not be unique and hence there may be two or more
values of θ̂n that all maximize the likelihood.
One must be careful when interpreting a maximum likelihood estimator. A
maximum likelihood estimator is not the most likely value of θ given the
observed data. Rather, a maximum likelihood estimator is a value of θ which
has the largest probability of generating the data when the distribution is
discrete. In the continuous case a maximum likelihood estimator is a value of
θ for which the joint density of the sample is greatest.
In many cases the distribution of interest often contains forms that may not
be simple to differentiate after the product is taken to form the likelihood
function. In these cases the problem can be simplified by taking the natural
logarithm of the likelihood function. The resulting function is often called
400 PARAMETRIC INFERENCE
the log-likelihood function. Note that because the natural logarithm function
is monotonic, the points that maximize L(θ|x1 , . . . , xn ) will also maximize
l(θ) = log[L(θ|x1 , . . . , xn )]. Therefore, a maximum likelihood estimator of θ
can be defined as the value θ̂n such that
l(θ̂n ) = sup l(θ).
θ∈Ω
so that
d2
= nx̄−2 −2 2
l(θ) n − 2nx̄n = −nx̄n < 0,
dθ2 θ=θ̂n
since x̄n will be positive with probability one. It follows that x̄n is a local
maximum. We need only now check the boundary points of Ω = (0, ∞). Noting
POINT ESTIMATION 401
that
n
! n
!
X X
lim θ−n exp −θ−1 xk = lim θ−n exp −θ−1 xk = 0,
θ→0 θ→∞
k=1 k=1
we need only show that L(x̄n |X1 , . . . , Xn ) > 0 to conclude that x̄n is the
maximum likelihood estimator of θ. To see this, note that
n
!
X
L(x̄n |X1 , . . . , Xn ) = x̄−n −1
n exp −x̄n xk = x̄−1
n exp(−n) > 0.
k=1
When taken as a function of θ, note that δ{xk ; (0, θ)} is zero unless θ > xk .
Therefore, the product on the right hand side of Equation (10.16) is zero
unless θ > xk for all k ∈ {1, . . . , n}, or equivalently if θ > x(n) , where x(n)
is the largest value in the sample. Therefore, the likelihood function has the
form (
0 θ ≤ x(n)
L(θ|x1 , . . . , xn ) = −n
θ θ > x(n) .
It follows then that the likelihood function is then maximized at θ̂n = x(n) .
See Figure 10.1.
Maximum likelihood estimators have many useful properties. For example,
maximum likelihood estimators have an invariance property that guarantees
that the maximum likelihood estimator of a function of a parameter is the
same function of the maximum likelihood estimator of the parameter. See
Theorem 7.2.10 of Casella and Berger (2002). In this section we will establish
some asymptotic properties of maximum likelihood estimators. In particular,
we will establish conditions under which maximum likelihood estimators are
consistent and asymptotically efficient.
The main impediment to establishing a coherent asymptotic theory for max-
imum likelihood estimators is that the derivative of the likelihood, or log-
likelihood, function may have multiple roots. This can cause problems for
consistency, for example, since the maximum likelihood estimator may jump
from root to root as new observations are obtained from the population. Be-
cause we intend to provide the reader with an overview of this subject we
will concentrate on problems that have a single unique root. A detailed ac-
count of the case where there are multiple roots can be found in Section 6.4
of Lehmann and Casella (1998).
402 PARAMETRIC INFERENCE
1. The parameter θ is identifiable. That is, f (x|θ) is distinct for each value of
θ in Ω.
2. The set {x ∈ R : f (x|θ) > 0} is the same for all θ ∈ Ω.
for all θ ∈ Ω \ θ0 .
Proof. We begin by noting that L(θ0 |X1 , . . . , Xn ) > L(θ|X1 , . . . , Xn ) will oc-
POINT ESTIMATION 403
cur if and only if
n
Y n
Y
f (Xi |θ0 ) > f (Xi |θ),
i=1 i=1
which is equivalent to the condition that
" n #−1 n
Y Y
f (Xi |θ0 ) f (Xi |θ) < 1.
i=1 i=1
Taking the logarithm of this last expression and dividing by n implies that
n
X
n−1 log{f (Xi |θ)[f (Xi |θ0 )]−1 } < 0.
i=1
Therefore, it follows that log[E{f (X1 |θ)[f (X1 |θ0 )]−1 }] = 0, and hence Equa-
tion (10.18) implies that E[log{f (X1 |θ)[f (X1 |θ0 )]−1 }] < 0. Therefore, we have
from Equation (10.17) that
n
X p
n−1 log{f (Xi |θ)[f (Xi |θ0 )]−1 } −
→ c,
i=1
which is equivalent to
lim P [L(θ0 |X1 , . . . , Xn ) > L(θ|X1 , . . . , Xn )] = 1.
n→∞
1. The parameter θ is identifiable. That is, f (x|θ) is distinct for each value of
θ in Ω.
2. The set {x ∈ R : f (x|θ) > 0} is the same for all θ ∈ Ω.
3. The parameter space Ω contains an open interval W where the true value
of θ is an interior point.
4. For almost all x ∈ R, f (x|θ) is differentiable with respect to θ in W .
5. The equation
∂
L(θ|X) = 0, (10.19)
∂θ
has a single unique root for each n ∈ N and all X ∈ Rn .
p
Then, with probability one, if θ̂n is the root of Equation (10.19), then θ̂n −
→ θ,
as n → ∞.
since θ0 − δ and θ0 + δ are not the true values of θ. Note that if x ∈ Gn (δ) then
it follows that there must be a local maximum in the interval (θ0 − δ, θ0 + δ)
since L(θ0 |X) > L(θ0 − δ|x) and L(θ0 |X) > L(θ0 + δ|x). Condition 4 implies
that there always exists θ̂n ∈ (θ0 − δ, θ0 + δ) such that L0 (θ̂n |X) = 0. Hence
Equation (10.20) implies that there is a sequence of roots {θ̂n }∞ n=1 such that
For further details on the result of Theorem 10.4 when there are multiple
roots, see Section 6.4 of Lehmann and Casella (1998).
POINT ESTIMATION 405
Example 10.12. Suppose that X1 , . . . , Xn is a set of independent and iden-
tically distributed random variables having an Exponential(θ) distribution.
In Example 10.10, the sample mean was shown to be the unique maximum
likelihood estimator of θ. Assuming that θ is an interior point of the parame-
ter space Ω = (0, ∞), the properties of Theorem 10.4 are satisfied and we can
p
conclude that X̄n −→ θ as n → ∞. This result can also be found directly using
Theorem 3.10.
where
k
X
Si2 = k −1 (Xij − X̄i )2 ,
j=1
for i = 1, . . . , n, which is the sample variance computed on Xi1 , . . . , Xik . Define
Ci = kθ−1 Si2 for i = 1, . . . , n and note that Ci has a ChiSquared(k − 1)
distribution for all i = 1, . . . , n. Then
n
X n
X
θ̂n = n−1 Si2 = (nk)−1 θCi .
i=1 i=1
1. Ω is an open interval.
2. The set {x : f (x|θ) > 0} is the same for all θ ∈ Ω.
3. The density f (x|θ) has three continuous derivatives with respect to θ for
each x ∈ {x : f (x|θ) > 0}.
4. The integral
Z ∞
f (x|θ)dx,
−∞
can be differentiated three times by exchanging the integral and the deriva-
tives.
5. The Fisher information number I(θ0 ) is defined, positive, and finite.
6. For any θ0 ∈ Ω there exists a positive constant d and function B(x) such
that 3
∂
log[f (x|θ)] ≤ B(x),
∂θ3
for all x ∈ {x : f (x|θ) > 0} and θ ∈ [θ0 −d, θ0 +d] such that E[B(X1 )] < ∞.
7. There is a unique maximum likelihood estimator θ̂n for each n ∈ N and
θ ∈ Ω.
d
→ Z as n → ∞ where Z has a N[0, I −1 (θ0 )] distribution.
Then n1/2 (θ̂n − θ) −
where the derivative indicated by f 0 (Xi |θ) is taken with respect to θ. Now
apply Theorem 1.13 (Taylor) to l0 (θ|X) to expand l0 (θ̂n |X) about a point
θ0 ∈ Ω as
l0 (θ̂n |X) = l0 (θ0 |X) + (θ̂n − θ0 )l00 (θ0 |X) + 12 (θ̂n − θ0 )2 l000 (ξn |X),
408 PARAMETRIC INFERENCE
where ξn is a random variable that is between θ0 and θ̂n with probability one.
Because θ̂n is the unique root of l0 (θ|X), it follows that
l0 (θ0 |X) + (θ̂n − θ0 )l00 (θ0 |X) + 21 (θ̂n − θ0 )2 l000 (ξn |X) = 0,
or equivalently
n−1/2 l0 (θ0 |X)
n1/2 (θ̂n − θ0 ) = . (10.22)
−n−1 l00 (θ0 |X) − 12 n−1 (θ̂n − θ0 )l000 (ξn |X)
The remainder of the proof is based on analyzing the asymptotic behavior of
each of the terms on the right hand side of Equation (10.22). We first note
that
n
X f 0 (Xi |θ0 )
n−1/2 l0 (θ0 |X) = n−1/2 .
i=1
f (Xi |θ0 )
Under the assumption that it is permissible to exchange the partial derivative
and the integral, Equation (10.13) implies that
Z ∞
f 0 (x|θ)dx = 0,
−∞
and hence
f 0 (Xi |θ0 )
E = 0.
f (Xi |θ0 )
Therefore,
( n )
f 0 (Xi |θ0 ) f 0 (Xi |θ0 )
X
−1/2 0 1/2 −1
n l (θ0 |X) = n n −E .
i=1
f (Xi |θ0 ) f (Xi |θ0 )
We can apply Theorem 4.20 (Lindeberg and Lévy) and Theorem 10.2 (Cramér
d
and Rao) to this last expression to find that n−1/2 l0 (θ0 |X) − → Z as n →
∞ where Z is a N[0, I(θ0 )] random variable. We now consider the term
−n−1 l00 (θ0 ). First note that
∂2
−n−1 l00 (θ0 ) = −n−1
2
log[L(θ|X)]
∂θ
θ=θ0
n
2 X
−1 ∂
= −n log[f (X |θ)]
i
∂θ2 i=1
θ=θ0
n
0
∂ X f (Xi |θ)
= −n−1
∂θ f (Xi |θ)
i=1 θ=θ0
n n
X f 00 (Xi |θ0 ) X [f 0 (Xi |θ0 )]2
= −n−1 + n−1
i=1
f (Xi |θ0 ) i=1
f 2 (Xi |θ0 )
n
X [f 0 (Xi |θ0 )]2 − f (Xi |θ0 )f 00 (Xi |θ0 )
= n−1 ,
i=1
f 2 (Xi |θ0 )
POINT ESTIMATION 409
which is the sum of a set of independent and identically distributed random
variables, each with expectation
0
[f (Xi |θ0 )]2 − f (Xi |θ0 )f 00 (Xi |θ0 )
E =
f 2 (Xi |θ0 )
( 2 )
f 0 (Xi |θ0 )
00
f (Xi |θ0 )
E −E , (10.23)
f (Xi |θ0 ) f (Xi |θ0 )
To evaluate the second term of the right hand side of Equation (10.23), we
note that Equation (10.15) implies that
00
f (Xi |θ0 )
E = 0,
f (Xi |θ0 )
and hence we have that
0
[f (Xi |θ0 )]2 − f (Xi |θ0 )f 00 (Xi |θ0 )
E = I(θ0 ).
f 2 (Xi |θ0 )
Therefore, Theorem 3.10 (Weak Law of Large Numbers) implies that
n
X [f 0 (Xi |θ0 )]2 − f (Xi |θ0 )f 00 (Xi |θ0 ) p
n−1 −
→ I(θ0 ),
i=1
f 2 (Xi |θ0 )
with probability one for all θ ∈ (θ0 − c, θ0 + c). Now ξn is a random variable
that is between θ0 and θ̂n with probability one, and Theorem 10.4 implies
410 PARAMETRIC INFERENCE
p
that θ̂n −
→ θ as n → ∞. Therefore, it follows that for any c > 0,
lim P [ξn ∈ (θ0 − c, θ0 + c)] = 1,
n→∞
From the assumptions of Theorem 10.5 it is evident that the asymptotic ef-
ficiency of a maximum likelihood estimator depends heavily on f , including
its support and smoothness properties. The main important assumption that
may not always be obvious is that the integral of the density and the deriva-
tive of the density with respect to θ may be interchanged. In this context the
following result is often useful.
Theorem 10.6. Let f (x|θ) be a function that is differentiable with respect to
θ ∈ Ω. Suppose there exists a function g(x|θ) and a real constant δ > 0 such
that Z ∞
g(x|θ)dx < ∞,
−∞
for all θ ∈ Ω and
∂
f (x, θ) ≤ g(x, θ)
∂θ
θ=θ0
for all θ0 ∈ Ω such that |θ0 − θ| ≤ δ. Then
Z ∞ Z ∞
∂ ∂
f (x, θ)dx = f (x, θ)dx.
∂θ −∞ −∞ ∂θ
The proof of Theorem 10.6 follows from Theorem 1.11 (Lesbesgue). For further
details on this result see Casella and Berger (2002) or Section 7.10 of Khuri
(2003). The conditions of Theorem 10.6 holds for a wide range of problems,
including those that fall within the exponential family.
Definition 10.5. Let X be a continuous random variable with a density of
the form f (x|θ) = exp[θT (x) − A(θ)] for all x ∈ R where θ is a parameter
with parameter space Ω, T is a function that does not depend on θ, and A
POINT ESTIMATION 411
is a function that does not depend on x. Then X has a density from a one
parameter exponential family.
Of importance for the exponential family in relation to our current discussion
is the fact that derivatives and integrals of the density can be exchanged.
Theorem 10.7. Let h be an integrable function and let θ be a interior point
of Ω, then the integral
Z ∞
h(x) exp[θT (x) − A(θ)]dx,
−∞
is continuous and has derivatives of all orders with respect to θ, and these
derivatives can be obtained by exchanging the derivative and the integral.
A proof of Theorem 10.7 can be found in Section 7.1 of Barndorff-Nielsen
(1978) or Chapter 2 of Lehmann (1986).
Corollary 10.1. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution that has density
f (x|θ) = exp[θT (x) − A(θ)],
where θ is a parameter with parameter space Ω that is an open interval, T is a
function that does not depend on θ, and A is a function that does not depend
on x. Then the likelihood function of θ has a unique solution that is consistent
and asymptotically normal and efficient.
Noting that the first integral on the right hand side of Equation (10.26) is the
expectation E{[T (Xi )−A0 (θ)]2 } and that previous arguments have shown that
E[T (Xi )] = A0 (θ), it follows that E{[T (Xi ) − A0 (θ)]2 } = V [T (Xi )] = A00 (θ).
Since the variance must be positive, we have that
d 0 d
A00 (θ) = A (θ) = E[T (Xi )] > 0.
dθ dθ
Hence, the right hand side of Equation (10.25) is a strictly increasing function
of θ, and therefore can have at most one solution. The remainder of the proof
of this result is based on verifying the assumptions of Theorem 10.5 for this
case. We have already assumed that Ω is an open interval. The form of the
density from Definition 10.5, along with the fact that T (x) is not a function
of θ and that A(θ) is not a function of x, implies that the set {x : f (x|θ) > 0}
does not depend on θ. The first three derivatives of f (x|θ), taken with respect
to θ can be shown to be continuous in θ under the assumption that A(θ)
has at least three continuous derivatives. The fact that the integral of f (x|θ)
taken with respect to x can be differentiated three times with respect to θ by
exchanging the integral and the derivative follows from Theorem 10.7. The
Fisher information number for the parameter θ for this model is given by
( 2 )
d
I(θ) = E log[f (Xi |θ)]
dθ
" 2 #
d
= E [θT (Xi ) − A(θ)]
dθ
= E{[T (Xi ) − A0 (θ)]2 }
≥ 0.
Under the assumption that E[T 2 (Xi )] < ∞ we obtain the required behavior
POINT ESTIMATION 413
for Assumption 5. For Assumption 6 we note that the third derivative of
log[f (x|θ)], taken with respect to θ, is given by
∂3
log[f (x|θ)] = −A000 (θ),
∂θ3
where we note that a suitable constant function is given by
B(x) = sup −A000 (θ).
θ∈(θ0 −c,θ0 +c)
For further details see Lemma 2.5.3 of Lehmann and Casella (1998). It follows
that the assumptions of Theorem 10.5 are satisfied and the result follows.
Example 10.15. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a density of the form
(
θ exp(−θx) x < 0
f (x|θ) = (10.27)
0 x ≤ 0,
where θ ∈ Ω = (0, ∞). Calculations similar to those given in Example 10.10
can be used to show that the maximum likelihood estimator of θ is θ̂n = X̄n−1 .
Alternatively, one can use the fact that maximum likelihood estimators are
transformation respecting. Because the density in Equation (10.27) has the
form given in Definition 10.5 with A(θ) = log(θ) and T (x) = −x, Corollary
d
10.1 implies that n1/2 (θ̂n − θ) −
→ Z as n → ∞ where Z is a N(0, σ 2 ) random
variable. The asymptotic variance σ 2 is given by
σ2 = I(θ)
∂2
= E − 2 f (Xi |θ)
∂θ
∂2
= E − 2 [log(θ) − θXi ]
∂θ
= θ−2 .
Therefore, it follows that n1/2 θ−1 (θ̂n −θ) converges in distribution to a N(0, 1)
distribution.
Similar definitions can be used to define the correctness and accuracy of one-
sided confidence intervals.
Example 10.16. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a distribution F with mean θ and finite
variance σ 2 . Even when F does not have a normal distribution, Theorem 4.20
d
(Lindeberg and Lévy) implies that n1/2 σ −1 (X̄n − θ) −
→ Z as n → ∞, where
Z has a N(0, 1) distribution. Therefore, an asymptotically accurate 100α%
confidence interval for θ is given by
Cn (α) = [X̄n − n−1/2 σz(1+α)/2 , X̄n − n−1/2 σz(1−α)/2 ],
under the condition that σ is known. If σ is unknown then a consistent esti-
mator of σ is given by the usual sample standard deviation σ̂n . Therefore, in
this case an asymptotically accurate 100α% confidence interval for θ is given
by
Cn (α) = [X̄n − n−1/2 σ̂n z(1+α)/2 , X̄n − n−1/2 σ̂n z(1−α)/2 ].
For two-sided confidence intervals we can use the interval C̃n (α) = [θ̂n −
n−1/2 σz(1+α)/2 , θ̂n − n−1/2 σz(1−α)/2 ], so that the asymptotic probability that
the interval will contain the true parameter value is given by
lim P [θ ∈ C̃n (α)] = lim P [n−1/2 σz(1−α)/2 ≤ θ̂n − θ ≤ n−1/2 σz(1+α)/2 ]
n→∞ n→∞
which uses the fact that (n − 1)θ−1 θ̂n is a pivotal quantity for θ that has
a ChiSquared(n − 1) distribution, and θ̂n is the unbiased version of the
sample variance. This pivotal function is only valid for the normal distribution.
If F is unknown then we can use the fact that Theorem 8.5 implies that
d
n1/2 (µ4 − θ2 )−1/2 (θ̂n − θ) −
→ Z as n → ∞ where Z has a N(0, 1) distribution
to construct an approximate confidence interval for θ. If E(|X1 |4 ) < ∞ then
p
Theorems 3.21 and 3.9 imply that µ̂4 − θ̂n2 − → µ4 − θ2 as n → ∞ and an
asymptotically accurate confidence interval for θ is given by
Ĉn (α) = [θ̂n − z(1+α)/2 n−1/2 (µ̂4 − θ̂n2 )1/2 , θ̂n − z(1−α)/2 n−1/2 (µ̂4 − θ̂n2 )1/2 ].
Example 10.19. Suppose X1 , . . . , Xn is a set of independent and identically
distributed random variables from a Poisson(θ) distribution. Garwood (1936)
suggested a 100α% confidence interval for θ using the form
Cn (α) = [ 12 n−1 χ22Y ;(1−α)/2 , 12 n−1 χ22(Y +1);(1+α)/2 ],
where
n
X
Y = nθ̂n = Xi .
i=1
The coverage probability of this interval is at least α, but may also be quite
conservative in some cases. See Figure 9.2.5 of Casella and Berger (2002).
An asymptotically accurate confidence interval for θ based on Theorem 4.20
(Lindeberg and Lévy) and Theorem 3.10 (Weak Law of Large Numbers) is
given by
Ĉn (α) = [θ̂n − z(1+α)/2 n−1/2 θ̂n1/2 , θ̂n − z(1−α)/2 n−1/2 θ̂n1/2 ].
If required, the asymptotic variance will be estimated with σ̂n2 = h2 (X̄n ). Let
Gn (t) = P [n1/2 σ −1 (θ̂n − θ) ≤ t] and Hn (t) = P [n1/2 σ̂n−1 (θ̂n − θ) ≤ t] and
define gα,n and hα,n to be the corresponding α quantiles of Gn and Hn so
that Gn (gα,n ) = α and Hn (hα,n ) = α.
In the same exact way that one would develop the confidence interval for a
population mean we can develop a confidence interval for θ using the quantiles
of Gn and Hn . In particular, if σ is known then it follows that a 100α%
upper confidence limit for θ is given by θ̂n,ord = θ̂n − n−1/2 σg1−α and if σ is
unknown then it follows that a 100α% upper confidence limit for θ is given by
θ̂n,stud = θ̂n − n−1/2 σ̂n h1−α . In this case we are borrowing the notation and
terminology of Hall (1988a) where θ̂n,ord is called the ordinary confidence limit
and θ̂n,stud is called the studentized confidence limit, making reference to the
t-interval for the mean where the population standard deviation is replaced
by the sample standard deviation. In both cases these upper confidence limits
are accurate and correct. See Exercise 18.
Example 10.21. Consider the same setup as Example 10.20 where an exact
and correct upper confidence limit for θ has asymptotic expansion
θ̂n (α) = θ̂n + n−1/2 σ̂n zα + n−1 σ̂n û1 (zα ) + n−3/2 σ̂n uˆ2 (zα ) + Op (n−2 ), (10.28)
n−1/2 v1 [−zα −n−1/2 u1 (zα )−n−1 u2 (zα )]φ[−zα −n−1/2 u1 (zα )−n−1 u2 (zα )] =
n−1/2 v1 (zα )φ(zα ) + n−1 u1 (zα )v10 (zα )φ(zα )−
n−1 zα u1 (zα )v1 (zα )φ(zα ) + O(n−3/2 ), (10.39)
as n → ∞. The third term in Equation (10.33) requires less sophisticated
arguments as we are only retaining terms of order O(n−1 ) and larger, and the
leading coefficient on this term is O(n−1 ). Hence, similar calculations to those
used above can be used to show that
n−1 uα [−zα −n−1/2 u1 (zα )−n−1 u2 (zα )]φ[−zα −n−1/2 u1 (zα )−n−1 u2 (zα )] =
− n−1 uα zα φ(zα ) + O(n−3/2 ),
as n → ∞. Therefore, it follows that
πn (α) = α + n−1/2 [u1 (zα ) − v1 (zα )]φ(zα ) − n−1 [ 21 zα u21 (zα ) − u2 (zα )+
u1 (zα )v10 (zα ) − zα u1 (zα )v1 (zα ) − v2 (zα ) + uα zα ]φ(zα ) + O(n−3/2 ), (10.41)
as n → ∞. From Equation (10.41) it is clear that the determining factor in
the accuracy of the one-sided confidence interval is the relationship between
the polynomials u1 and v1 . In particular, if u1 (zα ) = −s1 (zα ) = v1 (zα ) then
the expansion for the coverage probability simplifies to
π̃n (α) = α − n−1/2 v1 (zα )φ(zα ) + n−1 [v2 (zα ) − uα zα ]φ(zα ) + O(n−3/2 ),
as n → ∞. Hence, the normal approximation is first-order accurate.
Example 10.23. Consider the general setup of Example 10.20 with Edge-
worth corrected upper confidence limit given by
θ̄n,stud (α) = θ̂n − n−1/2 σ̂n z1−α − n−1 σ̂n ŝ1 (z1−α )
= θ̂n + n−1/2 σ̂n zα − n−1 σ̂n ŝ1 (zα ).
In terms of the generic expansion given in Equation (10.28) we have that
u1 (zα ) = −ŝ1 (zα ). Therefore, Equation (10.41) implies that the coverage prob-
ability of θ̄n,stud (α) has asymptotic expansion α + O(n−1 ), as n → ∞. There-
fore, the Edgeworth corrected upper confidence limit for θ is second-order
accurate.
Statistical hypothesis tests are procedures that decide the truth of a hypoth-
esis about an unknown population parameter based on a sample from the
population. The decision is usually made in accordance with a known rate of
error specified by the researcher.
STATISTICAL HYPOTHESIS TESTS 425
For this section we consider X1 , . . . , Xn to be a set of independent and iden-
tically distributed random variables from a distribution F with functional
parameter θ that has parameter space Ω. A hypothesis test begins by divid-
ing the parameter space into a partition of two regions called Ω0 and Ω1 for
which there are associated hypotheses H0 : θ ∈ Ω0 and H1 : θ ∈ Ω1 , respec-
tively. The structure of the test is such that the hypothesis H0 : θ ∈ Ω0 , called
the null hypothesis, is initially assumed to be true. The data are observed
and evidence that H0 : θ ∈ Ω0 is actually false is extracted from the data.
If the amount of evidence against H0 : θ ∈ Ω0 is large enough, as specified
by an acceptable error rate given by the researcher, then the null hypothesis
H0 : θ ∈ Ω0 is rejected as being false, and the hypothesis H1 : θ ∈ Ω1 , called
the alternative hypothesis, is accepted as truth. Otherwise we fail to reject the
null hypothesis and we conclude that there was not sufficient evidence in the
observed data to conclude that the null hypothesis is false.
The measure of evidence in the observed data against the null hypothesis is
measured by a statistic Tn = Tn (X1 , . . . , Xn ), called a test statistic. Prior
to observing the sample X1 , . . . , Xn the researcher specifies a set R that is
a subset of the range of Tn . This set is constructed so that when Tn ∈ R,
the researcher considers the evidence in the sample to be sufficient to warrant
rejection of the null hypothesis. That is, if Tn ∈ R, then the null hypothesis
is rejected and if Tn ∈/ R then the null hypothesis is not rejected. The set R
is called the rejection region.
The rejection region is usually constructed so that the probability of rejecting
the null hypothesis when the null hypothesis is really true is set to a level
α called the significance level. That is, α = P (Tn ∈ R|θ ∈ Ω0 ). The error
corresponding to this conclusion is called the Type I error, and it should be
noted that the probability of this error may depend on the specific value of θ in
Ω0 . In this case we control the largest probability of a Type I error for points
in Ω0 to be α with a slight difference in terminology separating the cases where
the probability α can be achieved and where it cannot. Another terminology
will be used when the error rate asymptotically achieves the probability α.
Definition 10.8. Consider a test of a null hypothesis H0 : θ ∈ Ω0 with test
statistic Tn and rejection region R.
The other type of error that one can make is a Type II error, which occurs when
the null hypothesis is not rejected but the alternative hypothesis is actually
true. The probability of avoiding this error is called the power of the test.
This probability, taken as a function of θ, is called the power function of the
test and will be denoted as βn (θ). That is, βn (θ) = P (Tn ∈ R|θ). The domain
of the power function can be taken to be Ω so that the function βn (θ) will
also reflect the probability of a Type I error when θ ∈ Ω0 . In this context we
would usually want βn (θ) when θ ∈ Ω0 to be smaller than βn (θ) when θ ∈ Ω1 .
That is, we should have a larger probability of rejecting the null hypothesis
when the alternative is true than when the null is true. A test that has this
property is called an unbiased test.
Definition 10.9. A test with power function βn (θ) is unbiased if βn (θ0 ) ≤
βn (θ1 ) for all θ0 ∈ Ω0 and θ1 ∈ Ω1 .
While unbiased tests are important, there are also asymptotic considerations
that can be accounted for as well. The most common of these is that a test
should be consistent against values of θ in the alternative hypothesis. This
is an extension of the idea of consistency in the case of point estimation.
In point estimation we like to have consistent estimators, that is ones that
converge to the correct value of θ as n → ∞. The justification for requiring
this property is that if we could examine the entire population then we should
know the parameter value exactly. The consistency concept is extended to
statistical hypothesis tests by requiring that if we could examine the entire
population, then we should be able to make a correct decision. That is, if the
alternative hypothesis is true, then we should reject with probability one as
n → ∞. Note that this behavior is only specified for points in the alternative
hypothesis since we always insist on erroneously rejecting the null hypothesis
with probability at most α no matter what the sample size is.
Definition 10.10. Consider a test of H0 : θ ∈ Ω0 against H1 : θ ∈ Ω1 that
has power function βn (θ). The test is consistent against the alternative θ ∈ Ω1
if
lim βn (θ) = 1.
n→∞
If the test is consistent against all alternatives θ ∈ Ω1 , then the test is called
a consistent test.
A consistent hypothesis test assures us that when the sample is large enough
that we will reject the null hypothesis when the alternative hypothesis is true.
For fixed sample sizes the probability of rejecting the null hypothesis when
the alternative is true depends on the actual value of θ in the alternative
hypothesis. Many tests will perform well when the actual value of θ is far
STATISTICAL HYPOTHESIS TESTS 427
away from the null hypothesis, and in the limiting case have a power equal to
one in the limit. That is
lim βn (θ) = 1,
d(θ,Ω0 )→∞
lim d(θn , Ω0 ) = 0,
n→∞
at a specified rate. Then the asymptotic power of the test against the sequence
of alternatives {θn }∞n=1 is given by
lim βn (θn ).
n→∞
for all θl < θ0 . Hence, Definition 10.8 implies that this test has asymptotic
size α. Similar arguments can be used to show that this test is also unbiased.
See Exercise 26.
Example 10.24. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables following an Exponential(θ) distribution where
θ ∈ Ω = (0, ∞). We will consider testing the null hypothesis H0 : θ ≤ θ0
against the alternative hypothesis H1 : θ > θ0 for some specified θ0 > 0. The-
orem 4.20 (Lindeberg-Lévy) implies that the test statistic Zn = n1/2 θ0−1 (X̄n −
θ0 ) converges in distribution to a N(0, 1) distribution as n → ∞ when θ = θ0 .
Therefore, rejecting the null hypothesis when Zn exceeds z1−α will result in a
test with asymptotic level equal to α.
Example 10.25. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables following a Poisson(θ) distribution where θ ∈
Ω = (0, ∞). We will consider testing the null hypothesis H0 : θ ≤ θ0 against
the alternative hypothesis H1 : θ > θ0 for some specified θ0 > 0. Once again,
−1/2
Theorem 4.20 implies that the test statistic Zn = n1/2 θ0 (X̄n −θ0 ) converges
in distribution to a N(0, 1) distribution as n → ∞ when θ = θ0 . Therefore,
rejecting the null hypothesis when Zn exceeds z1−α will result in a test with
asymptotic level equal to α.
The variance σ 2 can either be known as a separate parameter, or can be known
through the specification of the null hypothesis, as in the problem studied in
Example 10.25. In some cases, however, σ will not be known. In these cases
there is often a consistent estimator of σ that can be used. That is, we consider
p
the case where σ can be estimated by σ̂n where σ̂n − → σ as n → ∞. Theorem
d
4.11 (Slutsky) implies that n1/2 σ̂n−1 (θ̂n − θ0 ) −
→ Z as n → ∞ and hence
lim P [n1/2 σ̂n−1 (θ̂n − θ0 ) ≥ rα,n ] = α,
n→∞
for all θu ∈ (θ0 , ∞). Therefore, Definition 10.10 implies that the test is consis-
tent against any alternative θu ∈ (θ0 , ∞). See Exercise 27 for further details.
The asymptotic power of the test can be studied using the sequence of alter-
native hypotheses {θ1,n }∞
n=1 where θ1,n = θ0 + n
−1/2
δ where δ is a positive
real constant. Note that θ1,n → θ0 as n → ∞ and that θ1,n = θ0 + O(n−1/2 )
as n → ∞. For this sequence we have that
βn (θ1,n ) = P (Zn ≥ rα,n |θ = θ1,n )
" #
n1/2 (θ̂n − θ1,n ) n1/2 (θ1,n − θ0 )
= P ≥ rα,n − θ = θ1,n
σ σ
" #
n1/2 (θ̂n − θ1,n )
= P ≥ rα,n − σ −1 δ θ = θ1,n .
σ
As in the above case we must make some additional assumptions about the
limiting distribution of the sequence n1/2 σ −1 (θ̂n − θ1,n ) when θ = θ1,n . In this
case we will assume that
" #
n1/2 (θ̂n − θ1,n )
P ≤ t θ = θ1,n ; Φ(t),
σ
For δ near zero, a further approximation based on the results of Theorem 1.15
can be used to find that
lim βn (θ1,n ) = Φ(−z1−α ) + σ −1 δφ(zα ) + O(δ 2 ) = α + σ −1 δφ(zα ) + O(δ 2 ),
n→∞
as δ → 0.
Note that in some cases, such as in Example 10.26, the standard deviation σ,
depends on θ. That is, the standard deviation is σ(θ), for some function σ.
Hence considering a sequence of alternatives {θ1,n }∞
n=1 will imply that we must
also consider a sequence of standard deviations {σ1,n }∞
n=1 where σ1,n = σ(θ1,n )
for all n ∈ N. Therefore, if we can assume that σ(θ) is a continuous function
of θ, it follows from Theorem 1.3 that σ1,n → σ as n → ∞ for some positive
STATISTICAL HYPOTHESIS TESTS 431
finite constant σ = σ(θ0 ). In this case
" #
n1/2 (θ̂ − θ0 )
βn (θ1,n ) = P ≥ rα,n θ = θ1,n
σ(θ0 )
" #
n1/2 (θ̂ − θ1,n ) n1/2 (θ1,n − θ0 )
= P ≥ rα,n − θ = θ1,n .
σ(θ0 ) σ(θ0 )
d
Now, assume that n1/2 (θ̂n − θ)/σ(θ) − → Z as n → ∞ where Z is a N(0, 1)
random variable, for all θ ∈ (θ0 − ε, θ0 + ε) for some ε > 0. Further, assume
that " #
n1/2 (θ̂ − θ1,n )
P ≤ t θ = θ1,n ; Φ(t),
σ(θ1,n )
as n → ∞. Then the fact that
σ(θ1,n )
lim = 1,
n→∞ σ(θ0 )
as n → ∞, and hence the result of Equation (10.43) holds in this case as well.
These results can be summarized with the following result.
Theorem 10.10. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F with parameter θ. Consider
testing the null hypothesis H0 : θ ≤ θ0 against the alternative hypothesis
H1 : θ > θ0 by rejecting the null hypothesis if n1/2 (θ̂n − θ0 )/σ(θ0 ) > rα,n .
Assume that
1. rα,n → z1−α as n → ∞.
2. σ(θ) is a continuous function of θ.
d
3. n1/2 (θ̂n − θ)/σ(θ) −
→ Z as n → ∞ for all θ > θ0 − ε for some ε > 0 where
Z is a N(0, 1) random variable.
d
→ Z as n → ∞ where θ = θ1,n and {θ1,n }∞
4. n1/2 (θ̂n − θ1,n )/σ(θ1,n ) − n=1 is a
sequence such that θ1,n → θ0 as n → ∞.
as n → ∞ for all t ∈ R and any sequence {θ1,n }∞ n=1 such that θ1,n → θ0 as
n → ∞. To see why this holds we follow the approach of Lehmann (1999) and
note that for each n ∈ N, Theorem 4.24 (Berry and Esseen) implies that
" #
n1/2 (X̄n − θ1,n )
P ≤ t θ = θ − Φ(t) ≤
1/2
1,n
θn,1
−3/2
n−1/2 Bθ1,n E(|X1 − θ1,n |3 ), (10.44)
where B is a constant that does not depend on n. Therefore, it follows that
for each t ∈ R that
" #
n1/2 (X̄n − θ1,n )
lim P ≤ t θ = θ1,n = Φ(t),
n→∞ 1/2
θn,1
Hence we have that the right hand side of Equation (10.44) converges to zero
as n → ∞. Therefore, it follows from Theorem 10.10 that the asymptotic
−1/2
power of the test is Φ(θ0 − Z1−α ) for the sequence of alternatives given
−1/2
by θ1,n = θ0 + n δ where δ > 0. This test is also consistent against all
alternatives θ > θ0 .
Example 10.29. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from an Exponential(θ) distribution. We wish
to test the null hypothesis H0 : θ ≤ θ0 against the alternative hypothesis
H1 : θ > θ0 using the test statistic n1/2 θ0−1 (X̄n −θ0 ). Theorem 4.20 (Lindeberg
STATISTICAL HYPOTHESIS TESTS 433
d
and Lévy) implies that n1/2 θ−1 (X̄n − θ) −
→ Z as n → ∞ for all θ > 0 where
Z is a N(0, 1) random variable. Hence, assumptions 1 and 2 of Theorem 10.10
−1 d
have been satisfied and it only remains to show that n1/2 θ1,n (X̄n − θ1,n ) −
→Z
as n → ∞ when θ = θ1,n and θ1,n → θ0 as n → ∞. Using the same approach
as shown in Example 10.28 that is based on Theorem 4.24 (Berry and Esseen)
it follows that the assumptions of Theorem 10.10 hold and hence the test is
consistent with asymptotic power function given by Φ(θ0−1 δ − z1−α ) where the
sequence of alternatives is given by θ1,n = θ0 + n−1/2 δ.
In the case where σ is unknown, but is replaced by a consistent estimator
d
given by σ̂n , we have that the test statistic Tn = n1/2 σ̂n−1 (θ̂n − θ) −
→ Z as
n → ∞ where Z is a N(0, 1) random variable by Theorem 4.11 (Slutsky).
Under similar conditions to those given in Theorem 10.10, the consistency
and asymptotic power of the test has the properties given in Theorem 10.10.
See Section 3.3 of Lehmann (1999) for a complete development of this case.
More precise asymptotic behavior under the null hypothesis can be determined
by adding further structure to our model. In particular we will now restrict out
attention to the framework of the smooth function model described in Section
7.4. We will first consider the case where σ, which denotes the asymptotic
variance of n1/2 θ̂n , is known. In this case we will consider using the test
statistic Zn = n1/2 σ −1 (θ̂n − θ0 ) which follows the distribution Gn when θ0 is
the true value of θ. Calculations similar to those used above can be used to
show that an unbiased test of size α of the null hypothesis H0 : θ ≤ θ0 against
the alternative hypothesis H1 : θ > θ0 rejects the null hypothesis if Zn > g1−α ,
where we recall that g1−α is the (1 − α)th quantile of the distribution Gn . See
Exercise 31.
From an asymptotic viewpoint it follows from the smooth function model that
d
Zn −→ Z as n → ∞ where Z is a N(0, 1) random variable. Therefore, if the dis-
tribution Gn is unknown then one can develop a large sample test by rejecting
H0 : θ ≤ θ0 if Zn ≥ z1−α . This test is similar to the approximate normal tests
studied above, except with the additional framework of the smooth function
model, we can study the effect of this approximation, as well as alternate
approximations, more closely. Note that Theorem 7.11 (Bhattacharya and
Ghosh) implies that the quantile g1−α has a Cornish-Fisher expansion given
by g1−α = z1−α +n−1/2 q1 (z1−α )+O(n−1 ), as n → ∞. Therefore, we have that
|g1−α −z1−α | = O(n−1/2 ), as n → ∞. To see what effect this has on the size of
the test we note that Theorem 7.11 implies that the Edgeworth expansion for
the distribution Zn is given by P (Zn ≤ t) = Φ(t) + n−1/2 r1 (t)φ(t) + O(n−1 ),
as n → ∞. Therefore, the probability of rejecting the null hypothesis when
θ = θ0 is given by
P (Zn > z1−α |θ = θ0 ) = 1 − Φ(z1−α ) − n−1/2 r1 (z1−α )φ(z1−α ) + O(n−1 )
= α − n−1/2 r1 (zα )φ(zα ) + O(n−1 ), (10.45)
as n → ∞, where we have used the fact that both r1 and φ are even functions.
434 PARAMETRIC INFERENCE
Therefore, Definition 10.8 implies that this test is first-order accurate. Note
that the test may even be more accurate depending on the form of the term
r1 (zα )φ(zα ) in Equation (10.45). For example, if r1 (zα ) = 0 then the test is
at least second-order accurate.
Another strategy for obtaining a more accurate test is to use a quantile of the
form z1−α + n−1/2 q1 (z1−α ), under the assumption that the polynomial q1 is
known. Note that
P (Zn > z1−α + n−1/2 q1 (z1−α )|θ = θ0 ) = 1 − Φ[z1−α + n−1/2 q1 (z1−α )]−
n−1/2 r1 [z1−α + n−1/2 q1 (z1−α )]φ[z1−α + n−1/2 q1 (z1−α )] + O(n−1 ),
as n → ∞. Now, using Theorem 1.15 we have that
Φ[z1−α + n−1/2 q1 (z1−α )] = Φ(z1−α ) + n−1/2 q1 (z1−α )φ(z1−α ) + O(n−1 ),
r1 [z1−α +n−1/2 q1 (z1−α )] = r1 (z1−α )+O(n−1/2 ), and φ[z1−α +n−1/2 q1 (z1−α )] =
φ(z1−α ) + O(n−1/2 ), as n → ∞. Therefore,
where ξ is a random variable that is between θ0 and θ̂n with probability one.
Apply Theorem 1.13 to the second term on the right hand side of Equation
(10.47) to obtain
n−1/2 l0 (θ0 ) = n−1/2 l0 (θ̂n ) + n−1/2 (θ0 − θ̂n )l00 (ζ), (10.48)
where ζ is a random variable that is between θ̂n and θ0 with probability one.
To simplify the expression in Equation (10.48) we note that since θ̂n is the
maximum likelihood estimator of θ it follows that l0 (θ̂n ) = 0. Therefore
Some other common test statistics that have asymtptotic ChiSquared dis-
tributions include Wald’s statistic and Rao’s efficient score statistic. See Ex-
ercises 35 and 36. Under a sequence of alternative hypothesis {θ1,n }∞ n=1 where
θ1,n = θ0 + n−1/2 δ, it follows that Wilk’s likelihood ratio test statistic has an
asymptotic ChiSquared statistic with a non-zero non-centrality parameter.
438 PARAMETRIC INFERENCE
Theorem 10.12. Suppose that X1 , . . . , Xn is a set of independent and iden-
tically distributed random variables from a distribution F (x|θ) with density
f (x|θ), where θ has parameter space Ω. Suppose that the conditions of Theorem
10.5 hold, then, under the sequence of alternative hypotheses {θ1,n }∞
n=1 where
d
θ1,n = θ0 + n−1/2 δ, Λn −
→ X as n → ∞ where X has ChiSquared[1, δ 2 I(θ0 )]
distribution.
Proof. We begin by noting that if θ1,n = θ0 + n−1/2 δ, then retracing the steps
in the proof of Theorem 10.11 implies that under the sequence of alternative
hypotheses {θ1,n }∞ 2
n=1 we have that Λn = n(θ̂n −θ0 ) I(θ0 )+op (n
−1
), as n → ∞.
See Exercise 34. Now, note that
n1/2 (θ̂n − θ0 ) = n1/2 (θ̂n − θ1,n + n−1/2 δ) = n1/2 (θ̂n − θ1,n ) + δ.
Therefore, Theorem 4.11 (Slutsky) implies that under the sequence of al-
d
ternative hypotheses we have that n1/2 I 1/2 (θ0 )(θ̂n − θ0 ) −
→ Z as n → ∞
where Z is a N(δ, 1) random variable. Therefore, Theorem 4.12 implies that
d
n(θ̂n − θ0 )2 I(θ0 ) −
→ X as n → ∞, where X has a ChiSquared[1, δ 2 I(θ0 )]
distribution, and the result is proven.
Similar results also hold for the asymptotic distributions of Wald’s statistic
and Rao’s efficient score statistic under the same sequence of alternative hy-
potheses. See Exercises 37 and 38.
Observed confidence levels provide useful information about the relative truth
of hypotheses in multiple testing problems. In place of the repeated tests of
hypothesis usually associated with multiple comparison techniques, observed
confidence levels provide a level of confidence for each of the hypotheses. This
level of confidence measures the amount of confidence there is, based on the
observed data, that each of the hypotheses are true. This results in a relatively
simple method for assessing the truth of a sequence of hypotheses.
The development of observed confidence levels begins by constructing a for-
mal framework for the problem. This framework is based on the problem of
regions, which was first formally proposed by Efron and Tibshirani (1998). The
problem is constructed as follows. Let X be a d-dimensional random vector
following a d-dimensional distribution function F . We will assume that F is a
member of a collection, or family, of distributions given by F. The collection
F may correspond to a parametric family such as the collection of all d-variate
normal distributions, or may be nonparametric, such as the collection of all
continuous distributions with finite mean. Let θ be the parameter of interest
and assume that θ is a functional parameter of F of the form θ = T (F ),
with parameter space Ω. Let {Ωi }∞ i=1 be a countable sequence of subsets, or
OBSERVED CONFIDENCE LEVELS 439
regions, of Ω such that
∞
[
Ωi = Ω.
i=1
For simplicity each region will usually be assumed to be connected, though
in many problems Ωi is made up of disjoint regions. Further, in many prac-
tical examples the sequence is finite, but there are also practical examples
that require countable sequences. There are also examples where the sequence
{Ωi }∞
i=1 forms a partition of Ω in the sense that Ωi ∩ Ωj = ∅ for all i 6= j. In
the general case, the possibility that the regions can overlap will be allowed.
In many practical problems the subsets technically overlap on their bound-
aries, but the sequence can often be thought of as a partition from a practical
viewpoint. The statistical interest in such a sequence of regions arises from the
structure of a specific inferential problem. Typically the regions correspond
to competing models for the distribution of the random vector X and one is
interested in determining which of the models is most reasonable based on the
observed data vector X. Therefore the problem of regions is concerned with
determining which of the regions in the sequence {Ωi }∞ i=1 that θ belongs to
based on the observed data X.
An obvious simple solution to this problem would be to estimate θ based on
X and conclude that θ is in the region Ωi whenever the estimate θ̂n is in
the region Ωi . We will consider the estimator θ̂n = T (F̂n ) where F̂n is the
empirical distribution function defined in Definition 3.5. The problem with
simply concluding that θ ∈ Ωi whenever θ̂n ∈ Ωi is that θ̂n is subject to
sample variability. Therefore, even though we may observe θ̂n ∈ Ωi , it may
actually be true that θ̂n ∈ Ωj for some i 6= j where Ωi ∩ Ωj = ∅, and that
θ̂ ∈ Ωi was observed simply due to chance. If such an outcome were rare,
then the method may be acceptable. However, if such an outcome occurred
relatively often, then the method would not be useful. Therefore, it is clear
that the inherent variability in θ̂n must be accounted for in order to develop
a useful solution to the problem of regions.
Multiple comparison techniques solve the problem of regions using a sequence
of hypothesis tests. Adjustments to the testing technique helps control the
overall significance level of the sequence of tests. Modern techniques have been
developed by Stefansson, Kim and Hsu (1988) and Finner and Strassburger
(2002). Some general references that address issues concerned with multiple
comparison techniques include Hochberg and Tamhane (1987), Miller (1981)
and Westfall and Young (1993). Some practitioners find the results of these
procedures difficult to interpret as the number of required tests can sometimes
be quite large.
An alternate approach to multiple testing techniques was formally introduced
by Efron and Tibshirani (1998). This approach computes a measure of con-
fidence for each of the regions. This measure reflects the amount of confi-
dence there is that θ lies within the region based on the observed sample
440 PARAMETRIC INFERENCE
X. The method used for computing the observed confidence levels studied
here is based on the methodology of Polansky (2003a, 2003b, 2007). Let
C(α, ω) ⊂ Θ be a 100α% confidence region for θ based on the sample X.
That is, C(α, ω) ⊂ Ω is a function of the sample X with the property that
P [θ ∈ C(α, ω)] = α.
The vector ω ∈ Wα ⊂ Rq is called the shape parameter vector as it con-
tains a set of parameters that control the shape and orientation of the con-
fidence region, but do not have an effect on the confidence coefficient. Even
though Wα is usually a function of α, the subscript α will often be omitted
to simplify mathematical expressions. Now suppose that there exist sequences
{αi }∞ ∞
i=1 ∈ [0, 1] and {ωi }i=1 ∈ Wα such that C(αi , ωi ) = Ωi for i = 1, 2, . . .,
conditional on X. Then the sequence of confidence coefficients are defined to
be the observed confidence levels for {Ωi }∞ i=1 . In particular, αi is defined to
be the observed confidence level of the region Ωi . That is, the region Ωi cor-
responds to a 100αi % confidence region for θ based on the observed data.
This measure is similar to the measure suggested by Efron and Gong (1983),
Felsenstein (1985) and Efron, Holloran, and Holmes (1996). It is also similar in
application to the methods of Efron and Tibshirani (1998), though the formal
definition of the measure differs slightly from the definition used above. See
Efron and Tibshirani (1998) for further details on this definition.
Example 10.32. To demonstrate this idea, consider a simple example where
X1 , . . . , Xn is a set of independent and identically distributed random vari-
ables from a N(θ, σ 2 ) distribution. Let θ̂n and σ̂n be the usual sample mean
and variance computed on X1 , . . . , Xn . A confidence interval for the mean
that is based on the assumption that the population is Normal is based on
percentiles from the T(n − 1) distribution and has the form
C(α, ω) = (θ̂n − tn−1;1−ωL n−1/2 σ̂n , θ̂n − tn−1;1−ωU n−1/2 σ̂n ), (10.51)
th
where tν;ξ is the ξ quantile of a T(ν) distribution. In order for the confidence
interval in Equation (10.51) to have a confidence level equal to 100α% we take
ω 0 = (ωL , ωU ) to be the shape parameter vector where
Wα = {ω : ωU − ωL = α, ωL ∈ [0, 1], ωU ∈ [0, 1]},
for α ∈ (0, 1). Note that selecting ω ∈ Wα not only ensures that the confidence
level is 100α%, but also allows for several orientations and shapes of the
interval. For example, a symmetric two-tailed interval can be constructed by
selecting ωL = (1 − α)/2 and ωU = (1 + α)/2. An upper one-tailed interval is
constructed by setting ωL = 0 and ωU = α. A lower one-tailed interval uses
ωL = 1 − α and ωU = 1.
Now consider the problem of computing observed confidence levels for the
Normal mean for a sequence of interval regions of the form Ωi = [ti , ti+1 ]
where −∞ < ti < ti+1 < ∞ for i ∈ N. Setting Ωi = C(α, ω) where the
confidence interval used for this calculation is the one given in Equation (10.51)
OBSERVED CONFIDENCE LEVELS 441
yields
θ̂n − tn−1;1−ωL n−1/2 σ̂n = ti , (10.52)
and
θ̂n − tn−1;1−ωU n−1/2 σ̂n = ti+1 . (10.53)
Solving Equations (10.52) and (10.53) for ωL and ωU yields
ωL = 1 − Tn−1 [n1/2 σ̂n−1 (θ̂n − ti )],
and
ωU = 1 − Tn−1 [n1/2 σ̂n−1 (θ̂n − ti+1 )],
where Tn−1 is the distribution function of a T(n − 1) distribution. Because
ω ∈ Wα if and only if ωU − ωL = α it follows that the observed confidence
level for the region Ωi is given by
Tn−1 [n1/2 σ̂n−1 (θ̂n − ti )] − Tn−1 [n1/2 σ̂n−1 (θ̂n − ti+1 )].
This section will consider the problem of developing the asymptotic theory of
observed confidence levels for the case when there is a single parameter, that
is, when Ω ⊂ R. To simplify the development in this case, it will be further
assumed that Ω is an interval subset of R. Most standard single parameter
problems in statistical inference fall within these assumptions.
An observed confidence level is simply a function that takes a subset of Ω
and maps it to a real number between 0 and 1. Formally, let α be a function
and let T be a collection of subsets of Ω. Then an observed confidence level
is a function α : T → [0, 1]. Because confidence levels are closely related to
probabilities, it is reasonable to assume that α has the axiomatic properties
given in Definition 2.2 where we will suppose that T is a sigma-field of subsets
of Ω. For most reasonable problems in statistical inference it should suffice to
take T to be the Borel σ-field on Ω. Given this structure, it suffices to develop
observed confidence levels for interval subsets of Θ. Observed confidence levels
for other regions can be obtained through operations derived from the axioms
in Definition 2.2.
To develop observed confidence levels for a general scalar parameter θ, consider
a single interval region of the form Ψ = (tL , tU ) ∈ T. To compute the observed
confidence level of Ψ, a confidence interval for θ based on the sample X is
required. The general form of a confidence interval for θ based on X can
usually be written as C(α, ω) = [L(ωL ), U (ωU )], where ωL and ωU are shape
parameters such that (ωL , ωU ) ∈ Wα , for some Wα ⊂ R2 . It can often be
assumed that L(ωL ) and U (ωU ) are continuous monotonic functions of ωL
and ωU onto Ω, respectively, conditional on the observed sample X. See, for
example, Section 9.2 of Casella and Berger (2002). If such an assumption is
true, the observed confidence level of Ψ is computed by setting Ψ = C(α, ω)
and solving for ω. The value of α for which ω ∈ Wα is the observed confidence
level of Ψ. For the form of the confidence interval given above, the solution
442 PARAMETRIC INFERENCE
is obtained by setting ωL = L−1 (tL ) and ωU = U −1 (tU ), conditional on X.
A unique solution will exist for both shape parameters given the assumptions
on the functions L and U . Therefore, the observed confidence level of Ψ is the
value of α such that ω = (ωL , ωU ) ∈ Ωα . Thus, the calculation of observed
confidence levels in the single parameter case is equivalent to inverting the
endpoints of a confidence interval for θ. Some simple examples illustrating
this method is given below.
Example 10.33. Continuing with the setup of Example 10.32, suppose that
X1 , . . . ,Xn is a set of independent and identically distributed random variables
from a N(µ, θ) distribution where θ < ∞. For the variance, the parameter
space is Ω = (0, ∞), so that the region Ψ = (tL , tU ) is assumed to follow
the restriction that 0 < tL ≤ tU < ∞. Let θ̂n be the unbiased version of the
sample variance, then a 100α% confidence interval for θ is given by
" #
(n − 1)θ̂n (n − 1)θ̂n
C(α, ω) = , , (10.54)
χ2n−1;1−ωL χ2n−1;1−ωU
where ω ∈ Wα with
Wα = {ω 0 = (ω1 , ω2 ) : ωL ∈ [0, 1], ωU ∈ [0, 1], ωU − ωL = α},
and χ2ν,ξ is the ξ th percentile of a ChiSquared(ν) distribution. Therefore
(n − 1)θ̂n
L(ω) = ,
χ2n−1;1−ωL
and
(n − 1)θ̂n
U (ω) = .
χ2n−1;1−ωU
Setting Ψ = C(α, ω) and solving for ωL and ωU yields
ωL = L−1 (tL ) = 1 − χ2n−1 [t−1
L (n − 1)θ̂n ],
and
ωU = U −1 (tU ) = 1 − χ2n−1 [t−1
U (n − 1)θ̂n ],
where χ2ν is the distribution function of a ChiSquared(ν) distribution. This
implies that the observed confidence limit for Ψ is given by
α(Ψ) = χ2n−1 [t−1 2 −1
L (n − 1)θ̂n ] − χn−1 [tU (n − 1)θ̂n ].
Example 10.34. Suppose (X1 , Y1 ), . . . , (Xn , Yn ) is a set of independent and
identically distributed bivariate random vectors from a bivariate normal dis-
tribution with mean vector µ0 = (µX , µY ) and covariance matrix
2
σX σXY
Σ= ,
σXY σY2
2
where V (Xi ) = σX , V (Yi ) = σY2 and the covariance between X and Y is
OBSERVED CONFIDENCE LEVELS 443
−1 −1
σXY . Let θ = σXY σX σY , the correlation coefficient between X and Y .
The problem of constructing a reliable confidence interval for θ is usually
simplified using Fisher’s normalizing transformation. See Fisher (1915) and
Winterbottom (1979). Using the fact that tanh−1 (θ̂n ) has an approximate
N[tanh−1 (θ), (n − 3)−1/2 ] distribution, the resulting approximate confidence
interval for θ has the form
C(α, ω) = [tanh(tanh−1 (θ̂n ) − z1−ωL (n − 3)−1/2 ),
tanh(tanh−1 (θ̂n ) − z1−ωU (n − 3)−1/2 )] (10.55)
where θ̂n is the sample correlation coefficient given by
Pn
i=1 (Xi − X̄)(Yi − Ȳ )
θ̂n = P , (10.56)
n 2 1/2 2 1/2
Pn
i=1 (Xi − X̄) i=1 (Yi − Ȳ )
and ω 0 = (ωL , ωU ) ∈ Wα with
Wα = {ω 0 = (ω1 , ω2 ) : ωL ∈ [0, 1], ωU ∈ [0, 1], ωU − ωL = α}.
Therefore
L(ωL ) = tanh(tanh−1 (θ̂n ) − z1−ωL (n − 3)−1/2 )
and
U (ωU ) = tanh(tanh−1 (θ̂n ) − z1−ωU (n − 3)−1/2 ).
Setting Ψ = (tL , tU ) = C(α, ω) yields
ωL = 1 − Φ[(n − 3)1/2 (tanh−1 (θ̂n ) − tanh−1 (tL ))],
and
ωU = 1 − Φ[(n − 3)1/2 (tanh−1 (θ̂n ) − tanh−1 (tU ))],
so that the observed confidence level for Ψ is given by
α(Ψ) = Φ[(n − 3)1/2 (tanh−1 (θ̂) − tanh−1 (tL ))]−
Φ[(n − 3)1/2 (tanh−1 (θ̂) − tanh−1 (tU ))].
For the asymptotic development in this section we will consider problems that
occur within the smooth function model described in Section 7.4. As observed
in Section 10.3, confidence regions for θ can be constructed using the ordinary
upper confidence limit
θ̂ord (α) = θ̂n − n−1/2 σg1−α , (10.57)
for the case when σ is known, and the studentized critical point
θ̂stud (α) = θ̂n − n−1/2 σ̂n h1−α , (10.58)
for the case when σ is unknown. Two-sided confidence intervals can be devel-
oped using each of these upper confidence limits. For example, if ωL ∈ (0, 1)
and ωU ∈ (0, 1) are such that α = ωU − ωL ∈ (0, 1) then Cord (α, ω) =
444 PARAMETRIC INFERENCE
[θ̂ord (ωL ), θ̂ord (ωU )] is a 100α% confidence interval for θ when σ is known.
Similarly
Cstud (α, ω) = [θ̂stud (ωL ), θ̂stud (ωU )],
is a 100α% confidence interval for θ when σ is unknown.
For a region Ψ = (tL , tU ) the observed confidence level corresponding to each
of the theoretical critical points can be computed by setting Ψ equal to the
confidence interval and solving for ωL and ωU . For example, in the case of the
ordinary theoretical critical point, setting Ψ = Cord (α, ω; X) yields the two
equations
tL = L(ω) = θ̂n − n−1/2 σg1−ωL , (10.59)
and
tU = U (ω) = θ̂n − n−1/2 σg1−ωU . (10.60)
Let Gn (x) = P [n1/2 σ −1 (θ̂n − θ)] and Hn (x) = P [n1/2 σ̂n−1 (θ̂n − θ)]. Solving for
ωL and ωU in Equations (10.59) and (10.60) yields
ωL = L−1 (tL ) = 1 − Gn [n1/2 σ −1 (θ̂n − tL )],
and
ωU = U −1 (tU ) = 1 − Gn [n1/2 σ −1 (θ̂n − tU )],
so that the observed confidence level corresponding to the ordinary theoretical
confidence limit is
αord (Ψ) = Gn [n1/2 σ −1 (θ̂n − tL )] − Gn [n1/2 σ −1 (θ̂n − tU )]. (10.61)
Similarly, the observed confidence levels corresponding to the studentized con-
fidence interval is
αstud (Ψ) = Hn [n1/2 σ̂n−1 (θ̂n − tL )] − Hn [n1/2 σ̂n−1 (θ̂n − tU )]. (10.62)
In the case where F is unknown, the asymptotic Normal behavior of the
distributions Gn and Hn can be used to compute approximate observed con-
fidence levels of the form
α̂ord (Ψ) = Φ[n1/2 σ −1 (θ̂n − tL )] − Φ[n1/2 σ −1 (θ̂n − tU )], (10.63)
and
α̂stud (Ψ) = Φ[n1/2 σ̂n−1 (θ̂n − tL )] − Φ[n1/2 σ̂n−1 (θ̂n − tU )], (10.64)
for the cases where σ is known, and unknown, respectively.
The observed confidence level based on the Normal approximation is just
one of several alternative methods available for computing an observed confi-
dence level for any given parameter. Indeed, as was pointed out earlier, any
function that maps regions to the unit interval such that the three properties
of a probability measure are satisfied is technically a method for computing
an observed confidence level. Even if we focus on observed confidence levels
that are derived from confidence intervals that at least guarantee their cov-
erage level asymptotically, there may be many methods to choose from, and
techniques for comparing the methods become paramount in importance.
OBSERVED CONFIDENCE LEVELS 445
This motivates the question as to what properties we would wish the observed
confidence levels to possess. Certainly the issue of consistency would be rele-
vant in that an observed confidence level computed on a region Ω0 = (t0L , t0U )
that contains θ should converge to one as the sample size becomes large. Corre-
spondingly, an observed confidence level computed on a region Ω1 = (t1L , t1U )
that does not contain θ should converge to zero as the sample size becomes
large. The issue of consistency is relatively simple to decide within the smooth
function model studied. The normal approximation provides the simplest case
and is a good starting point. Consider the ordinary observed confidence level
given in Equation (10.63). Note that
Φ[n1/2 σ −1 (θ̂n − t0L )] = Φ[n1/2 σ −1 (θ − t0L ) + n1/2 σ −1 (θ̂n − θ)]
d
where n1/2 (θ − t0L )/σ → ∞ and n1/2 (θ̂n − θ)/σ − → Z as n → ∞, where Z is
a N(0, 1) random variable. It is clear that the second sequence is bounded in
probability, so that the first sequence dominates. It follows that
p
Φ[n1/2 σ −1 (θˆn − t0L )] −
→1
as n → ∞. Similarly, it can be shown that
p
Φ[n1/2 σ −1 (θ̂n − t0U )] −
→0
p
as n → ∞, so that it follows that α̂ord (Ω0 ) − → 1 as n → ∞ when θ ∈ Ω0 .
−1 p
A similar argument, using the fact that σ̂n σ − → 1 as n → ∞ can be used
p
to show that α̂stud (Ω0 ) − → 1 as n → ∞, as well. Arguments to show that
αord (Ω0 ) and αstud (Ω0 ) are also consistent follow in a similar manner, though
one must use the fact that Gn ; Φ and Hn ; Φ as n → ∞.
Beyond consistency, it is desirable for an observed confidence level to provide
an accurate representation of the level of confidence there is that θ ∈ Ψ, given
the observed sample X1 , . . . , Xn . Considering the definition of an observed
confidence level, it is clear that if Ψ corresponds to a 100α% confidence interval
for θ, conditional on X1 , . . . , Xn , the observed confidence level for Ψ should
be α. When σ is known the interval Cord (α, ω) will be used as the standard for
a confidence interval for θ. Hence, a measure α̃ of an observed confidence level
is accurate if α̃[Cord (α, ω)] = α. For the case when σ is unknown the interval
Cstud (α, ω) will be used as the standard for a confidence interval for θ, and
an arbitrary measure α̃ is defined to be accurate if α̃[Cstud (α, ω)] = α. Using
this definition, it is clear that αord and αstud are accurate when σ is known
and unknown, respectively. When σ is known and α̃[Cord (α, ω)] 6= α or when
σ is unknown and α̃[Cstud (α, ω)] 6= α one can analyze how close α̃[Cord (α, ω)]
or α̃[Cstud (α, ω)] is to α using asymptotic expansion theory. In particular, if
σ is known then a measure of confidence α̃ is said to be k th -order accurate if
α̃[Cord (α, ω)] = α+O(n−k/2 ), as n → ∞. Similarly, if σ is unknown a measure
α̃ is said to be k th -order accurate if α̃[Cstud (α, ω)] = α +O(n−k/2 ), as n → ∞.
To analyze the normal approximations, let us first suppose that σ is known.
446 PARAMETRIC INFERENCE
If α = ωU − ωL then
α̂ord [Cord (α, ω)] = Φ{n1/2 σ −1 [θ̂ − θ̂ord (ωL )]} −
Φ{n1/2 σ −1 [θ̂ − θ̂ord (ωU )]}
= Φ(g1−ωL ) − Φ(g1−ωU ). (10.65)
If Gn = Φ, then α̂ord [Cord (α, ω)] = α, and the method is accurate. When
Gn 6= Φ the Cornish-Fisher expansion for a quantile of Gn , along with an
application of Theorem 1.13 (Taylor) to Φ yields
Φ(g1−ω ) = 1 − ω − n−1/2 r1 (zω )φ(zω ) + O(n−1 ),
as n → ∞, for an arbitrary value of ω ∈ (0, 1). It is then clear that
α̂ord [Cord (α, ω)] = α + n−1/2 ∆(ωL , ωU ) + O(n−1 ), (10.66)
as n → ∞ where
∆(ωL , ωU ) = r1 (zωU )φ(zωU ) − r1 (zωL )φ(zωL ).
One can observe that α̂ord is first-order accurate, unless the first-order term
in Equation (10.66) is functionally zero. If it happens that ωL = ωU or ωL =
1 − ωU , then it follows that r1 (zωL )φ(zωL ) = r1 (zωU )φ(zωU ) since r1 is an
even function and the first-order term vanishes. The first case corresponds
to a degenerate interval with confidence measure equal to zero. The second
case corresponds to the situation where θ̂ corresponds to the midpoint of the
interval (tL , tU ). Otherwise, the term is typically nonzero.
The statistical inference methods used so far in this book are classified as
frequentists methods. In these methods the unknown parameter is considered
to be a fixed constant that is an element of the parameter space. The random
mechanism that produces the observed data is based on a distribution that
depends on this unknown, but fixed, parameter. These methods are called
frequentist methods because the results of the statistical analyses are inter-
preted using the frequency interpretation of probability. That is, the methods
are justified by properties that hold under repeated sampling from the distri-
bution of interest. For example, a 100α% confidence set is justified in that the
probability that the set contains the true parameter value with a probability
of α before the sample is taken, or that the expected proportion of the confi-
dence sets that contain the parameter over repeated sampling from the same
distribution is α.
An alternative view of statistical inference is based on Bayesian methods. In
the Bayesian framework the unknown parameter is considered to be a random
variable and the observed data is based on the joint distribution between the
parameter and the observed random variables. In the usual formulation the
distribution of the parameter, called the prior distribution, and the condi-
tional distribution of the observed random variables given the parameter are
specified. Inference is then carried out using the distribution of the parame-
ter conditional on the sample that was observed. This distribution is called
the posterior distribution. The computation of the posterior distribution is
based on calculations justified by Bayes’ theorem, from which the methods
are named. The interpretation of using Bayes’ theorem in this way is that the
prior distribution can be interpreted as the knowledge of the parameter before
the data was observed, while the posterior distribution is the knowledge of the
parameter that has been updated based on the information from the observed
sample. The advantage of this type of inference is that the theoretical prop-
erties of Bayesian methods are interpreted for the sample that was actually
observed, and not over all possible samples. The interpretation of the results
is also simplified due to the randomness of the parameter value. For example,
while a confidence interval must be interpreted in view of all of the possible
samples that could have been observed, a Bayesian confidence set produces a
set that has a posterior probability of α, which can be interpreted solely on
the basis of the observed sample and the prior distribution.
In some sense Bayesian methods do not need to rely on asymptotic properties
for their justification because of their interpretability on the current sam-
ple. However, many Bayesian methods can be justified within the frequentist
framework as well. In this section we will demonstrate that Bayes estimators
can also be asymptotically efficient and have an asymptotic Normal distri-
bution within the frequentist framework.
To formalize the framework for our study, consider a set of independent and
448 PARAMETRIC INFERENCE
identically distributed random variables from a distribution F (x|θ) which has
either a density of probability distribution function given by f (x|θ) where
we will assume for simplicity that x ∈ R. The parameter θ will be assumed
to follow the prior distribution π(θ) over some parameter space Ω, which
again for simplicity we will often be taken to be R. The object of a Bayesian
analysis is then to obtain the posterior distribution π(θ|x1 , . . . , xn ), which can
be obtained using an argument based on Bayes’ theorem of the form
f (x1 , . . . , xn , θ)
π(θ|x1 , . . . , xn ) = ,
m(x1 , . . . , xn )
where f (x1 , . . . , xn , θ) is the joint distribution of X1 , . . . , Xn and θ, and the
marginal distribution of X1 , . . . , Xn is given by
Z
m(x1 , . . . , xn ) = f (x1 , . . . , xn , θ)dθ.
Ω
Using the fact that the joint distribution of X1 , . . . , Xn and θ can be found as
f (x1 , . . . , xn , θ) = f (x1 , . . . , xn |θ)π(θ) it follows that the posterior distribution
can be found directly from f (x1 , . . . , xn |θ) and π(θ) as
f (x1 , . . . , xn |θ)π(θ)
π(θ|x1 , . . . , xn ) = R . (10.67)
Ω
f (x1 , . . . , xn |θ)π(θ)dθ
Noting that the denominator of Equation (10.67) is a constant, it is often
enough to conclude that
π(θ|x1 , . . . , xn ) ∝ f (x1 , . . . , xn |θ)π(θ),
which eliminates the need to compute the integral, which can be difficult
in some cases. Once the posterior distribution is computed then a Bayesian
analysis will either use the distribution itself as the updated knowledge about
the parameter. Alternately, point estimates, confidence regions and tests can
be constructed, though their interpretation is necessarily different than the
parallel frequentists methods. For an introduction to Bayesian methods see
Bolstad (2007).
The derivation of a Bayes estimator of a parameter θ begins by specifying
a loss function L[θ, δ(x1 , . . . , xn )] where δ(x1 , . . . , xn ) is a point estimator,
called a decision rule. The posterior expected loss, or Bayes risk, is then given
by Z ∞
L[θ, δ(x1 , . . . , xn )]π(θ|x1 , . . . , xn )dθ.
−∞
The Bayes estimator of θ is then taken to be the decision rule δ that minimizes
the Bayes risk. The result given below, which is adapted from Lehmann and
Casella (1998) provides conditions under which a Bayes estimator of θ can be
found for two common loss functions.
Theorem 10.13. Let θ have a prior distribution π over Ω and suppose that
the density or probability distribution of X1 , . . . , Xn , conditional on θ is given
by fθ (x1 , . . . , xn |θ). If
BAYESIAN ESTIMATION 449
1. L(θ, δ) is a non-negative loss function,
2. There exists a decision rule δ that has finite risk,
3. For almost all (x1 , . . . , xn ) ∈ Rn there exists a decision rule δ(x1 , . . . , xn )
that minimizes the Bayes risk,
then δ is the Bayes estimator and
For a proof of Theorem 10.13 see Section 4.1 of Lehmann and Casella (1998).
P (X = 0) = P (X = 0|θ = 14 )P (θ = 14 ) + P (X = 0|θ = 12 )P (θ = 12 )+
P (X = 0|θ = 34 )P (θ = 34 ) = 3
4 · 1
3 + 1
2 · 1
3 + 1
4 · 1
3 = 12 .
Therefore, the posterior probability is P (θ = 14 |X = 0) = 12
3
· 12 = 12 . Similar
calculations can be used to show that P (θ = 2 |X = 0) = 31 and P (θ =
1
3 1
4 |X = 0) = 6 . One can note that the lowest value of θ, which corresponds to
the highest probability for X = 0 has the largest posterior probability. If we
consider using squared error loss, then Theorem 10.13 implies that the Bayes
estimator of θ is given by the mean of the posterior distribution, which is
5
θ̃ = 12 .
Example 10.36. Let B1 , . . . , Bn be a set of independent and identically dis-
tributed random variables each having a Bernoulli(θ) distribution. Suppose
that θ has a Beta(α, β) prior distribution where both α and β are speci-
fied and hence can be treated as constants. The conditional distribution of
B1 , . . . , Bn given θ is given by
P (B1 = b1 , . . . , Bn = bn |θ) = θnB̄n (1 − θ)n−nB̄n ,
450 PARAMETRIC INFERENCE
where bi can either be 0 or 1 for each i ∈ {1, . . . , n}, and B̄n is the sample
mean of b1 , . . . , bn . Therefore it follows that the posterior distribution for θ
given B1 = b1 , . . . , Bn = bn is proportional to
h i
θnB̄n (1 − θ)n−nB̄n θα−1 (1 − θ)β−1 = θα−1+nB̄n (1 − θ)β+n−1−nB̄n ,
Figure 10.2 The Beta(5, 10) prior density (solid line) and the Beta(10, 25) poste-
rior density (dashed line) on θ from Example 10.37 when n = 10 and B̄n = 21 . The
sample proportion is located at 12 (dotted line) and the Bayes estimate is located at
10
35
(dash and dot line).
6
5
4
3
2
1
0
l(θ̂n |X) = l(θ0 |X) + (θ̂n − θ0 )l0 (θ0 |X) + 12 (θ̂n − θ0 )2 l00 (ξn |X), (10.69)
Figure 10.3 The Beta(5, 5) prior density (solid line) and the Beta(10, 10) posterior
density (dashed line) on θ from Example 10.37 when n = 10 and B̄n = 21 . The sample
proportion and Bayes estimate are both located at 12 (dotted line).
4
3
2
1
0
1. Ω is an open interval.
2. The set {x : f (x|θ) > 0} is the same for all θ ∈ Ω.
3. The density f (x|θ) has two continuous derivatives with respect to θ for each
x ∈ {x : f (x|θ) > 0}.
4. The integral
Z ∞
f (x|θ)dx,
−∞
can be twice differentiated by exchanging the integral and the derivative.
5. The Fisher information number I(θ0 ) is defined, positive, and finite.
6. For any θ0 ∈ Ω there exists a positive constant d and function B(x) such
that
2
∂
∂θ2 log[f (x|θ)] ≤ B(x),
Figure 10.4 The Beta(5, 10) prior density (solid line) and the Beta(55, 60) poste-
rior density (dashed line) on θ from Example 10.37 when n = 100 and B̄n = 21 . The
sample proportion is located at 21 (dotted line) and the Bayes estimate is located at
55
115
(dash and dot line).
10
8
6
4
2
0
Then,
Z ∞
p
|τ (t|x) − [I(θ0 )]1/2 φ{t[I(θ0 )]1/2 }|dt −
→ 0, (10.70)
−∞
as n → ∞. If we can additionally assume that the expectation
Z
|θ|πθdθ,
Ω
BAYESIAN ESTIMATION 455
Figure 10.5 The Beta(100, 500) prior density (solid line) and the Beta(150, 550)
posterior density (dashed line) on θ from Example 10.37 when n = 100 and B̄n = 12 .
25
20
15
10
5
0
is finite, then
Z ∞
p
(1 + |t|)|τ (t|x) − [I(θ0 )]1/2 φ{t[I(θ0 )]1/2 }|dt −
→ 0, (10.71)
−∞
as n → ∞.
The proof of Theorem 10.14 is rather complicated, and can be found in Section
6.8 of Lehmann and Casella (1998). Note that in Theorem 10.14 the integrals
in Equations (10.70) and (10.71) are being used as a type of norm in the space
of density functions, so that the results state that π(t|x) and I 1/2 φ{t[I(θ0 )]1/2 }
coincide as n → ∞, since the integrals converge in probability to zero. There-
fore, the conclusions of Theorem 10.14 state that the posterior density of
n1/2 [θ − θ0 − n−1 l0 (θ0 )/I(θ0 )] converges to that of a N{0, [I(θ0 )]−1 } density.
Equivalently, one can conclude that when n is large, the posterior distribution
of θ has an approximate N{[θ0 + n−1 l](θ0 )[I(θ0 )]−1 , [nI(θ0 )]−1 } distribution.
The results of Equations (10.70) and (10.71) make the same type of conclu-
sion, the difference being that the rate of convergence is faster in the tails of
the density in Equation (10.71) due to the factor (1 + |t|) in the integral.
Note that the assumptions required for Theorem 10.14 are quite a bit stronger
than what is required to develop the asymptotic theory of maximum likelihood
456 PARAMETRIC INFERENCE
estimators. While Assumptions 1–6 are the same as in Theorem 10.5, the
p
assumptions used for likelihood theory that imply that n−1 Rn (θ0 ) −
→ 0 as n →
∞, are replaced by the stronger Assumption 7 in Theorem 10.14. Additionally,
in the asymptotic theory of maximum likelihood estimation, the consistency
and asymptotic efficiency of the maximum likelihood estimator requires us
to only specify the behavior of the likelihood function in a neighborhood of
the true parameter value. Because Bayes estimators involve integration of the
likelihood over the entire range of the parameter space, Assumption 8 ensures
that the likelihood function is well behaved away from θ0 as well.
When the squared error loss function is used, the result of Theorem 10.14 is
sufficient to conclude that the Bayes estimator is both consistent and efficient.
Theorem 10.15. Suppose that X0 = (X1 , . . . , Xn ) is a vector of independent
and identically distributed random variables from a distribution f (x|θ), con-
ditional on θ, where the prior distribution of θ is π(θ) for θ ∈ Ω. Let τ (t|x)
be the posterior density of n1/2 (θ − θ̃0,n ), where θ̃0,n = θ0 + n−1 [I(θ0 )]−1 l0 (θ0 )
and θ0 ∈ Ω is the true value of θ. Suppose that the conditions of Theorem
10.14 hold, then when the loss function is squared error loss, it follows that
d
→ Z as n → ∞ where Z is a N{0, [I(θ0 )]−1 } random variable.
n1/2 (θ̂n − θ0 ) −
Note that θ̃0,n does not depend on t so that the first term in Equation (10.73)
is θ̃0,n . Therefore, Z
θ̃n = θ̃0,n + n−1/2 tτ (t|x)dt,
Ω̃
or that Z
n1/2 (θ̃n − θ̃0,n ) = tτ (t|x)dt.
Ω̃
Now, note that
Z
1/2
n |θ̃n − θ̃0,n | = tτ (t|x)dt =
Ω̃
Z
{tτ (t|x) − t[I(θ0 )]1/2 φ{t[I(θ0 )]1/2 }}dt
Ω̃
since the integral represents the expectation of a N{0, [I(θ0 )]−1 } random vari-
able. Theorem A.6 implies that
Z
{tτ (t|x) − tI 1/2 (θ0 )φ[tI 1/2 (θ0 )]}dt ≤
Ω̃
Z
tτ (t|x) − tI 1/2 (θ0 )φ[tI 1/2 (θ0 )] dt =
Ω̃
Z
|t| τ (t|x) − tI 1/2 (θ0 )φ[tI 1/2 (θ0 )] dt.
Ω̃
Therefore
Z
n1/2 |θ̃n − θ̃0,n | ≤ |t| τ (t|x) − tI 1/2 (θ0 )φ[tI 1/2 (θ0 )] dt. (10.74)
Ω̃
458 PARAMETRIC INFERENCE
Theorem 10.14 implies that the integral on the right hand side of Equation
(10.74) converges in probability to zero as n → ∞. Therefore, it follows that
p
n1/2 |θ̃n − θ̃0,n | −
→ 0 as n → ∞. Combining this result with the fact that
d
n1/2 (θ̃0,n − θ0 ) −
→ Z as n → ∞ and using Theorem 4.11 (Slutsky) implies that
d
n1/2 (θ̃n − θ0 ) −
→ Z as n → ∞, which proves the result.
The conditions of Theorem 10.15 appear to be quite complicated, but in fact
can be shown to hold for exponential families.
Theorem 10.16. Consider an exponential family density of the form f (x|θ) =
exp[θT (x) − A(θ)] where the parameter space Ω is an open interval, T (x) is
not a function of θ, and A(θ) is not a function of x. Then for this density the
assumptions of Theorem 10.15 are satisifed.
For a proof of Theorem 10.16 see Example 6.8.4 of Lehmann and Casella
(1998).
Example 10.38. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a N(θ, σ 2 ) distribution, conditional on θ,
where θ has a N(λ, τ 2 ) distribution where σ 2 , λ and τ 2 are known. In this
case it can be shown that the posterior distribution of θ is N(θ̃n , σ̃n2 ) where
τ 2 X̄n + n−1 σ 2 λ
θ̃n = ,
τ 2 + n−1 σ 2
and
σ2 τ 2
σ̃n2 = .
nτ 2 + σ 2
See Exercise 44. Under squared error loss, the Bayes estimator of θ is given
by E(θ|X1 , . . . , Xn ) = θ̃n . Treating X1 , . . . , Xn as a random sample from a
p
N(θ, σ 2 ) where θ is fixed we have that X̄n − → θ as n → ∞, by Theorem 3.10
(Weak Law of Large Numbers) and hence Theorem 3.7 implies that
τ 2 X̄n + n−1 σ 2 λ p
θ̃n = −
→ θ,
τ 2 + n−1 σ 2
as n → ∞. Therefore the Bayes estimator is consistent. Further, Theorem
d
4.20 (Lindeberg and Lévy) implies that n1/2 X̄n − → Z as n → ∞, where Z is
2
a N (θ, σ ) random variable. The result in this case is, in fact, exact for any
sample size, but for the sake of argument here we will base our calculations
on the asymptotic result. Note that
and
lim n−1 α + n−1 β + 1 = 1.
n→∞
d
Therefore, Theorem 4.11 (Slutsky) implies that n1/2 θ̃n −
→ Z as n → ∞, which
is the asymptotic behavior specified by Theorem 10.15.
10.7.1 Exercises
which is the proportion of values in the sample that are equal to zero, and
θ̃n = exp(−X̄n ). Compute the asymptotic relative efficiency of θ̂n relative
to θ̃n and comment on the results.
12. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables following a N(θ, 1) distribution.
p
a. Prove that if θ 6= 0 then δ{|X̄n | : [0, n−1/4 )} −
→ 0 as n → ∞.
p
b. Prove that if θ = 0 then δ{|X̄n | : [0, n−1/4 )} −
→ 1 as n → ∞.
17. Consider a sequence of random variables {{Xij }kj=1 }ni=1 that are assumed
to be mutually independent, each having a N(µi , θ) distribution for i =
1, . . . , n. Prove that the maximum likelihood estimators of µi and θ are
k
X
µ̂i = X̄i = k −1 Xij ,
j=1
πn (α) = α + n−1/2 [u1 (zα ) − v1 (zα )]φ(zα ) − n−1 [ 21 zα u21 (zα ) − u2 (zα )+
u1 (zα )v10 (zα ) − zα u1 (zα )v1 (zα ) − v2 (zα ) + uα zα ]φ(zα ) + O(n−3/2 ),
EXERCISES AND EXPERIMENTS 463
as n → ∞. Prove that when u1 (zα ) = s1 (zα ) and u2 (zα ) = s2 (zα ) it follows
that πn (α) = n−1 uα zα φ(zα ) + O(n−3/2 ), as n → ∞.
21. Let X1 , . . . , Xn be a sequence of independent and identically distributed
random variables from a distribution F with mean θ and assume the
framework of the smooth function model. Let σ 2 = E[(X1 − θ)2 ], γ =
σ −3 E[(X1 − θ)3 ], and κ = σ −4 E[(X1 − θ)4 ] − 3.
a. Prove that an exact 100α% upper confidence limit for θ has asymptotic
expansion
a. Find an asymptotic expansion for the confidence limit θ̂n (α) and find
the order of asymptotic correctness of the method.
b. Find an asymptotic expansion for the coverage probability of the up-
per confidence θ̂n (α) and find the order of asymptotic accuracy of the
method.
c. Find an asymptotic expansion for the coverage probability of a two-sided
confidence interval based on this method.
a. Prove that the test is consistent against all alternatives θ < θ0 . State
any additional assumptions that must be made in order for this result
to be true.
b. Develop an expression similar to that given in Theorem 10.10 for the
asymptotic power of this test for a sequence of alternatives given by
θ1,n = θ0 − n−1/2 δ where δ > 0 is a constant. State any additional
assumptions that must be made in order for this result to be true.
31. Consider the framework of the smooth function model where σ, which de-
notes the asymptotic variance of n1/2 θ̂n , is known. Consider using the test
statistic Zn = n1/2 σ −1 (θ̂n − θ0 ) which follows the distribution Gn when θ0
is the true value of θ. Prove that an unbiased test of size α of the null hy-
pothesis H0 : θ ≤ θ0 against the alternative hypothesis H1 : θ > θ0 rejects
the null hypothesis if Zn > g1−α , where we recall that g1−α is the (1 − α)th
quantile of the distribution Gn .
32. Consider the framework of the smooth function model where σ, which de-
notes the asymptotic variance of n1/2 θ̂n , is unknown, and the test statistic
Tn = n1/2 σ̂n−1 (θ̂n −θ0 ) follows the distribution Hn when θ0 is the true value
of θ.
and β̃ = n−1 . Note that even though the prior distribution is improper,
the posterior distribution is not.
b. Compute the Bayes estimator of θ using the squared error loss function.
Is this estimator consistent and asymptotically Normal in accordance
with Theorem 10.15?
10.7.2 Experiments
which is the proportion of values in the sample that are equal to zero,
and θ̃n = exp(−X̄n ). Use the 1000 samples to estimate the bias, standard
error, and the mean squared error for each case. Discuss the results of the
simulations in terms of the theoretical findings of Exercise 11. Repeat the
experiment for θ = 1, 2, and 5 with n = 5, 10, 25, 50, and 100.
3. Write a program in R that simulates 1000 samples of size n from a dis-
tribution F with mean θ, where both n and F are specified below. For
each sample compute two 90% upper confidence limits for the mean: the
first one of the form X̄n − n−1/2 σ̂n z0.10 and the second of the form X̄n −
n−1/2 σ̂n t0.10,n−1 , and determine whether θ is less than the upper confidence
limit for each method. Use the 1000 samples to estimate the coverage prob-
ability for each method. How do these estimated coverage probabilities
compare for the two methods with relation to the theory presented in this
chapter? Formally analyze your results and determine if there is a signifi-
cant difference between the two methods at each sample size. Use n = 5,
10, 25, 50 and 100 for each of the distributions listed below.
a. N(0, 1)
b. T(3)
c. Exponential(1)
d. Exponential(10)
e. LaPLace(0, 1)
f. Uniform(0, 1)
4. Write a program in R that simulates 1000 samples of size n from a N(θ, 1)
distribution. For each sample compute the sample mean given by X̄n and
the Hodges super-efficient estimator θ̂n = X̄n + (a − 1)X̄n δn where δn =
δ{|X̄n |; [0, n−1/4 )}. Using the results of the 1000 simulated samples estimate
the standard error of each estimator for each combination of n = 5, 10, 25,
50, and 100 and a = 0.25, 0.50, 1.00 and 2.00. Repeat the entire experiment
once for θ = 0 and once for θ = 1. Compare the estimated standard errors
of the two estimators for each combination of parameters given above and
comment on the results in terms of the theory presented in Example 10.9.
5. Write a program in R that simulates 1000 samples of size n from a distribu-
tion F with mean θ where n, θ and F are specified below. For each sample
compute the observed confidence level that θ is in the interval Ψ = [−1, 1]
as
Tn−1 [n1/2 σ̂n−1 (θ̂n + 1)] − Tn−1 [n1/2 σ̂n−1 (θ̂n − 1)],
where θ̂n is the sample mean and σ̂n is the sample standard deviation.
Keep track of the average observed confidence level over the 1000 samples.
472 PARAMETRIC INFERENCE
Repeat the experiment for n = 5, 10, 25, 50 and 100 and comment on the
results in terms of the consistency of the method.
a. F is a N(θ, 1) distribution with θ = 0.0, 0.25, . . . , 2.0.
b. F is a LaPLace(θ, 1) distribution with θ = 0.0, 0.25, . . . , 2.0.
c. F is a Cauchy(θ, 1) distribution θ = 0.0, 0.25, . . . , 2.0, where θ is taken
to be the median (instead of the mean) of the distribution.
6. Write a program in R that simulates 1000 samples of size n from a dis-
tribution F with mean θ, where n, F , and θ are specified below. For each
sample test the null hypothesis H0 : θ ≤ 0 against the alternative hy-
pothesis H1 : θ > 0 using two different tests. In the first test the null
hypothesis is rejected if n1/2 σ̂n−1 X̄n > z0.90 and in the second test the null
hypothesis is rejected if n1/2 σ̂n−1 X̄n > t0.90,n−1 . Keep track of how many
times each test rejects the null hypothesis over the 1000 replications for
θ = 0.0, 0.10σ, 0.20σ, . . . , 2.0σ where σ is the standard deviation of F . Plot
the number of rejections against θ for each test on the same set of axes,
and repeat the experiment for n = 5, 10, 25, 50 and 100. Discuss the results
in terms of the power functions of the two tests.
a. F is a N(θ, 1) distribution.
b. F is a LaPLace(θ, 1) distribution.
c. F is a Exponential(θ) distribution.
d. F is a Cauchy(θ, 1) distribution θ = 0.0, 0.25, . . . , 2.0, where θ is taken
to be the median (instead of the mean) of the distribution.
7. The interpretation of frequentist results of Bayes estimators is somewhat
difficult because of the sometimes conflicting views of the resulting theo-
retical properties. This experiment will look at two ways of looking at the
asymptotic results of this section.
a. Write a program in R that simulates a sample of size n from a N(0, 1)
distribution and computes the Bayes estimator under the assumption
that the mean parameter θ has a N(0, 12 ) prior distribution. Repeat the
experiment 1000 times for n = 10, 25, 50 and 100, and make a density
histogram of the resulting Bayes estimates for each sample size. Place a
comparative plot of the asymptotic Normal distribution for θ̃n as spec-
ified by Theorem 10.15. How well do the distributions agree, particularly
when n is larger?
b. Write a program in R that first simulates θ from a N(0, 21 ) prior distri-
bution and then simulates a sample of size n from a N(θ, 1) distribution,
conditional on the simulated value of θ. Compute the Bayes estimator of
θ for each sample. Repeat the experiment 1000 times for n = 10, 25, 50
and 100, and make a density histogram of the resulting Bayes estimates
for each sample size. Place a comparative plot of the asymptotic Nor-
mal distribution for θ̃n as specified by Theorem 10.15. How well do the
distributions agree, particularly when n is larger?
EXERCISES AND EXPERIMENTS 473
c. Write a program in R that first simulates θ from a N( 12 , 12 ) prior distri-
bution and then simulates a sample of size n from a N(θ, 1) distribution,
conditional on the simulated value of θ. Compute the Bayes estimator
of θ for each sample under the assumption that θ has a N(0, 12 ) prior
distribution. Repeat the experiment 1000 times for n = 10, 25, 50 and
100, and make a density histogram of the resulting Bayes estimates for
each sample size. Place a comparative plot of the asymptotic Normal
distribution for θ̃n as specified by Theorem 10.15. How well do the dis-
tributions agree, particularly when n is larger? What effect does the
misspecification of the prior have on the results?
CHAPTER 11
Nonparametric Inference
I had assumed you’d be wanting to go to the bank. As you’re paying close atten-
tion to every word I’ll add this: I’m not forcing you to go to the bank, I’d just
assumed you wanted to.
The Trial by Franz Kafka
11.1 Introduction
475
476 NONPARAMETRIC INFERENCE
distribution F where F ∈ A, a collection of joint distributions in Rn . The
function T (X1 , . . . , Xn ) is distribution free over A is the distribution of T if
the same for all F ∈ A.
Example 11.1. Let {Xn }∞ n=1 be a set of independent and identically dis-
tributed random variables from a continuous distribution F that has median
equal to θ. Consider the statistic
n
X
T (X1 , . . . , Xn ) = δ{Xk − θ; (−∞, 0]}.
k=1
A proof of Theorem 11.2 can be found in Section 2.4 of Randles and Wolfe
(1979).
478 NONPARAMETRIC INFERENCE
Example 11.3. Let (X1 , Y1 ), . . . , (Xn , Yn ) be a set of independent and identi-
cally distributed paired random variables from a continuous bivariate distribu-
tion F . Let G and H be the marginal distributions of Xn and Yn , respectively.
Assume that H(x) = G(x − θ) for some shift parameter θ and that G is a
symmetric distribution about zero. Let Zi = Xi −Yi and let R1 , . . . , Rn denote
the ranks of the absolute differences |Z1 |, . . . , |Zn |. Define Ci = δ{Zi ; (0, ∞)}
for i = 1, . . . , n, with C0 = (C1 , . . . , Cn ). The Wilcoxon signed rank statistic is
then given by Wn = C0 R, which is the sum of the ranks of the absolute differ-
ences that correspond to positive differences. When testing the null hypothesis
H0 : θ = θ0 the value of θ in Wn is replaced by θ0 . Under this null hypothesis
it follows from Theorem 11.2 that Wn is distribution free. The distribution of
Wn under the null hypothesis can be found by enumerating the value of Wn
over all possible equally likely permutations of the elements of C and R. For
further details see Section 3.1 of Hollander and Wolfe (1999).
The analysis of the asymptotic behavior of test statistics like those studied
in Examples 11.1 to 11.3 is the subject of this chapter. The next section will
develop a methodology for showing that such statistics are asymptotically
Normal using the theory of U -statistics.
This implies that h∗ (x1 ) must be a linear function of the form a + bx1 for
some constants a and b. However, direct integration implies that
1
Z η+ 2
(a + bx1 )dx1 = a + bη
1
η− 2
where Bn,r is a set that contains all vectors whose elements correspond to
unique selections of r integers from the set {1, . . . , n}, taken without replace-
ment.
Example 11.5. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F with finite mean θ. Let
h(x) = x and note that E[h(Xi )] = θ for all i = 1, . . . , n. Therefore
n
X
Un = n−1 Xi ,
i=1
n
X n
X
Wn = Ri δ{Zi ; (0, ∞)} = R̃i δ{Z(i) ; (0, ∞)}, (11.5)
i=1 i=1
where R̃i is the absolute rank associated with Z(i) . There are two possibilities
for the ith term in the sum in Equation (11.5). The term will be zero if Zi < 0.
If Zi > 0 then the ith term will add R̃i to the sum. Suppose for the moment
that R̃i = r for some r ∈ {1, . . . , n}. This means that there are r − 1 values
from |Z1 |, . . . , |Zn | such that |Zj | < |Z(i) | along with the one value that equals
|Z(i) |. Hence, the ith term will add
n
X n
X
δ{|Zj |; (0, |Z(i) |]} = δ{|Z(j) |; (0, |Z(i) |]},
j=1 j=1
to the sum in Equation (11.5) when Zi > 0. Now let us combine these two
conditions. Let Z(1) , . . . , Z(n) denote the order statistics of Z1 , . . . , Zn . Let
i < j and note that δ{Z(i) + Z(j) ; (0, ∞)} = 1 if and only if Z(j) > 0 and
|Z(i) | < Z(j) . To see why this is true consider the following cases. If Z(j) < 0
then Z(i) < 0 since i < j and hence δ{Z(i) + Z(j) ; (0, ∞)} = 0. Similarly, it is
possible that Z(j) > 0 but |Z(i) | > |Z(j) |. This can only occur when Z(i) < 0,
or else Z(j) would be larger than Z(j) which cannot occur because i < j.
Therefore |Z(i) | > |Z(j) | and Z(i) < 0 implies that Z(i) + Z(j) < 0 and hence
δ{Z(i) + Z(j) ; (0, ∞)} = 0. However, if Z(j) > 0 and |Z(i) | < |Z(j) | then it must
follow that δ{Z(i) + Z(j) ; (0, ∞)} = 1. Therefore, it follows that the term in
Equation (11.5) can be written as
n
X i
X
δ{|Z(j) |; (0, Z(i) )} = δ{Z(i) + Z(j) }; (0, ∞)},
j=1 j=1
where the upper limit of the sum on the right hand side of the equation reflects
the fact that we only add in observations less than or equal to Z(j) , which is
482 NONPARAMETRIC INFERENCE
the signed rank of Z(i) . Therefore, it follows that
n X
X i
W = δ{Z(i) + Z(j) ; (0, ∞)}
i=1 j=1
n X
X i
= δ{Zi + Zj ; (0, ∞)}
i=1 j=1
Xn n
X n
X
= δ{2Zi ; (0, ∞)} + δ{Zi + Zj ; (0, ∞)}. (11.6)
i=1 i=1 j=i+1
The first term in Equation (11.6) can be written as nU1,n where U1,n is a
U -statistic of the form
n
X
U1,n = n−1 δ{2Zi ; (0, ∞)},
i=1
and the second term in Equation (11.6) can be written as n2 U2,n where U2,n
If the terms in this sum where independent of one another then we could
exchange the variance and the sum. However, it is clear that if we choose
two distinct elements b and b0 from Bn,r , the two terms h(Xb1 , . . . , Xbr ) and
h(Xb01 , . . . , Xb0r ) could have as many as r − 1 of the random variables from the
set {X1 , . . . , Xn } is common, but may have as few as zero of these variables
UNBIASED ESTIMATION AND U -STATISTICS 483
in common. Note that the two terms could not have all r variables in common
because we are assuming that b and b0 are distinct. We also note that if n
was not sufficiently large then there may also be a lower bound on the number
of random variables that the two terms could have in common. In general we
will assume that n is large enough so that the lower limit is always zero. In
this case we have that
X
V h(Xb1 , . . . , Xbr ) =
b∈Bn,r
X X
C[h(Xb1 , . . . , Xbr ), h(Xb01 , . . . , Xb0r )].
b∈Bn,r b0 ∈Bn,r
To simplify this expression let us consider the case where the sets {b1 , . . . , br }
and {b01 , . . . , b0r } have exactly c elements in common. For example, we can
consider the term,
C[h(X1 , . . . , Xc , Xc+1 , . . . , Xr ), h(X1 , . . . , Xc , Xr+1 , . . . , X2r−c )],
where we assume that n > 2r − c. Now consider comparing this covariance to
another term that also has exactly c variables in common such as
C[h(X1 , . . . , Xc , Xc+1 , . . . , Xr ), h(X1 , . . . , Xc , Xr+2 , . . . , X2r−c+1 )].
Note that the two covariances will be equal because the joint distribution
of Xr+1 , . . . , X2r−c is exactly the same as the joint distribution of Xr+2 , . . .,
X2r−c+1 because X1 , . . . , Xn are assumed to be a sequence of independent
and identically distributed random variables. This fact, plus the symmetry of
the function h will imply that any two terms that have exactly c variables in
common will have the same covariance. Therefore, define
ζc = C[h(X1 , . . . , Xc , Xc+1 , . . . , Xr ), h(X1 , . . . , Xc , Xr+1 , . . . , X2r−c )],
for c = 0, . . . , r−1. The number of terms of this form in the sum of covariances
equals the number of ways to choose r indices from the set of n, which is nr ,
The variance of a U -statistic can be quite complicated, but it does turn out
that the leading term in Equation (11.8) is dominant from an asymptotic
viewpoint.
Theorem 11.4. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F . Let Un be an rth -order U -
statistic with symmetric kernel function h(x1 , . . . , xr ). If E[h2 (X1 , . . . , Xr )] <
∞ then V (Un ) = n−1 r2 ζ1 + o(n−1 ), as n → ∞.
where ζ1 , . . . , ζr are finite because we have assumed that E[h2 (X1 , . . . , Xr )] <
∞. It is the asymptotic behavior of the coefficients of ζc that will determine
the behavior of nV (Un ). We first note that for c = 1 we have that
−1 −1
r2 [(n − r)!]2
n r n−r n r n−r
n =n ζc = .
r c r−c r 1 r−1 (n − 2r + 1)!(n − 1)!
Canceling the identical terms in the numerator and the denominator of the
right hand side of this equation yields
−1 r−1
r2 (n − r) · · · (n − 2r + 1)
n r n−r Y n−i+1−r
n ζc = = r2 ,
r 1 r−1 (n − 1) · · · (n − r + 1) i=1
n−i
where it is instructive to note that the number of terms in the product does
not depend on n. Therefore, since
r−1
Y n−i+1−r
lim r2 = r2 ,
n→∞
i=1
n−i
[(n − r)!]2 (n − r) · · · (n − 2r + c + 1)
= =
(n − 2r + c)!(n − 1)! (n − 1) · · · (n − r + 1)
"c−1 # "r−c #
Y Y n−r−i+1
−1
(n − r + c − i) ,
i=1 i=1
n−i
where again we note that the number of terms in each of the products does
not depend on n. Therefore, we have that
c−1
Y
lim (n − r + c − i)−1 = 0,
n→∞
i=1
and
r−c
Y n−r−i+1
lim = 1.
n→∞
i=1
n−i
Therefore, it follows that
−1
n r n−r
lim n ζc = 0,
n→∞ r c r−c
and hence the result follows.
Example 11.10. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a distribution with mean θ and finite vari-
ance σ 2 . Continuing with Example 11.9 we consider estimating the param-
eter θ2 with a U -statistic of degree r = 2 with symmetric kernel function
h(x1 , x2 ) = x1 x2 . The variance of this U -statistic was computed in Example
11.9 as
4θ2 σ 2 2σ 4
V (Un ) = + .
n n(n − 1)
One can verify the result of Theorem 11.4 by noting that
2σ 4
lim nV (Un ) = lim 4θ2 σ 2 + = 4θ2 σ 2 = r2 ζ1 .
n→∞ n→∞ n−1
We are now in a position to develop conditions under which U -statistics are
asymptotically normal. We begin with the case where r = 1. In this case the
U -statistic defined in Equation (11.4) has the form
n
X
Un = n−1 h(Xi ),
i=1
UNBIASED ESTIMATION AND U -STATISTICS 487
which is the sum of independent and identically distributed random vari-
ables. Therefore, it follows from Theorem 4.20 (Lindeberg and Lévy) that
−1/2 d
n1/2 ζ1 (Un − θ) −
→ Z as n → ∞ where Z is a N(0, 1) random variable.
For the case when r > 1 the problem becomes more complicated as the terms
in the sum in a U -statistic are no longer necessarily independent. The approach
for establishing asymptotic normality for these types of U -statistics is based
on finding a function of the observed data that has the same asymptotic
behavior as the U -statistic, but is the sum of independent and identically
distributed random variables. Theorem 4.20 can then be applied to this related
function, thus establishing the asymptotic normality of the U -statistic. To
simplify matters, we will actually first center the U -statistic about the origin.
That is, if Un is a U -statistic of order r, then we will actually be working with
the function Un − θ which has expectation zero. The method for finding a
function of the data that is a sum of independent and identically distributed
terms that has the same asymptotic behavior as Un − θ is based on finding a
projection of Un − θ onto the space of functions that are sums of independent
and identically distributed random variables.
Recall that a projection of a point in a metric space on a subspace is accom-
plished by finding a point in the subspace that is closest to the specified point.
For example, we can consider the vector space R3 with vector x ∈ R3 . Let P
denote a two-dimensional subspace of R3 corresponding to a plane. Then the
vector x is projected onto P by finding a vector p ∈ P that minimizes kx − pk.
For our purpose we will consider projecting Un − θ onto the space of functions
given by ( )
Xn
Vn = Vn Vn = k(Xi ) ,
i=1
where k is a real-valued function. The function k that will result from this
projection will usually depend on some unknown parameters of F . However,
this does not affect the usefulness of the results since we are not actually
interested in computing the function; we only need to establish its asymptotic
properties.
Consider a U -statistic of order r given by Un . We wish to project Un − θ onto
the space Vn . In order to do this we need a measure of distance between Un −θ
and functions that are in Vn . For this we will use the expected square distance
between the two functions. That is, we use kUn − θ − Vn k = E[(Un − θ − Vn )2 ].
Theorem 11.5. Let Un be a U -statistic of order r calculated on X1 , . . . , Xn , a
set of independent and identically distributed random variables. The projection
of Un − θ onto Vn is given by
n
X
−1
Vn = rn {E[h(Xi , X2 , . . . , Xr )|Xi ] − θ}. (11.11)
i=1
Proof. In order to prove this result we must show that Vn ∈ Vn and that Vn
488 NONPARAMETRIC INFERENCE
minimizes kUn − θ − Vn k. To show that Vn ∈ Vn , we need only note that the
conditional expectation E[h(Xi , X2 , . . . , Xr )|Xi ] is only a function of Xi and
hence we can take the function k(Xi ) to be defined as
k̃(Xi ) = rn−1 {E[h(Xi , X2 , . . . , Xr )|Xi ] − θ}.
To prove that V1,n minimizes kUn −θ −V1,n k we let V be an arbitrary member
of Vn , and note that
kUn − θ − V k = E[(Un − θ − V )2 ]
= E{[(Un − θ − Vn ) + (Vn − V )]2 }
= E[(Un − θ − Vn )2 ] + E[(Vn − V )2 ]
+2E[(Un − θ − Vn )(Vn − V )].
Now, suppose that V has the form
n
X
V = k(Xi ),
i=1
E(Un |Xi ) =
−1 −1
n n−1 n−1 n n − 1 −1
+ θ+ r nk̃(Xi ) =
r r r−1 r r−1
θ + k̃(Xi ).
Similarly, we have that
n
X
Xn
E(Vn |Xi ) = E k̃(Xj ) Xi = E[k̃(Xj )|Xi ].
j=1 j=1
Now, when i = j we have that E[k̃(Xi )|Xi ] = k̃(Xi ) and when i 6= j, Theorem
A.17 implies
E[k̃(Xj )|Xi ] = E[k̃(Xj )]
= rn−1 E{E[h(Xj , X2 , . . . , Xr )|Xj ]} − rn−1 θ
= rn−1 E[h(Xj , X2 , . . . , Xr )] − rn−1 θ
= 0.
Therefore, E(Vn |Xi ) = k̃(Xi ) and E(Un − θ − Vn |Xi ) = 0, from which we can
conclude that E{(Un − θ − Vn )[k̃(Xi ) − k(Xi )]} = 0 for all i = 1, . . . , n. Hence,
it follows that ||Un − θ − V || = E[(Un − Vn )2 ] + E[(Vn − V )2 ]. Because both
terms are non-negative and the first term does not depend on V , it follows
that minimizing ||Un − θ − V || is equivalent to minimizing the second term
E[(Vn − V )2 ], which can be made zero by choosing V = Vn . Therefore, Vn
minimizes ||Un − θ − V || and the result follows.
Proof. The proof of this result proceeds in two parts. We first establish that
the projection of Un − θ onto Vn has an asymptotic Normal distribution. We
then prove that kUn − θ − Vn k converges to zero as n → ∞, which will then be
used to establish that the two statistics have the same limiting distribution.
Let Un have the form
1 X
Un = n
h(Xb1 , . . . , Xbr ),
r b∈Bn,r
kUn − θ − Vn k = E[(Un − θ − Vn )2 ] =
E[(Un − θ)2 ] − 2E[Vn (Un − θ)] + E[Vn2 ]. (11.18)
To evaluate the first term on the right hand side of Equation (11.18) we note
that E(Un ) = θ and hence Theorem 11.4 implies that
E[(Un − θ)2 ] = V (Un ) = n−1 r2 ζ1 + o(n−1 ), (11.19)
as n → ∞. To evaluate the second term on the right hand side of Equation
(11.18),
" #
n −1 X
X n
E[Vn (Un − θ)] = E k̃(Xi ) h(Xb1 , . . . , Xbr ) − θ
i=1
r
b∈Bn,r
−1 Xn
n X
= E{k̃(Xi )[h(Xb1 , . . . , Xbr ) − θ]}.
r i=1 b∈Bn,r
Now, if i ∈
/ {b1 , . . . , br } then k̃(Xi ) and h(Xb1 , . . . , Xbr ) will be independent
and hence
E{k̃(Xi )[h(Xb1 , . . . , Xbr ) − θ]} = E[k̃(Xi )]E[h(Xb1 , . . . , Xbr ) − θ] = 0.
For the remaining n−1
r−1 terms where i ∈ {b1 , . . . , br }, we apply Theorem A.17
492 NONPARAMETRIC INFERENCE
to find that
E{k̃(Xi )[h(Xb1 , . . . , Xbr ) − θ]} = E[E{k̃(Xi )[h(Xb1 , . . . , Xbr ) − θ]|Xi }]
= E[k̃(Xi )E{[h(Xb1 , . . . , Xbr ) − θ]|Xi }]
= nr−1 E[k̃ 2 (Xi )]
= n−1 rζ1 .
Therefore, it follows that
n n−1
X r r−1 ζ1
E[Vn (Un − θ)] = = r2 n−1 ζ1 . (11.20)
n nr
i=1
To evaluate the third term on the right hand side of Equation (11.18), we have
that
" # n
X n X X n X n
E(Vn2 ) = E k̃(Xi ) k̃(Xj ) = E[k̃(Xi )k̃(Xj )].
i=1 j=1 i=1 j=1
Now when i 6= j, k̃(Xi ) and k̃(Xj ) are independent and E[k̃(Xi )k̃(Xj )] =
E[k̃(Xi )]E[k̃(Xj )] = 0. Therefore,
n
X
E(Vn2 ) = E[k̃ 2 (Xi )] = nE[k̃ 2 (X1 )] = n−1 r2 ζ1 . (11.21)
i=1
n
X
U1,n = n−1 δ{2Zi ; (0, ∞)},
i=1
and
n n
2 X X
U2,n = δ{Zi + Zj ; (0, ∞)}.
n(n − 1) i=1 j=i+1
−1
1/2 n
n [W − E(W )] =
2
2n1/2
[U1,n − E(U1,n )] + n1/2 [U2,n − E(U2,n )]. (11.22)
n−1
For the first term on the right hand side of Equation (11.22) we note that
V (U1,n ) = n−1 V (δ{2Zi ; (0, ∞)}) where
1
E(δ{2Zi ; (0, ∞)}) = P (δ{2Zi ; (0, ∞)} = 1) = P (Zi > 0) = 2
since Zi has a symmetric distribution about zero. Since δ 2 {2Zi ; (0, ∞)} =
δ{2Zi ; (0, ∞)} it follows also that E(δ 2 {2Zi ; (0, ∞)}) = 21 . Therefore, we have
that V (δ{2Zi ; (0, ∞)}) = 14 , and hence V (U1,n ) = 41 n−1 . Theorem 3.10 (Weak
p
Law of Large Numbers) then implies that U1,n − → 12 as n → ∞. Noting that
2n1/2 (n − 1)−1 → 0 as n → ∞ we can then apply Theorem 4.11 (Slutsky)
p
to find that 2n1/2 (n − 1)−1 [U1,n − E(U1,n )] − → 0 as n → ∞. Therefore, it
−1
follows that the asymtptotic distribution of n1/2 n2 [W − E(W )] is the same
as n1/2 [U2,n − E(U2,n )]. We wish to apply Theorem 11.6 to the second term
on the right hand side of Equation (11.22), and therefore we need to verify the
assumptions required for Theorem 11.6. Let f be the density of Z1 where, by
assumption, f is symmetric about zero. Noting that the independence between
Zi and Zj implies that the joint distribution between Zi and Zj is f (zi )f (zj ),
494 NONPARAMETRIC INFERENCE
it follows that
E[δ 2 {Zi + Zj ; (0, ∞)}] = E[δ{Zi + Zj ; (0, ∞)}]
= P (Zi + Zj > 0)
Z ∞Z ∞
= f (zi )f (zj )dzj dzi
−∞ −zi
Z ∞ Z ∞
= f (zi ) f (zj )dzj dzi
−∞ −zi
Z ∞
= f (zi )[1 − F (−zi )]dzi
−∞
Z ∞
= F (zi )f (zi )dzi
−∞
Z1
= tdt = 21 ,
0
where we have used the fact that the symmetry of f implies that 1 − F (−z) =
F (z). Since 12 < ∞ we have verified that E[h2 (x1 , . . . , xr )] < ∞. To verify the
second assumption we note that
1
ζ1 = E[δ{Zi + Zj ; (0, ∞)}δ{Zi + Zk ; (0, ∞)}] − 4
= P ({Zi + Zj > 0} ∩ {Zi + Zk > 0}) − 14
Z ∞ Z ∞Z ∞
= f (zi )f (zj )f (zk )dzk dzj dzi − 14
−∞ −zi −zi
Z ∞ Z ∞
= f (zi ) f (zj )[1 − F (−zi )]dzj dzi − 14
−∞ −zi
Z ∞
= f (zi )F 2 (zi )dzi − 41
−∞
Z 1
2 1 1 1 1
= t dt − 4 = 3 − 4 = 12 > 0.
0
Hence, the second assumption is verified. Theorem 11.6 then implies that
d
n1/2 [U2,n − 12 ] −
→ Z2 where Z2 has a N(0, 13 ) distribution, and therefore
−1
n d
n1/2 [W − E(W )] − → Z2 ,
2
as n → ∞. Further calculations can be used to refine this result to find that
W − 41 n(n + 1) d
1 −
→ Z, (11.23)
[ 24 n(n + 1)(2n + 1)]1/2
as n → ∞ where Z has a N(0, 1) distribution. See Exercise 5. This result is
suitable for using W to test the null hypothesis H0 : θ = 0 using approximate
rejection regions. Figures 11.1–11.3 plot the exact distribution of W under
the null hypothesis for n = 5, 7, and 10. It is clear in Figure 11.3 that the
LINEAR RANK STATISTICS 495
0.09
0.08
0.07
P(W=w)
0.06
0.05
0.04
0.03
0 5 10 15
w
normal approximation should work well for this sample size and larger. Table
11.1 compares some exact quantiles of the distribution of W with some given
by the normal approximation. Note that Equation (11.23) implies that the α
quantile of the distribution of W can be approximated by
1
4 n(n
1
+ 1) + zα [ 24 n(n + 1)(2n + 1)]1/2 . (11.24)
0.06
0.05
0.04
P(W=w)
0.03
0.02
0.01
0 5 10 15 20 25
w
Figure 11.3 The exact distribution of the signed-rank statistic when n = 10.
0.04
0.03
P(W=w)
0.02
0.01
0.00
0 10 20 30 40 50
w
LINEAR RANK STATISTICS 497
Table 11.1 A comparison of the exact quantiles of the signed rank statistic against
those given by the normal approximation given in Equation (11.24). The approximate
quantiles have been rounded to the nearest integer.
Exact Quantiles Normal Approximation Relative Error (%)
n 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01
5 12 14 15 10 11 14 16.7 21.5 6.7
6 17 18 21 14 15 19 17.6 16.7 9.5
7 22 24 27 18 20 24 18.2 16.7 11.1
8 27 30 34 23 26 31 14.8 13.3 8.8
9 34 36 41 29 32 38 14.7 11.1 7.3
10 40 44 49 35 39 45 12.5 11.4 8.16
25 211 224 248 198 211 236 6.2 5.8 4.8
50 771 808 877 745 783 853 3.4 3.1 2.7
is called a linear rank statistic. The set of constants c(1), . . . , c(n) are called
the regression constants, and the set of constants a(1), . . . , a(n) are called the
scores of the statistic.
The important ingredients of Definition 11.5 are that the elements of the vector
R correspond to a random permutation of the integers in the set {1, . . . , n}
so that a(r1 ), . . . , a(rn ) are random variables, but the regression constants
c(1), . . . , c(n) are not random. The frequent use of ranks in classical nonpara-
metric statistics makes linear rank statistics an important topic.
Example 11.13. Let us consider the Wilcoxon, Mann, and Whitney Rank
Sum test statistic from Example 11.2. That is, let X1 , . . . , Xm be a set of
independent and identically distributed random variables from a continu-
ous distribution F and let Y1 , . . . , Yn be a set of independent and identi-
cally distributed random variables from a continuous distribution G where
G(t) = F (t − θ), where θ is an unknown parameter. Denoting the combined
sample as Z1 , . . . , Zn+m , and the corresponding ranks as R1 , . . . , Rn+m , the
test statistic
n+m
X
Mm,n = Di Ri , (11.25)
i=1
where Di = 1 when Zi is from the sample X1 , . . . , Xm and Di = 0 when
498 NONPARAMETRIC INFERENCE
Zi is from the sample Y1 , . . . , Yn . For simplicity assume that Zi = Xi for
i = 1, . . . , m and Zj = Yj−n+1 for j = m + 1, . . . , n + m. Then the statistic
given in Equation (11.25) can be written as a linear rank statistic of the form
given in Definition 11.5 with a(i) = i and c(i) = δ{i; {m + 1, . . . , n + m}} for
all i = 1, . . . , n + m.
Example 11.14. Once again consider the two-sample setup studied in Ex-
ample 11.2, except in this case we will use the median test statistic proposed
by Mood (1950) and Westenberg (1948). For this test we compute the me-
dian of the combined sample X1 , . . . , Xm , Y1 , . . . , Yn and then compute the
number of values in the sample Y1 , . . . , Yn that exceed the median. Note that
under the null hypothesis that θ = 0, where the combined sample all comes
from the sample distribution, we would expect that half of these would be
above the median. If θ 6= 0 then we would expect either greater than, or fewer
than, of these values to exceed the median. Therefore, counting the number
of values that exceed the combined median provides a reasonable test statis-
tic for the null hypothesis that θ = 0. This test statistic can be written in
the form of a linear rank statistic with c(i) = δ{i; {m + 1, . . . , n + m}} and
a(i) = δ{i; { 21 (n + m + 1), . . . , n + m}}. The form of the score function is
derived from the fact that if the rank of a value from the combined sample
exceeds 21 (n + m + 1) then the corresponding value exceeds the median. This
score function is called the median score function.
Typically, under a null hypothesis the elements of the vector R correspond to
a random permutation of the integers in the set {1, . . . , n} that is uniformly
distributed over the set Rn that is defined in Equation (11.1). In this case
the distribution of S can be found by enumerating the values of S over the
r! equally likely permutations in Rn . There are also some general results that
are helpful in the practical application of tests based on linear rank statistics.
Theorem 11.7. Let S be a linear rank statistic of the form
n
X
S= c(i)a(ri ).
i=1
where
n
X
ā = n−1 a(i),
i=1
and
n
X
c̄ = n−1 c(i).
i=1
LINEAR RANK STATISTICS 499
Theorem 11.7 can be proven using direct calculations. See Exercise 6. More
complex arguments are required to obtain further properties on the distribu-
tion of linear rank statistics. For example, the symmetry of the distribution
of a linear rank statistic can be established under fairly general conditions
using arguments based on the composition of permutations. This result was
first proven by Hájek (1969).
Theorem 11.8 (Hájek). Let S be a linear rank statistic of the form
n
X
S= c(i)a(ri ).
i=1
Let c(1) , . . . , c(n) and a(1) , . . . , a(n) denote the ordered values of c(1), . . . , c(n)
and a(1), . . . , a(n), respectively. Suppose that R is a vector whose elements
correspond to a random permutation of the integers in the set {1, . . . , n} that
is uniformly distributed over Rn . If a(i) +a(n+1−i) or c(i) +c(n+1−i) is constant
for i = 1, . . . , n, then the distribution of S is symmetric about nāc̄.
A proof of Theorem 11.8 can be found in Section 8.2 of Randles and Wolfe
(1979).
Example 11.15. Let us consider the rank sum test statistic from Exam-
ple 11.2 which can be written as a linear rank statistic of the form given
in Definition 11.5 with a(i) = i and c(i) = δ{i; {m + 1, . . . , n + m}} for
all i = 1, . . . , n + m. See Example 11.13. Note that a(i) = a(i) and that
a(i) + a(m + n − i + 1) = m + n + 1 for all i ∈ {1, . . . , m + n} so that Theorem
11.8 implies that the distribution of the rank sum test statistic is symmetric
when the null hypothesis that θ = 0 is true. Some examples of the distribution
are plotted in Figures 11.4–11.6.
where we emphasize that both the regression constants and the scores depend
on the sample size n. Section 8.3 of Randles and Wolfe (1979) points out that
the regression constants c(1, n), . . . , c(n, n) are usually determined by the type
of problem under consideration. For example, in Examples 11.13 and 11.14,
the regression constants are used to distinguish between the two samples.
Therefore, it is advisable to put as few restrictions on the types of regression
constants that can be considered so that as many different types of problems
500 NONPARAMETRIC INFERENCE
Figure 11.4 The distribution of the rank sum test statistic when n = m = 3.
0.14
0.12
P(W=w)
0.10
0.08
0.06
4 6 8 10 12
w
Figure 11.5 The distribution of the rank sum test statistic when n = m = 4.
0.10
0.08
P(W=w)
0.06
0.04
0.02
10 15 20
w
LINEAR RANK STATISTICS 501
Figure 11.6 The distribution of the rank sum test statistic when n = m = 5.
0.08
0.06
P(W=w)
0.04
0.02
10 15 20 25 30 35
w
as possible can be addressed by the asymptotic theory. A typical restriction
is given by Noether’s condition.
Definition 11.6 (Noether). Let c(1, n), . . . , c(n, n) be a set of regression con-
stants for a linear rank statistic of the form given in Equation (11.26). The
regression constants follow Noether’s condition if
Pn
i=1 d(i, n)
lim = ∞,
n→∞ maxi∈{1,...,n} d(i, n)
where " #2
n
X
−1
d(i, n) = c(i, n) − n c(i, n) ,
i=1
for i = 1, . . . , n and n ∈ N.
This condition originates from Noether (1949) and essentially keeps one of the
constants from dominating the others.
Greater latitude is given in choosing the score function, and hence more re-
strictive assumptions can be implemented on them. In particular, the usual
approach is to consider score functions of the form a(i, n) = α[i(n + 1)−1 ],
where α is a function that does not depend on n and is assumed to have
certain properties.
502 NONPARAMETRIC INFERENCE
Definition 11.7. Let α be a function that maps the open unit interval (0, 1)
to R such that,
Not all of the common score functions can be written strictly in the form
a(i, n) = α[i(n + 1)−1 ] where α is a square integrable score function. However,
a slight adjustment to this form will not change the asymptotic behavior of a
properly standardized linear rank statistic, and therefore it suffices to consider
this form.
To establish the asymptotic normality of a linear rank statistic, we will need
to develop several properties of the ranks, order statistics, and square inte-
grable score functions. The first result establishes independence between the
rank vector and the order statistics for sets of independent and identically
distributed random variables.
Theorem 11.9. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a continuous distribution F . Let R1 , . . . , Rn
denote the ranks of X1 , . . . , Xn and let X(1) , . . . , X(n) denote the order statis-
tics. Then R1 , . . . , Rn and X(1) , . . . , X(n) are mutually independent.
A proof of Theorem 11.9 can be found in Section 8.3 of Randles and Wolfe
(1979).
Not surprisingly, we also need to establish several properties of square inte-
grable score functions. The property we establish below is related to limiting
properties of the expectation of the the score function.
Theorem 11.10 (Hájek and Šidák). Let U be a Uniform(0, 1) random vari-
able and let {gn }n=1 be a sequence of functions that map the open unit interval
a.c.
(0, 1) to R, where gn (U ) −−→ g(U ) as n → ∞, and
lim sup E[gn2 (U )] ≤ E[g 2 (U )]. (11.27)
n→∞
Then
lim E{[gn (U ) − g(U )]2 } = 0.
n→∞
For the second term on the right hand side of Equation (11.28) we refer to
Theorem II.4.2 of Hájek and Šidák (1967), which shows that
lim E[gn (U )g(U )] = E[g 2 (U )]. (11.30)
n→∞
In most situations the linear rank statistic is used as a test statistic to test a
null hypotheses that implies that the rank vector R is uniformly distributed
over the set Rn . In this case each component of R has a marginal distribution
that is uniform over the integers {1, . . . , n}. Hence, the expectation E{α2 [(n+
1)−1 R1 ]} is equivalent to the expectation E{α2 (n + 1)−1 Un ]} where Un is a
d
Uniform{1, . . . , n} random variable. It can be shown that Un −
→ U as n → ∞
where U is a Uniform(0, 1) random variable. The result given in Theorem
11.11 below establishes the fact that the expectations converge as well.
Theorem 11.11. Let α be a square integrable score function and let U be a
Uniform{1, . . . , n} random variable. Then
n
X
lim E{α2 [(n + 1)−1 Un ]} = lim n−1 α2 [(n + 1)−1 i] =
n→∞ n→∞
i=1
Z 1
α2 (t)dt. (11.31)
0
A proof of Theorem 11.12 can be found in Section 8.3 of Randles and Wolfe
504 NONPARAMETRIC INFERENCE
(1979). We finally require a result that shows that α[(n + 1)−1 R1 ], where R1
is the rank of the first observation U1 from a set of n independent and identi-
cally distributed random variables from a Uniform(0, 1) distribution can be
approximated by α(U1 ). The choice of the first observation is by convenience,
what matters is that we have the rank of an observation from a Uniform sam-
ple. This turns out to be the key approximation in developing the asymptotic
Normality of linear rank statistics.
Theorem 11.13. Let α be a square integrable score function and let U1 , . . . , Un
be a set of independent and identically distributed Uniform(0, 1) random vari-
qm
ables. Suppose that R1 is the rank of U1 . Then |α[(n + 1)−1 R1 ] − α(U1 )| −−→ 0
as n → ∞.
Let
n
X
Yn = δ{(U1 − Ui ); [0, ∞)},
i=2
and note that Yn is a Binomial(n − 1, γ) random variable where γ = P (U1 −
Ui ≥ 0). Therefore, Theorem A.17 implies that
We are now ready to tackle the asymptotic normality of a linear rank statistic
when the null hypothesis is true. This result, first proven by Hájek (1961)
is proven using a similar approach to Theorem 11.6 (Hoeffding), in that the
linear rank statistic is approximated by a simpler statistic whose asymptotic
distribution is known. Therefore, the crucial part of the proof is based on
showing that the two statistics have the same limiting distribution.
Theorem 11.14 (Hájek). Let Sn be a linear rank statistic with regression
constants c(1, n), . . . , c(n, n) and score function a. Suppose that
to establish Equation (11.37). The key idea to proving the desired result is
based on approximating the statistic in Equation (11.38) with one whose
asymptotic distribution can easily be found. Let Wn be a Uniform{1, . . . , n+
d
1} random variable. Then, it can be shown that (n + 1)−1 Wn − → U as n → ∞
where U is a Uniform(0, 1) random variable. This suggests approximating
(n + 1)−1 Ri with U , or approximating the asymptotic behavior of Sn using
the statistic
Xn
Vn = [c(i, n) − c̄n ]α(Ui ) + nc̄n ān .
i=1
The first step is to find the asymptotic distribution of Vn . Note that
" n #
X
E(Vn ) = E [c(i, n) − c̄n ]α(Ui ) + nc̄n ān
i=1
n
X
= [c(i, n) − c̄n ]E[α(Ui )] + nc̄n ān .
i=1
LINEAR RANK STATISTICS 507
Note that because U1 , . . . , Un are identically distributed, it follows that
n
X
E(Vn ) = E[α(U1 )] [c(i, n) − c̄n ] + nc̄n ān = nc̄n ān = µn ,
i=1
since
n
X
[c(i, n) − c̄n ] = 0.
i=1
Similarly, since U1 , . . . , Un are mutually independent, it follows that
" n #
X
V (Vn ) = V [c(i, n) − c̄n ]α(Ui )
i=1
n
X
= [c(i, n) − c̄n ]2 V [α(Ui )]
i=1
n
X
= V [α(U1 )] [c(i, n) − c̄n ]2
i=1
n
X
= α̃2 [c(i, n) − c̄n ]2 ,
i=1
where
Z 1
α̃2 = V [α(U1 )] = [α(t) − ᾱ]2 dt.
0
Now,
n
X
Vn − µn = [c(i, n) − c̄n ]α(Ui )
i=1
Xn n
X
= [c(i, n) − c̄n ]α(Ui ) − [c(i, n) − c̄n ]ᾱ
i=1 i=1
Xn
= {[c(i, n) − c̄n ]α(Ui ) − [c(i, n) − c̄n ]ᾱ},
i=1
where
Z 1
ᾱ = E[α(Ui )] = α(t)dt.
0
2
Define Yi,n = [c(i, n) − c̄n ]α(Ui ), µi,n = [c(i, n) − c̄n ]ᾱ, and σi,n = [c(i, n) −
2 2
c̄n ] α̃ , for i = 1, . . . , n. Then Theorem 6.1 (Lindeberg, Lévy, and Feller) will
imply that
d
Zn = n1/2 τn−1/2 (Ȳn − µ̄n ) −
→Z (11.39)
as n → ∞ where Z has a N(0, 1) distribution, as long as we can show that
the associated assumptions hold. In terms of the notation of Theorem 6.1, we
508 NONPARAMETRIC INFERENCE
have that
n
X n
X
−1 −1
µ̄n = n µi,n = n [c(i, n) − c̄n ]ᾱ,
i=1 i=1
and
n
X n
X
τn2 = 2
σi,n = [c(i, n) − c̄n ]2 α̃2 . (11.40)
i=1 i=1
Now,
α̃2 [c(i, n) − c̄n ]2 [c(i, n) − c̄n ]2
τn−2 σi,n
2
= P n = Pn .
α̃2 i=1 [c(i, n) − c̄n ]2 i=1 [c(i, n) − c̄n ]
2
for every ε > 0. Let ε > 0, then we begin showing this condition by noting
that
where Li (ε, n) = {u : |c(i, n) − c̄n ||α(u) − ᾱ| > ετn }. Now let
−1
∆n = τn max |c(j, n) − c̄n | ,
1≤j≤n
where we note that the integral no longer depends on the index i. Therefore,
Equation (11.40) implies that
n
X
τn−2 E(|Yi − µi,n |2 δ{|Yi − µi,n |; (ετn , ∞}) ≤
i=1
(Z )( n )
X
τn−2 [α(u) − ᾱ] du2
[c(i, n) − c̄n ] 2
=
L̄(ε,n) i=1
(Z )( n )
X
τn−2 α̃2 α̃−2 [α(u) − ᾱ]2 du [c(i, n) − c̄n ]2 =
L̄(ε,n) i=1
Z
α̃−2 [α(u) − ᾱ]2 du.
L̄(ε,n)
−1
∆n = τn max |c(j, n) − c̄n | =
1≤j≤n
( n )1/2 −1
X
2
α̃ [c(i, n) − c̄n ] max |c(j, n) − c̄n | .
1≤j≤n
i=1
and the second condition is proven, and hence the convergence described in
Equation (11.39) follows.
We will now consider the mean square difference between Sn and Vn . We begin
510 NONPARAMETRIC INFERENCE
by noting that
Sn − Vn =
Xn n
X
[c(i, n) − c̄n ]a(Ri , n) + nc̄n ān − [c(i, n) − c̄n ]α(Ui ) − nc̄n ān =
i=1 i=1
Xn
[c(i, n) − c̄n ][a(Ri , n) − α(Ui )].
i=1
Therefore, using the fact that the conditional distribution of R is still uniform
over Rn conditional on U, we have that
E[(Sn − Vn )2 |U = u] =
" #2
Xn
E [c(i, n) − c̄n ][a(Ri,n ) − α(Ui )] U = u .
i=1
Denote c (i, n) = c(i, n) − c̄n and a∗ (i, n) = a(i) − α(Ui ), where we note
∗
E[(Sn − Vn )2 |U = u] =
( n )( n )
X X
−1 ∗ ∗ 2 ∗ ∗ 2
(n − 1) [a (i, n) − ān ] [c (i, n) − c̄n ] =
i=1 i=1
( n
)( n )
X X
−1 2 2
(n − 1) [a(i, n) − α(Ui ) − ā + ᾱU ] [c(i, n) − c̄n ] ,
i=1 i=1
where the last equality follows from the definitions of a∗ (i, n), c∗ (i, n), and
Equation (11.37). We also note that the order statistic U(i) is associated with
rank i, and that
n
X
ᾱU = n−1 α(U(i) ).
i=1
Recall that for a random variable Z such that V (Z) < ∞, it follows that
LINEAR RANK STATISTICS 511
V (Z) ≤ E(Z 2 ). Applying this formula to the case where Z has a Uni-
form{x1 , . . . , xn } distribution implies that
n
X n
X
(xi − x̄n )2 ≤ x2i .
i=1 i=1
Note that
n
∗
X
2
[a(i, n) − α(U1 )]2 P (R1∗ = i).
E [a(R1 , n) − α(U1 )] |U = u] =
i=1
Theorem 11.9 implies that P (R1∗ = i|U = u) = n−1 for i = 1, . . . , n and hence
n
X
E [a(R1∗ , n) − α(U1 )]2 |U = u] = n−1 [a(i, n) − α(U1 )]2 .
i=1
Therefore,
( n )( n )
X X
(n − 1)−1 [a(i, n) − α(U(i) ) − ā − ᾱU ]2 [c(i, n) − c̄n ]2 ≤
i=1 i=1
( n )
X
−1 2
E [a(R1∗ , n) − α(U1 )]2 |U = u] ,
n(n − 1) [c(i, n) − c̄n ]
i=1
E[(Sn − Vn )2 |U = u] ≤
( n )
X
n(n − 1)−1 [c(i, n) − c̄n ]2 E [a(R1∗ , n) − α(U1 )]2 |U = u] . (11.42)
i=1
Using Theorem A.17 and taking the expectation of both sides of Equation
(11.42) yields
E{E[(Sn − Vn )2 |U = u]} = E[(Sn − Vn )2 ],
and
E{E [a(R1∗ , n) − α(U1 )]2 |U = u] } = E [a(R1∗ , n) − α(U1 )]2 ,
512 NONPARAMETRIC INFERENCE
so that
( n
)
X
E[(Sn − Vn )2 ] ≤ n(n − 1)−1 [c(i, n) − c̄n ]2 E [a(R1∗ , n) − α(U1 )]2 .
i=1
Recalling that
n
X
τn2 = α̃ [c(i, n) − c̄n ]2 ,
i=1
we have that
q.m.
by Theorem 11.13. Hence τn−1 (Sn −Vn ) −−−→ 0 as n → ∞. Using the same type
of arguments used in proving Theorem 11.6, it then follows that σn−1 (Sn − µn )
and τn−1 (Vn − µn ) have the same limiting distribution, and hence the result is
proven.
Example 11.16. Consider the rank sum test statistic M which has the form
of a linear rank statistic with regression constants c(i, m, n) = δ{i; {m +
1, . . . , n + m}} and score function a(i) = i for i = 1, . . . , n + m. For these
regression constants, it follows that
n+m
X n+m
X
−1 −1
c̄m,n = (n + m) c(i, m, n) = (n + m) 1 = n(n + m)−1 ,
i=1 i=m+1
and
n+m
X n+m
X
[c(i, m, n) − c̄m,n ]2 = [c(i, m, n) − n(n + m)−1 ]2
i=1 i=1
m
X
= [0 − n(n + m)−1 ]2
i=1
n+m
X
+ [1 − n(n + m)−1 ]2
i=m+1
= mn(n + m)−1 .
LINEAR RANK STATISTICS 513
To verify Noether’s condition of Definition 11.6, we note that
and
Z 1
α̃2 = (t − 21 )2 dt = 1
12 .
0
Definition 11.7 then implies that α is square integrable score function. Now
One of the reasons that many nonparametric statistical methods have re-
mained popular in applications is that few assumptions need to be made
about the underlying population, and that this flexibility results in only a
small loss of efficiency in many cases. The use of a nonparametric method,
which is valid for a large collection of distributions, necessarily entails the
possible loss of efficiency. This can manifest itself in larger standard errors
for point estimates, wider confidence intervals, or hypothesis tests that have
lower power. This is because a parametric method, which is valid for a specific
parametric family, is able to take advantage of the structure of the problem
to produce a finely tuned statistical method. On the other hand, nonparamet-
ric methods have fewer assumptions to rely on and must be valid for a much
larger array of distributions. Therefore, these methods, cannot take advantage
of this additional structure.
A classic example of this difference can be observed by considering the problem
of estimating the location of the mode of a continuous unimodal density. If we
are able to reasonably assume that the population is Normal, then we can
estimate the location of the mode using the sample mean. On the other hand,
if the exact parametric form of the density is not known, then the problem
can become very complicated. It is worthwhile to note at this point that any
potential increase in the efficiency that may be realized by using a parametric
method is only valid if the parametric model is at least approximately true. For
example, the sample mean will only be a reasonable estimator of the location
of the mode of a density for certain parametric models. If one of these models
does not hold, then the sample mean may be a particularly unreasonable
estimator, and may even have an infinite bias, for example.
PITMAN ASYMPTOTIC RELATIVE EFFICIENCY 515
To assess the efficiency of statistical hypothesis tests, we must borrow some
of the ideas that we encountered in Section 10.4. Statistical tests are usually
compared on the basis of their power functions. That is, we would prefer to
have a test that rejects the null hypothesis more often when the alternative
hypothesis is true. It is important that when two tests are compared on the
basis of their power functions that the significance levels of the two tests be
the same. This is due to the fact that the power of any test can be arbitrarily
increased by increasing the value of the significance level. Therefore, if β1 (θ)
and β2 (θ) are the power functions of two tests of the set of hypotheses H0 :
θ ∈ Ω0 and H1 : θ ∈ Ω1 , based on the same sample size we would prefer the
test with power function β1 if β1 (θ) ≥ β2 (θ) for all θ ∈ Ω1 , where
sup β1 (θ) = sup β2 (θ).
θ∈Ω0 θ∈Ω0
This view may be too simplistic as it insists that one of the tests be uniformly
better than the other. Further, from our discussion in Section 10.4, we know
that many tests will do well when the distance between θ and the boundary of
Ω0 is large. Another complication comes from the fact that there are so many
parameters which can be varied, including the sample size, the value of θ in
the alternative hypothesis, and the distribution. We can remove the sample
size from the problem by considering asymptotic relative efficiency using a
similar concept encountered for point estimation in Section 10.2. The value of
θ in the alternative hypothesis can be eliminated if we consider a sequence of
alternative hypotheses that converge to the null hypothesis as n → ∞. This
is similar to the idea used in Section 10.4 to compute the asymptotic power
of a hypothesis test.
Definition 11.8 (Pitman). Consider two competing tests of a point null hy-
pothesis H0 : θ = θ0 where θ0 is a specified parameter value in the parameter
space Ω. Let Sn and Tn denote the test statistics for the two tests based on a
sample of size n. Let βS,n (θ) and βT,n (θ) be the power functions of the tests
based on the test statistics Sn and Tn , respectively, when the sample size equals
n.
lim θk = θ0 .
k→∞
3. Let {m(k)}∞ ∞
k=1 and {n(k)}k=1 be increasing sequences of positive integers
such that both tests have the same limiting significance level and
lim βS,m(k) (θk ) = lim βT,n(k) (θk ) ∈ (α, 1).
k→∞ k→∞
Then, the asymptotic relative efficiency of the test based on the test statistic
Sn against the test based on the test statistic Tn is given by
lim m(k)[n(k)]−1 .
k→∞
516 NONPARAMETRIC INFERENCE
This concept of relative efficiency establishes the relative sample sizes required
for the two tests to have the same asymptotic power. This type of efficiency
is based on Pitman (1948), and is often called Pitman relative asymptotic
efficiency.
For simplicity we will assume that both tests reject the null hypothesis H0 :
θ = θ0 for large values of the test statistic. The theory presented here can be
easily adapted to other types of rejection regions as well. We will also limit
our discussion to test statistics that have an asymptotic Normal distribution
under both the null and alternative hypotheses. The assumptions about the
asymptotic distributions of the test statistics used in this section are very
similar to those used in the study of asymptotic power in Section 10.4. In
particular we will assume that there exist functions µn (θ), ηn (θ), σn (θ) and
τn (θ) such that
Sm(k) − µm(k) (θk )
≤ t θ = θk ; Φ(t),
P
σm(k) (θk )
The power function for the test using the test statistic Sm(k) is given by
with a similar form for the power function of the test using the test statistic
Tn(k) . Therefore, the property in Definition 11.8 that requires
lim βS,m(k) (θk ) = lim βT,n(k) (θk ),
k→∞ k→∞
PITMAN ASYMPTOTIC RELATIVE EFFICIENCY 517
is equivalent to
Sm(k) − µm(k) (θk ) sm(k) (α) − µm(k) (θk )
lim P ≥ θ = θ k =
k→∞ σm(k) (θk ) σm(k) (θk )
Tn(k) − ηn(k) (θk ) tn(k) (α) − ηn(k) (θk )
lim P ≥ θ = θk ,
k→∞ τn(k) (θk ) τn(k) (θk )
which can in turn be shown to require
sm(k) (α) − µm(k) (θk ) tn(k) (α) − ηn(k) (θk )
lim = lim . (11.45)
k→∞ σm(k) (θk ) k→∞ τn(k) (θk )
Similarly, for both tests to have the same limiting significance level we require
that
Sm(k) − µm(k) (θ0 ) sm(k) (α) − µm(k) (θ0 )
lim P ≥ θ = θ0 =
k→∞ σm(k) (θ0 ) σm(k) (θ0 )
Tn(k) − ηn(k) (θ0 ) tn(k) (α) − ηn(k) (θ0 )
lim P ≥ θ = θ 0 ,
k→∞ τn(k) (θ0 ) τn(k) (θ0 )
Under this type of framework, Noether (1955) shows that the Pitman asymp-
totic relative efficiency is a function of the derivatives of µm(k) (θ) and ηn(k) (θ)
relative to σm(k) (θ) and τn(k) (θ), respectively.
Theorem 11.15 (Noether). Let Sn and Tn be test statistics based on a sample
of size n that reject a null hypothesis H0 : θ = θ0 when Sn ≥ sn (α) and
Tn ≥ tn (α), respectively. Let {θk }∞ k=1 be a sequence of real values greater
that θ0 such that θk → θ0 as k → ∞. Let {m(k)}∞ ∞
k=1 and {n(k)}k=1 be
increasing sequences of positive integers. Let {µm(k) (θ)}∞ k=1 , {η n(k) (θ)}∞
k=1 ,
{σm(k) (θ)}∞k=1 , and {τ n(k) (θ)}∞
k=1 , be sequences of real numbers that satisfy
the following assumptions:
1. For all t ∈ R,
Sm(k) − µm(k) (θk )
≤ t θ = θk ; Φ(t),
P
σm(k) (θk )
and
Tn(k) − ηn(k) (θk )
≤ t θ = θk ; Φ(t),
P
τn(k) (θk )
as n → ∞.
2. For all t ∈ R,
Sm(k) − µm(k) (θ0 )
≤ t θ = θ0 ; Φ(t),
P
σm(k) (θ0 )
518 NONPARAMETRIC INFERENCE
and
Tn(k) − ηn(k) (θ0 )
≤ t θ = θ0 ; Φ(t),
P
τn(k) (θ0 )
as n → ∞.
3.
σm(k) (θk )
lim = 1,
k→∞ σm(k) (θ0 )
and
τn(k) (θk )
lim = 1.
k→∞ τn(k) (θ0 )
4. The derivatives of the functions µm(k) (θ) and ηn(k) (θ) taken with respect to
θ exist, are continuous on an interval [θ0 − δ, θ0 + δ] for some δ > 0, are
non-zero when evaluated at θ0 , and
µ0m(k) (θk ) 0
ηn(k) (θk )
lim = lim = 1.
k→∞ µ0m(k) (θ0 ) k→∞ 0
ηn(k) (θ0 )
and
ET = lim [nτn2 (θ0 )]−1/2 ηn0 (θ0 ).
n→∞
Then the Pitman asymptotic relative efficiency of the test based on the test
statistic Sn , relative to the test based on the test statistic Tn is given by ES2 ET−2 .
Proof. The approach to proving this result is based on showing that the lim-
iting ratio of the sample size sequences m(k) and n(k), when the asymptotic
power functions are equal, is the same as the ratio ES2 ET−2 . We begin by
applying Theorem 1.13 (Taylor) to the functions µm(k) (θk ) and ηn(k) (θk ),
where we are taking advantage of Assumption 4, to find that µm(k) (θk ) =
µm(k) (θ0 ) + (θk − θ0 )µ0m(k) (θ̄k ) and ηn(k) (θk ) = ηn(k) (θ0 ) + (θk − θ0 )ηn(k)
0
(θ̃k )
where θ̄k ∈ (θ0 , θk ) and θ̃k ∈ (θ0 , θk ) for all k ∈ N. Note that even though θ̄k
and θ̃k are always in the same interval, they will generally not be equal to one
another. Now note that
sm(k) (α) − µm(k) (θk )
=
σm(k) (θk )
sm(k) (α) − µm(k) (θ0 ) + µm(k) (θ0 ) − µm(k) (θk ) σm(k) (θ0 )
.
σm(k) (θ0 ) σm(k) (θk )
Assumption 2 and Equation (11.45) imply that
sm(k) (α) − µm(k) (θ0 )
lim = z1−α .
k→∞ σm(k) (θ0 )
PITMAN ASYMPTOTIC RELATIVE EFFICIENCY 519
Combining this result with Assumption 3 implies that
sm(k) (α) − µm(k) (θk ) µm(k) (θ0 ) − µm(k) (θk )
lim = z1−α + lim . (11.46)
k→∞ σm(k) (θk ) k→∞ σm(k) (θ0 )
Performing the same calculations with the test based on the test statistic Tn(k)
yields
tn(k) (α) − ηn(k) (θk ) ηn(k) (θ0 ) − ηn(k) (θk )
lim = z1−α + lim . (11.47)
k→∞ τn(k) (θk ) k→∞ τn(k) (θ0 )
Combining these results with the requirement of Equation (11.45) yields
µm(k) (θ0 ) − µm(k) (θk ) τn(k) (θ0 )
lim · = 1.
k→∞ σm(k) (θ0 ) ηn(k) (θ0 ) − ηn(k) (θk )
Equations (11.46) and (11.47) then imply that
has the same limit as ES2 ET−2 . Therefore, the Pitman asymptotic relative effi-
ciency is given by ES2 ET−2 .
The values ES and ET are called the efficacies of the tests based on the test
statistics Sn and Tn , respectively. Randles and Wolfe (1979) point out some
important issues when interpreting the efficacies of test statistics. When one
examines the form of the efficacy of a test statistic we see that it measures
the rate of change of the function µn , in the case of the test based on the test
statistic Sn , at the point of the null hypothesis θ0 , relative to σn at the same
point. Therefore, the efficacy is a measure of how fast the distribution of Sn
changes at points near θ0 . In particular, the efficacy given in Theorem 11.15
measures the rate of change of the location of the distribution of Sn near the
null hypothesis point θ0 . Test statistics whose distributions change a great
deal near θ0 result in tests that are more sensitive to differences between the
θ0 and the actual value of θ in the alternative hypothesis. A more sensitive test
will be more powerful, and such tests will have a larger efficacy. Therefore,
520 NONPARAMETRIC INFERENCE
if ES > ET , then the Pitman asymptotic relative efficiency is greater than
one, and the test using the test statistic Sn has a higher asymptotic power.
Similarly, if ES < ET then the relative efficiency is less than one, and the test
based on the test statistic Tn has a higher asymptotic power.
In the development of the concept of asymptotic power for individual tests
in Section 10.4, a particular sequence of alternative hypotheses of the form
θn = θ0 + O(n−1/2 ), as n → ∞ was considered. In Theorem 11.15 no explicit
form for the sequence {θk }∞k=1 is discussed, though there is an implicit form
for this sequence given in the assumptions. In particular, it follows that θk =
θ0 + O(k −1/2 ) as k → ∞, matching the asymptotic form considered in Section
10.4. See Section 5.2 of Randles and Wolfe (1979) for further details on this
result.
This section will close with some examples of computing efficacies for the t-
test, the signed rank test, and the sign test. In order to make similar compar-
isons between these tests we will begin by making some general assumptions
about the setup of the testing problem that we will consider. Let X1 , . . . , Xn
be a set of independent and identically distributed random variables from a
distribution F (x − θ) that is symmetric about θ. We will assume that F has
a density f that is also continuous, except perhaps at a countable number
of points. Let θ0 be a fixed real value and assume that we are interested in
testing H0 : θ ≤ θ0 against H1 : θ > θ0 . In the examples given below we will
not concentrate on verifying the assumptions of Theorem 11.15. For details
on verifying these assumptions see Section 5.4 of Randles and Wolfe (1979).
Example 11.17. Consider the t-test statistic Tn = n1/2 σ −1 (X̄n − θ0 ) where
the null hypothesis is rejected when Tn > t1−α;n−1 . Note that t1−α;n−1 →
z1−α as n → ∞ in accordance with the assumptions of Theorem 11.15. The
form of the test statistic implies that µT (θ0 ) = θ0 and σn (θ0 ) = n−1/2 σ so
that the efficacy of the test is given by
µ0T (θ0 )
ET = lim = σ −1 .
n→∞ n1/2 n−1/2 σ
Example 11.18. Consider the signed rank test statistic Wn given by the
sum of the ranks of |X1 − θ0 |, . . . , |Xn − θ0 | that correspond to the cases
where Xi > θ0 for i = 1, . . . , n. Without loss of generality we will consider
the case where θ0 = 0. Following the approach of Randles and Wolfe (1979)
−1
we consider the equivalent test statistic Vn = n2 Wn . Using the results of
Example 11.12, it follows that
−1
n
Vn = [nU1,n + 21 n(n − 1)U2,n ] = 2(n − 1)−1 U1,n + U2,n ,
2
where
n
X
U1,n = n−1 δ{2Xi ; (0, ∞)},
i=1
PITMAN ASYMPTOTIC RELATIVE EFFICIENCY 521
and
n
X n
X
U2,n = 2[n(n − 1)]−1 δ{Xi + Xj ; (0, ∞)}.
i=1 j=i+1
Similarly,
Therefore,
Z ∞
−1
µn (θ1 ) = 2(n − 1) [1 − F (−θ1 )] + [1 − F (−u − 2θ1 )]dF (u). (11.48)
−∞
where we have used the fact that f is symmetric about zero. To find the
variance we note that we can use the result of Example 11.7, which found
2
the variance to have the form 31 n−1 n2 for the statistic Wn , and hence the
variance of Vn is σn2 (θ0 ) = 13 n−1 . Therefore, the efficacy of the signed rank
522 NONPARAMETRIC INFERENCE
Table 11.2 The efficacies and Pitman asymptotic relative efficiencies of the t-test
(T ), the signed rank test (V ), and the sign test (B) under sampling from various
populations.
Uniform(− 12 , 12 ) 12 12 4 1 1
3
1
3
1 3 3 4
LaPlace(0, 1) 2 4 1 2 2 3
Logistic(0, 1) 3π −2 1
3
1
4
1 2
9π
1 2
12 π
3
4
16 8 2 3
Triangular(−1, 1, 0) 6 3 4 9 3 4
test is given by
µ0n (θ0 ) µ0n (0)
lim = lim =
n→∞ n1/2 σn (θ0 ) n→∞ n1/2 σn (0)
Z ∞
lim 2(3)1/2 (n − 1)−1 f (0) + 2(3)1/2 f 2 (u)du =
n→∞ −∞
Z ∞
2(3)1/2 f 2 (u)du,
−∞
and hence 2
Z ∞
EV2 = 12 2
f (u)du . (11.49)
−∞
The value of the integral in Equation (11.49) has been computed for many
distributions. For example see Table B.2 of Wand and Jones (1995).
Example 11.19. Consider the test statistic used by the sign test of Example
11.1, which has the form
n
X
B= δ{Xi − θ; (0, ∞)},
i=1
One can make several interesting conclusions by analyzing the results of the
efficacy calculations from Examples 11.17–11.19. These efficacies, along with
the associated asymptotic relative effficiencies, are summarized in Table 11.2
for several distributions. We begin by considering the results for the N(0, 1)
PITMAN ASYMPTOTIC RELATIVE EFFICIENCY 523
distribution. We first note that the efficiencies relative to the t-test observed
in Table 11.2 are less than one, indicating that the t-test has a higher efficacy,
and is therefore more powerful in this case. The observed asymptotic relative
efficiency of the signed rank test is 3π −1 ' 0.955, which indicates that the
signed rank test has about 95% of the efficiency of the t-test when the pop-
ulation is N(0, 1). It is not surprising that the t-test is more efficient than
the signed rank test, since the t-test is derived under assumption that the
population is Normal, but what may be surprising is that the signed rank
test does so well. In fact, the results indicate that if a sample of size n = 100
is required by the signed rank test, then a sample of size n = 95 is required
by the t-test to obtain the same asymptotic power. Therefore, there is little
penalty for using the signed rank test even when the population is normal.
The sign test does not fair as well. The observed asymptotic relative efficiency
of the sign test is 2π −1 ' 0.637, which indicates that the sign test has about
64% of the efficiency of the t-test when the population is N(0, 1). Therefore,
if a sample of size n = 100 is required by the sign test, then a sample of size
n = 64 is required by the t-test to obtain the same asymptotic power. The
sign test also does not compare well with the signed rank test.
The Uniform(− 12 , 12 ) distribution is an interesting example because there is
little chance of outliers in samples from this distribution, but the shape of the
distribution is far from Normal. In this case the signed rank test and the t-
test perform equally well with an asymptotic relative efficiency of one. The sign
test performs poorly in this case with an asymptotic relative efficiency equal
to 31 , which indicates that the sign test has an asymptotic relative efficiency of
about 33%, or that a sample of n = 33 is required for the t-test or the signed
rank test, then a sample of 100 is required for the sign test to obtain the same
asymptotic power.
For the LaPlace(0, 1) distribution the trend begins to turn in favor of the
nonparametric tests. The observed asymptotic relative efficiency of the signed
rank test is 23 , which indicates that the signed rank test has about 150% of
the efficiency of the t-test when the population is LaPlace(0, 1). Therefore, if
the signed rank test requires a sample of size n = 100, then the t-test requires
a sample of size n = 150 to obtain the same asymptotic power. The sign test
does even better in this case with an asymptotic relative efficiency equal to 2
when compared to the t-test. In this case, if the sign test requires a sample of
size n = 100, then the t-test requires a sample of size n = 200 to obtain the
same asymptotic power. This is due to the heavy tails of the LaPlace(0, 1)
distribution. The sign test, and to a lesser extent, the signed rank test, are
robust to the presence of outliers while the t-test is not. An outlying value in
one direction may result in failing to reject a null hypothesis even when the
remainder of the data supports the alternative hypothesis.
For the Logistic(0, 1) distribution we get similar results, but not as drastic.
The Logistic(0, 1) distribution also has heavier tails than the N(0, 1) distri-
bution, but not as heavy as the LaPlace(0, 1) distribution. This is reflected
524 NONPARAMETRIC INFERENCE
in the asymptotic relative efficiencies. The signed rank test has an asymptotic
relative efficiency equal to 19 π 2 ' 1.097 which gives a small advantage to this
test over the t-test, while the sign test has an asymptotic relative efficiency
1 2
equal to 12 π ' 0.822 which gives an advantage to the the t-test.
For the case when the population follows a Triangular(−1, 1, 0) distribution
we find that the signed rank test has an asymptotic relative efficiency equal
to 98 ' 0.889, which gives a slight edge to the t-test, and the sign test has an
asymptotic relative efficiency equal to 32 ' 0.667, which implies that the t-test
is better than the sign test in this case. This is probably due to the fact that
the shape of the Triangular(−1, 1, 0) distribution is closer to the general
shape of a N(0, 1) distribution than many of the other distributions studied
here.
Pitman asymptotic relative efficiency is not the only viewpoint that has been
developed for comparing statistical hypothesis tests. For example, Hodges
and Lehmann (1970) developed the concept of deficiency, where expansion
theory similar to what was used in this section is carried out to higher order
terms. The concept of Bahadur efficiency, developed by Bahadur (1960a,
1960b, 1967) considers fixed alternative hypothesis values and power function
values and determines the rate at which the significance levels of the two tests
converge to zero. Other approaches to asymptotic relative efficiency can be
found in Cochran (1952), and Anderson and Goodman (1957).
for all x ∈ R. Note that fˆ(x) is the observed proportion of points in the sample
that are equal to the point x. This estimate can be shown to be pointwise
unbiased and consistent. See Exercise 14.
Figure 11.7 The estimator F̄n (t) computed on the example set of data indicated by
the points on the horizontal axis. The location of the grid points are indicated by the
vertical grey lines.
1.0
0.8
0.6
0.4
0.2
0.0
n
X
F̂ (gi ) = F̂n (gi ) = n−1 δ{Xk ; (−∞, gi ]},
k=1
F̄n (x) = F̂n (gi ) + (gi+1 − gi )−1 (x − gi )[F̂n (gi+1 ) − F̂n (gi )],
when x ∈ [gi , gi+1 ]. It can be shown that F̄n is a valid distribution function
under the assumptions given above. See Exercise 15. See Figure 11.7 for an
example of the form of this estimator.
To estimate the density at x ∈ (gi , gi+1 ) we take the derivative of F̄n (x) to
DENSITY ESTIMATION 527
obtain the estimator
d
f¯n (x) = F̄n (x)
dx
d n o
= F̂n (gi ) + (gi+1 − gi )−1 (x − gi )[F̂n (gi+1 ) − F̂n (gi )]
dx
= (gi+1 − gi )−1 [F̂n (gi+1 ) − F̂n (gi )]
" n n
#
X X
−1 −1
= n (gi+1 − gi ) δ{Xi ; (−∞, gi+1 ]} − δ{Xi ; (−∞, gi ]}
k=1 k=1
n
X
= (gi+1 − gi )−1 n−1 δ{Xi , ; (gi , gi+1 ]}, (11.50)
k=1
which is the proportion of observations in the range (gi , gi+1 ], divided by the
length of the range. The estimator specified in Equation 11.50 is called a his-
togram. This estimator is also often called a density histogram, to differentiate
it from the frequency histogram which is a plot of the frequency of observations
within each of the ranges (gi , gi+1 ]. Note that a frequency histogram does not
produce a valid density, and is technically not a density estimate. Note that
f¯n will usually not exist at the grid points g1 , . . . , gd as F̄n will usually not be
differentiable at these points. In practice this makes little difference and we
can either ignore these points, or can set the estimate at these points equal to
one of the neighboring estimate values. The form of this estimate is a series of
horizontal steps within each range (gi , gi+1 ). See Figure 11.8 for an example
form of this estimator.
Now that we have specified the form of the histogram we must consider the
placement and number of grid points g1 , . . . , gd . We would like to choose these
grid points so that the histogram provides a good estimate of the underlying
density, and hence we must develop a measure of discrepancy between the
true density f and the estimate f¯n . For univariate parameters we often use
the mean squared error, which is the expected square distance between the es-
timator and the true parameter value, as a measure of discrepancy. Estimators
that are able to minimize the mean squared error are considered reasonable
estimators of the parameter.
The mean squared error does not directly generalize to the case of estimat-
ing a density, unless we consider the pointwise behavior of the density esti-
mate. That is, for the case of the histogram, we would consider the mean
squared error of f¯n (x) as a pointwise estimator of f (x) at a fixed point x ∈ R
as MSE[f¯n (x), f (x)] = E{[f¯n (x) − f (x)]2 }. To obtain an overall measure of
the performance of this estimator we can then integrate the pointwise mean
squared error over the real line. This results in the mean integrated squared
528 NONPARAMETRIC INFERENCE
Figure 11.8 The estimator f¯n (t) computed on the example set of data indicated by
the points on the horizontal axis. The location of the grid points are indicated by the
vertical grey lines. These are the same data and grid points used in Figure 11.7.
5
4
3
2
1
0
error given by
Z ∞
MISE(f¯n , f ) = MSE[f¯n (x), f (x)]dx
−∞
Z ∞
= E{[f¯n (x) − f (x)]2 }dx
−∞
Z ∞
= E ¯ 2
[fn (x) − f (x)] dx , (11.51)
−∞
where we have assumed in the final equality that the interchange of the integral
and the expectation is permissible. As usual, the mean squared error can be
written as the sum of the square bias and the variance of the estimator.
The same operation can be performed here to find that the mean integrated
squared error can be written as the sum of the integrated square bias and the
integrated variance. That is MISE(f¯n , f ) = ISB(f¯n , f ) + IV(f¯n ) where
Z ∞ Z ∞
ISB(f¯n , f ) = Bias2 [f¯n (x), f (x)]dx = {E[f¯n (x)] − f (x)}2 dx,
−∞ −∞
and Z ∞
IV(f¯n ) = E[{(f¯n (x) − E[f¯n (x)]}2 ]dx.
−∞
DENSITY ESTIMATION 529
See Exercise 17. Using the mean integrated squared error as a measure of
discrepancy between our density estimate and the true density, we will use an
asymptotic analysis to specify how the grid points should be chosen. We will
begin by making a few simplifying assumptions. First, we will assume that the
grid spacing is even over the range of the distribution. That is, gi+1 − g1 = h,
for all i = 1, . . . , d where h > 0 is a value called the bin width. We will
not concern ourselves with the placement of the grid points. We will only
focus on choosing h that will minimize an asymptotic expression for the mean
integrated squared error of the histogram estimator.
We will assume that f has a certain amount of smoothness. For the moment
we can assume that f is continuous, but for later calculations we will have to
assume that f 0 is continuous. In either case it should be clear that f cannot
be a step function, which is the form of the histogram. Therefore, if we were
to expect the histogram to provide any reasonable estimate asymptotically it
is apparent that the bin width h must change with n. In fact, the bins must
get smaller as n gets larger in order for the histogram to become a smooth
function as n gets large. Therefore, we will assume that
lim h = 0.
n→∞
This is, in fact, a necessary condition for the histogram estimator to be con-
sistent. On the other hand, we must be careful that the bin width does not
converge to zero at too fast a rate. If h becomes too small too fast then there
will not be enough data within each of the bins to provide a consistent es-
timate of the true density in that region. Therefore, we will further assume
that
lim nh = ∞.
n→∞
For further information on these assumptions, see Scott (1992) and Section
2.1 of Simonoff (1996).
We will begin by considering the integrated bias of the histogram estimator. To
find the bias we begin by assuming that x ∈ (gi , gi+1 ] for some i ∈ {1, . . . , d−1}
where h = gi+1 − gi and note that
" n
#
X
¯
E[fn (x)] = E (gi+1 − gi ) n −1 −1
δ{Xi , ; (gi , gi+1 ]}
k=1
n
X
= (nh)−1 E(δ{Xi , ; (gi , gi+1 ]})
k=1
= h−1 E(δ{X1 , ; (gi , gi+1 ]})
Z gi+1
−1
= h dF (t).
gi
where we assume that ξ < ∞. Then, assuming that |t − x| < h we have that
−3 gi+1 1 00 1 −3 gi+1
Z Z
2 2
h
2 f (c)(t − x) dt ≤ ξh
2 (t − x) dt≤
gi g1
Z gi+1
1 −3
h dt = 12 h−1 ξ(gi+1 − gi ) = 21 |ξ| < ∞,
2
2h ξ
gi
where we have used the fact that δ{Xk , ; (gi , gi+1 ]} is a Bernoulli random
variable. To simplify this expression we begin by finding an asymptotic form
for the integrals in Equation (11.53). To this end, we apply Theorem 1.15 to
the density to find for x ∈ (gi , gi+1 ], we have that
Z gi+1 Z gi+1
dF (t) = f (x) + f 0 (x)(t − x) + 21 f 00 (c)(t − x)2 dt. (11.54)
gi gi
as n → ∞. See Exercise 19. We have previously shown that the last integral
in Equation (11.54) is O(h3 ) as n → ∞, from which it follows from Theorem
1.19 that
Z gi+1
dF (t) = hf (x) + O(h2 ),
gi
V [f¯n (x)] = n−1 h−2 [hf (x) + O(h2 )][1 − hf (x) + O(h2 )]
= n−1 [f (x) + O(h)][h−1 − f (x) + O(h)]
= n−1 [h−1 f (x) + O(1)]
= (nh)−1 f (x) + O(n−1 ),
To obtain the mean integrated squared error, we integrate the pointwise mean
squared error separately over each of the grid intervals. That is,
Z ∞ d Z
X gk+1
MISE[f¯n (x)] = MSE[f¯n (x)]dx = MSE[f¯n (x)]dx.
−∞ k=1 gk
532 NONPARAMETRIC INFERENCE
For the grid interval (gk , gk+1 ] we have that
Z gk+1 Z gk+1
MSE[f¯n (x)]dx = (nh)−1 f (x)dx+
gk gk
Z gk+1
1
4 [f 0 (x)]2 [h − 2(x − gk )]2 dx + O(n−1 ) + O(h3 ).
gk
for some ηk ∈ (gk , gk+1 ], using Theorem A.5. Integrating the polynomial
within the integral yields
Z gk+1
[h − 2(x − gk )]2 dx = h3 − 2h3 + 34 h3 = 13 h3 ,
gk
so that Z gk+1
1
4 [f 0 (x)]2 [h − 2(x − gk )]2 dx = 1 3 0 2
12 h [f (ηk )] .
gk
Taking the sum over all of the grid intervals gives the total mean integrated
squared error,
Xd Z gk+1 X d
MISE(f¯, f ) = (nh)−1 dF (x) + 1 3 0
12 h [f (ηk )]
2
k=1 gk k=1
−1 3
+O(n ) + O(h )
Z ∞ d
X
= (nh)−1 dF (x) + 1 2
12 h h[f 0 (ηk )]2 + O(n−1 ) + O(h3 )
−∞ k=1
d
X
= (nh)−1 + 1 2
12 h h[f 0 (ηk )]2 + O(n−1 ) + O(h3 ).
k=1
Figure 11.9 This figure demonstrates how a histogram with a smaller bin width is
better able to follow the curvature of an underlying density, resulting in a smaller
asymptotic bias.
where
Z ∞
0
R(f ) = [f 0 (t)]2 dt.
−∞
Note that the mean integrated squared error of the histogram contains the
classic tradeoff between bias and variance seen with so many estimators. That
is, if the bin width is chosen to be small, the bias will be small since there
will be many small bins that are able to capture the curvature of f , as shown
by the 12 h R(f 0 ) term. But in this case the variance, as shown by the (nh)−1
1 2
term, will be large, due to the fact that there will be fewer observations per
bin. When the bin width is chosen to be large, the bias becomes large as the
curvature of f will be not be able to be modeled as well by the wide steps
in the histogram, while the variance will be small due to the large number of
observations per bin. See Figures 11.9 and 11.10.
To find the bin width that minimizes this tradeoff we first truncate the ex-
pansion for the mean integrated squared error to obtain the asymptotic mean
integrated squared error given by
AMISE(f¯n , f ) = (nh)−1 + 1 2 0
12 h R(f ). (11.55)
Figure 11.10 This figure demonstrates how a histogram with a large bin width is less
able to follow the curvature of an underlying density, resulting in a larger asymptotic
bias.
zero, and solving for h gives the asymptotically optimal bin width given by
1/3
−1/3 6
hopt = n .
R(f 0 )
See Exercise 20. The resulting asymptotic mean squared error when using the
optimal bandwidth is therefore AMISEopt (f¯n , f ) = n−2/3 [ 16
9
R(f 0 )]1/3 .
Note the dependence of the optimal bandwidth on the integrated square
derivative of the underlying density. This dependence has two major implica-
tions. First, it is clear that densities that have smaller derivatives over their
range will require a larger bandwidth, and will result in a smaller asymptotic
mean integrated square error. This is due to the fact that these densities are
more flat and will be easier to estimate with large bin widths in the step
function in the histogram. When the derivative is large over the range of the
distribution, the optimal bin width is smaller and the resulting asymptotic
mean integrated squared error is larger. Such densities are more difficult to
estimate using the histogram. The second implication is that the asymptoti-
cally optimal bandwidth depends on the form of the underlying density. This
means that we must estimate the bandwidth from the observed data, which
requires an estimate of the integral of the squared derivative of the density.
Wand (1997) suggests using a kernel estimator to achieve this goal, and argues
that most of the usual informal rules do not choose the bin width to be small
DENSITY ESTIMATION 535
enough. Kernel estimators can also be used to estimate the density itself, and
is the second type of density estimator we discuss in this section.
One problem with the histogram is that it is always a step function and there-
fore does not usually reflect our notion of a continuous and smooth density.
As such, there have been numerous techniques developed to provide a smooth
density estimate. The estimator we will consider in this section is known as a
kernel density estimator, which appears to have been first studied by Fix and
Hodges (1951). See Fix and Hodges (1989) for a reprint of this paper. The
first asymptotic analysis of this method, which follows along the lines of the
developments in this chapter, where studied by Parzen (1962) and Rosenblatt
(1956).
To motivate the kernel density estimate, return once again to the problem of
estimating the distribution function F . It is clear that if we wish to have a
smooth density estimate based on an estimate of F , we must find an estimator
for F that itself is smooth. Indeed, we require more than just continuity in
this case. As we saw with the histogram, we specified a continuous estimator
for F that yielded a step function for the density estimate. Therefore, it seems
if we are to improve on this idea we should not only require an estimator for F
that is continuous, but it should also be differentiable everywhere. To motivate
an approach to finding such an estimate we write the empirical distribution
function as
Xn
F̂n (x) = n−1 δ{Xi ; (−∞, x]}
k=1
Xn
= n−1 δ{x − Xi ; [0, ∞)}
k=1
Xn
= n−1 K(x − Xi ), (11.56)
k=1
where K(t) = δ{t; [0, ∞)} for all t ∈ R. Note that K in this case can be taken
to be a distribution function for a degenerate random that concentrates all of
its mass at zero, and is therefore a step function with a single step of size one
at zero. The key idea behind developing the kernel density estimator is to note
that if we replace the function K in Equation (11.56) with any other valid
distribution function that is centered around zero, the estimator itself remains
a distribution function. See Exercise 21. That is, let K be any non-decreasing
right continuous function such that
lim K(t) = 1,
t→∞
lim K(t) = 0,
t→−∞
and Z ∞
tdK(t) = 0.
−∞
536 NONPARAMETRIC INFERENCE
Now define the kernel estimator of the distribution function F to be
n
X
F̃n (x) = n−1 K(x − Xi ). (11.57)
k=1
The problem with the proposed estimator in Equation (11.57) is that the
properties of the estimator are a function of the variance of the distribution
K. The control this property we introduce a scale parameter h to the function
K. That is, define the kernel estimator of the distribution function F as
n
−1
X x − Xi
F̃n,h (x) = n K . (11.58)
h
k=1
The question of what function should be used for the kernel function k is a
rather complicated issue that we will not address in depth. For finite sam-
ples the choice obviously makes a difference, but a theoretical quantification
of these differences is rather complicated, so researchers have turned to the
question of what effect does the choice of the kernel function have asymp-
totically as n → ∞? It turns out that for large samples the form of k does
not affect the rate at which the optimal asymptotic mean squared error of the
kernel density estimator approaches zero. Therefore, from an asymptotic view-
point the choice matters little. See Section 3.1.2 of Simonoff (1996). Hence, for
the remainder of this section we shall make the following generic assumptions
about the form of the kernel function k. We will assume that k is a symmet-
ric continuous density with zero mean finite variance. Given this assumption,
we will now show how asymptotic calculations can be used to determine the
asymptotically optimal bandwidth.
As with the histogram, we will use the mean integrated squared error given
in Equation (11.51) as a measure of the performance of the kernel density
estimator. We will assume that f is a smooth density, namely that f 00 is
continuous and square integrable. As with the histogram, we shall assume
DENSITY ESTIMATION 537
that the bandwidth has the properties that
lim h = 0,
n→∞
and
lim nh = ∞.
n→∞
We will finally assume that k is a bounded density that is symmetric and has a
finite fourth moment. To simplify notation, we will assume that X is a generic
random variable with distribution F . We begin by obtaining an expression for
the integrated bias. The expected value of the kernel density estimator at a
point x ∈ R is given by
" n #
˜ −1
X x − Xk
E[fn,h (x)] = E (nh) k
h
k=1
n
X x − Xk
= (nh)−1 E k
h
k=1
x−X
= h−1 E k
h
Z ∞
x −t
= h−1 k dF (t).
−∞ h
where the second inequality follows because we have assumed that k is sym-
metric about the origin. Now, apply Theorem 1.15 (Taylor) to f (x + vh) to
find
f (x + vh) = f (x) + vhf 0 (x) + 21 (vh)2 f 00 (x) + 61 (vh)3 f 000 (x) + O(h4 ),
as h → 0. Therefore, assuming that the integral of the remainder term remains
O(h4 ) as h → 0, it follows that
Z ∞
E[f˜n,h (x)] = k(v)f (x + vh)dv =
−∞
Z ∞ Z ∞ Z ∞
f (x)k(v)dv + vhf 0 (x)k(v)dv + 1 2 00
2 (vh) f (x)k(v)dv+
−∞ −∞ −∞
Z ∞
1 3 000 4
6 (vh) f (x)k(v)dv + O(h ),
−∞
Note that the usual asymptotic properties related to the empirical distribution
hold, conditional on X1 , . . . , Xn . For example, Theorem 3.18 (Glivenko and
a.c.
Cantelli) implies that kĤn − H̃n,b k∞ −−→ 0 as b → ∞, where the convergence
is relative only to the sampling from F̂n , conditional on X1 , . . . , Xn .
THE BOOTSTRAP 543
The relative complexity of the bootstrap algorithm, coupled along with the
fact that the bootstrap is generally considered to be most useful in the non-
parametric framework where the exact form of the population distribution is
unknown, means that the theoretical justification for the bootstrap has been
typically based on computer-based empirical studies and asymptotic theory. A
detailed study of both of these types of properties can be found in Mammen
(1992). This section will focus on the consistency of several common boot-
strap estimates and the asymptotic properties of several types of bootstrap
confidence intervals. We begin by focusing on the consistency of the bootstrap
estimate of the distribution Hn (t). There are two ways in which we can view
consistency in this case. In the first case we can concern ourselves with the
pointwise consistency of Ĥn (t) as an estimator of Hn (t). That is, we can con-
p
clude that Ĥn (t) is a pointwise consistent estimator of Hn (t) if Hn (t) −
→ Hn (t)
as n → ∞ for all t ∈ R. Alternatively, we can define Ĥn to be a consistent
estimator of Hn if some metric between Ĥn and Hn converges in probability
to zero as n → ∞. That is, let d be a metric on F, the space of all distribution
functions. The we will conclude that Ĥn is a consistent estimator of Hn if
p
d(Ĥn , Hn ) −
→ 0 as n → ∞. Most research on the consistency of the bootstrap
uses this definition. Both of the concepts above are based on convergence in
probability, and in this context these concepts are often referred to as weak
consistency. In the case where convergence in probability is replaced by almost
certain convergence, the concepts above are referred to as strong consistency.
Because there is more than one metric used on the space F, the consistency
of Ĥn as an estimator of Hn is often qualified by the metric that is being
p
used. For example, if d(Ĥn , Hn ) −→ 0 as n → ∞ then Ĥn is called a strongly
d-consistent estimator of Hn . In this section we will use the supremum metric
d∞ that is based on the inner product defined in Theorem 3.17.
The smooth function model, introduced in Section 7.4, was shown to be a
flexible model that contains many of the common types of smooth estimators
encountered in practice. The framework of the smooth function model affords
us sufficient structure to obtain the strong consistency of the bootstrap esti-
mate of Hn (t).
Theorem 11.16. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed d-dimensional random vectors from a distribution F with mean vector
µ where E(kXn k2 ) < ∞. Let θ = g(µ) where g : Rd → R is a continuously
differentiable function at µ such that
∂
g(x) 6= 0,
∂xi x=µ
For a proof of Theorem 11.16, see Section 3.2.1 of Shao and Tu (1995). The
544 NONPARAMETRIC INFERENCE
necessity of the condition that E(||Xn ||2 ) < ∞ has been the subject of con-
siderable research; Babu (1984), Athreya (1987), and Knight (1989) have all
supplied examples where the violation of this condition results in an incon-
sistent bootstrap estimate. In the special case where d = 1 and g(x) = x,
the condition has been shown to be necessary and sufficient by Giné and Zinn
(1989) and Hall (1990). The smoothness of the function g is also an important
aspect of the consistency of the bootstrap. For functions that are not smooth
functions of mean vectors there are numerous examples where the bootstrap
estimate of the sampling distribution is not consistent. The following example
can be found in Efron and Tibshirani (1993).
Example 11.20. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a Uniform(0, θ) distribution where θ ∈
Ω = (0, ∞), and let θ̂n = X(n) , the maximum value in the sample. Suppose
that X1∗ , . . . , Xn∗ is a set of independent and identically distributed random
variables from the empirical distribution of X1 , . . . , Xn . Then P ∗ (θ̂n∗ = θ̂n ) =
P ∗ (X(n)∗
= X(n) ). Note that X(n)∗
will equal X(n) any time that X(n) occurs in
the sample at least once. Recalling that, conditional on the observed sample
X1 , . . . , Xn , the empirical distribution places a mass of n−1 on each of the
∗
values in the sample, it follows that X(n) 6= X(n) with probability (1 − n−1 )n .
Therefore, the bootstrap estimates the probability P (θ̂n = θ) with P ∗ (θ̂n∗ =
θ̂n ) = 1 − (1 − n−1 )n , and Theorem 1.7 implies that
Noting that X(n) is a continuous random variable, it follows that the actual
probability is P (θ̂n = θ) = 0 for all n ∈ N. Therefore, the bootstrap estimate of
the probability is not consistent. We have plotted the actual distribution of θ̂n
along with a histogram of the bootstrap estimate of the sampling distribution
of θ̂n for a set of simulated data from a Uniform(0, 1) distribution in Figures
11.11 and 11.12. In Figure 11.12 the observed data are represented by the
plotted points along the horizontal axis. Note that since θ̂n∗ is the maximum
of a sample taken from the observed sample X1 , . . . , Xn , θ̂n∗ will be equal to
one of the observed points with probability one with respect to the probability
measure P ∗ , which is a conditional measure on θ̂n .
The problem with the bootstrap estimate in Example 11.20 is that the parent
population is continuous, but the empirical distribution is discrete. The boot-
strap usually overcomes this problem in the case where θ̂n is a smooth function
of the data because the bootstrap estimate of the sampling distribution be-
comes virtually continuous at a very fast rate as n → ∞. This is due to the
large number of atoms in the bootstrap estimate of the sampling distribution.
See Appendix I of Hall (1992). However, when θ̂n is not a smooth function of
the data, such as in Example 11.20, this continuity is never realized and the
bootstrap estimate of the sampling distribution of θ̂n can fail to be consistent.
THE BOOTSTRAP 545
Figure 11.11 The actual density of θ̂n = X(n) for samples of size 10 from a
Uniform(0, 1) distribution.
10
8
6
4
2
0
Figure 11.12 An example of the bootstrap estimate of θ̂n = X(n) for a sample of size
10 taken from a Uniform(0, 1) distribution. The simulated sample is represented by
the points plotted along the horizontal axis.
12
10
8
6
4
2
0
is not constant with respect to x1 . Therefore, under the condition that F has
a.c.
a finite fourth moment, Theorem 11.17 implies that d∞ (Ĥn , Hn ) −−→ 0 as
n → ∞ where Hn (t) = P [n1/2 (Un − µ2 ) ≤ t] and Ĥn (t) is the bootstrap
estimate given by Ĥn (t) = P ∗ [n1/2 (Un∗ − µ̂2,n ) ≤ t].
There are many other consistency results for the bootstrap estimate of the
sampling distribution that include results for L-statistics, differentiable statis-
tical functionals, empirical processes, and quantile processes. For an overview
of these results see Section 3.2 of Shao and Tu (1995).
Beyond the bootstrap estimate of the sampling distribution, we can also con-
sider the bootstrap estimate of the variance of an estimator, or equivalently
the standard error of an estimator. For example, suppose that we take Jn (t) =
THE BOOTSTRAP 547
P (θ̂n ≤ t) and estimate Jn (t) using the bootstrap to get Jˆn (t) = P ∗ (θ̂n∗ ≤ t).
The bias of θ̂n is given by
Z ∞
Bias(θ̂n ) = E(θ̂n ) − θ = tdJn (t) − θ, (11.61)
−∞
Not surprisingly, the conditions that ensure the consistency of the bootstrap
estimate of the variance are similar to what is required to ensure the consis-
tency of the bootstrap estimate of the sampling distribution.
Theorem 11.18. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed d-dimensional random vectors with mean vector µ and covariance
matrix Σ. Let θ = g(µ) for a real valued function g that is differentiable in
a neighborhood of µ. Define a d × 1 vector d(µ) to be the vector of partial
derivatives of g evaluated at µ. That is, the ith element of d(µ) is given by
∂
di (µ) = g(x) ,
∂xi x=µ
as n → ∞ where
n
!
X
θ̂n (Xi1 , . . . , Xin ) = g n−1 X ik ,
k=1
A proof of Theorem 11.18 can be found in Section 3.2.2 of Shao and Tu (1995).
548 NONPARAMETRIC INFERENCE
The condition given in Equation (11.63) is required because there are cases
where the bootstrap estimates of the variance diverges to infinity, a result that
is caused by the fact that |θ̂n∗ − θ̂n | may take on some exceptionally large values.
Note the role of resampling in this condition. A sample from the empirical
distribution F̂n consists of values from the original sample X1 , . . . , Xn . Hence,
any particular resample θ̂n may be computed on any set of values Xi1 , . . . , Xin .
The condition given in Theorem 11.18 ensures that none of the values will be
too far away from θ̂n as n → ∞. An example of a case where the bootstrap
estimator in not consistent is given by Ghosh et al. (1984).
As exhibited in Example 11.20, the bootstrap can behave very differently
when dealing with non-smooth statistics such as sample quantiles. However,
the result given below shows that the bootstrap can still provide consistent
variance estimates in such cases.
Theorem 11.19. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a distribution F . Let θ = F −1 (p) and
θ̂n = F̂n−1 (p) where p ∈ (0, 1) is a fixed constant. Suppose that f = F 0 exists
and is positive in a neighborhood of θ. If E(|X1 |ε ) < ∞ for some ε > 0 then
the bootstrap estimate of the variance σn2 = n−1 p(1 − p)[f (θ)]−2 is consistent.
a.c.
That is, σ̂n2 σn−2 −−→ 1 as n → ∞.
A proof of Theorem 11.19 can be found in Ghosh et al. (1984). Babu (1986)
considers the same problem and is able to prove the result under slightly
weaker conditions. It is worthwhile to compare the assumptions of Theorem
11.19 to those of Corollary 4.4, which are used to establish the asymptotic
Normality of the sample quantile. These assumptions are required in order
to be able to obtain the form of the asymptotic variance of the sample quantile.
The ability of the bootstrap to provide consistent estimates of the sampling
distribution and the variance of a statistic is only a small part of the theory
that supports the usefulness of the bootstrap in many situations. One of the
more surprising results is that under the smooth function model of Section
7.4, the bootstrap automatically performs an Edgeworth type correction. This
type of result was first observed in the early work of Babu and Singh (1983,
1984, 1985), Beran (1982), Bickel and Freedman (1980), Hall (1986a, 1986b),
and Singh (1981). A fully developed theory appears in the work of Hall (1988a,
1992).
The essential idea is based on the following result. Suppose that X1 , . . . , Xn
is a set of independent and identically distributed d-dimensional random vec-
tors from a distribution F with parameter θ that falls within the smooth
function model. Theorem 7.11 implies that the distribution function Gn (x) =
P [n1/2 σ −1 (θ̂n − θ) ≤ x] has asymptotic expansion
p
X
Gn (x) = Φ(x) + n−k/2 rk (x)φ(x) + o(n−p/2 ), (11.64)
k=1
THE BOOTSTRAP 549
as n → ∞, where rk is a polynomial whose coefficients depend on the mo-
ments of F . Theorem 5.1 of Hall (1992) implies that the bootstrap estimate of
Gn (x), which is given by Ĝn (x) = P ∗ [n1/2 σ̂n−1 (θ̂n∗ − θ̂n ) ≤ x], has asymptotic
expansion
p
X
Ĝn (x) = Φ(x) + n−k/2 r̂k (x)φ(x) + op (n−p/2 ), (11.65)
k=1
as n → ∞. The polynomial r̂k has the same form as rk , except that the
moments of F in the coefficients of the polynomial have been replaced by
the corresponding sample moments. One should also note that the error term
o(n−p/2 ) in Equation (11.64) has been replaced with the error term op (n−p/2 )
in Equation (11.65), which reflects the fact that the error term in the expansion
is now a random variable. A proof of this result can be found in Section 5.2.2
of Hall (1992). The same result holds for the Edgeworth expansion of the
studentized distribution. That is, if Ĥn (x) = P ∗ [n1/2 (σ̂n∗ )−1 (θ̂n∗ − θ̂n ) ≤ x] is
the bootstrap estimate of the distribution Hn (x) = P [n1/2 σ̂n−1 (θ̂n − θn ) ≤ x]
then
Xp
Ĥn (x) = Φ(x) + n−k/2 v̂k (x)φ(x) + op (n−p/2 ),
k=1
as n → ∞ where v̂k (x) is the sample version of vk (x) for k = 1, . . . , p. Similar
results hold for the Cornish–Fisher expansions for the quantile functions of
Ĝn (x) and Ĥn (x). Let ĝα = Ĝ−1 −1
n (α) and ĥα = Ĥn (α) be the bootstrap
estimates of the quantiles of the distributions of Ĝn and Ĥn , respectively.
Then Theorem 5.2 of Hall (1992) implies that
p
X
ĝα = zα + n−k/2 q̂k (zα ) + op (n−p/2 ), (11.66)
k=1
and
p
X
ĥα = zα + n−k/2 ŝk (zα ) + op (n−p/2 ),
k=1
as n → ∞ where q̂k and ŝk are the sample versions of qk and sk , respectively,
for all k = 1, . . . , p. The effect of these results is immediate. Because r̂k (x) =
rk (x) + Op (n−1/2 ) and v̂k (x) = vk (x) + Op (n−1/2 ), it follows from Equations
(11.64) and (11.65) that
Ĝn (x) = Φ(x) + n−1/2 rk (x)φ(x) + op (n−p/2 ) = Gn (x) + op (n−1/2 ),
and
Ĥn (x) = Φ(x) + n−1/2 vk (x)φ(x) + op (n−p/2 ) = Hn (x) + op (n−1/2 ),
as n → ∞. Therefore, it is clear that the bootstrap does a better job of esti-
mating Gn and Hn than the Normal approximation, which would estimate
both of these distributions by Φ(x), resulting in an error term that is op (1) as
550 NONPARAMETRIC INFERENCE
n → ∞. This effect has far reaching consequences for other bootstrap meth-
ods, most notably confidence intervals. Hall (1988a) identifies six common
bootstrap confidence intervals and describes their asymptotic behavior. We
consider 100α% upper confidence limits using four of these methods.
The percentile method, introduced by Efron (1979), estimates the sampling
distribution Jn (x) = P (θ̂n ≤ x) with the bootstrap estimate Jˆn (x) = P ∗ (θ̂n∗ ≤
x). The 100α% upper confidence limit is then given by θ̂back∗
(α) = Jˆn−1 (α),
where we are using the notation of Hall (1988a) to identify the confidence
limit. Note that
∗
as n → ∞, where it follows that θ̂stud (α) is second-order correct and accu-
rate. Therefore, from the asymptotic viewpoint this interval is superior to the
percentile and hybrid methods. The practical application of this confidence
limit can be difficult in some applications. The two main problems are that
the confidence limit can be computationally burdensome to compute, and that
when n is small the confidence limit can be numerically unstable. See Polansky
(2000) and Tibshirani (1988) for further details on stabilizing this method.
In an effort to fix the theoretical and practical deficiencies of the percentile
method, Efron (1981, 1987) suggests computing a 100α% upper confidence
∗
limit of the form θ̂back [β(α)], where β(α) is an adjusted confidence limit that
is designed to reduce the bias of the upper confidence limit. The first method
is called the bias corrected method, and is studied in Exercise 25 of Chapter
10. In this section we will explore the properties of the second method called
the bias corrected and accelerated method. It is worthwhile to note that Efron
(1981, 1987) did not develop these methods based on considerations pertain-
ing to Edgeworth expansion theory. However, the methods can be justified
using this theory. The development we present here is based on the argu-
ments of Hall (1988a). Define a function β̂(α) = Φ[zα + 2m̂ + âzα2 + O(n−1 )]
as n → ∞ where m̂ = Φ−1 [Ĝn (0)] is called the bias correction parameter and
â = −n−1/2 zα−2 [2r̂1 (0) − r̂1 (zα ) − v̂1 (zα )] is called the acceleration constant.
∗
Note that θ̂back [β̂(α)] = θ̂n + n−1/2 σ̂n ĝβ̂(α) , where ĝβ̂(α) has a Cornish–Fisher
expansion involving zβ̂(α) . Therefore, we begin our analysis of this method by
noting that
zβ̂(α) = Φ−1 {Φ[zα + 2m̂ + âzα2 + O(n−1 )]} = zα + 2m̂ + âzα2 + O(n−1 ),
as n → ∞. Therefore, it follows that
11.7.1 Exercises
a. Find a symmetric kernel function for this parameter, and develop the
corresponding U -statistic.
b. Using Theorem 11.5, find the projection of this U -statistic.
554 NONPARAMETRIC INFERENCE
c. Find conditions under which Theorem 11.6 (Hoeffding) applies to this
statistic, and specify its weak convergence properties.
a. Using direct calculations, prove that under the null hypothesis that θ = 0
it follows that E(W ) = 41 n(n + 1).
b. Using direct calculations, prove that under the null hypothesis that θ = 0
1
it follows that V (W ) = 24 n(n + 1)(2n + 1).
c. Prove that
W − 41 n(n + 1) d
1 −
→ Z,
[ 24 n(n + 1)(2n + 1)]1/2
as n → ∞ where Z has a N(0, 1) distribution.
where
n
X
ā = n−1 a(i),
i=1
and
n
X
−1
c̄ = n c(i).
i=1
7. Consider the rank sum test statistic from Example 11.2, which is a linear
rank statistic with a(i) = i and c(i) = δ{i; {m + 1, . . . , n + m}} for all
i = 1, . . . , n + m. Under the null hypothesis that the shift parameter θ is
zero, find the mean and variance of this test statistic.
EXERCISES AND EXPERIMENTS 555
8. Consider the median test statistic described in Example 11.14, which is
a linear rank statistic with a(i) = δ{i; { 21 (m + n + 1), . . . , m + n}} and
c(i) = δ{i; {m + 1, . . . , n + m}} for all i = 1, . . . , n + m.
a. Under the null hypothesis that the shift parameter θ is zero, find the
mean and variance of the median test statistic.
b. Determine if there are conditions under which the distribution of the
median test statistic under the null hypothesis is symmetric.
c. Prove that the regression constants satisfy Noether’s condition.
d. Define α(t) such that a(i) = α[(m + n + 1)−1 i] for all i = 1, . . . , m + n
and show that α is a square integrable function.
e. Prove that the linear rank statistic
Xn
D= a(i, n)c(i, n),
i=1
is too large. Let α denote the desired significance level of this test. Without
loss of generality assume that θ0 = 0.
a. Show that the critical value for this test converges to z1−α as n → ∞.
b. In the context of Theorem 11.15 show that we can take µn (θ) = F (θ)
and σn2 = 41 n.
c. Using the result derived above, prove that the efficacy of this test is
given by 2f (0).
equals 12 π −1/2 , 1, 14 , 16 , and 23 for the N(0, 1), Uniform(− 12 , 12 ), LaPlace(0, 1),
Logistic(0, 1), and Triangular(−1, 1, 0) densities, respectively.
11. Prove that the square efficacy of the t-test equals 1, 12, 21 , 3π −2 , and
6 for the N(0, 1), Uniform(− 12 , 12 ), LaPlace(0, 1), Logistic(0, 1), and
Triangular(−1, 1, 0) densities, respectively.
556 NONPARAMETRIC INFERENCE
12. Prove that the square efficacy of the sign test equals 2π −1 , 4, 1, 14 , and
4 for the N(0, 1), Uniform(− 21 , 12 ), LaPlace(0, 1), Logistic(0, 1), and
Triangular(−1, 1, 0) densities, respectively.
3 −1/2
13. Consider the density f (x) = 20 5 (5 − x2 )δ{x; (−51/2 , 51/2 )}. Prove that
−2
EV2 ET ' 0.864, which is a lower bound for this asymptotic relative effi-
ciency established by Hodges and Lehmann (1956). Comment on the im-
portance of this lower bound is statistical applications.
14. Let X1 , . . . , Xn be a set of independent and identically distributed ran-
dom variables from a discrete distribution with distribution function F
and probability distribution function f . Assume that F is a step function
with steps at points contained in the countable set D. Consider estimating
the probability distribution function as
n
X
fˆn (x) = F̂n (x) − F̂n (x−) = n−1 δ{Xk ; {x}},
k=1
for all x ∈ R. Prove that fˆn (x) is an unbiased and consistent estimator of
f (x) for each point x ∈ R.
15. Let X1 , . . . , Xn be a set of independent and identically distributed ran-
dom variables from a discrete distribution with distribution function F and
probability distribution function f . Let −∞ < g1 < g2 < · · · < gd < ∞
be a fixed grid of points in R. Assume that these points are selected in-
dependent of the sample X1 , . . . , Xn and that g1 < min{X1 , . . . , Xn } and
gd > max{X1 , . . . , Xn }. Consider the estimate of F given by
F̄n (x) = F̂n (gi ) + (x − gi )[F̂n (gi+1 ) − F̂n (gi )],
when x ∈ [gi , gi+1 ]. Prove that this estimate is a valid distribution function
conditional on X1 , . . . , Xn .
16. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a distribution F with continuous density f . Prove that the
histogram estimate with fixed grid points −∞ < g1 < · · · < gd < ∞ such
that g1 < min{X1 , . . . , Xn } and gd > max{X1 , . . . , Xn } given by
n
X
f¯n (x) = (gi+1 − gi )−1 n−1 δ{Xi , ; (gi , gi+1 ]},
k=1
and Z ∞
IV(f¯n ) = E{[(f¯n (x) − E(f¯n (x))]2 }dx.
−∞
EXERCISES AND EXPERIMENTS 557
18. Using the fact that the pointwise bias of the histogram is given by,
Bias[f¯n (x)] = 1 f 0 (x)[h − 2(x − gi )] + O(h2 ),
2
as h → 0, prove that the square bias is given by
Bias2 [f¯n (x)] = 1 [f 0 (x)]2 [h − 2((x − gi )]2 + O(h3 ).
4
19. Let f be a density with at least two continuous and bounded derivatives
and let gi < gi+1 be grid points such that h = gi+1 − gi .
Z gi+1
f 0 (x)(t − x)dt = O(h2 ),
gi
as n → ∞, where
lim h = 0.
n→∞
20. Given that the asymptotic mean integrated squared error for the histogram
with bin width h is given by
AMISE(f¯n , f ) = (nh)−1 + 1 h2 R(f 0 ),
12
show that the value of h that minimizes this function is given by
1/3
6
hopt = n−1/3 .
R(f 0 )
21. Let K be any non-decreasing right-continuous function such that
lim K(t) = 1,
t→∞
lim K(t) = 0,
t→−∞
and Z ∞
tdK(t) = 0.
−∞
Define the kernel estimator of the distribution function F to be
Xn
F̃n (x) = n−1 K(Xi − x).
k=1
23. Using the fact that the asymptotic mean integrated squared error of the
kernel estimator with bandwidth h is given by,
AMISE(f˜n,h , f ) = (nh)−1 R(k) + 1 h4 σ 4 R(f 00 ),
4 k
558 NONPARAMETRIC INFERENCE
show that the asymptotically optimal bandwidth is given by
1/5
R(k)
hopt = n−1/5 4 .
σk R(f 00 )
24. Consider the Epanechnikov kernel given by k(t) = 34 (1 − t)2 δ{t; [−1, 1]}.
√
Prove that σk R(k) = 3/(5 5).
25. Compute the efficiency of each of the kernel functions given below relative
to the Epanechnikov kernel.
a. The Biweight kernel function, given by 15 2 2
16 (1 − t ) δ{t; [−1, 1]}.
35
b. The Triweight kernel function given by 32 (1 − t2 )3 δ{t; [−1, 1]}.
c. The Normal kernel function given by φ(t).
d. The Uniform kernel function given by 12 δ{t; [−1, 1]}.
26. Let fˆn (t) denote a kernel density estimator with kernel function k computed
on a sample X1 , . . . , Xn . Prove that,
Z ∞ Z ∞
ˆ −1 t−X
E fh (t)f (t)dt = h E k f (t)dt ,
−∞ −∞ h
where the expectation on the right hand side of the equation is taken with
respect to X.
27. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a distribution F with functional parameter θ. Let θ̂n be an
estimator of θ. Consider estimating the sampling distribution of θ̂n given
by Jn (t) = P (θ̂n ≤ t) using the bootstrap resampling algorithm described
in Section 11.6. In this case let θ̂1∗ , . . . , θ̂b∗ be the values of θ̂n computed on
b resamples from the original sample X1 , . . . , Xn .
a. Show that the bootstrap estimate of the bias given by Equation (11.61)
g θ̂n ) = θ̄n∗ − θ̂n , where
can be approximated by Bias(
n
X
θ̄n∗ = b−1 θ̂i∗ .
i=1
b. Show that the standard error estimate of Equation (11.62) can be ap-
proximated by
( n
)1/2
X
−1 ∗ ∗ 2
σ̃n = b (θ̂i − θ̄ ) .
i=1
11.7.2 Experiments
For each case make a histogram of the 1000 values of the statistic and
evaluate the results in terms of the theory developed in Exercise 3. Repeat
the experiment for n = 5, 10, 25, 50, and 100 for each of the distributions
listed below.
a. F is a N(0, 1) distribution.
b. F is a Uniform(0, 1) distribution.
c. F is a LaPlace(0, 1) distribution.
d. F is a Cauchy(0, 1) distribution.
e. F is a Exponential(1) distribution.
2. Write a program in R that simulates 1000 samples of size n from a dis-
tribution F with mean θ, where n, F and θ are specified below. For each
sample compute the U -statistic for the parameter θ2 given by
n X
X i
−1
Un = 2[n(n − 1)] Xi Xj .
i=1 j=1
a. F is a N(0, 1) distribution.
b. F is a Uniform(0, 1) distribution.
c. F is a LaPlace(0, 1) distribution.
d. F is a Cauchy(0, 1) distribution.
e. F is a Exponential(1) distribution.
a. F is a N(θ, 1) distribution.
b. F is a Uniform(θ − 12 , θ + 12 ) distribution.
c. F is a LaPlace(θ, 1) distribution.
d. F is a Logistic(θ, 1) distribution.
e. F is a Triangular(−1 + θ, 1 + θ, θ) distribution.
a. F is a N(θ, 1) distribution.
b. F is a Uniform(0, 1) distribution.
c. F is a Cauchy(0, 1) distribution.
562 NONPARAMETRIC INFERENCE
d. F is a Triangular(−1, 1, 0) distribution.
e. F corresponds to the mixture of a N(0, 1) distribution with a N(2, 1)
distribution. That is, F has corresponding density 21 φ(x) + 21 φ(x − 2).
a. F is a N(θ, 1) distribution.
b. F is a Uniform(0, 1) distribution.
c. F is a Cauchy(0, 1) distribution.
d. F is a Triangular(−1, 1, 0) distribution.
e. F corresponds to the mixture of a N(0, 1) distribution with a N(2, 1)
distribution. That is, F has corresponding density 12 φ(x) + 21 φ(x − 2).
6. Write a function in R that will simulate 100 samples of size n from the
distributions specified below. For each sample use the nonparametric boot-
strap algorithm based on b resamples to estimate the distribution function
Hn (t) = P [n1/2 (θ̂n − θ) ≤ t] for parameters and estimates specified below.
For each bootstrap estimate of Hn (t) compute d∞ (Ĥn , Hn ). The function
should return the sample mean of the values of d∞ (Ĥn , Hn ) taken over the
k simulated samples. Compare these observed means for the cases speci-
fied below, and relate the results to the consistency result given in Theorem
11.16. Also comment on the role that the population distribution and b have
on the results. Treat the results as a designed experiment and use an ap-
propriate linear model to find whether the population distribution, the pa-
rameter, b and n have a significant effect on the mean value of d∞ (Ĥn , Hn ).
For further details on using R to compute bootstrap estimates, see Section
B.4.16.
7. Write a function in R that will simulate 100 samples of size n from the distri-
butions specified below. For each sample use the nonparametric bootstrap
algorithm based on b = 100 resamples to estimate the standard error of the
sample median. For each distribution and sample size, make a histogram
of these bootstrap estimates and compare the results with the asymptotic
standard error for the sample median given in Theorem 11.19. Comment on
how well the bootstrap estimates the standard error in each case. For fur-
ther details on using R to compute bootstrap estimates, see Section B.4.16.
Repeat the experiment for n = 5, 10, 25, and 50.
a. F is a N(0, 1) distribution.
b. F is an Exponential(1) distribution.
c. F is a Cauchy(0, 1) distribution.
d. F is a Triangular(−1, 1, 0) distribution.
e. F is a T(2) distribution.
f. F is a Uniform(0, 1) distribution.
APPENDIX A
Suppose Ω is the universal set, that is, the set that contains all of the elements
of interest. Membership of an element to a set is indicated by the ∈ relation.
Hence, a ∈ A indicates that the element a is contained in the set A. The
relation ∈/ indicates that an element is not contained in the indicated set. A
set A is a subset of Ω if all of the elements in A are also in Ω. This relationship
will be represented with the notation A ⊂ Ω. Hence A ⊂ Ω if and only if a ∈ A
implies a ∈ Ω. If A and B are subsets of Ω then A ⊂ B if all the elements
in A are also in B, that is A ⊂ B if and only if a ∈ A implies a ∈ B.
The union of two sets A and B is a set that contains all elements that are
either in A or B or both sets. This set will be denoted by A ∪ B. Therefore
A ∪ B = {ω ∈ Ω : ω ∈ A or ω ∈ B}. The intersection of two sets A and B is a
set that contains all elements that are common to both A and B. This set will
be denoted by A ∩ B. Therefore A ∩ B = {ω ∈ Ω : ω ∈ A and ω ∈ B}. The
complement of a set A is denoted by Ac and is defined as Ac = {ω ∈ Ω : ω ∈ /
A}, which is the set that contains all the elements in Ω that are not in A. If
A ⊂ B then the elements of A can be subtracted from A using the operation
B\A = {ω ∈ B : ω ∈ / A} = Ac ∩ B.
Unions and intersections distribute in much the same way as sums and prod-
ucts do.
Theorem A.1. Let {Ak }nk=1 be a sequence of sets and let B be another set.
Then !
\n n
\
B∪ Ak = (Ak ∪ B),
k=1 k=1
and !
n
[ n
[
B∩ Ak = (Ak ∩ B).
k=1 k=1
565
566 USEFUL THEOREMS AND NOTATION
Theorem A.2 (De Morgan). Let {Ak }nk=1 be a sequence of sets. Then
n
!c n
\ [
Ak = Ack ,
k=1 k=1
and !c
n
[ n
\
Ak = Ack .
k=1 k=1
The number systems have their usual notation. That is, N will denote that
natural numbers, Z will denote the integers, and R will denote the real num-
bers.
Some results in this book rely on some fundamental concepts from metric
spaces and point-set topology. A detailed review of this subject can be found
in Binmore (1981). Consider a space Ω and a metric δ. Such a pairing is known
as a metric space.
Definition A.1. A metric space consists of a set Ω and a function ρ : Ω×Ω →
R, where ρ satisfies
Let ω ∈ Ω and A ⊂ Ω, then we define the distance from the point ω to the
set A as
δ(ω, A) = inf d(ω, a).
a∈A
It follows that for any non-empty set A, there exists at least one point in
Ω such that d(ω, A) = 0, noting that the fact that d(ω, A) = 0 does not
necessarily imply that ω ∈ A. This allows us to define the boundary of a set.
Definition A.2. Suppose that A ⊂ Ω. A boundary point of A is a point ω ∈ Ω
such that d(ω, A) = 0 and d(ω, Ac ) = 0. The set of all boundary points of a
set A is denoted by ∂A.
Note that ∂A = ∂Ac and that ∂∅ = ∅. The concept of boundary points now
makes it possible to define open and closed sets.
Definition A.3. A set A ⊂ Ω is open if ∂A ⊂ Ac . A set A ⊂ Ω is closed if
∂A ⊂ A.
RESULTS FROM CALCULUS 567
It follows that A is closed if and only if δ(ω, A) = 0 implies that ω ∈ A. It
also follows that ∅ and Ω are open, the union of the collection of open sets is
open, and that the intersection of any finite collection of open sets is open.
On the other hand, ∅ and Ω are also closed, the intersection of any collection
of closed sets is closed, and the finite union of any set of closed sets is closed.
Definition A.4. Let A be a subset of Ω in a metric space. The interior of A
is defined as A◦ = A \ ∂A. The closure of A is defined as A− = A ∪ ∂A.
The results listed below are results from basic calculus referred to in this book.
These results, along with their proofs, can be found in Apostol (1967).
Theorem A.3. Let f be an integrable function on the interval [a, x] for each
x ∈ [0, b]. Let c ∈ [a, b] and define
Z x
F (x) = f (t)dt
c
0
for x ∈ [a, b]. Then F (x) exists at each point x ∈ (a, b) where f is continuous
and F 0 (x) = f (x).
Theorem A.4. Suppose that f and g are integrable functions with at least
one derivative on the interval [a, b]. Then
Z b Z b
0
f (x)g (x)dx = f (b)g(b) − f (a)g(a) − f 0 (x)g(x)dx.
a a
Theorem A.5. Suppose that f and w are continuous functions on the interval
[a, b]. If w does not change sign on [a, b] then
Z b Z b
w(x)f (x)dx = f (ξ) w(x)dx,
a a
Theorem A.7. If both f and g are integrable on the real interval [a, b] and
f (x) ≤ g(x) for every x ∈ [a, b] then
Z b Z b
f (x)dx ≤ g(x)dx.
a a
568 USEFUL THEOREMS AND NOTATION
It is helpful to note that the results from integral calculus transfer to expec-
tations as well. For example, if X is a random variable with distribution F ,
and f and g are Borel functions such that f (X(ω)) ≤ g(X(ω)) for all ω ∈ Ω,
then
Z Z
E[g(X)] = g(X(ω))dF (ω) ≤ f (X(ω))dF (ω) = E[f (X)]. (A.1)
In many cases when we are dealing with random variables, we may use results
that have slightly weaker conditions than that from classical calculus. For
example, the result of Equation (A.1) remains true under the assumption
that P [f (X(ω)) ≤ g(X(ω))] = 1.
While complex analysis does not usually play a large role in the theory of
probability and statistics, many of the arguments in this book are based on
characteristic functions, which require a basic knowledge of complex analysis.
Complete reviews of complex analysis can be found in Ahlfors (1979) and
Conway (1975). In this section x will denote a complex number of the form
x1 + ix2 ∈ C where i = (−1)1/2 .
Definition A.5. The absolute value or modulus of x ∈ C is |x| = [x21 +x22 ]1/2 .
Definition A.6 (Euler). The complex value exp(iy), where y ∈ R, can be
written as cos(y) + i sin(y).
Theorem A.8. Suppose that x ∈ C such that |x| ≤ 21 . Then | log(1−x)+x| ≤
|x2 |.
Theorem A.9. Suppose that x ∈ C and y ∈ C, then | exp(x) − 1 − y| ≤
(|x − y| + 12 |y|2 ) exp(γ) where γ ≥ |x| and γ ≥ |y|.
The following result is useful for obtaining bounds involving characteristic
functions.
Theorem A.10. For x ∈ C and y ∈ C we have that |xn − y n | ≤ n|x − y|z n−1
if |x| ≤ z and |y| ≤ z, where z ∈ R.
Theorem A.11. Let n ∈ N and y ∈ R. Then
n
X (iy)k 2|y|n
exp(iy) − ≤ ,
k! n!
k=0
and
n
X (iy)k |y|(n+1)
exp(iy) − ≤ .
k! (n + 1)!
k=0
Theorem A.13. Let {Bn }kn=1 be a partition of a sample space Ω and let A
be any other event in Ω. Then
k
X
P (A) = P (A|Bn )P (Bn ).
n=1
A.6 Inequalities
Theorem A.23. Suppose that x is a real value such that |x| < 1. Then
∞
X
xk = (1 + x)−1 .
k=0
where p ∈ (0, 1). The expectation and variance of X are p and p(1 − p),
respectively. The moment generating function of X is m(t) = 1 − p + p exp(t)
and the characteristic function of X is ψ(t) = 1 − p + p exp(it).
where n ∈ N and p ∈ (0, 1). The expectation and variance of X are np and
np(1−p), respectively. The moment generating function of X is m(t) = [1−p+
p exp(t)]n and the characteristic function of X is ψ(t) = [1 − p + p exp(it)]n . A
Bernoulli(p) random variable is a special case of a Binomial(n, p) random
variable with n = 1.
DISCRETE DISTRIBUTIONS 571
A.8.3 The Geometric Distribution
The expectation and variance of X are θ−1 and θ−2 (1 − θ), respectively. The
moment generating function of X is m(t) = [1 − (1 − θ) exp(t)]θ exp(t) and
the characteristic function of X is ψ(t) = [1 − (1 − θ) exp(it)]θ exp(it).
The mean vector of X is np and the covariance matrix of X has (i, j)th element
Σij = npi (δij − pj ), for i = 1, . . . , d and j = 1, . . . , d.
where λ, which is called the rate, is a positive real number. The expectation
and variance of X are λ. The moment generating function of X is m(t) =
exp{λ[exp(t) − 1]}, the characteristic function of X is ψ(t) = exp{λ[exp(it) −
1]}, and the cumulant generating function of X is
∞ k
X t
c(t) = λ[exp(t) − 1] = λ .
k!
k=1
572 USEFUL THEOREMS AND NOTATION
A.8.6 The (Discrete) Uniform Distribution
where α > 0 and β > 0. The expectation and variance of X are αβ and αβ 2 ,
respectively. The moment generating function of X is m(t) = (1 − tβ)−α and
the characteristic function of X is ψ(t) = (1 − itβ)−α .
574 USEFUL THEOREMS AND NOTATION
A.9.6 The LaPlace Distribution
The expectation and the variance of X are α and 2β 2 , respectively. The mo-
ment generating function of X is m(t) = (1 − t2 β 2 )−1 exp(tα) and the char-
acteristic function of X is ψ(t) = (1 + t2 β 2 )−1 exp(itα).
B.1 An Introduction to R
The results of simulations are most effectively understood using visual repre-
sentations such as plots and histograms. The basic mechanism for plotting a
set of data pairs in R is the plot function:
plot(x,y,type,xlim,ylim,main,xlab,ylab,lty,pch,col)
Technically, the plot function has the header plot(x,y,...) where the op-
tional arguments type, main, xlab, ylab, and lty are passed to the par
function. The result of the plot function is to send a plot if the pairs given
by the arguments x and y to the current graphics device, usually a separate
graphics window, depending on the specific way your version of R has been
set up. If no optional arguments are used then the plot is a simple scatterplot
and the labels for the horizontal and vertical axes are taken to be the names
of the objects passed to the function for the arguments x and y.
577
578 USING R FOR EXPERIMENTATION
The optional argument type specifies what type of plot should be constructed.
Some of the possible values of type, along with the resulting type of plot
produced are
"l", which connects the specified points with lines, but does not plot the
points themselves;
"b", which plots both the lines and the points as described in the two options
given above; and
"n", which sets up the axes for the plot, but does not actually plot any values.
The specification type="n" can be used to set up a pair of axes upon which
other objects will be plotted later. If the type argument is not used, then
the plot will use the value stored by the par command, which in most cases
corresponds to the option type="p". The current settings can be viewed by
executing the par command without any arguments.
The arguments xlim and ylim specify the range of the horizontal and vertical
axes, respectively. The range is expressed by an array of length two whose
first element corresponds to the minimum value for the axis and whose second
component corresponds to the maximum value for the axis. For example, the
specification xlim=c(0,1) specifies that the axis should have a range from zero
to one. If these arguments are not specified then R uses a specific algorithm
to compute what these ranges should be based on the ranges of the specified
data. In most cases R does a good job of selecting ranges that make the
visually appealing. However, when many sets of data are plotted on a single
set of axes, as will be discussed later in this section, the ranges of the axes
for the initial plot may need to be specified so that the axes are sufficient to
contain all of the objects to be plotted. If ranges are specified and points lie
outside the specified ranges, then the plot is still produced with the specified
ranges, and R will usually return a warning for each point that it encounters
that is outside the specified ranges.
The arguments main, xlab, and ylab specify the labels used for the main
title, the label for the horizontal axis, and the vertical axis, respectively.
The argument lty specifies the type of line used when the argument type="l"
or type="b" is used. The line types can either be specified as an integer or as
a character string. The possible line types include
BASIC PLOTTING TECHNIQUES 579
Character
Integer String Line Type Produced
0 "blank" No line is drawn
1 "solid" Solid line
2 "dashed" Dashed line
3 "dotted" Dotted line
4 "dotdash" Dots alternating with dashes
5 "longdash" Long dashes
6 "twodash" Two dashes followed by blank space
Argument Symbol
pch=19 solid circle
pch=20 small circle
pch=21 circle
pch=22 square
pch=23 diamond
pch=24 triangle point-up
pch=25 triangle point down
Other options are also available. See the R help page for the points command
for further details.
The col argument specifies the color used for plotting. Colors can be specified
in several different ways. The simplest way is to specify a character string that
contains the name of the color. Some examples of common character strings
that R understands are "red", "blue", "green", "orange", and "black". A
complete list of the possible colors can be obtained by executing the function
colors with no arguments. Another option is to use one of the many color
specification functions provided by R. For example, a gray color can be spec-
ified using the gray(level) function where the argument level is set to a
number between 0 and 1 that specifies how dark the gray shading should be.
Alternatively, colors can be specified directly in terms of their red-green-blue
580 USING R FOR EXPERIMENTATION
(RGB) components with a character string of the form "#RRGGBB", where each
of the pairs RR, GG, BB are of two digit hexadecimal numbers giving a value
between 00 and FF. For example, specifying the argument col="#FF0000" is
equivalent to using the argument col="red".
Many of the results shown in this book consist of plots of more than one
set of data on a single plot. There are many ways that this can be achieved
using R, but probably the simplest approach is to take advantage of the two
functions points and lines. The points function adds points to the current
plot at points specified by two arguments x and y as in the plot function.
The optional arguments pch and col can also be used with this function to
change the plotting symbol or the color of the plotting symbol. The lines
function adds lines to the current plot that connect the points specified by
the two arguments x and y. The optional arguments lty and col can also be
used with this function.
It is important to note that when plotting multiple sets of data on the same
set of axes, the range of the two axes, which are set in the original plot func-
tion, should be sufficient to handle all of the points specified in the multiple
plots. For example, suppose that we wish to plot linear, quadratic, and cubic
functions for a range of x values between 0 and 2, all on the same set of axes,
using different line types for each function. If we execute the commands
x <- seq(0,2,0.001)
y1 <- x
y2 <- x^2
y3 <- x^3
plot(x,y1,type="l",lty=1)
lines(x,y2,lty=2)
lines(x,y3,lty=3)
the resulting plot will cut off the quadratic and cubic functions because the
original plot command set up the vertical axis based on the range of y1. To
fix this problem we need only find the minimum and maximum values before
we execute the original plot command, and then specify this range in the plot
command. That is
x <- seq(0,2,0.001)
y1 <- x
y2 <- x^2
y3 <- x^3
yl <- c(min(y1,y2,y3),max(y1,y2,y3))
plot(x,y1,type="l",lty=1,ylim=yl)
lines(x,y2,lty=2)
lines(x,y3,lty=3)
The final plotting function that will generally be helpful in performing the ex-
periments suggested in this book is the hist function, which plots a histogram
of a set of data. The usage of the hist function is given by
BASIC PLOTTING TECHNIQUES 581
hist(x, breaks = "Sturges", freq = NULL, right = T, col = NULL,
border = NULL, main, xlim, ylim, xlab, ylab)
In all but the first case, R uses the specified number as a suggestion only.
The only way to force R to use the number you wish is to specify the
endpoints of the cells. The default values of breaks specifies the algorithm
"Sturges". Other possible algorithm names are "FD" and "Scott". Consult
the R help page for the hist function for further details on these algorithms.
freq is a logical argument that specifies whether a frequency or density
histogram is plotted. If freq=T is specified, then a frequency histogram is
plotted. If freq=F is specified then a density histogram is plotted. In this
case the histogram has a total area of one. When comparing a histogram to
a known density on the same plot, using a density histogram usually gives
better results.
right is a logical argument that specifies whether the endpoints of the cells
are included on the left or right hand side of the cell interval. If right=T
is specified then the cells include the right endpoint but not the left. If
right=F is specified then the cells include the left endpoint but not the
right.
col specifies the color of the bars plotted on the histogram. The default value
of NULL yields unfilled bars.
border specifies the color of the border around the bars. The default is to
use the color used for plotting the axes.
main, xlab, ylab can be used to specify the main title, the label for the
horizontal axis, and the label for the vertical axis, respectively.
xlim and ylim can be used to specify the ranges of the horizontal and vertical
axes. These options are useful when overlaying a histogram with a plot of
a density for comparison.
582 USING R FOR EXPERIMENTATION
Wand (1997) argues that many schemes for selecting the number of bins im-
plemented in standard statistical software usually uses too few bins. A more
reasonable method for selecting the bin width of a histogram is provided in the
KernSmooth library and has the form dpih(x), where x is the observed vector
of data and we have omitted many technical arguments which have reason-
able default values. In practice this function estimates the optimal width of the
bins based on the observed data, and not the number of bins. Therefore, the
endpoints of the histogram classes must be calculated to implement this func-
tion. For example, if we wish to simulate a sample of size 100 from a N(0, 1)
distribution and create a histogram of the data based on the methodology of
Wand (1997), then we can use the following code:
x <- rnorm(100)
h <- dpih(x)
bins <- seq(min(x)-h,max(x)+2*h,by=h)
hist(x,breaks=bins)
The standard R package has the ability to handle complex numbers in a na-
tive format. The basic function used for creating complex valued vectors is the
complex function. The real and imaginary parts of a complex vector can be re-
covered using the functions Re and Im. The modulus of a complex number can
be obtained using the mod function. The usage of these functions is summarized
below. Further information about these functions, including some optional ar-
guments not summarized below, can be found at www.r-project.org.
The R package includes functions that can compute the density (or probability
distribution function), distribution function, and quantile function for many
standard distributions including nearly all of the distributions used in this
book. There are also functions that will easily generate random samples from
STANDARD DISTRIBUTIONS AND RANDOM NUMBER GENERATION 583
these distributions as well. This section provides information about the R
functions for each of the distributions used in this book. Further information
about these functions can be found at www.r-project.org.
There does not appear to be a standard R library at this time that supports
the LaPlace(a, b) distribution. The relatively simple form of the distribution
makes it fairly easy to work with, however. For example, a function for the
density function can be programmed as
dlaplace <- function(x, a=0, b=1)
return(exp(-1*abs(x-a)/b)/(2*b))
A function for the distribution function can be programmed as
plaplace <- function(q, a=0, b=1)
{
if(x<a) return(0.5*exp(x-a)/b)
else return(1-0.5*exp((a-x)/b)/(2*b))
}
and a function for the quantile function can be programmed as
qlaplace <- function(p, a=0, b=1)
return(a-b*sign(p-0.5)*log(1-2*abs(p-0.5))
A function to generate a sample of size n from a LaPlace(a, b) distribution
can be programmed as
rlaplace <- function(n, a=0, b=1)
{
u <- runif(n,-0.5,0.5)
return(a-b*sign(u)*log(1-2*abs(u)))
}
Please note that these are fairly primitive functions in that they do no error
checking and may not be the most numerically efficient methods. They should
be sufficient for performing the experiments in this book as long as one is
careful with their use.
Support for the Wald distribution can be found in the library SuppDists,
which is available from most official R internet sites.
C <- matrix(0,n,m)
for(i in 1:n} for(j in 1:m) C[i,j] <- A[i,j] * B[i,j]
However, it is much more efficient to use the simple command C <- A*B,
which multiplies the corresponding elements. This also works with addition,
subtraction and division. One should note that the matrix product of A and
B, assuming they are conformable is given by the command A%*%B. Many
standard R functions will also work on the elements of a vector. For example,
cos(A) will return a matrix object whose elements correspond to the cosine
of the elements of A. Similarly, if one wishes to compute a vector of N(0, 1)
quantiles for p = 0.01, . . . , 0.99, one can avoid using a loop by simply executing
the command qnorm(seq(0.01,0.99,0.01)).
One should also note that R offers vector versions of many functions that have
the option of returning vectors. This is particularly useful when simulating a
set of data. For example, one could simulate a sample of size 10 from a N(0, 1)
distribution using the commands:
z <- matrix(0,10,1)
for(i in 1:10) z[i] <- dnorm(1)
It is much more efficient to just use the command z <- dnorm(10). In fact,
many of the experiments in this book specify simulating 1000 samples of size
10, for example. If all the samples are from the same distribution, and are in-
tended to be independent of one another, then all of the samples can be simu-
lated at once using the command z <- matrix(dnorm(10*1000),1000,10).
Each sample will then correspond to a row of the object z.
The final suggested method for avoiding loops in R is to use the apply com-
mand whenever possible. The apply command allows one to compute many
R functions row-wise or column-wise on an array in R without using a loop.
The apply command has the following form:
where X is the object that a function will be applied to, MARGIN indicates
whether the function will be applied to rows (MARGIN=1), columns (MARGIN=2),
or both (MARGIN=c(1,2)), and FUN is the function that will be applied. Op-
tional arguments for the function that is being applied can be passed after
594 USING R FOR EXPERIMENTATION
these three arguments have been specified. For example, suppose that we wish
to compute a 10% trimmed mean on the rows of a matrix object X. This can
be accomplished using the command apply(X,1,mean,trim=0.10).
Some of the experiments in this book require the reader to simulate samples
from normal mixtures. There are some libraries which offer some techniques,
but for the experiments in this book a simple approach will suffice. Suppose
that we wish to simulate a sample of size n from a normal mixture density of
the form
Xp
f (x) = ωi σi−1 φ[σi−1 (x − µi )], (B.1)
i=1
where ω1 , . . . , ωp are the weights of the mixture which are assumed to add to
one, µ1 , . . . , µp are the means of the normal densities, and σ12 , . . . , σp2 are the
SOME EXAMPLES 595
variances of the normal densities. The essential behind simulating a sample
from such a density lies behind the fact that if X has the density given in
Equation (B.1), then X has the same distribution as Z = Y0 W where Y has a
Multionomial(1, p, ω) distribution and W has a N(µ, Σ) distribution where
ω 0 = (ω1 , . . . , ωp ), µ0 = (µ1 , . . . , µp ), and Σ = Diag{σ12 , . . . , σp2 }. Therefore,
suppose we wish to simulate an observation from the density
f (x) = 41 φ(x − 1) + 12 φ(x) + 14 φ(x + 1),
then we could use the commands
omega <- c(0.25,0.50,0.25)
mu <- c(-1,0,1)
Sigma <- diag(1,2,2)
W <- mvrnorm(1, mu, Sigma)
Y <- rmultinom(1,3,omega)
Z <- t(W)%*%Y
In this simulation, we consider flipping a fair coin n times. For each flip we
wish to keep track of the proportion of flips that are heads. The experiment is
repeated b times and the resulting proportions are plotted together on a single
plot that demonstrates how the proportion converges to 12 as n gets large. For
the example we have used n = 100 and b = 5, but these parameters are easily
changed.
n <- 100
b <- 5
p <- matrix(0,b,n)
plot(c(0,100),c(0.5,0.5),type="l",lty=2,ylim=c(0,1),
xlab="Flip Number",ylab="Proportion Heads")
1.0
0.8
Proportion Heads
0.6
0.4
0.2
0.0
0 20 40 60 80 100
Flip Number
computed to match the mean and variance of the observed sample means.
Note that this simulation sets up the samples in a 100 × 10 matrix and uses
the apply function to compute the sample mean of each row. This simulation
also sets up the ranges of the horizontal and vertical axes so that the overlay
of the density function can observed. In setting up the range of the vertical
axis we use the hist function with the argument plot=F. Using this argument
causes the histogram not to be plotted, but does return a list the contains the
calculated heights of the density bars. Therefore, we are able to calculate the
maximum value of the density curve, contained in the y object, along with
the maximum density from the histogram contained in the object returned by
the command hist(obs.means,plot=F)$density).
x <- matrix(rexp(1000,1),100,10)
obs.means <- apply(x,1,mean)
norm.mean <- mean(obs.means)
norm.sd <- sd(obs.means)
xl <- c(norm.mean-4*norm.sd,norm.mean+4*norm.sd)
x.grid <- seq(xl[1],xl[2],length.out=1000)
y <- dnorm(x.grid,norm.mean,norm.sd)
yl <- c(0,max(y,hist(obs.means,plot=F)$density))
SOME EXAMPLES 597
Figure B.2 Example output from the simulation that investigates the Central Limit
Theorem.
0.5
0.0
observed means
hist(obs.means,freq=F,xlab="observed means",ylab="density",
main="The Central Limit Theorem", xlim=xl,ylim=yl)
lines(x.grid,y)
The resulting output should be a plot similar to the one in Figure B.2.
library(scatterplot3d)
598 USING R FOR EXPERIMENTATION
Figure B.3 Example output for plotting the normal characteristic function.
10
5
imag
0.6
0.4
0
t
0.2
0.0
!5
!0.2
!0.4
!10
!0.6
!0.2 0.0 0.2 0.4 0.6 0.8 1.0
real
t <- seq(-10,10,0.001)
cf <- exp(complex(1,0,1)*t-0.5*t*t)
scatterplot3d(Re(cf),Im(cf),t,type="l",xlab="real",ylab="imag")
The resulting output should be a plot similar to the one in Figure B.3.
n <- 3
x <- seq(-3,4,0.01)
y1 <- dnorm(x)
y2 <- dgamma(x+sqrt(n),n,sqrt(n))
y3 <- y1 + y1/(3*sqrt(n))*(x^3-3*x)
y4 <- y3 + y1/n*((1/18)*(x^6-15*x^4+45*x^2-15)+
SOME EXAMPLES 599
0.5
0.4
0.3
0.2
0.1
0.0
!3 !2 !1 0 1 2 3 4
(3/8)*(x^4-6*x^2+3))
plot(x,y1,type="l",xlab="",ylab="",ylim=c(0, max(y1,y2,y3,y4)),
lty=2)
lines(x,y2)
lines(x,y3,lty=3)
lines(x,y4,lty=4)
The resulting output should be a plot similar to the one in Figure B.4.
Theorem 3.15 (Hartman and Wintner) provides a result on the extreme fluc-
tuations of the sample mean. The complexity of this result makes it difficult
to visualize. The code below was used to produce Figure 3.6.
ss <- seq(5,500,1)
x <- rnorm(max(ss))
sm <- matrix(0,length(ss),1)
lf <- matrix(0,length(ss),1)
uf <- matrix(0,length(ss),1)
600 USING R FOR EXPERIMENTATION
ul <- matrix(0,length(ss),1)
ll <- matrix(0,length(ss),1)
for(i in seq(1,length(ss),1))
{
sm[i] <- sqrt(ss[i])*mean(x[1:ss[i]])
uf[i] <- max(sm[1:i])
lf[i] <- min(sm[1:i])
ul[i] <- sqrt(2*log(log(ss[i])))
ll[i] <- -1*ul[i]
}
yl <- c(min(sm,uf,lf,ul,ll),max(sm,uf,lf,ul,ll))
plot(ss,sm,type="l",ylim=yl)
lines(ss,uf,lty=2)
lines(ss,lf,lty=2)
lines(ss,ll,lty=3)
lines(ss,ul,lty=3)
References
601
602 REFERENCES
Bahadur, R. R. (1960b). Stochastic comparison of tests. The Annals of Math-
ematical Statistics, 31, 276–295.
Bahadur, R. R. (1964). On Fisher’s bound for asymptotic variances. The
Annals of Mathematical Statistics, 35, 1545–1552.
Bahadur, R. R. (1964). Rates of convergence of estimates and test statistics.
The Annals of Mathematical Statistics, 38, 303–324.
Barnard, G. A. (1970). Discussion on paper by Dr. Kalbfleisch and Dr. Sprott.
Journal of the Royal Statistical Society, Series B, 32, 194–195.
Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in
Statistical Theory. New York: John Wiley and Sons.
Barndorff-Nielsen, O. E. and Cox, D. R. (1989). Asymptotic Techniques for
Use in Statistics. London: Chapman and Hall.
Bartlett, M. S. (1963). Statistical estimation of density functions. Sankhyā,
Series A, 25, 245–254.
Basu, D. (1955). An inconsistency of the method of maximum likelihood.
The Annals of Mathematical Statistics, 26, 144–145.
Beran, R. (1987). Prepivoting to reduce level error of confidence sets. Biometrika,
74, 457–468.
Beran, R. (1982). Estimated sampling distributions: The bootstrap and com-
petitors. The Annals of Statistics, 10, 212–225.
Beran, R. and Ducharme, G. R. (1991). Asymptotic Theory for Bootstrap
Methods in Statistics. Montréal: Centre De Recherches Mathematiques.
Berry, A. C. (1941). The accuracy of the Gaussian approximation to the
sum of independent variates. Transactions of the American Mathematical
Society, 49, 122–136.
Bhattacharya, R. N. and Ghosh, J. K. (1978). On the validity of the formal
Edgeworth expansion. The Annals of Statistics, 6, 434–451.
Bhattacharya, R. N. and Rao, C. R. (1976). Normal Approximation and
Asymptotic Expansions. New York: John Wiley and Sons.
Bickel, P. J. and Freedman, D. A. (1980). On Edgeworth Expansions and the
Bootstrap. Unpublished manuscript.
Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the
bootstrap. The Annals of Statistics, 9, 1196–1217.
Billingsley, P. (1986). Probability and Measure. New York: John Wiley and
Sons.
Billingsley, P. (1999). Convergence of Probability Measures. New York: John
Wiley and Sons.
Binmore, K. G. (1981). Topological Ideas. Cambridge: Cambridge University
Press.
Bolstad, W. M. (2007). Introduction to Bayesian Statistics. New York: John
Wiley and Sons.
REFERENCES 603
Bowman, A. W. (1984). An alternative method of cross-validation for the
smoothing of density estimates. Biometrika, 71, 353–360.
Bowman, A.W. and Azzalini, A. (1997). Applied Smoothing Techniques for
Data Analysis: the Kernel Approach with S-Plus Illustrations. Oxford: Ox-
ford University Press.
Bratley, P., Fox, B. L., and Schrage, L. E. (1987). A Guide to Simulation.
New York: Springer-Verlag.
Buck, R. C. (1965). Advanced Calculus. New York: McGraw-Hill.
Burman, P. (1985). A data dependent approach to density estimation. Zeitschrift
für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 69, 609–628.
Butler, R. W. (2007). Saddlepoint Approximations with Applications. Cam-
bridge: Cambridge University Press.
Cantelli, F. P. (1933). Sulla determinazione empirica delle leggi di probabili-
tia. Giorn. Inst. Ital. Attuari, 4, 421–424.
Casella, G. and Berger, R. L. (2002). Statistical Inference. Pacific Grove, CA:
Duxbury.
Chen, P.-N. (2002). Asymptotic refinement of the Berry-Esseen constant.
Unpublished manuscript.
Christensen, R. (1996). Plane Answers to Complex Questions. New York:
Springer.
Chow, Y. S. and Teicher, H. (2003). Probability Theory: Independence, In-
terchangeability, Martingales. New York: Springer.
Chung, K. L. (1974). A Course in Probability. Boston, MA: Academic Press.
Cochran, W. G. (1952). The χ2 test of goodness of fit. The Annals of Math-
ematical Statistics, 23, 315–345.
Conway, J. B. (1975). Functions of One Complex Variable. New York: Springer-
Verlag.
Copson, E. T. (1965). Asymptotic Expansions. Cambridge: Cambridge Uni-
versity Press.
Cornish, E. A. and Fisher, R. A. (1937). Moments and cumulants in the
specification of distributions. International Statistical Review, 5, 307–322.
Cramér, H. (1928). On the composition of elementary errors. Skandinavisk
Aktuarietidskrift, 11, 13–74, 141–180.
Cramér, H. (1946). Mathematical Methods of Statistics. Princeton, NJ: Prince-
ton University Press.
Cramér, H. (1970). Random Variables and Probability Distributions. 3rd Ed.
Cambridge: Cambridge University Press.
Daniels, H. E. (1954). Saddlepoint approximations in statistics. The Annals
of Mathematical Statistics, 25, 631–650.
De Bruijn, N. G. (1958). Asymptotic Methods in Analysis. New York: Dover.
604 REFERENCES
Dieudonné, J. (1960). Foundations of Modern Analysis. New York: John Wi-
ley and Sons.
Edgeworth, F. Y. (1896). The asymmetrical probability curve. Philosphical
Magazine, Fifth Series, 41, 90–99.
Edgeworth, F. Y. (1905). The law of error. Proceedings of the Cambridge
Philosophical Society, 20, 36–65.
Edgeworth, F. Y. (1907). On the representation of a statistical frequency by
a series. Journal of the Royal Statistical Society, Series A, 70, 102–106.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The
Annals of Statistics, 7, 1–26.
Efron, B. (1981). Nonparametric standard errors and confidence intervals.
Canadian Journal of Statistics, 9, 139–172.
Efron, B. (1982). The Jackknife, the Bootstrap, and Other Resampling Plans.
Philadelphia, PA: Society for Industrial and Applied Mathematics.
Efron, B. (1987). Better bootstrap confidence intervals. Journal of the Amer-
ican Statistical Association, 82, 171–200.
Efron, B. and Gong, G. (1983). A leisurely look at the bootstrap, the jack-
knife, and cross validation. The American Statistician, 37, 36–48.
Efron, B., Holloran, E., and Holmes, S. (1996). Bootstrap confidence levels
for phylogenetic trees. Proceedings of the National Academy of Sciences of
the United States of America, 93, 13492–13434.
Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap.
London: Chapman and Hall.
Efron, B. and Tibshirani, R. J. (1998). The problem of regions. The Annals
of Statistics, 26, 1687–1718.
Epanechnikov, V. A. (1969). Non-parametric estimation of a multivariate
probability density. Theory of Probability and its Applications, 14, 153–
158.
Erdélyi, A. (1956). Asymptotic Expansions. New York: Dover Publications.
Esseen, C.-G. (1942). On the Liapounoff limit of error in the theory of prob-
ability. Arkiv för Matematik, Astronomi och Fysik, 28A, 1–19.
Esseen, C.-G. (1945). Fourier analysis of distribution functions. A mathe-
matical study of the Laplace-Gaussian law. Acta Mathematica, 77, 1–125.
Esseen, C.-G. (1956). A moment inequality with an application to the central
limit theorem. Skandinavisk Aktuarietidskrift, 39, 160–170.
Feller, W. (1935). Über den zentralen grenzwertsatz der wahrscheinlichkeit-
srechnung. Mathematische Zeitschrift, 40, 521–559.
Feller, W. (1971). An Introduction to Probability Theory and its Application,
Volume 2. 2nd Ed. New York: John Wiley and Sons.
Felsenstein, J. (1985). Confidence limits on phylogenies: An approach using
the bootstrap. Evolution, 783–791.
REFERENCES 605
Fernholz, L. T. (1983). von Mises Calculus for Statistical Functionals. New
York: Springer-Verlag.
Finner, H. and Strassburger, K. (2002). The partitioning principle. The An-
nals of Statistics, 30, 1194–1213.
Fisher, R. A. (1915). Frequency distribution of the values of the correlation
coefficient in samples from an indefinitely large population. Biometrika, 10,
507–521.
Fisher, N. I. and Hall, P. (1991). Bootstrap algorithms for small sample sizes.
Journal of Statistical Planning and Inference, 27, 157–169.
Fisher, R. A. and Cornish, E. A. (1960). The percentile points of distributions
having known cumulants. Technometrics, 2, 209–226.
Fix, E. and Hodges, J. L. (1951). Discriminatory analysis - nonparametric
discrimination: consistency properties. Report No. 4, Project No. 21-29-
004. Randolph Field, TX: USAF School of Aviation Medicine.
Fix, E. and Hodges, J. L. (1989). Discriminatory analysis - nonparametric
discrimination: consistency properties. International Statistical Review, 57,
238–247.
Fréchet, M. (1925). La notion de differentielle dans l’analyse generale. An-
nales Scientifiques de l’École Normale Supérieure, 42, 293–323.
Fristedt, B. and Gray, L. (1997). A Modern Approach to Probability Theory.
Boston, MA: Birkhäuser.
Garwood, F. (1936). Fiducial limits for the Poisson distribution. Biometrika,
28, 437–442.
Ghosh, M. (1994). On some Bayesian solutions of the Neyman-Scott problem.
Statistical Decision Theory and Related Topics, Volume V. J. Berger and
S. S. Gupta, eds. New York: Springer-Verlag. 267–276.
Ghosh, M., Parr, W. C., Singh, K., and Babu, G. J. (1984). A note on
bootstrapping the sample median. The Annals of Statistics, 12, 1130–1135.
Gibbons, J. D. and Chakraborti, S. (2003). Nonparametric Statistical Infer-
ence. Boca Raton, FL: CRC Press.
Giné, E. and Zinn, J. (1989). Necessary conditions for the bootstrap of the
mean. The Annals of Statistics, 17, 684–691.
Glivenko, V. (1933). Sulla determinazione empirica delle leggi di probabilitia.
Giornate dell’Istituto Italiano degli Attuari, 4, 92–99.
Gnedenko, B. V. (1962). The Theory of Probability. New York: Chelsea Pub-
lishing Company.
Gnedenko, B. V. and Kolmogorov, A. N. (1968). Limit Distributions for Some
Sums of Independent Random Variables. Reading, MA: Addison-Wesley.
Graybill, F. A. (1976). Theory and Application of the Linear Model. Pacific
Grove, CA: Wadsworth and Brooks.
Gut, A. (2005). Probability: A Graduate Course. New York: Springer.
606 REFERENCES
Hájek, J. (1969). Nonparametric Statistics. San Francisco, CA: Holden-Day.
Hájek, J. (1972). Local asymptotic minimax and admissibility in estimation.
Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics
and Probability, Volume I, 175–194.
Hájek, J. and Šidák, Z. (1967). Theory of Rank Tests. New York: Academic
Press.
Hall, P. (1983a). Inverting an Edgeworth expansion. The Annals of Statistics,
11, 569–576.
Hall, P. (1983b). Large sample optimality of least squares cross-validation in
density estimation. The Annals of Statistics, 11, 1156–1174.
Hall, P. (1986a). On the bootstrap and confidence intervals. The Annals of
Statistics, 14, 1431–1452.
Hall, P. (1986b). On the number of bootstrap simulations required to con-
struct a confidence interval. The Annals of Statistics, 14, 1453–1462.
Hall, P. (1988a). Theoretical comparison of bootstrap confidence intervals.
The Annals of Statistics, 16, 927–953.
Hall, P. (1988b). Introduction to the Theory of Coverage Processes. New
York: John Wiley and Sons.
Hall, P. (1990). Asymptotic properties of the bootstrap for heavy-tailed dis-
tributions. The Annals of Probability, 18, 1342–1360.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion. New York: Springer.
Hall, P. and Martin, M. A. (1988). On bootstrap resampling and iteration.
Biometrika, 75, 661–671.
Hall, P. and Martin, M. A. (1991). One the error incurred using the boot-
strap variance estimator when constructing confidence intervals for quan-
tiles. Journal of Multivariate Analysis, 38, 70–81.
Halmos, P. R. (1958). Finite-Dimensional Vector Spaces. 2nd Ed. Princeton,
NJ: Van Nostrand.
Halmos, P. R. (1974). Measure Theory. New York: Springer-Verlag.
Hardy, H. G. (1949). Divergent Series. Providence, RI: AMS Chelsea Pub-
lishing.
Heyde, C. C. (1963). On a property of the lognormal distribution. Journal
of the Royal Statistical Society, Series B, 25, 392–393.
Hochberg, Y. and Tamhane, A. C. (1987). Multiple Comparison Procedures.
New York: John Wiley and Sons.
Hodges, J. L. and Lehmann, E. L. (1956). Thee efficiency of some nonpara-
metric competitors of the t-test. The Annals of Mathematical Statistics, 27,
324–335.
Hodges, J. L. and Lehmann, E. L. (1970). Deficiency. The Annals of Mathe-
matical Statistics, 41, 783–801.
REFERENCES 607
Hoeffding, W. (1948). A class of statistics with asymptotically normal dis-
tribution. The Annals of Mathematical Statistics, 19, 293–325.
Hollander, M. and Wolfe, D. A. (1999). Nonparametric Statistical Methods.
2nd Ed. New York: John Wiley and Sons.
Hsu, P. L. and Robbins, H. (1947). Complete convergence and the law of large
numbers. Proceedings of the National Academy of Sciences of the United
States of America, 33, 25–31.
Huber, P. J. (1966). Strict efficiency excludes superefficiency. The Annals of
Mathematical Statistics, 37, 1425.
Jensen, J. L. (1988). Uniform saddlepoint approximations. Advances in Ap-
plied Probability, 20, 622–634.
Johnson, N. L., Kotz, S., and Balakrishnan, N. (1994). Distributions in Statis-
tics: Continuous Univariate Distributions. Volume I. 2nd Ed. New York:
John Wiley and Sons.
Keller, H. H. (1974). Differential Calculus on Locally Convex Spaces. Lecture
Notes in Mathematics Number 417. Berlin: Springer-Verlag.
Kendall, M. and Stuart, A. (1977). The Advanced Theory of Statistics, Vol-
ume 1: Distribution Theory. 4th Ed. New York: Macmillan Publishing Com-
pany.
Khuri, A. I. (2003). Advanced Calculus with Applications in Statistics. New
York: John Wiley and Sons.
Knight, K. (1989). One the bootstrap of the sample mean in the infinite
variance case. The Annals of Statistics, 17, 1168–1175.
Koenker, R. W. and Bassett, G. W. (1984). Four (Pathological) examples in
asymptotic statistics. The American Statistician, 38, 209–212.
Kolassa, J. E. (2006). Series Approximation Methods in Statistics. 3rd Ed.
New York: Springer.
Kolmogorov, A. N. (1956). Foundations of the Theory of Probability. New
York: Chelsea Publishing Company.
Kowalski, J. and Tu, X. M. (2008). Modern Applied U-Statistics. New York:
John Wiley and Sons.
Landau, E. (1974). Handbuch der Lehre von der Verteilung der Primzahlen.
Providence, RI: AMS Chelsea Publishing.
Le Cam, L. (1953). On some asymptotic properties of maximum likelihood
estimates and related Bayes’ estimates. University of California Publica-
tions in Statistics, 1, 277–330.
Le Cam, L. (1979). Maximum Likelihood Estimation: An Introduction. Lec-
ture Notes in Statistics Number 18. University of Maryland, College Park,
MD.
Lee, A. J. (1990). U-Statistics: Theory and Practice. New York: Marcel
Dekker.
608 REFERENCES
Lehmann, E. L. (1983). Theory of Point Estimation. New York: John Wiley
and Sons.
Lehmann, E. L. (1986). Testing Statistical Hypotheses. Pacific Grove, CA:
Wadsworth and Brooks/Cole.
Lehmann, E. L. (1999). Elements of Large-Sample Theory. New York: Springer.
Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation. New
York: Springer.
Lehmann, E. L. and Shaffer, J. (1988). Inverted distributions. The American
Statistician, 42, 191–194.
Lévy, P. (1925). Calcul des Probabilités. Paris: Gauthier-Villars.
Lindeberg, J. W. (1922). Eine neue herleitung des exponentialgezetzes in der
wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 15, 211–225.
Loéve, M. (1977). Probability Theory. 4th Ed. New York: Springer-Verlag.
Loh, W.-Y. (1987). Calibrating confidence coefficients. Journal of the Amer-
ican Statistical Association, 82, 155–162.
Lukacs, E. (1956). On certain periodic characteristic functions. Compositio
Mathematica, 13, 76–80.
Mammen, E. (1992). When Does Bootstrap Work? Asymptotic Results and
Simulations. New York: Springer.
Miller, R. G. (1981). Simultaneous Statistical Inference. 2nd Ed. New York:
Springer-Verlag.
Mood, A. M. (1950). Introduction to the Theory of Statistics. 3rd Ed. New
York: McGraw-Hill.
Nadaraya, E. A. (1974). On the integral mean squared error of some non-
parametric estimates for the density function. Theory of Probability and Its
Applications, 19, 133–141.
Nashed, M. Z. (1971). Differentiability and related properties of nonlinear
operators: some aspects of the role of differentials in nonlinear functional
analysis. Nonlinear Functional Analysis and Applications. L. B. Rall, ed.
New York: Academic Press. 103–309.
Neyman, J. and Scott. E. (1948). Consistent estimates based on partially
consistent observations. Econometrica, 16, 1–32.
Noether, G. E. (1949). On a theorem by Wald and Wolfowitz. The Annals
of Mathematical Statistics, 20, 455–458.
Noether, G. E. (1955). On a theorem of Pitman. The Annals of Mathematical
Statistics, 26, 64–68.
Parzen, E. (1962). On the estimation of a probability density function and
the mode. The Annals of Mathematical Statistics, 33, 1065–1076.
Petrov, V. V. (1995). Limit Theorems of Probability Theory: Sequences of
Independent Random Variables. New York: Oxford University Press.
REFERENCES 609
Petrov, V. V. (2000). Classical-type limit theorems for sums of independent
random variables. Limit Theorems of Probability Theory, Encyclopedia of
Mathematical Sciences. Number 6, 1–24. New York: Springer.
Phanzagl, J. (1970). On the asymptotic efficiency of median unbiased esti-
mates. The Annals of Mathematical Statistics, 41, 1500–1509.
Pitman, E. J. G. (1948). Notes on Non-Parametric Statistical Inference. Un-
published notes from Columbia University.
Polansky, A. M. (1995). Kernel Smoothing to Improve Bootstrap Confidence
Intervals. Ph.D. Dissertation. Dallas, TX: Southern Methodist University.
Polansky, A. M. (1999). Upper bounds on the true coverage of bootstrap
percentile type confidence intervals. The American Statistician, 53, 362–
369.
Polansky, A. M. (2000). Stabilizing bootstrap-t confidence intervals for small
samples. Canadian Journal of Statistics, 28, 501–516.
Polansky, A. M. (2003a). Supplier selection based on bootstrap confidence
regions of process capability indices. International Journal of Reliability,
Quality and Safety Engineering, 10, 1–14.
Polansky, A. M. (2003b). Selecting the best treatment in designed experi-
ments. Statistics in Medicine, 22, 3461–3471.
Polansky, A. M. (2007). Observed Confidence Levels: Theory and Application.
Boca Raton, FL: Chapman Hall/CRC Press.
Polansky, A. M. and Schucany, W. R. (1997). Kernel smoothing to improve
bootstrap confidence intervals. Journal of the Royal Statistical Society, Se-
ries B, 59, 821–838.
Pollard, D. (2002). A User’s Guide to Measure Theoretic Probability. Cam-
bridge: Cambridge University Press.
Putter, H. and Van Zwet, W. R. (1996). Resampling: consistency of substi-
tution estimators. The Annals of Statistics, 24, 2297–2318.
Randles, R. H. and Wolfe, D. A. (1979). Introduction to the Theory of Non-
parametric Statistics. New York: John Wiley and Sons.
Rao, C. R. (1963). Criteria for estimation in large samples. Sankhyā, 25,
189–206.
Reeds, J. A. (1976). On the Definition of von Mises Functionals. Ph.D. Dis-
sertation. Cambridge, MA: Harvard University.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a den-
sity function. The Annals of Mathematical Statistics, 27, 832–837.
Royden, H. L. (1988). Real Analysis. 3rd Ed. New York: Macmillan.
Rudemo, M. (1982). Empirical choice of histograms and kernel density esti-
mators. Scandinavian Journal of Statistics, 9, 65–78.
Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice and
Visualization. New York: John Wiley and Sons.
610 REFERENCES
Sen, P. K. and Singer, J. M. (1993). Large Sample Methods in Statistics.
London: Chapman and Hall.
Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics.
New York: John Wiley and Sons.
Severini, T. A. (2005). Elements of Distribution Theory. Cambridge: Cam-
bridge University Press.
Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. New York: Springer.
Shi, X. (1986). A note on bootstrapping U -statistics. Chinese Journal of
Applied Probability and Statistics, 2, 144–148.
Shiganov, I. S. (1986). Refinement of the Upper Bound of the constant in
the central limit theorem. Journal of Soviet Mathematics, 35, 2545–2550.
Shohat, J. A. and Tamarkin, J. D. (1943). The Problem of Moments. 4th Ed.
Providence, RI: American Mathematical Society.
Silverman, B. W. (1986). Density Estimation. London: Chapman and Hall.
Simmons, G. (1971). Identifying probability limits. The Annals of Mathe-
matical Statistics, 42, 1429–1433.
Simonoff, J. S. (1996). Smoothing Methods in Statistics. New York: Springer.
Singh, K. (1981). One the asymptotic accuracy of Efron’s bootstrap. The
Annals of Statistics, 9, 1187–1195.
Slomson, A. B. (1991). An Introduction to Combinatorics. Bocca Raton, FL:
CRC Press.
Sprecher, D. A. (1970). Elements of Real Analysis. New York: Dover.
Sprent, P. and Smeeton, N. C. (2007). Applied Nonparametric Statistical
Methods. 4th Ed. Boca Raton, FL: Chapman and Hall/CRC Press.
Stefansson, G., Kim, W.-C., and Hsu, J. C. (1988). On confidence sets in
multiple comparisons. Statistical Decision Theory and Related Topics IV.
S.S. Gupta and J. O. Berger, Eds. New York: Academic Press. 89–104.
Stone, C. J. (1984). An asymptotically optimal window selection rule for
kernel density estimates. The Annals of Statistics, 12, 1285–1297.
Tchebycheff, P. (1890). Sur duex theéorèmes relatifs aux probabilités. Acta
Mathematica, 14, 305–315.
Tibshirani, R. (1988). Variance stabilization and the bootstrap. Biometrika,
75, 433–444.
van Beek, P. (1972). An application of Fourier methods to the problem
of sharpening of the Berry-Esseen inequality. Zeitschrift für Wahrschein-
lichkeitstheorie und Verwandte Gebiete, 23, 183–196.
van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge: Cambridge
University Press.
von Mises, R. (1947). On the asymptotic distribution of differentiable statis-
tical functionals. The Annals of Mathematical Statistics, 18, 309–348.
REFERENCES 611
Wand, M. P. (1997). Data-based choice of histogram bin width. The Ameri-
can Statistician, 51, 59–64.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. London: Chapman
and Hall.
Westenberg, J. (1948). Significance test for median and interquartile range in
samples from continuous populations of any form. Proceedings Koningklijke
Nederlandse Akademie van Wetenschappen, 51, 252–261.
Westfall, P. H. and Young, S. S. (1993). Resampling-Based Multiple Testing.
New York: John Wiley and Sons.
Winterbottom, A. (1979). A note on the derivation of Fisher’s transformation
of the correlation coefficient. The American Statistician, 33, 142–143.
Withers, C. S. (1983). Asymptotic expansions for the distribution and quan-
tiles of a regular function of the empirical distribution with applications to
nonparametric confidence intervals. The Annals of Statistics, 11, 577–587.
Withers, C. S. (1984). Asymptotic expansions for distributions and quan-
tiles with power series cumulants. Journal of The Royal Statistical Society,
Series B, 46, 389–396.
Wolfowitz, J. (1965). Asymptotic efficiency of the maximum likelihood esti-
mator. Theory of Probability and its Applications, 10, 247–260.
Woodroofe, M. (1970). On choosing a delta-sequence. The Annals of Math-
ematical Statistics, 41, 1665–1671.
Yamamuro, S. (1974). Differential Calculus in Topological Linear Spaces.
Lecture Notes in Mathematics, Number 374. Berlin: Springer-Verlag.
Young, N. (1988). An Introduction to Hilbert Space. Cambridge: Cambridge
University Press.
Zolotarev, V. M. (1986). Sums of Independent Random Variables. New York:
John Wiley and Sons.