100% found this document useful (1 vote)
351 views633 pages

Untitled

Uploaded by

sasha kirilova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
351 views633 pages

Untitled

Uploaded by

sasha kirilova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 633

Introduction to

Statistical Limit
Theory
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Series Editors
Bradley P. Carlin, University of Minnesota, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada

Analysis of Failure and Survival Data Elementary Applications of Probability Theory,


P. J. Smith Second Edition
The Analysis of Time Series — H.C. Tuckwell
An Introduction, Sixth Edition Elements of Simulation
C. Chatfield B.J.T. Morgan
Applied Bayesian Forecasting and Time Series Epidemiology — Study Design and
Analysis Data Analysis, Second Edition
A. Pole, M. West and J. Harrison M. Woodward
Applied Nonparametric Statistical Methods, Essential Statistics, Fourth Edition
Fourth Edition D.A.G. Rees
P. Sprent and N.C. Smeeton
Exercises and Solutions in Biostatistical Theory
Applied Statistics — Handbook of GENSTAT L.L. Kupper, B.H. Neelon, and S.M. O’Brien
Analysis
Extending the Linear Model with R — Generalized
E.J. Snell and H. Simpson Linear, Mixed Effects and Nonparametric Regression
Applied Statistics — Principles and Examples Models
D.R. Cox and E.J. Snell J.J. Faraway
Applied Stochastic Modelling, Second Edition A First Course in Linear Model Theory
N. Ravishanker and D.K. Dey
B.J.T. Morgan
Generalized Additive Models:
Bayesian Data Analysis, Second Edition
An Introduction with R
A. Gelman, J.B. Carlin, H.S. Stern S. Wood
and D.B. Rubin
Graphics for Statistics and Data Analysis with R
Bayesian Ideas and Data Analysis: An Introduction K.J. Keen
for Scientists and Statisticians
Interpreting Data — A First Course
R. Christensen, W. Johnson, A. Branscum,
in Statistics
and T.E. Hanson
A.J.B. Anderson
Bayesian Methods for Data Analysis,
Introduction to General and Generalized
Third Edition
Linear Models
B.P. Carlin and T.A. Louis H. Madsen and P. Thyregod
Beyond ANOVA — Basics of Applied Statistics An Introduction to Generalized
R.G. Miller, Jr. Linear Models, Third Edition
Computer-Aided Multivariate Analysis, A.J. Dobson and A.G. Barnett
Fourth Edition Introduction to Multivariate Analysis
A.A. Afifi and V.A. Clark C. Chatfield and A.J. Collins
A Course in Categorical Data Analysis Introduction to Optimization Methods and Their
T. Leonard Applications in Statistics
A Course in Large Sample Theory B.S. Everitt
T.S. Ferguson Introduction to Probability with R
Data Driven Statistical Methods K. Baclawski
P. Sprent Introduction to Randomized Controlled Clinical
Decision Analysis — A Bayesian Approach Trials, Second Edition
J.Q. Smith J.N.S. Matthews
Design and Analysis of Experiment with SAS Introduction to Statistical Inference and Its
J. Lawson Applications with R
M.W. Trosset
Introduction to Statistical Limit Theory Randomization, Bootstrap and Monte Carlo
A.M. Polansky Methods in Biology, Third Edition
Introduction to Statistical Methods for B.F.J. Manly
Clinical Trials
T.D. Cook and D.L. DeMets Readings in Decision Analysis
S. French
Large Sample Methods in Statistics
P.K. Sen and J. da Motta Singer Sampling Methodologies with Applications
P.S.R.S. Rao
Linear Models with R
J.J. Faraway Statistical Analysis of Reliability Data
M.J. Crowder, A.C. Kimber,
Logistic Regression Models T.J. Sweeting, and R.L. Smith
J.M. Hilbe
Statistical Methods for Spatial Data Analysis
Markov Chain Monte Carlo — O. Schabenberger and C.A. Gotway
Stochastic Simulation for Bayesian Inference,
Second Edition Statistical Methods for SPC and TQM
D. Gamerman and H.F. Lopes D. Bissell

Mathematical Statistics Statistical Methods in Agriculture and Experimental


K. Knight Biology, Second Edition
R. Mead, R.N. Curnow, and A.M. Hasted
Modeling and Analysis of Stochastic Systems,
Second Edition Statistical Process Control — Theory and Practice,
V.G. Kulkarni Third Edition
G.B. Wetherill and D.W. Brown
Modelling Binary Data, Second Edition
D. Collett Statistical Theory, Fourth Edition
B.W. Lindgren
Modelling Survival Data in Medical Research,
Second Edition Statistics for Accountants
D. Collett S. Letchford

Multivariate Analysis of Variance and Repeated Statistics for Epidemiology


Measures — A Practical Approach for Behavioural N.P. Jewell
Scientists Statistics for Technology — A Course in Applied
D.J. Hand and C.C. Taylor Statistics, Third Edition
Multivariate Statistics — A Practical Approach C. Chatfield
B. Flury and H. Riedwyl Statistics in Engineering — A Practical Approach
Pólya Urn Models A.V. Metcalfe
H. Mahmoud Statistics in Research and Development,
Practical Data Analysis for Designed Experiments Second Edition
B.S. Yandell R. Caulcutt

Practical Longitudinal Data Analysis Stochastic Processes: An Introduction,


D.J. Hand and M. Crowder Second Edition

Practical Statistics for Medical Research P.W. Jones and P. Smith


D.G. Altman Survival Analysis Using S — Analysis of
A Primer on Linear Models Time-to-Event Data
J.F. Monahan M. Tableman and J.S. Kim

Probability — Methods and Measurement The Theory of Linear Models


A. O’Hagan B. Jørgensen

Problem Solving — A Statistician’s Guide, Time Series Analysis


Second Edition H. Madsen
C. Chatfield Time Series: Modeling, Computation, and Inference
R. Prado and M. West
Texts in Statistical Science

Introduction to
Statistical Limit
Theory

Alan M. Polansky
Northern Illinois University
Dekalb, Illinois, USA
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2011 by Taylor and Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed in the United States of America on acid-free paper


10 9 8 7 6 5 4 3 2 1

International Standard Book Number-13: 978-1-4200-7661-5 (Ebook-PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a pho-
tocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
TO

CATHERINE AND ALTON

AND

ADLER, AGATHA, FLUFFY, AND KINSEY

AND THE MEMORY OF

FLATNOSE, HOLMES, LISA, MILHOUSE, MIRROR,


RALPH, AND SEYMOUR
Contents

Preface xvii

1 Sequences of Real Numbers and Functions 1


1.1 Introduction 1
1.2 Sequences of Real Numbers 1
1.3 Sequences of Real Functions 12
1.4 The Taylor Expansion 18
1.5 Asymptotic Expansions 28
1.6 Inversion of Asymptotic Expansions 39
1.7 Exercises and Experiments 42
1.7.1 Exercises 42
1.7.2 Experiments 50

2 Random Variables and Characteristic Functions 53


2.1 Introduction 53
2.2 Probability Measures and Random Variables 53
2.3 Some Important Inequalities 59
2.4 Some Limit Theory for Events 65
2.5 Generating and Characteristic Functions 74
2.6 Exercises and Experiments 91
2.6.1 Exercises 91
2.6.2 Experiments 97

ix
x CONTENTS
3 Convergence of Random Variables 101
3.1 Introduction 101
3.2 Convergence in Probability 102
3.3 Stronger Modes of Convergence 107
3.4 Convergence of Random Vectors 117
3.5 Continuous Mapping Theorems 121
3.6 Laws of Large Numbers 124
3.7 The Glivenko–Cantelli Theorem 135
3.8 Sample Moments 140
3.9 Sample Quantiles 147
3.10 Exercises and Experiments 152
3.10.1 Exercises 152
3.10.2 Experiments 157

4 Convergence of Distributions 159


4.1 Introduction 159
4.2 Weak Convergence of Random Variables 159
4.3 Weak Convergence of Random Vectors 182
4.4 The Central Limit Theorem 195
4.5 The Accuracy of the Normal Approximation 201
4.6 The Sample Moments 214
4.7 The Sample Quantiles 215
4.8 Exercises and Experiments 221
4.8.1 Exercises 221
4.8.2 Experiments 227

5 Convergence of Moments 229


th
5.1 Convergence in r Mean 229
5.2 Uniform Integrability 237
5.3 Convergence of Moments 243
5.4 Exercises and Experiments 248
5.4.1 Exercises 248
5.4.2 Experiments 252
CONTENTS xi
6 Central Limit Theorems 255
6.1 Introduction 255
6.2 Non-Identically Distributed Random Variables 255
6.3 Triangular Arrays 263
6.4 Transformed Random Variables 265
6.5 Exercises and Experiments 278
6.5.1 Exercises 278
6.5.2 Experiments 281

7 Asymptotic Expansions for Distributions 283


7.1 Approximating a Distribution 283
7.2 Edgeworth Expansions 284
7.3 The Cornish–Fisher Expansion 304
7.4 The Smooth Function Model 311
7.5 General Edgeworth and Cornish–Fisher Expansions 314
7.6 Studentized Statistics 319
7.7 Saddlepoint Expansions 324
7.8 Exercises and Experiments 330
7.8.1 Exercises 330
7.8.2 Experiments 335

8 Asymptotic Expansions for Random Variables 339


8.1 Approximating Random Variables 339
8.2 Stochastic Order Notation 340
8.3 The Delta Method 348
8.4 The Sample Moments 350
8.5 Exercises and Experiments 354
8.5.1 Exercises 354
8.5.2 Experiments 355
xii CONTENTS
9 Differentiable Statistical Functionals 359
9.1 Introduction 359
9.2 Functional Parameters and Statistics 359
9.3 Differentiation of Statistical Functionals 363
9.4 Expansion Theory for Statistical Functionals 369
9.5 Asymptotic Distribution 375
9.6 Exercises and Experiments 378
9.6.1 Exercises 378
9.6.2 Experiments 381

10 Parametric Inference 383


10.1 Introduction 383
10.2 Point Estimation 383
10.3 Confidence Intervals 414
10.4 Statistical Hypothesis Tests 424
10.5 Observed Confidence Levels 438
10.6 Bayesian Estimation 447
10.7 Exercises and Experiments 459
10.7.1 Exercises 459
10.7.2 Experiments 470

11 Nonparametric Inference 475


11.1 Introduction 475
11.2 Unbiased Estimation and U -Statistics 478
11.3 Linear Rank Statistics 495
11.4 Pitman Asymptotic Relative Efficiency 514
11.5 Density Estimation 524
11.6 The Bootstrap 541
11.7 Exercises and Experiments 553
11.7.1 Exercises 553
11.7.2 Experiments 560
CONTENTS xiii
A Useful Theorems and Notation 565
A.1 Sets and Set Operators 565
A.2 Point-Set Topology 566
A.3 Results from Calculus 567
A.4 Results from Complex Analysis 568
A.5 Probability and Expectation 569
A.6 Inequalities 569
A.7 Miscellaneous Mathematical Results 570
A.8 Discrete Distributions 570
A.8.1 The Bernoulli Distribution 570
A.8.2 The Binomial Distribution 570
A.8.3 The Geometric Distribution 571
A.8.4 The Multinomial Distribution 571
A.8.5 The Poisson Distribution 571
A.8.6 The (Discrete) Uniform Distribution 572
A.9 Continuous Distributions 572
A.9.1 The Beta Distribution 572
A.9.2 The Cauchy Distribution 572
A.9.3 The Chi-Squared Distribution 573
A.9.4 The Exponential Distribution 573
A.9.5 The Gamma Distribution 573
A.9.6 The LaPlace Distribution 574
A.9.7 The Logistic Distribution 574
A.9.8 The Lognormal Distribution 574
A.9.9 The Multivariate Normal Distribution 574
A.9.10 The Non-Central Chi-Squared Distribution 575
A.9.11 The Normal Distribution 575
A.9.12 Student’s t Distribution 575
A.9.13 The Triangular Distribution 575
A.9.14 The (Continuous) Uniform Distribution 576
A.9.15 The Wald Distribution 576
xiv CONTENTS
B Using R for Experimentation 577
B.1 An Introduction to R 577
B.2 Basic Plotting Techniques 577
B.3 Complex Numbers 582
B.4 Standard Distributions and Random Number Generation 582
B.4.1 The Bernoulli and Binomial Distributions 583
B.4.2 The Beta Distribution 583
B.4.3 The Cauchy Distribution 584
B.4.4 The Chi-Squared Distribution 584
B.4.5 The Exponential Distribution 585
B.4.6 The Gamma Distribution 585
B.4.7 The Geometric Distribution 586
B.4.8 The LaPlace Distribution 587
B.4.9 The Lognormal Distribution 587
B.4.10 The Multinomial Distribution 588
B.4.11 The Normal Distribution 588
B.4.12 The Multivariate Normal Distribution 589
B.4.13 The Poisson Distribution 589
B.4.14 Student’s t Distribution 590
B.4.15 The Continuous Uniform Distribution 590
B.4.16 The Discrete Uniform Distribution 591
B.4.17 The Wald Distribution 592
B.5 Writing Simulation Code 592
B.6 Kernel Density Estimation 594
B.7 Simulating Samples from Normal Mixtures 594
B.8 Some Examples 595
B.8.1 Simulating Flips of a Fair Coin 595
B.8.2 Investigating the Central Limit Theorem 595
B.8.3 Plotting the Normal Characteristic Function 597
B.8.4 Plotting Edgeworth Corrections 598
B.8.5 Simulating the Law of the Iterated Logarithm 599
CONTENTS xv
References 601

Author Index 613

Subject Index 617


Preface

The motivation for writing a new book is heavily dependent on the books that
are currently available in the area you wish to present. In many cases one can
make the case that a new book is needed when there are no books available in
that area. In other cases one can make the argument that the current books
are out of date or are of poor quality. I wish to make it clear from the onset
that I do not have this opinion of any of the books on statistical asymptotic
theory that are currently available, or have been available in the past. In fact,
I find myself humbled as I search my book shelves and find the many well
written books on this subject. Indeed, I feel strongly that some of the best
minds in statistics have written books in this area. This leaves me to the
task of explaining why I found it necessary to write this book on asymptotic
statistical theory.
Many students of statistics are finding themselves more specialized away from
mathematical theory in favor of newer and more complex statistical methods
that necessarily require specialized study. Inevitably, this has led to a dimin-
ished focus of pure mathematics. However, this does not diminish the need for
a good understanding of asymptotic theory. Good students of statistics, for
example, should know exactly what the central limit theorem says and what
exactly it means. They should be able to understand the assumptions required
for the theory to work. There are many modern methods that are complex
enough so that they cannot be justified directly, as therefore must be justified
from an asymptotic viewpoint. Students should have a good understanding as
to what such a justification guarantees and what it does not. Students should
also have a good understanding of what can go wrong.
Asymptotic theory is mathematical in nature. Over the years I have helped
many students understand results from asymptotic theory, often as part of
their research or classwork. In this time I began to realize that the decreased
exposure to mathematical theory, particularly in real analysis, was making it
much more difficult for students to understand this theory.
I wrote this book with the goal of explaining as much of the background ma-
terial as I could, while still keeping a reasonable presentation and length. The
reader will not find a detailed review of the whole of real analysis, and other
important subjects from mathematics. Instead the reader will find sufficient
background of those subjects that are important for the topic at hand, along

xvii
xviii PREFACE
with references which the reader may explore for a more detailed understand-
ing. I have also attempted to present a much more detailed account of the
modes of convergence of random variables, distributions, and moments than
can be found in many other texts. This creates a firm foundation for the appli-
cations that appear in the book, along with further study that students may
do on their own.
As I began the job of writing this book, I recalled a quote from one of my
favorite authors, Sir Arthur Conan Doyle in his Sherlock Holmes adventure
“The Dancing Men.” Mr. Holmes is conversing with Dr. Watson and has
just presented him with one of his famous deductions that leaves Dr. Watson
startled by Holmes’ deductive abilities. Mr. Holmes replies as follows:
“You see, my dear Watson,”—he propped his test-tube in the rack, and began to
lecture with the air of a professor addressing his class—“it is not really difficult
to construct a series of inferences, each dependent upon its predecessor and each
simple in itself. If, after doing so, one simply knocks out all the central inferences
and presents one’s audience with the starting-point and the conclusion, one may
produce a startling, though possibly a meretricious, effect.”

A mathematical theorem essentially falls into this category in the sense that
it is a series of assumptions followed by a logical result. The central inferences
have been removed, sometimes producing a somewhat startling result. When
one uses the theorem as a tool to obtain further results, the central inferences
are usually unimportant. If one wishes to truly understand the result, one
must understand the central inferences, which is usually called the “proof” of
the theorem. As this book is meant to be either a textbook for a course, or a
reference book for someone unfamiliar with many of the results within, I feel
duty bound to not produce startling results in this book, but to include all
of the central inferences of each result in such a way that the simple logical
progression is clear to the reader. As a result, most of the proofs presented in
this book are very detailed.
I have included proofs to many of the results in the book. Nearly all of these
arguments come from those before me, but I have attempted to include de-
tailed explanations for the results while pointing out important tools and
techniques along the way. I have attempted to give due credit to the sources I
used for the proofs. If I have left anyone out, please accept my apologies and
be assured that it was not on purpose, but out of my ignorance. I have also
not steered away from the more complicated proofs in this field. These proofs
often contain valuable techniques that a user of asymptotic theory should be
familiar with. For many of the standard results there are several proofs that
one can choose from. I have not always chosen the shortest or the “slickest”
method of proof. In my choice of proofs I weighed the length of the argument
along with its pedagogical value and how well the proof provides insight into
the final result. This is particularly true when complicated or strange looking
assumptions are part of the result. I have attempted to point out in the proofs
where these conditions come from.
PREFACE xix
This book is mainly concerned with providing a detailed introduction into
the common modes of convergence and their related tools used in statistics.
However, it is also useful to consider how these results can be applied to several
common areas of statistics. Therefore, I have included several chapters that
deal more with the application of the theory developed in the first part of the
book. These applications are not an exhaustive offering by any stretch of the
imagination, and to be sure an exhaustive offering would have enlarged the
book to the size of a phone book from a large metropolitan area. Therefore, I
have attempted to include a few topics whose deeper understanding benefits
greatly from asymptotic theory and whose applications provide illustrative
examples of the theory developed earlier.
Many people have helped me along the way in developing the ideas behind
this book and implementing them on paper. Bob Stern and David Grubbs at
CRC/Chapman & Hall have been very supportive of this project and have
put up with my countless delays. There were several students who took a
course from me based on an early draft of this book: Devrim Bilgili, Arpita
Chatterjee, Ujjwal Das, Santu Ghosh, Priya Kohli, and Suchitrita Sarkar.
They all provided me with useful comments, found numerous typographical
errors, and asked intelligent questions, which helped develop the book from
its early phase to a more coherent and complete document. I would also like
to thank Qian Dong, Lingping Liu, and Kristin McCullough who studied from
a later version of the book and were able to point out several typographical
errors and places where there could be some improvement in the presentation. I
also want thank Sanjib Basu who helped me by answering numerous questions
about Bayesian theory.
My friends and colleagues who have supported me through the years also
deserve a special note of thanks: My advisor, Bill Schucany, has also been a
constant supporter of my activities. He recently announced his retirement and
I wish him all the best. I also want to thank Bob Mason, Youn-Min Chou,
Dick Gunst, Wayne Woodward, Bennie Pierce, Pat Gerard, Michael Ernst,
Carrie Helmig, Donna Lynn, and Jane Eesley for their support over the years.
Jeff Reynolds, my bandmate, my colleague, my student, and my friend, has
always shown me incredible support. He has been a constant source for reality
checks, help, and motivation. We have faced many battles together, many of
them quite enjoyable: “My brother in arms.”
There is always family, and mine is the best. My wife Catherine and my son
Alton have been there for me all throughout this process. I love and cherish
both of them; I could not have completed this project without their support,
understanding, sacrifices, and patience. I also wish to thank my extended
family; My Mom and Dad, Kay and Al Polansky who celebrated their 60th
wedding anniversary in 2010; my brother Gary and his famiy: Marcia, Kara,
Krista, Mandy, Nellie and Jack; my brother Dale and his new family, which we
welcomed in the summer of 2009: Jennifer and Sydney; and my wife’s family:
Karen, Mike, Ginnie, Christopher, Jonathan, Jackie, Frank, and Mila.
xx PREFACE
Finally, I would like to thank those who always provide me with the diversions
I need during such a project as this. Matt Groening, who started a little show
called The Simpsons the year before I started graduate school, will never
know how much joy his creation has brought to my life. I also wish to thank
David Silverman who visited Southern Methodist University while I was there
working on my Ph.D.; he drew me a wonderful Homer Simpson, which hangs
on my wall to this day. Jebediah was right: “A noble spirit embiggens the
smallest man.” There was also those times when what I needed was a good
polka. For this I usually turned to the music of Carl Finch and Jeffrey Barnes
of Brave Combo. See you at Westfest!
Much has happened during the time while I was writing this book. I has very
saddened to hear of the passing Professor Erich Lehmann on September 12,
2009, at the age of 91. One cannot understate the contributions of Professor
Lehmann on the field of statistics, and particularly on much of the material
presented in this book. I did not know Professor Lehmann personally, but did
meet him once at the Joint Statistical Meetings where he was kind enough to
sign a copy of the new edition of his book Theory of Point Estimation. His
classic books, which have always had a special place on my bookshelf since
beginning my career as a graduate student, have been old and reliable friends
for many years. On July 8, 2010, David Blackwell, the statistician and math-
ematician who wrote many groundbreaking papers on probability and game
theory passed away as well. Besides his many contributions to statistics, Pro-
fessor Blackwell also held the distinction of being the first African American
scholar to be admitted to the National Academy of Sciences and was the first
African American tenured professor at Berkeley. He too will be missed.
During the writing of this book I also passed my 40th birthday and began to
think about the turmoil and fear that pervaded my birth year of 1968. What
a time it must have been to bring a child into the world. There must have
been few hopes and many fears. I find myself looking at my own child Alton
and wondering what the world will have in store for him. It has again been a
time of turmoil and fear, and I can only hope that humanity begins to heed
the words of Dr. Martin Luther King:
“We must learn to live together as brothers or perish together as fools.”

Peace to All,
Alan M. Polansky
Creston, Illinois, USA
CHAPTER 1

Sequences of Real Numbers and


Functions

K. felt slightly abandoned as, probably observed by the priest, he walked by him-
self between the empty pews, and the size of the cathedral seemed to be just at
the limit of what a man could bear.

The Trial by Franz Kafka

1.1 Introduction

The purpose of this chapter is to introduce much of the mathematical limit


theory used throughout the reminder of the book. For many readers, many
of the topics will consist of review material, while other topics may be quite
new. The study of asymptotic properties in statistics relies heavily on results
and concepts from real analysis and calculus. As such, we begin with a review
of limits of sequences of real numbers and sequences of real functions. This is
followed by the development of what will be a most useful tool, Taylor’s The-
orem. We then introduce the concept of asymptotic expansions, a topic that
may be new to some readers and is vitally important to many of the results
treated later in the book. Of particular importance is the introduction of the
asymptotic order notation, which is popular in modern research in probability
and statistics. The related topic of inversion of asymptotic expansions is then
briefly introduced.

1.2 Sequences of Real Numbers

An infinite sequence of real numbers given by {xn }∞ n=1 is specified by the


function xn : N → R. That is, for each n ∈ N, the sequence has a real value
xn . The set N is usually called the index set. In this case the domain of the
function is countable, and the sequence is usually thought to evolve sequen-
tially through the increasing values in N. For example, the simple harmonic
sequence specified by xn = n−1 has the values
x1 = 1, x2 = 12 , x3 = 13 , x4 = 14 , . . . ,

1
2 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
while the simple alternating sequence xn = (−1)n has values
x1 = −1, x2 = 1, x3 = −1, x4 = 1, . . . .
One can also consider real sequences of the form xt with an uncountable
domain such as the real line which is specified by a function xt : R → R. Such
sequences are essentially just real functions. This section will consider only
real sequences whose index set is N.
The asymptotic behavior of such sequences is often of interest. That is, what
general conclusions can be made about the behavior of the sequence as n
becomes very large? In particular, do the values in the sequence appear to
“settle down” and become arbitrarily close to a single number x ∈ R as
n → ∞? For example, the sequence specified by xn = n−1 appears to become
closer and closer to 0 as n becomes very large. If a sequence has this type of
property, then the sequence is said to converge to x as n → ∞, or that the
limit of xn as n → ∞ is x, usually written as
lim xn = x,
n→∞

or as xn → x as n → ∞. To decide whether a given sequence has this type of


behavior, a mathematical definition of the convergence or limiting concept is
required. The most common definition is given below.
Definition 1.1. Let {xn }∞
n=1 be a sequence of real numbers and let x ∈ R
be a real number. Then xn converges to x as n → ∞, written as xn → x as
n → ∞, or as
lim xn = x,
n→∞
if and only if for every ε > 0 there exists nε ∈ N such that |xn − x| < ε for
every n ≥ nε .

This definition ensures the behavior described above. Specify any distance
ε > 0 to x, and all of the terms in a convergent sequence will eventually be
closer than that distance to x.
Example 1.1. Consider the harmonic sequence defined by xn = n−1 for all
n ∈ N. This sequence appears to monotonically become closer to zero as n
increases. In fact, it can be proven that the limit of this sequence is zero. Let
ε > 0. Then there exists nε ∈ N such that n−1 ε < ε, which can be seen by
taking nε to be any integer greater than ε−1 . It follows that any n ≥ nε will
also have the property that n−1 < ε. Therefore, according to Definition 1.1,
the sequence xn = n−1 converges to zero, or
lim n−1 = 0.
n→∞

For the real number system there is an equivalent development of the concept
of a limit that is dependent on the concept of a Cauchy sequence.
SEQUENCES OF REAL NUMBERS 3
Definition 1.2. Let {xn }∞ n=1 be a sequence of real numbers. The sequence
is a Cauchy sequence if for every ε > 0 there is an integer nε such that
|xn − xm | < ε for every n > nε and m > nε .
In general, not every Cauchy sequence converges to a limit. For example, not
every Cauchy sequence of rational numbers has a rational limit. See Exam-
ple 6.9 of Sprecher (1970). There are, however, some spaces where Cauchy
sequence has a unique limit. Such spaces are said to be complete. The real
number system is an example of a complete space.
Theorem 1.1. Every Cauchy sequence of real numbers has a unique limit.
The advantage of using Cauchy sequences is that we sometimes only need to
show the existence of a limit and Theorem 1.1 can be used to show that a real
sequence has a limit even if we do not know what the limit may be.
Simple algebraic transformations can be applied to convergent sequences with
the resulting limit being subject to the same transformation. For example,
adding a constant to each term of a convergent sequence results in a conver-
gent sequence whose limit equals the limit of the original sequence plus the
constant. A similar results applies to sequences that have been multiplied by
a constant.
Theorem 1.2. Let {xn }∞ n=1 be a sequence of real numbers such that

lim xn = x,
n→∞

and let c ∈ R be a constant. Then


lim (xn + c) = x + c,
n→∞

and
lim cxn = cx.
n→∞

Proof. We will prove the first result. The second result is proven in Exercise 6.
Let {xn }∞n=1 be a sequence of real numbers that converges to x, let c be a real
constant, and let ε > 0. Definition 1.1 and the convergence of the sequence
{xn }∞
n=1 implies that there exists a positive integer nε such that |xn − x| < ε
for all n ≥ nε . Now consider the sequence {xn + c}∞ n=1 . Note that for ε > 0
we have that |(xn + c) − (x + c)| = |xn − x|. Therefore, |(xn + c) − (x + c)| is
also less than ε for all n ≥ nε and the result follows from Definition 1.1.

Theorem 1.2 can be generalized to continuous transformations of convergent


sequences.
Theorem 1.3. Let {Xn }∞ n=1 be a sequence of real numbers such that

lim xn = x,
n→∞

and let f be a function that is continuous at x. Then


lim f (xn ).
n→∞
4 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
A proof of Theorem 1.3 can be found in Section 2.2 of Buck (1965).
In many cases we may consider combining two or more convergent sequences
through a simple algebraic transformation. For example, we might consider a
sequence whose value is equal to the sum of the corresponding terms of two
other convergent sequences. This new sequence is also convergent, and has a
limit equal to the sum of the limits of the two sequences. Similar results also
apply to the product and ratio of two convergent sequences.
Theorem 1.4. Let {xn }∞ ∞
n=1 and {yn }n=1 be sequences of real numbers such
that
lim xn = x,
n→∞
and
lim yn = y.
n→∞
Then
lim (xn + yn ) = x + y,
n→∞
and
lim xn yn = xy.
n→∞
If, in addition to the assumptions above, yn 6= 0 for all n ∈ N and y 6= 0, then
xn x
lim = .
n→∞ yn y

Proof. Only the first result will be proven here. The remaining results are
proven as part of Exercise 6. Let {xn }∞ ∞
n=1 and {yn }n=1 be convergent se-
quences with limits x and y respectively. Let ε > 0, then Definition 1.1 implies
that there exists integers nε,x and nε,y such that |xn −x| < ε/2 for all n ≥ nε,x
and |yn − y| < ε/2 for all n ≥ nε,y . Now, note that Theorem A.18 implies that
|(xn + yn ) − (x + y)| = |(xn − x) + (yn − y)| ≤ |xn − x| + |yn − y|.
Let nε = max{nε,x , nε,y } so that |xn − x| < ε/2 and |yn − y| < ε/2 for all
n ≥ nε . Therefore |(xn + yn ) − (x + y)| < ε/2 + ε/2 = ε for all n ≥ nε and the
result follows from Definition 1.1.

The focus of our study of limits so far has been for convergent sequences. Not
all sequences of real numbers are convergent, and the limits of non-convergent
real sequences do not exist.
Example 1.2. Consider the alternating sequence defined by xn = (−1)n for
all n ∈ N. It is intuitively clear that the alternating sequence does not “settle
down” at all, or that it does not converge to any real number. To prove this,
let l ∈ R be any real number. Take 0 < ε < max{|l − 1|, |l + 1|}, then for any
n0 ∈ N there will exist at least one n00 > n0 such that |xn − l| > ε. Hence, this
sequence does not have a limit. 

While a non-convergent sequence does not have a limit, the asymptotic be-
havior of non-convergent sequences can be described to a certain extent by
SEQUENCES OF REAL NUMBERS 5
considering the asymptotic behavior of the upper and lower bounds of the
sequence. Let {xn }∞ n=1 be a sequence of real numbers, then u ∈ R is an upper
bound for {xn }∞ n=1 if xn ≤ u for all n ∈ N. Similarly, l ∈ R is a lower bound for
{xn }∞ ∞
n=1 if xn ≥ l for all n ∈ N. The least upper bound of {xn }n=1 is ul ∈ R

if ul is an upper bound for {xn }n=1 and ul ≤ u for any upper bound u of
{xn }∞n=1 . The least upper bound will be denoted by

ul = sup xn ,
n∈N

and is often called the supremum of the sequence {xn }∞ n=1 . Similarly, the
greatest lower bound of {xn }∞ ∞
n=1 is lu ∈ R if lu is a lower bound for {xn }n=1

and lu ≥ l for any lower bound l of {xn }n=1 . The greatest lower bound of
{xn }∞
n=1 will be denoted by
lu = inf xn ,
n∈N
and is often called the infimum of {xn }∞
n=1 .
It is a property of the real numbers
that any sequence that has a lower bound also has a greatest lower bound and
that any sequence that has an upper bound also has a least upper bound.
See Page 33 of Royden (1988) for further details. The asymptotic behavior
of non-convergent sequences can be studied in terms of how the supremum
and infimum of a sequence behaves as n → ∞. That is, we can consider for
example the asymptotic behavior of the upper limit of a sequence {xn }∞ n=1 by
calculating
lim sup xk .
n→∞ k≥n

Note that the sequence  ∞


sup xk , (1.1)
k≥n n=1
is a monotonically decreasing sequence of real numbers. The limiting value of
this sequence should occur at its smallest value, or in mathematical terms
lim sup xk = inf sup xk . (1.2)
n→∞ k≥n k∈N k≥n

From the discussion above it is clear that if the sequence {xn }∞


n=1 is bounded,
then the sequence given in Equation (1.1) is also bounded and the limit in
Equation (1.2) will always exist. A similar concept can be developed to study
the asymptotic behavior of the infimum of a sequence as well.
Definition 1.3. Let {xn }∞n=1 be a sequence of real numbers. The limit supre-
mum of {xn }∞n=1 is
lim sup xn = inf sup xk ,
n→∞ n∈N k≥n

and the limit infimum of {xn }∞


n=1 is
lim inf xn = sup inf xk .
n→∞ n∈N k≥n

If
lim inf xn = lim sup xn = c ∈ R,
n→∞ n→∞
6 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
then the limit of {xn }∞
n=1 exists and is equal to c.

The usefulness of the limit supremum and the limit infimum can be demon-
strated through some examples.
Example 1.3. Consider the sequence {xn }∞
n=1 where xn = n
−1
for all n ∈ N.
Note that
sup xn = sup k −1 = n−1 ,
k≥n k≥n

since n−1 is the largest element in the sequence. Therefore,


lim sup xn = inf sup xk = inf n−1 = 0.
n→∞ k∈N k≥n n∈N

Note that zero is a lower bound of n−1 since n−1 > 0 for all n ∈ N. Further,
zero is the greatest lower bound since there exists an nε ∈ N such that n−1
ε <ε
for any ε > 0. Similar arguments can be used to show that
lim inf xn = sup inf xk = sup 0 = 0.
n→∞ n∈N k≥n n∈N

Therefore Definition 1.3 implies that


lim n−1 = lim inf n−1 = lim sup n−1 = 0,
n→∞ n→∞ n→∞

and the sequence is convergent. 


Example 1.4. Consider the sequence {xn }∞
where xn = (−1)
n=1
n
for all
n ∈ N. Note that
sup xk = sup(−1)k = 1,
k≥n k≥n
so that,
lim sup xn = inf sup xk = inf 1 = 1.
n→∞ k∈N k≥n n∈N

Similarly,
inf xk = inf (−1)k = −1,
k≥n k≥n
so that
lim inf xn = sup inf xk = sup −1 = −1.
n→∞ n∈N k≥n n∈N
In this case it is clear that
lim inf xn 6= lim sup xn ,
n→∞ n→∞

so that, as shown in Example 1.2, the limit of the sequence does not exist.
The limit infimum and limit supremum indicate the extent of the limit of the
variation of {xn }∞
n=1 as n → ∞. 
Example 1.5. Consider the sequence {xn }∞ n
n=1 where xn = (−1) (1 + n
−1
)
for all n ∈ N. Note that
(
k −1 1 + n−1 if n is even,
sup xk = sup(−1) (1 + k ) =
k≥n k≥n 1 + (n + 1)−1 if n is odd.
SEQUENCES OF REAL NUMBERS 7
In either case,
inf (1 + n−1 ) = inf [1 + (n + 1)−1 ] = 1,
n∈N n∈N
so that
lim sup xn = 1.
n→∞
Similarly,
(
−(1 + n−1 ) if n is odd,
inf xk = inf (−1)k (1 + k −1 ) = ,
k≥n k≥n −[1 + (n + 1)−1 ] if n is even.
and
sup −(1 + n−1 ) = sup −[1 + (n + 1)−1 ] = −1,
n∈N n∈N
so that
lim inf xn = −1.
n→∞
As in Example 1.4
lim inf xn 6= lim sup xn ,
n→∞ n→∞
so that the limit does not exist. Note that this sequence has the same asymp-
totic behavior on its upper and lower bounds as the much simpler sequence
in Example 1.4. 

The properties of the limit supremum and limit infimum are similar to those
of the limit with some notable exceptions.
Theorem 1.5. Let {xn }∞
n=1 be a sequence of real numbers. Then

inf xn ≤ lim inf xn ≤ lim sup xn ≤ sup xn ,


n∈N n→∞ n→∞ n∈N

and
lim sup(−xn ) = − lim inf xn .
n→∞ n→∞

Proof. The second property will be proven here. The first property is proven
in Exercise 10. Let {xn }∞ n=1 be a sequence of real numbers. Note that the
negative of any lower bound of {xn }∞ n=1 is an upper bound of the sequence
{−xn }∞n=1 . To see why, let l be a lower bound of {xn }∞
n=1 . Then l ≤ xn for all
n ∈ N. Multiplying each side of the inequality by −1 yields −l ≥ −xn for all
n ∈ N. Therefore −l is an upper bound of {−xn }∞ n=1 , and it follows that the
negative of the greatest lower bound of {xn }∞ n=1 is the least upper bound of
{−xn }∞n=1 . That is
− inf xk = sup −xk , (1.3)
k≥n k≥n

and
− sup xk = inf −xk . (1.4)
k≥n k≥n
8 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
Therefore, an application of Equations (1.3) and (1.4) implies that

lim sup(−xn ) = inf sup(−xk ) =


n→∞ n∈N k≥n
   
inf − inf xk = − sup inf xk = − lim inf xn .
n∈N k≥n n∈N k≥n n→∞

Combining the limit supremum and infimum of two or more sequences is


also possible, though the results are not as simple as in the case of limits of
convergent sequences.
Theorem 1.6. Let {xn }∞ ∞
n=1 and {yn }n=1 be sequences of real numbers.

1. If xn ≤ yn for all n ∈ N, then


lim sup xn ≤ lim sup yn ,
n→∞ n→∞

and
lim inf xn ≤ lim inf yn .
n→∞ n→∞

2. If

lim sup xn < ∞,
n→∞
and
lim inf xn < ∞,

n→∞
then
lim inf xn + lim inf yn ≤ lim inf (xn + yn ),
n→∞ n→∞ n→∞
and
lim sup(xn + yn ) ≤ lim sup xn + lim sup yn .
n→∞ n→∞ n→∞

3. If xn > 0 and yn > 0 for all n ∈ N,


0 < lim sup xn < ∞,
n→∞

and
0 < lim sup yn < ∞,
n→∞
then   
lim sup xn yn ≤ lim sup xn lim sup yn .
n→∞ n→∞ n→∞

Proof. Part of the first property is proven here. The remaining properties are
proven in Exercises 12 and 13. Suppose that {xn }∞ ∞
n=1 and {yn }n=1 are real se-
quences such that xn ≤ yn for all n ∈ N. Then xk ≤ yk for all k ∈ {n, n+1, . . .}.
This implies that any upper bound for {yn , yn+1 , . . . , } is also an upper bound
SEQUENCES OF REAL NUMBERS 9
for {xn , xn+1 , . . .}. It follows that the least upper bound for {xn , xn+1 , . . .} is
less than or equal to the least upper bound for {yn , yn+1 , . . . , }. That is
sup xk ≤ sup yk . (1.5)
k≥n k≥n

Because k is arbitrary in the above argument, the property in Equation (1.5)


holds for all k ∈ N. Now, note that a lower bound for the sequence
 n
sup xk , (1.6)
k≥1 k=1
is also a lower bound for  n
sup yk , (1.7)
k≥1 k=1
due to the property in Equation (1.5). Therefore, the greatest lower bound
for the sequence in Equation (1.6) is also a lower bound for the sequence in
Equation (1.7). Hence, it follows that the greatest lower bound for the sequence
in Equation (1.6) is bounded above by the lower bound for the sequence in
Equation (1.7). That is
lim sup xn = inf sup xk ≤ inf sup yk = lim sup yn .
n→∞ n∈N k≥n n∈N k≥n n→∞

The property for the limit infimum is proven in Exercise 11.


Example 1.6. Consider two sequences of real numbers {xn }∞ ∞
n=1 and {yn }n=1
n −1
given by xn = (−1) and yn = (1 + n ) for all n ∈ N. Using Definition 1.3 it
can be shown that
lim sup(−1)n = 1
n→∞
and
lim yn = lim sup yn = 1.
n→∞ n→∞
Therefore,   
lim sup xn lim sup yn = 1.
n→∞ n→∞
It was shown in Example 1.5 that
lim sup xn yn = 1.
n→∞

Therefore, in this case,


  
lim sup xn yn = lim sup xn lim sup yn .
n→∞ n→∞ n→∞


Example 1.7. Consider two sequences of real numbers and {xn }∞
n=1 {yn }∞
n=1
where xn = 21 [(−1)n − 1] and yn = (−1)n for all n ∈ N. Using Definition 1.3
it can be shown that
lim sup xn = 0,
n→∞
10 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
and
lim sup yn = 1,
n→∞
so that   
lim sup xn lim sup yn = 0.
n→∞ n→∞

Now the sequence xn yn can be defined as xn yn = 12 [1 − (−1)n ], for all n ∈ N,


so that
lim sup xn yn = 1,
n→∞
which would apparently contradict Theorem 1.6, except for the fact that the
assumption
lim sup xn > 0,
n→∞
is violated. 
Example 1.8. Consider two sequences of real numbers and {xn }∞
n=1 {yn }∞
n=1
where xn = (−1)n and yn = (−1)n+1 for all n ∈ N. From Definition 1.3 it
follows that
lim sup xn = lim sup yn = 1.
n→∞ n→∞
The sequence xn yn can be defined by xn yn = −1 for all n ∈ N, so that
lim sup xn yn = −1.
n→∞

Therefore, in this case


  
lim sup xn yn < lim sup xn lim sup yn .
n→∞ n→∞ n→∞

One sequence of interest is specified by xn = (1 + n−1 )n , which will be used


in later chapters. The computation of the limit of this sequence is slightly
more involved and specifically depends on the definition of limit given in
Definition 1.3, and as such is an excellent example of using the limit infimum
and supremum to compute a limit.
Theorem 1.7. Define the sequence {xn }∞
n=1 as xn = (1 + n
−1 n
) . Then
lim xn = e.
n→∞

Proof. The standard proof of this result, such as this one which is adapted
from Sprecher (1970), is based on the Theorem A.22. Using Theorem A.22,
note that when n ∈ N is fixed
n   n  
X n −i n−i X n −i
(1 + n−1 )n = n (1) = n . (1.8)
i=0
i i=0
i

Consider the ith term in the series on the right hand side of Equation (1.8),
SEQUENCES OF REAL NUMBERS 11
and note that
  i−1
n −i (n − i)! Y n − j 1
n = ≤ .
i i!(n − i)! j=0 n i!
Therefore it follows that
n
X 1
(1 + n−1 )n ≤ ≤ e,
i=0
i!
for every n ∈ N. The second inequality comes from the fact that

X 1
e= ,
i=0
i!
where all of the terms in the sequence are positive. It then follows that
lim sup(1 + n−1 )n ≤ e.
n→∞

See Theorem 1.6. Now suppose that m ∈ N such that m ≤ n. Then


n   m  
X n −i X n −i
(1 + n−1 )n = n ≥ n ,
i=0
i i=0
i
since each term in the sum is positive. Note that for fixed i,
  i−1
n −i 1 Y n−j 1
lim n = lim = .
n→∞ i n→∞ i! n i!
j=0

Therefore, using the same argument as above, it follows that


m
−1 n
X 1
lim inf (1 + n ) ≥ .
n→∞
i=0
i!
Letting m → ∞ establishes the inequality
lim inf (1 + n−1 )n ≥ e.
n→∞

Therefore it follows that


lim sup(1 + n−1 )n ≤ e ≤ lim inf (1 + n−1 )n .
n→∞ n→∞

The result of Theorem 1.5 then implies that


lim sup(1 + n−1 )n = e = lim inf (1 + n−1 )n ,
n→∞ n→∞

so that Definition 1.3 implies that


lim (1 + n−1 )n = e.
n→∞

Another limiting result that is used in this book is Stirling’s Approximation,


which is used to approximate the factorial operation.
12 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
Theorem 1.8 (Stirling).
nn (2nπ)1/2
lim = 1.
n→∞ exp(n)n!

A proof of Theorem 1.8 can be found in Slomson (1991). Theorem 1.8 implies
that when n is large,
nn (2nπ)1/2
' 1,
exp(n)n!
and hence one can approximate n! with nn exp(−n)(2nπ)1/2 .
Example 1.9. Let n and k be positive integers such that k ≤ n. The number
of combinations of k items selected from a set of n items is
 
n n!
= .
k k!(n − k)!
Theorem 1.8 implies that when both n and k are large, we can approximate
n

k as
nn exp(−n)(2nπ)1/2
 
n
'
k k exp(−k)(2kπ) (n − k)n−k exp(k − n)[2(n − k)π]1/2
k 1/2

= (2π)−1/2 nn+1/2 k −k−1/2 (n − k)k−n−1/2 .


For example, 10 10
 
5 = 252 exactly, while the approximation yields 5 ' 258.4,
giving a relative error of 2.54%. The approximation improves as both n and k
increase. For example, 20

10 = 184756 exactly, while the approximation yields
20

10 ' 187078.973, giving a relative error of 1.26%. 

1.3 Sequences of Real Functions

The convergence of sequences of functions will play an important role through-


out this book due to the fact that sequences of random variables are really
just sequences of functions. This section will review the basic definitions and
results for convergent sequences of functions from a calculus based viewpoint.
More sophisticated results, such as those based on measure theory, will be
introduced in Chapter 3.
The main difference between studying the convergence properties of a sequence
of real numbers and a sequence of real valued functions is that there is not
a single definition that characterizes a convergent sequence of functions. This
section will study two basic types of convergence: pointwise convergence and
uniform convergence.
Definition 1.4. Let {fn (x)}∞n=1 be a sequence of real valued functions. The
sequence converges pointwise to a real function f if
lim fn (x) = f (x) (1.9)
n→∞
SEQUENCES OF REAL FUNCTIONS 13
pw
for all x ∈ R. We will represent this property as fn −−→ f as n → ∞.
In Definition 1.4 the problem of convergence is reduced to looking at the
convergence properties of the real sequence of numbers given by {fn (x)}∞ n=1
when x is a fixed real number. Therefore, the limit used in Equation (1.9) is
the same limit that is defined in Definition 1.1.
Example 1.10. Consider a sequence of functions {fn (x)}∞ n=1 defined on the
unit interval by fn (x) = 1 + n−1 x2 for all n ∈ N. Suppose that x ∈ [0, 1] is a
fixed real number. Then
lim fn (x) = lim (1 + n−1 x2 ) = 1,
n→∞ n→∞
pw
regardless of the value of x ∈ [0, 1]. Therefore fn −−→ f as n → ∞. 
Example 1.11. Let {fn (x)}∞ n=1 be a sequence of real valued functions defined
on the unit interval [0, 1] as fn (x) = xn for all n ∈ N. First suppose that
x ∈ [0, 1) and note that
lim fn (x) = lim xn = 0.
n→∞ n→∞

However, if x = 1 then
lim fn (x) = lim 1 = 1.
n→∞ n→∞
pw
Therefore fn −−→ f as n → ∞ where fn (x) = δ{x; {1}}, and δ is the indicator
function defined by (
1 if x ∈ A,
δ{x; A} =
0 if x ∈
/ A.
Note that this example also demonstrates that the limit of a sequence of
continuous functions need not also be continuous. 
Because the definition of pointwise convergence for sequences of functions is
closely related to Definition 1.1, the definition for limit for real sequences,
many of the properties of limits also hold for sequences of functions that
converge pointwise. For example, if {fn (x)}∞ ∞
n=1 and {gn (x)}n=1 are sequences
of functions that converge to the functions f and g, respectively, then the
sequence {fn (x) + gn (x)}∞
n=1 converges pointwise to f (x) + g(x). See Exercise
14.
A different approach to defining convergence for sequences of real functions
requires not only that the sequence of functions converge pointwise to a lim-
iting function, but that the convergence must be uniform in x ∈ R. That is,
if {fn }∞
n=1 is a sequence of functions that convergence pointwise to a function
f , we further require that the rate of convergence of fn (x) to f (x) as n → ∞
does not depend on x ∈ R.
Definition 1.5. A sequence of functions {fn (x)}∞ n=1 converges uniformly to
a function f (x) as n → ∞ if for every ε > 0 there exists an integer nε such
that |fn (x) − f (x)| < ε for all n ≥ nε and x ∈ R. This type of convergence
u
will be represented as fn −→ f as n → ∞.
14 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
Example 1.12. Consider once again the sequence of functions {fn (x)}∞ n=1
given by fn (x) = 1 + n−1 x2 for all n ∈ N and x ∈ [0, 1]. It was shown in
pw
Example 1.10 that fn −−→ f = 1 as n → ∞ on [0, 1]. Now we investigate
whether this convergence is uniform or not. Let ε > 0, and note that for
x ∈ [0, 1],
|fn (x) − f (x)| = |1 + n−1 x2 − 1| = |n−1 x2 | ≤ n−1 ,
for all x ∈ [0, 1]. Take nε = ε−1 + 1 and we have that |fn (x) − f (x)| < ε for all
u
n > nε . Because nε does not depend on x, we have that fn − → f , as n → ∞
on [0, 1]. 

Example 1.13. Consider the sequence of functions {fn (x)}n=1 given by
fn (x) = xn for all x ∈ [0, 1] and n ∈ N. It was shown in Example 1.11 that
pw
fn −−→ f as n → ∞ where fn (x) = δ{x; {1}} on [0, 1]. Consider ε ∈ (0, 1).
For any value of n ∈ N, |fn (x) − 0| < ε on (0, 1) when xn < ε which im-
plies that n > log(ε)/ log(x), where we note that log(x) < 0 when x ∈ (0, 1).
Hence, such a bound on n will always depend on x since log(x) is unbounded
in the interval (0, 1), and therefore the sequence of functions {fn }∞ n=1 does
not converge uniformly to f as n → ∞. 
Example 1.12 demonstrates one characteristic of sequences of functions that
are uniformly convergent, in that the limit of a sequence of uniformly conver-
gent continuous functions must also be continuous.
Theorem 1.9. Suppose that {fn (x)}∞ n=1 is a sequence of functions on a subset
R of R that converge uniformly to a function f . If each fn is continuous at a
point x ∈ R, then f is also continuous at x.
A proof of Theorem 1.9 can be found in Section 9.4 of Apostol (1974). Note
that uniform convergence is a sufficient, but not a necessary condition, for
the limit function to be continuous, as is demonstrated by Example 1.13. An
alternate view of uniformly convergent sequences of functions can be defined
in terms of Cauchy sequences, as shown below.
Theorem 1.10. Suppose that {fn (x)}∞ n=1 is a sequence of functions on a
u
subset R of R. There exists a function f such that fn − → f as n → ∞ if and
only if for every ε > 0 there exists nε ∈ N such that |fn (x) − fm (x)| < ε for
all n > nε and m > nε , for every x ∈ R.
A proof of Theorem 1.10 can be found in Section 9.5 of Apostol (1974).
Another important property of sequences of functions {fn (x)}∞ n=1 is whether
a limit and an integral can be exhanged. That is, let a and b be real constants
that do not depend on n such that a < b. Does it necessarily follow that
Z b Z b Z b
lim fn (x)dx = lim fn (x)dx = f (x)dx?
n→∞ a a n→∞ a
An example can be used to demonstrate that such an exchange is not always
justified.
SEQUENCES OF REAL FUNCTIONS 15
Example 1.14. Consider a sequence of real functions defined by fn (x) =
2n δ{x; (2−n , 2−(n−1) )} for all n ∈ N. The integral of fn is given by
Z ∞ Z 2−(n−1)
fn (x)dx = 2n dx = 1,
−∞ 2−n

for all n ∈ N. Therefore


Z ∞
lim fn (x)dx = 1.
n→∞ −∞

However, for each x ∈ R,


lim fn (x) = 0,
n→∞
so that Z ∞ Z ∞
lim fn (x) = 0dx = 0.
−∞ n→∞ −∞
Therefore, in this case
Z b Z b
lim fn (x)dx 6= lim fn (x)dx.
n→∞ a a n→∞


While Example 1.14 shows it is not always possible to interchange a limit and
an integral, there are some instances where the change is allowed. One of the
most useful of these cases occurs when the sequence of functions {fn (x)}∞n=1
is dominated by an integrable function, that is, a function whose integral
exists and is finite. This result is usually called the Dominated Convergence
Theorem.
Theorem 1.11 (Lebesgue). Let {fn (x)}∞ n=1 be a sequence of real functions.
pw
Suppose that fn −−→ f as n → ∞ for some real valued function f , and that
there exists a real function g such that
Z ∞
|g(x)|dx < ∞,
−∞

and |fn (x)| ≤ g(x) for all x ∈ R and n ∈ N. Then


Z ∞
|f (x)|dx < ∞,
−∞

and Z ∞ Z ∞ Z ∞
lim fn (x)dx = lim fn (x)dx = f (x)dx.
n→∞ −∞ −∞ n→∞ −∞

A proof of this result can be found in Chapter 9 of Sprecher (1970). Example


1.14 demonstrated a situation where the interchange between an integral and a
limit is not justified. As such, the sequence of functions in Example 1.14 must
violate the assumptions of Theorem 1.11 in some way. The main assumption
in Theorem 1.11 is that the sequence of functions {fn (x)}∞ n=1 is dominated by
16 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
an integrable function g. This is the assumption that is violated in Example
1.14. Consider the function defined by

X
g(x) = 2n δ{x; (2−n , 2−(n−1) )}.
n=1

This function dominates fn (x) for all n ∈ N in that |fn (x)| ≤ g(x) for all
x ∈ R and n ∈ N. But note that
Z ∞ ∞ Z
X ∞ ∞
X
g(x)dx = 2n δ{x; (2−n , 2−(n−1) )} = 1 = ∞. (1.10)
−∞ n=1 −∞ n=1

Therefore, g is not an integrable function. Could there be another function


that dominates the sequence {fn (x)}∞ n=1 that is integrable? Such a function
would have to be less than g for at least some values of x in order to make the
integral in Equation (1.10) finite. This is not possible because g(x) = fn (x)
when x ∈ (2−n , 2−(n−1) ), for all n ∈ N. Therefore, there is not an integrable
function g that dominates the sequence {fn (x)}∞ n=1 for all n ∈ N.

A common application of Theorem 1.11 in statistical limit theory is to show


that the convergence of the density of a sequence of random variables also
implies that the distribution function of the sequence converges.
Example 1.15. Consider a sequence of real functions {fn (x)}∞
n=1 defined by
(
(1 + n−1 ) exp[−x(1 + n−1 )] for x > 0,
fn (x) =
0 for x ≤ 0,

for all n ∈ N. The pointwise limit of the sequence of functions is given by

lim fn (x) = lim (1 + n−1 ) exp[−x(1 + n−1 )] = exp(−x),


n→∞ n→∞

for all x > 0. Now, note that

|fn (x)| = (1 + n−1 ) exp[−x(1 + n−1 )] ≤ 2 exp(−x),

for all n ∈ N. Therefore, if we define


(
2 exp(−x) for x > 0,
g(x) =
0, for x ≤ 0

we have that |fn (x)| ≤ g(x) for all x ∈ R and n ∈ N. Now


Z ∞ Z ∞
g(x)dx = 2 exp(−x)dx = 2 < ∞,
−∞ 0
SEQUENCES OF REAL FUNCTIONS 17
so that g is an integrable function. Theorem 1.11 then implies that
Z x Z x
lim fn (t)dt = lim (1 + n−1 ) exp[−t(1 + n−1 )]dt
n→∞ 0 n→∞ 0
Z x
= lim (1 + n−1 ) exp[−t(1 + n−1 )]dt
n→∞
Z0 x
= exp(−t)dt
0
= 1 − exp(−x).


A related result that allows for the interchange of a limit and an integral
is based on the assumption that the sequence of functions is monotonically
increasing or decreasing to a limiting function.
Theorem 1.12 (Lebesgue’s Monotone Convergence Theorem). Let {fn (x)}∞ n=1
be a sequence of real functions that are monotonically increasing to f on R.
That is fi (x) ≤ fj (x) for all x ∈ R when i < j, for positive integers i and j
pw
and fn −−→ f as n → ∞ on R. Then
Z ∞ Z ∞ Z ∞
lim fn (x)dx = lim fn (x)dx = f (x)dx. (1.11)
n→∞ −∞ −∞ n→∞ −∞

It is important to note that the integrals in Equation (1.11) may be infinite,


unless the additional assumption that f is integrable is added. The result is not
just limited to monotonically increasing sequences of functions. The corollary
below provides a similar result for monotonically decreasing functions.
Corollary 1.1. Let {fn (x)}∞ n=1 be a sequence of non-negative real functions
that are monotonically decreasing to f on R. That is fi (x) ≥ fj (x) for all
pw
x ∈ R when i < j, for positive integers i and j and fn −−→ f as n → ∞ on
R. If f1 is integrable then
Z ∞ Z ∞ Z ∞
lim fn (x)dx = lim fn (x)dx = f (x)dx.
n→∞ −∞ −∞ n→∞ −∞

Proofs of Theorem 1.12 and Corollary 1.1 can be found in Gut (2005) or
Sprecher (1970).
Example 1.16. The following setup is often used when using arguments that
rely on the truncation of random variables. Let g be an integrable function on
R and define a sequence of functions {fn (x)}∞
n=1 as fn (x) = g(x)δ{|x|; (0, n)}.
It follows that for all x ∈ R that
lim fn (x) = lim g(x)δ{|x|; (0, n)} = g(x),
n→∞ n→∞

since for each x ∈ R there exists an integer nx such that fnx (x) = g(x) for
all n ≥ nx . Noting that {fn (x)}∞
n=1 is a monotonically increasing sequence of
18 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
functions allows us to use Theorem 1.12 to conclude that
Z ∞ Z ∞ Z ∞
lim fn (x)dx = lim fn (x)dx = g(x)dx.
n→∞ −∞ −∞ n→∞ −∞

1.4 The Taylor Expansion

Taylor’s Theorem, and the associated approximation of smooth functions, are


crucial elements of asymptotic theory in both mathematics and statistics. In
particular, the Central Limit Theorem is based on this result. Adding to the
usefulness of this idea is the fact that the general method can be extended to
vector functions and spaces of functions. Throughout this book various forms
of Taylor’s Theorem will serve as important tools in many diverse situations.
Before proceeding to the main result, it is useful to consider some motivation
of the form of the most common use of Taylor’s Theorem, which is the ap-
proximation of smooth functions near a known value of the function. Consider
a generic function f that is reasonably smooth. That is, f has no jumps or
kinks. This can be equivalently stated in terms of the derivatives of f . In this
example it will be assumed that f has at least three continuous derivatives in
a sufficiently large neighborhood of x.
Suppose that the value of the function f is known at a point x, but the value at
x+δ is desired, where δ is some positive small real number. The sign of δ is not
crucial to this discussion, but restricting δ to be positive will simplify the use
of the figures that illustrate this example. If f is linear, that is f (x) = a + bx
for some real constants a and b, then it follows that
f (x + δ) = f (x) + δf 0 (x). (1.12)
Now suppose that f is not linear, but is approximately linear in a small neigh-
borhood of x. This will be true of all reasonably smooth functions as long as
the neighborhood is small enough. In this case, the right hand side of Equation
(1.12) can be used to approximate f (x + δ). That is
f (x + δ) ' f (x) + δf 0 (x). (1.13)
See Figure 1.1. One can observe from Figure 1.1 that the quality of the ap-
proximation will depend on many factors. It is clear, at least visually, that the
approximation becomes more accurate as δ → 0. This is generally true and
can be proven using the continuity of f . It is also clear from Figure 1.1 that
the accuracy of the approximation will also depend on the curvature of f . The
less curvature f has, the more accurate the approximation will be. Hence, the
accuracy of the approximation depends on |f 00 | in a neighborhood of x as well.
The accuracy will also depend on higher derivatives when they exist and are
non-zero. To consider the accuracy of this approximation more closely, define
E1 (x, δ) = f (x + δ) − f (x) − δf 0 (x),
THE TAYLOR EXPANSION 19

Figure 1.1 The linear approximation of a curved function. The solid line indicates
the form of the function f and the dotted line shows the linear approximation of
f (z + δ) given by f (z + δ) ' f (z) + δf 0 (z) for the point z indicated on the graph.
0.52

f(z) + ! f’(z)
0.50

f(x)
0.48
0.46

z
0.5 0.6 0.7 0.8 0.9 1.0

the error of the approximation as a function of both x and δ. Note that using
Theorem A.3 yields
Z x+δ
f (x + δ) − f (x) = f 0 (t)dt,
x

and Z x+δ
δf 0 (x) = f 0 (x) dt,
x
so that Z x+δ
E1 (x, δ) = [f 0 (t) − f 0 (x)]dt.
x
An application of Theorem A.4 yields
Z x+δ
x+δ
E1 (x, δ) = −(x + δ − t)[f 0 (t) − f 0 (x)]|x + (x + δ − t)f 00 (t)dt
x
Z x+δ
= (x + δ − t)f 00 (t)dt,
x

which establishes the role of the second derivative in the error of the approx-
imation.
20 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
Note that if f has a high degree of curvature, with no change in the direction
of the concavity of f , the absolute value of the integral in E1 (x, δ) will be
large. This indicates that the function continues turning away from the linear
approximation as shown in Figure 1.1. If f has an inflection point in the
interval (x, x + δ), the direction on the concavity will change and the integral
will become smaller. The change in the sign of f 00 indicates that the function
is turning back toward the linear approximation and the error will decrease.
See Figure 1.2.

A somewhat simpler form of the error term E1 (x, δ) can be obtained through
an application of the Theorem A.5, as
Z x+δ
E1 (x, δ) = (x + δ − t)f 00 (t)dt
x
Z x+δ
= f 00 (ξ) (x + δ − t)dt
x
1 00 2
= 2 f (ξ)δ

for some ξ ∈ [x, x + δ]. The exact value of ξ will depend on x, δ, and f .

If f is a quadratic polynomial it follows that f (x + δ) can be written as


f (x + δ) = f (x) + δf 0 (x) + 12 δ 2 f 00 (x). (1.14)
See Exercise 21. As in the case of the linear approximation, if f is not a
quadratic polynomial then the right hand side of Equation (1.14) can be
used to approximate f (x + δ). Note that in the case where f is linear, the
approximation would revert back to the linear approximation given in Equa-
tion (1.13) and the result would be exact. In the case where the right hand
side of Equation (1.14) is an approximation, it is logical to consider whether
the quadratic approximation is better than the linear approximation given in
Equation (1.13). Obviously the linear approximation could be better in some
specialized cases for specific values of δ even when f is not linear. See Figure
1.2.

Assume that f 000 (x) exists, is finite, and continuous in the interval (x, x + δ).
Define
E2 (x, δ) = f (x + δ) − f (x) − δf 0 (x) − 21 δ 2 f 00 (x). (1.15)
Note that the first three terms on the right hand side of Equation (1.15) equals
E1 (x, δ) so that
Z x+δ
E2 (x, δ) = E1 (x, δ) − 21 δ 2 f 00 (x) = (x + δ − t)f 00 (t)dt − 21 δ 2 f 00 (x).
x

Now note that


Z x+δ
1 2 00
2 δ f (x) = f 00 (x) (x + δ − t)dt,
x
THE TAYLOR EXPANSION 21

Figure 1.2 The linear approximation of a curved function. The solid line indicates
the form of the function f and the dotted line shows the linear approximation of
f (z + δ) given by f (z + δ) ' f (z) + δf 0 (z) for the point z indicated on the graph.
In this case, the concavity of f changes and the linear approximation becomes more
accurate again for larger values of δ, for the range of the values plotted in the figure.
25
20

f(z) + ! f’(z)
15

f(x)
10
5
0

z
0.0 0.2 0.4 0.6 0.8 1.0

so that Z x+δ
E2 (x, δ) = (x + δ − t)[f 00 (t) − f 00 (x)]dt.
x
An application of Theorem A.4 yields
Z x+δ
1
E2 (x, δ) = 2 (x + δ − t)2 f 000 (t)dt,
x

which indicates that the third derivative is the essential determining factor in
the accuracy of this approximation. As with E1 (x, δ), the error term can be
restated in a somewhat simpler form by using an application of Theorem A.5.
That is,
Z x+δ
E2 (x, δ) = (x + δ − t)f 000 (t)dt
x
Z x+δ
= f 000 (ξ) (x + δ − t)2 dt
x
1 3 000
= 6 δ f (ξ),
22 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
for some ξ ∈ [x, x + δ], where the value of ξ will depend on x, δ and f . Note
that δ is cubed in E2 (x, δ) as opposed to being squared in E1 (x, δ). This would
imply, depending on the relative sizes of f 00 and f 000 in the interval [x, x + δ],
that E2 (x, δ) will generally be smaller than E1 (x, δ) for small values of δ. In
fact, one can note that
1 3 000
E2 (x, δ) 6 δ f [ξ2 (δ)] δf 000 [ξ2 (δ)]
= 1 2 00 = .
E1 (x, δ) 2 δ f [ξ1 (δ)]
3f 00 [ξ1 (δ)]
The values ξ1 (δ) and ξ2 (δ) are the values of ξ for E1 (x, δ) and E2 (x, δ), re-
spectively, written here as a function of δ to emphasize the fact that these
values change as δ → 0. In fact, as δ → 0, ξ1 (δ) → x and ξ2 (δ) → x. Now
assume that the derivatives are continuous so that f 00 [ξ1 (δ)] → f 00 (x) and
f 000 [ξ1 (δ)] → f 000 (x) as δ → 0 and that |f 000 (x)/f 00 (x)| < ∞. Then
E2 (x, δ) δf 000 [ξ2 (δ)]
lim = lim 00 = 0. (1.16)
δ→0 E1 (x, δ) δ→0 f [ξ1 (δ)]

Hence, under these conditions, the error from the quadratic approximation
will be dominated by the error from the linear approximation.
If f is a sufficiently smooth function, ensured by the existence of a required
number of derivatives, then the process described above can be iterated further
to obtain potentially smaller error terms when δ is small. This results in what
is usually known as Taylor’s Theorem.
Theorem 1.13 (Taylor). Let f be a function that has p + 1 bounded and
continuous derivatives in the interval (x, x + δ). Then
p
X δ k f (k) (x)
f (x + δ) = + Ep (x, δ),
k!
k=0

where
x+δ
δ p+1 f (p+1) (ξ)
Z
1
Ep (x, δ) = (x + δ − t)p f (p+1) (t)dt = ,
p! x (p + 1)!
for some ξ ∈ (x, x + δ).
For a proof of Theorem 1.13 see Exericises 24 and 25. What is so special
about the approximation that is obtained using Theorem 1.13? Aside from
the motivation given earlier in this section, consider taking the first derivative
of the pth order approximation at δ = 0, which is
p
" p
#
d X δ k f (k) (x) d 0
X δ k f (k) (x)
= f (x) + δf (x) + = f 0 (x).
dδ k! dδ k!


k=0 δ=0 k=2 δ=0
Hence, the approximating function has the same derivative at x as the actual
function f (x). In general it can be shown that
p

dj X δ k f (k) (x)
= f (j) (x), (1.17)
dδ j k!


k=0 δ=0
THE TAYLOR EXPANSION 23
for j = 1, . . . , p. Therefore, the pth -order approximating function has the same
derivatives of order 1, . . . , p as the actual function f (x) at the point x. A proof
of Equation (1.17) is given in Exercise 27.
Note that an alternative form of the expansion given in Theorem 1.13 can be
obtained by setting y = x + δ and x = y0 so that δ = y − x = y − y0 and the
expansion has the form
p
X (y − y0 )k f (k) (y0 )
f (y) = + Ep (y, y0 ), (1.18)
k!
k=0

where
(y − y0 )p+1 f (p+1) (ξ)
Ep (y, y0 ) = ,
(p + 1)!
and ξ is between y and y0 . The expansion given in Equation (1.18) is usually
called the expansion of f around the point y0 .

Example 1.17. Suppose we wish to approximate the exponential function for


positive arguments near zero. That is, we wish to approximate exp(δ) for small
values of δ. A simple approximation for these values may be useful since the
exponential function does not have a simple closed form from which it can be
evaluated. Several approximations based on Theorem 1.13 will be considered
in detail. For the first approximation, take p = 1 in Theorem 1.13 to obtain
exp(x+δ) = exp(x)+δ exp(x)+E1 (x, δ) so that when x = 0 the approximation
simplifies to exp(δ) = 1 + δ + E1 (δ). The error term is now written only as
a function of δ since the value of x is now fixed. Similarly, when p = 2 and
p = 3, Theorem 1.13 yields the approximations exp(δ) = 1 + δ + 21 δ 2 + E2 (δ)
and exp(δ) = 1 + δ + 12 δ 2 + 61 δ 3 + E3 (δ), respectively. The simple polynomial
approximations given by Theorem 1.13 in this case are due to the simple form
of the derivative of the exponential function. The exponential function, along
with these three approximations are plotted in Figure 1.3. One can observe
from the figure that all of the approximations, even the linear one, do well
for very small values of δ. This is due to the fact that all of the error terms
converge to zero as δ → 0. As more terms are added to the Taylor expansion,
the approximation gets better, and has relatively smaller error even for larger
values of δ. For example, the absolute error for the cubic approximation for
δ = 1 is smaller than that of the linear approximation at δ = 0.3.
To emphasize the difference in the behavior of the error terms for each of the
approximations, the size of the absolute error for each approximation has been
plotted in Figure 1.4. The large differences in the absolute error for each of
the three approximations is quite apparent from Figure 1.4, as well as the fact
that all three error terms converge to zero as δ → 0.
The relative absolute errors of each approximation are plotted in Figure 1.5.
One can visually observe the effect that is derived in Equation (1.16). All
three of the relative errors converge to zero as δ → 0, demonstrating that the
24 SEQUENCES OF REAL NUMBERS AND FUNCTIONS

Figure 1.3 The exponential function (solid line) and three approximations based on
Theorem 1.13 using p = 1 (dashed line), p = 2 (dotted line) and p = 3 (dash-dot
line). 3.0
2.5
2.0
1.5
1.0

0.0 0.2 0.4 0.6 0.8 1.0

absolute errors from the higher order polynomial approximations are domi-
nated by the absolute errors from the lower order polynomial approximations
as δ → 0. One can also observe from Figure 1.5 that error of the quadratic
approximation relative to that of the linear approximation is much larger than
that of the absolute error of the cubic approximation relative to that of the
linear approximation. 
Example 1.18. Consider the distribution function of a N(0, 1) distribution
given by
Z x
Φ(x) = (2π)−1/2 exp(−t2 /2)dt.
−∞

We would like to approximate Φ(x), an integral that has no simple closed form,
with a simple function for values of x near 0. As in the previous example, three
approximations based on Theorem 1.13 will be considered. Applying Theorem
1.13 to Φ(x + δ) with p = 1 yields the approximation
Φ(x + δ) = Φ(x) + δΦ0 (x) + E1 (x, δ) = Φ(x) + δφ(x) + E1 (x, δ),
THE TAYLOR EXPANSION 25

Figure 1.4 The absolute error for approximating the exponential function using the
three approximations based on Theorem 1.13 with p = 1 (dashed line), p = 2 (dotted
line) and p = 3 (dash-dot line).
0.8
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

where φ(x) = Φ0 (x) is the density function of a N(0, 1) distribution. Setting


x = 0 yields

Φ(δ) = 1
2 + δφ(0) + E1 (δ) = 1
2 + δ(2π)−1/2 + E1 (δ).

With p = 2 and x = 0, Theorem 1.13 yields

Φ(δ) = 1
2 + (2π)−1/2 δ + 21 φ0 (0)δ 2 + E2 (δ),

where

0 d −1/2 2

φ (0) = φ(x)
= −x(2π) exp(−x /2)
= −xφ(x) = 0.
dx x=0 x=0 x=0

Hence, the quadratic approximation is the same as the linear one. This indi-
cates that the linear approximation is more accurate in this case than what
would usually be expected. The cubic approximation has the form

Φ(δ) = 1
2 + (2π)−1/2 δ + 12 φ0 (0)δ 2 + 61 φ00 (0)δ 3 + E3 (δ), (1.19)
26 SEQUENCES OF REAL NUMBERS AND FUNCTIONS

Figure 1.5 The absolute relative errors for approximating the exponential function
using the three approximations based on Theorem 1.13 with p = 1, p = 2 and p = 3.
The relative errors are |E2 (δ)/E1 (δ)| (solid line), |E3 (δ)/E1 (δ)| (dashed line) and
|E3 (δ)/E2 (δ)| (dotted line).
0.30
0.25
0.20
0.15
0.10
0.05
0.00

0.0 0.2 0.4 0.6 0.8 1.0

where

00 d 2
= −(2π)−1/2 .

φ (0) = − xφ(x)
= (x − 1)φ(x)
dx x=0 x=0

Hence, the cubic approximation has the form


Φ(δ) = 1
2 + δφ(0) + E1 (δ) = 1
2 + δ(2π)−1/2 − 61 (2π)−1/2 δ 3 + E3 (δ).
The linear and quadratic approximations are plotted in Figure 1.6. It is again
clear that both approximations are accurate for very small values of δ. The cu-
bic approximation does relatively well for δ ∈ [0, 1], but quickly becomes worse
for larger values of δ. Note further that the approximations do not provide a
valid distribution function. The linear approximation quickly exceeds one and
the cubic approximation is not a non-decreasing function. Most approxima-
tions for distribution functions have a limited range where the approximation
is both accurate, and provides a valid distribution function. The decreased
error of the linear approximation when x = 0 is due to the fact that the stan-
THE TAYLOR EXPANSION 27

Figure 1.6 The standard normal distribution function (solid line) and two approxi-
mations based on Theorem 1.13 using p = 1 (dashed line) and p = 3 (dotted line).
1.0
0.9
0.8
0.7
0.6
0.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0

dard normal distribution function is nearly linear at x = 0. This increased


accuracy does not hold for other values of x, as the term 12 φ0 (x)δ 2 is only zero
at x = 0. What is the purpose of the cubic term in this approximation? While
the normal distribution function is nearly linear at around 0, there is some
curvature present. In fact, the concavity of Φ(x) changes at x = 0. The linear
term, whose concavity does not change, is unable to capture this behavior.
However, a cubic term can. Note that the cubic term is negative when δ > 0
and is positive when δ < 0. This allows the cubic approximation to capture
the slight curvature around x = 0 that the linear approximation cannot. 

From Example 1.18 it is clear that the derivatives of the standard normal
density have a specific form in that they are all a polynomial multiplied by
the standard normal density. The polynomial multipliers, called Hermite poly-
nomials are quite useful and will be used later in the book.
Definition 1.6. Let φ(x) be the density of a standard normal random variable.
28 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
The k th derivative of φ(x) can be expressed as
dk
φ(x) = (−1)k Hk (x)φ(x),
dxk
where Hk (x) is a k th -order polynomial in x called the k th Hermite polynomial.
That is,
(−1)k φ(k) (x)
Hk (x) = .
φ(x)
Hermite polynomials are an example of a set of orthogonal polynomials. See
Exercise 33. Hermite polynomials also have many interesting properties in-
cluding a simple recurrence relation between successive polynomials.
Theorem 1.14. Let Hk (x) be the k th Hermite polynomial, then

1. For any positive integer k,


bk/2c  
X (2i)! k k−2i
Hk (x) = (−1)i i x ,
i=0
2 i! 2i

where bk/2c is the greatest integer less than or equal to k/2.


2. For any integer k ≥ 2, Hk (x) = xHk−1 (x) − (k − 1)Hk−2 (x).
3. For any positive integer k,
d
Hk (x) = kHk−1 (x).
dx
For a proof of Theorem 1.14 see Exercises 30, 31, and 32.

1.5 Asymptotic Expansions

An asymptotic expansion is an approximation of a function usually written as


a partial series that gains accuracy usually as a certain parameter approaches
a specified value, usually zero or infinity. Asymptotic expansions have already
been encountered in Section 1.4 where the Taylor expansion was used. The
general form of an asymptotic expansion is
f (x, δ) = d0 (x) + δd1 (x) + δ 2 d2 (x) + · · · + δ p dp (x) + Ep (x, δ). (1.20)
This approximation is asymptotic in the sense that the accuracy of the ap-
proximation gets better as δ → 0 for fixed values of x. This type of expansion
is usually called a pth -order expansion since the highest power of δ in the ap-
proximation is p. The functions d1 (x), . . . , dp (x) are functions of x only, and
do not depend on δ. The power of δ need not be integers, but the power of
δ should go up by a set amount for each additional term. For example, many
asymptotic expansions may have the form
f (x, δ) = d0 (x) + δ 1/2 d1 (x) + δd2 (x) + · · · + δ p/2 dp (x) + Ep (x, δ),
ASYMPTOTIC EXPANSIONS 29
which is written in terms of powers of the square root of δ. Asymptotic ex-
pansions are also often defined in terms of the general form
f (x, δ) = f0 (x)[1 + δd1 (x) + · · · + δ p dp (x) + Ep (x, δ)], (1.21)
for approximating f (x, δ) around a leading term f0 (x). See Chapter 3 of
Barndorff-Nielsen and Cox (1989). An essential issue when dealing with asymp-
totic expansion is the fact that adding terms to an expansion, for example
going from a pth -order expansion to a (p + 1)st -order expansion, does not nec-
essarily increase the accuracy of the corresponding approximation for fixed δ.
In fact, the expansion may not even have a finite sum as p → ∞. The accuracy
of the expansion is an asymptotic property. That is, the error term Ep+1 (x, δ)
will usually be dominated by Ep (x, δ) only in the limit, or as δ → 0.
Example 1.19. Consider the function f (x) = x−1 . Suppose we wish to ap-
proximate f (x+δ) when x = 1 and δ is small. The first few derivatives of f (x)
are f 0 (x) = −x−2 , f 00 (x) = 2x−3 , and f 000 (x) = −6x−4 . An analysis of these
derivatives suggests the general form f (k) (x) = (−1)k (k!)x−(k+1) . Therefore,
the k th term of the Taylor expansion of f (1 + δ) is
δ k f (k) (1)
= (−1)k δ k .
k!
Hence, Theorem 1.13 implies
p
X
(1 + δ)−1 = (−1)k δ k + Ep (δ). (1.22)
k=0

When δ is small, specifically when δ is fixed so that 0 ≤ δ < 1, it follows that


p
X ∞
X
(1 + δ)−1 = lim (−1)k δ k = (−1)k δ k ,
p→∞
k=0 k=0

where the properties of the geometric series have been used. See Theorem
A.23. However, if δ is fixed so that δ > 1 then
p
X
(−1)k+1 δ k → ∞,
k=0

as p → ∞. Therefore, for fixed values of p, the asymptotic expansion given in


Equation 1.22 becomes more accurate as δ → 0, but for fixed δ the asymptotic
expansion may become less accurate as p → ∞. 

The asymptotic expansions typically encountered in statistical applications


are based on an increasing sample size. These types of expansions are usually
of the form
f (x, n) = d0 (x) + n−1 d1 (x) + n−2 d2 (x) + · · · + n−p dp (x) + Ep (x, n),
or
f (x, n) = d0 (x) + n−1/2 d1 (x) + n−1 d2 (x) + · · · + n−p/2 dp (x) + Ep (x, n),
30 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
as n → ∞. This form of the expansions is obtained by setting δ = n−1 in
the previous expansions, where δ → 0 as n → ∞. In many applications the
function f is a density or distribution function that is being approximated
for large sample sizes, or as n → ∞. The function d0 (x) is usually a known
density or distribution function that the function f converges to as n → ∞.
In many cases d0 (x) is the standard normal density or distribution function.
Example 1.20. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables following a Beta(α, β) distribution, and let X̄n
denote the sample mean. The distribution of Zn = n1/2 [X̄ − µ(α, β)]/σ(α, β)
is approximately N(0, 1) when n is large, where µ(α, β) = α(α + β)−1 and
 1/2
αβ
σ(α, β) = .
(α + β)2 (α + β + 1)
In fact, more precise information about the asymptotic distribution of Z can
be obtained using an Edgeworth expansion. These expansions will be covered
in great detail later, but for the current presentation assume that the density
of Z has the asymptotic expansion
f (z) = φ(z) − 16 n−1/2 φ(z)H3 (z)κ3 (α, β) + R2 (z, n),
where
2(β − α)(α + β + 1)1/2
κ3 (α, β) = ,
(α + β + 2)(αβ)1/2
which is the coefficient of skewness of the Beta(α, β) distribution, and H3 (z)
is the third-order Hermite polynomial from Definition 1.6 . The error term for
this expansion is not the same as given by Theorem 1.13, and its form will
not be considered at this time, other than the fact that
lim n1/2 R2 (z, n) = 0,
n→∞

as n → ∞. Therefore, the density converges pointwise to the standard normal


density φ(z) as n → ∞. In fact, one can use the first-order term
1 −1/2
6n φ(z)H3 (z)κ3 (α, β),
to more precisely describe the shape of this density when n is large. 

The exact form of the error terms of asymptotic expansions are typically not
of interest. However, the asymptotic behavior of these errors as δ → 0 or as
n → ∞ terms is important. For example, in Section 1.4 it was argued that
when certain conditions are met, the error term E2 (x, δ) from Theorem 1.13
is dominated as δ → 0 by E1 (x, δ), when x is fixed. Because the asymptotic
behavior of the error term, and not its exact form, is important, a specific
type of notation has been developed to symbolize the limit behavior of these
sequences. Asymptotic order notation is a simple type of shorthand that indi-
cates the asymptotic behavior of a sequence with respect to another sequence.
Definition 1.7. Let {xn }∞ ∞
n=1 and {yn }n=1 be real sequences.
ASYMPTOTIC EXPANSIONS 31
1. The notation xn = o(yn ) as n → ∞ means that
xn
lim = 0.
n→∞ yn

That is, the sequence {xn }∞ ∞


n=1 is dominated by {yn }n=1 as n → ∞.
2. The notation xn = O(yn ) as n → ∞ means that the sequence {|xn /yn |}∞
n=1
remains bounded as n → ∞.
3. The notation xn  yn means that
xn
lim = 1,
n→∞ yn

or that the two sequence are asymptotically equivalent as n → ∞.

The O-notation is often called Backmann-Landau notation. See Landau (1974).


The asymptotic order notation is often used to replace the error term in an
asymptotic expansion. For example, the error term of the expansion f (x) =
d0 (x) + n−1/2 d1 (x) + E1 (x, n) may have the property that E1 (x, n) = O(n−1 )
as n → ∞. In this case the expansion is often rewritten as f (x) = d0 (x) +
n−1/2 d1 (x) + O(n−1 ) as n → ∞. The term O(n−1 ) in the expansion repre-
sents an error sequence that has the property that when it is multiplied by
n, the ratio remains bounded as n → ∞. Convention usually stipulates that
the sign of the error term is always positive when using asymptotic order no-
tation, regardless of the sign of the actual sequence. It is always important
to remember that the actual error sequence usually does not have the exact
form specified by the asymptotic order notation. Rather, the notation pro-
vides a simple form for the asymptotic behavior of the sequence. Because the
order notation is asymptotic in nature, it must be qualified by the limit that
is being taken. In this book any limiting notation in the sample size n will be
understood to be qualified by the statement n → ∞, unless specifically noted
otherwise. Similarly, any limiting notation in a real value δ will be understood
to be qualified by the statement δ → 0, unless specifically noted otherwise.
For example, E(δ) = o(δ) will automatically be taken to mean
E(δ)
lim = 0,
δ→0 δ
or that E(δ) is dominated by δ as δ → 0. We will usually continue to qualify
these types of statements to emphasize their importance, but it is common in
statistical literature for this notation to be understood.
Another important note on the order notation is that the asymptotic behavior
of sequences is not unique. That is, if we suppose that the sequence an is
O(n−1 ) as n → ∞, it then follows that the sequence can also be characterized
as O(n−1/2 ) and O[(n + 1)−1 ] as n → ∞. To see why this is true, we refer
to Definition 1.7 which tells us that an = O(n−1 ) as n → ∞ means than the
sequence |nan | remains bounded for all n ∈ N. Because |n1/2 an | ≤ |nan | for
all n ∈ N, it follows that the sequence |n1/2 an | remains bounded for all n ∈ N
32 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
and therefore an = O(n−1/2 ) as n → ∞. A similar argument can be used to
establish the fact that an = O[(n + 1)−1 ] as n → ∞.
Example 1.21. Consider two real sequences {an }∞ ∞
n=1 and {bn }n=1 defined
−1 −1
by an = (n − 1) and bn = (n + 1) for all n ∈ N. Note that
an
lim = 1,
n→∞ bn

so that an  bn as n → ∞. The fact that the sequence an /bn remains bounded


as n → ∞ also implies that an = O(bn ) and bn = O(an ) as n → ∞. In
practice we would usually pick the simpler form of the sequence and conclude
that an = O(n−1 ) and bn = O(n−1 ) as n → ∞. The key idea for many
error sequences is that they converge to zero as n → ∞. In these cases the
asymptotic order notation allows us to study how quickly such sequences
converge to zero. Hence, the results that an = O(n−1 ) and bn = O(n−1 ) are
usually interpreted by concluding that the sequences {an }∞ ∞
n=1 and {bn }n=1
converge to zero at the same rate as each other, and at the same rate as n−1 .
Now consider the limits
lim n1/2 an = lim n1/2 bn = 0.
n→∞ n→∞

−1/2
This indicates that an = o(n ) and bn = o(n−1/2 ) as n → ∞, and the
conclusion is then that the sequences {an }∞ ∞
n=1 and {bn }n=1 converge to zero at
−1/2
a faster rate than n . To emphasize the fact that these representations are
not unique, we can also conclude that an = o(n−1/4 ) and an = o(n−1/256 ) as
n → ∞ as well, with similar conclusions for the sequence {bn }∞n=1 . Note finally
that any sequence that converges to zero, including the sequences {an }∞n=1 and
{bn }∞
n=1 , are also o(1) as n → ∞. 

The main tool that we have encountered for deriving asymptotic expansions
is given by Theorem 1.13 (Taylor), which provided fairly specific forms of the
error terms in the expansion. These error terms can also be written in terms
of the asymptotic order notation to provide a simple asymptotic form of the
errors.
Theorem 1.15. Let f be a function that has p + 1 bounded and continuous
derivatives in the interval (x, x + δ). Then
p
X δ k f (k) (x)
f (x + δ) = + Ep (x, δ),
k!
k=0

where Ep (x, δ) = O(δ p+1 ) and Ep (x, δ) = o(δ p ) as δ → 0.

Proof. We will prove that Ep (x, δ) = O(δ p+1 ) as δ → 0. The fact that
Ep (x, δ) = o(δ p ) is proven in Exercise 34. From Theorem 1.13 we have that
δ p+1 f (p+1) (ξ)
Ep (x, δ) = ,
(p + 1)!
ASYMPTOTIC EXPANSIONS 33
for some ξ ∈ (x, x + δ). Hence, the sequence Ep (x, δ)/δ p+1 has the form
f (p+1) (ξ)
,
(p + 1)!
which depends on δ through the value of ξ. The assumption that f has p + 1
bounded and continuous derivatives in the interval (x, x + δ) ensures that
this sequence remains bounded for all ξ ∈ (x, x + δ). Hence it follows from
Definition 1.7 that Ep (x, δ) = O(δ p+1 ) as δ → 0.
Example 1.22. Consider the asymptotic expansions developed in Example
1.17, that considered approximating the function exp(δ) as δ → 0. Theorem
1.15 can be applied to these results to conclude that exp(δ) = 1 + δ + O(δ 2 ),
exp(δ) = 1 + δ + 12 δ 2 + O(δ 3 ), and exp(δ) = 1 + δ + 21 δ 2 + 16 δ 3 + O(δ 4 ), as
δ → 0. This allows us to easily evaluate the asymptotic properties of the error
sequences. In particular, the asymptotically most accurate approximation has
an error term that converges to zero at the same rate as δ 4 . Alternatively,
we could also apply Theorem 1.15 to these approximations to conclude that
exp(δ) = 1 + δ + o(δ), exp(δ) = 1 + δ + 12 δ 2 + o(δ 2 ), and
1 1
exp(δ) = 1 + δ + δ 2 + δ 3 + o(δ 3 ),
2 6
as δ → 0. Hence, the asymptotically most accurate approximation considered
here has an error term that converges to 0 at a rate faster than δ 3 . 
Example 1.23. Consider the asymptotic expansions developed in Example
1.18 that approximated the standard normal distribution function near zero.
The first and second-order approximations coincide in this case so that we
can conclude using Theorem 1.15 that Φ(δ) = 12 + δ(2π)−1/2 + O(δ 3 ) and
Φ(δ) = 12 + δ(2π)−1/2 + o(δ 2 ) as δ → 0. The third order approximation has
the forms Φ(δ) = 21 + δ(2π)−1/2 + 16 δ 3 (2π)−1/2 + O(δ 4 ) and Φ(δ) = 12 +
δ(2π)−1/2 + 61 δ 3 (2π)−1/2 + o(δ 3 ) as δ → 0. 
There are other methods besides the Taylor expansion which can also be used
to generate asymptotic expansions. A particular method that is useful for
approximating integral functions of a certain form is based on Theorem A.4.
Integral functions with an exponential type form often fall into this category,
and the normal integral is a particularly interesting example.
Example 1.24. Consider the problem of approximating the tail probability
function of a standard normal distribution given by
Z ∞
Φ̄(z) = φ(t)dt,
z
for large values of z, or as z → ∞. To apply integration by parts, first note that
from Definition 1.6 it follows that φ0 (t) = −H1 (t)φ(t) = −tφ(t) or equivalently
φ(t) = −φ0 (t)/t. Therefore
Z ∞
Φ̄(z) = − t−1 φ0 (t)dt.
z
34 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
A single application of Theorem A.4 yields
Z ∞ Z ∞
− t−1 φ0 (t)dt = lim [−t−1 φ(t)] + z −1 φ(z) − t−2 φ(t)dt
z t→∞ z
Z ∞
−1 −2
= z φ(z) − t φ(t)dt. (1.23)
z

The integral of the right hand side of Equation (1.23) can be rewritten as
Z ∞ Z ∞
− t−2 φ(t)dt = t−3 φ0 (t)dt,
z z

which suggests an additional application of integration by parts to yield


Z ∞ Z ∞
−3 0 −3
t φ (t)dt = −z φ(z) + 3t−4 φ(t)dt.
z z

Therefore, an approximation for Φ̄(z) can be written as


Φ̄(z) = z −1 φ(z) − z −3 φ(z) + Ẽ2 (z),
where Z ∞
Ẽ2 (z) = 3t−4 φ(t)dt.
z
Note that the terms of this approximation are not in the form of the asymp-
totic expansion of Equation (1.20), but are in the form of Equation (1.21),
written as
Φ̄(z) = φ(z)[z −1 − z −3 + E2 (z)],
where Z ∞
1
E2 (z) = 3t−4 φ(t)dt.
φ(z) z
To evaluate the asymptotic behavior of the error term E2 (z), note that iter-
ating the process described above one more time yields
Z ∞ Z ∞
−4 −5
3t φ(t)dt = 3z φ(z) − 15t−6 φ(t)dt,
z z

so that Z ∞
1
E2 (z) = 3z −5 − 15t−6 φ(t)dt.
φ(z) z
Now

z5
Z
z 5 E2 (z) = 3 − 15t−6 φ(t)dt.
φ(z) z
The first term is bounded, and noting that φ(t) is a decreasing function for
t > 0 it follows that
Z ∞ Z ∞ Z ∞
z5 z5
15t−6 φ(t)dt ≤ 15t−6 φ(z)dt = z 5 15t−6 dt = 3.
φ(z) z φ(z) z z

Therefore z 5 E2 (z) remains bounded and positive for all z > 0, and it follows
that E2 (z) = O(z −5 ) as z → ∞. This process can be iterated further by
ASYMPTOTIC EXPANSIONS 35
applying integration by parts to the error term E2 (z) which will result in
an error term that is O(z −7 ) as z → ∞. Barndorff-Nielsen and Cox (1989)
point out several interesting properties of the resulting asymptotic expansion,
including the fact that if the process is continued the resulting sequence has
alternating signs, and that each successive approximation provides a lower
or upper bound for the true value Φ̄(z). Moreover, the infinite sequence that
results from continuing the expansion indefinitely is divergent when z is fixed.
See Example 3.1 of Barndorff-Nielsen and Cox (1989) for further details. 

The approximation of exponential type integrals can be generalized using


the LaPlace expansion. See Section 3.3 of Barndorff-Nielsen and Cox (1989),
Chapter 5 of Copson (1965), Chapter 4 of De Bruijn (1958), and Erdélyi
(1956) for further details on this and other techniques for deriving asymptotic
expansions. Example 1.24 also suggests that it might be of interest to study
how asymptotic relations relate to integration and differentiation. General
results for integration can be established as shown in the theorem below.
Theorem 1.16. Suppose f (z) = O[g(z)] as z → ∞ and that
Z ∞
|g(t)|dt < ∞,
z

for all z ∈ R. Then


Z ∞ Z ∞ 
f (t)dt = O |g(t)|dt ,
z z
as z → ∞.

Proof. The assumption that f (z) = O[g(z)] as z → ∞ implies that |f (z)/g(z)|


remains bounded as z → ∞, or that there exists a positive real number b < ∞
such that |f (z)| ≤ b|g(z)| as z → ∞. Therefore,
Z ∞ Z ∞ Z ∞


f (t)dt ≤
|f (t)|dt ≤ b |g(t)|dt,
z z z

as z → ∞, which implies that


R ∞

z
f (t)dt
R∞ ≤ b < ∞,
z
|g(t)|dt
as z → ∞, which yields the result.

More general theorems on the relationship between integration and the asymp-
totic order notation can be developed as well. See, for example, Section 1.1 of
Erdélyi (1956). It is important to note that it is generally not permissible to
exchange a derivative and an asymptotic order relation, though some results
are possible if additional assumptions can be made. For example, the following
result is based on the development of Section 7.3 of De Bruijn (1958), which
contains a proof of the result.
36 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
Theorem 1.17. Let g be a real function that is integrable over a finite interval
and define Z t
G(t) = g(x)dx.
0
If g is non-decreasing and G(t)  (α + 1)−1 tα+1 as t → ∞, then g(t)  tα as
t → ∞.
Example 1.25. Consider the function G(t) = t3 + t2 and note that G(t)  t3
as t → ∞. Knowing the exact form of G(t) in this case allows us to compute
the derivative using direct calculations to be g(t) = 3t2 + 2t so that g(t)  3t2
as t → ∞. Therefore, differentiating the asymptotic rate is permissible here.
In fact, if we did not know the exact form of g(t), but knew that g is a non-
decreasing function, the same rate could be obtained from Theorem 1.17. Note
also that G(t) = O(t3 ) and g(t) = O(t2 ) here as t → ∞. 
1/2
Example 1.26. Consider the function G(t) = t + sin(t) and note that
G(t) = O(t1/2 ) as t → ∞. Direct differentiation yields g(t) = 21 t−1/2 + cos(t),
but in this case it is not true that g(t) = O( 12 t−1/2 ) as t → ∞. Indeed, note
that
1 −1/2
2t + cos(t)
−1/2
= 1 + t1/2 cos(t),
t
does not remain bounded as t → ∞. The reason that differentiation is not
applicable here is due to the cyclic nature of G(t). As t → ∞ the t1/2 term
dominates the sin(t) term in G(t), so this periodic pattern is damped out in
the limit. However, the t−1/2 term, which converges to 0, in g(t) is dominated
by the cos(t) term as t → ∞ so the periodic nature of the function results.
Note that Theorem 1.17 is not applicable in this case as g(t) is not strictly
increasing. 

Other common operations, such as multiplication and addition, are permissi-


ble with order relations subject to a set of simple rules. As a motivational
example, consider asymptotic expansions in n for two functions given by
f (x) = d0 (x) + O(n−1/2 ) and g(y) = h0 (y) + O(n−1/2 ). Suppose that we
are interested in approximating the product f (x)g(y). Multiplying the two
asymptotic expansions yields an expansion of the form
f (x)g(y) = d0 (x)h0 (y)+O(n−1/2 )h0 (y)+d0 (x)O(n−1/2 )+O(n−1/2 )O(n−1/2 ).
Here, the notation O(n−1/2 )h0 (y) indicates that the error sequence from ap-
proximating f with d0 (x) that has the asymptotic property that the sequence
is O(n−1/2 ) as n → ∞, is multiplied by h0 (y). What is the asymptotic behav-
ior of the resulting sequence? This question can be answered by appealing to
Definition 1.7. Assume that h0 (y) is finite for the value of y we are interested
in and is constant with respect to n. Let E0 (x, n) be the error term from this
expansion that has the property that E0 (x, n) = O(n−1/2 ) as n → ∞. Then
it follows that the sequence n1/2 E0 (x, n)h0 (y) remains bounded for all n ∈ N
because the sequence n1/2 E0 (x, n) is guaranteed to remain bounded by Defi-
nition 1.7 and the fact that h0 (y) is finite and does not depend on n. Hence
ASYMPTOTIC EXPANSIONS 37
it follows that O(n−1/2 )h0 (y) = O(n−1/2 ) as n → ∞. Since h0 (y) is finite for
all n ∈ N it follows that h0 (y) = O(1) and we have proved that if we multiply
a sequence that is O(n−1/2 ) by a sequence that is O(1) we obtain a sequence
that is O(n−1/2 ) as n → ∞. This type of behavior is generalized in Theorem
1.18.
Theorem 1.18. Let {an }∞ ∞ ∞ ∞
n=1 , {bn }n=1 , {cn }n=1 , and {dn }n=1 be real se-
quences.

1. If an = o(bn ) and cn = o(dn ) as n → ∞ then an cn = o(bn dn ) as n → ∞.


2. If an = o(bn ) and cn = O(dn ) as n → ∞ then an cn = o(bn dn ) as n → ∞.
3. If an = O(bn ) and cn = O(dn ) as n → ∞ then an cn = O(bn dn ) as n → ∞.

Proof. We will prove the first result, leaving the proofs of the remaining results
as Exercise 37. Definition 1.7 implies that

an
lim = 0,
n→∞ bn

and
cn
lim = 0.
n→∞ dn

Therefore, from Theorem 1.4, it follows that



an cn an an
lim = lim lim = 0,

n→∞ bn dn n→∞ bn n→∞ bn

and hence an bn = o(cn dn ) as n → ∞.

Theorem 1.18 essentially yields two types of results. First, one can observe the
multiplicative effect of the asymptotic behavior of the sequences. Second, one
can also observe the dominating effect of sequences that have o-type behavior
over those with O-type behavior, in that the product of a sequence with o-
type behavior with a sequence that has O-type behavior yields a sequence
with o-type behavior. The reason for this dominance comes from the fact that
the product of a bounded sequence with a sequence that converges to zero,
also converges to zero.
Returning to the discussion on the asymptotic expansion for the product of
the functions f (x)g(y), it is now clear from Theorem 1.18 that the form of the
asymptotic expansion for f (x)g(y) can be written as
f (x)g(y) = d0 (x)d00 (y) + O(n−1/2 ) + O(n−1/2 ) + O(n−1 ).
The next step in simplifying this asymptotic expansion is to consider the
behavior of the sum of the sequences that are O(n−1/2 ) and O(n−1 ) as n → ∞.
Define error terms E1 (n), E10 (n) and E2 (n) such that E1 (n) = O(n−1/2 ),
E10 (n) = O(n−1/2 ) and E2 (n) = O(n−1 ) as n → ∞. Then
n1/2 [E1 (n) + E10 (n) + E2 (n)] = n1/2 E1 (n) + n1/2 E1 (n)0 + n1/2 E2 (n).
38 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
The fact that the first two sequences in the sum remain bounded for all n ∈ N
follows directly from Definition 1.7. Because the third sum in the sequence
is O(n−1 ) as n → ∞ it follows that nE2 (n) remains bounded for all n ∈
N. Because n1/2 E2 (n) ≤ nE2 (n) for all n ∈ N, it follows that n1/2 E2 (n)
remains bounded for all n ∈ N. Therefore E1 (n) + E10 (n) + E2 (n) = O(n−1/2 )
as n → ∞ and the asymptotic expansion for f (x)g(y) can be written as
f (x)g(y) = d0 (x)d00 (y) + O(n−1/2 ) as n → ∞. Is it possible that the error
sequence converges to zero at a faster rate than n−1/2 ? Such a result cannot
be found using the assumptions that are given because the error sequences
E1 (n) and E10 (n) are only guaranteed to remain bounded when multiplied by
n1/2 , and not any larger sequence in n. This type of result is generalized in
Theorem 1.19.
Theorem 1.19. Consider two real sequences {an }∞ ∞
n=1 and {bn }n=1 and pos-
itive real numbers k and m where k ≤ m. Then
1. If an = o(n−k ) and bn = o(n−m ) as n → ∞, then an + bn = o(n−k ) as
n → ∞.
2. If an = O(n−k ) and bn = O(n−m ) as n → ∞, then an + bn = O(n−k ) as
n → ∞.
3. If an = O(n−k ) and bn = o(n−m ) as n → ∞, then an + bn = O(n−k ) as
n → ∞.
4. If an = o(n−k ) and bn = O(n−m ) as n → ∞, then an + bn = O(n−k ) as
n → ∞.

Proof. Only the first result will be proven, leaving the proofs of the remaining
results the subject of Exercise 38. Suppose an = o(n−k ) and bn = o(n−m ) as
n → ∞ and consider the sequence nk (an + bn ). Because an = o(n−k ) it follows
that nk an → 0 as n → ∞. Similarly, nk bn → 0 as n → ∞ due to the fact that
|nk bn | ≤ |nm bn | → 0 as n → ∞. It follows that nk (an + bn ) → 0 as n → ∞
which yields the result.
Example 1.27. In Example 1.20 it was established that the density of Z =
n1/2 [X̄ − µ(α, β)]/σ(α, β), where X̄n is the sample mean from a sample of size
n from a Beta(α, β) distribution, has asymptotic expansion
f (x) = φ(x) − 16 n−1/2 φ(x)H3 (x)κ3 (α, β) + R2 (x, n).
It will be shown later that R2 (x, n) = O(n−1 ) as n → ∞. In some applications
κ3 (α, β) is not known exactly and is replaced by a sequence κ̂3 (α, β) where
κ̂3 (α, β) = κ3 (α, β) + O(n−1/2 ) as n → ∞. Theorems 1.18 and 1.19 can be
employed to yield
fˆ(x) = φ(x) − 61 n−1/2 φ(x)H3 (x)κ̂3 (α, β) + O(n−1 )
= φ(x) − 61 n−1/2 φ(x)H3 (x)[κ3 (α, β) + O(n−1/2 )] + O(n−1 )
= φ(x) − 61 n−1/2 φ(x)H3 (x)κ3 (α, β) + O(n−1 ), (1.24)
as n → ∞. Therefore, it is clear that replacing κ3 (α, β) with κ̂3 (α, β) does not
INVERSION OF ASYMPTOTIC EXPANSIONS 39
change the asymptotic order of the error in the asymptotic expansion. That
is |fˆ(x) − f (x)| = O(n−1 ) as n → ∞, for a fixed value of x ∈ R. Note that it
is not proper to conclude that fˆ(x) = f (x), even though both functions have
asymptotic expansions of the form φ(x)− 16 n−1/2 φ(x)H3 (x)κ3 (α, β)+O(n−1 ).
It is clear from the development given in Equation (1.24) that the error terms
of the two expansions differ, even though they are both O(n−1 ) as n → ∞. 

The final result of this section will provide a more accurate representation
of the approximation given by Stirling’s approximation to factorials given
in Theorem 1.8 by specifying the asymptotic behavior of the error of this
approximation.
Theorem 1.20. n! = nn exp(−n)(2nπ)1/2 [1 + O(n−1 )] as n → ∞.

For a proof of Theorem 1.20 see Example 3.5 of Barndorff-Nielsen and Cox
(1989). The theory of asymptotic expansions and divergent series is far more
expansive than has been presented in this brief overview. The material pre-
sented in this section is sufficient for understanding the expansion theory used
in the rest of this book. Several book length treatments of this topic can be
consulted for further information. These include Barndorff-Nielsen and Cox
(1989), Copson (1965), De Bruijn (1958), Erdélyi (1956), and Hardy (1949).
Some care must be taken when consulting some references on asymptotic ex-
pansions as many presentations are for analytic functions in the complex do-
main. The theoretical properties of asymptotic expansions for these functions
can differ greatly in some cases than for real functions.

1.6 Inversion of Asymptotic Expansions

Suppose that f is a function with asymptotic expansion given by

f (x, n) = d0 (x) + n−1/2 d1 (x) + n−1 d2 (x) + · · ·


+ n−p/2 dp (x) + O(n−(p+1)/2 ), (1.25)
as n → ∞, and we wish to obtain an asymptotic expansion for the in-
verse of the function with respect to x in terms of powers of n−1/2 . That
is, we wish to obtain an asymptotic expansion for a point xa,n such that
f (xa,n ) = a + O(n−(p+1)/2 ), as n → ∞. Note that from the onset we work
under the assumption that the inverse will not be exact, or that we would
be able to find xa,n such that f (xa,n ) = a exactly. This is due to the error
term whose behavior is only known asymptotically. This section will begin by
demonstrating a method that is often useful for finding such an inverse for
an asymptotic expansion of the type given in Equation (1.25). The method
can easily be adapted to asymptotic expansions that are represented in other
forms, such as powers of n−1 , δ 1/2 and δ.
To begin the process, assume that xa,n has an asymptotic expansion of the
40 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
form that matches the asymptotic expansion of f (x, n). That is, assume that

xa,n = v0 (a) + n−1/2 v1 (a) + n−1 v2 (a) + · · · + n−p/2 vp (a) + O(n−(p+1)/2 ),

as n → ∞. Now, substitute the asymptotic expansion for xa,n for x in the


asymptotic expansion for f (x, n) given in Equation (1.25), and use Theorem
1.13, and other methods, to obtain an expansion for the expression in terms
of the powers of n−1/2 . That is, find functions r0 , . . . , rp , such that

f (xa,n , n) = r0 + n−1/2 r1 + · · · + n−p/2 rp + O(n−(p+1)/2 ),

as n → ∞, where rk is a function of a, and v0 , . . . , vk , for k = 1, . . . , p. That


is rk = rk (a; v0 , . . . , vk ), for k = 1, . . . , p. Now note that f (xa , n) is also equal
to a + O(n−(p+1)/2 ) as n → ∞. Equating the coefficients of the powers of n
implies that v0 , . . . , vp should satisfy r0 (a; v0 ) = a and rk (a; v0 , . . . , vk ) = 0
for k = 1, . . . , p. Therefore, solving this system of equations will result in
expressions for v0 , . . . , vp , and an asymptotic expansion for xa,n is obtained.

As presented, this method is somewhat ad hoc, and we have not provided any
general guidelines as to when the method is applicable and what happens in
cases such as when the inverse is not unique. For the problems encountered in
this book the method is generally reliable. For a rigorous justification of the
method see De Bruijn (1958).

Example 1.28. Example 1.20 showed that the density of Zn = n1/2 (X̄n −
µ(α, β))/σ(α, β), where X̄n is computed from a sample of size n from a
Beta(α, β) distribution, has asymptotic expansion

fn (z) = φ(z) + 61 n−1/2 H3 (z)φ(z)κ3 (α, β) + O(n−1 ), (1.26)

as n → ∞. In this example we will consider finding an asymptotic expansion


for the quantile of this distribution. Term by term integration of the expansion
in Equation (1.26) with respect to z yields an expansion for the distribution
function of Zn given by

Fn (z) = Φ(z) − 61 n−1/2 H2 (z)φ(z)κ3 (α, β) + O(n−1 ),

as n → ∞, where it is noted that from Definition 1.6 it follows that the integral
of −H3 (z)φ(z), which is the third derivative of the standard normal density,
is given by H2 (z)φ(z), which is the second derivative of the standard normal
density. We assume that the integration of the error term with respect to z
does not change the order of the error term. This actually follows from the
fact that the error term can be shown to be uniform in z. Denote the αth
quantile of Fn as fα,n and assume that fα,n has an asymptotic expansion of
the form fα,n = v0 (α) + n−1/2 v1 (α) + O(n−1 ), as n → ∞. To obtain v0 (α)
and v1 (α) set Fn (fα,n ) = α + O(n−1 ), which is the property that fα,n should
have to be the αth quantile of Fn up to order O(n−1 ), as n → ∞. Therefore,
INVERSION OF ASYMPTOTIC EXPANSIONS 41
it follows that Fn [v0 (α) + n−1/2 v1 (α) + O(n−1 )] = α + O(n−1 ), or equivalently

Φ[v0 (α) + n−1/2 v1 (α) + O(n−1 )] − 16 n−1/2 φ[v0 (α) + n−1/2 v1 (α) + O(n−1 )]×
H2 [v0 (α) + n−1/2 v1 (α) + O(n−1 )]κ3 (α, β) = α + O(n−1 ), (1.27)
as n → ∞. Now expand each term in Equation (1.27) using Theorem 1.13
and the related theorems in Section 1.5. Applying Theorem 1.13 the standard
normal distribution function yields

Φ[v0 (α) + n−1/2 v1 (α) + O(n−1 )] =


Φ[v0 (α)] + [n−1/2 v1 (α) + O(n−1 )]φ[v0 (α)] + O(n−1 ) =
Φ[v0 (α)] + n−1/2 v1 (α)φ[v0 (α)] + O(n−1 ),
as n → ∞. To expand the second term in Equation (1.27), first apply Theorem
1.13 to the standard normal density. This yields
φ[v0 (α) + n−1/2 v1 (α) + O(n−1 )] = φ[v0 (α)] + O(n−1/2 ),
as n → ∞, where all terms of order n−1/2 are absorbed into the error term.
Direct evaluation of H2 [zα + n−1/2 v1 (α) + O(n−1 )] using Definition 1.6 can
be used to show that
H2 [v0 (α) + n−1/2 v1 (α) + O(n−1 )] = H2 [v0 (α)] + O(n−1/2 ),
as n → ∞. Therefore, the second term in Equation (1.27) has the form
− 61 n−1/2 κ3 (α, β)H2 [v0 (α)]φ[v0 (α)] + O(n−1 ). Combining these results yields

F (fα,n ) = Φ[v0 (α)] + n−1/2 φ[v0 (α)]{v1 (α) − 61 κ3 (α, β)H2 [v0 (α)]} + O(n−1 ),
as n → ∞. Using the notation of this section we have that r0 (α; v0 ) = Φ[v0 (α)]
and r1 (α; v0 , v1 ) = φ[v0 (α)]{v1 (α) − 61 κ3 (α, β)H2 [v0 (α)]}. Setting r0 (α, v0 ) =
Φ[v0 (α)] = α implies that v0 (α) = zα , the αth quantile of a N(0, 1) distribu-
tion. Similarly, setting r1 (α; v0 , v1 ) = φ[v0 (α)]{v1 (α) − 61 κ3 (α, β)H2 [v0 (α)]} =
0 implies that v1 (α) = 61 κ3 (α, β)H2 [v0 (α)] = 61 κ3 (α, β)H2 (zα ). Therefore, an
asymptotic expansion for the αth quantile of F is given by
fα,n = v0 (α) + n−1/2 v1 (α) + O(n−1 ) = zα + n−1/2 16 κ3 (α, β)H2 (zα ) + O(n−1 ),
as n → ∞. 

It should be noted that if closed forms for the derivatives of f are known,
then it can be easier to derive an asymptotic expansion for the inverse of a
function f (x) using Theorem 1.13 or Theorem 1.15 directly. The derivatives
of the inverse of f (x) are required for this approach. The following result from
calculus can be helpful with this calculation.
Theorem 1.21. Assume that g is a strictly increasing and continuous real
function on an interval [a, b] and let h be the inverse of g. If the derivative
of g exists and is non-zero at a point x ∈ (a, b) then the derivative of h also
42 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
exists and is non-zero at the corresponding point y = g(x) and
" #−1
d d
h(y) = g(x)
.
dy dx x=h(y)

A proof of Theorem 1.21 can be found in Section 6.20 of Apostol (1967). Note
the importance of the monotonicity condition in Theorem 1.21, which ensures
that the function g has a unique inverse. Further, the restriction that g is
strictly increasing implies that the derivative of the inverse will be positive.
Example 1.29. Consider the standard normal distribution function Φ(z),
and suppose we wish to obtain an asymptotic expansion for the standard
normal quantile function Φ−1 (α) for values of α near 12 . Theorem 1.15 implies
that
d
Φ−1 (α + δ) = Φ−1 (α) + δ Φ−1 (α) + O(δ 2 ),

as δ → 0. Noting that Φ(t) is monotonically increasing and that Φ0 (t) 6= 0 for
all t ∈ R, we can apply Theorem 1.21 to Φ−1 (α) to find that
" #−1
d −1 d 1
Φ (α) = Φ(z) = .
dα dz −1
z=Φ (α) φ(zα)

Therefore a one-term asymptotic expansion for the standard normal quantile


function is given by
δ
Φ−1 (α + δ) = zα + + O(δ 2 ),
φ(zα )
1
as δ → 0. Now take α = 2 as in Example 1.18 to yield z 12 +δ = δ(2π)1/2 +O(δ 2 ),
as δ → 0. 

1.7 Exercises and Experiments

1.7.1 Exercises

1. Let {xn }∞
n=1 be a sequence of real numbers defined by

−1
 n = 1 + 3(k − 1), k ∈ N,
xn = 0 n = 2 + 3(k − 1), k ∈ N,

1 n = 3 + 3(k − 1), k ∈ N.

Compute
lim inf xn ,
n→∞
and
lim sup xn .
n→∞
Determine if the limit of xn as n → ∞ exists.
EXERCISES AND EXPERIMENTS 43
2. Let {xn }∞
n=1 be a sequence of real numbers defined by
n n+1
xn = − ,
n+1 n
for all n ∈ N. Compute
lim inf xn ,
n→∞
and
lim sup xn .
n→∞
Determine if the limit of xn as n → ∞ exists.
n
3. Let {xn }∞
n=1 be a sequence of real numbers defined by xn = n
(−1) −n
for
all n ∈ N. Compute
lim inf xn ,
n→∞
and
lim sup xn .
n→∞
Determine if the limit of xn as n → ∞ exists.
4. Let {xn }∞
n=1 be a sequence of real numbers defined by xn = n2
−n
, for all
n ∈ N. Compute
lim inf xn ,
n→∞
and
lim sup xn .
n→∞
Determine if the limit of xn as n → ∞ exists.
5. Each of the sequences given below converges to zero. Specify the smallest
value of nε so that |xn | < ε for every n > nε as a function of ε.

a. xn = n−2
b. xn = n(n + 1)−1 − 1
c. xn = [log(n + 1)]−1
d. xn = 2(n2 + 1)−1

6. Let {xn }∞ ∞
n=1 and {yn }n=1 be sequences of real numbers such that

lim xn = x,
n→∞

and
lim yn = y.
n→∞

a. Prove that if c ∈ R is a constant, then


lim cxn = cx.
n→∞

b. Prove that
lim (xn + yn ) = x + y.
n→∞
44 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
c. Prove that
lim xn yn = xy.
n→∞

d. Prove that
xn x
lim
= ,
yn
n→∞ y
6 0 for all n ∈ N and y 6= 0.
where yn =

7. Let {xn }∞ ∞
n=1 and {yn }n=1 be sequences of real numbers such that xn ≤ yn
for all n ∈ N. Prove that if the limit of the two sequences exist, then
lim xn ≤ lim yn .
n→∞ n→∞

8. Let {xn }∞ ∞
n=1 and {yn }n=1 be sequences of real numbers such that

lim (xn + yn ) = s,
n→∞

and
lim (xn − yn ) = d.
n→∞
Prove that
lim xn yn = 41 (s2 − d2 ).
n→∞

9. Find the supremum and infimum limits for each sequence given below.

a. xn = (−1)n (1 + n−1 )
b. xn = (−1)n
c. xn = (−1)n n
d. xn = n2 sin2 ( 12 nπ)
e. xn = sin(n)
f. xn = (1 + n−1 ) cos(nπ)
g. xn = sin( 12 nπ) cos( 12 nπ)
h. xn = (−1)n n(1 + n)−n

10. Let {xn }∞


n=1 be a sequence of real numbers.

a. Prove that
inf xn ≤ lim inf xn ≤ lim sup xn ≤ sup xn .
n∈N n→∞ n→∞ n∈N

b. Prove that
lim inf xn = lim sup xn = l,
n→∞ n→∞
if and only if
lim xn = l.
n→∞
EXERCISES AND EXPERIMENTS 45
11. Let {xn }∞ ∞
n=1 and {yn }n=1 be a sequences of real numbers such that xn ≤ yn
for all n ∈ N. Prove that
lim inf xn ≤ lim inf yn .
n→∞ n→∞

12. Let {xn }∞ ∞


n=1 and {yn }n=1 be a sequences of real numbers such that


lim sup xn < ∞,
n→∞

and

lim sup yn < ∞,
n→∞
Then prove that
lim inf xn + lim inf yn ≤ lim inf (xn + yn ),
n→∞ n→∞ n→∞

and
lim sup(xn + yn ) ≤ lim sup xn + lim sup yn .
n→∞ n→∞ n→∞

13. Let {xn }∞ and


n=1 {yn }∞
n=1be a sequences of real numbers such that xn > 0
and yn > 0 for all n ∈ N,
0 < lim sup xn < ∞,
n→∞

and
0 < lim sup yn < ∞.
n→∞
Prove that   
lim sup xn yn ≤ lim sup xn lim sup yn .
n→∞ n→∞ n→∞

14. Let {fn (x)}∞ ∞


n=1 and {gn (x)}n=1 be sequences of real valued functions that
converge pointwise to the real functions f and g, respectively.
pw
a. Prove that cfn −−→ cf as n → ∞ where c is any real constant.
pw
b. Prove that fn + c −−→ f + c as n → ∞ where c is any real constant.
pw
c. Prove that fn + gn −−→ f + g as n → ∞.
pw
d. Prove that fn gn −−→ f g as n → ∞.
pw
e. Suppose that gn (x) > 0 and g(x) > 0 for all x ∈ R. Prove that fn /gn −−→
f /g as n → ∞.

15. Let {fn (x)}∞ ∞


n=1 and {gn (x)}n=1 be sequences of real valued functions that
converge uniformly on R to the real functions f and g as n → ∞, respec-
u
tively. Prove that fn + gn −
→ f + g on R as n → ∞.
16. Let {fn (x)}∞ 1
n=1 be a sequence of real functions defined by fn (x) = 2 nδ{x; (n−
n−1 , n + n−1 )} for all n ∈ N.
46 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
a. Prove that
lim fn (x) = 0
n→∞
for all x ∈ R, and hence conclude that
Z ∞
lim fn (x)dx = 0.
−∞ n→∞

b. Compute Z ∞
lim fn (x)dx.
n→∞ −∞
Does this match the result derived above?
c. State whether Theorem 1.11 applies to this case, and use it to explain
the results you found.

17. Let {fn (x)}∞n=1 be a sequence of real functions defined by fn (x) = (1 +


n−1 )δ{x; (0, 1)} for all n ∈ N.

a. Prove that
lim fn (x) = δ{x; (0, 1)}
n→∞
for all x ∈ R, and hence conclude that
Z ∞
lim fn (x)dx = 1.
−∞ n→∞

b. Compute Z ∞
lim fn (x)dx.
n→∞ −∞
Does this match the result you found above?
c. State whether Theorem 1.11 applies to this case, and use it to explain
the results you found above.

18. Let g(x) = exp(−|x|) and define a sequence of functions {fn (x)}∞
n=1 as
fn (x) = g(x)δ{|x|; (n, ∞)}, for all n ∈ N.

a. Calculate
f (x) = lim fn (x),
n→∞
for each fixed x ∈ R.
b. Calculate Z ∞
lim fn (x)dx,
n→∞ −∞
and Z ∞
f (x)dx.
−∞
Is the exchange of the limit and the integral justified in this case? Why
or why not?
EXERCISES AND EXPERIMENTS 47
19. Define a sequence of functions {fn (x)}∞ 2 n
n=1 as fn (x) = n x(1−x) for x ∈ R
and for all n ∈ N.

a. Calculate
f (x) = lim fn (x),
n→∞
for each fixed x ∈ R.
b. Calculate Z ∞
lim fn (x)dx,
n→∞ −∞
and Z ∞
f (x)dx.
−∞
Is the exchange of the limit and the integral justified in this case? Why
or why not?

20. Define a sequence of functions {fn (x)}∞ 2 n


n=1 as fn (x) = n x(1 − x) for
x ∈ [0, 1]. Determine whether
Z 1 Z 1
lim fn (x)dx = lim fn (x)dx.
n→∞ 0 0 n→∞

21. Suppose that f is a quadratic polynomial. Prove that for δ ∈ R,


f (x + δ) = f (x) + δf 0 (x) + 21 δ 2 f 00 (x).

22. Suppose that f is a cubic polynomial. Prove that for δ ∈ R,


f (x + δ) = f (x) + δf 0 (x) + 21 δ 2 f 00 (x) + 16 δ 3 f 000 (x).

23. Prove that if f is a polynomial of degree p then


p
X δ i f (i)
f (x + δ) = .
i=1
i!

24. Prove Theorem 1.13 using induction. That is, assume that
Z x+δ
E1 (x, δ) = (x + δ − t)f 00 (t)dt,
x

which has been shown to be true, and that


1 x+δ
Z
Ep (x, δ) = (x + δ − t)p f (p+1) (t)dt,
p! x
and show that these imply
Z x+δ
1
Ep+1 (x, δ) = (x + δ − t)p f (p+2) (t)dt.
(p + 1)! x
48 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
25. Given that Ep (x, δ) from Theorem 1.13 can be written as
1 x+δ
Z
Ep (x, δ) = (x + δ − t)p f (p+1) (t)dt,
p! x
show that Ep (x, δ) = δ p+1 f (p+1) (ξ)/(p + 1)! for some ξ ∈ [x, x + δ].
26. Use Theorem 1.13 with p = 1, 2 and 3 to find approximations for each of
the functions listed below for small values of δ.
a. f (δ) = 1/(1 + δ)
b. f (δ) = sin2 (π/4 + δ)
c. f (δ) = log(1 + δ)
d. f (δ) = (1 + δ)/(1 − δ)
27. Prove that the pth -order Taylor expansion of a function f (x) has the same
derivatives of order 1, . . . , p as f (x). That is, show that
p

dj X δ k f (k) (x)
= f (j) (x),
dδ j k!


k=0 δ=0
for j = 1, . . . , p. What assumptions are required for this result to be true?
28. Show that by taking successive derivatives of the standard normal density
that H3 (x) = x3 − 3x, H4 (x) = x4 − 6x2 + 3 and H5 (x) = x5 − 10x3 + 15x.
29. Use Theorem 1.13 (Taylor) to find fourth and fifth order polynomials that
are approximations to the standard normal distribution function Φ(x). Is
there a difference between the approximations? What can be said in general
about two consecutive even and odd order polynomial approximations of
Φ(x)? Prove your conjecture using the results of Theorem 1.14.
30. Prove Part 1 of Theorem 1.14 using induction. That is, prove that for any
non-negative integer k,
bk/2c  
X (2i)! k k−2i
Hk (x) = (−1)i i x ,
i=0
2 i! 2i
where bk/2c is the greatest integer less than or equal to k/2. It may prove
useful to use the result of Exercise 32.
31. Prove Part 2 of Theorem 1.14. That is, prove that for any non-negative
integer k ≥ 2,
Hk (x) = xHk−1 (x) − (k − 1)Hk−2 (x).
The simplest approach is to use Definition 1.6.
32. Prove Part 3 of Theorem 1.14 using only Definition 1.6. That is, prove that
for any non-negative integer k,
d
Hk (x) = kHk−1 (x).
dx
Do not use the result of Part 1 of Theorem 1.14.
EXERCISES AND EXPERIMENTS 49
33. The Hermite polynomials are often called a set of orthogonal polynomi-
als. Consider the Hermite polynomials up to a specified order d. Let hk
be a vector in Rd whose elements correspond to the coefficients of the
Hemite polynomial Hk (x). That is, for example, h01 = (1, 0, 0, 0 · · · 0), h02 =
(0, 1, 0, 0 · · · 0), and h03 = (−1, 0, 1, 0 · · · 0). Then the polynomials Hi (x) and
Hj (x) are said to be orthogonal if h0i hj = 0. Show that the first six Hermite
polynomials are all orthogonal to one another.
34. In Theorem 1.15 prove that Ep (x, δ) = o(δ p ), as δ → 0.
35. Consider approximating the normal tail integral
Z ∞
Φ̄(z) = φ(t)dt,
z

for large values of z using integration by parts as discussed in Example


1.24. Use repeated integration by parts to show that
Φ̄(z) = z −1 φ(z) − z −3 φ(z) + 3z −5 φ(z) − 15z −7 φ(z) + O(z −9 ),
as z → ∞.
36. Using integration by parts, show that the exponential integral
Z ∞
t−1 e−t dt,
z

has asymptotic expansion


z −1 e−z − z −2 e−z + 2z −3 e−z − 6z −4 e−z + O(z −5 ),
as z → ∞.
37. Prove the second and third results of Theorem 1.18. That is, let {an }∞
n=1 ,
{bn }∞ ∞ ∞
n=1 , {cn }n=1 , and {dn }n=1 be real sequences.

a. Prove that if an = o(bn ) and cn = O(dn ) as n → ∞ then an bn = o(cn dn )


as n → ∞.
b. Prove that if an = O(bn ) and cn = O(dn ) as n → ∞ then an bn =
O(cn dn ) as n → ∞.

38. Prove the remaining three results of Theorem 1.19. That is, consider two
real sequences {an }∞ ∞
n=1 and {bn }n=1 and positive integers k and m where
k ≤ m. Then
a. Suppose an = O(n−k ) and bn = O(n−m ) as n → ∞. Then prove that
an + bn = O(n−k ) as n → ∞.
b. Suppose an = O(n−k ) and bn = o(n−m ) as n → ∞. Then prove that
an + bn = O(n−k ) as n → ∞.
c. Suppose an = o(n−k ) and bn = O(n−m ) as n → ∞. Then prove that
an + bn = O(n−k ) as n → ∞.
50 SEQUENCES OF REAL NUMBERS AND FUNCTIONS
39. For each specified pair of functions G(t) and g(t), determine the value of α
and c so that G(t)  ctα−1 as t → ∞ and determine if there is a function
g(t)  dtα for some d as t → ∞ where c and d are real constants. State
whether Theorem 1.17 is applicable in each case.

a. G(t) = 2t4 + t
b. G(t) = t + t−1
c. G(t) = t2 + cos(t)
d. G(t) = t1/2 + cos(t)

40. Consider a real function f that can be approximated with the asymptotic
expansion
fn (x) = πx + 21 n−1/2 π 2 x1/2 − 13 n−1 π 3 x1/4 + O(n−3/2 ),
as n → ∞, uniformly in x, where x is assumed to be positive. Use the first
method demonstrated in Section 1.6 to find an asymptotic expansion with
error O(n−3/2 ) as n → ∞ for xa where f (xa ) = a + O(n−3/2 ) as n → ∞.
41. Consider the problem of approximating the function sin(x) and its inverse
for values of x near 0.

a. Using Theorem 1.15 show that sin(δ) = δ − 61 δ 3 + O(δ 4 ) as δ → 0.


b. Using Theorems 1.15 and the known derivatives of the inverse sine func-
tion show that sin−1 (δ) = δ + 61 δ 3 + O(δ 4 ) as δ → 0.
c. Recompute the first term of the expansion found in Part (b) using The-
orem 1.21. Do they match? What restrictions on x are required to apply
Theorem 1.21?

1.7.2 Experiments

1. Refer to the three approximations derived for each of the four functions in
Exercise 26. For each function use R to construct a line plot of the function,
along with the three approximations versus δ on a single plot. The lines
corresponding to each approximation and the original function should be
different, that is, the plots should look like the one given in Figure 1.3.
You may need to try several ranges of δ to find one that provides a good
indication of the behavior of each approximation. What do these plots
suggest about the errors of the three approximations?
2. Refer to the three approximations derived for each of the four functions in
Exercise 26. For each function use R to construct a line plot of the error
terms E1 (x, δ), E2 (x, δ) and E3 (x, δ) versus δ on a single plot. The lines
corresponding to each error function should be different so that the plots
should look like the one given in Figure 1.4. What do these plots suggest
about the errors of the three approximations?
EXERCISES AND EXPERIMENTS 51
3. Refer to the three approximations derived for each of the four functions in
Exercise 26. For each function use R to construct a line plot of the error
terms E2 (x, δ) and E3 (x, δ) relative to the error term E1 (x, δ). That is, for
each function, plot E2 (x, δ)/E1 (x, δ) and E3 (x, δ)/E1 (x, δ) versus δ. The
lines corresponding to each relative error function should be different. What
do these plots suggest about the relative error rates?
4. Consider the approximation for the normal tail integral Φ̄(z) studied in
Example 1.24 given by
Φ̄(z) ' z −1 φ(z)(1 − z −2 + 3z −4 − 15z −6 + 105z −8 ).
A slight rearrangement of the approximation implies that
z Φ̄(z)
' 1 − z −2 + 3z −4 − 15z −6 + 105z −8 .
φ(z)
Define S1 (z) = 1 − z −2 , S2 (z) = 1 − z −2 + 3z −4 , S3 (z) = 1 − z −2 + 3z −4 −
15z −6 and S4 (z) = 1 − z −2 + 3z −4 − 15z −6 + 105z −8 , which are the succes-
sive approximations of z Φ̄(z)/φ(z). Using R, compute z Φ̄(z)/φ(z), S1 (z),
S2 (z), S3 (z), and S4 (z) for z = 1, . . . , 10. Comment on which approxima-
tion performs best for each value of z and whether the approximations
become better as z becomes larger.
CHAPTER 2

Random Variables and Characteristic


Functions

Self-control means wanting to be effective at some random point in the infinite


radiations of my spiritual existence.
Franz Kafka

2.1 Introduction

This chapter begins with a short review of probability measures and random
variables. A sound formal understanding of random variables is crucial to have
a complete understanding of much of the asymptotic theory that follows. In-
equalities are also very useful in asymptotic theory, and the second section
of this chapter reviews several basic inequalities for both probabilities and
expectations, as well as some more advanced results that will have specific
applications later in the book. The next section develops some limit theory
that is useful for working with probabilities of sequences of events, including
the Borel-Cantelli lemmas. We conclude the chapter with a review of moment
generating functions, characteristic functions, and cumulant generating func-
tions. Moment generating functions and characteristic functions are often a
useful surrogate for distributions themselves. While the moment generating
function may be familiar to many readers, the characteristic function may
not, due to the need for some complex analysis. However, the extra effort re-
quired to use the characteristic function is worthwhile as many of the results
presented later in the book are more useful when derived using characteristic
functions.

2.2 Probability Measures and Random Variables

Consider an experiment, an action that selects a point from a set Ω called a


sample space. The point that is selected is called the outcome of the experi-
ment. In probability and statistics the focus is usually on random or stochastic
experiments where the point that is selected cannot be predicted with abso-
lute certainty, except in very specialized cases. Subsets of the sample space

53
54 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
are called events, and are taken to be members of a collection of subsets of Ω
called a σ-field.
Definition 2.1. Let Ω be a set, then F is a σ-field of subsets of Ω if F has
the following properties:

1. ∅ ∈ F and Ω ∈ F.
2. A ∈ F implies that Ac ∈ F.
3. If Ai ∈ F for i ∈ N then

[ ∞
\
Ai ∈ F and Ai ∈ F.
i=1 i=1

In some cases a σ-field will be generated from a sample space Ω. This σ-field
is the smallest σ-field that contains the events in Ω. The term smallest in
this case means that this σ-field is a subset of any other σ-field that contains
the events in Ω. For further information about σ-fields, their generators, and
their use in probability theory see Section 1.2 of Gut (2005) or Section 2.1 of
Pollard (2002).
The experiment selects an outcome in Ω according to a probability measure
P , which is a set function that maps F to R.
Definition 2.2. Let Ω be a sample space and F be a σ-algebra of subsets of
Ω. A set function P : F → R is a probability measure if P satisfies the axioms
of probability set forth by Kolmogorov (1933). The axioms are:

Axiom 1: P (A) ≥ 0 for every A ∈ F.


Axiom 2: P (Ω) = 1.
i=1 is a sequence of mutually exclusive events in F, then
Axiom 3: If {Ai }∞
∞ ∞
!
[ X
P Ai = P (Ai ).
i=1 i=1

The term mutually exclusive refers to the property that Ai and Aj are disjoint,
or that Ai ∩ Aj = ∅ for all i 6= j.

From Definition 2.2 it is clear that there are three elements that are required
to assign a set of probabilities to outcomes from an experiment: the sample
space Ω, which identifies the possible outcomes of the experiment; the σ-field
F, which identifies which events in Ω that the probability measure is able
to compute probabilities for; and the probability measure P , which assigns
probabilities to the events in F. These elements are often collected together
in a triple (Ω, F, P ), called a probability space.
When the sample space of an experiment is R, or an interval subset of R, the
σ-field used to define the probability space is usually generated from the open
subsets of the sample space.
PROBABILITY MEASURES AND RANDOM VARIABLES 55
Definition 2.3. Let Ω be a sample space. The Borel σ-field corresponding to
Ω is the σ-field generated by the collection of open subsets of Ω. The Borel
σ-field generated by Ω is denoted by B{Ω}.

In the case where Ω = R, it can be shown that B{R} can be generated from
simpler collections of events. In particular, B{R} can be generated from the
collection of intervals {(−∞, b] : b ∈ R}, with a similar result when Ω is a
subset of R such as Ω = [0, 1]. Other simple collections of intervals can also
be used to generate B{R}. See Section 3.3 of Gut (2005) for further details.
The main purpose of this section is to introduce the concept of a random
variable. Random variables provide a convenient way of referring to events
within a sample space that often have simple interpretations with regard to
the underlying experiment. Intuitively, random variables are often thought of
as mathematical variables that are subject to random behavior. This informal
way of thinking about random variables may be helpful to understand cer-
tain concepts in probability theory, but a true understanding, especially with
regard to statistical limit theorems, comes from the formal mathematical def-
inition below.
Definition 2.4. Let (Ω, F, P ) be a probability space, X be a function that
maps Ω to R, and B be a σ-algebra of subsets of R. The function X is a
random variable if X −1 (B) = {ω ∈ Ω : X(ω) ∈ B} ∈ F, for all B ∈ B.

Note that according to Definition 2.4, there is actually nothing random about
a random variable. When the experiment is performed an element of Ω is
chosen at random according to the probability measure P . The role of the
random variable is to map this outcome to the real line. Therefore, the output
of the random variable is random, but the mapping itself is not. The restriction
that X −1 (B) = {ω ∈ Ω : X(ω) ∈ B}, for all B ∈ B assures that probability
of the inverse mapping can be calculated.
Events written in terms of random variables are interpreted by selecting out-
comes from the sample space Ω that satisfy the event. That is, if A ∈ B then
the event {X ∈ A} is equivalent to the event that consists of all outcomes
ω ∈ Ω such that X(ω) ∈ A. This allows for the computation of probabilities
of events written in terms of random variables. That is, P (X ∈ A) = P (ω :
X(ω) ∈ A), where it is assumed that the event will be empty when A is not
a subset of the range of the function X. Random variables need not be one-
to-one functions, but in the case where X is a one-to-one function and a ∈ R
the computation simplifies to P (X = a) = P [X −1 (a)].
Example 2.1. Consider the simple experiment where a fair coin is flipped
three times, and the sequence of flips is observed. The elements of the sample
space will be represented by triplets containing the symbols Hi , signifying
that the ith flip is heads, and Ti , signifying that the ith flip is tails. The
order of the symbols in the triplet signify the order in which the outcomes
are observed. For example, the event H1 T2 H3 corresponds to the event that
56 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
heads was observed first, then tails, then heads again. The sample space for
this experiment is given by

Ω = {T1 T2 T3 , T1 T2 H3 , T1 H2 T3 , H1 T2 T3 ,
H1 H2 T3 , H1 T2 H3 , T1 H2 H3 , H1 H2 H3 }.
Because the coin is fair, the probability measure P on this sample space is
uniform so that each outcome has a probability of 18 . A suitable σ-field for Ω
is given by the power set of Ω. Now consider a random variable X defined as


 0 if ω = {T1 T2 T3 },

1 if ω ∈ {T T H , T H T , H T T },
1 2 3 1 2 3 1 2 3
X(ω) =


 2 if ω ∈ {H H
1 2 3T , H T H
1 2 3 , T 1 H2 H3 },
3 if ω = {H1 H2 H3 }.

Hence, the random variable X counts the number of heads in the three flips
of the coin. For example, the event that two heads are observed in the three
flips of the coin can be represented by the event {X = 2}. The probability
of this event is computed by considering all of the outcomes in the original
sample space that satisfy this event. That is
P (X = 2) = P [ω ∈ Ω : X(ω) = 2] = P (H1 H2 T3 , H1 T2 H3 , T1 H2 H3 ) = 38 .
Because the inverse image of each possible value of X is in F, we are always
able to compute the probability of the corresponding inverse image. 
Example 2.2. Consider a countable sample space of the form
Ω = {H1 , T1 H2 , T1 T2 H3 , T1 T2 T3 H4 , . . . , },
that corresponds to the experiment of flipping a coin repeatedly until the
first heads is observed where the notation of Example 2.1 has been used to
represent the possible outcomes of the experiment. If the coin is fair then the
probability measure P defined as

1
2,
 ω = {H1 },
P (ω) = 2−k , ω = {T1 T2 · · · Tk−1 Hk },

0 otherwise,

is appropriate for the described experiment with the σ-algebra F defined to be


the power set of Ω. A useful random variable to define for this experiment is
to set X(H1 ) = 1 and X(ω) = k when ω = {T1 T2 · · · Tk−1 Hk }, which counts
the number of flips until the first head is observed. 
Example 2.3. Consider the probability space (Ω, F, P ) where Ω = (0, 1], F
is the Borel sets on the interval (0, 1], and P is Lebesgue measure on (0, 1].
The measure P assigns an interval that is contained in (0, 1] of length l the
probability l. Probabilities of other events are obtained using the three axioms
of Definition 2.2. Define a random variable X(ω) = − log(ω) for all ω ∈ (0, 1]
so that the range of X is the positive real line. Let 0 < a < b < ∞ and
PROBABILITY MEASURES AND RANDOM VARIABLES 57
consider computing the probability that X ∈ (a, b). Working with the inverse
image, we find that
P [X ∈ (a, b)] = P [ω : X(ω) ∈ (a, b)]
= P [ω : a < − log(ω) < b]
= P [ω : exp(−b) < ω < exp(−a)]
= exp(−b) − exp(−a),
where we have used the monotonicity of the log function. This results in an
Exponential(1) random variable. 
Note that the inverse image of an event written in terms of a random variable
may be empty if there is not a non-empty event in F that is mapped to the
event in question. This does not present a problem as ∅ is guaranteed to be a
member of F by Definition 2.1. Definition 2.2 then implies that the probability
of the corresponding event is zero.
Functions of random variables can also be shown to be random variables them-
selves as long as the function has certain properties. According to Definition
2.4, a random variable is a function that maps the sample space ω to the real
line in such a way that the inverse image of a Borel set is in F. Suppose that X
is a random variable and g is a real function. It is clear that the composition
g[X(ω)] maps ω to R. But for g(X) to be a random variable we also require
that
[g(X)]−1 (B) = {ω ∈ Ω : g[X(ω)] ∈ B} ∈ F,
for all B ∈ B{R}. This can be guaranteed by requiring that the inverse image
from g of a Borel set will always be a Borel set. That is, we require that
g −1 (B) = {b ∈ R : g(b) ∈ B} ∈ B{R},
for all B ∈ B{R}. We will call such functions Borel functions. This develop-
ment yields the following result.
Theorem 2.1. Let X be a random variable and g be a Borel function. Then
g(X) is a random variable.
The development of random vectors follows closely the development of random
variables. Let (Ω, F, P ) be a probability space and let X1 , . . . , Xd be a set
of random variables defined on (Ω, F, P ). We can construct a d-dimensional
vector X by putting X1 , . . . , Xd into a d × 1 array as X = (X1 , . . . , Xd )0 .
What does this construction represent? According to Definition 2.4, each Xi
is a function that maps Ω to R in such a way that X −1 (B) = {ω ∈ Ω : X(ω) ∈
B} ∈ F for all B ∈ B{R}. In the case of X, a point ω ∈ Ω will be mapped
to X(ω) = [X1 (ω), . . . , Xd (ω)]0 ∈ Rd . Therefore, X is a function that maps
Ω to Rd . However, in order for X to be a random vector we must be assured
that we can compute the probability of events that are written in terms of
the random vector. That is, we need to be able to compute the probability of
the inverse mapping of any reasonable subset of Rd . Reasonable subsets will
be those that come from a σ-field on Rd .
58 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
Definition 2.5. Let (Ω, F, P ) be a probability space, X be a function that
maps Ω to Rd , and Bd be a σ-algebra of subsets of Rd . The function X is a
d-dimensional random vector if X−1 (B) = {ω ∈ Ω : X(ω) ∈ B} ∈ F, for all
B ∈ Bd .

The usual σ-field used for random vectors is the Borel sets on Rd , denoted by
B{Rd }.
Example 2.4. Consider a probability space (Ω, F, P ) where Ω = (0, 1)×(0, 1)
is the unit square and P is a bivariate extension of Lebesgue measure. That
is, if R is a rectangle of the form

R = {(ω1 , ω2 ) : ω 1 ≤ ω1 ≤ ω 1 , ω 2 ≤ ω1 ≤ ω 2 },

where ω 1 ≤ ω 1 and ω 2 ≤ ω 2 then P (R) = (ω 1 − ω 1 )(ω 2 − ω 1 ), which corre-


sponds to the area of the rectangle. Let F = B{(0, 1) × (0, 1)} and in general
define P (B) to be the area of B for any B ∈ B{(0, 1) × (0, 1)}. It is shown in
Exercise 3 that this is a probability measure. A simple bivariate random vari-
able defined on (Ω, F, P ) is X(ω) = (ω1 , ω2 )0 which would return the ordered
pair from the sample space. Another choice, X(ω) = [− log(ω1 ), − log(ω2 )]0
yields a bivariate Exponential random vector with independent Exponen-
tial(1) marginal distributions. 

As with the case of random variables there is the technical question as to


whether a function of a random vector is also a random vector. The develop-
ment leading to Theorem 2.1 can be extended to random vectors by consid-
ering a function g : Rd → Rq that has the property that

g −1 (B) = {b ∈ Rd : g(b) ∈ B} ∈ B{Rd },

for all B ∈ B{Rq }. As with the univariate case, we shall call such a function
as simply a Borel function, and we get a parallel result to Theorem 2.1.
Theorem 2.2. Let X be a d-dimensional random vector and g : Rd → Rq be
a Borel function. Then g(X) is a q-dimensional random vector.

If X0 = (X1 , . . . , Xd ) is a random vector in Rb then some examples of functions


of X that are random variables include min{X1 , . . . , Xd } and max{X1 , . . . , Xd },
as well as each of the components of X. Further, it follows that if {Xn }∞ n=1 is
a sequence of random variables the

inf Xn ,
n∈N

and
sup Xn ,
n∈N

are also random variables. See Section 2.1.1 of Gut (2005) for further details.
SOME IMPORTANT INEQUALITIES 59
2.3 Some Important Inequalities

In many applications in asymptotic theory, inequalities provide useful upper


or lower bounds on probabilities and expectations. These bounds are often
helpful in proving important properties. This section briefly reviews many of
the inequalities that are used later in the book. We begin with some basic
inequalities that are based on probability measures.
Theorem 2.3. Let A and B be events in a probability space (Ω, F, P ). Then

1. If A ⊂ B then P (A) ≤ P (B).


2. P (A ∩ B) ≤ P (A)

Proof. Note that if A ⊂ B then B can be partitioned into two mutually


exclusive events as B = A ∪ (Ac ∩ B). Axiom 3 of Definition 2.2 then implies
that P (B) = P (A) + P (Ac ∩ B). Axiom 1 of Definition 2.2 then yields the
result by noting that P (Ac ∩ AB) ≥ 0. The second property follows from the
first by noting that A ∩ B ⊂ A.

When events are not necessarily mutually exclusive, the Bonferroni Inequality
is useful for obtaining an upper bound on the probability of the union of the
events. In the special case were the events are mutually exclusive, Axiom 3 of
Definition 2.2 applies and an equality results.
Theorem 2.4 (Bonferroni). Let {Ai }ni=1 be a sequence of events from a prob-
ability space (Ω, F, P ). Then
n
! n
[ X
P Ai ≤ P (Ai ).
i=1 i=1

Theorem 2.4 is proven in Exercise 4.


The moments of random variables play an important role in obtaining bounds
for probabilities. The first result given below is rather simple, but useful, and
can be proven by noting that the variance of a random variable is always
non-negative.
Theorem 2.5. Consider a random variable X where V (X) < ∞. Then
V (X) ≤ E(X 2 ).

Markov’s Theorem is a general result that places a bound on the tail proba-
bilities of a random variable using the fact that a certain set of moments of
the random variable are finite. Essentially the result states that only so much
probability can be in the tails of the distribution of a random variable X when
E(|X|r ) < ∞ for some r > 0.
Theorem 2.6 (Markov). Consider a random variable X where E(|X|r ) < ∞
for some r > 0 and let δ > 0. Then P (|X| > δ) ≤ δ −r E(|X|r ).
60 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
Proof. Assume for simplicity that X is a continuous random variable with
distribution function F . If δ > 0 then
Z ∞
r
E(|X| ) = |x|r dF (x)
−∞
Z Z
r
= |x| dF (x) + |x|r dF (x)
{x:|x|≤δ} {x:|x|>δ}
Z
≥ |x|r dF (x),
{x:|x|>δ}

since Z
|x|r dF (x) ≥ 0.
{x:|x|≤δ}
Note that |x|r ≥ δ r within the set {x : |x| > δ}, so that
Z Z
|x|r dF (x) ≥ δ r dF (x) = δ r P (|X| > δ).
{x:|x|>δ} {x:|x|>δ}

Therefore E(|X|r ) ≥ δ r P (|X| > δ) and the result follows.


Tchebysheff ’s Theorem is an important special case of Theorem 2.6 that re-
states the tail probability in terms of distance from the mean when the vari-
ance of the random variable is finite.
Theorem 2.7 (Tchebysheff). Consider a random variable X with E(X) = µ
and V (X) = σ 2 < ∞. Then P (|X − µ| > δ) ≤ δ −2 σ 2 .
Theorem 2.7 is proven in Exercise 6. Several inequalities for expectations are
also used in this book. This first result is a direct consequence of Theorems
A.16 and A.18.
Theorem 2.8. Let 0 < r ≤ 1 and suppose that X and Y are random variables
such that E(|X|r ) < ∞ and E(|Y |r ) < ∞. Then E(|X + Y |r ) ≤ E(|X|r ) +
E(|Y |r ).
Minkowski’s Inequality provides a similar result to Theorem 2.8, expect that
the power r now exceeds one.
Theorem 2.9 (Minkowski). Let X1 , . . . , Xn be random variables such that
E(|Xi |r ) < ∞ for i = 1, . . . , n. Then for r ≥ 1,
X r 1/r X
" n !# n
E Xi ≤ [E(|Xi |r )]1/r .


i=1 i=1

Note that Theorem A.18 is a special case of Theorems 2.8 and 2.9 when r is
take to be one. Theorem 2.9 can be proven using Theorem A.18 and Hölder’s
Inequality, given below.
Theorem 2.10 (Hölder). Let X and Y be random variables such that E|X|p <
∞ and E|Y |q < ∞ where p and q are real numbers such that p−1 + q −1 = 1.
Then
1/p 1/q
E(|XY |) ≤ [E(|X|p )] [E(|Y |q )] .
SOME IMPORTANT INEQUALITIES 61

Figure 2.1 An example of a convex function.

x
z

For proofs of Theorems 2.9 and 2.10, see Section 3.2 of Gut (2005).
A more general result is Jensen’s Inequality, which is based on the properties
of convex functions.
Definition 2.6. Let f be a real function such that f [λx + (1 − λ)y] ≤ λf (x) +
(1 − λ)f (y) for all x ∈ R, y ∈ R and λ ∈ (0, 1), then f is a convex function.
If the inequality is strict then the function is strictly convex. If the function
−f (x) is convex, then the function f (x) is concave.

Note that if λ = 0 then λf (x) + (1 − λ)f (y) = f (y), and if λ = 1 it follows


that λf (x) + (1 − λ)f (y) = f (x). Noting that λf (x) + (1 − λ)f (y) = λ[f (x) −
f (y)] + f (y) we see that when x and y are fixed, λf (x) + (1 − λ)f (y) = f (y)
is a linear function of λ. Therefore, λf (x) + (1 − λ)f (y) = f (y) represents the
line that connects f (x) to f (y). Hence, convex functions are ones such that
f (z) is always below the line connecting f (x) and f (y) for all real numbers
x, y, and z such that x < z < y. Figure 2.1 gives an example of a convex
function while Figure 2.2 gives an example of a function that is not convex.
This property of convex functions allows us to develop the following inequality
for expectations of convex functions of a random variable.
Theorem 2.11 (Jensen). Let X be a random variable and let g be a convex
function, then E[g(X)] ≥ g[E(X)]. If g is strictly convex then E[g(X)] >
g[E(X)].

For a proof of Theorem 2.11, see Section 5.3 of Fristedt and Gray (1997).
62 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS

Figure 2.2 An example of function that is not convex.

Example 2.5. Let X be a random variable such that E(|X|r ) < ∞, for
some r > 0. Let s be a real number such that 0 < s < r. Since r/s > 1
it follows that f (x) = xr/s is a convex function and therefore Theorem 2.11
implies that [E(|X|s )]r/s ≤ E[(|X|s )r/s ] = E(|X|r ) < ∞, so that it follows
that E(|X|s ) < ∞. This establishes the fact that the existence of higher order
moments implies the existence of lower order moments. 

The following result establishes a bound for the absolute expectation of the
sum of truncated random variables. Truncation is often a useful tool that can
be applied when some of the moments of a random variable do not exist. If the
random variables are truncated at some finite value, then all of the moments
of the truncated random variables must exist, and tools such as Theorem 2.7
can be used.
Theorem 2.12. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables such that E(X1 ) = 0. Then, for any ε > 0,
n
!
X
E Xi δ{|Xi |; [0, ε]} ≤ nE(|X1 |δ{|X1 |; (ε, ∞)}).


i=1

Proof. Note that for all ω ∈ Ω and i ∈ {1, . . . , n} we have that


Xi (ω) = Xi (ω)δ{|Xi (ω)|; [0, ε]} + Xi (ω)δ{|Xi (ω)|; (ε, ∞)},
so that
E(X) = E(Xi δ{|Xi |; [0, ε]}) + E(Xi δ{|Xi |; (ε, ∞)}).
SOME IMPORTANT INEQUALITIES 63
Because E(Xi ) = 0 for all i ∈ {1, . . . , n} we have that
E(Xi δ{|Xi |; [0, ε]}) = −E(Xi δ{|Xi |; (ε, ∞)}). (2.1)
Now
!
n
X Xn
E Xi δ{|Xi |; [0, ε]} = E(Xi δ{|Xi |; [0, ε]})


i=1 i=1
n
X
≤ |E(Xi δ{|Xi |; [0, ε]})|
i=1
Xn
≤ |E(Xi δ{|Xi |; (ε, ∞)})|
i=1
Xn
≤ E(|Xi |δ{|Xi |; (ε, ∞)})
i=1
= nE(|X1 |δ{|X1 |; (ε, ∞)}),
where the first inequality follows from Theorem A.18, the second inequality
follows from Equation (2.1), and the third inequality follows from Theorem
A.6.

The following inequality is an analog of Theorem 2.7 for the case when the
random variable of interest is a sum or an average. While the bound has
the same form as Theorem 2.7, the event in the probability concerns a much
stronger event in terms of the maximal value of the random variable. The
result is usually referred to as Kolmogorov’s Maximal Inequality.
Theorem 2.13 (Kolmogorov). Let {Xn }∞ n=1 be a sequence of independent
random variables where E(Xn ) = 0 and V (Xn ) < ∞ for all n ∈ N. Let
n
X
Sn = Xi .
i=1

Then, for any ε > 0,


 
P max |Si | > ε ≤ ε−2 V (Sn ).
i∈{1,...,n}

Proof. The proof provided here is fairly standard for this result, though this
particular version runs most closely along what is shown in Gut (2005) and
Sen and Singer (1993). Let ε > 0 and define S0 ≡ 0 and define a sequence of
events {Ai }ni=0 as
A0 = {|Si | ≤ ε; i ∈ {0, 1, . . . , n}},
and
Ai = {|Sk | ≤ ε; k ∈ {0, 1, . . . , i − 1}} ∩ {|Si | > ε},
for i ∈ {1, . . . , n}. The essential idea to deriving the bound is based on the
64 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
fact that
  [n
max |Si | > ε = Ai , (2.2)
i∈{1,...,n}
i=1

where the events in the sequence {Ai }ni=0 are mutually exclusive. We now
consider Sn to be a random variable that maps some sample space Ω to R.
Equation (2.2) implies that
n
[
Ai ⊂ Ω,
i=1

so that
( n
) n
[ X
Sn2 (ω) ≥ Sn2 (ω)δ ω; Ai = Sn2 (ω)δ{ω; Ai },
i=1 i=1

for all ω ∈ Ω, where we have used the fact that the events in the sequence
{Ai }ni=0 are mutually exclusive. Therefore, Theorem A.16 implies that
n
X
E(Sn2 ) ≥ E(Sn2 δ {Ai }),
i=1

where we have suppressed the ω argument in the indicator function. Note that
we can write Sn as (Sn − Si ) + Si and therefore Sn2 = Si2 + 2Si (Sn − Si ) +
(Sn − Si )2 . Hence
n
X
E(Sn2 ) ≥ E{[Si2 + 2Si (Sn − Si ) + (Sn − Si )2 ]δ {Ai }}
i=1
n
X
≥ E{[Si2 + 2Si (Sn − Si )]δ {Ai }}
i=1
n
X n
X
= E(Si2 δ {Ai }) + 2 E[Si (Sn − Si )δ {Ai }],
i=1 i=1

where the second inequality is due to the fact that [Sn (ω) − Si (ω)]2 ≥ 0 for
all ω ∈ Ω. Now note that
n
X
Sn − Si = Xk ,
k=i+1

is independent of Si δ {Ai } because the event Ai and the sum Si depend only
on X1 , . . . , Xi . Therefore

E[Si (Sn − Si )δ {Ai }] = E[(Sn − Si )]E(Si δ {Ai }) =


n
!
X
E Xk E(Si δ {Ai }) = 0,
k=i+1
SOME LIMIT THEORY FOR EVENTS 65
since E(Xi ) = 0 for all i ∈ N. Therefore,
n
X n
X
E(Sn2 ) ≥ E(Si2 δ {Ai }) ≥ ε2 E(δ{Ai }). (2.3)
i=1 i=1

To obtain the second inequality in Equation (2.3), note that when ω ∈ Ai


then |Si | > ε, and therefore Si2 (ω)δ {ω; Ai } ≥ ε2 . When ω ∈ / Ai we have that
Si2 (ω)δ {ω; Ai } = 0. It follows that Si2 (ω)δ{ω; Ai } ≥ ε2 δ{ω; Ai }, for all ω ∈ Ω
and hence Theorem A.16 implies that E(Si2 δ{Ai }) ≥ ε2 E(δ{Ai }). Now note
that
n n n
!  
X X [
2 2 2
ε E(δ{Ai }) = ε P (Ai ) = ε P Ai = ε2 P max |Si | > ε .
i∈{1,...,n}
i=1 i=1 i=1

Therefore, we have established that


 
2 2
E(Sn ) ≥ ε P max |Si | > ε ,
i∈{1,...,n}

which yields the final result.

The exchange of a limit and an expectation is equivalent to the problem


of exchanging a limit and an integral or sum. Therefore, concepts such as
monotone or dominated convergence play a role in determining when such an
exchange can take place. A related result is Fatou’s Lemma, which provides
inequalities concerning such exchanges for sequences of random variables. The
result is also related to the problem of whether the convergence of a sequence
of random variables implies the convergence of the corresponding moments,
which we shall consider in detail in Chapter 5.
Theorem 2.14 (Fatou). Let {Xn }∞ n=1 be a sequence of non-negative random
variables and suppose there exist random variables L and U such that P (L ≤
Xn ≤ U ) = 1 for all n ∈ N, E(|L|) < ∞, and E(|U |) < ∞. Then
   
E lim inf Xn ≤ lim inf E(Xn ) ≤ lim sup E(Xn ) ≤ E lim sup Xn .
n→∞ n→∞ n→∞ n→∞

A proof of Theorem 2.14 can be found in Section 2.5 of Gut (2005).

2.4 Some Limit Theory for Events

The concept of a limit can also be applied to sequences of events. Conceptually,


the limiting behavior of sequences of events is more abstract than it is with
sequences of real numbers. We will begin by defining the limit supremum
and infimum for sequences of events. Recall from Definition 1.3 that the limit
supremum of a sequence of real numbers {xn }∞ n=1 is given by

lim sup xn = inf sup xk .


n→∞ n∈N k≥n
66 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
To convert this concept to sequences of events note that the supremum of a
sequence of real numbers is the least upper bound of the sequence. In terms of
events we will say that an event A is an upper bound of an event B if B ⊂ A.
For a sequence of events {An }∞n=1 in F, the supremum will be defined to be
the smallest event that contains the sequence, which is the least upper bound
of the sequence. That is, we will define

[
sup Ak = An .
n∈N n=1

Note that any event smaller than this union will not be an upper bound of
the sequence. Following similar arguments, the infimum of a sequence of real
numbers is defined to be the greatest lower bound. Therefore, the infimum of a
sequence of events is the largest event that is contained by all of the events in
the sequence. Hence, the infimum of a sequence of events {An }∞ n=1 is defined
as

\
inf An = An .
n∈N
n=1
These concepts can be combined to define the limit supremum and the limit
infimum of a sequence of events.
n=1 be a sequence of events in a σ-field F generated
Definition 2.7. Let {An }∞
from a sample space Ω. Then
∞ \
[ ∞
lim inf An = sup inf An = Ak ,
n→∞ n∈N k≥n n=1 k=n

and
∞ [
\ ∞
lim sup An = inf sup An = Ak .
n→∞ n∈N k≥n
n=1 k=n

As with sequences of real numbers, the limit of a sequence of events exists


when the limit infimum and limit supremum of the sequence agree.
Definition 2.8. Let {An }∞n=1 be a sequence of events in a σ-field F generated
from a sample space Ω. If
A = lim inf An = lim sup An ,
n→∞ n→∞

then the limit of the sequence {An }∞


n=1 exists and
lim An = A.
n→∞

If
lim inf An 6= lim sup An ,
n→∞ n→∞
then the limit of the sequence {An }∞
n=1 does not exist.
Example 2.6. Consider the probability space (Ω, F, P ) where Ω = (0, 1),
SOME LIMIT THEORY FOR EVENTS 67
F = B{(0, 1)}, and the sequence of events {An }∞ 1
n=1 is defined by An = ( 3 −
−1 2 −1
(3n) , 3 + (3n) ) for all n ∈ N. Now

\
inf An = ( 13 − (3k)−1 , 23 + (3k)−1 ) = [ 13 , 23 ],
k≥n
k=n

which does not depend on n so that by Definition 2.7


∞ \
[ ∞ ∞
[
lim inf An = ( 13 − (3k)−1 , 23 + (3k)−1 ) = [ 31 , 23 ] = [ 13 , 23 ].
n→∞
n=1 k=n n=1

Similarly, by Definition 2.7


∞ [
\ ∞
lim sup An = ( 31 − (3k)−1 , 23 + (3k)−1 )
n→∞
n=1 k=n
\∞
= ( 13 − (3n)−1 , 23 + (3n)−1 )
n=1
= [ 13 , 23 ],
which matches the limit infimum. Therefore, by Definition 2.8,
lim An = [ 31 , 23 ].
n→∞


Example 2.7. Consider the probability space (Ω, F, P ) where Ω = (0, 1),
F = B{(0, 1)}, and the sequence of events {An }∞ 1
n=1 is defined by An = ( 2 +
n1
(−1) 4 , 1) for all n ∈ N. Definition 2.7 implies that
∞ \
[ ∞ ∞
[
lim inf An = ( 21 + (−1)k 14 , 1) = ( 34 , 1) = ( 34 , 1).
n→∞
n=1 k=n n=1

Similarly, Definition 2.7 implies that


\ ∞
∞ [ ∞
\
lim sup An = ( 21 + (−1)k 14 , 1) = ( 41 , 1) = ( 14 , 1).
n→∞
n=1 k=n n=1

In this case the limit of the sequence of events {An }∞


n=1 does not exist. 

The sequence of events studied in Example 2.6 has a property that is very
important in the theory of limits of sequences of events. In that example,
An+1 ⊂ An for all n ∈ N, which corresponds to a monotonically increasing se-
quence of events. The computation of the limits of such sequences is simplified
by this structure.
Theorem 2.15. Let {An }∞ n=1 be a sequence of events from a σ-field F of
subsets of a sample space Ω.

1. If An ⊂ An+1 for all n ∈ N then the sequence is monotonically increasing


68 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
and has limit

[
lim An = An .
n→∞
n=1

2. If An+1 ⊂ An for all n ∈ N then the sequence is monotonically decreasing


and has limit

\
lim An = An .
n→∞
n=1

Proof. We will prove the first result. The second result is proven in Exercise 7.
n=1 be a sequence of monotonically increasing events from F. That
Let {An }∞
is An ⊂ An+1 for all n ∈ N. Then Definition 2.7 implies
∞ \
[ ∞
lim inf An = Ak .
n→∞
n=1 k=n

Because An ⊂ An+1 it follows that



\
Ak = An ,
k=n

and therefore

[
lim inf An = An .
n→∞
n=1
Similarly, Definition 2.7 implies that
∞ [
\ ∞
lim sup An = Ak .
n→∞
n=1 k=n

The monotonicity of the sequence implies that



[ ∞
[
Ak = Ak ,
k=1 k=n

for all n ∈ N. Therefore,


∞ [
\ ∞ ∞
[
lim sup An = Ak = Ak .
n→∞
n=1 k=1 k=1

Therefore, Definition 2.8 implies



[
lim An = An .
n→∞
n=1

It is important to note that the limits of the sequences are members of F.


SOME LIMIT THEORY FOR EVENTS 69
For monotonically increasing sequences this follows from the fact that count-
able unions of events in F are also in F by Definition 2.1. For monotonically
decreasing sequences note that by Theorem A.2,

!c ∞
\ [
An = Acn . (2.4)
n=1 n=1

The complement of each event is a member of F by Defintion 2.1 and there-


fore the countable union is also a member of F. Because the limits of these
sequences of events are members of F, the probabilities of the limits can also
be computed. In particular, Theorem 2.16 shows that the limit of the probabil-
ities of the sequence and the probability of the limit of the sequence coincide.
Theorem 2.16. Let {An }∞ n=1 be a sequence of events from a σ-field F of
subsets of a sample space Ω. If the sequence {An }∞ n=1 is either monotonically
increasing or decreasing then
 
lim P (An ) = P lim An .
n→∞ n→∞

Proof. To prove this result break the sequence up into mutually exclusive
events and use Definition 2.2. If {An }∞
n=1 is a sequence of monotonically in-
creasing events then define Bn+1 = An+1 ∩ Acn for n ∈ N where B1 is defined
to be A1 . Note that the sequence {Bn }∞n=1 is defined so that
n
[ n
[
Ai = Bi ,
i=1 i=1

for all n ∈ N and hence



[ ∞
[
An = Bn . (2.5)
n=1 n=1
Definition 2.2 implies that
n
X
P (An ) = P (Bi ),
i=1

for all n ∈ N. Therefore, taking the limit of each side of the equation as n → ∞
yields
X∞
lim P (An ) = P (Bn ).
n→∞
n=1
Definition 2.2, Equation (2.5), and Theorem 2.15 then imply
∞ ∞ ∞
! !
X [ [  
P (Bn ) = P Bn = P An = P lim An ,
n→∞
n=1 n=1 n=1

which proves the result for monotonically increasing events. For monotonically
decreasing events take the complement as shown in Equation (2.4) and note
that the resulting sequence is monotonically increasing. The above result is
then applied to this resulting sequence of monotonically increasing events.
70 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
The Borel-Cantelli Lemmas relate the probability of the limit supremum of
a sequence of events to the convergence of the sum of the probabilities of the
events in the sequence.
Theorem 2.17 (Borel and Cantelli). Let {An }∞
n=1 be a sequence of events.
If
X∞
P (An ) < ∞,
n=1
then  
P lim sup An = 0.
n→∞

Proof. From Definition 2.7



∞ [ ∞
  ! !
\ [
P lim sup An = P Am ≤P Am ,
n→∞
n=1 m=n m=n

for each n ∈ N, where the inequality follows from Theorem 2.3 and the fact
that
∞ [
\ ∞ ∞
[
Am ⊂ Am .
n=1 m=n m=n
Now, Theorem 2.4 (Bonferroni) implies that
  X∞
P lim sup An ≤ P (Am ),
n→∞
m=n

for each n ∈ N. The assumed condition



X
P (An ) < ∞,
n=1

implies that

X
lim P (Am ) = 0,
n→∞
m=n
so that Theorem 2.16 implies that
 
P lim sup An = 0.
n→∞

The usual interpretation of Theorem 2.17 relies on considering how often, in


an asymptotic sense, the events in the sequence occur. Recall that the event

[
Am ,
m=n
SOME LIMIT THEORY FOR EVENTS 71
occurs when at least one event in the sequence An , An+1 , . . . occurs. Now, if
the event
∞ [
\ ∞
Am ,
n=1 m=n
occurs, then at least one event in the sequence An , An+1 , . . . occurs for every
n ∈ N. This means that whenever one of the events An occurs in the sequence,
we are guaranteed with probability one that An0 will occur for some n0 > n.
That is, if the event
∞ [
\ ∞
lim sup An = Am ,
n→∞
n=1 m=n
occurs, then events in the sequence occur infinitely often. This event is usually
represented as {An i.o.}. In this context, Theorem 2.17 implies that if

X
P (An ) < ∞,
n=1

then the probability that An occurs infinitely often is zero. That is, there will
exist an n0 ∈ N such that none of the events in the sequence {An0 +1 , An0 +2 , . . .}
will occur, with probability one. The second Borel and Cantelli Lemma relates
the divergence of the sum of the probabilities of the events in the sequence
to the case where An occurs infinitely often with probability one. This result
only applies to the case where the events in the sequence are independent.
Theorem 2.18 (Borel and Cantelli). Let {An }∞
n=1 be a sequence of indepen-
dent events. If

X
P (An ) = ∞,
n=1
then  
P lim sup An = P (An i.o.) = 1.
n→∞

Proof. We use the method of Billingsley (1986) to prove this result. Note that
by Theorem A.2
 c ∞ [ ∞
!c ∞ \ ∞
\ [
lim sup An = Am = Acm .
n→∞
n=1 m=n n=1 m=n

Therefore, if we can prove that


∞ \

!
[
P Acm = 0,
n=1 m=n

then the result will follow. In fact, note that if we are able to show that

!
\
c
P Am = 0, (2.6)
m=n
72 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
for each n ∈ N, then Theorem 2.4 implies that
∞ \ ∞ ∞ ∞
! !
[ X \
c c
P Am ≤ P Am = 0.
n=1 m=n n=1 m=n

Therefore, it suffices to show that the property in Equation (2.6) is true.


We will first consider a finite part of the intersection in Equation (2.6). The
independence of the {An }∞ n=1 sequence implies that
n+k
! n+k
\ Y
P Acm = P (Acm ).
m=n m=n

Now, note that 1 − x ≤ exp(−x) for all positive and real x, so that
P (Acm ) = 1 − P (Am ) ≤ exp[−P (Am )],
for all m ∈ {n, n + 1, . . .}. Therefore,
n+k
! n+k " n+k #
\ Y X
P Acm ≤ exp[−P (Am )] = exp − P (Am ) . (2.7)
m=n m=n m=n

We wish to take the limit of both sides of Equation (2.7), which means that we
need to evaluate the limit of the sum on the right hand side. By supposition
we have that

X n−1
X ∞
X
P (Am ) = P (Am ) + P (Am ) = ∞,
m=1 m=1 m=n

and
n−1
X
P (Am ) ≤ n − 1 < ∞,
m=1
since P (Am ) ≤ 1 for each m ∈ N. Therefore, it follows that

X
P (Am ) = ∞, (2.8)
m=n

for each n ∈ N. Finally, to take the limit, note that


( n+k )∞
\
c
Am ,
m=n k=1
is a monotonically decreasing sequence of events so that Theorem 2.16 implies
that

n+k
! n+k
! !
\ \ \
lim P Acm = P lim Acm = P Acm .
k→∞ k→∞
m=n m=n m=n
But Equation (2.8) implies that
n+k
! " n+k
#
\ X
c
lim P Am ≤ exp − lim P (Am ) = 0,
k→∞ k→∞
m=n m=n
SOME LIMIT THEORY FOR EVENTS 73
and the result follows.

Note that Theorem 2.18 requires the sequence of events to be independent,


whereas Theorem 2.17 does not. However, Theorem 2.17 is still true under
the assumption of independence, and combining the results of Theorems 2.17
and 2.18 under this assumption yields the following zero-one law.
Corollary 2.1. Let {An }∞
n=1 be a sequence of independent events. Then
  ( P∞
0 if n=1 P (An ) < ∞,
P lim sup An = P∞
n→∞ 1 if n=1 P (An ) = ∞.

Results such as the one given in Corollary 2.1 are called zero-one laws because
the probability of the event of interest can only take on the values zero and
one.
Example 2.8. Let {Un }∞ n=1 be a sequence of independent random variables
where Un has a Uniform{1, 2, . . . , n} distribution for all n ∈ N. Define a
sequence of events {An }∞
n=1 as An = {Un = 1} for all n ∈ N. Note that

X ∞
X
P (An ) = n−1 = ∞,
n=1 n=1

so that Corollary 2.1 implies that


 
P lim sup An = P (An i.o.) = 1.
n→∞

Therefore, each time we observe {Un = 1}, we are guaranteed to observe


{Un0 = 1} for some n0 ≥ n, with probability one. Hence, the event An will
occur an infinite number of times in the sequence, with probability one. Now
suppose that Un has a Uniform{1, 2, . . . , n2 } distribution for all n ∈ N, and
assume that An has the same definition as above. In this case
∞ ∞
π2
X X
P (An ) = n−2 = 6 ,
n=1 n=1

so that Corollary 2.1 implies that


 
P lim sup An = P (An i.o.) = 0.
n→∞

In this case there will be a last occurrence of an event in the sequence An with
probability one. That is, there will exist an integer n0 such that {Un = 1}
will not be observed for all n > n0 , with probability one. That is, the event
{Un = 1} will occur in this sequence only a finite number of times, with
probability one. Squaring the size of the sample space in this case creates too
many opportunities for events other than those events in the sequence An to
occur as n → ∞ for an event An to ever occur again after a certain point. 
74 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
2.5 Generating and Characteristic Functions

Let X be a random variable with distribution function F . Many of the char-


acteristics of the behavior of a random variable can be studied by considering
the moments of X.
Definition 2.9. The k th moment of a random variable X with distribution
function F is Z ∞
µ0k = E(X k ) = xk dF (x), (2.9)
−∞
provided Z ∞
|x|k dF (x) < ∞.
−∞
The k th central moment of a random variable X with distribution function F
is Z ∞
µk = E[(X − µ01 )k ] = (x − µ01 )k dF (x), (2.10)
−∞
provided Z ∞
|x − µ01 |k dF (x) < ∞.
−∞

The integrals used in Definition 2.9 are Lebesgue-Stieltjes integrals, which can
be applied to any random variable, discrete or continuous.
Definition 2.10. Let X be a random variable with distribution function F .
Let g be any real function. Then the integral
Z ∞
g(x)dF (x),
−∞

will be interpreted as a Lebesgue-Stieltjes integral. That is, if X is continuous


with density f then
Z ∞ Z ∞
g(x)dF (x) = g(x)f (x)dx,
−∞ −∞

provided Z ∞
|g(x)|f (x)dx < ∞.
−∞
If X is a discrete random variable that takes on values in the set {x1 , x2 , . . .}
with probability distribution function f then
Z ∞ Xn
g(x)dF (x) = g(xi )f (xi ),
−∞ i=1

provided
n
X
|g(xi )|f (xi ) < ∞.
i=1
GENERATING AND CHARACTERISTIC FUNCTIONS 75
The use of this notation will allow us to keep the presentation of the book
simple without having to present essentially the same material twice: once
for the discrete case and once for the continuous case. For example, in the
particular case when X is a discrete random variable that takes on values in
the set {x1 , x2 , . . .} with probability distribution function f then Definitions
2.9 and 2.10 imply that

X
µ0k = E(X k ) = xki f (xi ),
i=1

provided the sum converges absolutely, where f is the probability distribution


function of X.
The moments of a distribution characterize the behavior of the random vari-
able, and under certain conditions uniquely characterize the entire probability
distribution of a random variable. As detailed in the theorem below, the entire
sequence of moments is required for this characterization.
Theorem 2.19. The sequence of moments {µ0k }∞ k=1 of a random variable X
uniquely determine the distribution function of X if

X
(µ02i )−1/2i = ∞. (2.11)
i=1

The condition given in Equation (2.11) is called the Carleman Condition and
has sufficient conditions given below.
Theorem 2.20. Let X be a random variable with moments {µ0k }∞ k=1 . If
Z ∞ 1/k
lim sup k −1 |x|k dF (x) < ∞,
k→∞ −∞
or

X µ0 λk
k
, (2.12)
k!
k=1
converges absolutely when |λ| < λ0 , for some λ0 > 0, then the Carleman
Condition of Equation (2.11) holds.
The Carleman Condition, as well as the individually sufficient conditions given
in Theorem 2.20, essentially put restrictions on the rate of growth of the
sequence µ02k . For further details on these results see Akhiezer (1965) and
Shohat and Tamarkin (1943).
Example 2.9. Consider a continuous random variable X with density
(
2x 0 ≤ x ≤ 1,
f (x) =
0 elsewhere.
Direct calculation shows that for i ∈ N,
Z 1
µ0i = E(X i ) = 2xi+1 dx = 2(i + 2)−1 .
0
76 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
Therefore (µ02i )−1/2i = (i + 1)1/2i for all i ∈ N. Noting that (i + 1)1/2i ≥ 1 for
all i ∈ N implies that

X
(µ02i )−1/2i = ∞,
i=1

and the Carleman condition of Theorem 2.19 is satisfied. Therefore Theorem


2.19 implies that the moment sequence µ0i = 2(i + 2)−1 uniquely identifies this
distribution. 
Example 2.10. Suppose that X is a continuous random variable on the pos-
itive real line that has a Lognormal(0, 1) distribution. That is, the density
of X is given by
(
(2πx2 )−1/2 exp{− 12 [log(x)]2 } x ≥ 0,
f (x) =
0 x < 0.

Let θ ∈ [−1, 1] and consider a new function


(
f (x){1 + θ sin[2π log(x)]} x≥0
fθ (x) = (2.13)
0 x < 0.

We first show that fθ (x) is a valid density function. Since f (x) is a density it
follows that f (x) ≥ 0 for all x ∈ R. Now | sin(x)| ≤ 1 for all x ∈ R and |θ| ≤ 1
so it follows that 1 + θ sin[2π log(x)] ≥ 0 for all x > 0. Hence it follows that
fθ (x) ≥ 0 for all x > 0 and θ ∈ [−1, 1]. Further, noting that
Z ∞ Z ∞
f (x){1 + θ sin[2π log(x)]}dx = 1 + θ f (x) sin[2π log(x)]dx,
0 0

it suffices to show that


Z ∞
f (x) sin[2π log(x)]dx = 0,
0

to show that the proposed density function integrates to one. Consider the
change of variable log(x) = u so that du = dx/x and note that
lim log(x) = −∞,
x→0

and
lim log(x) = ∞.
x→∞

Therefore,
Z ∞ Z ∞
f (x) sin[2π log(x)]dx = (2π)−1/2 exp(− 21 u2 ) sin(2πu)du.
0 −∞

The function in the integrand is odd so that the integral is zero. Therefore, it
follows that fθ (x) is a valid density. Now consider the k th moment of fθ given
GENERATING AND CHARACTERISTIC FUNCTIONS 77
by
Z ∞
µ0k (θ) = xk f (x){1 + θ sin[2π log(x)]}dx
Z0 ∞ Z ∞
= xk f (x)dx + θ xk f (x) sin[2π log(x)]dx
0 0
Z ∞
= µ0k + θ xk f (x) sin[2π log(x)]dx,
0

where µ0k is the k th moment of f (x). Using the same change of variable as
above, it follows that
Z ∞
xk f (x) sin[2π log(x)]dx =
0
Z ∞
(2π)−1/2 exp( 21 k 2 ) exp(− 21 u2 ) sin[2π(u + k)]du =
−∞
Z ∞
(2π)−1/2 exp( 12 k 2 ) exp(− 21 u2 ) sin(2πu)du = 0,
−∞

for each k ∈ N. Hence, µ0k (θ) = µ0k for all θ ∈ [−1, 1], and we have demon-
strated that this family of distributions all have the same sequence of moments.
Therefore, the moment sequence of this distribution does not uniquely iden-
tify this distribution. A plot of the lognormal density along with the density
given in Equation (2.13) when θ = 1 is given in Figure 2.3. This example is
based on one from Heyde (1963). 

The moment generating function is a function that contains the information


about all of the moments of a distribution. As suggested above, under certain
conditions the moment generating function can serve as a surrogate for a
distribution. This is useful in some cases where the distribution function itself
may not be convenient to work with.
Definition 2.11. Let X be a random variable with distribution function F .
The moment generating function of X, or equivalently of F , is
Z ∞
m(t) = E[exp{tX}] = exp(tx)dF (x),
−∞

provided
Z ∞
exp(tx)dF (x) < ∞
−∞

for |t| < b for some b > 0.


Example 2.11. Let X be a Binomial(n, p) random variable. The moment
78 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS

Figure 2.3 The lognormal density (solid line) and the density given in Equation
(2.13) with θ = 1 (dashed line). Both densities have the same moment sequence.
1.5
1.0
0.5
0.0

0 1 2 3 4

generating function of X is given by

Z ∞
m(t) = exp(tx)dF (x)
−∞
n  
X n x
= exp(tx) p (1 − p)n−x
x=0
x
n  
X n
= [p exp(t)]x (1 − p)n−x
x=0
x
= [1 − p + p exp(t)]n ,

where the final inequality results from Theorem A.22. 

Example 2.12. Let X be an Exponential(β) distribution. Suppose that


GENERATING AND CHARACTERISTIC FUNCTIONS 79
|t| < β −1 , then the moment generating function of X is given by
Z ∞
m(t) = exp(tx)dF (x)
−∞
Z ∞
= exp(tx)β −1 exp(−β −1 x)dx
0
Z ∞
= β −1 exp[−x(β −1 − t)]dx. (2.14)
0

Note that the integral in Equation (2.14) diverges unless the restriction |t| <
β −1 is employed. Now consider the change of variable v = x(β −1 − t) so that
dx = (β −1 − t)−1 dv. The moment generating function is then given by
Z ∞
−1 −1 −1
m(t) = β (β − t) exp(−v)dv = (1 − tβ)−1 .
0

When the convergence condition given in Definition 2.11 holds then all of
the moments of X are finite, and the moment generating function contains
information about all of the moments of F . To observe why this is true,
consider the Taylor series for exp(tX) given by

X (tX)i
exp(tX) = .
i=0
i!

Taking the expectation of both sides yields


∞ i 0
X tµ i
m(t) = E[exp(tX)] = , (2.15)
i=0
i!

where we have assumed that the exchange between the infinite sum and the
expectation is permissible. Note that the coefficient of ti , divided by i!, is
equal to the ith moment of F . As indicated by its name, the moment gen-
erating function can be used to generate the moments of the corresponding
distribution. The operation is outlined in Theorem 2.21, which can be proven
using a standard induction argument. See Problem 16.
Theorem 2.21. Let X be a random variable that has moment generating
function m(t) that converges on some radius |t| ≤ b for some b > 0. Then

dk m(t)

µ0k = .
dtk t=0

Example 2.13. Suppose that X is a Binomial(n, p) random variable. It


was shown in Example 2.11 that the moment generating function of X is
80 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
m(t) = [1 − p + p exp(t)]n . To find the mean of X use Theorem 2.21 to obtain

dm(t)
E(X) =
dt t=0

d n

= [1 − p + p exp(t)]
dt t=0
= np exp(t)[1 − p + p exp(t)]n−1

t=0
= np.


Of equal importance is the ability of the moment generating function to


uniquely identify the distribution of the random variable. This is what allows
one to use the moment generating function of a random variable as a surrogate
for the distribution of the random variable. In particular, the Central Limit
Theorem can be proven by showing that the limiting moment generating func-
tion of a particular sum of random variables matches the moment generating
function of the normal distribution.
Theorem 2.22. Let F and G be two distribution functions all of whose mo-
ments exist, and whose moment generating functions are mF (t) and mG (t),
respectively. If mF (t) and mG (t) exist and mF (t) = mG (t) for all |t| ≤ b, for
some b > 0 then F (x) = G(x) for all x ∈ R.

We do not formally prove Theorem 2.22, but to informally see why the result
would be true one can compare Equation (2.15) to Equation (2.12). When
the moment generating function exists then expansion given in Theorem 2.20
must converge for some radius of convergence. Therefore, Theorems 2.19 and
2.20 imply that the distribution is uniquely characterized by its moment se-
quence. Note that since Equation (2.15) is a polynomial in t whose coefficients
are functions of the moments, the two moment generating functions will be
equal only when all of the moments are equal, which will mean that the two
distributions are equal.
Another useful result relates the distribution of a function of a random variable
to the moment generating function of the random variable.
Theorem 2.23. Let X be a random variable with distribution function F . If
g is a real function then the moment generating function of g(X) is
Z ∞
m(t) = E{exp[tg(X)]} = exp[tg(x)]dF (x),
−∞

provided m(t) < ∞ when |t| < b for some b > 0.

The results of Theorems 2.22 and 2.23 can be combined to identify the distri-
butions of transformed random variables.
Example 2.14. Suppose X is an Exponential(1) random variable. Ex-
ample 2.12 showed that the moment generating function of X is mX (t) =
GENERATING AND CHARACTERISTIC FUNCTIONS 81
(1 − t)−1 when |t| < 1. Now consider a new random variable Y = βX where
β ∈ R. Theorem 2.23 implies that the moment generating function of Y is
mY (t) = E{exp[t(βX)]} = E{exp[(tβ)X]} = mX (tβ) as long as tβ < 1 or
equivalently t < β −1 . Evaluating the moment generating function of X at tβ
yields mY (t) = (1−tβ)−1 , which is the moment generating function of an Ex-
ponential(β) random variable. Theorem 2.22 can be used to conclude that
if X is an Exponential(1) random variable then βX is an Exponential(β)
random variable. 
Example 2.15. Let Z be a N(0, 1) random variable and define a new random
variable Y = Z 2 . Theorem 2.23 implies that the moment generating function
of Y is given by
mY (t) = E[exp(tZ 2 )]
Z ∞
= (2π)−1/2 exp(− 21 z 2 + tz 2 )dz
−∞
Z ∞
= (2π)−1/2 exp[− 12 z 2 (1 − 2t)]dz.
−∞

Assuming that 1 − 2t > 0, or that |t| < 12 so that integral converges, use the
change of variable v = z(1 − 2t)1/2 to obtain
Z ∞
−1/2 1
mY (t) = (1 − 2t) (2π)−1/2 exp(− v 2 )dv = (1 − 2t)−1/2 .
−∞ 2
Note that mY (t) is the moment generating function of a Chi-Squared(1)
random variable. Therefore, Theorem 2.22 implies that if Z is a N(0, 1) random
variable then Z 2 is a Chi-Squared(1) random variable. 

In Example 2.14 it is clear that the moment generating functions of linear


transformations of random variables can be evaluated directly without having
to re-evaluate the integral in Definition 2.11. This result can be generalized.
Theorem 2.24. Let X be a random variable with moment generating func-
tion mX (t) that exists and is finite for |t| < b for some b > 0. Suppose that Y
is a new random variable defined by Y = αX + β where α and β are real con-
stants. Then the moment generating function of Y is mY (t) = exp(tβ)mX (αt)
provided |αt| < b.

Theorem 2.24 is proven in Exercise 23. Another important property relates


the moment generating function of the sum of independent random variables
to the moment generating functions of the individual random variables.
Theorem 2.25. Let X1 , . . . , Xn be a sequence of independent random vari-
ables where Xi has moment generating function mi (t) for i = 1, . . . , n. Suppose
that m1 (t), . . . , mn (t) all exist and are finite when |t| < b for some b > 0. If
n
X
Sn = Xi ,
i=1
82 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
then the moment generating function of Sn is
n
Y
mSn (t) = mk (t),
k=1

when |t| < b. If X1 , . . . , Xn are identically distributed with moment generating


function m(t) then mSn (t) = mn (t) provided |t| < b.

Proof. The moment generating function of Sn is given by


" n
!# " n
!#
X X
mSn (t) = E[exp(tSn )] = E exp t Xi = E exp tXi
i=1 i=1
" n
# n n
Y Y Y
= E exp(tXi ) = E[exp(tXi )] = mi (t),
i=1 i=1 i=1

where the interchange of the product and the expectation follows from the
independence of the random variables. The second result follows by setting
m1 (t) = · · · = mn (t) = m(t).
Example 2.16. Suppose that X1 , . . . , Xn are a set of independent N(0, 1)
random variables. The moment generating function of Xk is m(t) = exp( 12 t2 )
for all t ∈ R. Theorem 2.25 implies that the moment generating function of
n
X
Sn = Xi ,
i=1

is mSn (t) = mn (t) = exp( 12 nt2 ), which is the moment generating function of
a N(0, n) random variable. Therefore, Theorem 2.22 implies that the sum of
n independent N(0, 1) random variables is a N(0, n) random variable. 

The moment generating function can be a useful tool in asymptotic theory, but
has the disadvantage that it does not exist for many distributions of interest. A
function that has many similar properties to the moment generating function
is the characteristic function. Using the characteristic function requires a little
more work as it is based on some ideas from complex analysis. However, the
benefits far outweigh this inconvenience as the characteristic function always
exists, can be used to generate moments when they exist, and also uniquely
identifies a distribution.
Definition 2.12. Let X be a random variable with distribution function F .
The characteristic function of X, or equivalently of F , is
Z ∞
ψ(t) = E[exp(itX)] = exp(itx)dF (x). (2.16)
−∞

In the case where F has density f , the characteristic function is equivalent to


Z ∞
ψ(t) = exp(itx)f (x)dx. (2.17)
−∞
GENERATING AND CHARACTERISTIC FUNCTIONS 83
One may recognize the integral in Equation (2.17) as the Fourier transforma-
tion of the function f . Such a transformation exists for other functions beside
densities, but for the special case of density functions the Fourier transforma-
tion is called the characteristic function. Similarly, the integral in Equation
(2.16) is called the Fourier–Stieltjes transformation of F . These transforma-
tions are defined for all bounded measures, and in the special case of a measure
that is normed to one, we again call the transformation a characteristic func-
tion.
There is no provision for a radius of convergence in Definition 2.12 because
the integral always exists. To see this note that
Z ∞ Z ∞

exp(itx)dF (x) ≤ | exp(itx)|dF (x),

−∞ −∞

by Theorem A.6. Now Definitions A.5 and A.6 (Euler) imply that
| exp(itx)| = | cos(tx) + i sin(tx)| = [cos2 (tx) + sin2 (tx)]1/2 = 1.
Therefore, it follows that
Z ∞ Z ∞

exp(itx)dF (x) ≤ dF (x) = 1.

−∞ −∞

Before giving some examples on deriving characteristic functions for specific


distributions it is worth pointing out that complex integration is not required
to find characteristic functions, since we are not integrating over the complex
plane. In fact, Definition A.6 implies that the characteristic function of a
random variable that has distribution function F is given by
Z ∞
ψ(t) = exp(itx)dF (x) =
−∞
Z ∞ Z ∞
cos(tx)dF (x) + i sin(tx)dF (x), (2.18)
−∞ −∞

where both of the integrals in Equation (2.18) are of real functions integrated
over the real line.
Example 2.17. Let X be a Binomial(n, p) random variable. The character-
istic function of X is given by
Z ∞
ψ(t) = exp(itx)dF (x)
−∞
n  
X n x
= exp(itx) p (1 − p)n−x
x=0
x
n  
X n
= [p exp(it)]x (1 − p)n−x
x=0
x
= [1 − p + p exp(it)]n ,
where the final equality results from Theorem A.22. 
84 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
Example 2.18. Let X be an Exponential(β) distribution. The character-
istic function of X is given by
Z ∞
ψ(t) = exp(itx)dF (x)
−∞
Z ∞
= exp(itx)β −1 exp(−β −1 x)dx
0
Z ∞ Z ∞
= β −1 cos(tx) exp(−β −1 x) + iβ −1 sin(tx) exp(−β −1 x)dx.
0 0

Now, using standard results from calculus, it follows that


Z ∞
β −1
cos(tx) exp(−β −1 x)dx = −2 ,
0 β + t2
and Z ∞
t
sin(tx) exp(−β −1 x)dx = .
0 β −2 + t2
Therefore
β −1
 
1 it 1 + itβ
ψ(t) = −2 2
+ −2 =
β β +t β + t2 1 + t2 β 2
1 + itβ 1
= = .
(1 + itβ)(1 − itβ) 1 − itβ

In both of the previous two examples the characteristic function turns out to
be a function whose range is in the complex field. This is not always the case,
as there are some circumstances under which the characteristic function is a
real valued function.
Theorem 2.26. Let X be a random variable. The characteristic function of
X is real valued if and only if X has the same distribution as −X.

Theorem 2.26 is proven in Exercise 22. Note that the condition that X has
the same distribution as −X implies that P (X ≥ x) = P (X ≤ −x) for all
x > 0, which is equivalent to the case where X has a symmetric distribution
about the origin. For example, a random variable with a N(0, 1) distribution
has a real valued characteristic function, whereas a random variable with a
non-symmetric distribution like a Gamma distribution has a complex val-
ued characteristic function. Note that Theorem 2.26 requires that X have a
distribution that is symmetric about the origin. That is, if X has a N(µ, σ)
distribution where µ 6= 0, then the characteristic function of X is complex
valued.
As with the moment generating function, the characteristic function uniquely
characterizes the distribution of a random variable, though the characteristic
function can be used in more cases as there is no need to consider potential
convergence issues with the characteristic function.
GENERATING AND CHARACTERISTIC FUNCTIONS 85
Theorem 2.27. Let F and G be two distribution functions whose character-
istic functions are ψF (t) and ψG (t) respectively. If ψF (t) = ψG (t) for all t ∈ R
then F (x) = G(x) for all x ∈ R.

Inversion provides a method for deriving the distribution function of a random


variable directly from its characteristic function. The two main results used in
this book focus on inversion of characteristic functions for certain continuous
and discrete distributions.
Theorem 2.28. Consider a random variable X with characteristic function
ψ(t). If Z ∞
|ψ(t)|dt < ∞,
−∞
then X has an absolutely continuous distribution F with a bounded and con-
tinuous density f given by
Z ∞
1
f (x) = exp(−itx)ψ(t)dt.
2π −∞

A proof of Theorem 2.28 can be found in Chapter 4 of Gut (2005).


Example 2.19. Consider a random variable X that has characteristic func-
tion ψ(t) = exp(− 12 t2 ), which follows the condition given in Theorem 2.28.
Therefore, X has an absolutely continuous distribution function F that has a
continuous and bounded density given by
Z ∞
1
f (x) = exp(−itx) exp(− 12 t2 )dt
2π −∞
Z ∞
1
= exp[− 12 (t2 + 2itx)]dt
2π −∞
Z ∞
1
= exp[− 12 (t + ix)2 ] exp(− 21 x2 )dt
2π −∞
Z ∞
1 1 2
= exp(− 2 x ) exp[− 21 (t + ix)2 ]dt.
2π −∞

Note that (2π)−1/2 multiplied by the integrand corresponds to a N(−ix, 1)


distribution, and the corresponding integral is one. Therefore
f (x) = (2π)−1/2 exp(− 12 x2 ),
which is the density of a N(0, 1) random variable. 

For discrete distributions we will focus on random variables that take on values
on a regular lattice. That is, for random variables X that take on values in
the set {kd + l : k ∈ Z} for some d > 0 and l ∈ R. Many of the common
discrete distributions, such as the Binomial, Geometric, and Poisson, have
supports on a regular lattice with d = 1 and l = 0.
Theorem 2.29. Consider a random variable X that takes on values in the set
86 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
{kd + l : k ∈ Z} for some d > 0 and l ∈ R. Suppose that X has characteristic
function ψ(t), then for any x ∈ {kd + l : k ∈ Z},
Z π/d
d
P (X = x) = exp(−itx)ψ(t)dt.
2π −π/d
Example 2.20. Let X be a discrete random variable that takes on values
in the set {0, 1, . . . , n} where n ∈ N is fixed so that d = 1 and l = 0. If X
has characteristic function ψ(t) = [1 − p + p exp(it)]n where p ∈ (0, 1) and
x ∈ {1, . . . , n}, then Theorem 2.29 implies that
Z π
1
P (X = x) = exp(−itx)[1 − p + p exp(it)]n dt
2π −π
Z π n  
1 X n k
= exp(−itx) p exp(itk)(1 − p)n−k dt
2π −π k
k=0
n   Z π
1 X n k n−k
= p (1 − p) exp[it(k − x)]dt,
2π k −π
k=0

where Theorem A.22 has been used to expand the polynomial. There are
two distinct cases to consider for the value of the index k. When k = x the
exponential function becomes one and the expression simplifies to
  Z π  
1 n x n−x n x
p (1 − p) dt = p (1 − p)n−x .
2π x −π x
When k 6= x the integral expression in the sum can be calculated as
Z π Z π Z π
exp[it(k − x)]dt = cos[t(k − x)]dt + i sin[t(k − x)]dt.
−π −π −π

The second integral is zero because sin[−t(k − x)] = − sin[(t(k − x)]. For the
first integral we note that (k − x) ∈ N and therefore
Z π Z π
cos[t(k − x)]dt = 2 cos[t(k − x)]dt = 0,
−π 0

since the range of the integral over the cosine function is an integer multiple
of π. Therefore, the integral expression is zero when k 6= x and we have that
 
n x
P (X = x) = p (1 − p)n−x ,
x
which corresponds to the probability of a Binomial(n, p) random variable. 
Due to the similar form of the definition of the characteristic function to the
moment generating function, the two functions have similar properties. In
particular, the characteristic function of a random variable can also be used
to obtain moments of the random variable if they exist. We first establish
that when the associated moments exist, the characteristic function can be
approximated by partial expansions whose terms correspond to the Taylor
series for the exponential function.
GENERATING AND CHARACTERISTIC FUNCTIONS 87
Theorem 2.30. Let X be a random variable with characteristic function ψ.
Then if E(|X|n ) < ∞ for some n ∈ N, then
n

(it)k E(X k ) 2|t|n |X|n
X  
ψ(t) − ≤E ,

k! n!
k=0

and
n
(it)k E(X k )
 n+1
|X|n+1

X |t|
ψ(t) − ≤E .

k! (n + 1)!
k=0

Proof. We will prove the first statement. The second statement follows using
similar arguments. Theorem A.11 implies that for y ∈ R,
n

X (iy)k 2|y|n
exp(iy) − ≤ . (2.19)

k! n!
k=0

Let y = tX where t ∈ R and note then that


n
n

X (it)k E(X k ) X E[(itX)k ]
ψ(t) − = E[exp(itX)] −

k! k!


k=0 k=0
" n
#
X (itX)k
≤ E exp(itX) −

k!


k=0
2|t|n |X|n
 
≤ E ,
n!
where the final inequality follows from the bound in Equation (2.19).
The results given in Theorem 2.30 allow us to determine the asymptotic be-
havior of the error that is incurred by approximating a characteristic function
with the corresponding partial series. This, in turn, allows us to obtain a
method by which the moments of a distribution can be obtained from a char-
acteristic function.
Theorem 2.31. Let X be a random variable with characteristic function ψ. If
E(|X|n ) < ∞ for some n ∈ {1, 2, . . .} then ψ (k) exists, is uniformly continuous
for k ∈ {1, 2, . . . , n},
n
X µ0 (it)k
ψ(t) = 1 + k
+ o(|t|n ),
k!
k=1

as t → 0 and
dk ψ(t)

= ik µ0k .
dtk t=0
A proof of Theorem 2.31 is the subject of Exercise 33. Note that the char-
acteristic function allows one to find the finite moments of distributions that
may not have all finite moments, as opposed to the moment generating func-
tion which does not exist when any of the moments of a random variable are
infinite.
88 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
Example 2.21. Let X be a N(0, 1) random variable, which has characteristic
function ψ(t) = exp(− 12 t2 ). Taking a first derivative yields

d d 1 2
1 2

ψ(t)
= exp(− 2 t ) = −t exp(− 2 t ) = 0,
dt t=0 dt t=0 t=0

which is the mean of the N(0, 1) distribution. 

The characteristic function also has a simple relationship with linear transfor-
mations of a random variable.
Theorem 2.32. Suppose that X is a random variable with characteristic
function ψ(t). Let Y = αX + β where α and β are real constants. Then the
characteristic function of Y is ψY (t) = exp(itβ)ψ(tα).

Theorem 2.32 is proven in Exercise 24. As with moment generating functions,


the characteristic function of the sum of independent random variables is the
product of the characteristic functions.
Theorem 2.33. Let X1 , . . . , Xn be a sequence of independent random vari-
ables where Xi has characteristic function ψi (t), for i = 1, . . . , n, then the
characteristic function of
Xn
Sn = Xi ,
i=1
is
n
Y
ψSn (t) = ψi (t).
i=1
If X1 , . . . , Xn are identically distributed with characteristic function ψ(t) then
the characteristic function of Sn is ψSn (t) = ψ n (t).

Theorem 2.33 is proven in Exercise 25.


A function that is related to the moment generating function is the cumulant
generating function, defined below.
Definition 2.13. Let X be a random variable with moment generating func-
tion m(t) defined on a radius of convergence |t| < b for some b > 0. The
cumulant generating function of X is given by c(t) = log[m(t)].

The usefulness of the cumulant generating function may not be apparent from
the definition given above, though one can immediately note that when the
moment generating function of a random variable exists, the cumulant gen-
erating function will also exist and will uniquely characterize the distribution
of the random variable as the moment generating function does. This follows
from the fact that the cumulant generating function is a one-to-one function
of the moment generating function. As indicated by its name, c(t) can be
used to generate the cumulants of a random variable, which are related to the
moments of a random variable. Before defining the cumulants of a random
GENERATING AND CHARACTERISTIC FUNCTIONS 89
variable, some expansion theory is required to investigate the structure of the
cumulant generating function more closely.
We begin by assuming that the moment generating function m(t) is defined on
a radius of convergence |t| < b for some b > 0. As shown in Equation (2.15),
the moment generating function can be written as
µ0n tn
m(t) = 1 + µ01 t + 21 µ02 t2 + 16 µ03 t3 + · · · +
+ O(tn ), (2.20)
k!
as t → 0. An application of Theorem 1.15 to the logarithmic function can be
used to show that
log(1 + δ) = δ − 21 δ 2 + 13 δ 3 − 14 δ 4 + · · · + n1 (−1)n+1 δ n + O(δ n ), (2.21)
as δ → 0. Substituting
n
X µ0 ti i
δ= + O(tn ), (2.22)
i=1
i!
into Equation (2.21) yields a polynomial in t of the same form as given in
Equation (2.20), but with different coefficients. That is, the cumulant gener-
ating function has the form
κn tn
c(t) = κ1 t + 21 κ2 t2 + 16 κ3 t3 + · · · +
+ Rc (t), (2.23)
n!
where the coefficients κi are called the cumulants of X, and Rc (t) is an error
term that depends on t whose order is determined below. Note that since
c(t) and m(t) have the same form, the cumulants can be generated from the
cumulant generating function in the same way that moments can be generated
from the moment generating function through Theorem 2.21. That is,
di c(t)

κi = .
dti t=0
Matching the coefficients of ti in Equations (2.21) and (2.23) yields expres-
sions for the cumulants of X in terms of the moments of X. For example,
matching the coefficients of t yields the relation µ01 t = κ1 t so that the first
cumulant is equal to µ01 , the mean. The remaining cumulants are not equal to
the corresponding moments. Matching the coefficients of t2 yields
1
2 κ2 t
2
= 21 µ02 t2 − 12 (µ01 )2 t2 ,
so that κ2 = µ02 − (µ01 )2 , the variance of X. Similarly, matching the coefficients
of t3 yields
1 3 1 0 3 1 0 0 3 1 0 3 3
6 κ3 t = 6 µ3 t − 2 µ1 µ2 t + 3 (µ1 ) t ,
so that κ3 = µ03 − 3µ01 µ02 + 2(µ01 )3 . In this form this expression may not seem
familiar, but note that
E[(X − µ01 )3 ] = µ03 − 3µ01 µ02 + 2(µ01 )3 ,
so that κ3 is the skewness of X. It can be similarly shown that κ4 is the
90 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
kurtosis of X. For further cumulants, see Exercises 34 and 35, and Chapter 3
of Kendall and Stuart (1977).
The reminder term Rc (t) from the expansion given in Equation (2.23) will
now be quantified. Consider the powers of t given in the expansion for δ in
Equation (2.22) when δ is substituted into Equation (2.23). The reminder
term from the linear term is O(tn ) as t → 0. When the expansion given for δ
is squared, there will be terms that range in powers of t from t2 to t2n . All of
the powers of t less than n + 1 have coefficients that are used to identify the
first k cumulants in terms of the moments of X. All the remaining terms are
O(tn ) as t → 0, assuming that all of the moments are finite, which must follow
if the moment generating function of X converges on a radius of convergence.
This argument can be applied to the remaining terms to obtain the result
n
X κi ti
c(t) = + O(tn ), (2.24)
i=1
i!
as t → ∞.
Example 2.22. Suppose X has a N(µ, σ 2 ) distribution. The moment gener-
ating function of X is given by m(t) = exp(µt + 21 σ 2 t2 ). From Definition 2.13,
the cumulant generating function of X is
c(t) = log[m(t)] = µt + 21 σ 2 t2 .
Matching this cumulant generating function to the general form given in Equa-
tion (2.24) implies the N(µ, σ 2 ) distribution has cumulants

µ i = 1,

κi = σ 2 i = 2

0 i = 3, 4, . . .


In some cases additional calculations are required to obtain the necessary form
of the cumulant generating function.
Example 2.23. Suppose that X has an Exponential(β) distribution. The
moment generating function of X is m(t) = (1 − βt)−1 when |t| < β −1 so
that the cumulant generating function of X is c(t) = − log(1 − βt). This
cumulant generating function is not in the form given in Equation (2.24) so
the cumulants cannot be obtained by directly observing c(t). However, when
|βt| < 1 it follows from the Taylor expansion given in Equation (2.21) that
log(1 − βt) = (−βt) − 12 (−βt)2 + 13 (−βt)3 + · · ·
= −βt − 21 β 2 t2 − 31 β 3 t3 + · · · .
Therefore c(t) = βt+ 21 β 2 t2 + 31 β 3 t3 +· · · , which is now in the form of Equation
(2.24). It follows that the ith cumulant of X can be found by solving
κi ti β i ti
= ,
i! i
EXERCISES AND EXPERIMENTS 91
for t 6= 0. This implies that the ith cumulant of X is κi = (i − 1)!β i for
i ∈ N. Alternatively, one could also find the cumulants by differentiating the
cumulant generating function. For example, the first cumulant is given by

d d
c(t)
= [− log(1 − βt)]
dt t=0 dt t=0
= β(1 − βt)−1

t=0
= β.
The remaining cumulants can be found by taking additional derivatives. 

Cumulant generating functions are particularly easy to work with for sums of
independent random variables.
Theorem 2.34. Let X1 , . . . , Xn be a sequence of independent random vari-
ables where Xi has cumulant generating function ci (t) for i = 1, . . . , n. Then
the cumulant generating function of
n
X
Sn = Xi ,
i=1

is
n
X
cSn (t) = ci (t).
i=1
If X1 , . . . , Xn are also identically distributed with cumulant generating func-
tion c(t) then the cumulant generating function of Sn is nc(t).

Theorem 2.34 is proven in Exercise 36. The fact that the cumulant gener-
ating functions add for sums of independent random variables implies that
the coefficients of ti /i! add as well. Therefore the ith cumulant of the sum of
independent random variables is equal to the sum of the corresponding cu-
mulants of the individual random variables. This result gives some indication
as to why cumulants are often preferable to work with, as the moments or
central moments of a sum of independent random variables can be a complex
function of the individual moments. For further information about cumulants
see Barndorff-Nielsen and Cox (1989), Gut (2005), Kendall and Stuart (1977),
and Severini (2005).

2.6 Exercises and Experiments

2.6.1 Exercises

1. Verify that F = {∅, ω1 , ω2 , ω3 , ω1 ∪ ω2 , ω1 ∪ ω3 , ω2 ∪ ω3 , ω1 ∪ ω2 ∪ ω3 } is a


σ-field containing the sets ω1 , ω2 , and ω3 . Prove that this is the smallest
possible σ-field containing these sets by showing that F is no longer a σ-field
if any of the sets are eliminated from F.
92 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
2. Let {An }∞ n=1 be a sequence of monotonically increasing events from a σ-
field F of subsets of a sample space Ω. Prove that the sequence {Acn }∞
n=1 is
a monotonically decreasing sequence of events from F.
3. Consider a probability space (Ω, F, P ) where Ω = (0, 1) × (0, 1) is the unit
square and P is a bivariate extension of Lebesgue measure. That is, if R is
a rectangle of the form
R = {(ω1 , ω2 ) : ω 1 ≤ ω1 ≤ ω 1 , ω 2 ≤ ω1 ≤ ω 2 },
where ω 1 ≤ ω 1 and ω 2 ≤ ω 2 then P (R) = (ω 1 − ω 1 )(ω 2 − ω 1 ), which
corresponds to the area of the rectangle. Let F = B{(0, 1) × (0, 1)} and
in general define P (B) to be the area of B for any B ∈ B{(0, 1) × (0, 1)}.
Prove that P is a probability measure.
4. Prove Theorem 2.4. That is, prove that
n
! n
[ X
P Ai ≤ P (Ai ).
i=1 i=1

The most direct approach is based on mathematical induction using the


general addition rule to prove the basis and the induction step. The general
addition rule states that for any two events A1 and A2 , P (A1 ∪ A2 ) =
P (A1 ) + P (A2 ) − P (A1 ∩ A2 ).
5. Prove Theorem 2.6 (Markov) for the case when X is a discrete random
variable on N with probability distribution function p(x).
6. Prove Theorem 2.7 (Tchebysheff). That is, prove that if X is a random
variable such that E(X) = µ and V (X) = σ 2 < ∞, then P (|X − µ| > δ) ≤
δ −2 σ 2 .

n=1 be a sequence of events from a σ-field F of subsets of a sample


7. Let {An }∞
space Ω. Prove that if An+1 ⊂ An for all n ∈ N then the sequence has limit

\
lim An = An .
n→∞
n=1

n=1 be a sequence of events from F, a σ-field on the sample space


8. Let {An }∞
Ω = (0, 1), defined by
(
( 31 , 23 ) if n is even,
An =
( 14 , 34 ) if n is odd,
for all n ∈ N. Compute
lim inf An ,
n→∞

lim sup An ,
n→∞
and determine if the limit of the sequence {An }∞
n=1 exists.
EXERCISES AND EXPERIMENTS 93

n=1 be a sequence of events from F, a σ-field on the sample space


9. Let {An }∞
Ω = R, defined by An = (−1 − n−1 , 1 + n−1 ) for all n ∈ N. Compute
lim inf An ,
n→∞

lim sup An ,
n→∞
and determine if the limit of the sequence {An }∞
n=1 exists.
10. Let {An }n=1 be a sequence of events from F, a σ-field on the sample space

Ω = (0, 1), defined by


(
B if n is even,
An =
B c if n is odd,
for all n ∈ N where B is a fixed member of F. Compute
lim inf An ,
n→∞

lim sup An ,
n→∞
and determine if the limit of the sequence {An }∞ n=1 exists.
11. Let {An }n=1 be a sequence of events from F, a σ-field on the sample space

Ω = R, defined by
(
[ 1 , 1 + n−1 ) if n is even,
An = 21 2 −1 1
( 2 − n , 2 ] if n is odd,
for all n ∈ N. Compute
lim inf An ,
n→∞
lim sup An ,
n→∞
and determine if the limit of the sequence {An }∞ n=1 exists.
12. Consider a probability space (Ω, F, P ) where Ω = (0, 1), F = B{(0, 1)} and
P is Lebesgue measure on (0, 1). Let {An }∞ n=1 be a sequence of events in F
defined by An = (0, 12 (1 + n−1 )) for all n ∈ N. Show that
 
lim P (An ) = P lim An .
n→∞ n→∞

13. Consider tossing a fair coin repeatedly and define Hn to be the event that
the nth toss of the coin yields a head. Prove that
P (lim sup Hn ) = 1,
n→∞

and interpret this result in terms of how often the event occurs.
14. Consider the case where {An }∞ n=1 is a sequence of independent events that
all have the same probability p ∈ (0, 1). Prove that
P (lim sup An ) = 1,
n→∞

and interpret this result in terms of how often the event occurs.
94 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
15. Let {Un }∞n=1 be a sequence of independent Uniform(0, 1) random vari-
ables. For each definition of An given below, calculate
 
P lim sup An .
n→∞

a. An = {Un < n−1 } for all n ∈ N.


b. An = {Un < n−3 } for all n ∈ N.
c. An = {Un < exp(−n)} for all n ∈ N.
d. An = {Un < 2−n } for all n ∈ N.
16. Let X be a random variable that has moment generating function m(t)
that converges on some radius |t| ≤ b for some b > 0. Using induction,
prove that
dk m(t)

µ0k = .
dtk t=0
17. Let X be a Poisson(λ) random variable.

a. Prove that the moment generating function of X is exp[λ exp(t) − 1].


b. Prove that the characteristic function of X is exp[λ exp(it) − 1].
c. Using the moment generating function, derive the first three moments
of X. Repeat the process using the characteristic function.

18. Let Z be a N(0, 1) random variable.

a. Prove that the moment generating function of X is exp[− 21 t2 ].


b. Prove that the characteristic function of X is also exp[− 12 t2 ].
c. Using the moment generating function, derive the first three moments
of X. Repeat the process using the characteristic function.

19. Let Z be a N(0, 1) random variable and define X = µ + σZ for some µ ∈ R


and 0 < σ < ∞. Using the fact that X is a N(µ, σ 2 ) random variable,
derive the moment generating function and the characteristic function of a
N(µ, σ 2 ) random variable.
20. Let X be a N(µ, σ 2 ) random variable. Using the moment generating func-
tion, derive the first three moments of X. Repeat the process using the
characteristic function.
21. Let X be a Uniform(α, β) random variable.

a. Prove that the moment generating function of X is [t(β−α)]−1 [exp(tβ)−


exp(tα)].
b. Prove that the characteristic function of X is [it(β − α)]−1 [exp(itβ) −
exp(itα)].
c. Using the moment generating function, derive the first three moments
of X. Repeat the process using the characteristic function.
EXERCISES AND EXPERIMENTS 95
22. Let X be a random variable. Prove that the characteristic function of X is
real valued if and only if X has the same distribution as −X.
23. Prove Theorem 2.24. That is, suppose that X is a random variable with
moment generating function mX (t) that exists and is finite for |t| < b
for some b > 0. Suppose that Y is a new random variable defined by
Y = αX + β where α and β are real constants. Prove that the moment
generating function of Y is mY (t) = exp(tβ)mX (αt) provided |αt| < b.
24. Prove Theorem 2.32. That is, suppose that X is a random variable with
characteristic function ψ(t). Let Y = αX + β where α and β are real con-
stants. Prove that the characteristic function of Y is ψY (t) = exp(itβ)ψ(αt).
25. Prove Theorem 2.33. That is, suppose that X1 , . . . , Xn be a sequence of
independent random variables where Xi has characteristic function ψi (t),
for i = 1, . . . , n. Prove that the characteristic function of
n
X
Sn = Xi ,
i=1

is
n
Y
ψSn (t) = ψi (t).
i=1
Further, prove that if X1 , . . . , Xn are identically distributed with character-
istic function ψ(t) then the characteristic function of Sn is ψSn (t) = ψ n (t).
26. Let X1 , . . . , Xn be a sequence of independent random variables where Xi
has a Gamma(αi , β) distribution for i = 1, . . . , n. Let
n
X
Sn = Xi .
i=1

Find the moment generating function of Sn , and identify the corresponding


distribution of the random variable.
27. Suppose that X is a discrete random variable that takes on non-negative in-
teger values and has characteristic function ψ(t) = exp{θ[exp(it) − 1]}. Use
Theorem 2.29 to find the probability that X equals k where k ∈ {0, 1, . . .}.
28. Suppose that X is a discrete random variable that takes on the values {0, 1}
and has characteristic function ψ(t) = cos(t). Use Theorem 2.29 to find the
probability that X equals k where k ∈ {0, 1}.
29. Suppose that X is a discrete random variable that takes on positive integer
values and has characteristic function
p exp(it)
ψ(t) = .
1 − (1 − p) exp(it)
Use Theorem 2.29 to find the probability that X equals k where k ∈
{1, 2, . . .}.
96 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
30. Suppose that X is a continuous random variable that takes on real values
and has characteristic function ψ(t) = exp(−|t|). Use Theorem 2.28 to find
the density of X.
31. Suppose that X is a continuous random variable that takes on values in
(0, 1) and has characteristic function ψ(t) = [exp(it) − 1]/it. Use Theorem
2.28 to find the density of X.
32. Suppose that X is a continuous random variable that takes on positive real
values and has characteristic function ψ(t) = (1 − θit)−α . Use Theorem
2.28 to find the density of X.
33. Let X be a random variable with characteristic function ψ. Suppose that
E(|X|n ) < ∞ for some n ∈ {1, 2, . . .} and that ψ (k) exists and is uniformly
continuous for k ∈ {1, 2, . . . , n}.

a. Prove that
n
X µ0 (it)k
ψ(t) = 1 + k
+ o(|t|n ),
k!
k=1
as t → 0.
b. Prove that
dk ψ(t)

= ik µ0k .
dtk t=0

34. a. Prove that κ4 = µ04 − 4µ03 µ01 − 3(µ02 )2 + 12µ02 (µ01 )2 − 6(µ01 )4 .
b. Prove that κ4 = µ4 − 3µ22 , which is often called the kurtosis of a random
variable.
c. Suppose that X is an Exponential(θ) random variable. Compute the
fourth cumulant of X.
35. a. Prove that
κ5 = µ05 −5µ04 µ01 −10µ03 µ02 +20µ03 (µ01 )2 +30(µ02 )2 µ01 −60µ02 (µ01 )3 +24(µ01 )5 .

b. Prove that κ5 = µ5 − 10µ2 µ3 .


c. Suppose that X is an Exponential(θ) random variable. Compute the
fifth cumulant of X.
36. Prove Theorem 2.34. That is, suppose that X1 , . . . , Xn be a sequence of
independent random variables where Xi has cumulant generating function
ci (t) for i = 1, . . . , n. Then prove that the cumulant generating function of
n
X
Sn = Xi ,
i=1

is
n
X
cSn (t) = ci (t).
i=1
EXERCISES AND EXPERIMENTS 97
37. Suppose that X is a Poisson(λ) random variable, so that the moment
generating function of X is m(t) = exp{λ[exp(t) − 1]}. Find the cumulant
generating function of X, and put it into the form given in Equation (2.24).
Using the form of the cumulant generating function, find a general form for
the cumulants of X.
38. Suppose that X is a Gamma(λ) random variable, so that the moment
generating function of X is m(t) = (1−tβ)−α . Find the cumulant generating
function of X, and put it into the form given in Equation (2.24). Using
the form of the cumulant generating function, find a general form for the
cumulants of X.
39. Suppose that X is a Laplace(α, β) random variable, so that the moment
generating function of X is m(t) = (1 − t2 β 2 )−1 exp(tα) when |t| < β −1 .
Find the cumulant generating function of X, and put it into the form given
in Equation (2.24). Using the form of the cumulant generating function,
find a general form for the cumulants of X.
40. One consequence of defining the cumulant generating function in terms of
the moment generating function is that the cumulant generating function
will not exist any time the moment generating function does not. An alter-
nate definition of the cumulant generating function is defined in terms of
the characteristic function. That is, if X has characteristic function ψ(t),
then the cumulant generating function can be defined as c(t) = log[ψ(t)].

a. Assume all of the cumulants (and moments) of X exist. Prove that the
coefficient of (it)k /k! for the cumulant generating function defined using
the characteristic function is the k th cumulant of X. You may want to
use Theorem 2.31.
b. Find the cumulant generating function of a random variable X that has
a Cauchy(0, 1) distribution based on the fact that the characteristic
function of X is ψ(t) = exp(|t|). Use the form of this cumulant generating
function to argue that the cumulants of X do not exist.

2.6.2 Experiments

1. For each of the distributions listed below, use R to compute P (|X − µ| > δ)
and compare the result to the bound given by Theorem 2.7 as δ −2 σ 2 for
δ = 12 , 1, 32 , 2. Which distributions become closest to achieving the bound?
What are the properties of these distributions?

a. N(0, 1)
b. T(3)
c. Gamma(1, 1)
d. Uniform(0, 1)
98 RANDOM VARIABLES AND CHARACTERISTIC FUNCTIONS
2. For each distribution listed below, plot the corresponding characteristic
function of the density as a function of t if the characteristic function is
real-valued, or as a function of t on the complex plane if the function is
complex-valued. Describe each characteristic function. Are there any prop-
erties of the associated random variables that have an apparent effect on
the properties of the characteristic function? See Section B.3 for details on
plotting complex functions in the complex plane.

a. Bernoulli( 12 )
b. Binomial(5, 12 )
c. Geometric( 12 )
d. Poisson(2)
e. Uniform(0, 1)
f. Exponential(2)
g. Cauchy(0, 1)

3. For each value of µ and σ listed below, plot the characteristic function
of the corresponding N(µ, σ 2 ) distribution as a function of t in the com-
plex plane. Describe how the changes in the parameter values affect the
properties of the corresponding characteristic function. This will require a
three-dimensional plot. See Section B.3 for further details.

a. µ = 0, σ =1
b. µ = 1, σ =1
c. µ = 0, σ =2
d. µ = 1, σ =2

4. Random walks are a special type of discrete stochastic process that are able
to change from one state to any adjacent state according to a conditional
probability distribution. This experiment will investigate the properties of
random walks in one, two, and three dimensions.

a. Consider a sequence of random variables {Xn }∞ n=1 where X1 = 0 with


probability one and P (Xn = k|Xn−1 = x) = 21 δ{k; {x − 1, x + 1}}, for
n = 2, 3, . . .. Such a sequence is known as a symmetric one-dimensional
random walk. Write a program in R that simulates such a sequence of
length 1000 and keeps track of the number of times the origin is visited
after the initial start of the sequence. Use this program to repeat the
experiment 100 times to estimate the average number of visits to the
origin for a sequence of length 1000.
b. Consider a sequence of two-dimensional random vectors {Yn }∞ n=1 where
Y10 = (0, 0) with probability one and P (Yn = yn |Yn−1 = y) = 14 δ{yn −
y; D}, for n = 2, 3, . . . where D = {(1, 0), (0, 1), (−1, 0), (0, −1)}. Such a
sequence is known as a symmetric two-dimensional random walk. Write
a program in R that simulates such a sequence of length 1000 and keeps
EXERCISES AND EXPERIMENTS 99
track of the number of times the origin is visited after the initial start
of the sequence. Use this program to repeat the experiment 100 times
to estimate the average number of visits to the origin for a sequence of
length 1000.
c. Consider a sequence of three-dimensional random vectors {Zn }∞ n=1 where
Z01 = (0, 0, 0) with probability one and P (Zn = zn |Zn−1 = z) = 61 δ{zn −
z; D}, for n = 2, 3, . . . where
D = {(1, 0, 0), (0, 1, 0), (0, 0, 1), (−1, 0, 0), (0, −1, 0), (0, 0, −1)}.
Such a sequence is known as a symmetric three-dimensional random
walk. Write a program in R that simulates such a sequence of length
1000 and keeps track of the number of times the origin is visited after the
initial start of the sequence. Use this program to repeat the experiment
100 times to estimate the average number of visits to the origin for a
sequence of length 1000.
d. The theory of Markov chains can be used to show that with probability
one, the one and two-dimensional symmetric random walks will return
to the origin infinitely often, whereas the three-dimensional random walk
will not. Discuss, as much as is possible, the properties of the probability
of the process visiting the origin at step n for each case in terms of
Theorem 2.17. Note that Theorem 2.18 cannot be applied to this case
because the events in the sequence are dependent. Discuss whether your
simulation results provide evidence about this behavior. Do you think
that a property such as this could ever be verified empirically?
CHAPTER 3

Convergence of Random Variables

The man from the country has not expected such difficulties: the law should
always be accessible for everyone, he thinks, but as he now looks more closely at
the gatekeeper in his fur coat, at his large pointed nose and his long, thin, black
Tartars beard, he decides that it would be better to wait until he gets permission
to go inside.
Before the Law by Franz Kafka

3.1 Introduction

Let {Xn }∞
n=1 be a sequence of random variables and let X be some other
random variable. Under what conditions is it possible to say that Xn converges
to X as n → ∞? That is, is it possible to define a limit for a sequence of
random variables so that the statement
lim Xn = X,
n→∞

has a well defined mathematical meaning? The answer, or answers, to this


question arise from a detailed consideration of the mathematical structure
of random variables. There are two ways to conceptualize random variables.
From an informal viewpoint one can view the sequence {Xn }∞ n=1 as a sequence
of quantities that are random, and this random behavior somehow depends
on the index n. The quantity X is also random. It is clear in this context that
the mathematical definition of the limit of a sequence of real numbers could
only be applied to an observed sequence of these random variables. That is,
if we observed Xn = xn for all n ∈ N and X = x, then it is easy to ascertain
whether
lim xn = x,
n→∞
using Definition 1.1. But what can be concluded about the sequence before
the random variables are observed? It is clear that the usual mathematical
definition of convergence and limit will not suffice, and that a new view of
convergence will need to be established for random variables. In some sense
we wish Xn and X to act the same when n is large, but how can this be
guaranteed? It is clear that probability should play a role, but what exactly
should the role be? For example, we could insist that the limit of the sequence

101
102 CONVERGENCE OF RANDOM VARIABLES
{Xn }∞
n=1 match X with probability one, which would provide the definition
that Xn converges to X as n → ∞ if
 
P lim Xn = X = 1.
n→∞

Another viewpoint might insist that Xn and X be arbitrarily close with a


probability converging to one. That is, one could say that Xn converges to X
as n → ∞ if for every ε > 0,
lim P (|Xn − X| < ε) = 1.
n→∞

Now there are two definitions of convergence of random variables to contend


with, which yields several more questions. Are the two notions of convergence
equivalent? Are there other competing definitions which should be considered?
Fortunately, a more formal analysis of this problem reveals a fully established
mathematical framework for this problem. Recall from Section 2.2 that ran-
dom variables are not themselves random quantities, but are functions that
map points in the sample space to the real line. This indicates that {Xn }∞
n=1 is
really a sequence of functions and that X is a possible limiting function. Well
established methods that originate in real analysis and measure theory yield
definitions for the convergence of functions as well as many other relevant
properties that will be used throughout this book. From the mathematician’s
point of view, statistical limit theory may seem like a specialized application
of real analysis and measure theory to problems that specifically involve a
probability (normed) measure. However, the application is not trivial, and
the viewpoint of problems considered by statisticians may considerably differ
from those that interest pure mathematicians.
The purpose of this chapter is to introduce the concept of convergence of ran-
dom variables within a statistical framework. The presentation in this chapter
will stress the mathematical viewpoint of these developments, as this view-
point provides the best understanding of the true nature of the modes of
convergence, their properties, and how they relate to one another. The chap-
ter begins by considering convergence in probability, which corresponds to the
second of the definitions considered above.

3.2 Convergence in Probability

We now formally develop the second definition of convergence of random vari-


ables that was proposed informally in Section 3.1.
Definition 3.1. Let {Xn }∞ n=1 be a sequence of random variables. The se-
quence converges in probability to a random variable X if for every ε > 0,
lim P (|Xn − X| ≥ ε) = 0.
n→∞
p
This relationship is represented by Xn −
→ X as n → ∞.
CONVERGENCE IN PROBABILITY 103
A simple application of the compliment rule provides an equivalent condition
p
for convergence in probability. In particular, Xn −
→ X as n → ∞ if for every
ε > 0,
lim P (|Xn − X| < ε) = 1.
n→∞
Either representation of the condition implies the same idea behind this mode
of convergence. That is, Xn should be close to X with a high probability
for large values of n. While the sequence {|Xn − X|}∞ n=1 is a sequence of
random variables indexed by n, the sequence of probabilities {P (|Xn − X| >
ε)}∞n=1 is a sequence of real constants indexed by n ∈ N. Hence, Definition
1.1 can be applied to the latter sequence. Therefore an equivalent definition
p
of convergence in probability is that Xn −→ X as n → ∞ if for every ε > 0
and δ > 0 there exists an integer nδ ∈ N such that P (|Xn − X| > ε) < δ for
all n ≥ nδ .
Another viewpoint of convergence in probability can be motivated by Defini-
tion 2.4. A random variable is a measureable mapping from a sample space
Ω to R. Hence the sequence random variables {Xn }∞ n=1 is really a sequence
of functions and the random variable X is a limiting function. Convergence
in probability requires that for any ε > 0 the sequence of functions must be
within an ε-band of the function X over a set of points from the sample space
whose probability increases to one as n → ∞. Equivalently, the set of points
for which the sequence of functions is not within the ε-band must decrease to
zero as n → ∞. See Figure 3.1.
Example 3.1. Let Z be a N(0, 1) random variable and let {Xn }∞n=1 be a
sequence of random variables such that Xn = Z + n−1 for all n ∈ N. Let
ε > 0, then
P (|Xn − Z| ≥ ε) = P (|Z + n−1 − Z| ≥ ε) = P (n−1 ≥ ε),
for all n ∈ N. There exists an nε ∈ N such that n−1 < ε for all n ≥ nε so that
P (n−1 > ε) = 0 for all n ≥ nε . Therefore,
lim P (|Xn − Z| ≥ ε) = 0,
n→∞
p
and it follows from Definition 3.1 that Xn −
→ Z as n → ∞. 
Example 3.2. Suppose that θ̂n is an unbiased estimator of θ, that is E(θ̂n ) =
θ for all values of θ within the parameter space. In many cases the standard
error, or equivalently, the variance of θ̂n converges to zero as n → ∞. That is,
lim V (θ̂n ) = 0.
n→∞

Under these conditions note that Theorem 2.7 (Tchebysheff) implies that for
any ε > 0 P (|θ̂n − θ| > ε) ≤ ε−2 V (θ̂n ). The limiting condition on the variance
of θ̂n and Definition 2.2 imply that
0 ≤ lim P (|θ̂n − θ| > ε) ≤ lim ε−2 V (θ̂n ) = 0,
n→∞ n→∞
104 CONVERGENCE OF RANDOM VARIABLES

Figure 3.1 Convergence in probability from the viewpoint of convergence of functions.


The solid line represents the random variable X and the dashed lines represent an ε-
band around the function where the horizontal axis is the sample space Ω. The dotted
line represents a random variable Xn . Convergence in probability requires that more
of the function Xn , with respect to the probability measure on Ω, be within the ε band
as n becomes larger.

since ε is constant with respect to n. !


Therefore,

lim P (|θ̂n − θ| ≥ ε) = 0,
n→∞

p
and Definition 3.1 implies that θ̂n − → θ as n → ∞. In estimation theory this
property is called consistency. That is, θ̂n is a consistent estimator of θ. A
special case of this result applies to the sample mean. Suppose that X1 , . . . , Xn
are a set of independent and identically distributed random variables from a
distribution with mean θ and finite variance σ 2 . The sample mean X̄n is
an unbiased estimator of θ with variance n−1 σ 2 which converges to zero as
p
n → ∞ as long as σ 2 < ∞. Therefore it follows that X̄n − → θ as n → ∞ and
the sample mean is a consistent estimator of θ. This result is a version of what
are known as Laws of Large Numbers. In particular, this result is known as
the Weak Law of Large Numbers. Various results of this type can be proven
under many different conditions. In particular, it will be shown in Section 3.6
that the condition that the variance is finite can be relaxed. This result can be
visualized with the aid of simulated data. Consider simulating samples from a
N(0, 1) distribution of size n = 5, 10, 15, . . . , 250, where the sample mean X̄n
CONVERGENCE IN PROBABILITY 105

Figure 3.2 The results of a small simulation demonstrating convergence in probability


due to the weak law of large numbers. Each line represents a sequence of sample
means computed on a sequence of independent N(0, 1) random variables. The means
were computed when n = 5, 10, . . . , 250. An ε-band of size ε = 0.10 has been placed
around the limiting value.
1.0
0.5
0.0
!0.5
!1.0

0 50 100 150 200 250

is computed on each sample. The Weak Law of Large Numbers states that
these sample means should converge in probability to 0 as n → ∞. Figure
3.2 shows the results of five such simulated sequences. An ε-band has been
plotted around 0. Note that all of the sequences generally become closer to 0
as n becomes larger, and that there is a point where all of the sequences are
within the ε-band. Remember that the definition of convergence in probability
is a result for random sequences. This does not mean that all such sequences
will be within the ε-band for a given sample size, only that the probability that
the sequences are within the ε-band converges to one as n → ∞. This can also
be observed from the fact that the individual sequences do not monotonically
converge to 0 as n becomes large. There are random fluctuations in all of the
sequences, but the overall behavior of the sequence does become closer to 0
as n becomes large. 
106 CONVERGENCE OF RANDOM VARIABLES
Example 3.3. Let {cn }∞
n=1 be a sequence of real constants where

lim cn = c,
n→∞

for some constant c ∈ R. Let {Xn }∞n=1 be a sequence of random variables with
a degenerate distribution at cn for all n ∈ N. That is P (Xn = cn ) = 1 for all
n ∈ N. Let ε > 0, then
P (|Xn − c| ≥ ε) = P (|Xn − c| ≥ ε|Xn = cn )P (Xn = cn ) = P (|cn − c| ≥ ε).
Definition 1.1 implies that for any ε > 0 there exists an nε ∈ N such that
|cn − c| < ε for all n > nε . Therefore P (|cn − c| > ε) = 0 for all n > nε , and
it follows that
lim P (|Xn − c| ≥ ε) = lim P (|cn − c| ≥ ε) = 0.
n→∞ n→∞
p
Therefore, by Definition 3.1 it follows that Xn −
→ c as n → ∞. 
Example 3.4. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a Uniform(θ, θ + 1) distribution for some
θ ≥ 0. Let X(1) be the first sample order statistic. That is
X(1) = min{X1 , . . . , Xn }.
The distribution function of X(1) can be found by using the fact that if X(1) ≥ t
for some t ∈ R, then Xi ≥ t for all i = 1, . . . , n. Therefore, the distribution
function of X(1) is given by

F (t) = P (X(1) ≤ t) = 1 − P (X(1) > t)


n
! n
\ Y
=1−P {Xi > t} =1− P (Xi > t).
i=1 i=1

If t ∈ (θ, θ + 1) then P (Xi > t) = 1 + θ − t so that the distribution function


of X(1) is

0
 for t < θ,
F (t) = 1 − (1 + θ − t) n
for t ∈ (θ, θ + 1),

1 for t > θ + 1.

Let ε > 0 and consider the inequality |X(1) −θ| ≤ ε. If ε ≥ 1 then |X(1) −θ| ≤ ε
with probability one because X(1) ∈ (θ, θ+1) with probability one. If ε ∈ (0, 1)
then
P (|X(1) − θ| < ε) = P (−ε < X(1) − θ < ε)
= P (θ − ε < X(1) < θ + ε)
= P (θ < X(1) < θ + ε)
= F (θ + ε)
= 1 − (1 − ε)n ,
STRONGER MODES OF CONVERGENCE 107
where the fact that X(1) must be greater than θ has been used. Therefore
lim P (|X(1) − θ| < ε) = 1,
n→∞
p
since 0 < 1 − ε < 1. Definition 3.1 then implies that X(1) −
→ θ as n → ∞, or
that X(1) is a consistent estimator of θ. 

3.3 Stronger Modes of Convergence

In Section 3.1 we considered another reasonable interpretation of the limit of


a sequence of random variables, which essentially requires that the sequence
converge with probability one. This section formalizes this definition and stud-
ies how this type of convergence is related to convergence in probability. We
also introduce another concept of convergence of random variables that is
based on modifying the definition of convergence in probability.
Definition 3.2. Let {Xn }∞ n=1 be a sequence of random variables. The se-
quence converges almost certainly to a random variable X if
 
P lim Xn = X = 1. (3.1)
n→∞
a.c.
This relationship is represented by Xn −−→ X as n → ∞.

To better understand this type of convergence it is sometimes helpful to


rewrite Equation (3.1) as
h i
P ω : lim Xn (ω) = X(ω) = 1.
n→∞

Hence, almost certain convergence requires that the set of all ω for which
Xn (ω) converges to X(ω), have probability one. Note that the limit used in
Equation (3.1) is the usual limit for a sequence of constants given in Definition
1.1, as when ω is fixed, Xn (ω) is a sequence of constants. See Figure 3.3.
Example 3.5. Consider the sample space Ω = [0, 1] with probability measure
P such that ω is chosen according to a Uniform[0, 1] distribution on the Borel
σ-field B[0, 1]. Define a sequence {Xn }∞ n=1 of random variables as Xn (ω) =
δ{ω; [0, n−1 )}. Let ω ∈ [0, 1] be fixed and note that there exists an nω ∈ N
such that n−1 < ω for all n ≥ nω . Therefore Xn (ω) = 0 for all n ≥ nω , and it
follows that for this value of ω
lim Xn (ω) = 0.
n→∞

The exception is when ω = 0 for which Xn (ω) = 1 for all n ∈ N. Therefore


 
P lim Xn = 0 = P [ω ∈ (0, 1)] = 1,
n→∞
a.c.
and it follows from Definition 3.2 that Xn −−→ 0 as n → ∞. 
108 CONVERGENCE OF RANDOM VARIABLES

Figure 3.3 Almost certain convergence is characterized by the point-wise convergence


of the random variable sequence to the limiting random variable. In this figure, the
limiting random variable is represented by the function given by the solid black line,
and random variables in the sequence are represented by the functions plotted with
the dotted line. The horizontal axis represents the sample space. For a fixed value
of ω ∈ Ω, the corresponding sequence is simply a sequence of real constants as
represented by the black points.

"

Definition 3.2 can be difficult to apply! in practice, and is not always use-
ful when studying the properties of almost certain convergence. By applying
Definition 1.1 to the limit inside the probability in Equation (3.1), an equiva-
lent definition that relates almost certain convergence to a statement that is
similar to the one used in Definition 3.1 can be obtained.
Theorem 3.1. Let {Xn }∞ n=1 be a sequence of random variables. Then Xn
converges almost certainly to a random variable X as n → ∞ if and only if
for every ε > 0,

lim P (|Xm − X| < ε for all m ≥ n) = 1. (3.2)


n→∞

Proof. This result is most easily proven by rewriting the definition of a limit
using set operations. This is the method used by Halmos (1950), Serfling
(1980), and many others. To prove the equivalence, consider the set
n o
A = ω : lim Xn (ω) = X(ω) .
n→∞
STRONGER MODES OF CONVERGENCE 109
Definition 1.1 implies that

A = {ω : for every ε there exists n ∈ N such that


|Xm (ω) − X(ω)| < ε for all m ≥ n} (3.3)
Since the condition in the event on the right hand side of Equation (3.3) must
be true for every ε > 0, we have that
\
A= {ω : there exists n ∈ N such that
ε>0
|Xm (ω) − X(ω)| < ε for all m ≥ n} (3.4)
The condition within each event of the intersection on the right hand side of
Equation (3.4) needs only to be true for at least one n ∈ N. Therefore

\ [
A= {ω : |Xm (ω) − X(ω)| < ε for all m ≥ n}.
ε>0 n=1

Now consider 0 < ε < δ, then we have that



[
{ω : |Xm (ω) − X(ω)| < ε for all m ≥ n} ⊂
n=1

[
{ω : |Xm (ω) − X(ω)| < δ for all m ≥ n}. (3.5)
n=1

This implies that the sequence of events within the intersection on the right
hand side of Equation (3.4) is monotonically decreasing as ε → 0. Therefore,
Theorem 2.15 implies that

[
A = lim {ω : |Xm (ω) − X(ω)| < ε for all m ≥ n}.
ε→0
n=1

Similarly, note that

{ω : |Xm (ω) − X(ω)| < ε for all m ≥ n + 1} ⊂


{ω : |Xm (ω) − X(ω)| < ε for all m ≥ n}
so that the sequence of events within the union on the right hand side of
Equation (3.4) is monotonically increasing as n → ∞. Therefore, Theorem
2.15 implies that
A = lim lim {ω : |Xm (ω) − X(ω)| < ε for all m ≥ n}.
ε→0 n→∞

Therefore, Theorem 2.16 implies that


h i
P ω : lim Xn (ω) = X(ω) =
n→∞
lim lim P [ω : |Xm (ω) − X(ω)| < ε for all m ≥ n].
ε→0 n→∞
110 CONVERGENCE OF RANDOM VARIABLES
Now, suppose that for every ε > 0,
lim P [ω : |Xm (ω) − X(ω)| < ε for all m ≥ n] = 1.
n→∞

It then follows that


lim lim P [ω : |Xm (ω) − X(ω)| < ε for all m ≥ n] = 1
ε→0 n→∞

and hence  
P lim Xn = X = 1,
n→∞
a.c. a.c.
so that Xn −−→ X as n → ∞. Now suppose that Xn −−→ X as n → ∞ and
let ε > 0 and note that Equation (3.5) implies that
 
1 = P lim Xn = X =
n→∞
lim lim P [ω : |Xm (ω) − X(ω)| < ε for all m ≥ n] ≤
ε→0 n→∞
lim P [ω : |Xm (ω) − X(ω)| < ε for all m ≥ n],
n→∞

so that
lim P [ω : |Xm (ω) − X(ω)| < ε for all m ≥ n] = 1,
n→∞
and the result is proven.
Example 3.6. Suppose that {Un }∞ n=1 is a sequence of independent Uni-
form(0, 1) random variables and let U(1,n) be the smallest order statistic of
U1 , . . . , Un defined by U(1,n) = min{U1 , . . . , Un }. Let ε > 0, then because
U(1,n) ≥ 0 with probability one, it follows that
P (|U(1,m) − 0| < ε for all m ≥ n) = P (U(1,m) < ε for all m ≥ n)
= P (U(1,n) < ε),
where the second equality follows from the fact that if U(1,n) < ε then U(1,m) <
ε for all m ≥ n. Similarly, if U(1,m) < ε for all m ≥ n then U(1,n) < ε so that
it follows that the two events are equivalent. Now note that the independence
of the random variables in the sequence implies that when ε < 1,
n
Y
P (U(1,n) < ε) = 1 − P (U(1,n) ≥ ε) = 1 − P (Uk ≥ ε) = 1 − (1 − ε)n .
k=1

Therefore, it follows that


lim P (U(1,m) < ε for all m ≥ n) = lim 1 − (1 − ε)n = 1.
n→∞ n→∞

When ε ≥ 1 the probability equals one for all n ∈ N. Hence Theorem 3.1
a.c.
implies that U(1,n) −−→ 0 as n → ∞. 
Theorem 3.1 is also useful in beginning to understand the relationship between
convergence in probability and almost certain convergence.
Theorem 3.2. Let {Xn }∞ n=1 be a sequence of random variables that converge
p
almost certainly to a random variable X as n → ∞. Then Xn − → X as n → ∞.
STRONGER MODES OF CONVERGENCE 111
Proof. Suppose that {Xn }∞ n=1 is a sequence of random variables that converge
almost certainly to a random variable X as n → ∞. Then Theorem 3.1 implies
that for every ε > 0,
lim P (|Xm − X| < ε for all m ≥ n) = 1.
n→∞

Now note that


{ω : |Xm (ω) − X(ω)| < ε for all m ≥ n} ⊂ {ω : |Xn (ω) − X(ω)| < ε},
so that is follows that
P (|Xm (ω) − X(ω)| < ε for all m ≥ n) ≤ P (|Xn (ω) − X(ω)| < ε),
for each n ∈ N. Therefore
lim P (|Xm (ω) − X(ω)| < ε for all m ≥ n) ≤ lim P (|Xn (ω) − X(ω)| < ε),
n→∞ n→∞

which implies
lim P (|Xn (ω) − X(ω)| < ε) = 1.
n→∞
p
Therefore Definition 3.1 implies that Xn −
→ X as n → ∞.

Theorem 3.2 makes it clear that almost certain convergence is potentially


a stronger concept of convergence of random variables than convergence in
probability. However, the question remains as to whether they may be equiv-
alent. The following example demonstrates that convergence in probability is
actually a weaker concept of convergence than almost certain convergence by
identifying a sequence of random variables that converges in probability, but
does not converge almost certainly.
Example 3.7. This example is a version of a popular example used before by
Serfling (1980), Royden (1988), and Halmos (1974). Consider a sample space
Ω = [0, 1] with a uniform probability measure P . That is, the probability
associated with the interval [a, b] ⊂ Ω is b − a. Let m(n) be a sequence of
intervals of the form [i/k, (i+1)/k] for i = 0, . . . , k−1 and k ∈ N, where m(1) =
[0, 1], m(2) = [0, 1/2], m(3) = [1/2, 1], m(4) = [0, 1/3], m(5) = [1/3, 2/3],
m(6) = [2/3, 1],. . . . See Figure 3.4. Define a sequence of random variables on
this sample space as Xn (ω) = δ{ω; m(n)} for ω ∈ [0, 1]. Let ε > 0, and note
that since Xn (ω) is 1 only on the interval m(n) that P (|Xn | < ε) ≤ 1 − k(n),
where k(n) is a sequence that converges to 0, as n → ∞. It is then clear that
for any ε > 0,
lim P (|Xn | < ε) = 1,
n→∞
p
so that it follows that Xn −→ 0 as n → ∞. Now consider any fixed ω ∈ [0, 1].
Note that for any n ∈ N there exists n0 ∈ N and n00 ∈ N such that n0 > n,
n00 > n, Xn0 (ω) = 1, and Xn00 (ω) = 0. Hence the limit of the sequence Xn (ω)
does not exist for any fixed ω ∈ [0, 1]. Since ω is arbitrary it follows that
{ω : lim Xn (ω) = 0} = ∅,
n→∞
112 CONVERGENCE OF RANDOM VARIABLES

Figure 3.4 The first ten subsets of the unit interval used in Example 3.7.

m(10)
m(9)
m(8)
m(7)
m(6)
m(5)
m(4)
m(3)
m(2)
m(1)

0.0 0.2 0.4 0.6 0.8 1.0

and, therefore,
h i
P lim Xn (ω) = 0 = 0.
n→∞
Therefore, this sequence does not converge almost certainly to 0. In fact, the
probability that Xn converges at all is zero. Note the fundamental difference
between the two modes of convergence demonstrated by this example. Con-
vergence in probability requires that Xn be arbitrarily close to X for a set of
ω whose probability limits to one, while almost certain convergence requires
that the set of ω for which Xn (ω) converges to X have probability one. This
latter set does not depend on n. 

While almost certain convergence requires a stronger condition on the se-


quence of random variables than convergence in probability does, Theorem 3.2
indicates that there are sequences that converge in probability that also con-
verge almost certainly. Therefore, some sequences that converge in probability
have additional properties that allow these sequences to meet the stronger re-
quirements of a sequence that converges almost certainly. The quantification
of the properties of such sequences motivates an additional concept of conver-
gence defined by Hsu and Robbins (1947).
STRONGER MODES OF CONVERGENCE 113
Definition 3.3. Let {Xn }∞ n=1 be a sequence of random variables. The se-
quence converges completely to a random variable X if for every ε > 0,

X
P (|Xn − X| > ε) < ∞. (3.6)
n=1
c
This relationship is represented by Xn →
− X as n → ∞.

Complete convergence is stronger than convergence in probability since con-


vergence in probability only requires that for every ε > 0 that P (|Xn −X| > ε)
converge to zero as n → ∞. Complete convergence not only requires that
P (|Xn − X| > ε) converge to zero, it must do it at a rate fast enough to
ensure that the infinite sum
X∞
P (|Xn − X| > ε),
n=1

converges. For example, if P (|Xn −X| > ε) = n−1 then the sequence of random
variables {Xn }∞n=1 would converge in probability to X, but not completely,
as n → ∞. If P (|Xn − X| > ε) = n−2 then the sequence of random variables
{Xn }∞
n=1 would converge in probability and completely to X, as n → ∞.
Example 3.8. Let U be a Uniform(0, 1) random variable and define a se-
quence of random variables {Xn }∞
n=1 such that Xn = δ{U ; (0, n
−2
)}. Let
ε > 0, then (
0 if ε ≥ 1
P (Xn > ε) = −2
n if ε < 1.
Therefore, for every ε > 0,

X ∞
X
P (Xn > ε) ≤ n−2 < ∞,
n=1 n=1
c
so that it follows from Defintion 3.3 that Xn →
− 0 as n → ∞. 
Example 3.9. Let {θ̂n }∞
n=1be a sequence of random variables such that
E(θ̂n ) = c for all n ∈ N where c ∈ R is a constant that does not depend on n.
Suppose further that V (θ̂n ) = n−2 τ where τ is a positive and finite constant
that does not depend on n. Under these conditions, note that for any ε > 0,
Theorem 2.7 implies that
P (|θ̂n − c| > ε) ≤ ε−2 V (Xn ) = n−2 ε−2 τ.
Therefore, for every ε > 0,

X ∞
X ∞
X
P (|θ̂n − c| > ε) ≤ n−2 ε−2 τ ≤ ε−2 τ n−2 < ∞,
n=1 n=1 n=1

where we have used the fact that ε and τ do not depend on n. Therefore,
c
θ̂n →
− c as n → ∞. 
114 CONVERGENCE OF RANDOM VARIABLES
While complete convergence is sufficient to ensure convergence in probability,
we must still investigate the relationship between complete convergence and
almost certain convergence. As it turns out, complete convergence also implies
almost certain convergence, and is one of the strongest concepts of convergence
of random variables that we will study in this book.
Theorem 3.3. Let {Xn }∞ n=1 be a sequence of random variables that converges
a.c.
completely to a random variable X as n → ∞. Then Xn −−→ X as n → ∞.

Proof. There are several approaches to proving this result, including one based
on Theorems 2.17 and 2.18. See Exercise 13. The approach we use here is
c
used by Serfling (1980). Suppose that Xn → − X as n → ∞. This method of
proof shows that the complement of the event in Equation (3.2) has limiting
probability zero, which in turn proves that the event has limiting probability
one, and the almost certain convergence of the sequence the results from
Theorem 3.1. Note that the complement of {|Xm (ω) − X(ω)| < ε for all m ≥
n} contains all ω ∈ Ω where |Xm (ω) − X(ω)| ≥ ε for at least one m ≥ n. That
is,

{|Xm (ω) − X(ω)| < ε for all m ≥ n}c =


{|Xm (ω) − X(ω)| ≥ ε for some m ≥ n}.
Now let ε > 0, and note that

P (ω : |Xm (ω) − X(ω)| > ε for some m ≥ n) =



!
[
P {|Xm (ω) − X(ω)| > ε} ,
m=n

because |Xm (ω) − X(ω)| > ε must be true for at least one m ≥ n. Theorem
2.4 then implies that
∞ ∞
!
[ X
P {|Xm (ω) − X(ω)| > ε} ≤ P (ω : |Xm (ω) − X(ω)| > ε).
m=n m=n

Definition 3.3 implies that for every ε > 0



X
P (ω : |Xn (ω) − X(ω)| > ε) < ∞.
n=1

In order for this sum to converge, the limit



X
lim P (ω : |Xm (ω) − X(ω)| > ε) = 0,
n→∞
m=n

must hold. Therefore, Definition 2.2 implies that for every ε > 0
lim P (ω : |Xm (ω) − X(ω)| > ε for some m ≥ n) = 0.
n→∞

Because the probabilities of an event and its complement always add to one,
STRONGER MODES OF CONVERGENCE 115
the probability of the complement of the event given earlier must converge to
one. That is,
lim P (ω : |Xm (ω) − X(ω)| ≤ ε for all m ≥ n) = 1.
n→∞
a.c.
Theorem 3.1 then implies that Xn −−→ X as n → ∞.

The result in Theorem 3.3, coupled with Definition 3.3, essentially implies that
if a sequence {Xn }∞n=1 converges in probability to X as n → ∞ at a sufficiently
fast rate, then the sequence will also converge almost certainly to X. The fact
that complete convergence is not equivalent to almost certain convergence is
established by Example 3.5. The sequence of random variables is shown in that
example to converge almost certainly, but because P (ω : |Xn (ω) − X(ω)| >
ε) = n−1 if ε < 1, the sequence does not converge completely.
There are some conditions under which almost certain convergence and com-
plete convergence are equivalent.
Theorem 3.4. Let {Xn }∞ n=1 be a sequence of independent random variables,
a.c. c
and let c be a real constant. If Xn −−→ c as n → ∞ then Xn → − c as n → ∞.
a.c.
Proof. Suppose that Xn −−→ c as n → ∞ where c is a real constant. Then,
for every ε > 0
lim P (|Xm − c| ≤ ε for all m ≥ n) = 1,
n→∞

or equivalently
lim P (|Xm − c| > ε for at least one m ≥ n) = 0.
n→∞

Note that

lim P (|Xm − c| > ε for at least one m ≥ n) =


n→∞
∞ ∞ [ ∞
! !
[ \
lim P {|Xm − c| > ε} = P {|Xm − c| > ε} =
n→∞
m=n n=1 m=n
 
P lim sup{|Xn − c| > ε} , (3.7)
n→∞

where the second equality follows from Theorem 2.16 and the fact that
( ∞ )∞
[
{|Xm − c| > ε} ,
m=n n=1

is a monotonically decreasing sequence of events. Now note that since {Xn }∞


n=1
is a sequence of independent random variables it follows that {|Xn −c| > ε}∞
n=1
is a sequence of independent random variables for every ε > 0. Therefore,
Corollary 2.1 implies that since
 
P lim sup{|Xn − c| > ε} = 0,
n→∞
116 CONVERGENCE OF RANDOM VARIABLES
for every ε > 0, then

X
P (|Xn − c| > ε) < ∞,
n=1
c
for every ε > 0. Definition 3.3 then implies that Xn →
− c as n → ∞.

Note that Theorem 3.4 obtains an equivalence between almost certain conver-
gence and convergence in probability. When a sequence converges in probabil-
ity at a fast enough rate to a constant, convergence in probability and almost
certain convergence are equivalent. Such a result was the main motivation of
Hsu and Robbins (1947). Note further that convergence to a constant plays an
c
important role in the proof and the application of Corollary 2.1. If Xn → − X
as n → ∞, then the sequence is not independent and Corollary 2.1 cannot be
c
applied to the sequence of events {|Xn − X| > ε}∞ n=1 . But when Xn → − c as
n → ∞ the sequence is independent and Corollary 2.1 can be applied.
As with convergent sequences of real numbers, subsequences of convergent
sequences of random variables can play an important role in the development
of asymptotic theory.
Theorem 3.5. Let {Xn }∞ n=1 be a sequence of random variables that converge
in probability to a random variable X. Then there exists a non-decreasing
c a.c.
sequence of positive integers {nk }∞
k=1 such that Xnk →
− X and Xnk −−→ X as
k → ∞.

For a proof of Theorem 3.5, see Chapter 5 of Gut (2005).


Example 3.10. Let {Un }∞ n=1 be a sequence of independent Uniform(0, 1)
random variables and define a sequence of random variables {Xn }∞
n=1 as Xn =
p
δ{Un ; (0, n−1 )} so that Xn −
→ 0 as n → ∞. Let ε > 0 and note that when
ε < 1 it follows that P (|Xn − 0| > ε) = n−1 and therefore

X ∞
X
P (|Xn − 0| > ε) = n−1 = ∞.
n=1 n=1

Hence, Xn does not converge completely to 0 as n → ∞. Now define a non-


decreasing sequence of positive integers nk = k 2 . In this case P (|Xnk − 0| <
ε) = k −2 and therefore

X ∞
X
P (Xnk − 0| > ε) = k −2 < ∞.
k=1 k=1
c
Therefore Xnk →
− 0 as k → ∞. Hence, we have found a subsequence that
converges completely to zero. 

It is important to note that Theorem 3.5 is not constructive in that it does


not identify a particular subsequence that will converge completely to the
random variable of interest. Theorem 3.5 is an existence result in that it
merely guarantees the existence of at least one subsequence that converges
CONVERGENCE OF RANDOM VECTORS 117
completely. Such a result may not be seen as being useful at first, but there
are several important applications of results such as these. It is sometimes
possible to prove properties for the entire sequence by working with properties
of the subsequence.
Example 3.11. This example, which is based on Theorem 3.5 of Gut (2005),
highlights how convergent subsequences can be used to prove properties of
the entire sequence. Let {Xn }∞
n=1 be a sequence of monotonically increasing
random variables that converge in probability to a random variable X as
p
n → ∞. That is, P (Xn < Xn+1 ) = 1 for all n ∈ N and Xn − → X as n → ∞.
Note that this also implies that P (Xn < X) = 1 for all n ∈ N. See Exercise
14. This example demonstrates that the monotonicity of the sequence, along
with the convergence in probability, is enough to conclude that Xn converges
almost certainly to X as n → ∞. Theorem 3.5 implies that there exists a
a.c.
sequence of monotonically increasing integers {nk }∞
n=1 such that Xnk −−→ X
as k → ∞. Theorem 3.1 then implies that for every ε > 0
lim P (|Xnk0 − X| ≤ ε for all k 0 ≥ k) = 1.
k→∞

We now appeal to the monotonicity of the sequence. Let M1 = {ω : Xn (ω) ≤


Xn+1 (ω) for all n ∈ N}, M2 = {ω : Xn (ω) ≤ X(ω) for all n ∈ N}, and M =
M1 ∪ M2 . By assumption P (M1 ) = P (M2 ) = 1. Theorems 2.4 and A.2 imply
that P (M c ) = P (M1c ∩ M2c ) ≤ P (M1c ) + P (M2c ) = 0 so that P (M ) = 1.
Suppose that ω ∈ M and
ω ∈ {|Xnk0 (ω) − X(ω)| ≤ ε for all k 0 ≥ k}.
The monotonicity of the sequence implies that |Xnk0 (ω) − X(ω)| = X(ω) −
Xnk0 (ω) so that |Xnk0 (ω) − X(ω)| ≤ ε implies that X(ω) − Xnk0 (ω) < ε.
Note further that because the sequence is monotonically increasing then if
X(ω) − Xnk0 (ω) < ε it follows that X(ω) − Xn (ω) < ε for all n ≥ n0k and
ω ∈ M . In fact, the two events are equivalent so that it follows that
lim P (|Xn − X| ≥ ε for all m ≥ n) = 1,
n→∞
a.c.
and therefore Xn −−→ X as n → ∞. 

Further applications of Theorem 3.5 can be found in Simmons (1971). Another


application is given in Exercise 19.

3.4 Convergence of Random Vectors

This section will investigate how the three modes of convergence studied in
Sections 3.2 and 3.3 can be applied to random vectors. Let {Xn }∞ n=1 be a
sequence of d-dimensional random vectors and let X be another d-dimensional
random vector. For an arbitrary d-dimensional vector x0 = (x1 , . . . , xd ) ∈ Rd
118 CONVERGENCE OF RANDOM VARIABLES
let kxk be the usual vector norm in d-dimensional Euclidean space defined by
d
!1/2
X
2
kxk = xi .
i=1

When d = 1 the norm reduces to the absolute value of x, that is kxk = |x|.
Therefore, we can generalize the one-dimensional requirement that |Xn (ω) −
X(ω)| > ε to kXn (ω) − X(ω)k > ε in the d-dimensional case.
Definition 3.4. Let {Xn }∞n=1 be a sequence of d-dimensional random vectors
and let X be another d-dimensional random vector.

1. The sequence {Xn }∞


n=1 convergences in probability to X as n → ∞ if for
every ε > 0
lim P (kXn − Xk ≥ ε) = 0.
n→∞

2. The sequence {Xn }∞


n=1 convergences almost certainly to X as n → ∞ if
 
P lim Xn = X = 1.
n→∞

3. The sequence {Xn }∞


n=1 convergences completely to X as n → ∞ if for every
ε>0
X∞
P (kXn − Xk ≥ ε) < ∞.
n=1

While Definition 3.4 is intuitively appealing, it is not easy to apply in practice.


The definition can be simplified by finding an equivalent condition on the
convergence of the individual elements of the random vector. Recall from
Defintion 2.5 that X is a random vector if it is a measureable function that
maps the sample space Ω to B{Rd }. That is X−1 (B) = {ω ∈ Ω : X(ω) ∈
B} ∈ F, for all B ∈ B{Rd }. Note that X0 = (X1 , . . . , Xd ) where Xi is a
measureable function that maps Ω to R for i = 1, . . . , d. That is Xi−1 (B) =
{ω ∈ Ω : X(ω) ∈ B} ∈ F for all B ∈ B{R}. Therefore a d-dimensional
random vector is made up of d one-dimensional random variables. Similarly,
let X0n = (X1n , . . . , Xdn ) have the same measureability properties for each of
the components for all n ∈ N.
Suppose ε > 0, and consider the inequality kXn (ω)−X(ω)k < ε for a specified
ω ∈ Ω. The inequality implies that |Xin (ω) − Xi (ω)| < ε for i = 1, . . . , d.
Therefore,
d
\
{ω : kXn (ω) − X(ω)k < ε} ⊂ {ω : |Xin (ω) − Xi (ω)| < ε}. (3.8)
i=1

This turns out to be the essential relationship required to establish that the
convergence of a random vector is equivalent to the convergence of the indi-
vidual elements of the random vector.
CONVERGENCE OF RANDOM VECTORS 119
Theorem 3.6. Let {Xn }∞ n=1 be a sequence of d-dimensional random vectors
and let X be another d-dimensional random vector where X0 = (X1 , . . . , Xd )
and X0n = (X1,n . . . , Xd,n ) for all n ∈ N.
p p
1. Xn − → X as n → ∞ if and only if Xk,n −
→ Xk as n → ∞ for all k ∈
{1, . . . , d}.
a.c. a.c.
2. Xn −−→ X as n → ∞ if and only if Xk,n −−→ Xk as n → ∞ for all
k ∈ {1, . . . , d}.

Proof. We will prove this result for convergence in probability. The remaining
p
result is proven in Exercise 15. Suppose that Xn −→ X as n → ∞. Then, from
Definition 3.4 is follows that for every ε > 0
lim P (kXn − Xk ≤ ε) = 1,
n→∞

which in turn implies that


d
!
\
lim P {ω : |Xi,n (ω) − Xi (ω)| ≤ ε} = 1,
n→∞
i=1

where we have used the relationship in Equation (3.8). Now let k ∈ {1, . . . , d}.
Theorem 2.3 implies that since
d
\
{ω : |Xi,n (ω) − Xi (ω)| ≤ ε} ⊂ {ω : |Xk,n (ω) − Xk (ω)| ≤ ε},
i=1

for all n ∈ N it follows that,


d
!
\
P {ω : |Xi,n (ω) − Xi (ω)| ≤ ε} ≤ P (ω : |Xk,n (ω) − Xk (ω)| ≤ ε),
i=1

for all n ∈ N. Therefore,


d
!
\
lim P {ω : |Xi,n (ω) − Xi (ω)| ≤ ε}
n→∞
i=1
≤ lim P (ω : |Xk,n (ω) − Xk (ω)| ≤ ε).
n→∞

Definition 2.2 then implies that


lim P (ω : |Xk,n (ω) − Xk (ω)| ≤ ε) = 1,
n→∞
p
and Defintion 3.1 implies that Xk,n −
→ Xk as n → ∞. Because k is arbitrary
p
in the above arguments, we have proven that Xn − → X as n → ∞ implies
p
that Xk,n −
→ Xk as n → ∞ for all k ∈ {1, . . . , d}, or that the convergence of
the random vector in probability implies the convergence in probability of the
elements of the random vector.
120 CONVERGENCE OF RANDOM VARIABLES
p
Now suppose that Xk,n − → Xk as n → ∞ for all k ∈ {1, . . . , d}, and let ε > 0
be given. Then Definition 3.1 implies that
lim P (ω : |Xk,n (ω) − Xk (ω)| > d−1 ε) = 0,
n→∞

for k ∈ {1, . . . , d}. Theorem 2.4 implies that


d
!
[
P {ω : |Xi,n (ω) − Xi (ω)| > d−1 ε} ≤
i=1
d
X
P (ω : |Xi,n (ω) − Xi (ω)| > d−1 ε),
i=1

for each n ∈ N. Therefore,


d
!
[
−1
lim P {ω : |Xi,n (ω) − Xi (ω)| > d ε} ≤
n→∞
i=1
d
X
lim P (ω : |Xi,n (ω) − Xi (ω)| > d−1 ε) = 0.
n→∞
i=1

Now,
d
\
{ω : |Xi,n (ω) − Xi (ω)| ≤ d−1 ε} ⊂ {ω : |Xi,n (ω) − Xi (ω)| ≤ ε},
i=1

so that
d
!
\
P {ω : |Xi,n (ω) − Xi (ω)| ≤ d−1 ε} ≤ P (ω : |Xi,n (ω) − Xi (ω)| ≤ ε).
i=1

Hence, Theorem A.2 then implies that

lim P (ω : kXn (ω) − X(ω)k ≤ ε) ≥


n→∞
d
!
\
lim P {ω : |Xi,n (ω) − Xi (ω)| ≤ ε} = 1.
n→∞
i=1
p
Therefore Definition 3.4 implies that Xn −
→ X as n → ∞, and the result is
proven.
Example 3.12. As a simple example of these results consider a set of in-
dependent and identically distributed random variables X1 , . . . , Xn having
distribution F with mean µ and another set of independent and identically
distributed random variables Y1 , . . . , Yn having distribution G with mean ν.
Let X̄n and Ȳn be the sample means of the two samples, respectively. The
p p
results from Example 3.2 can be used to show that X̄n − → µ and Ȳn −
→ ν as
n → ∞ as long as F and G both have finite variances. Now define Z0i = (Xi , Yi )
CONTINUOUS MAPPING THEOREMS 121
for i = 1, . . . , n and
n  
−1
X X̄n
Z̄n = n Zi = .
Ȳn
i=1
p
Theorem 3.6 implies that Z̄n − → θ as n → ∞ where θ 0 = (µ, ν). Therefore,
the result of Example 3.2 has been extended to the bivariate case. Note that
we have not assumed anything specific about the joint distribution of Xi
and Yi , other than that the covariance must be finite which follows from the
assumption that the variances of the marginal distributions are finite. 

Example 3.13. Let {Un }n=1 be a sequence of independent Uniform[0, 1]
random variables and define a sequence of random vectors {Xn }∞ n=1 such
that X0n = (X1,n , X2,n , X3,n ) for all n ∈ N, where X1,n = δ{Un ; (0, n−1 )},
X2,n = 21 δ{Un ; (0, 1 − n−1 )}, and X3,n = δ{Un ; (0, 1 − n−2 )}. From Definition
a.c. a.c. a.c.
3.2 it follows that X1,n −−→ 0, X2,n −−→ 21 , and X3,n −−→ 1 as n → ∞.
a.c.
Therefore, Theorem 3.6 implies that Xn −−→ (0, 21 , 1)0 as n → ∞. 
Example 3.14. Let {Xn }∞ n=1 , {Y } ∞
n n=1 , and {Z }∞
n n=1 be sequences of random
variables that converge in probability to the random variables X, Y , and
Z, respectively as n → ∞. Suppose that X has a N(θx , σx2 ) distribution, Y
has a N(θy , σy2 ) distribution, and Z has a N(θz , σz2 ) distribution, where X,
Y and Z are independent. Define a sequence of random vectors {W}∞ n=1 as
Wn = (Xn , Yn , Zn )0 for all n ∈ N and let W = (X, Y, Z). Then Theorem 3.6
p
implies that Wn − → W as n → ∞ where the independence of the components
of W imply that W has a N(µ, Σ) distribution where µ = (θx , θy , θz )0 and
 2 
σx 0 0
Σ =  0 σy2 0  .
0 0 σz2


3.5 Continuous Mapping Theorems

When considering a sequence of constants, a natural result that arises is that


if a sequence converges to a point c ∈ R, and we apply a continuous real
function g to the sequence, then it follows that the transformed sequence
converges to g(c). This is the result given in Theorem 1.3. In some sense,
this is the essential property of continuous functions. Extension of this result
to sequences of random variables can prove to be very useful. For example,
suppose that we are able to prove that under certain conditions the sample
variance is a consistent estimator of the population variance as the sample
size increases to ∞. That is, the sample variance converges in probability to
the population variance. Can we automatically conclude that the square root
of the sequence converges in probability to the square root of the limit? That
is, does it follows that the sample standard deviation converges in probability
to the population standard deviation, or that, the sample standard deviation
122 CONVERGENCE OF RANDOM VARIABLES
is a consistent estimator of the population standard deviation? As we show
in this section, such operations are usually permissible under the additional
assumption that the transforming function g is continuous with probability
one with respect to the distribution of the limiting random variable. We begin
by considering the simple case where a sequence of random variables converges
to a real constant. In this case the transformation need only be continuous at
the constant.
Theorem 3.7. Let {Xn }∞ n=1 be a sequence of random variables, c be a real
constant, and g be a Borel function that is continuous as c.
a.c. a.c.
1. If Xn −−→ c as n → ∞, then g(Xn ) −−→ g(c) as n → ∞.
p p
2. If Xn −
→ c as n → ∞, then g(Xn ) −
→ g(c) as n → ∞.

Proof. We will prove the second result of the theorem, leaving the proof of the
p
first part as Exercise 18. Suppose that Xn −→ c as n → ∞, so that Definition
3.1 implies that for every δ > 0
lim P (|Xn − c| < δ) = 1.
n→∞

Because g is a continuous function at the point c, it follows that for every


ε > 0 there exists a real number δε such that |x − c| < δε implies that
|g(x) − g(c)| < ε, where x is a real number. This in turn implies that for each
n ∈ N that
{ω ∈ Ω : |Xn (ω) − c| < δε } ⊂ {ω ∈ Ω : |g[Xn (ω)] − g(c)| < δε },
and therefore Theorem 2.3 implies that for each n ∈ N
P (|Xn − c| < δε ) ≤ P (|g(Xn ) − g(c)| < ε).
Therefore, for every ε > 0,
lim P (|g(Xn ) − g(c)| < ε) ≥ lim P (|Xn − c| < δε ) = 1,
n→∞ n→∞
p
and therefore g(Xn ) −
→ g(c) as n → ∞.
Example 3.15. Let {Xn }∞ n=1 be a sequence of independent random variables
where Xn has a Poisson(θ) distribution. Let X̄n be the sample mean com-
p
puted on X1 , . . . , Xn . Example 3.2 implies that X̄n − → θ as n → ∞. If we
wish to find a consistent estimator of the standard deviation of Xn which is
1/2
θ1/2 we can consider X̄n . Theorem 3.7 implies that since the square root
1/2 p
transformation is continuous at θ if θ > 0 that X̄n − → θ1/2 as n → ∞. 

Example 3.16. Let {Xn }n=1 be a sequence of independent random variables
where Xn has a N(0, 1) distribution. Let X̄n be the sample mean computed
p
on X1 , . . . , Xn . Example 3.2 implies that X̄n − → 0 as n → ∞. Consider the
function g(x) = δ{x; {0}} and note that P [g(X̄n ) = 0] = P (X̄n 6= 0) = 1 for
p
all n ∈ N. Therefore, it is clear that g(X̄n ) − → 0 as n → ∞. However, note
that g(0) = 1 so that in this case g(X̄n ) does not converge in probability to
g(0). This, of course, is due to the discontinuity at 0 in the function g. 
CONTINUOUS MAPPING THEOREMS 123
Theorem 3.7 can be extended to the case where X̄n converges to a random
variable X, instead of a real constant using essentially the same argument
of proof, if we are willing to assume that g is uniformly continuous. This is
due to the fact that when g is uniformly continuous we have that for every
ε > 0 that there exists a real number δε > 0 such that |x − y| < δε implies
|g(x) − g(y)| < ε no matter what value of x and y are considered. This implies
that
P (|Xn − X| ≤ δε ) ≤ P (|g(Xn ) − g(X)| ≤ ε),
for all n ∈ N, and the corresponding results follow. However, the assumption of
uniform continuity turns out to be needlessly strong, and in fact can be weak-
ened to the assumption that the transformation is continuous with probability
one with respect to the distribution of the limiting random variable.
Theorem 3.8. Let {Xn }∞ n=1 be a sequence of random variables, X be a ran-
dom variable, and g be a Borel function on R. Let C(g) be the set of continuity
points of g and suppose that P [X ∈ C(g)] = 1.
a.c. a.c.
1. If Xn −−→ X as n → ∞, then g(Xn ) −−→ g(X) as n → ∞.
p p
2. If Xn −
→ X as n → ∞, then g(Xn ) −
→ g(X) as n → ∞.

Proof. We will prove the first result in this case. See Exercise 19 for proof of
a.c.
the second result. Definition 3.2 implies that if Xn −−→ X as n → ∞ then
h i
P ω : lim Xn (ω) = X(ω) = 1.
n→∞

Let n o
N = ω ∈ Ω : lim Xn (ω) = X(ω) ,
n→∞
and note that by assumption P (N ) = P [C(g)] = 1. Consider ω ∈ N ∩ C(g).
For such ω is follows from Theorem 1.3 that
lim g[Xn (ω)] = g[X(ω)].
n→∞

Noting that P [N ∩ C(g)] = 1 yields the result.


Example 3.17. Let {Xn }∞ n=1 be a sequence of random variables where Xn =
Z + n−1 for all n ∈ N, and Z is a N(0, 1) random variable. As was shown in
p
Example 3.1, we have that Xn − → Z as n → ∞. The transformation g(x) = x2
is continuous with probability one with respect to the normal distribution and
p
hence it follows from Theorem 3.8 that Xn2 − → Y = Z 2 where Y is a random
variable with a ChiSquared(1) distribution. 

The extension of Theorem 3.8 to the case of multivariate transformations


of random vectors is almost transparent. The arguments simply consist of
applying Theorem 3.8 element-wise to the random vectors and appealing to
Theorem 3.6 to obtain the convergence of the corresponding random vectors.
Note that we also are using the fact that a Borel function of a random vector
is also a random vector from Theorem 2.2.
124 CONVERGENCE OF RANDOM VARIABLES
Theorem 3.9. Let {Xn } be a sequence of d-dimensional random vectors, X
be a d-dimensional random vector, and g : Rd → Rq be a Borel function. Let
C(g) be the set of continuity points of g and suppose that P [X ∈ C(g)] = 1.
a.c. a.c.
1. If Xn −−→ X as n → ∞, then g(Xn ) −−→ g(X) as n → ∞.
p p
2. If Xn −
→ X as n → ∞, then g(Xn ) −
→ g(X) as n → ∞.
Theorem 3.9 is proven in Exercise 20.
Example 3.18. Suppose that X1 , . . . , Xn be a sample from N(µ, σ) distribu-
tion. Example 3.2 shows that X̄n is a consistent estimator of µ and Exercise
7 shows that the sample variance Sn2 is a consistent estimator of σ 2 . Define a
parameter vector θ = (µ, σ 2 )0 along with a sequence of random vectors given
by θ̂n = (X̄n , Sn2 )0 for all n ∈ N, as a sequence of estimators of θ. Theorem
p
3.6 implies that θ̂n − → θ as n → ∞. The αth quantile of a N(µ, σ) distribution
is given by µ + σzα , where zα is the αth quantile of a N(0, 1) distribution.
This quantile can be estimated from the sample with X̄n + Sn zα , which is a
continuous transformation of the sequence of random vectors {θ̂n }∞ n=1 . There-
p
fore, Theorem 3.9 implies that X̄n + Sn zα − → µ + σzα as n → ∞, or that the
estimator is consistent. 
Example 3.19. Consider the result of Example 3.14 where a sequence of ran-
dom vectors {Wn }∞ n=1 was created that converged in probability to a random
vector W that has a N(µ, Σ) distribution where µ = (θx , θy , θz )0 and
 2 
σx 0 0
Σ =  0 σy2 0  .
0 0 σz2
Let 1 be a 3 × 1 vector where each element is equal to 1. Then Theorem 3.9
p
implies that 10 Wn −→ 10 W where 10 W has a N(θx + θy + θz , σx2 + σy2 + σz2 )
distribution. In the case where µ = 0 and Σ = I, then Theorem 3.9 also
implies that Wn0 Wn converges to the random variable W0 W which has a
ChiSquared(3) distribution.

3.6 Laws of Large Numbers

Example 3.2 discussed some general conditions under which an estimator θ̂n
of a parameter θ converges in probability to θ as n → ∞. In the special
case where the estimator is the sample mean calculated from a sequence of
independent and identically distributed random variables, Example 3.2 states
that the sample mean will converge in probability to the population mean
as long as the variance of the population is finite. This result is often called
the Weak Law of Large Numbers. The purpose of this section is to explore
other versions of this result. In particular we will consider alternate sets of
conditions under which the result remains the same. We will also consider
under what conditions the result can be strengthened to the Strong Law of
LAWS OF LARGE NUMBERS 125
Large Numbers, for which the sample mean converges almost certainly to the
population mean. The first result given in Theorem 3.10 below shows that the
assumption that the variance of the population is finite can be removed as
long as the mean of the population exists and is finite.
Theorem 3.10 (Weak Law of Large Numbers). Let X1 , . . . , Xn be a set of
independent and identically distributed random variables from a distribution
F with finite mean θ and let X̄n be the sample mean computed on the random
p
variables. Then X̄n −
→ θ as n → ∞.

Proof. Because there is no assumption that the variance of X1 is finite, Theo-


rem 2.7 cannot be used directly to establish the result. Instead, we will apply
Theorem 2.7 to a version of X1 whose values are truncated at a finite value.
This will insure that the variance of the truncated random variables is finite.
The remainder of the proof is then concerned with showing that the results
obtained for the truncated random variables can be translated into equivalent
results for the original random variables.
This proof is based on one used in Section 6.3 of Gut (2005) with some mod-
ifications. A simpler proof, based on characteristic functions, will be given in
Chapter 3, where some additional results will provide a simpler argument.
However, the proof given here is worth considering as several important con-
cepts will be used that will also prove useful later.
For simplicity we begin by assuming that θ = 0. Define a truncated version
of Xk as Yk = Xk δ{|Xk |; [0, nε3 ]} for all k ∈ N where ε > 0 is arbitrary. That
is, Yk will be equal to Xk when |Xk | ≤ nε3 , but will be zero otherwise. Define
the partial sums
Xn
−1
Sn = n Xk ,
k=1
and
n
X
Tn = n−1 Yk .
k=1
We will now consider the asymptotic relationship between Sn and E(Tn ).
Definition 2.2 implies that

P (|Sn − E(Tn )| > nε) =


P ({|Sn − E(Tn )| > nε} ∩ A) + P ({|Sn − E(Tn )| > nε} ∩ Ac ), (3.9)
where
n
\
A= {|Xk | ≤ nε3 }.
k=1
Considering the first term in Equation (3.9) we note that if the event A is
true, then Sn = Tn because none of the random variables are truncated. That
is, the events {|Sn − E(Tn )| > nε3 } ∩ A and {|Tn − E(Tn )| > nε3 } ∩ A are the
126 CONVERGENCE OF RANDOM VARIABLES
same. Therefore
P ({|Sn − E(Tn )| > nε} ∩ A) = P ({|Tn − E(Tn )| > nε} ∩ A)
≤ P (|Tn − E(Tn )| > nε),
where the inequality follows from Theorem 2.3. Regardless of the distribution
of Sn , the variance of Tn must be finite due to that fact that the support of
Yk is finite for all k ∈ {1, . . . , n}. Therefore, Theorem 2.7 can be applied to
Tn to obtain
V (Tn )
P (|Tn − E(Tn )| > nε) ≤ 2 2 .
n ε
Now !
n
X
V (Tn ) = V Yk = nV (Y1 ).
k=1
Therefore, Theorem 2.5 implies
P (|Tn − E(Tn )| > nε) ≤ (nε2 )−1 V (Y1 ) ≤ n−1 ε−2 E(Y12 ).
Now, recall that Y1 = X1 δ{|X1 |; [0, nε3 ]} so that
E(Y12 ) = E(X12 δ{|X1 |; [0, nε3 ]})
≤ E[(nε3 )|X1 |δ{|X1 |; [0, nε3 ]}]
= nε3 E(|X1 |δ{|X1 |; [0, nε3 ]})
≤ nε3 E(|X1 |),
where we have used Theorem A.7 and the fact that
X12 (ω)δ{|X1 |; [0, nε3 ]} ≤ nε3 |X1 (ω)|δ{|X1 |; [0, nε3 ]},
and
|X1 (ω)|δ{|X1 |; [0, nε3 ]} ≤ |X1 (ω)|,
for all ω ∈ Ω. Therefore
P (|Tn − E(Tn )| > nε) ≤ n−1 ε−2 (nε3 )E(|X1 |) = εE(|X1 |). (3.10)

To find a bound on the second term in Equation (3.9) first note that
n
!c n n
\ [ [
c 3 3 c
A = {|Xk | ≤ nε } = {|Xk | ≤ nε } = {|Xk | > nε3 },
k=1 k=1 k=1

which follows from Theorem A.2. Therefore, Theorems 2.3 and 2.4 and the
fact that the random variables are identically distributed imply
n
!
[
P ({|Sn − E(Tn )| > nε3 } ∩ Ac ) ≤ P {|Xk | > nε3 }
k=1
n
X
≤ P (|Xk | > nε3 )
k=1
= nP (|X1 | > nε3 ). (3.11)
LAWS OF LARGE NUMBERS 127
Combining the results of Equations (3.9)–(3.11) implies that
P (|Sn − E(Tn )| > nε) ≤ εE(|X1 |) + nP (|X1 | > nε3 ).
Let G be the distribution of |X1 |, then note that Theorem A.7 implies that
Z ∞ Z ∞ Z ∞
3 −3 3 −3
nP (|X1 | > nε ) = n dG(t) = ε nε dG(t) ≤ ε tdG(t).
nε3 nε3 nε3

Since we have assumed that E(|X1 |) < ∞, it follows that


Z ∞
lim tdG(t) = 0.
n→∞ nε3

Therefore, Theorem 1.6 implies that


lim sup P (|Sn − E(Tn )| > nε) ≤ εE(|X1 |).
n→∞

We use the limit supremum in the limit instead of the usual limit since we do
not yet know whether the sequence converges or not. Equivalently, we have
shown that
lim sup P (|n−1 Sn − n−1 E(Tn )| > ε) ≤ εE(|X1 |).
n→∞

Because ε > 0 is arbitrary it follows that we have shown that n−1 Sn −


p
n−1 E(Tn ) −
→ 0 as n → ∞. See Exercise 22. It is tempting to want to conclude
p
that n−1 Sn − → n−1 E(Tn ) as n → ∞, but this is not permissible since the
limit value in such a statement depends on n. To finish the proof we note that
Theorem 2.12 implies that
n
!
X
|E(Tn )| = E Yk


k=1
!
Xn
3
= E Xk δ{|Xk |; [0, nε ]}


k=1
≤ nE(|X1 |δ{|X1 |; (nε3 , ∞)}).
Therefore
lim n−1 |E(Tn )| ≤ lim E(|X1 |δ{|X1 |; (nε3 , ∞)}) = 0,
n→∞ n→∞

since for every ω ∈ Ω


lim |X1 (ω)|δ{|X1 (ω)|; (nε3 , ∞)} = 0.
n→∞
p
Theorem 3.8 then implies that n−1 Sn −→ 0 as n → ∞, and the result is proven
for θ = 0. If θ 6= 0 one can simply use this same proof for the transformed
p
random variables Xk − θ to make the conclusion that n−1 Sn − θ − → 0 as
−1 p
n → ∞, or equivalently that n Sn − → θ as n → ∞.
Example 3.20. Suppose that {Xn }∞
n=1 is a sequence of independent random
128 CONVERGENCE OF RANDOM VARIABLES
variables where Xn has a T(2) distribution. The variance of Xn does not exist,
but Theorem 3.10 still applies to this case and we can still therefore conclude
p
that X̄n −
→ 0 as n → ∞. 

The Strong Law of Large Numbers keeps the same essential result as Theorem
3.10 except that the mode of convergence is strengthened from convergence
in probability to almost certain convergence. The path to this stronger result
requires slightly more complicated mathematics, and we will therefore develop
some intermediate results before presenting the final result and its proof. The
general approach used here is the development used by Feller (1971). Some-
what different approaches to this result can be found in Gut (2005), Gnedenko
(1962), and Sen and Singer (1993), though the basic ideas are essentially the
same.
Theorem 3.11. Let {Xn }∞ n=1 be a sequence of independent random variables
where E(Xn ) = 0 for all n ∈ N and

X
E(Xn2 ) < ∞.
n=1

Then Sn converges almost certainly to a limit S.

Proof. Let ε > 0, and consider the probability


 
k
  X

P max |Sk − Sn | > ε = P  max Xj > ε .
n≤k≤m n<k≤m
j=n+1

Theorem 2.13 implies that


 
k m
X X
ε−2

P  max Xj > ε ≤ V (Xk )
n<k≤m
j=n+1 k=n+1

X
≤ ε−2 V (Xk ), (3.12)
k=n+1

where the second inequality is due to the fact that the terms of the sequence
are non-negative. Now, the right hand side of Equation (3.12) does not depend
on m, so that we can take the limit of the left hand side as m → ∞. Theorem
2.16 the implies that
∞ ∞
  !
[ X
P sup |Sk − Sn | > ε = P {|Sk − Sn | > ε} ≤ ε−2 V (Xk ).
k≥n
k=n k=n+1

Now take the limit as n → ∞ to obtain


  ∞
X
lim P sup |Sk − Sn | > ε ≤ lim ε−2 V (Xk ) = 0, (3.13)
n→∞ k≥n n→∞
k=n+1
LAWS OF LARGE NUMBERS 129
where the limit on the right hand side of Equation (3.13) follows from the
assumption that
X∞
E(Xn2 ) < ∞.
n=1
It follows from Equation (3.12) that the sequence {Sn }∞
n=1 is a Cauchy se-
quence with probability one, or by Theorem 1.1, {Sn }∞n=1 has a limit with
a.c.
probability one. Therefore Sn −−→ S as n → ∞ for some S.

Theorem 3.11 actually completes much of the work we need to prove the
Strong Law of Large Numbers in that we now know that the sum converges
almost certainly to a limit. However, the assumption on the variance of Xn is
quite strong and we will need to find a way to weaken this assumption. The
method will be the same as used in the proof of Theorem 3.10 in that we
will use truncated random variables. In order to apply the result in Theorem
3.11 to these truncated random variables a slight generalization of the result
is required.
Corollary 3.1. Let {Xn }∞ n=1 be a sequence of independent random variables
where E(Xn ) = 0 for all n ∈ N. Let {bn }∞ n=1 be a monotonically increasing
sequence of real numbers such that bn → ∞ as n → ∞. If

X
b−2 2
n E(Xn ) < ∞,
n=1

then

X
b−1
n Xn ,
n=1
a.c.
converges almost certainly to some limit and b−1
n Sn −−→ 0 as n → ∞.

Proof. The first result is obtained directly from Theorem 3.11 using {bn Xn }∞
n=1
as the sequence of random variables of interest. In that case we require

X ∞
X
E[(b−1 2
n Xn ) ] = b−2 2
n E(Xn ) < ∞,
n=1 n=1

which is the condition that is assumed. To prove the second result see Exercise
23.

The final result required to prove the Strong Law of Large Numbers is a
condition on the existence of the mean of a random variable.
Theorem 3.12. Let {Xn }∞ n=1 be a sequence of random variables where Xn
has distribution function F for all n ∈ N. Then for any ε > 0

X
P (|Xn | > nε) < ∞,
n=1

if and only if E(|Xn |) < ∞.


130 CONVERGENCE OF RANDOM VARIABLES
A proof of Theorem 3.12 can be found in Section VII.8 of Feller (1971). We
now have enough tools to consider the result of interest.

Theorem 3.13. Let {Xn }∞ n=1 be a sequence of independent and identically


a.c.
distributed random variables. If E(Xn ) exists and is equal to θ then X̄n −−→ θ.

Proof. Suppose that E(Xn ) exists and equals θ. We will consider the case
when θ = 0. The case when θ 6= 0 can be proven using the same methodology
used at the end of the proof of Theorem 3.10. Define two new sequences
of random variables {Yn }∞ ∞
n=1 and {Zn }n=1 as Yn = Xn δ{|Xn |; [0, n]} and
Zn = Xn δ{|Xn |; (n, ∞)}. Hence, it follows that
n
X n
X n
X
X̄n = n−1 Xk = n−1 Yk + n−1 Zk .
k=1 k=1 k=1

Because E(Xn ) exists, Theorem 3.12 implies that for every ε > 0,

X
P (|Xn | > nε) < ∞.
n=1

Therefore, for ε = 1, P (|Xn | > n) = P (Xn = Zn ) = P (Zn 6= 0). Hence



X
P (Zn 6= 0) < ∞,
n=1

and Theorem 2.17 then implies that P ({Zn 6= 0} i.o.) = 0. This means that
with probability one there will only be a finite number of times that Zn 6= 0
over all the values of n ∈ N, which implies that the sum

X
Zn ,
n=1

will be finite with probability one. Hence


n
!
X
−1
P lim n Zk = 0 = 1,
n→∞
k=1

and therefore
n
a.c.
X
n−1 Zk −−→ 0,
k=1
as n → ∞. The convergence behavior of the sum
n
X
Yk ,
k=1

will be studied with the aid of Corollary 3.1. To apply this result we must
show that
X∞
n−1 E(Yn2 ) < ∞.
n=1
LAWS OF LARGE NUMBERS 131
First note that since Yn is the version of Xn truncated at n and −n we have
that
Z ∞ Z n n Z
X
2 2 2
E(Yn ) = x δ{|x|; [0, n]}dF (x) = x dF (x) = x2 dF (x),
−∞ −n k=1 Rk

where Rk = {x ∈ R : k − 1 ≤ |x| < k} and F is the distribution function of


Xn . Hence
X∞ X∞ Xn Z
−2 2 −2
n E(Yn ) = n x2 dF (x).
n=1 n=1 k=1 Rk

Note that the set of pairs (n, k) for which n ∈ {1, 2, . . .} and k ∈ {1, 2, . . . , n}
are the same set of pairs for which k ∈ {1, 2, . . .} and n ∈ {k, k + 1, . . .}. This
allows us to change the order in the double sum as
∞ ∞ X ∞ ∞ ∞
Z "Z !#
X X X X
−2 2 −2 2 2 −2
n E(Yn ) = n x dF (x) = x dF (x) n ,
n=1 k=1 n=k Rk k=1 Rk n=k

Now

X
n−2 ≤ 2k −1 ,
n=k
so that

X ∞ Z
X
n−2 E(Yn2 ) ≤ 2x2 k −1 dF (x).
n=1 k=1 Rk

But when x ∈ Rk we have that |x| < k so that


X∞ Z ∞ Z
X ∞ Z
X
2 −1 −1
2x k dF (x) ≤ 2k|x|k dF (x) ≤ 2 |x|dF (x) < ∞,
k=1 Rk k=1 Rk k=1 Rk

where the last inequality follows from our assumptions. Therefore we have
shown that
X∞
n−2 E(Yn2 ) < ∞.
n=1

Now, consider the centered sequence of random variables Yn − E(Yn ), which


have mean zero. The fact that E{[Yn − E(Yn )]2 } ≤ E(Yn2 ) implies that

X
n−2 E{[Yn − E(Yn )]2 } < ∞.
n=1

We are now in the position where Corollary 3.1 can be applied to the centered
sequence, which allows us to conclude that
n
" n n
!#
a.c.
X X X
−1 −1 −1
n [Yk − E(Yk )] = n Yk − E n Yk −−→ 0,
k=1 k=1 k=1
132 CONVERGENCE OF RANDOM VARIABLES
as n → ∞, leaving us to evaluate the asymptotic behavior of
n
! n
X X
−1
E n Yk = n−1 E(Yk ).
k=1 k=1

This is the same problem encountered in the proof of Theorem 3.10, and the
same solution based on Theorem 2.12 implies that
n
X
lim n−1 E(Yk ) = 0.
n→∞
k=1

Therefore, Theorem 3.9 implies that


n
a.c.
X
n−1 Yk −−→ 0,
k=1

as n → ∞.

Aside from the stronger conclusion about the mode of convergence about the
sample mean, there is another major difference between the Weak Law of
Large Numbers and the Strong Law of Large Numbers. Section VII.8 of Feller
(1971) actually shows that the existence of the mean of Xn is both a necessary
and sufficient condition to assure the almost certain convergence of the sample
mean. As it turns out, the existence of the mean is not a necessary condition
for a properly centered sample mean to converge in probability to a limit.
Theorem 3.14. Let {Xn }∞ n=1 be a sequence of independent random variables
p
each having a common distribution F . Then X̄n − E(X1 δ{|X1 |; [0, n]}) − →0
as n → ∞ if and only if
lim nP (|X1 | > n) = 0.
n→∞

A proof of Theorem 3.14 can be found in Section 6.4 of Gut (2005). It is


important to note that in Theorem 3.14 that the result does not imply that the
sample mean converges to any value in this case. Rather, the conclusion is that
the difference between the normal mean and the truncated mean converges to
zero as n → ∞. In the special case where the distribution of X1 is symmetric
p
about zero, X̄n −
→ 0 as n → ∞, but if the mean of X1 does not exist then X̄n is
not converging to the population mean. We could, however, conclude that X̄n
converges in probability to the population median as n → ∞ in this special
case. Note further that the condition that nP (|X1 | > n) → 0 as n → ∞ is
both necessary and sufficient to ensure this convergence. This implies that
there are cases where the convergence does not take place.
Example 3.21. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a Cauchy(0, 1) distribution. The mean of
the distribution does not exist, and further it can be shown that nP (|X1 | >
n) → 2π −1 as n → ∞, so that the condition of Theorem 3.14 does not hold.
Therefore, even though the distribution of X1 is symmetric about zero, the
LAWS OF LARGE NUMBERS 133

Figure 3.5 The results of a small simulation demonstrating the behavior of sample
means computed from a Cauchy(0, 1) distribution. Each line represents a sequence
of sample means computed on a sequence of independent Cauchy(0, 1) random vari-
ables. The means were computed when n = 5, 10, . . . , 250.
%
$
#
!
!#
!$
!%

! "! #!! #"! $!! $"!

sample mean will not converge to zero. To observe the behavior of the sample
mean in this case see Figure 3.5, where five realizations of the sample mean
have been plotted for n = 5, 10, . . . , 250. Note that the values of the mean do
not appear to be settling down as we observed in Figure 3.2. 
Example 3.22. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a continuous distribution with density
(
1 −2
x |x| > 1
f (x) = 2
0 |x| ≤ 1.
We first note that
Z n Z n
1
2 xf (x)dx = 1
2 x−1 dx = 1
2 log(n),
1 1

so that Z n
xf (x)dx → ∞,
1
as n → ∞, and therefore the mean of X1 does not exist. Checking the condition
134 CONVERGENCE OF RANDOM VARIABLES
in Theorem 3.14 we have that, due to the symmetry of the density,
Z ∞ Z ∞
nP (|X1 | > n) = 2n dF = x−2 dx = 1.
n n

Therefore, Theorem 3.14 implies that X̄n − E(X1 δ{|X1 |; [0, n]}) does not con-
verge in probability to the truncated mean, which in this case is zero due to
symmetry. One the other hand, if we modify the tails of the density so that
they drop off at a slightly faster rate, then we can achieve convergence. For
example, consider the density suggested in Section 6.4 of Gut (2005), given
by (
cx2 log(|x|) |x| > 2
f (x) =
0 |x| ≤ 2,
where c is a normalizing constant. In this case it can be shown that nP (|X1 | >
n) → 0 as n → ∞, but that the mean does not exist. However, we can still
p
conclude that X̄n −
→ 0 as n → ∞ due to Theorem 3.14. 

The Laws of Large Numbers given by Theorems 3.10 and 3.13 provide a char-
acterization of the limiting behavior of the sample mean as the sample size
n → ∞. The Law of the Iterated Logarithm provides information about the
extreme fluctuations of the sample mean as n → ∞.
Theorem 3.15 (Hartman and Wintner). Let {Xn }∞ n=1 be a sequence of in-
dependent random variables each having a common distribution F such that
E(Xn ) = µ and V (Xn ) = σ 2 < ∞. Then
n1/2 (X̄n − µ)
 
P lim sup 2 1/2
= 1 = 1,
n→∞ {2σ log[log(n)]}

and
n1/2 (X̄n − µ)
 
P lim inf = −1 = 1.
n→∞ {2σ 2 log[log(n)]}1/2

A proof of Theorem 3.15 can be found in Section 8.3 of Gut (2005). The-
orem 3.15 shows that the extreme fluctuations of the sequence of random
variables given by {Zn }∞
n=1 = {n
1/2 −1
σ (X̄n − µ)}∞ n=1 are about the same size
1/2
as {2 log[log(n)]} . More precisely, let ε > 0 and consider the interval
Iε,n = [−(1 + ε){2 log[log(n)]}1/2 , (1 + ε){2 log[log(n)]}1/2 ],
then all but a finite number of values in the sequence {Zn }∞ n=1 will be con-
tained in Iε,n with probability one. On the other hand if we define the interval
Jε,n = [−(1 − ε){2 log[log(n)]}1/2 , (1 − ε){2 log[log(n)]}1/2 ],
then Zn ∈
/ Jn,ε an infinite number of times with probability one.
Example 3.23. Theorem 3.15 is somewhat difficult to visualize, but some
simulated results can help. In Figure 3.6 we have plotted a realization of
n1/2 X̄n for n = 1, . . . , 500, where the population is N(0, 1), along with its
THE GLIVENKO–CANTELLI THEOREM 135

Figure 3.6 A simulated example of the behavior indicated by Theorem 3.15. The
solid line is a realization of n1/2 X̄n for a sample of size n from a N(0, 1) dis-
tribution with n = 1, . . . , 500. The dotted line indicates the extent of the envelope
±{2 log[log(n)]}1/2 and the dashed line indicates the extreme fluctuations of n1/2 X̄n .
2
1
'(

0
!1
!2

0 100 200 300 %00 500

''
extreme fluctuations and the limits ±{2 log[log(n)]}1/2 . One would not ex-
pect the extreme fluctuations of n1/2 X̄n to exactly follow the limits given by
±{2 log[log(n)]}1/2 , but note in our realization that the general shape of the
fluctuations does follow the envelope fairly well as n becomes larger. 

3.7 The Glivenko–Cantelli Theorem

In many settings statistical inference is based on the nonparametric frame-


work, which attempts to make as few assumptions as possible about the un-
derlying process that produces a set of observations. A major assumption in
many methods of statistical inference is that the population has a certain
distribution. For example, statistical inference on population means and vari-
ances often assumes that the underlying population is at least approximately
normal. If the assumption about the normality of the population is not true,
then there can be an effect on the reliability of the associated methods of
statistical inference.
136 CONVERGENCE OF RANDOM VARIABLES
The empirical distribution function provides a method for estimating a distri-
bution function F based only the assumptions that the observed data consist
of independent and identically distributed random variables following the dis-
tribution F . This estimator is useful in two ways. First, the reasonableness of
assumptions about the distribution of the sample can be studied using this
estimator. Second, if no assumption about the distribution is to be made, the
empirical distribution can often be used in place of the unknown distribution
to obtain approximate methods for statistical inference. This section intro-
duces the empirical distribution function and studies many of the asymptotic
properties of the estimate.
Definition 3.5. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F . The empirical distribution
function of X1 , . . . , Xn is
n
X
F̂n (t) = n−1 δ{Xi ; (−∞, t]}.
i=1

It is useful to consider the structure of the estimate of F proposed in Defi-


nition 3.5 in some detail. For an observed sample X1 , . . . , Xn , the empirical
distribution function is a step function that has steps of size n−1 at each of the
observed sample values, under the assumption that each of the sample values
is unique. Therefore, if we consider the empirical distribution function condi-
tional on the observed sample X1 , . . . , Xn , the empirical distribution function
corresponds to a discrete distribution that has a probability of n−1 at each of
the observations in the sample.
The properties of the empirical distribution function can be studied in two
ways. For a fixed point t ∈ R we can consider F̂n (t) as an estimator of F (t), at
that point. Alternatively we can consider the entire function F̂n over the real
line simultaneously as an estimate of the function F . We begin by considering
the point-wise viewpoint first. Finite sample properties follow from the fact
that when t ∈ R is fixed,
P (δ{Xi ; (−∞, t]} = 1) = P (Xi ∈ (−∞, t]) = P (Xi ≤ t) = F (t).
Similarly, P (δ{Xi ; (−∞, t]} = 0) = 1 − F (t). Therefore δ{Xi ; (−∞, t]} is a
Bernoulli[F (t)] random variable for each i = 1, . . . , n, and hence
n
X
nF̂ (t) = δ(Xi ; (−∞, t]),
i=1

is a Binomial[n, F (t)] random variable. Using this fact, it can be proven


that for a fixed value of t ∈ R, F̂n (t) is an unbiased estimator of F (t) with
standard error n−1/2 {F (t)[1 − F (t)]}1/2 . The empirical distribution function
is also point-wise consistent.
Theorem 3.16. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F , and let F̂n be the empir-
THE GLIVENKO–CANTELLI THEOREM 137
ical distribution function computed on X1 , . . . , Xn . Then for each t ∈ R,
a.c.
F̂n (t) −−→ F (t) as n → ∞.

The next step in our development is to extend the consistency result of The-
orem 3.16 to the entire empirical distribution function. That is, we wish to
conclude that F̂n is a consistent estimator of F , or that F̂n convergences al-
most certainly to F as n → ∞. This differs from the previous result in that
we wish to show that the random function F̂n becomes arbitrarily close to F
with probability one as n → ∞. Therefore, we require a measure of distance
between two distribution functions. Many distance functions, or metrics, can
be defined on the space of distribution functions. For examples, see Young
(1988). A common metric for comparing two distribution functions in statis-
tical inference is based on the supremum metric.
Theorem 3.17. Let F and G be two distribution functions. Then,
d∞ (F, G) = kF − Gk∞ = sup |F (t) − G(t)|
t∈R

is a metric in the space of distribution functions called the supremum metric.

A stronger result can actually be proven in that the metric defined in Theorem
3.17 is actually a metric over the space of all functions. Now that a metric in
the space of distribution functions has been defined, it is relevant to ascertain
whether the empirical distribution function is a consistent estimator of F with
respect to this metric. That is, we would conclude that F̂n converges almost
certainly to F as n → ∞ if,
 
P lim kF̂n − F k = 0 = 1,
n→∞
a.c.
or equivalently that kF̂n − F k∞ −−→ 0 as n → ∞.
Theorem 3.18 (Glivenko and Cantelli). Let X1 , . . . , Xn be a set of indepen-
dent and identically distributed random variables from a distribution F , and
let F̂n be the empirical distribution function computed on X1 , . . . , Xn . Then
a.c.
kF̂n − F k∞ −−→ 0.
a.c.
Proof. For a fixed value of t ∈ R, Theorem 3.16 implies that F̂n (t) −−→ F (t)
as n → ∞. The result we wish to prove states that the maximum difference
between F̂n and F also converges to 0 as n → ∞, a stronger result. Rather
than attempt to quantify the behavior of the maximum difference directly,
a.c.
we will instead prove that F̂n (t) −−→ F (t) uniformly in t as n → ∞. We will
follow the method of proof used by van der Vaart (1998). Alternate approaches
can be found in Sen and Singer (1993) and Serfling (1980). The result was first
proven under various conditions by Glivenko (1933) and Cantelli (1933).
Let ε > 0 be given. Then, there exists a partition of R given by −∞ = t0 <
t1 < . . . < tk = ∞ such that
lim F (t) − F (ti−1 ) < ε,
t↑ti
138 CONVERGENCE OF RANDOM VARIABLES
for some k ∈ N. We will begin by arguing that such a partition exists. First
consider the endpoints of the partition t1 and tk−1 . Because F is a distribution
function we know that
lim F (t) = lim F (t) = 0,
t→t0 t→−∞

and hence there must be a point t1 such that F (t1 ) < ε. Similarly, since
lim F (t) = lim F (t) = 1,
t→tk t→∞

it follows that there must be a point tk−1 such that F (tk−1 ) is within ε of
F (tk ) = 1. If the distribution function is continuous with an interval (a, b),
then the definition of continuity implies that there must exist two points ti
and ti−1 such that F (ti ) − F (ti−1 ) < ε. Noting that for points where F is
continuous we have that
lim F (t) = F (ti ),
t↑ti
which shows that the partition exists on any interval (a, b) where the distri-
bution function is continuous.
Now consider an interval (a, b) where there exists a point t0 ∈ (a, b) such that
F has a jump of size δ 0 at t0 . That is,
δ 0 = F (t0 ) − lim0 F (t).
t↑t
0
There is no problem if δ < ε, for a specific value of ε, but this cannot be
guaranteed for every ε > 0. However, the partition can still be created by
setting one of the points of the partition exactly at t0 . First, consider the case
where ti = t0 . Considering the limit of F (t) as t ↑ t0 results in the value of
F at t0 if F was left continuous at t0 . It then follows that there must exist a
point ti−1 such that
lim0 F (t) − F (ti−1 ) < ε.
t↑t
See Figure 3.7. In the case where ti+1 = t0 , the property follows from the fact
that F is always right continuous, and is therefore continuous on the interval
(ti+1 , b) for some b. Therefore, there does exist a partition with the indicated
property. The partition is finite because the range of F , which is [0, 1], is
bounded.
Now consider t ∈ (ti−1 , ti ) for some i ∈ {0, . . . , k} and note that because F̂n
and F are non-decreasing it follows that
F̂n (t) ≤ lim F̂n (t),
t↑ti

and
F (t) ≥ F (ti−1 ) > lim F (t) − ε,
t↑ti
so that is follows that
F̂n (t) − F (t) ≤ lim F̂n (t) − lim F (t) + ε.
t↑ti t↑ti
THE GLIVENKO–CANTELLI THEOREM 139
Similar computations can be used to show that F̂n (t) − F (t) ≥ F̂n (ti−1 ) −
F (ti−1 ) − ε.
a.c.
We already know that F̂n (t) −−→ F (t) for every t ∈ R. However, because
a.c.
the partition t0 < t1 < . . . < tk is finite, it follows that F̂n (t) −−→ F (t)
uniformly on the partition t0 < t1 < . . . < tk . That is, for every ε > 0, there
exists a positive integer nε such that |F̂n (t) − F (t)| < ε for all n ≥ nε and
t ∈ {t0 , t1 , . . . , tk }, with probability one. To prove this we need only find nε,t
such that |F̂n (t) − F (t)| < ε for all n ≥ nε,t with probability one and assign
nε = max{nε,t0 , . . . , nε,tk }.

This implies that for every ε > 0 and t ∈ (ti−1 , ti ) that there is a positive
integer nε such that
F̂n (t) − F (t) ≤ lim F̂n (t) − lim F (t) + ε ≤ 2ε,
t↑ti t↑ti

and
F̂n (t) − F (t) ≥ F̂n (ti−1 ) − F (ti−1 ) − ε ≥ −2ε.
Hence, for every ε > 0 and t ∈ (ti−1 , ti ) that there is a positive integer nε
such that |F̂n (t) − F (t)| < 2ε for all n ≥ nε , with probability one. Noting that
the value of nε does not depend on i, we have proven that for every ε > 0
there is a positive integer nε such that |F̂n (t) − F (t)| < 2ε for every n ≥ nε
and t ∈ R, with probability one. Therefore, F̂n (t) converges uniformly to F (t)
with probability one. This uniform convergence implies that
 
P lim sup |F̂n (t) − F (t)| = 0 = 1,
n→∞ t∈R

or that
a.c.
sup |F̂n (t) − F (t)| −−→ 0,
t∈R
as n → ∞.

Example 3.24. Let {Xn }∞ n=1 be a sequence of independent and identically


distributed random variables each having an Exponential(θ) distribution.
Let F̂n be the empirical distribution function computed on X1 , . . . , Xn . We
can visualize the convergence of the F̂n to F as described in Theorem 3.18 by
looking at the results of a small simulation. In Figures 3.8–3.10 we compare
the empirical distribution function computed on simulated samples of size n =
5, 25 and 50 from an Exponential(θ) distribution to the true distribution
function, which in this case is given by F (t) = [1 − exp(−t)]δ{t; (0, ∞)}.
One can observe in Figures 3.8–3.10 that the empirical distribution function
becomes uniformly closer to the true distribution function as the sample size
becomes larger. Theorem 3.18 guarantees that this type of behavior will occur
with probability one. 
140 CONVERGENCE OF RANDOM VARIABLES

Figure 3.7 Constructing the partition used in the proof of Theorem 3.18 when there
is a discontinuity in the distribution function. By locating a partition point at the
jump point (grey line), the continuity of the distribution function to the left of this
point can be used to find a point (dotted line) such that the difference between the
distribution function at these two points does not exceed ε for any specified ε > 0.

3.8 Sample Moments

Let X1 , . . . , Xn be a sequence of independent and identically distributed ran-


dom variables from a distribution F and consider the problem of estimating
the k th moment of F defined in Definition 2.9 as
Z ∞
0
µk = xk dF (x). (3.14)
−∞

We will assume for the moment that


Z ∞
|x|m dF (x) < ∞, (3.15)
−∞

for some m ≥ k, with further emphasis on this assumption to be considered


later. The empirical distribution function introduced in Section 3.7 provides
a nonparametric estimate of F based on a sample X1 , . . . , Xn . Substituting
the empirical distribution function into Equation (3.14) we obtain a nonpara-
metric estimate of the k th moment of F given by
Z ∞ ∞
X
µ̂0k = xk dF̂n (x) = n−1 Xik ,
−∞ n=1
SAMPLE MOMENTS 141

Figure 3.8 The empirical distribution function computed on a simulated sample of


size n = 5 from an Exponential(θ) distribution (solid line) compared to the actual
distribution function (dashed line).
!%"
"%(
"%'
*+),

"%&
"%#
"%"

!! " ! # $

)
where the integral is evaluated using Definition 2.10. This estimate is known
as the k th sample moment. The properties of this estimate are detailed below.
Theorem 3.19. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F .

1. If E(|X1 |k ) < ∞ then µ̂0k is an unbiased estimator of µ0k .


2. If E(|X1 |2k ) < ∞ then the standard error of µ̂0k is n−1/2 (µ02k − µ0k )1/2 .
a.c.
3. If E(|X1 |k ) < ∞ then µ̂0k −−→ µ0k as n → ∞.

For a proof of Theorem 3.19 see Exercise 28.

The k th central moment can be handled in a similar way. From Definition 2.9
the k th central moment of F is
Z ∞
µk = (x − µ01 )k dF (x), (3.16)
−∞

where we will again use the assumption in Equation (3.15) that E(|X|m ) < ∞
for a value of m to be determined later. Substituting the empirical distribution
142 CONVERGENCE OF RANDOM VARIABLES

Figure 3.9 The empirical distribution function computed on a simulated sample of


size n = 25 from an Exponential(θ) distribution (solid line) compared to the actual
distribution function (dashed line).
1."
".8
".6
F+x,

".4
".2
"."

!1 " 1 2 3 4

x
function for F in Equation (3.16) provides an estimate of µk given by
Z ∞ Z ∞ k n
X
µ̂k = x− tdF̂n (t) dF̂n (x) = n−1 (Xi − µ̂01 )k .
−∞ −∞ i=1

This estimate has a more complex structure than that of the k th sample
moment which makes the bias and standard error more difficult to obtain.
One result which makes this job slightly easier is given below.
Theorem 3.20. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables such that E(X1 ) = 0 and E(|Xn |k ) < ∞ for
some k ≥ 2. Then 
n
k 
X
E  Xi  = O(n−k/2 ),


i=1
as n → ∞.
A proof of Theorem 3.20 can be found in Chapter 19 of Loéve (1977). A similar
result is given in Lemma 9.2.6.A of Serfling (1980).
Theorem 3.21. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F .
SAMPLE MOMENTS 143

Figure 3.10 The empirical distribution function computed on a simulated sample of


size n = 50 from an Exponential(θ) distribution (solid line) compared to the actual
distribution function (dashed line).
1."
".)
".'
F,x-

".%
".#
"."

!1 " 1 # $ % & '

1. If E(|X1 |k ) < ∞ then the bias of µ̂k as an estimator of µk is


1 −1
2n k(k − 1)µk−2 µ2 − n−1 kµk + O(n−2 ),

as n → ∞.

2. If E(|X1 |2k ) < ∞ then the variance of µ̂k is

n−1 (µ2k − µ2k − 2kµk−1 µk+1 + k 2 µ2 µ2k−1 ) + O(n−2 ),

as n → ∞.

a.c.
3. If E(|X1 |k ) < ∞ then µ̂k −−→ µk as n → ∞.

Proof. We follow the method of Serfling (1980) to prove the first result. We
144 CONVERGENCE OF RANDOM VARIABLES
begin by noting that

n
X
µ̂k = n−1 (Xi − µ̂01 )k
i=1
n
X
= n−1 [(Xi − µ01 ) + (µ01 − µ̂01 )]k
i=1
n X
k  
X k
= n−1 (µ01 − µ̂01 )j (Xi − µ01 )k−j
i=1 j=0
j
k   " n
#
X k 0 0 j −1
X
0 k−j
= (µ1 − µ̂1 ) n (Xi − µ1 ) .
j=0
j i=1

Therefore,

k  
( " n
#)
X k X
E(µ̂k ) = E (µ01 − µ̂01 )j n −1
(Xi − µ01 )k−j . (3.17)
j=0
j i=1

Now, the j = 0 term in the sum in Equation (3.17) equals


" n
#
X
E n−1 (Xi − µ01 )k = µk .
i=1

It follows that the bias of µ̂k is given by

k  
( " n
#)
X k X
E (µ01 − µ̂01 )j n−1
(Xi − µ01 )k−j . (3.18)
j=1
j i=1

The first term of this sum is given by

  ( " n
#)
k 0 0 −1
X
0 k−1
E (µ1 − µ̂1 ) n (Xi − µ1 ) =
1 i=1
n X
X n
kn−2 E[(µ01 − Xi )(Xj − µ01 )k−1 ].
i=1 j=1

Note that when i 6= j that the two terms in the sum are independent, and
therefore the expectation of the product is the product of the expectations. In
all of these cases E(Xi − µ01 ) = 0 and the term vanishes. When i = j the term
in the sum equals −E[(Xi − µ01 )k ] = −µk . Combining these results implies
that the first term of Equation (3.18) is −n−1 kµk . The second term (j = 2)
SAMPLE MOMENTS 145
of Equation (3.18) equals

  ( " n
#)
k 0 0 2 −1
X
0 k−2
E (µ1 − µ̂1 ) n (Xi − µ1 ) =
2 i=1
" #2 n 
 X n X 
−3
1
2 k(k − 1)n E (µ01 − Xi ) (Xj − µ01 )k−2 =
 
i=1 j=1
n X
X n X
n
1
2 k(k − 1)n−3 E[(µ01 − Xi )(µ01 − Xj )(Xl − µ01 )k−2 ]. (3.19)
i=1 j=1 l=1

When i differs from j and l the expectation is zero due to independence. When
i = j, the sum in Equation (3.19) becomes

n X
X n
1
2 k(k − 1)n−3 E[(µ01 − Xi )2 (Xj − µ01 )k−2 ]. (3.20)
i=1 j=1

When i 6= j the two terms in the sum in Equation (3.20) are independent
therefore

E[(µ01 − Xi )2 (Xj − µ01 )k−2 ] = E[(µ01 − Xi )2 ]E[(Xj − µ01 )k−2 ] = µ2 µk−2 .

When i = j the term in the sum in Equation (3.20) is given by

E[(µ01 − Xi )2 (Xi − µ01 )k−2 ] = E[(Xi − µ01 )k ] = µk .

Therefore, the sum in Equation (3.20) equals

1
2 k(k − 1)n−3 [n(n − 1)µ2 µk−2 + nµk ] = 12 k(k − 1)n−1 µ2 µk−2 + O(n−2 ),

as n → ∞. When j = 3, the term in the sum in Equation (3.18) can be shown


to be O(n−2 ) as n → ∞. See Theoretical Exercise 29. To obtain the behavior
of the remaining terms we note that the j th term in the sum in Equation
(3.18) has the form

 " n
#j " n
#
k −1
X
0 −1
X
0 k−j
n (µ1 − Xi ) n (Xi − µ1 ) .
j i=1 i=1
146 CONVERGENCE OF RANDOM VARIABLES
Now apply Theorem 2.10 (Hölder) to the two sums to yield
" #j " #
 Xn Xn 
−1
(µ01 − Xi ) n−1 (Xi − µ01 )k−j

E ≤
 n 
i=1 i=1

!j k/j j/k
  

 n
X 
E  n−1 (µ01 − Xi ) 

 
 i=1 
 
n
k/(k−j) (k−j)/k
 X 
× E  n−1 (Xi − µ01 )k−j . (3.21)

 
i=1

Applying Theorem 3.20 to the first term of Equation (3.21) implies

!j k/j j/k
  

 n
X 
E  n−1 (µ01 − Xi ) 

=
 
 i=1 
 
n
k j/k
 X 
E  n−1 (µ01 − Xi )  = [O(n−k/2 )]j/k = O(n−j/2 ),

 
i=1

as n → ∞. To simplify the second term in Equation (3.21), apply Theorem


2.9 (Minkowski’s Inequality) to find that
 
n
k/(k−j) (k−j)/k
 X 
E  n−1 (Xi − µ01 )k−j ≤

 
i=1
n
X
n−1 {E[|(Xi − µ01 )k−j |k/(k−j) ]}(k−j)/k =
i=1
n
X
n−1 {E[|Xi − µ01 |k ]}(k−j)/k = E(|X1 − µ01 |k )(k−j)/k = O(1), (3.22)
i=1

as n → ∞. Therefore Theorem 1.18 implies that the expression in Equation


(3.21) is of order O(1)O(n−j/2 ) = O(n−2 ) as n → ∞ when j ≥ 4. Combining
these results implies that
E(µ̂k ) = −n−1 kµk + 12 n−1 k(k − 1)µ2 µk−2 + O(n−2 ),
as n → ∞, which yields the result. The second result is proven in Exercise 30
and the third result is proven in Exercise 31.

Example 3.25. Let X1 , . . . , Xn be a sequence of independent and identically


distributed random variables from a distribution F with finite fourth moment.
SAMPLE QUANTILES 147
Then Theorem 3.21 implies that the sample variance given by
n
X
µ̂n = n−1 (Xi − X̄n )2 ,
i=1

is consistent with bias


n−1 µ0 µ2 − n−1 2µ2 + O(n−2 ) = −n−1 µ2 + O(n−2 ),
as n → ∞. Similarly, the variance is given by
n−1 (µ4 − µ22 − 4µ1 µ3 + 4µ2 µ21 ) + O(n−2 ) = n−1 (µ4 − µ22 ) + O(n−2 ),
as n → ∞, where we note that µ1 = 0 by definition. In fact, a closer analysis
of these results indicate that the error terms are identically zero in this case.
See Exercise 7 for an alternate approach to determining the corresponding
results for the unbiased version of the sample variance. 

3.9 Sample Quantiles

This section investigates under what conditions a sample quantile provides


a consistent estimate of the corresponding population quantile. Let X be a
random variable with distribution function
Z x
F (x) = P (X ≤ x) = dF (t).
−∞

The p quantile of X is defined to be ξp = F −1 (p) = inf{x : F (x) ≥ p}.


th

The population quantile as defined above is always unique, even though the
inverse of the distribution function may not be unique in every case. There
are three essential examples to consider. In the case where F (x) is contin-
uous and strictly increasing in the neighborhood of the quantile, then the
distribution function has a unique inverse in that neighborhood and it follows
that F (ξp ) = F [F −1 (p)] = p. The continuity of the function in this case also
guarantees that F (ξp −) = F (ξp ) = p. That is, the quantile ξp can be seen
as the unique solution to the equation F (ξp −) = F (ξp ) = p with respect to
ξp . See Figure 3.11. In the case where a discontinuous jump occurs at the
quantile, the distribution function does not have an inverse in the sense that
F [F −1 (p)] = p. The quantile ξp as defined above is located at the jump point.
In this case F (ξp −) < p < F (ξp ), and once again the quantile can be defined
to be the unique solution to the equation F (ξp −) < p < F (ξp ) with respect
to ξp . See Figure 3.12. In the last case F is continuous in a neighborhood
of the quantile, but is not increasing. In this case, due to the continuity of
F in the neighborhood of the quantile, F (ξp −) = p = F (ξp ). However, the
difference in this case is that the quantile is not the unique solution to the
equation F (ξp −) = p = F (ξp ) in that any point in the non-increasing neigh-
borhood of the quantile will also be a solution to this equation. See Figure
3.13. Therefore, for the first two situations (Figures 3.11 and 3.12) there is
148 CONVERGENCE OF RANDOM VARIABLES

Figure 3.11 When the distribution function is continuous in a neighborhood of the


quantile, then F (ξp ) = F [F −1 (p)] = p and F (ξp −) = F (ξp ) = p.

!p

Figure 3.12 When the distribution function has a discontinuous jump in at the quan-
tile, then F (ξp −) < p < F (ξp ).

!p
SAMPLE QUANTILES 149

Figure 3.13 When the distribution function is not increasing in a neighborhood of the
quantile, then F (ξp −) = p = F (ξp ), but there is no unique solution to this equation.

!p

a unique solution to the equation F (ξp −) ≤ p ≤ F (ξp ) that corresponds to


the quantile ξp as defined earlier. The third situation, where there is not a
unique solution, is problematical as we shall see in the development below. A
few other facts about distribution functions and quantiles will be also helpful
in establishing the results of this section.
Theorem 3.22. Let F be a distribution function and let ξp = F −1 (p) =
inf{x : F (x) ≥ p}. Then

1. ξp = F −1 (p) is a non-decreasing function of p ∈ (0, 1).


2. ξp = F −1 (p) is left-continuous function of p ∈ (0, 1).
3. F −1 [F (x)] ≤ x for all x ∈ R.
4. F [F −1 (p)] ≥ p for all p ∈ (0, 1).
5. F (x) ≥ p if and only if x ≥ F −1 (p).

Proof. To prove Part 1 we consider p1 ∈ (0, 1) and p2 ∈ (0, 1) such that


p1 < p2 . Then we have that ξp1 = F −1 (p1 ) = inf{x : F (x) > p1 }. The
key idea to this proof is establishing that ξp1 = F −1 (p1 ) = inf{x : F (x) >
p1 } ≤ inf{x : F (x) > p2 } = ξp2 . The reason for this follows from the fact
that F is a non-decreasing function. Therefore, the smallest value of x such
that F (x) > p2 must be at least as large as the smallest value of x such that
F (x) > p1 .
150 CONVERGENCE OF RANDOM VARIABLES

Figure 3.14 Example of computing a sample quantile when p = kn−1 . In this exam-
ple, the empirical distribution function for a Uniform(0, 1) sample of size n = 5 is
plotted and we wish to estimate ξ0.6 . In this case p = 0.6 = kn−1 where k = 3 so
that the estimate corresponds to the third order statistic.
1.0
0.8
0.6
F(x)

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Now suppose that X1 , . . . , Xn is a set of independent and identically dis-


tributed random variables following the distribution F and we wish to esti-
mate ξp , the pth quantile of F . If no assumptions can be made about F , a
logical estimator would be the corresponding quantile of the empirical distri-
bution function from Definition 3.5. That is, let ξˆp = inf{x : F̂n (x) ≥ p},
where F̂n (x) is the empirical distribution function computed on the sample
X1 , . . . , Xn . There are two specific cases that can occur when using this esti-
mate of a quantile. The empirical distribution function is a step function with
steps of size n−1 at each of the observed values in the sample. Therefore, if
p = kn−1 for some k ∈ {1, . . . , n} then ξˆp = inf{x : F̂n (x) ≥ kn−1 } = X(k) ,
the k th order statistic of the sample. For an example of this case, see Fig-
ure 3.14. If there does not exist a k ∈ {1, . . . , n} such that p = kn−1 then
there will be a value of k such that (k − 1)n−1 < p < kn−1 and therefore
p will be between steps of the empirical distribution function. In this case
ξˆp = inf{x : F̂n (x) ≥ (k − 1)n−1 } = X(k) as well. For an example of this case,
see Figure 3.15.
Theorem 3.23. Let {Xn }∞
n=1 be a sequence of independent random variables
SAMPLE QUANTILES 151

Figure 3.15 Example of computing a sample quantile when p 6= kn−1 . In this exam-
ple, the empirical distribution function for a Uniform(0, 1) sample of size n = 5 is
plotted and we wish to estimate the median ξ0.5 . In this case there is not a value of
k such that p = 0.5 = kn−1 , but when k = 3 we have that (k − 1)n−1 < p < kn−1
and therefore the estimate of the median corresponds to the third order statistic.
1.0
0.8
0.6
F(x)

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2

x
with common distribution F . Suppose that p ∈ (0, 1) and that ξp is the unique
a.c.
solution to F (ξp −) ≤ p ≤ F (ξp ). Then ξˆp −−→ ξp as n → ∞.

Proof. Let ε > 0 and note that the assumption that F (ξp −) ≤ p ≤ F (ξp )
implies that F (ξp − ε) < p < F (ξp + ε). Now, Theorem 3.16 implies that
a.c. a.c.
F̂n (x) −−→ F (x) for every x ∈ R so that F̂n (ξp − ε) −−→ F (ξp − ε) and
a.c.
F̂n (ξp + ε) −−→ F (ξp + ε) as n → ∞. Theorem 3.1 implies that for every δ > 0
lim P [|F̂m (ξp − ε) − F (ξp − ε)| < δ for all m ≥ n] = 1,
n→∞

and
lim P [|F̂m (ξp + ε) − F (ξp + ε)| < δ for all m ≥ n] = 1.
n→∞
Now, take δ small enough so that
lim P [F̂m (ξp − ε) < p < F̂m (ξp + ε) for all m ≥ n] = 1.
n→∞

Theorem 3.22 then implies that


−1
lim P (ξp − ε < F̂m (p) < ξp + ε for all m ≥ n) = 1,
n→∞
152 CONVERGENCE OF RANDOM VARIABLES
−1
but note that F̂m (p) = ξˆp,n so that
lim P (ξp − ε < ξˆp,n < ξp + ε for all m ≥ n) = 1,
n→∞
a.c.
which in turn implies that ξˆp,n −−→ ξp as n → ∞.
Example 3.26. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables, each having a N(µ, σ 2 ) distribution. It was shown
in Example 3.18 that a consistent estimator of the αth sample quantile is
given by X̄n + σ̂n zα where zα is the αth quantile of a N(0, 1) distribution.
In the special case where we are interested in estimating the third quartile,
which corresponds to the 0.75 quantile, this estimate approximately equals
X̄n + 1349
2000 σ̂n . An alternative consistent estimate is given by Theorem 3.23 as
the third sample quartile. 

3.10 Exercises and Experiments

3.10.1 Exercises

1. Let {Xn }∞n=1 be a sequence of independent random variables where Xn is


a Gamma(α, β) random variable with α = n and β = n−1 for n ∈ N. Prove
p
that Xn −→ 1 as n → ∞.
2. Let Z be a N(0, 1) random variable and let {Xn }∞ n=1 be a sequence of
random variables such that Xn = Yn +Z where Yn is a N(n−1 , n−1 ) random
p
variable for all n ∈ N. Prove that Xn −
→ Z and n → ∞.
3. Consider a sequence of independent random variables {Xn }∞ n=1 where Xn
has a Binomial(1, θ) distribution. Prove that the estimator
n
X
θ̂n = n−1 Xk
k=1

is a consistent estimator of θ.
4. Let U be a Uniform(0,1) random variable and define a sequence of random
p
variables {Xn }∞n=1 as Xn = δ{U ; (0, n
−1
)}. Prove that Xn −
→ 0 as n → ∞.
5. Let {cn }∞
n=1 be a sequence of real constants such that

lim cn = c,
n→∞

for some constant c ∈ R. Let {Xn }∞ n=1 be a sequence of random variables


such that P (Xn = cn ) = 1 − n−1 and P (Xn = 0) = n−1 . Prove that
p
Xn −→ c as n → ∞.
6. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a shifted exponential density of the form
(
exp[−(x − θ)] for x ≥ θ
f (x) =
0 for x < θ.
EXERCISES AND EXPERIMENTS 153
p
Let X(1) = min{X1 , . . . , Xn }. Prove that X(1) −→ θ as n → ∞.
7. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a distribution F with variance µ2 where E(|X1 |4 ) < ∞.
Let X̄n be the sample mean and Sn2 be the unbiased version of the sample
variance.
a. Prove that the sample variance can be rewritten as
n n
1 XX
Sn2 = (Xi − Xj )2 .
2n(n − 1) i=1 j=1

b. Prove that Sn2 is an unbiased estimator of µ2 , that is, prove that E(Sn2 ) =
µ2 for all µ2 > 0.
c. Prove that the variance of µ̂2 is
V (µ̂2 ) = n−1 (µ4 − n−3 2
n−1 µ2 ).

d. Use the results derived above to prove that µ̂2 is a consistent estimator
p
of µ2 . That is, prove that µ̂2 −
→ µ2 as n → ∞.
e. Relate the results observed here with the results given in Theorem 3.21.
8. Consider a sequence of independent random variables {Xn }∞ n=1 where Xn
has probability distribution function

−(n+1)
2
 x = −2n(1−ε) , 2n(1−ε)
fn (x) = 1 − 2−n x = 0

0 elsewhere,

1
where ε > 2 (Sen and Singer, 1993).
a. Compute the mean and variance of Xn .
b. Let
n
X
X̄n = n−1 Xk ,
k=1
for all n ∈ N. Compute the mean and variance of X̄n .
p
c. Prove that X̄n −→ 0 as n → ∞.
9. Let {Xn }∞
n=1 be a sequence of random variables such that

lim E(|Xn − c|) = 0,


n→∞
p
for some c ∈ R. Prove that Xn −
→ c as n → ∞. Hint: Use Theorem 2.6.
10. Let U be a Uniform[0, 1] random variable and let {Xn }∞
n=1 be a sequence
of random variables such that
Xn = δ{U ; (0, 12 − [2(n + 1)]−1 )} + δ{U ; ( 12 + [2(n + 1)]−1 , 1)},
a.c.
for all n ∈ N. Prove that Xn −−→ 1 as n → ∞.
154 CONVERGENCE OF RANDOM VARIABLES
11. Let {Xn }∞n=1 be a sequence of independent random variables where Xn has
probability distribution function

−1
1 − n
 x=0
−1
f (x) = n x = nα

0 elsewhere,

where α ∈ R.
p
a. For what values of α does Xn −
→ 0 as n → ∞?
a.c.
b. For what values of α does Xn −−→ 0 as n → ∞?
c
c. For what values of α does Xn →
− 0 as n → ∞?

12. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables following the distribution F . Prove that for a fixed value of t ∈ R,
the empirical distribution function F̂n (t) is an unbiased estimator of F (t)
with standard error n−1/2 {F (t)[1 − F (t)]}1/2 .
13. Prove Theorem 3.3 using the theorems of Borel and Cantelli. That is, let
{Xn }∞
n=1 be a sequence of random variables that converges completely to
a.c.
a random variable X as n → ∞. Then prove that Xn −−→ X as n → ∞
using Theorems 2.17 and 2.18.
14. Let {Xn }∞
n=1 be a sequence of monotonically increasing random variables
that converge in probability to a random variable X. That is, P (Xn <
p
Xn+1 ) = 1 for all n ∈ N and Xn − → X as n → ∞. Prove that P (Xn <
X) = 1 for all n ∈ N.
15. Prove Part 2 of Theorem 3.6. That is, let {Xn }∞
n=1 be a sequence of d-
dimensional random vectors and let X be another d-dimensional random
a.c. a.c.
vector. and prove that Xn −−→ X as n → ∞ if and only if Xkn −−→ Xk as
n → ∞ for all k ∈ {1, . . . , d}.
16. Let {Xn }∞n=1 be a sequence of random vectors that converge almost cer-
tainly to a random vector X as n → ∞. Prove that for every ε > 0 it
follows that
lim P (kXm − Xk < ε for all m ≥ n) = 1.
n→∞

Prove the converse as well.


17. Let {Un }∞ n=1 be a sequence of independent and identically distributed Uni-
form(0, 1) random variables and let U(n) be the largest order statistic of
a.c.
U1 , . . . , Un . That is, U(n) = max{U1 , . . . , Un }. Prove that U(n) −−→ 1 as
n → ∞.
18. Prove the first part of Theorem 3.7. That is let {Xn }∞
n=1 be a sequence of
random variables, c be a real constant, and g be a Borel function on R that
a.c. a.c.
is continuous at c. Prove that if Xn −−→ c as n → ∞, then g(Xn ) −−→ g(c)
as n → ∞.
EXERCISES AND EXPERIMENTS 155
19. Prove the second part of Theorem 3.8. That is let {Xn }∞ n=1 be a sequence
of random variables, X be a random variable, and g be a Borel function on
R. Let C(g) be the set of continuity points of g and suppose that P [X ∈
p p
C(g)] = 1. Prove that if Xn − → X as n → ∞, then g(Xn ) − → g(X) as
n → ∞. Hint: Prove by contradiction using Theorem 3.5.
20. Let {Xn } be a sequence of d-dimensional random vectors, X be a d-
dimensional random vector, and g : Rd → Rq be a Borel function. Let C(g)
be the set of continuity points of g and suppose that P [X ∈ C(g)] = 1.
a.c. a.c.
a. Prove that if Xn −−→ X as n → ∞, then g(Xn ) −−→ g(X) as n → ∞.
p p
b. Prove that if Xn −
→ X as n → ∞, then g(Xn ) −
→ g(X) as n → ∞.
21. Let {Xn }∞ ∞ ∞
n=1 , {Yn }n=1 , and {Zn }n=1 be independent sequences of random
variables that converge in probability to the random variables X, Y , and
Z, respectively. Suppose that X, Y , and Z are independent and that each
of these random variables has a N(0, 1) distribution.
a. Let {Wn }∞ n=1 be a sequence of three-dimensional random vectors defined
p
by Wn = (Xn , Yn , Zn )0 and let W = (X, Y, Z)0 . Prove that Wn − → W as
n → ∞. Identify the distribution of W. Would you be able to completely
identify this distribution if the random variables X, Y , and Z are not
independent? What additional information would be required?
p
b. Prove that 31 10 Wn −→ 13 10 W and identify the distribution of 13 10 W.
22. Let {Xn }∞
n=1 be a sequence of random variables. Suppose that for every
ε > 0 we have that
lim sup P (|Xn | > ε) ≤ cε,
n→∞
p
where c is a finite real constant. Prove that Xn −
→ 0 as n → ∞.
23. A result from calculus is Kronecker’s Lemma, which states that if {bn }∞
n=1
is a monotonically increasing sequence of real numbers such that bn → ∞
as n → ∞, then the convergence of the series

X
bn x n
n=1

implies that
n
X
lim b−1
n xk = 0.
n→∞
k=1
Use Kronecker’s Lemma to prove the second result in Corollary 3.1. That
is, let {Xn }∞
n=1 be a sequence of independent random variables where
E(Xn ) = 0 for all n ∈ N. Prove that if

X
b−2 2
n E(Xn ) < ∞,
n=1
a.c.
then b−1
n Sn −−→ 0 as n → ∞.
156 CONVERGENCE OF RANDOM VARIABLES
24. Let {Xn }∞n=1 be a sequence of independent and identically distributed ran-
dom variables from a Cauchy(0, 1) distribution. Prove that the mean of
the distribution does not exist, and further prove that it can be shown that
nP (|X1 | > n) → 2π −1 as n → ∞, so that the condition of Theorem 3.14
does not hold.
25. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables from a density of the form
(
cx2 log(|x|) |x| > 2
f (x) =
0 |x| ≤ 2,
where c is a normalizing constant. Prove that nP (|X1 | > n) → 0 as n → ∞,
p
but that the mean does not exist. Hence, we can still conclude that X̄n −
→0
as n → ∞ due to Theorem 3.14.
26. Prove Theorem 3.17. That is, let F and G be two distribution functions.
Show that
kF − Gk∞ = sup |F (t) − G(t)|
t∈R
is a metric in the space of distribution functions.
27. In the proof of Theorem 3.18, verify that F̂n (t)−F (t) ≥ F̂n (t)−F (ti−1 )−ε.
28. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a distribution F .

a. Prove that if E(|X1 |k ) < ∞ then µ̂0k is an unbiased estimator of µ0k .


b. Prove that if E(|X1 |2k ) < ∞ then the standard error of µ̂0k is n−1/2 (µ02k −
µ0k )1/2 .
a.c.
c. Prove that if E(|X1 |k ) < ∞ then µ̂0k −−→ µ0k as n → ∞.

29. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from a distribution F where E(|X1 |k ) < ∞. Consider the sum
k  
( " n
#)
X k 0 0 j −1
X
0 k−j
E (µ1 − µ̂1 ) n (Xi − µ1 ) .
j=1
j i=1

Prove that when j = 3 the term in the sum is O(n−2 ) as n → ∞.


30. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a distribution F with E(|X1 |2k ) < ∞.

a. Prove that

E(µ̂2k ) = µ2k + n−1 [µ2k − µ2k − 2k(µ2k + µk−1 µk+1 ) + k 2 µ2k−1 µ2 +


k(k − 1)µk µk−2 µ2 ] + O(n−2 ),
as n → ∞.
EXERCISES AND EXPERIMENTS 157
b. Use this result to prove that
V (µ̂k ) = n−1 (µ2k − µ2k − 2kµk−1 µk+1 + k 2 µ2 µ2k−1 ) + O(n−2 ),
as n → ∞.

31. Let X1 , . . . , Xn be a set of independent and identically distributed random


a.c.
variables from a distribution F with E(|X1 |k ) < ∞. Prove that µ̂k −−→ µk
as n → ∞.

3.10.2 Experiments

1. Consider an experiment that flips a fair coin 100 times. Define an indicator
random variable Bn so that
(
1 if the k th flip is heads
Bk =
0 if the k th flip is tails.

The proportion of heads up to the nth flip is then


n
X
p̂n = n−1 Bk .
k=1

Run a simulation that will repeat this experiment 25 times. On the same
set of axes, plot p̂n versus n for n = 1, . . . , 100 with the points connected
by lines, for each replication of the experiment. For comparison, plot a
horizontal line at 21 , the true probability of flipping heads. Comment on
the outcome of the experiments. What results are being demonstrated by
this set of experiments? How do these results related to what is usually
called the frequency method for computing probabilities?
2. Write a program in R that generates a sample X1 , . . . , Xn from a specified
distribution F , computes the empirical distribution function of X1 , . . . , Xn ,
and plots both the empirical distribution function and the specified distri-
bution function F on the same set of axes. Use this program with n =
5, 10, 25, 50, and 100 to demonstrate the consistency of the empirical dis-
tribution function given by Theorem 3.16. Repeat this experiment for each
of the following distributions: N(0, 1), Binomial(10, 0.25), Cauchy(0, 1),
and Gamma(2, 4).
3. Write a program in R that generates a sample X1 , . . . , Xn from a specified
distribution F and computes the sample mean X̄n . Use this program with
n = 5, 10, 25, 50, 100, and 1000 and plot the sample size against X̄n . Repeat
the experiment five times, and plot all the results on a single set of axes.
Produce the plot described above for each of the following distributions
N(0, 1), T(1), and T(2). For each distribution state whether the Strong
Law of Large Numbers or the Weak Law of Large Numbers regulates the
behavior of X̄n . What differences in behavior are observed on the plots?
158 CONVERGENCE OF RANDOM VARIABLES
4. Write a program in R that generates independent Uniform(0, 1) random
variables U1 , . . . , Un . Define two sequences of random variables X1 , . . . , Xn
and Y1 , . . . , Yn as Xk = δ{Uk ; (0, k −1 )} and Yk = δ{Uk ; (0, k −2 )}. Plot
X1 , . . . , Xn and Y1 , . . . , Yn against k = 1, . . . , n on the same set of axes for
p
n = 25. Is it apparent from the plot that Xn − → 0 as n → ∞ but that
c
Yn →− 0 as n → ∞? Repeat this process five times to get an idea of the
average behavior in each plot.
5. Write a program in R that generates a sample X1 , . . . , Xn from a specified
distribution F , computes the empirical distribution function of X1 , . . . , Xn ,
computes the maximum distance between F̂n and F , and computes the lo-
cation of the maximum distance between F̂n and F . Use this program with
n = 5, 10, 25, 50, and 100 and plot the sample size versus the maximum
distance to demonstrate Theorem 3.18. Separately, plot the location of the
maximum distance between F̂n and F against the sample size. Is there an
area where the maximum tends to stay, or does it tend to occur where F
has certain properties? Repeat this experiment for each of the following dis-
tributions: N(0, 1), Binomial(10, 0.25), Cauchy(0, 1), and Gamma(2, 4).
6. Write a program in R that generates a sample from a population with
distribution function



 0 x < −1
1
1 + x −1 ≤ x < − 2



F (x) = 12 − 21 ≤ x < 21
1 − x 1 ≤ x < 1




 2
1 x ≥ 1.
This distribution is Uniform on the set [−1, − 21 ] ∪ [ 12 , 1]. Use this program
to generate samples of size n = 5, 10, 25, 50, 100, 500, and 1000. For each
sample compute the sample median ξˆ0.5 . Repeat this process five times
and plot the results on a single set of axes. What effect does the flat area
of the distribution have on the convergence of the sample median? For
comparison, repeat the entire experiment but compute ξˆ0.75 instead.
CHAPTER 4

Convergence of Distributions

“Ask them then,” said the deputy director. “It’s not that important,” said K.,
although in that way his earlier excuse, already weak enough, was made even
weaker. As he went, the deputy director continued to speak about other things.
The Trial by Franz Kafka

4.1 Introduction

In statistical inference it is often the case that we are not interested in whether
a random variable converges to another specific random variable, rather we are
just interested in the distribution of the limiting random variable. Statistical
hypothesis testing provides a good example of this situation. Suppose that we
have a random sample from a distribution F with mean µ, and we wish to
test some hypothesis about µ. The most common test statistic to use in this
situation is Zn = n1/2 σ̂n−1 (µ̂n −µ0 ) where µ̂n and σ̂n are the sample mean and
standard deviation, respectively. The value µ0 is a constant that is specified
by the null hypothesis. In order to derive a statistical hypothesis test for the
null hypothesis based on this test statistic we need to know the distribution
of Zn when the null hypothesis is true, which in this case we will take to be
the condition that µ = µ0 . If the parametric form of F is not known explicitly,
then this distribution can be approximated using the Central Limit Theorem,
which states that Zn approximately has a N(0, 1) distribution when n is large
and µ = µ0 . See Section 4.4. This asymptotic result does not identify a specific
random variable Z that Zn converges to as n → ∞. There is no need because
all we are interested in is the distribution of the limiting random variable. This
chapter will introduce and study a type of convergence that only specifies the
distribution of the random variable of interest as n → ∞.

4.2 Weak Convergence of Random Variables

We wish to define a type of convergence for a sequence of random variables


{Xn }∞
n=1 to a random variable X that focuses on how the distribution of the
random variables in the sequence converge to the distribution of X as n →
∞. To define this type of convergence we will consider how the distribution

159
160 CONVERGENCE OF DISTRIBUTIONS
functions of the random variables in the sequence {Xn }∞
n=1 converge to the
distribution function of X.
If all of the random variables in the sequence {Xn }∞ n=1 and the random vari-
able X were all continuous then it might make sense to consider the densities
of the random variables. Or, if all of the random variables were discrete we
could consider the convergence of the probability distribution functions of the
random variables. Such approaches are unnecessarily restrictive. In fact, some
of the more interesting examples of convergence of distributions of sequences
of random variables are for sequences of discrete random variables that have
distributions that converge to a continuous distribution as n → ∞. By defining
the mode of convergence in terms of distribution functions, which are defined
for all types of random variables, our definition will allow for such results.
Definition 4.1. Let {Xn }∞ n=1 be a sequence of random variables where Xn
has distribution function Fn for all n ∈ N. Then Xn converges in distribution
to a random variable X with distribution function F as n → ∞ if
lim Fn (x) = F (x),
n→∞

for all x ∈ C(F ), the set of points where F is continuous. This relationship
d
will be represented by Xn −
→ X as n → ∞.
It is clear from Definition 4.1 that the concept of convergence in distribution
literally requires that the distribution of the random variables in the sequence
converge to a distribution that matches the distribution of the limiting random
variable. Convergence in distribution is also often called weak convergence
since the random variables play a secondary role in Definition 4.1. In fact, the
concept of convergence in distribution can be defined without them.
Definition 4.2. Let {Fn }∞ n=1 be a sequence of distribution functions. Then
Fn converges weakly to F as n → ∞ if
lim Fn (x) = F (x),
n→∞

for all x ∈ C(F ). This relationship will be represented by Fn ; F as n → ∞.


Definitions 4.1 and 4.2 do not require that the point-wise convergence of the
distribution functions take place where there are discontinuities in the limiting
distribution function. It turns out that requiring point-wise convergence for
all real values is too strict a requirement in that there are situations where
the distribution obviously converges to a limit, but the stricter requirement is
not met.
Example 4.1. Consider a sequence of random variables {Xn }∞ n=1 whose dis-
tribution function has the form Fn (x) = δ{x; [n−1 , ∞)}. That is, P (Xn =
d
n−1 ) = 1 for all n ∈ N. In the limit we would expect Xn −
→ X as n → ∞ where
X has distribution function F (x) = δ{x; [0, ∞)}. That is, P (X = 0) = 1. If
we consider any x ∈ (−∞, 0) it is clear that
lim Fn (x) = lim δ{x; [n−1 , ∞)} = 0 = F (x).
n→∞ n→∞
WEAK CONVERGENCE OF RANDOM VARIABLES 161
Similarly, if x ∈ (0, ∞) we have that there exists a value nx ∈ N such that
n−1 < x for all n ≥ nx . Therefore, for any x ∈ (0, ∞) we have that
lim Fn (x) = lim δ{x; [n−1 , ∞)} = 1 = F (x).
n→∞ n→∞

Now consider what occurs at x = 0. For every n ∈ N, Fn (x) = 0 so that


lim Fn (x) = δ{x; [n−1 , ∞)} = 0 6= F (x).
n→∞

Therefore, if we insist that Fn ; F as n → ∞ only if


lim Fn (x) = F (x),
n→∞

for all x ∈ R, we would not be able to conclude that the convergence takes
place in this instance. However, Definitions 4.1 and 4.2 do not have this strict
requirement, and noting that 0 ∈/ C(F ) in this case allows us to conclude that
d
Fn ; F , or Xn −
→ X, as n → ∞. 

The two definitions of convergence in distribution arise in different applica-


tions. In the first application we have a sequence of random variables and we
wish to determine the distribution of the limiting random variable. The Cen-
tral Limit Theorem is an example of this application. In the second application
we are only interested in the limit of a sequence of distribution functions. Ap-
proximation results, such as the normal approximation to the binomial distri-
bution, are examples of this application. Some examples of both applications
are given below.
Example 4.2. Let {Xn }∞ n=1 be a sequence of independent Normal(0, 1 +
n−1 ) random variables, and let X be a Normal(0, 1) random variable. Taking
the limit of the distribution function of Xn as n → ∞ yields
Z x
lim [2π(1 + n−1 )]−1/2 exp{− 21 (1 + n−1 )−1 t2 }dt =
n→∞ −∞
Z x(1+n−1 )−1/2
lim (2π)−1/2 exp{− 12 v 2 }dv =
n→∞ −∞
Z x
(2π)−1/2 exp{− 12 v 2 }dv = Φ(x),
−∞

where the change of variable v = t(1 + n−1 )−1/2 has been used. This limit
is valid for all x ∈ R, and since the distribution function of X is given by
d
Φ(x), it follows from Definition 4.1 that Xn − → X as n → ∞. Note that we
are required to verify that the limit is valid for all x ∈ R in this case because
C(Φ) = R. 

Example 4.3. Let {Xn }n=1 be a sequence of independent random variables
where the distribution function of Xn is
(
1 − exp(θ − x) for x ∈ [θ, ∞),
Fn (x) =
0 for x ∈ (−∞, θ).
162 CONVERGENCE OF DISTRIBUTIONS
Define a new sequence of random variables given by Yn = min{X1 , . . . , Xn }
for all n ∈ N. The distribution function of Yn can be found by noting that
Yn > y if and only if Xk > y for all k ∈ {1, . . . , n}. Therefore, for y ∈ [θ, ∞),

P (Yn ≤ y) = 1 − P (Yn > y)


n
!
\
= 1−P {Xi > y}
i=1
n
Y
= 1− P (Xi > y)
i=1
Yn
= 1− exp(θ − y)
i=1
= 1 − exp[n(θ − y)].

Further, note that if y ∈ (−∞, θ), then P (Yn ≤ y) is necessarily zero due
to the fact that P (Xn ≤ θ) = 0 for all n ∈ N. Therefore, the distribution
function of Yn is
(
1 − exp[n(θ − y)] for y ∈ [θ, ∞),
Gn (y) =
0 for y ∈ (−∞, θ).

For y ∈ [θ, ∞) we have that

lim Gn (y) = lim {1 − exp[n(θ − y)]} = 1.


n→∞ n→∞

For y ∈ (−∞, θ) the distribution function Gn (y) is zero for all n ∈ N.


Therefore, it follows from Definition 4.2 that Gn ; G as n → ∞ where
G(y) = δ(y; [θ, ∞)). In terms of random variables it follows that
d
Yn = min{X1 , . . . , Xn } −
→ Y,

as n → ∞ where P (Y = θ) = 1, a degenerate distribution at θ. In subsequent


d
discussions will use the simpler notation Yn −
→ θ as n → ∞ where it is under-
stood that convergence in distribution to a constant indicates convergence to
a degenerate random variable at that constant. 
Example 4.4. This example motivates the Poisson approximation to the
Binomial distribution. Consider a sequence of random variables {Xn }∞ n=1
where Xn has a Binomial(n, n−1 λ) distribution where λ is a positive con-
stant. Let x be a positive real value. The distribution function of Xn is given
by

0P
 for x < 0,
bxc n λ i λ
n−i
Fn (x) = P (Xn ≤ x) = i=0 i n 1− n for 0 ≤ x ≤ n

1 for x > n.

WEAK CONVERGENCE OF RANDOM VARIABLES 163
Consider the limiting behavior of the term
   i  n−i
n λ λ
1− ,
i n n
as n → ∞ when i is a fixed integer between 0 and n. Some care must be taken
with this limit to keep a single term from diverging while the remaining terms
converge to zero, resulting in an indeterminate form. Therefore, note that
   i  n−i  i  −i  n
n λ λ n! λ λ λ
1− = 1− 1−
i n n i!(n − i)! n n n
 n
n(n − 1) · · · (n − i + 1) i λ
= λ (n − λ)−i 1 −
i! n
n i
λi
  Y n−k+1
λ
= 1− .
i! n n−λ
k=1

Now
i
Y n−k+1
lim = 1,
n→∞ n−λ
k=1
and Theorem 1.7 implies that
 n
λ
lim 1− = exp(−λ).
n→∞ n
Therefore, it follows that
   i  n−i
n λ λ λi exp(−λ)
lim 1− = ,
n→∞ i n n i!
for each i ∈ {0, 1, . . . , n}, and hence for 0 ≤ x ≤ n we have that
bxc    i  n−i bxc i
X n λ λ X λ exp(−λ)
lim Fn (x) = lim 1− = , (4.1)
n→∞ n→∞
i=0
i n n i=0
i!

which is the distribution function of a Poisson(λ) random variable. Therefore,


d
Definition 4.1 implies that we have proven that Xn − → X as n → ∞ where X
is a Poisson(λ) random variable. It is also clear that from the limit exhibited
in Equation (4.1), that when n is large we may make use of the approximation
P (Xn ≤ x) ' P (X ≤ x). Hence, we have proven that under certain conditions,
probabilities of a Binomial(n, n−1 λ) random variable can be approximated
with Poisson(λ) probabilities. 

An important question related to the convergence of distributions is under


what conditions that limiting distribution is a valid distribution function.
That is, suppose that {Fn }∞
n=1 is a sequence of distribution functions such
that
lim Fn (x) = F (x),
n→∞
164 CONVERGENCE OF DISTRIBUTIONS
for all x ∈ C(F ) for some function F (x). Must F (x) necessarily be a distribu-
tion function? Standard mathematical arguments from calculus can be used
to show that F (x) ∈ [0, 1] for all x ∈ R and that F (x) is non-decreasing. See
Exercise 6. We do not have to worry about the right-continuity of F as we
do not necessarily require convergence at the points of discontinuity of F , so
that we can make these points right-continuous if we wish. The final property
that is required for F to be a valid distribution function is that
lim F (x) = 1,
x→∞

and
lim F (x) = 0.
x→−∞
The example given below shows that these properties do not always follow for
the limiting distribution.
Example 4.5. Lehmann (1999) considers a sequence of random variables
{Xn }∞
n=1 such that the distribution function of Xn is given by

0
 x<0
Fn (x) = 1 − pn , 0 ≤ x < n,

1 x ≥ n,

where {pn }∞
n=1 is a sequence of real numbers such that

lim pn = p.
n→∞

If p = 0 then (
0 x<0
lim Fn (x) = F (x) =
n→∞ 1 x ≥ 0,
which is a degenerate distribution at zero. However, if 0 < p < 1 then the
limiting function is given by
(
0 x < 0,
F (x) =
1 − p x ≥ 0.
It is clear in this case that
lim F (x) = 1 − p < 1.
x→∞

Therefore, even though Fn ; F as n → ∞, it does not follow that F is a


valid distribution function. 

In Example 4.5 we observed a situation where a sequence of distribution func-


tions converged to a function that was not a valid distribution function. The
problem in this case was that the sequence of distribution functions has a
non-zero mass located at a point that diverged to ∞ as n → ∞, leaving the
upper tail of the distribution function less than one. It turns out that this is
the essential problem which keeps us from concluding that a limiting function
WEAK CONVERGENCE OF RANDOM VARIABLES 165
is a valid distribution function. Sequences that do not have this problem are
said to be bounded in probability.
Definition 4.3. Let {Xn }∞ n=1 be a sequence of random variables. The se-
quence is bounded in probability if for every ε > 0 there exists an xε ∈ R and
nε ∈ N such that P (|Xn | ≤ xε ) > 1 − ε for all n > nε .
Example 4.6. Reconsider the situation in Example 4.5 where {Xn }∞ n=1 is
a sequence of random variables such that the distribution function of Xn is
given by 
0
 x<0
Fn (x) = 1 − pn , 0 ≤ x < n,

1 x ≥ n,

where {pn }∞
n=1 is a sequence of real numbers such that

lim pn = p.
n→∞

If we first consider the case where p = 0, then for every ε > 0, we need only
use Definition 1.1 and find a value nε such that pn < ε for all n ≥ nε . For
this value of n, it will follow that P (|Xn | ≤ 0) ≥ 1 − ε for all n > nε , and
by Definition 4.3, the sequence is bounded in probability. On the other hand,
consider the case where p > 0 and we set a value of ε such that 0 < ε < p.
Let x be a positive real value. For any n > x we have the property that
P (|Xm | ≤ x) = 1 − p ≤ 1 − ε for all m > n. Therefore, it is not possible to
find the value of x required in Definition 4.3, and the sequence is not bounded
in probability. 

In Examples 4.5 and 4.6 we found that when the sequence in question was
bounded in probability, the corresponding limiting distribution function was
a valid distribution function. When the sequence was not bounded in proba-
bility, the limiting distribution function was not a valid distribution function.
Hence, for that example, the property that the sequence is bounded in prob-
ability is equivalent to the condition that the limiting distribution function is
valid. This property is true in general.
Theorem 4.1. Let {Xn }∞ n=1 be a sequence of random variables where Xn has
distribution function Fn for all n ∈ N. Suppose that Fn ; F as n → ∞ where
F may or may not be a valid distribution function. Then,
lim F (x) = 0,
x→−∞

and
lim F (x) = 1,
x→∞
if and only if the sequence {Xn }∞
n=1 is bounded in probability.

Proof. We will first assume that the sequence {Xn }∞


n=1 is bounded in proba-
d
bility and that Xn −→ X as n → ∞. Let ε > 0. Because {Xn }∞ n=1 is bounded
in probability there exist xε/2 ∈ R and nε/2 ∈ N such that P (|Xn | ≤ xε/2 ) ≥
166 CONVERGENCE OF DISTRIBUTIONS
1 − ε/2 for all n > nε/2 . This implies that Fn (x) ≥ 1 − ε/2 and because
Fn (x) ≤ 1 for all x ∈ R and n ∈ N, it follows that |Fn (x) − 1| < ε/2 for
d
all x > xε/2 and n > nε/2 . Because Xn − → X as n → ∞ it follows that
there exists n0ε/2 ∈ N such that |Fn (x) − F (x)| < ε/2 for all n > n0ε/2 . Let
nε = max{nε/2 , n0ε/2 }. Theorem A.18 implies that
|F (x) − 1| ≤ |F (x) − Fn (x)| + |Fn (x) − 1| ≤ ε,
for all x ∈ C(F ) such that x > xε/2 and n > nε/2 . Hence, it follows that
|F (x) − 1| ≤ ε for all x ∈ C(F ) such that x > xε/2 . It follows then that
lim F (x) = 1.
x→∞

Similar arguments can be used to calculate the limit of F as x → −∞. See


Exercise 7. To prove the converse, see Exercise 8.
The convergence of sequences that are not bounded in probability can still
be studied in a more general framework called vague convergence. See Section
5.8.3 of Gut (2005) for further information on this type of convergence.
Recall that a quantile function is defined as the inverse of the distribution
function, where we have used ξp = F −1 (p) = inf{x : F (x) ≥ p} as the inverse
in cases where there is not a unique inverse. Therefore, associated with each
sequence of distribution functions {Fn }∞n=1 , is a sequence of quantile functions
{Fn−1 (t)}∞
n=1 . The weak convergence of the distribution functions to a distri-
bution function F as n → ∞ implies the convergence of the quantile functions
as well. However, as distribution functions are not required to converge at ev-
ery point, a similar result holds for the convergence of the quantile functions.

Theorem 4.2. Let {Fn }∞ n=1 be a sequence of distribution functions that con-
verge weakly to a distribution function F as n → ∞. Let {Fn−1 }∞ n=1 be the
corresponding sequence of quantile functions and let F −1 be the quantile func-
tion corresponding to F . Define N to be the set of points where Fn−1 does not
converge pointwise to F −1 . That is
n o
N = (0, 1) \ t ∈ (0, 1) : lim Fn−1 (t) = F −1 (t) .
n→∞

Then N is countable.
A proof of Theorem 4.2 can be found in Section 1.5.6 of Serfling (1980). As
Theorem 4.2 implies, there may be as many as a countable number of points
where the convergence does not take place. Certainly for the points where
the distribution function does not converge, we cannot expect the quantile
function to necessarily converge at the inverse of those points. However, the
result is not specific about at which points the convergence may not take place,
and other points may be included in the set N as well, such as the inverse
of points that occur where the distribution functions are not increasing. On
the other hand, there may be cases where the convergence of the quantile
functions may occur at all points in (0, 1), as the next example demonstrates.
WEAK CONVERGENCE OF RANDOM VARIABLES 167
Example 4.7. Let {Xn }∞ n=1 be a sequence of random variables where Xn
is an Exponential(θ + n−1 ) random variable for all n ∈ N where θ is a
positive real constant. Let X be an Exponential(θ) random variable. It can
d
be shown that Xn − → X as n → ∞. See Exercise 2. The quantile function
associated with Xn is given by Fn−1 (t) = −(θ + n−1 )−1 log(1 − t) for all n ∈ N
and t ∈ (0, 1). Similarly, the quantile function associated with X is given by
F −1 (t) = −θ−1 log(1 − t). Let t ∈ (0, 1) and note that
lim Fn−1 (t) = lim −(θ + n−1 )−1 log(1 − t) = −θ−1 log(1 − t) = F −1 (t).
n→∞ n→∞

Therefore, the convergence of the quantile functions holds for all t ∈ (0, 1),
and therefore N = ∅. 

We also present an example where the convergence of the quantile function


does not hold at some points t ∈ (0, 1).
Example 4.8. Let {Xn }∞ n=1 be a sequence of random variables where Xn
has a Bernoulli[ 21 + (n + 2)−1 ] distribution for all n ∈ N and let X be a
d
Bernoulli( 12 ) random variable. It can be shown that Xn − → X as n → ∞.
See Exercise 4. The distribution function of Xn is given by

0
 x<0
Fn (x) = 12 − (n + 2)−1 0 ≤ x < 1

1 x ≥ 1,

and the distribution function of X is given by



0 x < 0

F (x) = 21 0 ≤ x < 1

1 x ≥ 1.

Therefore, Fn−1 ( 21 ) = 1 for all n ∈ N but F −1 ( 12 ) = 0. Therefore,


lim Fn−1 ( 12 ) 6= F ( 12 ),
n→∞
1
and hence 2 ∈ N. 

Another important question is whether the expectations of random variables


that convergence in distribution to a random variable also converge. The an-
swer to this question is not straightforward, and as we will show in Chapter
5, weak convergence is not enough to ensure that the corresponding moments
will also converge. In this section we will develop some limited results along
this line of inquiry that will be specifically developed to result in an equiva-
lence between the convergence of certain expectations and weak convergence.
We begin by with the result of Helly and Bray, which concerns expectations
truncated to a finite real interval.
Theorem 4.3 (Helly and Bray). Let {Fn }∞ n=1 be a sequence of distribution
functions and let g be a function that is continuous on [a, b] where −∞ <
168 CONVERGENCE OF DISTRIBUTIONS
a < b < ∞ and a and b are continuity points of a distribution function F . If
Fn ; F as n → ∞, then
Z b Z b
lim g(x)dFn (x) = g(x)dF (x).
n→∞ a a

Proof. We follow the development of this result given in Sen and Singer (1993).
Let ε > 0 and consider a partition of the interval [a, b] given by a = x0 <
x1 < · · · < xm < xm+1 = b. We will assume that xk is a continuity point of
F for all k ∈ {0, 1, . . . , m + 1} and that xk+1 − xk < ε for k ∈ {0, 1, . . . , m}.
We will form a step function to approximate g on the interval [a, b]. Define
gm (x) = g[ 21 (xk + xk+1 )] whenever x ∈ (xk , xk+1 ) and note that because
a = x0 < · · · < xm+1 = b it follows that for every m ∈ N and x ∈ [a, b] we can
write gm (x) as
m
X
gm (x) = g[ 12 (xk + xk+1 )]δ{x; (xk , xk+1 )}.
k=0

Now, repeated application of Theorem A.18 implies that


Z Z
b Z b b Z b
g(x)dFn (x) − g(x)dF (x) ≤ g(x)dFn (x) − gm (x)dFn (x)


a a a a
Z
b Z b
+ gm (x)dFn (x) − gm (x)dF (x)

a a
Z
b Z b
+ gm (x)dF (x) − g(x)dF (x) .

a a

To bound the first term we note that since xk+1 −xk < ε for all k ∈ {0, . . . , m}
it then follows that if x ∈ (xk , xk+1 ) then |x − 21 (xk + xk+1 )| < ε. Because g
is a continuous function it follows that there exists ηε > 0 such that |gm (x) −
g(x)| = |g[ 12 (xk + xk+1 )] − g(x)| < ηε , for all x ∈ (xk , xk+1 ). Therefore, there
exists δε > 0 such that
sup |gm (x) − g(x)| < 13 δε .
x∈[a,b]

Now, Theorem A.6 implies that


Z Z
b Z b b
g(x)dFn (x) − gm (x)dFn (x) = [g(x) − gm (x)]dFn (x)


a a a
Z b
≤ |g(x) − gm (x)|dFn (x)
a
Z b
≤ sup |g(x) − gm (x)| dFn (x)
x∈[a,b] a
1
≤ 3 δε .
WEAK CONVERGENCE OF RANDOM VARIABLES 169
Hence, this term can be made arbitrarily small by choosing ε to be small, or
equivalently by choosing m to be large. For the second term we note that since
gm is a step function, we have that
Z b Xm
gm (x)dFn (x) = gm (x)[Fn (xk+1 ) − Fn (xk )]
a k=0
Xm
= g[ 21 (xk+1 − xk )][Fn (xk+1 ) − Fn (xk )],
k=0

where the second equality follows from the definition of gm . Similarly


Z b X m
gm (x)dF (x) = g[ 21 (xk+1 − xk )][F (xk+1 ) − F (xk )].
a k=0

Therefore,
Z b Z b
gm (x)dFn (x) − gm (x)dF (x) =
a a
m
X
g[ 21 (xk+1 − xk )][Fn (xk+1 ) − F (xk+1 ) − Fn (xk ) + F (xk )].
k=0

Now, since Fn ; F as n → ∞ and xk and xk+1 are continuity points of F , it


follows from Defintion 4.1 that
lim Fn (xk ) = F (xk ),
n→∞

for all k ∈ {0, 1, . . . , m + 1}. Therefore, it follows that


Z
b Z b
lim gm (x)dFn (x) − gm (x)dF (x) = 0.

n→∞ a a

The third term can also be bounded by 31 δε using similar arguments to those
above. See Exercise 19. Since all three terms can be made smaller than 13 δε
for any δε > 0, it follows that
Z
b Z b
lim g(x)dFn (x) − g(x)dF (x) < δε ,

n→∞ a a

for any δε > 0 and therefore it follows that


Z
b Z b
lim g(x)dFn (x) − g(x)dF (x) = 0,

n→∞ a a

which completes the proof.

The restriction that the range of the integral is bounded can be weakened if
we are willing to assume that the function of interest is instead bounded. This
result is usually called the extended or generalized theorem of Helly and Bray.
170 CONVERGENCE OF DISTRIBUTIONS
Theorem 4.4 (Helly and Bray). Let g be a continuous and bounded function
n=1 be a sequence of distribution functions such that Fn ; F as
and let {Fn }∞
n → ∞, where F is a distribution function. Then,
Z ∞ Z ∞
lim g(x)dFn (x) = g(x)dF (x).
n→∞ −∞ −∞

Proof. Once again, we will use the method of proof from Sen and Singer
(1993). This method of proof breaks up the integrals in the difference
Z ∞ Z ∞
g(x)dFn (x) − g(x)dF (x),
−∞ −∞
into two basic parts. In the first part, the integrals are integrated over a finite
range, and hence Theorem 4.3 (Helly and Bray) can be used to show that the
difference converges to zero. The second part corresponds to the integrals of
the leftover tails of the range. These differences will be made arbitrarily small
by appealing to both the assumed boundedness of the function g and the fact
that F is a distribution function. To begin, let ε > 0 and let
g̃ = sup |g(x)|,
x∈R

where g̃ < ∞ by assumption. Let a and b be continuity points of F . Repeated


use of Theorem A.18 implies that
Z ∞ Z ∞ Z a Z a


g(x)dFn (x) − g(x)dF (x)

g(x)dF n (x) − g(x)dF (x)

−∞ −∞ −∞ −∞
Z
b Z b
+ g(x)dFn (x) − g(x)dF (x)

a a
Z ∞ Z ∞

+ g(x)dFn (x) − g(x)dF (x) .
b b

Theorem 4.3 (Helly and Bray) can be applied to the second term as long as a
and b are finite constants that do not depend on n, to obtain
Z
b Z b
lim g(x)dFn (x) − g(x)dF (x) = 0.

n→∞ a a
The find bounds for the remaining two terms, we note that since F is a
distribution function it follows that
lim F (x) = 1,
x→∞

and
lim F (x) = 0.
x→−∞
Therefore, Definition 1.1 implies that for every δ > 0 there exist finite conti-
nuity points a < b such that
Z a
dF (x) = F (a) < δ,
−∞
WEAK CONVERGENCE OF RANDOM VARIABLES 171
and Z ∞
dF (x) = 1 − F (b) < δ.
b
Therefore, for these values of a and b it follows that
Z a Z a Z a


g(x)dF n (x) ≤

g̃dFn (x) = g̃
dFn (x) = g̃Fn (a).
−∞ −∞ −∞

For the choice of a described above we have that


lim g̃Fn (a) = g̃F (a) < g̃δ,
n→∞

since a is a continuity point of F , where δ can be chosen so that g̃δ < ε.


Therefore,
lim g̃Fn (a) < ε.
n→∞
Similarly, for this same choice of a we have that
Z a
g(x)dF (x) < g̃δ,
−∞

and therefore Theorem A.18 implies that


Z a Z a

lim
g(x)dFn (x) − g(x)dF (x) ≤
n→∞ −∞ −∞
Z a Z a

lim g(x)dFn (x) + g(x)dF (x) < 2ε,
n→∞ −∞ −∞

for every ε > 0. Therefore, it follows that


Z a Z a

lim g(x)dFn (x) − g(x)dF (x) = 0.
n→∞ −∞ −∞

Similarly, it can be shown that


Z ∞ Z ∞

lim g(x)dFn (x) − g(x)dF (x) = 0,
n→∞ b b

and the result follows. See Exercise 9.

The limit property in Theorem 4.4 (Helly and Bray) can be further shown to
be equivalent to the weak convergence of the sequence of distribution func-
tions, thus characterizing weak convergence. Another characterization of weak
convergence is based on the convergence of the characteristic functions corre-
sponding to the sequence of distribution functions. However, in order to prove
this equivalence, we require another of Helly’s Theorems.
Theorem 4.5 (Helly). Let {Fn }∞n=1 be a sequence of non-decreasing functions
that are uniformly bounded. Then the sequence {Fn }∞n=1 contains at least one
subsequence {Fnm }∞n=1 where {n } ∞
m m=1 is an increasing sequence in N such
that Fnm ; F as m → ∞ where F is a non-decreasing function.
172 CONVERGENCE OF DISTRIBUTIONS
A proof of Theorem 4.5 can be found in Section 37 of Gnedenko (1962).
Theorem 4.5 is somewhat general in that it deals with sequences that may
or may not be distribution functions. The result does apply to sequences of
distribution functions since they are uniformly bounded between zero and
one. However, as our discussion earlier in this chapter suggests, the limiting
function F may not be a valid distribution function, even when the sequence
is comprised of distribution functions. There are two main potential problems
in this case. The first is that F need not be right continuous. But, since
weak convergence is defined on the continuity points of F , it follows that we
can always define F in such a way that F is right continuous at the points
of discontinuity of F , without changing the weak convergence properties of
the sequence. See Exercise 10. The second potential problem is that F may
not have the proper limits as x → ∞ and x → −∞. This problem cannot
be addressed without further assumptions. See Theorem 4.1. It turns out
that the convergence of the corresponding distribution functions provides the
additional assumptions that are required. With this assumption, the result
given below provides other cromulent methods for assessing weak convergence.
Theorem 4.6. Let {Fn }∞ n=1 be a sequence of distribution functions and let
{ψn }∞
n=1 be a sequence of characteristic functions such that ψn is the char-
acteristic function of Fn for all n ∈ N. Let F be a distribution function with
characteristic function ψ. The following three statements are equivalent:

1. Fn ; F as n → ∞.
2. For each t ∈ R,
lim ψn (t) = ψ(t).
n→∞

3. For each bounded and continuous function g,


Z Z
lim g(x)dFn (x) = g(x)dF (x).
n→∞

Proof. We begin by showing that Conditions 1 and 3 are equivalent. The


fact that Condition 1 implies Condition 3 follows directly from Theorem 4.4
(Helly and Bray). To prove that Condition 3 implies Condition 1 we follow
the method of proof given by Serfling (1980). Let ε > 0 and let t ∈ R be a
continuity point of F . Consider a function g given by

1
 x ≤ t,
−1
g(x) = 1 − ε (x − t) t < x < t + ε

0 x ≥ t + ε.

Note that g is continuous and bounded. See Figure 4.1. Therefore, Condition
3 implies that
Z ∞ Z ∞
lim g(x)dFn (x) = g(x)dF (x).
n→∞ −∞ −∞
WEAK CONVERGENCE OF RANDOM VARIABLES 173
Now, because g(x) = 0 when x ≥ t + ε, it follows that
Z ∞ Z t Z t+ε
g(x)dFn (x) = dFn (x) + g(x)dFn (x).
−∞ −∞ t

Since g(x) ≥ 0 for all t ∈ [t, t + ε] it follows that


Z ∞ Z t
g(x)dFn (x) ≥ dFn (x) = Fn (t),
−∞ −∞

for all n ∈ N. Therefore, Theorem 1.6 implies that


Z ∞ Z ∞
lim sup Fn (t) ≤ lim g(x)dFn (x) = g(x)dF (x),
n→∞ n→∞ −∞ −∞

where we have used the limit superior because we do not know if the limit of
Fn (t) converges or not. Now
Z ∞ Z t Z t+ε
g(x)dF (x) = dF (x) + g(x)dF (x)
−∞ −∞ t
Z t Z t+ε
≤ dF (x) + dF (x)
−∞ t
= F (t + ε),
since g(x) ≤ 1 for all x ∈ [t, t + ε]. Therefore, we have shown that
lim sup Fn (t) ≤ F (t + ε).
n→∞

Similar arguments can be used to show that


lim inf Fn (t) ≥ F (t − ε).
n→∞

See Exercise 11. Hence, we have shown that


F (t − ε) ≤ lim inf Fn (t) ≤ lim sup Fn (t) ≤ F (t + ε),
n→∞ n→∞

for every ε > 0. Since t is a continuity point of F , this implies that


lim inf Fn (t) = lim sup Fn (t)
n→∞ n→∞

or equivalently that
lim Fn (t) = F (t),
n→∞
by Definition 1.3, when t is a continuity point of F . Therefore, Definition 4.2
implies that Fn ; F as n → ∞. Hence, we have shown that Conditions 1 and
3 are equivalent.
To prove that Conditions 1 and 2 are equivalent we use the method of proof
given by Gnendenko (1962). We will first show that Condition 1 implies Con-
dition 2. Therefore, let us assume that {Fn }∞
n=1 is a sequence of distribution
functions that converge weakly to a distribution function F as n → ∞. Define
g(x) = exp(itx) where t is a constant and note that g(x) is continuous and
174 CONVERGENCE OF DISTRIBUTIONS
bounded. Therefore, we use the equivalence between Conditions 1 and 3 to
conclude that
Z ∞ Z ∞
lim ψn (t) = exp(itx)dFn (x) = exp(itx)dF (x) = ψ(t),
n→∞ −∞ −∞

for all t ∈ R. Therefore, Condition 2 follows. The converse is somewhat more


difficult to prove because we need to account for the fact that F may not be
a distribution function. To proceed, we assume that
lim ψn (t) = ψ(t),
n→∞

for all t ∈ R. We first prove that {Fn }∞ n=1 converges to a distribution func-
tion. We then finish the proof by showing that the characteristic function of
F must be ψ(t). To obtain the weak convergence we begin by concluding from
Theorem 4.5 (Helly) that there is a subsequence {Fnm }∞ ∞
m=1 , where {nm }m=1
is an increasing sequence in N, that converges weakly to some non-decreasing
function F . From the discussion following Theorem 4.5, we know that we can
assume that F is a right continuous function. The proof that F has the correct
limit properties is rather technical and is somewhat beyond the scope of this
book. For a complete argument see Gnendenko (1962). In turn, we shall for the
rest of this argument, assume that F has the necessary properties to be a dis-
tribution function. It follows from the fact that Condition 1 implies Condition
2 that the characteristic function ψ must correspond to the distribution func-
tion F . To complete the proof, we must now show that the sequence {Fn }∞ n=1
converges weakly to F as n → ∞. We will use a proof by contradiction. That
is, let us suppose that the sequence {Fn }∞n=1 does not converge weakly to F as
n → ∞. In this case we would be able to find a sequence of integers {cn }∞ n=1
such that Fcn converges weakly to some distribution function G that differs
from F at at least one point of continuity. However, as we have stated above,
G must have a characteristic function equal to ψ(t) for all t ∈ R. But Theo-
rem 2.27 implies that G must be the same distribution function as F , thereby
contradicting our assumption, and hence it follows that {Fn }∞ n=1 converges
weakly to F as n → ∞.

Example 4.9. Let {Xn }∞ n=1 be a sequence of random variables where Xn


has a N(0, 1 + n−1 ) distribution for all n ∈ N and let X be a N(0, 1) random
variable. Hence Xn has characteristic function ψn (t) = exp[− 12 t2 (1 + n−1 )]
and X has characteristic function ψ(t) = exp(− 12 t2 ). Let t ∈ R and note that
lim ψn (t) = lim exp[− 12 t2 (1 + n−1 )] = exp(− 12 t2 ) = ψ(t).
n→∞ n→∞

d
Therefore, Theorem 4.6 implies that Xn −
→ X as n → ∞. 
Example 4.10. Let {Xn }∞ n=1 be a sequence of random variables where Xn
has distribution Fn for all n ∈ N. Suppose the Xn converges in distribution
to a random variable X with distribution F as n → ∞. Now let g(x) =
xδ{x; (−δ, δ)}, for a specified 0 < δ < ∞, which is a bounded and continuous
WEAK CONVERGENCE OF RANDOM VARIABLES 175

Figure 4.1 The bounded and continuous function used in the proof of Theorem 4.6.

1.0
0.8
0.6
0.4
0.2
0.0

t t+!

function. Then, it follows from Theorem 4.6 that


Z
lim E(Xn δ{Xn ; (−δ, δ)}) = lim g(x)dFn (x)
n→∞ n→∞
Z
= g(x)dF (x)

= E(Xδ{X; (−δ, δ)}).


Thus we have shown that trimmed expectations of sequences of random vari-
ables that converge in distribution also converge. 

It is important to note that we cannot take g(x) = x in Theorem 4.6 be-


cause this function is not bounded. Therefore, Theorem 4.6 cannot be used to
conclude that
lim E(Xn ) = E(X),
n→∞
d
whenever Xn −→ X as n → ∞. In fact, such a result is not true in general,
though Theorem 4.6 does provide some important clues as to what conditions
may be required to make such a conclusion.
Example 4.11. Let {Xn }∞ n=1 be a sequence of random variables where Xn
has distribution Fn for all n ∈ N. Suppose the Xn converges in distribution
to a random variable X with distribution F as n → ∞. Assume in this case
that there exists an interval (−a, a) where 0 < a < ∞ such that P [Xn ∈
(−a, a)] = P [X ∈ (−a, a)] = 1 for all n ∈ N. Define g(x) = xδ{x; (−δ, δ)} as
176 CONVERGENCE OF DISTRIBUTIONS
before where we specify δ such that a < δ < ∞. Once again, this is a bounded
and continuous function. It follows from Theorem 4.6 that
lim E(Xn δ{Xn ; (−δ, δ)}) = E(Xδ{X; (−δ, δ)}),
n→∞

but in this case we note that E(Xn ) = E(Xn δ{Xn ; (−δ, δ)}) and E(X) =
E(Xδ{X; (−δ, δ)}). Therefore, in this case we have proven that
lim E(Xn ) = E(X),
n→∞

under the assumption that the random variables stay within a bounded subset
of the real line. 

Note that the boundedness of the sequence of random variables is not a nec-
essary condition for the expectations of sequences of random variables that
converge in distribution to also converge, though it is sufficient. We will return
to this topic in Chapter 5 where we will develop equivalent conditions to the
convergence of the expectations.
Noting that the convergence detailed in Definitions 4.1 and 4.2 is pointwise in
terms of the sequence of distribution functions, it may be somewhat surprising
that the convergence can be shown to be uniform as well. This is due to
the special properties associated with distribution functions in that they are
bounded and non-decreasing.
Theorem 4.7 (Pólya). Suppose that {Fn }∞ n=1 is a sequence of distribution
functions such that Fn ; F as n → ∞ where F is a continuous distribution
function. Then
lim sup |Fn (t) − F (t)| = 0.
n→∞ t∈R

The proof of Theorem 4.7 essentially follows along the same lines as that of
Theorem 3.18 where we considered the uniform convergence of the empirical
distribution function. See Exercise 12.
When considering the modes of convergence studied thus far, it would appear
conceptually that convergence in distribution is a rather weak concept in that
it does not require that the random variables Xn and X should be close to one
another when n is large. Only the corresponding distributions of the sequence
need coincide with the distribution of X as n → ∞. Therefore, it would seem
that any of the modes of convergence studied in Chapter 3, which all require
that the random variables Xn and X coincide in some sense in the limit,
would also require that distributions of the random variables would also co-
incide. This should essentially guarantee convergence in distribution for these
sequences. It suffices to prove this property for convergence in probability, the
weakest of the modes of convergence studied in Chapter 3.
Theorem 4.8. Let {Xn }∞
n=1 be a sequence of random variables that converge
d
in probability to a random variable X as n → ∞. Then Xn −
→ X as n → ∞.
WEAK CONVERGENCE OF RANDOM VARIABLES 177
Proof. Let ε > 0, the distribution function of Xn be Fn (x) = P (Xn ≤ x) for
all n ∈ N, and F denote the distribution function of X. Then
Fn (x) = P (Xn ≤ x)
= P ({Xn ≤ x} ∩ {|Xn − X| < ε}) + P ({Xn ≤ x} ∩ {|Xn − X| ≥ ε})
≤ P ({Xn ≤ x} ∩ {|Xn − X| < ε}) + P ({|Xn − X| ≥ ε}),
by Theorem 2.3. Now note that
{Xn ≤ x} ∩ {|Xn − X| < ε} ⊂ {X ≤ x + ε},
so that Theorem 2.3 implies
P ({Xn ≤ x} ∩ {|Xn − X| < ε}) ≤ P (X ≤ x + ε),
and therefore,
Fn (x) ≤ P (X ≤ x + ε) + P (|Xn − X| ≥ ε).
Hence, without assuming that the limit exists, we can use Theorem 1.6 to
show that

lim sup Fn (x) ≤ lim sup P (X ≤ x + ε) + lim sup P (|Xn − X| ≥ ε)


n→∞ n→∞ n→∞
= P (X ≤ x + ε) = F (x + ε),
p
where the second term in the sum converges to zero because Xn −
→ X as
n → ∞. It can similarly be shown that
lim inf Fn (x) ≥ F (x − ε).
n→∞

See Exercise 13. Suppose that x ∈ C(F ). Since ε > 0 is arbitrary, we have
shown that
F (x) = F (x−) ≤ lim inf Fn (x) ≤ lim sup Fn (x) ≤ F (x+) = F (x).
n→∞ n→∞

Therefore, it follows that


lim inf Fn (x) = lim sup Fn (x) = F (x),
n→∞ n→∞

and Definition 1.3 implies that


lim Fn (x) = F (x),
n→∞
d
for all x ∈ C(F ). Therefore, it follows that Xn −
→ X as n → ∞.

Given the previous discussion it would be quite reasonable to conclude that


the converse of Theorem 4.8 is not true. This can be proven using the example
below.
Example 4.12. Let {Xn }∞ n=1 be a sequence of random variables where Xn =
(−1)n Z where Z has a N(0, 1) distribution for all n ∈ N. Because Z and −Z
d
both have a N(0, 1) distribution, it follows that Xn −
→ Z as n → ∞. But note
178 CONVERGENCE OF DISTRIBUTIONS
that for ε > 0 it follows that P (|Xn − Z| ≤ ε) = P (|2Z| ≤ ε), which is a
non-zero constant for all odd valued n ∈ N. Therefore, it does not follow that
lim P (|Xn − Z| ≤ ε) = 0,
n→∞

and therefore the sequence {Xn }∞


n=1 does not converge in probability to Z as
n → ∞. 

Additional assumptions may be added to a sequence that converges in distri-


bution that imply that the sequence converges in probability as well. One such
condition occurs when the sequence converges in distribution to a degenerate
distribution.
Theorem 4.9. Let {Xn }∞ n=1 be a sequence of random variables that converge
p
in distribution to a real constant c as n → ∞. Then Xn − → c as n → ∞.
d
Proof. Let ε > 0 and suppose that Xn − → c as n → ∞. Denote the distribution
function of Xn by Fn for all n ∈ N. The limiting distribution function F is
given by F (x) = δ{x; [c, ∞)} so that C(F ) = R \ {c}. Then
P (|Xn − c| ≤ ε) = P (c − ε ≤ Xn ≤ c + ε) = Fn (c + ε) − Fn [(c − ε)−].
Noting that c + ε and (c − ε)− are both elements of C(F ) for all ε > 0 implies
that
lim P (|Xn − c| ≤ ε) = lim Fn (c + ε) − Fn [(c − ε)−]
n→∞ n→∞
= F (c + ε) − F [(c − ε)−]
= 1 − 0 = 1.
where we have used the fact that F [(c − ε)−] = F (c − ε) since (c − ε) ∈ C(F ).
p
Therefore, Definition 3.1 implies that Xn −→ X as n → ∞.

While a sequence that converges in distribution may not always converge in


probability, it turns out that there does exist a sequence of random variables
with the same convergence in distribution properties, that also converge al-
most certainly. This result is known as the Skorokhod Representation Theorem.
Theorem 4.10 (Skorokhod). Let {Xn }∞ n=1 be a sequence of random variables
that converge in distribution to a random variable X. Then there exists a
sequence of random variables {Yn }∞ n=1 and a random variable Y defined on
a probability space (Ω, F, P ), where Ω = [0, 1], F = B{[0, 1]}, and P is a
continuous uniform probability measure on Ω, such that

1. X and Y have the same distribution.


2. Xn and Yn have the same distribution for all n ∈ N.
a.c.
3. Yn −−→ Y as n → ∞.

Proof. This proof is based on the development of Serfling (1980). Like many
existence proofs, this one is constructive in nature. Let F be the distribution
WEAK CONVERGENCE OF RANDOM VARIABLES 179
function of X and Fn be the distribution function of Xn for all n ∈ R. Define
the random variable Y : [0, 1] → R as Y (ω) = F −1 (ω). Similarly define
Yn (ω) = Fn−1 (ω) for all n ∈ N. We now prove that these random variables
have the properties listed earlier. We begin by proving that X and Y have
the same distribution. The distribution function of Y is given by G(y) =
P [ω : Y (ω) ≤ y] = P [ω : F −1 (ω) ≤ y]. Now Theorem 3.22 implies that
P [ω : F −1 (ω) ≤ y] = P [ω : ω ≤ F (y)]. Because P is a uniform probability
measure, it follows that G(y) = P [ω : ω ≤ F (y)] = F (y), and therefore we
have proven that X and Y have the same distribution. Using similar arguments
it can be shown that Xn and Yn have the same distribution for all n ∈ N.
Hence, it remains for us to show that Yn converges almost certainly to Y as
n → ∞. Note that the sequence of random variables {Yn }∞ n=1 is actually the
sequence {Fn−1 }∞n=1 , which is a sequence of quantile functions corresponding to
a sequence of distribution functions that converge weakly. Theorem 4.2 implies
that if we collect together all ω ∈ [0, 1] such that the sequence {Yn (ω)}∞ n=1
does not converge pointwise to Y (ω) = F −1 (ω) we get a countable set, which
has probability zero with respect to the continuous probability measure P on
a.c.
Ω. Therefore, Definition 3.2 implies that Yn −−→ Y as n → ∞.

The most common example of using Theorem 4.10 arises when proving that
continuous functions of a sequence of random variables that converge weakly
also converge. See Theorem 4.12 and the corresponding proof.
We often encounter problems that concern a sequence of random variables that
converge in distribution and we perturb the sequence with another sequence
that also has some convergence properties. In these cases it is important to
determine how such perturbations affect the convergence of the sequence. For
example, we might know that a sequence of random variables {Xn }∞ n=1 con-
verges in distribution to a random variable Z that has a N(µ, 1) distribution.
We may wish to standardize this sequence, but may not know µ. Suppose
we have a consistent estimator of µ. That is, we can compute µ̂n such that
p
µ̂n −
→ µ as n → ∞. Given this information, is it possible to conclude that
the sequence {Xn − µ̂n }∞ n=1 converges in distribution to a standard normal
distribution? Slutsky’s Theorem is a result that considers both additive and
multiplicative perturbations of sequences that converge in distribution.
Theorem 4.11 (Slutsky). Let {Xn }∞ n=1 be a sequence of random variables
that converge weakly to a random variable X. Let {Yn }∞ n=1 be a sequence of
random variables that converge in probability to a real constant c. Then,
d
1. Xn + Yn −
→ X + c as n → ∞.
d
2. Xn Yn −
→ cX as n → ∞.
d
3. Xn /Yn −
→ X/c as n → ∞ as long as c 6= 0.

Proof. The first result will be proven here. The remaining results are proven
in Exercises 14 and 15. Denote the distribution function of Xn + Yn to be Gn
180 CONVERGENCE OF DISTRIBUTIONS
and let F be the distribution function of X. Let ε > 0 and set x such that
x − c ∈ C(F ) and x + ε − c ∈ C(F ). Then
Gn (x) = P (Xn + Yn ≤ x)
= P ({Xn + Yn ≤ x} ∩ {|Yn − c| ≤ ε}) +
P ({Xn + Yn ≤ x} ∩ {|Yn − c| > ε}).
Now, Theorem 2.3 implies that
P ({Xn + Yn ≤ x} ∩ {|Yn − c| ≤ ε}) ≤ P (Xn + c ≤ x + ε),
and
P ({Xn + Yn ≤ x} ∩ {|Yn − c| > ε}) ≤ P (|Yn − c| > ε).
Therefore,
Gn (x) ≤ P (Xn + c ≤ x + ε) + P (|Yn − c| > ε),
and Theorem 1.6 implies that
lim sup Gn (x) ≤ lim sup P (Xn + c ≤ x + ε) + lim sup P (|Yn − c| > ε)
n→∞ n→∞ n→∞
= P (X ≤ x + ε − c),
d p
where we have used the fact that Xn −
→ X and Yn −
→ c as n → ∞. Similarly,
it can be shown that
P (Xn ≤ x − ε − c) ≤ Gn (x) + P (|Yn − c| > ε).
See Exercise 16. Therefore, Theorem 1.6 implies that
lim inf P (Xn ≤ x − ε − c) ≤ lim inf Gn (x) + lim inf P (|Yn − c| > ε)
n→∞ n→∞ n→∞

so that
lim inf Gn (x) ≥ P (X ≤ x − ε − c).
n→∞
Because ε > 0 is arbitrary, we have proven that
lim sup Gn (x) ≤ P (X ≤ x − c) ≤ lim inf Gn (x),
n→∞ n→∞

which can only occur when


lim inf Gn (x) = lim sup Gn (x) = lim Gn (x) = P (X ≤ x − c).
n→∞ n→∞ n→∞

The result then follows by noting that P (X ≤ x − c) = F (x − c) is the


distribution function of the random variable X + c.
Example 4.13. Suppose that {Zn }∞
n=1 is a sequence of random variables such
d
that Zn /σ −→ Z as n → ∞ where Z is a N(0, 1) random variable. Suppose
that σ is unknown, but can be estimated by a consistent estimator σ̂n . That
p d
is σ̂n −
→ σ as n → ∞. Noting that σ −→ σ as n → ∞, it follows from Part 3
d
of Theorem 4.11 that σ/σ̂n −
→ 1 as n → ∞. Theorem 4.9 then implies that
p
σ/σ̂n −→ 1 as n → ∞, so that Part 2 of Theorem 4.11 can be used to show
WEAK CONVERGENCE OF RANDOM VARIABLES 181
d
that Zn /σ̂n −
→ Z as n → ∞. This type of argument is used in later sections to
justify estimating a variance on a set of standardized random variables that
are asymptotically Normal. 
Example 4.14. Let {Xn }∞ n=1 be a sequence of random variables such that Xn
has a Gamma(αn , βn ) distribution where {αn }∞ ∞
n=1 and {βn }n=1 are sequences
of positive real numbers such that αn → α and βn → β as n → ∞, some some
positive real numbers α and β. Let α̂n and β̂n be consistent estimators of α
p p
and β in the sense that α̂n −
→ α and β̂n −→ β as n → ∞. It can be shown that
d
Xn −→ X as n → ∞ where X has a Gamma(α, β) distribution. See Exercise
d
3. Part 3 of Theorem 4.11 implies that Xn /β̂n − → X/β as n → ∞ where
X/β has a Gamma(α, 1) distribution. Similarly, an additional application of
Part 1 of Theorem 4.11 implies that if we define a new random variable Yn =
d
Xn /βn − αn for all n ∈ N, then Yn − → Y as n → ∞ where Y has a shifted
Gamma(α, 1) distribution with density
f (y) = 1
Γ(α) (y + α)α−1 exp[−(y + α)]δ{y; (α, ∞)}.


More general transformations of sequences of weakly convergent transforma-


tions also converge as well, under the condition that the transformation is
continuous with respect to the limiting distribution.
Theorem 4.12. Let {Xn }∞ n=1 be a sequence of random variables that converge
in distribution to a random variable X as n → ∞. Let g be a Borel function
d
on R and suppose that P [X ∈ C(g)] = 1. Then g(Xn ) −→ g(X) as n → ∞.

Proof. In this proof we follow the method of Serfling (1980) which translates
the problem into one concerning almost certain convergence using Theorem
4.10 (Skorokhod), where the convergence of the transformation is known to
hold. The result is then translated back to weak convergence using the fact that
almost certain convergence implies convergence in distribution. To proceed,
let us suppose that {Xn }∞ n=1 is a sequence of random variables that converge
in distribution to a random variable X as n → ∞. Theorem 4.10 then implies
that there exists a sequence of random variables {Yn }∞ n=1 such that Xn and
a.c.
Yn have the same distribution for all n ∈ N where Yn −−→ Y as n → ∞,
and Y has the same distribution as X. It then follows from Theorem 3.8 that
a.c.
g(Yn ) −−→ g(Y ) as n → ∞ as long as g is continuous with probability one
with respect to the distribution of Y . To show this, let D(g) denote the set
of discontinuities of the function g. Let P be the probability measure from
the measure space used to define X and P ∗ be the probability measure from
the measure space used to define Y . Then it follows that since X and Y have
the same distribution, that P ∗ [Y ∈ D(g)] = P [X ∈ D(g)] = 0. Therefore, it
a.c.
follows that g(Yn ) −−→ g(Y ) as n → ∞. Theorems 3.2 and 4.8 then imply
d
that g(Yn ) −
→ g(Y ) as n → ∞. But g(Yn ) has the same distribution as g(Xn )
182 CONVERGENCE OF DISTRIBUTIONS
for all n ∈ R and g(Y ) has the same distribution as g(X). Therefore, it follows
d
that g(Xn ) −
→ g(X) as n → ∞ as well, and the result is proven.

Proofs based on Theorem 4.10 (Skorokhod) can sometimes be confusing and


indeed one must be careful with their application. It may seem that Theorem
4.10 could be used to prove that any result that holds for almost certain
convergence also holds for convergence in distribution. This of course is not
true. The key step in the proof of Theorem 4.12 is that we can translate the
desired result on the random variables that converge almost certainly back to
a parallel result for weak convergence. In the proof of Theorem 4.12 this is
possible because the two sets of random variables have the same distributions.
Example 4.15. Let {Xn }∞ n=1 be a sequence of independent N(0, 1+n
−1
) ran-
dom variables and let X be a N(0, 1) random variable. We have already shown
d
in Example 4.2 that Xn − → X as n → ∞. Consider the function g(x) = x2 ,
which is continuous with probability one with respect to a N(0, 1) distribu-
tion. It follows from Theorem 4.12 that the sequence {Yn }∞n=1 converges in
distribution to a random variable Y , where Yn = g(Xn ) = Xn2 for all n ∈ R
and Y = g(X) = X 2 . Using the fact that Y has a ChiSquared(1) distribu-
tion, it follows that we have proven that {Xn2 }∞
n=1 converges in distribution
to a ChiSquared(1) distribution as n → ∞. 
Example 4.16. Let {Xn }∞ n=1 be a sequence of random variables where Xn
has an Exponential(θn ) distribution for all n ∈ N where {θn }∞n=1 is a se-
quence of positive real numbers such that θn → θ as n → ∞. Note that
d
Xn − → X as n → ∞ where X has an Exponential(1) distribution. Consider
the transformation g(x) = log(x) and note that g is continuous with proba-
bility one with respect to the random variable X. Therefore, Theorem 4.12
d
implies that log(Xn ) −
→ log(X) where log(X) has a Uniform(0, 1) distribu-
tion. 

4.3 Weak Convergence of Random Vectors

The extension of the concept of convergence in distribution to the case of


random vectors is relatively straightforward in that the univariate definition
is directly generalized to the multivariate definition. While the generaliza-
tion follows directly, there are some important consequences of moving to the
multivariate setting which must be studied carefully.
Definition 4.4. Let {Xn }∞ n=1 be a sequence of d-dimensional random vectors
where Xn has distribution function Fn for all n ∈ N. Then Xn converges in
distribution to a d-dimensional random vector X with distribution function F
if
lim Fn (x) = F (x),
n→∞
for all x ∈ C(F ).
WEAK CONVERGENCE OF RANDOM VECTORS 183
While the definition of convergence in distribution is essentially the same as
the univariate case, there are some hidden differences that arise due to the
fact that the distribution functions in this case are functions of d-dimensional
vectors. This is important because in the univariate case the set C(F ) corre-
sponds to all real x such that P (X = x) = 0. This is not necessarily true in
the multivariate case as demonstrated in the example below.
Example 4.17. This example is based on a discussion in Lehmann (1999).
Consider a discrete bivariate random vector X with probability distribution
function (
1
x ∈ {(0, −1)0 , (−1, 0)0 }
f (x) = 2 (4.2)
0 elsewhere.
The distribution function of X is F (x) = P (X ≤ x), where the inequality
between the vectors is interpreted element-wise. Therefore, the distribution
function is given by

0 x1 < 0, x2 < 0; x1 < −1, x2 > 0 or x1 > 0, x2 < −1

F (x) = 21 −1 ≤ x1 < 0, x2 ≥ 0 or x1 ≥ 0, −1 ≤ x2 < 0

1 x1 ≥ 0, x2 ≥ 0

where x = (x1 , x2 )0 . The probability contours of this distribution function are


given in Figure 4.2. It is clear from the figure that the point x = (0, 0)0 is a
point of discontinuity of the distribution function, yet P [x = (0, 0)0 ] = 0. 

Continuity points can be found by examining the probability of the boundary


of the set {t : t ≤ x} where we continue to use the convention in this book
that inequalities between vectors are interpreted pointwise.
Theorem 4.13. Let X be a d-dimensional random vector with distribution
function F and let B(x) = {t ∈ Rd : t ≤ x}. Then a point x ∈ Rd is a
continuity point of F if and only if P [X ∈ ∂B(x)] = 0.

A proof of Theorem 4.13 can be found in Section 5.1 of Lehmann (1999).


Example 4.18. Consider the random vector with distribution function in-
troduced in Example 4.17. Consider once again the point x0 = (0, 0) which
has boundary set
∂B(x) = ∂{t ∈ R2 : t ≤ 0}
= {(x1 , x2 ) ∈ R2 : x1 = 0, x2 ≤ 0} ∪
{(x1 , x2 ) ∈ R2 : x1 ≤ 0, x2 = 0}.
Now, from Equation 4.2 it follows that P [∂B(x)] = P ({(0, −1), (−1, 0)}) = 1.
Therefore, Theorem 4.13 implies that x0 = (0, 0) is not a continuity point. 

When the limiting distribution is continuous, the problem of proving weak


convergence for random vectors simplifies greatly.
Example 4.19. Let {Xn }∞
n=1 be a sequence of d-dimensional random vectors
184 CONVERGENCE OF DISTRIBUTIONS

Figure 4.2 Probability contours of the discrete bivariate distribution function from
Example 4.17. The dotted lines indicate the location of discrete steps in the distribu-
tion function, with the height of the steps being indicated on the plot. It is clear from
this plot that the point (0, 0) is a point of discontinuity of the distribution function.
1.0
0.5

1/2 1
0.0
x2

!0.5

1/2
!1.0

0
!1.5

!1.5 !1.0 !0.5 0.0 0.5 1.0

x1

such that Xn has a N(µn , Σn ) distribution where {µn }∞


n=1 is a sequence of
d-dimensional means such that
lim µn = µ,
n→∞

for some µ ∈ Rd and {Σn }∞


n=1 is a sequence of d×d positive definite covariance
matrices such that
lim Σn = Σ,
n→∞
d
for some d × d positive definite covariance matrix Σ. It follows that Xn − →X
as n → ∞ where X has a N(µ, Σ) distribution. To show this we note that
Z
Fn (x) = (2π)−d/2 |Σn |−1/2 exp[− 21 (t − µn )0 Σ−1
n (t − µn )]dt,
B(x)

and that
Z
F (x) = (2π)−d/2 |Σ|−1/2 exp[− 12 (t − µ)0 Σ−1 (t − µ)]dt,
B(x)
WEAK CONVERGENCE OF RANDOM VECTORS 185
where B(x) is defined in Theorem 4.13. Because the limiting distribution is
continuous, P [X ∈ ∂B(x)] = 0 for all x ∈ Rd . To show that weak convergence
follows, we note that Fn (x) can be written as
Z
Fn (x) = (2π)−d/2 exp(− 12 t0 t)dt,
−1/2
Σn [B(x)−µn ]

and similarly
Z
F (x) = (2π)−d/2 exp(− 12 t0 t)dt.
Σ−1/2 [B(x)−µ]

In both cases we have used the shorthand AB(x) + c to represent the linear
−1/2
transformation {At + c : t ∈ B(x)}. Now it follows that Σn [B(x − µn )] →
−1/2 d
Σ [B(x − µ)] for all x ∈ R . Therefore, it follows that
lim Fn (x) = F (x),
n→∞
d
for all x ∈ Rd and, hence, it follows from Definition 4.4 that Xn −
→ X as
n → ∞. 
If a sequence of d-dimensional distribution functions {Fn }∞
n=1 converges weakly
to a distribution function F as n → ∞, then Defintion 4.4 implies that
lim Fn (x) = F (x),
n→∞

as long as x is a continuity point of F . Let Xn be a d-dimensional random


vector with distribution function Fn for all n ∈ N and X have distribution
function F . Then the fact that Fn ; F as n → ∞ implies that
lim P (Xn ≤ x) = P (X ≤ x),
n→∞

as long as x is a continuity point of F . Using Theorem 4.13, this property can


also be written equivalently as
lim P [Xn ∈ B(x)] = P [X ∈ B(x)],
n→∞

as long as P [X ∈ ∂B(x)] = 0. This type of result can be extended to any


subset of Rd .
Theorem 4.14. Let {Xn }∞ n=1 be a sequence of d-dimensional random vec-
tors where Xn has distribution function Fn for all n ∈ N and let X be a d-
dimensional random vector with distribution function F . Suppose that Fn ; F
as n → ∞ and let B be any subset of Rd . If P (X ∈ ∂B) = 0, then,
lim P (Xn ∈ B) = P (X ∈ B).
n→∞

Example 4.20. Consider the setup of Example 4.19 where {Xn }∞ n=1 is a
sequence of d-dimensional random vectors such that Xn has a N(µn , Σn )
d
distribution and Xn −
→ X as n → ∞ where X has a N(µ, Σ) distribution.
Define a region
E(α) = {x ∈ Rd : (x − µ)0 Σ−1 (x − µ) ≤ χ2d;α }
186 CONVERGENCE OF DISTRIBUTIONS
where χ2d;α is the α quantile of a ChiSquared(d) distribution. The boundary
region of E(α) is given by
∂E(α) = {x ∈ Rd : (x − µ)0 Σ−1 (x − µ) = χ2d;α },
which is an ellipsoid in Rd . Now
P [X ∈ ∂E(α)] = P [(X − µ)0 Σ−1 (X − µ) = χ2d;α ] = 0,
since X is a continuous random vector. It then follows from Theorem 4.14
that
lim P [Xn ∈ E(α)] = P [X ∈ E(α)] = α.
n→∞


The result of Theorem 4.14 is actually part of a larger result that generalizes
Theorem 4.6 to the multivariate case.
Theorem 4.15. Let {Xn }∞ n=1 be a sequence of d-dimensional random vec-
tors where Xn has distribution function Fn for all n ∈ N and let X be a
d-dimensional random vector with distribution function F . Then the following
statements are equivalent.

1. Fn ; F as n → ∞.
2. For any bounded and continuous function g,
Z Z
lim g(x)dFn (x) = g(x)dF (x).
n→∞ Rd Rd

3. For any closed set of C ⊂ Rd ,


lim sup P (Xn ∈ C) = P (X ∈ C).
n→∞

4. For any open set of G ⊂ Rd ,


lim inf P (Xn ∈ G) = P (X ∈ G).
n→∞

5. For any set B where P (X ∈ ∂B) = 0,


lim P (Xn ∈ B) = P (X ∈ B).
n→∞

Proof. We shall follow the method of proof given by Billingsley (1986), which
first shows the equivalence on Conditions 2–5, and then proves that Condition
5 is equivalent to Condition 1. We begin by proving that Condition 2 implies
Condition 3. Let C be a closed subset of Rd . Define a metric ∆(x, C) that
measures the distance between a point x ∈ Rd and the set C as the smallest
distance between x and any point in C. That is,
∆(x, C) = inf {||x − c||}.
c∈C
WEAK CONVERGENCE OF RANDOM VECTORS 187
In order to effectively use Condition 2 we need to define a bounded and con-
tinuous function. Therefore, define

1
 if t < 0,
hk (t) = 1 − tk if 0 ≤ t ≤ k −1 ,
if k −1 ≤ t.

0

Let gk (x) = hk [∆(x, C)]. It follows that the function is continuous and bounded
between zero and one for all k ∈ N. Now, suppose that x ∈ C so that
∆(x, C) = 0 and hence,
lim hk [∆(x, C)] = lim 1 = 1,
k→∞ k→∞

since hk (0) = 1 for all k ∈ N. On the other hand, if x ∈


/ C so that ∆(x, C) > 0,
then there exists a positive integer kx such that k −1 < ∆(x, C) for all k > kx .
Therefore, it follows that
lim hk [∆(x, C)] = 0.
k→∞
pw
Hence, Definition 1.4 implies that gk −−→ δ{x; C} as k → ∞. It further follows
that gk (x) ≥ δ{x; C} for all x ∈ Rd . Therefore, Theorem 1.6 implies
Z
lim sup P (Xn ∈ C) = lim sup dFn (x)
n→∞ n→∞ C
Z
= lim sup δ{x; c}dFn (x)
n→∞ d
ZR
≤ lim sup gk (x)dFn (x)
n→∞ Rd
Z
= gk (x)dF (x),
Rd

where the final equality follows from Condition 2. It can further be proven
that gk converges monotonically (decreasing) to δ{x; C} as n → ∞ so that
Theorem 1.12 (Lebesgue) implies that
Z Z Z
lim gk (x)dF (x) = g(x)dF (x) = δ{x; c}dF (x) = P (X ∈ C).
k→∞ Rd Rd Rd

Therefore, we have proven that


lim sup P (Xn ∈ C) = P (X ∈ C),
n→∞

which corresponds to Condition 3. For a proof that Condition 3 implies Condi-


tion 4, see Exercise 20. We will now prove that Conditions 3 and 4 imply Con-
dition 5. Let A be a subset of Rd such that P (X ∈ ∂A) = 0. Let A◦ = A \ ∂A
be the open interior of A and let A− = A ∪ ∂A be the closure of A. From
Condition 3 and the fact that A◦ ⊂ A, we have
P (X ∈ A◦ ) ≤ lim inf P (Xn ∈ A◦ ) ≤ lim inf P (Xn ∈ A).
n→∞ n→∞
188 CONVERGENCE OF DISTRIBUTIONS
Similary, Condition 2 and the fact that A ⊂ A− implies
lim sup P (Xn ∈ A) ≤ lim sup P (Xn ∈ A− ) ≤ P (X ∈ A− ).
n→∞ n→∞

Theorem 1.5 implies then that


P (X ∈ A◦ ) ≤ lim inf P (Xn ∈ A) ≤ lim sup P (Xn ∈ A) ≤ P (X ∈ A− ).
n→∞ n→∞

But Condition 5 assumes that P (X ∈ ∂A) = 0 so that P (X ∈ A◦ ) = P (X ∈


A− ). Therefore, Definition 1.3 implies that
lim inf P (Xn ∈ A) = lim sup P (Xn ∈ A) = lim P (Xn ∈ A) = P (X ∈ A),
n→∞ n→∞ n→∞

and Condition 5 is proven. We now prove that Condition 5 implies Condition 2,


from which we can conclude that Conditions 2–5 are equivalent. Suppose that
g is continuous and bounded so that |g(x)| < b for all x ∈ Rd from some b > 0.
Let ε > 0 and define a partition of [−b, b] given by a0 < a1 < · · · < am where
ak − ak−1 < ε for all k = 1, . . . , m. We will assume that P [g(X) = ak ] = 0
for all k = 0, . . . , m. Let Ak = {x ∈ Rd : ak−1 < g(x) ≤ ak } for k = 0, . . . , m.
From the way the sets A0 , . . . , Am are constructed it follows that
Xm Z  Xm Z
g(x)dFn (x) − ak P (Xn ∈ Ak ) = [g(x) − ak ]dFn (x)
k=1 Ak k=1 Ak
Xm Z
≤ εdFn (x)
k=1 Ak
Xm Z
= ε dFn (x)
k=1 Ak
= ε,
since
m
[
Ak = {x ∈ Rd : |g(x)| ≤ b},
k=0
and P [|g(Xn )| ≤ b] = 1. This same property then implies that
Z Xm
g(x)dFn (x) − ak P (Xn ∈ Ak ) ≤ ε.
Rd k=0

It can similarly be shown that


Z m
X
g(x)dFn (x) − ak P (Xn ∈ Ak ) ≥ −ε,
Rd k=0

so that it follows that


Z
m
X
g(x)dFn (x) − ak P (Xn ∈ Ak ) ≤ ε.


Rd
k=0
WEAK CONVERGENCE OF RANDOM VECTORS 189
See Exercise 21. The same arguments can be used with Fn replaced by F and
Xn replaced by X to obtain
Z m

X
g(x)dF (x) − ak P (X ∈ Ak ) ≤ ε.


Rd
k=0

Now, it can be shown that P (X ∈ ∂Ak ) = 0 for k = 0, . . . , m. See Section 29


of Billingsley (1986). Therefore, Condition 5 implies that when m is fixed,
m
X m
X
lim ak P (Xn ∈ Ak ) = ak P (X ∈ Ak ).
n→∞
k=0 k=0

Therefore, it follows that


Z Z

lim g(x)dFn (x) − g(x)dF (x) ≤ 2ε.
n→∞

Rd Rd

Because ε > 0 is arbitrary, it follows that


Z Z
lim g(x)dFn (x) = g(x)dF (x),
n→∞ Rd Rd

and Condition 2 follows. We will finally show that Condition 5 implies Con-
dition 1. To show this define sets B(x) = {t ∈ Rd : t ≤ x}. Theorem 4.13
implies that x is a continuity point of F if and only if P [X ∈ ∂B(x)] = 0.
Therefore, if x is a continuity point of F , we have from Condition 5 that
lim Fn (x) = lim P [X ∈ B(x)] = P [X ∈ B(x)] = F (x).
n→∞ n→∞

Therefore, it follows that Fn ; F as n → ∞ and Condition 1 follows. It


remains to show that Condition 1 implies one of Conditions 2–5. This proof is
beyond the scope of this book, and can be found in Section 29 of Billingsley
(1986).

In the case of convergence of random variables it followed that the convergence


of the individual elements of a random vector was equivalent to the conver-
gence of the random vector itself. The same equivalence does not hold for
convergence in distribution of random vectors. This is due to the fact that a
set of marginal distributions do not uniquely determine the joint distribution
of a random vector.
Example 4.21. Let {Xn }∞ n=1 be a sequence of random variables such that
Xn has a N(0, 1 + n−1 ) distribution for all n ∈ N. Similarly, let {Yn }∞ n=1 be
a sequence of random variables such that Yn has a N(0, 1 + n−1 ) distribution
for all n ∈ N. Now consider the random vector Zn = (Xn , Yn )0 for all n ∈ N.
Of interest is the limiting behavior of the sequence {Zn }∞
n=1 . That is, is there
d
a unique random vector Z such that Zn − → Z as n → ∞? The answer is no,
unless further information is known about the joint behavior of Xn and Yn .
For example, if Xn and Yn are independent for all n ∈ N, then Z is a N(0, I)
distribution, where I is the identity matrix. On the other hand, suppose that
190 CONVERGENCE OF DISTRIBUTIONS
Zn = Cn W where W is a N(0, Σ) random vector where
 
1 τ
Σ= ,
τ 1

where τ is a constant such that τ ∈ (0, 1), and {Cn }∞


n=1 is a sequence of 2 × 2
matrices defined by

(1 + n−1 )1/2
 
0
Cn = .
0 (1 + n−1 )1/2

d
In this case Zn −→ Z as n → ∞ where Z has a N(0, Σ) distribution. Hence,
there are an infinite number of choices of Z depending on the covariance
between Xn and Yn . There are even more choices for the limiting distribution
because the limiting joint distribution need not even be multivariate Normal,
due to the fact that the joint distribution of two normal random variables need
not be multivariate Normal. 

The converse of this property is true. That is, if a sequence of random vectors
converge in distribution to another random vector, then all of the elements
in the sequence of random vectors must also converge to the elements of the
limiting random vector. This result follows from Theorem 4.14.

Corollary 4.1. Let {Xn }∞ n=1 be a sequence of d-dimensional random vectors


that converge in distribution to a random vector X as n → ∞. Let X0n =
d
(Xn1 , . . . , Xnd ) and X0 = (X1 , . . . , Xd ), then Xnk −
→ Xk as n → ∞ for all
k ∈ {1, . . . , d}.

For a proof of Corollary 4.1, see Exercise 22.

The convergence of random vectors was simplified to the univariate case using
Theorem 3.6. As Example 4.21 demonstrates, the same simplification is not
applicable to the convergence of distributions. The Cramér-Wold Theorem
does provide a method for reducing the convergence in distribution of random
vectors to the univariate case. Before presenting this result some preliminary
setup is required. The result depends on multivariate characteristic functions,
which are defined below.

Definition 4.5. Let X be a d-dimensional random vector. The characteristic


function of X is given by ψ(t) = E[exp(it0 X)] for all t ∈ Rd .

Example 4.22. Suppose that X1 , . . . , Xd are independent standard normal


random variables and let X0 = (X1 , . . . , Xd ). Suppose that t0 = (t1 , . . . , td ),
WEAK CONVERGENCE OF RANDOM VECTORS 191
then Definition 4.5 implies that the characteristic function of X is given by
ψ(t) = E[exp(it0 X)]
" n
!#
X
= E exp itk Xk
k=1
" n
#
Y
= E exp(itk Xk )
k=1
n
Y
= E[exp(itk Xk )]
k=1
Yn
= exp(− 12 t2k ),
k=1

where we have used the independence assumption and the fact that the char-
acteristic function of a standard normal random variable is exp(− 21 t2 ). There-
fore, it follows that
n
!
X
ψ(t) = exp − 2 1
tk = exp(− 12 t0 t).
2

k=1

As with the univariate case, characteristic functions uniquely identify a dis-


tribution, and in particular, a convergent sequence of characteristic functions
is equivalent to the weak convergence of the corresponding random vectors
or distribution functions. This result is the multivariate generalization of the
equivalence of Conditions 1 and 2 of Theorem 4.6.
Theorem 4.16. Let {Xn }∞ n=1 be a sequence of d-dimensional random vec-
tors where Xn has characteristic function ψ(t) for all n ∈ N. Let X be a d-
d
dimensional random vector with characteristic function ψ(t). Then Xn −
→X
as n → ∞ if and only if
lim ψn (t) = ψ(t),
n→∞
for every t ∈ Rd .

A proof of Theorem 4.16 can be found in Section 29 of Billingsley (1986).


Example 4.23. Let {Xn }∞ n=1 be a sequence of d-dimensional random vectors
where Xn has a Nd (µn , Σn ) distribution for all n ∈ N. Suppose that µn is
a sequence of d-dimensional vectors defined as µn = n−1 1, where 1 is a d-
dimensional vector of the form 10 = (1, 1, . . . , 1). Further suppose that Σn is a
sequence of d × d covariance matrices of the form Σn = I + n−1 (J − I), where I
is the d × d identity matrix and J is a d × d matrix of ones. The characteristic
function of Xn is given by ψ(t) = exp(it0 µn − 21 t0 Σn t). Note that
lim µn = 0,
n→∞
192 CONVERGENCE OF DISTRIBUTIONS
and
lim Σn = I,
n→∞
where 0 is a d × 1 vector of the form 00 = (0, 0, . . . , 0). Therefore, it follows
that for every t ∈ Rd that
lim ψ(t) = lim exp(it0 µn − 21 t0 Σn t) = exp(− 12 t0 t),
n→∞ n→∞

which is the characteristic function of a Nd (0, I) distribution. Therefore, it


d
follows from Theorem 4.16 that Xn −
→ X as n → ∞ where X is a Nd (0, I)
random vector. 

We are now in a position to present the theorem of Cramér and Wold, which
reduces the task of proving that a sequence of random vectors converge weakly
to another random vector to the univariate case by considering the convergence
all possible linear combinations of the components of the random vectors.
Theorem 4.17 (Cramér and Wold). Let {Xn }∞
n=1 be a sequence of random
d
vectors in Rd and let X be a d-dimensional random vector. Then Xn −
→ X as
d
n → ∞ if and only if v0 Xn −→ v0 X as n → ∞ for all v ∈ Rd .

Proof. We will follow the proof of Serfling (1980). Let us first suppose that for
d
any v ∈ Rd that v0 Xn − → vX as n → ∞. Theorem 4.6 then implies that the
characteristic function of v0 Xn converges to the characteristic function of v0 X.
Let ψn (t) be the characteristic function of Xn for all n ∈ N. The characteristic
function of v0 Xn is then given by E[exp(itv0 Xn )] = E{exp[i(tv0 )Xn ]} =
ψn (tv) by Definition 4.5. Similarly, if ψ(t) is the characteristic function of X,
then the characteristic function of v0 X is given by ψ(tv). Theorem 4.6 then
d
implies that if v0 Xn −
→ v0 X as n → ∞ for all v ∈ Rd , then
lim ψn (tv) = ψ(tv),
n→∞

for all v ∈ Rd and t ∈ R. This is equivalent to concluding that


lim ψn (u) = ψ(u),
n→∞
d
for all u ∈ Rd . Therefore, Theorem 4.16 implies that Xn −
→ X as n → ∞. For
a proof of the converse, see Exercise 23.
Example 4.24. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of random variables
where Xn has a N(µn , σn ) distribution, Yn has a N(νn , τn2 ) distribution, and
2

Xn is independent of Yn for all n ∈ N. We assume that {µn }∞ n=1 and {νn }n=1

are sequences of real numbers such that µn → µ and νn → ν as n → ∞ for


some real numbers µ and ν. Similarly, assume that {σn }∞ ∞
n=1 and {τn }n=1 are
sequences of positive real numbers such that σn → σ and τn → τ as n → ∞ for
some positive real numbers σ and τ . Let v1 and v2 be arbitrary real numbers,
then E(v1 Xn + v2 Yn ) = v1 µn + v2 νn and V (v1 Xn + v2 Yn ) = v12 σn2 + v2 τn2 ,
for all n ∈ N. It follows that v1 Xn + v2 Yn has a N(v1 µn + v2 νn , v12 σn2 + v2 τn2 )
WEAK CONVERGENCE OF RANDOM VECTORS 193
distribution for all n ∈ N, due to the assumed independence between Xn and
Yn . Similarly v1 X +v2 Y has a N(v1 µ+v2 ν, v12 σ 2 +v2 τ 2 +2v1 v2 γ) distribution.
d
Example 4.2 implies that v1 Xn + v2 Yn −
→ v1 X + v2 Y as n → ∞ for all v1 ∈ R
and v2 ∈ R. See Exercise 24 for further details on this conclusion. Now let
Z0n = (Xn , Yn ) for all n ∈ N and let Z0 = (X, Y ). Because v1 and v2 are
d
arbitrary, Theorem 4.17 implies that Zn − → Z as n → ∞. 

Example 4.25. Let {Xn }n=1 be a sequence of dx -dimensional random vec-
tors that converge in distribution to a random vector X as n → ∞, and let
{Yn }∞n=1 be a sequence of dy -dimensional random vectors that converge in
probability to a constant vector y as n → ∞. Let d = dx + dy and consider
the random vector defined by Z0n = (X0n , Yn0 ) for all n ∈ N and similarly de-
fine Z0 = (X0 , y0 ). Let v ∈ Rd be an arbitrary vector that can be partitioned
as v0 = (v10 , v20 ) where v1 ∈ Rdx and v2 ∈ Rdy . Now v0 Zn = v10 Xn + v20 Yn
d
and v0 Z = v10 X + v20 y. Theorem 4.17 implies that v10 Xn −
→ v10 X as n → ∞
p
and Theorem 3.9 implies that v20 Yn −→ v20 y as n → ∞. Theorem 4.11 (Slut-
d
sky) then implies that v10 Xn + v20 Yn −
→ v10 X + v20 y as n → ∞. Because v is
d
arbitrary, Theorem 4.17 implies that Zn −
→ Z as n → ∞. 
Linear functions are not the only function of convergent sequences of ran-
dom vectors that converge in distribution. The result of Theorem 4.12 can be
generalized to Borel functions of sequences of random vectors that converge
weakly as long as the function is continuous with respect to the distribution
of the limiting random vector.
Theorem 4.18. Let {Xn }∞ n=1 be a sequence of d-dimensional random vectors
that converge in distribution to a random vector X as n → ∞. Let g be a
Borel function that maps Rd to Rm and suppose that P [X ∈ C(g)] = 1. Then
d
g(Xn ) −
→ g(X) as n → ∞.
Theorem 4.18 can be proven using an argument that parallels the proof of
Theorem 4.12 (Continuous Mapping Theorem) using a multivariate version of
Theorem 4.10 (Skorokhod).
Example 4.26. Let {Zn }∞ n=1 be a sequence of bivariate random variables that
converge in distribution to a bivariate random variable Z as n → ∞ where Z as
a N(0, I) distribution. Let g(z) = g(z1 , z2 ) = z1 /z2 which is continuous except
on the line z2 = 0. Therefore, P [Z ∈ C(g)] = 1 and Theorem 4.12 implies that
d
g(Zn ) −
→ g(Z) as n → ∞ where g(Z) has a Cauchy(0, 1) distribution. 
d
Example 4.27. Let {Rn }∞
n=1 be a sequence of random variables where Rn −

R as n → ∞ where R2 has a ChiSquared(2) distribution. Let {Tn }∞ n=1 be
d
a sequence of random variables where Tn − → T as n → ∞ where T has a
Uniform(0, 2π) distribution. Assume that the bivariate sequence of random
vectors {(Rn , Tn )}∞
n=1 converge in distribution to the random vector (R, T )
as n → ∞, where R and T are independent of one another. Consider the
transformation g : R2 → R2 defined by g(R, T ) = (R cos(T ), R sin(T )), which
194 CONVERGENCE OF DISTRIBUTIONS
is continuous with probability one with respect to the joint distribution of R
and T . Then it follows from Theorem 4.12 that the sequence {g(Rn , Tn )}∞ n=1
converges in distribution to a bivariate normal distribution with mean vector
0 and covariance matrix I. 
Example 4.28. Let {Xn }∞ n=1 and {Y }∞
n n=1 be sequences of random variables
that converge in distribution to random variables X and Y as n → ∞, respec-
d
tively. Can we conclude necessarily that Xn + Yn − → X + Y as n → ∞ based
on Theorem 4.12? Without further information, we cannot make such a con-
clusion. The reason is that the weak convergence of the sequences {Xn }∞ n=1
and {Yn }∞n=1 do not imply the weak convergence of the associated sequence
of random vectors {(Xn , Yn )}∞n=1 and Theorem 4.12 requires that the random
vectors converge in distribution. Therefore, without further information about
the two sequences, and the convergence behavior, no conclusion can be made.
For example, if Xn and Yn converge to independent normal random variables
X and Y , then the conclusion does follow. But for example, take Xn to have
a standard normal distribution and Yn = −Xn so that Yn also has a standard
normal distribution. In this case X and Y are standard normal distributions,
and if we assume that X and Y where independent then X + Y would have a
N(0, 2) distribution whereas Xn + Yn is a degenerate distribution at 0 for all
n ∈ N, that converges to a degenerate distribution at 0 as n → ∞. Therefore,
we would have to know about the relationship between Xn and Yn in order
to draw the correct conclusion in this case. 
As in the univariate case, we are often interested in perturbed sequences of
random vectors that have some weak convergence property. The results of
Theorem 4.11 generalize to the multivariate case.

Theorem 4.19. Let {Xn }∞ n=1 be a sequence of d-dimensional random vectors


that converge in distribution to a random vector X as n → ∞. Let {Yn }∞ n=1
be a sequence of d-dimensional random vectors that converge in probability to
a constant vector y and let {Zn }∞n=1 be a sequence of d-dimensional random
vectors that converge in probability to a constant vector z as n → ∞. Then
d
diag(Yn )Xn +Zn − → diag(y)X+z as n → ∞, where diag(y) is a d×d diagonal
matrix whose diagonal values equal the elements of y.

Proof. We will use the method of proof suggested by Lehmann (1999). De-
fine a sequence of 3d-dimensional random variables {Wn }∞
n=1 where Wn =
0
0 0 0 0 0 0 0
(Xn , Yn , Zn ) for all n ∈ N and similarly define W = (X , y , z ). Theo-
d
rem 3.6 and Example 4.25 imply that Wn − → W as n → ∞. Now de-
fine g(w) = g(x, y, z) = diag(y)x + z which is an everywhere continuous
function so that P [W ∈ C(g)] = 1. Theorem 4.18 implies that g(Wn ) =
d
diag(Yn )Xn + Zn −
→ g(W) = diag(y)X + z as n → ∞.
Example 4.29. Let {Xn }∞
n=1 be a sequence of d-dimensional random vectors
d
such that n1/2 Xn −
→ Z as n → ∞ where Z is a Np (0, I) random vector.
THE CENTRAL LIMIT THEOREM 195
Suppose that {Yn }∞
n=1 is any sequence of random vectors that converges in
d
probability to a vector θ. Then Theorem 4.19 implies that n1/2 Xn +Yn −
→W
as n → ∞ where W has a Np (θ, I) distribution. 

4.4 The Central Limit Theorem

The central limit theorem, as the name given to it by G. Pólya in 1920 implies,
is the key asymptotic result in statistics. The result in some form has existed
since 1733 when De Moivre proved the result for a sequence of independent
and identically distributed Bernoulli(θ) random variables. In some sense,
one can question how far the field of statistics could have progressed without
this essential result. It is the Central Limit Theorem, in its various forms, that
allow us to construct approximate normal tests and confidence intervals for
unknown means when the sample size is large. Without such a result we would
be required to develop tests for each possible population. The result allows us
to approximate Binomial probabilities under certain circumstances when the
number of Bernoulli experiments is large. These probabilities, with such an
approximation, would have been very difficult to compute, especially before
the advent of the digital computer. Another key attribute of the Central Limit
Theorem is its widespread applicability. The Central Limit Theorem, with
appropriate modifications, applies not only to the case of independent and
identically distributed random variables, but can also be applied to dependent
sequences of variables, sequences that have varying distributions, and other
cases as well. Finally, the power of the normal approximation is additionally
quite important. When the parent population is not too far from normality,
then the Central Limit Theorem provides quite accurate approximations to
the distribution of the sample mean, even when the sample size is quite small.
In this section we will introduce the simplest form of the central limit theorem
which applies to sequences of independent and identically distributed random
variables with finite mean and variance and present the usual proof which is
based on limits of characteristic functions. We will also present the simple
form of the multivariate version of the central limit theorem. We will revisit
this topic with much more detail in Chapter 6 where we consider several
generalizations of this result.
Theorem 4.20 (Lindeberg and Lévy). Let {Xn }∞ n=1 be a sequence of inde-
pendent and identically distributed random variables such that E(Xn ) = µ and
d
V (Xn ) = σ 2 < ∞ for all n ∈ N, then Zn = n1/2 σ −1 (X̄n − µ) −
→ Z as n → ∞
where Z has a N(0, 1) distribution.

Proof. We will begin by assuming, without loss of generality, that µ = 0 and


σ = 1. We can do this since if µ and σ are known, we can always transform
the sequence {Xn }∞
n=1 as σ
−1
(Xn − µ) to get a sequence of random variables
with zero mean and variance equal to one. When µ = 0 and σ = 1 the random
196 CONVERGENCE OF DISTRIBUTIONS
variable of interest is
n
X
Zn = n1/2 X̄n = n−1/2 Xk .
k=1

The general method of proof involves computing the characteristic function


of Zn and then showing that this characteristic function converges to that of
a standard normal random variable as n → ∞. An application of Theorem
4.6 then completes the proof. Suppose that Xn has characteristic function
ψ(t) for all n ∈ N. Because the random variables in the sequence {Xn }∞ n=1
are independent and identically distributed, Theorem 2.33 implies that the
characteristic function of the sum of X1 , . . . , Xn is ψ n (t) for all n ∈ N. An
application of Theorem 2.32 then implies that the characteristic function of
Zn equals ψ n (n−1/2 t) for all n ∈ N. By assumption the variance σ 2 is finite,
and hence the second moment is also finite and Theorem 2.31 implies that
ψ(t) = 1 − 21 t2 + o(t2 ) as t → 0. Therefore, the characteristic function of Zn
equals
ψ n (n−1/2 t) = [1 − 12 n−1 t2 + o(n−1 t2 )]n = [1 − 21 n−1 t2 ]n + o(n−1 t2 ).
The second inequality can be justified using Theorem A.22. See Exercise 26.
Therefore, for fixed t ∈ R, Theorem 1.7 implies that
lim ψ n (n−1/2 t) = lim [1 − 21 n−1 t2 ]n + o(n−1 t2 ) = exp(− 12 t2 ),
n→∞ n→∞

which is the characteristic function of a standard normal random variable.


d
Therefore, Theorem 4.6 implies that n1/2 X̄n −→ Z as n → ∞ where Z is a
standard normal random variable.
Example 4.30. Suppose that {Bn }∞ n=1 be a sequence of independent and
identically distributed Bernoulli(θ) random variables. Let B̄n be the sam-
ple mean computed on B1 , . . . , Bn , which in this case will correspond to the
d
sample proportion. Theorem 4.20 then implies that n1/2 σ −1 (B̄n − µ) − → Z
as n → ∞ where Z is a standard normal random variable. In this case
µ = E(B1 ) = θ and σ 2 = V (B1 ) = θ(1 − θ) so that the result above is
d
equivalent to n1/2 (B̄n − θ)[θ(1 − θ)]−1/2 −
→ Z as n → ∞. This implies that
when n is large that
P {n1/2 (B̄n − θ)[θ(1 − θ)]−1/2 ≤ t} ' Φ(t),
or equivalently that
( n
)
X
P Bk ≤ θ + n1/2 t[θ(1 − θ)]1/2 ' Φ(t),
k=1

which is also equivalent to


n
!
X
P Bk ≤ t ' Φ{(t − θ)[nθ(1 − θ)]−1/2 }.
k=1
THE CENTRAL LIMIT THEOREM 197

Figure 4.3 The distribution function of a Binomial(n, θ) distribution and a


N[nθ, nθ(1 − θ)] distribution when n = 5 and θ = 41 .
1."
".&
".$
".#
".!
"."

!! " ! # $

The left hand side of this last equation is a Binomial probability and the right
hand side is a Normal probability. Therefore, this last equation provides a
method for approximating Binomial probabilities with Normal probabil-
ities under the condition that n is large. Figures 4.3 and 4.4 compare the
Binomial(n, p) and N[nθ, nθ(1 − θ)] distribution functions for n = 5 and
n = 10 when θ = 14 . One can observe that the Normal distribution function,
though continuous, does capture the general shape of the Binomial distribu-
tion function, with the approximation improving as n becomes larger. 

Example 4.31. Let {Xn }∞ n=1 be a sequence of independent and identically


distributed random variables where Xn has a Uniform(0, 1) distribution, so
that it follows that E(Xn ) = 12 and V (Xn ) = 12
1
for all n ∈ N. Therefore,
d
Theorem 4.20 implies that Zn = (12n)1/2 (X̄n − 21 ) −
→ Z as n → ∞ where Z is
a N(0, 1) random variable. The power of Theorem 4.20 is particularly impres-
sive in this case. When n = 2 the true distribution of Zn is a Triangular
distribution, which already begins to show a somewhat more normal shape
when compared to the population. Figures 4.5 to 4.6 show histograms based
on 100,000 simulated samples of Zn when n = 5 and 10. Again, it is apparent
how quickly the normal approximation becomes accurate, particularly near
the center of the distribution, even for smaller values of n. In fact, a simple
198 CONVERGENCE OF DISTRIBUTIONS

Figure 4.4 The distribution function of a Binomial(n, θ) distribution and a


N[nθ, nθ(1 − θ)] distribution when n = 10 and θ = 41 .
1.0
0.8
0.6
0.4
0.2
0.0

0 5 10

algorithm for generating approximate normal random variables in computer


simulations is to generate independent random variables U1 , . . . , U12 where Uk
has a Uniform(0, 1) distribution and then compute
12
!
X
Z= Uk − 6,
k=1

which has an approximate N(0, 1) distribution by Theorem 4.20. This algo-


rithm, however, is not highly recommended as there are more efficient and
accurate methods to generate standard normal random variables. See Section
5.3 of Bratley, Fox and Schrage (1987). 
In the usual setting in Theorem 4.20, the random variable Zn can be written as
n1/2 σ −1 (X̄n − µ). From this formulation it is clear that if σ < ∞, as required
by the assumptions of Theorem 4.20, then it follows that
lim n−1/2 σ = 0.
n→∞

Note that if Zn is to have a normal distribution in the limit then it must


follow that X̄n must approach µ at the same rate. That is, the result of the
Theorem 4.20 implies that X̄n is a consistent estimator of µ.
Theorem 4.21. Suppose that {Xn }∞ n=1 is a sequence of random variables
THE CENTRAL LIMIT THEOREM 199

Figure 4.5 Density histogram of 100,000 simulated values of Zn = (12n)1/2 (X̄n − 21 )


for samples of size n = 5 from a Uniform(0, 1) distribution compared to a N(0, 1)
density.
0.30
0.25
0.20
Density

0.15
0.10
0.05
0.00

!4 !2 0 2 4

Zn

d
such that σn−1 (Xn − µ) −→ Z as n → ∞ where Z is a N(0, 1) random variable
and {σn }∞
n=1 is a sequence of real numbers such that
lim σn = σ ∈ R,
n→∞
p
and µ ∈ R. Then Xn −
→ µ if and only if σ = 0.

p
Proof. Let us first assume that σ = 0, which implies that σn − → 0 as n → ∞.
d
Part 2 of Theorem 4.11 (Slutsky) implies that σn σn−1 (Xn − µ) = Xn − µ − →0
p
as n → ∞. Theorem 4.9 then implies that Xn − µ − → 0 as n → ∞, which is
p
equivalent to the result that Xn −→ µ as n → ∞. On the other hand if we
p
assume that Xn − → µ as n → ∞, then again we automatically conclude that
p
Xn − µ − → 0. Suppose that σ 6= 0 and let us find a contradiction. Since σ 6= 0
p
it follows that σn /σ −
→ 1 as n → ∞ and therefore Theorem 4.11 implies that
d
σn σ −1 σn−1 (Xn − µ) = Xn − µ −→ Z as n → ∞ where Z is a N(0, 1) random
p
variable. This is a contradiction since we know that Xn − µ −
→ 0 as n → ∞.
Therefore, σ cannot be non-zero.
200 CONVERGENCE OF DISTRIBUTIONS

Figure 4.6 Density histogram of 100,000 simulated values of Zn = (12n)1/2 (X̄n − 12 )


for samples of size n = 10 from a Uniform(0, 1) distribution compared to a N(0, 1)
density.
0.20
0.15
Density

0.10
0.05
0.00

!4 !2 0 2 4

Zn

Example 4.32. Suppose that {Bn }∞ n=1 be a sequence of independent and


identically distributed Bernoulli(θ) random variables. In Example 4.30 it
d
was shown that Zn = σn−1 (B̄n − θ) −
→ Z as n → ∞ where Z is a standard
normal random variable and σn = n−1/2 [θ(1 − θ)]1/2 . Since
lim σn = lim n−1/2 [θ(1 − θ)]1/2 = 0,
n→∞ n→∞
p
if follows from Theorem 4.21 that B̄n −
→ θ as n → ∞. 

Example 4.33. Let {Xn }n=1 be a sequence of independent and identically
distributed random variables where Xn has a Uniform(0, 1) distribution. In
d
Example 4.31 it was shown that Zn = σn−1 (X̄n − 12 ) −
→ Z as n → ∞ where Z
is a standard normal random variable and where σn = (12n)−1/2 . Since
lim σn = lim (12n)−1/2 = 0,
n→∞ n→∞
p 1
if follows from Theorem 4.21 that X̄n −
→ 2 as n → ∞. 

The multivariate version of Theorem 4.20 is a direct generalization to the case


THE ACCURACY OF THE NORMAL APPROXIMATION 201
of a sequence of independent and identically distributed random vectors from
a distribution that has a covariance matrix with finite elements.
Theorem 4.22. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed d-dimensional random vectors such that E(Xn ) = µ and V (Xn ) =
Σ, where the covariance matrix Σ has elements that are all finite. Then
d
n1/2 Σ−1/2 (X̄n − µ) −→ Z as n → ∞ where Z is a d-dimensional N(0, I)
random variable.

The proof of Theorem 4.22 is similar to that of the univariate case given by
Theorem 4.20, the only real difference being that the multivariate character-
istic function is used.
Example 4.34. Consider a sequence of discrete bivariate random variables
{Xn }∞
n=1 where Xn has probability distribution

x0 = (1, 0, 0)


 θ
x0 = (0, 1, 0)

η
f (x) =


 1 − θ − η x0 = (0, 0, 1)
0 otherwise,

for all n ∈ N where θ and η are parameters such that 0 < θ < 1, 0 < η < 1 and
0 < θ+η < 1. The mean vector of Xn is given by E(Xn ) = µ = (θ, η, 1−θ−η)0 .
The covariance matrix of Xn is given by
 
θ(1 − θ) −θη −θ(1 − θ − η)
V (Xn ) = Σ =  −θη η(1 − η) −η(1 − θ − η)  .
−θ(1 − θ − η) −η(1 − θ − η) (θ + η)(1 − θ − η)
d
Theorem 4.22 implies that n1/2 Σ−1/2 (X̄n − µ) −
→ Z as n → ∞ where Z is a
three dimensional N(0, I) random vector. That is, when n is large, it follows
that
P [n1/2 Σ−1/2 (X̄n − µ) ≤ t] ' Φ(t),
where Φ is the multivariate distribution function of a N(0, I) random vector.
Equivalently, we have that
n
!
X
P Xk ≤ t ' Φ[n−1/2 Σ−1/2 (t − µ)],
k=1

which results in a Normal approximation to a three dimensional Multino-


mial distribution. 

4.5 The Accuracy of the Normal Approximation

Let {Xn }∞n=1 be a sequence of independent and identically distributed random


variables from a distribution F such that E(Xn ) = µ and V (Xn ) = σ 2 < ∞
for all n ∈ N. Theorem 4.20 implies that the standardized sample mean given
202 CONVERGENCE OF DISTRIBUTIONS
by Zn = σ −1 n1/2 (X̄n − µ) converges in distribution to a random variable Z
that has a N(0, 1) distribution as n → ∞. That is,

lim P (Zn ≤ z) = Φ(z),


n→∞

for all z ∈ R. Definition 1.1 then implies that for large values of n, we can
use the approximation P (Zn ≤ z) ' Φ(z). One of the first concerns when one
uses any approximation should be about the accuracy of the approximation.
In the case of the normal approximation given by Theorem 4.20, we are in-
terested in how well the normal distribution approximates probabilities of the
standardized sample mean and how the quality of the approximation depends
on n and on the parameters of the distribution of the random variables in
the sequence. In some cases it is possible to study these effects using direct
calculation.
Example 4.35. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables where Xn has a Bernoulli(θ) distribution. In
this case it is well known that the sum Sn has a Binomial(n, θ) distribu-
tion. The distribution of the standardized sample mean, Zn = n1/2 θ−1/2 (1 −
θ)−1/2 (X̄n −θ) then has a Binomial(n, θ) distribution that has been scaled so
that Zn has support {n1/2 θ−1/2 (1 − θ)−1/2 (0 − θ), n1/2 θ−1/2 (1 − θ)−1/2 (n−1 −
θ), . . . , n1/2 θ−1/2 (1 − θ)−1/2 (1 − θ)}. That is,
 
1/2 −1/2 −1/2 −1 n k
P [Zn = n θ (1 − θ) (kn − θ)] = θ (1 − θ)n−k ,
k
for k ∈ {1, . . . , n}. As shown in Example 4.30, Theorem 4.20 implies that
d
Zn −→ Z as n → ∞ where Z is a N(0, 1) random variable. This means that
for large n, we have the approximation P (Zn ≤ z) ' Φ(z). Because the
distribution of the standardized mean is known in this case we can assess the
accuracy of the normal approximation directly. For example, when θ = 14 and
n = 5 we have that the Kolmogorov distance between P (Zn ≤ t) and Φ(t) is
0.2346. See Figure 4.7. Similarly, when θ = 12 and n = 10 we have that the
Kolmogorov distance between P (Zn ≤ t) and Φ(t) is 0.1230. See Figure 4.8.
A more complete table of comparisons is given in Table 4.1. It is clear from
the table that both n and θ affect the accuracy of the normal approximation.
As n increases, the Kolmogorov distance becomes smaller. This is guaranteed
by Theorem 4.7. However, another effect can be observed from Table 4.1.
The approximation becomes progressively worse as θ approaches zero. This
is due to the fact that the binomial distribution becomes more skewed as θ
approaches zero. The normal approximation requires larger sample sizes to
overcome this skewness. 

Example 4.36. Let {Xn }∞ n=1 be a sequence of independent and identically


distributed random variables where Xn has a Exponential(θ) distribution.
In this case it is well known that the sum Sn has a Gamma(n, θ) distribution,
and therefore the standardized mean Zn = n1/2 θ−1 (X̄n − θ) has a translated
THE ACCURACY OF THE NORMAL APPROXIMATION 203

Figure 4.7 The normal approximation to the Binomial(5, 14 ) distribution. The


largest difference is indicated by the dotted line.
1.0
0.8
0.6
F(x)

0.4
0.2
0.0

!2 0 2 4 6

x
Table 4.1 The normal approximation of the Binomial(n, θ) distribution for n =
5, 10, 25, 50, 100 and θ = 0.01, 0.05, 0.10, 0.25, 0.50. The value reported is the Kol-
mogorov distance between the scaled binomial distribution function and the normal
distribution function.
θ
n 0.01 0.05 0.10 0.25 0.50
5 0.5398 0.4698 0.3625 0.2347 0.1726
10 0.5291 0.3646 0.2361 0.1681 0.1230
25 0.4702 0.2331 0.1677 0.1071 0.0793
50 0.3664 0.1677 0.1161 0.0758 0.0561
100 0.2358 0.1160 0.0832 0.0535 0.0398
204 CONVERGENCE OF DISTRIBUTIONS

Figure 4.8 The normal approximation to the Binomial(10, 12 ) distribution. The


largest difference is indicated by the dotted line.
1.0
0.8
0.6
F(x)

0.4
0.2
0.0

!4 !2 0 2 4

x
Gamma(n, n−1/2 ) distribution given by
P (Zn ≤ z) = P [n1/2 θ−1 (X̄n − θ) ≤ z] (4.3)
Z z 1/2
n
= (t + n1/2 )n−1 exp[−n1/2 (t + n1/2 )]dt. (4.4)
−n1/2 Γ(n)
d
See Exercise 27. Theorem 4.20 implies that Zn − → Z as n → ∞ where Z is
a N(0, 1) distribution, and therefore for large n, we have the approximation
P (Zn ≤ z) ' Φ(z). Because the exact density of Zn is known in this case, we
can compute the Kolmogorov distance between P (Zn ≤ z) and Φ(z) to get an
overall view of the accuracy of this approximation. Note that in this case the
distribution of Zn does not depend on θ, so that we need only consider the
sample size n for our calculations. For example, when n = 2, the Kolmorogov
distance between P (Zn ≤ z) and Φ(z) is approximately 0.0945, and when
n = 5 the Kolmorogov distance between P (Zn ≤ z) and Φ(z) is approximately
0.0596. See Figures 4.9 and 4.10. A plot of the Kolmogorov distance against
n is given in Figure 4.11. We observe once again that the distance decreases
with n, as required by Theorem 4.7. 

It is not always possible to perform this type of calculation directly. In many


cases the exact distribution of the standardized sample mean is not known,
THE ACCURACY OF THE NORMAL APPROXIMATION 205

Figure 4.9 The normal approximation of the Gamma(n, n−1/2 ) distribution when
n = 2. The solid line is the Gamma(n, n−1/2 ) distribution function translated by its
mean which equals n1/2 , the dashed line is the N(0, 1) distribution function.
1.0
0.8
0.6
0.4
0.2
0.0

!1 0 1 2

or may not have a simple form. In these cases one can often use simulations
to approximate the behavior of the normal approximation.
Example 4.37. Let {{Xn,m }nm=1 }∞ n=1 be a triangular array of N(0, 1) ran-
dom variables that are mutually independent both within and between rows.
Define a new sequence of random variables {Yn }∞n=1 where

Yn = max{Xn,1 , . . . , Xn,n } − min{Xn,1 , . . . , Xn,n }


for all n ∈ N. The distribution function of Yn has the form
Z ∞
P (Yn ≤ y) = n [Φ(t + y) − Φ(t)]n−1 φ(t)dt, (4.5)
−∞

for y > 0. See Arnold, Balakrishnan and Nagaraja (1993). Now let Ȳn be the
sample mean of Y1 , . . . , Yn , and let Zn = n1/2 σ(Ȳn − µ) where µ = E(Yn )
and σ 2 = V (Yn ). While the distribution function in Equation (4.5) can be
computed numerically with some ease, the distributions of Ȳn and Zn are not
so simple to compute and approximations based on simulations are usually
easier. 

When applying the normal approximation in practice, the distribution F may


206 CONVERGENCE OF DISTRIBUTIONS

Figure 4.10 The normal approximation of the Gamma(n, n−1/2 ) distribution when
n = 5. The solid line is the Gamma(n, n−1/2 ) distribution function translated by its
mean which equals n1/2 , the dashed line is the N(0, 1) distribution function.
1.0
0.8
0.6
0.4
0.2
0.0

!2 !1 0 1 2

not be known. This problem prevents one from studying the accuracy of the
normal approximation using either direct calculation or simulations. In this
case one needs to be able to study the accuracy of the normal approximation in
such a way that it does not depend of the distribution of the random variables
in the sequence. One may be surprised to learn that there are theoretical
results which can be used in this case. Specifically, one can find universal
bounds on the accuracy of the normal approximation that do not depend on
the distribution of the random sequence. For the case we study in this section
we will have to make some further assumptions about the distribution F .
Specifically, we will need to know something about the third moment of F .

Developing these bounds depends heavily on the theory of characteristic func-


tions. For a given distribution F , it is not always an easy matter to obtain
analytic results about the sum, or average, of n random variables having that
distribution since convolutions are based on integrals or sums. However, the
characteristic function of a sum or average is much easier to compute because
of the results in Theorems 2.32 and 2.33. Because of the uniqueness of charac-
teristic functions guaranteed by Theorem 2.27 it may not be surprising that
THE ACCURACY OF THE NORMAL APPROXIMATION 207

Figure 4.11 The Kolmorogov distance between P (Zn ≤ z) and Φ(z) for different val-
ues of n where Zn is the standardized sample mean computed on n Exponential(θ)
random variables.
0.15
Kolmogorov Distance

0.10
0.05

0 20 40 60 80 100

we are able to study differences between distributions by studying differences


between their corresponding distribution functions. The main tool for study-
ing these differences is based on what is commonly known as the smoothing
theorem, a version of which is given below.
Theorem 4.23 (Smoothing Theorem). Suppose that F is a distribution func-
tion with characteristic function ψ such that
Z ∞
tdF (t) = 0,
−∞

and Z ∞
t2 dF (t) = 1.
−∞
Let G be a differentiable distribution function with characteristic function ζ
such that ζ 0 (0) = 0. Then for T > 0,
Z T
|ψ(t) − ζ(t)|
π|F (x) − G(x)| ≤ dt + 24T −1 sup |G0 (t)|.
−T |t| t∈R
208 CONVERGENCE OF DISTRIBUTIONS
A proof of Theorem 4.23 can be found in Section 2.5 of Kolassa (2006). The
form of the Smoothing Theorem given in Theorem 4.23 allows us to bound
the difference between distribution functions based on an integral of the cor-
responding characteristic functions. A more general version of this result will
be considered later in the book. Theorem 4.23 allows us to find a universal
bound for the accuracy of the normal approximation that depends only on
the absolute third moment of F . This famous result is known as the theorem
of Berry and Esseen.
Theorem 4.24 (Berry and Esseen). Let {Xn }∞ n=1 be a sequence of indepen-
dent and identically distributed random variables from a distribution such that
E(Xn ) = 0 and V (Xn ) = 1. If ρ = E(|Xn |3 ) < ∞ then

sup P (n1/2 X̄n ≤ t) − Φ(t) ≤ n−1/2 Bρ, (4.6)

t∈R

for all n ∈ N where B is a constant that does not depend on n.

Proof. The proof of this result is based on Theorem 4.23, and follows the
method of proof given by Feller (1971) and Kolassa (2006). To avoid certain
overly technical arguments, we will make the assumption in this proof that
the distribution of the random variables is symmetric, which results in a real
valued characteristic function. The general proof follows the same overall path.
See Section 6.2 of Gut (2005) for further details. We will assume that B = 3
and first dispense with the case where n < 10. To simplify notation, let X
be a random variable following the distribution F . Note that Theorem 2.11
(Jensen) implies that [E(|X|2 )]3/2 ≤ E(|X|3 ) = ρ, and by assumption we have
that E(|X|2 ) = 1. Therefore, it follows that ρ ≥ 1. Now, when B = 3, the
bound given in Equation (4.6) has the property n−1/2 Bρ = 2n−1/2 ρ where
ρ ≥ 1. Hence, if n1/2 ≤ 3, or equivalently if n < 10, it follows that nBρ ≥ 1.
Since the difference between two distribution functions is always bounded
above by one, it follows that the result is true without any further calculation
when B = 3 and n < 10. For the remainder of the proof we will consider only
the case where n ≥ 10.
Let ψ(t) be the characteristic function of Xn , and focusing our attention on
the characteristic function of the standardized mean, let
n
X
Zn = n1/2 X̄n = n−1/2 Xk .
k=1

Theorems 2.32 and 2.33 imply that the characteristic function of Zn is given
by ψ n (n−1/2 t). Letting Fn denote the distribution function of Zn , Theorem
4.23 implies that
Z T
|Fn (t) − Φ(x)| ≤ π −1 |t|−1 |ψ n (n−1/2 t) − exp(− 21 t2 )|dt
−T
+ 24T −1 π −1 sup |φ(t)|. (4.7)
t∈R
THE ACCURACY OF THE NORMAL APPROXIMATION 209
The bound of integration used in Equation (4.7) is chosen to be T = 43 ρ−1 n1/2 ,
where it follows that since ρ ≥ 1, we have that T ≤ 43 n1/2 . Now,

sup |φ(t)| = (2π)−1/2 < 52 ,


t∈R

so that the second term can be bounded as

24T −1 π −1 sup |φ(t)| < 48 −1 −1


5 T π .
t∈R

Note that 48
5 = 9.6, matching the bound given in Feller (1971). In order to
find a bound on the integral in Equation (4.7) we will use Theorem A.10,
which states that for any two real numbers ξ and γ we have that

|ξ n − γ n | ≤ n|ξ − γ|ζ n−1 , (4.8)

if |ξ| ≤ ζ and |γ| ≤ ζ. We will use this result with ξ = ψ(n−1/2 t) and γ =
exp(− 12 n−1 t2 ), and we therefore need to find a value for ζ. To begin finding
such a bound, we first note that
Z ∞ Z ∞
1 2

|ψ(t) − 1 + 2 t | = exp(itx)dF (x) − dF (x)−
−∞ −∞
Z ∞ Z ∞
1 2 2

itxdF (x) + 2 t x dF (x) ,
−∞ −∞

where we have used the fact that our assumptions imply that
Z ∞
exp(itx)dF (x) = ψ(t),
−∞

Z ∞
dF (x) = 1,
−∞

Z ∞
xdF (x) = 0,
−∞

and
Z ∞
x2 dF (x) = 1.
−∞

Theorem A.12 can be applied to real integrals of complex functions so that


we have that
Z ∞
|ψ(t) − 1 + 21 t2 | ≤ exp(itx) − 1 − itx + 1 t2 x2 dF (x).

2 (4.9)
−∞

Now, Theorem A.11 implies that


exp(itx) − 1 − itx + 1 t2 x2 ≤ 1 |tx|3 ,

2 6 (4.10)
210 CONVERGENCE OF DISTRIBUTIONS
so that it follows that
Z ∞
exp(itx) − 1 − itx + 1 t2 x2 dF (x) ≤

2
−∞
Z ∞ Z ∞
3 1 3
1
6 |tx| dF (x) = 6 |t| x3 dF (x) = 61 |t|3 ρ.
−∞ −∞

Now we use the assumption that F has a symmetric distribution about the
origin. Note that ρ = E(|X|3 ), and is not the skewness of F , so that ρ > 0 as
long as F is not a degenerate distribution at the origin. In this case, Theorem
2.26 implies that ψ(t) is real valued and that |ψ(t) − 1 + 21 t2 | ≤ 16 ρ|t|3 implies
that ψ(t) − 1 + 12 t2 ≤ 16 ρ|t|3 , or that
ψ(t) ≤ 1 − 21 t2 + 16 ρ|t|3 . (4.11)
Note that such a comparison would not make sense if ψ(t) were complex
valued. When ψ(t) is real-valued we have that ψ(t) ≥ 0 for all t ∈ R. Also, if
1 − 21 t2 > 0, or equivalently if t2 < 2, it follows that the right hand side of
Equation (4.11) is positive. Hence, if (n−1/2 t)2 < 2 we have that
|ψ(n−1/2 t)| ≤ 1 − 21 (n−1/2 t)2 + 61 ρ|n−1/2 t|3
= 1 − 12 n−1 t2 + 61 n−3/2 ρ|t|3 .
Now, if we assume that t is also within our proposed region of integration,
that is that |t| ≤ T = 43 ρ−1 n1/2 it follows that
−3/2 3
1
6 ρn |t| = 16 ρn−3/2 |t|2 |t| ≤ 4 −1 2
18 n t ,
and therefore |ψ(n−1/2 t)| ≤ 1 − 18 5 −1 2
n t . From Theorem A.21 we have that
5 −1 2
exp(x) ≥ 1+x for all x ∈ R, so that it follows that 1− 18 n t ≤ exp(−n−1 18 5 2
t )
−1/2 −1 5 2
for all t ∈ R. Therefore, we have established that |ψ(n t)| ≤ exp(−n 18 t ),
for all |t| ≤ T , the bound of integration established earlier. It then follows that
|ψ(n−1/2 t)|n−1 ≤ exp[− 18 5 −1
n (n − 1)t2 ]. But note that 185 −1
n (n − 1) ≥ 41 for
−1/2 n−1 1 2
n ≥ 10 so that |ψ(n t)| ≤ exp(− 4 t ) when n ≥ 10. We will use this
bound for ζ in Equation (4.8) to find that

|ψ n (n−1/2 t) − exp(− 12 t2 )| ≤ n|ψ(n−1/2 t) − exp(− 21 t2 )||ψ(n−1/2 t)|n−1 ≤


n|ψ(n−1/2 t) − exp(− 12 t2 )| exp(− 14 t2 ), (4.12)
when n ≥ 10. Now add and subtract 1 − 21 n−1 t2 inside the absolute value on
the right hand side of Equation 4.12 and apply Theorem A.18 to yield

n|ψ(n−1/2 t) − exp(− 12 t2 )| ≤ n|ψ(n−1/2 t) − 1 + 21 n−1 t2 |+


n|1 − 12 n−1 t2 − exp(− 12 n−1 t2 )|. (4.13)
The first term on the right hand side of Equation (4.13) can be bounded using
Equations (4.9) and (4.10) to find that
|ψ(n−1/2 t) − 1 + 21 n−1 t2 | ≤ 16 n−3/2 ρ|t|3 .
THE ACCURACY OF THE NORMAL APPROXIMATION 211
For the second term on the right hand side of Equation (4.13) we use Theorem
A.21 to find that
| exp(− 12 n−1 t2 ) − 1 + 21 n−1 t2 | ≤ 18 n−2 t4 .
Therefore,
n|ψ(n−1/2 t) − exp(− 12 t2 )| ≤ 16 n−1/2 ρ|t|3 + 18 n−1 t4 . (4.14)
Now, using Equation (4.8), the integrand in Equation (4.7) can be bounded
as

|t|−1 |ψ n (n−1/2 t) − exp(− 21 t2 )| ≤


n|t|−1 |ψ(n−1/2 t) − exp(− 12 n−1 t2 )| exp(− 14 t2 ) ≤
( 16 n−1/2 ρ|t|2 + 18 n−1 |t|3 ) exp(− 41 t2 ), (4.15)
where the second inequality follows from Equation (4.14). If we multiply and
divide the bound in Equation (4.15) by T = 34 ρ−1 n1/2 , we get that the bound
equals
T −1 ( 29 t2 + 16 ρ−1 n−1/2 |t|3 ) exp(− 41 t2 ).
Recalling that ρ ≥ 1 and that n1/2 > 3 when n ≥ 10, we have that ρ−1 n−1/2 ≤
1
3 , so that

T −1 ( 29 t2 + 61 ρ−1 n−1/2 |t|3 ) exp(− 14 t2 ) ≤ T −1 ( 29 t2 + 1 3 1 2


18 |t| ) exp(− 4 t ).

This function is non-negative and integrable over R and, therefore, we can


bound the integral in Equation (4.7) by
Z ∞
−1
π |t|−1 |ψ n (n−1/2 t) − exp(− 21 t2 )|dt + 24T −1 π −1 sup |φ(t)| ≤
−∞ t∈R
Z ∞
(πT )−1 ( 29 t2 + 18
1
|t|3 ) exp(− 14 t2 )dt + 48
5 T
−1 −1
π .
−∞

Now Z ∞
T −1 29 t2 exp(− 14 t2 )dt = 23 π 1/2 ρn−1/2 ,
−∞
and Z ∞
T −1 18
1
|t|3 exp(− 41 t2 )dt = 23 ρn−1/2 .
−∞
−1 −1/2
See Exercise 28. Similarly, 48 5 T = 36
5 ρn . Therefore, it follows that
Z ∞
π −1 |t|−1 |ψ n (n−1/2 t) − exp(− 21 t2 )|dt + 24T −1 π −1 sup |φ(t)| ≤
−∞ t∈R

π −1 ρn−1/2 ( 23 π 1/2 + 2
3 + 36
5 ) ≤ 136 −1
15 π ρn−1/2 ,

where we have used the fact that π 1/2 ≤ 95 for the second inequality. Now,
to finish up, note that π −1 ≤ 408
135
so that we finally have the conclusion that
212 CONVERGENCE OF DISTRIBUTIONS
from Equation (4.7) that
−1/2 −1/2
|Fn (t) − Φ(x)| ≤ 135 136
408 15 ρn = 18360
6120 ρn = 3ρn−1/2 ,
which completes the proof.

Note that the more general case where E(Xn ) = µ and V (Xn ) = σ 2 can
be addressed by applying Theorem 4.24 to the standardized sequence Zn =
σ −1 (Xn − µ) for all n ∈ N, yielding the result below.
Corollary 4.2. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution such that E(Xn ) = µ and
V (Xn ) = σ 2 . If E(|Xn − µ|3 ) < ∞ then

sup P [n1/2 σ −1 (X̄n − µ) ≤ t] − Φ(t) ≤ n−1/2 Bσ −3 E(|Xn − µ|3 )

t∈R

for all n ∈ N where B is a constant that does not depend on n.

Corollary 4.2 is proven in Exercise 29. The bound given in Theorem 4.24 and
Corollary 4.2 were first derived by Berry (1941) and Esseen (1942). Extension
to the case of non-identical distributions has been studied by Esseen (1945).
The exact value of the constant B specified in Theorem 4.24 and Corollary
4.2 is not known, though there has been a considerable amount of research
devoted to the topic of finding upper and lower bounds for B. The proof of
Theorem 4.24 uses the constant B = 3. Esseen’s original value for the constant
is 7.59. Esseen and Wallace also showed that B = 2.9 and B = 2.05 works
as well in unpublished works. See page 26 of Kolassa (2006). The recent best
upper bounds for B have been shown to be 0.7975 by van Beek (1972) and
0.7655 by Shiganov (1986). Chen (2002) provides further refinements of B.
Further information about this constant can be found in Petrov (1995) and
Zolotarev (1986). A lower bound of 0.4097 is given by Esseen (1956), using a
type of Bernoulli population for the sequence of random variables. A lower
bound based on similar arguments is derived in Example 4.38 below.
Example 4.38. Lower bounds for the constant B used in Theorem 4.24 and
Corollary 4.2 can be found by computing the observed distance between the
standardized distribution of the mean and the standard normal distribution
for specific examples. Any such distance must provide a lower bound for the
maximum distance, and therefore can be used to provide a lower bound for
B. The most useful examples provide a distance that is a multiple of n−1/2
so that a lower bound for B can be derived that does not depend on n.
For example, Petrov (2000) considers the case where {Xn }∞n=1 is a sequence
of independent and identically distributed random variables where Xn has
probability distribution function
(
1
x ∈ {−1, 1},
f (x) = 2
0 otherwise.

In this case we have that E(Xn ) = 0, V (Xn ) = 1, and E(|Xn |3 ) = 1 so that


THE ACCURACY OF THE NORMAL APPROXIMATION 213
Theorem 4.24 implies that

sup P (n−1/2 X̄n ≤ t) − Φ(t) ≤ n−1/2 B. (4.16)

t∈R

In this case the 12 (Xn + 1) has a Bernoulli( 21 ) distribution so that 12 (Sn + n)


has a Binomial(n, 12 ) distribution. Therefore, it follows that
 
1 n −n
P [ 2 (Sn + n) = k] = P (Sn = 2k − n) = 2 ,
k
for k ∈ {0, 1, . . . , n}. Suppose n is an even integer, then
 
−1/2 n
P (n X̄n = 0) = P (Sn = 0) = P [ 12 (Sn + n) = n
2] = n 2−n .
2

Now apply Theorem 1.20 (Stirling) to the factorial operators in the combina-
tion to find
 
n n!
n = n n
2 ( 2 )!( 2 )!
nn (2nπ)1/2 exp( n2 ) exp( n2 )[1 + o(1)]
=
nπ( n2 )n/2 ( n2 )n/2 exp(n)[1 + o(1)][1 + o(1)]
= 2n ( nπ
2 1/2
) [1 + o(1)],

as n → ∞. Therefore it follows that P (n−1/2 X̄n = 0) ' ( nπ 2 1/2


) , or more
−1/2 2 1/2
accurately P (n X̄n = 0) = ( nπ ) [1 + o(1)] as n → ∞. It follows that the
distribution function P (n−1/2 X̄n ≤ x) has a jump of size ( nπ2 −1/2
) [1 + o(1)]
−1/2
at x = 0. Therefore, in a neighborhood of x = 0, P (n X̄n ≤ x) cannot be
approximated by a continuous with error less than 12 × ( nπ 2 1/2
) [1 + o(1)] =
−1/2
(2nπ) [1 + o(1)] as n → ∞. From Equation 4.16, this suggests that B ≥
(2π)−1/2 ' 0.3989. 
Example 4.39. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables where Xn has a Bernoulli(θ) distribution as
discussed in Example 4.35. Therefore, in this case, µ = E(Xn ) = θ, σ 2 =
V (Xn ) = θ(1 − θ), and E(|Xn − θ|3 ) = |0 − θ|3 (1 − θ) + |1 − θ|3 θ = θ(1 −
θ)[θ2 + (1 − θ)2 ]. Therefore, Corollary 4.2 implies that

sup P {n1/2 [θ(1 − θ)]−1/2 (X̄n − θ) ≤ t} − Φ(t) ≤

t∈R

n−1/2 B[θ(1 − θ)]−1/2 [θ2 + (1 − θ)2 ],

where B = 0.7655 can be used as the constant. This bound is given for the
cases studied in Table 4.1 in Table 4.2. Note that in each of the cases studied,
the actual error given in Table 4.1 is lower than the error given by Corollary
4.2. 
214 CONVERGENCE OF DISTRIBUTIONS

Table 4.2 Upper bounds on the error of the normal approximation of the
Binomial(n, θ) distribution for n = 5, 10, 25, 50, 100 and θ = 0.01, 0.05, 0.10,
0.25, 0.50 provided by the Corollary 4.2 with B = 0.7655.
θ
n 0.01 0.05 0.10 0.25 0.50
5 3.3725 1.4215 0.9357 0.4941 0.3423
10 2.3847 1.0052 0.6617 0.3494 0.2421
25 1.5082 0.6357 0.4185 0.221 0.1531
50 1.0665 0.4495 0.2959 0.1563 0.1083
100 0.7541 0.3179 0.2092 0.1105 0.0765

4.6 The Sample Moments

Let {Xn }∞n=1 be a sequence of independent and identically distributed random


variables and let µ0k be the k th moment defined in Definition 2.9 and let µ̂0k
be the k th sample moment defined in Section 3.8 as
n
X
µ̂0k = n−1 Xik .
i=1

Note that for a fixed value of k, {Xnk }∞


is a sequence of independent and
n=1
identically distributed random variables with E(Xnk ) = µ0k and V (Xnk ) = µ02k −
(µ0k )2 , so that Theorem 4.20 (Lindeberg and Lévy) implies that if µ02k < ∞
d
then n1/2 (µ̂0k − µ0k )[µ02k − (µ0k )2 ]−1/2 −
→ Z as n → ∞ where Z is a N(0, 1)
random variable.
This general argument can be extended to the joint distribution of any set of
sample moments, as long as the required moments conditions are met.
Theorem 4.25. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables. Let µd = (µ01 , . . . , µ0d )0 , µ̂d = (µ̂01 , . . . , µ̂0d )0 and
d
assume that µ02d < ∞. Then n1/2 Σ−1/2 (µ̂n − µ) − → Z as n → ∞ where Z
is a d-dimensional N(0, I) random variable where the (i, j)th element of Σ is
µ0i+j − µ0i µ0j , for i = 1, . . . , d and j = 1, . . . , d.

Proof. Define a sequence of d-dimensional random vectors {Yn }∞ 0


n=1 as Yn =
2 d
(Xn , Xn , . . . , Xn ) for all n ∈ N and note that
 −1 Pn   0
n Pk=1 Xk µ̂1
n n−1 n Xk2  µ̂02 
X k=1
Ȳn = n−1 Yk =   =  ..  = µ̂d .
   
..
k=1

Pn .  .
n −1 d
k=1 Xk µ̂0d
Taking the expectation of Ȳn elementwise implies that E(Yn ) = µd . Let Σ
THE SAMPLE QUANTILES 215
be the covariance matrix of Yn so that the (i, j)th element of Σ is given by
C(Xni , Xnj ) = E(Xni Xnj ) − E(Xni )E(Xnj ) = µ0i+j − µ0i µ0j for i = 1, . . . , d and
j = 1, . . . , d. The result now follows by applying Theorem 4.22 to the sequence
{Yn }∞n=1 .
Example 4.40. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution where µ04 < ∞. Theorem 4.25
d
implies that n1/2 Σ−1/2 (µ̂2 −µ2 ) −
→ Z as n → ∞ where Z is a two-dimensional
N(0, I) random vector, where µ02 = (µ01 , µ02 ) and
 0
µ2 − (µ01 )2 µ03 − µ01 µ02

Σ= 0 .
µ3 − µ01 µ02 µ04 − (µ02 )2
Suppose that the distribution of {Xn }∞ 2 0 0
n=1 is N(µ, σ ). Then µ1 = µ, µ2 =
2 2 0 3 2 0 4 2 2 4
σ + µ , µ3 = µ + 3µσ , µ4 = µ + 6µ σ + 3σ , and
 2
2µσ 2

σ
Σ= .
2µσ 2 4µ2 σ 2 + 2σ 4
In the special case of a standard normal distribution where µ = 0 and σ 2 = 1
the covariance matrix simplifies to
 
1 0
Σ= .
0 2


4.7 The Sample Quantiles

In Section 3.9 we proved that the sample quantiles converge almost certainly
to the population quantiles under some assumptions on the local behavior of
the distribution function in a neighborhood of the quantile. In this section
we establish that sample quantiles also have an asymptotic Normal distri-
bution. Of interest in this case is the fact that the results again depend on
local properties of the distribution function, in particular, the derivative of the
distribution function at the point of the quantile. This result differs greatly
from the case of the sample moments whose asymptotic Normality depends
on global properties of the distribution, which in that case was dependent
on the moments of the distribution. The main result given below establishes
Normal limits for some specific forms of probabilities involving sample quan-
tiles. These will then be used to establish the asymptotic Normality of the
sample quantiles under certain additional assumptions.
Theorem 4.26. Let {Xn }∞ n=1 be a sequence of independent random variables
that have a common distribution F . Let p ∈ (0, 1) and suppose that F is
continuous at ξp . Then,
1. If F 0 (ξp −) exists and is positive then for x < 0,
lim P {n1/2 F 0 (ξp −)(ξˆp,n − ξp )[p(1 − p)]−1/2 ≤ x} = Φ(x).
n→∞
216 CONVERGENCE OF DISTRIBUTIONS
2. If F 0 (ξp +) exists and is positive then for x > 0,

lim P {n1/2 F 0 (ξp +)(ξˆp,n − ξp )[p(1 − p)]−1/2 ≤ x} = Φ(x).


n→∞

3. In any case

lim P [n1/2 (ξˆp,n − ξp ) ≤ 0] = Φ(0) = 21 .


n→∞

Proof. Fix t ∈ R and let v be a normalizing constant whose specific value will
be specified later in the proof. Define Gn (t) = P [n1/2 v −1 (ξˆpn − ξp ) ≤ t], which
is the standardized distribution of the pth sample quantile. Now,

Gn (t) = P [n1/2 v −1 (ξˆpn − ξp ) ≤ t]


= P (ξˆpn ≤ ξp + tvn−1/2 )
= P [F̂n (ξˆpn ) ≤ F̂n (ξp + tvn−1/2 )]

where F̂n is the empirical distribution function computed on X1 , . . . , Xn and


the last equality follows from the fact that the F̂n is a non-decreasing function.
Theorem 3.22 implies that

P [F̂n (ξˆpn ) ≤ F̂n (ξp + tvn−1/2 )] = P [p ≤ F̂n (ξp + tvn−1/2 )]


= P [np ≤ nF̂n (ξp + tvn−1/2 )].

The development in Section 3.7 implies that nF̂n (ξp + tvn−1/2 ) has a Bi-
nomial[n, F (ξp + tvn−1/2 )] distribution. Let θ = F (ξp + tvn−1/2 ) and note
that

Gn (t) = P [nF̂n (ξp + tvn−1/2 ) ≥ np]


= P [nF̂n (ξp + tvn−1/2 ) − nθ ≥ np − nθ]
" #
n1/2 [F̂n (ξp + tvn−1/2 ) − nθ] n1/2 (p − θ)
= P ≥ , (4.17)
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2

where the random variable n1/2 [F̂n (ξp + tvn−1/2 ) − nθ][θ(1 − θ)]−1/2 is a
standardized Binomial(n, θ) random variable. Now let us consider the case
when t = 0. When t = 0 is follows that θ = F (ξp + tvn−1/2 ) = F (ξp ) = p and
n1/2 (p − θ)[θ(1 − θ)]−1/2 = 0 as long as p ∈ (0, 1). Theorem 4.20 (Lindeberg
and Lévy) then implies that
" #
n1/2 [F̂n (ξp + tvn−1/2 ) − nθ]
lim P ≥ 0 = 1 − Φ(0) = 12 ,
n→∞ [θ(1 − θ)]1/2

which proves Statement 3. Note that the normalizing constant v does not
enter into this result as it is cancelled out when t = 0. To prove the remaining
THE SAMPLE QUANTILES 217
statements, note that
( )
n1/2 [nF̂n (ξp + tvn−1/2 ) − nθ] n1/2 (p − θ)
Φ(t)−Gn (t) = Φ(t)−P ≥ =
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2
( )
n1/2 [nF̂n (ξp + tvn−1/2 ) − nθ] n1/2 (p − θ)
P < − 1 + Φ(t) =
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2
( )
n1/2 [nF̂n (ξp + tvn−1/2 ) − nθ] n1/2 (p − θ)
P < − 1 + Φ(t)
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2
 1/2   1/2 
n (p − θ) n (p − θ)
−Φ +Φ
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2
( )
n1/2 [nF̂n (ξp + tvn−1/2 ) − nθ] n1/2 (p − θ)
 1/2 
n (p − θ)
=P < − Φ
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2 [θ(1 − θ)]1/2
 1/2 
n (θ − p)
+ Φ(t) − Φ .
[θ(1 − θ)]1/2
To obtain a bound on the first difference we use Theorem 4.24 (Berry and
Esseen) which implies that
( )
n1/2 [nF̂n (ξp + tvn−1/2 ) − nθ]
sup P < t − Φ(t) ≤ n−1/2 Bγτ −3 ,

[θ(1 − θ)]1/2
t∈R
where B is a constant that does not depend on n, and
γ = θ|1 − θ|3 + (1 − θ)| − θ3 | = θ(1 − θ)[(1 − θ)2 + θ2 ],
and τ 2 = θ(1 − θ). Therefore,
Bθ(1 − θ)[(1 − θ)2 + θ2 ] (1 − θ)2 + θ2
n−1/2 Bγτ −3 = 1/2 3/2 3/2
= Bn−1/2 1/2 .
n θ (1 − θ) θ (1 − θ)1/2
Therefore, it follows that
( ) 
n1/2 [nF̂n (ξp + tvn−1/2 ) − nθ] n1/2 (p − θ)
 1/2
n (p − θ)
P < −Φ

[θ(1 − θ)]1/2 [θ(1 − θ)]1/2 [θ(1 − θ)]1/2


(1 − θ)2 + θ2
≤ Bn−1/2
θ1/2 (1 − θ)1/2
and, hence,
(1 − θ)2 + θ2
 1/2 
−1/2
n (θ − p)
|Φ(t) − Gn (t)| ≤ Bn + Φ(t) − Φ
.
θ1/2 (1 − θ)1/2 [θ(1 − θ)]1/2
To complete the arguments we must investigate the limiting behavior of θ.
Note that θ = F (ξp + tvn−1/2 ), which is a function of n. Because we have
assumed that F is continuous at ξp it follows that for a fixed value of t ∈ R,
lim θ = lim F (ξp + n−1/2 vt) = F (ξp ) = p,
n→∞ n→∞
218 CONVERGENCE OF DISTRIBUTIONS
and therefore it follows that
lim θ(1 − θ) = p(1 − p).
n→∞

Now note that


n1/2 (θ − p) n1/2 [F (ξp + n−1/2 vt) − p]
= =
[θ(1 − θ)]1/2 [θ(1 − θ)]1/2
tv F (ξp + n−1/2 vt) − p
,
[θ(1 − θ)]1/2 tvn−1/2
where we have used the fact that F (ξp ) = p. Noting that the second term of
the last expression is in the form of a derivative, we have that if t > 0
n1/2 (θ − p) tv
lim = lim F 0 (ξp +).
n→∞ [θ(1 − θ)]1/2 n→∞ [θ(1 − θ)]1/2

Therefore, if we choose the normalizing constant as v = [p(1 − p)]1/2 /F 0 (ξp +)


we have that
n1/2 (θ − p)
lim = t.
n→∞ [θ(1 − θ)]1/2

Therefore, when t > 0 we have that

(1 − θ)2 + θ2
lim |Gn (t) − Φ(t)| ≤ lim Bn−1/2 +
n→∞ n→∞ θ1/2 (1 − θ)1/2
 1/2 
n (θ − p)
lim Φ(t) − Φ = 0.
n→∞ [θ(1 − θ)]1/2
Hence, we have shown that
lim P [n1/2 v −1 (ξˆpn − ξp ) ≤ t] = Φ(t),
n→∞

for all t > 0, which proves Statement 2. Similar arguments are used to prove
Statement 3. See Exercise 30.

The result of Theorem 4.26 simplifies when we are able to make additional
assumptions about the structure of F . In the first case we assume that F is
differentiable at the point ξp and that F (ξp ) > 0.
Corollary 4.3. Let {Xn }∞ n=1 be a sequence of independent random variables
that have a common distribution F . Let p ∈ (0, 1) and suppose that F is
differentiable at ξp and that F (ξp ) > 0. Then,
n1/2 F 0 (ξp )(ξˆpn − ξp ) d

→Z
[p(1 − p)]1/2
as n → ∞ where Z is a N(0, 1) random variable.

Corollary 4.3 is proven in Exercise 31. An additional simplification of the


results occurs when F has a density f in a neighborhood of ξp .
THE SAMPLE QUANTILES 219
Corollary 4.4. Let {Xn }∞ n=1 be a sequence of independent random variables
that have a common distribution F . Let p ∈ (0, 1) and suppose that F has
density f in a neighborhood of ξp and that f is positive and continuous at ξp .
Then,
n1/2 f (ξp )(ξˆpn − ξp ) d

→ Z,
[p(1 − p)]1/2
as n → ∞ where Z is a N(0, 1) random variable.
Corollary 4.4 is proven in Exercise 32. Corollary 4.4 highlights the difference
referred to earlier between the asymptotic Normality of a sample moment
and a sample quantile. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a distribution F that has a density f . The
sample mean has a standard error equal to
(Z 2 )1/2
∞  Z ∞
−1/2
n u− tdF (t) dF (u) .
−∞ −∞

This is a global property of F in that many distributions with vastly dif-


ferent local properties may have the same variance. On the other hand, the
asymptotic standard error of the sample median is given by 12 [f (ξ1/2 )]−1 . This
standard error depends only on the behavior of the density f near the popula-
tion median, and is therefore a local property. Note in fact that the standard
error is inversely related to the density at the median. This is due to the fact
that the sample median is determined by the values from the sample that are
closest to the middle of the data, and these values will tend to be centered
around the sample median on average. If f (ξ1/2 ) is large, then there will tend
to be a large amount of data concentrated around the population median,
which will provide a less variable estimate. On the other hand, if f (ξ1/2 ) is
small, then there will tend to be little data concentrated around the popula-
tion median, and hence the sample median will be more variable.

Example 4.41. Suppose that {Xn }∞ n=1 is a sequence of independent ran-


dom variables that have a common distribution F with positive density f
d
at ξ1/2 . Then Corollary 4.4 implies that 2f (ξ1/2 )(ξˆ1/2,n − ξ1/2 ) − → Z as
n → ∞ where Z is a N(0, 1) random variable. In the special case where f
is a N(µ, σ 2 ) density we have that f (ξ1/2 ) = f (µ) = (2πσ 2 )1/2 and hence
d
21/2 (πσ 2 )1/2 (ξˆ1/2,n − ξ1/2 ) −
→ Z as n → ∞ where Z is a N(0, 1) random
variable. Note that even when the population is normal, the finite sample dis-
tribution of the sample median is not normal, but is asymptotically Normal.
Now, consider a bimodal Normal mixture of the form
f (x) = 2−3/2 π −1/2 exp[−2(x + 23 )2 ] + exp[−2(x − 32 )2 ] .


The mean and median of this distribution is zero, and the variance is 25 . The
asymptotic standard error of the median is
[2f (ξ1/2 )]−1 = [21/2 π −1/2 exp(− 92 )]−1 = 2−1/2 π 1/2 exp( 92 ).
220 CONVERGENCE OF DISTRIBUTIONS

Figure 4.12 The densities of the bimodal Normal (solid line) mixture and the
N(0, 52 ) distribution (dashed line) used in Example 4.41.
0.4
0.3
0.2
0.1
0.0

!4 !2 0 2 4

Let us compare this standard error with that of the sample median based on
a sample of size n from a N(0, 52 ) distribution, which has the same median,
mean and variance as the bimodal Normal mixture considered earlier. In this
case the asymptotic standard error is given by

[2f (ξ1/2 )]−1 = [2(5π)−1/2 ]−1 = 12 (5π)1/2 ,

so that the ratio of the asymptotic standard error of the sample median for
the bimodal Normal mixture relative to that of the N(0, 52 ) distribution is

2−1/2 π 1/2 exp( 92 )


1 1/2
= ( 25 )1/2 exp( 92 ) ' 56.9.
2 (5π)

Therefore, the sample median has a much larger asymptotic standard error
for estimating the median of the bimodal Normal mixture. This is because
the density of the mixture distribution is much lower near the location of the
median. See Figure 4.12. 

Example 4.42. Suppose that {Xn }∞


n=1 is a sequence of independent random
EXERCISES AND EXPERIMENTS 221
variables that have a common distribution F given by


0 x < 0,
1x

0 ≤ x < 12 ,
F (x) = 21 1
 (3x − 1) 2 ≤ x < 1,
2


1 x ≥ 1.
This distribution function is plotted in Figure 4.13. It is clear that for this
distribution ξ1/4 = 21 , but note that F 0 (ξ1/4 −) = 21 and F 0 (ξ1/4 +) = 32 so
that the derivative of F does not exist at ξ1/4 . This means that ξˆ1/4,n does
not have an asymptotically normal distribution, but according to Theorem
4.26, probabilities of appropriately standardized functions of ξˆ1/4,n can be
approximated by normal probabilities. In particular, Theorem 4.26 implies
that
lim P [3−1/2 n1/2 (ξˆ1/4,n − ξ1/4 ) ≤ t] = Φ(t),
n→∞
for t < 0,
lim P [31/2 n1/2 (ξˆ1/4,n − ξ1/4 ) ≤ t] = Φ(t),
n→∞
for t > 0, and
lim P [n1/2 (ξˆ1/4,n − ξ1/4 ) ≤ 0] = 21 .
n→∞

It is clear from this result that the distribution of ξˆ1/4 has a longer tail below
ξ1/4 and a shorter tail above ξ1/4 . This is due to the fact that there is less
density, and hence less data, that will typically be observed on average below
ξ1/4 . See Experiment 6. 

It is important to note that the regularity conditions are crucially important to


these results. For an example of what can occur when the regularity conditions
are not met, see Koenker and Bassett (1984).

4.8 Exercises and Experiments

4.8.1 Exercises

1. Let {Xn }∞
n=1 be a sequence of random variables such that Xn has a Uni-
d
form{0, n−1 , 2n−2 , . . . , 1} distribution for all n ∈ N. Prove that Xn −
→X
as n → ∞ where X has a Uniform[0, 1] distribution.
2. Let {Xn }∞n=1 be a sequence of random variables where Xn is an Expo-
nential(θ + n−1 ) random variable for all n ∈ N where θ is a positive
real constant. Let X be an Exponential(θ) random variable. Prove that
d
Xn −
→ X as n → ∞.
3. Let {Xn }∞
n=1 be a sequence of random variables such that for each n ∈ N,
Xn has a Gamma(αn , βn ) distribution where {αn }∞ ∞
n=1 and {βn }n=1 are
sequences of positive real numbers such that αn → α and βn → β as
222 CONVERGENCE OF DISTRIBUTIONS

Figure 4.13 The distribution function considered in Example 4.42. Note that the
derivative of the distribution function does not exist at the point 12 , which equals
ξ1/4 for this population. According to Theorem 4.26, this is enough to ensure that
the asymptotic distribution of the sample quantile is not normal.
1.0
0.8
0.6
F(x)

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

d
n → ∞, some some positive real numbers α and β. Prove that Xn −
→ X as
n → ∞ where X has a Gamma(α, β) distribution.
4. Let {Xn }∞
n=1 be a sequence of random variables where for each n ∈ N, Xn
has an Bernoulli[ 12 +(n+2)−1 ] distribution, and let X be a Bernoulli( 12 )
d
random variable. Prove that Xn −
→ X as n → ∞.
5. Let {Xn } be a sequence of independent and identically distributed random
variables where the distribution function of Xn is
(
1 − x−θ for x ∈ (1, ∞),
Fn (x) =
0 for x ∈ (−∞, 1].

Define a new sequence of random variables given by


Yn = n−1/θ max{X1 , . . . , Xn },
for all n ∈ N.
EXERCISES AND EXPERIMENTS 223
a. Prove that the distribution function of Yn is
(
[1 − (nxθ )−1 ]n x > 1
Gn (y) =
0 for x ≤ 1.

b. Consider the distribution function of a random variable Y given by


(
exp(−x−θ ) x > 1
G(y) =
0 for x ≤ 1.
Prove that Gn ; G as n → ∞.

6. Suppose that {Fn }∞


n=1 is a sequence of distribution functions such that

lim Fn (x) = F (x),


n→∞

for all x ∈ R for some function F (x). Prove the following properties of
F (x).

a. F (x) ∈ [0, 1] for all x ∈ R.


b. F (x) is a non-decreasing function.

7. Let {Xn }∞n=1 be a sequence of random variables that converge in distribu-


tion to a random variable X where Xn has distribution function Fn for all
n ∈ N and X has distribution function F , which may or may not be a valid
distribution function. Prove that if the sequence {Xn }∞
n=1 is bounded in
probability then
lim F (x) = 0.
x→−∞

8. Let {Xn }∞n=1 be a sequence of random variables that converge in distribu-


tion to a random variable X where Xn has distribution function Fn for all
n ∈ N and X has a valid distribution function F . Prove that the sequence
{Xn }∞n=1 is bounded in probability.
9. Let g be a continuous and bounded function and let {Fn }∞ n=1 be a se-
quence of distribution functions such that Fn ; F as n → ∞ where F is
a distribution function. Prove that when b is a finite continuity point of F ,
that Z ∞ Z ∞

lim g(x)dFn (x) − g(x)dF (x) = 0.
n→∞ b b

10. Consider the sequence of distribution functions {Fn }∞


n=1 where

0
 x<0
Fn (x) = 21 + (n + 2)−1 0 ≤ x < 1
1 − (n + 2)−1 x ≥ 1.

a. Specify a function G such that Fn ; G as n → ∞ and G is a right


continuous function.
224 CONVERGENCE OF DISTRIBUTIONS
b. Specify a function G such that Fn ; G as n → ∞ and G is a left
continuous function.
c. Specify a function G such that Fn ; G as n → ∞ and G is neither right
continuous nor left continuous.
11. Let {Fn }∞
n=1 be a sequence of distribution functions and let F be a distri-
bution function such that for each bounded and continuous function g,
Z ∞ Z ∞
lim g(x)dFn (x) = g(x)dF (x).
n→∞ −∞ −∞

Prove that if ε > 0 and t is a continuity point of F , then


lim inf Fn (t) ≥ F (t − ε).
n→∞

12. Let {Fn }∞


n=1 be a sequence of distribution functions that converge in dis-
tribution to a distribution function F as n → ∞. Prove that
lim sup |Fn (x) − F (x)| = 0.
n→∞ x∈R

Hint: Adapt the proof of Theorem 3.18 to the current case.


13. In the context of the proof of Theorem 4.8 prove that
lim inf Fn (x) ≥ F (x − ε).
n→∞

14. Prove the second result of Theorem 4.11. That is, let {Xn }∞
n=1 be a sequence
of random variables that converge weakly to a random variable X. Let
{Yn }∞
n=1 be a sequence of random variables that converge in probability to
d
a real constant c. Prove that Xn Yn −
→ cX as n → ∞.
15. Prove the third result of Theorem 4.11. That is, let {Xn }∞
n=1 be a sequence
of random variables that converge weakly to a random variable X. Let
{Yn }∞
n=1 be a sequence of random variables that converge in probability to
d
a real constant c 6= 0. Prove that Xn /Yn − → X/c as n → ∞.
16. In the context of the proof of the first result of Theorem 4.11, prove that
P (Xn ≤ x − ε − c) ≤ Gn (x) + P (|Yn − c| > ε).
17. Use Theorem 4.11 to prove that if {Xn }∞ n=1 is a sequence of random vari-
ables that converge in probability to a random variable X as n → ∞, then
d
Xn −→ X as n → ∞.
18. Use Theorem 4.11 to prove that if {Xn }∞ n=1 is a sequence of random vari-
ables that converge in distribution to a real constant c as n → ∞, then
p
Xn −→ c as n → ∞.
19. In the context of the proof of Theorem 4.3, prove that
Z
b Z b
lim gm (x)dF (x) − g(x)dF (x) < 31 δε ,

n→∞ a a
for any δε > 0.
EXERCISES AND EXPERIMENTS 225
20. Let {Xn }∞n=1 be a sequence of d-dimensional random vectors where Xn
has distribution function Fn for all n ∈ N and let X be a d-dimensional
random vector with distribution function F . Prove that if for any closed
set of C ⊂ Rd ,
lim sup P (Xn ∈ C) = P (X ∈ C),
n→∞

then for any open set of G ⊂ Rd ,


lim inf P (Xn ∈ G) = P (X ∈ G).
n→∞
d
Hint: Let C = R \ G and show that C is closed.
21. Let X be a d-dimensional random vector with distribution function F . Let
g : Rd → R be a continuous function such that |g(x)| ≤ b for a finite real
value b for all x ∈ R. Let ε > 0 and define a partition of [−b, b] given by
a0 < a1 < · · · < am where ak − ak−1 < ε for all k = 1, . . . , m and let
Ak = {x ∈ Rd : ak−1 < f (x) ≤ ak } for k = 0, . . . , m. Prove that
Xm Z 
g(x)dF (x) − ak P (X ∈ Ak ) ≥ −ε.
k=1 Ak

22. Let {Xn }∞n=1 be a sequence of d-dimensional random vectors that converge
in distribution to a random vector X as n → ∞. Let X0n = (Xn1 , . . . , Xnd )
d d
and X0 = (X1 , . . . , Xd ). Prove that if Xn −
→ X as n → ∞ then Xnk −
→ Xk
as n → ∞ for all k ∈ {1, . . . , d}.
23. Prove the converse part of the proof of Theorem 4.17. That is, let {Xn }∞
n=1
be a sequence of d-dimensional random vectors and let X be a d-dimensional
d d
random vector. Prove that if Xn − → X as n → ∞ then v0 Xn − → v0 X as
d
n → ∞ for all v ∈ R .
24. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of random variables where Xn has
a N(µn , σn ) distribution and Yn has a N(νn , τn2 ) and νn → ν as n → ∞ for
2

some real numbers µ and ν. Assume that σn → σ and τn → τ as n → ∞ for


some positive real numbers σ and τ , and that Xn and Yn are independent
for all n ∈ N. Let v1 and v2 be arbitrary real numbers. Then it follows that
v1 Xn + v2 Yn has a N(v1 µn + v2 νn , v12 σn2 + v2 τn2 ) distribution for all n ∈ N.
Similarly v1 X + v2 Y has a N(v1 µ + v2 ν, v12 σ 2 + v2 τ 2 ) distribution. Provide
d
the details for the argument that v1 Xn + v2 Yn −→ v1 X + v2 Y as n → ∞ for
all v1 ∈ R and v2 ∈ R. In particular, consider the cases where v1 = 0 and
v2 6= 0, v1 6= 0 and v2 = 0, and finally v1 = v2 = 0.
25. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of random variables that converge
in distribution as n → ∞ to the random variables X and Y , respectively.
Suppose that Xn has a N(0, 1 + n−1 ) distribution for all n ∈ N and that
Yn has a N(0, 1 + n−1 ) distribution for all n ∈ N.

a. Identify the distributions of the random variables X and Y .


226 CONVERGENCE OF DISTRIBUTIONS
d
b. Find some conditions under which we can conclude that Xn Yn−1 −
→Z
as n → ∞ where Z has a Cauchy(0, 1) distribution.
d
c. Find some conditions under which we can conclude that Xn Yn−1 −
→W
as n → ∞ where W is a degenerate distribution at one.

26. In the context of the proof of Theorem 4.20, use Theorem A.22 to prove
that [1 − 12 n−1 t2 + o(n−1 )]n = [1 − 12 n−1 t2 ] + o(n−1 ) as n → ∞ for fixed t.
27. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables where Xn has a Exponential(θ) distribution. Prove that
the standardized sample mean Zn = n1/2 θ−1 (X̄n − θ) has a translated
Gamma(n, n−1/2 ) distribution given by
P (Zn ≤ z) = P [n1/2 θ−1 (X̄n − θ) ≤ z]
Z z
n1/2
= (t + n1/2 )n−1 exp[−n1/2 (t + n1/2 )]dt.
−n1/2 Γ(n)

28. Prove that


Z ∞
T −1 92 t2 exp(− 14 t2 )dt = 32 π 1/2 ρn−1/2 ,
−∞
and
Z ∞
T −1 18
1
|t|3 exp(− 14 t2 )dt = 23 ρn−1/2 ,
−∞

where T = 43 ρ−1 n1/2 .


29. Use Theorem 4.24 to prove Corollary 4.2.
30. Prove Statement 3 of Theorem 4.26.
31. Prove Corollary 4.3. That is, let {Xn }∞ n=1 be a sequence of independent
random variables that have a common distribution F . Let p ∈ (0, 1) and
suppose that F is differentiable at ξp and that F (ξp ) > 0. Then, prove that

n1/2 F 0 (ξp )(ξˆpn − ξp ) d



→ Z,
[p(1 − p)]1/2
as n → ∞ where Z is a N(0, 1) random variable.
32. Prove Corollary 4.4. That is, let {Xn }∞n=1 be a sequence of independent
random variables that have a common distribution F . Let p ∈ (0, 1) and
suppose that F has density f in a neighborhood of ξp and that f is positive
and continuous at ξp . Then, prove that

n1/2 f (ξp )(ξˆpn − ξp ) d



→ Z,
[p(1 − p)]1/2
as n → ∞ where Z is a N(0, 1) random variable.
EXERCISES AND EXPERIMENTS 227
4.8.2 Experiments

1. Write a program in R that simulates b samples of size n from an Exponen-


tial(1) distribution. For each of the b samples compute the minimum value
of the sample. When the b samples have been simulated, a histogram of the
b sample minimums should be produced. Run this simulation for n = 5,
10, 25, 50, and 100 with b = 10,000 and discuss the resulting histograms in
terms of the theoretical result of Example 4.3.
2. Write a program in R that simulates a sample of size b from a Bino-
mial(n, n−1 ) distribution and a sample of size b from a Poisson(1) distri-
bution. For each sample, compute the relative frequencies associated with
each of the values {1, . . . , n}. Notice that it is possible that not all of the
sample from the Poisson distribution will be used. Run this simulation for
n = 10, 25, 50, and 100 with b = 10,000 and discuss how well the relative
frequencies from each sample compare in terms of the theoretical result
given in Example 4.4.
3. Write a program in R that generates a sample of size b from a specified
distribution Fn (specified below) that weakly converges to a distribution F
as n → ∞. Compute the Kolmogorov distance between F and the empirical
distribution from the sample. Repeat this for n = 5, 10, 25, 50, 100, 500,
and 1000 with b = 10,000 for each of the cases given below. Discuss the
behavior of the Kolmogorov distance and n becomes large in terms of the
result of Theorem 4.7 (Pólya). Be sure to notice the specific assumptions
that are a part of Theorem 4.7.
a. Fn corresponds to a Normal(n−1 , 1 + n−1 ) distribution and F corre-
sponds to a Normal(0, 1) distribution.
b. Fn corresponds to a Gamma(1 + n−1 , 1 + n−1 ) distribution and F cor-
responds to a Gamma(1, 1) distribution.
c. Fn corresponds to a Binomial(n, n−1 ) distribution and F corresponds
to a Poisson(1) distribution.
4. Write a program in R that generates b samples of size n from a N(0, Σn )
distribution where
1 + n−1 n−1
 
Σn = .
n−1 1 + n−1
Transform each of the b samples using the bivariate transformation
g(x1 , x2 ) = 21 x1 + 14 x2 ,
and produce a histogram of the resulting transformed values. Run this
simulation for n = 10, 25, 50, and 100 with b = 10,000 and discuss how this
histogram compares to what would be expected for large n as regulated by
the underlying theory.
5. Write a program in R that generates b samples of size n from a specified
distribution F . For each sample compute the statistic Zn = n1/2 σ −1 (X̄n −
228 CONVERGENCE OF DISTRIBUTIONS
µ) where µ and σ correspond to the mean and standard deviation of the
specified distribution F . Produce of histogram of the b observed values of
Zn . Run this simulation for n = 10, 25, 50, and 100 with b = 10,000 for each
of the distributions listed below and discuss how these histograms compare
to what would be expected for large n as regulated by the underlying theory
given by Theorem 4.20.

a. F corresponds to a N(0, 1) distribution.


b. F corresponds to an Exponential(1) distribution.
c. F corresponds to a Gamma(2, 2) distribution.
d. F corresponds to a Uniform(0, 1) distribution.
e. F corresponds to a Binomial(10, 12 ) distribution.
1
f. F corresponds to a Binomial(10, 10 ) distribution.
6. Write a program in R that simulates b samples of size n from a distribution
that has distribution function


 0 x<0
1x

0 ≤ x < 21
F (x) = 12
 (3x − 1) 12 ≤ x < 1
2


1 x ≥ 1.

For each sample, compute the sample quantile ξˆ1/4 . When the b samples
have been simulated, a histogram of the b sample values of ξˆ1/4 should be
produced. Run this simulation for n = 5, 10, 25, 50, and 100 with b = 10,000
and discuss the resulting histograms in terms of the theoretical result of
Example 4.42.
CHAPTER 5

Convergence of Moments

And at such a moment, his solitary supporter, Karl comes along wanting to give
him a piece of advice, but instead only shows that all is lost.
Amerika by Franz Kafka

5.1 Convergence in rth Mean

Consider an arbitrary probability measure space (Ω, F, P ) and let Xr be the


collection of all possible random variables X that map Ω to R subject to the
restriction that E(|X|r ) < ∞. One can note that Xr is a vector space if we
introduce the operators ⊕ and ⊗ such that X ⊕Y is the function X(ω)+Y (ω)
for all ω ∈ Ω when X ∈ Xr and Y ∈ Xr , and for a scalar a ∈ R, a ⊗ X is the
function aX(ω) for all ω ∈ Ω. See Chapter 1 of Halmos (1958) and Exercise
1. By defining a function ||X||r for X ∈ Xr as
Z 1/r
||X||r = |X(ω)|r dP (ω) = [E(|X|r )]1/r ,

it can be shown that Xr is a normed vector space. See Chapter 3 of Halmos


(1958) and Exercise 2. In such a normed vector space we are able to define the
distance between two elements as dr (X, Y ) = ||X − Y ||r = [E(|X − Y |r )]1/r .
This development suggests defining a mode of convergence for random vari-
ables by requiring that the distance between a sequence of random variables
and the limiting random variable converge to zero.
Definition 5.1. Let {Xn }∞n=1 be a sequence of random variables. Then Xn
converges to a random variable X in rth mean for a specified r > 0 if
lim E(|Xn − X|r ) = 0.
n→∞
r
This type of convergence will be represented as Xn −
→ X as n → ∞.

Therefore, convergence in rth mean is equivalent to requiring that the dis-


tance dr (Xn , X) converge to zero as n → ∞. A different interpretation of this
mode of convergence is based on interpreting the expectation from a statistical
viewpoint. That is, Xn converges in rth mean to X if the expected absolute
difference between the two random variables converges to zero as n → ∞.

229
230 CONVERGENCE OF MOMENTS

Figure 5.1 Convergence in rth mean when r = 1 and P is a uniform probability


measure. The criteria that E(|Xn − X|) converges to zero as n → ∞ implies that
the area between X and Xn , as indicated by the shaded area between the two random
variables on the plot, converges to zero as n → ∞.

Xn

When r = 1 the absolute difference between Xn and X corresponds to the


area between Xn and X, weighted with respect to the probability measure P .
r=1
Hence Xn −−→ X as n → ∞ when the area between Xn and X, weighted
by the probability measure P , converges to zero as n → ∞. When P is a
uniform probability measure, this area is the usual geometric area between
Xn and X. See Figure 5.1. When r = 2 then convergence in rth mean is
usually called convergence in quadratic mean. We will represent this case as
qm
Xn −−→ X as n → ∞. When r = 1 then convergence in rth mean is usually
am
called convergence in absolute mean. We will represent this case as Xn −−→ X
as n → ∞.
Example 5.1. Consider the probability space (Ω, F, P ) where Ω = [0, 1],
F = B{[0, 1]} and P is a uniform probability measure on [0, 1]. Define a
random variable X as X(ω) = δ{ω; [0, 1]} and a sequence of random variables
{Xn }∞
n=1 as Xn (ω) = n
−1
(n − 1 + ω) for all ω ∈ [0, 1] and n ∈ N. Then
−1
E(|Xn − X|) = (2n) and therefore
lim E(|Xn − X|) = lim (2n)−1 = 0.
n→∞ n→∞
am
See Figure 5.2. Hence, Definition 5.1 implies that Xn −−→ X as n → ∞. 
Example 5.2. Suppose that {Xn }∞
n=1 is a sequence of independent random
CONVERGENCE IN RTH MEAN 231

Figure 5.2 Convergence in rth mean when r = 1 in Example 5.1 with X(ω) =
δ{ω; [0, 1]} and Xn (ω) = n−1 (n − 1 + ω) for all ω ∈ [0, 1] and n ∈ N. In this case
E(|Xn − X|) = (2n)−1 corresponds to the triangular area between Xn and X. The
area for n = 3 is shaded on the plot.
1.0

X
0.8

X3
0.6

X2
y

0.4
0.2
0.0

X1

0.0 0.2 0.4 0.6 0.8 1.0

variables from a common distribution that has mean µ and variance σ 2 < ∞.
Let X̄n be the sample mean. Then,

lim E(|X̄n − µ|2 ) = lim V (X̄n ) = lim n−1 σ 2 = 0.


n→∞ n→∞ n→∞

qm
Therefore, Definition 5.1 implies that X̄n −−→ µ as n → ∞. Are we able to
r=4
also conclude that X̄n −−→ µ as n → ∞? This would require that

lim E(|X̄n − µ|4 ) = 0.


n→∞

Note that

E(|X̄n − µ|4 ) = E(X̄n4 ) − 4µE(X̄n3 ) + 6µ2 E(X̄n2 ) − 4µ3 E(X̄n ) + µ4 .

This expectation can be computed explicitly, though it is not necessary to do


so here to see that E(|X̄n − µ|4 ) depends on E(Xn4 ), which is not guaranteed
to be finite by the assumption σ 2 < ∞. Therefore, we cannot conclude that
232 CONVERGENCE OF MOMENTS
r=4
X̄n −−→ µ as n → ∞ without further assumptions on the moments of the
distribution of Xn . 
Example 5.3. Let {Bn }∞ n=1 be a sequence of independent random variables
where Bn has a Bernoulli(θ) distribution for all n ∈ N. Let B̄n be the sample
mean computed on B1 , . . . , Bn , which in this case corresponds to the sample
proportion, which is an unbiased estimator of θ. Are we able to conclude that
qm
n1/2 {log[log(n)]}−1/2 (B̄n − θ)} −−→ 0 as n → ∞? To evaluate this possibility
we compute
 2  n
E n1/2 {log[log(n)]}−1/2 (B̄n − θ) − 0 = E[(B̄n − θ)2 ]

log[log(n)]
n
= V (B̄n )
log[log(n)]
nθ(1 − θ)
=
n log[log(n)]
θ(1 − θ)
= .
log[log(n)]
Now, note that
θ(1 − θ)
lim = 0,
n→∞ log[log(n)]
qm
so that Definition 5.1 implies that n1/2 {log[log(n)]}−1/2 (B̄n − θ)} −−→ 0 as
n → ∞. 

The following result establishes a hierarchy in the convergence of means in


that convergence of higher order means implies the convergence of lower order
means.
Theorem 5.1. Let {Xn }∞ n=1 be a sequence of random variables that converge
in rth mean to a random variable X as n → ∞. Suppose that s is a real
s
number such that 0 < s < r. Then Xn −→ X as n → ∞.
r
Proof. Suppose that Xn −
→ X as n → ∞. Definition 5.1 implies that
lim E(|Xn − X|r ) = 0.
n→∞

We note that, as in Example 2.5, that xr/s is a convex function so that The-
orem 2.11 (Jensen) implies that
lim [E(|Xn − X|s )]r/s ≤ lim E[(|Xn − X|s )r/s ] = lim E(|Xn − X|r ) = 0,
n→∞ n→∞ n→∞

which implies that


lim E(|Xn − X|s ) = 0.
n→∞
s
Definition 5.1 then implies that Xn −
→ X as n → ∞.
Example 5.4. Suppose that {Xn }∞
n=1 is a sequence of independent random
CONVERGENCE IN RTH MEAN 233
variables from a common distribution that has mean µ and variance σ 2 , such
that E(|Xn |4 ) < ∞. Recall from Example 5.2 that

E(|X̄n − µ|4 ) = E(X̄n4 ) − 4µE(X̄n3 ) + 6µ2 E(X̄n2 ) − 3µ4 . (5.1)

Calculations in Example 5.2 also show that E(X̄n2 ) = n−1 σ 2 + µ2 . To obtain


E(X̄n3 ) we first note that
 !3 
n
X
E(X̄n3 ) = E  n−1 Xk 
k=1
 
n X
X n X
n
= E n−3 Xi Xj Xk 
i=1 j=1 k=1
n X
X n X
n
= n−3 E(Xi Xj Xk ). (5.2)
i=1 j=1 k=1

The sum in Equation 5.2 has n3 terms in all, which can be partitioned as
follows. There are n terms where i = j = k for which the expectation in
the sum has the formE(Xi3 ) = γ, where γ will be used to denote the third
moment. There are 32 n(n − 1) terms where two of the indices are the same,
while the other is different. For these terms the expectation has the form
E(Xi2 )E(Xj ) = µ3 + µσ 2 . Finally, there are n(n − 1)(n − 2) terms where
i, j, and k are all unequal. In these cases the expectation has the form
E(Xi )E(Xj )E(Xk ) = µ3 . Therefore, it follows that

E(X̄n3 ) = n−3 [nγ + 3n(n − 1)(µ3 + µσ 2 ) + n(n − 1)(n − 2)µ3 ]


= A(n) + n−2 (n − 1)(n − 2)µ3 ,

where A(n) = O(n−1 ) as n → ∞. One can find E(X̄n4 ) using the same basic
approach. That is
n X
X n X
n X
n
E(X̄n4 ) = n−4 E(Xi Xj Xk Xm ). (5.3)
i=1 j=1 k=1 m=1

An analysis of the number of terms of each type in Equation (5.3) yields

E(X̄n4 ) = n−4 [nλ + 4n(n − 1)µγ + 6n(n − 1)(n − 2)µ2 (µ2 + σ 2 )


+n(n − 1)(n − 2)(n − 3)µ4 ]
= B(n) + n−3 (n − 1)(n − 2)(n − 3)µ4 ,

where λ = E(Xi4 ) and B(n) = O(n−1 ) as n → ∞. See Exercise 5. Substituting


the results of Equation (5.2) and Equation (5.3) into Equation (5.1) implies
234 CONVERGENCE OF MOMENTS
that
E(|X̄n − µ|4 ) = B(n) + n−3 (n − 1)(n − 2)(n − 3)µ4 −
4µ[A(n) + n−2 (n − 1)(n − 2)µ3 ] + 6µ2 [n−1 σ 2 + µ2 ] − 3µ4
= C(n) + µ4 [n−3 (n − 1)(n − 2)(n − 3) −
4n−2 (n − 1)(n − 2) + 3],
where C(n) = O(n−1 ) as n → ∞. Note that
lim n−3 (n − 1)(n − 2)(n − 3) − 4n−2 (n − 1)(n − 2) = −3,
n→∞

so that it follows that


lim E(|X̄n − µ|4 ) = 0,
n→∞
r=4
and therefore Definition 5.1 implies that X̄n −−→ µ as n → ∞. Theorem 5.1
r=3 qm am
also allows us to also conclude that X̄n −−→ µ, X̄n −−→ µ, and X̄n −−→ µ, as
n → ∞. Note of course that the converse is not true. If E(|Xn |4 ) = ∞ but
qm am
E(|Xn |2 ) < ∞ it still follows that X̄n −−→ µ and X̄n −−→ µ, as n → ∞, but
r=4
the conclusion that X̄n −−→ µ is no longer valid. 

The dependence of the definition of rth mean convergence on the concept of


expectation makes it difficult to easily determine how this mode of convergence
fits in with the other modes of convergence studied up to this point. We begin
to determine these relationships by considering convergence in probability.
Theorem 5.2. Let {Xn }∞n=1 be a sequence of random variables that converge
th p
in r mean to a random variable X as n → ∞ for some r > 0. Then Xn − →X
as n → ∞.
r
Proof. Let ε > 0 and suppose that Xn −
→ X as n → ∞ for some r > 0.
Theorem 2.6 (Markov) implies that
lim P (|Xn − X| > ε) ≤ lim εr E(|Xn − X|r ) = 0.
n→∞ n→∞
p
Definition 3.1 then implies that Xn −
→ X as n → ∞.
r
Theorem 5.2 also implies by Theorem 4.8 that if Xn − → X as n → ∞ then
d
Xn −→ X as n → ∞. Convergence in probability and convergence in rth mean
are not equivalent as there are sequences that converge in probability that do
not converge in rth mean.
Example 5.5. Let {Xn }∞ n=1 be a sequence of independent random variables
where Xn has probability distribution function

−α
1 − n
 x=0
fn (x) = n −α
x=n

0 otherwise,

CONVERGENCE IN RTH MEAN 235
where α > 0. Let ε > 0, then it follows that
lim P (|Xn − 0| > ε) ≤ n−α = 0,
n→∞
p
as long as α > 0. Therefore, Xn − → 0 as n → ∞. However, it does not
r
necessarily follow that Xn −
→ 0 as n → ∞. To see this note that
E(|Xn − 0|r ) = E(|Xn |r ) = (0)(1 − n−α ) + (nr )(n−α ) = nr−α .
If α > r then
lim E(|Xn |r ) = 0,
n→∞
r
and it follows from Definition 5.1 that Xn −
→ 0 as n → ∞. However, if α ≤ r
then (
1 α = r,
E(|Xn |r ) =
∞ α < r.
In either case Xn does not converge in rth mean to 0 as n → ∞. 

Since convergence in probability does not always imply convergence in rth


mean, it is of interest to determine under what conditions such an implication
might follow. The following result gives one such condition.
Theorem 5.3. Suppose that {Xn }∞ n=1 is a sequence of random variables that
converge in probability to a random variable X as n → ∞. Suppose that
r
P (|Xn | ≤ |Y |) = 1 for all n ∈ N and that E(|Y |r ) < ∞. Then Xn − → X
as n → ∞.

Proof. We begin proving this result by first showing that under the stated
conditions it follows that P (|X| ≤ |Y | = 1). To show this let δ > 0. Since
P (|Xn | ≤ |Y |) = 1 for all n ∈ N it follows that P (|X| > |Y | + δ) ≤ P (|X| >
|Xn | + δ) = P (|X| − |Xn | > δ) for all n ∈ N. Now Theorem A.18 implies that
|X|−|Xn | ≤ |Xn −X| which implies that P (|X| > |Y |+δ) ≤ P (|Xn −X| > δ),
for all n ∈ N. Therefore,
P (|X| > |Y | + δ) ≤ lim P (|Xn − X| > δ) = 0,
n→∞

where the limiting value follows from Definition 3.1 because we have assumed
p
that Xn − → X as n → ∞. Therefore, we can conclude that P (|X| > |Y |+δ) = 0
for all δ > 0 and hence it follows that P (|X| > |Y |) = 0. Therefore, P (|X| ≤
|Y |) = 1. Now we can begin to work on the problem of interest. Theorem A.18
implies that P (|Xn − X| ≤ |Xn | + |X|) = 1 for all n ∈ N. We have assumed
that P (|Xn | ≤ |Y |) = 1 and have established in the arguments above that
P (|X| ≤ |Y |) = 1, hence it follows that P (|Xn − X| ≤ |2Y |) = 1. The
assumption that E(|Y |r ) < ∞ implies that
Z ∞
r
lim E(|Y | δ{Y ; (b, ∞)}) = lim |Y |r dF = 0,
b→∞ b→∞ b

where F is taken to be the distribution of Y . Therefore, there exists a constant


236 CONVERGENCE OF MOMENTS
Aε ∈ R such that Aε > ε and
E(|Y |r δ{|2Y |; (Aε , ∞)}) ≤ ε.
Hence,
E(|Xn − X|r ) = E(|Xn − X|r δ{|Xn − X|; (Aε , ∞)})
+E(|Xn − X|r δ{|Xn − X|; [0, ε]})
+E(|Xn − X|r δ{|Xn − X|; (ε, Aε ]}).
We have established earlier that P (|Xn − X| ≤ |2Y |) = 1 so that it follows
that
E(|Xn − X|r δ{|Xn − X|; (Aε , ∞)}) ≤ E(|2Y |r δ{|2Y |; (Aε , ∞)}).
Also,

E(|Xn − X|r δ{|Xn − X|; [0, ε]}) ≤


E(εr δ{|Xn − X|; [0, ε]}) = εr P (|Xn − X| ≤ ε) ≤ εr .
For the remaining term in the sum we have that

E(|Xn − X|r δ{|Xn − X|; (ε, Aε ]}) ≤


E(Arε δ{|Xn − X|; (ε, Aε ]}) = Arε P [|Xn − X| ∈ (ε, Aε ]].
Because (ε, Aε ] ⊂ (ε, ∞), Theorem 2.3 implies that Arε P [|Xn − X| ∈ (ε, Aε ]] ≤
Arε P (|Xn − X| > ε). Combining these results implies that
E(|Xn − X|r ) ≤ E(|2Y |r δ{|2Y |; (Aε , ∞)}) + εr + Arε P (|Xn − X| > ε).
Now,
E(|2Y |r δ{|2Y |; (Aε , ∞)}) = 2r E{|Y |r δ{|2Y |; (Aε , ∞)}} ≤ 2r ε.
p
Therefore E(|Xn − X|r ) ≤ 2r ε + εr + Arε P (|Xn − X| > ε). Since Xn −
→ X as
n → ∞ it follows that
lim P (|Xn − X| > ε) = 0.
n→∞

Hence
lim E(|Xn − X|r ) ≤ (2r + 1)εr ,
n→∞
for all ε > 0. Since ε is arbitrary, it follows that
lim E(|Xn − X|r ) = 0,
n→∞
r
and hence we have proven that Xn −
→ X as n → ∞.

The relationship between convergence in rth mean, complete convergence,


and almost certain convergence is more difficult to obtain as one mode does
not necessarily imply the other. For example, to ensure that the sequence
converges almost certainly, additional assumptions are required.
UNIFORM INTEGRABILITY 237
Theorem 5.4. Let {Xn }∞ n=1 be a sequence of random variables that converge
in rth mean to a random variable X as n → ∞ for some r > 0. If

X
E(|Xn − X|r ) < ∞,
n=1
a.c.
then Xn −−→ X as n → ∞.

The proof of Theorem 5.4 is given in Exercise 6. The result of Theorem 5.4 can
c
actually be strengthened to the conclusion that Xn → − X as n → ∞ without
any change to the assumptions. Theorem 5.4 provides us with a condition
under which convergence in rth mean implies complete convergence and almost
certain convergence. Could almost certain convergence imply convergence in
rth mean? The following example of Serfling (1980) shows that this is not
always the case.
Example 5.6. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables such that E(Xn ) = 0 and V (Xn ) = 1. Consider
a new sequence of random variables defined by
n
X
Yn = {n log[log(n)]}−1/2 Xk ,
k=1
qm
for all n ∈ N. In Example 5.3 we showed that Yn −−→ 0 as n → ∞. Does it
a.c.
also follow that Yn −−→ 0? It does not follow since Theorem 3.15 (Hartman
and Wintner) and Definition 1.3 imply that
n
!
X
−1/2
P lim {n log[log(n)]} Xk = 0 = 0.
n→∞
k=1

Hence, Yn does not converge almost certainly to 0 as n → ∞. 

5.2 Uniform Integrability

In the previous section we considered defining the convergence of a sequence


of random variables in terms of expectation. We were able to derive a few
results which established the relationship between this type of convergence
and the types of convergence studied previously. In this section we consider
the converse problem. That is, suppose that {Xn }∞ n=1 is a sequence of ran-
dom variables that converge in some mode to a random variable X. Does it
necessarily follow that the moments of the random variable converge as well?
It is not surprising that when we consider the case where Xn converges in
rth mean to X, the answer is relatively straightforward. For example, when
am
Xn −−→ X as n → ∞ we know that
lim E(|Xn − X|) = 0.
n→∞
238 CONVERGENCE OF MOMENTS
Now, noting that |E(Xn ) − E(X)| = |E(Xn − X)| ≤ E(|Xn − X|) it follows
that
lim E(Xn ) = E(X), (5.4)
n→∞
and the corresponding sequence of first moments converge. In fact, Theorem
r
5.1 implies that if Xn −→ X as n → ∞ for any r > 1, the result in Equation
(5.4) follows. Can we find similar results for other modes of convergence? For
p
example, suppose that Xn − → X as n → ∞. Under what conditions can we
conclude that at least some of the corresponding moments of Yn converge to
the moments of Y ? Example 5.5 and Theorem 5.3 provide important clues to
solving this problem.
In Example 5.5 we consider a sequence of random variables that converge in
probability to a degenerate random variable at zero. Hence, the expectation
of the limiting distribution is also zero. However, the moments of the random
variables in the sequence do not always converge to zero. One example where
the convergence does not occur is when α = 1, in which case the distribution
of Xn is given by 
−1
1 − n
 x=0
fn (x) = n −1
x=n

0 otherwise.

What occurs with this distribution is that, as n → ∞, the random variable


can take on an arbitrarily large value (n). The probability associated with this
p
value converges to zero so that it follows that Xn −
→ 0 as n → ∞. However, the
probability associated with this value is not small enough for the expectation
of the Xn to converge to zero as well. In fact, in the case where α = 1, the
probability associated with the point at n is such that E(Xn ) = 1 for all
n ∈ N. Hence the expectations do not converge to zero as n → ∞.
Considering Theorem 5.3 with r = 1 provides a demonstration of a condition
under which the expectations do converge to the proper value when the se-
quence converges in probability. The key idea in this result is that all of the
random variables in the sequence must be absolutely bounded by an integrable
random variable Y . This condition does not allow the type of behavior that
is observed in Example 5.5. See Exercise 7. Hence, Theorem 5.3 provides a
sufficient condition under which convergence in probability implies that the
corresponding expectations of the sequence converge to the expectation of the
limiting random variable. The main question of interest is then in developing
a set of minimal conditions under which the result remains valid. Such devel-
opment requires a more refined view of the integrability of sequences, known
as uniform integrability.
Definition 5.2. Let {Xn }∞ n=1 be a sequence of random variables. The se-
quence is uniformly integrable if E(|Xn |δ{|Xn |; (a, ∞)}) converges to zero uni-
formly in n as a → ∞.

Note that Definition 5.2 requires more than just that each random variable in
UNIFORM INTEGRABILITY 239
the sequence be integrable. Indeed, if this was true, then E(|Xn |δ{|Xn |; (a, ∞)})
would converge to zero as a → ∞ for all n ∈ N. Rather, Definition 5.2 re-
quires that this convergence be uniform in n, meaning that the rate at which
E(|Xn |δ{|Xn |; (a, ∞)}) convergences to zero as a → ∞ must be the same for
all n ∈ N, or that the rate of convergence cannot depend on n.
Example 5.7. Let {Xn }∞ n=1 be a sequence of independent and identically dis-
tributed random variables from a distribution F with finite mean µ. That is
E(X) = µ < ∞. The fact that the mean is finite implies that the expectation
E(|Xn |δ{|Xn |; (a, ∞)}) converges to zero as a → ∞, and since the random
variables have identical distributions, the convergence rate for each expecta-
tion is exactly the same. Therefore, by Definition 5.2, the sequence {Xn }∞ n=1
is uniformly integrable. 
∞ 2
Example 5.8. Let {Xn }n=1 be a sequence of independent N(0, σn ) random
variables where {σn }∞n=1 is a sequence of real numbers such that 0 < σn ≤ σ <
∞ for all n ∈ N. For this sequence of random variables it is true that E(|Xn |) <
∞ for all n ∈ N, which ensures that E(|Xn |δ{|Xn |; (a, ∞)}) converges to zero
as a → ∞. To have uniform integrability we must further establish that this
convergence is uniform in n ∈ R. To see this note that for every n ∈ N, we
have that
Z
E(|Xn |δ{|Xn |; (a, ∞)}) = (2πσn2 )−1/2 |x| exp(− 21 σn2 x2 )dx
|x|>a
Z ∞
= 2 (2πσn2 )−1/2 x exp(− 21 σn2 x2 )dx
a
Z ∞
= 2σn (2π)−1/2 v exp(− 21 v 2 )dv.
−1
aσn

If σn ≤ σ then σn−1 ≥ σ −1 and hence aσn−1 ≥ aσ −1 , so that


Z ∞
E(|Xn |δ{|Xn |; (a, ∞)}) ≤ 2σ (2π)−1/2 v exp(− 12 v 2 )dv < ∞,
aσ −1
for all n ∈ N. Therefore, if we choose 0 < aε < ∞ so that
Z ∞
2σ (2π)−1/2 v exp(− 21 v 2 )dv < ε,
aε σ −1

then E(|Xn |δ{|Xn |; (aε , ∞)}) < ε for all n ∈ N. Since aε does not depend on n,
the convergence of E(|Xn |δ{|Xn |; (a, ∞)}) to zero is uniform in n, and hence
Definition 5.2 implies that the sequence {Xn }∞ n=1 is uniformly integrable. 

Example 5.9. Let {Xn }n=1 be a sequence of independent random variables
where Xn has probability distribution function

−α
1 − n
 x=0
fn (x) = n −α
x=n

0 otherwise,

where α > 1. Note that the expected value of Xn is given by E(Xn ) = n1−α ,
240 CONVERGENCE OF MOMENTS
which is finite. Now, for fixed values of n ∈ N we have that for a > 0,
(
n1−α a ≤ n,
E(|Xn |δ{|Xn |; (aε , ∞)}) =
0 a > n.
Let ε > 0 be given. Then E(|Xn |δ{|Xn |; (aε , ∞)}) < ε as long as a > n. This
makes it clear that the convergence of E(|Xn |δ{|Xn |; (aε , ∞)}) is not uniform
in this case since the value of a required to obtain E(|Xn |δ{|Xn |; (aε , ∞)}) < ε
is directly related to n. The only way to obtain a value of a that would ensure
E(|Xn |δ{|Xn |; (aε , ∞)}) < ε for all n ∈ N, would be to let a → ∞. 

One fact that follows from the uniform integrability of a sequence of random
variables is that the sequence expectations, and its least upper bound, must
stay finite.
Theorem 5.5. Let {Xn }∞
n=1 be a sequence of uniformly integrable random
variables. Then,
sup E(|Xn |) < ∞.
n∈N

Proof. Note that E(|Xn |) = E(|Xn |δ{|Xn |; [0, a]}) + E(|Xn |δ{|Xn |; (a, ∞)}).
We can bound the first term as
Z a Z a Z a
E(|Xn |δ{|Xn |; [0, a]}) = |Xn |dFn ≤ adFn = a dFn ≤ a,
0 0 0

where Fn is the distribution function of Xn for all n ∈ N. For the second term
we note that from Definition 5.2 we can choose a < ∞ large enough so that
E(|Xn |δ{|Xn |; (a, ∞)}) ≤ 1. The uniformity of the convergence implies that
a single choice of a will suffice for all n ∈ N. Hence, it follows that for a large
enough that E(|Xn |) ≤ a + 1. Therefore,
sup E(|Xn |) ≤ a + 1 < ∞.
n∈N

Uniform integrability can be somewhat difficult to verify in practice as the


uniformity of the convergence of E(|Xn |δ{|Xn |; (aε , ∞)}) to zero is not al-
ways easy to verify. There are several equivalent ways to characterize uniform
integrability, though many of these conditions are technical in nature and are
not always easier to apply than Definition 5.2. Therefore we will leave these
equivalent conditions to the reader for further study. See Gut (2005). There
are several sufficient conditions for uniform integrability which we will present
as the conditions on these results are relatively simple.
Theorem 5.6. Let {Xn }∞
n=1 be a sequence of random variables such that

sup E(|Xn |1+ε ) < ∞,


n∈N

for some ε > 0. Then {Xn }∞


n=1 is uniformly integrable.
UNIFORM INTEGRABILITY 241
Proof. Let Fn be the distribution function of Xn for all n ∈ N and let ε > 0.
Then
E(|Xn |δ{|Xn |; (a, ∞)}) = E(|Xn |1+ε |Xn |−ε δ{|Xn |; (a, ∞)})
≤ E(|Xn |1+ε a−ε δ{|Xn |; (a, ∞)})
= a−ε E(|Xn |1+ε δ{|Xn |; (a, ∞)})
≤ a−ε E(|Xn |1+ε )
≤ a−ε sup E(|Xn |1+ε ). (5.5)
n∈N

If ε > 0 and
sup E(|Xn |1+ε ) < ∞,
n∈N
then the upper bound in Equation (5.5) will converge to zero as a → ∞. The
convergence is guaranteed to be uniform because the upper bound does not
depend on n.
Example 5.10. Suppose that {Xn }∞ n=1 is a sequence of independent random
variables where Xn has a Gamma(αn , βn ) distribution for all n ∈ N where
{αn }∞ ∞
n=1 and {βn }n=1 are real sequences. Suppose that αn ≤ α < ∞ and
βn ≤ β < ∞ for all n ∈ N. Then E(|Xn |2 ) = αn βn2 + αn2 βn2 ≤ αβ 2 + α2 β 2 < ∞
for all n ∈ N. Then,
sup E(|Xn |2 ) ≤ αβ 2 + α2 β 2 < ∞,
n∈N

so that Theorem 5.6 implies that the sequence {Xn }∞


n=1 is uniformly inte-
grable. 

A similar result requires that the sequence of random variables be bounded


almost certainly by an integrable random variable.
Theorem 5.7. Let {Xn }∞ n=1 be a sequence of random variables such that
P (|Xn | ≤ Y ) = 1 for all n ∈ N where Y is a positive integrable random
variable. Then the sequence {Xn }∞
n=1 is uniformly integrable.

Theorem 5.7 is proven in Exercise 8. The need for having a integrable random
that bounds the sequence can be eliminated by replacing the random variable
Y in Theorem 5.7 with the supremum of the sequence {Xn }∞ n=1 .
Corollary 5.1. Let {Xn }∞
n=1 be a sequence of random variables such that
 
E sup |Xn | < ∞.
n∈N

Then the sequence {Xn }∞


n=1 is uniformly integrable.

Corollary 5.1 is proven in Exercise 9. The final result we highlight in this sec-
tion shows that it is also sufficient for a sequence to be bounded by a uniformly
integrable sequence to conclude that a sequence is uniformly integrable.
242 CONVERGENCE OF MOMENTS
Theorem 5.8. Let {Xn }∞ ∞
n=1 be a sequence of random variables and {Yn }n=1
be a sequence of positive integrable random variables such that P (|Xn | ≤ Yn ) =
1 for all n ∈ N. If the sequence {Yn }∞ n=1 is uniformly integrable then the
sequence {Xn }∞n=1 is uniformly integrable.

Proof. Note that because P (|Xn | ≤ Yn ) = 1 for all n ∈ N, it follows that


E(|Xn |δ{|Xn |; (a, ∞)}) ≤ E(Yn δ{Yn ; (a, ∞)}). For every ε > 0 there ex-
ists an aε such that E(Yn δ{Yn ; (aε , ∞)}) < ε, uniformly in n since the se-
quence {Yn }∞n=1 is uniformly integrable. This value of a also ensures that
E(|Xn |δ{|Xn |; (aε , ∞)}) < ε uniformly in n ∈ N. Therefore, Definition 5.2
implies that the sequence {Xn }∞n=1 is uniformly integrable.

Example 5.11. Let {Yn }∞ n=1 be a sequence of independent random variables


where Yn has a N(0, 1) distribution for all n ∈ N. Define a new sequence
{Xn }∞n=1 where Xn = min{|Y1 |, . . . , |Yn |} for all n ∈ N. The uniform integra-
bility of the sequence {Xn }∞
n=1 then follows from the uniform integrability of
the sequence {Yn }∞n=1 by Theorem 5.8. 

The final result of this section links the uniform integrability of the sum of
sequences of random variables to the uniform integrability of the individual
sequences.
Theorem 5.9. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of uniformly inte-
grable random variables. Then the sequence {Xn + Yn }∞
n=1 is uniformly inte-
grable.

Proof. From Definition 5.2 we know that to show that the sequence {Xn +
Yn }∞
n=1 is uniformly integrable, we must show that

lim E(|Xn + Yn |δ{|Xn + Yn |; (a, ∞)}) = 0,


a→∞

uniformly in n. To accomplish this we will bound the expectation by a sum of


expectations depending on the individual sequences {Xn }∞ ∞
n=1 and {Yn }n=1 .
Let a be a positive real number. Theorem A.19 implies that |Xn + Yn | ≤
2 max{|Xn |, |Yn |}, and hence
δ{|Xn + Yn |; (a, ∞)} ≤ δ{2 max{|Xn |, |Yn |}; (a, ∞)},
with probability one. Therefore,

|Xn + Yn |δ{|Xn + Yn |; (a, ∞)} ≤


2 max{|Xn |, |Yn |}δ{2 max{|Xn |, |Yn |}; (a, ∞)},
with probability one. Now, it can be shown that

2 max{|Xn |, |Yn |}δ{2 max{|Xn |, |Yn |}; (a, ∞)} ≤


2|Xn |δ{|Xn |; ( 21 a, ∞)} + 2|Yn |δ{|Yn |; ( 12 a, ∞)}.
CONVERGENCE OF MOMENTS 243
See Exercise 10. Therefore, we have proven that

|Xn + Yn |δ{|Xn + Yn |; (a, ∞)} ≤


2|Xn |δ{|Xn |; ( 21 a, ∞)} + 2|Yn |δ{|Yn |; ( 12 a, ∞)},
with probability one. Theorem A.16 then implies that

E(|Xn + Yn |δ{|Xn + Yn |; (a, ∞)}) ≤


2E(|Xn |δ{|Xn |; ( 21 a, ∞)}) + 2E(|Yn |δ{|Yn |; ( 12 a, ∞)}),
and hence,

lim E(|Xn + Yn |δ{|Xn + Yn |; (a, ∞)}) ≤


a→∞
lim 2E(|Xn |δ{|Xn |; ( 21 a, ∞)}) + lim 2E(|Yn |δ{|Yn |; ( 12 a, ∞)}). (5.6)
a→∞ a→∞

Now we use the fact that the individual sequences {Xn }∞ ∞


n=1 and {Yn }n=1 are
uniformly integrable. This fact implies that
lim 2E(|Xn |δ{|Xn |; ( 21 a, ∞)}) = 0,
a→∞

and
lim 2E(|Yn |δ{|Yn |; ( 21 a, ∞)}) = 0,
a→∞
and that in both cases the convergence is uniform. This means that for any
ε > 0 there exist bε and cε that do not depend on n such that
E(|Xn |δ{|Xn |; ( 12 a, ∞)}) < 12 ε,
for all a > bε and
E(|Yn |δ{|Yn |; ( 12 a, ∞)}) < 12 ε,
for all a > cε . Let aε = max{bε , cε } and note that aε does not depend on n.
It then follows from Equation (5.6) that
lim E(|Xn + Yn |δ{|Xn + Yn |; (a, ∞)}) ≤ ε,
a→∞

for all a > aε . Definition 5.2 then implies that the sequence {Xn + Yn }∞
n=1 is
uniformly integrable.

5.3 Convergence of Moments

In this section we will explore the relationship between the convergence of


a sequence of random variables and the convergence of the corresponding
moments of a random variable. As we shall observe, the uniform integrability
of the sequence is the key idea in establishing this correspondence. We will first
develop the results for almost certain convergence in some detail, and then
point out parallel results to convergence in probability, whose development
is quite similar. We will end the section by addressing weak convergence.
To develop the results for the case of almost certain convergence we need a
preliminary result that is another version of Fatou’s Lemma.
244 CONVERGENCE OF MOMENTS
Theorem 5.10 (Fatou). Let {Xn }∞ n=1 be a sequence of random variables that
converge almost certainly to a random variable X as n → ∞. Then,
E(|X|) ≤ lim inf E(|Xn |).
n→∞

Proof. Consider the sequence of sets given by


 ∞
inf |Xk | ,
k≥n n=1

and note that this sequence is monotonically increasing. Therefore, it follows


from Theorem 1.12 (Lebesgue), that
   
lim E inf |Xk | = E lim inf |Xn | = E(|X|),
n→∞ k≥n n→∞

a.c.
where the second equality follows from the fact that Xn −−→ X as n → ∞.
Note further that
inf |Xk | ≤ |Xn |,
k≥n

for all n ∈ N. Therefore, it follows from Theorem A.16 that


 
E inf |Xk | ≤ E(|Xn |),
k≥n

for all n ∈ N, and hence


   
E(|X|) = lim E inf |Xk | = lim inf E inf |Xk | ≤ lim inf E(|Xn |).
n→∞ k≥n n→∞ k≥n n→∞

Note that the second equality follows from Definition 1.3 because we have
proven above that the limit in the second term exists and equals E(|X|).

We now have developed sufficient theory to prove the main result which
equates uniform integrability with the convergence of moments.
Theorem 5.11. Let {Xn }∞ n=1 be a sequence of random variables that converge
almost certainly to a random variable X as n → ∞. Let r > 0, then
lim E(|Xn |r ) = E(|X|r )
n→∞

if and only if {|Xn |r }∞


n=1 is uniformly integrable.

Proof. We begin by assuming that the sequence {|Xn |r }∞


n=1 is uniformly in-
tegrable. Then, Theorem 5.5 implies that
sup E(|Xn |r ) < ∞,
n∈N

and Theorem 5.10 implies that


E(|X|r ) ≤ lim inf E(|Xn |r ).
n→∞
CONVERGENCE OF MOMENTS 245
From Theorem 1.5 we further have that
lim inf E(|Xn |r ) ≤ sup E(|Xn |r ),
n→∞ n∈N
r
so that it follows that E(|X| ) < ∞. We will now establish the uniform integra-
bility of the sequence {|Xn −X|r }∞ n=1 . Theorem A.18 implies that |Xn −X| ≤
r

(|Xn | + |X|) and Theorem A.20 implies that (|Xn | + |X|) ≤ 2 (|Xn | + |X|r )
r r r r

for r > 0. Therefore |Xn − X|r ≤ 2r (|Xn |r + |X|r ) and Theorem 5.9 then im-
plies that since both {|Xn |r }∞ r
n=1 and |X| are uniformly integrable, it follows
r ∞
that the sequence {|Xn − X| }n=1 is uniformly integrable. Let ε > 0 and note
that
E(|Xn − X|r ) = E(|Xn − X|r δ{|Xn − X|; [0, ε]})
+E(|Xn − X|r δ{|Xn − X|; (ε, ∞)})
≤ εr + E(|Xn − X|r δ{|Xn − X|; (ε, ∞)}).
Therefore, Theorem 1.6 implies that
lim sup E(|Xn − X|r ) ≤ εr + lim sup E(|Xn − X|r δ{|Xn − X|; (ε, ∞)}).
n→∞ n→∞

For the second term on the right hand side we use Theorem 2.14 (Fatou) to
conclude that

lim sup E(|Xn − X|r δ{|Xn − X|; (ε, ∞)}) ≤


n→∞
 
E lim sup |Xn − X|r δ{|Xn − X|; (ε, ∞)} .
n→∞
a.c.
The fact that Xn −−→ X as n → ∞ implies that
   
P lim sup |Xn − X| ≤ ε = P lim sup δ{|Xn − X|; (ε, ∞)} = 0 = 1.
n→∞ n→∞

Therefore, Theorem A.14 implies that


 
r
E lim sup |Xn − X| δ{|Xn − X|; (ε, ∞)} = 0,
n→∞

and we have shown that


lim sup E(|Xn − X|r ) ≤ εr .
n→∞

Since ε is arbitrary, and E(|Xn − X|r ) ≥ 0 it follows that we have proven that
lim E(|Xn − X|r ) = 0,
n→∞
r
or equivalently that Xn −→ X as n → ∞. We now intend to use this result
to show that the corresponding expectations converge. To do this we need to
consider two cases. In the first case we assume that 0 < r ≤ 1, for which
Theorem 2.8 implies that
E(|Xn |r ) = E[|(Xn − X) + X|r ] ≤ E(|Xn − X|r ) + E(|X|r )
246 CONVERGENCE OF MOMENTS
or equivalently that E(|Xn |r ) − E(|X|r ) ≤ E(|Xn − X|r ). Therefore, the fact
r
that Xn −→ X as n → ∞ implies that
lim E(|Xn |r ) − E(|X|r ) ≤ lim E(|Xn − X|r ) = 0,
n→∞ n→∞

and hence we have proven that


lim E(|Xn |r ) = E(|X|r ).
n→∞

Similar arguments, based on Theorem 2.9 in place of Theorem 2.8, are used
for the case where r > 1. See Exercise 11. For a proof on the converse see
Section 5.5 of Gut (2005).

The result of Theorem 5.11 also holds for convergence in probability. The
proof in this case in nearly the same, except one needs to prove Theorem 5.10
for the case when the random variables converge in probability.
Theorem 5.12. Let {Xn }∞ n=1 be a sequence of random variables that converge
in probability to a random variable X as n → ∞. Let r > 0, then
lim E(|Xn |r ) = E(|X|r )
n→∞

if and only if {|Xn |r }∞


n=1 is uniformly integrable.

The results of Theorems 5.11 and 5.12 also hold for convergence in distribu-
tion, though the proof is slightly different.
Theorem 5.13. Let {Xn }∞ n=1 be a sequence of random variables that converge
in distribution to a random variable X as n → ∞. Let r > 0, then
lim E(|Xn |r ) = E(|X|r )
n→∞

if and only if {|Xn |r }∞


n=1 is uniformly integrable.

Proof. We will prove the sufficiency of the uniform integrability of the se-
quence following the method of proof used by Serfling (1980). For a proof of
the necessity see Section 5.5 of Gut (2005). Suppose that the limiting random
variable X has a distribution function F . Let ε > 0 and choose a positive real
value a such that both a and −a are continuity points of F and that
sup E(|Xn |r δ{|Xn |; [a, ∞)}) < ε.
n∈N

This is possible because we have assumed that the sequence {|Xn |r }∞ n=1 is
uniformly integrable and we can therefore find a real value a that does not
depend on n such that E(|Xn |r δ{|Xn |; [a, ∞)}) < ε for all n ∈ N. Now choose
a real number b such that b > a and that b and −b are also continuity points
of F . Consider the function |x|r δ{|x|; [a, b]}, which is a continuous function on
the [a, b]. It therefore follows from Theorem 4.3 (Helly and Bray) that
lim E(|Xn |r δ{|Xn |; [a, b]}) = E(|X|r δ{|X|; [a, b]}).
n→∞
CONVERGENCE OF MOMENTS 247
Now, note that for every n ∈ N we have that
|Xn |r δ{|Xn |; [a, b]} ≤ |Xn |r δ{|Xn |; [a, ∞)},
with probability one. Theorem A.16 then implies that
E(|Xn |r δ{|Xn |; [a, b]}) ≤ E(|Xn |r δ{|Xn |; [a, ∞)}) < ε,
for every n ∈ N. Therefore it follows that
lim E(|Xn |r δ{|Xn |; [a, b]}) = E(|X|r δ{|X|; [a, b]}) < ε.
n→∞

This result holds for every b > a. Hence, it further follows from Theorem 1.12
(Lebesgue) that
lim E(|X|r δ{|X|; [a, b]}) = E(|X|r δ{|X|; [a, ∞)}) < ε.
b→∞

This also in turn implies that E(|X|r ) < ∞. Keeping the value of a as specified
above, we have that

|E(|Xn |r ) − E(|X|r )| = |E(|Xn |r δ{|Xn |; [0, a]}) + E(|Xn |r δ{|Xn |; (a, ∞)})
− E(|X|r δ{|X|; [0, a]}) − E(|X|r δ{|X|; (a, ∞)})|.
Theorem A.18 implies that

|E(|Xn |r ) − E(|X|r )| ≤ |E(|Xn |r δ{|Xn |; [0, a]}) − E(|X|r δ{|X|; [0, a]})|+
|E(|Xn |r δ{|Xn |; (a, ∞)}) − E(|X|r δ{|X|; (a, ∞)})|.
The expectations in the second term are each less that ε so that

|E(|Xn |r ) − E(|X|r )| ≤
|E(|Xn |r δ{|Xn |; [0, a]}) − E(|X|r δ{|X|; [0, a]})| + 2ε.
Noting once again that the function |x|r δ{|x|; [0, a]} is continuous on [0, a], we
apply Theorem 4.3 (Helly and Bray) to find that

lim |E(|Xn |r ) − E(|X|r )| ≤


n→∞
lim |E(|Xn |r δ{|Xn |; [0, a]}) + E(|X|r δ{|X|; [0, a]})| + 2ε = 2ε.
n→∞

Because ε is arbitrary, it follows that


lim E(|Xn |r ) = E(|X|r ).
n→∞

Example 5.12. Let {Yn }∞n=1 be a sequence of independent random variables


where Yn has a N(0, 1) distribution for all n ∈ N. Define a new sequence
{Xn }∞
n=1 where Xn = min{|Y1 |, . . . , |Yn |} for all n ∈ N. In Example 5.11 it
was shown that the sequence {Xn }∞ n=1 is uniformly integrable. We can also
248 CONVERGENCE OF MOMENTS
note that for every ε > 0

P (|Xn − 0| > ε) = P (min{|Y1 |, . . . , |Yn |} > ε) =


n
! n
\ Y
P {|Yi | > ε} = P (|Yi | > ε) = {2[1 − Φ(ε)]}n ,
i=1 i=1

so that
lim P (|Xn | > ε) = 0.
n→∞
p
Hence Xn −
→ 0 as n → ∞. Theorem 5.12 further implies that
lim E(Xn ) = E(0) = 0.
n→∞


Example 5.13. Let {Un }∞ be a sequence of random variables where Un has
n=1
a.c.
a Uniform(0, n−1 ) distribution for all n ∈ N. It can be shown that Un −−→
−1
0 as n → ∞. Let fn (u) = δ{u; (0, n )} denote the density of Un for all
n ∈ N, and let Fn denote the corresponding distribution function. Note that
E(|Xn |δ{|Xn |; (a, ∞)}) = 0 for all a > 1. This proves that the sequence is
uniformly integrable, and hence it follows from Theorem 5.11 that
lim E(Xn ) = 0.
n→∞

Note that this property could have been addressed directly by noting that
E(Xn ) = (2n)−1 for all n ∈ N, and therefore
lim E(Xn ) = lim (2n)−1 = 0.
n→∞ n→∞


Example 5.14. Suppose that {Xn }∞ n=1 is a sequence of independent random
variables. Suppose that Xn has a Gamma(αn , βn ) distribution for all n ∈ N
where {αn }∞ ∞
n=1 and {βn }n=1 are real sequences. Suppose that αn ≤ α < ∞
and βn ≤ β < ∞ for all n ∈ N and that αn → α and βn → β as n → ∞. It was
shown in Example 5.10 that the sequence {Xn }∞ n=1 is uniformly integrable.
d
It also follows that Xn −
→ X as n → ∞ where X is a Gamma(α, β) random
variable. Theorem 5.13 implies that E(Xn ) → E(X) = αβ as n → ∞. 

5.4 Exercises and Experiments

5.4.1 Exercises

1. Consider an arbitrary probability measure space (Ω, F, P ) and let Xr be


the collection of all possible random variables X that map Ω to R subject
to the restriction that E(|X|r ) < ∞. Define the operators ⊕ and ⊗ such
that X ⊕ Y is the function X(ω) + Y (ω) for all ω ∈ Ω when X ∈ Xr and
Y ∈ Xr , and for a scalar a ∈ R, a ⊗ X is the function aX(ω) for all ω ∈ Ω.
Prove that Xr is a vector space by showing the following properties:
EXERCISES AND EXPERIMENTS 249
a. For each X ∈ Xr and Y ∈ Xr , X ⊕ Y ∈ Xr .
b. The operation ⊕ is commutative, that is X ⊕Y = Y ⊕X for each X ∈ Xr
and Y ∈ Xr .
c. The operation ⊕ is associative, that is X ⊕ (Y ⊕ Z) = (X ⊕ Y ) ⊕ Z for
each X ∈ Xr , Y ∈ Xr and Z ∈ Xr .
d. There exists a random variable 0 ∈ Xr , called the origin, such that
0 ⊕ X = X for all X ∈ Xr .
e. For every X ∈ Xr there exists a unique random variable −X such that
X + (−X) = 0.
f. Multiplication by scalars is associative, that is a ⊗ (b ⊗ X) = (ab) ⊗ X
for every a ∈ R, b ∈ R and X ∈ Xr .
g. For every X ∈ Xr , 1 ⊗ X = X.
h. Multiplication by scalars is distributive, that is a⊗(X⊕Y ) = a⊗X+a⊗Y
for all a ∈ R, X ∈ Xr and Y ∈ Xr .
i. Multiplication by random variables is distributive, that is (a + b) ⊗ X =
(a ⊗ X) ⊕ (b ⊗ Y ) for all a ∈ R, b ∈ R and X ∈ Xr .

2. Within the context of Exercise 1, let ||X||r be defined for X ∈ Xr as


Z 1/r
r
||X||r = |X(ω)| dP (ω) .

Prove that ||X||r is a norm. That is, show that ||X||r has the following
properties:

a. ||X||r ≥ 0 for all X ∈ Xr .


b. ||X||r = 0 if and only if X = 0.
c. ||a ⊗ X||r = |a| · ||X||r
d. Prove the triangle inequality, that is ||X ⊕ Y || ≤ ||X|| + ||Y || for all
X ∈ Xr and Y ∈ Xr .

3. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from a distribution F that has parameter θ. Let θ̂n be an unbiased
estimator of θ based on the observed sample where V (θ̂n ) = τn2 where
lim τn2 = 0.
n→∞
qm
Prove that θ̂n −−→ θ as n → ∞.
4. Consider a sequence of random variables {Xn }∞
n=1 where Xn has probability
distribution function

−1
[log(n + 1)]
 x=n
fn (x) = 1 − [log(n + 1)]−1 x = 0

0 elsewhere,

for all n ∈ N.
250 CONVERGENCE OF MOMENTS
p
a. Prove that Xn −→ 0 as n → ∞.
r
b. Let r > 0. Determine whether Xn −
→ 0 as n → ∞.
5. Suppose that {Xn }∞ n=1 is a sequence of independent random variables
from a common distribution that has mean µ and variance σ 2 , such that
E(|Xn |4 ) < ∞. Prove that
E(X̄n4 ) = n−4 [nλ + 4n(n − 1)µγ + 6n(n − 1)(n − 2)µ2 (µ2 + σ 2 )
+n(n − 1)(n − 2)(n − 3)µ4 ]
= B(n) + n−3 (n − 1)(n − 2)(n − 3)µ4 ,
where B(n) = O(n−1 ) as n → ∞, γ = E(Xn3 ), and λ = E(Xn4 ).
6. Let {Xn }∞
n=1 be a sequence of random variables that converge in r
th
mean
to a random variable X as n → ∞ for some r > 0. Prove that if
X∞
E(|Xn − X|r ) < ∞,
n=1
a.c.
then Xn −−→ X as n → ∞.
7. Let {Xn }∞n=1 be a sequence of independent random variables where Xn has
probability distribution function

−α
1 − n
 x=0
fn (x) = n−α x=n

0 otherwise,

and 0 < α < 1. Prove that there does not exist a random variable Y such
that P (|Xn | ≤ |Y |) = 1 for all n ∈ N and E(|Y |) < ∞.
8. Let {Xn }∞n=1 be a sequence of random variables such that P (|Xn | ≤ Y ) = 1
for all n ∈ N where Y is a positive integrable random variable. Prove that
the sequence {Xn }∞ n=1 is uniformly integrable.

9. Let {Xn }n=1 be a sequence of random variables such that
 
E sup |Xn | < ∞.
n∈N

Prove that the sequence {Xn }∞is uniformly integrable.


n=1
10. Prove that if a, x, and y are positive real numbers then
2 max{x, y}δ{2 max{x, y}; (a, ∞)} ≤ 2xδ{x; ( 21 a, ∞)} + 2yδ{y; ( 12 a, ∞)}.
r
11. Suppose that {Xn }∞
n=1 is a sequence of random variables such that Xn −
→X
as n → ∞ for some random variable X. Prove that for r > 1 that
lim E(|Xn |r ) = E(|X|r ).
n→∞

12. Let {Xn }∞


n=1 be a sequence of independent random variables where Xn has
an Exponential(θn ) distribution for all n ∈ N, and {θn }∞
n=1 is a sequence
EXERCISES AND EXPERIMENTS 251
of real numbers such that θn > 0 for all n ∈ N. Find the necessary proper-
ties for the sequence {θn }∞ ∞
n=1 that will ensure that {Xn }n=1 is uniformly
integrable.
13. Let {Xn }∞ n=1 be a sequence of independent random variables where Xn
has a Triangular(αn , βn , γn ) distribution for all n ∈ N, where {αn }∞ n=1 ,
{βn }∞
n=1 , and {γ }∞
n n=1 are sequences of real numbers such that αn < γn <
βn for all n ∈ N. Find the necessary properties for the sequences {αn }∞ n=1 ,
{βn }∞
n=1 , and {γ } ∞
n n=1 that will ensure that {X } ∞
n n=1 is uniformly inte-
grable.
14. Let {Xn }∞
n=1 be a sequence of independent random variables where Xn has
a Beta(αn , βn ) distribution for all n ∈ N, where {αn }∞ ∞
n=1 and {βn }n=1 are
sequences of real numbers such that αn > 0 and βn > 0 for all n ∈ N. Find
the necessary properties for the sequences {αn }∞ ∞
n=1 , and {βn }n=1 that will

ensure that {Xn }n=1 is uniformly integrable.
15. Let {Xn }∞
n=1 be a sequence of random variables where Xn has distribution
Fn which has mean θn for all n ∈ N. Suppose that
lim θn = θ,
n→∞

for some θ ∈ (0, ∞). Hence it follows that


lim E(Xn ) = E(X),
n→∞

where X has distribution F with mean θ. Does it necessarily follow that


the sequence {Xn }∞
n=1 is uniformly integrable?
16. Let {Xn }∞n=1 be a sequence of random variables such that Xn has distribu-
a.c.
tion Fn for all n ∈ N. Suppose that Xn −−→ X as n → ∞ for some random
variable X. Suppose that there exists a subset A ⊂ R such that
Z
dFn (t) = 1,
A

for all n ∈ N, and Z


dt < ∞.
A
Do these conditions imply that E(Xn ) → E(X) as n → ∞?
17. Let {Xn }∞ 2
n=1 be a sequence of random variables such that Xn has a N(0, σn )
distribution, conditional on σn . For each case detailed below, determine if
the sequence {Xn }∞ n=1 is uniformly integrable and determine if Xn con-
verges weakly to a random variable X as n → ∞.

a. σn = n−1 for all n ∈ N.


b. σn = n for all n ∈ N.
c. σn = 10 + (−1)n for all n ∈ N.
d. {σn }∞
n=1 is a sequence of independent random variables where σn has an
Exponential(θ) distribution for each n ∈ N.
252 CONVERGENCE OF MOMENTS
18. Let {Xn }∞ 2
n=1 be a sequence of random variables such that Xn has a N(µn , σn )
distribution, conditional on µn and σn . For each case detailed below deter-
mine if the sequence {Xn }∞n=1 is uniformly integrable and determine if Xn
converges weakly to a random variable X as n → ∞.

a. µn = n and σn = n−1 for all n ∈ N.


b. µn = n−1 and σn = n for all n ∈ N.
c. µn = n−1 and σn = n−1 for all n ∈ N.
d. µn = n and σn = n for all n ∈ N.
e. µn = (−1)n and σn = 10 + (−1)n for all n ∈ N.
f. {µn }∞
n=1 is a sequence of independent random variables where µn has a
N(0, 1) distribution for each n ∈ N, and {σn }∞
n=1 is a sequence of random
variables where σn has an Exponential(θ) distribution for each n ∈ N.

5.4.2 Experiments

1. Write a program in R that simulates a sequence of independent and identi-


cally distributed random variables X1 , . . . , X100 where Xn follows a distri-
bution F that is specified below. For each n = 1, . . . , 100 compute X̄n on
X1 , . . . , Xn along with |X̄n − µ|2 where µ is the mean of the distribution F .
Repeat the experiment five times and plot each sequence {|X̄n − µ|2 }∞ n=1
against n on the same set of axes. Describe the behavior observed in each
qm
case and compare it to whether Xn −−→ µ as n → ∞.

a. F is a N(0, 1) distribution.
b. F is a Cauchy(0, 1) distribution, where µ is taken to be the median of
the distribution.
c. F is a Exponential(1) distribution.
d. F is a T(2) distribution.
e. F is a T(3) distribution.

2. Write a program in R that simulates a sequence of independent and identi-


cally distributed random variables B1 , . . . , B500 where Bn is a Bernoulli(θ)
random variable where θ is specified below. For each n = 3, . . . , 500, com-
pute
n1/2 {log[log(n)]}−1/2 (B̄n − θ),
where B̄n is the sample mean computed on B1 , . . . , Bn . Repeat the exper-
iment five times and plot each sequence
{n1/2 {log[log(n)]}−1/2 (B̄n − θ)}500
n=3

against n on the same set of axes. Describe the behavior observed for the
sequence in each case in terms of the result of Example 5.3. Repeat the
experiment for each θ ∈ {0.01, 0.10, 0.50, 0.75}.
EXERCISES AND EXPERIMENTS 253
3. Write a program in R that simulates a sequence of independent random
variables X1 , . . . , X100 where Xn as probability distribution function

−α
1 − n
 x=0
fn (x) = n −α
x=n

0 elsewhere.

Repeat the experiment five times and plot each sequence Xn against n on
the same set of axes. Describe the behavior observed for the sequence in
each case. Repeat the experiment for each α ∈ {0.5, 1.0, 1.5, 2.0}.
4. Write a program in R that simulates a sequence of independent random
variables X1 , . . . , X100 where Xn is a N(0, σn2 ) random variable where the
sequence {σn }∞ n=1 is specified below. Repeat the experiment five times and
plot each sequence Xn against n on the same set of axes. Describe the
behavior observed for the sequence in each case, and relate the behavior to
the results of Exercise 17.
a. σn = n−1 for all n ∈ N.
b. σn = n for all n ∈ N.
c. σn = 10 + (−1)n for all n ∈ N.
d. {σn }∞
n=1 is a sequence of independent random variables where σn has an
Exponential(θ) distribution for each n ∈ N.

5. Write a program in R that simulates a sequence of independent random


variables X1 , . . . , X100 where Xn is a N(µn , σn2 ) random variable where the
sequences {µn }∞ ∞
n=1 and {σn }n=1 are specified below. Repeat the experiment
five times and plot each sequence Xn against n on the same set of axes.
Describe the behavior observed for the sequence in each case, and relate
the behavior to the results of Exercise 18.

a. µn = n and σn = n−1 for all n ∈ N.


b. µn = n−1 and σn = n for all n ∈ N.
c. µn = n−1 and σn = n−1 for all n ∈ N.
d. µn = n and σn = n for all n ∈ N.
e. µn = (−1)n and σn = 10 + (−1)n for all n ∈ N.
f. {µn }∞
n=1 is a sequence of independent random variables where µn has a
N(0, 1) distribution for each n ∈ N, and {σn }∞
n=1 is a sequence of random
variables where σn has an Exponential(θ) distribution for each n ∈ N.
CHAPTER 6

Central Limit Theorems

They formed a unit of the sort that normally can be formed only by matter that
is lifeless.
The Trial by Franz Kafka

6.1 Introduction

One of the important and interesting features of the Central Limit Theorem
is that the weak convergence of the mean holds under many situations beyond
the simple situation where we observe a sequence of independent and iden-
tically distributed random variables. In this chapter we will explore some of
these extensions. The two main direct extensions of the Central Limit Theo-
rem we will consider are to non-identically distributed random variables and to
triangular arrays. Of course, other generalizations are possible, and we only
present some of the simpler cases. For a more general presentation of this
subject see Gnedenko and Kolmogorov (1968). We will also consider transfor-
mations of asymptotically normal statistics that either result in asymptoti-
cally Normal statistics, or statistics that follow a ChiSquared distribution.
As we will show, the difference between these two outcomes depends on the
smoothness of the transformation.

6.2 Non-Identically Distributed Random Variables

The Lindeberg, Lévy, and Feller version of the Central Limit Theorem relaxes
the assumption that the random variables in the sequence need to be identi-
cally distributed, but still retains the assumption of independence. The result
originates from the work of Lindeberg (1922), Lévy (1925), and Feller (1935),
who each proved various parts of the final result.
Theorem 6.1 (Lindeberg, Lévy, and Feller). Let {Xn }∞ n=1 be a sequence of
independent random variables where E(Xn ) = µn and V (Xn ) = σn2 < ∞ for
all n ∈ N where {µn }∞ 2 ∞
n=1 and {σn }n=1 are sequences of real numbers. Let
n
X
µ̄n = n−1 µk ,
k=1

255
256 CENTRAL LIMIT THEOREMS
n
X
τn2 = σk2 , (6.1)
k=1

and suppose that

lim max τn−2 σk2 = 0, (6.2)


n→∞ k∈{1,2,...,n}

d
then Zn = nτn−1 (X̄n − µ̄n ) −
→ Z, as n → ∞ where Z has a N(0, 1) distribution
if and only if
n
X
lim τn−2 E(|Xk − µk |2 δ{|Xk − µk |; (ετn , ∞)}) = 0, (6.3)
n→∞
k=1

for every ε > 0.

Proof. We will prove the sufficiency of the condition given in Equation (6.3),
but not its necessity. See Section 7.2 of Gut (2005) for details on the necessity
part of the proof. The main argument of this proof is based on the same idea
that we used in proving Theorem 4.20 (Lindeberg and Lévy) in that we will
show that the characteristic function of Zn converges to the characteristic
function of a N(0, 1) random variable. As in the proof of Theorem 4.20 we
begin by assuming that µn = 0 for all n ∈ N which does not reduce the
generality of the proof since the numerator of Zn has the form
n
X n
X
X̄n − µ̄n = n−1 (Xk − µk ) = n−1 Xk∗ ,
k=1 k=1

where E(Xk∗ ) = 0 for all k ∈ {1, . . . , n}. Let ψk be the characteristic function
of Xk for all k ∈ N. Theorem 2.33 implies that the characteristic function of
the sum of X1 , . . . , Xn is
n
Y
ψk (t).
k=1

Note that
n
X
Zn = nτn−1 (X̄n − µ̄n ) = τn−1 Xk ,
k=1

under the assumption that µk = 0 for all k ∈ N. Therefore, it follows from


Theorem 2.32 that the characteristic function of Zn , which we will denote as
ψ, is
n
( n )
Y X
−1 −1
 
ψ(t) = ψk (τn t) = exp log ψk (τn t) .
k=1 k=1
NON-IDENTICALLY DISTRIBUTED RANDOM VARIABLES 257
Some algebra in the exponent can be used to show that
( n )
X
−1
 
exp log ψk (τn t) =
k=1
( n n
)
X X
log[ψk (τn−1 t)] + 1 − ψk (τn−1 t) ×
 
exp
k=1 k=1
( n n
) ( n
)
X X X
exp ψk (τn−1 t) − (1 − 1 −2 2 2
2 τn σk t ) exp − 12 τn−2 σk2 t2 .
k=1 k=1 k=1

Note that the final term in the product has the form
( n ) ( n
)
X X
exp − 12 τn−2 σk2 t2 = exp − 12 t2 τn−2 σk2 = exp(− 12 t2 ),
k=1 k=1

which is the characteristic function of a standard normal random variable.


Therefore, the proof now depends on demonstrating that
( n n
)
X X
−1 −1
 
lim exp log[ψk (τn t)] + 1 − ψk (τn t) = 1,
n→∞
k=1 k=1

and ( )
n
X n
X
lim exp ψk (τn−1 t) − (1 − 1 −2 2 2
2 τn σk t ) = 1,
n→∞
k=1 k=1
which is equivalent to showing that

Xn n
X
log[ψk (τn−1 t)] + 1 − ψk (τn−1 t) = 0,
 
lim (6.4)

n→∞
k=1 k=1

and n
X n
X
−1 1 −2 2 2
lim ψk (τn t) − (1 − 2 τn σk t ) = 0. (6.5)

n→∞
k=1 k=1
We work on showing the property in Equation (6.4) first. Theorem 2.30 im-
plies that |ψk (t) − 1 − itE(Xk )| ≤ E( 21 t2 Xk2 ). In our case we are evaluating
the characteristic function at τn−1 t and we have assumed that E(Xk ) = 0.
Therefore, we have that
 2 2
t Xk t2 σk2
|ψk (τn−1 t) − 1| ≤ E 2
= ≤ 12 t2 τn−2 max σk2 . (6.6)
2τn 2τn2 1≤k≤n

The assumption given in Equation (6.2) then implies that


lim |ψ(τn−1 t) − 1| ≤ lim 1 2 −2
t τn max σk2 = 0. (6.7)
n→∞ n→∞ 2 1≤k≤n

It is also noteworthy that the convergence is uniform in k, due to the bound


provided in the assumption of Equation (6.2). Now, Theorem A.8 implies that
for z ∈ C and |z| ≤ 12 we have that | log(1−z)+z| ≤ |z|2 . Equation (6.7) implies
258 CENTRAL LIMIT THEOREMS
that there exists an integer n0 such that |ψk (τn−1 t) − 1| ≤ 12 . The uniformity
of the convergence implies that n0 does not depend on k. Therefore, Theorem
A.8 implies that for all n ≥ n0 we have that
| log{1 − [1 − ψk (τn−1 t)]} + 1 − ψk (τn−1 t)| ≤ |ψk (τn−1 t) − 1|2 .
Hence, for all n ≥ n0 ,

Xn n
−1
X  −1

log[ψk (τn t)] + 1 − ψk (τn t) =



k=1 k=1

Xn
 −1
 −1

log{1 − [1 − ψk (τn t)]} + 1 − ψk (τn t) ≤



k=1
n
X n
 X
log{1 − [1 − ψk (τn−1 t)]} + 1 − ψk (τn−1 t) ≤ |ψk (τn−1 t) − 1|2 .
 

k=1 k=1

Now we use the fact that


|ψk (τn−1 t) − 1| ≤ max |ψk (τn−1 t) − 1|,
1≤k≤n

and Equation (6.6) to show that


n
X n
X
|ψk (τn−1 t) − 1| 2
≤ max |ψk (τn−1 t) − 1| |ψk (τn−1 t) − 1|
1≤k≤n
k=1 k=1
 n
X
1 −2 2
≤ 2 τn t max σk2 E( 12 τn−2 t2 Xk2 )
1≤k≤n
k=1
  n
X
1 4 −2
= 4 t τn max σk2 τn−2 σk2
1≤k≤n
k=1
 
1 4 −2
= 4 t τn max σk2 .
1≤k≤n

Therefore, Equation (6.2) implies that


n n

X X
log[ψk (τn−1 t)] + 1 − ψk (τn−1 t) ≤
 
lim

n→∞
k=1 k=1
 
1 4 −2
lim t τn max σk2 = 0.
n→∞ 4 1≤k≤n

This completes our first task in this proof since we have proven that Equation
(6.4) is true. We now take up the task of proving that Equation (6.5) is true.
Theorem 2.30 implies that
 2 2
t Xk
|ψk (tτn−1 ) − 1 − itτn−1 E(Xk ) + 12 t2 τn−2 E(Xk2 )| ≤ E , (6.8)
τn2
NON-IDENTICALLY DISTRIBUTED RANDOM VARIABLES 259
which, due to the assumption that E(Xk ) = 0 simplifies to
2 2
 2 2
ψk (tτn−1 ) − 1 + t σk ≤ E t Xk .

(6.9)
2τn2 τn2
Theorem 2.30 and similar calculations to those used earlier can also be used
to establish that
2 2
 3 3

ψk (tτn−1 ) − 1 + t σk ≤ E |t| |Xk | .

2τn2 6τn3
Now, Theorem A.18 and Equations (6.8) and (6.9) imply that

n  2 2 n
t2 σk2
  
X
−1 t σ k
X −1
ψk (tτn ) − 1 − ≤ ψk (tτn ) − 1 − 2τ 2

2τn2


k=1

k=1 n
n 2 2
|t|3 |Xk |3
    
X t Xk
≤ min E , E .
τn2 6τn3
k=1

We now split each of the expectations across small and large values of |Xk |.
Let ε > 0, then
 2 2  2 2   2 2 
t Xk t Xk t Xk
E = E δ{|Xk |; [0, ετn ]} + E δ{|Xk |; (ετn , ∞)} ,
τn2 τn2 τn2
and

|t|3 |Xk |3 |t|3 |Xk |3


   
E =E δ{|Xk |; [0, ετn ]} +
6τn3 6τn3
 3
|t| |Xk |3

E δ{|Xk |; (ετn , ∞)} ,
6τn3
which implies that
  2 2  3
|t| |Xk |3
 3
|t| |Xk |3
 
t Xk
min E , E ≤ E δ{|Xk |; [0, ετ n ]} +
τn2 6τn3 6τn3
 2 2 
t Xk
E δ{|X k |; (ετn , ∞)} .
τn2
Therefore,
n   n
t2 σk2 X
 3
|t| |Xk |3
X  
−1
ψ (tτ ) − 1 − ≤ E δ{|X |; [0, ετ ]} +

k n k n
2τn2 6τn3


k=1 k=1
n  2 2 
X t Xk
E δ{|Xk |; (ετn , ∞)} .
τn2
k=1

The first term on the right hand side can be simplified by bounding |Xk |3 ≤
260 CENTRAL LIMIT THEOREMS
ετn |Xk |2 due to the condition imposed by the indicator function. That is,
n  3 n
|t| |Xk |3 |t|3 ετn
X  X
E 3
δ{|Xk |; [0, ετn ]} ≤ E(|Xk |2 δ{|Xk |; [0, ετn ]})
6τn 6τn3
k=1 k=1
n
X
1 3 −2
≤ 6 |t| ετn E(|Xk |2 )
k=1
Xn
1 3 −2
= 6 |t| ετn σk2
k=1
1 3
= 6 |t| ε.

The second inequality follows from the fact that


E(|Xk |2 δ{|Xk |; [0, ετn ]}) ≤ E(|Xk |2 ), (6.10)
where we note that expectation on the right hand side of Equation (6.10) is
finite by assumption. This was the reason for bounding one of the |Xk | in the
term |Xk |3 by ετn . Therefore,
n  
2 2

X t σ
ψk (tτn−1 ) − 1 − k 1 3
≤ 6 |t| ε+

2τn2


k=1
n
X
t2 τn−2 E(|Xk |2 δ{|Xk |; (ετn , ∞)}). (6.11)
k=1

Equation (6.3) implies that the second term on the right hand side of Equation
(6.11) converges to zero as n → ∞, therefore

n   2 2
X t σ
ψk (tτn−1 ) − 1 − k 1 3
lim sup ≤ 6 |t| ε,

2τn2
n→∞
k=1

for every ε > 0. Since ε is arbitrary, it follows that


n  
t2 σk2
X 
−1
lim ψk (tτn ) − 1 − = 0,

n→∞ 2τn2
k=1

which verifies Equation (6.5), and the result follows.

The condition given in Equation (6.3) is known as the Lindeberg Condition.


As Serfling (1980) points out, the Lindeberg Condition actually implies the
condition given in Equation (6.2), so that the main focus in applying this
result is on the verification of Equation (6.3). This condition regulates the
tail behavior of the sequence of distribution functions that correspond to the
sequence of random variables {Xn }∞ n=1 . Without becoming too technical about
the meaning of this condition, it is clear in the proof of Theorem 6.1 where the
condition arises. Indeed, this condition is exactly what is required to complete
the final step of the proof. Unfortunately, the Lindeberg Condition can be
NON-IDENTICALLY DISTRIBUTED RANDOM VARIABLES 261
difficult to verify in practice, and hence we will explore a simpler sufficient
condition.
Corollary 6.1. Let {Xn }∞ n=1 be a sequence of independent random variables
where E(Xn ) = µn and V (Xn ) = σn2 < ∞ for all n ∈ N where {µn }∞ n=1 and
{σn2 }∞
n=1 are sequences of real numbers. Let
n
X
µ̄n = n−1 µk ,
k=1

n
X
τn2 = σk2 ,
k=1

and suppose that for some η > 2 that


n
X
E(|Xk − µk |η ) = o(τnη ), (6.12)
k=1

d
as n → ∞, then Zn = nτn−1 (X̄n − µ̄n ) −
→ Z, as n → ∞, where Z has a N(0, 1)
distribution.

Proof. We will follow the method of proof of Serfling (1980). We will show that
the conditions of Equations (6.2) and (6.3) follow from the condition given in
Equation (6.12). Let ε > 0 and focus on the term inside the summation of
Equation (6.3). We note that

E(|Xk − µk |2 δ{|Xk − µk |; (ετn , ∞)}) =


E(|Xk − µk |2−η |Xk − µk |η δ{|Xk − µk |; (ετn , ∞)}) ≤
E(|ετn |2−η |Xk − µk |η δ{|Xk − µk |; (ετn , ∞)}),

where we note that the inequality comes from the fact that η > 2 so that the
exponent 2 − η < 0, and hence

|Xk − µk |2−η δ{|Xk − µk |; (ετn , ∞)} ≤ |ετn |2−η δ{|Xk − µk |; (ετn , ∞)}.

The inequality follows from an application of Theorem A.16. It, then, further
follows that

E(|ετn |2−η |Xk − µk |η δ{|Xk − µk |; (ετn , ∞)}) =


|ετn |2−η E(|Xk − µk |η δ{|Xk − µk |; (ετn , ∞)}) ≤
|ετn |2−η E(|Xk − µk |η ).
262 CENTRAL LIMIT THEOREMS
Now,
n
X
lim sup τn−2 E(|Xk − µk |2 δ{|Xk − µk |; (ετn , ∞)}) ≤
n→∞
k=1
n
X
lim sup τn−2 |ετn |2−η E(|Xk − µk |η ) =
n→∞
k=1
n
X
lim sup ετn−η E(|Xk − µk |η ) = 0,
n→∞
k=1

by Equation (6.12). A similar result follows for limit infimum, so that it follows
that
Xn
lim τn−2 E(|Xk − µk |2 δ{|Xk − µk |; (ετn , ∞)}) = 0,
n→∞
k=1
and therefore the condition of Equation (6.3) is satisfied. We now show that
the condition of Equation (6.2) is also satisfied. To do this, we note that for
all k ∈ {1, . . . , n}, we have that

σk2 = E[(Xk − µk )2 ] = E[(Xk − µk )2 δ{|Xk − µk |; [0, ετn ]}]+


E[(Xk − µk )2 δ{|Xk − µk |; (ετn , ∞)}].
The first term can be bounded as
E[(Xk − µk )2 δ{|Xk − µk |; [0, ετn ]}] ≤ E[(ετn )2 δ{|Xk − µk |; [0, ετn ]}]
= (ετn )2 E[δ{|Xk − µk |; [0, ετn ]}]
= (ετn )2 P (|Xk − µk | ≤ ετn )
≤ (ετn )2 ,
where the final inequality follows because the probability is bounded above
by one. Therefore, it follows that
σk2 ≤ (ετn )2 + E[(Xk − µk )2 δ{|Xk − µk |; (ετn , ∞)}],
for all k ∈ {1, . . . , n}. Hence, it follows that
n
X
max σk2 ≤ (ετn )2 + E[(Xk − µk )2 δ{|Xk − µk |; (ετn , ∞)}].
k∈{1,...,n}
k=1

Therefore,

lim max τn−2 σk2 ≤


n→∞ k∈{1,...,n}
n
X
lim ε2 + lim τn−2 E[(Xk − µk )2 δ{|Xk − µk |; (ετn , ∞)}]. (6.13)
n→∞ n→∞
k=1

The condition given in Equation (6.3) implies that the second term on the
TRIANGULAR ARRAYS 263
right hand side of Equation (6.13) converges to zero as n → ∞. Therefore
lim max τn−2 σk2 ≤ ε2 ,
n→∞ k∈{1,...,n}

for all ε > 0, so that


lim max τn−2 σk2 = 0,
n→∞ k∈{1,...,n}

and the result follows.


Example 6.1. Let {Xn }∞ n=1 be a sequence of independent random variables
where Xn has an Exponential(θn ) distribution. In this case µn = θn and
σn2 = θn2 for all n ∈ N. Let us consider the case where θn = n−1/2 for all n ∈ N
and take η = 4. In this case E(|Xn − θn |4 ) = 9θn4 = 9n−2 for all n ∈ N. Hence,
Pn Pn
4
k=1 E(|Xn − θn | ) 9 k=1 n−2
lim n 2 = lim P n −1 )2
= 0,
n→∞ (
σ2 ) k=1 n
n→∞
P
( k=1 k
since the Harmonic series diverges, but the series in the numerator converges
to 61 π 2 . Therefore, it follows from Definition 1.7 that
n
X
E(|Xn − θn |4 ) = o(τn4 ),
k=1
d
as n → ∞. Therefore, Corollary 6.1 implies that Zn = nτn−1 (X̄n − µ̄n ) −
→ Z,
as n → ∞ where Z has a N(0, 1) distribution. 

6.3 Triangular Arrays

Triangular arrays generalize the sequences of non-identically distributed ran-


dom variables studied in Section 6.2. When the value of n changes in a triangu-
lar array, the distribution of the entire sequence of random variables up to the
nth random variable may change as well. Such sequences can be represented
as doubly indexed sequences of random variables.
Definition 6.1. Let {{Xnm }um=1 n
}∞
n=1 be a doubly indexed sequence of random

variables where {un }n=1 is a sequence of increasing integers such that un →
∞ as n → ∞. Then {{Xnm }um=1 n
}∞n=1 is called a double array of random
variables. In the special case that un = n for all n ∈ N, then the sequence is
called a triangular array.

Under certain conditions the result of Theorem 4.20 can be extended to double
arrays as well. For simplicity we give the result for triangular arrays.
Theorem 6.2. Let {{Xnk }nk=1 }∞ n=1 be a triangular array where Xn1 , . . . , Xnn
are mutually independent random variables for each n ∈ N. Suppose that Xnk
2
has mean µnk and variance σnk < ∞ for all k ∈ {1, . . . , n} and n ∈ N. Let
n
X
µn· = µnk ,
k=1
264 CENTRAL LIMIT THEOREMS
and
n
X
2 2
σn· = σnk .
k=1
Then
lim max P (|Xnk − µnk | > εσn· ) = 0, (6.14)
n→∞ k∈{1,...,n}

for each ε > 0 and


n
!
d
X
−1
Zn = σn· Xnk − µn· −
→ Z,
k=1

as n → ∞, where Z is a N(0, 1) random variable, together hold if and only if


n
X
−2
lim σn· E[(Xnk − µnk )2 δ{|Xnk − µnk |; (εσn· , ∞)}] = 0, (6.15)
n→∞
k=1

for each ε > 0.

Theorem 6.2 can be generalized to double arrays without much modification


to the result above. See Theorem 1.9.3 of Serfling (1980). The condition in
Equation (6.14) is called the uniform asymptotic negligibility condition which
essentially establishes bounds on the amount of probability in the tails of
the distribution of Xnk uniformly within each row. Double arrays of random
variables that have this property are said to be holospoudic. See Section 7.1 of
Chung (1974). The condition given in Equation (6.15) is the same Lindeberg
Condition used in Theorem 6.1. One can also note that in fact Theorem 6.1
is a special case of Theorem 6.2 if we assume that distribution of the random
variable Xnk does not change with k. Since the Lindeberg Condition also shows
up in this result, we are once again confronted with the fact the Equation
(6.15) can be difficult to apply in practice. However, there is an analog of
Corollary 6.1 which can be applied to the case of triangular arrays as well.
Corollary 6.2. Let {{Xnk }nk=1 }∞
n=1 be a triangular array where Xn1 , . . . , Xnn
are mutually independent random variables for each n ∈ N. Suppose that Xnk
2
has mean µnk and variance σnk < ∞ for all k ∈ {1, . . . , n} and n ∈ N. Let
n
X
µn· = µnk ,
k=1

and
n
X
2 2
σn· = σnk .
k=1
Suppose that for some η > 2
n
X
E(|Xnk − µnk |η ) = o(σn·
η
),
k=1
TRANSFORMED RANDOM VARIABLES 265
as n → ∞, then !
n
d
X
−1
Zn = σn· Xnk − µn· −
→ Z,
k=1
as n → ∞ where Z is a N(0, 1) random variable.
For the proof of Corollary 6.2, see Exercise 5.
Example 6.2. Consider a triangular array {{Xnk }nk=1 }∞ n=1 where the se-
quence Xn1 , . . . , Xnn is a set of independent and identically distributed ran-
dom variables from an Exponential(θn ) distribution where {θn }∞ n=1 is a
sequence of positive real numbers that converges to a real value θ as n → ∞.
2
In this case µnk = θn and σnk = θn2 for all k ∈ {1, . . . , n} and n ∈ N so that
n
X n
X
µn· = µnk = θn = nθn ,
k=1 k=1

and
n
X n
X
2 2
σn· = σnk = θn2 = nθn2 .
k=1 k=1
We will use Corollary 6.2 with η = 4, so that we have that
n
X n
X n
X
E(|Xnk − µnk |4 ) = E(|Xnk − θn |4 ) = 9θn4 = 9nθn4 .
k=1 k=1 k=1

Therefore, it follows that


Pn 4
k=1 E(|Xnk − µnk | ) 9nθn4
lim 4
= lim 2 )2
= lim 9n−1 = 0,
n→∞ σn· n→∞ (nθn n→∞

and hence,
n
X
E(|Xnk − µnk |4 ) = o(σn·
4
),
k=1
as n → ∞. Therefore, Corollary 6.2 implies that
n
!
d
X
Zn = n1/2 θn−1 Xnk − nθn −
→ Z,
k=1

where Z is a N(0, 1) random variable. 


One application of triangular arrays occurs when we consider estimates based
on the empirical distribution function such as bootstrap estimates as described
in Efron (1979). See Beran and Ducharme (1991) for further details on this
type of application.

6.4 Transformed Random Variables

Consider a problem in statistical inference where we have an asymptotically


Normal estimate θ̂n for a parameter θ which is computed from a sequence
266 CENTRAL LIMIT THEOREMS
of independent and identically distributed random variables X1 , . . . , Xn . Sup-
pose that the real interest is not in the parameter θ itself, but in a function
of the parameter given by g(θ). An obvious estimate of g(θ) is g(θ̂n ). The
properties of this estimate depend on the properties of both the estimator θ̂n
and the function g. For example, if θ̂n is a consistent estimator of θ, and g is
continuous at θ, then g(θ̂n ) is a consistent estimator of g(θ) by Theorem 3.7.
On the other hand, if θ̂n is an unbiased estimator of θ, then g(θ̂n ) will typically
not be an unbiased estimator of g(θ) unless g is a linear function. However, it
could be the case that g(θ̂n ) is asymptotically unbiased as n → ∞. See Chap-
ter 10. In this chapter we have been examining conditions under which the
sample mean is asymptotically Normal. We now examine under what condi-
tions a function of an asymptotically Normal sequence of random variables
d
remains asymptotically Normal. For example, suppose that σn−1 (θ̂n −θ) − →Z
as n → ∞ where σn is a sequence of positive real numbers such that σn → 0
as n → ∞. Under what conditions can we find a function of g(θ̂n ) of the form
τn−1 [g(θ̂n ) − g(θ)] that converges in distribution to a N(0, 1) random variable
as n → ∞ where τn is a sequence of positive constants that converge to zero
as n → ∞? That answer turns out to depend on the properties of the function
g and finding the correct sequence {τn }∞ n=1 to properly scale the sequence. To
find the sequence {τn }∞ n=1 we must account for how the transformation of θ̂n
changes the variability of the sequence.
Theorem 6.3. Let {Xn }∞ n=1 be a sequence of random variables such that
d
σn−1 (Xn − µ) − → Z as n → ∞ where Z is a N(0, 1) random variable and
{σn }∞n=1 is a real sequence such that

lim σn = 0.
n→∞

Let g be a real function that has a non-zero derivative at µ. Then,


g(Xn ) − g(µ) d

→ Z,
σn g 0 (µ)
as n → ∞.

Proof. The method of proof used here is to use Theorem 4.11 (Slutsky) to
show that σn−1 (Xn − µ) and [σn g 0 (µ)]−1 [g(Xn ) − g(µ)] have the same limiting
distribution. To this end, define a function h as h(x) = (x − µ)−1 [g(x) −
g(µ)] − g 0 (µ) for all x 6= µ. The definition of derivative motivates us to define
p
h(µ) = 0. Now, Theorem 4.21 implies than Xn − → µ as n → ∞, and therefore
p
Theorem 3.7 implies that h(Xn ) − → h(µ) = 0 as n → ∞. Theorem 4.11 then
p
implies that h(Xn )σn−1 (Xn − µ) − → 0 as n → ∞. But, this implies that
  
g(Xn ) − g(µ) Xn − µ
− g 0 (µ) =
Xn − µ σn
g(Xn ) − g(µ) Xn − µ p
− g 0 (µ) −
→ 0,
σn σn
TRANSFORMED RANDOM VARIABLES 267
as n → ∞, which in turn implies that
g(Xn ) − g(µ) Xn − µ p
− −
→ 0,
σn g 0 (µ) σn
as n → ∞. Therefore, Theorem 4.11 implies that
g(Xn ) − g(µ) Xn − µ Xn − µ d
− + −
→ Z,
σn g 0 (µ) σn σn
where Z is a N(0, 1) random variable. Hence
g(Xn ) − g(µ) d

→ Z,
σn g 0 (µ)
as n → ∞, and the result is proven.

Theorem 6.3 indicates that the change in variation required to maintain the
asymptotic normality of a transformation of an asymptotically normal random
variable is related to the derivative of the transformation. This is because the
asymptotic variation in g(Xn ) is related to the local change in g near µ due
p
to the fact that Xn − → µ as n → ∞. To visualize this consider Figures 6.1
and 6.2. In Figure 6.1 the variation around µ decreases through the function
g due to small derivative of g in a neighborhood of µ, while in Figure 6.2 the
variation around µ increases through the function g due to large derivative of
g in a neighborhood of µ. If the derivative of g is zero at µ, then there will
be no variation in g as Xn approaches µ as n → ∞. This fact does not allow
us to obtain an asymptotic Normal result for the transformed sequence of
random variables, though other types of weak convergence are possible.
Example 6.3. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution with variance σ 2 < ∞. Let
Sn2 be the sample variance. Then, under some minor conditions, it can be
d
shown that n1/2 (µ4 − σ 4 )−1/2 (Sn2 − σ 2 ) −
→ Z as n → ∞ where Z is a N(0, 1)
random variable and µ4 is the fourth moment of Xn , which is assumed to
be finite. Asymptotic normality can also be obtained for the sample standard
deviation by considering the function g(x) = x1/2 where

d d 1/2
= 12 σ −1 .

g(x) = x
dx x=σ 2 dx x=σ 2

Theorem 6.3 then implies that


n1/2 [g(Sn2 ) − g(σ 2 )] 2n1/2 σ(Sn − σ) d
0
= −
→ Z,
4 1/2
(µ4 − σ ) g (σ ) 2 µ4 − σ 4
as n → ∞. 

Example 6.4. Let {Xn }n=1 be a sequence of independent and identically dis-
tributed random variables from a distribution with E(Xn ) = µ and V (Xn ) =
d
σ 2 < ∞. Theorem 4.20 (Lindeberg and Lévy) implies that n1/2 σ −1 (X̄n −µ) −

Z as n → ∞ where Z is a N(0, 1) random variable. Suppose that we wish
268 CENTRAL LIMIT THEOREMS

Figure 6.1 When the derivative of g is small in a neighborhood of µ, the variation


in g, represented by the vertical grey band about µ, decreases. This is indicated by
the horizontal grey band around g(µ).

g
g(µ)

to estimate θ = exp(µ) with the plug-in estimator θ̂ = exp(X̄n ). Letting


g(x) = exp(x) implies that g 0 (µ) = exp(µ), so that Theorem 6.3 then implies
d
that n1/2 exp(−µ)σ −1 [exp(X̄n ) − exp(µ)] −
→ Z as n → ∞. 

In the case where the derivative of g is zero at µ, the limiting distribution


is no longer Normal. In this case the asymptotic distribution depends how
how many derivatives at µ are equal to zero. For example, if g 0 (µ) = 0 but
g 00 (µ) 6= 0 it follows that there is a function of the sequence that converges in
distribution to Z 2 as n → ∞ where Z has a N(0, 1) distribution and hence
Z 2 is a ChiSquared(1) distribution.
Theorem 6.4. Let {Xn }∞
n=1 be a sequence of random variables such that
d
σn−1 (Xn − µ) − → Z as n → ∞ where Z is a N(0, 1) random variable and
{σn }∞n=1 is a real sequence such that
lim σn = 0.
n→∞

Let g be a real function such that


dm


g(x) 6= 0,
dxm
x=µ
TRANSFORMED RANDOM VARIABLES 269

Figure 6.2 When the derivative of g is large in a neighborhood of µ, the variation in


g, represented by the vertical grey band about µ, increases. This is indicated by the
horizontal grey band around g(µ).

g
g(µ)

for some m ∈ N such that


dk


g(x) = 0,
dxk x=µ

for all k ∈ {1, . . . , m − 1}. Then,


m![g(Xn ) − g(µ)] d m

→Z ,
σnm g (m) (µ)
as n → ∞.

The proof of Theorem 6.4 is the subject of Exercise 8.


Example 6.5. Let {Xn }∞
n=1 be a sequence of random variables such that
d
σn−1 (Xn − µ) − → Z as n → ∞ where Z is a N(0, 1) random variable and
{σn }∞n=1 is a real sequence such that
lim σn = 0.
n→∞

Consider the transformation g(x) = x2 . If µ 6= 0 then,



d
g(x) = 2µ,
dx x=µ
270 CENTRAL LIMIT THEOREMS
d
and Theorem 6.3 implies that (2µσn )−1 (Xn2 − µ2 ) −
→ Z as n → ∞. On the
other hand, if µ = 0 then,

d
g(x) = 0,
dx x=µ

but
d2


2
g(x) = 2.
dx
x=0
d
In this case Theorem 6.4 implies that σn−2 (Xn2 − µ2 ) −
→ Z 2 as n → ∞. There-
fore, in the case where µ = 0, the limiting distribution is a ChiSquared(1)
distribution. 

It is also possible to extend these results to the multivariate case, though


the result gets slightly more complicated, as the proper transformation now
depends on the matrix of partial derivatives. We will begin by first extending
the result of Theorem 6.3 to the case where g is a function that maps Rd to
R.
Theorem 6.5. Let {Xn }∞
n=1 be a sequence of random vectors from a d-
d
dimensional distribution such that n1/2 (Xn − θ) − → Z as n → ∞ where Z
has a N(0, Σ) distribution, θ is a d × 1 constant vector and Σ is a positive
definite covariance matrix. Let g be a real function that maps Rd to R and let


d(θ) = g(x) ,
∂x x=θ

be the vector of partial derivatives of g evaluated at θ. If d(θ) is not equal to the


zero vector and d(x) is continuous in a neighborhood of θ, then n1/2 [g(Xn ) −
d
→ Z as n → ∞ where Z is a N [0, d0 (θ)Σd(θ)] random variable.
g(θ)] −

Proof. We generalize the argument used to prove Theorem 6.3. Define a func-
tion h that maps Rd to R as
h(x) = kx − θk−1 [g(x) − g(θ) − d0 (θ)(x − θ)],
where we define h(θ) = 0. Therefore,
n1/2 [g(Xn ) − g(θ)] = n1/2 kXn − θkh(Xn ) + n1/2 d0 (θ)(Xn − θ).
d
By assumption we know that n1/2 (Xn − θ) − → Z as n → ∞ where Z has a
N(0, Σ) distribution. Therefore Theorem 4.17 (Cramér and Wold) implies that
d
n1/2 d0 (θ)(Xn − θ) − → Z1 as n → ∞ where Z1 is a N [0, d0 (θ)Σd(θ)] random
variable. Since k · k denotes a vector norm on Rd we have that n1/2 kXn − θk =
kn1/2 (Xn − θ)k. Assuming that the norm is continuous, Theorem 4.18 implies
d
that kn1/2 (Xn − θ)k −→ kZk as n → ∞. Serfling (1980) argues that the
p
function h is continuous, and therefore since Xn −
→ θ as n → ∞ it follows
p
that h(Xn ) −
→ h(θ) = 0 as n → ∞. Therefore, Theorem 4.11 (Slutsky) implies
TRANSFORMED RANDOM VARIABLES 271
p
that n1/2 kXn − θkh(Xn ) −
→ 0 as n → ∞, and another application of Theorem
4.11 implies that
d
n1/2 kXn − θkh(Xn ) + n1/2 d0 (θ)(Xn − θ) −
→ Z,

as n → ∞ and the result is proven.

Example 6.6. Let {Xn }∞ n=1 be a sequence of independent and identically


distributed bivariate random variables from a distribution with mean vector
µ and positive definite covariance matrix Σ where we will assume that X0n =
(X1n , X2n ) for all n ∈ N, µ0 = (µ1 , µ2 ), and that
 2 
σ τ
Σ= 1 ,
τ σ22

where all of the elements of µ and Σ will be assumed to be finite. Suppose we


are interested in the correlation coefficient given by ρ = τ σx−1 σy−1 , which can
be estimated with ρ̂n = S12 S1−1 S2−1 , where
n
X
S12 = n−1 (X1k − X̄1 )(X2k − X̄2 ),
k=1

n
X
Si2 = n−1 (Xik − X̄i )2 ,
k=1

and
n
X
X̄i = n−1 Xik ,
k=1

for i = 1, 2. We will first show that a properly normalized function of the ran-
dom vector (S12 , S22 , S12 ) converges in distribution to a Normal distribution.
Following the arguments of Sen and Singer (1993) we first note that
n
X
S12 = n−1 (X1k − X̄1 )(X2k − X̄2 )
k=1
Xn
= n−1 [(X1k − µ1 ) + (µ1 − X̄1 )][(X2k − µ2 ) + (µ2 − X̄2 )]
k=1
Xn n
X
= n−1 (X1k − µ1 )(X2k − µ2 ) + n−1 (X1k − µ1 )(µ2 − X̄2 ) +
k=1 k=1
Xn n
X
n−1 (µ1 − X̄1 )(X2k − µ2 ) + n−1 (µ1 − X̄1 )(µ2 − X̄2 ). (6.16)
k=1 k=1
272 CENTRAL LIMIT THEOREMS
The two middle terms in Equation (6.16) can be simplified as
n
X
n−1 (X1k − µ1 )(µ2 − X̄2 ) =
k=1
n
X
n−1 (µ2 − X̄2 ) (X1k − µ1 ) = (µ2 − X̄2 )(X̄1 − µ1 ),
k=1

and
n
X
n−1 (µ1 − X̄1 )(X2k − µ2 ) = (X̄1 − µ1 )(µ2 − X̄2 ).
k=1
Therefore, it follows that
n
X
S12 = n−1 (X1k − µ1 )(X2k − µ2 ) − (µ1 − X̄1 )(µ2 − X̄2 )
k=1
Xn
= n−1 (X1k − µ1 )(X2k − µ2 ) + R12 .
k=1
p
Theorem 3.10 (Weak Law of Large Numbers) implies that X̄1 − → µ1 and
p p
X̄2 −
→ µ2 as n → ∞, so that Theorem 3.9 implies that R12 −
→ 0 as n → ∞. It
can similarly be shown that
n
X
Si2 = n−1 (Xik − µi )2 + Ri ,
k=1
p
where Ri − → 0 as n → ∞ for i = 1, 2. See Exercise 11. Now define a random
vector U0n = (S12 − σ12 , S22 − σ22 , S12 − τ ) for all n ∈ N. Let λ ∈ R3 where
λ0 = (λ1 , λ2 , λ3 ) and observe that
n1/2 λ0 Un = n1/2 λ1 (S12 − σ12 ) + λ2 (S22 − σ22 ) + λ3 (S12 − τ )
 
( " n
#
X
= n1/2 λ1 n−1 (X1k − µ1 )2 + R1 − σ12
k=1
" n
#
X
−1 2
+λ2 n (X2k − µ2 ) + R2 − σ22
k=1
" n
#)
X
−1
+ λ3 n (X1k − µ1 )(X2k − µ2 ) + R12 − τ .
k=1

Combine the remainder terms into one term and combine the remaining terms
into one sum to find that
n
X
n1/2 λ0 Un = n−1/2 {λ1 [(X1k − µ1 )2 − σ12 ] + λ2 [(X2k − µ2 )2 − σ22 ]
k=1
+ λ3 [(X1k − µ1 )(X2k − µ2 ) − τ ]} + R,
TRANSFORMED RANDOM VARIABLES 273
where R = n1/2 (R1 + R2 + R12 ). Note that even though each individual
remainder term converges to zero in probability, it may not necessarily follow
that n1/2 times the remainder also converges to zero. However, we will show in
Chapter 8 that this does follow in this case. Now, define a sequence of random
variables {Vn }∞
n=1 as

Vk = λ1 [(X1k −µ1 )2 −σ12 ]+λ2 [(X2k −µ2 )2 −σ22 ]+λ3 [(X1k −µ1 )(X2k −µ2 )−τ ],
so that it follows that
n
X
n1/2 λ0 Un = n−1/2 Vk + R,
k=1
p
where R − → 0 as n → ∞. The random variables V1 , . . . , Vn are a set of inde-
pendent and identically distributed random variables. The expectation of Vk
is given by
E(Vk ) = E{λ1 [(X1k − µ1 )2 − σ12 ] + λ2 [(X2k − µ2 )2 − σ22 ]
+λ3 [(X1k − µ1 )(X2k − µ2 ) − τ ]}
= λ1 {E[(X1k − µ1 )2 ] − σ12 } + λ2 {E[(X2k − µ2 )2 ] − σ22 }
+λ3 {E[(X1k − µ1 )(X2k − µ2 )] − τ ]}
= 0,
where we have used the fact that E[(X1k − µ1 )2 ] = σ12 , E[(X2k − µ2 )2 ] = σ22 ,
and E[(X1k − µ1 )(X2k − µ2 )] = τ . The variance of Vk need not be found
explicitly, but is equal to λ0 Λλ where Λ = V (Un ). Therefore, Theorem 4.20
(Lindeberg and Lévy) implies that
n
d
X
n−1/2 Vk −
→ Z,
k=1
0
as n → ∞ where Z has a N(0, λ Λλ) distribution. Theorem 4.11 (Slutsky)
then further implies that
n
d
X
n1/2 λ0 Un = n−1/2 Vk + R −
→ Z,
k=1
p
as n → ∞, since R − → 0 as n → ∞. Because λ is arbitrary, it follows
d
from Theorem 4.17 (Cramér and Wold) that n1/2 Un − → Z as n → ∞,
where Z has a N(0, Λ) distribution. Using this result, we shall prove that
there is a function of the sample correlation that converges in distribution
to a Normal distribution. Let θ 0 = (σ12 , σ22 , τ ) and consider the function
g(x) = g(x1 , x2 , x3 ) = x3 (x1 x2 )−1/2 . Then, it follows that
−3/2 −1/2 −1/2 −3/2
d0 (x) = [− 21 x3 x1 x2 , − 21 x3 x1 x2 , (x1 x2 )−1/2 ],
so that
d0 (θ) = [− 21 ρσ1−2 , − 12 ρσ2−2 , (σ1 σ2 )−1 ],
274 CENTRAL LIMIT THEOREMS
which has been written in terms of the correlation ρ = σ1 σ2 τ . Theorem 6.5
d
→ Z as n → ∞ where Z is a N[0, d0 (θ)Λd(θ)]
then implies that n1/2 (ρ̂ − ρ) −
random variable. 

The result of Theorem 6.5 extends to the more general case where g is a func-
tion that maps Rd to Rm . The main requirement for the transformed sequence
to remain asymptotically Normal is that the matrix of partial derivatives of
g must have elements that exist and are non-zero at θ.
Theorem 6.6. Let {Xn }∞
n=1 be a sequence of random vectors from a d-
d
dimensional distribution such that n1/2 (Xn − θ) − → Z as n → ∞ where Z
is a d-dimensional N(0, Σ) random vector, θ is a d × 1 constant vector, and
Σ is a d × d covariance matrix. Let g be a real function that maps Rd to Rm
such that g(x) = [g1 (x), . . . , gm (x)]0 for all x ∈ Rd where gk (x) is a real func-
tion that maps Rd to R. Let D(θ) be the m × d matrix of partial derivatives
of g whose (i, j)th element is given by


Dij (θ) = gi (x)
∂xj x=θ

for i = 1, . . . , m and j = 1, . . . , d where x0 = (x1 , . . . , xd ). If D(θ) exists


and Dij (θ) 6= 0 for all i = 1, . . . , m and j = 1, . . . , d then n1/2 [g(Xn −
d
g(µ)] −
→ Z as n → ∞ where Z is an m-dimensional random variable that has
a N[0, D(θ)ΣD0 (θ)] distribution.

Proof. We follow the method of Serfling (1980) which is based on Theorem


4.17 (Cramér and Wold) and on generalizing the argument used to prove
Theorem 6.5. Define functions h1 , . . . , hm where
hi (x) = kx − θk−1 [gi (x) − gi (θ) − d0i (θ)(x − θ)],
for i = 1, . . . , m where d0i (θ) is the ith row of the matrix D(θ), and we note
once again that we define hi (θ) = 0 and that hi (x) is continuous for all
i = 1, . . . , m. Suppose that v ∈ Rm and note that
m
X
v0 [g(Xn ) − g(θ)] = vi [gi (Xn ) − gi (θ)] =
i=1
m
X
vi [kXn − θkhi (Xn ) + d0i (θ)(Xn − θ)] =
i=1
m
X m
X
vi kXn − θkhi (Xn ) + vi d0i (θ)(Xn − θ). (6.17)
i=1 i=1

The second term of the right hand side of Equation (6.17) can be written as
m
X
vi d0i (θ)(Xn − θ) = v0 D(θ)(Xn − θ).
i=1
TRANSFORMED RANDOM VARIABLES 275
d
By assumption we have that n1/2 (Xn −θ) −
→ Z as n → ∞ where Z is a N(0, Σ)
d
random vector. Therefore, Theorem 4.17 implies that n1/2 v0 D(θ)(Xn − θ) −

v0 D(θ)Z = W where W has a N[0, v0 D(θ)ΣD(θ)v] random vector. For the
first term in Equation (6.17), we note that
m
X
vi kX − θkhi (Xn ) = v0 h(Xn )kXn − θk,
i=1
0
where h (x) = [h1 (x), . . . , hm (x)]. The fact that n1/2 kXn − θk = kn1/2 (Xn −
d d
θ)k and that n1/2 (Xn −θ) −
→ Z implies once again that kn1/2 (Xn −θ)k −
→ kZk
p
as n → ∞, as in the proof of Theorem 6.5. It also follows that h(Xn ) − →
h(θ) = 0 as n → ∞ so that Theorem 4.11 (Slutsky) and Theorem 3.9 imply
p
that n1/2 v0 h(Xn )kXn −θk −
→ v0 0kZk = 0, as n → ∞ and another application
of Theorem 4.11 (Slutsky) implies that

n1/2 v0 [g(Xn ) − g(θ)] =


d
n1/2 v0 h(Xn )kXn − θk + n1/2 v0 D(θ)(Xn − θ) −
→ v0 D(θ)Z,
as n → ∞. Theorem 4.17 (Cramér and Wold) implies then that n1/2 [g(Xn ) −
d
g(θ)] −
→ D(θ)Z as n → ∞ due to the fact that v is arbitrary. Therefore, the
result is proven.
Example 6.7. Let {Xn }∞ n=1 be a sequence of independent and identically dis-
tributed random variables from a distribution where E(|Xn |6 ) < ∞. Consider
the sequence of three-dimensional random vectors defined by
 −1 Pn   0
n Pk=1 Xk µ̂1
n
Yn = n−1 Pk=1 Xk2  = µ̂02  ,
n
n−1 k=1 Xk3 µ̂03
for all n ∈ N. Let µ0 = (µ01 , µ02 , µ03 ), then Theorem 4.25 implies that n1/2 (Yn −
d
µ) −
→ Z as n → ∞, where Z has a N(0, Σ) distribution with
 0
µ2 − (µ01 )2 µ03 − µ01 µ02 µ04 − µ01 µ03

Σ =  µ03 − µ01 µ02 µ04 − (µ02 )2 µ05 − µ02 µ03  .


µ04 − µ01 µ03 µ05 − µ02 µ03 µ06 − (µ03 )2
We will consider a function that maps the first three moments to the first
three moments about the mean. That is, consider a function
   
g1 (x) x1
g(x) = g2 (x) =  x2 − x21 ,
3
g3 (x) x3 − 3x1 x2 + 2x1
where x0 = (x1 , x2 , x3 ) ∈ R3 . In this case the matrix D(θ) is given by
 
1 0 0
D(θ) =  −2µ01 1 0 ,
0 0 2 0
−3µ2 + 6(µ1 ) −3µ1 1
276 CENTRAL LIMIT THEOREMS
d
and Theorem 6.6 implies that n1/2 (Yn − µ) −
→ Z as n → ∞ where Z is a
random vector in R3 that has a N[0, D(θ)ΣD0 (θ)] distribution. 
Theorem 6.4 showed how it was possible in some cases to obtain limiting
distributions that were related to the ChiSquared distibution. In these cases
the result depended on the fact that the function of the asymptotically normal
random variable had a first derivative that vanished at the limiting value, but
whose second derivative did not. This is not the only way to find sequences
of random variables that have an asymptotic ChiSquared distribution. For
example, suppose that {Xn }∞ n=1 is a sequence of random variables such that
d
Xn −
→ Z as n → ∞ where Z is a N(0, 1) random variable. Then, Theorem
d
4.12 implies that Xn2 − → Z 2 as n → ∞ where Z 2 has a ChiSquared(1)
distribution. In the multivariate case we extend this result to quadratic forms
of sequences of random vectors that have a limiting normal distribution. This
development is very important to the development of the asymptotic theory
of regression analysis and linear models.
Definition 6.2. Let X be a d-dimensional random variable and let C be a
d × d symmetric matrix of real values, then X0 CX is a quadratic form of X.
It is clear that a quadratic form is a polynomial function of the elements of X,
which is a function that is continuous everywhere. Theorem 4.18 then implies
that a quadratic form of any sequence of random vectors that converge in
distribution to a random vector, converges in distribution to the quadratic
form of the limiting random vector.
Theorem 6.7. Let {Xn }∞ n=1 be a sequence of d-dimensional random vectors
that converge in distribution to a random vector X as n → ∞. Let C be a
d
d × d symmetric matrix of real values, then X0n CXn − → X0 CX as n → ∞.
When a random vector X has a normal distribution, under certain conditions
on the covariance matrix of X and the form of the matrix C it is possible
to obtain a quadratic form that has a non-central ChiSquared distribution.
There are many conditions under which this type of result can be obtained. We
will consider one such condition, following along the general development of
Serfling (1980). Additional results on quadratic forms can be found in Chapter
1 of Christensen (1996) and Chapter 4 of Graybill (1976).
Theorem 6.8. Suppose that X has a d-dimensional N(µ, Σ) distribution and
let C be a d × d symmetric matrix. Assume that η 0 Σ = 0 implies that η 0 µ =
0, then X0 CX has a non-central ChiSquared distribution with trace(CΣ)
degrees of freedom and non-centrality parameter equal to µ0 Cµ if and only if
ΣCΣCΣ = ΣCΣ.
Theorem 6.8 implies that if {Xn }∞n=1 is a sequence of d-dimensional random
vectors that converge to a N(µ, Σ) distribution, then a quadratic form of this
sequence will converge to a non-central ChiSquared distribution as long as
the limiting covariance matrix and the form of the quadratic form follow the
assumptions outlined in the result.
TRANSFORMED RANDOM VARIABLES 277
Example 6.8. Suppose that {Xn }∞ n=1 is a sequence of random vectors such
that Xn has a Multinomial(n, d, p) distribution for all n ∈ N where p0 =
(p1 , p2 , . . . , pd ) and pk > 0 for all k ∈ {1, . . . , d}. Note that Xn can be gen-
erated by summing n independent Multinomial(1, d, p) random variables.
That is, we can take

n
X
Xn = Dk ,
k=1

where Dk has a Multinomial(1, d, p) distribution for k = 1, . . . , n and D1 ,


. . ., Dn are mutually independent. Therefore, noting that E(Xn ) = np, The-
d
orem 4.22 implies that n1/2 Σ−1/2 (n−1 Xn − p) − → Z, as n → ∞ where Z is
a N(0, I) random vector, and Σ is the covariance matrix of Dn which has
(i, j)th element given by pi (δij − pj ) for i = 1, . . . , d and j = 1, . . . , d, where
(
1 when i = j,
δij =
0 when i 6= j.

A popular test statistic for testing the null hypothesis that the probability
vector is equal to a proposed model p is given by
d
X
Tn = np−1 2
k (Xnk − npk ) ,
k=1

where we assume that X0n = (Xn1 , . . . , Xnd ) for all n ∈ N. Defining Yn =


n1/2 (n−1 Xn − p) we note that Tn can be written as a quadratic form in terms
of Yn as Tn = Yn0 CYn where
 −1 
p1 0 ··· 0
 0 p−1
2 ··· 0 
C= . .
 
.. .. ..
 .. . . . 
0 0 ··· p−1
d

Therefore, C has (i, j)th element Cij = δij p−1


i . In order to verify that Tn has
an asymptotic non-central ChiSquared distribution, we first need to verify
that for every vector η such that η 0 Σ = 0 then it follows that η 0 µ = 0. In this
case the limiting distribution of Yn is a N(0, I) distribution so that µ = 0.
Hence η 0 µ = 0 for all η and the property follows. We now need to verify
that ΣCΣCΣ = ΣCΣ. To verify this property we first note that the (i, j)th
element of the product CΣ has the form
d
X
(CΣ)ij = δik p−1
i pk (δkj − pj ) = δii (δij − pj ) = δij − pj .
k=1
278 CENTRAL LIMIT THEOREMS
This implies that the (i, j)th element of CΣCΣ is
d
X
(CΣCΣ)ij = (δik − pk )(δkj − pj )
k=1
d
X
= (δik δkj − pk δkj − δik pj + pk pj )
k=1
d
X
= δij − pj − pj + pj pk
k=1
= δij − pj − pj + pj
= δij − pj .
Thus, it follows that CΣCΣ = CΣ and hence ΣCΣCΣ = ΣCΣ. It then
d
follows from Theorem 6.8 that Tn −
→ Q as n → ∞ where Q is a ChiSquared
0
(trace(CΣ), µ Cµ) random variable. Noting that
d
X d
X
trace(CΣ) = (δii − pi ) = (1 − pi ) = d − 1,
i=1 i=1

and that µ0 Cµ = 0 implies that the random variable Q has a ChiSquared(d−


1) distribution. 

6.5 Exercises and Experiments

6.5.1 Exercises

1. Prove that Theorem 6.1 (Lindeberg, Lévy, and Feller) reduces to Theorem
4.20 (Lindeberg and Lévy) when {Xn }∞ n=1 is a sequence of independent
and identically distributed random variables.
2. Let {Xn }∞
n=1 be a sequence of independent random variables where Xn has
a Gamma(θn , 2) distribution where {θn }∞
n=1 is a sequence of positive real
numbers.

a. Find a non-trivial sequence {θn }∞


n=1 such that the assumptions of The-
orem 6.1 (Lindeberg, Lévy, and Feller) hold, and describe the resulting
conclusion for the weak convergence of X̄n .
b. Find a non-trivial sequence {θn }∞
n=1 such that the assumptions of The-
orem 6.1 (Lindeberg, Lévy, and Feller) do not hold.

3. Let {Xn }∞
n=1 be a sequence of independent random variables where Xn has
a Bernoulli(θn ) distribution where {θn }∞ n=1 is a sequence of real numbers.
Find a non-trivial sequence {θn }∞n=1 such that the assumptions of Theorem
6.1 (Lindeberg, Lévy, and Feller) hold, and describe the resulting conclusion
for the weak convergence of X̄n .
EXERCISES AND EXPERIMENTS 279
4. In the context of Theorem 6.1 (Lindeberg, Lévy, and Feller), prove that
Equation (6.3) implies Equation (6.2).
5. Prove Corollary 6.2. That is, let {{Xnk }nk=1 }∞
n=1 be a triangular array where
X11 , . . . , Xn1 are mutually independent random variables for each n ∈ N.
2
Suppose that Xnk has mean µnk and variance σnk for all k ∈ {1, . . . , n}
and n ∈ N. Let
Xn
µn· = µnk ,
k=1
and
n
X
2 2
σn· = σnk .
k=1
Suppose that for some η > 2
n
X
E(|Xnk − µnk |η ) = o(σn·
η
),
k=1

as n → ∞, then
n
!
d
X
−1
Zn = σn· Xnk − µn· −
→ Z,
k=1

as n → ∞, where Z is a N(0, 1) random variable.


6. Let {{Xn,k }nk=1 }∞
n=1 be a triangular array of random variables where Xn,k
has a Gamma(θn,k , 2) distribution where {{θn,k }nk=1 }∞
n=1 is a triangular
array of positive real numbers. Find a non-trivial triangular array of the
form {{θn,k }nk=1 }∞
n=1 such that the assumptions of Theorem 6.2 hold and
describe the resulting conclusion for the weak convergence of
n
X
Xnk .
k=1

7. Let {{Xn,k }nk=1 }∞


n=1 be a triangular array of random variables where Xn,k
has a Bernoulli(θn,k ) distribution where {{θn,k }nk=1 }∞n=1 is a triangular
array of real numbers that are between zero and one. Find a non-trivial
triangular array {{θn,k }nk=1 }∞
n=1 such that the assumptions of Theorem 6.2
hold and describe the resulting conclusion for the weak convergence of
n
X
Xnk .
k=1

8. Prove Theorem 6.4. That is, suppose that {Xn }∞


n=1 be a sequence of random
d
variables such that σn−1 (Xn − µ) − → Z as n → ∞ where Z is a N(0, 1)
random variable and {σn }∞ n=1 is a real sequence such that
lim σn = 0.
n→∞
280 CENTRAL LIMIT THEOREMS
Let g be a real function such that
dm


g(x) 6= 0,
dxm
x=µ

for some m ∈ N where


dk


k
g(x) = 0,
dx
x=µ
for all k ∈ {1, . . . , m − 1}. Then prove that
m![g(Xn ) − g(µ)] d m

→Z
σnm g 0 (µ)
as n → ∞.
d
9. Let {Xn }∞ −1
n=1 be a sequence of random variables such that σn (Xn −µ) − →Z
as n → ∞ where Z is a N(0, 1) random variable and {σn }∞ n=1 is a real
sequence such that
lim σn = 0.
n→∞
Consider the transformation g(x) = ax + b where a and b are known real
constants, and a 6= 0. Derive the asymptotic behavior, normal or otherwise,
of g(Xn ) as n → ∞.
d
10. Let {Xn }∞ −1
n=1 be a sequence of random variables such that σn (Xn −µ) − →Z
as n → ∞ where Z is a N(0, 1) random variable and {σn }∞ n=1 is a real
sequence such that
lim σn = 0.
n→∞
Consider the transformation g(x) = x3 .

a. Suppose that µ > 0. Derive the asymptotic behavior, normal or other-


wise, of g(Xn ) as n → ∞.
b. Suppose that µ = 0. Derive the asymptotic behavior, normal or other-
wise, of g(Xn ) as n → ∞.

11. Let {Xn }n=1 be a set of independent and identically distributed random
variables from a distribution with me µ and finite variance σ 2 . Show that
n
X n
X
S 2 = n−1 (Xk − X̄)2 = n−1 (Xk − µ)2 + R,
k=1 k=1
p
where R −
→ 0 as n → ∞.
12. In Example 6.6, find Λ and d0 (θ)Λd(θ).
d
13. Let {Xn } be a sequence of d-dimensional random vectors where Xn −
→ Z as
n → ∞ where Z has a N(0, I) distribution. Let A be a m × d matrix and
find the asymptotic distribution of the sequence {AXn }∞ n=1 as n → ∞.
Describe any additional assumptions that may need to be made for the
matrix A.
EXERCISES AND EXPERIMENTS 281
d
14. Let {Xn } be a sequence of d-dimensional random vectors where Xn − →Z
as n → ∞ where Z has a N(0, I) distribution. Let A be a m × d matrix and
let b be a m × 1 real valued vector. Fnd the asymptotic distribution of the
sequence {AXn + b}∞ n=1 as n → ∞. Describe any additional assumptions
that may need to be made for the matrix A and the vector b. What effect
does adding the vector b have on the asymptotic result?
d
15. Let {Xn } be a sequence of d-dimensional random vectors where Xn − →Z
as n → ∞ where Z has a N(0, I) distribution. Let A be a symmetric d × d
matrix and find the asymptotic distribution of the sequence {X0n AXn }∞
n=1
as n → ∞. Describe any additional assumptions that need to be made for
the matrix A.
16. Let {Xn }∞
n=1 be a sequence of two-dimensional random vectors where
d
Xn −→ Z as n → ∞ where Z has a N(0, I) distribution. Consider the trans-
formation g(x) = x1 + x2 + x1 x2 where x0 = (x1 , x2 ). Find the asymptotic
distribution of g(Xn ) as n → ∞.
17. Let {Xn }∞
n=1 be a sequence of three-dimensional random vectors where
d
Xn −→ Z as n → ∞ where Z has a N(0, I) distribution. Consider the trans-
formation g(x) = [x1 x2 + x3 , x1 x3 + x2 , x2 x3 + x1 ] where x0 = (x1 , x2 , x3 ).
Find the asymptotic distribution of g(Xn ) as n → ∞.

6.5.2 Experiments

1. Write a program in R that simulates 1000 samples of size n from an Ex-


ponential(1) distribution. On each sample compute n1/2 (X̄n − 1) and
n1/2 (X̄n2 − 1). Make a density histogram of the 1000 values of n1/2 (X̄n − 1),
and on a separate plot make a histogram with the same scale of the 1000
values of n1/2 (X̄n2 − 1). Compare the variability of the two histograms with
what is predicted by theory, and describe how the transformation changes
the variability of the sequences. Repeat the experiment for n = 5, 10, 25,
100 and 500 and describe how both sequences converge to a Normal dis-
tribution.
2. Write a program in R that simulates 1000 observations from a Multino-
mial(n, 3, p) distribution where p0 = ( 41 , 14 , 21 ). On each observation com-
P3
pute Tn = k=1 np−1 2 0
k (Xnk − npk ) where X = (Xn1 , Xn2 , Xn3 ). Make a
density histogram of the 1000 values of Tn and overlay the plot with a plot
of ChiSquared(2) distribution for comparison. Repeat the experiment for
n = 5, 10, 25, 100 and 500 and describe how both sequences converge to a
ChiSquared(2) distribution.
3. Write a program in R that simulates 1000 sequences of independent ran-
dom variables of length n where the k th variable in the sequence has an
Exponential(θk ) distribution where θk = k −1/2 for all k ∈ N. For each
282 CENTRAL LIMIT THEOREMS
simulated sequence, compute Zn = n1/2 τn−1 (X̄n − µ̄n ), where
k
X
τk2 = k −1 .
i=1

Plot the 1000 values of Zn on a density histogram and overlay the histogram
with a plot of a N(0, 1) density. Repeat the experiment for n = 5, 10, 25,
100 and 500 and describe how the distribution converges.
4. Write a program in R that simulates 1000 samples of size n from a Uni-
form(θ1 , θ2 ) distribution where n, θ1 , and θ2 are specified below. For each
sample compute Zn = n1/2 σ −1 (X̄n2 − µ2 ) where X̄n is the mean of the
observed sample and µ and σ correspond to the mean and standard de-
viation of a Uniform(θ1 , θ2 ) distribution. Plot a histogram of the 1000
observed values of Zn for each case listed below and compare the shape of
the histograms to what would be expected.
a. θ1 = −1, θ2 = 1.
b. θ1 = 0, θ2 = 1.
CHAPTER 7

Asymptotic Expansions for


Distributions

But when he talks about them and compares them with himself and his colleagues
there’s a small error running through what he says, and, just for your interest,
I’ll tell you about it.
The Trial by Franz Kafka

7.1 Approximating a Distribution

Let us consider the case of a sequence of independent and identically dis-


tributed random variables {Xn }∞ n=1 from a distribution F whose mean is µ
and variance is σ 2 . Assume that E(|Xn |3 ) < ∞. Then Theorem 4.20 (Linde-
d
berg and Lévy) implies that Zn = n1/2 σ −1 (X̄n − µ) −
→ Z as n → ∞ where Z
has a N(0, 1) distribution. This implies that
lim P (Zn ≤ t) = Φ(t),
n→∞

for all t ∈ R. Define Rn (t) = Φ(t) − P (Zn ≤ t) for all t ∈ R and n ∈ N. This
in turn implies that P (Zn ≤ t) = Φ(t) + Rn (t). Theorem 4.24 (Berry and
Esseen) implies that |Rn (t)| ≤ n−1/2 Bρ where B is a finite constant that does
not depend on n or t and ρ is the third absolute moment about the mean of
F . Noting that ρ also does not depend on n and t, we have that
lim n1/2 |Rn (t)| ≤ Bρ,
n→∞

uniformly in t. Therefore, Definition 1.7 implies that Rn (t) = O(n−1/2 ) and


we obtain the asymptotic expansion P (Zn ≤ t) = Φ(t) + O(n−1/2 ) as n → ∞.
Similar arguments also lead to the alternate expansion P (Zn ≤ t) = Φ(t)+o(1)
as n → ∞.
The purpose of this chapter is to extend this idea to higher order expansions.
Our focus will be on distributions that are asymptotically normal via Theorem
4.20 (Lindeberg and Lévy). In the case of distributions of a sample mean, it
is possible to obtain an asymptotic expansion for the density or distribution
function that has an error term that is O(n−(p+1)/2 ) or o(n−p/2 ) as n → ∞
with the addition of several assumptions.

283
284 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
In this chapter we will specifically consider the case of obtaining an asymp-
totic expansion that has an error term that is o(n−1/2 ) as n → ∞. We will
also consider inverting this expansion to obtain an asymptotic expansion for
the quantile function whose error term is also o(n−1/2 ) as n → ∞. These re-
sults can be extended beyond the sample mean to smooth functions of vector
means through what is usually known as the smooth function model and we
will explore both this model and the corresponding expansion theory. This
chapter concludes with a brief description of saddlepoint expansions which
are designed to provide more accuracy to these approximations under certain
conditions.

7.2 Edgeworth Expansions

An Edgeworth expansion is an asymptotic expansion for the standardized


distribution of a sample mean. In essence, the expansion improves the accuracy
of Theorem 4.20 (Lindeberg and Lévy) under additional assumptions on the
finiteness of the moments and the smoothness of the underlying distribution
of the population. Some key elements of Edgeworth expansions are that the
error term is uniform over the real line and that the terms of the expansion
depend on the moments of the underlying distribution of the data. The fact
that the terms of the expansion depend on the moments of the population
implies that we are able to observe what properties of a population affect the
accuracy of Theorem 4.20.
The historical development of the Edgeworth expansions begins with the work
of P. Tchebycheff and F. Y. Edgeworth. Specifically, one can refer to Tcheby-
cheff (1890) and Edgeworth (1896, 1905). In both cases the distribution func-
tion was the focus of the work, and both worked with sums of independent
random variables as we do in this section. A rigorous treatment of these ex-
pansions was considered by Cramér (1928). Chapter VII of Cramér (1946) also
addresses this subject. Similar types of expansions, known as Gram–Charlier
and Brun–Charlier expansions, can be seen as the basis of the work of Tcheby-
cheff and Edgeworth, though these expansions have less developed convergence
properties. See Cramér (1946,1970), Edgeworth (1907), and Johnson, Kotz,
and Balakrishnan (1994) for further details.
In this section we will explicitly derive a one-term Edgeworth expansion, which
has the form φ(t) + 16 σ −3 n−1/2 µ3 (t3 − 3t)φ(t), and we will demonstrate that
the error associated with this expansion is o(n−1/2 ) as n → ∞. This result
provides an approximation for the standardized distribution of a sample mean
that provides a more accurate approximation than what is given by Theorem
4.20 (Lindeberg and Lévy). We provide a detailed proof of this result, relying
on the method of Feller (1971) for our arguments. The main idea of the proof is
based on the inversion formula for characteristic functions, given by Theorem
2.28. Let {ψn (t)}∞n=1 be a sequence of integrable characteristic functions and
EDGEWORTH EXPANSIONS 285
let ψ(t) be another integrable characteristic function such that
Z ∞
lim |ψn (t) − ψ(t)|dt = 0.
n→∞ −∞

If Fn is the distribution associated with ψn for all n ∈ N and F is the dis-


tribution function associate with ψ then Theorem 2.28 implies that Fn has a
bounded and continuous density fn for all n ∈ N and that F has a bounded
and continuous density f . Further, Theorem 2.28 implies that
Z ∞
−1

|fn (x) − f (x)| = (2π)
exp(−itx)ψn (t)dt−
−∞
Z ∞
−1

(2π) exp(−itx)ψ(t)dt
−∞
Z ∞
−1

= (2π)
exp(−itx)[ψn (t) − ψ(t)]dt
−∞
Z ∞
≤ (2π)−1 | exp(−itx)||ψn (t) − ψ(t)|dt,
−∞

where the inequality comes from an application of Theorem A.6. Noting that
| exp(−itx)| ≤ 1, it then follows that
Z ∞
|fn (x) − f (x)| ≤ (2π)−1 |ψn (t) − ψ(t)|dt. (7.1)
−∞

For further details see Section XV.3 of Feller (1971). The general method
of proof for determining the error associated with the one-term Edgeworth
expansion is based on computing the integral of the distance between the
characteristic function of the standardized density of the sample mean and the
characteristic function of the approximating expansion, which then bounds the
difference between the corresponding densities. It is important to note that the
bound given in Equation (7.1) is uniform in x, a property that will translate
to the error term of the Edgeworth expansions.
A slight technicality should be addressed at this point. The expansion φ(t) +
1 −3 −1/2 0 3
6σ n µ3 (t − 3t)φ(t) is typically not a valid density function as it usually
does not integrate to one and is not always non-negative. In this case we
are really computing a Fourier transformation on the expansion. Under the
assumptions we shall impose it follows that the Fourier inversion theorem
still works as detailed in Theorem 2.28 with the exception that the Fourier
transformation of the expansion may not strictly be a valid characteristic
function. See Theorem 4.1 of Bhattacharya and Rao (1976). The arguments
producing the bound in Equation (7.1) also follow, with the right hand side
being called the Fourier norm by Feller (1971). See Exercise 1.
Another technical matter arises when taking the Fourier transformation of the
expansion. The second term in the expansion is given by a constant multiplied
by H3 (x)φ(x). It is therefore convenient to have a mechanism for computing
the Fourier transformation of a function of this type.
286 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
Theorem 7.1. The Fourier transformation of Hk (x)φ(x) is (it)k exp(− 21 t2 ).
Proof. We will prove one specific case of this result. For a proof of the general
result see Exercise 2. In this particular case we will evaluate the integral
Z ∞
φ(3) (x) exp(−itx)dx,
−∞

which from Definition 1.6 is the Fourier transformation of −H3 (x)φ(x). Using
Theorem A.4 with u = exp(−itx), v = φ(2) (x), du = −it exp(−itx), and
dv = φ(3) (t), we have that
Z ∞
φ(3) (x) exp(−itx)dx =
−∞
∞ Z ∞
exp(−itx)φ(2) (x) + (it) φ(2) (x) exp(−itx)dx. (7.2)

−∞ −∞

Since u = exp(−itx) is a bounded function it follows that by taking the


appropriate limits of φ(2) (x) that the first term of Equation (7.2) is zero.
To evaluate the second term in Equation (7.2) we use another application of
Theorem A.4 to show that
Z ∞ Z ∞
(it) φ(2) (x) exp(−itx)dx = (it)2 φ(1) (x) exp(−itx)dx,
−∞ −∞

for which yet another application of Theorem A.4 implies that


Z ∞
2
(it) φ(1) (x) exp(−itx)dx = (it)3 exp(− 21 t2 ).
−∞

This proves the result for the special case when k = 3.


The development of the Edgeworth expansion is also heavily dependent on the
properties of characteristic functions. We first require a bound for the char-
acteristic function for continuous random variables. The result given below
provides a somewhat more general result that characterizes the behavior of
characteristic functions in relation to the properties of F .
Theorem 7.2. Let X be a random variable with distribution F and charac-
teristic function ψ. Then either
1. |ψ(t)| < 1 for all t 6= 0, or
2. |ψ(u)| = 1 and |ψ(t)| < 1 for t ∈ (0, u) for which the values of X are
concentrated on a regular grid, or
3. |ψ(t)| = 1 for all t ∈ R, for which X is concentrated at a single point.
A more specific result concerning the nature of the random variable and the
corresponding characteristic function for the second condition is addressed in
Theorem 7.6, which is presented later in this section. Further discussion about
Theorem 7.2 and its proof can be found in Section XV.1 of Feller (1971). We
will also require some characterization about the asymptotic tail behavior of
characteristic functions.
EDGEWORTH EXPANSIONS 287
Theorem 7.3 (Riemann and Lebesgue). Suppose that X is an absolutely
continuous random variable with characteristic function ψ. Then,
lim |ψ(t)| = 0.
t→±∞

A proof of Theorem 7.3 can be found in Section 4.1 of Gut (2005). We now
have a collection of theory that is suitable for determining the asymptotic
characteristics of the error term of the Edgeworth expansion.
Theorem 7.4. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables where Xn has characteristic function ψ for all
n ∈ N. Let Fn (x) = P [n1/2 σ −1 (X̄n − µ) ≤ x], with density fn (x) for all
x ∈ R where µ = E(Xn ) and σ 2 = V (Xn ). Assume that E(Xn3 ) < ∞, |ψ|ν is
integrable for some ν ≥ 1, and that fn (x) exists for n ≥ ν. Then,
fn (x) − φ(x) − 61 σ −3 n−1/2 µ3 (x3 − 3x)φ(x) = o(n−1/2 ), (7.3)
as n → ∞.

Proof. As usual, we will consider without loss of generality the case where
µ = 0. Let us first consider computing the Fourier transform of the function
fn (x) − φ(x) − 61 σ −3 n−1/2 µ3 (x3 − 3x)φ(x). (7.4)
Because the Fourier transform is an integral, we can accomplish this with
term-by-term integration. Previous calculations using Theorems 2.32 and 2.33
imply that the Fourier transform of fn (x) can be written as ψ n (tσ −1 n−1/2 ).
Similarly, it is also known that the characteristic function of φ(x) is exp(− 21 t2 ).
Finally, we require the Fourier transform of − 16 σ −3 n−1/2 µ3 (x3 − 3x)φ(x).
To simplify this matter we note that by Definition 1.6, the third Hermite
polynomial is given by H3 (x) = x3 − 3x and, therefore, Definition 1.6 implies
that
−(x3 − 3x)φ(t) = (−1)3 H3 (x)φ(x) = φ(3) (x).
Theorem 7.1 therefore implies that the Fourier transformation of −H3 (x)φ(x) =
φ(3) (x) is (it)3 exp(− 21 t2 ). Therefore, the Fourier transformation of Equation
(7.4) is given by
ψ n (n−1/2 σ −1 t) − exp(− 21 t2 ) − 16 µ3 σ −3 n−1/2 (it)3 exp(− 12 t2 ).
Now, using the Fourier norm of Equation (7.1), we have that

|fn (x) − φ(x) − 16 σ −3 n−1/2 µ3 (x3 − 3x)φ(x)| ≤


Z ∞
|ψ n (n−1/2 σ −1 t) − exp(− 21 t2 ) − 16 µ3 σ −3 n−1/2 (it)3 exp(− 21 t2 )|dt. (7.5)
−∞

Our task now is to show that the integral of the right hand side of Equation
(7.5) is o(n−1/2 ) as n → ∞. Let δ > 0 and note that since ψ is a characteristic
function of a density we have from Theorem 7.2 that |ψ(t)| ≤ 1 for all t ∈ R
and that |ψ(t)| < 1 for all t 6= 0. This result, combined with the result of
288 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
Theorem 7.3 (Riemann and Lebesgue) implies that there is a real number qδ
such that |ψ(t)| ≤ qδ < 1 for all |t| ≥ δ. We now begin approximating the
Fourier norm in Equation (7.5). We begin by breaking up the Fourier norm
in Equation (7.5) into two integrals: the first over an interval near the origin
and the second over the remaining tails. That is,
Z ∞
|ψ n (n−1/2 σ −1 t) − exp(− 12 t2 ) − 16 µ3 σ −3 n−1/2 (it)3 exp(− 12 t2 )|dt =
Z −∞
|ψ n (n−1/2 σ −1 t) − exp(− 21 t2 ) − 16 µ3 σ −3 n−1/2 (it)3 exp(− 12 t2 )|dt+
|t|>δσn1/2
Z
|ψ n (n−1/2 σ −1 t) − exp(− 12 t2 ) − 61 µ3 σ −3 n−1/2 (it)3 exp(− 12 t2 )|dt.
|t|<δσn1/2
(7.6)
Working with the first term on the right hand side of Equation (7.6), we note
that Theorem A.18 implies that
Z
|ψ n (n−1/2 σ −1 t)−exp(− 12 t2 )− 16 µ3 σ −3 n−1/2 (it)3 exp(− 21 t2 )|dt ≤
|t|>δσn1/2
Z
|ψ n (n−1/2 σ −1 t)| + exp(− 12 t2 )(1 + | 16 µ3 σ −3 n−1/2 (it)3 |)dt =
|t|>δσn1/2
Z
|ψ n (n−1/2 σ −1 t)|dt+
|t|>δσn1/2
Z
exp(− 12 t2 )(1 + | 16 µ3 σ −3 n−1/2 (it)3 |)dt. (7.7)
|t|>δσn1/2

For the first integral on the right hand side of Equation (7.7), we note that
ψ n (n−1/2 σ −1 t) = ψ ν (n−1/2 σ −1 t)ψ n−ν (n−1/2 σ −1 t).
Now |t| > δσn1/2 implies that n−1/2 σ −1 |t| > δ so the fact that |ψ(t)| ≤ qδ
for all |t| ≥ δ implies that ψ n−ν (n−1/2 σ −1 t) ≤ qδn−ν . Hence, it follows from
Theorem A.7 that
Z Z
|ψ n (n−1/2 σ −1 t)|dt ≤ qδn−ν |ψ ν (n−1/2 σ −1 t)|dt
|t|>δσn1/2 |t|>δσn1/2
Z ∞
−1/2 −1
≤ qδn−ν ν
|ψ (n σ t)|dt (7.8)
−∞
Z ∞
= n1/2 σqδn−ν |ψ ν (u)|du. (7.9)
−∞

Equation (7.9) follows from a change of variable in the integral in Equation


(7.8). It is worthwhile to note the importance of the method used in establish-
ing Equation (7.8). We cannot assume that |ψ|n is integrable for every n ∈ N
since we have only assumed that |ψ|ν is integrable. By bounding |ψ|n−ν sep-
arately we are able then to take advantage of the integrability of |ψ|ν . Now,
EDGEWORTH EXPANSIONS 289
using the integrability of |ψ|ν , we note that
Z ∞
nqδn−ν σn1/2 |ψ ν (u)|du = n3/2 qδn−ν C,
−∞

where C is a constant that does not depend on n. Noting that |qδ | < 1, we
have that

lim n3/2 qδn−ν C = 0.


n→∞

Therefore, using Definition 1.7, we have that


Z ∞
qδn−ν |ψ ν (n−1/2 σ −1 t)|dt = o(n−1 ),
−∞

as n → ∞. For the second integral on the right hand side of Equation (7.7)
we have that
Z
exp(− 12 t2 )(1 + | 16 µ3 σ −3 n−1/2 (it)3 |)dt ≤
|t|>δσn1/2
Z
exp(− 21 t2 )(1 + |µ3 σ −3 t3 |)dt =
|t|>δσn 1/2
Z Z
exp(− 12 t2 )dt + exp(− 21 t2 )|µ3 σ −3 t3 |dt, (7.10)
|t|>δσn1/2 |t|>δσn1/2

since | 16 n−1/2 i3 | < 1 for all n ∈ N. Now


Z Z
exp(− 12 t2 )dt = (2π)1/2 (2π)−1/2 exp(− 12 t2 )dt =
|t|>δσn1/2 |t|>δσn1/2

(2π)1/2 P (|Z| > δσn1/2 ),

where Z is a N(0, 1) random variable. Theorem 2.6 (Markov) implies that


P (|Z| > δσn1/2 ) ≤ (δσn1/2 )−3 E(|Z|3 ). Therefore, we have that
Z
exp(− 12 t2 )dt ≤ (2π)1/2 (δσn1/2 )−3 E(|Z|3 ),
|t|>δσn1/2

and hence,
Z
lim n exp(− 12 t2 )dt ≤ lim n−1/2 (2π)1/2 (δσ)−3 E(|Z|3 ) = 0.
n→∞ |t|>δσn1/2 n→∞

Therefore, using Definition 1.7, we have proven that


Z
exp(− 12 t2 )dt = o(n−1 ),
|t|>δσn1/2

as n → ∞. For the second integral on the right hand side of Equation (7.10)
290 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
we note that
Z Z
exp(− 12 t2 )|µ3 σ −3 t3 |dt = |µ3 σ −3
| |t|3 exp(− 21 t2 )dt
|t|>δσn1/2 |t|>δσn1/2
Z
= 2|µ3 σ −3 | t3 exp(− 21 t2 )dt. (7.11)
t>δσn1/2

Note that the final equality in Equation (7.11) follows from the fact that the
integrand is an even function. Now consider a change of variable where u = 12 t2
so that du = tdt and the lower limit of the integral becomes 21 δ 2 σ 2 n. Hence,
we have that
Z Z ∞
exp(− 12 t2 )|µ3 σ −3 t3 |dt = 4|µ3 σ −3 | u exp(−u)du.
1 2 2
|t|>δσn1/2 2δ σ n

Now use Theorem A.4 with dv = exp(−u) so that v = − exp(−u) to find that
Z ∞
4|µ3 σ −3 | u exp(−u)du =
1 2 2
2δ σ n

Z ∞
−4|µ3 σ −3 |u exp(−u) 1 δ2 σ2 n + 4|µ3 σ −3 | exp(−u)du =
2 1 2 2
2δ σ n
2|µ3 σ −3 |δ 2 σ 2 n exp(− 21 δ 2 σ 2 n) + 4|µ3 σ −3
| exp(− 12 δ 2 σ 2 n), (7.12)
where we have used the fact that
lim nk exp(−n) = 0,
n→∞

for all fixed k ∈ N. Therefore, using the same property, we can also conclude
using Definition 1.7 that
Z
exp(− 21 t2 )|µ3 σ −3 t3 |dt = o(n−1 ),
|t|>δσn1/2

as n → ∞. Therefore, it follows that


Z ∞
(2π)−1 |ψ n (tσ −1 n−1/2 )−exp(− 12 t2 )− 61 n−1/2 σ −3 µ3 (it)3 exp(− 12 t2 )|dt =
−∞
Z
(2π)−1 |ψ n (tσ −1 n−1/2 ) − exp(− 12 t2 )−
|t|<δσn1/2
1 −1/2 −3
6n σ µ3 (it)3 exp(− 12 t2 )|dt + o(n−1 ),
as n → ∞. We will now follow Feller (1971) in simplifying the problem through
the introduction of the function Λ(t) = log[ψ(t)] + 12 σ 2 t2 . To see why this
function is useful, note that
exp[nΛ(tσ −1 n−1/2 )] = exp{n log[ψ(tσ −1 n−1/2 )] + 12 nσ 2 t2 σ −2 n−1 }
= exp{log[ψ n (tσ −1 n−1/2 )]} exp( 12 t2 )
= ψ n (tσ −1 n−1/2 ) exp( 12 t2 ).
EDGEWORTH EXPANSIONS 291
Therefore, it follows that
Z
−1
(2π) |ψ n (tσ −1 n−1/2 ) − exp(− 12 t2 )−
|t|<δσn1/2
1 −1/2 −3
6n σ µ3 (it)3 exp(− 21 t2 )|dt =
Z
(2π)−1 exp(− 12 t2 )| exp[nΛ(tσ −1 n−1/2 )]−1− 61 n−1/2 µ3 σ −3 (it)3 |dt.
|t|<δσn1/2
(7.13)
To place a bound on the integral in Equation (7.13), we will use the in-
equality from Theorem A.9 which states that | exp(a) − 1 − b| ≤ (|a − b| +
1 2
2 |b| ) exp(γ) where γ ≥ max{|a|, |b|}. Note that this inequality is valid for a
and b whether they are real or complex valued. In this instance we will take
a = nΛ(tσ −1 n−1/2 ) and b = 61 n−1/2 µ3 σ −3 (it)3 . Therefore, it follows that

| exp[nΛ(tσ −1 n−1/2 )] − 1 − 16 n−1/2 µ3 σ −3 (it)3 | ≤


[|nΛ(tσ −1 n−1/2 ) − 16 n−1/2 µ3 σ −3 (it)3 | + 12 | 16 n−1/2 µ3 σ −3 (it)3 |2 ] exp(γ) =
[|nΛ(tσ −1 n−1/2 ) − 61 n−1/2 µ3 σ −3 (it)3 | + 1 −1 2 −6
72 n µ3 σ |it|6 ] exp(γ).
We now develop some properties for the function Λ. Recalling that Λ(t) =
log[ψ(t)]+ 12 σ 2 t2 we have that Λ(0) = log[ψ(0)] = log(1) = 0. We will also need
to evaluate the first three derivatives of Λ at zero so that we may approximate
Λ(t) with a Taylor expansion. The first derivative is Λ0 (t) = ψ 0 (t)/ψ(t) + σ 2 t
so that Theorem 2.31 implies that Λ0 (0) = ψ 0 (0)/ψ(0) = iµ = 0 since we
have assumed that µ = 0. Similarly, the second derivative is given by Λ00 (t) =
[ψ 0 (t)/ψ(t)]2 − ψ 00 (t)/ψ(t) + σ 2 so that Theorem 2.31 implies that
Λ00 (0) = −[ψ 0 (0)/ψ(0)]2 + ψ 00 (0)/ψ(0) + σ 2 = −(iµ)2 + (iσ)2 + σ 2 = 0.
For the third derivative we have
 0 3
000 ψ (t) 3ψ 0 (t)ψ 00 (t) ψ 000 (t)
Λ (t) = 2 − + ,
ψ(t) ψ 2 (t) ψ(t)
so that Λ000 (0) = i3 µ3 . Theorem 1.13 implies that
Λ(t + δ) = Λ(t) + δΛ0 (t) + 12 δ 2 Λ00 (t) + E2 (t, δ),
so that
Λ(δ) = Λ(0) + δΛ0 (0) + 21 δ 2 Λ00 (0) + E2 (δ) = E2 (δ), (7.14)
1 000 3
where E2 (δ) = 6 Λ (ξ)δ for some ξ ∈ (0, δ). Let ε > 0 and note that since
Λ000 (t) is a continuous function, there is a neighborhood |t| < δ such that Λ000 (t)
varies less than 6εσ 3 from Λ000 (0) = i3 µ3 . Therefore |Λ000 (ξ) − i3 µ3 | ≤ 6εσ 3
and hence | 61 Λ000 (ξ)t3 − 61 i3 µ3 t3 | ≤ εσ 3 t3 . Since Λ(t) = 61 Λ000 (ξ)t3 we have that
|Λ(tσ −1 n−1/2 ) − 61 µ3 i3 (tσ −1 n−1/2 )3 | ≤ |εt3 n−3/2 |. (7.15)
Multiplying both sides of Equation (7.15) by n implies that
|nΛ(tσ −1 n−1/2 ) − 61 µ3 (it)3 σ −3 n−1/2 | ≤ |εt3 n−1/2 |
292 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
for |t| < δ. Therefore, using the bound from Theorem A.9, we have that

| exp[nΛ(tσ −1 n−1/2 )] − 1 − 16 n−1/2 µ3 σ −3 (it)3 | ≤


[|εt3 n−1/2 | + 1 −1 2 −6
72 n µ3 σ (it)6 ] exp(γ) =
1 −1 2 −6 6 6
[εn−1/2 |t|3 + 72 n µ3 σ t |i |] exp(γ).
To find an appropriate value for γ we note that
γ ≥ max{|nΛ(tσ −1 n−1/2 )|, | 61 n−1/2 µ3 σ −3 (it)3 |}.
To find this bound we first note that
|Λ(t)| = | 61 Λ000 (ξ)δ 3 | ≤ |i3 µ3 + σ 3 ε||t3 |.
Suppose that δ is chosen small enough so that |δ| < ( 41 σ 2 )|i3 µ3 + σ 3 ε|. Then
|Λ(t)| ≤ |i3 µ3 + σ 3 ε||t3 | ≤ |i3 µ3 + σ 3 ε||δ||t2 | = 41 σ 2 t2 ,
since |t| < δ. Therefore,
|Λ(tσ −1 n−1/2 )| ≤ 14 σ 2 t2 (σ −2 n−1 ) = 41 n−1 t2 ,
and hence |nΛ(tσ −1 n−1/2 )| ≤ 14 t2 . Similarly, we note that
| 16 µ3 (it)3 | = | 16 µ3 t3 | = | 16 ||t2 ||t|,
and we additionally choose δ small enough so that δ < 41 | 16 µ3 |−1 σ 2 . Therefore
| 16 µ3 (it)3 | < 14 σ 2 t2 . Evaluating this bound at tσ −1 n−1/2 implies that

| 61 µ3 i3 t3 σ −3 n−3/2 | = | 61 µ3 (it)3 σ −3 n−1/2 | ≤ 14 t2 .


Therefore, we can take γ = 14 t2 and it follows that

| exp[nΛ(tσ −1 n−1/2 )] − 1 − 16 n−1/2 µ3 σ −3 (it)3 | ≤


[|εt3 n−1/2 | + 1 −1 2 6
72 n µ3 σ |it|6 ] exp( 14 t2 ).
It, then, follows that the integral in Equation (7.13) can be bounded as
Z
(2π)−1 exp(− 12 t2 )| exp[nΛ(tσ −1 n−1/2 )]−
|t|<δσn1/2

1 − 61 n−1/2 µ3 σ −3 (it)3 |dt ≤


Z
(2π)−1 [|εt3 n−1/2 | + 1 −1 2 −6
72 n µ3 σ |it|6 ] exp(− 41 t2 )dt,
|t|<δσn1/2

where
Z
lim n1/2 (2π)−1 ε|t|3 n−1/2 exp(− 14 t2 )dt =
n→∞ |t|<δσn1/2
Z Z ∞
3
lim ε |t| exp(− 41 t2 )dt =ε |t|3 exp(− 14 t2 )dt. (7.16)
n→∞ |t|<δσn1/2 −∞
EDGEWORTH EXPANSIONS 293
Note that the integral on the right hand side of Equation (7.16) is finite. Let
B1 be a finite bound that is larger than this integral, then it follows that
Z
lim n1/2 (2π)−1 ε|t|3 n−1/2 exp(− 14 t2 )dt ≤ εB1 ,
n→∞ |t|<δσn1/2

for every ε > 0, and hence the limit is zero. Similarly,


Z
1/2 −1 1 −1 2 −6
lim n (2π) 72 n µ3 σ |it|6 exp(− 14 t2 )dt =
n→∞ |t|<δσn1/2
Z
−1/2 −1 1 2 −6
lim n (2π) 72 µ3 σ |it|6 exp(− 14 t2 )dt = 0,
n→∞ |t|<δσn1/2

since Z ∞
|it|6 exp(− 14 t2 )dt < ∞.
−∞
It, then, follows that
Z
(2π)−1 exp(− 12 t2 )| exp[nΛ(tσ −1 n−1/2 )]−
|t|<δσn1/2

1 − 61 n−1/2 µ3 σ −3 (it)3 |dt = o(n−1/2 ),


as n → ∞, and the result follows.

The error in Equation 7.3 can also be characterized as having the property
O(n−1 ) as n → ∞. Theorem 7.4 offers a potential improvement in the accu-
racy of approximating the distribution of the standardized sample mean over
Theorem 4.20 (Lindeberg and Lévy) which approximates fn (t) with a normal
density. In that case, we have that fn (t) − φ(t) = o(1) = O(n−1/2 ) as n → ∞.
Further reductions in the order of the error are possible if terms are added
to the expansion and further conditions on the existence of moments and the
smoothness of f can be assumed.
Theorem 7.5. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables. Let Fn (x) = P [n1/2 σ −1 (X̄n − µ) ≤ x], with
density fn (x) for all x ∈ R. Assume that E(|Xn |p ) < ∞, |ψ|ν is integrable for
some ν ≥ 1, and that fn (x) exists for n ≥ ν. Then,
p
X
fn (x) − φ(x) − φ(x) n−k/2 pk (x) = o(n−p/2 ), (7.17)
k=1

uniformly in x as n → ∞, where pk (x) is a real polynomial whose coefficients


depend only on µ1 , . . . , µk . In particular, pk (x) does not depend on n, p or on
other properties of F .

A proof of Theorem 7.5 can be found in Section XVI.2 of Feller (1971).


The polynomial p1 (x) has already been identified in Theorem 7.4 as being
equal to p1 (x) = 61 µ3 σ −3 H3 (x). It can be similarly shown that the polyno-
1 2 −6 1 −4
mial p2 (x) is given by p2 (x) = 72 µ3 σ H6 (x) + 24 σ (µ4 − 3σ 4 )H4 (x). It
294 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
turns out that these polynomials are often easier to write in terms of cu-
mulants. In particular, noting that κ4 = µ4 − 3σ 4 we have that p2 (x) =
1 2 −6 1 −4
72 µ3 σ H6 (x) + 24 σ κ4 H4 (x). Even further simplification is possible if we
define the standardized cumulants as ρk = σ −k κk for all k ∈ N. In this case
we have p1 (x) = 61 ρ3 H3 (x) and p2 (x) = 72 1 2 1
ρ3 H6 (x) + 24 ρ4 H4 (x). The form
of these, and additional terms, can be motivated by considering the form of
the cumulant generating function for a standardized mean. For simplicity,
suppose that X1 , . . . , Xn are a set of independent and identically distributed
random variables from a distribution F that has mean equal to zero, unit
variance, and cumulant generating function c(t). Theorem 2.24 and Definition
2.13 implies that the cumulant generating function of n1/2 X̄n , the standard-
ized version of the mean, is nc(n−1/2 t). Equation (2.23) gives the form of the
cumulant generating function for the special case where κ1 = 0 and κ2 = 1 as
c(t) = 21 t2 + 16 κ3 t3 + 24
1
κ4 t4 + o(t4 ), as t → 0. Therefore,

nc(n−1/2 t) = 21 t2 + 16 n−1/2 κ3 t3 + 1 −1
24 n κ4 t4 + o(n−1 ),
as n → ∞. We now convert this result to an equivalent expression for the
moment generating function of the standardized mean. Definition 2.13 implies
that the moment generating function of n1/2 X̄n is
m(t) = exp[ 21 t2 + 16 n−1/2 κ3 t3 + 1 −1
24 n κ4 t4 + o(n−1 )].
Using Theorem 1.13 on the exponential function implies that

m(t) = exp( 21 t2 ) + [ 16 n−1/2 κ3 t3 + 1 −1


24 n κ4 t4 + o(n−1 )] exp( 21 t2 )+
1 1 −1/2 1 −1
2[6n κ3 t3 + 24 n κ4 t4 + o(n−1 )] exp( 21 t2 )+
o{[ 16 n−1/2 κ3 t3 + 1 −1
24 n κ4 t4 + o(n−1 )]2 },
as n → ∞. Collecting terms of order o(n−1 ) and higher into the error term
yields

m(t) = exp( 21 t2 ) + 16 n−1/2 κ3 t3 exp( 21 t2 ) +1 −1


24 n κ4 t4 exp( 12 t2 )+
1 −1 2 6
72 n κ3 t exp( 12 t2 ) + o(n−1 ), (7.18)
as n → ∞. To obtain an expansion for the density of the standardized mean
we must now invert the expansion for the moment generating function term by
term. While we have not explicitly discussed inversion of moment generating
functions, we need only one specialized result in this case. It can be shown
that Z ∞
exp(tx)φ(x)Hk (x)dx = tk exp( 21 t2 ). (7.19)
−∞
See Exercise 10. We again find ourselves dealing with functions that are not
strictly densities, but nonetheless we can view the left hand side of Equa-
tion (7.19) as the moment generating function of φ(x)Hk (x). Therefore, if
we invert the term on the right hand side of Equation (7.19) we should get
the function φ(x)Hk (x). Therefore, inverting 61 n−1/2 κ3 t2 exp( 12 t2 ) results in
EDGEWORTH EXPANSIONS 295
1 −1/2
6n κ3 φ(x)H3 (x). The remaining terms can be inverted using the same
1 −1 1 −1 2
method to obtain 24 n κ4 H4 (x)φ(x) and 72 n κ3 H6 (x)φ(x), respectively.
1/2
Therefore, the expansion for the density of n X̄n that results from inverting
the expansion for the moment generating function given in Equation (7.18),
is given by
φ(x) + 61 n−1/2 κ3 φ(x)H3 (x) + 1 −1
24 n κ4 H4 (x)φ(x) + 1 −1 2
72 n κ3 H6 (x)φ(x) + En ,
where En is the error term, whose asymptotic properties are not considered
at this time. Note that we did not end up with the standardized version of
the cumulants in our expansion because we have assumed that the variance
is one. Additional terms of the expansion are available, though the computa-
tion of these terms does become rather tedious. These terms are based on an
asymptotic expansion of the cumulants of Zn . See Chapter 2 of Hall (1992)
and Exercises 11–12 for further details.
Note that the error term of the pth -order Edgeworth expansion given in Equa-
tion (7.17) is o(n−p/2 ), as n → ∞. We can also show that this error term is
O(n−(p+1)/2 ) as n → ∞. To see this, note that Equation (7.17) implies that
a (p + 1)st -order Edgeworth expansion has the form
p
X
fn (x) − φ(x) − φ(x) n−k/2 pk (x) =
k=1

n−p/2−1/2 φ(x)pp+1 (x) + o(n−(p+1)/2 ), (7.20)


as n → ∞, where we have moved the last term to the right hand side of the
equation. Note that
n−(p+1)/2 φ(x)pp+1 (x)
= φ(x)pp+1 (x).
n−(p+1)/2
Since pp+1 (x) is a polynomial that does not depend on n, |φ(x)pp+1 (x)| re-
mains bounded as n → ∞ for each fixed value of x. Therefore, it follows that
n−(p+1)/2 φ(x)pp+1 (x) = O(n−(p+1)/2 ) as n → ∞. The second term on the
right hand side of Equation (7.20) converges to zero as n → ∞ when divided
by n−(p+1)/2 . Hence, it follows that
p
X
fn (x) − φ(x) − φ(x) n−k/2 pk (x) = O(n−(p+1)/2 ),
k=1
as n → ∞.
Example 7.1. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables where Xn has an Exponential(θ) distribution
for all n ∈ N. In this case the third and fourth standardized cumulants are
given by ρ3 = 2 and ρ4 = 6. Therefore we have that p1 (x) = 31 H3 (x) and
1
p2 (x) = 18 H6 (x) + 41 H4 (x), with the two-term Edgeworth expansion for the
density of Zn = n1/2 σ −1 (X̄n − µ) being given by
φ(x) + 13 n−1/2 φ(x)H3 (x) + n−1 φ(x)[ 18
1
H6 (x) + 14 H4 (x)] + o(n−1 ),
296 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS

Figure 7.1 The true density of Zn = n1/2 σ −1 (X̄n − µ) (solid line), the standard nor-
mal density (dashed line) and the one-term Edgeworth expansion (dotted line) when
n = 5 and X1 , . . . , Xn is a set of independent and identically distributed random
variables following an Exponential(θ) distribution.
0.4
0.3
0.2
0.1
0.0

!3 !2 !1 0 1 2 3 4

as n → ∞. To illustrate the correction that the Edgeworth expansion makes


in this case, we note that from Example 4.36 that Zn has a location translated
Gamma(n, n−1/2 ) density whose form is given in Equation (4.3). Figure 7.1
compares the true density of Zn with the standard normal density and the
one-term Edgeworth expansion when n = 5. One can observe from Figure 7.1
that the Edgeworth expansion does a much better job of capturing the true
density of Zn than the normal approximation does. Figure 7.2 provides the
same comparison when n = 10. Both approximations are better, but one can
observe that the one-term Edgeworth expansion is still more accurate.
Example 7.2. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables where Xn has a linear type density of the form
f (x) = 2[θ + x(1 − 2θ)]δ{x; (0, 1)}, (7.21)
where θ ∈ [0, 1]. The k th moment about the origin for this density is given by
2 2kθ
µ0k = E(X k ) = − .
k + 2 (k + 1)(k + 2)
Therefore, the first four moments of this density are given by µ = µ1 =
EDGEWORTH EXPANSIONS 297

Figure 7.2 The true density of Zn = n1/2 σ −1 (X̄n − µ) (solid line), the standard
normal density (dashed line) and the one-term Edgeworth expansion (dotted line)
when n = 10 and X1 , . . . , Xn is a set of independent and identically distributed
random variables following an Exponential(θ) distribution.
0.4
0.3
0.2
0.1
0.0

!3 !2 !1 0 1 2 3 4

2
3 − 31 θ, σ 2 = µ2 = 181
+ 91 θ − 19 θ2 , µ3 = − 135
1 1
− 45 θ + 19 θ2 − 27
2 3
θ , and
1 4 1 2 2 3 1 4
µ4 = 135 + 135 θ − 15 θ + 27 θ − 27 θ . It then follows that the standardized
cumulants are given by
1 1 1 2 2 3
− 135 − 45 θ + 9 θ − 27 θ
ρ3 = ,
1 1 1 2 3/2

18 + 9 θ − 9 θ
and
µ4 − 3σ 4
ρ4 =
σ4
1 4 1 2 2 3 1 4
135 + 135 θ − 15 θ + 27 θ − 27 θ
= − 3.
1 1 1 2 2

18 + 9 θ − 9 θ

The standardized cumulants are plotted as a function of θ ∈ [0, 1] in Figure


7.3. Of interest here is the fact that the third standardized cumulant equals
zero when θ = 21 , which corresponds to the linear distribution being equal
to a Uniform(0, 1) distribution, which is symmetric. When the third stan-
dardized cumulant is zero it follows that the second term of the Edgeworth
expansion, which has the form 16 n−1/2 φ(x)ρ3 H3 (x), is also zero and the term
298 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS

Figure 7.3 The third (solid line) and fourth (dashed line) standardized cumulants of
the linear density given in Equation (7.21) as a function of θ.
0.5
0.0
!0.5
!1.0

0.0 0.2 0.4 0.6 0.8 1.0

is eliminated. This means that we now have an expansion of the form


φ(x) + n−1 φ(x)[ 72
1 2
ρ3 H6 (x) + 1
24 ρ4 H4 (x)] + o(n−1 ) = φ(x) + o(n−1/2 ),
as n → ∞. Hence, the normal approximation provided by Theorem 4.20 (Lin-
deberg and Lévy) which usually has an error term that is o(1) as n → ∞, is
now o(n−1/2 ) as n → ∞. Therefore, in the case where the population is sym-
metric, Theorem 4.20 is more accurate than when the population is not sym-
metric. Is it possible to choose θ such that the second term of the expansion,
which has the form n−1 φ(x)[ 72 1 2 1
ρ3 H6 (x) + 24 ρ4 H4 (x)], would also be elimi-
1 2 1
nated? That is, can we find a value of θ for which 72 ρ3 H3 (x) + 24 ρ4 H4 (x) = 0
for all x ∈ R? This would require both the third and fourth standardized
cumulants to be equal to zero, which does not occur for any value of θ ∈ [0, 1].
Therefore, it is clear that there are not any situations where we can gain any
further accuracy past the case from where the distribution is symmetric and
the first term of the expansion is eliminated. 
Example 7.3. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables where Xn has a density that is a mixture of two
Normal densities given by f (x) = 12 φ(x) + 12 φ(x − θ) for all x ∈ R where
EDGEWORTH EXPANSIONS 299
θ ∈ R. When considering the moments of a normal mixture such as this, it is
worthwhile to note that
Z ∞
k
E(Xn ) = xk [ 12 φ(x) + 12 φ(x − θ)]dx
−∞
Z ∞ Z ∞
k
= 2 1
x φ(x)dx + 2 1
xk φ(x − θ)dx.
−∞ −∞

Therefore, if we suppose that Y is a N(0, 1) random variable and W is a


N(θ, 1) random variable, then E(Xnk ) = 21 E(Y k ) + 12 E(W k ). Therefore, we
have that E(Xn ) = 21 θ, E(Xn2 ) = 1 + 21 θ2 , E(Xn3 ) = 12 θ3 + 32 θ, and E(Xn4 ) =
− 12 θ4 + 32 θ2 + 3. From these calculations we can find that the third and fourth
3 4 3 2
moments of the normal mixture are µ3 = 0 and µ4 = − 16 θ + 2 θ +3. The third
cumulant is also zero, and the fourth cumulant is κ4 = − 15 4 3 2
16 θ − 2 θ . There
are two key elements to the Edgeworth expansion in this case. First, note that
since the third cumulant is always zero, as this particular normal mixture is
always symmetric, the first term of the Edgeworth expansion is always zero,
and hence the error term for this expansion is always o(n−1/2 ), as n → ∞, no
matter what the value of θ is. Secondly, note that the fourth cumulant is zero
when θ = 0, which means that the error term of the Edgeworth expansion is
no larger than o(n−1 ) as n → ∞ in this case. In fact, all of the cumulants
except the second are zero when θ = 0 since the normal mixture in this case
coincides with a N(0, 1) distribution, where the Edgeworth expansion has no
terms beyond the φ(x) term and the normal approximation of Theorem 4.20
(Lindeberg and Lévy) is exact. 
The development of the Edgeworth expansion in Theorems 7.4 and 7.5 focuses
on the density of the random variable Zn = n1/2 σ −1 (X̄n − µ). Similar expan-
sions can also be developed for the distribution function of the standardized
random variable Zn . From one viewpoint such an expansion can be obtained
through term-by-term integration of the expansion given in Equation (7.3),
for which we would obtain the result
Fn (x) − Φ(x) − 16 σ −3 n−1/2 µ3 (1 − x2 )φ(t) = o(n−1/2 ), (7.22)
under the conditions of Theorem 7.4. However, such an expansion is still valid
under even weaker conditions. In fact, it is not necessary that the distribution
F have a density, but that certain distributions known as lattice distributions
must be avoided.
Definition 7.1. A distribution F is a lattice distribution if the jump points
of F are all on set of the form {a + bz : z ∈ Z} where a ∈ R and b is a real
number called the span of the lattice.
We have encountered lattice distribution previously when we considered the
inversion of the characteristic function, where we pointed out that examples of
distributions that are lattice distributions include the Binomial and Poisson
distributions, which both have span equal to one. See Example 2.20. The rea-
son that lattice distributions are important is that their characteristic func-
300 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
tions have a particular form that make the derivation of Edgeworth expansions
of the form given in Theorem 7.5 impossible without changing the form of the
expansion.
Theorem 7.6. Let X be a random variable with distribution F and charac-
teristic function ψ.

1. Then F is a lattice distribution on the set {a+bz : z ∈ Z} then the function


ζ(t) = exp(−ita)ψ(t) is a periodic function with period 2πb−1 .
2. If ψ(t) is a periodic function, then X is a lattice random variable.

Proof. If F is a lattice distribution on the set {a + bz : z ∈ Z}, then the


characteristic function of X is given by
ψ(t) = E[exp(itX)]
X
= pk exp[it(a + bk)]
k∈Z
X
= exp(ita) pk exp(itbk),
k∈Z

where pk = P (X = a + bk) for all k ∈ Z. Therefore,


X
ζ(t) = pk exp(itbk),
k∈Z

and
X
ζ(t + 2πb−1 ) = pk [ibk(t + 2πb−1 )]
k∈Z
X
= pk exp(itbk) exp(2πik)
k∈Z
X
= pk exp(itbk)
k∈Z
= ζ(t),
since exp(2πik) = 1 when k ∈ Z.
To prove the second result, we now assume that X is a random variable with a
periodic characteristic function that has period a. We wish now to show that
X is a lattice random variable. Noting that ψ(0) = E[exp(i0X)] = E(1) = 1
it follows from the periodicity of ψ that ψ(0) = ψ(a) = 1. Now, Definition A.6
implies that

ψ(a) = E[exp(aiX)] = E[cos(aX) + i sin(aX)] =


E[cos(aX)] + iE[sin(aX)] = 1. (7.23)
Note that the right hand side of Equation (7.23) is a real number. Therefore,
iE[sin(aX)] = 0 and hence E[cos(aX)] = 1, or equivalently E[1 − cos(aX)] =
0. Since P [−1 ≤ cos(aX) ≤ 1] = 1 it follows that P [1 − cos(aX) ≥ 0] = 1.
EDGEWORTH EXPANSIONS 301
Therefore Theorem A.15 implies that since E[1 − cos(aX)] = 0 it follows that
P [1 − cos(aX) = 0] = 1. This implies that the only values of X that have
non-zero probabilities are x ∈ R such that cos(ax) = 1. The periodicity of the
cosine function implies that X must be a lattice random variable.

Further discussion on this topic can be found in Chapter XV of Feller (1971)


and Chapter 8 of Chow and Teicher (2003). See also Lukacs (1956). From
Theorem 7.6 we can observe that when F corresponds to a lattice distribution,
then the characteristic function of F has periodic components that do not
decay as |t| → ∞. This means that the truncation argument used to prove
Theorem 7.4 can no longer be used, and in fact the form of the expansion
must be changed. For further information on Edgeworth type expansions for
lattice distributions see Esseen (1945) and Chapter XVI of Feller (1971).
The proof of the validity of the expansion in Equation (7.22) for non-lattice
distributions is based on Theorem 4.23 (Smoothing Theorem). However, as
with the expansion for densities, Theorem 4.23 is not directly applicable be-
cause the expansion does not yield a valid distribution function. In particular,
Φ(x) + 16 σ −3 n−1/2 µ3 (1 − x2 )φ(t) may not be non-decreasing, may go outside
the range [0, 1], and may not have the proper limits as x → ±∞. To address
this problem we require a more general version of Theorem 4.23. For our par-
ticular case we will present a specific version of this theorem that focuses on
the case of bounding the difference between the normal distribution function
and another function.
Theorem 7.7. Let G be a function such that
lim |Φ(x) − G(x)| = 0,
x→∞

and
lim |Φ(x) − G(x)| = 0.
x→−∞
Suppose that G has bounded derivative that has a continuously differentiable
Fourier transformation γ such that γ(0) = 1 and

d
γ(t) = 0.
dt t=0
If F is a distribution function with zero expectation, then
Z t
|ψ(x) − ζ(x)|
|F (x) − G(x)| ≤ dx + 24(πt)−1 sup |G0 (x)|,
−t π|x| x∈R

for all x ∈ R and t > 0.

A proof of Theorem 7.7 can be found in Section XVI.3 of Feller (1971). We now
have the required results to justify the Edgeworth expansion for non-lattice
distributions.
Theorem 7.8. Let {Xn }∞ n=1 be a sequence of independent and identically dis-
tributed random variables from a distribution F . Let Fn (x) = P [n1/2 σ −1 (X̄n −
302 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
µ) ≤ x] and assume that E(Xn3 ) < ∞. If F is a non-lattice distribution then
Fn (x) − Φ(x) − 16 σ −3 n−1/2 µ3 (1 − x2 )φ(x) = o(n−1/2 ), (7.24)
as n → ∞ uniformly in x.

Proof. We shall give a partial proof of the result, leaving the details for Ex-
ercise 6. Let
G(x) = Φ(x) + 61 σ −3 n−1/2 µ3 (1 − x2 )φ(x).
We must first prove that our choice for G(x) satisfies the assumptions of
Theorem 7.7. We begin by noting that since φ(x) → 0 as x → ±∞ at a faster
rate than (1 − x2 ) diverges to −∞ it follows that
lim (1 − x2 )φ(x) = 0,
x→∞

and
lim (1 − x2 )φ(x) = 0.
x→−∞
Therefore,
1 −3 −1/2
lim G(x) = lim Φ(x) + lim σ n µ3 (1 − x2 )φ(x) = 1,
x→∞ x→∞ x→∞ 6

and
1 −3 −1/2
lim G(x) = lim Φ(x) + lim σ n µ3 (1 − x2 )φ(x) = 0.
x→−∞ x→−∞ x→−∞ 6

Similarly, under the assumptions a similar conclusion can be obtained for


Fn (x). Therefore, it follows that
lim |Fn (x) − G(x)| = 0,
x→∞

and
lim |Fn (x) − G(x)| = 0.
x→∞
The derivative of G(x) is given by
d d h i
G(x) = Φ(x) + 61 σ −3 n−1/2 µ3 (1 − x2 )φ(x)
dx dx
= φ(x) − 16 σ −3 n−1/2 µ3 (3x − x3 )φ(x)
= φ(x) + 16 σ −1 n−1/2 µ3 H3 (x)φ(x),
which can be shown to be bounded. We now compute the Fourier transforma-
tion of G0 (x) as
Z ∞
γ(t) = exp(itx)[φ(x) + 16 σ −1 n−1/2 µ3 H3 (x)φ(x)]dx
−∞
Z ∞ Z ∞
1 −1 −1/2
= exp(itx)φ(x)dx + 6 σ n µ3 exp(itx)H3 (x)φ(x)dx
−∞ −∞
= exp(− 21 t2 ) + (it)3 exp(− 21 t2 )
= exp(− 12 t2 )[1 + 61 µ3 σ −3 n−1/2 (it)3 ]. (7.25)
EDGEWORTH EXPANSIONS 303
The first term in Equation (7.25) comes from the fact that the characteristic
function of a N(0, 1) distribution is exp(− 21 t2 ). The second term in Equation
(7.25) is the result of an application of Theorem 7.1. Note that γ(0) = 1. The
derivative of the Fourier transformation is given by

d
exp(− 12 t2 )[1 + 16 µ3 σ −3 n−1/2 (it)3 ]

=
dt t=0

−t exp(− 21 t2 )[1 + 16 µ3 σ −3 n−1/2 (it)3 ] +

t=0

exp(− 12 t2 )[ 31 µ3 σ −3 n−1/2 i3 t2 ] = 0.

t=0
Therefore, we can apply Theorem 7.7 to find that

|Fn (x) − G(x)| ≤


Z T
π −1 |t−1 [ψ n (tσ −1 n−1/2 ) − γ(t)]dt + 24(T π)−1 sup |G0 (x)|.
−T x∈R
0
Let ε > 0 and choose α to be large enough so that 24|G (t)| < αε for all x ∈ R.
Then let T = αn1/2 and we have that
Z T
−1
|Fn (x) − G(x)| ≤ π |t−1 [ψ n (tσ −1 n−1/2 ) − γ(t)]dt + εn−1/2 .
−T

The remainder of the proof proceeds much along the same path as the proof
of Theorem 7.4. See Exercise 6.

As in the case of expansions for densities, the result of Theorem 7.8 can be
expanded under some additional assumptions so that the error is o(n−p/2 ) as
n → ∞.
Theorem 7.9. Let {Xn }∞ n=1 be a sequence of independent and identically dis-
tributed random variables from a distribution F that has characteristic func-
tion ψ. Let Fn (t) = P [n1/2 σ −1 (X̄n − µ) ≤ x] and assume that E(Xnp ) < ∞.
If
lim sup |ψ(t)| < 1, (7.26)
|t|→∞
then
p
X
Fn (x) − Φ(x) − φ(x) n−k/2 rk (x) = o(n−p/2 ), (7.27)
k=1
uniformly in x as n → ∞, where rk (x) is a real polynomial whose coefficients
depend only on µ1 , . . . , µk . In particular, rk (x) does not depend on n, p or on
other properties of F .
To find the polynomials r1 , . . . , rk we note that the Edgeworth expansion for
the distribution function is the integrated version of the expansion for the
density. Therefore, it follows that
d
φ(x)pk (x) = φ(x)rk (x), (7.28)
dx
304 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
which yields
−r1 (x) = 61 ρ3 H2 (x) (7.29)
and
1 1 2
−r2 (x) = 24 ρ4 H3 (x) + 72 ρ3 H5 (x). (7.30)
See Exercise 13.
Example 7.4. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables where Xn has an Exponential(θ) distribu-
tion for all n ∈ N, where the third and fourth standardized cumulants are
given by ρ3 = 2 and ρ4 = 6. Therefore we have that −r1 (x) = 31 H2 (x) and
−r2 (x) = 41 H3 (x) + 18
1
H5 (x), with the two-term Edgeworth expansion for the
distribution function of Zn = n1/2 σ −1 (X̄n − µ) being given by
Φ(x) − 13 n−1/2 φ(x)H2 (x) − n−1 φ(x)[ 41 H3 (x) + 1
18 H5 (x)] + o(n−1 ),
as n → ∞. To illustrate the correction that the Edgeworth expansion makes
in this case, we note that from Example 4.36 that Zn has a location translated
Gamma(n, n−1/2 ) density whose form is given in Equation (4.3). Figure 7.4
compares the true distribution function of Zn with the standard normal dis-
tribution function and the one-term Edgeworth expansion when n = 5. One
can observe from Figure 7.4 that the Edgeworth expansion does a much better
job of capturing the true distribution function of Zn than the normal approx-
imation does. Figure 7.5 provides the same comparison when n = 10. Both
approximations are better, but one can observe that the one-term Edgeworth
expansion is still more accurate. 

7.3 The Cornish–Fisher Expansion

The Edgeworth expansion given in Equation (7.27) provides a more accurate


approximation to the distribution function of Zn = n1/2 σ −1 (X̄n − µ) than the
normal approximation given by Theorem 4.20 (Lindeberg and Lévy). While
it is useful to see how the distribution of Zn changes with respect to the
moments of F , it might also be useful to use these results to provide new
methods of statistical inference based on these corrections. For example, if
X1 , . . . , Xn is a set of independent and identically distributed random vari-
ables from a N(θ, σ 2 ) distribution, then a 100α% confidence interval for θ is
given by [X̄n − n−1/2 σz(1+α)/2 , X̄n − n−1/2 σz(1−α)/2 ] when σ is known. The
form of this interval is based on the fact that when F has a N(θ, σ 2 ) dis-
tribution, Zn has a N(0, 1) distribution. When F does not have a N(θ, σ 2 )
distribution, then the confidence interval can still be justified asymptotically
d
in the sense that Zn − → Z as n → ∞ where Z is a N(0, 1) random variable.
This interval is approximate for finite sample sizes in that the actual coverage
probability may not be exactly α. Of interest, then, is the possibility that
the performance of this interval could be improved using the information that
we obtain about the distribution of Zn from the Edgeworth expansion given
THE CORNISH–FISHER EXPANSION 305

Figure 7.4 The true distribution function of Zn = n1/2 σ −1 (X̄n − µ) (solid line),
the standard normal density (dashed line) and the one-term Edgeworth expansion
(dotted line) when n = 5 and X1 , . . . , Xn is a set of independent and identically
distributed random variables following an Exponential(θ) distribution.
1.0
0.8
0.6
0.4
0.2
0.0

!3 !2 !1 0 1 2 3 4

in Equation (7.27). However, in order to improve the confidence interval we


need to have corrections to the normal approximation for the quantiles of Zn ,
not the distribution function of Zn . Therefore, we would like to develop an
Edgeworth-type expansion for the quantiles of Zn .

Expansions for the quantiles of the random variable Zn are called Cornish–
Fisher expansions and were first developed by Cornish and Fisher (1937)
and Fisher and Cornish (1960). See also Hall (1983a). We will develop these
expansions using the inversion method described in Section 1.6. In particular
we will begin with the two-term Edgeworth expansion for the distribution
function of Zn given by

Fn (x) = Φ(x) − 16 n−1/2 φ(x)ρ3 H2 (x)−


n−1 φ(x)[ 24
1
ρ4 H3 (x) + 1 2
72 ρ3 H5 (x)] + o(n−1 ), (7.31)
where we would like to find an asymptotic expansion for a value gα,n where
Fn (gα,n ) = α + o(n−1 ), as n → ∞. To begin the process we will assume that
the asymptotic expansion for gα,n has the form gα,n = v0 (α) + n−1/2 v1 (α) +
n−1 v2 (α) + o(n−1 ) as n → ∞. We now substitute this expansion for x in
306 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS

Figure 7.5 The true distribution function of Zn = n1/2 σ −1 (X̄n − µ) (solid line),
the standard normal density (dashed line) and the one-term Edgeworth expansion
(dotted line) when n = 10 and X1 , . . . , Xn is a set of independent and identically
distributed random variables following an Exponential(θ) distribution.
1.0
0.8
0.6
0.4
0.2
0.0

!3 !2 !1 0 1 2 3 4

Equation (7.31) and use Theorem 1.15 to find an asymptotic expansion in


terms of the functions v0 , v1 , and v2 . We first recall from Theorem 1.15 that
Φ(x + δ) = Φ(x) + δφ(x) − 12 δ 2 H1 (x)φ(x) + o(δ 2 ), as δ → 0 for any fixed x ∈ R.
Therefore, taking x = v0 (α) and δ = n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 ) we have
that

Φ(gα,n ) = Φ[v0 (α)] + [n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]φ[v0 (α)]−
1 −1/2
2 [n v1 (α) + n−1 v2 (α) + o(n−1 )]2 H1 [v0 (α)]φ[v0 (α)]+
o{[n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]2 }, (7.32)
as n → ∞. We will simplify each term and consolidate all terms that are of
order o(n−1 ) or smaller into the error term. The first term on the right hand
side of Equation (7.32) needs no simplification, so we move to the second term
where we have that

[n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]φ[v0 (α)] =


n−1/2 v1 (α)φ[v0 (α)] + n−1 v2 (α)φ[v0 (α)] + o(n−1 ),
as n → ∞ where we conclude that the term that has the form o(n−1 )φ[v0 (α)] =
THE CORNISH–FISHER EXPANSION 307
o(n−1 ) because φ[v0 (α)] is constant with respect to n. The third term is sim-
plified by noting that
[n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]2 = n−1 v12 (α) + o(n−1 ), (7.33)
as n → ∞. To see why this is true, let {Rn }∞
n=1 be a sequence of real numbers
such that Rn = o(n−1 ) as n → ∞. Then, note that

[n−1/2 v1 (α) + n−1 v2 (α) + Rn ]2 =


n−1 v12 (α) + 2n−3/2 v1 (α)v2 (α) + 2n−1/2 v1 (α)Rn +
n−2 v22 (α) + 2n−1 v2 (α)Rn + Rn2 .

Now, 2n−3/2 v1 (α)v2 (α) = o(n−1 ) as n → ∞ since


lim n[2n−3/2 v1 (α)v2 (α)] = lim 2n−1/2 v1 (α)v2 (α) = 0,
n→∞ n→∞

based once again on the fact that v1 (α) and v2 (α) do not depend on n. Sim-
ilarly, it can be shown that n−2 v22 (α) = o(n−1 ) as n → ∞. For the term
2n−1/2 v1 (α)Rn we note that Rn = o(n−1 ) as n → ∞, which implies that
lim n[2n−1/2 v1 (α)Rn ] = lim 2n−1/2 v1 (α)(nRn ) = 0,
n→∞ n→∞

so that 2n−1/2 v1 (α)Rn = o(n−1 ) as n → ∞. For the remaining term we also


have that Rn2 = o(n−1 ) as n → ∞. See Exercise 7. Consolidating these results
leads to the result given in Equation (7.33). We finally note that H1 [v0 (α)] =
v0 (α), so that H1 [v0 (α)]φ[v0 (α)] = v0 (α)φ[v0 (α)] which does not depend on
n. Therefore, it follows that

H1 [v0 (α)][n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]2 φ[v0 (α)] =


n−1 v0 (α)v12 (α)φ[v0 (α)].
Similar arguments can also be used to show that
o{[n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]2 } = o(n−1 ),
as n → ∞. See Exercise 8. Therefore, it follows that

Φ(gα,n ) = Φ[v0 (α)] + n−1/2 v1 (α)φ[v0 (α)]+


n−1 φ[v0 (α)][v2 (α) − 21 v0 (α)v12 (α)] + o(n−1 ), (7.34)
as n → ∞.
We now consider the second term in Equation (7.31). Note that the lead-
ing coefficient of this term is n−1/2 . Because we are only keeping terms that
are of lower order than o(n−1 ), we only need to keep track of the terms of
1 −1
6 ρ3 φ(gα,n )H2 (gα,n ) that are of smaller order than o(n ) as n → ∞. We
again rely on Theorem 1.15 to find that
φ(x + δ) = φ(x) − δH1 (x)φ(x) + o(δ 2 ) = φ(x) − δxφ(x) + o(δ 2 ),
308 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
as δ → 0. Therefore,

φ(gα,n ) = φ[v0 (α)] − v0 (α)φ[v0 (α)][n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]+
o{[n−1/2 v1 (α) + n−1 v2 (α)]2 },
as n → ∞. Keeping only terms of order larger than o(n−1/2 ) yields
φ(gα,n ) = φ[v0 (α)] − n−1/2 v0 (α)v1 (α)φ[v0 (α)] + o(n−1/2 ),
as n → ∞. Now H2 (x) = x2 − 1 so that using calculations similar to those
detailed above
H2 (gα,n ) = [v0 (α) + n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]2 − 1
= v02 (α) + n−1 v12 (α) + n−2 v22 (α) + 2n−1/2 v0 (α)v1 (α)
+2n−1 v0 (α)v2 (α) + 2n−3/2 v1 (α)v2 (α) − 1 + o(n−1 ),
as n → ∞. But, noting that
n−1 v12 (α) + n−2 v22 (α) + 2n−1 v0 (α)v2 (α) + 2n−3/2 v1 (α)v2 (α) = o(n−1/2 ),
implies that
H2 (gα,n ) = v02 (α) + 2n−1/2 v0 (α)v1 (α) − 1 + o(n−1/2 )
= H2 [v0 (α)] + 2n−1/2 v0 (α)v1 (α) + o(n−1/2 ),
as n → ∞. Therefore,
φ(gα,n )H2 (gα,n ) = {φ[v0 (α)] − n−1/2 v0 (α)v1 (α)φ[v0 (α)] + o(n−1/2 )} ×
{H2 [v0 (α)] + 2n−1/2 v0 (α)v1 (α) + o(n−1/2 )}
= φ[v0 (α)]H2 [v0 (α)] + 2n−1/2 v0 (α)v1 (α)φ[v0 (α)]
−n−1/2 v0 (α)v1 (α)H2 [v0 (α)]φ[v0 (α)]
−2n−1 v02 (α)v12 (α)φ[v0 (α)] + o(n−1/2 ),
as n → ∞. However 2n−1 v02 (α)v12 (α)φ[v0 (α)] = o(n−1/2 ) so that

φ(gα,n )H2 (gα,n ) = φ[v0 (α)]H2 [v0 (α)]+


n−1/2 v0 (α)v1 (α){2 − H2 [v0 (α)]}φ[v0 (α)] + o(n−1/2 ),
as n → ∞. This, therefore, implies that
1 −1/2
6n ρ3 φ(gα,n )H2 (gα,n ) = 61 n−1/2 ρ3 φ[v0 (α)]H2 [v0 (α)]+
1 −1
6n ρ3 v0 (α)v1 (α){2 − H2 [v0 (α)]}φ[v0 (α)] + o(n−1 ), (7.35)
as n → ∞. For the final term in Equation (7.31) we require an asymptotic
1 1 2
expansion for φ(gα,n )[ 24 ρ4 H3 (gα,n ) + 72 ρ3 H5 (gα,n )] whose coefficient, which
−1
is n , implies that our error term for this expansion may be o(1) as n → ∞.
3
We begin by noting that H3 (gα,n ) = gα,n − 3gα,n , where it can be shown that
3
gα,n = [v0 (α) + n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]3 = v03 (α) + o(n−1 ),
THE CORNISH–FISHER EXPANSION 309
as n → ∞. Along with the direct result that 3gα,n = 3v0 (α) + o(1) as n → ∞
implies that H3 (gα,n ) = H3 [v0 (α)] + o(1) as n → ∞. Similarly, it can be
shown that H4 (gα,n ) = H4 [v0 (α)] + o(1) as n → ∞. See Exercise 9. Previous
calculations imply that φ(gα,n ) = φ[v0 (α)] + o(1) as n → ∞ so that

n−1 φ(gα,n )[ 24
1
ρ4 H3 (gα,n ) + 1 2
72 ρ3 H5 (gα,n )] =
−1
n 1
φ[v0 (α)]{ 24 ρ4 H3 [v0 (α)] + 1 2
72 ρ3 H5 [v0 (α)]} + o(n−1 ), (7.36)
as n → ∞. Combining the results of Equations (7.34), (7.35), and (7.36)
implies that

Fn (gα,n ) = Φ[v0 (α)] + n−1/2 φ[v0 (α)]{v1 (α) − 61 ρ3 H2 [v0 (α)]}+


n−1 φ[v0 (α)]{v2 (α) − 12 v0 (α)v12 (α) − 61 ρ3 v0 (α)v1 (α)(2 − H2 [v0 (α)])−
1
24 ρ4 H3 [v0 (α)] − 1 2
72 ρ3 H5 [v0 (α)]} + o(n−1 ), (7.37)
as n → ∞. To obtain expressions for v0 (α), v1 (α), and v2 (α) we set F (gα,n ) =
α+o(n−1 ) and match the coefficients of the successive powers of n−1/2 . Match-
ing the constant terms with respect to n in Equation (7.37) with the constant
term α implies that Φ[v0 (α)] = α or equivalently that v0 (α) = zα . We sim-
ilarly match coefficient of n−1/2 in Equation (7.37) with zero which implies
that φ(zα )[v1 (α) − 61 ρ3 H2 (zα )] = 0, where we have already made the substi-
tution v0 (α) = zα . This implies that v1 (α) = 61 ρ3 H2 (zα ). We finally match
the coefficients of n−1 in Equation (7.37) also to zero. This implies
1 2 2 1 2 1 1 2
v2 (α) + 72 zα ρ3 H2 (zα ) − 18 ρ3 zα H2 (zα ) − 24 ρ4 H3 (zα ) − 72 ρ3 H5 (zα ) = 0,
so that
1 2 1 2 2 1 1 2
v2 (α) = 18 ρ3 zα H2 (zα ) − 72 zα ρ3 H2 (zα ) + 24 ρ4 H3 (zα ) + 72 ρ3 H5 (zα ).

Therefore, it follows that


gα,n = v0 (α) + n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )
= zα + 61 n−1/2 ρ3 H2 (zα ) + n−1 [ 18
1 2
ρ3 zα H2 (zα ) − 1 2 2
72 zα ρ3 H2 (zα ) +
1 1 2 −1
24 ρ4 H3 (zα ) + 72 ρ3 H5 (zα )] + o(n ), (7.38)
as n → ∞. The asymptotic expansion given in Equation (7.38) is called a
second-order Cornish–Fisher expansion. The expansion can also be termi-
nated after the first order term to obtain an expansion of the form zα +
1 −1/2
6n ρ3 H2 (zα ) + o(n−1/2 ) as n → ∞, which is called a first-order Cornish–
Fisher expansion. Note that the constant term of the expansion is zα , so as in
the case of the Edgeworth expansions, Cornish–Fisher expansions can be inter-
preted as corrections to the approximation given by Theorem 4.20 (Lindeberg
and Lévy). The result can be expanded to form a pth -order Cornish–Fisher
expansion that has an error term of order o(n−p/2 ) as n → ∞.
Theorem 7.10. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F that has characteristic
function ψ. Let Fn (t) = P [n1/2 σ −1 (X̄n − µ) ≤ t], gα,n = inf{x ∈ R : Fn (x) ≥
310 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
α}, and assume that E(Xnp ) < ∞. If
lim sup |ψ(t)| < 1,
|t|→∞

then
p
X
gn,α − zα − n−k/2 qk (zα ) = o(n−p/2 ), (7.39)
k=1
as n → ∞ uniformly in ε < α < 1 − ε for each ε > 0 where q3 , . . . , qp are
polynomials that depend on the moments of Xn and not on n.

We shall not endeavor to present a more formal proof of Theorem 7.10 other
than what is presented above. The polynomials q1 , . . . , qp are related to the
polynomials r1 , . . . , rp , though the relationship can become quite complicated
as more terms are added to the expansion. In paticular q1 (x) = −r1 (x) and
q2 (x) = r1 (x)r10 (x)− 21 xr12 (x)−r2 (x). These relationships can be determined by
directly inverting the asymptotic expansion in Equation (7.27). See Exercise
14. As with the Edgeworth expansion, the expansion given by Theorem 7.10
can also be written so that the error term has order O(n−(p+1)/2 ) as n → ∞.
Because of the relationship between the two methods, many similar conclu-
sions about the accuracy of the normal approximation for the distribution of
n1/2 σ −1 (X̄n −µ) also hold for the quantiles of n1/2 σ −1 (X̄n −µ). In general, we
can observe from Theorem 7.10 that the normal approximation provides an ap-
proximation for the quantiles of n1/2 σ −1 (X̄n −µ) that has error that is o(1), or
O(n−1/2 ), as n → ∞. However, if the third standardized cumulant ρ3 is zero,
corresponding to the case where the population is symmetric, the normal ap-
proximation provides an approximation for the quantiles of n1/2 σ −1 (X̄n − µ),
where the error is o(n−1/2 ), or O(n−1 ), as n → ∞. Similarly, if the fourth stan-
dardized cumulant ρ4 is zero, corresponding to the case where the population
has the same kurtosis as a Normal population, the normal approximation
provides an approximation for the quantiles of n1/2 σ −1 (X̄n − µ), where the
error is o(n−1 ), or O(n−3/2 ), as n → ∞.
Example 7.5. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables where Xn has an Exponential(θ) distribu-
tion for all n ∈ N. In Example 7.4 it was shown that the distribution of
n1/2 σ −1 (X̄n − θ) has an Edgeworth expansion with −r1 (x) = 31 H2 (x) and
−r2 (x) = 83 H3 (x) + 18
1
H5 (x). Theorem 7.10 then implies that the αth quantile
of the distribution of n1/2 σ −1 (X̄n − θ) has Cornish–Fisher expansion
gα,n = zα + n−1/2 q1 (zα ) + n−1 q2 (zα ) + op (n−1 ),
as n → ∞, where q1 (zα ) = −r1 (zα ) = 31 H3 (zα ) and q2 (zα ) = r1 (zα )r10 (zα ) −
1 2
2 zα r1 (zα ) − r2 (zα ). Evaluating the derivative of r1 we find that

0 d
r1 (zα ) = r1 (zα ) = − 32 H1 (zα ) = − 32 zα .
dx x=zα
THE SMOOTH FUNCTION MODEL 311
Therefore
2
q2 (zα ) = 29 zα H2 (zα ) − 1
18 H2 (zα ) + 38 H3 (zα ) + 1
18 H5 (zα ).

7.4 The Smooth Function Model

The sample mean is not the only function of a set of independent and identi-
cally distributed random variables that has an asymptotic Normal distribu-
tion. For example, in Section 6.4, we studied several conditions under which
transformations of an asymptotically Normal random vector also have an
asymptotic Normal distribution. The key condition for such transformations
to be asymptotically Normal was based on the smoothness of the trans-
formation. One particular application of this theory is based on looking at
smooth functions of sample mean vectors. Edgeworth and Cornish–Fisher ex-
pansions can also be applied to these problems as well, though the results are
slightly more complicated and the function of the sample mean must have a
certain type of smooth representation. This section will focus on the model
for which these expansions are valid. Section 7.5 will provide details about the
expansions themselves.
The specification of the smooth function model begins with a sequence of d-
dimensional random vectors {Xn }∞ n=1 following a d-dimensional distribution
F . It is assumed that these random vectors are independent and identically
distributed. Let µ = E(Xn ) and assume that the components of µ are finite.
The parameter of interest is assumed to be a smooth function of the vector
mean µ. That is, we assume that there exists a smooth function g : Rd → R
such that θ = g(µ). The exact requirements on the smoothness of g will be
detailed in Section 7.5.
The parameter of interest will be estimated using a plug-in estimate based on
the sample mean. That is, we will assume that the mean µ is estimated by
the sample mean
Xn
µ̂ = X̄n = n−1 Xk .
k=1
An estimate of θ can then be computed by substituting the sample mean into
the function g. That is, θ̂n = g(µ̂). Note that under the conditions we have
stated thus far θ̂n is a consistent estimator of θ owing to the consistency of the
sample mean from Theorem 3.10 (Weak Law of Large Numbers) and Theorem
3.9.
In order to correctly standardize the distribution of θ̂n we also require the
standard error of θ̂n . In particular we will assume that the asymptotic variance
of n1/2 θ̂n is a constant σ that is also a smooth function of µ. That is, we assume
312 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
that there is a smooth function h : Rd → R such that
σ 2 = h2 (µ) = lim V (n1/2 θ̂n ).
n→∞

The smooth function model can be summarized as a model for parameters


that are smooth functions of a vector mean where the standard error of the
estimated parameter is also a smooth function of the vector mean. The smooth
function model allows one to study statistics well beyond just the mean. Pa-
rameters that have representations within the smooth function model include
the variance, the correlation between two random variables, a ratio of means,
a ratio of variances, the skewness, the kurtosis, and many more. The key to
setting up the model correctly comes from identifying the sequence of random
vectors {Xn }∞ n=1 , their dimension d, and the functions g and h. Note that the
specification of the function h further requires one to evaluate the asymptotic
variance of θ̂n . It is important to note that when specifying a smooth function
model, the dimension d must be chosen large enough so that both θ and σ
can be specified. In a typical specification of the smooth function model, the
sequence {Xn }∞ n=1 will often consist the random vectors whose components
are powers of another random variable or vector. Therefore, the vector mean µ
will be a vector that contains various moments of the original random variable
or vector. Therefore both {Xn }∞ n=1 and the dimension d must be specified so
that there are sufficient moments to specify both θ and σ in the model. Some
example smooth function model specifications are given below.
Example 7.6. Let {Wn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean θ. We wish to
represent the parameter θ in a smooth function model. Assuming that µ0k is
the k th moment of Wn , we note that θ = µ01 = E(Wn ). While we have not
specified the sequence {Xn }∞ n=1 , the dimension d, or the function h as of yet,
the estimate of θ will consist of the component of the sample mean of the
vectors X1 , . . . , Xn corresponding to Wn . Therefore, the estimate of θ will be
given by θ̂n = W̄n . Hence, the asymptotic variance of θ̂n is given by
σ 2 = lim V (n1/2 θ̂n ) = lim V (n1/2 W̄n ) = µ02 − (µ01 )2 .
n→∞ n→∞

Since the parameter and the asymptotic variance rely on the first two moments
of Wn we can specify a smooth function model for θ with d = 2 and Xn =
(Wn , Wn2 ) so that the vector mean is given by µ = E(X0n ) = (µ01 , µ02 ) for all
n ∈ N. The functions g and h are given by g(x) = x1 and h(x) = x2 − x21
where x0 = (x1 , x2 ). 
Example 7.7. Let {Wn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean η and variance
θ. We wish to represent the parameter θ in a smooth function model. Assuming
that µ0k is the k th moment of Wn , we note that θ = µ02 − (µ01 )2 so that we will
need at least the first two moments of Wn to be represented in our Xn vector.
Theorem 3.21 implies that the asymptotic variance of the sample variance is
µ4 − µ22 = µ04 − 4µ01 µ03 + 6(µ01 )2 µ2 − 3(µ01 )4 − [µ02 − (µ01 )2 ]2 ,
THE SMOOTH FUNCTION MODEL 313
so that to represent the variance in the smooth function model we require
d = 4 with X0n = (Wn , Wn2 , Wn3 , Wn4 ) for all n ∈ N. With this representation
we have g(x) = x2 − x21 and
h(x) = x4 − 4x1 x3 + 6x21 x2 − 3x41 − (x2 − x21 )2 ,
where x0 = (x1 , x2 , x3 , x4 ). See Exercise 15. 
Example 7.8. Let {Wn }∞
be a sequence of independent and identically
n=1
distributed bivariate random vectors from a distribution F having mean vector
η and covariance matrix Σ. Let Wn0 = (Wn1 , Wn2 ) for all n ∈ N and define
j j j
µij = E{[Wn1 − E(Wn2 )]i [Wn2 − E(Wn2 )]j } and µ0ij = E(Wn1 Wn2 ). The
parameter of interest in this example is the correlation between Wn1 and
−1/2 −1/2
Wn2 . That is, θ = µ11 µ20 µ02 . The estimate of θ, based on replacing the
population moments with the sample moments as described above, has the
form Pn
k=1 (Wk1 − W̄1 )(Wk2 − W̄2 )
θ̂n = Pn Pn .
[ k=1 (Wk1 − W̄1 )2 ]1/2 [ k=1 (Wk2 − W̄2 )2 ]1/2
When constructing the smooth function model representation for this param-
eter, we will need to specify the sequence {X}∞ n=1 to include moments of both
Wn1 and Wn2 , but also various products of these random variables as well.
The correlation parameter itself can be specified readily enough as a function
of these moments as shown above, but the asymptotic variance is more chal-
lenging. For our current application we will use the result from Section 27.8
of Cramér (1946) which gives the asymptotic variance of n1/2 θ̂n as

σ 2 = 41 θ(µ40 µ−2 −2 −1 −1
20 + µ04 µ02 + 2µ22 µ20 µ02 +
4µ22 µ−2 −1 −1 −1 −1
11 − 4µ31 µ11 µ20 − 4µ13 µ11 µ02 ),

which makes it apparent that we require moments up to order four from each
random variable plus several products of powers of these random variables. It
then suffices to define the sequence {X}∞
n=1 as

X0n = (Wn1 , Wn1


2 3
, Wn1 4
, Wn1 2
, Wn2 , Wn2 3
, Wn2 4
, Wn2 ,
2 2 2 2 3 3
Wn1 Wn2 , Wn1 Wn2 , Wn1 Wn2 , Wn1 Wn2 , Wn1 Wn2 , Wn1 Wn2 ),
for all n ∈ N, where we have d = 14 and
µ0 = (µ010 , µ020 , µ030 , µ040 , µ001 , µ002 , µ003 , µ004 , µ011 , µ021 , µ012 , µ022 , µ031 , µ013 ).
From this definition we can define the function
x9 − x1 x5
g(x) = ,
(x2 − x21 )1/2 (x6 − x25 )1/2
for which θ = g(µ) and θ̂n = g(X̄n ). The asymptotic variance calculation is
somewhat complicated, so we will introduce some helpful functions to make
the resulting expressions more compact. We begin by noting that
µ40 = µ040 − 4µ010 µ030 + 6µ020 (µ010 )2 − 3(µ010 )4 ,
314 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
we can define a function h40 (x) as
h40 (x) = x4 − 4x1 x3 + 6x2 x21 − 3x41 ,
so that h40 (µ) = µ40 . Similarly, we can define
h04 (x) = x8 − 4x5 x7 + 6x6 x25 − 3x45 ,
so that h40 (µ) = µ04 . Extending this idea we define the functions h20 (x) =
x2 − x21 , h02 (x) = x6 − x25 , h11 (x) = x9 − x1 x5 ,
h22 (x) = x12 − 2x5 x10 + x25 x2 − 2x1 x11 + 4x1 x5 x9 − 3x21 x25 + x21 x6 ,
h31 (x) = x13 − 3x1 x10 + 3x21 x9 − x3 x5 − 3x1 x5 x2 − 3x31 x5 ,
and
h13 (x) = x14 − 3x5 x11 + 3x25 x9 − x7 x1 + 3x1 x5 x6 − 3x1 x35 ,
where it follows that h20 (µ) = µ20 , h02 (µ) = µ02 , h11 (µ) = µ11 , h22 (µ) = µ22 ,
h31 (µ) = µ31 , and h13 (µ) = µ13 . See Exercise 16. In terms of these functions
the asymptotic variance can be written as

σ 2 = h2 (µ) = 41 g(µ){h40 (µ)[h20 (µ)]−2 + h04 (µ)[h02 (µ)]−2 +


2h22 (µ)[h20 (µ)h02 (µ)]−1 + 4h22 (µ)[h11 (µ)]−1 −
4h31 (µ)[h11 (µ)h20 (µ)]−1 − 4h13 (µ)[h11 (µ)h02 (µ)]−1 }.
Therefore, the correlation between two random variables is a parameter that
can be represented in the smooth function model with d = 14. 
The previous examples demonstrate that reasonably smooth functions of mo-
ments can always be represented in the smooth function model, though ex-
pressions for the h function can become quite complicated. It is worthwhile to
note that there are also many parameters which cannot be represented in the
smooth function model. For example, a median, or any other quantile for that
matter, cannot be written in terms of the smooth function model as there is
not any general method for representing the median in terms of the moments
of a distribution. Another example of such a parameter is the mode of a den-
sity. Further, non-smooth functions of the moments of a distribution will also
not fit within the smooth function model.

7.5 General Edgeworth and Cornish–Fisher Expansions

The results in this section rely heavily on the smooth function model presented
in Section 7.4. In particular, we will assume that {Xn }∞ n=1 is a sequence of
independent and identically distributed d-dimensional random vectors from a
d-dimensional distribution F . Let µ = E(Xn ) and assume that the parameter
of interest θ is related to µ through a smooth function g. The parameter θ is
estimated with the plug-in estimate g(X̄n ). Finally, we assume that
σ 2 = lim V [n1/2 g(X̄n )],
n→∞
GENERAL EDGEWORTH AND CORNISH–FISHER EXPANSIONS 315
is also related to µ through a smooth function h.
It is worth noting the effect that transformations have on the normal approxi-
d
mation. We know from Theorem 4.22 that n1/2 Σ−1/2 (X̄n −µ) − → Z as n → ∞
where Z is a d-dimensional N(0, I) random variable and Σ is the covariance
matrix of Xn as long as the elements of Σ are all finite. Given this result, we
d
can then apply Theorem 6.5 to find that n1/2 [g(Xn ) − g(µ)] − → Z as n → ∞
where Z is a N[0, d0 (µ)Σd(µ)] random variable and d(µ) is the vector of
partial derivatives of g evaluated at µ. To simplify the notation in this sec-
tion we will let di represent the ith element of d(µ). We now encounter some
smoothness conditions required in our smooth function model. In particular,
we must now assume that d(µ) is not equal to the zero vector and d(x) is
continuous in a neighborhood of µ.
An important issue in developing Edgeworth expansions for under the smooth
function model is related to finding expressions for the cumulants of θ̂n in
terms of the moments of the distribution F . This is important because, as with
the usual Edgeworth expansion, the coefficients of the terms of the expansion
will be related to the cumulants of θ̂n , and therefore the specification of the
cumulants is required to specify the exact form for the expansion. As an
example we will consider the case of the specification of σ 2 . Suppose that
X0 = (X1 , . . . , Xd ) where X has distribution F and define µij = E{[Xi −
E(Xi )][Xj − E(Xj )]} so that the (i, j)th element of Σ is given by µij . The
quadratic form representing the asymptotic variance of n1/2 (θ̂n − θ), which is
equivalently the asymptotic variance of n1/2 θ̂n , is then given by
d X
X d
σ 2 = d0 (µ)Σd(µ) = di dj µij .
i=1 j=1

It turns out that all of the cumulants of n1/2 (θ̂n − θ) can be written using sim-
ilar expressions to these forms, though the proof becomes rather complicated
for each additional cumulant. We shall present only the results. A detailed
argument supporting these results can be found in Chapter 2 of Hall (1992).
Define A(x) = [g(x) − g(µ)]/h(µ) where x0 = (x1 , . . . , xd ). Let
∂k


ai1 ···ik = A(x) .
∂xi1 · · · ∂xik x=µ

It then follows that the first cumulant n1/2 A(X̄n ) = n1/2 σ −1 (θ̂n − θ) equals
n−1/2 A1 + O(n−1 ) as n → ∞ where
d X
X d
1
A1 = 2 aij µij . (7.40)
i=1 j=1

If we have chosen our h function correctly, the second cumulant of n1/2 A(X̄n )
should be one. See Page 55 of Hall (1992) for further details. The third cumu-
316 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
lant of n1/2 A(X̄n ) is given by n−1/2 A2 + O(n−1 ) where
d X
X d X
d d X
X d X
d X
d
A2 = ai aj ak µijk + 3 ai aj akl µik µjl , (7.41)
i=1 j=1 k=1 i=1 j=1 k=1 l=1

where µijk = E{[Xi − E(Xi )][Xj − E(Xj )][Xk − E(Xk )]}.


The main result of this section is that the form of the Edgeworth expansion
for the distribution function of n1/2 σ −1 (θ̂n − θ) has the same form as that
for the standardized sample mean, with the exception that the cumulants are
replaced by those above. This result was established by Bhattacharya and
Ghosh (1978).
Theorem 7.11 (Bhattacharya and Ghosh). Let {Xn }∞ n=1 be a sequence of
independent and identically distributed d-dimensional random vectors from a
distribution F with mean vector µ. Let θ = g(µ) for some function g and
suppose that θ̂n = g(X̄n ). Let σ 2 be the asymptotic variance of n1/2 θ̂n and
assume that σ = h(µ) for some function h. Define A(x) = σ −1 [g(x) − g(µ)]
and assume that A has p+2 continuous derivatives in a neighborhood of µ and
that E(||X||p+2 ) < ∞. Let ψ be the characteristic function of F and assume
that
lim sup |ψ(t)| < ∞. (7.42)
||t||→∞
1/2
Let Gn (x) = P [n A(X̄n ) ≤ x], then
p
X
Gn (x) = Φ(x) + n−k/2 rk (x)φ(x) + o(n−p/2 ), (7.43)
k=1

as n → ∞, uniformly in x where r1 , . . . , rp are polynomials that depend on


the moments of X up to order p + 2.
As indicated previously, the form of the polynomials is the same as in Theorem
7.9. In particular we have that
r1 (x) = −[σ̃ −1 A1 + 61 σ̃ −3 A2 H2 (x)], (7.44)
where A1 and A2 are defined in Equations (7.40) and (7.41) and
d X
X d
σ̃ = ai aj µij . (7.45)
i=1 j=1

In most applications we will have σ̃ = 1. One may note that the extra term
involving the cumulant A1 in Equation (7.44) that does not appear in the
polynomial defined in Theorem 7.9. This is due to the fact that the first
cumulant of X̄n − µ is zero. Further polynomials can be obtained using the
methodology of Withers (1983, 1984). For example, Polansky (1995) obtains

r2 (x) = 21 σ̃ −2 (A21 + A22 )H1 (x)+


−2
1
24 (4σ̃ A1 A2 + σ̃ −4 A43 )H3 (x) + 1 −2 2
72 σ̃ A2 H5 (x), (7.46)
GENERAL EDGEWORTH AND CORNISH–FISHER EXPANSIONS 317
where
d X
X d X
d d X
X d X
d X
d
1
A22 = ai ajk µijk + 2 aij akl µik µjl +
i=1 j=1 k=1 i=1 j=1 k=1 l=1
d X
X d X
d X
d
ai ajkl µij µkl , (7.47)
i=1 j=1 k=1 l=1

and
 2
d X
X d X
d X
d d X
X d
1
A43 = 2 ai aj ak al µijkl − 3  21 ai aj µij  +
i=1 j=1 k=1 l=1 i=1 j=1

X d X
d X d X
d X
d
12 ai aj ak alm µil µjkm +
i=1 j=1 k=1 l=1 m=1
d X
X d X
d X
d X
d X
d
12 ai aj akl amu µik µjm µlu +
i=1 j=1 k=1 l=1 m=1 u=1

X d X
d X d X
d X d
d X
4 ai aj ak almu µil µjm µku . (7.48)
i=1 j=1 k=1 l=1 m=1 u=1

Inversion of the asymptotic expansion given in Equation (7.43) will result in


a Cornish–Fisher expansion for the quantile function of n1/2 A(X̄n ), which
we will denote as gα,n (α) = G−1 n (α). The form of this expansion and the
involved polynomials are the same as is detailed in Theorem 7.10, with the
exception that the cumulants are defined as above. We will again call these
polynomials qk . Edgeworth expansions also exist for the density of n1/2 A(X̄n )
under additional assumptions. These expansions have the same form as given
in Theorem 7.5 with the exception of the form of the cumulants. See Theorem
2.5 of Hall (1992) for further details.
An important assumption in Theorem 7.11 is the multivariate form of Cramér’s
continuity condition given by Equation (7.49), which is equivalent to the ran-
dom vector Xn having a non-degenerate absolutely continuous component.
The importance of this condition is that it implies that all the points in the
distribution of X̄n have exponentially small probabilties so that the distribu-
tion of n1/2 A(X̄n ) is virtually continuous. We will not attempt to verify this
condition directly, but will use a result from Hall (1992) which verifies the
condition in the cases we will encounter in this chapter.
Theorem 7.12 (Hall). Let W be a random variable that has a nonsingu-
lar distribution and define X0 = (W, W 2 , . . . , W d ). If ψ is the characteristic
function of X, then
lim sup |ψ(t)| < ∞.
||t||→∞
318 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
A proof of Theorem 7.12 is given in Section Chapter 2 of Hall (1992). Note
that the form of the random vector used in Theorem 7.12 is of the same form
we used in our specification of the smooth function models in Examples 7.6–
7.8. Therefore, as long as the distribution in those examples is nonsingular,
the required result for Cramér’s continuity condition should follow.
Example 7.9. Let {Wn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean η and variance
θ. It was shown in Example 7.7 that θ can be represented using a smooth
function model with d = 4 with X0n = (Wn , Wn2 , Wn3 , Wn4 ) for all n ∈ N. In
this representation we specified g(x) = x2 − (x1 )2 and
h(x) = x4 − 4x1 x3 + 6x21 x2 − 3x41 − (x2 − x21 )2 ,
where x0 = (x1 , x2 , x3 , x4 ). We will now demonstrate how this model can
be used to find the form of the Edgeworth expansion for the standardized
distribution of θ̂n . For simplicity we will examine the special case where η = 0
and θ = 1. In this special case direct calculations can be used to show that
µ0 = (0, 1, γ, κ) where we have defined γ = E(Wn3 ) and κ = E(Wn4 ). It then
follows that σ 2 = h(µ) = κ − θ2 = κ − 1. To obtain the polynomial r1 we
need to find expressions for the constants ai , aij , and aijk for i = 1, . . . , 4,
j = 1, . . . , 4, and k = 1, . . . , 4. In this case A(x) = σ −1 [g(x) − g(µ)] where
σ 2 = h(µ). Therefore,

∂ −1
σ [x2 − x21 − θ] = −2σ −1 η = 0,

a1 =
∂x1 x=µ

since we have assumed that η = 0. Similarly,



∂ −1
σ [x2 − x21 − θ] = σ −1 ,

a2 =
∂x2 x=µ

and a3 = a4 = 0. Similarly,
∂ 2 −1

2
= −2σ −1 ,

a11 = 2 σ [x2 − x1 − θ]
∂x1 x=µ

a12 = a21 = a22 = 0, a3i = ai3 = 0, and a4i = ai4 = 0 for i ∈ {1, 2, 3, 4}.
Therefore, it follows from Equation (7.40) that
d X
X d
A1 = 1
2 aij µij = 21 a11 µ11 = −σ −1 µ11 ,
i=1 j=1

where µ11 = E(Wn2 ) = θ = 1. Therefore A1 = −σ −1 . Similarly, Equation


(7.41) implies that
d X
X d X
d d X
X d X
d X
d
A2 = ai aj ak µijk + 3 ai aj akl µik µjl =
i=1 j=1 k=1 i=1 j=1 k=1 l=1

a32 µ222 + 3a11 a22 µ221 .


STUDENTIZED STATISTICS 319
Let
µ222 = E[(Wn2 − θ)3 ] = E[(Wn2 − 1)3 ] = E[Wn6 − 3Wn4 + 3Wn2 − 1] = ζ,
and
µ21 = E[(Wn2 − θ)Wn ] = E[Wn3 − Wn ] = γ.
Hence
A2 = σ −3 ζ − 6σ −3 γ 2 .
Finally, Equation (7.45) implies that
d X
X d
σ̃ = ai aj µij = a22 µ22 = σ −2 µ22
i=1 j=1

where µ2 2 = E[(Wn2 − 1)2 ] = E(Wn4 ) − 2E(Wn2 ) + 1 = κ − 1 = σ 2 . Therefore,


σ̃ = 1 and it follows that the polynomial r1 is given by
r1 (x) = σ −1 − σ −3 [ 61 ζ − γ 2 ]H2 (x).


7.6 Studentized Statistics

In the theory of Edgeworth and Cornish–Fisher expansions developed so far


we have dealt with standardized statistics of the form n1/2 σ −1 (θ̂n − θ) where
θ is a parameter that fits within the smooth function model. Note that this
setup requires that the asymptotic variance of n1/2 θ̂n be known. In most
practical cases the variance is unknown and therefore must be estimated from
the observed sample. If we note that within the smooth function model we
have σ = h(µ) and that X̄n is a consistent estimate of µ, then it follows
from Theorem 3.9 that we can obtain a consistent estimate of σ as σ̂n =
h(X̄n ). We can now apply Theorem 4.11 (Slutsky) as detailed in Example
4.13 and conclude that n1/2 σ̂n−1 (θ̂n − θ) is still asymptotically Normal. The
most well known example of this type of application occurs when θ̂n is the
sample mean. Even when the population is Normal, n1/2 σ̂n−1 (θ̂n −θ) does not
have a N(0, 1) distribution but instead has a T(n − 1) distribution. However,
n1/2 σ̂n−1 (θ̂n − θ) still has an asymptotic Normal distribution. The case of the
sample mean under a normal population is interesting because the distribution
of n1/2 σ̂n−1 (θ̂n − θ) is known for finite values of n. The T(n − 1) distribution
is similar in shape to a N(0, 1) distribution in that both distributions are
symmetric with a single peak at the origin. However, the T(n−1) distribution
has heavier tails, with the difference between the two distributions vanishing
as n → ∞. This suggests that an Edgeworth type correction to the N(0, 1)
distribution might be possible for this case. It turns out that this type of
correction is possible even when the population is not normal.
We will need to specify some additional notation before presenting the result.
320 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
Define B(x) = [g(x)−g(µ)]/h(x) where x ∈ Rd and we note that the B(X̄n ) =
σ̂n−1 (θ̂n − θ). While n1/2 A(X̄n ) was called the standardized version of θ̂n , the
function n1/2 B(X̄n ) is usually called the studentized version of θ̂n , alluding
to the distribution of n1/2 B(X̄n ) when the population is Normal. Let
∂k


bi1 ···ik = B(x) .
∂xi1 · · · ∂xik x=µ

With this definition, the Edgeworth type correction for the distribution func-
tion of B(X̄n ) then has the same form as that of A(X̄n ), with the exception
that the constants ai1 ···ik are replaced by bi1 ···ik .
Theorem 7.13. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed d-dimensional random vectors from a distribution F with mean
vector µ. Let θ = g(µ) for some function g and suppose that θ̂n = g(X̄n ). Let
σ 2 be the asymptotic variance of n1/2 θ̂n and assume that σ = h(µ) for some
function h. Define B(x) = [g(x) − g(µ)]/h(x) and assume that B has p + 2
continuous derivatives in a neighborhood of µ and that E(||X||p+2 ) < ∞. Let
ψ be the characteristic function of F and assume that
lim sup |ψ(t)| < ∞. (7.49)
||t||→∞

Let Hn (x) = P [n1/2 B(X̄n ) ≤ x], then


p
X
Hn (x) = Φ(x) + n−k/2 vk (x)φ(x) + o(n−p/2 ), (7.50)
k=1

as n → ∞, uniformly in x where v1 , . . . , vp are polynomials that depend on


the moments of X up to order p + 2.

The polynomial q1 is given by


v1 (x) = −[B1 + 61 B2 H2 (x)] (7.51)
where
d X
X d
1
B1 = 2 bij µij , (7.52)
i=1 j=1

d X
X d X
d d X
X d X
d X
d
B2 = bi bj bk µijk + 3 bi bj bkl µik µjl , (7.53)
i=1 j=1 k=1 i=1 j=1 k=1 l=1

and µij and µijk are as defined previously. Even though r1 and q1 have the
same form, the two polynomials are not equal as ai1 ···ik 6= bi1 ···ik . In fact, Hall
(1988a) points out that
d X
X d
r1 (x) − v1 (x) = − 12 σ −3 ai cj µij x2 ,
i=1 j=1
STUDENTIZED STATISTICS 321
where


ck = g(x) .
∂xk x=µ

See Exercise 18. Further polynomials may be obtained using similar methods.

The expansion in Equation (7.50) can also be inverted to obtain a Cornish–


Fisher type expansion for the quantiles of the distribution of B(X̄n ). This
expansion has the form
p
X
hα,n = zα + n−k/2 sk (zα ) + o(n−p/2 ),
k=1

as n → ∞ where s1 (x) = −v1 (x) and s2 (x) = v1 (x)v10 (x) − 21 xv12 (x) − v2 (x).
Example 7.10. Let {Wn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean θ and variance
σ 2 . It was shown in Example 7.6 that this parameter can be represented in the
smooth function model with X0n = (Wn , Wn2 ), g(x) = x1 , and h(x) = x2 − x21 ,
where we would have θ̂n = W̄n and
n
X
σ̂n2 = n−1 (Wk − W̄n )2 .
k=1

In this example, we will derive the one-term Edgeworth expansion for the
studentized distribution of θ̂n and compare it to the Equation (7.24). For
simplicity, we will consider the case where θ = 0 and σ = 1. To obtain this
expansion, we must first find the constants b1 , b2 , b11 , b12 , b21 and b22 , where
B(x) = (x1 − θ)(x2 − x21 )−1/2 . We, first, note that


b1 = B(x)
∂x1 x=µ

= x1 (x1 − θ)(x2 − x21 )−3/2 + (x2 − x21 )−1/2

x=µ
2 −3/2 2 −1/2
= θ(θ − θ)(β − θ ) + (β − θ )
= (β − θ2 )−1/2
= 1,

where β = E(Wn2 ) so that β − θ2 = σ 2 = 1. Similarly,




b2 = B(x)
∂x2 x=µ

= − 2 (x1 − θ)(x2 − x21 )−3/2
1
x=µ
= 0.
322 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
The second partial derivatives are given by
∂2


b11 = 2 B(x)
∂x1 x=µ

= 2x1 (x2 − x21 )−3/2 + 3(x1 − θ)x21 (x2 − x21 )−5/2 +



(x1 − θ)(x2 − x21 )−3/2

x=µ
= 0,

∂2


b12 = b21 = B(x)
∂x1 x2 x=µ

= − 23 x1 (x1 − θ)(x2 − x21 )−5/2 − 21 (x2 − x21 )−3/2

x=µ

= − 21 (β − θ2 )−3/2
= − 21 ,
and
∂2


b22 = B(x)
∂x22 x=µ

= 3
4 (x1 − θ)(x2 − x21 )−5/2

x=µ
= 0.
The expressions for B1 and B2 also require us to find the moments of the form
µij and µijk for i = 1, 2, j = 1, 2, and k = 1, 2. However, due to the fact that
b2 = b11 = b22 = 0 we only need to find µ11 , µ12 , and µ111 . Letting κ3 denote
the third cumulant of Wn we have in this case that µ11 = E[(Wn − θ)2 ] =
β − θ2 = 1, µ12 = µ21 = E[Wn3 ] − θβ = κ3 , µ111 = E[(Wn − θ)3 ] = κ3 , We
now have enough information to compute B1 and B2 . From Equation (7.52),
we have that
2 X
X 2
1
B1 = 2 bij µij = 12 (b12 µ12 + b21 µ21 ) = − 12 κ3 .
i=1 j=1

To find B2 we first note that


2 X
X 2 X
2
bi bj bk µijk = b31 µ111 = κ3 .
i=1 j=1 k=1

Similarly,
2 X
X 2 X
2 X
2
bi bj bkl µik µjl = b21 b12 µ11 µ12 + b21 b21 µ12 µ11 = −κ3 .
i=1 j=1 k=1 l=1

Hence, it follows from Equation (7.53) that B2 = −2κ3 , so that Equation


STUDENTIZED STATISTICS 323

Figure 7.6 The Edgeworth expansion for the standardized sample mean (solid line)
and the studentized mean (dashed line) when n = 5 and κ3 = 1.
1.0
0.8
0.6
0.4
0.2
0.0

!4 !2 0 2 4

(7.51) implies that


v1 (x) = [ 12 κ3 + 13 κ3 H2 (x)] = 16 κ3 (2x2 + 1).
Hence, under the assumptions of Theorem 7.13 we have that
P [n1/2 σ̂ −1 (X̄n − θ) ≤ x] = Φ(x) + 16 n−1/2 φ(x)κ3 (2x2 + 1) + o(n−1/2 ), (7.54)
as n → ∞. For comparison, the one-term Edgeworth expansion from Equation
(7.24) and the expansion from Equation (7.54) are plotted in Figure 7.6 for
the case when κ3 = 1 and n = 5. One can observe from these plots the heavier
lower tail associated with the studentized distribution of the sample mean.
This heavier tail accounts for the extra variability that is introduced into
the distribution because the parameter σ has been replaced by the random
variable σ̂n , along with the fact that the distribution is positively skewed. For
the viewpoint of statistical inference, this heavier tail can be interpreted as
having less precise information about the underlying population which results
in a wider upper confidence limit and possibly less powerful hypothesis tests
for the mean. 

Further terms in the Edgeworth expansion for studentized statistics can be


determined using calculations of the same form used to find r2 in the case of
standardized statistics. See, for example, Exercise 19. Cornish–Fisher type ex-
324 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
pansions can also be obtained for the quantile functions of studentized statis-
tics within the smooth function model. The form of these expansions are the
same as given in Section 7.3.

7.7 Saddlepoint Expansions

Exponential tilting is a methodology that has been developed to increase the


accuracy of Edgeworth expansion in the tails of the distributions of the stan-
dardized sample mean. Consider the case where {Xn }∞ n=1 is a sequence of
independent and identically distributed random variables from a distribution
F . Under the assumptions of Theorem 7.4 we have that
fn (t) = φ(t) + 61 n−1/2 ρ3 H3 (t)φ(t) + O(n−1 ),
as n → ∞ where fn (t) is the density of n1/2 σ −1 (X̄n − µ) and ρ3 is the third
standardized cumulant of F . The error term of this expansion is O(n−1 ) as
n → ∞ which provides a description of the asymptotic error of the expansion.
This error is uniform in t, meaning that it applies to any value of t ∈ R.
However, there are points where the error of the expansion may be quite a bit
less than is indicated by the form of the error term. For example, if we consider
the form of the expansion at the point x = 0 we note that φ(0)H3 (0) = 0
and therefore we obtain the expansion fn (0) = φ(0) + O(n−1 ) as n → ∞.
Hence, the normal approximation provided by Theorem 4.20 (Lindeberg-Lévy)
is actually more accurate near zero, or near the center of the distribution of
the standardized mean.
An idea of how this term affects the expansion can be obtained by observing
the behavior in Figure 7.7. The function is zero at the origin but quickly in-
creases as we move away from the origin. Asymptotically, however, we also
observe that the exponential function in the Normal density eventually dom-
inates the function H3 (t) and the term returns near zero once again. Hence,
the point t = 0 is the only point where the normal approximation attains the
smaller asymptotic error term. In practical terms this analysis may provide
a somewhat distorted view of how well the Edgeworth expansion performs
in the tails of the distribution. While the actual error of the first term of
the Edgeworth expansion may become smaller as t → ±∞, the error may be
large compared to the actual density we are attempting to approximate. A
more practical viewpoint looks at this error relative to the actual density of
the standardized sample mean. In general this density is unknown without
further specification of F , but we can obtain an idea of the relative error by
comparing it to the normal density. That is, we would just look at the function
H3 (x), which is plotted in Figure 7.8. It is clear from this plot that the relative
error of the one-term Edgeworth expansion becomes quite large as t → ±∞.
The idea behind exponential tilting is to take advantage of the error term in
the Edgeworth expansion when t = 0. The consequence of using this method-
ology is that the resulting approximation will have a relative error of O(n−1 )
SADDLEPOINT EXPANSIONS 325

Figure 7.7 The function φ(t)H3 (t) from the first term of an Edgeworth expansion
for the standardized sample mean.
0.4
0.2
0.0
!0.2
!0.4

!4 !2 0 2 4

as n → ∞ over a large interval of t values in R whereas the usual Edgeworth


expansion has an absolute error of O(n−1 ) as n → ∞ only at the origin.

To begin our development, let f be the density associated with F and assume
that f has moment generating function m(u) and cumulant generating func-
tion c(u), both of which are assumed to exist. Define fλ (t) = exp(λt)f (t)/m(λ)
for some real parameter λ. It then follows that fλ (t) is a density since fλ (t) ≥ 0
and
Z ∞
exp(λt)f (t)dt = m(λ),
−∞

which implies
Z ∞ Z ∞
fλ (t)dt = [m(λ)]−1 exp(λt)f (t)dt = 1.
−∞ −∞

Several properties of this density can be obtained. Let Eλ denote the expec-
tation with respect to fλ and suppose that Y is a random variable following
the density fλ . That is,
Z ∞ Z ∞
Eλ (Y ) = tfλ (t)dt = [m(λ)]−1 t exp(λt)f (t)dt.
−∞ −∞
326 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS

Figure 7.8 The function H3 (t) from the first term of an Edgeworth expansion for
the standardized sample mean which provides an approximation to the relative error
of the Edgeworth expansion as a function of t.
100
50
0
!50
!100

!4 !2 0 2 4

Now the moment generating function of fλ (t) is given by


Z ∞
mλ (u) = Eλ [exp(uY )] = [m(λ)]−1 exp(ut) exp(λt)f (t)dt =
−∞
Z ∞
−1
[m(λ)] exp[(u + λ)t]f (t)dt = [m(λ)]−1 m(u + λ).
−∞

Now suppose that {Yn }∞


n=1 is a sequence of independent and identically dis-
tributed random variables following the density fλ . Then Theorem 2.25 implies
that the moment generating function of nȲn can be found to be
 n
m(u + λ)
Eλ [exp(unȲn )] = Eλn [exp(uYn )] = .
m(λ)

The central idea to exponential tilting is to develop a connecting between


the distribution of nX̄n and nȲn . The density of nX̄n can then be written in
terms of the density of nȲn where we are free to choose the parameter λ as we
please. By choosing a specific choice for λ, the density of nX̄n can be written
in terms of the center of the density nȲn . An Edgeworth expansion is then
applied to the center of the density of nȲn to obtain the expansion we want.
SADDLEPOINT EXPANSIONS 327
We must first find the connection between the density of nX̄n and nȲn . Let
fn denote the density of nX̄n and fn,λ denote the density of nȲn . We have ob-
tained the fact that the moment generating functions of fn and fn,λ are mn (u)
and mn (u + λ)/mn (λ) respectively. We now informally follow the argument
in Section 4.3 of Barndorff-Nielsen and Cox (1989). Note that since
Z ∞
m(u) = exp(ut)f (t)dt,
−∞

it follows that if we invert m(u) we get the density f (t). Similarly, we note
that
Z ∞ Z ∞
m(u + λ) = exp[t(u + λ)]f (t)dt = exp(ut)[exp(λt)f (t)]dt.
−∞ −∞

Therefore, if we invert the moment generating function m(u + λ) we should


get the function exp(λt)f (t). Noting that m(λ) is a constant with respect to
u, it follows that if we invert m(u + λ)/m(λ) we obtain exp(λt)f (t)/m(λ).
We now shift this same argument to the moment generating functions of nX̄n
and nȲn . We know that mn (u) is the moment generating function of nX̄n
so that mn (u) should invert to fn (t). Similarly, following the above pattern,
mn (u + λ) should invert to exp(λt)fn,λ (t), and hence the density associated
with mn (u + λ)/mn (λ) is exp(λt)fn (t)/mn (λ). This establishes that fn,λ (t) =
exp(λt)fn (t)/mn (λ). Note that from Definition 2.13 we have that mn (λ) =
exp{n log[m(λ)]} = exp[nc(λ)] so that we have established the identity
fn,λ (t) = exp[λt − nc(λ)]fn (t), (7.55)
which connects the density of nX̄n given by fn (t) to the density of nȲn given by
fn,λ (t). The informality in this argument comes from the fact that exp(λt)f (t)
may not be a density, but we may think of the inversion of the moment
generating functions as a type of LaPlace transformation. Alternately, we
could work with characteristic functions. See Section 4.3 of Barndorff-Nielsen
and Cox (1989) for further details.
Suppose that our aim is to obtain an approximation of fn (t) at a point t ∈ R
and we wish to take advantage of the identity in Equation (7.55) so that we
can choose λ to give us an accurate Edgeworth expansion. To begin, we note
that Equation (7.55) implies that fn (t) = exp[nc(λ) − λt]fn,λ (t). We will take
an Edgeworth expansion for fn,λ but we wish to do so where the expansion is
most accurate, which is in the center. The key to this is to choose λ so that t
corresponds to the expected value of nȲn . That is, we choose λ = λ̃ so that
Eλ (nȲn ) = t. Now Eλ (nȲn ) = nEλ (Y ) where we can find the expectation of
Y using Theorem 2.21 along with the moment generating function of Y . That
is
m0 (λ)

d m(u + λ) d
= c0 (λ).

Eλ (Y ) = = = log[m(u)]
du m(λ)
u=0 m(λ) du u=λ

Therefore, it follows that we choose λ = λ̃ such that nc0 (λ̃) = t.


328 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
To find the approximation we must now find an Edgeworth expansion for
fn,λ (t) evaluated at the mean of nȲn . Theorem 7.4 provides an expansion for
the density of n1/2 σ −1 (Ȳn − µ). To connect these two densities, we first note
that

Fn,λ (t) = P (nȲn ≤ t) =


P [n1/2 σ̃ −1 (Ȳn − µ̃) ≤ n−1/2 σ̃ −1 (t − nµ̃)] = Gn [n−1/2 σ̃ −1 (t − nµ̃)],
where Gn is the distribution function of n1/2 σ −1 (Ȳn − µ), µ̃ = E(Y ), and
σ̃ 2 = V (Y ). Therefore, it follows that the density of nȲn can be written as
d
Gn [n−1/2 σ̃ −1 (t − nµ̃)] = n−1/2 σ̃ −1 gn [n−1/2 σ̃ −1 (t − nµ̃)].
fn,λ (t) =
dt
Now, Theorem 7.4 implies that
gn (t) = φ(t) + 61 n−1/2 φ(t)ρ3 H3 (t) + O(n−1 ), (7.56)
as n → ∞. Evaluating fn,λ at the mean of nȲn implies that fn,λ (nµ̃) =
n−1/2 σ̃ −1 gn (0). Equation (7.56) then implies that
gn (0) = φ(0) + 61 n−1/2 φ(0)ρ3 H3 (0) + O(n−1 ) = (2π)−1/2 [1 + O(n−1 )].
Therefore, choosing λ = λ̃ so that t = nµ̃ implies that
fn (t) = exp[nc(λ̃) − λ̃t]fn,λ (nµ̃)
= exp[nc(λ̃) − λ̃t]n−1/2 σ̃ −1 (2π)−1/2 [1 + O(n−1 )].
To complete the expansion we must find an expression for σ̃. It can be shown
that σ̃ 2 = nc00 (λ̃). See Exercise 21. Therefore, the final form of the expansion
is given by
fn (t) = [2πnc00 (λ̃)]−1/2 exp[nc(λ̃) − λ̃t][1 + O(n−1 )], (7.57)
as n → ∞. The expansion in Equation (7.57) is called a tilted Edgeworth ex-
pansion, indirect Edgeworth expansion, or a saddlepoint expansion. The name
tilted Edgeworth expansion comes from the fact that forming the density fλ is
called exponential tilting. The name saddlepoint expansion has its basis in an
alternate method for deriving the approximation using a contour integral. See
Daniels (1954). The derivation used in this section is based on the derivation
from Section 4.3 of Barndorff-Nielsen and Cox (1989).
There are several relevant issues about the expansion in Equation (7.57) that
warrant further discussion. The first concerns the parameter value λ̃ which is
defined explicitly through the equation nc0 (λ̃) = t, and whether one should
expect that the equation can always be solved. This will not present a prob-
lem for the cases studied in this book, but for a more general discussion see
Chapter 6 of Barndorff-Nielsen and Cox (1989). The second issue concerns the
applicability of the method for various values of t. The expansion in Equation
(7.57) guarantees a relative error of O(n−1 ), as long as we are able to find
the value λ̃ that corresponds to the t value we wish to create the expansion
SADDLEPOINT EXPANSIONS 329
around. Section 4.3 of Barndorff-Nielsen and Cox (1989) points out that the
expansion should be valid for any value of t for which |t − nE(Xn )| < bn
for a fixed value of b. There are even cases where the expansion is valid for
all t ∈ R. For further information see Daniels (1954) and Jensen (1988). For
a detailed book-length exposition of saddlepoint methods, with many useful
applications, see Butler (2007).
Example 7.11. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables following a N(µ, σ 2 ) distribution. We wish to
derive a saddlepoint expansion that approximates the density of nX̄n at a
point x with relative error O(n−1 ) as n → ∞. From Example 2.22 we know
that the cumulant generating function of Xn is c(t) = tµ + 21 t2 σ 2 for all
t ∈ R. Therefore, setting x = nc0 (λ̃) implies that x = nµ + nλ̃σ 2 so that
λ̃ = σ −2 (n−1 x − µ). Further we have that c00 (t) = σ 2 for all t ∈ R so that
c00 (λ̃) = σ 2 . We also need an expression for nc(λ̃) which is given by
nc(λ̃) = nµ[σ −2 (n−1 x − µ)] + 21 nσ 2 [σ −4 (n−1 x − µ)2 ]
= nσ −2 (n−1 µx − µ2 + 12 n−2 x2 − µn−1 x + 12 µ2 )
1 −2
= 2σ (−nµ2 + n−1 x2 ).
It then follows that the saddlepoint expansion given in Equation (7.57) has
the form
fn (x) = [2πnc00 (λ̃)]−1/2 exp[nc(λ̃) − λ̃x][1 + O(n−1 )]
= (2πnσ 2 )−1/2 exp[ 21 σ −2 (−nµ2 + n−1 x2 ) − xσ −2 (n−1 x − µ)]
×[1 + O(n−1 )]
= (2πnσ 2 )−1/2 exp[− 12 n−1 σ −2 (x − nµ)2 ][1 + O(n−1 )],
which has a leading term equal to a N(nµ, nσ 2 ) density, which matches the
exact density of nX̄n . Therefore, the saddlepoint expansion provides the exact
answer in the leading term. 
Example 7.12. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables following a ChiSquared(θ) distribution. We
wish to derive a saddlepoint expansion that approximates the density of nX̄n
at a point x with relative error O(n−1 ) as n → ∞. The cumulant generating
function of a ChiSquared(θ) distribution is given by c(t) = − 12 θ log(1 − 2t)
for t < 12 . Therefore it follows that c0 (t) = θ(1 − 2t)−1 and solving nc0 (λ̃) = x
for λ̃ yields λ̃ = 12 − 21 nx−1 θ. Note that since c(t) only exists for t < 12 we
can only find the correct value of λ̃ for values of x such that λ̃ < 21 . This
turns out not to be a problem, since λ̃ < 12 for all x > 0. Further, it follows
that c00 (t) = 2θ(1 − 2t)−2 so that c00 (λ̃) = 2x2 n−2 θ. Finally, we have that
c(λ̃) = − 12 θ log[nθx−1 ]. Therefore, the saddlepoint expansion has the form

fn (x) = [4πn−1 x2 θ]−1/2 exp[− 12 nθ log(nθx−1 ) − 12 (1 − nx−1 θ)][1 + O(n−1 )].


Barndorff-Nielsen and Cox (1989) point out that an application of Theorem
330 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
1.8 (Stirling) implies that the leading term of this expansion has the same
form as a Gamma distribution. See Exercise 25. 

7.8 Exercises and Experiments

7.8.1 Exercises

1. Let f be a real function and define the Fourier norm as Feller (1971) does
as Z ∞
−1
(2π) |f (x)|dx.
−∞
For a fixed value of x, is this function a norm?
2. Prove that the Fourier transformation of Hk (x)φ(x) is (it)k exp(− 21 t2 ).
Hint: Use induction and integration by parts as in the partial proof of The-
orem 7.1.
3. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables where Xn has a Gamma(α, β) distribution for all n ∈ N.

a. Compute one- and two- term Edgeworth expansions for the density of
n1/2 σ −1 (X̄n − µ) where in this case µ = αβ and σ 2 = αβ 2 . What effect
do the values of α and β have on the accuracy of the expansion? Is it
possible to eliminate either the first or second term through a specific
choice of α and β?
b. Compute one- and two-term Edgeworth expansions for the distribution
function of n1/2 σ −1 (X̄n − µ).
c. Compute one- and two-term Cornish–Fisher expansions for the quantile
function of n1/2 σ −1 (X̄n − µ).

4. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables where Xn has a Beta(α, β) distribution for all n ∈ N.

a. Compute one- and two-term Edgeworth expansions for the density of


n1/2 σ −1 (X̄n − µ) where in this case µ = α/(α + β) and
αβ
σ2 = .
(α + β)2 (α + β + 1)
What effect do the values of α and β have on the accuracy of the expan-
sion? Is it possible to eliminate either the first or second term through
a specific choice of α and β?
b. Compute one- and two-term Edgeworth expansions for the distribution
function of n1/2 σ −1 (X̄n − µ).
c. Compute one- and two-term Cornish–Fisher expansions for the quantile
function of n1/2 σ −1 (X̄n − µ).
EXERCISES AND EXPERIMENTS 331
5. Let {Xn }∞ n=1 be a sequence of independent and identically distributed ran-
dom variables where Xn has a density that is a mixture of two Normal
densities of the form f (x) = θφ(x) + (1 − θ)φ(x − 1) for all x ∈ R, where
θ ∈ [0, 1].

a. Compute one- and two-term Edgeworth expansions for the density of


n1/2 σ −1 (X̄n − µ) where in this case µ = E(Xn ) and σ 2 = V (Xn ). What
effect does the value of θ have on the accuracy of the expansion? Is it
possible to eliminate either the first or second term through a specific
choice of θ?
b. Compute one- and two-term Edgeworth expansions for the distribution
function of n1/2 σ −1 (X̄n − µ).
c. Compute one- and two-term Cornish–Fisher expansions for the quantile
function of n1/2 σ −1 (X̄n − µ).

6. Prove Theorem 7.8. That is, let {Xn }∞ n=1 be a sequence of independent
and identically distributed random variables from a distribution F . Let
Fn (t) = P [n1/2 σ −1 (X̄n − µ) ≤ t] and assume that E(Xn3 ) < ∞. If F is a
non-lattice distribution then prove that
Fn (x) − Φ(x) − 61 σ −3 n−1/2 µ3 (1 − x2 )φ(t) = o(n−1/2 ),
as n → ∞ uniformly in x. The first part of this proof is provided after
Theorem 7.4. At what point is it important that the distribution be non-
lattice?
7. Let {Rn }∞
n=1 be a sequence of real numbers such that Rn = o(n
−1
) as
2 −1
n → ∞. Prove that Rn = o(n ) as n → ∞.
8. Suppose that v1 (α) and v2 (α) are constant with respect to n. Prove that if
Rn = [n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 )]2 then a sequence that is o(Rn ) as
n → ∞ is also o(n−1 ) as n → ∞.
9. Suppose that gα,n = v0 (α) + n−1/2 v1 (α) + n−1 v2 (α) + o(n−1 ) as n → ∞
where v0 (α), v1 (α), and v2 (α) are constant with respect to n. Prove that
H3 (gα,n ) = H3 [v0 (α)] + o(1) and H4 (gα,n ) = H4 [v0 (α)] + o(1) as n → ∞.
10. Prove that Z ∞
exp(tx)φ(x)Hk (x)dx = tk exp( 21 t2 ).
−∞

11. Suppose that X1 , . . . , Xn are a set of independent and identically dis-


tributed random variables from a distribution F that has mean equal to
zero, unit variance, and cumulant generating function c(t). Find the form
of the polynomial p3 (x) from Theorem 7.5 by considering the form of an
expansion for the cumulant generating function of n1/2 X̄n that has an er-
ror term equal to o(n−3/2 ) as n → ∞. What assumptions must be made in
order to apply Theorem 7.5 to this problem?
12. Suppose that X1 , . . . , Xn are a set of independent and identically dis-
tributed random variables from a distribution F that has mean equal to
332 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
θ, variance equal to σ 2 , and cumulant generating function c(t). Find the
form of the polynomials p1 (x), p2 (x) and p3 (x) from Theorem 7.5 by con-
sidering the form of an expansion for the cumulant generating function of
n1/2 σ −1 (X̄n − θ) that has an error term equal to o(n−3/2 ) as n → ∞.
13. Using Equation (7.28), prove that −r1 (x) = 61 ρ3 H2 (x) and
1 1 2
−r2 (x) = 24 ρ4 H3 (x) + 72 ρ3 H5 (x).

14. Let {Xn }∞


n=1 be a sequence of independent and identically distributed ran-
dom variables from a distribution F . Let Fn (t) = P [n1/2 σ −1 (X̄n − µ) ≤ t]
and assume that E(Xn3 ) < ∞. Suppose that F is a non-lattice distribution,
then
Fn (x) = Φ(x) + n−1/2 r1 (x)φ(x) + n−1 r2 (x)φ(x) + o(n−1 ), (7.58)
as n → ∞ uniformly in x. Prove that q1 (x) = −r1 (x) and
q2 (x) = r1 (x)r10 (x) − 21 xr12 (x) − r2 (x)
by inverting the expansion given in Equation (7.58).
15. Let {Wn }∞n=1 be a sequence of independent and identically distributed ran-
dom variables from a distribution F with mean η and variance θ. Prove
that the parameter θ can be represented in the smooth function model
with d = 4, X0n = (Wn , Wn2 , Wn3 , Wn4 ) for all n ∈ N, g(x) = x2 − (x1 )2 , and
h(x) = x4 − 4x1 x3 + 6x21 x2 − 3x41 − (x2 − x21 )2 ,
where x0 = (x1 , x2 , x3 , x4 ). This will verify the results of Example 7.7.
16. In the context of Example 7.8, let {Wn }∞ n=1 be a sequence of independent
and identically distributed bivariate random vectors from a distribution F
having mean vector η and covariance matrix Σ. Let Wn0 = (Wn1 , Wn2 )
j
for all n ∈ N and define µij = E{[Wn1 − E(Wn2 )]i [Wn2 − E(Wn2 )]j } and
0 j j
µij = E(Wn1 Wn2 ). Let
µ0 = (µ010 , µ020 , µ030 , µ040 , µ001 , µ002 , µ003 , µ004 , µ011 , µ021 , µ012 , µ022 , µ031 , µ013 ),
and prove that if we define h20 (x) = x2 − x21 , h02 (x) = x6 − x25 , h11 (x) =
x9 − x1 x5 ,
h22 (x) = x12 − 2x5 x10 + x25 x2 − 2x1 x11 + 4x1 x5 x9 − 3x21 x25 + x21 x6 ,
h31 (x) = x13 − 3x1 x10 + 3x21 x9 − x3 x5 − 3x1 x5 x2 − 3x31 x5 ,
and
h13 (x) = x14 − 3x5 x11 + 3x25 x9 − x7 x1 + 3x1 x5 x6 − 3x1 x35 ,
then it follows that h20 (µ) = µ20 , h02 (µ) = µ02 , h11 (µ) = µ11 , h22 (µ) =
µ22 , h31 (µ) = µ31 , and h13 (µ) = µ13 .
17. Prove that the polynomials given in Equations (7.44) and (7.46) reduce
to those given in Equations (7.29) and (7.30), when θ is taken to be the
univariate mean.
EXERCISES AND EXPERIMENTS 333
18. Let {Xn }∞n=1 be a sequence of independent and identically distributed d-
dimensional random vectors from a distribution F with mean vector µ.
Let θ = g(µ) for some function g and suppose that θ̂n = g(X̄n ). Let σ 2
be the asymptotic variance of n1/2 θ̂n and assume that σ = h(µ) for some
function h. Define A(x) = [g(x) − g(µ)]/h(µ), B(x) = [g(x) − g(µ)]/h(x)
and assume that B has p + 2 continuous derivatives in a neighborhood of µ
and that E(||X||p+2 ) < ∞. Finally, define the constants ai1 ···ik and bi1 ···ik
by
∂k


ai1 ···ik = A(x) ,
∂xi1 · · · ∂xik x=µ
and
∂k


bi1 ···ik = B(x) .
∂xi1 · · · ∂xik x=µ

a. Prove that bi = ai σ −1 .
b. Prove that bij = aij σ −1 − 21 (ai cj + aj ci )σ −3 , where


ck = g(x) .
∂xk x=µ

c. Use the first two parts of this problem to prove that


d X
X d
r1 (x) − q1 (x) = − 21 σ −3 ai cj µij x2 ,
i=1 j=1

where r1 (x) = −[A1 + 16 A2 H2 (x)], q1 (x) = −[B1 + 16 B2 H2 (x)] and A1 ,


A2 , B1 and B2 are defined in Equations (7.40), (7.41), (7.52), and (7.53).

19. Let {Wn }∞n=1 be a sequence of independent and identically distributed ran-
dom variables from a distribution F with mean θ and variance σ 2 . It was
shown in Example 7.6 that this parameter can be represented in the smooth
function model with X0n = (Wn , Wn2 ), g(x) = x1 , and h(x) = x2 − x21 where
we have θ̂n = W̄n and
n
X
σ̂ 2 = n−1 (Wk − W̄n )2 .
k=1

Assuming that θ = 0 and σ = 1 and using the form from Equations (7.44)–
(7.48), derive a two-term Edgeworth expansion for the studentized distri-
bution of θ̂ and compare it to Equation (7.54). In particular, show that
1
q2 (x) = x[ 12 κ4 (x2 − 3) − 1 2 4
18 κ3 (x + 2x2 − 3) − 14 (x2 + 3)],
which can be found in Section 2.6 of Hall (1992), where κ3 and κ4 are the
third and fourth cumulants of F , respectively.
20. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables from a distribution F that has density f , characteristic func-
334 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
tion ψ(u), and cumulant generating function c(u). Assume that the charac-
teristic function is real valued in this case and use the alternate definition
of the cumulant generating function given by c(u) = log[ψ(u)]. Define the
density fλ (t) = exp(λt)f (t)/ψ(λ) and let {Yn }∞ n=1 be a sequence of inde-
pendent and identically distributed random variables following the density
fλ . Let fn denote the density of nX̄n and fn,λ denote the density of nȲn .
Using characteristic functions, prove that fn,λ (t) = exp[λt − nc(λ)]fn (t).
21. Let X be a random variable with moment generating function m(u) and
cumulant generating function c(u). Assuming that both functions exist,
prove that
d2 m(u + λ)

= c00 (λ).
du2 m(λ) t=0
22. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables following an Exponential(1) density.

a. Prove that if λ̃ = x−1 (x − n) then nc0 (λ̃) = x.


b. Prove that c(λ̃) = log(n−1 x) and that c00 (λ̃) = n−2 x2 .
c. Prove that the leading term of the saddlepoint expansion for fn (x) is
given by (2π)−1/2 n−n+1/2 xn−1 exp(n − x).
d. The exact form for fn (x) in this case is xn−1 exp(−x)/Γ(n). Approx-
imate Γ(n) using Theorem 1.8 (Stirling), and show that the resulting
approximation matches the saddlepoint approximation.

23. Let {Xn }∞


n=1 be a sequence of independent and identically distributed ran-
dom variables following an Gamma(α, β) density.

a. Find the value of λ̃ that is the solution to nc0 (λ̃) = x.


b. Find c(λ̃) and c00 (λ̃).
c. Derive the saddlepoint expansion for fn (x).

24. Let {Xn }∞


n=1 be a sequence of independent and identically distributed ran-
dom variables following an Wald(α, β) density.

a. Find the value of λ̃ that is the solution to nc0 (λ̃) = x.


b. Find c(λ̃) and c00 (λ̃).
c. Derive the saddlepoint expansion for fn (x).

25. Let {Xn }∞


n=1 be a sequence of independent and identically distributed ran-
dom variables following a ChiSquared(θ) distribution. In Example 7.12
we derived a saddlepoint expansion that approximates the density of nX̄n
at a point x with relative error O(n−1 ) as n → ∞, given by
fn (x) = [4πn−1 x2 θ]−1/2 exp[− 12 nθ log(nθx−1 ) − 12 (1 − nx−1 θ)][1 + O(n−1 )].
Prove that an application of Theorem 1.8 (Stirling) implies that the leading
term of this expansion has the same form as a Gamma distribution.
EXERCISES AND EXPERIMENTS 335
7.8.2 Experiments

1. Write a program in R that generates b samples of size n from a specified


distribution F (specified below). For each sample compute the statistic
Zn = n1/2 σ −1 (X̄n −µ) where µ and σ correspond to the mean and standard
deviation of the specified distribution F . Produce a density histogram of
the b values of Zn . Run this simulation for n = 10, 25, 50, and 100 for each
of the distributions listed below. On each histogram overlay a plot of the
standard normal density and the function given by the one-term Edgeworth
expansion for each case studied. Discuss how these histograms compare to
what would be expected for large n, as regulated by the underlying theory
given by Theorems 4.20 and 7.4.
a. F corresponds to a N(0, 1) distribution.
b. F corresponds to an Exponential(1) distribution.
c. F corresponds to a Gamma(2, 2) distribution.
d. F corresponds to a Uniform(0, 1) distribution.
2. Write a program in R that generates b samples of size n from a specified
distribution F (specified below). For each sample compute the statistic
Zn = n1/2 σ −1 (X̄n −µ) where µ and σ correspond to the mean and standard
deviation of the specified distribution F . Produce a plot of the empirical
distribution function of the b values of Zn . On each plot overlay a plot of
the standard normal distribution function and the function given by the
one-term Edgeworth expansion for each case studied. Discuss how these
functions compare to what would be expected for large n as regulated by
the underlying theory given by Theorems 4.20 and 7.4. Run this simulation
for n = 10, 25, 50, and 100 for each of the distributions listed below.
a. F corresponds to a N(0, 1) distribution.
b. F corresponds to an Exponential(1) distribution.
c. F corresponds to a Gamma(2, 2) distribution.
d. F corresponds to a Uniform(0, 1) distribution.
3. Write a program in R that generates b samples of size n from the linear
density f (x) = 2[θ + x(1 − 2θ)]δ{x; (0, 1)} studied in Example 7.2. Recall
that the first four moments of this density are given by µ = µ1 = 23 − 31 θ,
1
σ 2 = µ2 = 18 + 19 θ − 91 θ2 , µ3 = − 135 1 1
− 45 θ + 19 θ2 − 27
2 3
θ , and µ4 =
1 4 1 2 2 3 1 4
135 + 135 θ − 15 θ + 27 θ − 27 θ . For each sample compute the statistic
1/2 −1
Zn = n σ (X̄n − µ) where µ and σ correspond to the mean and stan-
dard deviation specified above. Produce a plot of the empirical distribution
function of the b values of Zn . On each plot overlay a plot of the standard
normal distribution function and the function given by the one-term Edge-
worth expansion. Discuss how these functions compare to what would be
expected for large n as regulated by the underlying theory given by The-
orems 4.20 and 7.4. Run this simulation for n = 10, 25, 50, and 100 each
with θ = 0.10, 0.25, and 0.50.
336 ASYMPTOTIC EXPANSIONS FOR DISTRIBUTIONS
4. Write a program in R that generates b samples of size n from a density that
is a mixture of two Normal densities given by f (x) = 21 φ(x)+ 12 φ(x−θ) for
all x ∈ R where θ ∈ R. In Example 7.3 we found that E(Xn ) = 12 θ, E(Xn2 ) =
1 + 12 θ2 , E(Xn3 ) = 12 θ3 + 32 θ, and E(Xn4 ) = − 12 θ4 + 32 θ2 + 3. The third and
3 4 3 2
fourth moments of the normal mixture are µ3 = 0 and µ4 = − 16 θ + 2 θ +3.
15 4
The third cumulant is also zero, and the fourth cumulant is κ4 = − 16 θ −
3 2 1/2 −1
2 θ . For each sample compute the statistic Z n = n σ (X̄n − µ) where
µ and σ correspond to the mean and standard deviation specified above.
Produce a plot of the empirical distribution function of the b values of Zn .
On each plot overlay a plot of the standard normal distribution function and
the function given by the one-term Edgeworth expansion. Discuss how these
functions compare to what would be expected for large n as regulated by
the underlying theory given by Theorems 4.20 and 7.4. Run this simulation
for n = 10, 25, 50, and 100 each with θ = 0, 0.50, 1.00, and 3.00.
5. Write a program in R that generates b samples of size n from a specified
distribution F (specified below). For each sample compute the approximate
100α% confidence interval for the mean given by [X̄n −n−1/2 σz(1+α)/2 , X̄n −
n−1/2 σz(1−α)/2 ] using the known value of the population variance from
the distributions specified below. Also compute two additional confidence
intervals for the mean where the N(0, 1) quantiles are replaced by ones
based on one- and two-term Cornish–Fisher expansions for g(1−α)/2,n and
g(1+α)/2,n . Compute the percentage of time each method for computing
the confidence interval contained the true value of the mean over the b
simulated samples. Use α = 0.10, b = 1000 and repeat the simulation for
the sample sizes n = 5, 10, 25, 50, 100, 250, and 500.

a. F corresponds to a N(0, 1) distribution where µ = 0, σ is known to be


one, ρ3 = 0 and ρ4 = 0. Note: In this case one needs only to compute
the confidence interval for the mean given by the normal approximation.
b. F corresponds to an Exponential(1) distribution.
c. F corresponds to a Gamma(2, 2) distribution.
d. F corresponds to a Uniform(0, 1) distribution.

6. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables following a N(µ, σ 2 ) distribution. A saddlepoint expansion
that approximates the density of nX̄n at a point x with relative error
O(n−1 ) as n → ∞ was shown in Example 7.11 to have the form
fn (x) = [2πnc00 (λ̃)]−1/2 exp[nc(λ̃) − λ̃x][1 + O(n−1 )]
= (2πnσ 2 )−1/2 exp[ 21 σ −2 (−nµ2 + n−1 x2 ) − xσ −2 (n−1 x − µ)]
×[1 + O(n−1 )]
= (2πnσ 2 )−1/2 exp[− 12 n−1 σ −2 (x − nµ)2 ][1 + O(n−1 )],
which has a leading term equal to a N(nµ, nσ 2 ) density, which matches
the exact density of nX̄n . Plot a N(nµ, nσ 2 ) density and the saddlepoint
EXERCISES AND EXPERIMENTS 337
approximation for µ = 1, σ 2 = 1, and n = 2, 5, 10, 25 and 50. Discuss how
well the saddlepoint approximation appears to be doing in each case.
7. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables following a ChiSquared(θ) distribution. A saddlepoint ex-
pansion that approximates the density of nX̄n at a point x with relative
error O(n−1 ) as n → ∞ was shown in Example 7.12 to have the form
fn (x) = [4πn−1 x2 θ]−1/2 exp[− 12 nθ log(nθx−1 ) − 12 (1 − nx−1 θ)][1 + O(n−1 )].
Plot the correct Gamma density of nX̄n and the saddlepoint approximation
for θ = 1 and n = 2, 5, 10, 25 and 50. Discuss how well the saddlepoint
approximation appears to be doing in each case.
CHAPTER 8

Asymptotic Expansions for Random


Variables

I trust them far more than what Barnabas says. Even if they are old, worthless
letters picked at random out of a pile of equally worthless letters, with no more
understanding than the canaries at fairs have, pecking out people’s fortunes at
random, well, even if that is so, at least these letters bear some relation to my
work.
The Castle by Franz Kafka

8.1 Approximating Random Variables

So far in this book we have relied at many times on the ability to approximate
a function f (x + δ) for a sequence of constants δ that converge to zero based
on the value f (x). That is, we are able to approximate values of f (x + δ) for
small values of δ as long as we know f (x). The main tool for developing these
approximations was based on Theorem 1.13 (Taylor), though we also briefly
talked about other methods as well. The main strength of the theory is based
on the idea that the accuracy of these approximations are well known and have
properties that are easily represented using the asymptotic order notation
introduced in Section 1.5. For instance, in Example 1.23 we found that the
distribution function of a N(0, 1) random variable could be approximated near
zero with 21 + δ(2π)−1/2 + 16 δ 3 (2π)−1/2 . The error of this approximation can
be represented as o(δ 3 ), which means that the error, when divided by δ 3 ,
converges to zero as δ converges to zero. We also saw that this error can be
represented as O(δ 4 ), which means that the error, when divided by δ 4 , remains
bounded as δ converges to zero.
In some cases it would be useful to develop methods for approximating ran-
dom variables. As an example, consider the situation where we have observed
X1 , . . . , Xn , a set of independent and identically distributed random vari-
ables from a distribution F with mean θ and variance σ 2 . If the distribu-
tion F is not known, then an approximate 100α% upper confidence limit for
θ is given by Ûn (X1 , . . . , Xn ) = X̄n − n−1/2 σ̂z1−α , where σ̂ is the sample
standard deviation. This confidence limit is approximate in the sense that
P [θ ≤ Ûn (X1 , . . . , Xn )] ' α. Let us assume that there is an exact upper confi-
dence limit Un (X1 , . . . , Xn ) such that P [θ ≤ Un (X1 , . . . , Xn )] = α. In this case

339
340 ASYMPTOTIC EXPANSIONS FOR RANDOM VARIABLES
it would be useful to be able to compare Ûn (X1 , . . . , Xn ) and Un (X1 , . . . , Xn )
to determine the quality of the approximation. That is, we are interested in
the behavior of Rn = |Ûn (X1 , . . . , Xn ) − Un (X1 , . . . , Xn )| as n → ∞. For
example, we would like to be able to conclude that Rn → 0 as n → ∞. How-
ever, there is a small problem in this case in that both Un (X1 , . . . , Xn ) and
Ûn (X1 , . . . , Xn ) are random variables due to their dependence on the sample
X1 , . . . , Xn . Therefore, we cannot characterize the asymptotic behavior of Rn
using a limit for real sequences, we must characterize this behavior in terms
of one of the modes of convergence for random variables discussed in Chapter
p
3. For example, we can use Definition 3.1 and determine whether Rn − → 0 as
n → ∞.
An alternative method for approximating the upper confidence limit would
be to use the correction suggested by the Cornish-Fisher expansion of Theo-
rem 7.10. That is, replace z1−α with h1−α,n to obtain an approximate upper
confidence limit given by Ũn (X1 , . . . , Xn ) = X̄n − n−1/2 σ̂h1−α,n . By defining
Rn = |Ũn (X1 , . . . , Xn ) − Un (X1 , . . . , Xn )| we could again ascertain whether
p
Rn − → 0 as n → ∞. However, a more useful analysis would consider whether
Ũn (X1 , . . . , Xn ) is an asymptotically more accurate approximate upper confi-
dence limit than Ûn (X1 , . . . , Xn ). If all we know is that both methods converge
to zero in probability then we cannot compare the two methods directly. In
this case we require some information about the rate at which the two methods
converge in probability to zero.
The goal of this chapter is to develop methods for approximating a random
variable, or a sequence of random variables, with an asymptotic expansion
whose terms may be random variables. The error terms of these sequences
will also necessarily be random variables as well. That is, let {Xn }∞ n=1 be a
sequence of random variables. Then we would like to find random variables
Y0 , Y1 , . . . , Yn such that Xn = Y0 + n−1/2 Y1 + n−1 Y2 + · · · + n−p/2 Yp + R(n, p),
where R(n, p) is a random variable that serves as a remainder term that de-
pends on both n and p. Note that the random variables Y0 , . . . , Yp themselves
do not depend on n. As with asymptotic expansions for sequences of real num-
bers, the form of the expansion and the rate at which the error term converges
to zero are both important properties of the expansion. Therefore, another fo-
cus of this chapter is on defining rates of convergence for sequences of random
variables. In particular, we will develop stochastic analogs of the asymptotic
order notation from Definition 1.7. We will then apply these methods to ap-
plications, such as the delta method and the asymptotic distribution of the
sample central moments.

8.2 Stochastic Order Notation

Extending the asymptotic order notation from sequences of real numbers to


sequences of random variables in principle involves converting the limits in
STOCHASTIC ORDER NOTATION 341
the asymptotic order notation to limits of random variables based on one of
the modes of convergence studied in Chapter 3. For sequences of real numbers
{xn }∞ ∞
n=1 and {yn }n=1 we conclude that xn = o(yn ) as n → ∞ if
xn
lim = 0. (8.1)
n→∞ yn

Therefore, to define a stochastic version of this notation, we replace the limit


in Equation (8.1) with convergence in probability.
Definition 8.1. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of random variables.
p
The notation Xn = op (Yn ) as n → ∞ means that Xn Yn−1 − → 0 as n → ∞.
Example 8.1. Let {Xn }∞ n=1 be a sequence of independent discrete random
variables where Xn has probability distribution function
(
1
x ∈ {0, n−1 }
fn (x) = 2
0 otherwise,
for all n ∈ N. Note that n1/2 Xn therefore has probability distribution function
(
1
x ∈ {0, n−1/2 }
gn (x) = 2
0 otherwise,
for all n ∈ N, where it can then be shown that for any ε > 0,
lim P (n1/2 Xn > ε) ≤ lim n−1/2 = 0,
n→∞ n→∞
1/2 p
and therefore Definition 3.1 implies that n Xn − → 0 as n → ∞. Therefore, it
follows from Definition 8.1 that Xn = op (n−1/2 ) as n → ∞. Note that Xn is
not op (n−1 ) since nXn has a Bernoulli( 21 ) distribution for all n ∈ N, which
is a sequence of random variables that does not depend on n. 
Example 8.2. Let {Xn }∞ n=1 be a sequence of independent discrete random
variables where Xn has probability distribution function
(
1
x ∈ {0, n−1 }
fn (x) = 2
0 otherwise,
for all n ∈ N and let {Yn }∞
n=1 be a sequence of independent random variables
where Yn has probability distribution function
(
1
y ∈ {n−1/2 , 1}
gn (y) = 2
0 otherwise,
for all n ∈ N. We will assume that Xn is independent of Yn for all n ∈ N.
Consider the sequence of random variables given by {Zn }∞ n=1 where Zn =
Xn Yn−1 for all n ∈ N. Because of the independence between Xn and Yn , the
distribution of Zn can be found to be

1
2 z = 0

hn (z) = 14 z ∈ {n−1/2 , n−1 }

0 otherwise,

342 ASYMPTOTIC EXPANSIONS FOR RANDOM VARIABLES
for all n ∈ N. Now, it can then be shown that for any ε > 0,
lim P (Zn > ε) ≤ lim n−1/2 = 0,
n→∞ n→∞
p
and therefore Definition 3.1 implies that Zn − → 0 as n → ∞. Therefore, it
follows from Definition 8.1 that Xn = op (Yn ) as n → ∞. 
Example 8.3. Let {Xn }∞ n=1 be a sequence of independent random variables
from a distribution F with mean µ. From Theorem 3.10 (Weak Law of Large
Numbers) we know that the sample mean converges in probability to µ as
p
n → ∞. Further, Theorem 4.11 (Slutsky) implies that X̄n − µ − → 0 as n → ∞.
Therefore, it follows from Definition 8.1 that X̄n − µ = op (1) as n → ∞. 

For sequences of real numbers {xn }∞ ∞


n=1 and {yn }n=1 we conclude that xn =
−1
O(yn ) as n → ∞ if |xn yn | remains bounded as n → ∞. To develop a stochas-
tic version of this operator we first note that a sequence of real numbers
{zn }∞
n=1 is bounded if there exists a positive real number b such that |zn | < b
for all n ∈ N. The same sequence remains bounded as n → ∞ if there exists
a positive real number m and integer nb such that |zn | < b for all n > nb . For
a stochastic version of this property we require that behavior to apply with
at least probability 1 − ε for every ε > 0. This matches the concept that a
sequence is bounded in probability from Definition 4.3. Using this definition,
we can now define a stochastic version of the O operator.
Definition 8.2. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of random variables.
The notation Xn = Op (Yn ) as n → ∞ means that |Xn Yn−1 | is bounded in
probability.
Example 8.4. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of independent ran-
dom variables. Suppose that Yn is a Poisson(θ) random variable for all
n ∈ N, and that, conditional on Yn , the random variable Xn has a Uni-
form{1, 2, . . . , Yn } distribution. Therefore, it follows that Xn Yn−1 has a dis-
tribution that guarantees that P (0 ≤ Xn Yn−1 ≤ 1) = 1 for all n ∈ N. Let
ε > 0 and choose bε = 1. Then it follows that P (|Xn Yn−1 | ≤ bε ) = 1 for all
n ∈ N and therefore Theorem 4.3 implies that |Xn Yn−1 | is bounded in proba-
bility. Definition 8.2 then implies that Xn Yn−1 = Op (1), or equivalently that
Xn = Op (Yn ) as n → ∞. 
Example 8.5. Let {Xn }∞ n=1 be a sequence of independent random variables
where Xn has a Uniform(−n−1 , 1 + n−1 ) distribution for all n ∈ N. Let ε > 0
and consider the postive real number bε = 2. Then it follows that P (|Xn | ≤
bε ) = 1 for all n ∈ N and therefore by Definition 4.3 the sequence {Xn }∞n=1
is bounded in probability. Hence, Definition 8.2 implies that Xn = Op (1) as
n → ∞. 
Example 8.6. Let {Xn }∞ n=1 be a sequence of independent random variables
where Xn has a N(0, n−1 ) distribution for all n ∈ N. Let ε > 0 and let b be
any positive real number. Then
lim P (|Xn | < b) = lim P (|Z| ≤ n1/2 b) = 1,
n→∞ n→∞
STOCHASTIC ORDER NOTATION 343
where Z is a N(0, 1) random variable. Therefore, there exists a positive real
number bε and positive integer nε such that P (|Xn | ≤ bε ) > 1 − ε for all
n > nε , and hence Xn = Op (1), as n → ∞. 

One can observe from Example 8.5 that a very useful special case of sequences
that are bounded in probability are those that converge in distribution. As
discussed in Section 4.2, we will always assume that sequences that converge
in distribution do so to distributions that have valid distribution functions. It
is this property that assures that such sequences are bounded in probability.
Theorem 8.1. Let {Xn }∞ n=1 be a sequence of random variables that converge
in distribution to a random variable X as n → ∞, then Xn = Op (1) as
n → ∞.

Proof. Let Fn denote the distribution function of Xn for all n ∈ N and let
d
F denote the distribution of X. Since Xn − → X as n → ∞ it follows from
Definition 4.1 that
lim Fn (x) = F (x),
n→∞
for all x ∈ C(F ). By assumption F is a distribution function such that
lim F (x) = 1,
x→∞

and
lim F (x) = 0.
x→−∞
Therefore Theorem 4.1 implies that {Xn }∞ n=1 is bounded in probability Defi-
nition 8.2 implies then that Xn = Op (1) as n → ∞.
Example 8.7. Let {Xn }∞ n=1 be a sequence of independent random variables
from a distribution F with mean µ and variance σ 2 . Theorem 4.20 (Linde-
d
berg and Lévy) implies that n1/2 σ −1 (X̄n − µ) −
→ Z where Z has a N(0, 1)
distribution. Therefore Theorem 8.1 implies that n1/2 σ −1 (X̄n − µ) = Op (1)
as n → ∞ and Definition 8.2 implies that X̄n − µ = Op (n−1/2 ) or equivalently
that X̄n = µ + Op (n−1/2 ) as n → ∞. 
Example 8.8. Let Z1 , . . . , Zm be a set of independent and identically dis-
tributed N(0, 1) random variables and let δ > 0. Define
m
X
Xm,δ = (Z1 + δ)2 + Zk2 .
k=2

Note that if δ = 0 then Xm,δ has a ChiSquared(m) distribution. If δ > 0


then Xm,δ has a non-central ChiSquared distribution. Following Section
2.3 of Barndorff-Nielsen and Cox (1989), we will investigate the asymptotic
distribution of (2δ)−1 (Xm,δ − δ 2 ) as δ → ∞. We first note that
m
X
Xm,δ = δ 2 + 2δZ1 + Zk2 ,
k=1
344 ASYMPTOTIC EXPANSIONS FOR RANDOM VARIABLES
so that
m
X
(2δ)−1 (Xm,δ − δ 2 ) = Z1 + (2δ)−1 Zk2 = Z1 + Rδ .
k=1
Now Z12 + · · · + Zm 2
has a ChiSquared(m) distribution and does not de-
pend on δ. Hence it follows that δRδ is a random variable equal to half of
d
a ChiSquared(m) random variable and hence δRδ − → R as δ → ∞ where
2R has a ChiSquared(m) distribution. Therefore Theorem 8.1 implies that
δRδ = Op (1) or equivalently that Rδ = Op (δ −1 ) as δ → ∞. It then follows that
(2δ)−1 (Xm,δ − δ 2 ) = Z1 + Op (δ −1 ) as δ → ∞. Note that this expansion pro-
vides a normal approximation for the non-central ChiSquared distribution
when the non-centrality parameter is large. 
∞ ∞
Example 8.9. Let {Xn }n=1 and {Yn }n=1 be sequences of independent and
identically distributed random variables from distributions F and G respec-
tively. Suppose that the two sequences are mutually independent of one an-
other and that the distributions F and G both have means equal to zero
d
and variances equal to one. Theorem 4.22 implies that n1/2 X̄n − → Z1 and
d
n1/2 Ȳn −
→ Z2 where Z1 and Z2 are independent N(0, 1) random variables. The-
d
orem 4.18 implies that X̄n Ȳn−1 −
→ W as n → ∞ where W is a Cauchy(0, 1)
random variable. Therefore, Theorem 8.1 implies that X̄n Ȳn−1 = Op (1) as
n → ∞, or equivalently Xn = Op (Yn ) as n → ∞. 

It is important to note that a sequence being bounded in probability is not


equivalent to the sequence converging in distribution, or in any other mode
of convergence. A sequence that is bounded in probability need not even be
convergent, it only needs to stay within some bounds as n → ∞.
Example 8.10. Let {Xn }∞ n=1 be a sequence of independent random variables
such that Xn = (−1)n X where X is a Bernoulli( 12 ) random variable. This
sequence does not converge in distribution to any random variable, but the
sequence is bounded in probability. To see this let ε > 0 and define bε = 32
and note that P (|Xn | ≤ bε ) = 1 > 1 − ε for all n ∈ N. 

When using the order notation for real valued sequences, we found that if a
sequence was o(yn ) as n → ∞ for some real valued sequence {yn }∞
n=1 then the
sequence was also O(yn ) as n → ∞. This same type of relationship holds for
sequences of random variables using the stochastic order notation.
Theorem 8.2. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of random variables.
Suppose that Xn = op (Yn ) as n → ∞, then Xn = Op (Yn ) as n → ∞.

For a proof of Theorem 8.2 see Exercise 2. To effectively work with stochastic
order notation we must also establish how the sequences of each order interact
with each other. The result below also provides results as to how real sequences
and sequences of random variables interact with one another.
Theorem 8.3. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of random variables
and let {yn }∞
n=1 be a sequence of real numbers.
STOCHASTIC ORDER NOTATION 345
1. If Xn = Op (n−a ) and Yn = Op (n−b ) as n → ∞, then Xn Yn = Op (n−(a+b) )
as n → ∞.
2. If Xn = Op (n−a ) and yn = o(n−b ) as n → ∞, then Xn yn = op (n−(a+b) )
as n → ∞.
3. If Xn = Op (n−a ) and Yn = op (n−b ) as n → ∞, then Xn Yn = op (n−(a+b) )
as n → ∞.
4. If Xn = op (n−a ) and yn = o(n−b ) as n → ∞, then Xn yn = op (n−(a+b) ) as
n → ∞.
5. If Xn = Op (n−a ) and yn = O(n−b ) as n → ∞, then Xn yn = Op (n−(a+b) )
as n → ∞.
6. If Xn = op (n−a ) and yn = O(n−b ) as n → ∞, then Xn Yn = op (n−(a+b) )
as n → ∞.
7. If Xn = op (n−a ) and Yn = op (n−b ) as n → ∞, then Xn Yn = op (n−(a+b) )
as n → ∞.

Proof. We will prove the first two parts of this theorem, leaving the remaining
parts for Exercise 9. To prove the first result, suppose that {Xn }∞ n=1 and
{Yn }∞
n=1 are sequences of random variables such that Xn = Op (n
−a
) and Yn =
Op (n−b ) as n → ∞. Therefore, it follows from Definition 8.2 that the sequences
{na Xn }∞ b ∞
n=1 and {n Yn }n=1 are both bounded in probability. Therefore, for
every ε > 0 there exist bounds xε and yε and positive integers nx,ε and ny,ε
such that P (|na Xn | ≤ xε ) > 1 − ε and P (|mb Ym | ≤ yε ) > 1 − ε for all
n > nx,ε and m > ny,ε . Define bε = max{xε , yε } and nε = max{nx,ε , ny,ε }.
Therefore, it follows that P (|na Xn | ≤ bε ) > 1 − ε and P (|nb Yn | ≤ bε ) > 1 − ε
for all n > nε . In accordance with Definition 8.2, we must now prove that the
sequence {na+b Xn Yn }∞ n=1 is bounded in probability. Let ε > 0 and note that

P (|n Xn Yn | ≤ bε ) = P |na+b Xn Yn | ≤ b2ε | na Xn | > bε P (|na Xn | > bε ) +


a+b 2


P |na+b Xn Yn | ≤ b2ε | na Xn | ≤ bε P (|na Xn | ≤ bε )




≥ P |na+b Xn Yn | ≤ b2ε | na Xn | ≤ bε P (|na Xn | ≤ bε )




≥ P (|nb Yn | ≤ bε )P (|na Xn | ≤ bε )
≥ (1 − ε)2
= 1 − 2ε + ε2
> 1 − 2ε.

Therefore, Definition 4.3 implies that the sequence {na+b Xn Yn }∞


n=1 is bounded
in probability and Definition 8.2 implies that Xn Yn = Op (n−(a+b) ) as n → ∞.
To prove the second result suppose that {Xn }∞ n=1 is a sequence of random
variables and that {yn }∞n=1 is a sequence of real numbers such that Xn =
Op (n−a ) and yn = o(n−b ) as n → ∞. Since Xn = Op (n−a ) as n → ∞ it
follows from Definitions 4.3 and 8.2 that for every ε > 0 there exists a positive
346 ASYMPTOTIC EXPANSIONS FOR RANDOM VARIABLES
real number bε and a positive integer nε such that P (|na Xn | ≤ bε ) ≥ 1 − ε for
all n > nε . Since yn = o(n−b ) as n → ∞ it follows from Definition 1.7 that
for every δ > 0 there exists a positive integer nδ such that |nb Yn | < δ for all
n > nδ . Now let ε > 0 and ξ > 0 be given, and choose δ so that bε δ < ξ. Then
it follows that for all n > max{nε , nδ } we have that

P (|nb yn ||na Xn | ≤ bε δ) = P (|na+b Xn yn | ≤ ξ) ≥ 1 − ε.

Since ε is arbitrary it follows from Definition 1.1 that

lim P (|na+b Xn yn | ≤ ξ) = 1,
n→∞

p
and therefore Definition 3.1 implies that na+b Xn yn −
→ 0 as n → ∞. Therefore,
Definition 8.1 implies that Xn yn = op (n−(a+b) ) as n → ∞.

With the introduction of the stochastic order notation in Definitions 8.1 and
8.2, we can now define an asymptotic expansion for a sequence of random
variables {Xn }∞
n=1 as an expansion of the form

Xn = Y0 + n−1/2 Y1 + n−1 Y2 + · · · + Yp n−p/2 + Op (n−(p+1)/2 ), (8.2)

or of the form

Xn = Y0 + n−1/2 Y1 + n−1 Y2 + · · · + Yp n−p/2 + op (n−p/2 ), (8.3)

as n → ∞. Such expansions are often called stochastic asymptotic expansions.


We can also define these expansion in terms of the powers of n−1 , or any other
sequence in n that converges to zero as n → ∞. We have already seen some
expansions of this form in Examples 8.7 and 8.8. If we consider a stochastic
expansion of the form given in Equation (8.2) or (8.3) with p = 1 then we
have that Xn = Y0 + op (1) as n → ∞, so that the error in this expansion
p
converges in probability to zero. Therefore it follows that Xn − Y0 − → 0, or
p
equivalently that Xn −→ Y0 . In this sense, such an expansion can be seen as
having a leading term equal to a limiting random variable plus some random
error that converges to zero in probability as n → ∞. We now return to our
motivating example from Section 8.1.
Example 8.11. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean µ and vari-
ance σ 2 . Earlier in this section we considered an approximate 100α% upper
confidence limit for θ given by Ûn (X1 , . . . , Xn ) = X̄n − n−1/2 σ̂z1−α . Now
let us consider the accuracy of this confidence limit. Define the studentized
distribution of X̄n as Hn (x) = P [n1/2 σ̂n−1 (X̄n − θ) ≤ x] and let hα,n be
the αth quantile of this distribution, where σ̂n is the sample standard de-
viation. Note that an exact 100α% upper confidence limit for θ is given by
STOCHASTIC ORDER NOTATION 347
Un (X1 , . . . , Xn ) = X̄n − n−1/2 σ̂n h1−α,n since
P [θ < Un (X1 , . . . , Xn )] = P (θ < X̄n − n−1/2 σ̂n h1−α,n )
= P [n1/2 σ̂n−1 (X̄n − θ) > h1−α,n ]
= 1 − Hn (g1−α,n )
= α.
Now suppose that F follows the assumptions of Theorem 7.13, then hα,n has
an asymptotic expansion of the form
h1−α,n = z1−α + n−1/2 s1 (z1−α ) + n−1 s2 (z1−α ) + o(n−1 ),
as n → ∞. Therefore

Un (X1 , . . . , Xn ) =
X̄n − n−1/2 σ̂n z1−α − n−1 σ̂n s1 (z1−α ) − n−3/2 σ̂n s2 (z1−α ) + o(n−3/2 ),
as n → ∞. Hence, if we define Rn = |Ûn (X1 , . . . , Xn ) − Un (X1 , . . . , Xn )|, then
it follows that
Rn = |n−1 σ̂n s1 (z1−α ) + n−3/2 σ̂n s2 (z1−α ) + o(n−3/2 )|,
as n → ∞. Assuming that σ̂n is a consistent estimator of σ, which can be ver-
ified using Theorem 3.19, it follows that σ̂n = σ + op (1), as n → ∞. Therefore,
it follows that n−1 σ̂n s1 (z1−α ) = op (n−1/2 ) and hence Rn = op (n−1/2 ), as n →
∞. Hence we have shown that Ûn (X1 , . . . , Xn ) = Un (X1 , . . . , Xn ) + op (n−1/2 )
as n → ∞. Now recall that s1 (z1−α ) and s2 (z1−α ) are polynomials whose
coefficients are functions of the moments of F . Suppose that we can estimate
s1 (z1−α ) and s2 (z1−α ) with consistent estimators ŝ1 (z1−α ) and ŝ2 (z1−α ) so
that ŝ1 (z1−α ) = s1 (z1−α ) + op (1) and ŝ2 (z1−α ) = s2 (z1−α ) + op (1), as n → ∞.
We can then approximate the true upper confidence limit with
Ũn (X1 , . . . , Xn ) = X̄n − n−1/2 σ̂n zα − n−1 σ̂n ŝ1 (z1−α ) − n−3/2 σ̂n ŝ2 (z1−α ).
To find the accuracy of this approximate note that
n−1/2 σ̂n ŝ1 (z1−α ) = n−1/2 σs1 (z1−α ) + op (n−1/2 ),
and
n−1 σ̂n ŝ2 (z1−α ) = n−1 σs2 (z1−α ) + op (n−1 ),
as n → ∞. Therefore,
Ũn (X1 , . . . , Xn ) = X̄n − n−1/2 σ̂n z1−α − n−1 σ̂n s1 (z1−α ) + op (n−1 ),
as n → ∞ and it follows that |Ũn (X1 , . . . , Xn ) − Un (X1 , . . . , Xn )| = op (n−1 ),
as n → ∞, which is more accurate than the normal approximation given by
Ûn (X1 , . . . , Xn ). Note that estimating s2 (z1−α ) in this case makes no dif-
ference from an asymptotic viewpoint, because the error from estimating
s1 (z1−α ) is as large, asymptotically, as this term. Therefore, an asymptot-
ically equivalent substitute for Ũn (X1 , . . . , Xn ) is the approximation X̄n −
n−1/2 σ̂n z1−α − n−1 σ̂n ŝ1 (z1−α ).
348 ASYMPTOTIC EXPANSIONS FOR RANDOM VARIABLES
8.3 The Delta Method

In several instances we have considered how taking a function of a sequence


of convergent random variables affects the convergence properties of the se-
quence. For example, in certain cases the function of the sequence of random
variables converges to the same function of the limit random variable. That is,
if {Xn }∞
n=1 is a sequence of random variables that converges in some mode to
a random variable X as n → ∞, and g is a smooth function, then we observed
in Theorems 3.8 and 4.12 that g(Xn ) will converge in the same mode to g(X)
as n → ∞. In Section 6.4 we observed that certain functions of asymptoti-
cally Normal sequences are also asymptotically Normal. In this section we
use the asymptotic properties of a sequence of random variables to develop a
stochastic asymptotic expansion for a function of the sequence of random vari-
ables. In this development we will require both that the sequence converges
in probability to a constant and that a monotonically increasing function of n
multiplied by the sequence converges in distribution. Our results will include
a stochastic asymptotic expansion for the function of the sequence along with
a conclusion about the weak convergence of the transformed sequence. This
method is often called the delta method or approximate local linearization.

To begin this development, let {Xn }∞


n=1 be a sequence of random variables
p
such that Xn −→ θ as n → ∞ for some real number θ and assume that there
is a random variable Z such that
n1/2 (Xn − θ) = Z + op (1), (8.4)
as n → ∞. Therefore, it follows from Theorem 4.11 (Slutsky) that n1/2 (Xn −
d
θ) −
→ Z as n → ∞ and hence Xn has the stochastic asymptotic expansion
Xn = θ + n−1/2 Z + op (n−1/2 ) as n → ∞.

Now consider transforming this sequence with a function g, where we will


assume that g has two continuous derivatives on the real line such that g 0 (θ) 6=
0 and g 00 is bounded. The transformed sequence is {g(Xn )}∞ n=1 . Theorem 1.13
(Taylor) implies that g(x) = g(θ) + (x − θ)g 0 (θ) + Rx where Rx = 21 g 00 (ξx )(x −
θ)2 and ξx is real number between x and θ that depends on x. Therefore,
substituting the random variable Xn for x we have that g(Xn ) = g(θ) +
(Xn − θ)g 0 (θ) + Rn where Rn = 21 g 00 (ξn )(Xn − θ)2 and ξn is a random variable
that is between Xn and θ with probability one, and therefore depends on n.
Now, it follows from Equation (8.4) that
n1/2 [g(Xn ) − g(θ)] = n1/2 [g(θ) + (Xn − θ)g 0 (θ) + Rn − g(θ)]
= n1/2 (Xn − θ)g 0 (θ) + n1/2 Rn
= [Z + op (1)]g 0 (θ) + n1/2 Rn
= Zg 0 (θ) + n1/2 Rn + op (1),
as n → ∞. To ascertain the asymptotic behavior of the term n1/2 Rn we first
THE DELTA METHOD 349
note that
n1/2 (Xn − θ)2 = n1/2 (Xn − θ)(Xn − θ)
= [Z + op (1)][n−1/2 Z + op (n−1/2 )]
= n−1/2 Z 2 + (n−1/2 Z)op (1) + (Z)op (n−1/2 )
+op (1)op (n−1/2 ),
as n → ∞. Considering the first term of this expression, we observe that
n−1/2 = o(1) and assuming that Z has a valid distribution function, Defini-
tion 8.2 implies that Z = Op (1) as n → ∞ and hence Z 2 = Op (1)Op (1) =
Op (1) as n → ∞ by Theorem 8.3. Therefore, Theorem 8.3 also implies that
n−1/2 Z 2 = op (1) as n → ∞. Similar arguments using Theorem 8.3 can be
used to show that (n−1/2 Z)op (1) = o(1)Op (1)op (1) = op (1), (Z)op (n−1/2 ) =
Op (1)op (n−1/2 ) = op (n−1/2 ) = op (1), and finally op (1)op (n−1/2 ) = op (n−1/2 ) =
op (1), as n → ∞. Because we have assumed that g 00 is uniformly bounded in
the range of Xn it follows that g 00 (ξn ) is uniformly bounded with probability
one and therefore g 00 (ξn ) = Op (1) as n → ∞. Therefore, it follows that
n1/2 Rn = 21 n1/2 g 00 (ξn )(Xn − θ)2 = op (1)Op (1) = op (1),
as n → ∞, and hence n1/2 [g(Xn ) − g(θ)] = Zg 0 (θ) + op (1) as n → ∞. Ap-
plying Theorem 4.11 to this stochastic asymptotic expansion leads us to the
d
→ Zg 0 (θ) as n → ∞.
conclusion that n1/2 [g(Xn ) − g(θ)] −
A common application of this theory is when Z is a N(0, σ 2 ) random vari-
able for which the conclusion would be, under the assumptions of this section
on g, that n1/2 [g(Xn ) − g(θ)] converges in distribution to a g 0 (θ)Z random
variable which has a N(0, [g 0 (θ)σ]2 ) distribution. This is the same conclusion
we encountered in Theorem 6.3. As this section shows, the conclusion is more
general and can be motivated through the use of stochastic asymptotic ex-
pansions. If the condition that g 0 (θ) 6= 0 is violated, then additional terms in
the Taylor expansion must be used and we obtain a result that has the same
form as in Theorem 6.4.
Example 8.12. Suppose that {Wn }∞ n=1 is a sequence of independent random
variables such that Wn has a N(θ, σ 2 ) distribution for all n ∈ N where θ 6= 0.
Define a sequence of random variables {Xn }∞ n=1 where Xn = W̄n for all n ∈ N
−1 2 p
so that Xn has a N(θ, n σ ) distribution for all n ∈ N and Xn − → θ as
n → ∞. Let g(x) = x2 so that g 0 (x) = 2x and g 00 (x) = 2, which is bounded on
the real line. The earlier conclusions imply that n1/2 (Xn2 − θ2 ) = 2θZ + op (1)
for a random variable Z that has a N(0, σ 2 ) distribution and that n1/2 (Xn2 −
θ2 ) converges in distribution to a random variable that has a N(0, 4θ2 σ 2 )
distribution. 
Example 8.13. One problem in statistical inference is that the standard er-
ror of an estimator may depend on the unknown parameter of interest. In
this case the standard error can only be estimated using the estimate of the
unknown parameter. Variance stabilization is a method that can sometimes
350 ASYMPTOTIC EXPANSIONS FOR RANDOM VARIABLES
be used to circumvent this problem by transforming the estimator so that the
standard error does not depend on the unknown parameter. As an example,
consider {Xn }∞n=1 , a sequence of random variables where we will assume that
d p
n1/2 (Xn − θ) − → Z and that Xn − → θ as n → ∞. Hence, we will consider
Xn to be a consistent estimator of θ. The distribution of Z is not overly
important for this process, but for simplicity we will assume that Z has a
N[0, h(θ)] where h is some known function of θ. We will further assume, con-
sistent with the assumptions above, that the variance of Xn is n−1 h(θ). We
would now like to transform this sequence so that the variance of Xn does not
depend on θ. Let g be a function that follows the assumptions of this section.
d
Then n1/2 [g(Xn ) − g(θ)] − → Y where Y has a N{0, [g 0 (θ)]2 h(θ)} distribution.
Therefore, to eliminate θ from the variance we need to choose g such that
[g 0 (θ)]2 h(θ) = 1, or equivalently that g 0 (θ) = [h(θ)]−1/2 . 

8.4 The Sample Moments

Let {Xn }∞n=1 be a sequence of independent and identically distributed ran-


dom variables. In Section 4.6 we established that the sample moments µ̂0k are
asymptotically normal. In this section we demonstrate how to use stochas-
tic asymptotic expansions to prove that the sample central moments are also
asymptotically normal. Therefore, let µk be the k th central moment defined in
Definition 2.9 and let µ̂k be the k th sample central moment defined in Section
3.8 as
Xn
µ̂k = n−1 (Xi − X̄n )k .
i=1
The presence of the sample mean in the expression for µ̂k complicates the
discussion of the limiting distribution of these statistics. If the sample mean
was replaced by the population mean µ01 then the asymptotic Normality
of µ̂k could be established directly through the use of Theorem 4.20. One
approach to simplifying this problem is based on finding an approximation to
µ̂k that has this simpler form.
Theorem 8.4. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean µ and whose
first k moments exist. Define
n
X
µ̃k = n−1 (Xi − µ)k ,
i=1

then,
n1/2 (µ̂k − µk ) = n1/2 (µ̃k − µk − kµk−1 µ̃1 ) + op (1), (8.5)
as n → ∞.

Proof. To begin this argument, first note that using Theorem A.22 we have
THE SAMPLE MOMENTS 351
that
n
X n
X
−1 −1
n (Xi − X̄n )k = n [(Xi − µ) − (X̄n − µ)]k
i=1 i=1
n X k  
X k
= n−1 (−1)k−j (Xi − µ)j (X̄n − µ)k−j
i=1 j=0
j
k   " n #
k−j k
X X
−1 k−j j
= (−1) n (X̄n − µ) (Xi − µ)
j=0
j i=1
k  
k−j k
X
= (−1) (X̄n − µ)k−j µ̃j .
j=0
j

Now note that


n
X
µ̃1 = n−1 (Xi − µ) = X̄n − µ,
i=1
so that
n k  
X X k k−j
n−1 (Xi − X̄n )k = (−1)k−j µ̃ µ̃j
i=1 j=0
j 1
k−2    
X k k−j
k−j 1 k
= (−1) µ̃ µ̃j + (−1) µ̃1 µ̃k−1
j=0
j 1 k−1
 
0 k
+(−1) µ̃k
k
k−2  
k−j k
X
= (−1) µ̃k−j
1 µ̃j − k µ̃1 µ̃k−1 + µ̃k .
j=0
j

Therefore, it follows that


 
k−2  
X k
n1/2 (µ̂k − µk ) = n1/2 µ̃k − µk − k µ̃1 µ̃k−1 + (−1)k−j µ̃k−j
1 µ̃j 
j=0
j

= n1/2 [µ̃k − µk − k µ̃1 µk−1 + k µ̃1 µk−1 − k µ̃1 µ̃k−1 +



k−2  
X k
(−1)k−j µ̃k−j
1 µ̃j 
j=0
j

= n1/2 (µ̃k − µk − k µ̃1 µk−1 ) + n1/2 µ̃1 [kµk−1 − k µ̃k−1 +



k−2  
X k
(−1)k−j µ̃1k−j−1 µ̃j  .
j=0
j

d
Now Theorem 4.20 implies that n1/2 µ̃1 = n1/2 (X̄n − µ) −
→ Z as n → ∞
352 ASYMPTOTIC EXPANSIONS FOR RANDOM VARIABLES
where Z is a N(0, µ2 ) random variable. Therefore, Theorem 8.1 implies that
n1/2 µ̃1 = Op (1) as n → ∞. Next we note that Theorem 3.10 implies that
n
X p
µ̃k−1 = n−1 (Xi − µ)k−1 −
→ E[(Xi − µ)k−1 ] = µk−1 , (8.6)
i=1
p
as n → ∞ so that Theorem 3.8 implies that kµk−1 − k µ̃k−1 − → 0 as n → ∞.
Finally, we note that in the special case when k = 2 in Equation (8.6) that
n
X p
µ̃1 = n−1 (Xi − µ) −
→ E(Xi − µ) = µ1 = 0,
i=1
p p
→ 0 as n → ∞. Therefore, Theorem 3.8 implies that µ̃1k−j−1 µ̃j −
so that µ̃1 − →0
as n → ∞ for j = 0, . . . , k − 2, and hence
k−2  
X k k−j−1
(−1)k−j µ̃ µ̃j = op (1),
j=0
j 1

as n → ∞. Therefore,
n1/2 (µ̂k − µk ) = n1/2 (µ̃k − µk − kµk−1 µ̃1 ) + op (1),
as n → ∞.

The key idea in developing the stochastic approximation in Theorem 8.4 is


that the statistics on the right hand side of Equation (8.5) can be written as
the sum of independent and identically distributed random variables, whose
asymptotic behavior can the be linked directly to Theorem 4.20 (Lindeberg
and Lévy). In the following result we use Theorem 8.4 along with Theorem
4.22 to develop a multivariate Normal result for a vector of estimated central
moments.
Theorem 8.5. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean µ and k th central
moment equal to µk . If µ2k < ∞ then
d
n1/2 (µ̂2 − µ2 , . . . , µ̂k − µk )0 −
→ Z,
as n → ∞ where Z is a N(0, Σ) random vector. The covariance matrix Σ has
(i, j)th element equal to
µi+j+2 −µi+1 µj+1 −(i+1)µi µj+2 −(j +1)µi+2 µj +(i+1)(j +1)µi µj µ2 . (8.7)

Proof. Let Zn = n1/2 (µ̂2 − µ2 , . . . , µ̂k − µk )0 and Wn = n1/2 (µ̃2 − µ2 −


d
2µ1 µ̃1 , . . . , µ̃k − µk − kµk−1 µ̃1 )0 . Let us assume for the moment that Zn −
→Z
for some random vector Z whose distribution we will specify later in the proof.
p
Theorem 8.4 tells us that Wn − Zn − → 0 as n → ∞. Therefore, Theorem 4.19
d
(Multivariate Slutsky Theorem) implies that Wn = Zn + (Wn − Zn ) − →Z
as n → ∞. Hence Zn and Wn converge in distribution to the same random
THE SAMPLE MOMENTS 353
variable. Therefore, we need only find the limit distribution of Wn . To find
this limiting distribution, we first note that
n
X n
X
µ̃k − µk − kµk−1 µ̃1 = n−1 (Xi − µ)k − µk − kµk−1 n−1 (Xi − µ) =
i=1 i=1
n
X
n−1 (Xi − µ)k − µk − kµk−1 (Xi − µ) .
 
i=1

Define a sequence of random vectors {Yn }∞


n=1 as

Yn0 = [(Xn − µ)2 − µ2 − 2µ1 (Xn − µ), . . . , (Xn − µ)k − µk − kµk−1 (Xn − µ)],
for all n ∈ N. Note that
E[(Xn − µ)k − µk − kµk−1 (Xn − µ)] = µk − µk − kµk−1 µ1 = 0,
since µ1 = 0. Then it follows that Wn = n1/2 Ȳn and Theorem 4.22 implies
d
that Wn − → Z as n → ∞ where Z is a N(0, Σ) random vector. As discussed
above, the random vector Zn has this same limit distribution, and therefore all
there is left to do is verify the form of the covariance matrix. The form of the
covariance matrix can be determined from the covariance matrix of Yn . The
(i, j)th element of this covariance matrix is given by the covariance between
(Xn −µ)i+1 −µi+1 −(i+1)µi (Xn −µ) and (Xn −µ)j+1 −µj+1 −(j+1)µj (Xn −µ).
Since both random variables have expectation equal to zero, this covariance
is equal to the expectation of the product

E{[(Xn − µ)i+1 − µi+1 − (i + 1)µi (Xn − µ)]×


[(Xn − µ)j+1 − µj+1 − (j + 1)µj (Xn − µ)]} =
E[(Xn − µ)i+1 (Xn − µ)j+1 ] − E[(Xn − µ)i+1 µj+1 ]
− E[(Xn − µ)i+1 (j + 1)µj (Xn − µ)] − E[µi+1 (Xn − µ)j+1 ] + E[µi+1 µj+1 ]
+ E[µi+1 (j + 1)µj (Xn − µ)] − E[(i + 1)µi (Xn − µ)j+1 ]
+ E[(i + 1)µi (Xn − µ)µj+1 ] + E[(i + 1)(j + 1)(Xn − µ)2 µi µj ] =
µi+j+2 − µi+1 µj+1 − (j + 1)µj µi+2 − µi+1 µj+1 + µi+1 µj+1
+ (j + 1)µi+1 µj µ1 − (i + 1)µi µj+2 + (i + 1)µi µj+1 µ1 + (i + 1)(j + 1)µi µj µ2 =
µi+j+2 − µi+1 µj+1 − (j + 1)µj µi+2 − (i + 1)µi µj+2 + (i + 1)(j + 1)µi µj µ2 ,
where we note that (i + 1)µi µj+1 µ1 = (j + 1)µi+1 µj µ1 = 0. This expression
matches the covariance given in Equation (8.7).
Example 8.14. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with k th central moment
given by µk . Suppose that µ6 < ∞, then Theorem 8.5 implies that n1/2 (µ̂2 −
d
µ2 , µ̂3 − µ3 )0 −
→ Z as n → ∞ where Z is a N(0, Σ) random vector where
µ4 − µ22
 
µ5 − 4µ2 µ3
Σ= ,
µ5 − 4µ2 µ3 µ6 − µ23 − 6µ3 µ4 + 9µ32
354 ASYMPTOTIC EXPANSIONS FOR RANDOM VARIABLES
where we note that µ1 = 0. 

8.5 Exercises and Experiments

8.5.1 Exercises

1. Let {Xn }∞
n=1 be a sequence of independent random variables where Xn
has a N(0, n−1 ) distribution for all n ∈ N. Prove that Xn = Op (n−1/2 ) as
n → ∞.
2. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of random variables. Suppose that
Xn = op (Yn ) as n → ∞. Prove that Xn = Op (Yn ) as n → ∞.
3. Let {Xn }∞n=1 be a sequence of independent and identically distributed ran-
dom variables where Xn = n−1 Un where Un has a Uniform(0, 1) distribu-
tion for all n ∈ N. Prove that Xn = op (n−1/2 ) and that Xn = Op (n−1 ) as
n → ∞.
4. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of independent random variables.
Suppose that Yn is a Beta(αn , βn ) random variable where {αn }∞ n=1 and
{βn }∞
n=1 are sequences of positive real numbers that converge to α and β,
respectively. Suppose further that, conditional on Yn , the random variable
Xn has a Binomial(m, Yn ) distribution where m is a fixed positive integer
for all n ∈ N. Prove that Xn = Op (1) as n → ∞.
5. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of independent random variables.
Suppose that Yn is a Poisson(θ) random variable where θ is a positive real
number. Suppose further that, conditional on Yn , the random variable Xn
has a Binomial(Yn , τ ) distribution for all n ∈ N where τ is a fixed real
number in the interval [0, 1]. Prove that Xn = Op (Yn ) as n → ∞.
6. Let {Xn }∞
n=1 be a sequence of independent random variables where Xn
has a Gamma(αn , βn ) distribution for all n ∈ N and {αn }∞ ∞
n=1 and {βn }n=1
are bounded sequences of positive real numbers. That is, there exist real
numbers α and β such that 0 < αn ≤ α and 0 < βn ≤ β for all n ∈ N.
Prove that Xn = Op (1) as n → ∞.
7. Let {Xn }∞
n=1 be a sequence of independent random variables where Xn has
a Geometric(θn ) distribution where {θn }∞
n=1 is described below. For each
sequence determine whether Xn = Op (1) as n → ∞.

a. θn = n(n + 10)−1 for all n ∈ N.


b. θn = n−1 for all n ∈ N.
c. θn = n−2 for all n ∈ N.
d. θn = 12 for all n ∈ N.

8. Let {Xn }∞ ∞
n=1 and {Yn }n=1 be sequences of independent random variables,
where Xn has a Uniform(0, n) distribution and Yn has a Uniform(0, n2 )
distribution for all n ∈ N. Prove that Xn = op (Yn ) as n → ∞.
EXERCISES AND EXPERIMENTS 355
9. Let {Xn }∞ ∞ ∞
n=1 and {Yn }n=1 be sequences of random variables and let {yn }n=1
be a sequence of real numbers.
a. Prove that if Xn = Op (n−a ) and Yn = op (n−b ) as n → ∞, then Xn Yn =
op (n−(a+b) ) as n → ∞.
b. Prove that if Xn = op (n−a ) and yn = o(n−b ) as n → ∞, then Xn yn =
op (n−(a+b) ) as n → ∞.
c. Prove that if Xn = Op (n−a ) and yn = O(n−b ) as n → ∞, then Xn yn =
Op (n−(a+b) ) as n → ∞.
d. Prove that if Xn = op (n−a ) and yn = O(n−b ) as n → ∞, then Xn yn =
op (n−(a+b) ) as n → ∞.
e. Prove that if Xn = op (n−a ) and Yn = op (n−b ) as n → ∞, then Xn Yn =
op (n−(a+b) ) as n → ∞.
10. Suppose that {Wn }∞ n=1 is a sequence of independent random variables such
that Wn has a N(θ, σ 2 ) distribution for all n ∈ N where θ 6= 0. Define a
sequence of random variables {Xn }∞ n=1 where Xn = W̄n for all n ∈ N so
−1 2 p
that N(θ, n σ ) distribution for all n ∈ N and Xn − → θ as n → ∞. Find
the asymptotic distribution of n1/2 [exp(−Xn2 ) − exp(−θ2 )].
11. Let {Bn }∞n=1 be a sequence of independent random variables where Bn has
a Bernoulli(θ) distribution for all n ∈ N. Define a sequence of random
variables {Xn }∞
n=1 where
n
X
Xn = n−1 Bk ,
k=1

which is the proportion of the first n Bernoulli random variables that


d
equal one. Prove that n1/2 (Xn − θ) − → Z as n → ∞ where Z has a
p
N[0, n−1 θ(1 − θ)] distribution and that Xn − → θ as n → ∞. Using these
conclusions, find the asymptotic distribution of n1/2 [Xn (1−Xn )−θ(1−θ)].
12. Let {Xn }∞ n=1 be a sequence of independent and identically distributed ran-
dom variables from a distribution F with k th central moment given by µk .
Suppose that µ10 < ∞. Use Theorem 8.5 to find the asymptotic distribution
of n1/2 (µ̂3 − µ3 , µ̂4 − µ4 , µ̂5 − µ5 )0 .

8.5.2 Experiments

1. Write a program in R that first simulates 1000 observations from a Pois-


son(10) distribution. For each observation, simulate a Binomial(n, 21 ) ob-
servation where n is equal to the corresponding observation from the Pois-
son(10) distribution. Repeat this experiment five times and plot the re-
sulting sequences of ratios of the Binomial observations to the Poisson
observations. Describe the plots and address whether the behavior in the
plots appears to indicate that the theory given in Exercise 5 has been ob-
served.
356 ASYMPTOTIC EXPANSIONS FOR RANDOM VARIABLES
2. Write a program in R that simulates two sequences of random variables.
The first sequence is given by X1 , . . . , X100 where Xn has a Uniform(0, n)
distribution for n = 1, . . . , 100. The second sequence is given by Y1 , . . . , Y100
where Yn has a Uniform(0, n2 ) distribution for n = 1, . . . , 100. Given the
two sequences, compute the sequence X1 Y1−1 , . . . , X100 Y100 −1
. Repeat the
experiment five times and plot the resulting sequences of ratios. Describe
the plots and address whether the behavior in the plots appears to indicate
that the theory given in Exercise 6 has been observed.
3. Write a program in R that simulates 1000 samples of size 100 from a distri-
bution F , where F is specified below. For each sample compute the second
and third sample central moments. Use the 1000 simulated values of the
two moments to estimate the variance of each sample moment, along with
the covariance between the moments. Note that to estimate the covari-
ance it is important to keep the sample central moments from each sample
paired with one another. Compare these estimates to what would be ex-
pected given the theory in Example 8.14, noting that the assumptions of
the example are not met for all of the specified distributions given below.

a. N(0, 1)
b. Cauchy(0, 1)
c. T(3)
d. Exponential(1)

4. Write a program in R that simulates samples of size n = 1, . . . , 100 from


a distribution F , where F is specified below. For each sample compute the
fourth sample central moment, as well as µ̃1 and µ̃4 where
n
X
µ̃k = n−1 (Xi − µ)k ,
i=1

and X1 , . . . , Xn denotes the simulated sample. For each value of n, compute


n1/2 (µ̂4 −µ4 ), n1/2 (µ̃4 −µ4 −4µ3 µ̃1 ), and the absolute difference between the
two expressions. Repeat the experiment five times and plot the resulting
sequences of differences against n. Compare the limiting behavior of the
absolute differences to what would be expected given the theory in Theorem
8.4, noting that the assumptions of the example are not met for all of the
specified distributions given below.

a. N(0, 1)
b. Cauchy(0, 1)
c. T(3)
d. Exponential(1)

5. Write a program in R that simulates the sequence {Xn }100 n=1 where Xn has
a Geometric(θn ) distribution where the sequence θn is specified below.
Repeat the experiment five times and plot the five realizations against n on
EXERCISES AND EXPERIMENTS 357
the same set of axes. Describe the behavior of each sequence and compare
this to the theoretical results of Exercise 7.
a. θn = n(n + 10)−1 for all n ∈ N.
b. θn = n−1 for all n ∈ N.
c. θn = n−2 for all n ∈ N.
d. θn = 12 for all n ∈ N.
CHAPTER 9

Differentiable Statistical Functionals

K. stopped talking with them; do I, he thought to himself, do I really have to


carry on getting tangled up with the chattering of base functionaries like this?
The Trial by Franz Kafka

9.1 Introduction

This chapter will introduce a class of parameters that are known as func-
tional parameters. These types of parameters are functions of a distribution
function and can therefore be estimated by taking the same function of the
empirical distribution function computed on a sample from the distribution.
A novel approach to finding the asymptotic distribution of statistics of this
type was developed by von Mises (1947). In essence von Mises (1947) showed
that statistics of this type could be approximated based on a type of Taylor
expansion. Under some regularity conditions the asymptotic distribution of
the approximation can be found using standard methods such as Theorem
4.20 (Linbdeberg and Lévy). In this chapter we will first introduce functional
parameters and statistics. We will then develop the Taylor type expansion by
first introducing a differential for functional statistics, and then introducing
the expansion itself. We will then proceed to develop the asymptotic theory.
There have been many advances in this theory since its inception, mostly
in regard to developing more useful differentials. Much of the mathematical
theory required to study these advances is beyond the mathematical level of
this book. The purpose of this chapter is to provide a general overview of
the subject at a mathematical level that is consistent with the rest of our
presentation so far. Those who find an interest in this topic should consult
Fernholz (1983), Chapter 6 of Serfling (1980), and Chapter 20 of van der Vart
(1998).

9.2 Functional Parameters and Statistics

In many cases in statistical inference the parameter of interest can be written


as a function of the underlying distribution. That is, suppose that we have

359
360 DIFFERENTIABLE STATISTICAL FUNCTIONALS
observed a set of independent and identically distributed random variables
from a distribution F , and we are interested in a certain characteristic θ of
this distribution. Usually, θ can be written as a function of F . That is, we
can take θ = T (F ) for some function T . It is important to note that the
domain of this function is not the real line. Rather, T takes a distribution
function and maps it to the real line. Therefore the domain of this function is
a space containing distribution functions. We will work with a few spaces of
distribution functions. In particular, we will consider the set of all continuous
distribution functions on R, the set of all discrete distribution functions on
R, and the set of all distribution functions on R. Some examples of functional
parameters are given below.
Example 9.1. Let F ∈ F, the collection of all distribution functions on the
real line, and let θ be the k th central moment of F . Then θ is a functional
parameter that can be written as
Z ∞
θ = T (F ) = tk dF (t).
−∞

Noting that this parameter may not exist for all F ∈ F, we may consider
taking F from Fk , the collection of all distribution functions that have at
least k finite moments. 
Example 9.2. Let F ∈ Fk , where Fk is defined in Example 9.1, and let θ be
the k th moment of F about the mean. Then θ is a functional parameter that
can be written as
Z ∞ Z ∞ k
θ = T (F ) = t− udF (u) dF (t).
−∞ −∞

The variance parameter is a special case of this functional parameter with


k = 2. 
Example 9.3. Let F be a distribution on R and let {Ri }ki=1 be a sequence
of subsets of R that form a partition of R. That is, Ri ∩ Rj = ∅ for all i 6= j,
and
[k
Ri = R.
i=1
For simplicity we will assume that k is finite. Let {pi }ki=1 be a hypothe-
sized model for the probabilities associated with the subsets in the sequence
{Ri }∞
i=1 . That is, if the model is correct, then
Z
pi = dF,
Ri

for all i ∈ {1, . . . , k}. The hypothesized model can then be compared to the
true model, using the functional
k
X Z 2
θ = T (F ) = p−1
i dF − p i ,
i=1 Ri
FUNCTIONAL PARAMETERS AND STATISTICS 361
which is the sum of the square differences of the probabilities relative to the
model probabilities. Note that when the proposed model is correct then θ = 0,
and that when the proposed model is incorrect then θ > 0. 

Estimation of functional parameters can be based on finding an estimator


of the distribution function F̂ based on the observed sample X1 , . . . , Xn . An
estimator of θ = T (F ) is then developed by computing the functional of the
estimator of the distribution function. That is, θ̂n = T (F̂ ). Such estimates are
often called plug-in or substitution estimators. The most common estimator of
F is the empirical distribution function defined in Definition 3.5, though other
estimates can be used as well. See Putter and Van Zwet (1996) for general
conditions about the consistency of such estimators.

A key property that applies to many plug-in estimators based on the empirical
distribution function is that, conditional on the observed sample X1 , . . . , Xn ,
the empirical distribution function is a discrete distribution that associates a
probability of n−1 with each value in the sample. Therefore, integrals with
respect to the empirical distribution function simplify according to Definition
2.10 as a sum. That is, if g is a real valued function we have that
Z ∞ Xn
−1
g(x)dF̂n (x) = n g(Xi ).
−∞ k=1

Example 9.4. Let F be a distribution with at least k finite moments, and


let θ be the k th central moment of F which is a functional parameter that can
be written as
Z ∞
θ = T (F ) = tk dF (t).
−∞

Suppose that a sample X1 , . . . , Xn is observed from F and the empirical dis-


tribution function F̂n is used to estimate F . Then a plug-in estimator for θ is
given by
Z ∞ n
X
θ̂n = T (F̂n ) = tk dF̂n (t) = n−1 Xik ,
−∞ i=1

which is the k th
sample moment µ̂0k . 
th
Example 9.5. Let F be a distribution with finite k moment and let θ be
the k th moment of F about the mean. Then θ is a functional parameter that
can be written as
Z ∞ Z ∞ k
θ = T (F ) = t− udF (u) dF (t).
−∞ −∞

Suppose that a sample X1 , . . . , Xn is observed from F and the empirical dis-


tribution function F̂n is used to estimate F . Then a plug-in estimator for θ is
362 DIFFERENTIABLE STATISTICAL FUNCTIONALS
given by
Z ∞  Z ∞ k
θ̂n = T (F̂n ) = t− udF̂n (u) dF̂n (t)
−∞ −∞
" n
#k
Z ∞ X
= t − n−1 Xi dF̂n (t)
−∞ i=1
n
" n
#k
X X
−1 −1
= n Xj − n Xi ,
j=1 i=1

which is the k th sample central moment µ̂k . 


Example 9.6. Let F be a distribution on R and let {Ri }ki=1
be a sequence of
subsets of R that form a partition of R where we will assume that k is finite.
Let {pi }ki=1 be a hypothesized model for the probabilities associated with the
subsets in the sequence {Ri }∞i=1 . In Example 9.3 we considered the functional
parameter
X k Z 2
−1
θ = T (F ) = pi dF − pi ,
i=1 Ri

which compares the hypothesized model to the true model. Suppose that a
sample X1 , . . . , Xn is observed from F and the empirical distribution function
F̂n is used to estimate F . Then a plug-in estimator for θ is given by
X k Z 2
−1
θ̂n = T (F̂n ) = pi dF̂n − pi
i=1 Ri
 2
k
X n
X
= p−1
i
n −1
δ(Xj ; Ri ) − pi 
i=1 j=1
k
X
= p−1 2
i (p̂i − pi ) ,
i=1

where p̂i is the proportion of the sample that was observed in subset Ri . Sup-
pose that we alternatively considered estimating the distribution function F
with a Normal distribution whose parameters were estimated from the ob-
served sample. That is, we could estimate F with a N(X̄n , S) distribution,
conditional on the observed sample X1 , . . . , Xn , where S is the sample stan-
dard deviation. In this case, the plug-in estimator for θ is given by
k
X
θ̂n = T (F̂n ) = p−1 2
i (p̃i − pi ) ,
i=1

where p̃i is the probability that a N(X̄n , S) random variable is in the region
Ri for all i ∈ {1, . . . , k}. 
DIFFERENTIATION OF STATISTICAL FUNCTIONALS 363
9.3 Differentiation of Statistical Functionals

The development of a Taylor type expansion for functional parameters and


statistics requires us first to develop a derivative, or differential, for functional
parameters. To motivate defining such a derivative, we will first briefly recall
how the derivative of a real valued function is computed. Let g be a real valued
function and suppose that we wish to compute the derivative of g at a point
x ∈ R. The derivative of g at x is the instantaneous slope of the function g at
the point x. This instantaneous slope can be defined by taking the slope of a
line that connects g(x) with the point g(x + δ) as δ approaches zero. That is,
g(x + δ) − g(x)
g 0 (x) = lim .
δ→0 δ
Of course we could approach the point x in the opposite direction and define
the derivative to be
g(x) − g(x − δ)
g 0 (x) = lim .
δ→0 δ
The derivative in his case is said to exist if both definitions agree. If the two
definitions do not agree then the function is not differentiable at x.
Developing a derivative for functionals is more complicated since the domain of
a functional is a space of functions. Let T be a functional that maps a function
space F to the real line. In order to compute a derivative of the functional, we
need to find the instantaneous change in the functional at a point in the space
F. As with the derivative of a real function, we will define this instantaneous
change by comparing the difference between T (F ) and T (Fδ ), where Fδ is a
function such that Fδ ∈ F for all δ ∈ R, and Fδ converges to F as δ → ∞. It
is important to note that there are many potential paths through the space F
that Fδ may take to reach F as δ → 0, and that, just as with the derivative of
a real function, this path may have an effect of the derivative at F . Another
problem that arises when we attempt to build such a derivative is that we also
need a measure of the amount of change between Fδ and F for the denominator
of the limit.
The Gâteaux differential of T uses the path in F defined by Fδ = F +δ(G−F )
for some function G ∈ F and δ ∈ [0, 1]. The amount of change between Fδ and
F is measured by δ.
Definition 9.1 (Gâteaux). Let T be a functional that maps a space of func-
tions F to R and let F and G be members of F. The Gâteaux differential of
T at F in the direction of G is defined to be
∆1 T (F ; G − F ) = lim δ −1 [T (Fδ ) − T (F )],
δ↓0

provided the limit exists.

The function ∆1 T (F ; G−F ) is usually called a differential and not a derivative


due to the fact that the function may not always have the same properties
364 DIFFERENTIABLE STATISTICAL FUNCTIONALS
as a derivative for real functions. However, if we define a real valued function
h(δ) = T [F + δ(G − F )] then we have that h(0) = T (F ) and hence

d
∆1 T (F ; G − F ) = lim δ −1 [h(δ) − h(0)] =

h(δ) ,
δ↓0 dδ δ↓0

which is the usual derivative, from the right, of the real function h evaluated at
zero. This result allows us to easily define higher order Gâteaux differentials.
Definition 9.2 (Gâteaux). Let T be a functional that maps a space of func-
tions F to R and let F and G be members of F. The k th order Gâteaux dif-
ferential of T at F in the direction of G is defined to be

dk


∆k T (F ; G − F ) = k
h(δ) ,
dδ δ↓0

where h(δ) = T [F + δ(G − F )], provided the derivative exists.

Many of the functionals studied in this chapter will all have a relatively sim-
ple form that can be written as a multiple integral of a symmetric function
where each integral is integrated with respect to dF . In this case the Gâteaux
differential has a simple form.
Theorem 9.1. Consider a functional of the form
Z ∞ Z ∞ r
Y
T (F ) = ··· h(x1 , . . . , xr ) dF (xi ),
−∞ −∞ i=1

where F ∈ F, and F is a collection of distribution functions. Then if k ≤ r,


" k
#
Y
∆k T (F ; G − F ) = (r − i + 1) ×
i=1
"r−k #( k
)
Z ∞ Z ∞ Y Y
··· h(x1 , . . . , xk , y1 , . . . , yr−k ) dF (yi ) d[G(xi ) − F (xi )] .
−∞ −∞ i=1 i=1

If k > r then ∆k T (F ; G − F ) = 0.

Proof. We will prove this result for r = 1 and r = 2. For a general proof see
Exercise 9. We first note that when r = 1 we have that
Z ∞
T [F + δ(G − F )] = h(x1 )d{F (x1 ) + δ[G(x1 ) − F (x1 )]} =
−∞
Z ∞ Z ∞
h(x1 )dF (x1 ) + δ h(x1 )d[G(x1 ) − F (x1 )].
−∞ −∞
DIFFERENTIATION OF STATISTICAL FUNCTIONALS 365
Therefore, Definition 9.2 implies that
Z ∞
d
∆1 T (F ; G − F ) = h(x1 )dF (x1 )+
dδ −∞
Z ∞ 

δ h(x1 )d[G(x1 ) − F (x1 )]
−∞ δ↓0
Z ∞
= h(x1 )d[G(x1 ) − F (x1 )]
−∞
Z ∞ Z ∞
= h(x1 )dG(x1 ) − h(x1 )dF (x1 )
−∞ −∞
= T (G) − T (F ).

We also note that when k > 1 it follows that ∆k T (F ; G−F ) = 0. When r = 2,


we have that

T [F + δ(G − F )] =
Z ∞ Z ∞
h(x1 , x2 )d{F (x1 )+δ[G(x1 )−F (x1 )]}d{F (x2 )+δ[G(x2 )−F (x2 )]}.
−∞ −∞

To simplify this expression we first work with the differential in the double
integral. Note that

d{F (x1 ) + δ[G(x1 ) − F (x1 )]}d{F (x2 ) + δ[G(x2 ) − F (x2 )]} =


{dF (x1 ) + δd[G(x1 ) − F (x1 )]}{dF (x2 ) + δd[G(x2 ) − F (x2 )]} =
dF (x1 )dF (x2 ) + δd[G(x1 ) − F (x1 )]dF (x2 )+
δdF (x1 )d[G(x2 ) − F (x2 )] + δ 2 d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )].

Therefore,

Z ∞Z ∞
T [F + δ(G − F )] = h(x1 , x2 )dF (x1 )dF (x2 )+
−∞ −∞
Z ∞Z ∞
δ h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 )+
−∞ −∞
Z ∞Z ∞
δ h(x1 , x2 )dF (x1 )d[G(x2 ) − F (x2 )]+
−∞ −∞
Z ∞Z ∞
δ2 h(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )].
−∞ −∞

We now take advantage of the fact that h(x1 , x2 ) is a symmetric function in


its arguments so that h(x1 , x2 ) = h(x2 , x1 ). With this assumption it follows
366 DIFFERENTIABLE STATISTICAL FUNCTIONALS
that
Z ∞ Z ∞
h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 ) =
−∞ −∞
Z ∞Z ∞
h(x1 , x2 )dF (x1 )d[G(x2 ) − F (x2 )],
−∞ −∞

and, therefore,
Z ∞Z ∞
T [F + δ(G − F )] = h(x1 , x2 )dF (x1 )dF (x2 )+
−∞ −∞
Z ∞Z ∞
2δ h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 )+
−∞ −∞
Z ∞Z ∞
δ2 h(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )].
−∞ −∞

Therefore, the first two Gâteaux differentials follow from Definition 9.2 as
Z ∞ Z ∞
d
∆1 T (F ; G − F ) = h(x1 , x2 )dF (x1 )dF (x2 )+
dδ −∞ −∞
Z ∞Z ∞
2δ h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 )+
−∞ −∞
Z ∞Z ∞ 
δ2

h(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )]
−∞ −∞ δ↓0
Z ∞ Z ∞
=2 h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 )+
−∞ −∞
Z ∞Z ∞

2δ h(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )]
−∞ −∞ δ↓0
Z ∞ Z ∞
=2 h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 ),
−∞ −∞

and
 Z ∞Z ∞
d
∆2 T (F ; G − F ) = 2 h(x1 , x2 )d[G(x1 ) − F (x1 )]dF (x2 )+
dδ −∞ −∞
Z ∞Z ∞ 

2δ h(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )]
−∞ −∞ δ↓0
Z ∞ Z ∞
=2 h(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )]
−∞ −∞
= 2T (G − F ).

We also note that when k > 2 it follows that ∆k T (F ; G − F ) = 0.


DIFFERENTIATION OF STATISTICAL FUNCTIONALS 367
Example 9.7. Consider the mean functional given by
Z ∞
T (F ) = xdF (x).
−∞

This functional has the form given in Theorem 9.1 with r = 1 and h(x) = x.
Therefore, Theorem 9.1 implies that the first Gâteaux differential has the form
Z ∞ Z ∞
∆1 T (F ; G − F ) = T (G) − T (F ) = xdG(x) − xdF (x).
−∞ −∞

Higher order Gâteaux differentials are equal to zero. 

Example 9.8. Consider the variance functional given by


Z ∞  Z ∞ 2
T (F ) = t− tdF (t) dF (t),
−∞ −∞

where F ∈ F, a collection of distribution functions with mean µ and finite


variance σ 2 . In order to take advantage of the result given in Theorem 9.1 we
must first write this functional as a multiple integral of a symmetric function.
To find such a function we can note that if X1 and X2 are independent random
variables both following the distribution F , then

E[(X1 − X2 )2 ] = E(X12 + X22 − 2X1 X2 ) = 2µ2 + 2σ 2 − 2µ2 = 2σ 2 ,

so that 21 E[(X1 − X2 )2 ] = σ 2 . Therefore, it follows that T (F ) can equivalently


be written as
Z ∞Z ∞
T (F ) = h(x1 , x2 )dF (x1 )dF (x2 ),
−∞ −∞

where h(x1 , x2 ) = 21 (x21 + x22 − 2x1 x2 ), which is a symmetric function in x1


and x2 . We can now apply Theorem 9.1 to this functional to find the Gâteaux
differentials for this functional. We first find that
Z ∞Z ∞
1 2 2
∆1 T (F, G − F ) = 2 2 (x1 + x2 − 2x1 x2 )d[G(x1 ) − F (x1 )]dF (x2 )
−∞ −∞
Z ∞Z ∞
= x21 d[G(x1 ) − F (x1 )]dF (x2 )
−∞ −∞
Z ∞Z ∞
+ x22 d[G(x1 ) − F (x1 )]dF (x2 )
−∞ −∞
Z ∞Z ∞
−2 x1 x2 d[G(x1 ) − F (x1 )]dF (x2 ).
−∞ −∞
368 DIFFERENTIABLE STATISTICAL FUNCTIONALS
Now
Z ∞Z ∞
x21 d[G(x1 ) − F (x1 )]dF (x2 ) =
−∞ −∞
Z ∞Z ∞ Z ∞Z ∞
x21 dG(x1 )dF (x2 ) − x21 dF (x1 )dF (x2 ) =
−∞ −∞ −∞ −∞
Z ∞  Z ∞  Z ∞  Z ∞ 
2 2
x1 dG(x1 ) dF (x2 ) − x1 dF (x1 ) dF (x2 ) =
−∞ −∞ −∞ −∞
ν20 − µ02 ,
where ν20 and µ02 are the second moments of G and F , respectively. Similarly,
the second term is given by
Z ∞Z ∞
x22 d[G(x1 ) − F (x1 )]dF (x2 ) =
−∞ −∞
Z ∞  Z ∞  Z ∞  Z ∞ 
2 2
dG(x1 ) x2 dF (x2 ) − dF (x1 ) x2 dF (x2 ) = 0.
−∞ −∞ −∞ −∞
Finally, the third term is given by
Z ∞Z ∞
2 x1 x2 d[G(x1 ) − F (x1 )]dF (x2 ) =
−∞ −∞
Z ∞  Z ∞  Z ∞  Z ∞ 
2 x1 dG(x1 ) x2 dF (x2 ) − 2 x1 dF (x1 ) x2 dF (x2 )
−∞ −∞ −∞ −∞
= 2ν10 µ01 − 2(µ01 )2 ,
where µ1 and ν1 are the first moments of F and G, respectively. Therefore,
∆1 T (F, G − F ) = ν20 − µ02 − 2ν10 µ01 + 2(µ01 )2 . The second derivative has the
form
Z ∞Z ∞
2T (G − F ) = x21 d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )]+
−∞ −∞
Z ∞Z ∞
x22 d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )]−
−∞ −∞
Z ∞Z ∞
2 x1 x2 d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )].
−∞ −∞
Simplifying the first term we observe that
Z ∞Z ∞
x21 d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )] =
−∞ −∞
Z ∞  Z ∞ 
2
x1 d[G(x1 ) − F (x1 )] d[G(x2 ) − F (x2 )] = 0.
−∞ −∞
Similarly,
Z ∞ Z ∞
x22 d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )] = 0.
−∞ −∞
EXPANSION THEORY FOR STATISTICAL FUNCTIONALS 369
Therefore,
Z ∞Z ∞
∆2 T (F, G − F ) = −2 x1 x2 d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )] =
−∞ −∞
Z ∞  Z ∞ 
−2 x1 d[G(x1 ) − F (x1 )] x2 d[G(x2 ) − F (x2 )] =
−∞ −∞
Z ∞ Z ∞ 2
−2 x1 dG(x1 ) − x1 dF (x1 ) = −2(ν10 − µ01 )2 .
−∞ −∞


The Gâteaux differential is not the only approach to defining a derivative type
methodology for functionals, and in some sense this differential does not have
sufficient properties for a full development of the type of asymptotic theory
that we wish to seek in general settings. The original approach to using differ-
entials of this type was developed by von Mises (1947), who used a differential
similar to the Gâteaux differential. If we are working in a linear space that
is equipped with a norm then one can define what is commonly known as
the Fréchet derivative. See Dieudonné (1960), Section 2.3 of Fernholz (1983),
Fréchet (1925), Nashed (1971), and Chapter 6 of Serfling (1980), for further
details on this type of derivative. However, Fernholz (1983) points out that few
statistical functions are Fréchet differentiable. The Hadamard differential, first
used in this context by Reeds (1976), exists under weaker conditions than the
Fréchet derivative, but still exhibits the required properties. The Hadamard
differential is preferred by Fernholz (1983) as a compromise. For further details
on the Hadamard differential see Averbukh and Smolyanov (1968), Fernholz
(1983), Keller (1974), and Yamamuro (1974). In our presentation we will con-
tinue to use the Gâteaux differential and will limit ourselves to problems where
this differential is useful. This generally follows the development of Serfling
(1980), without addressing the issues related to the Fréchet derivative. This
provides a reasonable and practical overview of this topic without becoming
too deep with mathematical technicalities.

9.4 Expansion Theory for Statistical Functionals

The concept that a differential can be developed for a functional motivates


a Taylor type approximation for functionals. Recall from Theorem 1.13 that
a function f can be approximated at points near x ∈ R by f (x) + δf 0 (x)
where the error between f (x + δ) and the approximation converges to zero as
δ → 0. In principle a similar approximation can be obtained for functionals
evaluated at distributions near F . That is, we can approximate T (G) with
T (F ) + ∆1 T (F, G − F ) where the distribution G is near F . As Serfling (1980)
points out, no further theory is required to use this approximation. However,
if we wish for this approximation to have similar properties to the Taylor
approximation, then we need to consider how the error of the approximation
370 DIFFERENTIABLE STATISTICAL FUNCTIONALS
behaves. That is, define E1 (F, G) such that T (G) = T (F ) + ∆1 T (F, G − F ) +
E1 (F, G) for all F , G and T where ∆1 T (F, G − F ) exists. Then we must show
that E1 (F, G) → 0 at some rate as G → F .

One case that is of particular interest comes from the fact that the statistical
properties of plug-in estimators of T can be studied by replacing G with the
empirical distribution function of Definition 3.5. That is, we can consider the
expansion

T (F̂n ) = T (F ) + ∆1 T (F, F̂n − F ) + E1 (F̂n , F ). (9.1)


u
We know from Theorem 3.18 (Glivenko and Cantelli) that F̂n − → F with
a.c.
probability one as n → ∞ since ||F̂n − F ||∞ −−→ 0 as n → ∞. We would
p
like to use this fact to prove that E1 (F̂n , F ) −
→ 0 as n → ∞. We now begin
developing a theory that will lead to the asymptotic behavior of this error
term. The following result helps to simplify the form of the error term.
Theorem 9.2. Let G ∈ F and let F be a fixed distribution from F. Consider
a functional of the form
Z ∞ Z ∞ r
Y
T (G) = ··· t(x1 , . . . , xr ) d[G(xk ) − F (xk )].
−∞ −∞ k=1

Then, there exists a function t̃(x1 , . . . , xr |F ) such that


Z ∞ Z ∞ r
Y
T (G) = ··· t̃(x1 , . . . , xr |F ) dG(xk ),
−∞ −∞ k=1

for all G ∈ F.

Proof. We prove this result for the special cases when r = 1 and r = 2. For
the general case see Exercise 10. For the case when r = 1, we follow the
constructive proof of Serfling (1980) and consider the function
Z ∞
t̃(x1 |F ) = t(x1 ) − t(x2 )dF (x2 ).
−∞

Using this function, we observe that


Z ∞ Z ∞ Z ∞ 
t̃(x1 |F )dG(x1 ) = t(x1 ) − t(x2 )dF (x2 ) dG(x1 ) =
−∞ −∞ −∞
Z ∞ Z ∞Z ∞
t(x1 )dG(x1 ) − t(x2 )dF (x2 )dG(x1 ) =
−∞ −∞ −∞
Z ∞
t(x1 )[dG(x1 ) − dF (x1 )],
−∞
EXPANSION THEORY FOR STATISTICAL FUNCTIONALS 371
which proves the result when r = 1. For r = 2 we use the function
Z ∞ Z ∞
t̃2 (x1 , x2 |F ) = t(x1 , x2 ) − t(t1 , x2 )dF (t1 ) − t(x1 , t2 )dF (t2 )+
−∞ −∞
Z ∞Z ∞
t(t1 , t2 )dF (t1 )dF (t2 ),
−∞ −∞

so that
Z ∞Z ∞ Z ∞ Z ∞
t̃2 (x1 , x2 |F )dG(x1 )dG(x2 ) = t(x1 , x2 )dG(x1 )dG(x2 )−
−∞ −∞ −∞ −∞
Z ∞Z ∞Z ∞
t(t1 , x2 )dF (t1 )dG(x1 )dG(x2 )−
−∞ −∞ −∞
Z ∞Z ∞Z ∞
t(x1 , t2 )dF (t2 )dG(x1 )dG(x2 )+
−∞ −∞ −∞
Z ∞Z ∞Z ∞Z ∞
t(t1 , t2 )dF (t1 )dF (t2 )dG(x1 )dG(x2 ) =
−∞ −∞ −∞ −∞
Z ∞Z ∞ Z ∞Z ∞
t(x1 , x2 )dG(x1 )dG(x2 ) − t(t1 , x2 )dF (t1 )dG(x2 )−
−∞ −∞ −∞ −∞
Z ∞Z ∞ Z ∞Z ∞
t(x1 , t2 )dG(x1 )dF (t2 ) + t(t1 , t2 )dF (t1 )dF (t2 ) =
−∞ −∞ −∞ −∞
Z ∞Z ∞
t(x1 , x2 )d[G(x1 ) − F (x1 )]d[G(x2 ) − F (x2 )],
−∞ −∞

which completes the proof for k = 2.

The result of Theorem 9.2 can have a profound impact on the form of a
Gâteaux differential. For example, Theorem 9.1 implies that if T is a functional
of the form
Z ∞ Z ∞ r
Y
T (F ) = ··· h(x1 , . . . , xr ) dF (xi ),
−∞ −∞ i=1
then
Z ∞ Z ∞ k
Y
∆k T (F, F̂n − F ) = ··· t(x1 , . . . , xk |F ) d[F̂n (xi ) − F (xi )],
−∞ −∞ i=1

where
Z ∞ Z ∞ r−k
Y
t(x1 , . . . , xk |F ) = ··· h(x1 , . . . , xk , y1 , . . . , yr−k ) dF (yi ).
−∞ −∞ i=1

Theorem 9.2 then implies that


Z ∞ Z ∞ k
Y
∆k T (F, F̂n − F ) = ··· t̃(x1 , . . . , xk |F ) dF̂n (xi ),
−∞ −∞ i=1
372 DIFFERENTIABLE STATISTICAL FUNCTIONALS
for some function t̃(x1 , . . . , xk |F ). Applying Definition 2.10 to the integrals
with respect to the empirical distribution function then yields
n
X n
X
∆k T (F, F̂n − F ) = n−k ··· t̃(Xi1 , . . . , Xik |F ).
i1 =1 ik =1

This form of the differential can often be used to establish the weak conver-
gence properties of the differential. For example, if k = 1, we have that
n
X
∆1 T (F, F̂n − F ) = n−1 t̃(Xi |F ),
i=1

which is asymptotically Normal by Theorem 4.20 (Lindeberg and Lévy).


p
Hence, if it can also be shown that E1 (F, F̂n ) −
→ 0 as n → ∞, the Theorem
4.11 can be used to show that T (F̂n ) is asymptotically Normal. To establish
this type of property we use the result given below.

Theorem 9.3. Let {Xn }∞ n=1 be a sequence of independent and identically


distributed random variables from a distribution F and let t(x1 , . . . , xr ) be a
function such that E[t2 (Xi1 , . . . , Xir )] < ∞ for all sets of indices i1 , . . . , ir
where ij ∈ {1, . . . , r} for all j ∈ {1, . . . , r}. Then
" #2 
 Z ∞ Z ∞ r
Y 
E ··· t(x1 , . . . , xr ) d[F̂n (xi ) − F (xi )] = O(n−r ),
 −∞ −∞ 
i=1

as n → ∞.

Proof. Theorem 9.2 implies that there exists a function t̃(x1 , . . . , xr |F ) such
that
Z ∞ Z ∞ r
Y
··· t(x1 , . . . , xr ) d[F̂n (xi ) − F (xi )] =
−∞ −∞ i=1
Z ∞ Z ∞
··· t̃(x1 , . . . , xr |F )dF̂n (x1 ) · · · dF̂n (xr ).
−∞ −∞

Repeated application of Definition 2.10 then implies that


Z ∞ Z ∞
··· t̃(x1 , . . . , xr |F )dF̂n (x1 ) · · · dF̂n (xr ) =
−∞ −∞
n
X n
X
n−r ··· t̃(xi1 , . . . , xir |F ).
i1 =1 ir =1
EXPANSION THEORY FOR STATISTICAL FUNCTIONALS 373
Therefore,
" #2 
 Z ∞ Z ∞ Yr 
E ··· t(x1 , . . . , xr ) d[F̂n (xi ) − F (xi )] =
 −∞ −∞ 
i=1
" #2 
 X n X n 
E n−r ··· t̃(Xi1 , . . . , Xir |F ) =
 
i1 =1 ir =1
n
X n X
X n n
X
n−2r ··· ··· E[t̃(Xi1 , . . . , Xir |F )t̃(Xj1 , . . . , Xjr |F )].
i1 =1 ir =1 j1 =1 jr =1

(9.2)
Now we take advantage of the fact that
Z ∞
t̃(xi1 , . . . , xir |F )dF (xik ) = 0,
−∞

for all ik ∈ {1, . . . , r} and k ∈ {1, . . . , r}. See Exercise 8. This implies that the
expectation in Equation (9.2) will be zero unless each index in the function
occurs at least twice. Serfling (1980) concludes that the number of non-zero
terms is O(nr ), as n → ∞. Assuming that the remaining absolute expectations
are bounded by some real value, as indicated by the assumptions, it follows
that
" #2 
 Z ∞ Z ∞ Yr 

E
 · · · t(x1 , . . . , x r ) d[F̂ (x
n i ) − F (x i )] ≤
−∞ −∞ 
i=1
 2 
 Z ∞ Z ∞ Yr 
E ··· t(x1 , . . . , xr ) d[F̂n (xi ) − F (xi )] =

 −∞ −∞ 
i=1
n
X n
X n
X n
X
n−2r ··· ··· E[t̃(Xi1 , . . . , Xir )t̃(Xj1 , . . . , Xjr )] = O(n−r ),
i1 =1 ir =1 j1 =1 jr =1

as n → ∞.

We now combine these results in order to prove that the error term in the
expansion in Equation (9.1) has the desired asymptotic properties. We begin
by noting the Definition 9.2 implies that
T (F̂n ) = T (F ) + ∆1 T (F, F̂n − F ) + E1 (F, F̂n )

d
= T (F ) + T [F + δ(F̂n − F )] + E1 (F, F̂n ). (9.3)
dδ δ↓0

Let us first consider the case when r = 1. Noting that ∆1 T (F, F̂n − F ) =
T (F̂n ) − T (F ) implies that T (F ) + ∆1 T (F, F̂n − F ) = T (F ) + T (F̂n ) − T (F ) =
T (F̂n ). Therefore, E1 (F, F̂n ) is identically zero for all n ∈ N.
374 DIFFERENTIABLE STATISTICAL FUNCTIONALS
For the case when r = 2 we define v(δ) = T [F + δ(F̂n − F )] as a function of
δ so that v(0) = T (F ) and v(1) = T (F̂n ). With this notation, Equation (9.3)
can be written as

d
+ E1 (δ) = v(0) + v 0 (0) + E1 (δ),

v(1) = v(0) + v(δ) (9.4)
dδ δ=0

which has the same form of a Taylor expansion for the function v provided by
Theorem 1.13, and hence E1 (δ) = 21 δ 2 v 00 (ξ), for some ξ ∈ (0, 1).
Let δ be an arbitrary member of the unit interval. Following the arguments
of the proof of Theorem 9.1, if the functional T has the form
Z ∞
T (F ) = h(x1 , x2 )dF (x1 )dF (x2 ),
−∞

then
Z ∞ Z ∞
v 00 (ξ) = 2 h(x1 , x2 )d[F̂n (x1 ) − F (x1 )]d[F̂n (x1 ) − F (x2 )],
−∞ −∞

where we note that the derivative does not depend on ξ. Theorem 9.2 implies
that there exists a function h̃(x1 , x2 |F ) such that
Z ∞Z ∞ X n
n X
v 00 (ξ) = h̃(x1 , x2 )dF̂n (x1 )dF̂n (x2 ) = n−2 h̃(X1 , Xj |F ).
−∞ −∞ i=1 j=1

Suppose that we can assume that E[h̃2 (Xi , Xj )] < ∞ for all i and j from the
index set {1, . . . , n}. Then Theorem 9.3 implies that
 2 

 n X n 
X 
n1/2 E n−2 h̃(Xi , Xj |F ) = O(n−3/2 ),
 
 i=1 j=1 

as n → ∞. Therefore, it follows that


 2 

 n X n 
1/2 −2
X 
lim n E n h̃(Xi , Xj |F ) = 0.
n→∞ 
 i=1 j=1 

qm
Definition 5.1 implies that n1/2 |v 00 (ξ)| −−→ 0 as n → ∞ and hence Theorem 5.2
p p
implies that n1/2 |v 00 (ξ)| −
→ 0 as n → ∞. This in turn implies that n1/2 E1 (δ) −

0 as n → ∞ for all δ ∈ (0, 1). This type of argument can be generalized to the
following result.
Theorem 9.4. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F and let θ be a functional
parameter of the form
Z ∞ Z ∞ r
Y
θ = T (F ) = ··· h(x1 , . . . , xr ) dF (xi ).
−∞ −∞ i=1
ASYMPTOTIC DISTRIBUTION 375
Then, 2
d p
n1/2 sup 2 T [F + λ(F̂n − F )] −
→ 0,
λ∈[0,1] dλ
p
as n → ∞, and therefore n1/2 E1 (F, F̂n ) −
→ 0, as n → ∞.
Discussion about some general conditions under which the result of Theorem
9.4 holds for other types of functional parameters and differentials can be
found in Chapter 6 of Serfling (1980).
Example 9.9. Consider the variance functional of Example 9.8 which has
the form Z ∞Z ∞
T (F ) = h(x1 , x2 )dF (x1 )dF (x2 ),
−∞ −∞
where h(x1 , x2 ) = 21 (x21 + x22 − 2x1 x2 ). In Example 9.8 is was shown that
∆1 T (F, F̂n − F ) = µ̂02 − µ02 − 2µ̂01 µ01 + 2(µ01 )2 . Therefore, it follows that T (F ) +
∆1 T (F, F̂n − F ) = µ̂02 + (µ01 )2 − 2µ̂01 µ01 , and hence in this case we have that
E1 (F, F̂n ) = µ̂02 − (µ̂01 )2 − µ̂02 − (µ01 )2 + 2µ̂01 µ01 = −(µ̂01 − µ01 )2 .
Theorems 8.1 and 8.5 then imply that E1 (F, F̂n ) = op (n−1/2 ) as n → ∞,
which verifies the arguments given earlier for this example. 

9.5 Asymptotic Distribution

In this section we will use the results of the previous section, along with the
expansion given in Equation (9.1), to find the asymptotic distribution of the
sample functional T (F̂n ). From an initial view, the use of the expansion given
in Equation (9.1) may not seem as if it would be necessarily useful as it is not
readily apparent that the asymptotic properties of ∆1 T (F, F̂n − F ) would be
easier to establish than that of T (F̂n ) directly. Indeed, it is the case is some
problems that either approach may have the same level of difficulty. However,
we have established that in some cases ∆1 T (F, F̂n − F ) can be written as a
sum of random variables whose asymptotic properties follow from established
results such as Theorem 4.20 (Lindeberge-Lévy). Another important key in-
gredient in establishing these results is the asymptotic behavior of the error
for the expansion given in Equation (9.1). In the previous section we observed
that in certain cases this error term can be guaranteed to converge to zero at
a certain rate. For the development of the result below, the error term will
need to converge in probability to zero at least as fast as n−1/2 as n → ∞.
Theorem 9.5. Let {Xn }∞ n=1 be a sequence of independent and identically dis-
tributed random variables from a distribution F . Let θ = T (F ) be a functional
with estimator θ̂n = T (F̂n ), where F̂n is the empirical distribution function of
X1 , . . . , Xn . Suppose that
n
X
∆1 T (F, F̂n − F ) = n−1 t̃(Xi |F ),
i=1
376 DIFFERENTIABLE STATISTICAL FUNCTIONALS
for some function t̃(Xi |F ). Let µ̃ = E[t̃(Xi |F )] and σ̃ 2 = V [t̃(Xi |F )] and
assume that 0 < σ̃ 2 < ∞. If
p
n1/2 E1 (F, F̂n ) = n1/2 [T (F̂n ) − T (F ) − ∆1 (F, F̂n − F )] −
→ 0,
d
as n → ∞, then n1/2 (θ̂n − θ − µ̃) −
→ Z as n → ∞ where Z has a N(0, σ̃ 2 )
distribution.

Proof. We begin proving this result by first finding the asymptotic distribution
of ∆1 T (F, F̂n −F ). First, note that {t̃(Xn |F )}∞n=1 is a sequence of independent
and identically distributed random variables with mean µ̃ and variance 0 <
σ̃ 2 < ∞. Therefore, Theorem 4.20 (Lindeberge-Lévy) implies that
" n
#
d
X
1/2 −1
n n t̃(Xi |F ) − µ̃ −→ Z, (9.5)
i=1
2
as n → ∞ where Z is a N(0, σ̃ ) random variable. Now we use the fact that
we have defined E1 (F, F̂n ) so that T (F̂n ) = T (F ) + ∆1 (F, F̂n − F ) + E1 (F, F̂n )
so that it follows that
n1/2 (θ̂n − θ − µ̃) = n1/2 [∆1 T (F, F̂n − F ) − µ̃] + n1/2 E1 (F, F̂n ). (9.6)
It follows that the first term on the right hand side of Equation (9.6) converges
in distribution to Z, while the second term converges in probability to zero as
n → ∞. Therefore, Theorem 4.11 (Slutsky) implies that the sum of these two
terms converges to Z, and the result is proven.

When the functional parameter has the form studied in the previous section,
the conditions under which we obtain the result given in Theorem 9.5 greatly
simplify.
Corollary 9.1. Let {Xn }∞ n=1 be a sequence of independent and identically dis-
tributed random variables from a distribution F . Let θ = T (F ) be a functional
parameter of the form
Z ∞ Z ∞ r
Y
T (F ) = ··· h(x1 , . . . , xr ) dF (xi ), (9.7)
−∞ −∞ i=1

with estimator θ̂n = T (F̂n ), where F̂n is the empirical distribution function of
X1 , . . . , Xn . Define µ̃ = E[t̃(Xi |F )], σ̃ 2 = V [t̃(Xi |F )], where
n
X
∆1 T (F, F̂n − F ) = n−1 t̃(Xi |F ),
i=1

for some function t̃(Xi |F ), assume that 0 < σ̃ 2 < ∞, and that ∆1 T (F, F̂n −F )
d
is not functionally equal to zero, then n1/2 (θ̂n − θ − µ̃) −
→ Z as n → ∞ where
Z has a N(0, σ̃ 2 ) distribution.

Proof. We begin by noting that if the functional T has the form indicated in
ASYMPTOTIC DISTRIBUTION 377
Equation (9.7) then Theorem 9.1 implies that

∆1 T (F ; F̂n − F ) = T (F̂n ) − T (F ) =
Z ∞ Z ∞ r−1
!
Y
r ··· h(x1 , y1 , . . . , yr−1 ) dF (yi ) d[F̂n (x1 ) − F (x1 )] =
−∞ −∞ i=1
Z ∞
h̃(x1 |F )d[F̂n (x1 ) − F (x1 )].
−∞

Now, Theorem 9.2 implies that there exists a function t̃(x1 |F ) such that
Z ∞ n
X
∆1 T (F ; F̂n − F ) = t̃(x1 |F )dF̂n (x1 ) = n−1 t̃(Xi |F ),
−∞ i=1

which yields the form required by Theorem 9.5. Theorem 9.4 then provides
the required behavior of the error term, which then proves the result.

Hence, the form of the differential is key in establishing asymptotic Normal-


ity in this case, in that ∆1 T (F ; F̂n −F ) can be written as a sum of independent
and identically distributed random variables. If ∆1 T (F ; F̂n − F ) = 0, then the
first term of the approximation is zero and the asymptotic behavior changes.
In particular, Theorem 6.4.B of Serfling (1980) demonstrates conditions un-
der which ∆1 T (F ; F̂n − F ) = 0 and ∆2 T (F ; F̂n − F ) 6= 0, and the resulting
asymptotic distribution is a weighted sum of independent ChiSquared ran-
dom variables.
Example 9.10. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F and consider once again
the variance functional θ = T (F ) from Example 9.8. In Example 9.9 it was
shown that T (F̂n ) = T (F ) + ∆1 T (F, F̂n − F ) + E1 (F, F̂n − F ) where
n
X
∆1 T (F, F̂n − F ) = n−1 Xi2 − µ02 − 2X̄n µ01 + 2(µ01 )2
i=1
n
X
= n−1 [Xi2 − 2Xi µ01 + (µ01 )2 + (µ01 )2 − µ02 ]
i=1
n
X
= n−1 [(Xi − µ01 )2 − θ]
i=1
n
X
= n−1 t̃(Xi |F ),
i=1

where t̃(Xi |F ) = (Xi −µ01 )2 −θ. Example 9.9 shows that E1 (F, F̂n ) = op (n−1/2 )
as n → ∞ so that Theorem 9.5 implies that if σ̃ is finite, then n1/2 (θ̂n − θ −
d
→ Z as n → ∞ where Z is a N(0, σ̃ 2 ) random variable. In this case θ̂n is
µ̃) −
378 DIFFERENTIABLE STATISTICAL FUNCTIONALS
the sample variance given by
n
X
θ̂n = n−1 (Xi − X̄n )2 .
i=1

Direct calculations can be used to show that µ̃ = E[t̃(Xi |F )] = 0 and σ̃ 2 =


V [t̃(Xi |F )] = V [(Xi − µ)2 − θ] = µ4 − θ2 . Therefore we have shown that
d
n1/2 (θ̂n − θ) −
→ Z as n → ∞ where Z is a N(0, µ4 − θ2 ) random variable. 

9.6 Exercises and Experiments

9.6.1 Exercises

1. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables from a distribution F . Let θ be defined as the pth quantile
of F .

a. Write θ as a functional parameter T (F ) of F .


b. Develop a plug-in estimator for θ based on using the empirical distribu-
tion function to estimate F .
c. Consider estimating F with a N(X̄n , S) distribution. Write this estima-
tor in terms of zp , the pth quantile of a N(0, 1) distribution.

2. Consider a functional of the form


Z ∞Z ∞
T (F ) = h(x1 , x2 )dF (x1 )dF (X2 ),
−∞ −∞

where F ∈ F, a collection of distribution functions. Find an expression for


the k th Gâteaux differential of T (F ) without using the assumption that h
is a symmetric function of its arguments. Compare this result to that of
Theorem 9.1.
3. Consider the skewness functional given by
Z ∞ Z ∞ 3
T (F ) = t− tdF (t) dF (t),
−∞ −∞

where F ∈ F, a collection of distribution functions with mean µ and vari-


ance σ 2 and third central moment γ.

a. Using the method demonstrated in Example 9.8, write this functional in


a form suitable for the application of Theorem 9.1.
b. Using Theorem 9.1, find the first two Gâteaux differentials of T (F ).
c. Determine whether Corollary 9.1 applies to this functional and derive
the asymptotic normality of T (F̂n ) if it does.
EXERCISES AND EXPERIMENTS 379
4. Consider the k th moment functional given by
Z ∞
T (F ) = tk dF (t),
−∞

where F ∈ F, a collection of distribution functions where µ0k < ∞.

a. Using the method demonstrated in Example 9.8, write this functional in


a form suitable for the application of Theorem 9.1.
b. Using Theorem 9.1, find the first two Gâteaux differentials of T (F ).
c. Determine whether Corollary 9.1 applies to this functional and derive
the asymptotic normality of T (F̂n ) if it does.

5. Consider the k th central moment functional given by


Z ∞ Z ∞ k
T (F ) = t− tdF (t) dF (t),
−∞ −∞

where F ∈ F, a collection of distribution functions where µk < ∞.

a. Using the method demonstrated in Example 9.8, write this functional in


a form suitable for the application of Theorem 9.1.
b. Using Theorem 9.1, find the first two Gâteaux differentials of T (F ).
c. Determine whether Corollary 9.1 applies to this functional and derive
the asymptotic normality of T (F̂n ) if it does.

6. Let F be a symmetric and continuous distribution function and consider


the functional parameter θ that corresponds to the expectation
(1 − 2α)−1 E(Xδ{X; [ξα , ξ1−α ]}),
where α ∈ [0, 12 ) is a fixed constant, ξα = F −1 (α), and ξ1−α = F −1 (1 − α).

a. Prove that this functional parameter can be written as


Z 1−α
−1
T (F ) = (1 − 2α) ξu du.
α

b. Let G denote a degenerate distribution at a real constant δ. Prove that


Z 1−α
−1 u − δ{F (δ); (−∞, u]}
∆1 T (F ; G − F ) = (1 − 2α) du.
α f (ξu )
c. Prove that the differential given above can be written as

−1 −1
(1 − 2α) [F (α) − ξ1/2 ] δ ∈ (−∞, ξα )

∆1 T (F ; G − F ) = (1 − 2α)−1 (δ − ξ1/2 ) δ ∈ [ξα , ξ1−α ]
(1 − 2α)−1 [F −1 (α) − ξ1/2 ] δ ∈ (ξ1−α , ∞).


380 DIFFERENTIABLE STATISTICAL FUNCTIONALS
7. Let X be a random variable following a distribution F (x|θ) where θ ∈ Ω.
Assume that F (x|θ) has continuous density f (x|θ). The maximum likeli-
hood estimator of θ is given by the value of θ that maximizes F (x|θ) with
respect to θ. Assuming that F (x|θ) is maximized at a unique interior point
of Ω, argue that the functional corresponding to the maximum likelihood
estimator can be written implicitly as
Z ∞ 0
f (x|θ)
dF (x|θ) = 0,
−∞ f (x|θ)

where the derivative in the integral is taken with respect to θ. Discuss any
additional assumptions that you need to make.
8. Let G ∈ F and let F be a fixed distribution from F. Consider a functional
of the form
Z ∞ Z ∞ r
Y
I(G) = ··· t(x1 , . . . , xr ) d[G(xk ) − F (xk )],
−∞ −∞ k=1

where there exists a function t̃(x1 , . . . , xr |F ) such that


Z ∞ Z ∞ r
Y
I(G) = ··· t̃(x1 , . . . , xr |F ) dG(xk ),
−∞ −∞ k=1

for all G ∈ F as given by Theorem 9.2. Prove that for the functions
t̃(x1 , . . . , xr |F ) specified in the proof of Theorem 9.2 that
Z ∞
t̃(xi1 , . . . , xir |F )dF (xik ) = 0.
−∞

9. Consider a functional of the form


Z ∞ Z ∞ r
Y
T (F ) = ··· h(x1 , . . . , xr ) dF (xi ),
−∞ −∞ i=1

where F ∈ F, a collection of distribution functions. Prove that if k ≤ r


then
" k #
Y
∆k T (F ; G − F ) = (r − i + 1) ×
i=1
"r−k #" k
#
Z ∞ Z ∞ Y Y
··· h(x1 , . . . , xk , y1 , . . . , yr−k ) dF (yi ) d[G(xi ) − F (xi ) ,
−∞ −∞ i=1 i=1

and if k > r then ∆k T (F ; G − F ) = 0. This result has been proven for


r = 1 and r = 2 in Section 9.3. Establish this result for a general value of
r.
10. Let t(x1 , . . . , xr ) be a real valued function. Prove that
Z ∞ Z ∞ r
Y
I(G) = ··· t(x1 , . . . , xr ) d[G(xi ) − F (xi )],
−∞ −∞ i=1
EXERCISES AND EXPERIMENTS 381
where F and G are members of a collection of distribution functions given
by F and F is fixed, then there exists a function t̃(x1 , . . . , xr |F ) such that
Z ∞ Z ∞
I(G) = ··· t̃(x1 , . . . , xr |F )dG(x1 ) · · · dG(xr ).
−∞ −∞

Use a constructive proof using the function


r Z
X ∞
t̃(x1 , . . . , x| F ) = t(x1 , . . . , xr ) − t(x1 , . . . , xr )dF (xi )+
i=1 −∞
r X
X r Z ∞ Z ∞
t(x1 , . . . , xr )dF (xi )dF (xj ) − · · ·
i=1 j=1 −∞ −∞
j>i
Z ∞ Z ∞
+ (−1)r ··· t(x1 , . . . , xr )dF (x1 ) · · · dF (xr ).
−∞ −∞

9.6.2 Experiments

1. Write a program in R that simulates 100 samples of size n from distributions


that are specified below. Let T (F ) be the variance functional described in
Example 9.8. For each sample compute T (F̂n ) − T (F ) and ∆1 T (F, F̂n −
F ) where the differential was found in Example 9.8. For each sample size
and distribution, construct a scatterplot the 100 values T (F̂n ) − T (F ) and
∆1 T (F, F̂n − F ). Describe the behavior observed in the plots with relation
to the expansion given in Equation (9.1). Use n = 5, 10, 25, 50, and 100.

a. N(0, 1)
b. Exponential(1)
c. Uniform(0, 1)
d. Cauchy(0, 1)

2. Write a program in R that simulates 100 samples of size n from distributions


that are specified below. Let T (F ) correspond to the quantile functional
so that T (F ) = F −1 (α) where α ∈ (0, 1) is a specified constant. For each
sample compute θ̂n = T (F̂ ) where F̂ corresponds to the empirical distri-
bution function, and again where F̂ corresponds to a Normal distribution
with mean equal to the sample mean, and variance equal to the sample
variance. Construct a scatterplot of the pairs of estimated quantiles and
overlay a line on the plot that corresponds to the true value of the popu-
lation quantile in each case. Repeat these calculations for α = 0.05, 0.10,
0.25, 0.50, 0.75, 0.90, and 0.95, and for n = 10, 25, 50 and 100. Describe
the behavior found in each of these plots and discuss how the assump-
tion of Normality affects the performance of the estimator based on the
Normal distribution.
382 DIFFERENTIABLE STATISTICAL FUNCTIONALS
a. N(0, 1)
b. Exponential(1)
c. Uniform(0, 1)
d. Cauchy(0, 1)
CHAPTER 10

Parametric Inference

But sitting in front of him and taken by surprise by his dismissal, K. would be
able easily to infer everything he wanted from the lawyer’s face and behaviour,
even if he could not be induced to say very much.
The Trial by Franz Kafka

10.1 Introduction

Classical statistical inference is usually concerned with estimation and hy-


pothesis testing for a unknown parameter θ within a parametric framework.
This means that we will assume that the unknown population follows a known
parametric family and that only the parameter is unknown. Furthermore, the
parameter space is assumed to have a finite dimension. In this chapter we will
be interested in developing asymptotic methods for studying how these meth-
ods perform. For point estimation, asymptotic methods will be developed to
obtain approximate expressions for the bias and variance of estimators that
are functions of the sample mean. We will also be interested in establishing
if asymptotically unbiased estimators have a variance that approaches opti-
mality as the sample size increases to infinity. The optimality properties of
maximum likelihood estimators will also be studied. In the case of confidence
intervals we will consider how the confidence coefficient behaves as n → ∞. In
statistical hypothesis testing we will use asymptotic comparisons to compare
the power functions of tests. We will also consider the asymptotic properties
of observed confidence levels, a method for solving multiple testing problems.
We will conclude the chapter with a look at conditions under which Bayes
estimators are asymptotically optimal within the frequentist framework.

10.2 Point Estimation

Let X1 , . . . , Xn be a set of real valued independent and identically distributed


random variables from a distribution F (x|θ) where θ is an unknown parameter
with parameter space Ω. In point estimation we are concerned with estimating
θ based on X1 , . . . , Xn . That is, we would like to find a plausible value for θ
based on an observed sample from F (x|θ).

383
384 PARAMETRIC INFERENCE
Definition 10.1. Any function T that maps a sample X1 , . . . , Xn to a pa-
rameter space Ω is a point estimator of θ.
We will usually denote an estimator of a parameter θ as θ̂n = T (X1 , . . . , Xn ),
or simply as θ̂n . Note that there is nothing special about a point estimator, it
is simply a function of the observed data that does not depend on θ, or any
other unknown quantities. The search for a good point estimator requires us
to define the types of properties that a good estimator should have. Usually
we would like our estimator to be close to θ in some respect. Let ρ be a metric
on Ω. Then we can measure the distance between θ̂n and θ as ρ(θ̂n , θ). But
ρ(θ̂n , θ) is a random variable and hence we need some way of summarizing the
behavior of ρ(θ̂n , θ). This is usually accomplished by taking the expectation
of ρ(θ̂n , θ). In decision theory the distance ρ(θ̂n , θ) is usually called the loss
associated with estimating θ with θ̂n and the function ρ is called the loss
function. The expected loss, given by R(θ̂, θ) = E[ρ(θ̂n , θ)] is called the risk
associated with estimating θ with the estimator θ̂n .
A common loss function that is often used in practice is ρ(θ̂, θ) = (θ̂n − θ)2 ,
which is called squared error loss. The associated risk, given by MSE(θ̂n ) =
E[(θ̂n − θ)2 ] is called the mean squared error, which measures the expected
square distance between θ̂n and θ. It can be shown that MSE(θ̂n ) can be
decomposed into two parts given by MSE(θ̂n ) = Bias2 (θ̂n ) + V (θ̂n ), where
Bias(θ̂n ) = E(θ̂n ) − θ is called the bias of the estimator θ̂n . See Exercise 1.
The bias of an estimator θ̂n measures the expected systematic departure of
θ̂n from θ. A special case occurs when the bias always equals zero.
Definition 10.2. An estimator θ̂ of θ is unbiased if Bias(θ̂n ) = 0 for all
θ ∈ Ω.
If an estimator is unbiased then the mean squared error and the variance
of the estimator coincide. In this case the variance of the estimator can be
used alone as a measure of the quality of the estimator. Usually the standard
deviation of the estimator, called the standard error of θ̂n is often reported as
a measure of the quality of the estimator, since it is in the same units as the
parameter space, whereas the variance is in square units.
An important special case in estimation theory is the case of estimating the
mean of a population that has a finite variance. Let {Xn }∞ n=1 be a sequence of
independent and identically distributed random variables from a distribution
F with mean θ and finite variance σ 2 . It is well known that if we estimate
θ with the sample mean X̄n then Bias(X̄n ) = 0 and V (X̄n ) = n−1 σ 2 for all
θ ∈ Ω. Suppose now that we are interested in estimating g(θ) for some real
function g. An obvious estimator of g(θ) is g(X̄n ). If g is a linear function of the
form g(x) = a+bx then the bias and variance of g(X̄n ) can be found by directly
using the properties of X̄n . That is E[g(X̄n )] = a + bE(X̄n ) = a + bθ = g(θ)
so that g(X̄n ) is an unbiased estimator of g(θ). The variance of g(X̄n ) can be
found to be V [g(X̄n )] = b2 V (X̄n ) = n−1 b2 σ 2 .
POINT ESTIMATION 385
If g is not a linear function then the bias and variance of g(X̄n ) cannot be
found using only the properties of X̄n . In this case more information about
the population F must be known.
Example 10.1. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean θ and finite
variance σ 2 . Suppose now that we are interested in estimating g(θ) = θ2 using
the estimator g(X̄n ) = X̄n2 . We first find the bias for this estimator. Taking
the expectation we have that
 
X n X n
E(X̄n2 ) = E n−2 Xi Xj 
i=1 j=1
n
X n X
X n
= n−2 E(Xi2 ) + n−2 E(Xi )E(Xj )
i=1 i=1 j=1
j6=i

= n−1 θ2 + n−1 σ 2 + n−1 (n − 1)θ2


= θ2 + n−1 σ 2 .

Hence the bias is given by Bias[g(X̄n )] = −n−1 σ 2 = O(n−1 ), as n → ∞. To


find the variance we have that E[(X̄n2 )2 ] = E(X̄n4 ). Therefore we must find an
expression for E(X̄n4 ). Direct calculations yield
 
n X
X n X
n X
n
E(X̄n4 ) = E n−4 Xi Xj Xk Xl  =
i=1 j=1 k=1 l=1
n
X n X
X n n X
X n
n−4 E(Xi4 ) + 4n−4 E(Xi3 )E(Xj ) + 3n−4 E(Xi2 )E(Xj2 )+
i=1 i=1 j=1 i=1 j=1
j6=i j6=i
XXn n Xn
6n−4 E(Xi2 )E(Xj )X(Xk )+
i=1 j=1 k=1
j6=i k6=j
k6=i
n n
XXXX n n
n−4 E(Xi )E(Xj )E(Xk )E(Xl ) =
i=1 j=1 k=1 l=1
j6=i k6=j l6=k
k6=i l6=j
l6=i

n−3 µ04 + 4n−3 (n − 1)θµ03 + 3n−3 (n − 1)(µ02 )2 +


6n−3 (n − 1)(n − 2)θ2 µ02 + n−3 (n − 1)(n − 2)(n − 3)θ4 .

A great deal of algebraic perseverance could be used on these expressions to


obtain an exact, but quite complicated, expression for the variance. To simplify
matters, we will only keep terms that are larger than O(n−2 ) as n → ∞ for
386 PARAMETRIC INFERENCE
this analysis. Therefore,

E(X̄n4 ) = 6n−3 (n − 1)(n − 2)θ2 (θ2 + σ 2 )+


n−3 (n − 1)(n − 2)(n − 3)θ4 + O(n−2 ), (10.1)
as n → ∞. However, this expression does not eliminate all of the terms that
are O(n−2 ) or smaller as n → ∞. Expanding the first term on the right hand
side of Equation (10.1) yields

6n−3 (n − 1)(n − 2)θ2 (θ2 + σ 2 ) = 6n−3 (n2 − 3n + 2)θ2 (θ2 + σ 2 ) =


6n−1 θ2 (θ2 + σ 2 ) + O(n−2 ),
as n → ∞. Similarly, expanding the second term on the right hand side of
Equation (10.1) yields
n−3 (n−1)(n−2)(n−3)θ4 = n−3 (n3 −6n2 +11n−6)θ4 = θ4 −6n−1 θ4 +O(n−2 ),
as n → ∞. Therefore, it follows that
E[X̄n4 ] = 6n−1 θ2 σ 2 + θ4 + O(n−2 ),
as n → ∞. To complete finding the variance we note that
E 2 (X̄n2 ) = (θ2 + n−1 σ 2 )2 = θ4 + 2θ2 n−1 σ 2 + O(n−2 ),
as n → ∞. This yields a variance of the form V (X̄n2 ) = 4n−1 θ2 σ 2 + O(n−2 ),
as n → ∞. 

Example 10.1 highlights the increased complexity one finds when working with
estimating a non-linear function of θ. The fact that we are able to compute
the bias and variance in the closed forms indicated are a result of the function
being a sum of powers of X̄n . If we alternatively considered functions such
as sin(X̄n ) or exp(−X̄n2 ) such a direct approach would no longer be possible.
However, an approximate approach can be developed for certain functions by
approximating the function of interest with a linear expression obtained using
a Taylor expansion from Theorem 1.13.
Example 10.2. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean θ and finite
variance σ 2 . Suppose now that we are interested in estimating g(θ) = exp(−θ2 )
using the estimator g(X̄n ) = exp(−X̄n2 ). Without further specific information
of the form of F , simple closed form expressions for the mean and variance of
exp(X̄n2 ) are difficult to obtain. However, note that Theorem 1.13 implies that
we can find a Taylor expansion for the exponential function at −X̄n2 around
the point −θ2 . That is,
exp(−X̄n2 ) = exp(−θ2 ) − 2(X̄n − θ)θ exp(−θ2 ) + E2 (X̄n , θ),
where E2 (X̄n , θ) = (X̄n − θ)2 (2ξ 2 − 1) exp(−ξ 2 ), where ξ is a random variable
that is always between X̄n and θ with probability one. Taking expectations,
POINT ESTIMATION 387
we find that
E[exp(−X̄n2 )] = exp(−θ2 ) − 2θE(X̄n − θ) exp(−θ2 ) + E[E2 (X̄n , θ)]
= exp(−θ2 ) + E[(X̄n − θ)2 (2ξ 2 − 1) exp(−ξ 2 )],
since E(X̄n − θ) = 0. Therefore, the bias of exp(−X̄n2 ) as an estimator of
exp(−θ2 ) is given by E[(X̄n − θ)2 (2ξ 2 − 1) exp(−ξ 2 )]. The expectation of the
error can be troublesome to compute due to the random variable ξ unless
we happen to note in this case that the function (2ξ 2 − 1) exp(−ξ 2 ) is a
bounded function so that there exists a finite real value m such that |(2ξ 2 −
1) exp(−ξ 2 )| < m for all ξ ∈ R. Therefore, it follows that

|E[(X̄n − θ)2 (2ξ 2 − 1) exp(−ξ 2 )]| ≤


E[|(X̄n − θ)2 (2ξ 2 − 1) exp(−ξ 2 )|] ≤ mE[(X̄n − θ)2 ] = n−1 mσ 2 ,
and hence Definition 1.7 implies that E[E2 (X̄n , θ)] = O(n−1 ), as n → ∞.
Therefore, we have proven that Bias[exp(−X̄n )2 ] = O(n−1 ), as n → ∞. To
find the approximate variance we note [exp(−X̄n2 )]2 = exp(−2X̄n2 ), and use a
Taylor expansion for this function to find
exp(−2X̄n2 ) = exp(−2θ2 ) − 4θ(X̄n − θ) exp(−2θ2 ) + Ẽ2 (X̄n , θ), (10.2)
where Ẽ2 (X̄n , θ) = 2(X̄n − θ)2 (4ξ 2 − 1) exp(−2ξ 2 ) for some random variable
ξ that is between θ and X̄n with probability one. Taking the expectation of
both sides of Equation (10.2), we note that the second term is zero since
E(X̄n − θ) = 0, and therefore we need only find the rate of convergence for
the error term. Using the same reasoning as above, we note that the function
(4ξ 2 − 1) exp(−2ξ 2 ) is bounded for all ξ ∈ R and therefore it follows that
E[2(X̄n − θ)2 (4ξ 2 − 1) exp(−2ξ 2 )] ≤ 2mE[(X̄n − θ)2 ] = 2mn−1 σ 2 ,

for some real valued bound m. Hence, it follows that


E[exp(−2X̄n2 )] = exp(−2θ2 ) + O(n−1 ), (10.3)
as n → ∞. To obtain an expression for the variance we also need to find an
expansion for the square expectation of exp(−X̄n2 ). From the previous result,
we note that
E 2 [exp(−X̄n2 )] = [exp(−θ2 ) + O(n−1 )]2 = exp(−2θ2 ) + O(n−1 ), (10.4)
as n → ∞. Combining the results of Equations (10.3) and (10.4) implies that
V [exp(−X̄n2 )] = O(n−1 ) as n → ∞. 
The general methodology of Example 10.2, based on taking expectations of
Taylor expansions, can be generalized to a wider class of functions. However, it
is worth noting that the key property of this development is the boundedness
of certain derivatives of the function of X̄n . In this context, assumptions of this
form cannot be relaxed without additional assumptions on the distribution F .
Theorem 10.1. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with mean θ and variance
388 PARAMETRIC INFERENCE
σ 2 . Suppose that f has a finite fourth moment and let g be a function that has
at least four derivatives.

1. If the fourth derivative of g is bounded, then


E[g(X̄n )] = g(θ) + 12 n−1 σ 2 g 00 (θ) + O(n−2 ),
as n → ∞.
2. If the fourth derivative of g 2 is also bounded, then
V [g(X̄n )] = n−1 σ 2 [g 0 (θ)]2 + O(n−2 ),
as n → ∞.

Proof. We prove Part 1, leaving the proof of Part 2 as Exercise 2. We begin


by noting that Theorem 1.13 (Taylor) implies that

g(X̄n ) = g(θ) + (X̄n − θ)g 0 (θ)+


1
2 (X̄n − θ)2 g 00 (θ) + 16 (X̄n − θ)3 g 000 (θ) + 1
24 (X̄n − θ)4 g 0000 (ξ), (10.5)
for some ξ that is between θ and X̄n with probability one. Taking the expec-
tation of both sides of Equation (10.5) yields

E[g(X̄n )] = g(θ)+
1 −1 2 00
2n σ g (θ) + 16 E[(X̄n − θ)3 ]g 000 (θ) + 1
24 E[(X̄n − θ)4 g 0000 (ξ)], (10.6)
since E(X̄n − θ) = 0 and E[(X̄n − θ)2 ] = n−1 σ 2 . Note that we cannot factor
g 0000 (ξ) out of the expectation in Equation (10.6) because ξ is a random variable
due to the fact that ξ is always between θ and X̄n . For the third term in
Equation (10.6), we note that
" #3 
 X n 
E[(X̄n − θ)3 ] = n−3 E (Xi − θ)
 
i=1
n
X n X
X n
= n−3 E[(Xi − θ)3 ] + 3n−3 E[(Xi − θ)2 (Xj − θ)]
i=1 i=1 j=1
j6=i
n X
X n X
n
+n−3 E[(Xi − θ)(Xj − θ)(Xk − θ)]. (10.7)
i=1 j=1 k=1
j6=i k6=i
k6=j

The second and third term on the right hand side of Equation (10.7) are zero
due to independence and the fact that E(Xi − θ) = 0. Therefore, it follows
that E[(X̄n − θ)3 ] = n−2 E[(Xi − θ)3 ] = O(n−2 ), as n → ∞. By assumption,
there exists a bound m ∈ R such that g 0000 (t) ≤ m for all t ∈ R. Therefore,
it follows that g 0000 (ξ)(X̄n − θ)4 ≤ m(X̄n − θ)4 , with probability one. Hence
1 0000 1
Theorem A.16 implies that E[ 24 g (ξ)(X̄n − θ)4 ] ≤ 24 mE[(X̄n − θ)4 ] < ∞,
POINT ESTIMATION 389
since we have assumed that the fourth moment is finite. To obtain the rate of
convergence for the error term we note that
 " n #4 
 X 
E[(X̄n − θ)4 ] = E n−4 (Xi − θ)
 
i=1
n
X n X
X n
= n−4 E[(Xi − θ)4 ] + 4n−4 E[(Xi − θ)3 (Xj − θ)]
i=1 i=1 j=1
j6=i
n X
X n
+3n−4 E[(Xi − θ)2 (Xj − θ)2 ]
i=1 j=1
j6=i
n
XX n Xn
+6n−4 E[(Xi − θ)2 (Xj − θ)(Xk − θ)]
i=1 j=1 k=1
j6=i k6=i
k6=j
n
XXXXn n
+n−4 E[(Xi − θ)(Xj − θ)(Xk − θ)(Xl − θ)].
i=1 j=1 k=1 l=1
j6=i k6=i l6=i
k6=j l6=j
l6=k

Using similar arguments to those used for the third moment we find that
E[(X̄n − θ)4 ] = n−3 E[(Xn − θ)4 ] + 3n−3 (n − 1)E[(Xn − θ)2 (Xm − θ)2 ] so that
it follows that E[(X̄n − θ)4 ] = O(n−2 ) as n → ∞. Therefore, it follows that
E[g(X̄n )] = g(θ) + 21 n−1 σ 2 g 00 (θ) + O(n−2 ),
as n → ∞.
Example 10.3. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F with mean θ and finite vari-
ance σ 2 . In Example 10.1 we considered estimating θ2 using the estimator X̄n2 .
Taking g(t) = t2 we have that g 0 (t) = 2t, g 00 (t) = 2, and that g k (t) = 0 for
all k ≥ 3. Therefore, we can apply Theorem 10.1, assuming that the fourth
moment of F is finite, to find E(X̄n2 ) = θ2 + n−1 σ 2 + O(n−2 ), as n → ∞. This
compares exactly with the result of Example 10.1 with the exception of the
error term. The error term in this case is identically zero in this case because
the derivatives of order higher than two are zero for the function g(t) = t2 .
For the variance, we consider derivatives of the function h(t) = g 2 (t) = t4 ,
where we have that h0 (t) = 4t3 , h00 (t) = 12t2 , h000 (t) = 24t and h0000 (t) = 24.
The fourth derivative of h is bounded and therefore we can apply Theorem
10.1 to find that V (X̄n ) = 4n−1 θ2 σ 2 + O(n−2 ), as n → ∞, which matches
the variance expansion found in Example 10.1. The error term in this case
is not identically zero as we encountered non-zero terms of order O(n−2 ) in
Example 10.1. This is also indicated by the fact that the fourth derivative of
h is not zero. 
390 PARAMETRIC INFERENCE
Example 10.4. Consider the framework presented in Example 10.2 where
{Xn }∞n=1 is a sequence of independent and identically distributed random vari-
ables from a distribution F with mean θ and finite variance σ 2 . We are inter-
ested in estimating g(θ) = exp(−θ2 ) using the estimator g(X̄n ) = exp(−X̄n2 ).
If we assume that F has a finite fourth moment then the assumptions of
Theorem 10.1 hold and we have that
E[exp(−X̄n2 )] = exp(−θ2 ) − n−1 σ 2 (1 − 2θ2 ) exp(−θ2 ) + O(n−2 ),
and
V [exp(−X̄n2 )] = 4n−1 σ 2 θ2 exp(−2θ2 ) + O(n−2 ),
as n → ∞. We can compare this result to the result supplied by Theorem 6.3,
d
which implies that n1/2 (2θσ)−1 exp(θ2 )[exp(−X̄n2 ) − exp(−θ2 )] −
→ Z, as n →
∞ where Z is a N(0, 1) random variable. Note that the asymptotic variance
given by Theorem 6.3 matches the first term of the asymptotic expansion for
the variance given in Theorem 10.1. 

While the result of Theorem 10.1 is general in the sense that the distribution
F need not be specified, the assumption on the boundedness of the fourth
derivative of F will often be violated in practice. This does not mean that
no asymptotic result of this kind can be obtained, but that such results will
probably rely on methods more specific to the problem considered.
Example 10.5. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables following a N(θ, σ 2 ) distribution and consider
estimating exp(θ) with exp(X̄n ). We are not able to apply Theorem 10.1
directly to this case because the fourth derivative of exp(t), which is still
exp(t), is not a bounded function. If we attempt to apply Theorem 1.13 to
1
this problem we will end up with an error term of the form 24 (X̄n −θ)2 exp(ξ),
where ξ is a random variable that is always between θ and X̄n . This type of
error term is difficult to deal with in this case because the exponential function
is not bounded and hence we cannot directly bound the expectation as we did
in Example 10.2 and the proof of Theorem 10.1. Instead we follow the approach
of Lehmann (1999) and use the convergent Taylor series for the exponential
function instead. That is,

X exp(θ)(X̄n − θ)i
exp(X̄n ) = . (10.8)
i=0
i!
We will assume that it is permissible in this case to exchange the expectation
and the infinite sum, so that taking the expectation of both sides of Equation
(10.8) yields
∞ ∞
X exp(θ)E[(X̄n − θ)i ] X E[(X̄n − θ)2i ]
E[exp(X̄n )] = = exp(θ) . (10.9)
i=0
i! i=0
(2i)!

The second equality in Equation (10.9) is due to the fact that X̄n − θ has a
N(0, n−1 σ 2 ) distribution, whose odd moments are zero. Hence it follows that
POINT ESTIMATION 391
we must evaluate
 n 1/2 Z ∞
2i
E[(X̄n − θ) ] = t2i exp(− 21 nσ −2 t2 )dt.
2πσ 2 −∞

To evaluate this integral consider the change of variable u = 21 nσ −2 t2 so that


du = nσ −2 tdt, where one must be careful to note that the transformation in
the change of variable is not one-to-one. It follows that
Z ∞
E[(X̄n − θ)2i ] = π −1/2 σ 2i n−i 2i ui−1/2 exp(−u)du
0
−1/2 2i −i i
= π σ n 2 Γ(i + 12 ).

Now, noting that Γ( 12 ) = π 1/2 we have that

Γ(i + 12 ) = (i − 21 )(i − 23 ) · · · 32 12 Γ( 21 )
= 2−i (2i − 1)(2i − 3) · · · (3)(1)π 1/2
(2i)!π 1/2
=
2i (2i)(2i − 1) · · · (4)(2)
= (2i)!π 1/2 2−2i (i!)−1 .

Therefore, it follows that


σ 2i (2i)!
E[(X̄n − θ)2i ] = . (10.10)
ni 2i i!
Putting the result of Equation (10.10) into the sum in Equation (10.9) yields

X E[(X̄n − θ)2i ]
E[exp(X̄n )] = exp(θ)
i=0
(2i)!

X σ 2i (2i)!
= exp(θ)
i=0
ni 2i i!(2i)!

X σ 2i
= exp(θ)
i=0
ni 2i i!
∞  i
X 1 σ2
= exp(θ)
i=0
i! 2n
= exp(θ) exp( 12 n−1 σ 2 ).

Now, Theorem 1.13 (Taylor) implies that

exp( 12 n−1 σ 2 ) = exp(0) + 12 n−1 σ 2 exp(0) + O(n−2 ) =


1 + 12 n−1 σ 2 + O(n−2 ),
392 PARAMETRIC INFERENCE
as n → ∞. Therefore, it follows that

E[exp(X̄n )] = exp(θ)(1 + 12 n−1 σ 2 ) + O(n−2 ) =


exp(θ) + 12 n−1 σ 2 exp(θ) + O(n−2 ),
as n → ∞. Hence, it follows that the bias is O(n−1 ) as n → ∞. The variance
can be found using a similar argument. See Exercise 7. 

It is noteworthy that not all functions of the sample mean have nice proper-
ties, even when the population is normal. For example, when X1 , . . . , Xn are
independent and identically distributed N(θ, 1) random variables, n1/2 (X̄n−1 −
d
θ−1 ) −
→ Z as n → ∞ where Z is a N(0, θ−4 ) random variable, but E(X̄n−1 )
does not exist for any n ∈ N. See Example 4.2.4 of Lehmann (1999) and
Lehmann and Shaffer (1988) for further details.
If we consider the case where θ̂n is an estimator of a parameter θ with an
expansion for its expectation of the form E(θ̂n ) = θ + n−1 b + O(n−2 ) as
n → ∞ where b is a real constant, then it follows that the bias of θ̂n is
O(n−1 ), and hence the square bias of θ̂n is O(n−2 ), as n → ∞. If the variance
of θ̂n has an expansion of the form V (θ̂n ) = n−1 v + O(n−2 ) as n → ∞ where
v is a real constant, then it follows that the mean squared error of θ̂n has the
expansion MSE(θ̂n ) = n−1 v + O(n−2 ) as n → ∞. Therefore, it is the constant
v that is the important factor in determining the asymptotic performance of
θ̂n , under these assumptions on the form of the bias and variance.
Now suppose that θ̃n is another estimator of θ, and that θ̃n has similar prop-
erties to θ̂n in the sense that the mean squared error for θ̃n has asymptotic
expansion MSE(θ̃n ) = n−1 w + O(n−2 ) as n → ∞, where w is a real con-
stant. If we wish to compare the performance of these two estimators from an
asymptotic viewpoint it follows that we need only compare the constants v
and w.
Definition 10.3. Let θ̂n and θ̃n be two estimators of a parameter θ such
that MSE(θ̂n ) = n−1 v + O(n−2 ) and MSE(θ̃n ) = n−1 w + O(n−2 ) as n → ∞
where v and w are real constants. Then the asymptotic relative efficiency of
θ̂n relative to θ̃n is given by ARE(θ̂n , θ̃n ) = wv −1 .

The original motivation for the form of the asymptotic relative efficiency comes
from comparing the sample sizes required for each estimator to have the same
mean squared error. From the asymptotic viewpoint we would require sample
sizes n and m so that n−1 v = m−1 w, with the asymptotic relative efficiency
of θ̂n relative to θ̃n is given by ARE(θ̂n , θ̃n ) = mn−1 . However, note that
if n−1 v = m−1 w then it follows that wv −1 = mn−1 , yielding the form of
Definition 10.3.
Example 10.6. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F with finite variance σ 2
and continuous density f . We will assume that F has a unique median θ
POINT ESTIMATION 393
such that f (θ) > 0. In the case where f is symmetric about θ we have two
immediate possible choices for estimating θ given by the sample mean θ̂n =
X̄n and the sample median θ̃n . It is known that V (θ̂n ) = n−1 σ 2 , and it
follows from Corollary 4.4 that the leading term for the variance of θ̃n is
−2
1
4 [f (θ)] . Therefore, Definition 10.3 implies that ARE(θ̂n , θ̃n ) = 41 [σf (θ)]−2 .
If F corresponds to N(θ, σ 2 ) random variable then f (θ) = (2πσ 2 )−1/2 and
we have that ARE(θ̂n , θ̃n ) = 12 π ' 1.5708, which indicates that the sample
mean is about one and one half times more efficient than the sample median.
If F corresponds to a T(ν) distribution where ν > 2 we have that θ = 0,
σ 2 = v/(v − 2), and
Γ( ν+1
2 )
f (θ) = .
(πv)1/2 Γ( ν2 )
Therefore, it follows that
π(ν − 2)Γ2 ( ν2 )
ARE(θ̂n , θ̃n ) = .
4Γ2 ( ν+1
2 )

The values of ARE(θ̂n , θ̃n ) for ν = 3, 4, 5, 10, 25, 50, and 100 are given in Table
10.1. From Table 10.1 one can observe that for heavy tailed T(ν) distributions,
the sample median is more efficient than the sample mean. In these cases the
variance of the sample mean is increased by the high likelihood of observing
outliers in samples. However, as the degrees of freedom increase, the median
becomes less efficient so that when ν = 5 the sample median and the sample
mean are almost equally efficient from an asymptotic viewpoint. From this
point on, the sample mean becomes more efficient. In the limit we find that
ARE(θ̂n , θ̃n ) approaches the value associated with the normal distribution
given by 21 π. 
Example 10.7. Let B1 , . . . , Bn be a set of independent and identically dis-
tributed Bernoulli(θ) random variables, where the success probability θ is
the parameter of interest. We will also assume that the parameter space is
Ω = (0, 1). The usual unbiased estimator of θ is the sample mean θ̂n = B̄n ,
which corresponds to the proportion of successes observed in the sample. The
properties of the sample mean imply that this estimator is unbiased with
variance n−1 θ(1 − θ). When θ is small the estimator θ̂n is often considered
unsatisfactory because θ̂n will be equal to zero with a large probability. For
example, calculations based on the Binomial(n, θ) distribution can be used
to show that if n = 100 and θ = 0.001 then P (θ̂n = 0) = 0.9048. Since zero
is not in the parameter space of θ, this may not be considered a reasonable
estimator of θ. An alternative approach to estimating θ in this case is based
on adding in one success and one failure to the observed sample. That is, we
consider the alternative estimator
n
!
X
−1
θ̃n = (n + 2) Bi + 1 = (1 + 2n−1 )−1 (B̄n + n−1 ).
i=1
394 PARAMETRIC INFERENCE

Table 10.1 The asymptotic relative efficiency of the sample mean relative to the
sample median when the population follows a T(ν) distribution.

ν 3 4 5 10 25 50 100
ARE(θ̂n , θ̃n ) 0.617 0.889 1.041 1.321 1.4743 1.5231 1.5471

This estimator will never equal zero, but is not unbiased as


E(θ̃n ) = (1 + 2n−1 )−1 (θ + n−1 ) = θ + O(n−1 ),
as n → ∞, so that the bias of θ̃n is O(n−1 ) as n → ∞. Focusing on the
variance we have that
V (θ̃n ) = (1 + 2n−1 )−2 [n−1 θ(1 − θ)] = (n + 2)−2 nθ(1 − θ).
Hence, the efficiency of θ̂n , relative to θ̃n is given by
nθ(1 − θ) n n2
ARE(θ̂n , θ̃n ) = lim = lim = 1.
n→∞ (n + 2)2 θ(1 − θ) n→∞ (n + 2)2

Therefore, from an asymptotic viewpoint, the estimators have the same effi-
ciency. 

Asymptotic optimality refers to the condition that an estimator achieves the


best possible performance as n → ∞. For unbiased estimators, optimality is
defined in terms of the Cramér–Rao bound.
Theorem 10.2 (Cramér and Rao). Let X1 , . . . , Xn be a set of independent
and identically distributed random variables from a distribution F (x|θ) with
parameter θ, parameter space Ω, and density f (x|θ). Let θ̂n be any estimator
of θ computed on X1 , . . . , Xn such that V (θ̂n ) < ∞. Assume that the following
regularity conditions hold.

1. The parameter space Ω is an open interval which can be finite, semi-infinite,


or infinite.
2. The set {x : f (x|θ) > 0} does not depend on θ.
3. For any x ∈ A and θ ∈ Ω the derivative of f (x|θ) with respect to θ exists
and is finite.
4. The first two derivatives of
Z ∞
f (x|θ)dx,
−∞

with respect to θ can be obtained by exchanging the derivative and the in-
tegral.
5. The first two derivatives of log[f (x|θ)] with respect to θ exist for all x ∈ R
and θ ∈ Ω.
POINT ESTIMATION 395
Then V (θ̂n ) ≥ [nI(θ)]−1 where
 

I(θ) = V log[f (Xn |θ)] .
∂θ

The development of this result can be found in Section 2.6 of Lehmann and
Casella (1998). The value I(θ) is called the Fisher information number. Noting
that
∂ f 0 (x|θ)
log[f (x|θ)] = ,
∂θ f (x|θ)
it then follows that the random variable within the expectation measures the
relative rate of change of f (x|θ) with respect to changes in θ. If this rate of
change is large, then samples with various values of θ will be easily distin-
guished from one another and θ will be easier to estimate. In this case the
bound on the variance will be small. If this rate of change is small then the
parameter is more difficult to estimate and the variance bound will be larger.
Several alternate expressions are available for I(θ) under various assumptions.
To develop some of these, let X be a random variable that follows the distri-
bution F (x|θ) and note that
   0 
∂ f (X|θ)
V log[f (X|θ)] = V =
∂θ f (X|θ)
( 2 )
f 0 (X|θ)
 0 
2 f (X|θ)
E −E . (10.11)
f (X|θ) f (X|θ)

Evaluating the second term on the right hand side of Equation (10.11) yields
 0  Z ∞ 0 Z ∞
f (X|θ) f (x|θ)
E = f (x|θ)dx = f 0 (x|θ)dx. (10.12)
f (X|θ) −∞ f (x|θ) −∞

Because f (x|θ) is a density we know that


Z ∞
f (x|θ)dx = 1,
−∞

and hence,
Z ∞

f (x|θ)dx = 0.
∂θ −∞
If we can exchange the partial derivative and the integral, then it follows that
Z ∞
f 0 (x|θ)dx = 0. (10.13)
−∞

Therefore, under this condition,


( 2 )
f 0 (X|θ)
I(θ) = E . (10.14)
f (X|θ)
396 PARAMETRIC INFERENCE
Note further that
∂2 ∂ f 0 (x|θ) f 00 (x|θ) [f 0 (x|θ)]2
2
log[f (x|θ)] = = − 2 .
∂θ ∂θ f (x|θ) f (x|θ) f (x|θ)
Therefore,
 00 ( 2 )
∂2 f 0 (x|θ)
  
f (x|θ)
E log[f (X|θ)] = E −E .
∂θ2 f (x|θ) f (x|θ)
Under the assumption that the second partial derivative with respect to θ and
the integral in the expectation can be exchanged we have that
 00  Z ∞
f (x|θ)
E = f 00 (x|θ)dx = 0. (10.15)
f (x|θ) −∞

Therefore, under this assumption


∂2
 
I(θ) = E log[f (X|θ)] ,
∂θ2
which is usually the simplest form useful for computing I(θ).
Classical estimation theory strives to find unbiased estimators that are optimal
in the sense that they have a mean squared error that attains the lower bound
specified by Theorem 10.2. This can be a somewhat restrictive approach due to
the fact that the lower bound is not sharp, and hence is not always attainable.
Example 10.8. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a N(µ, θ) distribution where θ is finite. Theo-
rem 10.2 (Cramér and Rao) implies that the optimal mean squared error for
unbiased estimators of θ is 2n−1 θ2 . The usual unbiased estimator of θ is given
by
Xn
θ̂n = (n − 1)−1 (Xi − X̄n )2 .
i=1
The mean squared error of this estimator is given by 2(n − 1)−1 θ2 , which
is strictly larger than the bound given in Theorem 10.2. As pointed out in
Example 7.3.16 of Casella and Berger (2002), the bound is not attainable in
this case because the optimal estimator of θ is given by
n
X
θ̃n = n−1 (Xi − µ)2 ,
i=1

which depends on the unknown parameter µ. Note however that the bound is
attained asymptotically since
2n−1 θ2
lim ARE(θ̂n , θ̃n ) = lim = 1.
n→∞ n→∞ 2(n − 1)−1 θ 2

A less restrictive approach is to consider estimators that attain the lower


POINT ESTIMATION 397
bound asymptotically as the sample size increases to ∞. Within this approach
we will consider consistent estimators of θ that have an asymptotic Normal
distribution. For these estimators, mild regularity conditions exist for which
there are estimators that attain the bound given in Theorem 10.2. To de-
velop this approach let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a distribution F with parameter θ. Let θ̂n
d
be an estimator of θ based on X1 , . . . , Xn such that n1/2 (θ̂n − θ) − → Z as
n → ∞ where Z is a N[0, σ 2 (θ)] random variable. Note that in this setup θ̂n
is consistent and that σ 2 (θ) is the asymptotic variance of n1/2 θ̂n , which does
not depend on n. The purpose of this approach is then to determine under
what conditions σ 2 (θ) = [I(θ)]−1 .
Definition 10.4. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F with parameter θ. Let θ̂n be an
d
estimator of θ such that n1/2 (θ̂n − θ) −
→ Z as n → ∞ where Z is a N[0, σ 2 (θ)]
random variable. If σ 2 (θ) = [I(θ)]−1 for all θ ∈ Ω, then θ̂n is an asymptotically
efficient estimator of θ.
There are several differences between finding an efficient estimator, that is
one that attains the bound given in Theorem 10.2 for every n ∈ N, and one
that is asymptotically efficient, which attains this bound only in the limit as
n → ∞. We first note that an asymptotically efficient estimator is not unique,
and may not even be unbiased, even asymptotically. However, some regular-
ity conditions on the type of estimator considered are generally necessary as
demonstrated by the famous example given below, which is due to Hodges.
See Le Cam (1953) for further details.
Example 10.9. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables following a N(θ, 1) distribution. In this case I(θ) =
1 so that an asymptotically efficient estimator is one with σ 2 (θ) = 1. Consider
the estimator given by θ̂n = X̄n + (a − 1)X̄n δn , where δn = δ{|X̄n |; [0, n−1/4 )}
for all n ∈ N. We know from Theorem 4.20 (Lindeberg and Lévy) that
d
n1/2 (X̄n −θ) −→ Z as n → ∞ where Z is a N(0, 1) random variable. Hence X̄n is
asymptotically efficient by Definition 10.4. To establish the asymptotic behav-
p
ior of θ̂n we consider two distinct cases. When θ 6= 0 we have that X̄n −
→ θ 6= 0
p
as n → ∞ by Theorem 3.10 and hence it follows that δ{|X̄n | : [0, n−1/4 )} −
→0
p
as n → ∞. See Exercise 12. Therefore, Theorem 3.9 implies that θ̂n − → θ
as n → ∞ when θ 6= 0. However, we can demonstrate an even stronger re-
sult which will help us establish the weak convergence of θ̂n . Consider the
sequence of random variables given by n1/2 (a − 1)X̄n δn . Note that Theorem
d
4.20 (Lindeberg and Lévy) implies that n1/2 X̄n −
→ Z as n → ∞ where Z
is a N(0, 1) random variable. Therefore, Theorem 4.11 (Slutsky) implies that
p
n1/2 (a − 1)X̄n δn −
→ 0 as n → ∞. Now note that Theorem 4.11 implies that
n1/2 (θ̂n − θ) = n1/2 [X̄n + (a − 1)δn X̄n − θ]
= n1/2 (X̄n − θ) + n1/2 (a − 1)δn X̄n ,
398 PARAMETRIC INFERENCE
converges in distribution to a N(0, 1) distribution as n → ∞. Therefore, we
have established that σ 2 (θ) = 1 when θ 6= 0 and hence θ̂n is asymptotically
efficient in this case.
p
For the case when θ = 0 we have that δn − → 1 as n → ∞. See Exercise 12.
1/2 d
Noting that n X̄n − → Z as n → ∞ where Z is a N(0, 1) random variable,
it then follows from Theorem 4.11 (Slutsky) that n1/2 θ̂n = n1/2 X̄n [1 + (a −
d
→ aZ as n → ∞, so that σ 2 (θ) = a2 . Note then that if |a| < 1 then
1)δn ] −
it follows that σ 2 (θ) < 1, and therefore in this case the estimator is more
efficient that the best estimator. 
Example 10.9 demonstrates the somewhat disturbing result that there are
conditions under which we might have what are known as super-efficient es-
timators, whose variance is less than the minimum given by the bound of
Theorem 10.2. However, it is noteworthy that we only obtain such an estima-
tor at a single point in the parameter space Ω. In fact, Le Cam (1953) shows
that under some regularity conditions similar to those given in Theorem 10.2,
the set of points in Ω for which there are super-efficient estimators always
has a Lebesgue measure equal to zero. In particular, there are no uniformly
super-efficient estimators over Ω. For further details on this result see Ba-
hadur (1964) and Section 6.1 of Lehmann and Casella (1998). However, it has
been pointed out by Le Cam (1953), Huber (1966), and Hájek (1972) that the
violation of the inequality in Theorem 10.2 even at a single point can produce
certain unfavorable properties of the risk of the estimators in a neighborhood
of that point. See Example 1.1 in Section 6.2 of Lehmann and Casella (1998)
for further details.
Lehmann and Casella (1998) also point out that there are no additional as-
sumptions on the distribution of F that can be made which will avoid this
difficulty. However, restricting the types of estimators considered is possible.
For example, one could require both that
lim V [n1/2 (θ̂n − θ)] = v(θ),
n→∞

and

lim Bias(θ̂n ) = 0,
n→∞ ∂θ
which would avoid super-efficient estimators. Another assumption that can
avoid this difficulty is to require v(θ) to be a continuous function in θ. Still
another possibility suggested by Rao (1963) and Wolfowitz (1965) is to require
the weak convergence on n1/2 (θ̂n − θ) to Z to be uniform in θ. Further results
are proven by Phanzagl (1970).
We now consider a specific methodology for obtaining estimators that is ap-
plicable to many types of problems: maximum likelihood estimation. Under
specific conditions, we will be able to show that this method provides asymp-
totically optimal estimators that are also asymptotically Normal. If we ob-
serve X1 , . . . , Xn , a set of independent and identically distributed random
POINT ESTIMATION 399
variables from a distribution F (x|θ), then the joint density of the observed
sample if given by
n
Y
f (x1 , . . . , xn |θ) = f (xi |θ),
k=1
where θ ∈ Ω. We have assumed that F (x|θ) has density f (x|θ) and that the
observed random variables are continuous. In the discrete case the density
f (x|θ) is replaced by the probability distribution function associated with
F (x|θ). The likelihood function considers the joint density f (x1 , . . . , xn |θ) as
a function of θ where the observed sample is taken to be fixed. That is
n
Y
L(θ|x1 , . . . , xn ) = f (xi |θ).
k=1

For simpler notation we will often use L(θ|x) in place of L(θ|x1 , . . . , xn ) where
x0 = (x1 , . . . , xn ). The maximum likelihood estimators of θ are taken to be the
points that maximize the function L(θ|x1 , . . . , xn ) with respect to θ. That is,
θ̂n is a maximum likelihood estimator of θ if
L(θ̂n |x1 , . . . , xn ) = sup L(θ|x1 , . . . , xn ).
θ∈Ω

Assuming that L(θ|x1 , . . . , xn ) has at least two derivatives and that the pa-
rameter space of θ is Ω, candidates for the maximum likelihood estimator of
θ have the properties

d
L(θ|x1 , . . . , xn ) = 0,
dθ θ=θ̂n

and
d2


L(θ|x1 , . . . , xn ) < 0.
dθ2 θ=θ̂n
Other candidates for a maximum likelihood estimator are the points on the
boundary of the parameter space. The maximum likelihood estimators are
the candidates for which the likelihood is the largest. Therefore, maximum
likelihood estimators may not be unique and hence there may be two or more
values of θ̂n that all maximize the likelihood.
One must be careful when interpreting a maximum likelihood estimator. A
maximum likelihood estimator is not the most likely value of θ given the
observed data. Rather, a maximum likelihood estimator is a value of θ which
has the largest probability of generating the data when the distribution is
discrete. In the continuous case a maximum likelihood estimator is a value of
θ for which the joint density of the sample is greatest.
In many cases the distribution of interest often contains forms that may not
be simple to differentiate after the product is taken to form the likelihood
function. In these cases the problem can be simplified by taking the natural
logarithm of the likelihood function. The resulting function is often called
400 PARAMETRIC INFERENCE
the log-likelihood function. Note that because the natural logarithm function
is monotonic, the points that maximize L(θ|x1 , . . . , xn ) will also maximize
l(θ) = log[L(θ|x1 , . . . , xn )]. Therefore, a maximum likelihood estimator of θ
can be defined as the value θ̂n such that
l(θ̂n ) = sup l(θ).
θ∈Ω

Example 10.10. Suppose that X1 , . . . , Xn is a set of independent and iden-


tically distributed random variables from an Exponential(θ) distribution,
where the parameter space for θ is Ω = (0, ∞). We will denote the correspond-
ing observed sample as x1 , . . . , xn . The likelihood function in this case is given
by
n
Y
L(θ|x1 , . . . , xn ) = f (xk |θ)
k=1
Yn
= θ−1 exp(−θ−1 xk )
k=1
n
!
X
−n −1
= θ exp −θ xk .
k=1

Therefore, the log-likelihood function is given by


n
X
l(θ) = −n log(θ) − θ−1 xk .
k=1

The first derivative is


n
d X
l(θ) = −nθ−1 + θ−2 xk .

k=1

Setting this derivative equal to zero and solving for θ gives


n
X
θ̂n = x̄n = n−1 xk ,
k=1

as a candidate for the maximum likelihood estimator. The second derivative


of the log-likelihood is given by
n
d2 X
2
l(θ) = nθ−2 − 2θ−3 xk ,

k=1

so that
d2

= nx̄−2 −2 2

l(θ) n − 2nx̄n = −nx̄n < 0,
dθ2 θ=θ̂n
since x̄n will be positive with probability one. It follows that x̄n is a local
maximum. We need only now check the boundary points of Ω = (0, ∞). Noting
POINT ESTIMATION 401
that
n
! n
!
X X
lim θ−n exp −θ−1 xk = lim θ−n exp −θ−1 xk = 0,
θ→0 θ→∞
k=1 k=1

we need only show that L(x̄n |X1 , . . . , Xn ) > 0 to conclude that x̄n is the
maximum likelihood estimator of θ. To see this, note that
n
!
X
L(x̄n |X1 , . . . , Xn ) = x̄−n −1
n exp −x̄n xk = x̄−1
n exp(−n) > 0.
k=1

Therefore, it follows that x̄n is the maximum likelihood estimator of θ. 


Example 10.11. Suppose that X1 , . . . , Xn is a set of independent and iden-
tically distributed random variables from a Uniform(0, θ) distribution where
the parameter space for θ is Ω = (0, ∞). The likelihood function for this case
is given by
n
Y n
Y
−1 −n
L(θ|X1 , . . . , Xn ) = θ δ{xk ; (0, θ)} = θ δ{xk ; (0, θ)}. (10.16)
k=1 k=1

When taken as a function of θ, note that δ{xk ; (0, θ)} is zero unless θ > xk .
Therefore, the product on the right hand side of Equation (10.16) is zero
unless θ > xk for all k ∈ {1, . . . , n}, or equivalently if θ > x(n) , where x(n)
is the largest value in the sample. Therefore, the likelihood function has the
form (
0 θ ≤ x(n)
L(θ|x1 , . . . , xn ) = −n
θ θ > x(n) .
It follows then that the likelihood function is then maximized at θ̂n = x(n) .
See Figure 10.1. 
Maximum likelihood estimators have many useful properties. For example,
maximum likelihood estimators have an invariance property that guarantees
that the maximum likelihood estimator of a function of a parameter is the
same function of the maximum likelihood estimator of the parameter. See
Theorem 7.2.10 of Casella and Berger (2002). In this section we will establish
some asymptotic properties of maximum likelihood estimators. In particular,
we will establish conditions under which maximum likelihood estimators are
consistent and asymptotically efficient.
The main impediment to establishing a coherent asymptotic theory for max-
imum likelihood estimators is that the derivative of the likelihood, or log-
likelihood, function may have multiple roots. This can cause problems for
consistency, for example, since the maximum likelihood estimator may jump
from root to root as new observations are obtained from the population. Be-
cause we intend to provide the reader with an overview of this subject we
will concentrate on problems that have a single unique root. A detailed ac-
count of the case where there are multiple roots can be found in Section 6.4
of Lehmann and Casella (1998).
402 PARAMETRIC INFERENCE

Figure 10.1 The likelihood function for θ where X1 , . . . , Xn is a set of independent


and identically distributed random variables from a Uniform(0, θ) distribution. The
horizontal axis corresponds to the parameter space, and the jump occurs when θ =
x(n) .

A first asymptotic motivation for maximum likelihood estimators comes from


the fact that the likelihood function of θ computed on a set of independent
and identically distributed random variables from a distribution F (x|θ) is
asymptotically maximized at the true value of θ.
Theorem 10.3. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F (x|θ) with density or probabil-
ity distribution f (x|θ), where θ is the parameter of interest. Suppose that the
following conditions hold.

1. The parameter θ is identifiable. That is, f (x|θ) is distinct for each value of
θ in Ω.
2. The set {x ∈ R : f (x|θ) > 0} is the same for all θ ∈ Ω.

If θ0 is the true value of θ, then


lim P [L(θ0 |X1 , . . . , Xn ) > L(θ|X1 , . . . , Xn )] = 1,
n→∞

for all θ ∈ Ω \ θ0 .

Proof. We begin by noting that L(θ0 |X1 , . . . , Xn ) > L(θ|X1 , . . . , Xn ) will oc-
POINT ESTIMATION 403
cur if and only if
n
Y n
Y
f (Xi |θ0 ) > f (Xi |θ),
i=1 i=1
which is equivalent to the condition that
" n #−1 n
Y Y
f (Xi |θ0 ) f (Xi |θ) < 1.
i=1 i=1

Taking the logarithm of this last expression and dividing by n implies that
n
X
n−1 log{f (Xi |θ)[f (Xi |θ0 )]−1 } < 0.
i=1

Since X1 , . . . , Xn are independent and identically distributed, it follows from


Theorem 3.10 (Weak Law of Large Numbers) that
n
X p
n−1 log[f (Xi |θ)/f (Xi |θ0 )] −
→ E{log[f (X1 |θ)/f (X1 |θ0 )]}, (10.17)
i=1

as n → ∞, where we assume that the expectation exists. Note that − log(t)


is a strictly convex function on R so that Theorem 2.11 (Jensen) implies that
−E[log{f (X1 |θ)[f (X1 |θ0 )]−1 }] > − log[E{f (X1 |θ)[f (X1 |θ0 )]−1 }]. (10.18)
We now use the fact that θ0 is the true value of θ to find
Z ∞ Z ∞
−1 f (x|θ)
E{f (X1 |θ)[f (X1 |θ0 )] } = f (x|θ0 )dx = f (x|θ)dx = 1.
−∞ f (x|θ0 ) −∞

Therefore, it follows that log[E{f (X1 |θ)[f (X1 |θ0 )]−1 }] = 0, and hence Equa-
tion (10.18) implies that E[log{f (X1 |θ)[f (X1 |θ0 )]−1 }] < 0. Therefore, we have
from Equation (10.17) that
n
X p
n−1 log{f (Xi |θ)[f (Xi |θ0 )]−1 } −
→ c,
i=1

where c < 0 is a real constant. Definition 3.1 then implies that


n
!
X
−1 −1
lim P n log{f (Xi |θ)[f (Xi |θ0 )] } < 0 = 1,
n→∞
i=1

which is equivalent to
lim P [L(θ0 |X1 , . . . , Xn ) > L(θ|X1 , . . . , Xn )] = 1.
n→∞

The assumptions of Theorem 10.3 are sufficient to ensure the consistency of


the maximum likelihood estimator for the case when Ω has a finite number
of elements. See Corollary 6.2 of Lehmann and Casella (1998). However, even
404 PARAMETRIC INFERENCE
when Ω is countable the result breaks down. See Bahadur (1958), Le Cam
(1979), and Example 6.2.6 of Lehmann and Casella (1998), for further details.
However, a few additional regularity conditions can provide a consistency
result for the maximum likelihood estimator.
Theorem 10.4. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F (x|θ) with density f (x|θ),
where θ is the parameter of interest. Let X = (X1 , . . . , Xn )0 and suppose
that the following conditions hold.

1. The parameter θ is identifiable. That is, f (x|θ) is distinct for each value of
θ in Ω.
2. The set {x ∈ R : f (x|θ) > 0} is the same for all θ ∈ Ω.
3. The parameter space Ω contains an open interval W where the true value
of θ is an interior point.
4. For almost all x ∈ R, f (x|θ) is differentiable with respect to θ in W .
5. The equation

L(θ|X) = 0, (10.19)
∂θ
has a single unique root for each n ∈ N and all X ∈ Rn .
p
Then, with probability one, if θ̂n is the root of Equation (10.19), then θ̂n −
→ θ,
as n → ∞.

Proof. Let θ0 ∈ Ω be the true value of the parameter. Condition 3 implies


that we can select a positive real number δ such that (θ0 − δ, θ0 + δ) ⊂ W .
Define the set
Gn (δ) = {X ∈ Rn : L(θ0 |X) > L(θ0 − δ|X) and L(θ0 |X) > L(θ0 + δ|X)}.
Theorem 10.3 implies that
lim P [Gn (δ)] = 1, (10.20)
n→∞

since θ0 − δ and θ0 + δ are not the true values of θ. Note that if x ∈ Gn (δ) then
it follows that there must be a local maximum in the interval (θ0 − δ, θ0 + δ)
since L(θ0 |X) > L(θ0 − δ|x) and L(θ0 |X) > L(θ0 + δ|x). Condition 4 implies
that there always exists θ̂n ∈ (θ0 − δ, θ0 + δ) such that L0 (θ̂n |X) = 0. Hence
Equation (10.20) implies that there is a sequence of roots {θ̂n }∞ n=1 such that

lim P (|θ̂n − θ0 | < δ) = 1,


n→∞
p
and hence θ̂n −
→ θ, as n → ∞.

For further details on the result of Theorem 10.4 when there are multiple
roots, see Section 6.4 of Lehmann and Casella (1998).
POINT ESTIMATION 405
Example 10.12. Suppose that X1 , . . . , Xn is a set of independent and iden-
tically distributed random variables having an Exponential(θ) distribution.
In Example 10.10, the sample mean was shown to be the unique maximum
likelihood estimator of θ. Assuming that θ is an interior point of the parame-
ter space Ω = (0, ∞), the properties of Theorem 10.4 are satisfied and we can
p
conclude that X̄n −→ θ as n → ∞. This result can also be found directly using
Theorem 3.10. 

In Example 10.12 we can observe that if the maximum likelihood estimator of


θ has a closed form, then it is often easier to establish the consistency directly
using results like Theorem 3.10. The interest in results like Theorem 10.4
is that we can observe what types of assumptions are required to establish
consistency. Note, however, that the assumptions of Theorem 10.4 are not
necessary; there are examples of maximum likelihood estimators that do not
follow all of the assumptions of Theorem 10.4 that are still consistent.
Example 10.13. Suppose that X1 , . . . , Xn is a set of independent and iden-
tically distributed random variables having an Uniform(0, θ) distribution,
where the unique maximum likelihood estimator was found in Example 10.11
to be θ̂n = X(n) , the largest order statistic of the sample. This example vi-
olates the second assumption of Theorem 10.4, so that we cannot use that
result to directly obtain the consistency of θ̂n . However, noting that the pa-
rameter space in this case is Ω = (0, ∞), we can use a direct approach to find
that if we let 0 < ε < θ, then
P (|θ̂n − θ| < ε) = P (θ − ε < θ̂n < θ)
n
!
\
= 1−P {0 ≤ Xi ≤ θ − ε}
i=1
n
Y
= 1− P (0 ≤ Xi ≤ θ − ε)
i=1
−1
= 1 − [θ (θ − ε)]n .
Therefore, it follows that
lim P (|θ̂n − θ| < ε) = 1.
n→∞
p
When ε > θ we have that P (|θ̂n − θ| < ε) = 1 for all n ∈ N. Therefore θ̂n −
→θ
as n → ∞. 

During the development of the fundamental aspects of estimation theory, there


were several conjectures that maximum likelihood estimators had many glob-
ally applicable properties such as consistency. The following example demon-
strates that there are, in fact, inconsistent maximum likelihood estimators.
Example 10.14. Consider a sequence of random variables {{Xij }kj=1 }ni=1
that are assumed to be mutually independent, each having a N(µi , θ) distribu-
tion for i = 1, . . . , n. It can be shown that the maximum likelihood estimators
406 PARAMETRIC INFERENCE
of µi and θ are
k
X
µ̂i = X̄i = k −1 Xij ,
j=1
for each i = 1, . . . , n and
n X
X k
θ̂n = (nk)−1 (Xij − X̄i )2 ,
i=1 j=1

respectively. See Exercise 17. Note that


n X
X k
θ̂n = (nk)−1 (Xij − X̄i )2 =
i=1 j=1
 
n
X k
X n
X
n−1 k −1 (Xij − X̄i )2  = n−1 Si2 ,
i=1 j=1 i=1

where
k
X
Si2 = k −1 (Xij − X̄i )2 ,
j=1
for i = 1, . . . , n, which is the sample variance computed on Xi1 , . . . , Xik . Define
Ci = kθ−1 Si2 for i = 1, . . . , n and note that Ci has a ChiSquared(k − 1)
distribution for all i = 1, . . . , n. Then
n
X n
X
θ̂n = n−1 Si2 = (nk)−1 θCi .
i=1 i=1

Noting that C1 , . . . , Cn are mutually independent, if follows from Theorem


p
3.10 (Weak Law of Large Numbers) that θ̂n − → k −1 θE(C1 ) = θk −1 (k − 1)
as n → ∞, so that the maximum likelihood estimator does not converge in
probability to θ. The inconsistency in this example is due to the fact that
the bias of the maximum likelihood estimator does not converge to zero as
n → ∞. 
Example 10.14 was the first example of an inconsistent maximum likelihood
estimator and is given by Neyman and Scott (1948). An interesting story
regarding this example can be found in Barnard (1970). Other examples of
inconsistent maximum likelihood estimators can be found in Bahadur (1958),
Le Cam (1953), Basu (1955). Simple adjustments or a Bayesian approach can
often be used to obtain consistent estimates in these cases. For instance, in
Example 10.14, the estimator k(k − 1)−1 θ̂n will provide a consistent estimator
of θ, though the adjusted estimator is not the maximum likelihood estimator.
See Ghosh (1994) for further examples.
With a few more assumptions added to those of Theorem 10.4, we can establish
conditions under which maximum likelihood estimators are both consistent
and asymptotically efficient.
POINT ESTIMATION 407
Theorem 10.5. Suppose that X1 , . . . , Xn are a set of independent and iden-
tically distributed random variables from a distribution F (x|θ) with density
f (x|θ), where θ has parameter space Ω. Let θ0 denote the true value of θ.
Suppose that

1. Ω is an open interval.
2. The set {x : f (x|θ) > 0} is the same for all θ ∈ Ω.
3. The density f (x|θ) has three continuous derivatives with respect to θ for
each x ∈ {x : f (x|θ) > 0}.
4. The integral
Z ∞
f (x|θ)dx,
−∞
can be differentiated three times by exchanging the integral and the deriva-
tives.
5. The Fisher information number I(θ0 ) is defined, positive, and finite.
6. For any θ0 ∈ Ω there exists a positive constant d and function B(x) such
that 3

log[f (x|θ)] ≤ B(x),
∂θ3
for all x ∈ {x : f (x|θ) > 0} and θ ∈ [θ0 −d, θ0 +d] such that E[B(X1 )] < ∞.
7. There is a unique maximum likelihood estimator θ̂n for each n ∈ N and
θ ∈ Ω.
d
→ Z as n → ∞ where Z has a N[0, I −1 (θ0 )] distribution.
Then n1/2 (θ̂n − θ) −

Proof. Suppose that X1 , . . . , Xn is a set of independent and identically dis-


tributed random variables from a distribution F (x|θ) with density f (x|θ). Let
l(θ|X) denote the log-likelihood function of θ given by
" n # n
Y X
l(θ|X) = log f (Xi |θ) = log[f (Xi |θ)],
i=1 i=1
0
where X = (X1 , . . . , Xn ) . Assume that the maximum likelihood estimator,
denoted by θ̂n , is the unique solution to the equation
n n
0 d X X f 0 (Xi |θ)
l (θ|X) = log[f (Xi |θ)] = = 0, (10.21)
dθ i=1 i=1
f (Xi |θ)

where the derivative indicated by f 0 (Xi |θ) is taken with respect to θ. Now
apply Theorem 1.13 (Taylor) to l0 (θ|X) to expand l0 (θ̂n |X) about a point
θ0 ∈ Ω as
l0 (θ̂n |X) = l0 (θ0 |X) + (θ̂n − θ0 )l00 (θ0 |X) + 12 (θ̂n − θ0 )2 l000 (ξn |X),
408 PARAMETRIC INFERENCE
where ξn is a random variable that is between θ0 and θ̂n with probability one.
Because θ̂n is the unique root of l0 (θ|X), it follows that
l0 (θ0 |X) + (θ̂n − θ0 )l00 (θ0 |X) + 21 (θ̂n − θ0 )2 l000 (ξn |X) = 0,
or equivalently
n−1/2 l0 (θ0 |X)
n1/2 (θ̂n − θ0 ) = . (10.22)
−n−1 l00 (θ0 |X) − 12 n−1 (θ̂n − θ0 )l000 (ξn |X)
The remainder of the proof is based on analyzing the asymptotic behavior of
each of the terms on the right hand side of Equation (10.22). We first note
that
n
X f 0 (Xi |θ0 )
n−1/2 l0 (θ0 |X) = n−1/2 .
i=1
f (Xi |θ0 )
Under the assumption that it is permissible to exchange the partial derivative
and the integral, Equation (10.13) implies that
Z ∞
f 0 (x|θ)dx = 0,
−∞

and hence
f 0 (Xi |θ0 )
 
E = 0.
f (Xi |θ0 )
Therefore,
( n )
f 0 (Xi |θ0 ) f 0 (Xi |θ0 )
X 
−1/2 0 1/2 −1
n l (θ0 |X) = n n −E .
i=1
f (Xi |θ0 ) f (Xi |θ0 )

We can apply Theorem 4.20 (Lindeberg and Lévy) and Theorem 10.2 (Cramér
d
and Rao) to this last expression to find that n−1/2 l0 (θ0 |X) − → Z as n →
∞ where Z is a N[0, I(θ0 )] random variable. We now consider the term
−n−1 l00 (θ0 ). First note that
∂2

−n−1 l00 (θ0 ) = −n−1

2
log[L(θ|X)]
∂θ
θ=θ0
n

2 X
−1 ∂

= −n log[f (X |θ)]

i
∂θ2 i=1


θ=θ0
n

0
∂ X f (Xi |θ)

= −n−1
∂θ f (Xi |θ)

i=1 θ=θ0
n n
X f 00 (Xi |θ0 ) X [f 0 (Xi |θ0 )]2
= −n−1 + n−1
i=1
f (Xi |θ0 ) i=1
f 2 (Xi |θ0 )
n
X [f 0 (Xi |θ0 )]2 − f (Xi |θ0 )f 00 (Xi |θ0 )
= n−1 ,
i=1
f 2 (Xi |θ0 )
POINT ESTIMATION 409
which is the sum of a set of independent and identically distributed random
variables, each with expectation
 0
[f (Xi |θ0 )]2 − f (Xi |θ0 )f 00 (Xi |θ0 )

E =
f 2 (Xi |θ0 )
( 2 )
f 0 (Xi |θ0 )
 00 
f (Xi |θ0 )
E −E , (10.23)
f (Xi |θ0 ) f (Xi |θ0 )

where Equation (10.14) implies that


( 2 )
f 0 (Xi |θ0 )
I(θ0 ) = E .
f (Xi |θ0 )

To evaluate the second term of the right hand side of Equation (10.23), we
note that Equation (10.15) implies that
 00 
f (Xi |θ0 )
E = 0,
f (Xi |θ0 )
and hence we have that
 0
[f (Xi |θ0 )]2 − f (Xi |θ0 )f 00 (Xi |θ0 )

E = I(θ0 ).
f 2 (Xi |θ0 )
Therefore, Theorem 3.10 (Weak Law of Large Numbers) implies that
n
X [f 0 (Xi |θ0 )]2 − f (Xi |θ0 )f 00 (Xi |θ0 ) p
n−1 −
→ I(θ0 ),
i=1
f 2 (Xi |θ0 )

as n → ∞. For the last term we note that


" n #
−1 000 −1 ∂3 Y
n l (θ) = n log f (Xi |θ)
∂θ3 i=1
n
∂3 X
= n−1 log[f (Xi |θ)]
∂θ3 i=1
n
−1
X ∂3
= n log[f (Xi |θ)].
i=1
∂θ3

Therefore, Assumption 6 implies that


n
n 3
−1 X ∂ 3

−1 000

−1
X ∂
|n l (θ)| = n log[f (X |θ)] ≤ n ∂θ3 log[f (Xi |θ)]

i

∂θ3


i=1 i=1
n
X
≤ n−1 B(Xi ),
i=1

with probability one for all θ ∈ (θ0 − c, θ0 + c). Now ξn is a random variable
that is between θ0 and θ̂n with probability one, and Theorem 10.4 implies
410 PARAMETRIC INFERENCE
p
that θ̂n −
→ θ as n → ∞. Therefore, it follows that for any c > 0,
lim P [ξn ∈ (θ0 − c, θ0 + c)] = 1,
n→∞

and hence " #


n
X
−1 000 −1
lim P |n l (ξn )| ≤ n B(Xi ) = 1.
n→∞
i=1
Now Theorem 3.10 (Weak Law of Large Numbers) implies that
n
X p
n−1 B(Xi ) −
→ E[B(X1 )] < ∞,
i=1

as n → ∞. Therefore, it follows from Definition 4.3 that |n−1 l000 (ξn )| is


p
bounded in probability as n → ∞ and therefore 21 n−1 (θ̂n − θ0 )l000 (ξn ) −
→ 0
d
as n → ∞. Hence, Theorem 4.11 (Slutsky) implies that n1/2 (θ̂n − θ0 ) − →
I −1 (θ0 )Z0 as n → ∞ where Z0 has a N[0, I(θ0 )] distribution, and it follows
d
that n1/2 (θ̂n − θ0 ) − → Z as n → ∞ where Z is a random variable with a
N[0, I −1 (θ0 )] distribution.

From the assumptions of Theorem 10.5 it is evident that the asymptotic ef-
ficiency of a maximum likelihood estimator depends heavily on f , including
its support and smoothness properties. The main important assumption that
may not always be obvious is that the integral of the density and the deriva-
tive of the density with respect to θ may be interchanged. In this context the
following result is often useful.
Theorem 10.6. Let f (x|θ) be a function that is differentiable with respect to
θ ∈ Ω. Suppose there exists a function g(x|θ) and a real constant δ > 0 such
that Z ∞
g(x|θ)dx < ∞,
−∞
for all θ ∈ Ω and


f (x, θ) ≤ g(x, θ)

∂θ
θ=θ0

for all θ0 ∈ Ω such that |θ0 − θ| ≤ δ. Then
Z ∞ Z ∞
∂ ∂
f (x, θ)dx = f (x, θ)dx.
∂θ −∞ −∞ ∂θ

The proof of Theorem 10.6 follows from Theorem 1.11 (Lesbesgue). For further
details on this result see Casella and Berger (2002) or Section 7.10 of Khuri
(2003). The conditions of Theorem 10.6 holds for a wide range of problems,
including those that fall within the exponential family.
Definition 10.5. Let X be a continuous random variable with a density of
the form f (x|θ) = exp[θT (x) − A(θ)] for all x ∈ R where θ is a parameter
with parameter space Ω, T is a function that does not depend on θ, and A
POINT ESTIMATION 411
is a function that does not depend on x. Then X has a density from a one
parameter exponential family.
Of importance for the exponential family in relation to our current discussion
is the fact that derivatives and integrals of the density can be exchanged.
Theorem 10.7. Let h be an integrable function and let θ be a interior point
of Ω, then the integral
Z ∞
h(x) exp[θT (x) − A(θ)]dx,
−∞

is continuous and has derivatives of all orders with respect to θ, and these
derivatives can be obtained by exchanging the derivative and the integral.
A proof of Theorem 10.7 can be found in Section 7.1 of Barndorff-Nielsen
(1978) or Chapter 2 of Lehmann (1986).
Corollary 10.1. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution that has density
f (x|θ) = exp[θT (x) − A(θ)],
where θ is a parameter with parameter space Ω that is an open interval, T is a
function that does not depend on θ, and A is a function that does not depend
on x. Then the likelihood function of θ has a unique solution that is consistent
and asymptotically normal and efficient.

Proof. The likelihood function of θ is given by


n
" n
#
Y X
L(θ|X) = exp[θT (Xi ) − A(θ)] = exp θ T (Xi ) − nA(θ) ,
i=1 i=1

so that the log-likelihood is given by


n
X
l(θ|X) = θ T (Xi ) − nA(θ).
i=1

Therefore, the maximum likelihood estimator of θ is the solution to the equa-


tion
Xn
T (Xi ) − nA0 (θ) = 0. (10.24)
i=1
Now, noting that Z ∞
exp[θT (x) − A(θ)]dx = 1,
−∞
it follows from Theorem 10.7 that
Z ∞ Z ∞
d d
exp[θT (x) − A(θ)]dx = exp[θT (x) − A(θ)]dx =
dθ −∞ −∞ dθ
Z ∞
[T (x) − A0 (θ)] exp[θT (x) − A(θ)]dx = 0.
−∞
412 PARAMETRIC INFERENCE
This implies that E[T (Xi )] = A0 (θ), and hence the likelihood function in
Equation (10.24) is equivalent to
n
X
E[T (Xi )] = n−1 T (Xi ). (10.25)
i=1

Note further that


Z ∞
d
exp[θT (x) − A(θ)]dx =
dθ2 −∞
Z ∞
d
[T (x) − A0 (θ)] exp[θT (x) − A(θ)]dx =
dθ −∞
Z ∞
[T (x) − A0 (θ)]2 exp[θT (x) − A(θ)]dx−
−∞
Z ∞
A00 (θ) exp[θT (x) − A(θ)]dx = 0. (10.26)
−∞

Noting that the first integral on the right hand side of Equation (10.26) is the
expectation E{[T (Xi )−A0 (θ)]2 } and that previous arguments have shown that
E[T (Xi )] = A0 (θ), it follows that E{[T (Xi ) − A0 (θ)]2 } = V [T (Xi )] = A00 (θ).
Since the variance must be positive, we have that
d 0 d
A00 (θ) = A (θ) = E[T (Xi )] > 0.
dθ dθ
Hence, the right hand side of Equation (10.25) is a strictly increasing function
of θ, and therefore can have at most one solution. The remainder of the proof
of this result is based on verifying the assumptions of Theorem 10.5 for this
case. We have already assumed that Ω is an open interval. The form of the
density from Definition 10.5, along with the fact that T (x) is not a function
of θ and that A(θ) is not a function of x, implies that the set {x : f (x|θ) > 0}
does not depend on θ. The first three derivatives of f (x|θ), taken with respect
to θ can be shown to be continuous in θ under the assumption that A(θ)
has at least three continuous derivatives. The fact that the integral of f (x|θ)
taken with respect to x can be differentiated three times with respect to θ by
exchanging the integral and the derivative follows from Theorem 10.7. The
Fisher information number for the parameter θ for this model is given by
( 2 )
d
I(θ) = E log[f (Xi |θ)]

" 2 #
d
= E [θT (Xi ) − A(θ)]

= E{[T (Xi ) − A0 (θ)]2 }
≥ 0.
Under the assumption that E[T 2 (Xi )] < ∞ we obtain the required behavior
POINT ESTIMATION 413
for Assumption 5. For Assumption 6 we note that the third derivative of
log[f (x|θ)], taken with respect to θ, is given by
∂3
log[f (x|θ)] = −A000 (θ),
∂θ3
where we note that a suitable constant function is given by
B(x) = sup −A000 (θ).
θ∈(θ0 −c,θ0 +c)

For further details see Lemma 2.5.3 of Lehmann and Casella (1998). It follows
that the assumptions of Theorem 10.5 are satisfied and the result follows.
Example 10.15. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a density of the form
(
θ exp(−θx) x < 0
f (x|θ) = (10.27)
0 x ≤ 0,
where θ ∈ Ω = (0, ∞). Calculations similar to those given in Example 10.10
can be used to show that the maximum likelihood estimator of θ is θ̂n = X̄n−1 .
Alternatively, one can use the fact that maximum likelihood estimators are
transformation respecting. Because the density in Equation (10.27) has the
form given in Definition 10.5 with A(θ) = log(θ) and T (x) = −x, Corollary
d
10.1 implies that n1/2 (θ̂n − θ) −
→ Z as n → ∞ where Z is a N(0, σ 2 ) random
variable. The asymptotic variance σ 2 is given by
σ2 = I(θ)
∂2
 
= E − 2 f (Xi |θ)
∂θ
∂2
 
= E − 2 [log(θ) − θXi ]
∂θ
= θ−2 .
Therefore, it follows that n1/2 θ−1 (θ̂n −θ) converges in distribution to a N(0, 1)
distribution. 

The definition of an exponential family given in Definition 10.5 is somewhat


simplistic and has a restrictive form. We have used this form in order to keep
the presentation simple. More general results are available that can be applied
to many different distributions and parameters. As with the consistency of
maximum likelihood estimators, it is sometimes easier to bypass Theorem 10.5
and try to obtain the asymptotic normality of a maximum likelihood estimator
directly, especially if the estimator has a closed form. In many cases, results
like Theorem 6.3, and other similar results, can be used to aid in this process.
Additionally, there are cases of maximum likelihood estimators that do not
follow all of the assumptions of Theorem 10.5 that are still asymptotically
normal.
414 PARAMETRIC INFERENCE
Efficiency results for maximum likelihood estimators can be extended to more
general cases in several ways. A key assumption used in this section was that
the equation L0 (θ|X) has a unique root for all n ∈ N. This need not be the
case in general. When there are multiple roots, consistency and asymptotic
efficiency results are obtainable, though the assumptions and arguments used
for establishing the results are more involved. These results can also be ex-
tended to the case where there is more than one parameter. See Sections
6.5–6.7 of Lehmann and Casella (1998) for further details on these results.

10.3 Confidence Intervals

Confidence intervals specify an estimator for a parameter θ that accounts for


the inherent random error associated with a point estimate of the parameter.
This is achieved by identifying a function of the observed sample data that
produces an interval, or region, that contains the true parameter value with
a probability that is specified prior to sampling.
Definition 10.6. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables following a distribution F with parameter θ that has
parameter space Ω ⊂ R. Let θ̂L,n and θ̂U,n be functions of X1 , . . . , Xn and let
α ∈ (0, 1). Then Cn (α) = [θ̂L,n , θ̂U,n ] is a 100α% confidence interval for θ if
P [θ ∈ Cn (α)] = α for all θ ∈ Ω and n ∈ N.

The value α is called the confidence coefficient of the interval. It is important


to remember that the confidence coefficient is the probability that the inter-
val contains the parameter based on the random mechanism associated with
taking the sample. After the sample is taken we can no longer say that the
parameter is contained within the interval with probability α. Rather, in the
post-sample interpretation α is usually interpreted as the expected number of
intervals calculated using the same method that will contain the parameter.
One-sided confidence intervals can also be defined by taking θ̂L,n = −∞ for
an upper confidence interval or θ̂U,n = ∞ for a lower confidence interval.
The development of a confidence interval usually results from inverting a sta-
tistical hypothesis test, or from a pivotal quantity, which is a function of the
data and the unknown parameter θ, whose distribution does not depend on θ,
or on any other unknown quantities. The typical example of a pivotal quantity
comes from considering X1 , . . . , Xn to be a set of independent and identically
distributed random variables from a N(θ, σ 2 ) distribution where σ is known.
In this case the function σ −1 n1/2 (X̄n − θ) has a N(0, 1) distribution which
does not depend on θ. It is from this fact that we are able to conclude that
Cn (α) = [X̄n − n−1/2 σz(1+α)/2 , X̄n − n−1/2 σz(1−α)/2 ] is a 100α% confidence
interval for θ. Similarly, when σ is not known, the function σ̂n−1 n1/2 (X̄n − θ)
has a T(n − 1) distribution which does not depend on θ or σ. Using this pivot
we obtain the usual t-interval for the mean.
CONFIDENCE INTERVALS 415
In many cases pivotal functions are difficult to obtain and we may then choose
to use confidence intervals that may not have a specific confidence coefficient
for finite sample sizes, but may have a confidence coefficient that converges to
α as n → ∞. That is, we wish to specify an approximate confidence interval
Cn (α) such that
lim P [θ ∈ Cn (α)] = α,
n→∞
for all θ ∈ Ω and α ∈ (0, 1). We can further refine such a property to reflect
both the accuracy and the correctness of the confidence interval.
Definition 10.7. Suppose that Cn (α) = [θ̂L,n , θ̂U,n ] is a 100α% confidence
interval for a parameter θ such that P [θ ∈ Cn (α)] = α for all θ ∈ Ω and
n ∈ N. Let Dn (α) = [θ̃L,n , θ̃U,n ] be an approximate 100α% confidence interval
for a parameter θ such that P [θ ∈ Dn (α)] → α as n → ∞ for all θ ∈ Ω.

1. The approximate confidence interval Dn (α) is accurate if P [θ ∈ Dn (α)] = α


for all n ∈ N, θ ∈ Ω and α ∈ (0, 1).
2. The approximate confidence interval Dn (α) is k th -order accurate if P [θ ∈
Dn (α)] = α + O(n−k/2 ) as n → ∞ for all θ ∈ Ω and α ∈ (0, 1).
3. The approximate confidence interval Dn (α) is correct if θ̃L,n = θ̂L,n and
θ̃U,n = θ̂U,n for all n ∈ N, θ ∈ Ω and α ∈ (0, 1).
4. The approximate confidence interval Dn (α) is k th -order correct if θ̃L,n =
θ̂L,n + O(n−(k+1)/2 ) and θ̃U,n = θ̂U,n + O(n−(k+1)/2 ) as n → ∞.

Similar definitions can be used to define the correctness and accuracy of one-
sided confidence intervals.
Example 10.16. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a distribution F with mean θ and finite
variance σ 2 . Even when F does not have a normal distribution, Theorem 4.20
d
(Lindeberg and Lévy) implies that n1/2 σ −1 (X̄n − θ) −
→ Z as n → ∞, where
Z has a N(0, 1) distribution. Therefore, an asymptotically accurate 100α%
confidence interval for θ is given by
Cn (α) = [X̄n − n−1/2 σz(1+α)/2 , X̄n − n−1/2 σz(1−α)/2 ],
under the condition that σ is known. If σ is unknown then a consistent esti-
mator of σ is given by the usual sample standard deviation σ̂n . Therefore, in
this case an asymptotically accurate 100α% confidence interval for θ is given
by
Cn (α) = [X̄n − n−1/2 σ̂n z(1+α)/2 , X̄n − n−1/2 σ̂n z(1−α)/2 ].


Moving beyond the case of constructing a confidence interval for a normal


mean, we first consider the case where θ is a general parameter that can be
d
estimated using an estimator θ̂n where n1/2 σ −1 (θ̂n − θ) −
→ Z as n → ∞, and
Z has a N(0, 1) distribution. At this time we will assume that σ is known.
416 PARAMETRIC INFERENCE
In this case we can consider the approximate upper confidence limit given by
θ̃U,n = θ̂n − n−1/2 σz1−α . Note that
lim P (θ ≤ θ̃U,n ) = lim P (θ ≤ θ̂n − n−1/2 σz1−α )
n→∞ n→∞

= lim P [n1/2 σ −1 (θ̂n − θ) ≥ z1−α ]


n→∞
= 1 − Φ(z1−α ) = α.
Hence, the upper confidence limit θ̃U,n is asymptotically accurate. The lower
confidence limit θ̃L,n = θ̂n − n−1/2 σzα is also asymptotically accurate since
lim P (θ ≥ θ̃L,n ) = lim P [n1/2 σ −1 (θ̂n − θ) ≤ zα ] = Φ(zα ) = α.
n→∞ n→∞

For two-sided confidence intervals we can use the interval C̃n (α) = [θ̂n −
n−1/2 σz(1+α)/2 , θ̂n − n−1/2 σz(1−α)/2 ], so that the asymptotic probability that
the interval will contain the true parameter value is given by
lim P [θ ∈ C̃n (α)] = lim P [n−1/2 σz(1−α)/2 ≤ θ̂n − θ ≤ n−1/2 σz(1+α)/2 ]
n→∞ n→∞

= lim P [z(1−α)/2 ≤ n1/2 σ −1 (θ̂n − θ) ≤ z(1+α)/2 ]


n→∞
= Φ(z(1+α)/2 ) − Φ(z(1−α)/2 ) = α.
Therefore the two-sided interval is also asymptotically accurate.
In the case where σ is unknown we consider the upper confidence limit given
by θ̄U,n = θ̂n − n−1/2 σ̂n z1−α , where we will assume that σ̂n is a consistent
p
estimator of σ. That is, we assume that σ̂n − → σ as n → ∞. If n1/2 σ −1 (θ̂n −
d
θ) −
→ Z as n → ∞, then Theorem 4.11 (Slutsky) can be used to show that
d
n1/2 σ̂n−1 (θ̂n − θ) −
→ Z as n → ∞ as well. Therefore,
lim P (θ ≤ θ̄U,n ) = lim P (θ ≤ θ̂n − n−1/2 σ̂n z1−α )
n→∞ n→∞

= lim P [n1/2 σ̂n−1 (θ̂n − θ) ≥ z1−α ]


n→∞
= 1 − Φ(z1−α ) = α.
Hence, the upper confidence limit θ̄U,n is asymptotically accurate. Similar cal-
culations to those used above can then be used to show that the corresponding
lower confidence limit and two-sided confidence interval are also asymptoti-
cally accurate.
Example 10.17. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a Bernoulli(θ) distribution. Theorem
d
4.20 (Lindeberg and Lévy) implies that n1/2 [θ(1 − θ)]−1/2 (X̄n − θ) −
→ Z as
n → ∞, where Z has a N(0, 1) distribution. Therefore, an asymptotically
accurate 100α% confidence interval for θ could be thought of as
[X̄n − n−1/2 [θ(1 − θ)]1/2 z(1+α)/2 , X̄n − n−1/2 [θ(1 − θ)]−1/2 z(1−α)/2 ],
except for the fact that the standard deviation of X̄n in this case depends
CONFIDENCE INTERVALS 417
on θ, the unknown parameter. However, Theorem 3.10 (Weak Law of Large
p
Numbers) implies that X̄n − → θ as n → ∞ and hence Theorem 3.7 implies
p
that [X̄n (1 − X̄n )]1/2 −
→ [θ(1 − θ)]1/2 as n → ∞, which provides a consistent
estimator of θ(1 − θ). Therefore, it follows that

Ĉn (α) = [X̄n − n−1/2 [X̄n (1 − X̄n )]1/2 z(1+α)/2 ,


X̄n − n−1/2 [X̄n (1 − X̄n )]−1/2 z(1−α)/2 ],
is an asymptotically accurate 100α% confidence interval for θ. 
Example 10.18. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F with mean η and variance
θ < ∞. If F corresponds to a N(η, θ) distribution then an exact 100α% con-
fidence interval for θ is given by
Cn (α) = [(n − 1)θ̂n [χ2n−1;(1+α)/2 ]−1 , (n − 1)θ̂n [χ2n−1;(1−α)/2 ]−1 ],

which uses the fact that (n − 1)θ−1 θ̂n is a pivotal quantity for θ that has
a ChiSquared(n − 1) distribution, and θ̂n is the unbiased version of the
sample variance. This pivotal function is only valid for the normal distribution.
If F is unknown then we can use the fact that Theorem 8.5 implies that
d
n1/2 (µ4 − θ2 )−1/2 (θ̂n − θ) −
→ Z as n → ∞ where Z has a N(0, 1) distribution
to construct an approximate confidence interval for θ. If E(|X1 |4 ) < ∞ then
p
Theorems 3.21 and 3.9 imply that µ̂4 − θ̂n2 − → µ4 − θ2 as n → ∞ and an
asymptotically accurate confidence interval for θ is given by
Ĉn (α) = [θ̂n − z(1+α)/2 n−1/2 (µ̂4 − θ̂n2 )1/2 , θ̂n − z(1−α)/2 n−1/2 (µ̂4 − θ̂n2 )1/2 ].

Example 10.19. Suppose X1 , . . . , Xn is a set of independent and identically
distributed random variables from a Poisson(θ) distribution. Garwood (1936)
suggested a 100α% confidence interval for θ using the form
Cn (α) = [ 12 n−1 χ22Y ;(1−α)/2 , 12 n−1 χ22(Y +1);(1+α)/2 ],
where
n
X
Y = nθ̂n = Xi .
i=1
The coverage probability of this interval is at least α, but may also be quite
conservative in some cases. See Figure 9.2.5 of Casella and Berger (2002).
An asymptotically accurate confidence interval for θ based on Theorem 4.20
(Lindeberg and Lévy) and Theorem 3.10 (Weak Law of Large Numbers) is
given by
Ĉn (α) = [θ̂n − z(1+α)/2 n−1/2 θ̂n1/2 , θ̂n − z(1−α)/2 n−1/2 θ̂n1/2 ].


More accurate asymptotic properties, such as the order of correctness and


418 PARAMETRIC INFERENCE
accuracy, can be obtained if we assume the framework of the smooth function
model introduced in Section 7.4. That is, consider a sequence of independent
and identically distributed d-dimensional random vectors {Xn }∞ n=1 from a d-
dimensional distribution F . Let µ = E(Xn ) and assume that the components
of µ are finite and that there exists a smooth function g : Rd → R such that
θ = g(µ) and we estimate θ with θ̂n = g(µ̂). Finally, assume that there is a
smooth function h : Rd → R such that

σ 2 = h2 (µ) = lim V (n1/2 θ̂n ).


n→∞

If required, the asymptotic variance will be estimated with σ̂n2 = h2 (X̄n ). Let
Gn (t) = P [n1/2 σ −1 (θ̂n − θ) ≤ t] and Hn (t) = P [n1/2 σ̂n−1 (θ̂n − θ) ≤ t] and
define gα,n and hα,n to be the corresponding α quantiles of Gn and Hn so
that Gn (gα,n ) = α and Hn (hα,n ) = α.

In the same exact way that one would develop the confidence interval for a
population mean we can develop a confidence interval for θ using the quantiles
of Gn and Hn . In particular, if σ is known then it follows that a 100α%
upper confidence limit for θ is given by θ̂n,ord = θ̂n − n−1/2 σg1−α and if σ is
unknown then it follows that a 100α% upper confidence limit for θ is given by
θ̂n,stud = θ̂n − n−1/2 σ̂n h1−α . In this case we are borrowing the notation and
terminology of Hall (1988a) where θ̂n,ord is called the ordinary confidence limit
and θ̂n,stud is called the studentized confidence limit, making reference to the
t-interval for the mean where the population standard deviation is replaced
by the sample standard deviation. In both cases these upper confidence limits
are accurate and correct. See Exercise 18.

Note that in the case where F is a N(θ, σ 2 ) distribution, the distribution


Hn is a N(0, 1) distribution and Gn is a T(n − 1) distribution as discussed
above. When F is not a normal distribution, but θ still is contained within
the smooth function model, it is often the case that the distributions Gn and
Hn are unknown, complicated, or may depend on unknown parameters. In
these cases the normal approximation, motivated by the fact that Gn ; Φ
and Hn ; Φ as n → ∞, is often used. The normal approximation replaces
the quantiles gα and hα with zα , obtaining an approximate upper 100α%
confidence limits of the form θ̃n,ord = θ̂n − n−1/2 σz1−α if σ is known, and
θ̃n,stud = θ̂n − n−1/2 σ̂n z1−α if σ is unknown. The accuracy and correctness of
these confidence limits can be studied with the aid of Edgeworth expansions.
Example 10.20. Suppose that X1 , . . . , Xn is a set of independent and iden-
tically distributed random variables from a distribution F with parameter
θ. Suppose that F and θ fall within the smooth function model described
above. A correct and exact 100α% upper confidence limit for θ is given by
θ̂n,stud = θ̂n − n−1/2 σ̂n h1−α . According to Theorem 7.13, the quantile h1−α
has an asymptotic expansion of the form h1−α = z1−α + n−1/2 s1 (z1−α ) +
n−1 s2 (z1−α ) + O(n−3/2 ), as n → ∞. Therefore, it follows that the exact
CONFIDENCE INTERVALS 419
100α% upper confidence limit for θ has asymptotic expansion

θ̂n,stud = θ̂n − n−1/2 σ̂n hn,1−α


= θ̂n − n−1/2 σ̂n [z1−α + n−1/2 s1 (z1−α ) + n−1 s2 (z1−α )] + O(n−2 )
= θ̂n − n−1/2 σ̂n z1−α − n−1 σ̂n s1 (z1−α ) − n−3/2 σ̂n s2 (z1−α )
+Op (n−2 )
= θ̃n,stud + Op (n−1 ),

as n → ∞, where we have used the fact that σ̂n = σ + Op (n−1/2 ) as n → ∞.


From this result we find that |θ̂n,stud − θ̃n,stud | = Op (n−1 ) as n → ∞, so that
the normal approximation is first-order correct. 

Example 10.21. Consider the same setup as Example 10.20 where an exact
and correct upper confidence limit for θ has asymptotic expansion

θ̂n,stud = θ̂n − n−1/2 σ̂n z1−α − n−1 σ̂n s1 (z1−α ) + Op (n−3/2 ),

as n → ∞. The polynomial s1 depends on the moments of F that are usually


unknown but can be estimated using the corresponding sample moments. The
resulting estimate of s1 (z1−α ), denoted by ŝ1 (z1−α ), has the property that
ŝ1 (z1−α ) = s1 (z1−α ) + Op (n−1/2 ), as n → ∞. Therefore, we can consider the
Edgeworth corrected version of the normal approximation given by the upper
confidence limit

θ̄n,stud = θ̂n − n−1/2 σ̂n z1−α − n−1 σ̂n ŝ1 (z1−α )


= θ̂n − n−1/2 σ̂n z1−α − n−1 σ̂n s1 (z1−α ) + Op (n−3/2 )
= θ̂n,stud + Op (n−3/2 ),

as n → ∞. Therefore, it follows that θ̄n,stud is second-order correct. 

In order to use Edgeworth expansions to study the accuracy of confidence


intervals, let θ̂n (α) denote a generic upper 100α% confidence limit for θ that
has an asymptotic expansion of the form

θ̂n (α) = θ̂n + n−1/2 σ̂n zα + n−1 σ̂n û1 (zα ) + n−3/2 σ̂n uˆ2 (zα ) + Op (n−2 ), (10.28)

as n → ∞, where û1 (zα ) = u1 (zα ) + Op (n−1/2 ) for an even polynomial u1


and û2 (zα ) = u2 (zα ) + Op (n−1/2 ) for an odd polynomial u2 . Following the
development of Hall (1988a), the 100α% upper confidence interval for θ given
420 PARAMETRIC INFERENCE
by Cn (α) = (−∞, θ̂n (α)] has coverage probability given by
πn (α) = P [θ ≤ θ̂n (α)]
= P [θ ≤ θ̂n + n−1/2 σ̂n zα + n−1 σ̂n û1 (zα ) + n−3/2 σ̂n û2 (zα )
+Op [(n−2 )]
= P [n1/2 σ̂n−1 (θ − θ̂n ) ≤ zα + n−1/2 û1 (zα ) + n−1 û2 (zα )
+Op (n−3/2 )]
= P [n1/2 σ̂n−1 (θ̂n − θ) ≥ −zα − n−1/2 û1 (zα ) − n−1 u2 (zα )
+Op (n−3/2 )] (10.29)
−1/2
where we have used the fact that û2 (zα ) = u2 (zα ) + Op (n ), as n → ∞.
We still have to contend with two random terms on the right hand side of the
probability in Equation (10.29). Subtracting n−1/2 u1 (zα ) from both sides of
the inequality yields

πn (α) = P {n1/2 σ̂n−1 (θ̂n − θ) + n−1/2 [û1 (zα ) − u1 (zα )] ≥


− zα − n−1/2 u1 (zα ) − n−1 u2 (zα ) + Op (n−3/2 )}. (10.30)
The simplification of this probability is now taken in two steps. The first result
accounts for how the error term of Op (n−3/2 ) in the event contributes to the
corresponding probability.
Theorem 10.8 (Hall). Under the assumptions of this section it follows that

P {n1/2 σ̂n−1 (θ̂n − θ) + n−1/2 [û1 (zα ) − u1 (zα )] ≥


− zα − n−1/2 u1 (zα ) − n−1 u2 (zα ) + Op (n−3/2 )} =
P {n1/2 σ̂n−1 (θ̂n − θ) + n−1/2 [û1 (zα ) − u1 (zα )] ≥
− zα − n−1/2 u1 (zα ) − n−1 u2 (zα )} + Op (n−3/2 ),
as n → ∞.
See Hall (1986a) for further details on the exact assumptions required for this
result and its proof. The second step of the simplification of the probability
involves relating the distribution function of n1/2 σ̂n−1 (θ̂n −θ), whose expansion
we know from Theorem 7.13, to the distribution function of n1/2 σ̂n−1 (θ̂n − θ) +
n−1/2 [û1 (zα ) − u1 (zα )], whose expansion we are not yet familiar with.
Theorem 10.9 (Hall). Under the assumptions of this section it follows that
for every x ∈ R,

P {n1/2 σ̂n−1 (θ̂n − θ) + n−1/2 [û1 (zα ) − u1 (zα )] ≤ x} =


P [n1/2 σ̂n−1 (θ̂n − θ) ≤ x] − n−1 uα xφ(x) + O(n−3/2 ),
as n → ∞, where uα is a constant satisfying
E{nσ̂n−1 (θ̂n − θ)[û1 (zα ) − u1 (zα )]} = uα + O(n−1 ), (10.31)
as n → ∞.
CONFIDENCE INTERVALS 421
See Hall (1986a) for additional details on the exact assumptions required for
this result and its proof. The Edgeworth expansion from Theorem 7.13 can
be used to find an expression for P [n1/2 σ̂n−1 (θ̂n − θ) ≤ x] and the result of
Theorem 10.8 can then be used to relate this to the asymptotic coverage of the
upper confidence limit. Picking up where we left off with Equation (10.30),
Theorem 10.8 tells us that the error term of order Op (n−3/2 ) in the event
contributes O(n−3/2 ) to the corresponding probability. Therefore, the result
of Theorem 10.9 yields

π(α) = P {n1/2 σ̂n−1 (θ̂n − θ) + n−1/2 [û1 (zα ) − u1 (zα )] ≥


− zα − n−1/2 u1 (zα ) − n−1 u2 (zα )} + O(n−3/2 ) =
1 − P [n1/2 σ̂n−1 (θ̂n − θ) ≤ −zα − n−1/2 u1 (zα ) − n−1 u2 (zα )]+
n−1 uα [−zα − n−1/2 u1 (zα ) − n−1 u2 (zα )]×
φ[−zα − n−1/2 u1 (zα ) − n−1 u2 (zα )] + O(n−3/2 ), (10.32)
as n → ∞ where uα is defined in Equation (10.31). The remainder of this ar-
gument is now concerned with simplifying the expressions in Equation (10.32).
We begin by noting that Theorem 7.13 implies that
P [n1/2 σ̂n−1 (θ̂n − θ) ≤ x] = Φ(x) + n−1/2 v1 (x)φ(x) + n−1 v2 (x)φ(x) + O(n−3/2 ),
as n → ∞. Therefore, it follows that

P [n1/2 σ̂n−1 (θ̂n − θ) ≤ −zα − n−1/2 u1 (zα ) − n−1 u2 (zα )] =


Φ[−zα −n−1/2 u1 (zα )−n−1 u2 (zα )]+n−1/2 v1 [−zα −n−1/2 u1 (zα )−n−1 u2 (zα )]×
φ[−zα − n−1/2 u1 (zα ) − n−1 u2 (zα )] + n−1 v2 [−zα − n−1/2 u1 (zα ) − n−1 u2 (zα )]×
φ[−zα − n−1/2 u1 (zα ) − n−1 u2 (zα )] + O(n−3/2 ), (10.33)
as n → ∞. We must now simplify the terms in Equation (10.33). From Ex-
ample 1.18 we have that the first term in Equation (10.33) has the form

Φ(−zα ) − [n−1/2 u1 (zα ) + n−1 u2 (zα )]φ(−zα )+


1 −1/2
2 [n u1 (zα ) + n−1 u2 (zα )]2 φ0 (−zα ) + O(n−3/2 ) =
−1/2
1−α−n u1 (zα )φ(zα ) − n−1 u2 (zα )φ(zα )+
1 −1 2
2n u1 (zα )φ0 (−zα ) + O(n−3/2 ), (10.34)
as n → ∞. Recalling that φ0 (t) = −tφ(t) it follows that the expression in
Equation (10.34) is equivalent to

1 − α − n−1/2 u1 (zα )φ(zα ) − n−1 u2 (zα )φ(zα )+


1 −1
2n zα u21 (zα )φ(zα ) + O(n−3/2 ), (10.35)
as n → ∞. For the second term in Equation (10.33) we can either use the
exact form of the polynomial v1 as specified in Section 7.6, or we can simply
422 PARAMETRIC INFERENCE
use Theorem 1.15 to conclude that v1 (t+δ) = v1 (t)+δv10 (t)+O(δ 2 ), as δ → 0.
Therefore, keeping terms of order O(n−1/2 ) or larger, yields

v1 [−zα − n−1/2 u1 (zα ) − n−1 u2 (zα )] =


v1 (−zα ) − [n−1/2 u1 (zα ) + n−1 u2 (zα )]v10 (−zα ) + O(n−1 ) =
v1 (−zα ) − n−1/2 u1 (zα )v10 (−zα ) + O(n−1 ), (10.36)
as n → ∞. Now v1 is an even function, and hence v10 is an odd function.
Therefore, the expression in Equation (10.36) equals
v1 (zα ) + n−1/2 u1 (zα )v10 (zα ) + O(n−1 ), (10.37)
as n → ∞. Working further with the second term in Equation (10.33), we
note that Theorem 1.15 implies that

φ[−zα − n−1/2 u1 (zα ) − n−1 u2 (zα )] =


φ(−zα ) − [n−1/2 u1 (zα ) + n−1 u2 (zα )]φ0 (−zα ) + O(n−1 ) =
φ(zα ) − n−1/2 zα u1 (zα )φ(zα ) + O(n−1 ), (10.38)
as n → ∞. Combining the expressions of Equations (10.37) and (10.38) yields

n−1/2 v1 [−zα −n−1/2 u1 (zα )−n−1 u2 (zα )]φ[−zα −n−1/2 u1 (zα )−n−1 u2 (zα )] =
n−1/2 v1 (zα )φ(zα ) + n−1 u1 (zα )v10 (zα )φ(zα )−
n−1 zα u1 (zα )v1 (zα )φ(zα ) + O(n−3/2 ), (10.39)
as n → ∞. The third term in Equation (10.33) requires less sophisticated
arguments as we are only retaining terms of order O(n−1 ) and larger, and the
leading coefficient on this term is O(n−1 ). Hence, similar calculations to those
used above can be used to show that

n−1 v2 [−zα − n−1/2 u1 (zα ) − n−1 u2 (zα )] =


− n−1 v2 (zα )φ(zα ) + O(n−3/2 ), (10.40)
as n → ∞, where we have used the fact that v2 is an odd function. Combining
the results of Equations (10.33), (10.35), (10.39), and (10.40) yields

P [n1/2 σ̂n−1 (θ̂n − θ) ≤ −zα − n−1/2 u1 (zα ) − n−1 u2 (zα )] =


1 − α − n−1/2 u1 (zα )φ(zα ) − n−1 u2 (zα )φ(zα ) + 21 n−1 zα u21 (zα )φ(zα )+
n−1/2 v1 (zα )φ(zα ) + n−1 u1 (zα )v10 (zα )φ(zα ) − n−1 zα u1 (zα )v1 (zα )φ(zα )−
n−1 v2 (zα )φ(zα ) + O(n−3/2 ) =
1 − α + n−1/2 [v1 (zα ) − u1 (zα )]φ(zα )+
n−1 [ 21 zα u21 (zα ) − u2 (zα ) + u1 (zα )v10 (zα ) − zα u1 (zα )v1 (zα ) − v2 (zα )]φ(zα )
+ O(n−3/2 ),
as n → ∞. To complete simplifying the expression in Equation (10.32), we
CONFIDENCE INTERVALS 423
note that

n−1 uα [−zα −n−1/2 u1 (zα )−n−1 u2 (zα )]φ[−zα −n−1/2 u1 (zα )−n−1 u2 (zα )] =
− n−1 uα zα φ(zα ) + O(n−3/2 ),
as n → ∞. Therefore, it follows that

πn (α) = α + n−1/2 [u1 (zα ) − v1 (zα )]φ(zα ) − n−1 [ 21 zα u21 (zα ) − u2 (zα )+
u1 (zα )v10 (zα ) − zα u1 (zα )v1 (zα ) − v2 (zα ) + uα zα ]φ(zα ) + O(n−3/2 ), (10.41)
as n → ∞. From Equation (10.41) it is clear that the determining factor in
the accuracy of the one-sided confidence interval is the relationship between
the polynomials u1 and v1 . In particular, if u1 (zα ) = −s1 (zα ) = v1 (zα ) then
the expansion for the coverage probability simplifies to

πn (α) = α − n−1 [ 21 zα v12 (zα ) − u2 (zα ) + v1 (zα )v10 (zα )−


v2 (zα ) + uα zα ]φ(zα ) + O(n−3/2 ),
as n → ∞ and the resulting interval is second-order accurate. On the other
hand, if u1 (zα ) 6= v1 (zα ), then the interval is only first-order accurate.
Example 10.22. Consider the general setup of Example 10.20 with
θ̃n,stud (α) = θ̂n = n−1/2 σ̂n z1−α ,
denoting the upper confidence limit given by the normal approximation. In
terms of the generic expansion given in Equation (10.28) we have that u1 (zα ) =
u2 (zα ) = 0 for all α ∈ (0, 1). Therefore, Equation (10.41) implies that the cov-
erage probability of θ̃n,stud (α) has asymptotic expansion

π̃n (α) = α − n−1/2 v1 (zα )φ(zα ) + n−1 [v2 (zα ) − uα zα ]φ(zα ) + O(n−3/2 ),
as n → ∞. Hence, the normal approximation is first-order accurate. 
Example 10.23. Consider the general setup of Example 10.20 with Edge-
worth corrected upper confidence limit given by
θ̄n,stud (α) = θ̂n − n−1/2 σ̂n z1−α − n−1 σ̂n ŝ1 (z1−α )
= θ̂n + n−1/2 σ̂n zα − n−1 σ̂n ŝ1 (zα ).
In terms of the generic expansion given in Equation (10.28) we have that
u1 (zα ) = −ŝ1 (zα ). Therefore, Equation (10.41) implies that the coverage prob-
ability of θ̄n,stud (α) has asymptotic expansion α + O(n−1 ), as n → ∞. There-
fore, the Edgeworth corrected upper confidence limit for θ is second-order
accurate. 

A two-sided 100α% confidence interval for θ is given by


[θ̂n − n−1/2 σ̂n h(1+α)/2 , θ̂n − n−1/2 σ̂n h(1−α)/2 ], (10.42)
which has coverage probability πn [ 21 (1 + α)/2] − πn [ 21 (1 − α)/2] where we
424 PARAMETRIC INFERENCE
are using the definition of πn (α) from earlier. Therefore, the coverage of the
two-sided interval in Equation (10.42) is given by Equation (10.41) as
1
2 (1 + α) − 12 (1 − α) + n−1/2 [u1 (z(1+α)/2 ) − v1 (z(1+α)/2 )]−
n−1/2 [u1 (z(1−α)/2 ) − v1 (z(1−α)/2 )] + O(n−1 ),
as n → ∞. Noting that z(1+α)/2 = −z(1−α)/2 for all α ∈ (0, 1) and that v1
and u1 are even functions, it follows that u1 (z(1+α)/2 ) = u1 (z(1−α)/2 ) and
v1 (z(1+α)/2 ) = v1 (z(1−α)/2 ) so that the coverage of the two-sided interval
is given by α + O(n−1 ), as n → ∞ regardless of u1 . Therefore, within the
smooth function model, second-order correctness has a major influence on
the coverage probability of one-sided confidence intervals, but its influence on
two-sided confidence intervals is greatly diminished.
The view of the accuracy and correctness of confidence intervals based on
Edgeworth expansion theory is useful, but also somewhat restrictive. This is
due to the fact that we require the added structure of the smooth function
model to obtain the required asymptotic expansions. For example, these re-
sults cannot be used to find the order accuracy of population quantiles based
on the quantiles of the empirical distribution function. This is because these
quantile estimates do not fit within the framework of the smooth function
model. The related expansion theory for sample quantiles turns out to be
quite different from what is presented here. See Appendix IV of Hall (1992) for
further details on this problem. Another limitation comes from the smooth-
ness assumptions that must be made on the distribution F . Therefore, the
theory presented above is not able to provide further details on the order of
correctness or accuracy of the approximate confidence intervals studied in Ex-
amples 10.17 and 10.19. Esseen (1945) provides Edgeworth expansions for the
case of lattice random variables, and again the resulting expansions are quite
different from those presented here. Another interesting case which we have
not studied in this section is the case where θ is a p-dimensional vector, and
we seek a region Cn (α) such that P [θ ∈ Cn (α)] = α. In this case there is a
multivariate form of the smooth function model and well defined Edgeworth
expansion theory that closely follows the general form of what is presented
here, though there are some notable differences. See Section 4.2 of Hall (1992)
and Chapter 3 of Polansky (2007) for further details. A very general view of
coverage processes, well beyond that of just confidence intervals and regions,
can be found in Hall (1988b).

10.4 Statistical Hypothesis Tests

Statistical hypothesis tests are procedures that decide the truth of a hypoth-
esis about an unknown population parameter based on a sample from the
population. The decision is usually made in accordance with a known rate of
error specified by the researcher.
STATISTICAL HYPOTHESIS TESTS 425
For this section we consider X1 , . . . , Xn to be a set of independent and iden-
tically distributed random variables from a distribution F with functional
parameter θ that has parameter space Ω. A hypothesis test begins by divid-
ing the parameter space into a partition of two regions called Ω0 and Ω1 for
which there are associated hypotheses H0 : θ ∈ Ω0 and H1 : θ ∈ Ω1 , respec-
tively. The structure of the test is such that the hypothesis H0 : θ ∈ Ω0 , called
the null hypothesis, is initially assumed to be true. The data are observed
and evidence that H0 : θ ∈ Ω0 is actually false is extracted from the data.
If the amount of evidence against H0 : θ ∈ Ω0 is large enough, as specified
by an acceptable error rate given by the researcher, then the null hypothesis
H0 : θ ∈ Ω0 is rejected as being false, and the hypothesis H1 : θ ∈ Ω1 , called
the alternative hypothesis, is accepted as truth. Otherwise we fail to reject the
null hypothesis and we conclude that there was not sufficient evidence in the
observed data to conclude that the null hypothesis is false.
The measure of evidence in the observed data against the null hypothesis is
measured by a statistic Tn = Tn (X1 , . . . , Xn ), called a test statistic. Prior
to observing the sample X1 , . . . , Xn the researcher specifies a set R that is
a subset of the range of Tn . This set is constructed so that when Tn ∈ R,
the researcher considers the evidence in the sample to be sufficient to warrant
rejection of the null hypothesis. That is, if Tn ∈ R, then the null hypothesis
is rejected and if Tn ∈/ R then the null hypothesis is not rejected. The set R
is called the rejection region.
The rejection region is usually constructed so that the probability of rejecting
the null hypothesis when the null hypothesis is really true is set to a level
α called the significance level. That is, α = P (Tn ∈ R|θ ∈ Ω0 ). The error
corresponding to this conclusion is called the Type I error, and it should be
noted that the probability of this error may depend on the specific value of θ in
Ω0 . In this case we control the largest probability of a Type I error for points
in Ω0 to be α with a slight difference in terminology separating the cases where
the probability α can be achieved and where it cannot. Another terminology
will be used when the error rate asymptotically achieves the probability α.
Definition 10.8. Consider a test of a null hypothesis H0 : θ ∈ Ω0 with test
statistic Tn and rejection region R.

1. The test is a size α test if


sup P (Tn ∈ R|θ = θ0 ) = α.
θ0 ∈Ω0

2. The test is a level α test if


sup P (Tn ∈ R|θ = θ0 ) ≤ α.
θ0 ∈Ω0

3. The test has asymptotic size equal to α if


lim P (Tn ∈ R|θ = θ0 ) = α.
n→∞
426 PARAMETRIC INFERENCE
4. The test is k th -order accurate if
P (Tn ∈ R|θ = θ0 ) = α + O(n−k/2 ),
as n → ∞.

The other type of error that one can make is a Type II error, which occurs when
the null hypothesis is not rejected but the alternative hypothesis is actually
true. The probability of avoiding this error is called the power of the test.
This probability, taken as a function of θ, is called the power function of the
test and will be denoted as βn (θ). That is, βn (θ) = P (Tn ∈ R|θ). The domain
of the power function can be taken to be Ω so that the function βn (θ) will
also reflect the probability of a Type I error when θ ∈ Ω0 . In this context we
would usually want βn (θ) when θ ∈ Ω0 to be smaller than βn (θ) when θ ∈ Ω1 .
That is, we should have a larger probability of rejecting the null hypothesis
when the alternative is true than when the null is true. A test that has this
property is called an unbiased test.
Definition 10.9. A test with power function βn (θ) is unbiased if βn (θ0 ) ≤
βn (θ1 ) for all θ0 ∈ Ω0 and θ1 ∈ Ω1 .

While unbiased tests are important, there are also asymptotic considerations
that can be accounted for as well. The most common of these is that a test
should be consistent against values of θ in the alternative hypothesis. This
is an extension of the idea of consistency in the case of point estimation.
In point estimation we like to have consistent estimators, that is ones that
converge to the correct value of θ as n → ∞. The justification for requiring
this property is that if we could examine the entire population then we should
know the parameter value exactly. The consistency concept is extended to
statistical hypothesis tests by requiring that if we could examine the entire
population, then we should be able to make a correct decision. That is, if the
alternative hypothesis is true, then we should reject with probability one as
n → ∞. Note that this behavior is only specified for points in the alternative
hypothesis since we always insist on erroneously rejecting the null hypothesis
with probability at most α no matter what the sample size is.
Definition 10.10. Consider a test of H0 : θ ∈ Ω0 against H1 : θ ∈ Ω1 that
has power function βn (θ). The test is consistent against the alternative θ ∈ Ω1
if
lim βn (θ) = 1.
n→∞
If the test is consistent against all alternatives θ ∈ Ω1 , then the test is called
a consistent test.

A consistent hypothesis test assures us that when the sample is large enough
that we will reject the null hypothesis when the alternative hypothesis is true.
For fixed sample sizes the probability of rejecting the null hypothesis when
the alternative is true depends on the actual value of θ in the alternative
hypothesis. Many tests will perform well when the actual value of θ is far
STATISTICAL HYPOTHESIS TESTS 427
away from the null hypothesis, and in the limiting case have a power equal to
one in the limit. That is
lim βn (θ) = 1,
d(θ,Ω0 )→∞

where d is a measure of distance between θ and the set Ω0 . Of greater interest


then is how the tests will perform for values of θ that are in the alternative,
but are close to the boundary between Ω0 and Ω1 . One way to study this
behavior is to consider a sequence of points in the alternative hypothesis that
depend on n that converge to a point on the boundary of Ω0 . The rate at which
the points converge to the boundary must be chosen carefully, otherwise the
limit will be either α, if the sequence converges too quickly to the boundary,
or will be one if the sequence does not converge fast enough. For many tests,
choosing this sequence so that d(θ, Ω0 ) = O(n−1/2 ) as n → ∞ will ensure
that the resulting limit of the power function evaluated on this sequence will
be between α and one. Such a limit is called the asymptotic power of the test
against the specified sequence of points in the alternative hypothesis.
Definition 10.11. Consider a test of H0 : θ ∈ Ω0 against H1 : θ ∈ Ω1 that
has power function βn (θ). Let {θn }∞n=1 be a sequence of points in Ω1 such that

lim d(θn , Ω0 ) = 0,
n→∞

at a specified rate. Then the asymptotic power of the test against the sequence
of alternatives {θn }∞n=1 is given by

lim βn (θn ).
n→∞

In this section we will consider some asymptotic properties of hypothesis tests


for the case when Ω is the real line and Ω0 and Ω1 are intervals. The null
hypothesis will generally have the form H0 : θ ∈ Ω0 = (−∞, θ0 ] or H0 :
θ ∈ Ω0 = [θ0 , ∞) with the corresponding alternative hypotheses given by
H1 : θ ∈ Ω1 = (θ0 , ∞) and H1 : θ ∈ Ω1 = (−∞, θ0 ), respectively. We will
first consider the case of asymptotically normal test statistics. In particular
we will consider test statistics of the form Zn = n1/2 σ −1 (θ̂n − θ0 ) where
d
Zn −→ Z as n → ∞ where Z is a N(0, 1) random variable, and σ is a known
constant that does not depend on n. Consider the problem of testing the null
hypothesis H0 : θ ≤ θ0 against the alternative hypothesis H1 : θ > θ0 . Because
the alternative hypothesis specifies that θ is larger than θ0 we will consider
rejecting the null hypothesis when Zn is too large, or when Zn > rα,n where
{rα,n }∞
n=1 is a sequence or real numbers and α is the specified significance level
of the test. If the distribution of Zn under the null hypothesis is known then
the sequence {rα,n }∞ n=1 can be specified so that the test has level or size α in
accordance with Definition 10.8. If this distribution is unknown then the test
can be justified from an asymptotic viewpoint by specifying that rn,α → z1−α
as n → ∞. In this case we have that
lim P (Zn ≥ rn,α |θ = θ0 ) = α.
n→∞
428 PARAMETRIC INFERENCE
Now consider this probability when θ = θl , a point that is away from the
boundary of Ω0 , so that θl is strictly less than θ0 . In this case we have that
" #
n1/2 (θ̂n − θ0 )
P (Zn ≥ rn,α |θ = θl ) = P ≥ rn,α θ = θl

σ
" #
n1/2 (θ̂n − θl ) n1/2 (θ0 − θl )
= P ≥ rn,α + θ = θl
σ σ
" #
n1/2 (θ̂n − θl )
≤ P ≥ rn,α θ = θl .

σ
Therefore, it follows that
lim P (Zn ≥ rn,α |θ = θl ) ≤ lim P (Zn ≥ rn,α |θ = θ0 ) = α,
n→∞ n→∞

for all θl < θ0 . Hence, Definition 10.8 implies that this test has asymptotic
size α. Similar arguments can be used to show that this test is also unbiased.
See Exercise 26.
Example 10.24. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables following an Exponential(θ) distribution where
θ ∈ Ω = (0, ∞). We will consider testing the null hypothesis H0 : θ ≤ θ0
against the alternative hypothesis H1 : θ > θ0 for some specified θ0 > 0. The-
orem 4.20 (Lindeberg-Lévy) implies that the test statistic Zn = n1/2 θ0−1 (X̄n −
θ0 ) converges in distribution to a N(0, 1) distribution as n → ∞ when θ = θ0 .
Therefore, rejecting the null hypothesis when Zn exceeds z1−α will result in a
test with asymptotic level equal to α. 
Example 10.25. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables following a Poisson(θ) distribution where θ ∈
Ω = (0, ∞). We will consider testing the null hypothesis H0 : θ ≤ θ0 against
the alternative hypothesis H1 : θ > θ0 for some specified θ0 > 0. Once again,
−1/2
Theorem 4.20 implies that the test statistic Zn = n1/2 θ0 (X̄n −θ0 ) converges
in distribution to a N(0, 1) distribution as n → ∞ when θ = θ0 . Therefore,
rejecting the null hypothesis when Zn exceeds z1−α will result in a test with
asymptotic level equal to α. 
The variance σ 2 can either be known as a separate parameter, or can be known
through the specification of the null hypothesis, as in the problem studied in
Example 10.25. In some cases, however, σ will not be known. In these cases
there is often a consistent estimator of σ that can be used. That is, we consider
p
the case where σ can be estimated by σ̂n where σ̂n − → σ as n → ∞. Theorem
d
4.11 (Slutsky) implies that n1/2 σ̂n−1 (θ̂n − θ0 ) −
→ Z as n → ∞ and hence
lim P [n1/2 σ̂n−1 (θ̂n − θ0 ) ≥ rα,n ] = α,
n→∞

as long as rα,n → Z1−α as n → ∞.


Example 10.26. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables following a distribution F with mean θ and finite
STATISTICAL HYPOTHESIS TESTS 429
variance σ 2 . Let σ̂n be the usual sample variance, which is a consistent es-
timator of σ by Theorem 3.21 as long as E(|Xn |4 ) < ∞. Therefore we have
d
that Tn = n1/2 σ̂n−1 (X̄n − θ0 ) −
→ Z as n → ∞, and a test that rejects the null
hypothesis H0 : θ ≤ θ0 in favor of the alternative hypothesis H1 : θ > θ0 when
Tn > z1−α is a test with asymptotic level equal to α. 
Example 10.27. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a N(η, θ) distribution. Consider the problem
of testing the null hypothesis H0 : θ ≤ θ0 against the alternative hypothesis
H1 : θ > θ0 . Let θ̂n be the usual unbiased version of the sample variance. Un-
der these assumptions it follows that (n − 1)θ−1 θ̂n has a ChiSquared(n − 1)
distribution, which motivates using the test statistic Xn = (n−1)θ0−1 θ̂n , where
the null hypothesis will be rejected when Xn > χ2n−1;1−α . This is an exact test
under these assumptions. An approximate test can be motivated by Theorem
8.5, where we see that for any F that has a finite fourth moment it follows
that
n1/2 (θ̂n − θ0 ) d
Zn = −
→Z
(µ4 − θ2 )1/2
as n → ∞ where Z has a N(0, 1) and µ3 and µ4 are the third and fourth
moments of F , respectively. Of course the statistic Zn could not be used
to test the null hypothesis because the denominator is unknown, and is not
fully specified by the null hypothesis. However, under the assumption that
the fourth moment is finite, a consistent estimator of the denominator can be
obtained by replacing the population moments with their sample counterparts.
Therefore, it also follows that
n1/2 (θ̂n − θ0 ) d
Tn = −
→Z
(µ̂4 − θ̂n2 )1/2
as n → ∞, and hence a test that rejects the null hypothesis in favor of the
alternative hypothesis when Tn > z1−α is a test with asymptotic level equal
to α. 

Under some additional assumptions we can address the issue of consistency.


In particular we will assume that
" #
n1/2 (θ̂n − θu )
P ≤ t θ = θu ; Φ(t),

σ

as n → ∞ for all t ∈ R and θu ∈ (θ0 , ∞). It then follows that


βn (θu ) = P (Zn ≥ rn,α |θ = θu )
" #
n1/2 (θ̂n − θ0 )
= P ≥ rn,α θ = θu

σ
" #
n1/2 (θ̂n − θu ) n1/2 (θu − θ0 )
= P ≥ rn,α − θ = θu ,
σ σ
430 PARAMETRIC INFERENCE
where we note that rn,α − n1/2 σ −1 (θu − θ0 ) → −∞ as n → ∞, under the
assumption that rn,α → z1−α as n → ∞. It then follows that
lim βn (θu ) = 1,
n→∞

for all θu ∈ (θ0 , ∞). Therefore, Definition 10.10 implies that the test is consis-
tent against any alternative θu ∈ (θ0 , ∞). See Exercise 27 for further details.

The asymptotic power of the test can be studied using the sequence of alter-
native hypotheses {θ1,n }∞
n=1 where θ1,n = θ0 + n
−1/2
δ where δ is a positive
real constant. Note that θ1,n → θ0 as n → ∞ and that θ1,n = θ0 + O(n−1/2 )
as n → ∞. For this sequence we have that
βn (θ1,n ) = P (Zn ≥ rα,n |θ = θ1,n )
" #
n1/2 (θ̂n − θ1,n ) n1/2 (θ1,n − θ0 )
= P ≥ rα,n − θ = θ1,n
σ σ
" #
n1/2 (θ̂n − θ1,n )
= P ≥ rα,n − σ −1 δ θ = θ1,n .

σ

As in the above case we must make some additional assumptions about the
limiting distribution of the sequence n1/2 σ −1 (θ̂n − θ1,n ) when θ = θ1,n . In this
case we will assume that
" #
n1/2 (θ̂n − θ1,n )
P ≤ t θ = θ1,n ; Φ(t),

σ

as n → ∞. Therefore, using the same arguments as above it follows that


lim βn (θ1,n ) = 1 − Φ(z1−α − σ −1 δ).
n→∞

The symmetry of the Normal distribution can be used to conclude that


lim βn (θ1,n ) = Φ(σ −1 δ − z1−α ). (10.43)
n→∞

For δ near zero, a further approximation based on the results of Theorem 1.15
can be used to find that
lim βn (θ1,n ) = Φ(−z1−α ) + σ −1 δφ(zα ) + O(δ 2 ) = α + σ −1 δφ(zα ) + O(δ 2 ),
n→∞

as δ → 0.

Note that in some cases, such as in Example 10.26, the standard deviation σ,
depends on θ. That is, the standard deviation is σ(θ), for some function σ.
Hence considering a sequence of alternatives {θ1,n }∞
n=1 will imply that we must
also consider a sequence of standard deviations {σ1,n }∞
n=1 where σ1,n = σ(θ1,n )
for all n ∈ N. Therefore, if we can assume that σ(θ) is a continuous function
of θ, it follows from Theorem 1.3 that σ1,n → σ as n → ∞ for some positive
STATISTICAL HYPOTHESIS TESTS 431
finite constant σ = σ(θ0 ). In this case
" #
n1/2 (θ̂ − θ0 )
βn (θ1,n ) = P ≥ rα,n θ = θ1,n

σ(θ0 )
" #
n1/2 (θ̂ − θ1,n ) n1/2 (θ1,n − θ0 )
= P ≥ rα,n − θ = θ1,n .
σ(θ0 ) σ(θ0 )
d
Now, assume that n1/2 (θ̂n − θ)/σ(θ) − → Z as n → ∞ where Z is a N(0, 1)
random variable, for all θ ∈ (θ0 − ε, θ0 + ε) for some ε > 0. Further, assume
that " #
n1/2 (θ̂ − θ1,n )
P ≤ t θ = θ1,n ; Φ(t),

σ(θ1,n )
as n → ∞. Then the fact that
σ(θ1,n )
lim = 1,
n→∞ σ(θ0 )

and Theorem 4.11 (Slutsky) implies that


" #
n1/2 (θ̂ − θ1,n )
P ≤ t θ = θ1,n ; Φ(t),

σ(θ0 )

as n → ∞, and hence the result of Equation (10.43) holds in this case as well.
These results can be summarized with the following result.
Theorem 10.10. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F with parameter θ. Consider
testing the null hypothesis H0 : θ ≤ θ0 against the alternative hypothesis
H1 : θ > θ0 by rejecting the null hypothesis if n1/2 (θ̂n − θ0 )/σ(θ0 ) > rα,n .
Assume that

1. rα,n → z1−α as n → ∞.
2. σ(θ) is a continuous function of θ.
d
3. n1/2 (θ̂n − θ)/σ(θ) −
→ Z as n → ∞ for all θ > θ0 − ε for some ε > 0 where
Z is a N(0, 1) random variable.
d
→ Z as n → ∞ where θ = θ1,n and {θ1,n }∞
4. n1/2 (θ̂n − θ1,n )/σ(θ1,n ) − n=1 is a
sequence such that θ1,n → θ0 as n → ∞.

Then the test is consistent against all alternatives θ > θ0 and


lim βn (θ1,n ) = Φ[δ/σ(θ0 ) − z1−α ].
n→∞

Example 10.28. Suppose that X1 , . . . , Xn is a set of independent and identi-


cally distributed random variables from a Poisson(θ) distribution. In Exam-
−1/2
ple 10.25 we considered the test statistic Zn = n1/2 θ0 (X̄n − θ0 ). Theorem
432 PARAMETRIC INFERENCE
4.20 implies that
n1/2 (X̄n − θu )
 
≤ t θ = θu ; Φ(t),

P 1/2

θu
as n → ∞ for all t ∈ R and θu ∈ (θ0 , ∞). Hence, it follows that the test
statistic based on Tn is consistent. To obtain the asymptotic power of the test
we need to show that
" #
n1/2 (X̄n − θ1,n )
P ≤ t θ = θ1,n ; Φ(t),

1/2
θ n,1

as n → ∞ for all t ∈ R and any sequence {θ1,n }∞ n=1 such that θ1,n → θ0 as
n → ∞. To see why this holds we follow the approach of Lehmann (1999) and
note that for each n ∈ N, Theorem 4.24 (Berry and Esseen) implies that
" #
n1/2 (X̄n − θ1,n )
P ≤ t θ = θ − Φ(t) ≤

1/2
1,n
θn,1
−3/2
n−1/2 Bθ1,n E(|X1 − θ1,n |3 ), (10.44)
where B is a constant that does not depend on n. Therefore, it follows that
for each t ∈ R that
" #
n1/2 (X̄n − θ1,n )
lim P ≤ t θ = θ1,n = Φ(t),

n→∞ 1/2
θn,1

as long as the right hand side of Equation (10.44) converges to zero as n → ∞.


To show this we note that
θn−3/2 E(|X1 − θ1,n |3 ) = θn−3/2 E(|X13 − 3X12 θ1,n + 3X1 θ1,n
2 3
− θ1,n |)
≤ θn−3/2 [E(|X1 |3 ) − 3E(X12 )θ1,n + 2θ1,n
3
)].
Now E(X12 ) = θ1,n + θ1,n
2
and E(X13 ) = θ1,n [(θ1,n + 1)2 + θ1,n ], so that
−1/2
θn−3/2 E(|X1 − θ1,n |3 ) ≤ θ1,n (7θ1,n
2
+ 5θ1,n + 1).
Therefore, it follows that
−1/2
lim θn−3/2 E(|X1 − θ1,n |3 ) ≤ θ0 (7θ02 + 5θ0 + 1).
n→∞

Hence we have that the right hand side of Equation (10.44) converges to zero
as n → ∞. Therefore, it follows from Theorem 10.10 that the asymptotic
−1/2
power of the test is Φ(θ0 − Z1−α ) for the sequence of alternatives given
−1/2
by θ1,n = θ0 + n δ where δ > 0. This test is also consistent against all
alternatives θ > θ0 . 
Example 10.29. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from an Exponential(θ) distribution. We wish
to test the null hypothesis H0 : θ ≤ θ0 against the alternative hypothesis
H1 : θ > θ0 using the test statistic n1/2 θ0−1 (X̄n −θ0 ). Theorem 4.20 (Lindeberg
STATISTICAL HYPOTHESIS TESTS 433
d
and Lévy) implies that n1/2 θ−1 (X̄n − θ) −
→ Z as n → ∞ for all θ > 0 where
Z is a N(0, 1) random variable. Hence, assumptions 1 and 2 of Theorem 10.10
−1 d
have been satisfied and it only remains to show that n1/2 θ1,n (X̄n − θ1,n ) −
→Z
as n → ∞ when θ = θ1,n and θ1,n → θ0 as n → ∞. Using the same approach
as shown in Example 10.28 that is based on Theorem 4.24 (Berry and Esseen)
it follows that the assumptions of Theorem 10.10 hold and hence the test is
consistent with asymptotic power function given by Φ(θ0−1 δ − z1−α ) where the
sequence of alternatives is given by θ1,n = θ0 + n−1/2 δ. 
In the case where σ is unknown, but is replaced by a consistent estimator
d
given by σ̂n , we have that the test statistic Tn = n1/2 σ̂n−1 (θ̂n − θ) −
→ Z as
n → ∞ where Z is a N(0, 1) random variable by Theorem 4.11 (Slutsky).
Under similar conditions to those given in Theorem 10.10, the consistency
and asymptotic power of the test has the properties given in Theorem 10.10.
See Section 3.3 of Lehmann (1999) for a complete development of this case.
More precise asymptotic behavior under the null hypothesis can be determined
by adding further structure to our model. In particular we will now restrict out
attention to the framework of the smooth function model described in Section
7.4. We will first consider the case where σ, which denotes the asymptotic
variance of n1/2 θ̂n , is known. In this case we will consider using the test
statistic Zn = n1/2 σ −1 (θ̂n − θ0 ) which follows the distribution Gn when θ0 is
the true value of θ. Calculations similar to those used above can be used to
show that an unbiased test of size α of the null hypothesis H0 : θ ≤ θ0 against
the alternative hypothesis H1 : θ > θ0 rejects the null hypothesis if Zn > g1−α ,
where we recall that g1−α is the (1 − α)th quantile of the distribution Gn . See
Exercise 31.
From an asymptotic viewpoint it follows from the smooth function model that
d
Zn −→ Z as n → ∞ where Z is a N(0, 1) random variable. Therefore, if the dis-
tribution Gn is unknown then one can develop a large sample test by rejecting
H0 : θ ≤ θ0 if Zn ≥ z1−α . This test is similar to the approximate normal tests
studied above, except with the additional framework of the smooth function
model, we can study the effect of this approximation, as well as alternate
approximations, more closely. Note that Theorem 7.11 (Bhattacharya and
Ghosh) implies that the quantile g1−α has a Cornish-Fisher expansion given
by g1−α = z1−α +n−1/2 q1 (z1−α )+O(n−1 ), as n → ∞. Therefore, we have that
|g1−α −z1−α | = O(n−1/2 ), as n → ∞. To see what effect this has on the size of
the test we note that Theorem 7.11 implies that the Edgeworth expansion for
the distribution Zn is given by P (Zn ≤ t) = Φ(t) + n−1/2 r1 (t)φ(t) + O(n−1 ),
as n → ∞. Therefore, the probability of rejecting the null hypothesis when
θ = θ0 is given by
P (Zn > z1−α |θ = θ0 ) = 1 − Φ(z1−α ) − n−1/2 r1 (z1−α )φ(z1−α ) + O(n−1 )
= α − n−1/2 r1 (zα )φ(zα ) + O(n−1 ), (10.45)
as n → ∞, where we have used the fact that both r1 and φ are even functions.
434 PARAMETRIC INFERENCE
Therefore, Definition 10.8 implies that this test is first-order accurate. Note
that the test may even be more accurate depending on the form of the term
r1 (zα )φ(zα ) in Equation (10.45). For example, if r1 (zα ) = 0 then the test is
at least second-order accurate.
Another strategy for obtaining a more accurate test is to use a quantile of the
form z1−α + n−1/2 q1 (z1−α ), under the assumption that the polynomial q1 is
known. Note that

P (Zn > z1−α + n−1/2 q1 (z1−α )|θ = θ0 ) = 1 − Φ[z1−α + n−1/2 q1 (z1−α )]−
n−1/2 r1 [z1−α + n−1/2 q1 (z1−α )]φ[z1−α + n−1/2 q1 (z1−α )] + O(n−1 ),
as n → ∞. Now, using Theorem 1.15 we have that
Φ[z1−α + n−1/2 q1 (z1−α )] = Φ(z1−α ) + n−1/2 q1 (z1−α )φ(z1−α ) + O(n−1 ),
r1 [z1−α +n−1/2 q1 (z1−α )] = r1 (z1−α )+O(n−1/2 ), and φ[z1−α +n−1/2 q1 (z1−α )] =
φ(z1−α ) + O(n−1/2 ), as n → ∞. Therefore,

P (Zn > z1−α + n−1/2 q1 (z1−α )|θ = θ0 ) =


1 − Φ(z1−α ) − n−1/2 q1 (z1−α )φ(z1−α ) − n−1/2 r1 (z1−α )φ(z1−α ) + O(n−1 ),
as n → ∞. Noting that q1 (z1−α ) = −r1 (z1−α ) implies that
P (Zn > z1−α + n−1/2 q1 (z1−α )|θ = θ0 ) = α + O(n−1 ), (10.46)
as n → ∞, and therefore the test is second-order accurate.
Unfortunately, because the coefficients of q1 depend on the moments of the
population, which are likely to be unknown, we cannot usually compute this
rejection region in practice. On the other hand, if sample moments were sub-
stituted in place of the population moments, then we would have an estimator
of q1 with the property that q̂1 (z1−α ) = q1 (z1−α ) + Op (n−1/2 ), as n → ∞.
Using this estimator we see that
z1−α + n−1/2 q̂1 (z1−α ) + O(n−1 ) = z1−α + n−1/2 q1 (z1−α ) + Op (n−1 ),
as n → ∞. Using a result like Theorem 10.8, and the result of Equation (10.46)
implies that

P (Zn > z1−α + n−1/2 q̂1 (z1−α )|θ = θ0 ) =


P (Zn > z1−α + n−1/2 q1 (z1−α ) + Op (n−1 )|θ = θ0 ) =
P (Zn > z1−α + n−1/2 q1 (z1−α )|θ = θ0 ) + O(n−1 ) = α + O(n−1 ),
as n → ∞. Therefore, this approximate Edgeworth-corrected test will also be
second-order accurate.
For the case when σ is unknown we use the test statistic Tn = n1/2 σ̂n−1 (θ̂n −θ0 )
which has distribution Hn and associated quantile h1−α when θ = θ0 . Calcu-
lations similar to those given above can be used to show that approximating
STATISTICAL HYPOTHESIS TESTS 435
h1−α with z1−α results in a test that is first-order accurate while approxi-
mating h1−α with z1−α + n−1/2 ŝ1 (z1−α ) results in a test that is second-order
accurate. See Exercise 32.
Example 10.30. Let X1 , . . . , Xn be a sequence of independent and identi-
cally distributed random variables that have a N(θ, σ 2 ) distribution. Consider
testing the null hypothesis H0 : θ ≤ θ0 against the alternative hypothesis
H1 : θ > θ0 for some θ ∈ R. This model falls within the smooth function
model where Gn is a N(0, 1) distribution and Hn is a T(n − 1) distribution.
Of course, rejecting the null hypothesis when Tn > tn−1;1−α results in an ex-
act test. However, when n is large it is often suggested that rejecting the null
hypothesis when Tn = n1/2 σ̂n−1 (X̄n − θ0 ) > z1−α , where σ̂n2 is the unbiased
version of the sample variance, is a test that is approximately valid. Indeed,
this is motivated by the fact that the T(n − 1) distribution converges in distri-
bution to a N(0, 1) distribution as n → ∞. For the test based on the normal
approximation we have that
P (Tn > z1−α |θ = θ0 ) = α − n−1/2 v1 (z1−α )φ(z1−α ) + O(n−1 ),
as n → ∞, so that the test is at least first-order accurate. In the case of
the mean functional v1 (t) = 61 γ(2t2 + 1) where γ = σ −3 µ3 , which is the
standardized skewness of the population. For a normal population γ = 0 and
hence P (Tn > z1−α |θ = θ0 ) = α + O(n−1 ), as n → ∞. We can also note that
in this case
P (Tn > z1−α |θ = θ0 ) = α − n−1 v2 (z1−α )φ(z1−α ) + O(n−3/2 ),
as n → ∞. For the case of the mean functional we have that
v2 (t) 1
= t[ 12 κ(t2 − 3) − 1 2 4
18 γ (t + 2t
2
− 3) − 14 (t2 + 3)]
1
= t[ 12 κ(t2 − 3) − 1 2
4 (t + 3)],
since γ = 0 for the N(θ, σ 2 ) distribution, where κ = σ −4 µ4 − 3. The fourth
moment of a N(θ, σ 2 ) distribution is 3σ 4 so that κ = 0. Therefore v2 (t) =
− 14 t(t2 + 3), and hence
P (Tn > z1−α |θ = θ0 ) = α + 14 n−1 z1−α (z1−α
2
+ 3)φ(z1−α ) + O(n−3/2 ),
as n → ∞. Therefore, it follows that the normal approximation is second-order
accurate for samples from a N(θ, σ 2 ) distribution. Note that the second-order
accuracy will also hold for any symmetric population. If the population is not
symmetric the test will only be first-order accurate. 
Example 10.31. Let X1 , . . . , Xn be a sequence of independent and identi-
cally distributed random variables that have a distribution F for all n ∈ N. Let
θ be the mean of F , and assume that F falls within the assumptions of the
smooth function model. Consider testing the null hypothesis H0 : θ ≤ θ0
against the alternative hypothesis H1 : θ > θ0 for some θ ∈ R. Theo-
rem 8.5 implies that we can estimate γ = σ −3 µ3 with γ̂n = σ̂n−3 µ̂3 where
γ̂n = γ + Op (n−1/2 ) as n → ∞. In this case σ̂n and µˆ3 are the sample ver-
sions of the corresponding moments. Using this estimate, we can estimate
436 PARAMETRIC INFERENCE
s1 (z1−α ) with ŝ1 (z1−α ) = − 61 γ̂n (2z1−α
2
+ 1), where it follows that ŝ1 (z1−α ) =
−1/2
s1 (z1−α ) + Op (n ), as n → ∞. Now consider the test that rejects the null
hypothesis when Tn = n1/2 σ̂n−1 (X̄n − θ0 ) > z1−α + ŝ1 (z1−α ). Then

P [Tn > z1−α + n−1/2 ŝ1 (z1−α )|θ = θ0 ] =


P [Tn > z1−α + n−1/2 s1 (z1−α )|θ = θ0 ] + O(n−1 ) =
1 − Φ[z1−α + n−1/2 s1 (z1−α )]−
n−1/2 v1 [z1−α + n−1/2 s1 (z1−α )]φ[z1−α + n−1/2 s1 (z1−α )] + O(n−1 ) =
α − n−1/2 s1 (z1−α )φ(z1−α ) − n−1/2 v1 (z1−α )φ(z1−α ) + O(n−1 ),
as n → ∞. Noting that s1 (z1−α ) = −v1 (z1−α ) implies that rejecting the
null hypothesis when Tn > z1−α + ŝ1 (z1−α ) yields a test that is second-order
correct. 
The normal distribution is not the only asymptotic distribution that is com-
mon for test statistics. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a distribution F with parameter θ. Let L(θ)
be the likelihood function of θ and let θ̂n be the maximum likelihood estima-
tor of θ, which we will assume is unique. We will consider Wilk’s likelihood
ratio test statistic for testing a null hypothesis of the form H0 : θ = θ0 against
an alternative hypothesis of the form H1 : θ 6= θ0 . The idea behind this test
statistic is based on comparing the likelihood for the estimated value of θ,
given by L(θ̂n ), to the likelihood of the value of θ under the null hypothesis,
given by L(θ0 ). Since θ̂n is the maximum likelihood estimator of θ it follows
that L(θ̂n ) ≥ L(θ0 ). If this difference is large then there is evidence that the
likelihood for θ0 is very low compared to the maximum likelihood and hence
θ0 is not a likely value of θ. In this case the null hypothesis would be rejected.
One the other hand, if this difference is near zero, then the likelihoods are
near one another, and there is little evidence that the null hypothesis is not
true. The usual likelihood ratio test compares L(θ̂n ) to L(θ0 ) using the ratio
L(θ0 )/L(θ̂n ), so that small values of this test statistic indicate that the null
hypothesis should be rejected. Wilk’s likelihood ratio test statistic is equal to
−2 times the logarithm of this ratio. That is,
" #
L(θ0 )
Λn = −2 log = 2{log[L(θ̂n )] − log[L(θ0 )]} = 2[l(θ̂n ) − l(θ0 )].
L(θ̂n )
Therefore, a test based on the statistic Λn rejects the null hypothesis when
Λn is large. Under certain conditions the test statistic Λn has an asymptotic
ChiSquared(1) distribution.
Theorem 10.11. Suppose that X1 , . . . , Xn is a set of independent and iden-
tically distributed random variables from a distribution F (x|θ) with density
f (x|θ), where θ has parameter space Ω. Suppose that the conditions of The-
d
orem 10.5 hold, then, under the null hypothesis H0 : θ = θ0 , Λn −
→ X as
n → ∞ where X has ChiSquared(1) distribution.
STATISTICAL HYPOTHESIS TESTS 437
Proof. Theorem 1.13 (Taylor) implies that under the stated assumptions we
have that
l(θ̂n ) = l(θ0 ) + (θ̂n − θ0 )l0 (θ0 ) + 21 (θ̂n − θ0 )2 l00 (ξ)
= l(θ0 ) + n1/2 (θ̂n − θ0 )[n−1/2 l0 (θ0 )]
+ 21 (θ̂n − θ0 )2 l00 (ξ), (10.47)

where ξ is a random variable that is between θ0 and θ̂n with probability one.
Apply Theorem 1.13 to the second term on the right hand side of Equation
(10.47) to obtain
n−1/2 l0 (θ0 ) = n−1/2 l0 (θ̂n ) + n−1/2 (θ0 − θ̂n )l00 (ζ), (10.48)
where ζ is a random variable that is between θ̂n and θ0 with probability one.
To simplify the expression in Equation (10.48) we note that since θ̂n is the
maximum likelihood estimator of θ it follows that l0 (θ̂n ) = 0. Therefore

n−1/2 l0 (θ0 ) = n−1/2 l0 (θ̂n ) + n−1/2 (θ0 − θ̂n )l00 (ζ) =


− n−1 l00 (ζ)[n1/2 (θ̂n − θ0 )]. (10.49)
Substituting the expression from Equation (10.49) into Equation (10.47) yields
l(θ̂n ) = l(θ0 ) + n1/2 (θ̂n − θ0 ){−n−1 l00 (ζ)[n1/2 (θ̂n − θ0 )]}
+ 21 (θ̂n − θ0 )2 l00 (ξ)
= l(θ0 ) − n(θ̂n − θ0 )2 [n−1 l00 (ζ)] + 21 (θ̂n − θ0 )2 l00 (ξ), (10.50)
or equivalently that
Λn = −2n(θ̂n − θ0 )2 [n−1 l00 (ζ)] + n(θ̂n − θ0 )2 [n−1 l00 (ξ)].
Using calculations similar to those used in the proof of Theorem 10.5, we have
p
that Theorem 3.10 (Weak Law of Large Numbers) implies that n−1 l00 (ξ) − →
p
−I(θ0 ) and n−1 l00 (ζ) −
→ −I(θ0 ) as n → ∞. Therefore it follows that n−1 l00 (ξ) =
−1 00
−I(θ0 ) + op (1) and n l (ζ) = −I(θ0 ) + op (1) as n → ∞. Therefore, it follows
that
Λn = n(θ̂n − θ0 )2 I(θ0 ) + op (1),
d
as n → ∞. Now, Theorem 10.5 implies that n1/2 (θ̂n − θ)I 1/2 (θ0 ) −
→ Z as n →
∞ where Z has a N(0, 1) distribution. Therefore, Theorem 4.12 implies that
d
n(θ̂n − θ)2 I(θ0 ) −
→ X as n → ∞, where X has a ChiSquared(1) distribution,
and the result is proven with an application of Theorem 4.11 (Slutsky).

Some other common test statistics that have asymtptotic ChiSquared dis-
tributions include Wald’s statistic and Rao’s efficient score statistic. See Ex-
ercises 35 and 36. Under a sequence of alternative hypothesis {θ1,n }∞ n=1 where
θ1,n = θ0 + n−1/2 δ, it follows that Wilk’s likelihood ratio test statistic has an
asymptotic ChiSquared statistic with a non-zero non-centrality parameter.
438 PARAMETRIC INFERENCE
Theorem 10.12. Suppose that X1 , . . . , Xn is a set of independent and iden-
tically distributed random variables from a distribution F (x|θ) with density
f (x|θ), where θ has parameter space Ω. Suppose that the conditions of Theorem
10.5 hold, then, under the sequence of alternative hypotheses {θ1,n }∞
n=1 where
d
θ1,n = θ0 + n−1/2 δ, Λn −
→ X as n → ∞ where X has ChiSquared[1, δ 2 I(θ0 )]
distribution.

Proof. We begin by noting that if θ1,n = θ0 + n−1/2 δ, then retracing the steps
in the proof of Theorem 10.11 implies that under the sequence of alternative
hypotheses {θ1,n }∞ 2
n=1 we have that Λn = n(θ̂n −θ0 ) I(θ0 )+op (n
−1
), as n → ∞.
See Exercise 34. Now, note that
n1/2 (θ̂n − θ0 ) = n1/2 (θ̂n − θ1,n + n−1/2 δ) = n1/2 (θ̂n − θ1,n ) + δ.
Therefore, Theorem 4.11 (Slutsky) implies that under the sequence of al-
d
ternative hypotheses we have that n1/2 I 1/2 (θ0 )(θ̂n − θ0 ) −
→ Z as n → ∞
where Z is a N(δ, 1) random variable. Therefore, Theorem 4.12 implies that
d
n(θ̂n − θ0 )2 I(θ0 ) −
→ X as n → ∞, where X has a ChiSquared[1, δ 2 I(θ0 )]
distribution, and the result is proven.

Similar results also hold for the asymptotic distributions of Wald’s statistic
and Rao’s efficient score statistic under the same sequence of alternative hy-
potheses. See Exercises 37 and 38.

10.5 Observed Confidence Levels

Observed confidence levels provide useful information about the relative truth
of hypotheses in multiple testing problems. In place of the repeated tests of
hypothesis usually associated with multiple comparison techniques, observed
confidence levels provide a level of confidence for each of the hypotheses. This
level of confidence measures the amount of confidence there is, based on the
observed data, that each of the hypotheses are true. This results in a relatively
simple method for assessing the truth of a sequence of hypotheses.
The development of observed confidence levels begins by constructing a for-
mal framework for the problem. This framework is based on the problem of
regions, which was first formally proposed by Efron and Tibshirani (1998). The
problem is constructed as follows. Let X be a d-dimensional random vector
following a d-dimensional distribution function F . We will assume that F is a
member of a collection, or family, of distributions given by F. The collection
F may correspond to a parametric family such as the collection of all d-variate
normal distributions, or may be nonparametric, such as the collection of all
continuous distributions with finite mean. Let θ be the parameter of interest
and assume that θ is a functional parameter of F of the form θ = T (F ),
with parameter space Ω. Let {Ωi }∞ i=1 be a countable sequence of subsets, or
OBSERVED CONFIDENCE LEVELS 439
regions, of Ω such that

[
Ωi = Ω.
i=1
For simplicity each region will usually be assumed to be connected, though
in many problems Ωi is made up of disjoint regions. Further, in many prac-
tical examples the sequence is finite, but there are also practical examples
that require countable sequences. There are also examples where the sequence
{Ωi }∞
i=1 forms a partition of Ω in the sense that Ωi ∩ Ωj = ∅ for all i 6= j. In
the general case, the possibility that the regions can overlap will be allowed.
In many practical problems the subsets technically overlap on their bound-
aries, but the sequence can often be thought of as a partition from a practical
viewpoint. The statistical interest in such a sequence of regions arises from the
structure of a specific inferential problem. Typically the regions correspond
to competing models for the distribution of the random vector X and one is
interested in determining which of the models is most reasonable based on the
observed data vector X. Therefore the problem of regions is concerned with
determining which of the regions in the sequence {Ωi }∞ i=1 that θ belongs to
based on the observed data X.
An obvious simple solution to this problem would be to estimate θ based on
X and conclude that θ is in the region Ωi whenever the estimate θ̂n is in
the region Ωi . We will consider the estimator θ̂n = T (F̂n ) where F̂n is the
empirical distribution function defined in Definition 3.5. The problem with
simply concluding that θ ∈ Ωi whenever θ̂n ∈ Ωi is that θ̂n is subject to
sample variability. Therefore, even though we may observe θ̂n ∈ Ωi , it may
actually be true that θ̂n ∈ Ωj for some i 6= j where Ωi ∩ Ωj = ∅, and that
θ̂ ∈ Ωi was observed simply due to chance. If such an outcome were rare,
then the method may be acceptable. However, if such an outcome occurred
relatively often, then the method would not be useful. Therefore, it is clear
that the inherent variability in θ̂n must be accounted for in order to develop
a useful solution to the problem of regions.
Multiple comparison techniques solve the problem of regions using a sequence
of hypothesis tests. Adjustments to the testing technique helps control the
overall significance level of the sequence of tests. Modern techniques have been
developed by Stefansson, Kim and Hsu (1988) and Finner and Strassburger
(2002). Some general references that address issues concerned with multiple
comparison techniques include Hochberg and Tamhane (1987), Miller (1981)
and Westfall and Young (1993). Some practitioners find the results of these
procedures difficult to interpret as the number of required tests can sometimes
be quite large.
An alternate approach to multiple testing techniques was formally introduced
by Efron and Tibshirani (1998). This approach computes a measure of con-
fidence for each of the regions. This measure reflects the amount of confi-
dence there is that θ lies within the region based on the observed sample
440 PARAMETRIC INFERENCE
X. The method used for computing the observed confidence levels studied
here is based on the methodology of Polansky (2003a, 2003b, 2007). Let
C(α, ω) ⊂ Θ be a 100α% confidence region for θ based on the sample X.
That is, C(α, ω) ⊂ Ω is a function of the sample X with the property that
P [θ ∈ C(α, ω)] = α.
The vector ω ∈ Wα ⊂ Rq is called the shape parameter vector as it con-
tains a set of parameters that control the shape and orientation of the con-
fidence region, but do not have an effect on the confidence coefficient. Even
though Wα is usually a function of α, the subscript α will often be omitted
to simplify mathematical expressions. Now suppose that there exist sequences
{αi }∞ ∞
i=1 ∈ [0, 1] and {ωi }i=1 ∈ Wα such that C(αi , ωi ) = Ωi for i = 1, 2, . . .,
conditional on X. Then the sequence of confidence coefficients are defined to
be the observed confidence levels for {Ωi }∞ i=1 . In particular, αi is defined to
be the observed confidence level of the region Ωi . That is, the region Ωi cor-
responds to a 100αi % confidence region for θ based on the observed data.
This measure is similar to the measure suggested by Efron and Gong (1983),
Felsenstein (1985) and Efron, Holloran, and Holmes (1996). It is also similar in
application to the methods of Efron and Tibshirani (1998), though the formal
definition of the measure differs slightly from the definition used above. See
Efron and Tibshirani (1998) for further details on this definition.
Example 10.32. To demonstrate this idea, consider a simple example where
X1 , . . . , Xn is a set of independent and identically distributed random vari-
ables from a N(θ, σ 2 ) distribution. Let θ̂n and σ̂n be the usual sample mean
and variance computed on X1 , . . . , Xn . A confidence interval for the mean
that is based on the assumption that the population is Normal is based on
percentiles from the T(n − 1) distribution and has the form
C(α, ω) = (θ̂n − tn−1;1−ωL n−1/2 σ̂n , θ̂n − tn−1;1−ωU n−1/2 σ̂n ), (10.51)
th
where tν;ξ is the ξ quantile of a T(ν) distribution. In order for the confidence
interval in Equation (10.51) to have a confidence level equal to 100α% we take
ω 0 = (ωL , ωU ) to be the shape parameter vector where
Wα = {ω : ωU − ωL = α, ωL ∈ [0, 1], ωU ∈ [0, 1]},
for α ∈ (0, 1). Note that selecting ω ∈ Wα not only ensures that the confidence
level is 100α%, but also allows for several orientations and shapes of the
interval. For example, a symmetric two-tailed interval can be constructed by
selecting ωL = (1 − α)/2 and ωU = (1 + α)/2. An upper one-tailed interval is
constructed by setting ωL = 0 and ωU = α. A lower one-tailed interval uses
ωL = 1 − α and ωU = 1.
Now consider the problem of computing observed confidence levels for the
Normal mean for a sequence of interval regions of the form Ωi = [ti , ti+1 ]
where −∞ < ti < ti+1 < ∞ for i ∈ N. Setting Ωi = C(α, ω) where the
confidence interval used for this calculation is the one given in Equation (10.51)
OBSERVED CONFIDENCE LEVELS 441
yields
θ̂n − tn−1;1−ωL n−1/2 σ̂n = ti , (10.52)
and
θ̂n − tn−1;1−ωU n−1/2 σ̂n = ti+1 . (10.53)
Solving Equations (10.52) and (10.53) for ωL and ωU yields
ωL = 1 − Tn−1 [n1/2 σ̂n−1 (θ̂n − ti )],
and
ωU = 1 − Tn−1 [n1/2 σ̂n−1 (θ̂n − ti+1 )],
where Tn−1 is the distribution function of a T(n − 1) distribution. Because
ω ∈ Wα if and only if ωU − ωL = α it follows that the observed confidence
level for the region Ωi is given by
Tn−1 [n1/2 σ̂n−1 (θ̂n − ti )] − Tn−1 [n1/2 σ̂n−1 (θ̂n − ti+1 )].


This section will consider the problem of developing the asymptotic theory of
observed confidence levels for the case when there is a single parameter, that
is, when Ω ⊂ R. To simplify the development in this case, it will be further
assumed that Ω is an interval subset of R. Most standard single parameter
problems in statistical inference fall within these assumptions.
An observed confidence level is simply a function that takes a subset of Ω
and maps it to a real number between 0 and 1. Formally, let α be a function
and let T be a collection of subsets of Ω. Then an observed confidence level
is a function α : T → [0, 1]. Because confidence levels are closely related to
probabilities, it is reasonable to assume that α has the axiomatic properties
given in Definition 2.2 where we will suppose that T is a sigma-field of subsets
of Ω. For most reasonable problems in statistical inference it should suffice to
take T to be the Borel σ-field on Ω. Given this structure, it suffices to develop
observed confidence levels for interval subsets of Θ. Observed confidence levels
for other regions can be obtained through operations derived from the axioms
in Definition 2.2.
To develop observed confidence levels for a general scalar parameter θ, consider
a single interval region of the form Ψ = (tL , tU ) ∈ T. To compute the observed
confidence level of Ψ, a confidence interval for θ based on the sample X is
required. The general form of a confidence interval for θ based on X can
usually be written as C(α, ω) = [L(ωL ), U (ωU )], where ωL and ωU are shape
parameters such that (ωL , ωU ) ∈ Wα , for some Wα ⊂ R2 . It can often be
assumed that L(ωL ) and U (ωU ) are continuous monotonic functions of ωL
and ωU onto Ω, respectively, conditional on the observed sample X. See, for
example, Section 9.2 of Casella and Berger (2002). If such an assumption is
true, the observed confidence level of Ψ is computed by setting Ψ = C(α, ω)
and solving for ω. The value of α for which ω ∈ Wα is the observed confidence
level of Ψ. For the form of the confidence interval given above, the solution
442 PARAMETRIC INFERENCE
is obtained by setting ωL = L−1 (tL ) and ωU = U −1 (tU ), conditional on X.
A unique solution will exist for both shape parameters given the assumptions
on the functions L and U . Therefore, the observed confidence level of Ψ is the
value of α such that ω = (ωL , ωU ) ∈ Ωα . Thus, the calculation of observed
confidence levels in the single parameter case is equivalent to inverting the
endpoints of a confidence interval for θ. Some simple examples illustrating
this method is given below.
Example 10.33. Continuing with the setup of Example 10.32, suppose that
X1 , . . . ,Xn is a set of independent and identically distributed random variables
from a N(µ, θ) distribution where θ < ∞. For the variance, the parameter
space is Ω = (0, ∞), so that the region Ψ = (tL , tU ) is assumed to follow
the restriction that 0 < tL ≤ tU < ∞. Let θ̂n be the unbiased version of the
sample variance, then a 100α% confidence interval for θ is given by
" #
(n − 1)θ̂n (n − 1)θ̂n
C(α, ω) = , , (10.54)
χ2n−1;1−ωL χ2n−1;1−ωU
where ω ∈ Wα with
Wα = {ω 0 = (ω1 , ω2 ) : ωL ∈ [0, 1], ωU ∈ [0, 1], ωU − ωL = α},
and χ2ν,ξ is the ξ th percentile of a ChiSquared(ν) distribution. Therefore

(n − 1)θ̂n
L(ω) = ,
χ2n−1;1−ωL
and
(n − 1)θ̂n
U (ω) = .
χ2n−1;1−ωU
Setting Ψ = C(α, ω) and solving for ωL and ωU yields
ωL = L−1 (tL ) = 1 − χ2n−1 [t−1
L (n − 1)θ̂n ],

and
ωU = U −1 (tU ) = 1 − χ2n−1 [t−1
U (n − 1)θ̂n ],
where χ2ν is the distribution function of a ChiSquared(ν) distribution. This
implies that the observed confidence limit for Ψ is given by
α(Ψ) = χ2n−1 [t−1 2 −1
L (n − 1)θ̂n ] − χn−1 [tU (n − 1)θ̂n ].


Example 10.34. Suppose (X1 , Y1 ), . . . , (Xn , Yn ) is a set of independent and
identically distributed bivariate random vectors from a bivariate normal dis-
tribution with mean vector µ0 = (µX , µY ) and covariance matrix
 2 
σX σXY
Σ= ,
σXY σY2
2
where V (Xi ) = σX , V (Yi ) = σY2 and the covariance between X and Y is
OBSERVED CONFIDENCE LEVELS 443
−1 −1
σXY . Let θ = σXY σX σY , the correlation coefficient between X and Y .
The problem of constructing a reliable confidence interval for θ is usually
simplified using Fisher’s normalizing transformation. See Fisher (1915) and
Winterbottom (1979). Using the fact that tanh−1 (θ̂n ) has an approximate
N[tanh−1 (θ), (n − 3)−1/2 ] distribution, the resulting approximate confidence
interval for θ has the form
C(α, ω) = [tanh(tanh−1 (θ̂n ) − z1−ωL (n − 3)−1/2 ),
tanh(tanh−1 (θ̂n ) − z1−ωU (n − 3)−1/2 )] (10.55)
where θ̂n is the sample correlation coefficient given by
Pn
i=1 (Xi − X̄)(Yi − Ȳ )
θ̂n = P  , (10.56)
n 2 1/2 2 1/2
 Pn
i=1 (Xi − X̄) i=1 (Yi − Ȳ )
and ω 0 = (ωL , ωU ) ∈ Wα with
Wα = {ω 0 = (ω1 , ω2 ) : ωL ∈ [0, 1], ωU ∈ [0, 1], ωU − ωL = α}.
Therefore
L(ωL ) = tanh(tanh−1 (θ̂n ) − z1−ωL (n − 3)−1/2 )
and
U (ωU ) = tanh(tanh−1 (θ̂n ) − z1−ωU (n − 3)−1/2 ).
Setting Ψ = (tL , tU ) = C(α, ω) yields
ωL = 1 − Φ[(n − 3)1/2 (tanh−1 (θ̂n ) − tanh−1 (tL ))],
and
ωU = 1 − Φ[(n − 3)1/2 (tanh−1 (θ̂n ) − tanh−1 (tU ))],
so that the observed confidence level for Ψ is given by
α(Ψ) = Φ[(n − 3)1/2 (tanh−1 (θ̂) − tanh−1 (tL ))]−
Φ[(n − 3)1/2 (tanh−1 (θ̂) − tanh−1 (tU ))].


For the asymptotic development in this section we will consider problems that
occur within the smooth function model described in Section 7.4. As observed
in Section 10.3, confidence regions for θ can be constructed using the ordinary
upper confidence limit
θ̂ord (α) = θ̂n − n−1/2 σg1−α , (10.57)
for the case when σ is known, and the studentized critical point
θ̂stud (α) = θ̂n − n−1/2 σ̂n h1−α , (10.58)
for the case when σ is unknown. Two-sided confidence intervals can be devel-
oped using each of these upper confidence limits. For example, if ωL ∈ (0, 1)
and ωU ∈ (0, 1) are such that α = ωU − ωL ∈ (0, 1) then Cord (α, ω) =
444 PARAMETRIC INFERENCE
[θ̂ord (ωL ), θ̂ord (ωU )] is a 100α% confidence interval for θ when σ is known.
Similarly
Cstud (α, ω) = [θ̂stud (ωL ), θ̂stud (ωU )],
is a 100α% confidence interval for θ when σ is unknown.
For a region Ψ = (tL , tU ) the observed confidence level corresponding to each
of the theoretical critical points can be computed by setting Ψ equal to the
confidence interval and solving for ωL and ωU . For example, in the case of the
ordinary theoretical critical point, setting Ψ = Cord (α, ω; X) yields the two
equations
tL = L(ω) = θ̂n − n−1/2 σg1−ωL , (10.59)
and
tU = U (ω) = θ̂n − n−1/2 σg1−ωU . (10.60)
Let Gn (x) = P [n1/2 σ −1 (θ̂n − θ)] and Hn (x) = P [n1/2 σ̂n−1 (θ̂n − θ)]. Solving for
ωL and ωU in Equations (10.59) and (10.60) yields
ωL = L−1 (tL ) = 1 − Gn [n1/2 σ −1 (θ̂n − tL )],
and
ωU = U −1 (tU ) = 1 − Gn [n1/2 σ −1 (θ̂n − tU )],
so that the observed confidence level corresponding to the ordinary theoretical
confidence limit is
αord (Ψ) = Gn [n1/2 σ −1 (θ̂n − tL )] − Gn [n1/2 σ −1 (θ̂n − tU )]. (10.61)
Similarly, the observed confidence levels corresponding to the studentized con-
fidence interval is
αstud (Ψ) = Hn [n1/2 σ̂n−1 (θ̂n − tL )] − Hn [n1/2 σ̂n−1 (θ̂n − tU )]. (10.62)
In the case where F is unknown, the asymptotic Normal behavior of the
distributions Gn and Hn can be used to compute approximate observed con-
fidence levels of the form
α̂ord (Ψ) = Φ[n1/2 σ −1 (θ̂n − tL )] − Φ[n1/2 σ −1 (θ̂n − tU )], (10.63)
and
α̂stud (Ψ) = Φ[n1/2 σ̂n−1 (θ̂n − tL )] − Φ[n1/2 σ̂n−1 (θ̂n − tU )], (10.64)
for the cases where σ is known, and unknown, respectively.
The observed confidence level based on the Normal approximation is just
one of several alternative methods available for computing an observed confi-
dence level for any given parameter. Indeed, as was pointed out earlier, any
function that maps regions to the unit interval such that the three properties
of a probability measure are satisfied is technically a method for computing
an observed confidence level. Even if we focus on observed confidence levels
that are derived from confidence intervals that at least guarantee their cov-
erage level asymptotically, there may be many methods to choose from, and
techniques for comparing the methods become paramount in importance.
OBSERVED CONFIDENCE LEVELS 445
This motivates the question as to what properties we would wish the observed
confidence levels to possess. Certainly the issue of consistency would be rele-
vant in that an observed confidence level computed on a region Ω0 = (t0L , t0U )
that contains θ should converge to one as the sample size becomes large. Corre-
spondingly, an observed confidence level computed on a region Ω1 = (t1L , t1U )
that does not contain θ should converge to zero as the sample size becomes
large. The issue of consistency is relatively simple to decide within the smooth
function model studied. The normal approximation provides the simplest case
and is a good starting point. Consider the ordinary observed confidence level
given in Equation (10.63). Note that
Φ[n1/2 σ −1 (θ̂n − t0L )] = Φ[n1/2 σ −1 (θ − t0L ) + n1/2 σ −1 (θ̂n − θ)]
d
where n1/2 (θ − t0L )/σ → ∞ and n1/2 (θ̂n − θ)/σ − → Z as n → ∞, where Z is
a N(0, 1) random variable. It is clear that the second sequence is bounded in
probability, so that the first sequence dominates. It follows that
p
Φ[n1/2 σ −1 (θˆn − t0L )] −
→1
as n → ∞. Similarly, it can be shown that
p
Φ[n1/2 σ −1 (θ̂n − t0U )] −
→0
p
as n → ∞, so that it follows that α̂ord (Ω0 ) − → 1 as n → ∞ when θ ∈ Ω0 .
−1 p
A similar argument, using the fact that σ̂n σ − → 1 as n → ∞ can be used
p
to show that α̂stud (Ω0 ) − → 1 as n → ∞, as well. Arguments to show that
αord (Ω0 ) and αstud (Ω0 ) are also consistent follow in a similar manner, though
one must use the fact that Gn ; Φ and Hn ; Φ as n → ∞.
Beyond consistency, it is desirable for an observed confidence level to provide
an accurate representation of the level of confidence there is that θ ∈ Ψ, given
the observed sample X1 , . . . , Xn . Considering the definition of an observed
confidence level, it is clear that if Ψ corresponds to a 100α% confidence interval
for θ, conditional on X1 , . . . , Xn , the observed confidence level for Ψ should
be α. When σ is known the interval Cord (α, ω) will be used as the standard for
a confidence interval for θ. Hence, a measure α̃ of an observed confidence level
is accurate if α̃[Cord (α, ω)] = α. For the case when σ is unknown the interval
Cstud (α, ω) will be used as the standard for a confidence interval for θ, and
an arbitrary measure α̃ is defined to be accurate if α̃[Cstud (α, ω)] = α. Using
this definition, it is clear that αord and αstud are accurate when σ is known
and unknown, respectively. When σ is known and α̃[Cord (α, ω)] 6= α or when
σ is unknown and α̃[Cstud (α, ω)] 6= α one can analyze how close α̃[Cord (α, ω)]
or α̃[Cstud (α, ω)] is to α using asymptotic expansion theory. In particular, if
σ is known then a measure of confidence α̃ is said to be k th -order accurate if
α̃[Cord (α, ω)] = α+O(n−k/2 ), as n → ∞. Similarly, if σ is unknown a measure
α̃ is said to be k th -order accurate if α̃[Cstud (α, ω)] = α +O(n−k/2 ), as n → ∞.
To analyze the normal approximations, let us first suppose that σ is known.
446 PARAMETRIC INFERENCE
If α = ωU − ωL then
α̂ord [Cord (α, ω)] = Φ{n1/2 σ −1 [θ̂ − θ̂ord (ωL )]} −
Φ{n1/2 σ −1 [θ̂ − θ̂ord (ωU )]}
= Φ(g1−ωL ) − Φ(g1−ωU ). (10.65)
If Gn = Φ, then α̂ord [Cord (α, ω)] = α, and the method is accurate. When
Gn 6= Φ the Cornish-Fisher expansion for a quantile of Gn , along with an
application of Theorem 1.13 (Taylor) to Φ yields
Φ(g1−ω ) = 1 − ω − n−1/2 r1 (zω )φ(zω ) + O(n−1 ),
as n → ∞, for an arbitrary value of ω ∈ (0, 1). It is then clear that
α̂ord [Cord (α, ω)] = α + n−1/2 ∆(ωL , ωU ) + O(n−1 ), (10.66)
as n → ∞ where
∆(ωL , ωU ) = r1 (zωU )φ(zωU ) − r1 (zωL )φ(zωL ).
One can observe that α̂ord is first-order accurate, unless the first-order term
in Equation (10.66) is functionally zero. If it happens that ωL = ωU or ωL =
1 − ωU , then it follows that r1 (zωL )φ(zωL ) = r1 (zωU )φ(zωU ) since r1 is an
even function and the first-order term vanishes. The first case corresponds
to a degenerate interval with confidence measure equal to zero. The second
case corresponds to the situation where θ̂ corresponds to the midpoint of the
interval (tL , tU ). Otherwise, the term is typically nonzero.

When σ is unknown and α = ωU − ωL we have that


α̂stud [Cstud (α, ω)] = Φ{n1/2 σ̂n−1 [θ̂ − θ̂stud (ωL )]} −
Φ{n1/2 σ̂n−1 [θ̂ − θ̂stud (ωU )]}
= Φ(h1−ωL ) − Φ(h1−ωU ).
A similar argument to the one given above yields
α̂stud [Cstud (α, ω)] = α + n−1/2 Λ(ωL , ωU ) + O(n−1 ),
as n → ∞, where
Λ(ωL , ωU ) = v1 (zωU )φ(zωU ) − v1 (zωL )φ(zωL ).
Therefore, the methods based on the normal approximation are first-order
accurate. This accuracy can be improved using Edgeworth type corrections. In
particular, by estimating the polynomials in the first term of the Edgeworth
expansions for the distributions Hn and Gn , second-order correct observed
confidence levels can be obtained. See Exercise 43. Observed confidence levels
can also be applied to problems where the parameter is a vector, along with
problems in regression, linear models, and density estimation. See Polansky
(2007) for further details on these applications.
BAYESIAN ESTIMATION 447
10.6 Bayesian Estimation

The statistical inference methods used so far in this book are classified as
frequentists methods. In these methods the unknown parameter is considered
to be a fixed constant that is an element of the parameter space. The random
mechanism that produces the observed data is based on a distribution that
depends on this unknown, but fixed, parameter. These methods are called
frequentist methods because the results of the statistical analyses are inter-
preted using the frequency interpretation of probability. That is, the methods
are justified by properties that hold under repeated sampling from the distri-
bution of interest. For example, a 100α% confidence set is justified in that the
probability that the set contains the true parameter value with a probability
of α before the sample is taken, or that the expected proportion of the confi-
dence sets that contain the parameter over repeated sampling from the same
distribution is α.
An alternative view of statistical inference is based on Bayesian methods. In
the Bayesian framework the unknown parameter is considered to be a random
variable and the observed data is based on the joint distribution between the
parameter and the observed random variables. In the usual formulation the
distribution of the parameter, called the prior distribution, and the condi-
tional distribution of the observed random variables given the parameter are
specified. Inference is then carried out using the distribution of the parame-
ter conditional on the sample that was observed. This distribution is called
the posterior distribution. The computation of the posterior distribution is
based on calculations justified by Bayes’ theorem, from which the methods
are named. The interpretation of using Bayes’ theorem in this way is that the
prior distribution can be interpreted as the knowledge of the parameter before
the data was observed, while the posterior distribution is the knowledge of the
parameter that has been updated based on the information from the observed
sample. The advantage of this type of inference is that the theoretical prop-
erties of Bayesian methods are interpreted for the sample that was actually
observed, and not over all possible samples. The interpretation of the results
is also simplified due to the randomness of the parameter value. For example,
while a confidence interval must be interpreted in view of all of the possible
samples that could have been observed, a Bayesian confidence set produces a
set that has a posterior probability of α, which can be interpreted solely on
the basis of the observed sample and the prior distribution.
In some sense Bayesian methods do not need to rely on asymptotic properties
for their justification because of their interpretability on the current sam-
ple. However, many Bayesian methods can be justified within the frequentist
framework as well. In this section we will demonstrate that Bayes estimators
can also be asymptotically efficient and have an asymptotic Normal distri-
bution within the frequentist framework.
To formalize the framework for our study, consider a set of independent and
448 PARAMETRIC INFERENCE
identically distributed random variables from a distribution F (x|θ) which has
either a density of probability distribution function given by f (x|θ) where
we will assume for simplicity that x ∈ R. The parameter θ will be assumed
to follow the prior distribution π(θ) over some parameter space Ω, which
again for simplicity we will often be taken to be R. The object of a Bayesian
analysis is then to obtain the posterior distribution π(θ|x1 , . . . , xn ), which can
be obtained using an argument based on Bayes’ theorem of the form
f (x1 , . . . , xn , θ)
π(θ|x1 , . . . , xn ) = ,
m(x1 , . . . , xn )
where f (x1 , . . . , xn , θ) is the joint distribution of X1 , . . . , Xn and θ, and the
marginal distribution of X1 , . . . , Xn is given by
Z
m(x1 , . . . , xn ) = f (x1 , . . . , xn , θ)dθ.

Using the fact that the joint distribution of X1 , . . . , Xn and θ can be found as
f (x1 , . . . , xn , θ) = f (x1 , . . . , xn |θ)π(θ) it follows that the posterior distribution
can be found directly from f (x1 , . . . , xn |θ) and π(θ) as
f (x1 , . . . , xn |θ)π(θ)
π(θ|x1 , . . . , xn ) = R . (10.67)

f (x1 , . . . , xn |θ)π(θ)dθ
Noting that the denominator of Equation (10.67) is a constant, it is often
enough to conclude that
π(θ|x1 , . . . , xn ) ∝ f (x1 , . . . , xn |θ)π(θ),
which eliminates the need to compute the integral, which can be difficult
in some cases. Once the posterior distribution is computed then a Bayesian
analysis will either use the distribution itself as the updated knowledge about
the parameter. Alternately, point estimates, confidence regions and tests can
be constructed, though their interpretation is necessarily different than the
parallel frequentists methods. For an introduction to Bayesian methods see
Bolstad (2007).
The derivation of a Bayes estimator of a parameter θ begins by specifying
a loss function L[θ, δ(x1 , . . . , xn )] where δ(x1 , . . . , xn ) is a point estimator,
called a decision rule. The posterior expected loss, or Bayes risk, is then given
by Z ∞
L[θ, δ(x1 , . . . , xn )]π(θ|x1 , . . . , xn )dθ.
−∞
The Bayes estimator of θ is then taken to be the decision rule δ that minimizes
the Bayes risk. The result given below, which is adapted from Lehmann and
Casella (1998) provides conditions under which a Bayes estimator of θ can be
found for two common loss functions.
Theorem 10.13. Let θ have a prior distribution π over Ω and suppose that
the density or probability distribution of X1 , . . . , Xn , conditional on θ is given
by fθ (x1 , . . . , xn |θ). If
BAYESIAN ESTIMATION 449
1. L(θ, δ) is a non-negative loss function,
2. There exists a decision rule δ that has finite risk,
3. For almost all (x1 , . . . , xn ) ∈ Rn there exists a decision rule δ(x1 , . . . , xn )
that minimizes the Bayes risk,
then δ is the Bayes estimator and

1. If L(θ, δ) = (θ − δ)2 then the Bayes estimator is E(θ|X1 , . . . , Xn ) and is


unique.
2. If L(θ, δ) = |θ − δ| then the Bayes estimator is any median of the posterior
distribution.

For a proof of Theorem 10.13 see Section 4.1 of Lehmann and Casella (1998).

Example 10.35. Let X be a single observation from a discrete distribution


with probability distribution function

1
2θ
 x ∈ {−1, 1},
f (x|θ) = 1 − θ x = 0,

0 elsewhere,

where θ ∈ Ω = { 14 , 12 , 34 }. Suppose that we place a Uniform{ 14 , 12 , 34 } prior


distribution on θ. The posterior distribution can then be found by direct
calculation. For example if we observe X = 0, the posterior probability for
θ = 14 is
P (X = 0|θ = 14 )P (θ = 14 ) 3 1
4 · 3
P (θ = 14 |X = 0) = =
P (X = 0) P (X = 0)
where

P (X = 0) = P (X = 0|θ = 14 )P (θ = 14 ) + P (X = 0|θ = 12 )P (θ = 12 )+
P (X = 0|θ = 34 )P (θ = 34 ) = 3
4 · 1
3 + 1
2 · 1
3 + 1
4 · 1
3 = 12 .
Therefore, the posterior probability is P (θ = 14 |X = 0) = 12
3
· 12 = 12 . Similar
calculations can be used to show that P (θ = 2 |X = 0) = 31 and P (θ =
1
3 1
4 |X = 0) = 6 . One can note that the lowest value of θ, which corresponds to
the highest probability for X = 0 has the largest posterior probability. If we
consider using squared error loss, then Theorem 10.13 implies that the Bayes
estimator of θ is given by the mean of the posterior distribution, which is
5
θ̃ = 12 . 
Example 10.36. Let B1 , . . . , Bn be a set of independent and identically dis-
tributed random variables each having a Bernoulli(θ) distribution. Suppose
that θ has a Beta(α, β) prior distribution where both α and β are speci-
fied and hence can be treated as constants. The conditional distribution of
B1 , . . . , Bn given θ is given by
P (B1 = b1 , . . . , Bn = bn |θ) = θnB̄n (1 − θ)n−nB̄n ,
450 PARAMETRIC INFERENCE
where bi can either be 0 or 1 for each i ∈ {1, . . . , n}, and B̄n is the sample
mean of b1 , . . . , bn . Therefore it follows that the posterior distribution for θ
given B1 = b1 , . . . , Bn = bn is proportional to
h i
θnB̄n (1 − θ)n−nB̄n θα−1 (1 − θ)β−1 = θα−1+nB̄n (1 − θ)β+n−1−nB̄n ,


which corresponds to a Beta(α + nB̄n , n + β − B̄n ) distribution. Theorem


10.13 then implies that the Bayes estimator of θ, when the loss function is
squared error loss, is given by the expectation of the posterior distribution
which is θ̃n = (α + β + n)−1 (α + nB̄n ). From the frequentist perspective, we
can first note that θ̃n is a consistent estimator of θ using Theorem 3.9 since
p
θ̃n = (n−1 α + n−1 β + 1)−1 (n−1 α + B̄n ) where B̄n − → θ by Theorem 3.10
(Weak Law of Large Numbers), n α → 0, and n (n α + n−1 β+1 )−1 →
−1 −1 −1

1, as n → ∞. The efficiency of θ̃n can then be studied by noting that the


expectation and variance of θ̃n are given by E(θ̃n ) = (α + β + n)−1 (α + θ)
and V (θ̃n ) = (α + β + n)−2 nθ(1 − θ). Using these results we can compare the
asymptotic relative efficiency of θ̃n to the maximum likelihood estimator of θ
which is given by θ̂n = B̄n . Using the fact that the variance of θ̂n is given by
n−1 θ(1 − θ) we find that
θ(1 − θ)(α + β + n)2
ARE(θ̃n , θ̂n ) = lim = 1.
n→∞ n2 θ(1 − θ)
It is known that θ̂n attains the lower bound for the variance given by Theorem
10.2 and hence it follows that θ̃n is asymptotically efficient. 
The asymptotic efficiency of the Bayes estimator in Example 10.36 is par-
ticularly intriguing because it demonstrates the possibility that there may
be general conditions that would allow for Bayes estimators to have frequen-
tists properties such as consistency and asymptotic efficiency. One necessary
requirement for such properties is that the prior information must asymptot-
ically have a negligible effect on the estimator as the sample size increases
to infinity. The reason for this is that the prior information necessarily intro-
duces a bias in the estimator, and this bias must be overcome by the sample
information for the estimator to be consistent. From an intuitive standpoint
we can argue that if we have full knowledge of the population, which is what
the limiting sample size might represent, then any prior information should
be essentially ignored. Note that this is a frequentist property and that the
Bayesian viewpoint may not necessarily follow this intuition.
Example 10.37. Let B1 , . . . , Bn be a set of independent and identically
distributed random variables, each having a Bernoulli(θ) distribution. Sup-
pose that θ has a Beta(α, β) prior distribution as in Example 10.36, where
the Bayes estimator of θ was found to be θ̃n = (α + β + n)−1 (α + nB̄n ). We
can observe the effect that the prior information has on θ̃n by looking at some
specific examples. In the first case consider taking α = 5 and β = 10 which
indicates that our prior information is that θ is very likely between zero and
4 1
10 . See Figure 10.2. If we observe B̄n = 2 with n = 10 then the posterior
BAYESIAN ESTIMATION 451
distribution of θ is Beta(10, 25) with θ̃n = 10 35 while the sample proportion
is θ̂n = 21 . One can observe from Figure 10.2 that the peak of the posterior
distribution has moved slightly toward the estimate θ̂n , a result of accounting
for both the prior information about θ and the information about θ from the
observed sample. Alternatively, if n = 10 but the prior distribution of θ is
taken to be a Beta(5, 5) distribution, which emphasizes a wider range of val-
ues in our prior information with a preference for θ being near 12 , an observed
value of B̄n = 12 results in a Beta(10, 10) distribution and a Bayes estimate
equal to 12 . In this case our prior information and our observed information
match very well, and the result is that the posterior distribution is still cen-
tered about 12 , but with a smaller variance indicating more posterior evidence
that θ is near 12 . See Figure 10.3.
We can now consider the question as to how this framework behaves asymp-
totically as n → ∞. Let us consider the first case where the prior distribution
is Beta(5, 10), and the sample size has been increased to 100. In this case the
55
posterior distribution is Beta(55, 60) with θ̃n = 115 . One can now observe a
significant change in the posterior distribution when compared to the prior
distribution as the large amount of information from the observed sample over-
whelms the information contained in the prior distribution. See Figure 10.4. It
is this type of behavior that must occur for Bayes estimators to be consistent
and efficient. Note that this effect depends on the choice of the prior distribu-
tion. For example, Figure 10.5 compares a Beta(100, 500) prior distribution
with the Beta(150, 550) posterior distribution under the same sample size of
n = 100 and observed sample proportion of B̄n = 21 . There is little difference
here because the variance of the prior distribution, which reflects the quality
of our knowledge of θ, is quite a bit less and hence more observed information
from the sample is required to overtake this prior information. However, as
the sample size increases, the information from the observed sample will even-
tually overtake the prior information. The fact that this occurs can be seen
from the fact that θ̃n is a consistent and asymptotically efficient estimator of
θ for all choices of α and β. See Example 10.36. 
Before proceeding to the main asymptotic results for this section, it is worth-
while to develop some connections between Bayesian estimation and the likeli-
hood function. If we focus momentarily on the case where X0 = (X1 , . . . , Xn )
is a vector of independent and identically distributed random variables from
a distribution F (x|θ) with density or probability distribution function f (x|θ),
then
Yn
f (X|θ) = f (Xi |θ) = L(θ|X).
i=1
Therefore, the posterior distribution of θ can be written as
L(θ|X)π(θ)
π(θ|X) = R . (10.68)

L(θ|X)π(θ)dθ
The assumptions required for the development of the asymptotic results rely
452 PARAMETRIC INFERENCE

Figure 10.2 The Beta(5, 10) prior density (solid line) and the Beta(10, 25) poste-
rior density (dashed line) on θ from Example 10.37 when n = 10 and B̄n = 21 . The
sample proportion is located at 12 (dotted line) and the Bayes estimate is located at
10
35
(dash and dot line).
6
5
4
3
2
1
0

0.0 0.2 0.4 0.6 0.8 1.0

on properties of the error term of a Taylor expansion of the log-likelihood


function. In particular, Theorem 1.13 implies that

l(θ̂n |X) = l(θ0 |X) + (θ̂n − θ0 )l0 (θ0 |X) + 12 (θ̂n − θ0 )2 l00 (ξn |X), (10.69)

where θ̂n is a sequence that converges in probability to θ0 as n → ∞, and


ξn is a random variable that is between θ̂n and θ0 with probability one. Note
p
that this implies that ξn −
→ θ0 as n → ∞. In the proof of Theorem 10.5 we
−1 00 p
show that n l (θ0 |X) − → I(θ0 ) as n → ∞, and hence the assumed continuity
p
of l (θ|X) with respect to θ implies that n−1 l00 (ξn |X) −
00
→ I(θ0 ), as n → ∞.
p
Therefore, we have that l00 (ξn |X) = −nI(θ0 ) − Rn (θ̂n ) where n−1 Rn (θ̂n ) −
→0
as n → ∞. For the Bayesian framework we will require additional conditions
on the convergence of Rn (θ̂n ) as detailed in Theorem 10.14 below.
Theorem 10.14. Suppose that X0 = (X1 , . . . , Xn ) is a vector of independent
and identically distributed random variables from a distribution f (x|θ), con-
ditional on θ, where the prior distribution of θ is π(θ) for θ ∈ Ω. Let τ (t|x)
be the posterior density of n1/2 (θ − θ̃0,n ), where θ̃0,n = θ0 + n−1 [I(θ0 )]−1 l0 (θ0 )
and θ0 ∈ Ω is the true value of θ. Suppose that
BAYESIAN ESTIMATION 453

Figure 10.3 The Beta(5, 5) prior density (solid line) and the Beta(10, 10) posterior
density (dashed line) on θ from Example 10.37 when n = 10 and B̄n = 21 . The sample
proportion and Bayes estimate are both located at 12 (dotted line).
4
3
2
1
0

0.0 0.2 0.4 0.6 0.8 1.0

1. Ω is an open interval.
2. The set {x : f (x|θ) > 0} is the same for all θ ∈ Ω.
3. The density f (x|θ) has two continuous derivatives with respect to θ for each
x ∈ {x : f (x|θ) > 0}.
4. The integral
Z ∞
f (x|θ)dx,
−∞
can be twice differentiated by exchanging the integral and the derivative.
5. The Fisher information number I(θ0 ) is defined, positive, and finite.
6. For any θ0 ∈ Ω there exists a positive constant d and function B(x) such
that
2

∂θ2 log[f (x|θ)] ≤ B(x),

for all x ∈ {x : f (x|θ) > 0} and θ ∈ [θ0 − d, θ0 + d] such that with


E[B(X1 )] < ∞.
454 PARAMETRIC INFERENCE

Figure 10.4 The Beta(5, 10) prior density (solid line) and the Beta(55, 60) poste-
rior density (dashed line) on θ from Example 10.37 when n = 100 and B̄n = 21 . The
sample proportion is located at 21 (dotted line) and the Bayes estimate is located at
55
115
(dash and dot line).
10
8
6
4
2
0

0.0 0.2 0.4 0.6 0.8 1.0

7. For any ε > 0 there exists δ > 0 such that


!
−1
lim P sup |n Rn (θ)| ≥ ε = 0.
n→∞
|θ̂n −θ0 |≤δ

8. For any δ > 0 there exists ε > 0 such that


!
lim P sup n−1 [l(θ̂n ) − l(θ0 )] ≤ −ε = 1.
n→∞
|θ̂n −θ0 |≥δ

9. The prior density π on θ is continuous and positive for all θ ∈ Ω.

Then,
Z ∞
p
|τ (t|x) − [I(θ0 )]1/2 φ{t[I(θ0 )]1/2 }|dt −
→ 0, (10.70)
−∞
as n → ∞. If we can additionally assume that the expectation
Z
|θ|πθdθ,

BAYESIAN ESTIMATION 455

Figure 10.5 The Beta(100, 500) prior density (solid line) and the Beta(150, 550)
posterior density (dashed line) on θ from Example 10.37 when n = 100 and B̄n = 12 .
25
20
15
10
5
0

0.0 0.2 0.4 0.6 0.8 1.0

is finite, then
Z ∞
p
(1 + |t|)|τ (t|x) − [I(θ0 )]1/2 φ{t[I(θ0 )]1/2 }|dt −
→ 0, (10.71)
−∞
as n → ∞.

The proof of Theorem 10.14 is rather complicated, and can be found in Section
6.8 of Lehmann and Casella (1998). Note that in Theorem 10.14 the integrals
in Equations (10.70) and (10.71) are being used as a type of norm in the space
of density functions, so that the results state that π(t|x) and I 1/2 φ{t[I(θ0 )]1/2 }
coincide as n → ∞, since the integrals converge in probability to zero. There-
fore, the conclusions of Theorem 10.14 state that the posterior density of
n1/2 [θ − θ0 − n−1 l0 (θ0 )/I(θ0 )] converges to that of a N{0, [I(θ0 )]−1 } density.
Equivalently, one can conclude that when n is large, the posterior distribution
of θ has an approximate N{[θ0 + n−1 l](θ0 )[I(θ0 )]−1 , [nI(θ0 )]−1 } distribution.
The results of Equations (10.70) and (10.71) make the same type of conclu-
sion, the difference being that the rate of convergence is faster in the tails of
the density in Equation (10.71) due to the factor (1 + |t|) in the integral.
Note that the assumptions required for Theorem 10.14 are quite a bit stronger
than what is required to develop the asymptotic theory of maximum likelihood
456 PARAMETRIC INFERENCE
estimators. While Assumptions 1–6 are the same as in Theorem 10.5, the
p
assumptions used for likelihood theory that imply that n−1 Rn (θ0 ) −
→ 0 as n →
∞, are replaced by the stronger Assumption 7 in Theorem 10.14. Additionally,
in the asymptotic theory of maximum likelihood estimation, the consistency
and asymptotic efficiency of the maximum likelihood estimator requires us
to only specify the behavior of the likelihood function in a neighborhood of
the true parameter value. Because Bayes estimators involve integration of the
likelihood over the entire range of the parameter space, Assumption 8 ensures
that the likelihood function is well behaved away from θ0 as well.
When the squared error loss function is used, the result of Theorem 10.14 is
sufficient to conclude that the Bayes estimator is both consistent and efficient.
Theorem 10.15. Suppose that X0 = (X1 , . . . , Xn ) is a vector of independent
and identically distributed random variables from a distribution f (x|θ), con-
ditional on θ, where the prior distribution of θ is π(θ) for θ ∈ Ω. Let τ (t|x)
be the posterior density of n1/2 (θ − θ̃0,n ), where θ̃0,n = θ0 + n−1 [I(θ0 )]−1 l0 (θ0 )
and θ0 ∈ Ω is the true value of θ. Suppose that the conditions of Theorem
10.14 hold, then when the loss function is squared error loss, it follows that
d
→ Z as n → ∞ where Z is a N{0, [I(θ0 )]−1 } random variable.
n1/2 (θ̂n − θ0 ) −

Proof. Note that


n1/2 (θ̂n − θ0 ) = n1/2 (θ̂n − θ̃0,n ) + n1/2 (θ̃0,n − θ0 ). (10.72)
The second term in Equation (10.72) has the form
n1/2 (θ̃0,n − θ0 ) = n−1/2 [I(θ0 )]−1 l0 (θ0 )
n

−1 ∂
X
−1/2
= n [I(θ0 )] log[f (Xi |θ)]

∂θ i=1
θ=θ0
n
X f 0 (Xi |θ0 )
= n−1/2 [I(θ0 )]−1 .
i=1
f (Xi |θ0 )

As noted in the proof of Theorem 10.5, Assumption 4 implies that


 0 
f (Xi |θ0 )
E = 0,
f (Xi |θ0 )
d
so that Theorem 4.20 (Lindeberg and Lévy) implies that n1/2 (θ̃0,n − θ0 ) −
→Z
as n → ∞ where Z is a N{0, [I(θ0 )]−1 } random variable. To study the first
term in Equation (10.72) we begin by noting that
π[n−1/2 (t + n1/2 θ̃0,n )|x]L(n−1/2 t + θ̃0,n |x)
τ (t|x) = R dt

π[n−1/2 (t + n1/2 θ̃0,n |x]L(n−1/2 t + θ̃0,n |x)
= n−1/2 π(θ̃0,n + n−1/2 t|x).
Theorem 10.13 implies that the Bayes estimator of θ using squared error loss
BAYESIAN ESTIMATION 457
is given by Z
θ̃n = θπ(θ|x)dθ.

Consider the change of variable where we let θ = θ̃0,n + n−1/2 t so that dθ =
n−1/2 dt. Let Ω̃ denote the corresponding transformation of Ω. Therefore
Z Z
θ̃n = θπ(θ|x)dθ = n−1/2 (θ̃0,n + n−1/2 t)π(θ̃0,n + n−1/2 t)dt.
Ω Ω̃
−1/2
Noting that τ (t|x) = n π(θ̃0,n + n−1/2 t|x) we have that
Z
θ̃n = (θ̃0,n + n−1/2 t)τ (t|x)dt
Ω̃
Z Z
−1/2
= θ̃0,n τ (t|x)dt + n tτ (t|x)dt. (10.73)
−Ω̃ Ω̃

Note that θ̃0,n does not depend on t so that the first term in Equation (10.73)
is θ̃0,n . Therefore, Z
θ̃n = θ̃0,n + n−1/2 tτ (t|x)dt,
Ω̃
or that Z
n1/2 (θ̃n − θ̃0,n ) = tτ (t|x)dt.
Ω̃
Now, note that
Z
1/2

n |θ̃n − θ̃0,n | = tτ (t|x)dt =

Ω̃
Z
{tτ (t|x) − t[I(θ0 )]1/2 φ{t[I(θ0 )]1/2 }}dt


Ω̃

which follows from the fact that


Z
t[I(θ0 )]1/2 φ{t[I(θ0 )]1/2 }dt = 0,
Ω̃

since the integral represents the expectation of a N{0, [I(θ0 )]−1 } random vari-
able. Theorem A.6 implies that
Z
{tτ (t|x) − tI 1/2 (θ0 )φ[tI 1/2 (θ0 )]}dt ≤


Ω̃
Z
tτ (t|x) − tI 1/2 (θ0 )φ[tI 1/2 (θ0 )] dt =

Ω̃
Z
|t| τ (t|x) − tI 1/2 (θ0 )φ[tI 1/2 (θ0 )] dt.

Ω̃

Therefore
Z
n1/2 |θ̃n − θ̃0,n | ≤ |t| τ (t|x) − tI 1/2 (θ0 )φ[tI 1/2 (θ0 )] dt. (10.74)

Ω̃
458 PARAMETRIC INFERENCE
Theorem 10.14 implies that the integral on the right hand side of Equation
(10.74) converges in probability to zero as n → ∞. Therefore, it follows that
p
n1/2 |θ̃n − θ̃0,n | −
→ 0 as n → ∞. Combining this result with the fact that
d
n1/2 (θ̃0,n − θ0 ) −
→ Z as n → ∞ and using Theorem 4.11 (Slutsky) implies that
d
n1/2 (θ̃n − θ0 ) −
→ Z as n → ∞, which proves the result.
The conditions of Theorem 10.15 appear to be quite complicated, but in fact
can be shown to hold for exponential families.
Theorem 10.16. Consider an exponential family density of the form f (x|θ) =
exp[θT (x) − A(θ)] where the parameter space Ω is an open interval, T (x) is
not a function of θ, and A(θ) is not a function of x. Then for this density the
assumptions of Theorem 10.15 are satisifed.
For a proof of Theorem 10.16 see Example 6.8.4 of Lehmann and Casella
(1998).
Example 10.38. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a N(θ, σ 2 ) distribution, conditional on θ,
where θ has a N(λ, τ 2 ) distribution where σ 2 , λ and τ 2 are known. In this
case it can be shown that the posterior distribution of θ is N(θ̃n , σ̃n2 ) where
τ 2 X̄n + n−1 σ 2 λ
θ̃n = ,
τ 2 + n−1 σ 2
and
σ2 τ 2
σ̃n2 = .
nτ 2 + σ 2
See Exercise 44. Under squared error loss, the Bayes estimator of θ is given
by E(θ|X1 , . . . , Xn ) = θ̃n . Treating X1 , . . . , Xn as a random sample from a
p
N(θ, σ 2 ) where θ is fixed we have that X̄n − → θ as n → ∞, by Theorem 3.10
(Weak Law of Large Numbers) and hence Theorem 3.7 implies that
τ 2 X̄n + n−1 σ 2 λ p
θ̃n = −
→ θ,
τ 2 + n−1 σ 2
as n → ∞. Therefore the Bayes estimator is consistent. Further, Theorem
d
4.20 (Lindeberg and Lévy) implies that n1/2 X̄n − → Z as n → ∞, where Z is
2
a N (θ, σ ) random variable. The result in this case is, in fact, exact for any
sample size, but for the sake of argument here we will base our calculations
on the asymptotic result. Note that

n1/2 τ 2 X̄n + n−1/2 σ 2 λ τ2 n−1/2 σ 2 λ


n1/2 θ̃n = = n 1/2
X̄n + ,
τ 2 + n−1 σ 2 τ 2 + n−1 σ 2 τ 2 + n−1 σ 2
where
τ2
lim = 1,
n→∞ τ 2 + n−1 σ 2
and
n−1/2 σ 2 λ
lim = 0.
n→∞ τ 2 + n−1 σ 2
EXERCISES AND EXPERIMENTS 459
d
Therefore, Theorem 4.11 (Slutsky) implies that n1/2 θ̃n −
→ Z as n → ∞, which
is the asymptotic behavior specified by Theorem 10.15. 
Example 10.39. Let B1 , . . . , Bn be a set of independent and identically dis-
tributed Bernoulli(θ) random variables and suppose that θ has a Beta(α, β)
prior distribution where both α and β are specified. In Example 10.36 it was
shown that the Bayes estimator under squared error loss is given by
α + nB̄n
θ̃n = ,
α+β+n
which was shown to be a consistent estimator of θ. Theorem 4.20 (Lindeberg
d
and Lévy) implies that n1/2 B̄n −
→ Z as n → ∞ where Z is a N[θ, θ(1 − θ)]
random variable. Now, note that
n1/2 α + n3/2 B̄n n1/2 α n1/2 B̄n
n1/2 θ̃n = = + −1 ,
α+β+n α + β + n n α + n−1 β + 1
where
n1/2 α
lim = 0,
n→∞ α + β + n

and
lim n−1 α + n−1 β + 1 = 1.
n→∞
d
Therefore, Theorem 4.11 (Slutsky) implies that n1/2 θ̃n −
→ Z as n → ∞, which
is the asymptotic behavior specified by Theorem 10.15. 

10.7 Exercises and Experiments

10.7.1 Exercises

1. Prove that M SE(θ̂, θ) can be decomposed into two parts given by


MSE(θ̂, θ) = Bias2 (θ̂, θ) + V (θ̂),
where Bias(θ̂, θ) = E(θ̂) − θ.
2. Let {Xn }∞n=1 be a sequence of independent and identically distributed ran-
dom variables from a distribution F with mean θ and variance σ 2 . Suppose
that f has a finite fourth moment and let g be a function that has at least
four derivatives. If the fourth derivative of g 2 is bounded then prove that
V [g(X̄n )] = n−1 σ 2 [g 0 (θ)]2 + O(n−1 ),
as n → ∞.
3. Let {Xn }∞
n=1 be a sequence of independent and identically distributed ran-
dom variables from a distribution F with mean θ and finite variance σ 2 .
Suppose now that we are interested in estimating g(θ) = cθ3 using the
estimator g(X̄n ) = cX̄n3 where c is a known real constant.
460 PARAMETRIC INFERENCE
a. Find the bias and variance of g(X̄n ) as an estimator of g(θ) directly.
b. Find asymptotic expressions for the bias and variance of g(X̄n ) as an
estimator of g(θ) using Theorem 10.1. Compare this result to the exact
expressions derived above.

4. Let B1 , . . . , Bn be a set of independent and identically distributed random


variables from a Bernoulli(θ) distribution. Suppose we are interested
in estimating the variance g(θ) = θ(1 − θ) with the estimator g(B̄n ) =
B̄n (1 − B̄n ).

a. Find the bias and variance of g(X̄n ) as an estimator of g(θ) directly.


b. Find asymptotic expressions for the bias and variance of g(X̄n ) as an
estimator of g(θ) using Theorem 10.1. Compare this result to the exact
expressions derived above.

5. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from an Exponential(θ) distribution. Suppose that we are inter-
ested in estimating the variance g(θ) = θ2 with the estimator g(X̄n ) = X̄n2 .

a. Find the bias and variance of g(X̄n ) as an estimator of g(θ) directly.


b. Find asymptotic expressions for the bias and variance of g(X̄n ) as an
estimator of g(θ) using Theorem 10.1. Compare this result to the exact
expressions derived above.

6. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from a Poisson(θ) distribution. Suppose that we are interested
in estimating the variance g(θ) = exp(−θ) with the estimator g(X̄n ) =
exp(−X̄n ). Find asymptotic expressions for the bias and variance of g(X̄n )
as an estimator of g(θ) using Theorem 10.1.
7. Let {Xn }∞n=1 be a sequence of independent and identically distributed ran-
dom variables following a N(θ, σ 2 ) distribution and consider estimating
exp(θ) with exp(X̄n ). Find an asymptotic expression for the variance of
exp(X̄n ) using the methods detailed in Example 10.5.
8. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a N(θ, σ 2 ) distribution where σ 2 is finite. Let k(n) be a
function of n ∈ N that returns an integer between 1 and n, and consider
the estimator of θ that returns the average of the first k(n) observations.
That is, define
k(n)
X
X̄k(n) = [k(n)]−1 Xi .
i=1

a. Prove that X̄k(n) is an unbiased estimator of θ.


b. Find the mean squared error of X̄k(n) .
c. Assuming that the optimal mean squared error for unbiased estimators of
θ is n−1 σ 2 , under what conditions will X̄k(n) be asymptotically optimal?
EXERCISES AND EXPERIMENTS 461
9. Let {Xn }∞n=1 be a sequence of independent and identically distributed ran-
dom variables from a LaPlace(θ, 1) distribution. Let θ̂n denote the sample
mean and θ̃n denote the sample median computed on X1 , . . . , Xn . Compute
ARE(θ̂n , θ̃).
10. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables following a mixture of two Normal distributions, with a density
given by f (x) = 21 φ(x − ζ) + 21 φ(x + ζ) where θ is a positive real number.
Compute the asymptotic relative efficiency of the sample mean relative to
the sample median as an estimator of the mean of this density. Comment
on the role that the parameter ζ has on the efficiency.
11. Let X1 , . . . , Xn be a set of independent and identically distributed ran-
dom variables from a Poisson(θ) distribution. Consider two estimators of
P (Xn = 0) = exp(−θ) given by
n
X
θ̂n = n−1 δ{Xi ; {0}},
i=1

which is the proportion of values in the sample that are equal to zero, and
θ̃n = exp(−X̄n ). Compute the asymptotic relative efficiency of θ̂n relative
to θ̃n and comment on the results.
12. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables following a N(θ, 1) distribution.
p
a. Prove that if θ 6= 0 then δ{|X̄n | : [0, n−1/4 )} −
→ 0 as n → ∞.
p
b. Prove that if θ = 0 then δ{|X̄n | : [0, n−1/4 )} −
→ 1 as n → ∞.

13. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from a Gamma(2, θ) distribution.

a. Find the maximum likelihood estimator for θ.


b. Determine whether the maximum likelihood estimator is consistent and
asymptotically efficient.

14. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from a Poisson(θ) distribution.

a. Find the maximum likelihood estimator for θ.


b. Determine whether the maximum likelihood estimator is consistent and
asymptotically efficient.

15. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from a N(θ, 1) distribution.

a. Find the maximum likelihood estimator for θ.


b. Determine whether the maximum likelihood estimator is consistent and
asymptotically efficient.
462 PARAMETRIC INFERENCE
16. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a N(0, θ) distribution.

a. Find the maximum likelihood estimator for θ.


b. Determine whether the maximum likelihood estimator is consistent and
asymptotically efficient.

17. Consider a sequence of random variables {{Xij }kj=1 }ni=1 that are assumed
to be mutually independent, each having a N(µi , θ) distribution for i =
1, . . . , n. Prove that the maximum likelihood estimators of µi and θ are
k
X
µ̂i = X̄i = k −1 Xij ,
j=1

for each i = 1, . . . , n and


n X
X k
θ̂n = (nk)−1 (Xij − X̄i )2 .
i=1 j=1

18. Consider a sequence of independent and identically distributed d-dimensional


random vectors {Xn }∞ n=1 from a d-dimensional distribution F . Assume
the structure smooth function model with µ = E(Xn ), θ = g(µ) with
θ̂n = g(X̄n ). Further, assume that
σ 2 = h2 (µ) = lim V (n1/2 θ̂),
n→∞

with σ̂n2 = h2 (X̄n ). Let Gn (t) = P [n1/2 σ −1 (θ̂n − θ) ≤ t] and Hn (t) =


P [n1/2 σ̂n−1 (θ̂n − θ) ≤ t] and define gα,n and hα,n to be the corresponding α
quantiles of Gn and Hn . Define the ordinary and studentized 100α% upper
confidence limits for θ as θ̂n,ord = θ̂n − n−1/2 σg1−α and θ̂n,stud = θ̂n −
n−1/2 σ̂n h1−α . Prove that θ̂n,ord and θ̂n,stud are accurate upper confidence
limits.
19. Suppose that X1 , . . . , Xn is a set of independent and identically distributed
random variables from a distribution F with parameter θ. Suppose that F
and θ fall within the smooth function model. For the case when σ is known,
prove that θ̃n,ord is a first-order correct and accurate 100α% confidence limit
for θ.
20. Equation (10.41) provides an asymptotic expansion for the coverage prob-
ability of an upper confidence limit that has asymptotic expansion
θ̂n (α) = θ̂n + n−1/2 σ̂n zα + n−1 σ̂n û1 (zα ) + n−3/2 σ̂n û2 (zα ) + Op (n−2 ),
as n → ∞, as

πn (α) = α + n−1/2 [u1 (zα ) − v1 (zα )]φ(zα ) − n−1 [ 21 zα u21 (zα ) − u2 (zα )+
u1 (zα )v10 (zα ) − zα u1 (zα )v1 (zα ) − v2 (zα ) + uα zα ]φ(zα ) + O(n−3/2 ),
EXERCISES AND EXPERIMENTS 463
as n → ∞. Prove that when u1 (zα ) = s1 (zα ) and u2 (zα ) = s2 (zα ) it follows
that πn (α) = n−1 uα zα φ(zα ) + O(n−3/2 ), as n → ∞.
21. Let X1 , . . . , Xn be a sequence of independent and identically distributed
random variables from a distribution F with mean θ and assume the
framework of the smooth function model. Let σ 2 = E[(X1 − θ)2 ], γ =
σ −3 E[(X1 − θ)3 ], and κ = σ −4 E[(X1 − θ)4 ] − 3.
a. Prove that an exact 100α% upper confidence limit for θ has asymptotic
expansion

θ̂n,stud (α) = θ̂n + n−1/2 σ̂n {z1−α + 16 n−1/2 γ(2z1−α


2
+ 1)+
n−1 z1−α [− 12
1 2
κ(z1−α 5 2
− 3) + 72 2
γ (4z1−α 2
− 1) + 14 (z1−α + 3)]} + Op (n−2 ),
as n → ∞.
b. Let γ̂n and κ̂n be the sample skewness and kurtosis, respectively, and
assume that γ̂n = γ + Op (n−1/2 ) and κ̂n = κ + Op (n−1/2 ), as n → ∞.
Prove that the Edgeworth-corrected upper confidence limit for θ given
by

θ̂n + n−1/2 σ̂n {z1−α + 16 n−1/2 γ̂n (2z1−α


2
+ 1)+
n−1 z1−α [− 12
1 2
κ̂n (z1−α − 3) + 5 2 2
72 γ̂n (4z1−α
2
− 1) + 14 (z1−α + 3)]},
has an asymptotic expansion for its coverage probability given by
π̄n (α) = α − 16 n−1 (κ − 23 γ 2 )zα (2zα2 + 1)φ(zα ) + O(n−3/2 ),
so that the confidence limit is second-order accurate. In this argument
you will need to prove that uα = γ −1 (κ − 23 γ 2 )u1 (zα ).
c. Let γ̂n be the sample skewness and assume that γ̂n = γ + Op (n−1/2 ), as
n → ∞. Find an asymptotic expansion for the coverage probability of
the Edgeworth-corrected upper confidence limit for θ given by
θ̂n + n−1/2 σ̂n [z1−α + 16 n−1/2 γ̂n (2z1−α
2
+ 1)].
Does there appear to be any advantage from an asymptotic viewpoint
to estimating the kurtosis?
22. Let X1 , . . . , Xn be a sequence of independent and identically distributed
random variables from a distribution F with parameter θ and assume the
framework of the smooth function model. Consider a general 100α% upper
confidence limit for θ which has an asymptotic expansion of the form
θ̂n (α) = θ̂n + n−1/2 σ̂n zα + n−1 σ̂n û1 (zα ) + n−3/2 σ̂n û2 (zα ) + Op (n−2 ),
as n → ∞. Consider a two-sided confidence interval based on this confidence
limit of the form [θ̂n [(1 − α)/2], θ̂n [(1 + α)/2]]. Prove that the length of this
interval has an asymptotic expansion of the form
2n−1/2 σ̂n z(1+α)/2 + 2n−1 σ̂n u1 (z(1+α)/2 ) + Op (n−2 ),
as n → ∞.
464 PARAMETRIC INFERENCE
23. Prove that the coverage probability of a 100α% upper confidence limit that
has an asymptotic expansion of the form
θ̂n (α) = θ̂n + n−1/2 σ̂n zα + n−1 σ̂n ŝ1 (zα ) + n−3/2 σ̂n ŝ2 (zα ) + Op (n−2 ),
has an asymptotic expansion of the form
πn (α) = n−1 uα zα φ(zα ) + O(n−3/2 ),
as n → ∞. Is there any case where uα = 0 so that the resulting confidence
limit would be third-order accurate?
24. Let X1 , . . . , Xn be a sequence of independent and identically distributed
random variables from a distribution F with parameter θ and assume the
framework of the smooth function model. A common 100α% upper confi-
dence limit proposed, using the very different motivation, for use with the
bootstrap methodology of Efron (1979), is the backwards confidence limit
given by θ̂n (α) = θ̂n + n−1/2 σ̂n gα , where it is assumed that the standard
deviation σ is unknown. Hall (1988a) named this confidence limit the back-
wards confidence limit because it is based on the upper quantile of G instead
of the lower quantile of H, which is the correct quantile given in the form of
the upper studentized confidence limit. Therefore, one justification of this
method is based on assuming gα ' h1−α , which will be approximately true
in the smooth function model when n is large.

a. Find an asymptotic expansion for the confidence limit θ̂n (α) and find
the order of asymptotic correctness of the method.
b. Find an asymptotic expansion for the coverage probability of the up-
per confidence θ̂n (α) and find the order of asymptotic accuracy of the
method.
c. Find an asymptotic expansion for the coverage probability of a two-sided
confidence interval based on this method.

25. Let X1 , . . . , Xn be a sequence of independent and identically distributed


random variables from a distribution F with parameter θ and assume the
framework of the smooth function model. Consider the backwards confi-
dence limit given by θ̂n (α) = θ̂n + n−1/2 σ̂n gα , where it is assumed that the
standard deviation σ is unknown. The asymptotic correctness and accuracy
is studied in Exercise 24. Efron (1981) attempted to improve the proper-
ties of the backwards method by adjusting the confidence coefficient to re-
move some of the bias from the method. The resulting method, called the
bias-corrected method, uses an upper confidence limit equal to θ̂n [β(α)] =
θ̂n + n−1/2 σ̂n gβ(α) , where β(α) = Φ(zα + 2µ̃) and µ̃ = Φ−1 [Gn (0)].

a. Prove that µ̃ = n−1/2 r1 (0) + O(n−1 ), as n → ∞.


b. Prove that the bias-corrected critical point has the form θ̂n [β(α)] =
θ̂n + n−1/2 σ̂n {zα + n−1/2 [2r1 (0) − r1 (zα )] + O(n−1 )}, as n → ∞.
EXERCISES AND EXPERIMENTS 465
c. Prove that the coverage probability of the bias-corrected upper confi-
dence limit is given by
πbc (α) = α + n−1/2 [2r1 (0) − r1 (zα ) − v1 (zα )]φ(zα ) + O(n−1 ),
as n → ∞.
d. Discuss the results given above in terms of the performance of this con-
fidence interval.

26. Consider the problem of testing the null hypothesis H0 : θ ≤ θ0 against


the alternative hypothesis H1 : θ > θ0 using the test statistic Zn =
n1/2 σ −1 (θ̂n − θ0 ) where σ is known and the null hypothesis is rejected
whenever Zn > rn,α , a constant that depends on n and α. Prove that this
test is unbiased.
n=1 be a sequence of distribution functions such that Fn ; F as
27. Let {Fn }∞
n → ∞ for some distribution function F . Let {tn }∞
n=1 be a sequence of real
numbers.

a. Prove that if tn → ∞ as n → ∞ then


lim Fn (tn ) = 1.
n→∞

b. Prove that if tn → −∞ as n → ∞ then


lim Fn (tn ) = 0.
n→∞

c. Prove that if tn → t where t ∈ C(F ) as n → ∞ then


lim Fn (tn ) = F (t).
n→∞

28. Let B1 , . . . , Bn be a sequence of independent and identically distributed


random variables from a Bernoulli(θ) distribution where the parameter
space of θ is Ω = (0, 1). Consider testing the null hypothesis H0 : θ ≤ θ0
against the alternative hypothesis H1 : θ > θ0 .

a. Describe an exact test of H0 against H1 whose rejection region is based


on the Binomial distribution.
b. Find an approximate test of H0 against H1 using Theorem 4.20 (Linde-
berg and Lévy). Prove that this test is consistent and find an expression
for the asymptotic power of this test for the sequence of alternatives
given by θ1,n = θ0 + n−1/2 δ where δ > 0.

29. Let U1 , . . . , Un be a sequence of independent and identically distributed


random variables from a Uniform(0, θ) distribution where the parameter
space for θ is Ω = (0, ∞). Using the test statistic U(n) , where U(n) =
max{U1 , . . . , Un }, develop an unbiased test of the null hypothesis H0 : θ ≤
θ0 against the alternative hypothesis H1 : θ > θ0 that is not based on an
asymptotic Normal distribution.
466 PARAMETRIC INFERENCE
30. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a distribution F with parameter θ. Consider testing the null
hypothesis H0 : θ ≥ θ0 against the alternative hypothesis H1 : θ < θ0 .
Assume that the test statistic is of the form Zn = n1/2 [σ(θ0 )]−1 (θ̂n − θ0 )
d
where Zn −→ Z as n → ∞ and the null hypothesis is rejected when Zn <
rα,n where rα,n → zα as n → ∞.

a. Prove that the test is consistent against all alternatives θ < θ0 . State
any additional assumptions that must be made in order for this result
to be true.
b. Develop an expression similar to that given in Theorem 10.10 for the
asymptotic power of this test for a sequence of alternatives given by
θ1,n = θ0 − n−1/2 δ where δ > 0 is a constant. State any additional
assumptions that must be made in order for this result to be true.

31. Consider the framework of the smooth function model where σ, which de-
notes the asymptotic variance of n1/2 θ̂n , is known. Consider using the test
statistic Zn = n1/2 σ −1 (θ̂n − θ0 ) which follows the distribution Gn when θ0
is the true value of θ. Prove that an unbiased test of size α of the null hy-
pothesis H0 : θ ≤ θ0 against the alternative hypothesis H1 : θ > θ0 rejects
the null hypothesis if Zn > g1−α , where we recall that g1−α is the (1 − α)th
quantile of the distribution Gn .
32. Consider the framework of the smooth function model where σ, which de-
notes the asymptotic variance of n1/2 θ̂n , is unknown, and the test statistic
Tn = n1/2 σ̂n−1 (θ̂n −θ0 ) follows the distribution Hn when θ0 is the true value
of θ.

a. Prove that a test of the null hypothesis H0 : θ ≤ θ0 against the alterna-


tive hypothesis H1 : θ > θ0 that rejects the null hypothesis if Tn > z1−α
is a first-order accurate test.
b. Prove that a test of the null hypothesis H0 : θ ≤ θ0 against the al-
ternative hypothesis H1 : θ > θ0 that rejects the null hypothesis if
Tn > z1−α + n−1/2 ŝ1 (z1−α ) is a second-order accurate test.

33. Let X1 , . . . , Xn be a sequence of independent and identically distributed


random variables that have an Exponential(θ) distribution for all n ∈ N.
Consider testing the null hypothesis H0 : θ ≤ θ0 against the alternative
hypothesis H1 : θ > θ0 for some θ ∈ R. This model falls within the smooth
function model. When n is large it is often suggested that rejecting the null
hypothesis when Tn = n1/2 σ̂n−1 (X̄n − θ0 ) > z1−α , where σ̂n2 is the unbiased
version of the sample variance, is a test that is approximately valid. Find
an asymptotic expansion for the accuracy of this approximate test.
34. In the context of the proof of Theorem 10.12, prove that Λn = n(θ̂n −
θ0 )2 I(θ0 ) + op (n−1 ), as n → ∞ where θ1,n = θ0 + n−1/2 δ.
EXERCISES AND EXPERIMENTS 467
35. Under the assumptions outlined in Theorem 10.11, show that Wald’s statis-
tic, which is given by Q = n(θ̂n − θ0 )I(θ̂n ) where I(θ̂n ) denotes the Fisher
information number evaluated at the maximum likelihood statistics θ̂n ,
has an asymptotic ChiSquared(1) distribution under the null hypothesis
H0 : θ = θ 0 .
36. Under the assumptions outlined in Theorem 10.11, show that Rao’s efficient
score statistic, which is given by Q = n−1 Un2 (θ0 )I −1 (θ0 ) has an asymptotic
ChiSquared(1) distribution under the null hypothesis H0 : θ = θ0 , where
n

X ∂
Un (θ0 ) = log[f (Xi ; θ)] .

i=1
∂θ
θ=θ0

Does this statistic require us to calculate the maximum likelihood estimator


of θ?
37. Under the assumptions outlined in Theorem 10.11, show that Wald’s statis-
tic, Q = n(θ̂n − θ0 )I(θ̂n ), has an asymptotic ChiSquared[1, δ 2 I(θ0 )] dis-
tribution under the sequence of alternative hypotheses {θ1,n }∞ n=1 where
θ1,n = θ0 + n−1/2 δ.
38. Under the assumptions outlined in Theorem 10.11, show that Rao’s efficient
score statistic, which is given by Q = n−1 Un2 (θ0 )I −1 (θ0 ) has an asymptotic
ChiSquared[1, δ 2 I(θ0 )] distribution under the sequence of alternative hy-
pothesis {θ1,n }∞n=1 where θ1,n = θ0 + n
−1/2
δ.
39. Suppose that X1 , . . . , Xn is a set of independent and identically distributed
random variables from a continuous distribution F . Let ξ ∈ (0, 1) and define
θ = F −1 (ξ), the ξ th population quantile of F . To compute a confidence
region for θ, let X(1) ≤ X(2) ≤ · · · ≤ X(n) be the order statistics of the
sample and let B be a Binomial(n, ξ) random variable. The usual point
estimator of θ is θ̂ = Xbnpc+1 where bxc is the largest integer strictly less
than x. A confidence interval for θ is given by C(α, ω) = [X(L) , X(U ) ], where
L and U are chosen so that P (B < L) = ωL and P (B ≥ U ) = 1 − ωU and
α = ωU − ωL . See Section 3.2 of Conover (1980) for examples using this
method.
a. Derive an observed confidence level based on this confidence interval for
an arbitrary interval subset Ψ = (tL , tU ) of the parameter space of θ.
b. Derive an approximate observed confidence level for an arbitrary interval
subset Ψ = (tL , tU ) of the parameter space of θ that is based on approx-
imating the Binomial distribution with the Normal distribution when
n is large.
40. Suppose X1 , . . . , Xn is a set of independent and identically distributed ran-
dom variables from a Poisson(θ) distribution where θ ∈ Ω = (0, ∞). Gar-
wood (1936) suggests a 100α% confidence interval for θ using the form
" #
χ22Y ;ωL χ22(Y +1);ωU
C(α, ω) = , ,
2n 2n
468 PARAMETRIC INFERENCE
where
n
X
Y = Xi ,
i=1
and ω = (ωL , ωU ) ∈ Wα where
Wα = {ω = (ω1 , ω2 ) : ωL ∈ [0, 1], ωU ∈ [0, 1], ωU − ωL = α}.
Therefore L(ωL ) = 21 n−1 χ22Y ;1−ωL and U (ωU ) = 12 n−1 χ22(Y +1);1−ωU . De-
rive an observed confidence level based on this confidence interval for an
arbitrary interval subset Ψ = (tL , tU ) of the parameter space of θ.
41. Suppose X1 , . . . , Xn is a random sample from an Exponential location
family of densities of the form f (x) = exp[−(x − θ)]δ{x; [θ, ∞)}, where
θ ∈ Ω = R.
a. Let X(1) be the first order-statistic of the sample X1 , . . . , Xn . That is
X(1) = min{X1 , . . . , Xn }. Prove that
C(α, ω) = [X(1) + n−1 log(1 − ωU ), X(1) + n−1 log(1 − ωL )],
is a 100α% confidence interval for θ when ωU − ωL = α where ωL ∈ [0, 1]
and ωU ∈ [0, 1]. Hint: Use the fact that the density of X(1) is f (x(1) ) =
n exp[−n(x(1) − θ)]δ{x(1) ; [θ, ∞)}.
b. Use the confidence interval given above to derive an observed confidence
level for an arbitrary region Ψ = (tL , tU ) ⊂ R where tL < tU .
42. Suppose X1 , . . . , Xn is a random sample from a Uniform(0, θ) density
where θ ∈ Ω = (0, ∞).
a. Find a 100α% confidence interval for θ when ωU − ωL = α where ωL ∈
[0, 1] and ωU ∈ [0, 1].
b. Use the confidence interval given above to derive an observed confidence
level for an arbitrary region Ψ = (tL , tU ) ⊂ R where 0 < tL < tU .
43. Let X1 , . . . , Xn be a set of independent and identically distributed d-
dimensional random vectors from a distribution F with real valued pa-
rameter θ that fits within the smooth function model. Let Ψ be an interval
subset of the parameter space of θ, which will be assumed to be a subset
of the real line. When σ is unknown, a correct observed confidence level for
Ψ is given by
αstud (Ψ) = Hn [n1/2 σ̂n−1 (θ̂ − tL )] − Hn [n1/2 σ̂n−1 (θ̂ − tU )],
where Hn is the distribution function of n1/2 σ̂n−1 (θ̂n − θ). Suppose that Hn
is unknown, but that we can estimate Hn using its Edgeworth expansion.
That is, we can estimate Hn with Ĥn (t) = Φ(t) + n−1/2 v̂1 (t)φ(t), where
v̂1 (t) = v1 (t) + Op (n−1/2 ), as n → ∞. The observed confidence level can
then be estimated with
α̃stud (Ψ) = Ĥn [n1/2 σ̂n−1 (θ̂ − tL )] − Ĥn [n1/2 σ̂n−1 (θ̂ − tU )].
Prove that α̃stud is second-order accurate.
EXERCISES AND EXPERIMENTS 469
44. Let X1 , . . . , Xn be a set of independent and identically distributed ran-
dom variables from a N(θ, σ 2 ) distribution conditional on θ, where θ has a
N(λ, τ 2 ) distribution where σ 2 , λ and τ 2 are known. Prove that the poste-
rior distribution of θ is N(θ̃n , σ̃n2 ) where
τ 2 x̄n + n−1 σ 2 λ
θ̃n = ,
τ 2 + n−1 σ 2
and
σ2 τ 2
σ̃n2 = .
nτ 2 + σ 2
45. Let X be a single observation from a discrete distribution with probability
distribution function

1
4θ
 x ∈ {−2, −1, 1, 2},
f (x|θ) = 1 − θ x = 0,

0 elsewhere,

where θ ∈ Ω = { 15 , 25 , 35 , 54 }. Suppose that the prior distribution on θ is a


Uniform{ 15 , 25 , 35 , 45 } distribution. Suppose that X = 2 is observed. Com-
pute the posterior distribution of θ and the Bayes estimator of θ using the
squared error loss function.
46. Let X be a single observation from a discrete distribution with probability
distribution function

−1
n θ
 x ∈ {1, 2, . . . , n},
f (x|θ) = 1 − θ x = 0,

0 elsewhere,

where θ ∈ Ω = {(n + 1)−1 , 2(n + 1)−1 , . . . , n(n + 1)−1 }. Suppose that


the prior distribution on θ is a Uniform{(n + 1)−1 , 2(n + 1)−1 , . . . , n(n +
1)−1 } distribution. Suppose that X = x is observed. Compute the posterior
distribution of θ and the Bayes estimator of θ using the squared error loss
function.
47. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a Poisson(θ) distribution and let θ have a Gamma(α, β)
prior distribution where α and β are known.

a. Prove that the posterior distribution of θ is a Gamma(α̃, β̃) distribution


where
Xn
α̃ = α + Yi ,
i=1
−1 −1
and β̃ = (β + n) .
b. Compute the Bayes estimator of θ using the squared error loss function.
Is this estimator consistent and asymptotically Normal in accordance
with Theorem 10.15?
470 PARAMETRIC INFERENCE
48. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a Poisson(θ) distribution and let θ have a prior distribution
of the form (
θ−1/2 θ > 0
π(θ) =
0 elsewhere.
This type of prior is known as an improper prior because it does not in-
tegrate to one. In particular, this prior is known as the Jeffrey’s prior for
the Poisson distribution. See Section 10.1 of Bolstad (2007) for further
information on this prior.

a. Prove that the posterior distribution of θ is a Gamma(α̃, β̃) distribution


where
n
X
α̃ = 12 + Yi ,
i=1

and β̃ = n−1 . Note that even though the prior distribution is improper,
the posterior distribution is not.
b. Compute the Bayes estimator of θ using the squared error loss function.
Is this estimator consistent and asymptotically Normal in accordance
with Theorem 10.15?

10.7.2 Experiments

1. Write a program in R that will simulate b = 1000 samples of size n from


a T(ν) distribution. For each sample the compute the sample mean, the
sample mean with 5% trimming, the sample mean with 10% trimming,
and the sample median. Estimate the mean squared error of estimating the
population mean for each of these estimators over the b samples. Use your
program to obtain the mean square estimates when n = 25, 50, and 100
with ν = 3, 4, 5, 10, and 25.

a. Informally compare the results of these simulations. Does the sample


median appear to be more efficient than the sample mean as indicated
by the asymptotic relative efficiency when ν equals three and four? Does
the trend reverse itself when ν becomes larger? How do the trimmed
mean methods compare to the sample mean and the sample median?
b. Now formally compare the four estimators using an analysis of variance
with a randomized complete block design where the treatments are taken
to be the estimators, the blocks are taken to be the sample sizes, and
the observed mean squared errors are taken to be the observations. How
do the results of this analysis compare to the results observed above?

2. Write a program in R that simulates 1000 samples of size n from a Pois-


son(θ) distribution, where n and θ are specified below. For each sample
EXERCISES AND EXPERIMENTS 471
compute the two estimators of P (Xn = 0) = exp(−θ) given by
n
X
θ̂n = n−1 δ{Xi ; {0}},
i=1

which is the proportion of values in the sample that are equal to zero,
and θ̃n = exp(−X̄n ). Use the 1000 samples to estimate the bias, standard
error, and the mean squared error for each case. Discuss the results of the
simulations in terms of the theoretical findings of Exercise 11. Repeat the
experiment for θ = 1, 2, and 5 with n = 5, 10, 25, 50, and 100.
3. Write a program in R that simulates 1000 samples of size n from a dis-
tribution F with mean θ, where both n and F are specified below. For
each sample compute two 90% upper confidence limits for the mean: the
first one of the form X̄n − n−1/2 σ̂n z0.10 and the second of the form X̄n −
n−1/2 σ̂n t0.10,n−1 , and determine whether θ is less than the upper confidence
limit for each method. Use the 1000 samples to estimate the coverage prob-
ability for each method. How do these estimated coverage probabilities
compare for the two methods with relation to the theory presented in this
chapter? Formally analyze your results and determine if there is a signifi-
cant difference between the two methods at each sample size. Use n = 5,
10, 25, 50 and 100 for each of the distributions listed below.
a. N(0, 1)
b. T(3)
c. Exponential(1)
d. Exponential(10)
e. LaPLace(0, 1)
f. Uniform(0, 1)
4. Write a program in R that simulates 1000 samples of size n from a N(θ, 1)
distribution. For each sample compute the sample mean given by X̄n and
the Hodges super-efficient estimator θ̂n = X̄n + (a − 1)X̄n δn where δn =
δ{|X̄n |; [0, n−1/4 )}. Using the results of the 1000 simulated samples estimate
the standard error of each estimator for each combination of n = 5, 10, 25,
50, and 100 and a = 0.25, 0.50, 1.00 and 2.00. Repeat the entire experiment
once for θ = 0 and once for θ = 1. Compare the estimated standard errors
of the two estimators for each combination of parameters given above and
comment on the results in terms of the theory presented in Example 10.9.
5. Write a program in R that simulates 1000 samples of size n from a distribu-
tion F with mean θ where n, θ and F are specified below. For each sample
compute the observed confidence level that θ is in the interval Ψ = [−1, 1]
as
Tn−1 [n1/2 σ̂n−1 (θ̂n + 1)] − Tn−1 [n1/2 σ̂n−1 (θ̂n − 1)],
where θ̂n is the sample mean and σ̂n is the sample standard deviation.
Keep track of the average observed confidence level over the 1000 samples.
472 PARAMETRIC INFERENCE
Repeat the experiment for n = 5, 10, 25, 50 and 100 and comment on the
results in terms of the consistency of the method.
a. F is a N(θ, 1) distribution with θ = 0.0, 0.25, . . . , 2.0.
b. F is a LaPLace(θ, 1) distribution with θ = 0.0, 0.25, . . . , 2.0.
c. F is a Cauchy(θ, 1) distribution θ = 0.0, 0.25, . . . , 2.0, where θ is taken
to be the median (instead of the mean) of the distribution.
6. Write a program in R that simulates 1000 samples of size n from a dis-
tribution F with mean θ, where n, F , and θ are specified below. For each
sample test the null hypothesis H0 : θ ≤ 0 against the alternative hy-
pothesis H1 : θ > 0 using two different tests. In the first test the null
hypothesis is rejected if n1/2 σ̂n−1 X̄n > z0.90 and in the second test the null
hypothesis is rejected if n1/2 σ̂n−1 X̄n > t0.90,n−1 . Keep track of how many
times each test rejects the null hypothesis over the 1000 replications for
θ = 0.0, 0.10σ, 0.20σ, . . . , 2.0σ where σ is the standard deviation of F . Plot
the number of rejections against θ for each test on the same set of axes,
and repeat the experiment for n = 5, 10, 25, 50 and 100. Discuss the results
in terms of the power functions of the two tests.
a. F is a N(θ, 1) distribution.
b. F is a LaPLace(θ, 1) distribution.
c. F is a Exponential(θ) distribution.
d. F is a Cauchy(θ, 1) distribution θ = 0.0, 0.25, . . . , 2.0, where θ is taken
to be the median (instead of the mean) of the distribution.
7. The interpretation of frequentist results of Bayes estimators is somewhat
difficult because of the sometimes conflicting views of the resulting theo-
retical properties. This experiment will look at two ways of looking at the
asymptotic results of this section.
a. Write a program in R that simulates a sample of size n from a N(0, 1)
distribution and computes the Bayes estimator under the assumption
that the mean parameter θ has a N(0, 12 ) prior distribution. Repeat the
experiment 1000 times for n = 10, 25, 50 and 100, and make a density
histogram of the resulting Bayes estimates for each sample size. Place a
comparative plot of the asymptotic Normal distribution for θ̃n as spec-
ified by Theorem 10.15. How well do the distributions agree, particularly
when n is larger?
b. Write a program in R that first simulates θ from a N(0, 21 ) prior distri-
bution and then simulates a sample of size n from a N(θ, 1) distribution,
conditional on the simulated value of θ. Compute the Bayes estimator of
θ for each sample. Repeat the experiment 1000 times for n = 10, 25, 50
and 100, and make a density histogram of the resulting Bayes estimates
for each sample size. Place a comparative plot of the asymptotic Nor-
mal distribution for θ̃n as specified by Theorem 10.15. How well do the
distributions agree, particularly when n is larger?
EXERCISES AND EXPERIMENTS 473
c. Write a program in R that first simulates θ from a N( 12 , 12 ) prior distri-
bution and then simulates a sample of size n from a N(θ, 1) distribution,
conditional on the simulated value of θ. Compute the Bayes estimator
of θ for each sample under the assumption that θ has a N(0, 12 ) prior
distribution. Repeat the experiment 1000 times for n = 10, 25, 50 and
100, and make a density histogram of the resulting Bayes estimates for
each sample size. Place a comparative plot of the asymptotic Normal
distribution for θ̃n as specified by Theorem 10.15. How well do the dis-
tributions agree, particularly when n is larger? What effect does the
misspecification of the prior have on the results?
CHAPTER 11

Nonparametric Inference

I had assumed you’d be wanting to go to the bank. As you’re paying close atten-
tion to every word I’ll add this: I’m not forcing you to go to the bank, I’d just
assumed you wanted to.
The Trial by Franz Kafka

11.1 Introduction

Nonparametric statistical methods are designed to provide valid statistical es-


timates, confidence intervals and hypothesis tests under very few assumptions
about the underlying model that generated the data. Typically, as the name
suggests, these methods avoid parametric models, which are models that can
be represented using a single parametric family of functions that depend on
a finite number of parameters. For example, a statistical method that makes
an assumption that the underlying distribution of the data is Normal is a
parametric method as the model for the data can be represented with a single
parametric family of densities that depend on two parameters. That is, the
distribution F comes from the family N where
N = {f (x) = (2πσ 2 )−1/2 exp[− 12 σ −2 (x − µ)2 ] : −∞ < µ < ∞, 0 < σ 2 < ∞}.
On the other hand, if we develop a statistical method that only makes the as-
sumption that the underlying distribution is symmetric and continuous, then
the method is nonparametric. In this case the underlying distribution can-
not be represented by a single parametric family that has a finite number
of parameters. Nonparametric methods are important to statistical theory as
parametric assumptions are not always valid in applications. A general intro-
duction to these methods can be found in Gibbons and Chakraborti (2003),
Hollander and Wolfe (1999), and Sprent and Smeeton (2007).
The development of many nonparametric methods depends on the ability to
find statistics that are functions of the data, and possibly a null hypothesis,
that have the same known distribution over a nonparametric family of distri-
butions. If the distribution is known, then the statistic can be use to develop
a hypothesis test that is valid over the entire nonparametric family.
Definition 11.1. Let {Xn }∞
n=1 be a set of random variables having joint

475
476 NONPARAMETRIC INFERENCE
distribution F where F ∈ A, a collection of joint distributions in Rn . The
function T (X1 , . . . , Xn ) is distribution free over A is the distribution of T if
the same for all F ∈ A.
Example 11.1. Let {Xn }∞ n=1 be a set of independent and identically dis-
tributed random variables from a continuous distribution F that has median
equal to θ. Consider the statistic
n
X
T (X1 , . . . , Xn ) = δ{Xk − θ; (−∞, 0]}.
k=1

It follows that δ{Xk − θ; (−∞, 0]} has a Bernoulli( 12 ) distribution for k =


1, . . . , n, and hence the fact that X1 , . . . , Xn are independent and identi-
cally distributed random variables implies that T (X1 , . . . , Xn ) has a Bino-
mial(n, 12 ) distribution. From Definition 11.1 it follows that T (X1 , . . . , Xn ) is
distribution free over the class of continuous distributions with median equal
to θ. In applied problems θ is usually unknown, but the statistic T (X1 , . . . , Xn )
can be used to develop a hypothesis test of H0 : θ = θ0 against H1 : θ 6= θ0
when θ is replaced by θ0 . That is,
n
X
Bn (X1 , . . . , Xn ) = δ{Xk − θ0 ; (−∞, 0]}.
k=1

Under the null hypothesis Bn has a Binomial(n, 21 ) distribution and therefore


the null hypothesis H0 : θ = θ0 can be tested at significance level α by rejecting
H0 whenever B ∈ R(α) where R(α) is any set such that P [B ∈ R(α)|θ = θ0 ] =
α. This type of test is usually called the sign test. 
Many nonparametric procedures are based on ranking the observed data.
Definition 11.2. Let X1 , . . . , Xn be an observed sample from a distribution
F and let X(1) , . . . , X(n) denote the ordered sample. The rank of Xi , denoted
by Ri = R(Xi ), equals k if Xi = X(k) .
We will assume that the ranks are unique in that there are not two or more of
the observed sample values that equal one another. This will be assured with
probability one when the distribution F is continuous. Under this assumption,
an important property of the ranks is that their joint distribution does not
depend on F . This is due to the fact that the ranks always take on the values
1, . . . , n and the assignment of the ranks to the values X1 , . . . , Xn is a random
permutation of the integers in the set {1, . . . , n}.
Theorem 11.1. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a continuous distribution F and let R0 =
(R1 , . . . , Rn ) be the vector of ranks. Define the set
Rn = {r : r is a permutation of the integers 1, . . . , n}. (11.1)
Then R is uniformly distributed over Rn .
For a proof of Theorem 11.1 see Section 2.3 of Randles and Wolfe (1979).
INTRODUCTION 477
Rank statistics find common use in nonparametric methods. These types of
statistics are often used to compare two or more populations as illustrated
below.
Example 11.2. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a continuous distribution F and let Y1 , . . . , Ym
be a set of independent and identically distributed random variables from a
continuous distribution G where G(t) = F (t − θ) and θ is an unknown pa-
rameter. This type of model for the distributions F and G is known as the
shift model, and in the special case where the means of F and G both exist,
θ = E(Y1 ) − E(X1 ). To test the null hypothesis H0 : θ = 0 against the al-
ternative hypothesis H1 : θ 6= 0 compute the ranks of the combined sample
{X1 , . . . , Xm , Y1 , . . . , Yn }. Denote this combined sample as Z1 , . . . , Zn+m and
denote the corresponding ranks as R1 , . . . , Rn+m . Note that under the null
hypothesis the combined sample can be treated as a single sample from the
distribution F . Now consider the test statistic
n+m
X
Mm,n = Di Ri ,
i=1

where Di = 0 when Zi is from the sample X1 , . . . , Xm and Di = 1 when Zi


is from the sample Y1 , . . . , Yn . Theorem 11.1 implies both that the random
vector D = (D1 , . . . , Dn+m )0 is distribution free and that D is independent
of the vector of ranks given by R. Hence, it follows that Mm,n = D0 R is
distribution free. A test based on Mm,n is known as the Wilcoxon, Mann, and
Whitney Rank Sum test. The exact distribution of the test statistic under
the null hypothesis can be derived by considering possible all equally likely
configurations of D and R, and computing the value of Mm,n for each. See
Chapter 4 of Hollander and Wolfe (1999) for further details. 

When considering populations that are symmetric about a shift parameter θ,


the sign of the observed value, which corresponds to whether the observed
value is greater than, or less than, the shift parameter can also be used to
construct distribution free statistics.
Theorem 11.2. Let Z1 , . . . , Zn be a set of independent and identically dis-
tributed random variables from a distribution F that is symmetric about zero.
Let R1 , . . . , Rn denote the ranks of the absolute observations |Z1 |, . . . , |Zn |
with R0 = (R1 , . . . , Rn ) and let Ci = δ{Zi ; (0, ∞)} for i = 1, . . . , n, with
C0 = (C1 , . . . , Cn ). Then

1. The random variables in R and C are mutually independent.


2. Then R is uniformly distributed over the set Rn .
3. The components of C are a set of independent and identically distributed
Bernoulli( 21 ) random variables.

A proof of Theorem 11.2 can be found in Section 2.4 of Randles and Wolfe
(1979).
478 NONPARAMETRIC INFERENCE
Example 11.3. Let (X1 , Y1 ), . . . , (Xn , Yn ) be a set of independent and identi-
cally distributed paired random variables from a continuous bivariate distribu-
tion F . Let G and H be the marginal distributions of Xn and Yn , respectively.
Assume that H(x) = G(x − θ) for some shift parameter θ and that G is a
symmetric distribution about zero. Let Zi = Xi −Yi and let R1 , . . . , Rn denote
the ranks of the absolute differences |Z1 |, . . . , |Zn |. Define Ci = δ{Zi ; (0, ∞)}
for i = 1, . . . , n, with C0 = (C1 , . . . , Cn ). The Wilcoxon signed rank statistic is
then given by Wn = C0 R, which is the sum of the ranks of the absolute differ-
ences that correspond to positive differences. When testing the null hypothesis
H0 : θ = θ0 the value of θ in Wn is replaced by θ0 . Under this null hypothesis
it follows from Theorem 11.2 that Wn is distribution free. The distribution of
Wn under the null hypothesis can be found by enumerating the value of Wn
over all possible equally likely permutations of the elements of C and R. For
further details see Section 3.1 of Hollander and Wolfe (1999). 
The analysis of the asymptotic behavior of test statistics like those studied
in Examples 11.1 to 11.3 is the subject of this chapter. The next section will
develop a methodology for showing that such statistics are asymptotically
Normal using the theory of U -statistics.

11.2 Unbiased Estimation and U -Statistics

Let {Xn }∞ n=1 be a sequence of independent and identically distributed ran-


dom variables from a distribution F with functional parameter θ = T (F ).
Hoeffding (1948) considered a class of estimators of θ that are unbiased and
also have a distribution that converges weakly to a Normal distribution as
the sample size increases. Such statistics are not necessarily directly connected
with nonparametric methods. For example, the sample mean and variance are
examples of such statistics. However, many of the classical nonparametric test
statistics fall within this area. As such, the methodology of Hoeffding (1948)
provides a convenient method for finding the asymptotic distribution of such
test statistics.
To begin our development we consider the concept of estimability. Estimability
in this case refers to the smallest sample size for which an unbiased estimator
of a parameter exists.
Definition 11.3. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F ∈ F, where F is a collection
of distributions. A parameter θ ∈ Ω is estimable of degree r over F if r is the
smallest sample size for which there exists a function h∗ : Rr → Ω such that
E[h∗ (X1 , . . . , Xr )] = θ for every F ∈ F and θ ∈ Ω.
Example 11.4. Let {Xn }∞ n=1 be a sequence of independent and identically
distributed random variables from a distribution F ∈ F where F is a collection
of distributions that have a finite second moment. Let θ = V (X1 ). We will
show that θ is estimable of degree two over F. First, we demonstrate that
UNBIASED ESTIMATION AND U -STATISTICS 479
the degree of θ is not greater than two. To do this we note that if we define
h∗ (X1 , X2 ) = 21 (X1 −X2 )2 , then E[ 12 (X1 −X2 )2 ] = 12 (θ+µ2 −2µ2 +θ+µ2 ) = θ
where µ = E(X1 ). Therefore, the degree of θ is at most two. To show that
the degree of θ is not one we must show that there is not a function h∗1 (X1 )
such that E[h∗1 (X1 )] = θ for all F ∈ F and θ ∈ Ω. Following the suggestion
of Randles and Wolfe (1979) we first assume that such a function exists and
search for a contradiction. For example, if such an h∗1 exists then
E[X12 − h∗1 (X1 )] = θ + µ2 − θ = µ2 . (11.2)
Now, is it possible that such a function exists? As suggested by Randles and
Wolfe (1979) we shall consider what happens when F corresponds to a Uni-
form(η − 21 , η + 12 ) distribution where η ∈ R. In this case
1 1 1
Z η+ 2 Z η+ 2 Z η+ 2
E[X12 − h1 (X1 )] = x21 − h∗1 (x1 )dx1 = x21 dx1 − h∗1 (x1 )dx1 .
1 1 1
η− 2 η− 2 η− 2
(11.3)
1
The first integral in Equation (11.3) is given by η 2 + 12 and hence the second
integral must equal η 2 . That is,
1
Z η+ 2
h∗1 (x1 )dx1 = η 2 .
1
η− 2

This implies that h∗ (x1 ) must be a linear function of the form a + bx1 for
some constants a and b. However, direct integration implies that
1
Z η+ 2
(a + bx1 )dx1 = a + bη
1
η− 2

which is actually linear in η, which is a contradiction, and hence no such


function exists when η 6= 0, and therefore θ is not estimable of degree one over
F. It is important to note that this result depends on the family of distributions
considered. See Exercise 1. 

Now, suppose that X1 , . . . , Xn is a set of independent and identically dis-


tributed random variables from a distribution F ∈ F and let θ be a param-
eter that is estimable of degree r with function h∗ (X1 , . . . , Xr ). Note that
h∗ (X1 , . . . , Xr ) is an unbiased estimator of θ since E[h∗ (X1 , . . . , Xr )] = θ for
all θ ∈ Ω. To compute this estimator the sample size n must be at least as
large as r. When n > r this estimator is unlikely to be very efficient since it
does not take advantage of the information available in the entire sample. In
fact, h∗ (Xi1 , . . . , Xir ) is also an unbiased estimator of θ for any set of indices
{i1 , . . . , ir } that are selected without replacement from the set {1, . . . , n}. The
central idea of a U -statistic is to form an efficient unbiased estimator of θ by av-
eraging together all nr possible

 unbiased estimators of θ corresponding to the
function h∗ computed on all nr possible samples of the form {Xi1 , . . . , Xir } as
described above. This idea is much simpler to implement if h∗ is a symmetric
480 NONPARAMETRIC INFERENCE
function of its arguments. That is, if h∗ (X1 , . . . , Xr ) = h∗ (Xi1 , . . . , Xir ) where
in this case {i1 , . . . , ir } is any permutation of the integers in the set {1, . . . , r}.
In fact, given any function h∗ (X1 , . . . , Xr ) that is an unbiased estimator of θ,
it is possible to construct a symmetric function h(X1 , . . . , Xr ) that is also an
unbiased estimator of θ. To see why this is true we construct h(X1 , . . . , Xr )
as X
h(X1 , . . . , Xr ) = (r!)−1 h∗ (Xa1 , . . . , Xar ),
a∈Ar
where a0 = (a1 , . . . , ar ) and Ar is the set that contains all vectors whose
elements correspond to the permutations of the integers in the set {1, . . . , r}.
Note that because X1 , . . . , Xn are independent and identically distributed it
follows that E[h∗ (Xa1 , . . . , Xar )] = θ for all a ∈ Ar and hence
X
E[h(X1 , . . . , Xr )] = (r!)−1 E[h∗ (Xa1 , . . . , Xar )] = θ.
a∈Ar

Therefore h(X1 , . . . , Xr ) is a symmetric function that can be used in place of


h∗ (X1 , . . . , Xr ). With the symmetric function h(X1 , . . . , Xr ) we now define a
U -statistic as the average of all possible values of the function h computed over
all nr possible selections of r random variables from the set {X1 , . . . , Xn }.


Definition 11.4. Let X1 , . . . , Xn be a set of independent and identically dis-


tributed random variables from a distribution F with functional parameter θ.
Suppose the θ is estimable of degree r with a symmetric function h which will
be called the kernel function. Then a U -statistic for θ is given by
 −1 X
n
Un = Un (X1 , . . . , Xn ) = h(Xb1 , . . . , Xbr ), (11.4)
r
b∈Bn,r

where Bn,r is a set that contains all vectors whose elements correspond to
unique selections of r integers from the set {1, . . . , n}, taken without replace-
ment.
Example 11.5. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F with finite mean θ. Let
h(x) = x and note that E[h(Xi )] = θ for all i = 1, . . . , n. Therefore
n
X
Un = n−1 Xi ,
i=1

is a U -statistic of degree one. 


Example 11.6. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a continuous distribution F and define
θ = P (Xn ≤ ξ) for a specified real constant ξ. Consider the function h(x) =
δ{x − ξ; (−∞, 0]} where E[h(Xn )] = P (Xn − ξ ≤ 0) = P (Xn ≤ ξ) = θ.
Therefore a U -statistic of degree one for the parameter θ is given by
n
X
Bn = n−1 δ{Xi − ξ; (−∞, 0]}.
i=1
UNBIASED ESTIMATION AND U -STATISTICS 481
This statistic corresponds to the test statistic used by the sign test for testing
hypotheses about a quantile of F . 

Example 11.7. Let X1 , . . . , Xn be a set of independent and identically dis-


tributed random variables from a continuous distribution F that is symmetric
about a point θ. Let R be the vector of ranks of |X1 −θ|, . . . , |Xn −θ| and let C
be an n × 1 vector with ith element ci = δ{Xi − θ; (0, ∞)} for i = 1, . . . , n. The
signed rank statistic was seen in Example 11.3 to have the form Wn = C0 R.
The main purpose of this exercise is to demonstrate that this test statistic can
be written as the sum of two U -statistics. To simplify the notation used in
this example, let Zi = Xi − θ for i = 1, . . . , n so that R contains the ranks of
|Z1 |, . . . , |Zn | and ci = δ{Zi ; (0, ∞)}. Let us examine each term in the statistic

n
X n
X
Wn = Ri δ{Zi ; (0, ∞)} = R̃i δ{Z(i) ; (0, ∞)}, (11.5)
i=1 i=1

where R̃i is the absolute rank associated with Z(i) . There are two possibilities
for the ith term in the sum in Equation (11.5). The term will be zero if Zi < 0.
If Zi > 0 then the ith term will add R̃i to the sum. Suppose for the moment
that R̃i = r for some r ∈ {1, . . . , n}. This means that there are r − 1 values
from |Z1 |, . . . , |Zn | such that |Zj | < |Z(i) | along with the one value that equals
|Z(i) |. Hence, the ith term will add

n
X n
X
δ{|Zj |; (0, |Z(i) |]} = δ{|Z(j) |; (0, |Z(i) |]},
j=1 j=1

to the sum in Equation (11.5) when Zi > 0. Now let us combine these two
conditions. Let Z(1) , . . . , Z(n) denote the order statistics of Z1 , . . . , Zn . Let
i < j and note that δ{Z(i) + Z(j) ; (0, ∞)} = 1 if and only if Z(j) > 0 and
|Z(i) | < Z(j) . To see why this is true consider the following cases. If Z(j) < 0
then Z(i) < 0 since i < j and hence δ{Z(i) + Z(j) ; (0, ∞)} = 0. Similarly, it is
possible that Z(j) > 0 but |Z(i) | > |Z(j) |. This can only occur when Z(i) < 0,
or else Z(j) would be larger than Z(j) which cannot occur because i < j.
Therefore |Z(i) | > |Z(j) | and Z(i) < 0 implies that Z(i) + Z(j) < 0 and hence
δ{Z(i) + Z(j) ; (0, ∞)} = 0. However, if Z(j) > 0 and |Z(i) | < |Z(j) | then it must
follow that δ{Z(i) + Z(j) ; (0, ∞)} = 1. Therefore, it follows that the term in
Equation (11.5) can be written as

n
X i
X
δ{|Z(j) |; (0, Z(i) )} = δ{Z(i) + Z(j) }; (0, ∞)},
j=1 j=1

where the upper limit of the sum on the right hand side of the equation reflects
the fact that we only add in observations less than or equal to Z(j) , which is
482 NONPARAMETRIC INFERENCE
the signed rank of Z(i) . Therefore, it follows that
n X
X i
W = δ{Z(i) + Z(j) ; (0, ∞)}
i=1 j=1
n X
X i
= δ{Zi + Zj ; (0, ∞)}
i=1 j=1
Xn n
X n
X
= δ{2Zi ; (0, ∞)} + δ{Zi + Zj ; (0, ∞)}. (11.6)
i=1 i=1 j=i+1

The first term in Equation (11.6) can be written as nU1,n where U1,n is a
U -statistic of the form
n
X
U1,n = n−1 δ{2Zi ; (0, ∞)},
i=1

and the second term in Equation (11.6) can be written as n2 U2,n where U2,n


is a U -statistic of the form


 −1 Xn n
n X
U2,n = δ{Zi + Zj ; (0, ∞)}.
2 i=1 j=i+1

A key property of U -statistics which makes them an important topic in statis-


tical estimation theory is that they are optimal in that they have the lowest
variance of all unbiased estimators of θ.
Theorem 11.3. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F with parameter θ. Let Un
be a U -statistic for θ and let Tn be any other unbiased estimator of θ, then
V (Un ) ≤ V (Tn ).

A proof of Theorem 11.3 can be found in Section 5.1.4 of Serfling (1980).


The main purpose of this section is to develop conditions under which a U -
statistic is asymptotically normal. In order to obtain such a result, we first
need to develop an expression for the variance of a U -statistic. The form of
the U -statistic defined in Equation (11.4) suggests that we need to obtain an
expression for the variance of the sum
X
h(Xb1 , . . . , Xbr ).
b∈Bn,r

If the terms in this sum where independent of one another then we could
exchange the variance and the sum. However, it is clear that if we choose
two distinct elements b and b0 from Bn,r , the two terms h(Xb1 , . . . , Xbr ) and
h(Xb01 , . . . , Xb0r ) could have as many as r − 1 of the random variables from the
set {X1 , . . . , Xn } is common, but may have as few as zero of these variables
UNBIASED ESTIMATION AND U -STATISTICS 483
in common. Note that the two terms could not have all r variables in common
because we are assuming that b and b0 are distinct. We also note that if n
was not sufficiently large then there may also be a lower bound on the number
of random variables that the two terms could have in common. In general we
will assume that n is large enough so that the lower limit is always zero. In
this case we have that
 
X
V h(Xb1 , . . . , Xbr ) =
b∈Bn,r
X X
C[h(Xb1 , . . . , Xbr ), h(Xb01 , . . . , Xb0r )].
b∈Bn,r b0 ∈Bn,r

To simplify this expression let us consider the case where the sets {b1 , . . . , br }
and {b01 , . . . , b0r } have exactly c elements in common. For example, we can
consider the term,
C[h(X1 , . . . , Xc , Xc+1 , . . . , Xr ), h(X1 , . . . , Xc , Xr+1 , . . . , X2r−c )],
where we assume that n > 2r − c. Now consider comparing this covariance to
another term that also has exactly c variables in common such as
C[h(X1 , . . . , Xc , Xc+1 , . . . , Xr ), h(X1 , . . . , Xc , Xr+2 , . . . , X2r−c+1 )].
Note that the two covariances will be equal because the joint distribution
of Xr+1 , . . . , X2r−c is exactly the same as the joint distribution of Xr+2 , . . .,
X2r−c+1 because X1 , . . . , Xn are assumed to be a sequence of independent
and identically distributed random variables. This fact, plus the symmetry of
the function h will imply that any two terms that have exactly c variables in
common will have the same covariance. Therefore, define
ζc = C[h(X1 , . . . , Xc , Xc+1 , . . . , Xr ), h(X1 , . . . , Xc , Xr+1 , . . . , X2r−c )],
for c = 0, . . . , r−1. The number of terms of this form in the sum of covariances
equals the number of ways to choose r indices from the set of n, which is nr ,


multiplied by the number of ways to choose c common indices from the r


chosen indices which is rc , multiplied by the number of ways to choose the
remaining r − c non-common indicesfrom the set of n − r indices not chosen
from the first selection which is n−r r−c . Therefore, it follows that the variance
of a U -statistic of the form defined in Equation (11.4) is
 −2 X
n X
V (Un ) = C[h(Xb1 , . . . , Xbr ), h(Xb01 , . . . , Xb0r )]
r
b∈Bn,r b0 ∈Bn,r
 −2 X r    
n n r n−r
= ζc
r c=0
r c r−c
 −1 X r   
n r n−r
= ζc . (11.7)
r c=0
c r−c
484 NONPARAMETRIC INFERENCE
Note further that
ζ0 = C[h(X1 , . . . , Xr ), h(Xr+1 , . . . , X2r )] = 0,
since X1 , . . . , Xr are mutually independent of Xr+1 , . . . , X2r . Hence, the ex-
pression in Equation (11.7) simplifies to
r   
X r n−r
V (Un ) = ζc . (11.8)
c=1
c r−c

Example 11.8. Let X1 , . . . , Xn be a set of independent and identically dis-


tributed random variables from a distribution with mean θ and finite variance
σ 2 . In Example 11.5 it was shown that the sample mean Un = X̄n is a U -
statistic of degree r = 1 for θ with function h(x) = x. The variance of this
U -statistic can then be computed using Equation (11.8) to find
 −1 X r   
n 1 n−1
V (Un ) = ζc
1 c=1
c 1−c
= n−1 ζ1
= n−1 C(X1 , X1 )
= n−1 σ 2 ,
which is the well-known expression for the variance of the sample mean. 
Example 11.9. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution with mean θ and finite variance
σ 2 . Consider estimating the parameter θ2 which can be accomplished with a
U -statistic of degree r = 2 with symmetric function h(x1 , x2 ) = x1 x2 . The
variance of this U -statistic can then be computed using Equation (11.8) to
find
 −1 X 2   
n 2 n−2
V (Un ) = ζc
2 c=1
c 2−c
 −1     −1   
n 2 n−2 n 2 n−2
= ζ1 + ζ2
2 1 1 2 2 0
2(n − 2) 1
= 1 ζ1 + 1 ζ2
2 n(n − 1) 2 n(n − 1)
4(n − 2) 2
= ζ1 + ζ2 .
n(n − 1) n(n − 1)
Now
ζ1 = C(X1 X2 , X1 X3 )
= E(X12 X2 X3 ) − E(X1 X2 )E(X1 X3 )
= θ2 (θ2 + σ 2 ) − θ4
= θ2 σ2 ,
UNBIASED ESTIMATION AND U -STATISTICS 485
and
ζ2 = C(X1 X2 , X1 X2 )
= E(X12 X22 ) − E(X1 X2 )E(X1 X2 )
= (θ2 + σ 2 )2 − θ4
= 2θ2 σ 2 + σ 4 .
Therefore, we can conclude that
4(n − 2)θ2 σ 2 4θ2 σ 2 2σ 4
V (Un ) = + +
n(n − 1) n(n − 1) n(n − 1)
2 2 4
4θ σ 2σ
= + .
n n(n − 1)


The variance of a U -statistic can be quite complicated, but it does turn out
that the leading term in Equation (11.8) is dominant from an asymptotic
viewpoint.
Theorem 11.4. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a distribution F . Let Un be an rth -order U -
statistic with symmetric kernel function h(x1 , . . . , xr ). If E[h2 (X1 , . . . , Xr )] <
∞ then V (Un ) = n−1 r2 ζ1 + o(n−1 ), as n → ∞.

Proof. From Equation (11.7) we have that


r  −1   
X n r n−r
nV (Un ) = n ζc , (11.9)
c=1
r c r−c

where ζ1 , . . . , ζr are finite because we have assumed that E[h2 (X1 , . . . , Xr )] <
∞. It is the asymptotic behavior of the coefficients of ζc that will determine
the behavior of nV (Un ). We first note that for c = 1 we have that
 −1    −1  
r2 [(n − r)!]2
 
n r n−r n r n−r
n =n ζc = .
r c r−c r 1 r−1 (n − 2r + 1)!(n − 1)!
Canceling the identical terms in the numerator and the denominator of the
right hand side of this equation yields
 −1   r−1
r2 (n − r) · · · (n − 2r + 1)

n r n−r Y n−i+1−r
n ζc = = r2 ,
r 1 r−1 (n − 1) · · · (n − r + 1) i=1
n−i
where it is instructive to note that the number of terms in the product does
not depend on n. Therefore, since
r−1
Y n−i+1−r
lim r2 = r2 ,
n→∞
i=1
n−i

it follows that the first term in Equation (11.9) converges to r2 ζ1 as n → ∞.


486 NONPARAMETRIC INFERENCE
For the case where c ∈ {2, . . . , r}, we have that
 −1  
(r!)2 [(n − r)!]2

n r n−r
n = . (11.10)
r c r−c c![(r − c)!]2 (n − 2r + c)!(n − 1)!
Again, we cancel identical terms on the numerator and the denominator of
the second term on the right hand side of Equation (11.10) to yield

[(n − r)!]2 (n − r) · · · (n − 2r + c + 1)
= =
(n − 2r + c)!(n − 1)! (n − 1) · · · (n − r + 1)
"c−1 # "r−c #
Y Y n−r−i+1
−1
(n − r + c − i) ,
i=1 i=1
n−i
where again we note that the number of terms in each of the products does
not depend on n. Therefore, we have that
c−1
Y
lim (n − r + c − i)−1 = 0,
n→∞
i=1

and
r−c
Y n−r−i+1
lim = 1.
n→∞
i=1
n−i
Therefore, it follows that
 −1   
n r n−r
lim n ζc = 0,
n→∞ r c r−c
and hence the result follows.
Example 11.10. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a distribution with mean θ and finite vari-
ance σ 2 . Continuing with Example 11.9 we consider estimating the param-
eter θ2 with a U -statistic of degree r = 2 with symmetric kernel function
h(x1 , x2 ) = x1 x2 . The variance of this U -statistic was computed in Example
11.9 as
4θ2 σ 2 2σ 4
V (Un ) = + .
n n(n − 1)
One can verify the result of Theorem 11.4 by noting that
2σ 4
lim nV (Un ) = lim 4θ2 σ 2 + = 4θ2 σ 2 = r2 ζ1 .
n→∞ n→∞ n−1

We are now in a position to develop conditions under which U -statistics are
asymptotically normal. We begin with the case where r = 1. In this case the
U -statistic defined in Equation (11.4) has the form
n
X
Un = n−1 h(Xi ),
i=1
UNBIASED ESTIMATION AND U -STATISTICS 487
which is the sum of independent and identically distributed random vari-
ables. Therefore, it follows from Theorem 4.20 (Lindeberg and Lévy) that
−1/2 d
n1/2 ζ1 (Un − θ) −
→ Z as n → ∞ where Z is a N(0, 1) random variable.
For the case when r > 1 the problem becomes more complicated as the terms
in the sum in a U -statistic are no longer necessarily independent. The approach
for establishing asymptotic normality for these types of U -statistics is based
on finding a function of the observed data that has the same asymptotic
behavior as the U -statistic, but is the sum of independent and identically
distributed random variables. Theorem 4.20 can then be applied to this related
function, thus establishing the asymptotic normality of the U -statistic. To
simplify matters, we will actually first center the U -statistic about the origin.
That is, if Un is a U -statistic of order r, then we will actually be working with
the function Un − θ which has expectation zero. The method for finding a
function of the data that is a sum of independent and identically distributed
terms that has the same asymptotic behavior as Un − θ is based on finding a
projection of Un − θ onto the space of functions that are sums of independent
and identically distributed random variables.
Recall that a projection of a point in a metric space on a subspace is accom-
plished by finding a point in the subspace that is closest to the specified point.
For example, we can consider the vector space R3 with vector x ∈ R3 . Let P
denote a two-dimensional subspace of R3 corresponding to a plane. Then the
vector x is projected onto P by finding a vector p ∈ P that minimizes kx − pk.
For our purpose we will consider projecting Un − θ onto the space of functions
given by ( )
Xn
Vn = Vn Vn = k(Xi ) ,


i=1
where k is a real-valued function. The function k that will result from this
projection will usually depend on some unknown parameters of F . However,
this does not affect the usefulness of the results since we are not actually
interested in computing the function; we only need to establish its asymptotic
properties.
Consider a U -statistic of order r given by Un . We wish to project Un − θ onto
the space Vn . In order to do this we need a measure of distance between Un −θ
and functions that are in Vn . For this we will use the expected square distance
between the two functions. That is, we use kUn − θ − Vn k = E[(Un − θ − Vn )2 ].
Theorem 11.5. Let Un be a U -statistic of order r calculated on X1 , . . . , Xn , a
set of independent and identically distributed random variables. The projection
of Un − θ onto Vn is given by
n
X
−1
Vn = rn {E[h(Xi , X2 , . . . , Xr )|Xi ] − θ}. (11.11)
i=1

Proof. In order to prove this result we must show that Vn ∈ Vn and that Vn
488 NONPARAMETRIC INFERENCE
minimizes kUn − θ − Vn k. To show that Vn ∈ Vn , we need only note that the
conditional expectation E[h(Xi , X2 , . . . , Xr )|Xi ] is only a function of Xi and
hence we can take the function k(Xi ) to be defined as
k̃(Xi ) = rn−1 {E[h(Xi , X2 , . . . , Xr )|Xi ] − θ}.
To prove that V1,n minimizes kUn −θ −V1,n k we let V be an arbitrary member
of Vn , and note that
kUn − θ − V k = E[(Un − θ − V )2 ]
= E{[(Un − θ − Vn ) + (Vn − V )]2 }
= E[(Un − θ − Vn )2 ] + E[(Vn − V )2 ]
+2E[(Un − θ − Vn )(Vn − V )].
Now, suppose that V has the form
n
X
V = k(Xi ),
i=1

where k is a real valued function. Then


( n
)
X
E[(Un − θ − Vn )(Vn − V )] = E (Un − θ − Vn ) [k̃(Xi ) − k(Xi )]
i=1
n
X
= E{(Un − θ − Vn )[k̃(Xi ) − k(Xi )]}.
i=1
(11.12)
Evaluating the term in the sum on the right hand side of Equation (11.12) we
use Theorem A.17 to find that

E{(Un − θ − Vn )[k̃(Xi ) − k(Xi )]} =


E[E{(Un − θ − Vn )[k̃(Xi ) − k(Xi )]|Xi }], (11.13)
where the outer expectation on the right hand side of Equation (11.13) is
taken with respect to Xi . Therefore, it follows that

E[E{(Un − θ − Vn )[k̃(Xi ) − k(Xi )]|Xi }] =


E{[k̃(Xi ) − k(Xi )]E(Un − θ − Vn |Xi )}. (11.14)
To evaluate the conditional expectation on the right hand side of Equation
(11.14) we note that
 
 −1 X
n
E(Un |Xi ) = E  h(Xb1 , . . . , Xbr ) Xi 
r
b∈Bn,r
 −1 X
n
= E[h(Xb1 , . . . , Xbr )|Xi ]. (11.15)
r
b∈Bn,r
UNBIASED ESTIMATION AND U -STATISTICS 489
There are n−1

r terms in the sum on the right hand side of Equation (11.15)
where i ∈
/ {b1 , . . . , br }. For these terms we have
 that E[h(Xb1 , . . . , Xbr )|Xi ] =
E[h(Xb1 , . . . , Xbr )] = θ. The remaining n−1r−1 terms in the sum have the form

E[h(Xb1 , . . . , Xbr )|Xi ] = E[h(Xi , X1 , . . . , Xr−1 )|Xi ] = r−1 nk̃(Xi ) + θ,


which follows from the definition of the function k̃ and the fact that X1 , . . . , Xn
are a set of independent and identically distributed random variables. There-
fore,

E(Un |Xi ) =
 −1      −1  
n n−1 n−1 n n − 1 −1
+ θ+ r nk̃(Xi ) =
r r r−1 r r−1
θ + k̃(Xi ).
Similarly, we have that
 
n
X

Xn

E(Vn |Xi ) = E  k̃(Xj ) Xi  = E[k̃(Xj )|Xi ].
j=1 j=1

Now, when i = j we have that E[k̃(Xi )|Xi ] = k̃(Xi ) and when i 6= j, Theorem
A.17 implies
E[k̃(Xj )|Xi ] = E[k̃(Xj )]
= rn−1 E{E[h(Xj , X2 , . . . , Xr )|Xj ]} − rn−1 θ
= rn−1 E[h(Xj , X2 , . . . , Xr )] − rn−1 θ
= 0.
Therefore, E(Vn |Xi ) = k̃(Xi ) and E(Un − θ − Vn |Xi ) = 0, from which we can
conclude that E{(Un − θ − Vn )[k̃(Xi ) − k(Xi )]} = 0 for all i = 1, . . . , n. Hence,
it follows that ||Un − θ − V || = E[(Un − Vn )2 ] + E[(Vn − V )2 ]. Because both
terms are non-negative and the first term does not depend on V , it follows
that minimizing ||Un − θ − V || is equivalent to minimizing the second term
E[(Vn − V )2 ], which can be made zero by choosing V = Vn . Therefore, Vn
minimizes ||Un − θ − V || and the result follows.

Now that we have determined that the projection of Un − θ is given by Vn ,


we can now prove that Vn has an asymptotic Normal distribution and that
Un − θ has the same asymptotic properties as Vn . This result, which was
proven by Hoeffding (1948), establishes conditions under which a U -statistic
is asymptotically normal.
Theorem 11.6 (Hoeffding). Let X1 , . . . , Xn be a set of independent and iden-
tically distributed random variables from a distribution F . Let θ be a parame-
ter that estimable of degree r with symmetric kernel function h(x1 , . . . , xr ). If
490 NONPARAMETRIC INFERENCE
E[h2 (x1 , . . . , xr )] < ∞ and ζ1 > 0 then
 
 −1 X
n d
n1/2 (r2 ζ1 )−1/2  h(Xb1 , . . . , Xbr ) − θ −
→ Z,
r
b∈Bn,r

as n → ∞ where Z is a N(0, 1) random variable.

Proof. The proof of this result proceeds in two parts. We first establish that
the projection of Un − θ onto Vn has an asymptotic Normal distribution. We
then prove that kUn − θ − Vn k converges to zero as n → ∞, which will then be
used to establish that the two statistics have the same limiting distribution.
Let Un have the form
1 X
Un = n
 h(Xb1 , . . . , Xbr ),
r b∈Bn,r

and let Vn be the projection of Un − θ onto the space Vn . That is


n
X
Vn = k̃(Xi ),
i=1

where k̃ is defined in the proof of Theorem 11.5. Because X1 , . . . , Xn is a


set of independent and identically distributed random variables, it follows
that k̃(X1 ), . . . , k̃(Xn ) is also a set of independent and identically distributed
random variables and hence Theorem 4.20 (Lindeberg and Lévy) implies that
( n
)
d
X
1/2 −1 −1
n σ̃ n nk̃(Xi ) − E[nk̃(Xi )] −→ Z, (11.16)
i=1

as n → ∞ where Z is a N(0, 1) random variable. Now Theorem A.17 implies


that

E[nk̃(X1 )] = E {rE[h(X1 , . . . , Xr )|X1 ] − rθ}


= rE[h(X1 , . . . , Xr )] − rθ
= 0. (11.17)

Similarly, to find σ̃ 2 we note that

V [nk̃(X1 )] = E[n2 k̃ 2 (X1 )]


= r2 E{[E[h(X1 , . . . , Xr )|X1 ] − θ2 ]2 }
= r2 V {E[h(X1 , . . . , Xr )|X1 ]},

since the result of Equation (11.17) implies that E{E[h(X1 , . . . , Xr )|X1 ]} = θ.


UNBIASED ESTIMATION AND U -STATISTICS 491
Therefore, it follows that
V [nk̃(X1 )] = r2 E{E 2 [h(X1 , . . . , Xr )|X1 ]} − r2 θ2
= r2 E{E[h(X1 , . . . , Xr )|X1 ]E[h(X1 , . . . , Xr )|X1 ]} − r2 θ2
= r2 E{E[h(X1 , . . . , Xr )|X1 ]E[h(X1 , Xr+1 , . . . , X2r−1 )|X1 ]}
−r2 θ2
= r2 E{E[h(X1 , . . . , Xr )h(X1 , Xr+1 , . . . , X2r−1 )|X1 ]} − r2 θ2
= r2 E[h(X1 , . . . , Xr )h(X1 , Xr+1 , . . . , X2r−1 )] − r2 θ2 .
Now, note that
ζ1 = C[h(X1 , . . . , Xr ), h(X1 , Xr+1 , . . . , X2r−1 )]
= E[h(X1 , . . . , Xr )h(X1 , Xr+1 , . . . , X2r−1 )]
−E[h(X1 , . . . , Xr )]E[h(X1 , Xr+1 , . . . , X2r−1 )]
= E[h(X1 , . . . , Xr )h(X1 , Xr+1 , . . . , X2r−1 )] − θ2 .
Therefore, it follows that
E[h(X1 , . . . , Xr )h(X1 , Xr+1 , . . . , X2r−1 )] = ζ1 + θ2 ,
and hence we have shown that nV [k̃(X1 )] = r2 ζ1 , or equivalently we have
shown that V [k̃(X1 )] = n−2 r2 ζ1 . Substituting the expressions for the expec-
tation and variance of k̃(X1 ) into the result of Equation (11.16) yields the
d
result that r−1 (nζ1 )−1/2 Vn −
→ Z as n → ∞. For the next step, we begin by
proving that kUn − θ − Vn k → 0 as n → ∞. We first note that

kUn − θ − Vn k = E[(Un − θ − Vn )2 ] =
E[(Un − θ)2 ] − 2E[Vn (Un − θ)] + E[Vn2 ]. (11.18)
To evaluate the first term on the right hand side of Equation (11.18) we note
that E(Un ) = θ and hence Theorem 11.4 implies that
E[(Un − θ)2 ] = V (Un ) = n−1 r2 ζ1 + o(n−1 ), (11.19)
as n → ∞. To evaluate the second term on the right hand side of Equation
(11.18),
" #   
n −1 X
 X n 
E[Vn (Un − θ)] = E k̃(Xi )  h(Xb1 , . . . , Xbr ) − θ

i=1
r 
b∈Bn,r
 −1 Xn
n X
= E{k̃(Xi )[h(Xb1 , . . . , Xbr ) − θ]}.
r i=1 b∈Bn,r

Now, if i ∈
/ {b1 , . . . , br } then k̃(Xi ) and h(Xb1 , . . . , Xbr ) will be independent
and hence
E{k̃(Xi )[h(Xb1 , . . . , Xbr ) − θ]} = E[k̃(Xi )]E[h(Xb1 , . . . , Xbr ) − θ] = 0.
For the remaining n−1

r−1 terms where i ∈ {b1 , . . . , br }, we apply Theorem A.17
492 NONPARAMETRIC INFERENCE
to find that
E{k̃(Xi )[h(Xb1 , . . . , Xbr ) − θ]} = E[E{k̃(Xi )[h(Xb1 , . . . , Xbr ) − θ]|Xi }]
= E[k̃(Xi )E{[h(Xb1 , . . . , Xbr ) − θ]|Xi }]
= nr−1 E[k̃ 2 (Xi )]
= n−1 rζ1 .
Therefore, it follows that
n n−1

X r r−1 ζ1
E[Vn (Un − θ)] = = r2 n−1 ζ1 . (11.20)
n nr

i=1

To evaluate the third term on the right hand side of Equation (11.18), we have
that
" # n 
 X n X  X n X n
E(Vn2 ) = E k̃(Xi )  k̃(Xj ) = E[k̃(Xi )k̃(Xj )].
 
i=1 j=1 i=1 j=1

Now when i 6= j, k̃(Xi ) and k̃(Xj ) are independent and E[k̃(Xi )k̃(Xj )] =
E[k̃(Xi )]E[k̃(Xj )] = 0. Therefore,
n
X
E(Vn2 ) = E[k̃ 2 (Xi )] = nE[k̃ 2 (X1 )] = n−1 r2 ζ1 . (11.21)
i=1

Combining the results of Equations (11.18)–(11.21) yields


kUn − θ − Vn k = n−1 r2 ζ1 − 2n−1 r2 ζ1 + n−1 r2 ζ1 + o(n−1 ) = O(n−1 ),
as n → ∞, so that kUn − θ − Vn k → 0 as n → ∞. Therefore from Definition
qm
5.1 it follows that Un − θ − Vn −−→ 0 as n → ∞. Theorems 5.2 and 4.8 imply
d
then that Un − θ − Vn − → 0 as n → ∞, and therefore Un − θ and Vn converge
to the same distribution as n → ∞. Hence, Equation (11.16) implies that
d
r−1 (nζ1 )−1/2 (Un − θ) −
→ Z as n → ∞, and the result is proven.
Example 11.11. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a distribution with mean θ and finite vari-
ance σ 2 . Continuing with Example 11.9 we consider estimating the parameter
E[X12 ] = θ2 with a U -statistic of degree r = 2 that has a symmetric kernel
function h(x1 , x2 ) = x1 x2 with ζ1 = θ2 σ 2 . The assumption that the variance
is finite implies that θ < ∞ and we have that E[h2 (X1 , X2 )] = E[X12 X22 ] =
E[X12 ]E[X22 ] = (θ2 + σ 2 )2 < ∞. Further ζ1 = θ2 σ 2 > 0 so that Theorem
d
11.6 implies that n1/2 (2θσ)−1 (Un − θ2 ) −
→ Z as n → ∞ where Z is a N(0, 1)
random variable. 
Example 11.12. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a continuous distribution F that is symmetric
about a point θ. Let R be the vector of ranks of |X1 − θ|, . . . , |Xn − θ|, let C
be an n × 1 vector with ith element ci = δ{Xi − θ; (0, ∞)} for i = 1, . . . , n, and
UNBIASED ESTIMATION AND U -STATISTICS 493
Zi = Xi − θ for i = 1, . . . , n. The Wilcoxon signed rank statistic was seen in
Example 11.3 to have the form W = C0 R. In Example 11.7 it was shown that
W = nU1,n + 21 n(n − 1)U2,n where U1,n and U2,n are U -statistics of orders
r = 1 and r = 2, respectively. These U -statistics are given by

n
X
U1,n = n−1 δ{2Zi ; (0, ∞)},
i=1

and

n n
2 X X
U2,n = δ{Zi + Zj ; (0, ∞)}.
n(n − 1) i=1 j=i+1

To find the asymptotic distribution of W , we first note that

 −1
1/2 n
n [W − E(W )] =
2
2n1/2
[U1,n − E(U1,n )] + n1/2 [U2,n − E(U2,n )]. (11.22)
n−1

For the first term on the right hand side of Equation (11.22) we note that
V (U1,n ) = n−1 V (δ{2Zi ; (0, ∞)}) where

1
E(δ{2Zi ; (0, ∞)}) = P (δ{2Zi ; (0, ∞)} = 1) = P (Zi > 0) = 2

since Zi has a symmetric distribution about zero. Since δ 2 {2Zi ; (0, ∞)} =
δ{2Zi ; (0, ∞)} it follows also that E(δ 2 {2Zi ; (0, ∞)}) = 21 . Therefore, we have
that V (δ{2Zi ; (0, ∞)}) = 14 , and hence V (U1,n ) = 41 n−1 . Theorem 3.10 (Weak
p
Law of Large Numbers) then implies that U1,n − → 12 as n → ∞. Noting that
2n1/2 (n − 1)−1 → 0 as n → ∞ we can then apply Theorem 4.11 (Slutsky)
p
to find that 2n1/2 (n − 1)−1 [U1,n − E(U1,n )] − → 0 as n → ∞. Therefore, it
−1
follows that the asymtptotic distribution of n1/2 n2 [W − E(W )] is the same
as n1/2 [U2,n − E(U2,n )]. We wish to apply Theorem 11.6 to the second term
on the right hand side of Equation (11.22), and therefore we need to verify the
assumptions required for Theorem 11.6. Let f be the density of Z1 where, by
assumption, f is symmetric about zero. Noting that the independence between
Zi and Zj implies that the joint distribution between Zi and Zj is f (zi )f (zj ),
494 NONPARAMETRIC INFERENCE
it follows that
E[δ 2 {Zi + Zj ; (0, ∞)}] = E[δ{Zi + Zj ; (0, ∞)}]
= P (Zi + Zj > 0)
Z ∞Z ∞
= f (zi )f (zj )dzj dzi
−∞ −zi
Z ∞ Z ∞
= f (zi ) f (zj )dzj dzi
−∞ −zi
Z ∞
= f (zi )[1 − F (−zi )]dzi
−∞
Z ∞
= F (zi )f (zi )dzi
−∞
Z1
= tdt = 21 ,
0

where we have used the fact that the symmetry of f implies that 1 − F (−z) =
F (z). Since 12 < ∞ we have verified that E[h2 (x1 , . . . , xr )] < ∞. To verify the
second assumption we note that
1
ζ1 = E[δ{Zi + Zj ; (0, ∞)}δ{Zi + Zk ; (0, ∞)}] − 4
= P ({Zi + Zj > 0} ∩ {Zi + Zk > 0}) − 14
Z ∞ Z ∞Z ∞
= f (zi )f (zj )f (zk )dzk dzj dzi − 14
−∞ −zi −zi
Z ∞ Z ∞
= f (zi ) f (zj )[1 − F (−zi )]dzj dzi − 14
−∞ −zi
Z ∞
= f (zi )F 2 (zi )dzi − 41
−∞
Z 1
2 1 1 1 1
= t dt − 4 = 3 − 4 = 12 > 0.
0

Hence, the second assumption is verified. Theorem 11.6 then implies that
d
n1/2 [U2,n − 12 ] −
→ Z2 where Z2 has a N(0, 13 ) distribution, and therefore
 −1
n d
n1/2 [W − E(W )] − → Z2 ,
2
as n → ∞. Further calculations can be used to refine this result to find that
W − 41 n(n + 1) d
1 −
→ Z, (11.23)
[ 24 n(n + 1)(2n + 1)]1/2
as n → ∞ where Z has a N(0, 1) distribution. See Exercise 5. This result is
suitable for using W to test the null hypothesis H0 : θ = 0 using approximate
rejection regions. Figures 11.1–11.3 plot the exact distribution of W under
the null hypothesis for n = 5, 7, and 10. It is clear in Figure 11.3 that the
LINEAR RANK STATISTICS 495

Figure 11.1 The exact distribution of the signed-rank statistic when n = 5.

0.09
0.08
0.07
P(W=w)

0.06
0.05
0.04
0.03

0 5 10 15

w
normal approximation should work well for this sample size and larger. Table
11.1 compares some exact quantiles of the distribution of W with some given
by the normal approximation. Note that Equation (11.23) implies that the α
quantile of the distribution of W can be approximated by
1
4 n(n
1
+ 1) + zα [ 24 n(n + 1)(2n + 1)]1/2 . (11.24)


The topic of U -statistics can be expanded in many ways. For example, an


overview of U -statistics for two or more samples can be found in Section 3.4
of Randles and Wolfe (1979). For other generalizations and asymptotic results
for U -statistics see Kowalski and Tu (2008), Lee (1990), and Chapter 5 of
Serfling (1980).

11.3 Linear Rank Statistics

Another class of statistics that commonly occur in nonparametric statistical


inference are linear rank statistics, which are linear functions of the rank vec-
tor. As with U -statistics, linear rank statistics are asymtptotically Normal
under some very general conditions.
496 NONPARAMETRIC INFERENCE

Figure 11.2 The exact distribution of the signed-rank statistic when n = 7.

0.06
0.05
0.04
P(W=w)

0.03
0.02
0.01

0 5 10 15 20 25

w
Figure 11.3 The exact distribution of the signed-rank statistic when n = 10.
0.04
0.03
P(W=w)

0.02
0.01
0.00

0 10 20 30 40 50

w
LINEAR RANK STATISTICS 497

Table 11.1 A comparison of the exact quantiles of the signed rank statistic against
those given by the normal approximation given in Equation (11.24). The approximate
quantiles have been rounded to the nearest integer.
Exact Quantiles Normal Approximation Relative Error (%)
n 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01
5 12 14 15 10 11 14 16.7 21.5 6.7
6 17 18 21 14 15 19 17.6 16.7 9.5
7 22 24 27 18 20 24 18.2 16.7 11.1
8 27 30 34 23 26 31 14.8 13.3 8.8
9 34 36 41 29 32 38 14.7 11.1 7.3
10 40 44 49 35 39 45 12.5 11.4 8.16
25 211 224 248 198 211 236 6.2 5.8 4.8
50 771 808 877 745 783 853 3.4 3.1 2.7

Definition 11.5. Let X1 , . . . , Xn be a set of independent and identically dis-


tributed random variables from a distribution F and let R0 = (r1 , . . . , rn ) be
the vector a ranks associated with X1 , . . . , Xn . Let a and c denote functions
that map the set {1, . . . , n} to the real line. Then the statistic
n
X
S= c(i)a(ri ),
i=1

is called a linear rank statistic. The set of constants c(1), . . . , c(n) are called
the regression constants, and the set of constants a(1), . . . , a(n) are called the
scores of the statistic.

The important ingredients of Definition 11.5 are that the elements of the vector
R correspond to a random permutation of the integers in the set {1, . . . , n}
so that a(r1 ), . . . , a(rn ) are random variables, but the regression constants
c(1), . . . , c(n) are not random. The frequent use of ranks in classical nonpara-
metric statistics makes linear rank statistics an important topic.
Example 11.13. Let us consider the Wilcoxon, Mann, and Whitney Rank
Sum test statistic from Example 11.2. That is, let X1 , . . . , Xm be a set of
independent and identically distributed random variables from a continu-
ous distribution F and let Y1 , . . . , Yn be a set of independent and identi-
cally distributed random variables from a continuous distribution G where
G(t) = F (t − θ), where θ is an unknown parameter. Denoting the combined
sample as Z1 , . . . , Zn+m , and the corresponding ranks as R1 , . . . , Rn+m , the
test statistic
n+m
X
Mm,n = Di Ri , (11.25)
i=1
where Di = 1 when Zi is from the sample X1 , . . . , Xm and Di = 0 when
498 NONPARAMETRIC INFERENCE
Zi is from the sample Y1 , . . . , Yn . For simplicity assume that Zi = Xi for
i = 1, . . . , m and Zj = Yj−n+1 for j = m + 1, . . . , n + m. Then the statistic
given in Equation (11.25) can be written as a linear rank statistic of the form
given in Definition 11.5 with a(i) = i and c(i) = δ{i; {m + 1, . . . , n + m}} for
all i = 1, . . . , n + m. 
Example 11.14. Once again consider the two-sample setup studied in Ex-
ample 11.2, except in this case we will use the median test statistic proposed
by Mood (1950) and Westenberg (1948). For this test we compute the me-
dian of the combined sample X1 , . . . , Xm , Y1 , . . . , Yn and then compute the
number of values in the sample Y1 , . . . , Yn that exceed the median. Note that
under the null hypothesis that θ = 0, where the combined sample all comes
from the sample distribution, we would expect that half of these would be
above the median. If θ 6= 0 then we would expect either greater than, or fewer
than, of these values to exceed the median. Therefore, counting the number
of values that exceed the combined median provides a reasonable test statis-
tic for the null hypothesis that θ = 0. This test statistic can be written in
the form of a linear rank statistic with c(i) = δ{i; {m + 1, . . . , n + m}} and
a(i) = δ{i; { 21 (n + m + 1), . . . , n + m}}. The form of the score function is
derived from the fact that if the rank of a value from the combined sample
exceeds 21 (n + m + 1) then the corresponding value exceeds the median. This
score function is called the median score function. 
Typically, under a null hypothesis the elements of the vector R correspond to
a random permutation of the integers in the set {1, . . . , n} that is uniformly
distributed over the set Rn that is defined in Equation (11.1). In this case
the distribution of S can be found by enumerating the values of S over the
r! equally likely permutations in Rn . There are also some general results that
are helpful in the practical application of tests based on linear rank statistics.
Theorem 11.7. Let S be a linear rank statistic of the form
n
X
S= c(i)a(ri ).
i=1

If R is a vector whose elements correspond to a random permutation of the in-


tegers in the set {1, . . . , n} that is uniformly distributed over Rn then E(S) =
nāc̄ and
( n ) n 
X X 
V (S) = (n − 1)−1 [a(i) − ā]2 [c(j) − c̄]2
 
i=1 j=1

where
n
X
ā = n−1 a(i),
i=1
and
n
X
c̄ = n−1 c(i).
i=1
LINEAR RANK STATISTICS 499
Theorem 11.7 can be proven using direct calculations. See Exercise 6. More
complex arguments are required to obtain further properties on the distribu-
tion of linear rank statistics. For example, the symmetry of the distribution
of a linear rank statistic can be established under fairly general conditions
using arguments based on the composition of permutations. This result was
first proven by Hájek (1969).
Theorem 11.8 (Hájek). Let S be a linear rank statistic of the form
n
X
S= c(i)a(ri ).
i=1

Let c(1) , . . . , c(n) and a(1) , . . . , a(n) denote the ordered values of c(1), . . . , c(n)
and a(1), . . . , a(n), respectively. Suppose that R is a vector whose elements
correspond to a random permutation of the integers in the set {1, . . . , n} that
is uniformly distributed over Rn . If a(i) +a(n+1−i) or c(i) +c(n+1−i) is constant
for i = 1, . . . , n, then the distribution of S is symmetric about nāc̄.

A proof of Theorem 11.8 can be found in Section 8.2 of Randles and Wolfe
(1979).
Example 11.15. Let us consider the rank sum test statistic from Exam-
ple 11.2 which can be written as a linear rank statistic of the form given
in Definition 11.5 with a(i) = i and c(i) = δ{i; {m + 1, . . . , n + m}} for
all i = 1, . . . , n + m. See Example 11.13. Note that a(i) = a(i) and that
a(i) + a(m + n − i + 1) = m + n + 1 for all i ∈ {1, . . . , m + n} so that Theorem
11.8 implies that the distribution of the rank sum test statistic is symmetric
when the null hypothesis that θ = 0 is true. Some examples of the distribution
are plotted in Figures 11.4–11.6. 

The possible symmetry of the distribution of a linear rank statistic indicates


that its distribution could be a good candidate for the normal approximation,
an idea that is supported by the results provided in Example 11.15. The re-
mainder of this section is devoted to developing conditions under which linear
rank statistics have an asymptotic Normal distribution. To begin develop-
ment of the asymptotic results we now consider a sequence of linear rank
statistics {Sn }∞
n=1 where
n
X
Sn = c(i, n)a(Ri , n), (11.26)
i=1

where we emphasize that both the regression constants and the scores depend
on the sample size n. Section 8.3 of Randles and Wolfe (1979) points out that
the regression constants c(1, n), . . . , c(n, n) are usually determined by the type
of problem under consideration. For example, in Examples 11.13 and 11.14,
the regression constants are used to distinguish between the two samples.
Therefore, it is advisable to put as few restrictions on the types of regression
constants that can be considered so that as many different types of problems
500 NONPARAMETRIC INFERENCE

Figure 11.4 The distribution of the rank sum test statistic when n = m = 3.

0.14
0.12
P(W=w)

0.10
0.08
0.06

4 6 8 10 12

w
Figure 11.5 The distribution of the rank sum test statistic when n = m = 4.
0.10
0.08
P(W=w)

0.06
0.04
0.02

10 15 20

w
LINEAR RANK STATISTICS 501

Figure 11.6 The distribution of the rank sum test statistic when n = m = 5.

0.08
0.06
P(W=w)

0.04
0.02

10 15 20 25 30 35

w
as possible can be addressed by the asymptotic theory. A typical restriction
is given by Noether’s condition.
Definition 11.6 (Noether). Let c(1, n), . . . , c(n, n) be a set of regression con-
stants for a linear rank statistic of the form given in Equation (11.26). The
regression constants follow Noether’s condition if
Pn
i=1 d(i, n)
lim = ∞,
n→∞ maxi∈{1,...,n} d(i, n)

where " #2
n
X
−1
d(i, n) = c(i, n) − n c(i, n) ,
i=1
for i = 1, . . . , n and n ∈ N.

This condition originates from Noether (1949) and essentially keeps one of the
constants from dominating the others.
Greater latitude is given in choosing the score function, and hence more re-
strictive assumptions can be implemented on them. In particular, the usual
approach is to consider score functions of the form a(i, n) = α[i(n + 1)−1 ],
where α is a function that does not depend on n and is assumed to have
certain properties.
502 NONPARAMETRIC INFERENCE
Definition 11.7. Let α be a function that maps the open unit interval (0, 1)
to R such that,

1. α(t) = α1 (t) − α2 (t) where α1 and α2 are non-decreasing functions that


map the open unit interval (0, 1) to R.
2. The integral
Z 1 Z 1 2
α(t) − α(u)du dt
0 0
is non-zero and finite.

Then the function α(t) is called a square integrable score function.

Not all of the common score functions can be written strictly in the form
a(i, n) = α[i(n + 1)−1 ] where α is a square integrable score function. However,
a slight adjustment to this form will not change the asymptotic behavior of a
properly standardized linear rank statistic, and therefore it suffices to consider
this form.
To establish the asymptotic normality of a linear rank statistic, we will need
to develop several properties of the ranks, order statistics, and square inte-
grable score functions. The first result establishes independence between the
rank vector and the order statistics for sets of independent and identically
distributed random variables.
Theorem 11.9. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed random variables from a continuous distribution F . Let R1 , . . . , Rn
denote the ranks of X1 , . . . , Xn and let X(1) , . . . , X(n) denote the order statis-
tics. Then R1 , . . . , Rn and X(1) , . . . , X(n) are mutually independent.

A proof of Theorem 11.9 can be found in Section 8.3 of Randles and Wolfe
(1979).
Not surprisingly, we also need to establish several properties of square inte-
grable score functions. The property we establish below is related to limiting
properties of the expectation of the the score function.
Theorem 11.10 (Hájek and Šidák). Let U be a Uniform(0, 1) random vari-
able and let {gn }n=1 be a sequence of functions that map the open unit interval
a.c.
(0, 1) to R, where gn (U ) −−→ g(U ) as n → ∞, and
lim sup E[gn2 (U )] ≤ E[g 2 (U )]. (11.27)
n→∞

Then
lim E{[gn (U ) − g(U )]2 } = 0.
n→∞

Proof. We begin by noting that


E{[gn (U ) − g(U )]2 } = E[gn2 (U )] − 2E[gn (U )g(U )] + E[g 2 (U )]. (11.28)
For the first term on the right hand side of Equation (11.28) we note that
LINEAR RANK STATISTICS 503
since {gn2 (U )}∞
n=1 is a sequence of non-negative random variables that converge
almost certainly to g(U ), it follows from Theorem 5.10 (Fatou) that
E[g(U )] ≤ lim inf E[gn (U )].
n→∞

Combining the result with Equation (11.28) yields


lim sup E[gn2 (U )] ≤ E[g 2 (U )] ≤ lim inf E[gn (U )],
n→∞ n→∞

so that Definition 1.3 implies that


lim E[gn2 (U )] = E[g(U )]. (11.29)
n→∞

For the second term on the right hand side of Equation (11.28) we refer to
Theorem II.4.2 of Hájek and Šidák (1967), which shows that
lim E[gn (U )g(U )] = E[g 2 (U )]. (11.30)
n→∞

Combining the results of Equations (11.28)–(11.30) yields the result.

In most situations the linear rank statistic is used as a test statistic to test a
null hypotheses that implies that the rank vector R is uniformly distributed
over the set Rn . In this case each component of R has a marginal distribution
that is uniform over the integers {1, . . . , n}. Hence, the expectation E{α2 [(n+
1)−1 R1 ]} is equivalent to the expectation E{α2 (n + 1)−1 Un ]} where Un is a
d
Uniform{1, . . . , n} random variable. It can be shown that Un −
→ U as n → ∞
where U is a Uniform(0, 1) random variable. The result given in Theorem
11.11 below establishes the fact that the expectations converge as well.
Theorem 11.11. Let α be a square integrable score function and let U be a
Uniform{1, . . . , n} random variable. Then
n
X
lim E{α2 [(n + 1)−1 Un ]} = lim n−1 α2 [(n + 1)−1 i] =
n→∞ n→∞
i=1
Z 1
α2 (t)dt. (11.31)
0

A complete proof of Theorem 11.11 is quite involved. The interested reader


should consult Section 8.3 of Randles and Wolfe (1979) for the complete de-
tails. A similar result is required to obtain the asymptotic behavior of the
score function evaluated at a Binomial random variable.
Theorem 11.12. Let α be a square integrable score function and let Yn be a
Binomial(n, θ) random variable. Then
lim E{α[(n + 2)−1 (Yn + 1)]} = α(θ)
n→∞

for all θ ∈ (0, 1) \ A where the Lebesgue measure of A is zero.

A proof of Theorem 11.12 can be found in Section 8.3 of Randles and Wolfe
504 NONPARAMETRIC INFERENCE
(1979). We finally require a result that shows that α[(n + 1)−1 R1 ], where R1
is the rank of the first observation U1 from a set of n independent and identi-
cally distributed random variables from a Uniform(0, 1) distribution can be
approximated by α(U1 ). The choice of the first observation is by convenience,
what matters is that we have the rank of an observation from a Uniform sam-
ple. This turns out to be the key approximation in developing the asymptotic
Normality of linear rank statistics.
Theorem 11.13. Let α be a square integrable score function and let U1 , . . . , Un
be a set of independent and identically distributed Uniform(0, 1) random vari-
qm
ables. Suppose that R1 is the rank of U1 . Then |α[(n + 1)−1 R1 ] − α(U1 )| −−→ 0
as n → ∞.

Proof. We begin by noting that

E[{α[(n + 1)−1 R1 ] − α(U1 )}2 ] =


E{α2 [(n + 1)−1 R1 ]} − 2E{α(U1 )α[(n + 1)−1 R1 ]} + E[α2 (U1 )]. (11.32)
Since U1 is a Uniform(0, 1) random variable it follows that
Z 1
E[α2 (U1 )] = α2 (t)dt. (11.33)
0

The marginal distribution of R1 is Uniform{1, . . . , n} so that Theorem 11.11


implies that
Z 1
lim E{α2 [(n + 1)−1 R1 ]} = α2 (t)dt. (11.34)
n→∞ 0
To evaluate the second term on the right hand side of Equation (11.32) we
note that the rank of U1 equals the number of values in {U1 , . . . , Un } that are
less than or equal to U1 (including U1 ). Equivalently, the rank of U1 equals
the number of differences U1 − Ui that are non-negative. Therefore
( " n #)
X
−1 −1
α[(n + 1) R1 ] = α (n + 1) δ{U1 − Ui ; [0, ∞)} .
i=1

Let
n
X
Yn = δ{(U1 − Ui ); [0, ∞)},
i=2
and note that Yn is a Binomial(n − 1, γ) random variable where γ = P (U1 −
Ui ≥ 0). Therefore, Theorem A.17 implies that

E{α(U1 )α[(n + 1)−1 R1 ]} = E{α(U1 )α[(n + 1)−1 (Yn + 1)]} =


E[E{α(U1 )α[(n + 1)−1 (Yn + 1)]}|U1 ]. (11.35)
Now, Yn |U1 is a Binomial(n − 1, λ) random variable where λ = P (U1 − Ui ≥
0|U1 ) = P (Ui ≤ U1 ) = U1 since Ui is a Uniform(0, 1) random variable for i =
2, . . . , n. Let βn (u) = E{α[(n+1)−1 (B +1)]} where B is a Binomial(n−1, u)
LINEAR RANK STATISTICS 505
random variable, then Equation (11.35) implies that
E{α[(n + 1)−1 R1 ]α(U1 )} = E[βn (U )α(U )],
where U is a Uniform(0, 1) random variable. Theorem 11.12 implies that
lim βn (u) = lim E{α[(n + 1)−1 (B + 1)]} = α(u),
n→∞ n→∞

so that the next step is then to show that


lim E[βn (U )α(U )] = E[α2 (U )],
n→∞

using Theorem 11.10. To verify the assumption of Equation (11.27) we note


that Theorem 2.11 (Jensen) implies that
E[βn2 (U )] = E[E 2 {α[(n + 1)−1 (B + 1)]|U }]
≤ E[E{α2 [(n + 1)−1 (B + 1)]|U }]
= E{α2 [(n + 1)−1 R1 ]},
since we have shown that R1 has the same distribution as B + 1. Theorem
11.11 implies that
lim E{α2 [(n + 1)−1 R1 ]} = E[α2 (U )].
n→∞

Therefore, Theorem 1.6 implies that


lim sup E[βn2 (U )] ≤ lim sup E{α2 [(n + 1)−1 R1 ]} = E[α2 (U )],
n→∞ n→∞

which verifies the assumption in Equation (11.27). Therefore, Theorem 11.10


implies that
Z 1
lim E[βn (U )α(U )] = E[α2 (U )] = α2 (t)dt. (11.36)
n→∞ 0

Conbining the results of Equations (11.32)–(11.36) yields the result.

We are now ready to tackle the asymptotic normality of a linear rank statistic
when the null hypothesis is true. This result, first proven by Hájek (1961)
is proven using a similar approach to Theorem 11.6 (Hoeffding), in that the
linear rank statistic is approximated by a simpler statistic whose asymptotic
distribution is known. Therefore, the crucial part of the proof is based on
showing that the two statistics have the same limiting distribution.
Theorem 11.14 (Hájek). Let Sn be a linear rank statistic with regression
constants c(1, n), . . . , c(n, n) and score function a. Suppose that

1. a(i, n) = α[(n + 1)−1 i] for i = 1, . . . , n where α is a square integrable score


function.
2. c(1, n), . . . , c(n, n) satisfy Noether’s condition given in Definition 11.6.
3. R is a rank vector that is uniformly distributed over the set Rn for each
n ∈ N.
506 NONPARAMETRIC INFERENCE
d
Then σn−1 (Sn − µn ) −
→ Z as n → ∞ where Z is a N(0, 1) random variable,
µn = nc̄n ān , and
( n )( n )
X X
2 −1 2 2
σn = (n − 1) [c(i, n) − c̄n ] [a(i, n) − ān ] .
i=1 i=1

Proof. Let U1 , . . . , Un be a set of independent and identically distributed ran-


dom variables from a Uniform(0, 1) distribution and let U(1) , . . . , U(n) denote
the corresponding order statistics, with U denoting the vector containing the
n order statistics.
Conside the linear rank statistic Sn and note that
Xn
Sn = c(i, n)a(Ri , n)
i=1
n
X
= c(i, n)a(Ri , n) − nc̄n ān + nc̄n ān
i=1
Xn n
X
= c(i, n)a(Ri , n) − c̄n a(i, n) + nc̄n ān (11.37)
i=1 i=1
Xn
= [c(i, n) − c̄n ]a(Ri , n) + nc̄n ān
i=1
Xn
= [c(i, n) − c̄n ]α[(n + 1)−1 Ri ] + nc̄n ān , (11.38)
i=1
where we have used the fact that
Xn n
X
a(i, n) = a(Ri , n),
i=1 i=1

to establish Equation (11.37). The key idea to proving the desired result is
based on approximating the statistic in Equation (11.38) with one whose
asymptotic distribution can easily be found. Let Wn be a Uniform{1, . . . , n+
d
1} random variable. Then, it can be shown that (n + 1)−1 Wn − → U as n → ∞
where U is a Uniform(0, 1) random variable. This suggests approximating
(n + 1)−1 Ri with U , or approximating the asymptotic behavior of Sn using
the statistic
Xn
Vn = [c(i, n) − c̄n ]α(Ui ) + nc̄n ān .
i=1
The first step is to find the asymptotic distribution of Vn . Note that
" n #
X
E(Vn ) = E [c(i, n) − c̄n ]α(Ui ) + nc̄n ān
i=1
n
X
= [c(i, n) − c̄n ]E[α(Ui )] + nc̄n ān .
i=1
LINEAR RANK STATISTICS 507
Note that because U1 , . . . , Un are identically distributed, it follows that
n
X
E(Vn ) = E[α(U1 )] [c(i, n) − c̄n ] + nc̄n ān = nc̄n ān = µn ,
i=1

since
n
X
[c(i, n) − c̄n ] = 0.
i=1
Similarly, since U1 , . . . , Un are mutually independent, it follows that
" n #
X
V (Vn ) = V [c(i, n) − c̄n ]α(Ui )
i=1
n
X
= [c(i, n) − c̄n ]2 V [α(Ui )]
i=1
n
X
= V [α(U1 )] [c(i, n) − c̄n ]2
i=1
n
X
= α̃2 [c(i, n) − c̄n ]2 ,
i=1

where
Z 1
α̃2 = V [α(U1 )] = [α(t) − ᾱ]2 dt.
0
Now,
n
X
Vn − µn = [c(i, n) − c̄n ]α(Ui )
i=1
Xn n
X
= [c(i, n) − c̄n ]α(Ui ) − [c(i, n) − c̄n ]ᾱ
i=1 i=1
Xn
= {[c(i, n) − c̄n ]α(Ui ) − [c(i, n) − c̄n ]ᾱ},
i=1

where
Z 1
ᾱ = E[α(Ui )] = α(t)dt.
0
2
Define Yi,n = [c(i, n) − c̄n ]α(Ui ), µi,n = [c(i, n) − c̄n ]ᾱ, and σi,n = [c(i, n) −
2 2
c̄n ] α̃ , for i = 1, . . . , n. Then Theorem 6.1 (Lindeberg, Lévy, and Feller) will
imply that
d
Zn = n1/2 τn−1/2 (Ȳn − µ̄n ) −
→Z (11.39)
as n → ∞ where Z has a N(0, 1) distribution, as long as we can show that
the associated assumptions hold. In terms of the notation of Theorem 6.1, we
508 NONPARAMETRIC INFERENCE
have that
n
X n
X
−1 −1
µ̄n = n µi,n = n [c(i, n) − c̄n ]ᾱ,
i=1 i=1

and
n
X n
X
τn2 = 2
σi,n = [c(i, n) − c̄n ]2 α̃2 . (11.40)
i=1 i=1

For the first condition we must show that

lim max τn−2 σi,n


2
= 0. (11.41)
n→∞ i∈{1,...,n}

Now,
α̃2 [c(i, n) − c̄n ]2 [c(i, n) − c̄n ]2
τn−2 σi,n
2
= P n = Pn .
α̃2 i=1 [c(i, n) − c̄n ]2 i=1 [c(i, n) − c̄n ]
2

It then follows that the assumption required in Equation (11.41) is implied by


Noether’s condition given in Definition 11.6. The second condition we need to
show is that
n
X
lim τn−2 E(|Yi − µi,n |2 δ{|Yi − µi,n |; (ετn , ∞)}) = 0,
n→∞
i=1

for every ε > 0. Let ε > 0, then we begin showing this condition by noting
that

E(|Yi − µi,n |2 δ{|Yi − µi,n |; (ετn , ∞)}) =


Z
|[c(i, n) − c̄n ]α(u) − [c(i, n) − c̄n ]ᾱ|2 du =
Li (ε,n)
Z
[c(i, n) − c̄n ]2 [α(u) − ᾱ]2 du,
Li (ε,n)

where Li (ε, n) = {u : |c(i, n) − c̄n ||α(u) − ᾱ| > ετn }. Now let
 −1
∆n = τn max |c(j, n) − c̄n | ,
1≤j≤n

and note that

Li (ε, n) = {u : |c(i, n) − c̄n ||α(u) − ᾱ| > ετn }


 
⊂ u : max |c(i, n) − c̄n ||α(u) − ᾱ| > ετn
1≤j≤n

= {u : |α(u) − ᾱ| > ε∆n }


= L̄(ε, n),

which no longer depends on the index i. Because the integrand is non-negative


LINEAR RANK STATISTICS 509
we have that
Z
[c(i, n) − c̄n ]2 [α(u) − ᾱ]2 du ≤
Li (ε,n)
Z
[c(i, n) − c̄n ]2 [α(u) − ᾱ]2 du =
L̄(ε,n)
Z
2
[c(i, n) − c̄n ] [α(u) − ᾱ]2 du,
L̄(ε,n)

where we note that the integral no longer depends on the index i. Therefore,
Equation (11.40) implies that

n
X
τn−2 E(|Yi − µi,n |2 δ{|Yi − µi,n |; (ετn , ∞}) ≤
i=1
(Z )( n )
X
τn−2 [α(u) − ᾱ] du2
[c(i, n) − c̄n ] 2
=
L̄(ε,n) i=1
(Z )( n )
X
τn−2 α̃2 α̃−2 [α(u) − ᾱ]2 du [c(i, n) − c̄n ]2 =
L̄(ε,n) i=1
Z
α̃−2 [α(u) − ᾱ]2 du.
L̄(ε,n)

To take the limit we note that

 −1
∆n = τn max |c(j, n) − c̄n | =
1≤j≤n
( n )1/2  −1
X
2
α̃ [c(i, n) − c̄n ] max |c(j, n) − c̄n | .
1≤j≤n
i=1

Noether’s condition of Definition 11.6 implies that ∆n → ∞ as n → ∞ and


hence ε∆n → ∞ as n → ∞ for each ε > 0. Therefore, it follows that
Z
lim α̃−2 [α(u) − ᾱ]2 du = 0,
n→∞ L̄(ε,n)

and the second condition is proven, and hence the convergence described in
Equation (11.39) follows.

We will now consider the mean square difference between Sn and Vn . We begin
510 NONPARAMETRIC INFERENCE
by noting that

Sn − Vn =
Xn n
X
[c(i, n) − c̄n ]a(Ri , n) + nc̄n ān − [c(i, n) − c̄n ]α(Ui ) − nc̄n ān =
i=1 i=1
Xn
[c(i, n) − c̄n ][a(Ri , n) − α(Ui )].
i=1

Therefore, using the fact that the conditional distribution of R is still uniform
over Rn conditional on U, we have that

E[(Sn − Vn )2 |U = u] =
" #2 
 Xn 

E [c(i, n) − c̄n ][a(Ri,n ) − α(Ui )] U = u .
 
i=1

Denote c (i, n) = c(i, n) − c̄n and a∗ (i, n) = a(i) − α(Ui ), where we note

that conditional on U = u, α(Ui ) will be a constant. Then we have that


E[(Sn − Vn )2 |U = u] = E[(Sn∗ )2 |U = u], where Sn∗ is a linear rank statistic of
the form
X n
Sn∗ = c∗ (i, n)a∗ (Ri , n).
i=1
Now, note that
n
X n
X
c̄∗n =n −1 ∗
c (i, n) = n −1
[c(i, n) − c̄n ]2 = 0.
i=1 i=1

Therefore, Theorem 11.7 implies that E(Sn∗ |U = u) = 0, and hence


E[(Sn − Vn )2 |U = u] = E[(Sn∗ )2 |U = u] = V (Sn∗ |U = u).
Thus, Theorems 11.7 and 11.9 imply that

E[(Sn − Vn )2 |U = u] =
( n )( n )
X X
−1 ∗ ∗ 2 ∗ ∗ 2
(n − 1) [a (i, n) − ān ] [c (i, n) − c̄n ] =
i=1 i=1
( n
)( n )
X X
−1 2 2
(n − 1) [a(i, n) − α(Ui ) − ā + ᾱU ] [c(i, n) − c̄n ] ,
i=1 i=1

where the last equality follows from the definitions of a∗ (i, n), c∗ (i, n), and
Equation (11.37). We also note that the order statistic U(i) is associated with
rank i, and that
n
X
ᾱU = n−1 α(U(i) ).
i=1
Recall that for a random variable Z such that V (Z) < ∞, it follows that
LINEAR RANK STATISTICS 511
V (Z) ≤ E(Z 2 ). Applying this formula to the case where Z has a Uni-
form{x1 , . . . , xn } distribution implies that
n
X n
X
(xi − x̄n )2 ≤ x2i .
i=1 i=1

Therefore, we have that


( n )( n )
X X
(n − 1)−1 [a(i, n) − α(Ui ) − ā + ᾱU ]2 [c(i, n) − c̄n ]2 ≤
i=1 i=1
( n )( n )
X X
(n − 1)−1 [a(i, n) − α(Ui )]2
[c(i, n) − c̄n ]2
=
i=1 i=1
( n
)( n )
X X
n(n − 1)−1 n−1 [a(i, n) − α(Ui )]2 [c(i, n) − c̄n ]2 .
i=1 i=1

Note that
n

X
2
[a(i, n) − α(U1 )]2 P (R1∗ = i).

E [a(R1 , n) − α(U1 )] |U = u] =
i=1

Theorem 11.9 implies that P (R1∗ = i|U = u) = n−1 for i = 1, . . . , n and hence
n
X
E [a(R1∗ , n) − α(U1 )]2 |U = u] = n−1 [a(i, n) − α(U1 )]2 .

i=1

Therefore,
( n )( n )
X X
(n − 1)−1 [a(i, n) − α(U(i) ) − ā − ᾱU ]2 [c(i, n) − c̄n ]2 ≤
i=1 i=1
( n )
X
−1 2
E [a(R1∗ , n) − α(U1 )]2 |U = u] ,

n(n − 1) [c(i, n) − c̄n ]
i=1

which in turn implies that

E[(Sn − Vn )2 |U = u] ≤
( n )
X
n(n − 1)−1 [c(i, n) − c̄n ]2 E [a(R1∗ , n) − α(U1 )]2 |U = u] . (11.42)

i=1

Using Theorem A.17 and taking the expectation of both sides of Equation
(11.42) yields
E{E[(Sn − Vn )2 |U = u]} = E[(Sn − Vn )2 ],
and
E{E [a(R1∗ , n) − α(U1 )]2 |U = u] } = E [a(R1∗ , n) − α(U1 )]2 ,
 
512 NONPARAMETRIC INFERENCE
so that
( n
)
X
E[(Sn − Vn )2 ] ≤ n(n − 1)−1 [c(i, n) − c̄n ]2 E [a(R1∗ , n) − α(U1 )]2 .

i=1

Recalling that
n
X
τn2 = α̃ [c(i, n) − c̄n ]2 ,
i=1

we have that

lim E[τn−2 (Sn − Vn )2 ] ≤


n→∞
( n )
X
−2 −1
[c(i, n) − c̄n ] E [a(R1∗ , n) − α(U1 )]2 =
2

lim τn n(n − 1)
n→∞
i=1
−1 −2
E [a(R1∗ , n) − α(U1 )]2 ≤

lim n(n − 1) α̃
n→∞
lim α̃−2 E [a(R1∗ , n) − α(U1 )]2 = 0,

n→∞

q.m.
by Theorem 11.13. Hence τn−1 (Sn −Vn ) −−−→ 0 as n → ∞. Using the same type
of arguments used in proving Theorem 11.6, it then follows that σn−1 (Sn − µn )
and τn−1 (Vn − µn ) have the same limiting distribution, and hence the result is
proven.

Example 11.16. Consider the rank sum test statistic M which has the form
of a linear rank statistic with regression constants c(i, m, n) = δ{i; {m +
1, . . . , n + m}} and score function a(i) = i for i = 1, . . . , n + m. For these
regression constants, it follows that
n+m
X n+m
X
−1 −1
c̄m,n = (n + m) c(i, m, n) = (n + m) 1 = n(n + m)−1 ,
i=1 i=m+1

and
n+m
X n+m
X
[c(i, m, n) − c̄m,n ]2 = [c(i, m, n) − n(n + m)−1 ]2
i=1 i=1
m
X
= [0 − n(n + m)−1 ]2
i=1
n+m
X
+ [1 − n(n + m)−1 ]2
i=m+1

= mn (n + m)−2 + nm2 (n + m)−2


2

= mn(n + m)−1 .
LINEAR RANK STATISTICS 513
To verify Noether’s condition of Definition 11.6, we note that

max [c(i, m, n) − c̄m,n ]2 = max{c̄2m,n , (1 − c̄m,n )2 }


i∈{1,...,n+m}

= max{n2 (n + m)−2 , m2 (n + m)−2 }


= (n + m)−2 max{n2 , m2 }
= (n + m)−2 (max{n, m})2 .

In order to simplify the condition in Definition 11.6, we must have that


 −1 n+m
X
Nm,n = max [c(i, m, n) − c̄m,n ]2 [c(i, m, n) − c̄m,n ]2 =
i∈{1,...,n+m}
i=1
2
nm(n + m) nm(n + m)
= →∞
(n + m)(max{n, m})2 (max{n, m})2

as n+m → ∞. Note that if n > m then we have that Nm,n = nm(n+m)n−2 =


mn−1 (n + m), and if m ≥ n then we have that Nm,n = nm(n + m)m−2 =
nm−1 (n + m). Therefore, it follows that

(n + m) min{m, n} n min{m, n} + m min{m, n}


Nm,n = = . (11.43)
max{m, n} max{m, n}
Note that no matter what the relative sizes of m and n are, one of the
terms in the sum on the right hand side of Equation (11.43) will have the
form min{m, n} and the other will have the form min2 {m, n}(max{m, n})−1 .
Therefore,

Nm,n = min{m, n}{1 + min{m, n}[max{m, n}]−1 }.

Therefore, if min{m, n} → ∞ as n + m → ∞ then Nm,n → ∞ as n + m → ∞


and Noether’s condition will hold. The score function can be written as a(i) =
i = (m + n + 1)α[(m + n + 1)−1 i] where α(t) = t for t ∈ (0, 1). Therefore,
Z 1
ᾱ = tdt = 21 ,
0

and
Z 1
α̃2 = (t − 21 )2 dt = 1
12 .
0

Definition 11.7 then implies that α is square integrable score function. Now

µm+n = (m + n)c̄m,n ām,n =


m+n
X
(m + n)n(m + n)−1 (m + n)−1 i = 21 n(m + n + 1),
i=1
514 NONPARAMETRIC INFERENCE
and
(m+n ) (m+n )
X X
2 −1 2 2
σm,n = (m + n − 1) [c(i, n) − c̄m,n ] [a(i, n) − ām,n ] =
i=1 i=1
m+n
X
(m + n − 1)−1 (m + n)−1 mn [i − 12 (m + n)(m + n − 1)]2 . (11.44)
i=1

The sum in Equation (11.44) is equal to (m + n) times the variance of a


Uniform{1, 2, . . . , m + n} random variable. Therefore,
2
σm,n = 1
12 (m + n − 1)−1 (m + n)−1 mn(m + n)(m + n + 1)(m + n − 1) =
1
12 mn(m + n + 1),
d
and Theorem 11.14 implies that [M − 12 n(m+n+1)][ 12
1
mn(m+n+1)]−1/2 −
→Z
as n → ∞, where Z is a N(0, 1) random variable. This verifies the results
observed in Figures 11.4–11.6. 

11.4 Pitman Asymptotic Relative Efficiency

One of the reasons that many nonparametric statistical methods have re-
mained popular in applications is that few assumptions need to be made
about the underlying population, and that this flexibility results in only a
small loss of efficiency in many cases. The use of a nonparametric method,
which is valid for a large collection of distributions, necessarily entails the
possible loss of efficiency. This can manifest itself in larger standard errors
for point estimates, wider confidence intervals, or hypothesis tests that have
lower power. This is because a parametric method, which is valid for a specific
parametric family, is able to take advantage of the structure of the problem
to produce a finely tuned statistical method. On the other hand, nonparamet-
ric methods have fewer assumptions to rely on and must be valid for a much
larger array of distributions. Therefore, these methods, cannot take advantage
of this additional structure.
A classic example of this difference can be observed by considering the problem
of estimating the location of the mode of a continuous unimodal density. If we
are able to reasonably assume that the population is Normal, then we can
estimate the location of the mode using the sample mean. On the other hand,
if the exact parametric form of the density is not known, then the problem
can become very complicated. It is worthwhile to note at this point that any
potential increase in the efficiency that may be realized by using a parametric
method is only valid if the parametric model is at least approximately true. For
example, the sample mean will only be a reasonable estimator of the location
of the mode of a density for certain parametric models. If one of these models
does not hold, then the sample mean may be a particularly unreasonable
estimator, and may even have an infinite bias, for example.
PITMAN ASYMPTOTIC RELATIVE EFFICIENCY 515
To assess the efficiency of statistical hypothesis tests, we must borrow some
of the ideas that we encountered in Section 10.4. Statistical tests are usually
compared on the basis of their power functions. That is, we would prefer to
have a test that rejects the null hypothesis more often when the alternative
hypothesis is true. It is important that when two tests are compared on the
basis of their power functions that the significance levels of the two tests be
the same. This is due to the fact that the power of any test can be arbitrarily
increased by increasing the value of the significance level. Therefore, if β1 (θ)
and β2 (θ) are the power functions of two tests of the set of hypotheses H0 :
θ ∈ Ω0 and H1 : θ ∈ Ω1 , based on the same sample size we would prefer the
test with power function β1 if β1 (θ) ≥ β2 (θ) for all θ ∈ Ω1 , where
sup β1 (θ) = sup β2 (θ).
θ∈Ω0 θ∈Ω0

This view may be too simplistic as it insists that one of the tests be uniformly
better than the other. Further, from our discussion in Section 10.4, we know
that many tests will do well when the distance between θ and the boundary of
Ω0 is large. Another complication comes from the fact that there are so many
parameters which can be varied, including the sample size, the value of θ in
the alternative hypothesis, and the distribution. We can remove the sample
size from the problem by considering asymptotic relative efficiency using a
similar concept encountered for point estimation in Section 10.2. The value of
θ in the alternative hypothesis can be eliminated if we consider a sequence of
alternative hypotheses that converge to the null hypothesis as n → ∞. This
is similar to the idea used in Section 10.4 to compute the asymptotic power
of a hypothesis test.
Definition 11.8 (Pitman). Consider two competing tests of a point null hy-
pothesis H0 : θ = θ0 where θ0 is a specified parameter value in the parameter
space Ω. Let Sn and Tn denote the test statistics for the two tests based on a
sample of size n. Let βS,n (θ) and βT,n (θ) be the power functions of the tests
based on the test statistics Sn and Tn , respectively, when the sample size equals
n.

1. Suppose that both tests have size α.


2. Let {θk }∞
k=1 be a sequence of values in Ω such that

lim θk = θ0 .
k→∞

3. Let {m(k)}∞ ∞
k=1 and {n(k)}k=1 be increasing sequences of positive integers
such that both tests have the same limiting significance level and
lim βS,m(k) (θk ) = lim βT,n(k) (θk ) ∈ (α, 1).
k→∞ k→∞

Then, the asymptotic relative efficiency of the test based on the test statistic
Sn against the test based on the test statistic Tn is given by
lim m(k)[n(k)]−1 .
k→∞
516 NONPARAMETRIC INFERENCE
This concept of relative efficiency establishes the relative sample sizes required
for the two tests to have the same asymptotic power. This type of efficiency
is based on Pitman (1948), and is often called Pitman relative asymptotic
efficiency.

For simplicity we will assume that both tests reject the null hypothesis H0 :
θ = θ0 for large values of the test statistic. The theory presented here can be
easily adapted to other types of rejection regions as well. We will also limit
our discussion to test statistics that have an asymptotic Normal distribution
under both the null and alternative hypotheses. The assumptions about the
asymptotic distributions of the test statistics used in this section are very
similar to those used in the study of asymptotic power in Section 10.4. In
particular we will assume that there exist functions µn (θ), ηn (θ), σn (θ) and
τn (θ) such that
Sm(k) − µm(k) (θk )
 
≤ t θ = θk ; Φ(t),

P
σm(k) (θk )

Tn(k) − ηn(k) (θk )


 
≤ t θ = θk ; Φ(t),

P
τn(k) (θk )

Sm(k) − µm(k) (θ0 )


 
≤ t θ = θ0 ; Φ(t),

P
σm(k) (θ0 )
and
Tn(k) − ηn(k) (θ0 )
 
≤ t θ = θ0 ; Φ(t),

P
τn(k) (θ0 )
as k → ∞. Let α ∈ (0, 1) be a fixed significance level and let {sm(k) (α)}∞k=1
and {tn(k) (α)}∞
k=1 be sequences of real numbers such that sm(k) (α) → z1−α
and tn(k) (α) → z1−α as n → ∞ where we are assuming that the null hypoth-
esis H0 : θ = θ0 is rejected when Sm(k) ≥ sm(k) (α) and Tm(k) ≥ tm(k) (α),
respectively. The tests are assumed to have a limiting significance level α. In
particular we assume that
lim P [Sm(k) ≥ sm(k) (α)|θ = θ0 ] = lim P [Tn(k) ≥ tn(k) (α)|θ = θ0 ] = α.
k→∞ k→∞

The power function for the test using the test statistic Sm(k) is given by

βS,m(k) (θ1 ) = P [Sm(k) ≥ sm(k) (α)|θ = θ1 ] =


Sm(k) − µm(k) (θ1 ) sm(k) (α) − µm(k) (θ1 )
 
P ≥ θ = θ 1 ,
σm(k) (θ1 ) σm(k) (θ1 )

with a similar form for the power function of the test using the test statistic
Tn(k) . Therefore, the property in Definition 11.8 that requires
lim βS,m(k) (θk ) = lim βT,n(k) (θk ),
k→∞ k→∞
PITMAN ASYMPTOTIC RELATIVE EFFICIENCY 517
is equivalent to
Sm(k) − µm(k) (θk ) sm(k) (α) − µm(k) (θk )
 
lim P ≥ θ = θ k =
k→∞ σm(k) (θk ) σm(k) (θk )
Tn(k) − ηn(k) (θk ) tn(k) (α) − ηn(k) (θk )
 
lim P ≥ θ = θk ,
k→∞ τn(k) (θk ) τn(k) (θk )
which can in turn be shown to require
sm(k) (α) − µm(k) (θk ) tn(k) (α) − ηn(k) (θk )
lim = lim . (11.45)
k→∞ σm(k) (θk ) k→∞ τn(k) (θk )
Similarly, for both tests to have the same limiting significance level we require
that
Sm(k) − µm(k) (θ0 ) sm(k) (α) − µm(k) (θ0 )
 
lim P ≥ θ = θ0 =
k→∞ σm(k) (θ0 ) σm(k) (θ0 )
Tn(k) − ηn(k) (θ0 ) tn(k) (α) − ηn(k) (θ0 )
 
lim P ≥ θ = θ 0 ,
k→∞ τn(k) (θ0 ) τn(k) (θ0 )

which in turn requires that


sm(k) (α) − µm(k) (θ0 ) tn(k) (α) − ηn(k) (θ0 )
lim = lim = z1−α .
k→∞ σm(k) (θ0 ) k→∞ τn(k) (θ0 )

Under this type of framework, Noether (1955) shows that the Pitman asymp-
totic relative efficiency is a function of the derivatives of µm(k) (θ) and ηn(k) (θ)
relative to σm(k) (θ) and τn(k) (θ), respectively.
Theorem 11.15 (Noether). Let Sn and Tn be test statistics based on a sample
of size n that reject a null hypothesis H0 : θ = θ0 when Sn ≥ sn (α) and
Tn ≥ tn (α), respectively. Let {θk }∞ k=1 be a sequence of real values greater
that θ0 such that θk → θ0 as k → ∞. Let {m(k)}∞ ∞
k=1 and {n(k)}k=1 be
increasing sequences of positive integers. Let {µm(k) (θ)}∞ k=1 , {η n(k) (θ)}∞
k=1 ,
{σm(k) (θ)}∞k=1 , and {τ n(k) (θ)}∞
k=1 , be sequences of real numbers that satisfy
the following assumptions:

1. For all t ∈ R,
Sm(k) − µm(k) (θk )
 
≤ t θ = θk ; Φ(t),

P
σm(k) (θk )
and
Tn(k) − ηn(k) (θk )
 
≤ t θ = θk ; Φ(t),

P
τn(k) (θk )
as n → ∞.
2. For all t ∈ R,
Sm(k) − µm(k) (θ0 )
 
≤ t θ = θ0 ; Φ(t),

P
σm(k) (θ0 )
518 NONPARAMETRIC INFERENCE
and
Tn(k) − ηn(k) (θ0 )
 
≤ t θ = θ0 ; Φ(t),

P
τn(k) (θ0 )
as n → ∞.
3.
σm(k) (θk )
lim = 1,
k→∞ σm(k) (θ0 )

and
τn(k) (θk )
lim = 1.
k→∞ τn(k) (θ0 )
4. The derivatives of the functions µm(k) (θ) and ηn(k) (θ) taken with respect to
θ exist, are continuous on an interval [θ0 − δ, θ0 + δ] for some δ > 0, are
non-zero when evaluated at θ0 , and
µ0m(k) (θk ) 0
ηn(k) (θk )
lim = lim = 1.
k→∞ µ0m(k) (θ0 ) k→∞ 0
ηn(k) (θ0 )

Define positive constants ES and ET as


ES = lim [nσn2 (θ0 )]−1/2 µ0n (θ0 )
n→∞

and
ET = lim [nτn2 (θ0 )]−1/2 ηn0 (θ0 ).
n→∞
Then the Pitman asymptotic relative efficiency of the test based on the test
statistic Sn , relative to the test based on the test statistic Tn is given by ES2 ET−2 .

Proof. The approach to proving this result is based on showing that the lim-
iting ratio of the sample size sequences m(k) and n(k), when the asymptotic
power functions are equal, is the same as the ratio ES2 ET−2 . We begin by
applying Theorem 1.13 (Taylor) to the functions µm(k) (θk ) and ηn(k) (θk ),
where we are taking advantage of Assumption 4, to find that µm(k) (θk ) =
µm(k) (θ0 ) + (θk − θ0 )µ0m(k) (θ̄k ) and ηn(k) (θk ) = ηn(k) (θ0 ) + (θk − θ0 )ηn(k)
0
(θ̃k )
where θ̄k ∈ (θ0 , θk ) and θ̃k ∈ (θ0 , θk ) for all k ∈ N. Note that even though θ̄k
and θ̃k are always in the same interval, they will generally not be equal to one
another. Now note that
sm(k) (α) − µm(k) (θk )
=
σm(k) (θk )
sm(k) (α) − µm(k) (θ0 ) + µm(k) (θ0 ) − µm(k) (θk ) σm(k) (θ0 )
.
σm(k) (θ0 ) σm(k) (θk )
Assumption 2 and Equation (11.45) imply that
sm(k) (α) − µm(k) (θ0 )
lim = z1−α .
k→∞ σm(k) (θ0 )
PITMAN ASYMPTOTIC RELATIVE EFFICIENCY 519
Combining this result with Assumption 3 implies that
sm(k) (α) − µm(k) (θk ) µm(k) (θ0 ) − µm(k) (θk )
lim = z1−α + lim . (11.46)
k→∞ σm(k) (θk ) k→∞ σm(k) (θ0 )
Performing the same calculations with the test based on the test statistic Tn(k)
yields
tn(k) (α) − ηn(k) (θk ) ηn(k) (θ0 ) − ηn(k) (θk )
lim = z1−α + lim . (11.47)
k→∞ τn(k) (θk ) k→∞ τn(k) (θ0 )
Combining these results with the requirement of Equation (11.45) yields
µm(k) (θ0 ) − µm(k) (θk ) τn(k) (θ0 )
lim · = 1.
k→∞ σm(k) (θ0 ) ηn(k) (θ0 ) − ηn(k) (θk )
Equations (11.46) and (11.47) then imply that

µm(k) (θ0 ) − µm(k) (θ̄k ) τn(k) (θ0 )


lim · =
k→∞ σm(k) (θ0 ) ηn(k) (θ0 ) − ηn(k) (θ̃k )
µ0m(k) (θ̄k )τn(k) (θ0 )
lim 0
=
k→∞ ηn(k) (θ̃k )σm(k) (θ0 )
m1/2 (k) µ0m(k) (θ̄k ) n1/2 (k)τn(k) (θ0 )
lim · =
k→∞ n1/2 (k) m1/2 (k)σm(k) (θ0 ) 0
ηn(k) (θ̃k )
 1/2
m(k)
lim ES ET−1 = 1.
k→∞ n(k)
Therefore, the Pitman asymptotic relative efficiency, which is given by
lim m(k)[n(k)]−1 ,
k→∞

has the same limit as ES2 ET−2 . Therefore, the Pitman asymptotic relative effi-
ciency is given by ES2 ET−2 .

The values ES and ET are called the efficacies of the tests based on the test
statistics Sn and Tn , respectively. Randles and Wolfe (1979) point out some
important issues when interpreting the efficacies of test statistics. When one
examines the form of the efficacy of a test statistic we see that it measures
the rate of change of the function µn , in the case of the test based on the test
statistic Sn , at the point of the null hypothesis θ0 , relative to σn at the same
point. Therefore, the efficacy is a measure of how fast the distribution of Sn
changes at points near θ0 . In particular, the efficacy given in Theorem 11.15
measures the rate of change of the location of the distribution of Sn near the
null hypothesis point θ0 . Test statistics whose distributions change a great
deal near θ0 result in tests that are more sensitive to differences between the
θ0 and the actual value of θ in the alternative hypothesis. A more sensitive test
will be more powerful, and such tests will have a larger efficacy. Therefore,
520 NONPARAMETRIC INFERENCE
if ES > ET , then the Pitman asymptotic relative efficiency is greater than
one, and the test using the test statistic Sn has a higher asymptotic power.
Similarly, if ES < ET then the relative efficiency is less than one, and the test
based on the test statistic Tn has a higher asymptotic power.
In the development of the concept of asymptotic power for individual tests
in Section 10.4, a particular sequence of alternative hypotheses of the form
θn = θ0 + O(n−1/2 ), as n → ∞ was considered. In Theorem 11.15 no explicit
form for the sequence {θk }∞k=1 is discussed, though there is an implicit form
for this sequence given in the assumptions. In particular, it follows that θk =
θ0 + O(k −1/2 ) as k → ∞, matching the asymptotic form considered in Section
10.4. See Section 5.2 of Randles and Wolfe (1979) for further details on this
result.
This section will close with some examples of computing efficacies for the t-
test, the signed rank test, and the sign test. In order to make similar compar-
isons between these tests we will begin by making some general assumptions
about the setup of the testing problem that we will consider. Let X1 , . . . , Xn
be a set of independent and identically distributed random variables from a
distribution F (x − θ) that is symmetric about θ. We will assume that F has
a density f that is also continuous, except perhaps at a countable number
of points. Let θ0 be a fixed real value and assume that we are interested in
testing H0 : θ ≤ θ0 against H1 : θ > θ0 . In the examples given below we will
not concentrate on verifying the assumptions of Theorem 11.15. For details
on verifying these assumptions see Section 5.4 of Randles and Wolfe (1979).
Example 11.17. Consider the t-test statistic Tn = n1/2 σ −1 (X̄n − θ0 ) where
the null hypothesis is rejected when Tn > t1−α;n−1 . Note that t1−α;n−1 →
z1−α as n → ∞ in accordance with the assumptions of Theorem 11.15. The
form of the test statistic implies that µT (θ0 ) = θ0 and σn (θ0 ) = n−1/2 σ so
that the efficacy of the test is given by
µ0T (θ0 )
ET = lim = σ −1 .
n→∞ n1/2 n−1/2 σ


Example 11.18. Consider the signed rank test statistic Wn given by the
sum of the ranks of |X1 − θ0 |, . . . , |Xn − θ0 | that correspond to the cases
where Xi > θ0 for i = 1, . . . , n. Without loss of generality we will consider
the case where θ0 = 0. Following the approach of Randles and Wolfe (1979)
−1
we consider the equivalent test statistic Vn = n2 Wn . Using the results of
Example 11.12, it follows that
 −1
n
Vn = [nU1,n + 21 n(n − 1)U2,n ] = 2(n − 1)−1 U1,n + U2,n ,
2
where
n
X
U1,n = n−1 δ{2Xi ; (0, ∞)},
i=1
PITMAN ASYMPTOTIC RELATIVE EFFICIENCY 521
and
n
X n
X
U2,n = 2[n(n − 1)]−1 δ{Xi + Xj ; (0, ∞)}.
i=1 j=i+1

The results from Example 11.12 suggest that


µn (θ1 ) = E[2(n − 1)−1 U1,n + U2,n |θ = θ1 ].
When θ1 = θ0 we can use the results directly from Example 11.12, but in this
case we need to find the conditional expectation for any θ1 > θ0 so we can
evaluate the derivative of µn (θ1 ). Using the shift model it follows that when
θ = θ1 the distribution is given by F (x − θ1 ) for all x ∈ R. Therefore,
Z ∞
E(U1,n |θ = θ1 ) = P (Xi > 0|θ = θ1 ) = dF (x − θ1 ) =
0
Z ∞
dF (u) = 1 − F (−θ1 ) = F (θ1 ).
−θ1

Similarly,

E(U2,n |θ = θ1 ) = P (Xi + Xj > 0|θ = θ1 ) =


Z ∞Z ∞ Z ∞Z ∞
dF (xj − θ1 )dF (xi − θ1 ) = dF (u)dF (xi − θ1 ) =
−∞ −xi −∞ −xi −θ1
Z ∞ Z ∞
[1 − F (−xi − θ1 )]dF (xi − θ1 ) = [1 − F (−u − 2θ1 )]dF (u).
−∞ −∞

Therefore,
Z ∞
−1
µn (θ1 ) = 2(n − 1) [1 − F (−θ1 )] + [1 − F (−u − 2θ1 )]dF (u). (11.48)
−∞

Assuming that we can exchange a derivative and the integral in Equation


(11.48), it follows that
Z ∞
µ0n (θ1 ) = 2(n − 1)−1 f (−θ1 ) + 2 f (−u − 2θ1 )dF (u).
−∞

Hence, when µ0n (θ1 )


is evaluated at θ0 = 0, we have that
Z ∞
0 −1
µn (θ0 ) = 2(n − 1) f (0) + 2 f (−u)dF (u) =
−∞
Z ∞
−1
2(n − 1) f (0) + 2 f 2 (u)du,
−∞

where we have used the fact that f is symmetric about zero. To find the
variance we note that we can use the result of Example 11.7, which found
2
the variance to have the form 31 n−1 n2 for the statistic Wn , and hence the
variance of Vn is σn2 (θ0 ) = 13 n−1 . Therefore, the efficacy of the signed rank
522 NONPARAMETRIC INFERENCE

Table 11.2 The efficacies and Pitman asymptotic relative efficiencies of the t-test
(T ), the signed rank test (V ), and the sign test (B) under sampling from various
populations.

Distribution ET2 EV2 2


EB EV2 ET−2 2 −2
EB ET 2 −2
EB EV
N(0, 1) 1 3π −1 2π −1 3π −1 2π −1 2
3

Uniform(− 12 , 12 ) 12 12 4 1 1
3
1
3
1 3 3 4
LaPlace(0, 1) 2 4 1 2 2 3

Logistic(0, 1) 3π −2 1
3
1
4
1 2

1 2
12 π
3
4
16 8 2 3
Triangular(−1, 1, 0) 6 3 4 9 3 4

test is given by
µ0n (θ0 ) µ0n (0)
lim = lim =
n→∞ n1/2 σn (θ0 ) n→∞ n1/2 σn (0)
Z ∞
lim 2(3)1/2 (n − 1)−1 f (0) + 2(3)1/2 f 2 (u)du =
n→∞ −∞
Z ∞
2(3)1/2 f 2 (u)du,
−∞

and hence 2
Z ∞
EV2 = 12 2
f (u)du . (11.49)
−∞
The value of the integral in Equation (11.49) has been computed for many
distributions. For example see Table B.2 of Wand and Jones (1995). 
Example 11.19. Consider the test statistic used by the sign test of Example
11.1, which has the form
n
X
B= δ{Xi − θ; (0, ∞)},
i=1

where the null hypothesis is rejected when B exceeds a specified quantile of


the Binomial( 12 , n) distribution, which has an asymptotic Normal distribu-
tion. The efficacy of this test can be shown to be EB = 2f (0), where we are
assuming, as in the previous examples, that θ0 = 0. See Exercise 9. 

One can make several interesting conclusions by analyzing the results of the
efficacy calculations from Examples 11.17–11.19. These efficacies, along with
the associated asymptotic relative effficiencies, are summarized in Table 11.2
for several distributions. We begin by considering the results for the N(0, 1)
PITMAN ASYMPTOTIC RELATIVE EFFICIENCY 523
distribution. We first note that the efficiencies relative to the t-test observed
in Table 11.2 are less than one, indicating that the t-test has a higher efficacy,
and is therefore more powerful in this case. The observed asymptotic relative
efficiency of the signed rank test is 3π −1 ' 0.955, which indicates that the
signed rank test has about 95% of the efficiency of the t-test when the pop-
ulation is N(0, 1). It is not surprising that the t-test is more efficient than
the signed rank test, since the t-test is derived under assumption that the
population is Normal, but what may be surprising is that the signed rank
test does so well. In fact, the results indicate that if a sample of size n = 100
is required by the signed rank test, then a sample of size n = 95 is required
by the t-test to obtain the same asymptotic power. Therefore, there is little
penalty for using the signed rank test even when the population is normal.
The sign test does not fair as well. The observed asymptotic relative efficiency
of the sign test is 2π −1 ' 0.637, which indicates that the sign test has about
64% of the efficiency of the t-test when the population is N(0, 1). Therefore,
if a sample of size n = 100 is required by the sign test, then a sample of size
n = 64 is required by the t-test to obtain the same asymptotic power. The
sign test also does not compare well with the signed rank test.
The Uniform(− 12 , 12 ) distribution is an interesting example because there is
little chance of outliers in samples from this distribution, but the shape of the
distribution is far from Normal. In this case the signed rank test and the t-
test perform equally well with an asymptotic relative efficiency of one. The sign
test performs poorly in this case with an asymptotic relative efficiency equal
to 31 , which indicates that the sign test has an asymptotic relative efficiency of
about 33%, or that a sample of n = 33 is required for the t-test or the signed
rank test, then a sample of 100 is required for the sign test to obtain the same
asymptotic power.
For the LaPlace(0, 1) distribution the trend begins to turn in favor of the
nonparametric tests. The observed asymptotic relative efficiency of the signed
rank test is 23 , which indicates that the signed rank test has about 150% of
the efficiency of the t-test when the population is LaPlace(0, 1). Therefore, if
the signed rank test requires a sample of size n = 100, then the t-test requires
a sample of size n = 150 to obtain the same asymptotic power. The sign test
does even better in this case with an asymptotic relative efficiency equal to 2
when compared to the t-test. In this case, if the sign test requires a sample of
size n = 100, then the t-test requires a sample of size n = 200 to obtain the
same asymptotic power. This is due to the heavy tails of the LaPlace(0, 1)
distribution. The sign test, and to a lesser extent, the signed rank test, are
robust to the presence of outliers while the t-test is not. An outlying value in
one direction may result in failing to reject a null hypothesis even when the
remainder of the data supports the alternative hypothesis.
For the Logistic(0, 1) distribution we get similar results, but not as drastic.
The Logistic(0, 1) distribution also has heavier tails than the N(0, 1) distri-
bution, but not as heavy as the LaPlace(0, 1) distribution. This is reflected
524 NONPARAMETRIC INFERENCE
in the asymptotic relative efficiencies. The signed rank test has an asymptotic
relative efficiency equal to 19 π 2 ' 1.097 which gives a small advantage to this
test over the t-test, while the sign test has an asymptotic relative efficiency
1 2
equal to 12 π ' 0.822 which gives an advantage to the the t-test.
For the case when the population follows a Triangular(−1, 1, 0) distribution
we find that the signed rank test has an asymptotic relative efficiency equal
to 98 ' 0.889, which gives a slight edge to the t-test, and the sign test has an
asymptotic relative efficiency equal to 32 ' 0.667, which implies that the t-test
is better than the sign test in this case. This is probably due to the fact that
the shape of the Triangular(−1, 1, 0) distribution is closer to the general
shape of a N(0, 1) distribution than many of the other distributions studied
here.
Pitman asymptotic relative efficiency is not the only viewpoint that has been
developed for comparing statistical hypothesis tests. For example, Hodges
and Lehmann (1970) developed the concept of deficiency, where expansion
theory similar to what was used in this section is carried out to higher order
terms. The concept of Bahadur efficiency, developed by Bahadur (1960a,
1960b, 1967) considers fixed alternative hypothesis values and power function
values and determines the rate at which the significance levels of the two tests
converge to zero. Other approaches to asymptotic relative efficiency can be
found in Cochran (1952), and Anderson and Goodman (1957).

11.5 Density Estimation

In the most common forms of statistical analysis one observes a sample X1 ,. . .,


Xn from a distribution F , and interest lies in performing statistical inference
on a characteristic of F , usually denoted by θ. Generally θ is some function of
F and is called a parameter. This type of inference is often performed using a
parametric model for F , which increases the power of the inferential methods,
assuming the model is true. A more general problem involves estimating F
without making any parametric assumptions about the form of F . That is,
we wish to compute a nonparametric estimate of F based on the sample
X1 , . . . , Xn .
In Section 3.7 we introduced the empirical distribution function as an esti-
mator of F that only relies on the assumption that the sample X1 , . . . , Xn
is a set of independent and identically distributed random variables from F .
This estimator was shown to be pointwise consistent, pointwise unbiased, and
consistent with respect to Kolmogorov distance. See Theorems 3.16 and 3.18.
Overall, the empirical distribution function can be seen as a reliable estima-
tor of a distribution function. However, in many cases, practitioners are not
interested in directly estimating F , but would rather estimate the density or
probability distribution associated with F , as the general shape of the popu-
DENSITY ESTIMATION 525
lation is much easier to visualize using the density of probability distribution
than with the distribution function.

We first briefly consider the case where F is a discrete distribution where


we would be interested in estimating the probability distribution given by
f (x) = P (X = x) = F (x) − F (x−), for all x ∈ R, where X will be assumed
to be a random variable following the distribution F . Let F̂n be the empirical
distribution function computed on X1 , . . . , Xn . Since F̂n is a step function,
and therefore corresponds to a discrete distribution, an estimator of F can be
derived by computing the probability distribution associated with F̂n . That
is, f can be estimated by
n
X
fˆn (x) = F̂n (x) − F̂n (x−) = n−1 δ{Xk ; {x}},
k=1

for all x ∈ R. Note that fˆ(x) is the observed proportion of points in the sample
that are equal to the point x. This estimate can be shown to be pointwise
unbiased and consistent. See Exercise 14.

In density estimation we assume that F has a continuous density f , and we


are interested in estimating the density f based on the sample X1 , . . . , Xn . In
this case the empirical distribution function offers little help as F̂n is a step
function corresponding to a discrete distribution and hence has no density
associated with it. This section explores the basic development and asymptotic
properties of two common estimators of f : the histogram and the kernel density
estimator. We will show how an asymptotic analysis of these estimators is
important in understanding the general behavior and application of these
estimates.

Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from a distribution F that has continuous density f . For simplicity
we will assume that F is differentiable everywhere. We have already pointed
out that the empirical distribution function cannot be used directly to derive
an estimator of f , due to the fact that F̂n is a step function. Therefore, we
require an estimate of the distribution function that is not a step function,
and has a density associated with it. This estimate can then be differentiated
to estimate the underlying density.

The histogram is a density estimate that is based on using a piecewise linear


estimator for F . Let −∞ < g1 < g2 < · · · < gd < ∞ be a fixed grid of points
in R. For the moment we will not concern ourselves with how these points are
selected, but we will assume that they are selected independent of the sample
X1 , . . . , Xn and that they cover the range of the observed sample in that we
will assume that g1 < min{X1 , . . . , Xn } and gd > max{X1 , . . . , Xn }. At each
of these points in the grid, we can use the empirical distribution function to
526 NONPARAMETRIC INFERENCE

Figure 11.7 The estimator F̄n (t) computed on the example set of data indicated by
the points on the horizontal axis. The location of the grid points are indicated by the
vertical grey lines.
1.0
0.8
0.6
0.4
0.2
0.0

estimate the distribution function F at that point. That is

n
X
F̂ (gi ) = F̂n (gi ) = n−1 δ{Xk ; (−∞, gi ]},
k=1

for i = 1, . . . , d. Under our assumptions on the grid of points outlined above


we have that F̂n (g1 ) = 0 and F̂n (gd ) = 1. Given this assumption, a piecewise
linear estimate of F can be obtained by estimating F (x) at a point x ∈
[gi , gi+1 ) through linear interpolation between F̂n (gi ) and F̂n (gi+1 ). That is,
we estimate F (x) with

F̄n (x) = F̂n (gi ) + (gi+1 − gi )−1 (x − gi )[F̂n (gi+1 ) − F̂n (gi )],

when x ∈ [gi , gi+1 ]. It can be shown that F̄n is a valid distribution function
under the assumptions given above. See Exercise 15. See Figure 11.7 for an
example of the form of this estimator.

To estimate the density at x ∈ (gi , gi+1 ) we take the derivative of F̄n (x) to
DENSITY ESTIMATION 527
obtain the estimator

d
f¯n (x) = F̄n (x)
dx
d n o
= F̂n (gi ) + (gi+1 − gi )−1 (x − gi )[F̂n (gi+1 ) − F̂n (gi )]
dx
= (gi+1 − gi )−1 [F̂n (gi+1 ) − F̂n (gi )]
" n n
#
X X
−1 −1
= n (gi+1 − gi ) δ{Xi ; (−∞, gi+1 ]} − δ{Xi ; (−∞, gi ]}
k=1 k=1
n
X
= (gi+1 − gi )−1 n−1 δ{Xi , ; (gi , gi+1 ]}, (11.50)
k=1

which is the proportion of observations in the range (gi , gi+1 ], divided by the
length of the range. The estimator specified in Equation 11.50 is called a his-
togram. This estimator is also often called a density histogram, to differentiate
it from the frequency histogram which is a plot of the frequency of observations
within each of the ranges (gi , gi+1 ]. Note that a frequency histogram does not
produce a valid density, and is technically not a density estimate. Note that
f¯n will usually not exist at the grid points g1 , . . . , gd as F̄n will usually not be
differentiable at these points. In practice this makes little difference and we
can either ignore these points, or can set the estimate at these points equal to
one of the neighboring estimate values. The form of this estimate is a series of
horizontal steps within each range (gi , gi+1 ). See Figure 11.8 for an example
form of this estimator.

Now that we have specified the form of the histogram we must consider the
placement and number of grid points g1 , . . . , gd . We would like to choose these
grid points so that the histogram provides a good estimate of the underlying
density, and hence we must develop a measure of discrepancy between the
true density f and the estimate f¯n . For univariate parameters we often use
the mean squared error, which is the expected square distance between the es-
timator and the true parameter value, as a measure of discrepancy. Estimators
that are able to minimize the mean squared error are considered reasonable
estimators of the parameter.

The mean squared error does not directly generalize to the case of estimat-
ing a density, unless we consider the pointwise behavior of the density esti-
mate. That is, for the case of the histogram, we would consider the mean
squared error of f¯n (x) as a pointwise estimator of f (x) at a fixed point x ∈ R
as MSE[f¯n (x), f (x)] = E{[f¯n (x) − f (x)]2 }. To obtain an overall measure of
the performance of this estimator we can then integrate the pointwise mean
squared error over the real line. This results in the mean integrated squared
528 NONPARAMETRIC INFERENCE

Figure 11.8 The estimator f¯n (t) computed on the example set of data indicated by
the points on the horizontal axis. The location of the grid points are indicated by the
vertical grey lines. These are the same data and grid points used in Figure 11.7.
5
4
3
2
1
0

error given by
Z ∞
MISE(f¯n , f ) = MSE[f¯n (x), f (x)]dx
−∞
Z ∞
= E{[f¯n (x) − f (x)]2 }dx
−∞
Z ∞ 
= E ¯ 2
[fn (x) − f (x)] dx , (11.51)
−∞

where we have assumed in the final equality that the interchange of the integral
and the expectation is permissible. As usual, the mean squared error can be
written as the sum of the square bias and the variance of the estimator.
The same operation can be performed here to find that the mean integrated
squared error can be written as the sum of the integrated square bias and the
integrated variance. That is MISE(f¯n , f ) = ISB(f¯n , f ) + IV(f¯n ) where
Z ∞ Z ∞
ISB(f¯n , f ) = Bias2 [f¯n (x), f (x)]dx = {E[f¯n (x)] − f (x)}2 dx,
−∞ −∞

and Z ∞
IV(f¯n ) = E[{(f¯n (x) − E[f¯n (x)]}2 ]dx.
−∞
DENSITY ESTIMATION 529
See Exercise 17. Using the mean integrated squared error as a measure of
discrepancy between our density estimate and the true density, we will use an
asymptotic analysis to specify how the grid points should be chosen. We will
begin by making a few simplifying assumptions. First, we will assume that the
grid spacing is even over the range of the distribution. That is, gi+1 − g1 = h,
for all i = 1, . . . , d where h > 0 is a value called the bin width. We will
not concern ourselves with the placement of the grid points. We will only
focus on choosing h that will minimize an asymptotic expression for the mean
integrated squared error of the histogram estimator.
We will assume that f has a certain amount of smoothness. For the moment
we can assume that f is continuous, but for later calculations we will have to
assume that f 0 is continuous. In either case it should be clear that f cannot
be a step function, which is the form of the histogram. Therefore, if we were
to expect the histogram to provide any reasonable estimate asymptotically it
is apparent that the bin width h must change with n. In fact, the bins must
get smaller as n gets larger in order for the histogram to become a smooth
function as n gets large. Therefore, we will assume that
lim h = 0.
n→∞

This is, in fact, a necessary condition for the histogram estimator to be con-
sistent. On the other hand, we must be careful that the bin width does not
converge to zero at too fast a rate. If h becomes too small too fast then there
will not be enough data within each of the bins to provide a consistent es-
timate of the true density in that region. Therefore, we will further assume
that
lim nh = ∞.
n→∞
For further information on these assumptions, see Scott (1992) and Section
2.1 of Simonoff (1996).
We will begin by considering the integrated bias of the histogram estimator. To
find the bias we begin by assuming that x ∈ (gi , gi+1 ] for some i ∈ {1, . . . , d−1}
where h = gi+1 − gi and note that
" n
#
X
¯
E[fn (x)] = E (gi+1 − gi ) n −1 −1
δ{Xi , ; (gi , gi+1 ]}
k=1
n
X
= (nh)−1 E(δ{Xi , ; (gi , gi+1 ]})
k=1
= h−1 E(δ{X1 , ; (gi , gi+1 ]})
Z gi+1
−1
= h dF (t).
gi

Theorem 1.15 implies that f (t) = f (x) + f 0 (x)(t − x) + 21 f 00 (c)(t − x)2 as


|t−x| → 0 where c is some value in the interval (gi , gi+1 ] and we have assumed
530 NONPARAMETRIC INFERENCE
that f has two bounded and continuous derivatives. This implies that
Z gi+1 Z gi+1 Z gi+1 Z gi+1
f (t)dt = f (x)dt + f 0 (x)(t − x)dt + 1 00 2
2 f (c)(t − x) dt.
gi gi gi gi
(11.52)
Now Z gi+1 Z gi+1
f (x)dt = f (x) dt = (gi+1 − gi )f (x) = hf (x),
gi gi
and
Z gi+1 Z gi+1
f 0 (x)(t − x)dt = f 0 (x) (t − x)dt
gi gi
1 0 2 2
= 2 f (x)[gi+1 − gi − 2(gi+1 − gi )x]
1 0
= 2 f (x)[(gi+1 − gi )(gi+1 + gi ) − 2hx]
1 0
= 2 hf (x)(gi+1 + gi − 2x)
1 0
= 2 hf (x)[h − 2(x − gi )].

For the final term in Equation (11.52), we will let


ξ= sup f 00 (c),
c∈(gi ,gi+1 ]

where we assume that ξ < ∞. Then, assuming that |t − x| < h we have that
−3 gi+1 1 00 1 −3 gi+1
Z Z
2 2

h
2 f (c)(t − x) dt ≤ ξh
2 (t − x) dt ≤
gi g1
Z gi+1
1 −3
h dt = 12 h−1 ξ(gi+1 − gi ) = 21 |ξ| < ∞,
2

2h ξ
gi

for all n. Therefore, it follows that


Z gi+1
1 00 2 3
2 f (c)(t − x) dt = O(h ),
gi

as n → ∞. Thus, for x ∈ (gi , gi+1 ] we have that


Z gi+1
E[f¯n (x)] = h−1 f (t)dt
gi
−1
= h {hf (x) + 21 hf 0 (x)[h − 2(x − gi )] + O(h3 )}
= f (x) + 21 f 0 (x)[h − 2(x − gi )] + O(h2 ),
as n → ∞. Therefore, the pointwise bias is
Bias[f¯n (x)] = 21 f 0 (x)[h − 2(x − gi )] + O(h2 ),
as h → 0, or equivalently as n → ∞. It then follows that the square bias is
given by
Bias2 [f¯n (x)] = 41 [f 0 (x)]2 [h − 2(x − gi )]2 + O(h3 ).
See Exercise 18. The pointwise variance is developed using similar methods.
DENSITY ESTIMATION 531
That is, for x ∈ (gi , gi+1 ], we have that
n
X
V [f¯n (x)] = (nh)−2 V (δ{Xk ; (gi , gi+1 ]})
k=1
−1 −2
= n hV (δ{X1 ; (gi , gi+1 ]})
Z gi+1  Z gi+1 
−1 −2
= n h dF (x) 1 − dF (x) , (11.53)
gi gi

where we have used the fact that δ{Xk , ; (gi , gi+1 ]} is a Bernoulli random
variable. To simplify this expression we begin by finding an asymptotic form
for the integrals in Equation (11.53). To this end, we apply Theorem 1.15 to
the density to find for x ∈ (gi , gi+1 ], we have that
Z gi+1 Z gi+1
dF (t) = f (x) + f 0 (x)(t − x) + 21 f 00 (c)(t − x)2 dt. (11.54)
gi gi

Using methods similar to those used above, it can be shown that


Z gi+1
f 0 (x)(t − x)dt = O(h2 ),
gi

as n → ∞. See Exercise 19. We have previously shown that the last integral
in Equation (11.54) is O(h3 ) as n → ∞, from which it follows from Theorem
1.19 that
Z gi+1
dF (t) = hf (x) + O(h2 ),
gi

as n → ∞. Therefore, it follows from Theorems 1.18 and 1.19 that

V [f¯n (x)] = n−1 h−2 [hf (x) + O(h2 )][1 − hf (x) + O(h2 )]
= n−1 [f (x) + O(h)][h−1 − f (x) + O(h)]
= n−1 [h−1 f (x) + O(1)]
= (nh)−1 f (x) + O(n−1 ),

as n → ∞. Combining the expressions for the pointwise bias and variance


yields the pointwise mean squared error, which is given by

MSE[f¯n (x)] = V [f¯n (x)] + Bias2 [f¯n (x)] =


(nh)−1 f (x) + 41 [f 0 (x)]2 [h − 2(x − gi )]2 + O(n−1 ) + O(h3 ).

To obtain the mean integrated squared error, we integrate the pointwise mean
squared error separately over each of the grid intervals. That is,
Z ∞ d Z
X gk+1
MISE[f¯n (x)] = MSE[f¯n (x)]dx = MSE[f¯n (x)]dx.
−∞ k=1 gk
532 NONPARAMETRIC INFERENCE
For the grid interval (gk , gk+1 ] we have that
Z gk+1 Z gk+1
MSE[f¯n (x)]dx = (nh)−1 f (x)dx+
gk gk
Z gk+1
1
4 [f 0 (x)]2 [h − 2(x − gk )]2 dx + O(n−1 ) + O(h3 ).
gk

Now, Z gk+1 Z gk+1


(nh)−1 f (x)dx = (nh)−1 dF (x),
gk gk
and
Z gk+1 Z gk+1
0 2 2 1 0 2
1
4 [f (x)] [h − 2(x − gk )] dx = 4 [f (ηk )] [h − 2(x − gk )]2 dx,
gk gk

for some ηk ∈ (gk , gk+1 ], using Theorem A.5. Integrating the polynomial
within the integral yields
Z gk+1
[h − 2(x − gk )]2 dx = h3 − 2h3 + 34 h3 = 13 h3 ,
gk

so that Z gk+1
1
4 [f 0 (x)]2 [h − 2(x − gk )]2 dx = 1 3 0 2
12 h [f (ηk )] .
gk
Taking the sum over all of the grid intervals gives the total mean integrated
squared error,
Xd Z gk+1 X d
MISE(f¯, f ) = (nh)−1 dF (x) + 1 3 0
12 h [f (ηk )]
2

k=1 gk k=1
−1 3
+O(n ) + O(h )
Z ∞ d
X
= (nh)−1 dF (x) + 1 2
12 h h[f 0 (ηk )]2 + O(n−1 ) + O(h3 )
−∞ k=1
d
X
= (nh)−1 + 1 2
12 h h[f 0 (ηk )]2 + O(n−1 ) + O(h3 ).
k=1

To simplify this expression we utilize the definition of the Riemann integral


to obtain
Z ∞ d
X
0 2
[f (t)] dt = (h[f 0 (ηk )]2 + εk ),
−∞ k=1
where εk ≤ h[f (gk+1 ) − f 0 (gk )] = O(h) as long as f has bounded variation
0

on the interval (gk , gk+1 ]. Therefore, it follows that


Z ∞
¯
MISE(fn , f ) = (nh) + 12 h−1 1 2
[f 0 (t)]2 dt + O(h3 ) + O(n−1 )
−∞
= (nh)−1 + 1 2
12 h R(f 0
) + O(h3 ) + O(n−1 ),
DENSITY ESTIMATION 533

Figure 11.9 This figure demonstrates how a histogram with a smaller bin width is
better able to follow the curvature of an underlying density, resulting in a smaller
asymptotic bias.

where
Z ∞
0
R(f ) = [f 0 (t)]2 dt.
−∞

Note that the mean integrated squared error of the histogram contains the
classic tradeoff between bias and variance seen with so many estimators. That
is, if the bin width is chosen to be small, the bias will be small since there
will be many small bins that are able to capture the curvature of f , as shown
by the 12 h R(f 0 ) term. But in this case the variance, as shown by the (nh)−1
1 2

term, will be large, due to the fact that there will be fewer observations per
bin. When the bin width is chosen to be large, the bias becomes large as the
curvature of f will be not be able to be modeled as well by the wide steps
in the histogram, while the variance will be small due to the large number of
observations per bin. See Figures 11.9 and 11.10.

To find the bin width that minimizes this tradeoff we first truncate the ex-
pansion for the mean integrated squared error to obtain the asymptotic mean
integrated squared error given by

AMISE(f¯n , f ) = (nh)−1 + 1 2 0
12 h R(f ). (11.55)

Differentiating AMISE(f¯n , f ) with respect to h, setting the result equal to


534 NONPARAMETRIC INFERENCE

Figure 11.10 This figure demonstrates how a histogram with a large bin width is less
able to follow the curvature of an underlying density, resulting in a larger asymptotic
bias.

zero, and solving for h gives the asymptotically optimal bin width given by
 1/3
−1/3 6
hopt = n .
R(f 0 )
See Exercise 20. The resulting asymptotic mean squared error when using the
optimal bandwidth is therefore AMISEopt (f¯n , f ) = n−2/3 [ 16
9
R(f 0 )]1/3 .
Note the dependence of the optimal bandwidth on the integrated square
derivative of the underlying density. This dependence has two major implica-
tions. First, it is clear that densities that have smaller derivatives over their
range will require a larger bandwidth, and will result in a smaller asymptotic
mean integrated square error. This is due to the fact that these densities are
more flat and will be easier to estimate with large bin widths in the step
function in the histogram. When the derivative is large over the range of the
distribution, the optimal bin width is smaller and the resulting asymptotic
mean integrated squared error is larger. Such densities are more difficult to
estimate using the histogram. The second implication is that the asymptoti-
cally optimal bandwidth depends on the form of the underlying density. This
means that we must estimate the bandwidth from the observed data, which
requires an estimate of the integral of the squared derivative of the density.
Wand (1997) suggests using a kernel estimator to achieve this goal, and argues
that most of the usual informal rules do not choose the bin width to be small
DENSITY ESTIMATION 535
enough. Kernel estimators can also be used to estimate the density itself, and
is the second type of density estimator we discuss in this section.
One problem with the histogram is that it is always a step function and there-
fore does not usually reflect our notion of a continuous and smooth density.
As such, there have been numerous techniques developed to provide a smooth
density estimate. The estimator we will consider in this section is known as a
kernel density estimator, which appears to have been first studied by Fix and
Hodges (1951). See Fix and Hodges (1989) for a reprint of this paper. The
first asymptotic analysis of this method, which follows along the lines of the
developments in this chapter, where studied by Parzen (1962) and Rosenblatt
(1956).
To motivate the kernel density estimate, return once again to the problem of
estimating the distribution function F . It is clear that if we wish to have a
smooth density estimate based on an estimate of F , we must find an estimator
for F that itself is smooth. Indeed, we require more than just continuity in
this case. As we saw with the histogram, we specified a continuous estimator
for F that yielded a step function for the density estimate. Therefore, it seems
if we are to improve on this idea we should not only require an estimator for F
that is continuous, but it should also be differentiable everywhere. To motivate
an approach to finding such an estimate we write the empirical distribution
function as
Xn
F̂n (x) = n−1 δ{Xi ; (−∞, x]}
k=1
Xn
= n−1 δ{x − Xi ; [0, ∞)}
k=1
Xn
= n−1 K(x − Xi ), (11.56)
k=1

where K(t) = δ{t; [0, ∞)} for all t ∈ R. Note that K in this case can be taken
to be a distribution function for a degenerate random that concentrates all of
its mass at zero, and is therefore a step function with a single step of size one
at zero. The key idea behind developing the kernel density estimator is to note
that if we replace the function K in Equation (11.56) with any other valid
distribution function that is centered around zero, the estimator itself remains
a distribution function. See Exercise 21. That is, let K be any non-decreasing
right continuous function such that
lim K(t) = 1,
t→∞

lim K(t) = 0,
t→−∞
and Z ∞
tdK(t) = 0.
−∞
536 NONPARAMETRIC INFERENCE
Now define the kernel estimator of the distribution function F to be
n
X
F̃n (x) = n−1 K(x − Xi ). (11.57)
k=1

The problem with the proposed estimator in Equation (11.57) is that the
properties of the estimator are a function of the variance of the distribution
K. The control this property we introduce a scale parameter h to the function
K. That is, define the kernel estimator of the distribution function F as
n  
−1
X x − Xi
F̃n,h (x) = n K . (11.58)
h
k=1

This scale parameter is usually called a bandwidth. Now, if we further assume


that K is smooth enough, the kernel estimator given in Equation (11.58) will
be differentiable everywhere as we can obtain a continuous estimate of the
density f by differentiating the kernel estimator of F to obtain
n  
˜ −1
X x − Xi
fn,h (x) = (nh) k ,
h
k=1

the kernel density estimator of f with bandwidth h, where k(t) = K 0 (t). In


this case we again have two choices to make about the estimator. We need
to decide what kernel function, specified by k, should be used, and we also
need to decide what bandwidth should be used. As with the histogram, we
shall discuss these issues in terms of the mean integrated squared error of the
kernel estimate in terms of h and k.

The question of what function should be used for the kernel function k is a
rather complicated issue that we will not address in depth. For finite sam-
ples the choice obviously makes a difference, but a theoretical quantification
of these differences is rather complicated, so researchers have turned to the
question of what effect does the choice of the kernel function have asymp-
totically as n → ∞? It turns out that for large samples the form of k does
not affect the rate at which the optimal asymptotic mean squared error of the
kernel density estimator approaches zero. Therefore, from an asymptotic view-
point the choice matters little. See Section 3.1.2 of Simonoff (1996). Hence, for
the remainder of this section we shall make the following generic assumptions
about the form of the kernel function k. We will assume that k is a symmet-
ric continuous density with zero mean finite variance. Given this assumption,
we will now show how asymptotic calculations can be used to determine the
asymptotically optimal bandwidth.

As with the histogram, we will use the mean integrated squared error given
in Equation (11.51) as a measure of the performance of the kernel density
estimator. We will assume that f is a smooth density, namely that f 00 is
continuous and square integrable. As with the histogram, we shall assume
DENSITY ESTIMATION 537
that the bandwidth has the properties that
lim h = 0,
n→∞

and
lim nh = ∞.
n→∞
We will finally assume that k is a bounded density that is symmetric and has a
finite fourth moment. To simplify notation, we will assume that X is a generic
random variable with distribution F . We begin by obtaining an expression for
the integrated bias. The expected value of the kernel density estimator at a
point x ∈ R is given by
" n  #
˜ −1
X x − Xk
E[fn,h (x)] = E (nh) k
h
k=1
n   
X x − Xk
= (nh)−1 E k
h
k=1
  
x−X
= h−1 E k
h
Z ∞  
x −t
= h−1 k dF (t).
−∞ h

Now consider the change of variable v = h−1 (t − x) so that t = x + vh and


dt = hdv to obtain
Z ∞ Z ∞
E[f˜n,h (x)] = k(−v)f (x + vh)dv = k(v)f (x + vh)dv,
−∞ −∞

where the second inequality follows because we have assumed that k is sym-
metric about the origin. Now, apply Theorem 1.15 (Taylor) to f (x + vh) to
find
f (x + vh) = f (x) + vhf 0 (x) + 21 (vh)2 f 00 (x) + 61 (vh)3 f 000 (x) + O(h4 ),
as h → 0. Therefore, assuming that the integral of the remainder term remains
O(h4 ) as h → 0, it follows that
Z ∞
E[f˜n,h (x)] = k(v)f (x + vh)dv =
−∞
Z ∞ Z ∞ Z ∞
f (x)k(v)dv + vhf 0 (x)k(v)dv + 1 2 00
2 (vh) f (x)k(v)dv+
−∞ −∞ −∞
Z ∞
1 3 000 4
6 (vh) f (x)k(v)dv + O(h ),
−∞

as h → 0. Using the fact that k is a symmetric density about the origin, we


have that Z ∞ Z ∞
f (x)k(v)dv = f (x) k(v)dv = f (x),
−∞ −∞
538 NONPARAMETRIC INFERENCE
Z ∞ Z ∞
0 0
vhf (x)k(v)dv = hf (x) vk(v)dv = 0,
−∞ −∞
Z ∞ Z ∞
2 00
1
2 (vh) f (x)k(v)dv = 12 h2 f 00 (x) v 2 k(v)dv = 21 h2 f 00 (x)σk2 ,
−∞ −∞
and Z ∞ Z ∞
3 000
1
6 (vh) f (x)k(v)dv = 61 h3 f 000 (x) v 3 k(v)dv = 0,
−∞ −∞
where σk2 is the variance of the kernel function k. Therefore,
E[f˜n,h (x)] = f (x) + 21 h2 f 00 (x)σk2 + O(h4 ),
and the pointwise bias of the kernel density estimator with bandwidth h is
Bias[f˜n,h (x)] = 12 h2 f 00 (x)σk2 + O(h4 ).
To compute the mean integrated squared error we require the squared bias.
It can be shown that
Bias2 [f˜n,h (x)] = [ 12 h2 f 00 (x)σk2 + O(h4 )]2 = 41 h4 [f 00 (x)]2 σk4 + O(h6 ).
See Exercise 22. To find the pointwise variance of the kernel density estimator
we note that
n      
X x − Xi x−X
V [f˜n,h (x)] = (nh)−2 V k = n−1 h−2 V k .
h h
k=1

The variance part of this term can be written as


   Z ∞     
x−X 2 x−t 2 x−X
V k = k dF (t) − E k =
h −∞ h h
Z ∞   
2 2 x−X
h k (v)f (x + vh)dv − E k .
−∞ h
Theorem 1.15 implies that f (x + vh) = f (x) + O(h) as h → 0 so that
Z ∞ Z ∞
h k 2 (v)f (x + vh)dv = h k 2 (v)[f (x) + O(h)]dv
−∞ −∞
Z ∞
= hf (x) k 2 (v)dv + O(h2 )
−∞
= hf (x)R(k) + O(h2 ).
From previous calculations, we know that
  
x−X
E k = hf (x) + O(h4 ),
h
so that   
x−X
E2 k = h2 f 2 (x) + O(h4 ).
h
DENSITY ESTIMATION 539
Thus,
V [f˜n,h (x)] = n−1 h−2 [hf (x)R(k) + O(h2 ) − h2 f (x) + O(h4 )]
= (nh)−1 f (x)R(k) + n−1 O(1) − n−1 f (x) + O(n−1 h2 )
= (nh)−1 f (x)R(k) + O(n−1 ).
Therefore, the pointwise mean squared error of the kernel estimator with
bandwidth h is given by
MSE[f˜n,h (x)] = (nh)−1 f (x)R(k) + 14 h4 [f 00 (x)]2 σk4 + O(h6 ) + O(n−1 ),
as h → 0 and as n → ∞. Integrating we find that the mean integrated squared
error is given by
Z ∞ Z ∞
MISE(f˜n,h , f ) = (nh)−1 R(k) f (x)dx + 14 h4 σk4 [f 00 (x)]2 dx
−∞ −∞
6 −1
+O(h ) + O(n )
= (nh)−1 R(k) + 41 h4 σk4 R(f 00 ) + O(h6 ) + O(n−1 ),
as h → 0 which occurs under our assumptions when n → ∞. As with the
case of the histogram, we truncate the error term to get the asymptotic mean
integrated squared error, given by
AMISE(f˜n,h , f ) = (nh)−1 R(k) + 14 h4 σk4 R(f 00 ).
Minimizing the asymptotic mean integrated square error with respect to the
bandwidth h yields the asymptotically optimal bandwidth given by
 1/5
−1/5 R(k)
hopt = n .
σk4 R(f 00 )
See Exercise 23. If this bandwidth is known, the asymptotically optimal asymp-
totic mean integrated squared error using the kernel density estimate with
bandwidth hopt is given by
1/5 4/5
σk4 R(f 00 )
 
R(k)
AMISEopt (f˜n,hopt , f ) = n−4/5 R(k) + 14 n−4/5
R(k) σk4 R(f 00 )
= 54 [σk R(k)]4/5 [R(f 00 )]1/5 n−4/5 . (11.59)
Equation (11.59) provides a significant amount of information about the kernel
density estimation method. First we note that the asymptotic optimal mean
integrated squared error of the kernel density estimate converges to zero at
a faster rate than that of the histogram. This indicates, at least for large
samples, that the kernel density estimate should provide a better estimate of
the underlying density than the histogram from the viewpoint of the mean
integrated squared error. We also note that R(f 00 ) plays a prominent role in
both the size of the asymptotically optimal bandwidth, and the asymptotic
mean integrated squared error. Recall that R(f 00 ) is a measure of the smooth-
ness of the density f . Therefore, we see that if R(f 00 ) is large, then we have
540 NONPARAMETRIC INFERENCE
a density that is not very smooth. Such a density requires a small bandwidth
and at the same time is difficult to estimate as the asymptotically optimal
mean integrated squared error becomes larger. On the other hand, if R(f 00 )
is small, a larger bandwidth is required and the density is easier to estimate.
It is important to note that the value of R(f 00 ) does not depend on n, and
therefore cannot affect the convergence rate of the asymptotic mean squared
error.
Aside from the term R(f 00 ) in Equation (11.59), we can also observe that the
size of the asymptotic mean integrated squared error is controlled by σk R(k),
which is completely dependent upon the kernel function and is therefore com-
pletely within control of the user. That is, we are free to choose the kernel
function that will minimize the asymptotic mean integrated squared error.
It can be shown that the optimal form of the kernel function is given by
k(t) = 34 (1 − t)2 δ{t; [−1, 1]}, which is usually called the Epanechnikov ker-
nel, and for which σk R(k) = 3/(53/2 ). See Exercise 24. See Bartlett (1963),
Epanechnikov (1969) and Hodges and Lehmann (1956). The relative efficiency
of other kernel functions can be therefore computed as
3
Relative Efficiency = .
σk R(k)53/2
For example, it can be shown that the Normal kernel, taken with σk = 1 has
a relative efficiency approximately equal to 0.9512. In fact, it can be shown
that most of the standard kernel functions used in practice have efficiencies
of at least 0.90. See Exercise 25. This, coupled with the fact that the choice
of kernel does not have an effect on the rate at which the asymptotic mean
integrated square error converges to zero, indicates that the choice of kernel
is not as asymptotically important as the choice of bandwidth.
The fact that the optimal bandwidth depends on the unknown density through
R(f 00 ) is perhaps not surprising, but it does lend an additional level of diffi-
culty to the prospect of using kernel density estimators in practical situations.
There have been many advances in this area; one of the promising early ap-
proaches is based on cross validation which estimates the risk of the kernel
density estimator using a leave-one-out calculation similar to jackknife type
estimators. This is the approach proposed by Rudemo (1982) and Bowman
(1984). Consistency results for the cross-validation method for estimating the
optimal bandwidth can be found in Burman (1985), Hall (1983b) and Stone
(1984). Another promising approach is based on attempting to estimate R(f 00 )
directly based on the data. This estimate is usually based on a kernel estimator
itself, which also requires a bandwidth that depends on estimating functionals
of even higher order derivatives of f . This process is usually carried on for a
few iterations with the bandwidth required to estimate the functional of the
highest order derivative being computed for a parametric distribution, usually
normal. This approach, known as the plug-in approach was first suggested by
Woodroofe (1970) and Nadaraya (1974). Since then, a great deal of research
has been done with these methods, usually making improvements on the rate
THE BOOTSTRAP 541
at which the estimated bandwidth approaches the optimal one. The research
is too vast to summarize here, but a good presentation of these methods can
be found in Simonoff (1996) and Wand and Jones (1995).

11.6 The Bootstrap

In this section we will consider a very general nonparametric methodology


that was first proposed by Efron (1979). This general approach, known as
the bootstrap, provides a single tool that can be applied to many different
problems including the construction of confidence sets and statistical tests,
the reduction of estimator bias, and the computation of standard errors. The
popularity of the bootstrap arises from its applicability across many differ-
ent types of statistical problems. The same essential bootstrap technique can
be applied to problems in univariate and multivariate inference, regression
problems, linear models and time series. The bootstrap differs from the non-
parametric techniques like those based on rank statistics in that bootstrap
methods are usually approximate. This means that theory developed for the
bootstrap is usually asymptotic in nature, though some finite sample results
are available. For example, see Fisher and Hall (1991) and Polansky (1999).
The second major difference between the bootstrap and nonparametric tech-
niques based on ranks is that in practice the bootstrap is necessarily computa-
tionally intensive. While some classical nonparametric techniques also require
a large number of computations, the bootstrap algorithm differs in that the
computations are usually based on simulations. This section introduces the
bootstrap methodology and presents some general asymptotic properties such
as the consistency of bootstrap estimates and the correctness and accuracy of
bootstrap confidence intervals.
To introduce the bootstrap methodology formally, let {Xn }∞ n=1 be a sequence
of independent and identically distributed random variables following a dis-
tribution F . Let θ = T (F ) be a functional parameter with parameter space
Ω, and let θ̂n = T (F̂n ) be a point estimator for θ where F̂n is the empirical
distribution function defined in Definition 3.5. Inference on the parameter θ
typically requires knowledge of the sampling distribution of θ̂n , or some func-
tion of θ̂n . Let Rn (θ̂n , θ) be the function of interest. For example, when we wish
to construct a confidence interval we might take Rn (θ̂n , θ) = n1/2 σ −1 (θ̂n − θ)
where σ 2 is the asymptotic variance of n1/2 θ̂n . The distribution function of
Rn (θ̂n , θ) is defined to be
Hn (t) = P [Rn (θ̂n , θ) ≤ t|X1 , . . . , Xn ∼ F ]
where the notation X1 , . . . , Xn ∼ F is used to represent the situation where
X1 , . . . , Xn is a set of independent and identically distributed random vari-
ables from F . In some cases Hn (t) is known; for example when F is known or
when Rn (θ̂n , θ) is distribution free. In other cases Hn (t) will be unknown and
542 NONPARAMETRIC INFERENCE
must either be approximated using asymptotic arguments or must be esti-
mated using the observed data. For example, when F is unknown and θ is the
mean, the distribution of Rn (θ̂n , θ) = n1/2 σ −1 (θ̂n − θ) can be approximated
by a normal distribution. In the cases where such an approximation is not
readily available, or where such an approximation is not accurate enough, one
can estimate Hn (t) using the bootstrap.
The bootstrap estimate of the sampling distribution Hn (t) is obtained by
estimating the unknown distribution F with an estimate F̂ computed using the
observed random variables X1 , . . . , Xn . The distribution of Rn (θ̂n , θ) is then
found conditional on X1 , . . . , Xn , under the assumption that the distribution
is F̂ instead of F . That is, the bootstrap estimate is given by
Ĥn (t) = P ∗ [Rn (θ̂n∗ , θ̂n ) ≤ t|X1∗ , . . . , Xn∗ ∼ F̂ ], (11.60)
where P ∗ (A) = P ∗ (A|X1 , . . . , Xn ) is the probability measure induced by F̂
conditional on X1 , . . . , Xn , θ̂n = T (F̂n ), θ̂n∗ = T (F̂n∗ ), and F̂n∗ is the em-
pirical distribution function of X1∗ , . . . , Xn∗ . The usual reliance of computing
bootstrap estimates using computer simulations is based on the fact that the
conditional distribution given in Equation (11.60) is generally difficult to com-
pute in a closed form in most practical problems. An exception is outlined in
Section 3 of Efron (1979).
In cases where the bootstrap estimate Ĥn (t) cannot be found analytically
one could use a constructive method based on considering all possible equally
likely samples from F̂n . Computing Rn (θ̂n , θ) for each of these possible sam-
ples and combining the appropriate probabilities would then provide an exact
tabulation of the distribution Ĥn (t). This process can become computationally
prohibitive even when  the sample size is moderate, as the number of possi-
ble samples is 2n−1 n , which equals 92, 378 when n = 10. See Fisher and Hall
(1991). Alternately one can simulate the process that produced the data using
F̂n in place of the unknown distribution F . Using this algorithm, one simulates
b sets of n independent and identically distributed random variables from F̂n ,
conditional on the observed random variables X1 , . . . , Xn . For each set of n
random variables the function Rn (θ̂n , θ) is computed. Denote these values as
R1∗ , . . . , Rb∗ . Note that these observed values are independent and identically
distributed random variables from the bootstrap distribution estimate Ĥn (t),
conditional on X1 , . . . , Xn . The distribution Ĥn (t) can then be approximated
with the empirical distribution function computed on R1∗ , . . . , Rb∗ . That is,
b
X
Ĥn (t) ' H̃n,b (t) = b−1 δ{Ri∗ ; (−∞, t]}.
i=1

Note that the usual asymptotic properties related to the empirical distribution
hold, conditional on X1 , . . . , Xn . For example, Theorem 3.18 (Glivenko and
a.c.
Cantelli) implies that kĤn − H̃n,b k∞ −−→ 0 as b → ∞, where the convergence
is relative only to the sampling from F̂n , conditional on X1 , . . . , Xn .
THE BOOTSTRAP 543
The relative complexity of the bootstrap algorithm, coupled along with the
fact that the bootstrap is generally considered to be most useful in the non-
parametric framework where the exact form of the population distribution is
unknown, means that the theoretical justification for the bootstrap has been
typically based on computer-based empirical studies and asymptotic theory. A
detailed study of both of these types of properties can be found in Mammen
(1992). This section will focus on the consistency of several common boot-
strap estimates and the asymptotic properties of several types of bootstrap
confidence intervals. We begin by focusing on the consistency of the bootstrap
estimate of the distribution Hn (t). There are two ways in which we can view
consistency in this case. In the first case we can concern ourselves with the
pointwise consistency of Ĥn (t) as an estimator of Hn (t). That is, we can con-
p
clude that Ĥn (t) is a pointwise consistent estimator of Hn (t) if Hn (t) −
→ Hn (t)
as n → ∞ for all t ∈ R. Alternatively, we can define Ĥn to be a consistent
estimator of Hn if some metric between Ĥn and Hn converges in probability
to zero as n → ∞. That is, let d be a metric on F, the space of all distribution
functions. The we will conclude that Ĥn is a consistent estimator of Hn if
p
d(Ĥn , Hn ) −
→ 0 as n → ∞. Most research on the consistency of the bootstrap
uses this definition. Both of the concepts above are based on convergence in
probability, and in this context these concepts are often referred to as weak
consistency. In the case where convergence in probability is replaced by almost
certain convergence, the concepts above are referred to as strong consistency.
Because there is more than one metric used on the space F, the consistency
of Ĥn as an estimator of Hn is often qualified by the metric that is being
p
used. For example, if d(Ĥn , Hn ) −→ 0 as n → ∞ then Ĥn is called a strongly
d-consistent estimator of Hn . In this section we will use the supremum metric
d∞ that is based on the inner product defined in Theorem 3.17.
The smooth function model, introduced in Section 7.4, was shown to be a
flexible model that contains many of the common types of smooth estimators
encountered in practice. The framework of the smooth function model affords
us sufficient structure to obtain the strong consistency of the bootstrap esti-
mate of Hn (t).
Theorem 11.16. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed d-dimensional random vectors from a distribution F with mean vector
µ where E(kXn k2 ) < ∞. Let θ = g(µ) where g : Rd → R is a continuously
differentiable function at µ such that


g(x) 6= 0,
∂xi x=µ

for i = 1, . . . , d where x0 = (x1 , . . . , xn ). Define Hn (t) = P [n1/2 (θ̂n − θ) ≤ t]


a.c.
where θ̂n = g(X̄n ). Then d∞ (Ĥn , Hn ) −−→ 0 as n → ∞ where Ĥn (t) =
P ∗ [n1/2 (θ̂n∗ − θ̂n ) ≤ t].

For a proof of Theorem 11.16, see Section 3.2.1 of Shao and Tu (1995). The
544 NONPARAMETRIC INFERENCE
necessity of the condition that E(||Xn ||2 ) < ∞ has been the subject of con-
siderable research; Babu (1984), Athreya (1987), and Knight (1989) have all
supplied examples where the violation of this condition results in an incon-
sistent bootstrap estimate. In the special case where d = 1 and g(x) = x,
the condition has been shown to be necessary and sufficient by Giné and Zinn
(1989) and Hall (1990). The smoothness of the function g is also an important
aspect of the consistency of the bootstrap. For functions that are not smooth
functions of mean vectors there are numerous examples where the bootstrap
estimate of the sampling distribution is not consistent. The following example
can be found in Efron and Tibshirani (1993).
Example 11.20. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a Uniform(0, θ) distribution where θ ∈
Ω = (0, ∞), and let θ̂n = X(n) , the maximum value in the sample. Suppose
that X1∗ , . . . , Xn∗ is a set of independent and identically distributed random
variables from the empirical distribution of X1 , . . . , Xn . Then P ∗ (θ̂n∗ = θ̂n ) =
P ∗ (X(n)∗
= X(n) ). Note that X(n)∗
will equal X(n) any time that X(n) occurs in
the sample at least once. Recalling that, conditional on the observed sample
X1 , . . . , Xn , the empirical distribution places a mass of n−1 on each of the

values in the sample, it follows that X(n) 6= X(n) with probability (1 − n−1 )n .
Therefore, the bootstrap estimates the probability P (θ̂n = θ) with P ∗ (θ̂n∗ =
θ̂n ) = 1 − (1 − n−1 )n , and Theorem 1.7 implies that

lim P ∗ (θ̂n∗ = θ̂n ) = lim 1 − (1 − n−1 )n = 1 − exp(−1).


n→∞ n→∞

Noting that X(n) is a continuous random variable, it follows that the actual
probability is P (θ̂n = θ) = 0 for all n ∈ N. Therefore, the bootstrap estimate of
the probability is not consistent. We have plotted the actual distribution of θ̂n
along with a histogram of the bootstrap estimate of the sampling distribution
of θ̂n for a set of simulated data from a Uniform(0, 1) distribution in Figures
11.11 and 11.12. In Figure 11.12 the observed data are represented by the
plotted points along the horizontal axis. Note that since θ̂n∗ is the maximum
of a sample taken from the observed sample X1 , . . . , Xn , θ̂n∗ will be equal to
one of the observed points with probability one with respect to the probability
measure P ∗ , which is a conditional measure on θ̂n . 

The problem with the bootstrap estimate in Example 11.20 is that the parent
population is continuous, but the empirical distribution is discrete. The boot-
strap usually overcomes this problem in the case where θ̂n is a smooth function
of the data because the bootstrap estimate of the sampling distribution be-
comes virtually continuous at a very fast rate as n → ∞. This is due to the
large number of atoms in the bootstrap estimate of the sampling distribution.
See Appendix I of Hall (1992). However, when θ̂n is not a smooth function of
the data, such as in Example 11.20, this continuity is never realized and the
bootstrap estimate of the sampling distribution of θ̂n can fail to be consistent.
THE BOOTSTRAP 545

Figure 11.11 The actual density of θ̂n = X(n) for samples of size 10 from a
Uniform(0, 1) distribution.
10
8
6
4
2
0

0.0 0.2 0.4 0.6 0.8 1.0

Another class of estimators where the bootstrap estimate of Hn is consistent


is for the sampling distributions of certain U -statistics.
Theorem 11.17 (Bickel and Freedman). Let X1 , . . . , Xn be a set of inde-
pendent and identically distributed random variables from distribution F with
parameter θ. Let Un be a U -statistic of degree m = 2 for θ with kernel func-
tion h(x1 , x2 ). Suppose that E[h2 (X1 , X2 )] < ∞, E[|h(X1 , X1 )|] < ∞, and the
integral Z∞
h(x, y)dF (y),
−∞
a.c.
is not a constant with respect to x. The d∞ (Ĥn , Hn ) −−→ 0 as n → ∞.
For a proof of Theorem 11.17 see Bickel and Freedman (1981) and Shi (1986).
An example from Bickel and Freedman (1981) demonstrates that the condition
E[|h(X1 , X1 )|] < ∞ cannot be weakened.
Example 11.21. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a distribution F . Consider a U -statistic
Un of degree m = 2 with kernel function k(X1 , X2 ) = 21 (X1 − X2 )2 , which
corresponds to the unbiased version of the sample variance. Note that
Z ∞Z ∞
E[h2 (X1 , X2 )] = 1 4
4 (x1 − x2 ) dF (x1 )dF (x2 ),
−∞ −∞
546 NONPARAMETRIC INFERENCE

Figure 11.12 An example of the bootstrap estimate of θ̂n = X(n) for a sample of size
10 taken from a Uniform(0, 1) distribution. The simulated sample is represented by
the points plotted along the horizontal axis.
12
10
8
6
4
2
0

0.0 0.2 0.4 0.6 0.8 1.0

will be finite as long as F has at least four finite moments. Similarly


Z ∞
1 2
E[|h(X1 , X1 )|] = 2 (x1 − x1 ) dF (x1 ) = 0 < ∞.
−∞

Finally, note that


Z ∞ Z ∞
h(x1 , x2 )dF (x2 ) = 1
2 (x1 − x2 )2 dF (x2 ) = 1 2
2 x1 + x1 µ01 + 12 µ02 ,
−∞ −∞

is not constant with respect to x1 . Therefore, under the condition that F has
a.c.
a finite fourth moment, Theorem 11.17 implies that d∞ (Ĥn , Hn ) −−→ 0 as
n → ∞ where Hn (t) = P [n1/2 (Un − µ2 ) ≤ t] and Ĥn (t) is the bootstrap
estimate given by Ĥn (t) = P ∗ [n1/2 (Un∗ − µ̂2,n ) ≤ t]. 

There are many other consistency results for the bootstrap estimate of the
sampling distribution that include results for L-statistics, differentiable statis-
tical functionals, empirical processes, and quantile processes. For an overview
of these results see Section 3.2 of Shao and Tu (1995).
Beyond the bootstrap estimate of the sampling distribution, we can also con-
sider the bootstrap estimate of the variance of an estimator, or equivalently
the standard error of an estimator. For example, suppose that we take Jn (t) =
THE BOOTSTRAP 547
P (θ̂n ≤ t) and estimate Jn (t) using the bootstrap to get Jˆn (t) = P ∗ (θ̂n∗ ≤ t).
The bias of θ̂n is given by
Z ∞
Bias(θ̂n ) = E(θ̂n ) − θ = tdJn (t) − θ, (11.61)
−∞

which has bootstrap estimate


Z ∞
d θ̂n ) = Ê(θ̂n ) − θ̂n =
Bias( tdJˆn (t) − θ̂n . (11.62)
−∞

Similarly, the standard error of θ̂n equals


(Z 2 )1/2
∞  Z ∞
σn = t− udJn (u) dJn (t) ,
−∞ −∞

which has bootstrap estimate


(Z 2 )1/2
 ∞ Z ∞
σ̂n = t− udJˆn (u) dJˆn (t) .
−∞ −∞

Not surprisingly, the conditions that ensure the consistency of the bootstrap
estimate of the variance are similar to what is required to ensure the consis-
tency of the bootstrap estimate of the sampling distribution.
Theorem 11.18. Let X1 , . . . , Xn be a set of independent and identically dis-
tributed d-dimensional random vectors with mean vector µ and covariance
matrix Σ. Let θ = g(µ) for a real valued function g that is differentiable in
a neighborhood of µ. Define a d × 1 vector d(µ) to be the vector of partial
derivatives of g evaluated at µ. That is, the ith element of d(µ) is given by


di (µ) = g(x) ,
∂xi x=µ

where x0 = (x1 , . . . , xd ). Suppose that d(µ) 6= 0. If E(||X1 ||2 ) < ∞ and


a.c.
max ζn−1 |θ̂n (Xi1 , . . . , Xin ) − θ̂n (X1 , . . . , Xn )| −−→ 0, (11.63)
i1 ,...,in ∈{1,...,n}

as n → ∞ where
n
!
X
θ̂n (Xi1 , . . . , Xin ) = g n−1 X ik ,
k=1

and ζn is a sequence of positive real numbers such that


lim inf ζn > 0,
n→∞

and ζn = O[exp(nδ )] where δ ∈ (0, 21 ), then the bootstrap estimator of σn2 =


a.c.
n−1 d0 (µ)Σd(µ) is consistent. That is σn−2 σ̂n2 −−→ 1 as n → ∞.

A proof of Theorem 11.18 can be found in Section 3.2.2 of Shao and Tu (1995).
548 NONPARAMETRIC INFERENCE
The condition given in Equation (11.63) is required because there are cases
where the bootstrap estimates of the variance diverges to infinity, a result that
is caused by the fact that |θ̂n∗ − θ̂n | may take on some exceptionally large values.
Note the role of resampling in this condition. A sample from the empirical
distribution F̂n consists of values from the original sample X1 , . . . , Xn . Hence,
any particular resample θ̂n may be computed on any set of values Xi1 , . . . , Xin .
The condition given in Theorem 11.18 ensures that none of the values will be
too far away from θ̂n as n → ∞. An example of a case where the bootstrap
estimator in not consistent is given by Ghosh et al. (1984).
As exhibited in Example 11.20, the bootstrap can behave very differently
when dealing with non-smooth statistics such as sample quantiles. However,
the result given below shows that the bootstrap can still provide consistent
variance estimates in such cases.
Theorem 11.19. Let X1 , . . . , Xn be a set of independent and identically
distributed random variables from a distribution F . Let θ = F −1 (p) and
θ̂n = F̂n−1 (p) where p ∈ (0, 1) is a fixed constant. Suppose that f = F 0 exists
and is positive in a neighborhood of θ. If E(|X1 |ε ) < ∞ for some ε > 0 then
the bootstrap estimate of the variance σn2 = n−1 p(1 − p)[f (θ)]−2 is consistent.
a.c.
That is, σ̂n2 σn−2 −−→ 1 as n → ∞.

A proof of Theorem 11.19 can be found in Ghosh et al. (1984). Babu (1986)
considers the same problem and is able to prove the result under slightly
weaker conditions. It is worthwhile to compare the assumptions of Theorem
11.19 to those of Corollary 4.4, which are used to establish the asymptotic
Normality of the sample quantile. These assumptions are required in order
to be able to obtain the form of the asymptotic variance of the sample quantile.
The ability of the bootstrap to provide consistent estimates of the sampling
distribution and the variance of a statistic is only a small part of the theory
that supports the usefulness of the bootstrap in many situations. One of the
more surprising results is that under the smooth function model of Section
7.4, the bootstrap automatically performs an Edgeworth type correction. This
type of result was first observed in the early work of Babu and Singh (1983,
1984, 1985), Beran (1982), Bickel and Freedman (1980), Hall (1986a, 1986b),
and Singh (1981). A fully developed theory appears in the work of Hall (1988a,
1992).
The essential idea is based on the following result. Suppose that X1 , . . . , Xn
is a set of independent and identically distributed d-dimensional random vec-
tors from a distribution F with parameter θ that falls within the smooth
function model. Theorem 7.11 implies that the distribution function Gn (x) =
P [n1/2 σ −1 (θ̂n − θ) ≤ x] has asymptotic expansion
p
X
Gn (x) = Φ(x) + n−k/2 rk (x)φ(x) + o(n−p/2 ), (11.64)
k=1
THE BOOTSTRAP 549
as n → ∞, where rk is a polynomial whose coefficients depend on the mo-
ments of F . Theorem 5.1 of Hall (1992) implies that the bootstrap estimate of
Gn (x), which is given by Ĝn (x) = P ∗ [n1/2 σ̂n−1 (θ̂n∗ − θ̂n ) ≤ x], has asymptotic
expansion
p
X
Ĝn (x) = Φ(x) + n−k/2 r̂k (x)φ(x) + op (n−p/2 ), (11.65)
k=1

as n → ∞. The polynomial r̂k has the same form as rk , except that the
moments of F in the coefficients of the polynomial have been replaced by
the corresponding sample moments. One should also note that the error term
o(n−p/2 ) in Equation (11.64) has been replaced with the error term op (n−p/2 )
in Equation (11.65), which reflects the fact that the error term in the expansion
is now a random variable. A proof of this result can be found in Section 5.2.2
of Hall (1992). The same result holds for the Edgeworth expansion of the
studentized distribution. That is, if Ĥn (x) = P ∗ [n1/2 (σ̂n∗ )−1 (θ̂n∗ − θ̂n ) ≤ x] is
the bootstrap estimate of the distribution Hn (x) = P [n1/2 σ̂n−1 (θ̂n − θn ) ≤ x]
then
Xp
Ĥn (x) = Φ(x) + n−k/2 v̂k (x)φ(x) + op (n−p/2 ),
k=1
as n → ∞ where v̂k (x) is the sample version of vk (x) for k = 1, . . . , p. Similar
results hold for the Cornish–Fisher expansions for the quantile functions of
Ĝn (x) and Ĥn (x). Let ĝα = Ĝ−1 −1
n (α) and ĥα = Ĥn (α) be the bootstrap
estimates of the quantiles of the distributions of Ĝn and Ĥn , respectively.
Then Theorem 5.2 of Hall (1992) implies that
p
X
ĝα = zα + n−k/2 q̂k (zα ) + op (n−p/2 ), (11.66)
k=1

and
p
X
ĥα = zα + n−k/2 ŝk (zα ) + op (n−p/2 ),
k=1
as n → ∞ where q̂k and ŝk are the sample versions of qk and sk , respectively,
for all k = 1, . . . , p. The effect of these results is immediate. Because r̂k (x) =
rk (x) + Op (n−1/2 ) and v̂k (x) = vk (x) + Op (n−1/2 ), it follows from Equations
(11.64) and (11.65) that
Ĝn (x) = Φ(x) + n−1/2 rk (x)φ(x) + op (n−p/2 ) = Gn (x) + op (n−1/2 ),
and
Ĥn (x) = Φ(x) + n−1/2 vk (x)φ(x) + op (n−p/2 ) = Hn (x) + op (n−1/2 ),
as n → ∞. Therefore, it is clear that the bootstrap does a better job of esti-
mating Gn and Hn than the Normal approximation, which would estimate
both of these distributions by Φ(x), resulting in an error term that is op (1) as
550 NONPARAMETRIC INFERENCE
n → ∞. This effect has far reaching consequences for other bootstrap meth-
ods, most notably confidence intervals. Hall (1988a) identifies six common
bootstrap confidence intervals and describes their asymptotic behavior. We
consider 100α% upper confidence limits using four of these methods.
The percentile method, introduced by Efron (1979), estimates the sampling
distribution Jn (x) = P (θ̂n ≤ x) with the bootstrap estimate Jˆn (x) = P ∗ (θ̂n∗ ≤
x). The 100α% upper confidence limit is then given by θ̂back∗
(α) = Jˆn−1 (α),
where we are using the notation of Hall (1988a) to identify the confidence
limit. Note that

Jˆn (x) = P ∗ (θ̂n∗ ≤ x) =


P ∗ [n1/2 σ̂n−1 (θ̂n∗ − θ̂n ) ≤ n1/2 σ̂n−1 (x − θ̂n )] = Ĝn [n1/2 σ̂n−1 (x − θ̂n )].
Therefore, it follows that Ĝn {n1/2 σ̂n−1 [Jˆn (α) − θ̂n ]} = α, or equivalently that
n1/2 σ̂n−1 [Jˆn−1 (α) − θ̂n ] = ĝα . Hence it follows that Jˆn (α) = θ̂n + n−1/2 σ̂n ĝα

and therefore θ̂back (α) = θ̂n + n−1/2 σ̂n ĝα . The expansion given in Equation
(11.66) implies that

θ̂back (α) = θ̂n + n−1/2 σ̂n [zα + n−1/2 q̂1 (zα ) + n−1 q̂2 (zα ) + op (n−1 )],
as n → ∞. Noting that q̂1 (zα ) = q1 (zα ) + Op (n−1/2 ) as n → ∞ implies that

θ̂back (α) = θ̂n + n−1/2 σ̂n zα + n−1 σ̂n q1 (zα ) + Op (n−3/2 ),
as n → ∞. In Section 10.3 it was shown that a correct 100α% upper confidence
limit for θ has the expansion
θ̂stud (α) = θ̂n − n−1/2 σ̂n z1−α − n−1 σ̂n s1 (z1−α ) + Op (n−1 ),

as n → ∞. Therefore |θ̂back (α) − θ̂stud (α)| = Op (n−1 ) as n → ∞ and Defintion
10.7 implies that the backwards, or percentile, method is first-order correct.
Calculations similar to those in Section 10.3 can be used to conclude that the
method is first-order accurate. Hall (1988a) refers to this limit as the backwards
limit because the upper confidence limit is based on the upper percentile of the
distribution of Gn whereas the form of the correct upper confidence limit is
based on the lower tail percentile of Hn . The upper confidence limit will have
better performance when q1 (zα ) = 0, which occurs when the distribution Jn
is symmetric. For the case where θ is the mean of the population, this occurs
when the underlying population is symmetric.
A second common upper confidence limit identified by Hall (1988a) is the

hybrid limit given by θ̂hyb (α) = θ̂n − n−1/2 σ̂n ĝ1−α , which has the expansion

θ̂hyb (α) = θ̂n − n−1/2 σ̂n [z1−α + n−1/2 q̂1 (z1−α ) + n−1 q̂2 (z1−α ) + Op (n−1 )],
as n → ∞. This confidence limit uses the lower percentile for the upper limit,
which is an improvement over the percentile method. However, the percentile
is still from the distribution Gn , which assumes that σ is known, rather than
the distribution Hn , which takes into account the fact that σ is unknown.
THE BOOTSTRAP 551
The hybrid method is still first order correct and accurate, though its finite
sample behavior has often been shown empirically to be superior to that of
the percentile method. See Chapter 4 of Shao and Tu (1995).
The third common interval is the studentized bootstrap interval of Efron

(1982) which has a 100α% upper confidence limit given by θ̂stud = θ̂n −
−1/2
n σ̂n ĥ1−α . In some sense this interval is closest to mimicking the correct
interval θ̂stud (α), and the advantages of this interval become apparent when
we study the asymptotic properties of the confidence limit. We can first note

that θ̂stud has expansion

θ̂stud (α) = θ̂n − n−1/2 σ̂n [z1−α + n−1/2 ŝ1 (z1−α ) + op (n−1/2 )]
= θ̂n − n−1/2 σ̂n z1−α − n−1 σ̂n s1 (z1−α ) + op (n−1 ),


as n → ∞, where it follows that θ̂stud (α) is second-order correct and accu-
rate. Therefore, from the asymptotic viewpoint this interval is superior to the
percentile and hybrid methods. The practical application of this confidence
limit can be difficult in some applications. The two main problems are that
the confidence limit can be computationally burdensome to compute, and that
when n is small the confidence limit can be numerically unstable. See Polansky
(2000) and Tibshirani (1988) for further details on stabilizing this method.
In an effort to fix the theoretical and practical deficiencies of the percentile
method, Efron (1981, 1987) suggests computing a 100α% upper confidence

limit of the form θ̂back [β(α)], where β(α) is an adjusted confidence limit that
is designed to reduce the bias of the upper confidence limit. The first method
is called the bias corrected method, and is studied in Exercise 25 of Chapter
10. In this section we will explore the properties of the second method called
the bias corrected and accelerated method. It is worthwhile to note that Efron
(1981, 1987) did not develop these methods based on considerations pertain-
ing to Edgeworth expansion theory. However, the methods can be justified
using this theory. The development we present here is based on the argu-
ments of Hall (1988a). Define a function β̂(α) = Φ[zα + 2m̂ + âzα2 + O(n−1 )]
as n → ∞ where m̂ = Φ−1 [Ĝn (0)] is called the bias correction parameter and
â = −n−1/2 zα−2 [2r̂1 (0) − r̂1 (zα ) − v̂1 (zα )] is called the acceleration constant.

Note that θ̂back [β̂(α)] = θ̂n + n−1/2 σ̂n ĝβ̂(α) , where ĝβ̂(α) has a Cornish–Fisher
expansion involving zβ̂(α) . Therefore, we begin our analysis of this method by
noting that
zβ̂(α) = Φ−1 {Φ[zα + 2m̂ + âzα2 + O(n−1 )]} = zα + 2m̂ + âzα2 + O(n−1 ),
as n → ∞. Therefore, it follows that

ĝβ̂(α) = zβ̂(α) + n−1/2 q̂1 [zβ̂(α) ] + Op (n−1 ) =


zα + 2m̂ + âzα2 + n−1/2 q̂1 [zβ̂(α) ] + Op (n−1 ).
552 NONPARAMETRIC INFERENCE
Theorem 1.13 (Taylor) implies that

q̂1 [zβ̂(α) ] = q̂1 [zα + 2m̂ + âzα2 + O(n−1 )]


q̂1 {zα + 2Φ−1 [Ĝn (0)] − n−1/2 [2r̂1 (0) − r̂1 (zα ) − v̂1 (zα )]zα2 + O(n−1 )} =
q̂1 {zα + 2Φ−1 [Ĝn (0)]} + Op (n−1/2 ),
as n → ∞. Note that
Ĝn (0) = Φ(0) + n−1/2 r̂1 (0)φ(0) + O(n−1 ) = Φ(0) + n−1/2 r1 (0)φ(0) + Op (n−1 ),
as n → ∞. Using the expansion from Example 1.29, we have that

Φ−1 [Φ(0) + n−1/2 r1 (0)φ(0) + Op (n−1 )] =


Φ−1 [Φ(0)] + n−1/2 r1 (0)φ(0)[φ(0)]−1 + O(n−1 ) =
n−1/2 r1 (0) + O(n−1 ) = O(n−1/2 ),
as n → ∞. Therefore,

q̂1 {zα + 2Φ−1 [Ĝn (0)]} + Op (n−1/2 ) = q̂1 (zα ) + Op (n−1/2 ) =


q1 (zα ) + Op (n−1/2 ).
Therefore, it follows that
ĝβ̂(α) = zα + 2Φ−1 [Ĝn (0)] − n−1/2 [2r̂1 (0) − r̂1 (zα ) − v̂1 (zα )]
+n−1/2 q1 (zα ) + Op (n−1 )
= zα + 2n−1/2 r1 (0) − 2n−1/2 r1 (0) + n−1/2 r1 (zα )
+n−1/2 v1 (zα ) + n−1/2 q1 (zα ) + Op (n−1 )
= zα + n−1/2 r1 (zα ) + n−1/2 v1 (zα ) + n−1/2 q1 (zα ) + Op (n−1 ),
(11.67)
as n → ∞. Recall that q1 (zα ) = −r1 (zα ) and s1 (zα ) = −v1 (zα ) so that the
expansion in Equation (11.67) becomes ĝβ̂ (α) = zα − n−1/2 s1 (zα ) + Op (n−1 )
as n → ∞, and hence the upper confidence limit of the accelerated and bias
corrected percentile method is

θ̂bca (α) = θ̂n + n−1/2 σ̂n [zα − n−1/2 s1 (zα ) + Op (n−1 )]
= θ̂n + n−1/2 σ̂n zα − n−1 σ̂n s1 (zα ) + Op (n−3/2 )
= θ̂n − n−1/2 σ̂n z1−α − n−1 σ̂n s1 (zα ) + Op (n−3/2 ),
as n → ∞, which matches θ̂stud (α) to order O(n−3/2 ), and hence it follows
that the bias corrected and accelerated method is second-order accurate and
correct.
As with the confidence intervals studied in Section 10.3, the asymptotic accu-
racy of two-sided bootstrap confidence intervals are less effected by correct-
ness. It is also worthwhile to take note that bootstrap confidence intervals may
EXERCISES AND EXPERIMENTS 553
behave quite differently outside the smooth function model. See, for example,
Hall and Martin (1991). There are also many other methods for construct-
ing bootstrap confidence intervals that are at least second-order correct. For
examples, see Hall and Martin (1988), Polansky and Schucany (1997), Beran
(1987), and Loh (1987).

11.7 Exercises and Experiments

11.7.1 Exercises

1. Let {Xn }∞n=1 be a sequence of independent and identically distributed ran-


dom variables from a distribution F ∈ F where F is the collection of a
distributions that have a finite second moment and have a mean equal to
zero. Let θ = V (X1 ) and prove that θ is estimable of degree one over F.
2. Suppose that X1 , . . . , Xn is a set of independent and identically distributed
random variables from a distribution F with finite variance θ. It was shown
in Example 11.4 that θ is estimable of degree two.

a. Prove that h(Xi , Xj ) = 21 (Xi − Xj )2 is a symmetric function such that


E[h(Xi , Xj )] = θ for all i 6= j.
b. Using the function h(Xi , Xj ), form a U -statistic for θ and prove that it
is equivalent to the unbiased version of the sample standard deviation.
c. Using Theorem 11.5, find the projection of this U -statistic.
d. Find conditions under which Theorem 11.6 (Hoeffding) applies to this
statistic, and specify its weak convergence properties.

3. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from a distribution F with mean θ. In Example 11.9 the U -statistic
n X
X i
−1
Un = 2[n(n − 1)] Xi Xj ,
i=1 j=1

was considered as an unbiased estimator of θ2 .

a. Using Theorem 11.5, find the projection of this U -statistic.


b. Find conditions under which Theorem 11.6 (Hoeffding) applies to this
statistic, and specify its weak convergence properties.

4. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from a distribution F with mean θ. Suppose that we are interested
in estimating θ(1 − θ) using a U -statistic.

a. Find a symmetric kernel function for this parameter, and develop the
corresponding U -statistic.
b. Using Theorem 11.5, find the projection of this U -statistic.
554 NONPARAMETRIC INFERENCE
c. Find conditions under which Theorem 11.6 (Hoeffding) applies to this
statistic, and specify its weak convergence properties.

5. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from a continuous distribution F that is symmetric about a point
θ. Let R be the vector of ranks of |X1 − θ|, . . . , |Xn − θ| and let C be an
n × 1 vector with ith element ci = δ{Xi − θ; (0, ∞)} for i = 1, . . . , n. The
signed rank statistic was seen in Example 11.3 to have the form W = C0 R.
In Example 11.12 is was further shown that
 −1
n d
n1/2 [W − E(W )] −→ Z2 ,
2
as n → ∞, where Z2 has a N(0, 31 ) distribution.

a. Using direct calculations, prove that under the null hypothesis that θ = 0
it follows that E(W ) = 41 n(n + 1).
b. Using direct calculations, prove that under the null hypothesis that θ = 0
1
it follows that V (W ) = 24 n(n + 1)(2n + 1).
c. Prove that
W − 41 n(n + 1) d
1 −
→ Z,
[ 24 n(n + 1)(2n + 1)]1/2
as n → ∞ where Z has a N(0, 1) distribution.

6. Let S be a linear rank statistic of the form


X n
S= c(i)a(ri ).
i=1

If R is a vector whose elements correspond to a random permutation of the


integers in the set {1, . . . , n} then prove that E(S) = nāc̄ and
( n ) n 
X X 
V (S) = (n − 1)−1 [a(i) − ā]2 [c(j) − c̄]2
 
i=1 j=1

where
n
X
ā = n−1 a(i),
i=1
and
n
X
−1
c̄ = n c(i).
i=1

7. Consider the rank sum test statistic from Example 11.2, which is a linear
rank statistic with a(i) = i and c(i) = δ{i; {m + 1, . . . , n + m}} for all
i = 1, . . . , n + m. Under the null hypothesis that the shift parameter θ is
zero, find the mean and variance of this test statistic.
EXERCISES AND EXPERIMENTS 555
8. Consider the median test statistic described in Example 11.14, which is
a linear rank statistic with a(i) = δ{i; { 21 (m + n + 1), . . . , m + n}} and
c(i) = δ{i; {m + 1, . . . , n + m}} for all i = 1, . . . , n + m.

a. Under the null hypothesis that the shift parameter θ is zero, find the
mean and variance of the median test statistic.
b. Determine if there are conditions under which the distribution of the
median test statistic under the null hypothesis is symmetric.
c. Prove that the regression constants satisfy Noether’s condition.
d. Define α(t) such that a(i) = α[(m + n + 1)−1 i] for all i = 1, . . . , m + n
and show that α is a square integrable function.
e. Prove that the linear rank statistic
Xn
D= a(i, n)c(i, n),
i=1

converges weakly to a N(0, 1) distribution when it has been properly


standardized.

9. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from a distribution F with continuous and bounded density f ,
that is assumed to be symmetric about a point θ. Consider testing the null
hypothesis H0 : θ = θ0 against the alternative hypothesis H1 : θ > θ0 , by
rejecting the null hypothesis when the test statistic
n
X
B= δ{Xi − θ0 ; (0, ∞)},
i=1

is too large. Let α denote the desired significance level of this test. Without
loss of generality assume that θ0 = 0.

a. Show that the critical value for this test converges to z1−α as n → ∞.
b. In the context of Theorem 11.15 show that we can take µn (θ) = F (θ)
and σn2 = 41 n.
c. Using the result derived above, prove that the efficacy of this test is
given by 2f (0).

10. Prove that Z ∞


f 2 (x)dx
−∞

equals 12 π −1/2 , 1, 14 , 16 , and 23 for the N(0, 1), Uniform(− 12 , 12 ), LaPlace(0, 1),
Logistic(0, 1), and Triangular(−1, 1, 0) densities, respectively.
11. Prove that the square efficacy of the t-test equals 1, 12, 21 , 3π −2 , and
6 for the N(0, 1), Uniform(− 12 , 12 ), LaPlace(0, 1), Logistic(0, 1), and
Triangular(−1, 1, 0) densities, respectively.
556 NONPARAMETRIC INFERENCE
12. Prove that the square efficacy of the sign test equals 2π −1 , 4, 1, 14 , and
4 for the N(0, 1), Uniform(− 21 , 12 ), LaPlace(0, 1), Logistic(0, 1), and
Triangular(−1, 1, 0) densities, respectively.
3 −1/2
13. Consider the density f (x) = 20 5 (5 − x2 )δ{x; (−51/2 , 51/2 )}. Prove that
−2
EV2 ET ' 0.864, which is a lower bound for this asymptotic relative effi-
ciency established by Hodges and Lehmann (1956). Comment on the im-
portance of this lower bound is statistical applications.
14. Let X1 , . . . , Xn be a set of independent and identically distributed ran-
dom variables from a discrete distribution with distribution function F
and probability distribution function f . Assume that F is a step function
with steps at points contained in the countable set D. Consider estimating
the probability distribution function as
n
X
fˆn (x) = F̂n (x) − F̂n (x−) = n−1 δ{Xk ; {x}},
k=1

for all x ∈ R. Prove that fˆn (x) is an unbiased and consistent estimator of
f (x) for each point x ∈ R.
15. Let X1 , . . . , Xn be a set of independent and identically distributed ran-
dom variables from a discrete distribution with distribution function F and
probability distribution function f . Let −∞ < g1 < g2 < · · · < gd < ∞
be a fixed grid of points in R. Assume that these points are selected in-
dependent of the sample X1 , . . . , Xn and that g1 < min{X1 , . . . , Xn } and
gd > max{X1 , . . . , Xn }. Consider the estimate of F given by
F̄n (x) = F̂n (gi ) + (x − gi )[F̂n (gi+1 ) − F̂n (gi )],
when x ∈ [gi , gi+1 ]. Prove that this estimate is a valid distribution function
conditional on X1 , . . . , Xn .
16. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a distribution F with continuous density f . Prove that the
histogram estimate with fixed grid points −∞ < g1 < · · · < gd < ∞ such
that g1 < min{X1 , . . . , Xn } and gd > max{X1 , . . . , Xn } given by
n
X
f¯n (x) = (gi+1 − gi )−1 n−1 δ{Xi , ; (gi , gi+1 ]},
k=1

is a valid density function, conditional on X1 , . . . , Xn .


17. Prove that the mean integrated squared error can be written as the sum of
the integrated square bias and the integrated variance. That is, prove that
MISE(f¯n , f ) = ISB(f¯n , f ) + IV(f¯n ), where
Z ∞ Z ∞
ISB(f¯n , f ) = Bias2 (f¯n (x), f (x))dx = [E[f¯n (x)] − f (x)]2 dx,
−∞ −∞

and Z ∞
IV(f¯n ) = E{[(f¯n (x) − E(f¯n (x))]2 }dx.
−∞
EXERCISES AND EXPERIMENTS 557
18. Using the fact that the pointwise bias of the histogram is given by,
Bias[f¯n (x)] = 1 f 0 (x)[h − 2(x − gi )] + O(h2 ),
2
as h → 0, prove that the square bias is given by
Bias2 [f¯n (x)] = 1 [f 0 (x)]2 [h − 2((x − gi )]2 + O(h3 ).
4

19. Let f be a density with at least two continuous and bounded derivatives
and let gi < gi+1 be grid points such that h = gi+1 − gi .
Z gi+1
f 0 (x)(t − x)dt = O(h2 ),
gi

as n → ∞, where
lim h = 0.
n→∞

20. Given that the asymptotic mean integrated squared error for the histogram
with bin width h is given by
AMISE(f¯n , f ) = (nh)−1 + 1 h2 R(f 0 ),
12
show that the value of h that minimizes this function is given by
 1/3
6
hopt = n−1/3 .
R(f 0 )
21. Let K be any non-decreasing right-continuous function such that
lim K(t) = 1,
t→∞

lim K(t) = 0,
t→−∞
and Z ∞
tdK(t) = 0.
−∞
Define the kernel estimator of the distribution function F to be
Xn
F̃n (x) = n−1 K(Xi − x).
k=1

Prove that F̃n is a valid distribution function.


22. Use the fact that the pointwise bias of the kernel density estimator with
bandwidth h is given by
Bias[f˜n,h (x)] = 1 h2 f 00 (x)σ 2 + O(h4 ),
2 k

as h → 0 to prove that the square bias is given by


Bias2 [f˜n,h (x)] = 1 h4 [f 00 (x)02 σ 4 + O(h6 ).
4 k

23. Using the fact that the asymptotic mean integrated squared error of the
kernel estimator with bandwidth h is given by,
AMISE(f˜n,h , f ) = (nh)−1 R(k) + 1 h4 σ 4 R(f 00 ),
4 k
558 NONPARAMETRIC INFERENCE
show that the asymptotically optimal bandwidth is given by
 1/5
R(k)
hopt = n−1/5 4 .
σk R(f 00 )
24. Consider the Epanechnikov kernel given by k(t) = 34 (1 − t)2 δ{t; [−1, 1]}.

Prove that σk R(k) = 3/(5 5).
25. Compute the efficiency of each of the kernel functions given below relative
to the Epanechnikov kernel.
a. The Biweight kernel function, given by 15 2 2
16 (1 − t ) δ{t; [−1, 1]}.
35
b. The Triweight kernel function given by 32 (1 − t2 )3 δ{t; [−1, 1]}.
c. The Normal kernel function given by φ(t).
d. The Uniform kernel function given by 12 δ{t; [−1, 1]}.
26. Let fˆn (t) denote a kernel density estimator with kernel function k computed
on a sample X1 , . . . , Xn . Prove that,
Z ∞  Z ∞   
ˆ −1 t−X
E fh (t)f (t)dt = h E k f (t)dt ,
−∞ −∞ h
where the expectation on the right hand side of the equation is taken with
respect to X.
27. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a distribution F with functional parameter θ. Let θ̂n be an
estimator of θ. Consider estimating the sampling distribution of θ̂n given
by Jn (t) = P (θ̂n ≤ t) using the bootstrap resampling algorithm described
in Section 11.6. In this case let θ̂1∗ , . . . , θ̂b∗ be the values of θ̂n computed on
b resamples from the original sample X1 , . . . , Xn .
a. Show that the bootstrap estimate of the bias given by Equation (11.61)
g θ̂n ) = θ̄n∗ − θ̂n , where
can be approximated by Bias(
n
X
θ̄n∗ = b−1 θ̂i∗ .
i=1

b. Show that the standard error estimate of Equation (11.62) can be ap-
proximated by
( n
)1/2
X
−1 ∗ ∗ 2
σ̃n = b (θ̂i − θ̄ ) .
i=1

28. Let X1 , . . . , Xn be a set of independent and identically distributed random


variables from a distribution F with mean θ with θ̂n = X̄n . Consider esti-
mating the sampling distribution of θ̂n given by Jn (t) = P (θ̂n ≤ t) using
the bootstrap resampling algorithm described in Section 11.6. Find closed
expressions for the bootstrap estimates of the bias and standard error of θ̂n .
This is an example where the bootstrap estimates have closed forms and
do not require approximate simulation methods to compute the estimates.
EXERCISES AND EXPERIMENTS 559
29. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a distribution F with mean θ. Let Rn (θ̂, θ) = n1/2 (θ̂n − θ)
where θ̂n = X̄n , the sample mean. Let Hn (t) = P [Rn (θ̂n , θ) ≤ t) with
bootstrap estimate Ĥn (t) = P ∗ [Rn (θ̂n∗ , θ̂n ) ≤ t]. Using Theorem 11.16,
a.c.
under what conditions can we conclude that d∞ (Ĥn , Hn ) −−→ 0 as n → ∞?
30. Let X1 , . . . , Xn be a set of independent and identically distributed ran-
dom variables from a distribution F with mean µ. Define θ = g(µ) = µ2 ,
and let Rn (θ̂, θ) = n1/2 (θ̂n − θ) where θ̂n = g(X̄n ) = X̄n2 . Let Hn (t) =
P [Rn (θ̂n , θ) ≤ t) with bootstrap estimate Ĥn (t) = P ∗ [Rn (θ̂n∗ , θ̂n ) ≤ t]. Us-
ing the theory related by Theorem 11.16, under what conditions can we
a.c.
conclude that d∞ (Ĥn , Hn ) −−→ 0 as n → ∞?
31. Let X1 , . . . , Xn be a set of two-dimensional independent and identically
distributed random vectors from a distribution F with mean vector µ. Let
g(x) = x2 − x21 where x0 = (x1 , x2 ). Define θ = g(µ) with θ̂n = g(X̄n ).
Let Rn (θ̂, θ) = n1/2 (θ̂n − θ) and Hn (t) = P [Rn (θ̂n , θ) ≤ t) with bootstrap
estimate Ĥn (t) = P ∗ [Rn (θ̂n∗ , θ̂n ) ≤ t]. Using Theorem 11.16, under what
a.c.
conditions can we conclude that d∞ (Ĥn , Hn ) −−→ 0 as n → ∞? Explain
how this result can be used to determine the conditions under which the
bootstrap estimate of the sampling distribution of the sample variance is
strongly consistent.
32. Use Theorem 11.16 to determine the conditions under which the bootstrap
estimate of the sampling distribution of the sample correlation is strongly
consistent.
33. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a distribution F with mean θ. Consider a U -statistic of
degree two with kernel function h(X1 , X2 ) = X1 X2 which corresponds
to an unbiased estimator of θ2 . Describe under what conditions, if any,
Theorem 11.17 could be applied to this U -statistic to obtain a consistent
bootstrap estimator of the sampling distribution.
34. Let X1 , . . . , Xn be a set of independent and identically distributed random
variables from a distribution F with mean θ with estimator θ̂n = X̄n .
a. Use Theorem 11.18 to find the simplest conditions under which the boot-
strap estimate of the standard error of θ̂n is consistent.
b. The standard error of θ̂n can also be estimated without the bootstrap
by n−1/2 σ̂n , where σ̂n is the sample standard deviation. Under what
conditions is this estimator consistent? Compare these conditions to the
simplest conditions required for the bootstrap.
35. For each method for computing bootstrap confidence intervals detailed in
Section 11.6, describe an algorithm for which the method could be imple-
mented practically using simulation for an observed set of data. Discuss the
computational cost of each method.
36. Prove that the bootstrap hybrid method is first-order accurate and correct.
560 NONPARAMETRIC INFERENCE
37. Using the notation of Chapter 7, show that the acceleration constant used in
the bootstrap accelerated and bias corrected bootstrap confidence interval
as the form
d X
X d X d
â = 16 n−1 σ −3 âi âj âk µ̂ijk ,
i=1 j=1 k=1
under the smooth function model where âi and µ̂ijk have the same form
as ai and µijk except that the moments of F have been replaced with the
corresponding sample moments.
38. In the context of the development of the bias corrected and accelerated
bootstrap confidence interval, prove that
m̂ + (m̂ + zα )[1 − α̂(m̂ + zα )]−1 = zα + 2m̂ + zα2 â + Op (n−1 ),
as n → ∞. This form of the expression is of the same form given by Efron
(1987).

11.7.2 Experiments

1. Write a program in R that simulates 1000 samples of size n from a dis-


tribution F with mean θ, where n, F and θ are specified below. For each
sample compute the U -statistic for the parameter θ2 given by
n X
X i
Un = 2[n(n − 1)]−1 Xi Xj .
i=1 j=1

For each case make a histogram of the 1000 values of the statistic and
evaluate the results in terms of the theory developed in Exercise 3. Repeat
the experiment for n = 5, 10, 25, 50, and 100 for each of the distributions
listed below.
a. F is a N(0, 1) distribution.
b. F is a Uniform(0, 1) distribution.
c. F is a LaPlace(0, 1) distribution.
d. F is a Cauchy(0, 1) distribution.
e. F is a Exponential(1) distribution.
2. Write a program in R that simulates 1000 samples of size n from a dis-
tribution F with mean θ, where n, F and θ are specified below. For each
sample compute the U -statistic for the parameter θ2 given by
n X
X i
−1
Un = 2[n(n − 1)] Xi Xj .
i=1 j=1

In Exercise 3, the projection of Un was also found. Compute the projection


on each sample as well, noting that the projection may depend on pop-
ulation parameters that you will need to compute. For each case make a
EXERCISES AND EXPERIMENTS 561
histogram of the 1000 values of each statistic and comment on how well
each one does with estimating θ2 . Also construct a scatterplot of the pairs
of each estimate computed on the sample in order to study how the two
statistics relate to one another. If the projection is the better estimator of
θ, what would prevent us from using it in practice? Repeat the experiment
for n = 5, 10, 25, 50, and 100 for each of the distributions listed below.

a. F is a N(0, 1) distribution.
b. F is a Uniform(0, 1) distribution.
c. F is a LaPlace(0, 1) distribution.
d. F is a Cauchy(0, 1) distribution.
e. F is a Exponential(1) distribution.

3. Write a program in R that simulates 1000 samples of size n from a distri-


bution F with location parameter θ, where n, F and θ are specified below.
For each sample use the sign test, the signed rank test, and the t-test to test
the null hypothesis H0 : θ ≤ 0 against the alternative hypothesis H1 : θ > 0
at the α = 0.10 significance level. Over the 1000 samples keep track of how
often the null hypothesis is rejected. Repeat the experiment for n = 10, 25
1 2
and 50 with θ = 0.0, 20 σ, 20 σ, . . . , σ, where σ is the standard deviation of
F . For each sample size and distribution plot the proportion of rejections
for each test against the true value of θ. Discuss how the power of these
tests relate to one another in terms of the results of Table 11.2.

a. F is a N(θ, 1) distribution.
b. F is a Uniform(θ − 12 , θ + 12 ) distribution.
c. F is a LaPlace(θ, 1) distribution.
d. F is a Logistic(θ, 1) distribution.
e. F is a Triangular(−1 + θ, 1 + θ, θ) distribution.

4. Write a program in R that simulates five samples of size n from a distri-


bution F , where n and F are specified below. For each sample compute a
histogram estimate of the density f = F 0 using the bin-width bandwidth
estimator of Wand (1997), which is supplied by R. See Section B.2. For each
sample size and distribution plot the original density, along with the five
histogram density estimates. Discuss the estimates and how well they are
able to capture the characteristics of the true density. What type of char-
acteristics appear to be difficult to estimate? Does there appear to be areas
where the kernel density estimator has more bias? Are there areas where
the kernel density estimator appears to have a higher variance? Repeat the
experiment for n = 50, 100, 250, 500, and 1000.

a. F is a N(θ, 1) distribution.
b. F is a Uniform(0, 1) distribution.
c. F is a Cauchy(0, 1) distribution.
562 NONPARAMETRIC INFERENCE
d. F is a Triangular(−1, 1, 0) distribution.
e. F corresponds to the mixture of a N(0, 1) distribution with a N(2, 1)
distribution. That is, F has corresponding density 21 φ(x) + 21 φ(x − 2).

5. Write a program in R that simulates five samples of size n from a distri-


bution F , where n and F are specified below. For each sample compute a
kernel density estimate of the density f = F 0 using the plug-in bandwidth
estimator supplied by R. See Section B.6. Plot the original density, along
with the five kernel density estimates on the same set of axes, making the
density estimate a different line type than the true density for clarity. Dis-
cuss the estimates and how well they are able to capture the characteristics
of the true density. What type of characteristics appear to be difficult to
estimate? Does there appear to be areas where the kernel density estimator
has more bias? Are there areas where the kernel density estimator appears
to have a higher variance? Repeat the experiment for n = 50, 100, 250,
500, and 1000.

a. F is a N(θ, 1) distribution.
b. F is a Uniform(0, 1) distribution.
c. F is a Cauchy(0, 1) distribution.
d. F is a Triangular(−1, 1, 0) distribution.
e. F corresponds to the mixture of a N(0, 1) distribution with a N(2, 1)
distribution. That is, F has corresponding density 12 φ(x) + 21 φ(x − 2).

6. Write a function in R that will simulate 100 samples of size n from the
distributions specified below. For each sample use the nonparametric boot-
strap algorithm based on b resamples to estimate the distribution function
Hn (t) = P [n1/2 (θ̂n − θ) ≤ t] for parameters and estimates specified below.
For each bootstrap estimate of Hn (t) compute d∞ (Ĥn , Hn ). The function
should return the sample mean of the values of d∞ (Ĥn , Hn ) taken over the
k simulated samples. Compare these observed means for the cases speci-
fied below, and relate the results to the consistency result given in Theorem
11.16. Also comment on the role that the population distribution and b have
on the results. Treat the results as a designed experiment and use an ap-
propriate linear model to find whether the population distribution, the pa-
rameter, b and n have a significant effect on the mean value of d∞ (Ĥn , Hn ).
For further details on using R to compute bootstrap estimates, see Section
B.4.16.

a. N(0, 1) distribution, θ is the population mean, θ̂n is the usual sample


mean, b = 10 and 100, and n = 5, 10, 25, and 50.
b. T(2) distribution, θ is the population mean, θ̂n is the usual sample mean,
b = 10 and 100, and n = 5, 10, 25, and 50.
c. Poisson(2) distribution, θ is the population mean, θ̂n is the usual sample
mean, b = 10 and 100, and n = 5, 10, 25, and 50.
EXERCISES AND EXPERIMENTS 563
d. N(0, 1) distribution, θ is the population variance, θ̂n is the usual sample
variance, b = 10 and 100, and n = 5, 10, 25, and 50.
e. T(2) distribution, θ is the population variance, θ̂n is the usual sample
variance, b = 10 and 100, and n = 5, 10, 25, and 50.
f. Poisson(2) distribution, θ is the population variance, θ̂n is the usual
sample variance, b = 10 and 100, and n = 5, 10, 25, and 50.

7. Write a function in R that will simulate 100 samples of size n from the distri-
butions specified below. For each sample use the nonparametric bootstrap
algorithm based on b = 100 resamples to estimate the standard error of the
sample median. For each distribution and sample size, make a histogram
of these bootstrap estimates and compare the results with the asymptotic
standard error for the sample median given in Theorem 11.19. Comment on
how well the bootstrap estimates the standard error in each case. For fur-
ther details on using R to compute bootstrap estimates, see Section B.4.16.
Repeat the experiment for n = 5, 10, 25, and 50.
a. F is a N(0, 1) distribution.
b. F is an Exponential(1) distribution.
c. F is a Cauchy(0, 1) distribution.
d. F is a Triangular(−1, 1, 0) distribution.
e. F is a T(2) distribution.
f. F is a Uniform(0, 1) distribution.
APPENDIX A

Useful Theorems and Notation

A.1 Sets and Set Operators

Suppose Ω is the universal set, that is, the set that contains all of the elements
of interest. Membership of an element to a set is indicated by the ∈ relation.
Hence, a ∈ A indicates that the element a is contained in the set A. The
relation ∈/ indicates that an element is not contained in the indicated set. A
set A is a subset of Ω if all of the elements in A are also in Ω. This relationship
will be represented with the notation A ⊂ Ω. Hence A ⊂ Ω if and only if a ∈ A
implies a ∈ Ω. If A and B are subsets of Ω then A ⊂ B if all the elements
in A are also in B, that is A ⊂ B if and only if a ∈ A implies a ∈ B.
The union of two sets A and B is a set that contains all elements that are
either in A or B or both sets. This set will be denoted by A ∪ B. Therefore
A ∪ B = {ω ∈ Ω : ω ∈ A or ω ∈ B}. The intersection of two sets A and B is a
set that contains all elements that are common to both A and B. This set will
be denoted by A ∩ B. Therefore A ∩ B = {ω ∈ Ω : ω ∈ A and ω ∈ B}. The
complement of a set A is denoted by Ac and is defined as Ac = {ω ∈ Ω : ω ∈ /
A}, which is the set that contains all the elements in Ω that are not in A. If
A ⊂ B then the elements of A can be subtracted from A using the operation
B\A = {ω ∈ B : ω ∈ / A} = Ac ∩ B.

Unions and intersections distribute in much the same way as sums and prod-
ucts do.
Theorem A.1. Let {Ak }nk=1 be a sequence of sets and let B be another set.
Then !
\n n
\
B∪ Ak = (Ak ∪ B),
k=1 k=1
and !
n
[ n
[
B∩ Ak = (Ak ∩ B).
k=1 k=1

Taking complements over intersections or unions changes the operations to


unions and intersections, respectively. These results are usually called De Mor-
gan’s Laws.

565
566 USEFUL THEOREMS AND NOTATION
Theorem A.2 (De Morgan). Let {Ak }nk=1 be a sequence of sets. Then
n
!c n
\ [
Ak = Ack ,
k=1 k=1

and !c
n
[ n
\
Ak = Ack .
k=1 k=1

The number systems have their usual notation. That is, N will denote that
natural numbers, Z will denote the integers, and R will denote the real num-
bers.

A.2 Point-Set Topology

Some results in this book rely on some fundamental concepts from metric
spaces and point-set topology. A detailed review of this subject can be found
in Binmore (1981). Consider a space Ω and a metric δ. Such a pairing is known
as a metric space.
Definition A.1. A metric space consists of a set Ω and a function ρ : Ω×Ω →
R, where ρ satisfies

1. ρ(x, y) ≥ 0 for all x ∈ Ω and y ∈ Ω.


2. ρ(x, y) = 0 if and only if x = y.
3. ρ(x, y) = ρ(y, x) for all x ∈ Ω and y ∈ Ω.
4. ρ(x, z) ≤ ρ(x, y) + ρ(y, z) for all x ∈ Ω, y ∈ Ω, and z ∈ Ω.

Let ω ∈ Ω and A ⊂ Ω, then we define the distance from the point ω to the
set A as
δ(ω, A) = inf d(ω, a).
a∈A
It follows that for any non-empty set A, there exists at least one point in
Ω such that d(ω, A) = 0, noting that the fact that d(ω, A) = 0 does not
necessarily imply that ω ∈ A. This allows us to define the boundary of a set.
Definition A.2. Suppose that A ⊂ Ω. A boundary point of A is a point ω ∈ Ω
such that d(ω, A) = 0 and d(ω, Ac ) = 0. The set of all boundary points of a
set A is denoted by ∂A.

Note that ∂A = ∂Ac and that ∂∅ = ∅. The concept of boundary points now
makes it possible to define open and closed sets.
Definition A.3. A set A ⊂ Ω is open if ∂A ⊂ Ac . A set A ⊂ Ω is closed if
∂A ⊂ A.
RESULTS FROM CALCULUS 567
It follows that A is closed if and only if δ(ω, A) = 0 implies that ω ∈ A. It
also follows that ∅ and Ω are open, the union of the collection of open sets is
open, and that the intersection of any finite collection of open sets is open.
On the other hand, ∅ and Ω are also closed, the intersection of any collection
of closed sets is closed, and the finite union of any set of closed sets is closed.
Definition A.4. Let A be a subset of Ω in a metric space. The interior of A
is defined as A◦ = A \ ∂A. The closure of A is defined as A− = A ∪ ∂A.

It follows that for any subsets A and B of Ω that ω ∈ A− if and only if


δ(ω, A) = 0, if A ⊂ B then A− ⊂ B − , and that A− is the smallest closed set
containing A. For properties about interior sets, one needs only to note that
[(Ac )− ]c = A◦ .

A.3 Results from Calculus

The results listed below are results from basic calculus referred to in this book.
These results, along with their proofs, can be found in Apostol (1967).
Theorem A.3. Let f be an integrable function on the interval [a, x] for each
x ∈ [0, b]. Let c ∈ [a, b] and define
Z x
F (x) = f (t)dt
c
0
for x ∈ [a, b]. Then F (x) exists at each point x ∈ (a, b) where f is continuous
and F 0 (x) = f (x).
Theorem A.4. Suppose that f and g are integrable functions with at least
one derivative on the interval [a, b]. Then
Z b Z b
0
f (x)g (x)dx = f (b)g(b) − f (a)g(a) − f 0 (x)g(x)dx.
a a
Theorem A.5. Suppose that f and w are continuous functions on the interval
[a, b]. If w does not change sign on [a, b] then
Z b Z b
w(x)f (x)dx = f (ξ) w(x)dx,
a a

for some ξ ∈ [a, b].


Theorem A.6. Suppose that f is any integrable function and R ⊂ R. Then
Z Z

f (t) ≤ |f (t)|dt.

R R

Theorem A.7. If both f and g are integrable on the real interval [a, b] and
f (x) ≤ g(x) for every x ∈ [a, b] then
Z b Z b
f (x)dx ≤ g(x)dx.
a a
568 USEFUL THEOREMS AND NOTATION
It is helpful to note that the results from integral calculus transfer to expec-
tations as well. For example, if X is a random variable with distribution F ,
and f and g are Borel functions such that f (X(ω)) ≤ g(X(ω)) for all ω ∈ Ω,
then
Z Z
E[g(X)] = g(X(ω))dF (ω) ≤ f (X(ω))dF (ω) = E[f (X)]. (A.1)

In many cases when we are dealing with random variables, we may use results
that have slightly weaker conditions than that from classical calculus. For
example, the result of Equation (A.1) remains true under the assumption
that P [f (X(ω)) ≤ g(X(ω))] = 1.

A.4 Results from Complex Analysis

While complex analysis does not usually play a large role in the theory of
probability and statistics, many of the arguments in this book are based on
characteristic functions, which require a basic knowledge of complex analysis.
Complete reviews of complex analysis can be found in Ahlfors (1979) and
Conway (1975). In this section x will denote a complex number of the form
x1 + ix2 ∈ C where i = (−1)1/2 .
Definition A.5. The absolute value or modulus of x ∈ C is |x| = [x21 +x22 ]1/2 .
Definition A.6 (Euler). The complex value exp(iy), where y ∈ R, can be
written as cos(y) + i sin(y).
Theorem A.8. Suppose that x ∈ C such that |x| ≤ 21 . Then | log(1−x)+x| ≤
|x2 |.
Theorem A.9. Suppose that x ∈ C and y ∈ C, then | exp(x) − 1 − y| ≤
(|x − y| + 12 |y|2 ) exp(γ) where γ ≥ |x| and γ ≥ |y|.
The following result is useful for obtaining bounds involving characteristic
functions.
Theorem A.10. For x ∈ C and y ∈ C we have that |xn − y n | ≤ n|x − y|z n−1
if |x| ≤ z and |y| ≤ z, where z ∈ R.
Theorem A.11. Let n ∈ N and y ∈ R. Then
n

X (iy)k 2|y|n
exp(iy) − ≤ ,

k! n!
k=0

and
n
X (iy)k |y|(n+1)
exp(iy) − ≤ .

k! (n + 1)!
k=0

A proof of Theorem A.11 can be found in the Appendix of Gut (2005) or in


Section 2.3 of Kolassa (2006).
The integration of complex functions is simplified in this book because we are
always integrating with respect to the real line, and not the complex plane.
PROBABILITY AND EXPECTATION 569
Theorem A.12. Let f be an integrable function that maps the real line to
the complex plane. Then
Z ∞ Z ∞


f (x)dx ≤ |
|f (x)|dx.
−∞ −∞

A.5 Probability and Expectation

Theorem A.13. Let {Bn }kn=1 be a partition of a sample space Ω and let A
be any other event in Ω. Then
k
X
P (A) = P (A|Bn )P (Bn ).
n=1

Theorem A.14. Let X be a random variable such that P (X = 0) = 1. Then


E(X) = 0.
Theorem A.15. Let X be a random variable such that P (X ≥ 0) = 1. If
E(X) = 0 then P (X = 0) = 1.
Theorem A.16. Let X and Y be random variables such that X ≤ Y with
probability one. Then E(X) ≤ E(Y ).
Theorem A.17. Let X and Y be any two random variables. Then E(X) =
E[E(X|Y )].

A.6 Inequalities

Theorem A.18. Let x and y be real numbers, then |x + y| ≤ |x| + |y|.


Theorem A.19. Let x and y be positive real numbers. Then |x + y| ≤
2 max{x, y}.

The following inequalities can be considered an extension of the Theorem A.18


to powers of sums. The results can be proven using the properties of convex
functions.
Theorem A.20. Suppose that x, y and r are positive real numbers. Then

1. (x + y)r ≤ 2r (xr + y r ) when r > 0.


2. (x + y)r ≤ xr + y r when 0 < r ≤ 1.
3. (x + y)r ≤ 2r−1 (xr + y r ) when r ≥ 1.

A proof of Theorem A.20 can be found in Section A.5 of Gut (2005).


Theorem A.21. For any x ∈ R, exp(x) ≥ x + 1 and for x > 0 exp(−x) −
1 + x ≤ 12 x2 .

A proof of Theorem A.21 can be found in Section A.1 of Gut (2005).


570 USEFUL THEOREMS AND NOTATION
A.7 Miscellaneous Mathematical Results

Theorem A.22. Let x and y be real numbers and n ∈ N. Then


n   n
n
X n i n−i X n!
(x + y) = xy = xi y n−i .
i=0
i i=0
(n − i)!i!

Theorem A.23. Suppose that x is a real value such that |x| < 1. Then

X
xk = (1 + x)−1 .
k=0

Theorem A.24. If a and b are positive real numbers, then xb exp(−ax) → 0


as x → ∞.

A.8 Discrete Distributions

A.8.1 The Bernoulli Distribution

A random variable X has a Bernoulli(p) distribution if the probability dis-


tribution function of X is given by
(
px (1 − p)1−x for x ∈ {0, 1}
f (x) =
0 otherwise,

where p ∈ (0, 1). The expectation and variance of X are p and p(1 − p),
respectively. The moment generating function of X is m(t) = 1 − p + p exp(t)
and the characteristic function of X is ψ(t) = 1 − p + p exp(it).

A.8.2 The Binomial Distribution

A random variable X has a Binomial(n, p) distribution if the probability


distribution function of X is given by
( 
n x n−x
f (x) = x p (1 − p) for x ∈ {0, 1, . . . , n}
0 otherwise,

where n ∈ N and p ∈ (0, 1). The expectation and variance of X are np and
np(1−p), respectively. The moment generating function of X is m(t) = [1−p+
p exp(t)]n and the characteristic function of X is ψ(t) = [1 − p + p exp(it)]n . A
Bernoulli(p) random variable is a special case of a Binomial(n, p) random
variable with n = 1.
DISCRETE DISTRIBUTIONS 571
A.8.3 The Geometric Distribution

A random variable X has a Geometric(θ) distribution if the probability


distribution function of X is given by
(
θ(1 − θ)x−1 x ∈ N
f (x) =
0 elsewhere.

The expectation and variance of X are θ−1 and θ−2 (1 − θ), respectively. The
moment generating function of X is m(t) = [1 − (1 − θ) exp(t)]θ exp(t) and
the characteristic function of X is ψ(t) = [1 − (1 − θ) exp(it)]θ exp(it).

A.8.4 The Multinomial Distribution

A d-dimensional random vector X has a Multinomial(n, d, p) distribution


if the joint probability distribution function of X is given by
 Qn Pn
−1 nk
n! k=1 (nk !) pk
 for k=1 nk = n and
f (x) = nk > 0 for k ∈ {1, . . . , d},

0 otherwise.

The mean vector of X is np and the covariance matrix of X has (i, j)th element
Σij = npi (δij − pj ), for i = 1, . . . , d and j = 1, . . . , d.

A.8.5 The Poisson Distribution

A random variable X has a Poisson(λ) distribution if the probability distri-


bution function of X is given by
( x
λ exp(−λ)
x! for x ∈ {0, 1, . . . }
f (x) =
0 otherwise,

where λ, which is called the rate, is a positive real number. The expectation
and variance of X are λ. The moment generating function of X is m(t) =
exp{λ[exp(t) − 1]}, the characteristic function of X is ψ(t) = exp{λ[exp(it) −
1]}, and the cumulant generating function of X is
∞ k
X t
c(t) = λ[exp(t) − 1] = λ .
k!
k=1
572 USEFUL THEOREMS AND NOTATION
A.8.6 The (Discrete) Uniform Distribution

A random variable X has a Uniform{1, 2, . . . , u} distribution if the proba-


bility distribution function of X is given by
(
u−1 for x ∈ {1, 2, . . . , u},
f (x) =
0 otherwise,

where u is a positive integer. The expectation and variance of X are 12 (u + 1)


1
and 12 (u + 1)(u − 1). The moment generating function of X is
exp(t)[1 − exp(ut)]
m(t) = ,
u[1 − exp(t)]
and the characteristic function of X is
exp(it)[1 − exp(uit)]
ψ(t) = .
u[1 − exp(it)]

A.9 Continuous Distributions

A.9.1 The Beta Distribution

A random variable X has an Beta(α, β) distribution if the density function


of X is given by
(
1
xα−1 (1 − x)β−1 for x ∈ (0, 1)
f (x) = B(α,β)
0 elsewhere,
where α and β are positive real numbers and B(α, β) is the beta function
given by
Z 1
Γ(α)Γ(β)
B(α, β) = = xα−1 (1 − x)β−1 dx.
Γ(α + β) 0
The expectation and variance of X are α/(α + β) and
αβ
,
(α + β)2 (α + β + 1)
respectively.

A.9.2 The Cauchy Distribution

A random variable X has a Cauchy(α, β) distribution if the density function


of X is given by
( "  2 #)−1
x−α
f (x) = πβ 1 + ,
β
CONTINUOUS DISTRIBUTIONS 573
for all x ∈ R. The moments and cumulants of X do not exist. The moment
generating function of X does not exist. The characteristic function of X is
exp(itα − |t|β).

A.9.3 The Chi-Squared Distribution

A random variable X has an ChiSquared(ν) distribution if the density func-


tion of X is given by
x(ν−2)/2 exp(− 12 x)
f (x) = ,
2ν/2 Γ( 21 ν)
where x > 0 and ν is a positive integer. The mean of X is ν and the variance
of X is 2ν. The moment generating function of X is m(t) = (1 − 2t)−ν/2 ,
the characteristic function of X is ψ(t) = (1 − 2it)−ν/2 , and the cumulant
generating function is c(t) = − 21 ν log(1 − 2t).

A.9.4 The Exponential Distribution

A random variable X has an Exponential(β) distribution if the density


function of X is given by
(
β −1 exp(x/β) for x > 0
f (x) =
0 otherwise,

where β > 0. The expectation and variance of X are β and β 2 , respectively.


The moment generating function of X is m(t) = (1 − tβ)−1 , the characteristic
function of X is ψ(t) = (1 − itβ)−1 , and the cumulant generating function is
c(t) = − log(1 − tβ).

A.9.5 The Gamma Distribution

A random variable X has a Gamma(α, β) distribution if the density function


of X is given by
(
1 α−1
αx exp(−x/β) for x > 0
f (x) = Γ(α)β
0 otherwise

where α > 0 and β > 0. The expectation and variance of X are αβ and αβ 2 ,
respectively. The moment generating function of X is m(t) = (1 − tβ)−α and
the characteristic function of X is ψ(t) = (1 − itβ)−α .
574 USEFUL THEOREMS AND NOTATION
A.9.6 The LaPlace Distribution

A random variable X has a LaPlace(α, β) distribution if the density function


of X is given by
  
 1 exp x−α , x<α
2 βh 
f (x) = i
1 − 1 exp − x−α , x ≥ α.
2 β

The expectation and the variance of X are α and 2β 2 , respectively. The mo-
ment generating function of X is m(t) = (1 − t2 β 2 )−1 exp(tα) and the char-
acteristic function of X is ψ(t) = (1 + t2 β 2 )−1 exp(itα).

A.9.7 The Logistic Distribution

A random variable X has an Logistic(µ, σ) distribution if the density func-


tion of X is given by
f (x) = σ −1 {1 + exp[−σ −1 (x − µ)]}−2 exp[−σ −1 (x − µ)],
1 2 2
for all x ∈ R. The expectation and the variance of X are µ and 3π σ ,
respectively.

A.9.8 The Lognormal Distribution

A random variable X has an Lognormal(µ, σ) distribution if the density


function of X is given by
n o
[log(x)−µ]2
(
1
1/2 exp − 2σ 2 for x > 0
f (x) = xσ(2π)
0 elsewhere,
where µ is a real number and σ is a positive real number. The expectation and
variance of X are exp(µ + 12 σ 2 ) and exp(2µ + σ 2 )[exp(σ 2 ) − 1], respectively.

A.9.9 The Multivariate Normal Distribution

A d-dimensional random vector X has a N(d, µ, Σ) distribution if the density


function of X is given by
f (x) = (2π)−d/2 |Σ|−1/2 exp[− 21 (x − µ)0 Σ−1 (x − µ)]
where µ is a d-dimensional real vector and Σ is a d × d covariance matrix. The
expectation and covariance of X are µ and Σ, respectively. The moment gen-
erating function of X is φ(t) = exp(µ0 t + 12 t0 Σt). The characteristic function
of X is ψ(t) = exp(iµ0 t + 21 t0 Σt).
CONTINUOUS DISTRIBUTIONS 575
A.9.10 The Non-Central Chi-Squared Distribution

A random variable X has an ChiSquared(ν, δ) distribution if the density


function of X us given by

2−v/2 exp[− 1 (x + δ)] P∞ xν/2+k−1 δk
for x > 0,
2 k=0 Γ( ν +k)22k k!
f (x) = 2
0 elsewhere.
The expectation and variance of X are ν + δ and 2(ν + 2δ), respectively. When
δ = 0 the distribution is equivalent to a ChiSquared(ν) distribution.

A.9.11 The Normal Distribution

A random variable X has a N(µ, σ) distribution if the density function of X


us given by
f (x) = (2πσ 2 )−1/2 exp[−(x − µ)2 /2σ] for x ∈ R,
where µ ∈ R and σ > 0. The expectation and variance of X are µ and σ 2 ,
respectively. The moment generating function of X is m(t) = exp(tµ + 21 σ 2 t2 )
and the characteristic function of X is ψ(t) = exp(itµ + 21 σ 2 t2 ). A standard
normal random variable is a normal random variable with µ = 0 and σ 2 = 1.

A.9.12 Student’s t Distribution

A random variable X has a T(ν) distribution if the density function of X is


given by
Γ[ 1 (ν + 1)]
f (x) = 2 1 (πν)−1/2 (1 + x2 ν −1 )−(ν+1)/2 ,
Γ( 2 ν)
where x ∈ R and ν is a positive integer known as the degrees of freedom.
Assuming the ν > 1, the mean of X is 0 and if ν > 2 then the variance of X
is ν(ν − 2)−1 . The moment generating function of X does not exist.

A.9.13 The Triangular Distribution

A random variable X has a Triangular(α, β, γ) distribution if the density


function of X is given by

−1
2[(β − α)(γ − α)] (x − α) x ∈ (α, γ)

−1
f (x) = 2[(β − α)(β − γ)] (β − x) x ∈ [γ, β)

0 elsewhere,

where α ∈ R, β ∈ R, γ ∈ R such that α ≤ γ ≤ β. The expectation and


1
the variance of X are 31 (α + β + γ) and 18 (α2 + β 2 + γ 2 − αβ − αγ − βγ),
576 USEFUL THEOREMS AND NOTATION
respectively. The moment generating function of X is
2[(β − γ) exp(tα) − (β − α) exp(tγ) + (γ − α) exp(tβ)]
m(t) = ,
(β − α)(γ − α)(β − γ)t2
and the characteristic function of X is
−2[(β − γ) exp(itα) − (β − α) exp(itγ) + (γ − α) exp(itβ)]
ψ(t) = .
(β − α)(γ − α)(β − γ)t2

A.9.14 The (Continuous) Uniform Distribution

A random variable X has a Uniform(α, β) distribution if the density function


of X is given by
(
(β − α)−1 for α < x < β
f (x) =
0 otherwise,
where α ∈ R and β ∈ R such that α < β. The expectation and variance of
X are 21 (α + β) and 12 1
(β − α)2 . The moment generating function of X is
−1
m(t) = [t(β − α)] [exp(tβ) − exp(tα)] and the characteristic function of X is
ψ(t) = [it(β − α)]−1 [exp(itβ) − exp(itα)].

A.9.15 The Wald Distribution

A random variable X has a Wald(µ, λ) distribution if the density function of


X is given by
(
λ1/2 (2π)−1/2 x−3/2 exp[− 21 λµ−2 x−1 (x − µ)2 ] x > 0
f (x) =
0 elsewhere,
where µ > 0 and λ > 0. The expectation and variance of X are µ and
µ3 λ−1 . The moment generating function of X is m(t) = exp(λµ−1 )[1 − (1 −
2λ−1 µ2 t)1/2 ] and the characteristic function of X is ψ(t) = exp(λµ−1 )[1−(1−
2λ−1 µ2 it)1/2 ].
APPENDIX B

Using R for Experimentation

B.1 An Introduction to R

The statistical software package R is a statistical computing environment that


is similar in implementation to the S package developed at Bell Laboratories
by John Chambers and his colleagues. The R package is a GNU project and
is available under a free software license. Open source code for R, as well as
compiled implementations for Unix, Linux, OS X, and Windows are available
from www.r-project.org. A major strength of the R statistical computing
environment is the ability to simulate data from numerous distributions. This
allows R to be easily used for simulations and other types of statistical exper-
iments.
This appendix provides some useful tips for using R as a tool for visualizing
asymptotic results and is intended to be an aid to those wishing to solve
the experimental exercises in the book. This appendix does assume a basic
working knowledge of R, though this section will cover many of the basic ideas
used in R.

B.2 Basic Plotting Techniques

The results of simulations are most effectively understood using visual repre-
sentations such as plots and histograms. The basic mechanism for plotting a
set of data pairs in R is the plot function:

plot(x,y,type,xlim,ylim,main,xlab,ylab,lty,pch,col)

Technically, the plot function has the header plot(x,y,...) where the op-
tional arguments type, main, xlab, ylab, and lty are passed to the par
function. The result of the plot function is to send a plot if the pairs given
by the arguments x and y to the current graphics device, usually a separate
graphics window, depending on the specific way your version of R has been
set up. If no optional arguments are used then the plot is a simple scatterplot
and the labels for the horizontal and vertical axes are taken to be the names
of the objects passed to the function for the arguments x and y.

577
578 USING R FOR EXPERIMENTATION
The optional argument type specifies what type of plot should be constructed.
Some of the possible values of type, along with the resulting type of plot
produced are

"p", which produces a scatterplot of individual points;

"l", which connects the specified points with lines, but does not plot the
points themselves;

"b", which plots both the lines and the points as described in the two options
given above; and

"n", which sets up the axes for the plot, but does not actually plot any values.

The specification type="n" can be used to set up a pair of axes upon which
other objects will be plotted later. If the type argument is not used, then
the plot will use the value stored by the par command, which in most cases
corresponds to the option type="p". The current settings can be viewed by
executing the par command without any arguments.

The arguments xlim and ylim specify the range of the horizontal and vertical
axes, respectively. The range is expressed by an array of length two whose
first element corresponds to the minimum value for the axis and whose second
component corresponds to the maximum value for the axis. For example, the
specification xlim=c(0,1) specifies that the axis should have a range from zero
to one. If these arguments are not specified then R uses a specific algorithm
to compute what these ranges should be based on the ranges of the specified
data. In most cases R does a good job of selecting ranges that make the
visually appealing. However, when many sets of data are plotted on a single
set of axes, as will be discussed later in this section, the ranges of the axes
for the initial plot may need to be specified so that the axes are sufficient to
contain all of the objects to be plotted. If ranges are specified and points lie
outside the specified ranges, then the plot is still produced with the specified
ranges, and R will usually return a warning for each point that it encounters
that is outside the specified ranges.

The arguments main, xlab, and ylab specify the labels used for the main
title, the label for the horizontal axis, and the vertical axis, respectively.

The argument lty specifies the type of line used when the argument type="l"
or type="b" is used. The line types can either be specified as an integer or as
a character string. The possible line types include
BASIC PLOTTING TECHNIQUES 579

Character
Integer String Line Type Produced
0 "blank" No line is drawn
1 "solid" Solid line
2 "dashed" Dashed line
3 "dotted" Dotted line
4 "dotdash" Dots alternating with dashes
5 "longdash" Long dashes
6 "twodash" Two dashes followed by blank space

Alternatively, a character string of up to 8 characters may be specified, giving


the length of line segments which are alternatively drawn and skipped. Consult
the help pages for the par command for further details on this specification.
The argument pch specifies the symbol or character to be used for plotting
points when the argument type="p" or type="b" is used. This can be either
specified by a single character or by an integer. The integers 1–18 specify the
set of plotting symbols originally used in the S software package. In addition,
there is a special set of R plotting symbols which can be obtained by spec-
ified integers between 19 and 25. These specifications produce the following
symbols:

Argument Symbol
pch=19 solid circle
pch=20 small circle
pch=21 circle
pch=22 square
pch=23 diamond
pch=24 triangle point-up
pch=25 triangle point down

Other options are also available. See the R help page for the points command
for further details.
The col argument specifies the color used for plotting. Colors can be specified
in several different ways. The simplest way is to specify a character string that
contains the name of the color. Some examples of common character strings
that R understands are "red", "blue", "green", "orange", and "black". A
complete list of the possible colors can be obtained by executing the function
colors with no arguments. Another option is to use one of the many color
specification functions provided by R. For example, a gray color can be spec-
ified using the gray(level) function where the argument level is set to a
number between 0 and 1 that specifies how dark the gray shading should be.
Alternatively, colors can be specified directly in terms of their red-green-blue
580 USING R FOR EXPERIMENTATION
(RGB) components with a character string of the form "#RRGGBB", where each
of the pairs RR, GG, BB are of two digit hexadecimal numbers giving a value
between 00 and FF. For example, specifying the argument col="#FF0000" is
equivalent to using the argument col="red".
Many of the results shown in this book consist of plots of more than one
set of data on a single plot. There are many ways that this can be achieved
using R, but probably the simplest approach is to take advantage of the two
functions points and lines. The points function adds points to the current
plot at points specified by two arguments x and y as in the plot function.
The optional arguments pch and col can also be used with this function to
change the plotting symbol or the color of the plotting symbol. The lines
function adds lines to the current plot that connect the points specified by
the two arguments x and y. The optional arguments lty and col can also be
used with this function.
It is important to note that when plotting multiple sets of data on the same
set of axes, the range of the two axes, which are set in the original plot func-
tion, should be sufficient to handle all of the points specified in the multiple
plots. For example, suppose that we wish to plot linear, quadratic, and cubic
functions for a range of x values between 0 and 2, all on the same set of axes,
using different line types for each function. If we execute the commands
x <- seq(0,2,0.001)
y1 <- x
y2 <- x^2
y3 <- x^3
plot(x,y1,type="l",lty=1)
lines(x,y2,lty=2)
lines(x,y3,lty=3)
the resulting plot will cut off the quadratic and cubic functions because the
original plot command set up the vertical axis based on the range of y1. To
fix this problem we need only find the minimum and maximum values before
we execute the original plot command, and then specify this range in the plot
command. That is
x <- seq(0,2,0.001)
y1 <- x
y2 <- x^2
y3 <- x^3
yl <- c(min(y1,y2,y3),max(y1,y2,y3))
plot(x,y1,type="l",lty=1,ylim=yl)
lines(x,y2,lty=2)
lines(x,y3,lty=3)
The final plotting function that will generally be helpful in performing the ex-
periments suggested in this book is the hist function, which plots a histogram
of a set of data. The usage of the hist function is given by
BASIC PLOTTING TECHNIQUES 581
hist(x, breaks = "Sturges", freq = NULL, right = T, col = NULL,
border = NULL, main, xlim, ylim, xlab, ylab)

The arguments of the function are

x is a vector of values for which the histogram will be plotted.


breaks specifies how many cells to be used when plotting the histogram.
This argument can also specify the location of the cell endpoints. breaks
can be one of the following:

• A vector giving the location of the endpoints of the histogram cells.


• A single number specifying the number of cells to be used when plotting
the histogram.
• A character string naming an algorithm to compute the number of cells
to be used when plotting the histogram.
• A function to compute the number of cells to be used when plotting the
histogram.

In all but the first case, R uses the specified number as a suggestion only.
The only way to force R to use the number you wish is to specify the
endpoints of the cells. The default values of breaks specifies the algorithm
"Sturges". Other possible algorithm names are "FD" and "Scott". Consult
the R help page for the hist function for further details on these algorithms.
freq is a logical argument that specifies whether a frequency or density
histogram is plotted. If freq=T is specified, then a frequency histogram is
plotted. If freq=F is specified then a density histogram is plotted. In this
case the histogram has a total area of one. When comparing a histogram to
a known density on the same plot, using a density histogram usually gives
better results.
right is a logical argument that specifies whether the endpoints of the cells
are included on the left or right hand side of the cell interval. If right=T
is specified then the cells include the right endpoint but not the left. If
right=F is specified then the cells include the left endpoint but not the
right.
col specifies the color of the bars plotted on the histogram. The default value
of NULL yields unfilled bars.
border specifies the color of the border around the bars. The default is to
use the color used for plotting the axes.
main, xlab, ylab can be used to specify the main title, the label for the
horizontal axis, and the label for the vertical axis, respectively.
xlim and ylim can be used to specify the ranges of the horizontal and vertical
axes. These options are useful when overlaying a histogram with a plot of
a density for comparison.
582 USING R FOR EXPERIMENTATION
Wand (1997) argues that many schemes for selecting the number of bins im-
plemented in standard statistical software usually uses too few bins. A more
reasonable method for selecting the bin width of a histogram is provided in the
KernSmooth library and has the form dpih(x), where x is the observed vector
of data and we have omitted many technical arguments which have reason-
able default values. In practice this function estimates the optimal width of the
bins based on the observed data, and not the number of bins. Therefore, the
endpoints of the histogram classes must be calculated to implement this func-
tion. For example, if we wish to simulate a sample of size 100 from a N(0, 1)
distribution and create a histogram of the data based on the methodology of
Wand (1997), then we can use the following code:

x <- rnorm(100)
h <- dpih(x)
bins <- seq(min(x)-h,max(x)+2*h,by=h)
hist(x,breaks=bins)

B.3 Complex Numbers

The standard R package has the ability to handle complex numbers in a na-
tive format. The basic function used for creating complex valued vectors is the
complex function. The real and imaginary parts of a complex vector can be re-
covered using the functions Re and Im. The modulus of a complex number can
be obtained using the mod function. The usage of these functions is summarized
below. Further information about these functions, including some optional ar-
guments not summarized below, can be found at www.r-project.org.

complex(length.out=0, real, imaginary) creates a vector whose length


equals length.out that contains complex numbers whose real parts are
stored in real and imaginary parts imaginary.
Re(x): returns a vector of the same size as x, whose elements correspond to
the real parts of the complex vector x.
Im(x): returns a vector of the same size as x, whose elements correspond to
the imaginary parts of the complex vector x.
mod(x): returns a vector of the same size as x, whose elements correspond
to modulus of the complex vector x.

B.4 Standard Distributions and Random Number Generation

The R package includes functions that can compute the density (or probability
distribution function), distribution function, and quantile function for many
standard distributions including nearly all of the distributions used in this
book. There are also functions that will easily generate random samples from
STANDARD DISTRIBUTIONS AND RANDOM NUMBER GENERATION 583
these distributions as well. This section provides information about the R
functions for each of the distributions used in this book. Further information
about these functions can be found at www.r-project.org.

B.4.1 The Bernoulli and Binomial Distributions

dbinom(x, size, prob, log=F)


Calculates the probability P (X = x) where X has a Binomial distribution
based on size independent Bernoulli experiments with success proba-
bility prob. The optional argument log indicates whether the logarithm of
the probability should be returned. The Bernoulli distribution is imple-
mented by specifying size=1.
pbinom(q, size, prob, lower.tail=T, log.p=F)
Calculates the cumulative probability P (X ≤ q) where X has a Bino-
mial distribution based on size independent Bernoulli experiments with
success probability prob. The optional argument lower.tail indicates
whether P (X ≤ q) should be returned (the default) or if P (X > q) should
be returned. The optional argument log.p indicates whether the natural
logarithm of the probability should be returned. The Bernoulli distribu-
tion is implemented by specifying size=1.
qbinom(p, size, prob.lower.tail=T, loq.p=F)
Returns the pth quantile of a Binomial distribution based on size indepen-
dent Bernoulli experiments with success probability prob. The optional
argument lower.tail indicates whether p = P (X ≤ x) (the default) or
if p = P (X > x). The optional argument log.p indicates whether p is
the natural logarithm of the probability. The Bernoulli distribution is
implemented by specifying size=1.
rbinom(n, size, prob)
Generates a random sample of size n from a Binomial distribution based
on size independent Bernoulli experiments with success probability
prob. The Bernoulli distribution is implemented by specifying size=1.

B.4.2 The Beta Distribution

dbeta(x, shape1, shape2, log = F)


Calculates the density function f (x) of a random variable X that has a
Beta(α, β) distribution with the α parameter equal to shape1 and the β
parameter equal to shape2. The optional argument log indicates whether
the logarithm of the density should be returned.
pbeta(q, shape1, shape2, lower.tail = T, log.p = F)
Calculates the cumulative probability P (X ≤ q) where X has a Beta(α, β)
distribution with the α parameter equal to shape1 and the β parameter
584 USING R FOR EXPERIMENTATION
equal to shape2. The optional argument lower.tail indicates whether
P (X ≤ q) should be returned (the default) or if P (X > q) should be
returned. The optional argument log.p indicates whether the natural log-
arithm of the probability should be returned.
qbeta(p, shape1, shape2, lower.tail=T, log.p = F)
Returns the pth quantile of a Beta(α, β) distribution with the α param-
eter equal to shape1 and the β parameter equal to shape2. The optional
argument lower.tail indicates whether p = P (X ≤ x) (the default) or if
p = P (X > x). The optional argument log.p indicates whether p is the
natural logarithm of the probability.
rbeta(n, shape1, shape2)
Generates a random sample of size n from a Beta(α, β) distribution with
the α parameter equal to shape1 and the β parameter equal to shape2.

B.4.3 The Cauchy Distribution

dcauchy(x, location=0, scale=1, log=F)


Calculates the density function f (x) of a random variable X that has a
Cauchy(α, β) distribution with the α parameter equal to location and
the β parameter equal to scale. The optional argument log indicates
whether the logarithm of the density should be returned.
pcauchy(q, location=0, scale=1, lower.tail=T, log.p=F)
Calculates the cumulative probability P (X ≤ q) where X has a Cauchy(α,
β) distribution with the α parameter equal to location and the β param-
eter equal to scale. The optional argument lower.tail indicates whether
P (X ≤ q) or P (X > q) should be returned. The optional argument log.p
indicates whether the natural logarithm of the probability should be re-
turned.
qcauchy(p, location=0, scale=1, lower.tail=T, log.p=F)
Returns the pth quantile of a Cauchy(α, β) distribution with the α param-
eter equal to location and the β parameter equal to scale. The optional
argument lower.tail indicates whether p equals P (X ≤ q) or P (X > q).
The optional argument log.p indicates whether p is the natural logarithm
of the probability.
rcauchy(n, location = 0, scale = 1)
Generates a random sample of size n from a Cauchy(α, β) distribution
with the α parameter equal to location and the β parameter equal to
scale.

B.4.4 The Chi-Squared Distribution

dchisq(x, df, log = F)


Calculates the density function f (x) of a random variable X that has a
STANDARD DISTRIBUTIONS AND RANDOM NUMBER GENERATION 585
ChiSquared(η) distribution with the η parameter equal to df. The op-
tional argument log indicates whether the logarithm of the density should
be returned.
pchisq(q, df, lower.tail = T, log.p = F)
Calculates the cumulative probability P (X ≤ q) where X is a random
variable with a ChiSquared(η) distribution with the η parameter equal
to df. The optional argument lower.tail indicates whether P (X ≤ q)
or P (X > q) should be returned. The optional argument log.p indicates
whether the natural logarithm of the probability should be returned.
qchisq(p, df, lower.tail = T, log.p = F)
Returns the pth quantile of a ChiSquared(η) distribution with the η pa-
rameter equal to df. The optional argument lower.tail indicates whether
p = P (X ≤ q) (the default) or if p = P (X > q). The optional argument
log.p indicates whether p is the natural logarithm of the probability.
rchisq(n, df)
Generates a sample of size n from a ChiSquared(η) distribution with the
η parameter equal to df.

B.4.5 The Exponential Distribution

dexp(x, rate = 1, log = F)


Calculates the density function f (x) of a random variable X that has an
Exponential(θ) distribution with θ−1 equal to rate. The optional argu-
ment log indicates whether the logarithm of the density should be returned.
pexp(q, rate = 1, lower.tail = T, log.p = F)
Calculates the cumulative probability P (X ≤ q) where X is a random
variable with an Exponential(θ) distribution with θ−1 equal to rate. The
optional argument lower.tail indicates whether P (X ≤ q) or P (X > q)
should be returned. The optional argument log.p indicates whether the
natural logarithm of the probability should be returned.
qexp(p, rate = 1, lower.tail = T, log.p = F)
Returns the pth quantile of a Exponential(θ) distribution with θ−1 equal
to rate. The optional argument lower.tail indicates whether p = P (X ≤
q) (the default) or if p = P (X > q). The optional argument log.p indicates
whether p is the natural logarithm of the probability.
rexp(n, rate = 1)
Generates a sample of size n from an Exponential(θ) distribution with
θ−1 equal to rate.

B.4.6 The Gamma Distribution

dgamma(x, shape, rate = 1, scale = 1/rate, log = F)


Calculates the density function f (x) of a random variable X that has a
586 USING R FOR EXPERIMENTATION
Gamma(α, β) distribution with α equal to shape and β −1 equal to rate.
Alternatively, one can specify the scale parameter instead of the rate
parameter where scale is equal to β. The optional argument log indicates
whether the logarithm of the density should be returned.
pgamma(q, shape, rate=1, scale=1/rate, lower.tail=T, log.p=F)
Calculates the cumulative probability P (X ≤ q) where X is a random
variable with a Gamma(α, β) distribution with α equal to shape and β −1
equal to rate. Alternatively, one can specify the scale parameter instead
of the rate parameter where scale is equal to β. The optional argument
lower.tail indicates whether P (X ≤ q) or P (X > q) should be returned.
The optional argument log.p indicates whether the logarithm of the prob-
ability should be returned.
qgamma(p, shape, rate=1, scale=1/rate, lower.tail=T, log.p = F)
Returns the pth quantile of a Gamma(α, β) distribution with α equal to
shape and β −1 equal to rate. Alternatively, one can specify the scale
parameter instead of the rate parameter where scale is equal to β. The
optional argument lower.tail indicates whether p = P (X ≤ q) (the de-
fault) or if p = P (X > q). The optional argument log.p indicates whether
p is the natural logarithm of the probability.
rgamma(n, shape, rate = 1, scale = 1/rate)
Generates a sample of size n from a Gamma(α, β) distribution with α equal
to shape and β −1 equal to rate. Alternatively, one can specify the scale
parameter instead of the rate parameter where scale is equal to β.

B.4.7 The Geometric Distribution

dgeom(x, prob, log = F)


Calculates the probability P (X = x) where X has a Geometric(θ) dis-
tribution with θ specified by prob. The optional argument log indicates
whether the logarithm of the probability should be returned.
pgeom(q, prob, lower.tail = T, log.p = F)
Calculates the cumulative probability P (X ≤ q) where X has a Geo-
metric(θ) distribution with θ specified by prob. The optional argument
lower.tail indicates whether P (X ≤ q) or P (X > q) should be returned.
The optional argument log.p indicates whether the logarithm of the prob-
ability should be returned.
qgeom(p, prob, lower.tail = T, log.p = F)
Returns the pth quantile of a Geometric(θ) distribution with θ specified by
prob. The optional argument lower.tail indicates whether p = P (X ≤ q)
(the default) or if p = P (X > q). The optional argument log.p indicates
whether p is the natural logarithm of the probability.
STANDARD DISTRIBUTIONS AND RANDOM NUMBER GENERATION 587
rgeom(n, prob)
Generates a sample of size n from a Geometric(θ) distribution with θ
specified by prob.

B.4.8 The LaPlace Distribution

There does not appear to be a standard R library at this time that supports
the LaPlace(a, b) distribution. The relatively simple form of the distribution
makes it fairly easy to work with, however. For example, a function for the
density function can be programmed as
dlaplace <- function(x, a=0, b=1)
return(exp(-1*abs(x-a)/b)/(2*b))
A function for the distribution function can be programmed as
plaplace <- function(q, a=0, b=1)
{
if(x<a) return(0.5*exp(x-a)/b)
else return(1-0.5*exp((a-x)/b)/(2*b))
}
and a function for the quantile function can be programmed as
qlaplace <- function(p, a=0, b=1)
return(a-b*sign(p-0.5)*log(1-2*abs(p-0.5))
A function to generate a sample of size n from a LaPlace(a, b) distribution
can be programmed as
rlaplace <- function(n, a=0, b=1)
{
u <- runif(n,-0.5,0.5)
return(a-b*sign(u)*log(1-2*abs(u)))
}
Please note that these are fairly primitive functions in that they do no error
checking and may not be the most numerically efficient methods. They should
be sufficient for performing the experiments in this book as long as one is
careful with their use.

B.4.9 The Lognormal Distribution

dlnorm(x, meanlog = 0, sdlog = 1, log = F)


Calculates the density function f (x) of a random variable X that has a
Lognormal(µ, σ 2 ) density with µ equal to logmean and σ equal to sdlog.
The optional argument log indicates whether the logarithm of the density
should be returned.
588 USING R FOR EXPERIMENTATION
plnorm(q, meanlog = 0, sdlog = 1, lower.tail = T, log.p = F)
Calculates the cumulative probability P (X ≤ q) where X has a Lognor-
mal(µ, σ 2 ) density with µ equal to logmean and σ equal to sdlog. The
optional argument lower.tail indicates whether P (X ≤ q) (the default)
or if P (X > q) should be returned. The optional argument log.p indicates
whether the natural logarithm of the probability should be returned.
qlnorm(p, meanlog = 0, sdlog = 1, lower.tail = T, log.p = F)
Returns the pth quantile of a Lognormal(µ, σ 2 ) density with µ equal
to logmean and σ equal to sdlog. The optional argument lower.tail
indicates whether p = P (X ≤ q) (the default) or if p = P (X > q). The
optional argument log.p indicates whether p is the natural logarithm of
the probability.
rlnorm(n, meanlog = 0, sdlog = 1)
Generates a sample of size n from a Lognormal(µ, σ 2 ) density with µ
equal to logmean and σ equal to sdlog.

B.4.10 The Multinomial Distribution

rmultinom(n, size, prob)


Generates n k×1 independent random vectors. Each random random vector
follows a Multinomial distribution where size outcomes are classified
into k categories that have associated probabilities specified by the k × 1
vector prob. The value of k is determined by the function from the length
of prob.
dmultinom(x, size=NULL, prob, log=F)
Returns the probability P (X = x) where x is specified by the k × 1 vector
x. The random vector X has a Multinomial where size outcomes are
classified into k categories that have associated probabilities specified by
the k × 1 vector prob. By default size is set equal to sum(x) and need not
be specified. The optional argument log indicates whether the logarithm
of the probability should be returned.

B.4.11 The Normal Distribution

dnorm(x, mean = 0, sd = 1, log = F)


Calculates the density function f (x) of a random variable X that has a
N(µ, σ 2 ) distribution where µ is specified by mu and σ is specified by sigma.
The optional argument log indicates whether the logarithm of the density
should be returned.
pnorm(q, mean = 0, sd = 1, lower.tail = T, log.p = F)
Calculates the cumulative probability P (X ≤ q) where X has a N(µ, σ 2 )
distribution where µ is specified by mu and σ is specified by sigma. The
STANDARD DISTRIBUTIONS AND RANDOM NUMBER GENERATION 589
optional argument lower.tail indicates whether P (X ≤ q) (the default)
or if P (X > q) should be returned. The optional argument log.p indicates
whether the natural logarithm of the probability should be returned.
qnorm(p, mean = 0, sd = 1, lower.tail = T, log.p = F)
Returns the pth quantile of a N(µ, σ 2 ) distribution where µ is specified
by mu and σ is specified by sigma. The optional argument lower.tail
indicates whether p = P (X ≤ q) (the default) or if p = P (X > q). The
optional argument log.p indicates whether p is the natural logarithm of
the probability.
rnorm(n, mean = 0, sd = 1)
Generates a sample of size n from a N(µ, σ 2 ) distribution where µ is spec-
ified by mu and σ is specified by sigma.

B.4.12 The Multivariate Normal Distribution

Samples can be simulated from a N(µ, Σ) distribution using the mvrnorm(n


= 1, mu, Sigma) function which can be found in the MASS library. The argu-
ment n specifies the size of the sample to be generated, mu is the mean vector
and Sigma is the covariance matrix. The function returns a n×d matrix object
where d is the dimension of the vector mu.

B.4.13 The Poisson Distribution

dpois(x, lambda, log = F)


Calculates the probability P (X = x) where X has a Poisson distribution
with rate specified by lambda. The optional argument log indicates whether
the logarithm of the probability should be returned.
ppois(q, lambda, lower.tail = T, log.p = F)
Calculates the cumulative probability P (X ≤ q) where X has a Pois-
son distribution with rate specified by lambda. The optional argument
lower.tail indicates whether P (X ≤ q) should be returned (the default)
or if P (X > q) should be returned. The optional argument log.p indicates
whether the natural logarithm of the probability should be returned.
qpois(p, lambda, lower.tail = T, log.p = F)
Returns the pth quantile of a Poisson distribution with rate specified by
lambda. The optional argument lower.tail indicates whether p = P (X ≤
q) (the default) or if p = P (X > q). The optional argument log.p indicates
whether p is the natural logarithm of the probability.
rpois(n, lambda)
Generates a random sample of size n from a Poisson distribution with rate
specified by lambda.
590 USING R FOR EXPERIMENTATION
B.4.14 Student’s t Distribution

dt(x, df, log = F)


Calculates the density function f (x) of a random variable X that has a
T(ν) distribution with ν equal to df. The optional argument log indicates
whether the logarithm of the density should be returned.
pt(q, df, lower.tail = T, log.p = F)
Calculates the cumulative probability P (X ≤ q) where X has a T(ν) dis-
tribution with ν equal to df. The optional argument lower.tail indicates
whether P (X ≤ q) should be returned (the default) or if P (X > q) should
be returned. The optional argument log.p indicates whether the natural
logarithm of the probability should be returned.
qt(p, df, lower.tail = T, log.p = F)
Returns the pth quantile of a T(ν) distribution with ν equal to df. The op-
tional argument lower.tail indicates whether p = P (X ≤ q) (the default)
or if p = P (X > q). The optional argument log.p indicates whether p is
the natural logarithm of the probability.
rt(n, df)
Generates a random sample of size n from a T(ν) distribution with ν equal
to df.

B.4.15 The Continuous Uniform Distribution

dunif(x, min=0, max=1, log = F)


Calculates the density function f (x) of a random variable X that has a Uni-
form(α, β) distribution with α equal to min and β equal to max. The op-
tional argument log indicates whether the logarithm of the density should
be returned.
punif(q, min=0, max=1, lower.tail = T, log.p = F)
Calculates the cumulative probability P (X ≤ q) where X has a Uni-
form(α, β) distribution with α equal to min and β equal to max. The
optional argument lower.tail indicates whether P (X ≤ q) should be
returned (the default) or if P (X > q) should be returned. The optional
argument log.p indicates whether the natural logarithm of the probability
should be returned.
qunif(p, min=0, max=1, lower.tail = T, log.p = F)
Returns the pth quantile of a Uniform(α, β) distribution with α equal
to min and β equal to max. The optional argument lower.tail indicates
whether p = P (X ≤ q) (the default) or if p = P (X > q). The optional
argument log.p indicates whether p is the natural logarithm of the prob-
ability.
runif(n, min=0, max=1)
STANDARD DISTRIBUTIONS AND RANDOM NUMBER GENERATION 591
Generates a random sample of size n from a Uniform(α, β) distribution
with α equal to min and β equal to max.

B.4.16 The Discrete Uniform Distribution

There is not full support for the Uniform(x1 , x2 , . . . , xn ) distribution in R,


though sampling from this distribution can be accomplished using the sample
function which has the following form:
sample(x, size, replace = F, prob = NULL)
where x is a vector that contains the units to be sampled, size is a non-
negative integer that equals the sample size, replace is a logical object that
specifies whether the sampling should take place with or without replacement,
and prob is a vector of probabilities for each of the elements in the event
that non-uniform sampling is desired. Therefore, if we wish to generate a
sample of size 10 from a Uniform(1, 2, 3, 4, 5, 6) distribution then we can use
the command sample(1:6,10,replace=T). Specification of the vector prob
allows us to simulate sampling from any discrete distribution. For example,
if we wish to generate a sample of size 10 from a discrete distribution with
probabilty distribution function
(
1
x = −1, 1
f (x) = 41
2 x=0
then we can use the command
sample(c(-1,0,1),10,replace=T,prob=c(0.25,0.50,0.25))
Finally, if we wish to simulate resamples from a sample x in order to perform
bootstrap calculations, then we can use the command
sample(x,length(x),replace=T)
In fact, a rudimentary bootstrap function can be specified as
bootstrap <- function(x,b,fun,...)
{
n <- length(x)
xs <- matrix(sample(x,n*b,replace=T),b,n)
return(apply(xs,1,fun,...))
}
The function returns the b values of the statistic specified by fun computed
on the b resamples. For example, the following command simulates a sample
of size 25 from a N(0, 1) distribution, generates 1000 bootstrap resamples,
calculates the 10% trimmed mean of each, and makes a histogram of the
resulting values.
592 USING R FOR EXPERIMENTATION
hist(bootstrap(rnorm(25),1000,mean,trim=0.10))

A more sophisticated implementation of the bootstrap can be found in the


boot library.

B.4.17 The Wald Distribution

Support for the Wald distribution can be found in the library SuppDists,
which is available from most official R internet sites.

dinvGauss(x, nu, lambda, log=F)


Calculates the density of X at x where X has a Wald(µ, λ) distribution
where µ is given by nu and λ is specified by lambda. The optional argu-
ment log indicates whether the natural logarithm of the density should be
returned.
pinvGauss(q, nu, lambda, lower.tail=T, log.p=F)
Calculates the cumulative probability P (X ≤ q) where X has a Wald(µ, λ)
distribution where µ is given by nu and λ is specified by lambda. The
optional argument lower.tail indicates whether P (X ≤ q) should be
returned (the default) or if P (X > q) should be returned. The optional
argument log.p indicates whether the natural logarithm of the probability
should be returned.
qinvGauss(p, nu, lambda, lower.tail=T, log.p=F)
Returns the pth quantile of a textscWald(µ, λ) distribution where µ is given
by nu and λ is specified by lambda. The optional argument lower.tail
indicates whether p = P (X ≤ q) (the default) or if p = P (X > q). The
optional argument log.p indicates whether p is the natural logarithm of
the probability.
rinvGauss(n, nu, lambda)
Generates a random sample of size n from a textscWald(µ, λ) distribution
where µ is given by nu and λ is specified by lambda.

B.5 Writing Simulation Code

Most of the experiments suggested in this book should run in a reasonable


amount of time on a standard Mac or PC. If some of the simulations seem
to be taking too long, then some of the parameters can be changed to make
them run faster, many times without too much loss of information about the
concept under consideration. Instructors are encouraged to try out some of the
simulations on their local network before assigning some of the larger experi-
ments in order to establish whether the simulation parameters are reasonable
for their students complete in a reasonable amount of time.
There are also some basic R concepts that will keep the simulations efficient
WRITING SIMULATION CODE 593
as possible. In general, it is usually advised that even though R does support
looping, most notably using the for loop, loops in general should be avoided
if possible. This can be accomplished in R using several helpful concepts that
are built into R.
The first way to avoid loops is to note that R can do many calculations
on an entire vector or matrix at once. For example, suppose that A and B
are two matrix objects with n rows and m columns. One could multiply the
corresponding elements of the matrices using the double loop:

C <- matrix(0,n,m)
for(i in 1:n} for(j in 1:m) C[i,j] <- A[i,j] * B[i,j]

However, it is much more efficient to use the simple command C <- A*B,
which multiplies the corresponding elements. This also works with addition,
subtraction and division. One should note that the matrix product of A and
B, assuming they are conformable is given by the command A%*%B. Many
standard R functions will also work on the elements of a vector. For example,
cos(A) will return a matrix object whose elements correspond to the cosine
of the elements of A. Similarly, if one wishes to compute a vector of N(0, 1)
quantiles for p = 0.01, . . . , 0.99, one can avoid using a loop by simply executing
the command qnorm(seq(0.01,0.99,0.01)).
One should also note that R offers vector versions of many functions that have
the option of returning vectors. This is particularly useful when simulating a
set of data. For example, one could simulate a sample of size 10 from a N(0, 1)
distribution using the commands:

z <- matrix(0,10,1)
for(i in 1:10) z[i] <- dnorm(1)

It is much more efficient to just use the command z <- dnorm(10). In fact,
many of the experiments in this book specify simulating 1000 samples of size
10, for example. If all the samples are from the same distribution, and are in-
tended to be independent of one another, then all of the samples can be simu-
lated at once using the command z <- matrix(dnorm(10*1000),1000,10).
Each sample will then correspond to a row of the object z.
The final suggested method for avoiding loops in R is to use the apply com-
mand whenever possible. The apply command allows one to compute many
R functions row-wise or column-wise on an array in R without using a loop.
The apply command has the following form:

apply(X, MARGIN, FUN, ...)

where X is the object that a function will be applied to, MARGIN indicates
whether the function will be applied to rows (MARGIN=1), columns (MARGIN=2),
or both (MARGIN=c(1,2)), and FUN is the function that will be applied. Op-
tional arguments for the function that is being applied can be passed after
594 USING R FOR EXPERIMENTATION
these three arguments have been specified. For example, suppose that we wish
to compute a 10% trimmed mean on the rows of a matrix object X. This can
be accomplished using the command apply(X,1,mean,trim=0.10).

B.6 Kernel Density Estimation

Kernel density estimation is supported in R through several add-on libraries.


Two of these libraries are particularly useful for kernel density estimation. The
sm library is a companion to the book of Bowman and Azzalini (1997), and
provides a variety of functions for several types of smoothing methods. The
second library is the KernSmooth library that is a companion to the book of
Wand and Jones (1995). The KernSmooth library is specific to kernel smooth-
ing methods, and we will briefly demonstrate how to obtain a kernel density
estimate using this library. There are two basic functions that we require from
this library. The first function estimates the optimal bandwidth based on an
iterated plug-in approach. The form of this function is dpik(x, level=2,
kernel="normal") where x is the observed data, level corresponds to how
many iterations are used, and kernel is the type of kernel function being
used. To keep matters simple, we have purposely left out many more tech-
nical arguments which can be left with their default value. To compute the
kernel estimator the function bkde(x, kernel = "normal", bandwidth) is
used, where x is the observed data, kernel is the type of kernel function be-
ing used, and bandwidth is the bandwidth to be used. The output from this
function is a list object that contains vectors x and y that correspond to a grid
on the range of the data, and the corresponding value of the kernel density
estimator at each of the grid points. The output from this function can be
passed directly to the plot function. For example, to simulate a set of 100
observations from a N(0, 1) distribution, estimate the optimal bandwidth, and
plot the corresponding estimate we can use the commands:
x <- rnorm(100)
plot(bkde(x,bandwidth=dpik(x)),type="l")

B.7 Simulating Samples from Normal Mixtures

Some of the experiments in this book require the reader to simulate samples
from normal mixtures. There are some libraries which offer some techniques,
but for the experiments in this book a simple approach will suffice. Suppose
that we wish to simulate a sample of size n from a normal mixture density of
the form
Xp
f (x) = ωi σi−1 φ[σi−1 (x − µi )], (B.1)
i=1
where ω1 , . . . , ωp are the weights of the mixture which are assumed to add to
one, µ1 , . . . , µp are the means of the normal densities, and σ12 , . . . , σp2 are the
SOME EXAMPLES 595
variances of the normal densities. The essential behind simulating a sample
from such a density lies behind the fact that if X has the density given in
Equation (B.1), then X has the same distribution as Z = Y0 W where Y has a
Multionomial(1, p, ω) distribution and W has a N(µ, Σ) distribution where
ω 0 = (ω1 , . . . , ωp ), µ0 = (µ1 , . . . , µp ), and Σ = Diag{σ12 , . . . , σp2 }. Therefore,
suppose we wish to simulate an observation from the density
f (x) = 41 φ(x − 1) + 12 φ(x) + 14 φ(x + 1),
then we could use the commands
omega <- c(0.25,0.50,0.25)
mu <- c(-1,0,1)
Sigma <- diag(1,2,2)
W <- mvrnorm(1, mu, Sigma)
Y <- rmultinom(1,3,omega)
Z <- t(W)%*%Y

B.8 Some Examples

B.8.1 Simulating Flips of a Fair Coin

In this simulation, we consider flipping a fair coin n times. For each flip we
wish to keep track of the proportion of flips that are heads. The experiment is
repeated b times and the resulting proportions are plotted together on a single
plot that demonstrates how the proportion converges to 12 as n gets large. For
the example we have used n = 100 and b = 5, but these parameters are easily
changed.
n <- 100
b <- 5
p <- matrix(0,b,n)

plot(c(0,100),c(0.5,0.5),type="l",lty=2,ylim=c(0,1),
xlab="Flip Number",ylab="Proportion Heads")

for(i in 1:b) lines(seq(1,n,1),cumsum(rbinom(n=100,size=1,


prob=0.5))/seq(1,n,1))
The resulting output should be a plot similar to the one in Figure B.1.

B.8.2 Investigating the Central Limit Theorem

In this simulation we simulate 100 samples of size 10 from a Exponential(1)


distribution and compute the mean of each sample. A histogram of the re-
sulting sample means is then plotted, along with a Normal density for com-
parison. The mean and variance of the normal density that is plotted are
596 USING R FOR EXPERIMENTATION

Figure B.1 Example output for simulating flips of a fair coin.

1.0
0.8
Proportion Heads

0.6
0.4
0.2
0.0

0 20 40 60 80 100

Flip Number

computed to match the mean and variance of the observed sample means.
Note that this simulation sets up the samples in a 100 × 10 matrix and uses
the apply function to compute the sample mean of each row. This simulation
also sets up the ranges of the horizontal and vertical axes so that the overlay
of the density function can observed. In setting up the range of the vertical
axis we use the hist function with the argument plot=F. Using this argument
causes the histogram not to be plotted, but does return a list the contains the
calculated heights of the density bars. Therefore, we are able to calculate the
maximum value of the density curve, contained in the y object, along with
the maximum density from the histogram contained in the object returned by
the command hist(obs.means,plot=F)$density).

x <- matrix(rexp(1000,1),100,10)
obs.means <- apply(x,1,mean)
norm.mean <- mean(obs.means)
norm.sd <- sd(obs.means)
xl <- c(norm.mean-4*norm.sd,norm.mean+4*norm.sd)
x.grid <- seq(xl[1],xl[2],length.out=1000)
y <- dnorm(x.grid,norm.mean,norm.sd)
yl <- c(0,max(y,hist(obs.means,plot=F)$density))
SOME EXAMPLES 597

Figure B.2 Example output from the simulation that investigates the Central Limit
Theorem.

The Central Limit Theorem


1.5
1.0
density

0.5
0.0

0.0 0.5 1.0 1.5 2.0

observed means

hist(obs.means,freq=F,xlab="observed means",ylab="density",
main="The Central Limit Theorem", xlim=xl,ylim=yl)
lines(x.grid,y)

The resulting output should be a plot similar to the one in Figure B.2.

B.8.3 Plotting the Normal Characteristic Function

The characteristic function of a N(µ, σ 2 ) random variable is difficult to visu-


alize because it is a complex valued function of t given by ψ(t) = exp(−itµ −
1 2 2
2 t σ ). One solution to this problem is to use a three-dimensional scatterplot
to visualize the function. In the following example code, we use the function
scatterplot3d from the package scatterplot3d. Using the native complex
data type in R makes this type of plot relatively easy to produce.

library(scatterplot3d)
598 USING R FOR EXPERIMENTATION

Figure B.3 Example output for plotting the normal characteristic function.

10
5

imag
0.6
0.4
0
t

0.2
0.0
!5

!0.2
!0.4
!10

!0.6
!0.2 0.0 0.2 0.4 0.6 0.8 1.0

real

t <- seq(-10,10,0.001)
cf <- exp(complex(1,0,1)*t-0.5*t*t)

scatterplot3d(Re(cf),Im(cf),t,type="l",xlab="real",ylab="imag")

The resulting output should be a plot similar to the one in Figure B.3.

B.8.4 Plotting Edgeworth Corrections

In the R program below we demonstrate how to plot Edgeworth corrected


density for the standardized distribution of the sample mean when the popu-
lation is a translated Gamma distribution. In this case we have used a sample
size equal to three. See Example 7.1 for further details.

n <- 3
x <- seq(-3,4,0.01)
y1 <- dnorm(x)
y2 <- dgamma(x+sqrt(n),n,sqrt(n))
y3 <- y1 + y1/(3*sqrt(n))*(x^3-3*x)
y4 <- y3 + y1/n*((1/18)*(x^6-15*x^4+45*x^2-15)+
SOME EXAMPLES 599

Figure B.4 Example output for plotting Edgeworth corrections.

0.5
0.4
0.3
0.2
0.1
0.0

!3 !2 !1 0 1 2 3 4

(3/8)*(x^4-6*x^2+3))
plot(x,y1,type="l",xlab="",ylab="",ylim=c(0, max(y1,y2,y3,y4)),
lty=2)
lines(x,y2)
lines(x,y3,lty=3)
lines(x,y4,lty=4)

The resulting output should be a plot similar to the one in Figure B.4.

B.8.5 Simulating the Law of the Iterated Logarithm

Theorem 3.15 (Hartman and Wintner) provides a result on the extreme fluc-
tuations of the sample mean. The complexity of this result makes it difficult
to visualize. The code below was used to produce Figure 3.6.

ss <- seq(5,500,1)
x <- rnorm(max(ss))
sm <- matrix(0,length(ss),1)
lf <- matrix(0,length(ss),1)
uf <- matrix(0,length(ss),1)
600 USING R FOR EXPERIMENTATION
ul <- matrix(0,length(ss),1)
ll <- matrix(0,length(ss),1)
for(i in seq(1,length(ss),1))
{
sm[i] <- sqrt(ss[i])*mean(x[1:ss[i]])
uf[i] <- max(sm[1:i])
lf[i] <- min(sm[1:i])
ul[i] <- sqrt(2*log(log(ss[i])))
ll[i] <- -1*ul[i]
}
yl <- c(min(sm,uf,lf,ul,ll),max(sm,uf,lf,ul,ll))
plot(ss,sm,type="l",ylim=yl)
lines(ss,uf,lty=2)
lines(ss,lf,lty=2)
lines(ss,ll,lty=3)
lines(ss,ul,lty=3)
References

Ahlfors, L. (1979). Complex Analysis. New York: McGraw-Hill.


Akhiezer, N. I. (1965). The Classical Moment Problem. New York: Hafner
Publishing Company.
Anderson, T. W. and Goodman, L. A. (1957). Statistical inference about
Markov chains. The Annals of Mathematical Statistics, 28, 89–109.
Apostol, T. M. (1967). Calculus: Volume I. 2nd Ed. New York: John Wiley
and Sons.
Apostol, T. M. (1974). Mathematical Analysis. 2nd Ed. Menlo Park, CA:
Addison-Wesley.
Arnold, B. C., Balakrishnan, N., and Nagaraja, H. N. (1993). A First Course
in Order Statistics. New York: John Wiley and Sons.
Athreya, K. B. (1987). Bootstrap of the mean in the infinite variance case.
The Annals of Statistics, 14, 724–731.
Averbukh, V. I. and Smolyanov, O. G. (1968). The various definitions of the
derivative in linear topological spaces. Russian Mathematical Surveys, 23,
67–113.
Babu, G. J. (1984). Bootstrapping statistics with linear combinations of chi-
square as weak limit. Sankhyā, Series A, 46, 85–93.
Babu, G. J. (1986). A note on bootstrapping the variance of the sample
quantile. The Annals of the Institute of Statistical Mathematics, 38, 439–
443.
Babu, G. J. and Singh, K. (1983). Inference on means using the bootstrap.
The Annals of Statistics, 11, 999–1003.
Babu, G. J. and Singh, K. (1984). One-term Edgeworth correction by Efron’s
bootstrap. Sankhyā Series A, 46, 219–232.
Babu, G. J. and Singh, K. (1985). Edgeworth expansions for sampling with-
out replacement from finite populations. Journal of Multivariate Analysis,
17, 261–278.
Bahadur, R. R. (1958). Examples of inconsistency of maximum likelihood
estimates. Sahnkya, 20, 207–210.
Bahadur, R. R. (1960a). Asymptotic efficiency of tests and estimators. Sankhyā,
20, 229–252.

601
602 REFERENCES
Bahadur, R. R. (1960b). Stochastic comparison of tests. The Annals of Math-
ematical Statistics, 31, 276–295.
Bahadur, R. R. (1964). On Fisher’s bound for asymptotic variances. The
Annals of Mathematical Statistics, 35, 1545–1552.
Bahadur, R. R. (1964). Rates of convergence of estimates and test statistics.
The Annals of Mathematical Statistics, 38, 303–324.
Barnard, G. A. (1970). Discussion on paper by Dr. Kalbfleisch and Dr. Sprott.
Journal of the Royal Statistical Society, Series B, 32, 194–195.
Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in
Statistical Theory. New York: John Wiley and Sons.
Barndorff-Nielsen, O. E. and Cox, D. R. (1989). Asymptotic Techniques for
Use in Statistics. London: Chapman and Hall.
Bartlett, M. S. (1963). Statistical estimation of density functions. Sankhyā,
Series A, 25, 245–254.
Basu, D. (1955). An inconsistency of the method of maximum likelihood.
The Annals of Mathematical Statistics, 26, 144–145.
Beran, R. (1987). Prepivoting to reduce level error of confidence sets. Biometrika,
74, 457–468.
Beran, R. (1982). Estimated sampling distributions: The bootstrap and com-
petitors. The Annals of Statistics, 10, 212–225.
Beran, R. and Ducharme, G. R. (1991). Asymptotic Theory for Bootstrap
Methods in Statistics. Montréal: Centre De Recherches Mathematiques.
Berry, A. C. (1941). The accuracy of the Gaussian approximation to the
sum of independent variates. Transactions of the American Mathematical
Society, 49, 122–136.
Bhattacharya, R. N. and Ghosh, J. K. (1978). On the validity of the formal
Edgeworth expansion. The Annals of Statistics, 6, 434–451.
Bhattacharya, R. N. and Rao, C. R. (1976). Normal Approximation and
Asymptotic Expansions. New York: John Wiley and Sons.
Bickel, P. J. and Freedman, D. A. (1980). On Edgeworth Expansions and the
Bootstrap. Unpublished manuscript.
Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the
bootstrap. The Annals of Statistics, 9, 1196–1217.
Billingsley, P. (1986). Probability and Measure. New York: John Wiley and
Sons.
Billingsley, P. (1999). Convergence of Probability Measures. New York: John
Wiley and Sons.
Binmore, K. G. (1981). Topological Ideas. Cambridge: Cambridge University
Press.
Bolstad, W. M. (2007). Introduction to Bayesian Statistics. New York: John
Wiley and Sons.
REFERENCES 603
Bowman, A. W. (1984). An alternative method of cross-validation for the
smoothing of density estimates. Biometrika, 71, 353–360.
Bowman, A.W. and Azzalini, A. (1997). Applied Smoothing Techniques for
Data Analysis: the Kernel Approach with S-Plus Illustrations. Oxford: Ox-
ford University Press.
Bratley, P., Fox, B. L., and Schrage, L. E. (1987). A Guide to Simulation.
New York: Springer-Verlag.
Buck, R. C. (1965). Advanced Calculus. New York: McGraw-Hill.
Burman, P. (1985). A data dependent approach to density estimation. Zeitschrift
für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 69, 609–628.
Butler, R. W. (2007). Saddlepoint Approximations with Applications. Cam-
bridge: Cambridge University Press.
Cantelli, F. P. (1933). Sulla determinazione empirica delle leggi di probabili-
tia. Giorn. Inst. Ital. Attuari, 4, 421–424.
Casella, G. and Berger, R. L. (2002). Statistical Inference. Pacific Grove, CA:
Duxbury.
Chen, P.-N. (2002). Asymptotic refinement of the Berry-Esseen constant.
Unpublished manuscript.
Christensen, R. (1996). Plane Answers to Complex Questions. New York:
Springer.
Chow, Y. S. and Teicher, H. (2003). Probability Theory: Independence, In-
terchangeability, Martingales. New York: Springer.
Chung, K. L. (1974). A Course in Probability. Boston, MA: Academic Press.
Cochran, W. G. (1952). The χ2 test of goodness of fit. The Annals of Math-
ematical Statistics, 23, 315–345.
Conway, J. B. (1975). Functions of One Complex Variable. New York: Springer-
Verlag.
Copson, E. T. (1965). Asymptotic Expansions. Cambridge: Cambridge Uni-
versity Press.
Cornish, E. A. and Fisher, R. A. (1937). Moments and cumulants in the
specification of distributions. International Statistical Review, 5, 307–322.
Cramér, H. (1928). On the composition of elementary errors. Skandinavisk
Aktuarietidskrift, 11, 13–74, 141–180.
Cramér, H. (1946). Mathematical Methods of Statistics. Princeton, NJ: Prince-
ton University Press.
Cramér, H. (1970). Random Variables and Probability Distributions. 3rd Ed.
Cambridge: Cambridge University Press.
Daniels, H. E. (1954). Saddlepoint approximations in statistics. The Annals
of Mathematical Statistics, 25, 631–650.
De Bruijn, N. G. (1958). Asymptotic Methods in Analysis. New York: Dover.
604 REFERENCES
Dieudonné, J. (1960). Foundations of Modern Analysis. New York: John Wi-
ley and Sons.
Edgeworth, F. Y. (1896). The asymmetrical probability curve. Philosphical
Magazine, Fifth Series, 41, 90–99.
Edgeworth, F. Y. (1905). The law of error. Proceedings of the Cambridge
Philosophical Society, 20, 36–65.
Edgeworth, F. Y. (1907). On the representation of a statistical frequency by
a series. Journal of the Royal Statistical Society, Series A, 70, 102–106.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The
Annals of Statistics, 7, 1–26.
Efron, B. (1981). Nonparametric standard errors and confidence intervals.
Canadian Journal of Statistics, 9, 139–172.
Efron, B. (1982). The Jackknife, the Bootstrap, and Other Resampling Plans.
Philadelphia, PA: Society for Industrial and Applied Mathematics.
Efron, B. (1987). Better bootstrap confidence intervals. Journal of the Amer-
ican Statistical Association, 82, 171–200.
Efron, B. and Gong, G. (1983). A leisurely look at the bootstrap, the jack-
knife, and cross validation. The American Statistician, 37, 36–48.
Efron, B., Holloran, E., and Holmes, S. (1996). Bootstrap confidence levels
for phylogenetic trees. Proceedings of the National Academy of Sciences of
the United States of America, 93, 13492–13434.
Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap.
London: Chapman and Hall.
Efron, B. and Tibshirani, R. J. (1998). The problem of regions. The Annals
of Statistics, 26, 1687–1718.
Epanechnikov, V. A. (1969). Non-parametric estimation of a multivariate
probability density. Theory of Probability and its Applications, 14, 153–
158.
Erdélyi, A. (1956). Asymptotic Expansions. New York: Dover Publications.
Esseen, C.-G. (1942). On the Liapounoff limit of error in the theory of prob-
ability. Arkiv för Matematik, Astronomi och Fysik, 28A, 1–19.
Esseen, C.-G. (1945). Fourier analysis of distribution functions. A mathe-
matical study of the Laplace-Gaussian law. Acta Mathematica, 77, 1–125.
Esseen, C.-G. (1956). A moment inequality with an application to the central
limit theorem. Skandinavisk Aktuarietidskrift, 39, 160–170.
Feller, W. (1935). Über den zentralen grenzwertsatz der wahrscheinlichkeit-
srechnung. Mathematische Zeitschrift, 40, 521–559.
Feller, W. (1971). An Introduction to Probability Theory and its Application,
Volume 2. 2nd Ed. New York: John Wiley and Sons.
Felsenstein, J. (1985). Confidence limits on phylogenies: An approach using
the bootstrap. Evolution, 783–791.
REFERENCES 605
Fernholz, L. T. (1983). von Mises Calculus for Statistical Functionals. New
York: Springer-Verlag.
Finner, H. and Strassburger, K. (2002). The partitioning principle. The An-
nals of Statistics, 30, 1194–1213.
Fisher, R. A. (1915). Frequency distribution of the values of the correlation
coefficient in samples from an indefinitely large population. Biometrika, 10,
507–521.
Fisher, N. I. and Hall, P. (1991). Bootstrap algorithms for small sample sizes.
Journal of Statistical Planning and Inference, 27, 157–169.
Fisher, R. A. and Cornish, E. A. (1960). The percentile points of distributions
having known cumulants. Technometrics, 2, 209–226.
Fix, E. and Hodges, J. L. (1951). Discriminatory analysis - nonparametric
discrimination: consistency properties. Report No. 4, Project No. 21-29-
004. Randolph Field, TX: USAF School of Aviation Medicine.
Fix, E. and Hodges, J. L. (1989). Discriminatory analysis - nonparametric
discrimination: consistency properties. International Statistical Review, 57,
238–247.
Fréchet, M. (1925). La notion de differentielle dans l’analyse generale. An-
nales Scientifiques de l’École Normale Supérieure, 42, 293–323.
Fristedt, B. and Gray, L. (1997). A Modern Approach to Probability Theory.
Boston, MA: Birkhäuser.
Garwood, F. (1936). Fiducial limits for the Poisson distribution. Biometrika,
28, 437–442.
Ghosh, M. (1994). On some Bayesian solutions of the Neyman-Scott problem.
Statistical Decision Theory and Related Topics, Volume V. J. Berger and
S. S. Gupta, eds. New York: Springer-Verlag. 267–276.
Ghosh, M., Parr, W. C., Singh, K., and Babu, G. J. (1984). A note on
bootstrapping the sample median. The Annals of Statistics, 12, 1130–1135.
Gibbons, J. D. and Chakraborti, S. (2003). Nonparametric Statistical Infer-
ence. Boca Raton, FL: CRC Press.
Giné, E. and Zinn, J. (1989). Necessary conditions for the bootstrap of the
mean. The Annals of Statistics, 17, 684–691.
Glivenko, V. (1933). Sulla determinazione empirica delle leggi di probabilitia.
Giornate dell’Istituto Italiano degli Attuari, 4, 92–99.
Gnedenko, B. V. (1962). The Theory of Probability. New York: Chelsea Pub-
lishing Company.
Gnedenko, B. V. and Kolmogorov, A. N. (1968). Limit Distributions for Some
Sums of Independent Random Variables. Reading, MA: Addison-Wesley.
Graybill, F. A. (1976). Theory and Application of the Linear Model. Pacific
Grove, CA: Wadsworth and Brooks.
Gut, A. (2005). Probability: A Graduate Course. New York: Springer.
606 REFERENCES
Hájek, J. (1969). Nonparametric Statistics. San Francisco, CA: Holden-Day.
Hájek, J. (1972). Local asymptotic minimax and admissibility in estimation.
Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics
and Probability, Volume I, 175–194.
Hájek, J. and Šidák, Z. (1967). Theory of Rank Tests. New York: Academic
Press.
Hall, P. (1983a). Inverting an Edgeworth expansion. The Annals of Statistics,
11, 569–576.
Hall, P. (1983b). Large sample optimality of least squares cross-validation in
density estimation. The Annals of Statistics, 11, 1156–1174.
Hall, P. (1986a). On the bootstrap and confidence intervals. The Annals of
Statistics, 14, 1431–1452.
Hall, P. (1986b). On the number of bootstrap simulations required to con-
struct a confidence interval. The Annals of Statistics, 14, 1453–1462.
Hall, P. (1988a). Theoretical comparison of bootstrap confidence intervals.
The Annals of Statistics, 16, 927–953.
Hall, P. (1988b). Introduction to the Theory of Coverage Processes. New
York: John Wiley and Sons.
Hall, P. (1990). Asymptotic properties of the bootstrap for heavy-tailed dis-
tributions. The Annals of Probability, 18, 1342–1360.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion. New York: Springer.
Hall, P. and Martin, M. A. (1988). On bootstrap resampling and iteration.
Biometrika, 75, 661–671.
Hall, P. and Martin, M. A. (1991). One the error incurred using the boot-
strap variance estimator when constructing confidence intervals for quan-
tiles. Journal of Multivariate Analysis, 38, 70–81.
Halmos, P. R. (1958). Finite-Dimensional Vector Spaces. 2nd Ed. Princeton,
NJ: Van Nostrand.
Halmos, P. R. (1974). Measure Theory. New York: Springer-Verlag.
Hardy, H. G. (1949). Divergent Series. Providence, RI: AMS Chelsea Pub-
lishing.
Heyde, C. C. (1963). On a property of the lognormal distribution. Journal
of the Royal Statistical Society, Series B, 25, 392–393.
Hochberg, Y. and Tamhane, A. C. (1987). Multiple Comparison Procedures.
New York: John Wiley and Sons.
Hodges, J. L. and Lehmann, E. L. (1956). Thee efficiency of some nonpara-
metric competitors of the t-test. The Annals of Mathematical Statistics, 27,
324–335.
Hodges, J. L. and Lehmann, E. L. (1970). Deficiency. The Annals of Mathe-
matical Statistics, 41, 783–801.
REFERENCES 607
Hoeffding, W. (1948). A class of statistics with asymptotically normal dis-
tribution. The Annals of Mathematical Statistics, 19, 293–325.
Hollander, M. and Wolfe, D. A. (1999). Nonparametric Statistical Methods.
2nd Ed. New York: John Wiley and Sons.
Hsu, P. L. and Robbins, H. (1947). Complete convergence and the law of large
numbers. Proceedings of the National Academy of Sciences of the United
States of America, 33, 25–31.
Huber, P. J. (1966). Strict efficiency excludes superefficiency. The Annals of
Mathematical Statistics, 37, 1425.
Jensen, J. L. (1988). Uniform saddlepoint approximations. Advances in Ap-
plied Probability, 20, 622–634.
Johnson, N. L., Kotz, S., and Balakrishnan, N. (1994). Distributions in Statis-
tics: Continuous Univariate Distributions. Volume I. 2nd Ed. New York:
John Wiley and Sons.
Keller, H. H. (1974). Differential Calculus on Locally Convex Spaces. Lecture
Notes in Mathematics Number 417. Berlin: Springer-Verlag.
Kendall, M. and Stuart, A. (1977). The Advanced Theory of Statistics, Vol-
ume 1: Distribution Theory. 4th Ed. New York: Macmillan Publishing Com-
pany.
Khuri, A. I. (2003). Advanced Calculus with Applications in Statistics. New
York: John Wiley and Sons.
Knight, K. (1989). One the bootstrap of the sample mean in the infinite
variance case. The Annals of Statistics, 17, 1168–1175.
Koenker, R. W. and Bassett, G. W. (1984). Four (Pathological) examples in
asymptotic statistics. The American Statistician, 38, 209–212.
Kolassa, J. E. (2006). Series Approximation Methods in Statistics. 3rd Ed.
New York: Springer.
Kolmogorov, A. N. (1956). Foundations of the Theory of Probability. New
York: Chelsea Publishing Company.
Kowalski, J. and Tu, X. M. (2008). Modern Applied U-Statistics. New York:
John Wiley and Sons.
Landau, E. (1974). Handbuch der Lehre von der Verteilung der Primzahlen.
Providence, RI: AMS Chelsea Publishing.
Le Cam, L. (1953). On some asymptotic properties of maximum likelihood
estimates and related Bayes’ estimates. University of California Publica-
tions in Statistics, 1, 277–330.
Le Cam, L. (1979). Maximum Likelihood Estimation: An Introduction. Lec-
ture Notes in Statistics Number 18. University of Maryland, College Park,
MD.
Lee, A. J. (1990). U-Statistics: Theory and Practice. New York: Marcel
Dekker.
608 REFERENCES
Lehmann, E. L. (1983). Theory of Point Estimation. New York: John Wiley
and Sons.
Lehmann, E. L. (1986). Testing Statistical Hypotheses. Pacific Grove, CA:
Wadsworth and Brooks/Cole.
Lehmann, E. L. (1999). Elements of Large-Sample Theory. New York: Springer.
Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation. New
York: Springer.
Lehmann, E. L. and Shaffer, J. (1988). Inverted distributions. The American
Statistician, 42, 191–194.
Lévy, P. (1925). Calcul des Probabilités. Paris: Gauthier-Villars.
Lindeberg, J. W. (1922). Eine neue herleitung des exponentialgezetzes in der
wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 15, 211–225.
Loéve, M. (1977). Probability Theory. 4th Ed. New York: Springer-Verlag.
Loh, W.-Y. (1987). Calibrating confidence coefficients. Journal of the Amer-
ican Statistical Association, 82, 155–162.
Lukacs, E. (1956). On certain periodic characteristic functions. Compositio
Mathematica, 13, 76–80.
Mammen, E. (1992). When Does Bootstrap Work? Asymptotic Results and
Simulations. New York: Springer.
Miller, R. G. (1981). Simultaneous Statistical Inference. 2nd Ed. New York:
Springer-Verlag.
Mood, A. M. (1950). Introduction to the Theory of Statistics. 3rd Ed. New
York: McGraw-Hill.
Nadaraya, E. A. (1974). On the integral mean squared error of some non-
parametric estimates for the density function. Theory of Probability and Its
Applications, 19, 133–141.
Nashed, M. Z. (1971). Differentiability and related properties of nonlinear
operators: some aspects of the role of differentials in nonlinear functional
analysis. Nonlinear Functional Analysis and Applications. L. B. Rall, ed.
New York: Academic Press. 103–309.
Neyman, J. and Scott. E. (1948). Consistent estimates based on partially
consistent observations. Econometrica, 16, 1–32.
Noether, G. E. (1949). On a theorem by Wald and Wolfowitz. The Annals
of Mathematical Statistics, 20, 455–458.
Noether, G. E. (1955). On a theorem of Pitman. The Annals of Mathematical
Statistics, 26, 64–68.
Parzen, E. (1962). On the estimation of a probability density function and
the mode. The Annals of Mathematical Statistics, 33, 1065–1076.
Petrov, V. V. (1995). Limit Theorems of Probability Theory: Sequences of
Independent Random Variables. New York: Oxford University Press.
REFERENCES 609
Petrov, V. V. (2000). Classical-type limit theorems for sums of independent
random variables. Limit Theorems of Probability Theory, Encyclopedia of
Mathematical Sciences. Number 6, 1–24. New York: Springer.
Phanzagl, J. (1970). On the asymptotic efficiency of median unbiased esti-
mates. The Annals of Mathematical Statistics, 41, 1500–1509.
Pitman, E. J. G. (1948). Notes on Non-Parametric Statistical Inference. Un-
published notes from Columbia University.
Polansky, A. M. (1995). Kernel Smoothing to Improve Bootstrap Confidence
Intervals. Ph.D. Dissertation. Dallas, TX: Southern Methodist University.
Polansky, A. M. (1999). Upper bounds on the true coverage of bootstrap
percentile type confidence intervals. The American Statistician, 53, 362–
369.
Polansky, A. M. (2000). Stabilizing bootstrap-t confidence intervals for small
samples. Canadian Journal of Statistics, 28, 501–516.
Polansky, A. M. (2003a). Supplier selection based on bootstrap confidence
regions of process capability indices. International Journal of Reliability,
Quality and Safety Engineering, 10, 1–14.
Polansky, A. M. (2003b). Selecting the best treatment in designed experi-
ments. Statistics in Medicine, 22, 3461–3471.
Polansky, A. M. (2007). Observed Confidence Levels: Theory and Application.
Boca Raton, FL: Chapman Hall/CRC Press.
Polansky, A. M. and Schucany, W. R. (1997). Kernel smoothing to improve
bootstrap confidence intervals. Journal of the Royal Statistical Society, Se-
ries B, 59, 821–838.
Pollard, D. (2002). A User’s Guide to Measure Theoretic Probability. Cam-
bridge: Cambridge University Press.
Putter, H. and Van Zwet, W. R. (1996). Resampling: consistency of substi-
tution estimators. The Annals of Statistics, 24, 2297–2318.
Randles, R. H. and Wolfe, D. A. (1979). Introduction to the Theory of Non-
parametric Statistics. New York: John Wiley and Sons.
Rao, C. R. (1963). Criteria for estimation in large samples. Sankhyā, 25,
189–206.
Reeds, J. A. (1976). On the Definition of von Mises Functionals. Ph.D. Dis-
sertation. Cambridge, MA: Harvard University.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a den-
sity function. The Annals of Mathematical Statistics, 27, 832–837.
Royden, H. L. (1988). Real Analysis. 3rd Ed. New York: Macmillan.
Rudemo, M. (1982). Empirical choice of histograms and kernel density esti-
mators. Scandinavian Journal of Statistics, 9, 65–78.
Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice and
Visualization. New York: John Wiley and Sons.
610 REFERENCES
Sen, P. K. and Singer, J. M. (1993). Large Sample Methods in Statistics.
London: Chapman and Hall.
Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics.
New York: John Wiley and Sons.
Severini, T. A. (2005). Elements of Distribution Theory. Cambridge: Cam-
bridge University Press.
Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. New York: Springer.
Shi, X. (1986). A note on bootstrapping U -statistics. Chinese Journal of
Applied Probability and Statistics, 2, 144–148.
Shiganov, I. S. (1986). Refinement of the Upper Bound of the constant in
the central limit theorem. Journal of Soviet Mathematics, 35, 2545–2550.
Shohat, J. A. and Tamarkin, J. D. (1943). The Problem of Moments. 4th Ed.
Providence, RI: American Mathematical Society.
Silverman, B. W. (1986). Density Estimation. London: Chapman and Hall.
Simmons, G. (1971). Identifying probability limits. The Annals of Mathe-
matical Statistics, 42, 1429–1433.
Simonoff, J. S. (1996). Smoothing Methods in Statistics. New York: Springer.
Singh, K. (1981). One the asymptotic accuracy of Efron’s bootstrap. The
Annals of Statistics, 9, 1187–1195.
Slomson, A. B. (1991). An Introduction to Combinatorics. Bocca Raton, FL:
CRC Press.
Sprecher, D. A. (1970). Elements of Real Analysis. New York: Dover.
Sprent, P. and Smeeton, N. C. (2007). Applied Nonparametric Statistical
Methods. 4th Ed. Boca Raton, FL: Chapman and Hall/CRC Press.
Stefansson, G., Kim, W.-C., and Hsu, J. C. (1988). On confidence sets in
multiple comparisons. Statistical Decision Theory and Related Topics IV.
S.S. Gupta and J. O. Berger, Eds. New York: Academic Press. 89–104.
Stone, C. J. (1984). An asymptotically optimal window selection rule for
kernel density estimates. The Annals of Statistics, 12, 1285–1297.
Tchebycheff, P. (1890). Sur duex theéorèmes relatifs aux probabilités. Acta
Mathematica, 14, 305–315.
Tibshirani, R. (1988). Variance stabilization and the bootstrap. Biometrika,
75, 433–444.
van Beek, P. (1972). An application of Fourier methods to the problem
of sharpening of the Berry-Esseen inequality. Zeitschrift für Wahrschein-
lichkeitstheorie und Verwandte Gebiete, 23, 183–196.
van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge: Cambridge
University Press.
von Mises, R. (1947). On the asymptotic distribution of differentiable statis-
tical functionals. The Annals of Mathematical Statistics, 18, 309–348.
REFERENCES 611
Wand, M. P. (1997). Data-based choice of histogram bin width. The Ameri-
can Statistician, 51, 59–64.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. London: Chapman
and Hall.
Westenberg, J. (1948). Significance test for median and interquartile range in
samples from continuous populations of any form. Proceedings Koningklijke
Nederlandse Akademie van Wetenschappen, 51, 252–261.
Westfall, P. H. and Young, S. S. (1993). Resampling-Based Multiple Testing.
New York: John Wiley and Sons.
Winterbottom, A. (1979). A note on the derivation of Fisher’s transformation
of the correlation coefficient. The American Statistician, 33, 142–143.
Withers, C. S. (1983). Asymptotic expansions for the distribution and quan-
tiles of a regular function of the empirical distribution with applications to
nonparametric confidence intervals. The Annals of Statistics, 11, 577–587.
Withers, C. S. (1984). Asymptotic expansions for distributions and quan-
tiles with power series cumulants. Journal of The Royal Statistical Society,
Series B, 46, 389–396.
Wolfowitz, J. (1965). Asymptotic efficiency of the maximum likelihood esti-
mator. Theory of Probability and its Applications, 10, 247–260.
Woodroofe, M. (1970). On choosing a delta-sequence. The Annals of Math-
ematical Statistics, 41, 1665–1671.
Yamamuro, S. (1974). Differential Calculus in Topological Linear Spaces.
Lecture Notes in Mathematics, Number 374. Berlin: Springer-Verlag.
Young, N. (1988). An Introduction to Hilbert Space. Cambridge: Cambridge
University Press.
Zolotarev, V. M. (1986). Sums of Independent Random Variables. New York:
John Wiley and Sons.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy