100% found this document useful (1 vote)
3K views264 pages

Probability Essentials (Jacod J., Protter P) PDF

Uploaded by

asadrauf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
100% found this document useful (1 vote)
3K views264 pages

Probability Essentials (Jacod J., Protter P) PDF

Uploaded by

asadrauf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 264
Jean Jacod Université de Paris VI we Laboratoire de Probabilités 4, place Jussieu - Tour 56 75252 Paris Cedex 05, France e-mail: jj@ccrjussieu.fr Philip Protter School of Operations Research and Industrial Engineering Cornell University 219 Rhodes Hall Ithaca, NY 14853, USA e-mail: protter@orie.cornell.edu Sketch of Carl Friedrich Gau8 (by J. B, Listing; Nachla& Gauf, Posth, 26) by kind permission of Universitatsbibliothek Géttingen. Photograph of Paul Lévy by kind permission of Jean-Claude Lévy, Denise Piron, and Marie-Héléne Schwartz Photograph of Andrei N. Kolmogorov by kind permission of Albert N. Shiryaev Mathematics Subject Classification (2000); 60-01, 60E05, 60E10, 60G42 Cataloging-in-Publication Data applied for Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at . ISBN 3-540-43871-8 2nd Edition Springer-Verlag Berlin Heidelberg NewY- ork ISBN 3-540-66419-X. 1st Ed, Springer-Verlag Berlin Heidelberg NewYork This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broad-casting, reproduction on microfilms or in any other way, and storage in data banks. Duplica- tion of this publication or parts thereof is permitted only under the provisions of the German Co- pyright Law of September 9, 1965, in its current version, and permission for use must always be ob- tained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH hutp://www.springende © Springer-Verlag Berlin Heidelberg 2000, 2003 Printed in Italy ‘Typesetting: Camera-ready copy from the author using a Springer TgX macro package Cover design: design & production GmbH, Heidelberg Printed on acid-free papery SPIN: 10884210 4i/3142LK -5 43.210 Sey aie prieth s + Z- 57S To Diane and Sylvie and To Rachel, Margot, Olivier, Serge, Thomas, Vincent and Martin Preface to the Second Edition We have made small changes throughout the book, including the exercises, and we have tried to correct if not all. then at least most of the typos. We wish to thank the many colleagues and students who have commented con- structively on the book since its publication two years ago, and in particular Professors Valentin Petrov, Esko Valkeila, Volker Priebe, and Frank Knight. Jean Jacod, Paris Philip Protter, Ithaca March, 2002 Preface to the First Edition We present here a one semester course on Probability Theory. We also treat measure theory and Lebesgue integration, concentrating on those aspects which are especially germane to the study of Probability Theory. The book is intended to fill a current need: there are mathematically sophisticated stu- dents and researchers (especially in Engineering, Economics, and Statistics) who need a proper grounding in Probability in order to pursue their primary interests. Many Probability texts available today are celebrations of Prob- ability Theory, containing treatments of fascinating topics to be sure, but nevertheless they make it difficult to construct a lean one semester course that covers (what we believe) are the essential topics. \. Chapters 1-23 provide such a course. We have indulged ourselves a bit by including Chapters 24-28 which are highly optional. but which may prove useful to Economists and Electrical Engineers. “‘\This book had its origins in a course the second author gave in Perugia. Italy in 1997; he used the samizdat “notes” of the first author. long used for courses at the University of Paris VI, augmenting them as needed, The result has been further tested at courses given at Purdue University. We thank the indulgence and patience of the students both in Perugia and in West Lafayette. We also thank our editor Catriona Byrne, as well as Nick Bingham for many superb suggestions, an anonymous referee for the same. and Judy Mitchell for her extraordinary typing skills. Jean Jacod, Paris Philip Protter, West Lafayette Table of Contents 1. Introduction .. 1 2. Axioms of Probability ...........0. 0.0 ccc eee cee eee eee eee 7 3. Conditional Probability and Independence. . 15 4, Probabilities on a Finite or Countable Space.............. 21 5. Random Variables on a Countable Space ... 27 6. Construction of a Probability Measure.............-.....5 35 7. Construction of a Probability Measure on R... 39 8. Random Variables ..........6 6.00. c cece teen ete tenes 47 9. Integration with Respect to a Probability Measure ....... 51 10. Independent Random Variables...................0000e0 ee 65 11. Probability Distributions on R............... 00. cece eee 77 12. Probability Distributions on R® ............66. 2... e eee eee 87 13. Characteristic Functions .......... 0.00. .0 00 eee eeeeee eee ee 103 14. Properties of Characteristic Functions ............-......+ 111 15. Sums of Independent Random Variables .................- 117 16. Gaussian Random Variables (The Normal and the Multivariate Normal Distributions) ... - 125 17. Convergence of Random Variables ................+0..5005 141 18. Weak Convergence .. x Table of Contents 19. Weak Convergence and Characteristic Functions... 20. The Laws of Large Numbers................0 00 cee e eves 173 21. The Central Limit Theorem ...................0.. eee eee 181 22. L? and Hilbert Spaces ...................... 022 e cece eee 189 23. Conditional Expectation ................ 0.0000 eee eee 197 24. Martingales....... 0.0... ccc cence etnies 211 25. Supermartingales and Submartingales ............-...--++ 219 26. Martingale Inequalities ............. 0.5.06 cece eee eee 223 27. Martingale Convergence Theorems ............---....++5+ 229 28. The Radon-Nikodym Theorem.................6-.00e eee 243, References ..... 6.6.66. cece tenet t tee tee eens 249 1. Introduction Almost everyone these days is familiar with the concept of Probability. Each day we are told the probability that it will rain the next day; frequently we discuss the probabilities of winning a lottery or surviving the crash of an air- plane. The insurance industry calculates (for example) the probability that. a man or woman will live past his or her eightieth birthday, given he or she is 22 years old and applying for life insurance. Probability is used in business too: for example, when deciding to build a waiting area in a restaurant, one wants to calculate the probability of needing space for more than n people each day; a bank wants to calculate the probability a loan will be repaid; a manufacturer wants to calculate the probable demand for his product in the future. In medicine a doctor needs to calculate the probability of success of various alternative remedies; drug companies calculate the probability of harmful side effects of drugs. An example that has recently achieved spec- tacular success is the use of Probability in Economics, and in particular in Stochastic Finance Theory. Here interest rates and security prices (such as stocks, bonds, currency exchanges) are modelled as varying randomly over time but subject to specific probability laws; one is then able to provide in- surance products (for example) to investors by using these models. One could go on with such a list. Probability theory is ubiquitous in modern society and in science. Probability theory is a reasonably old subject. Published references on games of chance (i.e., gambling) date to J. Cardan (1501-1576) with his book De Ludo Alae {4]. Probability also appears in the work of Kepler (1571-1630) and of Galileo (1564-1642). However historians seem to agree that the subject really began with the work of Pascal (1623-1662) and of Fermat (1601-1665). The two exchanged letters solving gambling “paradoxes” posed to them by the aristocrat de Méré. Later the Dutch mathematician Christian Huygens (1629-1695) wrote an influential book [13] elaborating on the ideas of Pascal and Fermat. Finally in 1685 it was Jacques Bernoulli (1654-1705) who pro- posed such interesting probability problems (in the “Journal des Scavans”) (see also [3]) that it was necessary to develop a serious theory to answer them. After the work of J. Bernoulli and his contemporary A. De Moivre (1667-1754) [6], many renowned mathematicians of the day worked on prob- ability problems, including Daniel Bernoulli (1700-1782), Euler (1707-1803), 2 1. Introduction Gauss (1777-1855), and Laplace (1749-1827). For a nice history of Probabil- ity before 1827 (the year of the death of Laplace) one can consult [21]. In the twentieth century it was Kolmogorov (1903-1987) who saw the connection between the ideas of Borel and Lebesgue and probability theory and he gave probability theory its rigorous measure theory basis. After the fundamental work of Kolmogorov. the French mathematician Paul Lévy (1886-1971) set the tone for modern Probability with his seminal work on Stochastic Pro- cesses as well as characteristic functions and limit theorems, We think of Probability Theory as a mathematical model of chance. or random events. The idea is to start with a few basic principles about how the laws of chance behave. These should be sufficiently simple that one can believe them readily to correspond to nature. Once these few principles are accepted, we then deduce a mathematical theory to guide us in more com- plicated situations. This is the goal of this book. We now describe the approach of this book. First we cover the bare essen- tials of discrete probability in order to establish the basic ideas concerning probability measures and conditional probability. We next consider proba- bilities on countable spaces, where it is easy and intuitive to fix the ideas. We then extend the ideas to general measures and of course probability mea- sures on the real numbers. This represents Chapters 2-7. Random variables are handled analogously: first on countable spaces and then in general. In- tegration is established as the expectation of random variables, and later the connection to Lebesgue integration is clarified. This brings us through Chapter 12. Chapters 13 through 21 are devoted to the study of limit theorems, the central feature of classical probability and statistics. We give a detailed treat- ment of Gaussian random variables and transformations of random variables. as well as weak convergence. Conditional expectation is not presented via the Radon-Nikodym theo- rem and the Hahn-Jordan decomposition. but rather we use Hilbert Space projections. This allows a rapid approach to the theory, To this end we cover the necessities of Hilbert space theory in Chapter 22: we nevertheless extend the concept of conditional expectation beyond the Hilbert space setting to include integrable randoin variables. This is done in Chapter 23. Last, in Chapters 24-28 we give a beginning taste of martingales, with an applica- tion to the Radon—-Nikodym Theorem. These last five chapters are not really needed for a course on the “essentials of probability”. We include them how- ever because many sophisticated applications of probability use martingales; also martingales serve as a nice introduction to the subject of stochastic pro- cesses. We have written the book independent of the exercises. That is, the im- portant material is in the text itself and not in the exercises. The exercises provide an opportunity to absorb the material by working with the subject. Starred exercises are suspected to be harder than the others. 1. Introduction 3 We wish to acknowledge that Allan Gut’s book [11] was useful in providing exercises, and part of our treatment of martingales was influenced by the delightful introduction to the book of Richard Bass [1}. No probability background is assumed. The reader should have a good knowledge of (advanced) calculus. some linear algebra, and also “mathemat- ical sophistication”. Random Experiments Random experiments are experiments whose output cannot be surely pre- dicted in advance. But when one repeats the same experiment a large number of times one can observe some “regularity” in the average output. A typical example is the toss of a coin: one cannot predict the result of a single toss, but if we toss the coin many times we get an average of about 50% of “heads” if the coin is fair. The theory of probability aims towards a mathematical theory which describes such phenomena. This theory contains three main ingredients: a) The state space: this is the set of all possible outcomes of the experiment, and it is usually denoted by 92. Examples: 1) A toss of a coin: 2 = {h.t}. 2) Two successive tosses of a coin: 2 = {hh.tt-ht,th}. 3) A toss of two dice: 2 = {(i,j):11 be a sequence of rationals decreasing to a and (bp )n>1 be a sequence of rationals increasing strictly to b. Then (a,b) = U%4 (an, Dn] = Una ((— 08, bn] 1 (90, an]®) Therefore C C o(D), whence o(C) C o(D). However since each element of D is a closed set, it is also a Borel set, and therefore o(D) is contained in the Borel sets B. Thus we have B=o(C)Co(D) cB, and hence o(D) = B. Qo On the state space 2 the family of all events will always be a o-algebra A: the axioms (1), (2) and (3) correspond to the “logical” operations described in Chapter 1, while Axiom (4) is necessary for mathematical reasons. The probability itself is described below: Definition 2.3. A probability measure defined on a a-algebra A of 2 is a function P : A— [0,1] that satisfies: 1. P(Q)=1 2. For every countable sequence (An )n>1 of elements of A, pairwise disjoint (that is, A, NAm = whenever n 4m), one has P(U® An) = So P(An)- n=1 Axiom (2) above is called countable additivity; the number P(A) is called the probability of the event A. In Definition 2.3 one might imagine a more naive condition than (2). namely: A,BeA, ANB=0 + P(AUB)=P(A)+P(B). (2.1) 2. Axioms of Probability 9 This property is called additivity (or “finite additivity”) and, by an elemen- tary induction , it implies that for every finite A;,...Am of pairwise disjoint events A; € A we have P(UnL1An) = D> P(An)- n=1 Theorem 2.2. If P is a probability measure on (92..A), then: (i) We have P(0) = (ii) P is additive. Proof. If in Axiom (2) we take A, = @ for all n, we see that the number a = P(@) is equal to an infinite sum of itself; since 0 < a < 1, this is possible only if a = 0, and we have (i). For (ii) it suffices to apply Axiom (2) with A; = A and Aj = Band Ag = Ay =... = 0, plus the fact that P(@) = 0, to obtain (2.1). o Conversely, countable additivity is not implied by additivity. In fact, in spite of its intuitive appeal, additivity is not enough to handle the mathe- matical problems of the theory, even in such a simple example as tossing a coin, as we shall see later. The next theorem (Theorem 2.3) shows exactly what is extra when we assume countable additivity instead of just finite additivity. Before stating this theorem, and to see that the last four conditions in it are meaningful, let us mention the following immediate consequence of Definition 2.3: A,CeA, ACC = P(A)< P(C) (take B = A°NC, hence AN B = @ and AU B = C, and apply (2.1)). Theorem 2.3. Let A be a o-algebra. Suppose that P : A — [0,1] satisfies (1) and is additive. Then the following are equivalent: (i) Aaiom (2) of Definition (2.3). (ii) If An € A and Ay, | 0, then P(An) | 0. (iii) If An € A and An | A, then P(An) | P(A). (iv) If An € A and An t 2, then P(An) 11. (v) If An € A and An | A, then P(An) t P(A). Proof. The notation A, | A means that Aj4) C An, each n, and N22, An = A. The notation A, f A means that Ay C Any; and U2 )An = Note that if A, | A, then AS 1 A‘, and by the finite additivity axiom P(AS) =1- P(Ay). Therefore (i) is equivalent to (iv) and similarly (iii) is equivalent to (v). Moreover by choosing A to be 2 we have that (v) implies (iv). Suppose now that we have (iv). Let A, € A with A, 7 A. Set B, A, UA®. Then B,, increases to (2, hence P(B,) increases to 1. Since A, C A we have A, A° = 0, whence P(A, U A‘) = P(A,) + P(A®). Thus 10 2. Axioms of Probability 1 = lim P(B,) = lim {P(An) + P(A}. whence lity: P(A,) = 1 — P(A‘) = P(A). and we have (v). It remains to show that (i) is equivalent to (v). Suppose we have (v). Let An € A be pairwise disjoint: that is. ifn # m. then A,QAyn = 0. Define B, = UrepenAp and B = UX.;A,. Then by the definition of a Probability Measure we have P(B,) = Soy P(Ap) which increases with n to D*, P(An). and also P(B,,) increases to P(B) by (v). We deduce lim,— P(B,) = P(B) and we haye x P(B) = P(UL1 An) = SO P(An) n=1 and thus we have (i). Finally assume we have (i). and we wish to establish (v). Let A, € A. with A, increasing to A, We construct a new sequence as follows: B, =A. Bz = Ag\ Ay = A2 (Aj). By = An\ Ant. Then UX, B, = A and the events (B,),>1 are pairwise disjoint. Therefore by (i) we have n P(A) = lim SY P(B,). p=l But also 37’_, P(Bp) = P(An). whence we deduce lim, P(An) = P(A) and we have (vy). Oo If A€ 2%. we define the indicator function by lifweA. Lae) = {0 ifw¢ A. We often do not explicitly write the w. and just write 1,4. We can say that A, € A converges to A (we write A, — A) if limp se La, (w) = la(w) for all w € 2. Note that if the sequence A, in- creases (resp. decreases) to A. then it also tends to A in the above sense. Theorem 2.4. Let P be a probability measure, and let An be a sequence of events in A which converges to A. Then A€ A and linn P(An) = P(A). Proof. Let us define \ lim sup Ap = Ny Um>n Am nC oi liminf A, = US; Amon Am- 150 2, Axioms of Probability 1 Since A is a o-algebra. we have limsup,_.., An € A and lim inf,.4, An € A (see Exercise 2.4). By hypothesis A, converges to A. which means lim, la, = 1a. all w. This is equivalent to saying that A = limsup, An = lim infty. An. Therefore A € A. Now let Bh = Om>nAm and Cy = Un>nAm- Then B, increases to A and C;, decreases to A, thus lima sx P(Bn) = limn—oc P(C,) = P(A), by Theorem 2.3. However B, C An C Cy. therefore P(Bn) < P(An) < P(C),). so lim, +5 P(An) = P(A) as well. q 12 2. Axioms of Probability Exercises for Chapter 2 2.1 Let 2 be a finite set. Show that the set of all subsets of 2, 2%, is also finite and that it is a o-algebra. 2.2 Let (Ga)aca be an arbitrary family of c-algebras defined on an abstract space 2. Show that H = NacaGa is also a o-algebra. 2.3 Let (A,)n>1 be a sequence of sets. Show that (De Morgan’s Laws) a) (Um An)® = Me An b) (M1 An)® = Urea An 2.4 Let A be a o-algebra and (A,)n>1 a sequence of events in A. Show that liminf A, €.A; limsupA, €A; and liminf A, C limsup An. n—00 noo nav00 n30 2.5 Let (An)no1 be a sequence of sets. Show that limsup 1a, ~ liminf 14, = 1 fimsup, An\limint, An} n—00 n-00 (where A\ B = A Be whenever BC A). 2.6 Let A be a o-algebra of subsets of (2 and let B € A. Show that F = {AN B: A€é A} is a o-algebra of subsets of B. Is it still true when B is a subset of §2 that does not belong to A ? 2.7 Let f be a function mapping 2 to another space E with a o-algebra €. Let A = {Ac 2: there exists B € € with A = f~'(B)}. Show that A is a o-algebra on 22. 2.8 Let f : R > R be a continuous function, and let A = {A C R: there exists B € B with A= f~1(B)} where B are the Borel subsets of the range space R. Show that A C B, the Borel subsets of the domain space R. For problems 2.9-2.15 we assume a fixed abstract space 2, a o-algebra A, and a Probability P defined on (Q,.A). The sets A, B, A;, etc... always belong to A. 2.9 For A,B € A with AN B = 0, show P(AUB) = P(A) + P(B). 2.10 For A, B € A, show P(AU B) = P(A) + P(B) ~ P(ANB). 2.11 For A € A, show P(A) = 1- P(A®). 2.12 For A,B € A, show P(ANB°) = P(A) — P(ANB). Exercises 13 2.13 Let Ay.....4 A, be given events. Show that P (UL, Ay) = SO P(Ad — SO P(A Aj) + SE P(A;N.A; Ag) = (=D P(ALN AD... An) i SPA) — PAY), i=l i 0. The conditional probability of A given B is P(A| B) = P(ANB)/P(B). Theorem 3.2. Suppose P(B) > 0. (a) A and B are independent if and only if P(A| B) = P(A). (b) The operation A — P(A| B) from A — [0,1] defines a new probability measure on A, called the “conditional probability measure given B”. Proof. We have already established (a) in the discussion preceding the the- orem. For (b), define Q(A) = P(A | B), with B fixed. We must show Q satisfies (1) and (2) of Definition 2.3, But P(QNB) P(B) _ Q(2) = P(2| B) = ER = BB = Therefore, Q satisfies (1). As for (2), note that if (Ap)n>1 is a sequence of elements of A which are pairwise disjoint, then PUUR An) OB) — P(UR (ALO B)) Q (Urdu) = P (Uf Ay | B) = Se 0) and also the sequence (A, 7 B)n>1 is pairwise disjoint as well; thus -> ee Yo P(An |B) = 3 QAn). i n=1 3. Conditional Probability and Independence 17 The next theorem connects independence with conditional probability for a finite number of events. Theorem 3.3. [f A,,..., A, € A and if P(A, N...N An-1) > 0, then P(A) MN... An) = P(A;)P(Ag | Ay) P(A3 | A, Ag) + P(Ap | APO... An-2). Proof. We use induction. For n = 2, the theorem is simply Definition 3.2. Suppose the theorem holds for n— 1 events. Let B= A, M...9 An—y. Then by Definition 3.2 P(BNA,) = P(A, | B)P(B); next we replace P(B) by its value given in the inductive hypothesis: P(B) = P(Ay)P(Ap | At)... P(An—-1 | Ar... Ana), and we get the result. a A collection of events (E,) is called a partition of 92 if E, € A, each n, they are pairwise disjoint, P(E,,) > 0, each n, and U,E, = 22. Theorem 3.4 (Partition Equation). Let (E,)n>, be a finite or countable partition of 2. Then if AG A, P(A) = YO P(A| En)P(En). Proof. Note that A=AN2= AN (UnEn) = Un(AN E,). Since the E,, are pairwise disjoint so also are (AN En)n>1, hence P(A) = P(Un(AN En) = 3) P(AN En) = ¥° P(A | En)P(En)- n n o Theorem 3.5 (Bayes’ Theorem). Let (E,) be a finite or countable parti- tion of 2, and suppose P(A) > 0. Then P(A| En)P(En) Sn P(A| Em)P( Em)” Proof. By Theorem 3.4 we have that the denominator YO PUA | Em) P(Em) = P(A). P(En | A) = Therefore the formula becomes P(A| E,)P(En) _— P(AN En) _ Pay Bay Pn | AD. Oo Bayes’ theorem is quite simple but it has profound consequences both in Probability and Statistics. See, for example, Exercise 3.6. 18 3. Conditional Probability and Independence Exercises for Chapter 3 In all exercises the probability space is fixed. and A. B, An, etc... are events. 3.1 Show that if AN B = 0. then A and B cannot be independent unless P(A) =0 or P(B) = 0. 3.2 Let P(C) > 0. Show that P(AUB | C) = P(A| C)+P(B | C)—P(AnB | C). 3.3 Suppose P(C) > 0 and Aj..... A, are all pairwise disjoint. Show that n P(UR,Ai |C) = 9 P(A: | C). 3.4 Let P(B) > 0. Show that P(ANB) = P(A| B)P(B). 3.5 Let 0 < P(B) <1 and A be any event. Show P(A) = P(A| B)P(B) + P(A| B°)P(B’). 3.6 Donated blood is screened for AIDS. Suppose the test has 99% accuracy, and that one in ten thousand people in your age group are HIV positive. The test has a 5% false positive rating. as well. Suppose the test screens you as positive. What is the probability you have AIDS? Is it 99%? (Hint: 99% refers to P (test positive|you have AIDS). You want to find P (you have AIDS|test is positive). 3.7 Let (An)n>i € A and (By)n>1 € Aand A, — A (see before Theorem 2.4 for the definition of A, — A) and B, — B, with P(B) > 0 and P(B,) > 0. all n. Show that a) limp P(An |B) = P(A| B). b) limp x P(A | Bp) = P(A| B). ¢) lima +x P(An | Ba) = P(A| B). 3.8 Suppose we model tossing a coin with two outcomes, H and T, repre- senting Heads and Tails. Let P(H) = P(T) = 3. Suppose now we toss two such coins, so that the sample space of outcoines {2 consists of four points: HH, HT, TH, TT. We assume that the tosses are independent. a) Find the conditional probability that both coins show a head given that the first shows a head (answer: 3). b) Find the conditional probability that both coins show heads given that at least one of them is a head (answer: 4). 3.9 Suppose A, B, C are independent events and P(AN B) ¥ 0. Show P(C| ANB) = P(C). Exercises 19 3.10 A box has r red and b black balls. A ball is chosen at random from the box (so that each ball is equally likely to be chosen). and then a second ball is drawn at random from the remaining balls in the box. Find the probabilities that é 2 nr) a) Both balls are red [Ans.: ==] b) The first ball is red and the second js black [Ans. apts 3.11 (Polya’s Urn) An urn contains r red balls and b blue balls. A ball is chosen at random from the urn. its color is noted, and it is returned together with d more balls of the same color. This is repeated indefinitely. What is the probability that a) The second ball drawn is blue? [Ans. its oF b) The first ball drawn js blue given that the second ball drawn is blue? s.: _btd [Ans.: 45 3.12 Consider the framework of Exercise 3.11. Let B,, denote the event that the nth ball drawn is blue. Show that P(B,) = P(B,) for all n > 1. 3.13 Consider the framework of Exercise 3.11. Find the probability that the first. ball is blue given that the n subsequent drawn balls are all blue. Find the limit of this probability as n tends to oo. [Ans.: -2#24); limit is 1| 3.14 An insurance company insures an equal number of male and female drivers. In any given year the probability that a male driver has an accident involving a claim is a, independently of other years. The analogous prob- ability for females is 9. Assume the insurance company selects a driver at random. a) What is the probability the selected driver will make a claim this year? 5: ate Ans.: 93" b) What is the probability the selected driver makes a claim in two consec- utive years? [Ans | 3.15 Consider the framework of Exercise 3.14 and let A;, Ag be the events that a randomly chosen driver makes a claim in each of the first and second years, respectively. Show that P(Az | Ai) > P(A;). [Ans. P(Ag | Ai) — P(At) = se 2a+3) 3.16 Consider the framework of Exercise 3.14 and find the probability that a claimant js female. [Ans.: =25] 3.17 Let Aj, Ao.....An be independent events. Show that the probability that none of the Ay...., An occur is less than or equal to exp(— 2”_, P(A,))- 20 3. Conditional Probability and Independence 3.18 Let A, B be events with P(A) > 0. Show P(ANB| AUB) < P(ANB| A). 4. Probabilities on a Finite or Countable Space For Chapter 4, we assume 2 is finite or countable, and we take the o-algebra A = 2 (the class of all subsets of 2). Theorem 4.1. (a) A probability on the finite or countable set 2 is charac- terized by its values on the atoms: p., = P({w}), w € 2. (b) Let (pu)wee be a family of real numbers indexed by the finite or count- able set 2. Then there exists a unique probability P such that P({w}) = pw if and only if p> 0 and Yeo Pw = 1. When £2 is countably infinite, >, p.. is the sum of an infinite number of terms which a priori are not ordered: although it is possible to enumerate the points of §2, such an enumeration is in fact arbitrary. So we do not have a proper series, but rather a “summable family”. In the Appendix to this chapter we gather some useful facts on summable families. Proof. Let A € A; then A = Usea{w}, a finite or countable union of pairwise disjoint singletons. If P is a probability, countable additivity yields P(A) = P (Useatw}) = YO P({w}) = YO vw. wea weA Therefore we have (a). For (b). note that if P({w}) = p,, then by definition p,, > 0, and also 1 = P(2) = P Useo{w}) = Yo Pw) = Yo rw. wen wen For the converse, if the p, satisfy p, > 0 and )>¢¢ Py = 1, then we define a probability P by P(A) = Yo,¢4 Pw, With the convention that an “empty” sum equals 0. Then P(@) = 0 and P(22) = cq Pw = 1. For countable additivity, it is trivial when 2 is finite; when 22 is countable it follows from the fact that one has the following associativity: 7je7 Duea, Pu = Lweuies A, Po if the Aj are pairwise disjoint. o Suppose first that §2 is finite. Any family of nonnegative terms summing up to 1 gives an example of a probability on (2. But among all these examples the following is particularly important: 22 4. Probabilities on a Finite or Countable Space Definition 4.1. A probability P on the finite set 2 is called uniform if p, = P({w}) does not depend on x. In this case. it is immediate that = #4 #Q) Then computing the probability of any event A amounts to counting the number of points in A. On a given finite set 2 there is one and only one uniform probability. We now give two examples which are important for applications. P(A) a) The Hypergeometric distribution. An urn contains N white balls and AM black balls. One draws n balls without replacement, so n < N + Mf. One gets X white balls and n — X black balls. One is looking for the probability that X = 2, where « is an arbitrary fixed integer. Since we draw the balls without replacement, we can as well suppose that the n balls are drawn at once. So it becomes natural to consider that an outcome is a subset with n elements of the set {1.2..... N+M} of all N+.M balls (which can be assumed to numbered from 1 to N +1). That is, (2 is the family of all subsets with n points, and the total number of possible outcomes is #(2) = (wE™) = wees: recall that for p and q two integers with p 0 is the probability P defined on N by 7 ast, nl 2. The Geometric distribution of parameter a € [0,1) is the probability defined on N by Pn Pn = (1—aja”, n=0,1,2.3,.... 24 4. Probabilities on a Finite or Countable Space Note that in the Binomial model if n is large. then while in theory Cra —p)"~/ is known exactly, in practice it can be hard to compute. (Often it is beyond the capacities of quite powerful hand calculators. for ex- ample.) If nis large and p is small, however (as is often the case), there is an alternative method which we now describe. Suppose p changes with n; call it p,. Suppose further lim, .5< NPyp = A. One can show (see Exercise 4.1) that n i (1 —p,)?-F =e Pe im, ("Joos (L= pn)" =e i and thus one can easily approximate a Binomial probability (in this case) with a Poisson. Appendix: Some useful result on series In this Appendix we give a summary, mostly without proofs, of some useful result on series and summable families: these are primarily useful for studying probabilities on countable state spaces. These results (with proofs) can be found in most texts on Calculus (for example, see Chapter 10 of [18]). First we establish some conventions. Quite often one is led to perform calculations involving +oc (written more simply as 0c) or —oc. For these calculations to make sense we always use the following conventions: +oot+00 = +00, —oo-00 =—00, atoo=+00, a-co=-oo ifaeR, 0x0=0, a€|0,co] + axc=+00, a€[-0,0[ = ax w=-x. Let up; be a sequence of numbers, and consider the “partial sums” S, = ty +... + Un. S1: The series >, un is called convergent if S, converges to a finite limit S, also denoted by S$ = 37, un (the “sum” of the series). S2: The series So, un is called absolutely convergent if the series >, |tn| converges. S3: If un, > 0 for all n, the sequence S;,, is increasing, hence always con- verges to a limit S € [0, oo]. We still write $ = }>,, un, although the series converges in the sense of (S1) if and only if S < oc. The summands uw, can even take their values in [0, oc] provided we use the conventions above concerning addition with oo. In general the convergence of a series depends on the order in which the terms are enumerated. There are however two important cases where the ordering of the terms has no influence, and one speaks rather of “summable families” instead of “series” in these cases, which are $4 and S5 below: 4. Probabilities on a Finite or Countable Space 25 S4: When the u, are reals and the series is absolutely convergent one can modify the order in which the terms are taken without changing the absolute convergence, nor the sum of the series. $5: When uw, € [0.00] for all n, the sum }>,, un (which is finite or infinite: cf. (S3) above) does not change if the order is changed. S6: When up € [0,00], or when the series is absolutely convergent, we have the following associativity property: let (A;)icy be a partition of N*, with J = {1,2,...,.N} for some integer N, or J = N*. For each i € J we set vi = Daca, Un: if A; is finite this is an ordinary sum, otherwise v; is itself the sum of a series. Then we have 37, un = Dye, Ui (this latter sum is again the sum of a series if J = N*). 26 4. Probabilities on a Finite or Countable Space Exercises for Chapter 4 4.1 (Poisson Approximation to the Binomial) Let P be a Binomial proba- bility with probability of success p and number of trials n. Let A = pn. Show that P(k successes) k n —k BOG) CS DYOa) Let n — oc and let p change so that \ remains constant. Conclude that for small p and large n, AK oy P(k successes) © where \ = pn. {Note: In general for this approximation technique to be good one needs n large, p small. and also \ = np to be of moderate size — for example A < 20.] 4.2 (Poisson Approximation to the Binomial, continued) In the setting of Exercise 4.1, let py = P({k}) and q, = 1— px. Show that the g, are the probabilities of singletons for a Binomial distribution B(1 — p,n). Deduce a Poisson approximation of the Binomial when n is large and p is close to 1. 4.3 We consider the setting of the hypergeometric distribution, except that we have m colors and N; balls of color i. Set N = Ny+...+Nmm.and call X; the number of balls of color i drawn among n balls. Of course X;+...+ Xm =n. Show that CpG) P(X, = 21,00 Xm=em)=4 CE) 5 eee 0 otherwise. 5. Random Variables on a Countable Space In Chapter 5 we again assume {2 is countable and A= 2°. A random vari- able X in this case is defined to be a function from 92 into a set T. A random variable represents an unknown quantity (hence the term variable) that varies not as a variable in an algebraic relation (such as 2? —9 = 0). but rather varies with the outcome of a random event. Before the random event. we know which values X could possibly assume, but we do not know which one it will take until the random event happens. This is analogous to algebra when we know that ¢ can take on a priori any real value, but we do not know which one (or ones) it will take on until we solve the equation x? — 9 = 0 (for example). Note that even if the state space (or range space) T is not countable, the image T’ of 2 under X (that is, all points {i} in T for which there exists an w € 2 such that X(w) = 7) is either finite or countably infinite. We can then define the distribution of X (also called the law of X) on the range space T’ of X by P*(A) = P({w: X(w) € A}) = P(X71(A)) = P(X € A). That this formula defines a Probability measure on T' (with the o-algebra 27 of all subsets of T’) is evident. Since T’ is at most countable. this probability is completely determined by the following numbers: PP=PX=f)= YO pe. {uiX(w)=j} Sometimes, the family (px :j € T’) is also called the distribution (or the law) of X. We have of course Px(A) = 0j<4 7}. If P* has a known distribution, for example Poisson. then we say that X is a Poisson random variable. Definition 5.1. Let X be a real-valued random variable on a countable space 92. The expectation of X, denoted E{X}, is defined to be F{X}= Xp. provided this sum makes sense: this is the case when is finite; this is also the case when 92 is countable, when the series is absolutely convergent or 28 5. Random Variables on a Countable Space X > 0 always (in the latter case. the above sum and hence E{X} as well may take the value +9c). This definition can be motivated as follows: If one repeats an experiment n times, and one records the values X;, X2,...,X, of X corresponding to the n outcomes, then the empirical mean 4(X1+...+Xn) is Duco X(w) fn({e}), where f,({w}) denotes the frequency of appearance of the singleton {w}. Since f,({w}) “converges” to P({w}), it follows (at least when 92 is finite) that the empirical mean converges to the expectation E{X } as defined above. Define L' to be the space of real valued random variables on (2,4, P) which have a finite expectation. The following facts follow easily: (i) C? is a vector space, and the expectation operator F is linear, (ii) the expectation operator FE is positive: if X € £' and X > 0, then E{X} > 0. More generally if X,Y € £' and X < Y then E{X} < E{Y}. (iii) C* contains all bounded random variables. If X = a, then E{X} =a. (iv) If X € £}, its expectation depends only on its distribution and, if T’ is the range of X, E{X} = So §P(X = 5). (5.1) jet’ (v) If X = 1, is the indicator function of an event A, then E{X} = P(A). We observe that if )>,,(X(w))?p. is absolutely convergent, then SX pos YM Xeps+ SY Xe). @ [X(w)|a} < FCO} for alla>0. Proof. Since X is an r.v. so also is Y = A(X); let A={¥~*((a,20))} = (ws W(X(w)) > a} = (A(X) = a}. Then A(X) > ala, hence E{h(X)} > E{ala} = aE{1a} = aP(A) and we have the result. o 5. Random Variables on a Countable Space 29 Corollary 5.1 (Markov’s Inequality). PAx| 2 a} < HUD Proof. Take h(«) = |2| in Theorem 5.1. o Definition 5.2. Let X be a real-valued random variable with X? in £1. The Variance of X is defined to be o? =o% = E{(X — E(X))?}. The standard deviation of X, ax, is the nonnegative square root of the vari- ance. The primary use of the standard deviation is to report statistics in the correct (and meaningful) units. An example of the problem units can pose is as follows: let X denote the number of children in a randomly chosen family. Then the units of the vari- ance will be “square children”, whereas the units for the standard deviation ox will be simply “children”. If E{X} represents the expected, or average, value of X (often called the mean), then E{|X — E(X)|} = E{|X — p|} where p = E{X}, represents the average difference from the mean, and is a measure of how “spread out” the values of X are. Indeed, it measures how the values vary from the mean. The variance is the average squared distance from the mean. This has the effect of diminishing small deviations from the mean and enlarging big ones. However the variance is usually easier to compute than is £{|X — |}, and often it has a simpler expression. (See for example Exercise 5.11.) The variance too can be thought of as a measure of variability of the random variable X. Corollary 5.2 (Chebyshev’s Inequality). [f X? is in L', then we have a) P{|X| >a} < Ett for a>0. (b) prt amas for ado. Proof. Both inequalities are known as Chebyshev’s inequality. For part (a), take h(x) = 2* and then by Theorem 5.1 P{\X| >a} = P{A(X) > a2} < roy. For part (b), let ¥ = |X — E{X}]. Then P{|X — E{X}| > a} = P{Y > a} = P{¥? > a} < Pont = 7 Corollary 5.2 is also known as the Bienaymé-Chebyshev inequality. 30 5. Random Variables on a Countable Space Examples: 1) X is Poisson with paraineter A. Then X: 2 — N (the natural numbers), and i P(X EA) = PPX =)=Yo ran jEA jEA The expectation of X is x ay ee A j=0 j=0 7 jt ay Gate 2) X has the Bernoulli distribution if X takes on only two values: 0 and 1. X corresponds to an experiment with only two outcomes, usually called “success” and “failure”. Usually {X = 1} corresponds to “success”. Also it is customary to call P({X = 1}) = p and P({X = 0}) =q=1-p Note E{X} =1P(X =1) + 0P(X =0) = Lp+0.q=p. 3) X has the Binomial distribution if P* is the Binomial probability. That is, for a given and fixed n, X can take on the values {0,1,2...., nj. P({X = k}) = (eka —pyr-*, where 0

0. The constant c is such that ¢ 7°, jr = 1. The function 1 (s)= Soa ooh k=1 is known as the Riemann zeta function, and it is extensively tabulated. = 1, Thus c= Wer: and 32 5. Random Variables on a Countable Space 1 . 1 PRED = Tay The mean is easily calculated in terms of the Riemann zeta function: 4S apyxe ye Si BUR} DPX i) Gary LRA _ 62) C(a+1) 7) Ifthe state space EF of a random variable X has only a finite number of points, say n, and each point is equally likely, then X is said to have a uniform distribution. In the case where 1,2....,n, 1 . P(X =j)=7, J then X has the Discrete Uniform distribution with parameter n. Using that D2, i= M4EY, we have PO}= Dare an = Lead! = Hee) K a Exercises 33 Exercises for Chapter 5 5.1 Let g : [0.00) — [0, 00) be strictly increasing and nonnegative. Show that Eg Xt 9(@) 5.2 Let h: R = [0,a] be a nonnegative (bounded) function. Show that for Oa)< for a> 0. P{R(X) > a} > 5.3 Show that 0} = E{X?} — E{X}, assuming both expectations exist. 5.4 Show that E{X}? < H{X?} always, assuming both expectations exist. 5.5 Show that 0} = H{X(X —1)}+px —pX, where px = E{X}, assuming all expectations exist. 5.6 Let X be Binomial B(p,n). For what value of j is P(X = j) the greatest’? (Hint: Calculate Pe) [Ans.: [(n + 1)p], where [2] denotes integer part of «.] 5.7 Let X be Binomial B(p,n). Find the probability X is even. [Ans.: }(1+ (1 = 2p)"),] 5.8 Let X, be Binomial B(p,,n) with A = np, being constant. Let A, {X,, > 1}, and let Y be Poisson (\). Show that limo. P(Xn = j | An) PY=j|Y>1). 5.9 Let X be Poisson (X). What value of j maximizes P(X = j)? [Ans.: [A].] (Hint: See Exercise 5.6.) 5.10 Let X be Poisson (A). For fixed j > 0, what value of \ maximizes P(X = jy? [Ans.: j.] 5.11 Let X be Poisson (A) with A a positive integer. Show E{|X — Al} = 2Me* 2 oar? and that 6% =A. 5.12* Let X be Binomial B(p,n). Show that for A > 0 and e > 0, P(X —np > ne) < E{exp(A(X — np — ne))}. 5.13 Let X,, be Binomial B(p,n) with p > 0 fixed. Show that for any fixed b> 0, P(X, 0 fixed. and a > 0. Show that x (2 >a) < vol?) vO min { VoD). avi} and also that P(|X — np| < ne) tends to 1 for all ¢ > 0. —?P 5.15 * Let X be a Binomial a where n = 2m. Let a(m, k) = = AY p(X =m+h). Show that limm—.<(a(m, k))™ = e7 5.16 Let X be Geometric. Show that for i,j > 0, P(X >itj|X>i)=P(X > Jj). 5.17 Let X be Geometric (p). Show & {ty} = eat - pr. 5.18 A coin is tossed independently and repeatedly with the probability of heads equal to p. a) What is the probability of only heads in the first n tosses? b) What is the probability of obtaining the first tail at the nt® toss? c) What is the expected number of tosses required to obtain the first tail? {Ans.: 45. —P 5.19 Show that for a sequence of events (A,)n>1. 20 oo E {= la, \ =o (An), n=1 n=l where oc is a possible value for each side of the equation. 5.20 Suppose X takes all its values in N (= {0,1,2.3,...}). Show that x B{X} = YO P(X > n). n=0 5.21 Let X be Poisson (\). Show for r = 2.3, 4,..., E{X(X —1)...(X —r+ 1} =r". 5.22 Let X be Geometric (p). Show for r = 2,3, 4,.... rip” E{X(X-1). (Xar+Dh= Ge. 6. Construction of a Probability Measure Here we no longer assume 2 is countable. We assume given 2 and a o- algebra A Cc 2°. (Q,.A) is called a measurable space. We want to construct probability measures on A. When {2 is finite or countable we have already seen this is simple to do. When 2 is uncountable, the same technique does not work: indeed, a “typical” probability P will have P({w}) = 0 for all w, and thus the family of all numbers P({w}) for w € (2 does not characterize the probability P in general. It turns out in many “concrete” situations — in particular in the next chapter — that it is often relatively simple to construct _a “probability” on an algebra which generates the g-algebra A. and the problem at hand is then to extend this probability to the o-algebra itself. So, let us suppose A is the g-algebra generated by an algebra Ao, and let us further suppose we are given a probability P on the algebra Ag: that is. a function P : Ag > [0.1] satisfying 1. P(Q)=1. 2. (Countable Additivity) for any sequence (A,,) of elements of Ap, pairwise disjoint, and such that U,An € Ao, we have P(UnAn) = D>, P(An). It might seem natural to use for A the set of all subsets of 2, as we did in the case where {2 was countable. We do not do so for the following reason, illustrated by an example: suppose {2 = {0, 1], and let us define a set function P on intervals of the form P((a.b]) = 6 — a, where 0 1 with An Am = @ for n #m; then one can prove that no such P exists! The collection of sets 2!!! js simply too big for this to work. Borel realized that we can however do this on a smaller collection of sets, namely the smallest o-algebra containing intervals of the form (a, }]. This is the import of the next theorem: Theorem 6.1. Each probability P defined on the algebra Ag has a unique extension (also called P) on A. 36 6. Construction of a Probability Measure We will show only the uniqueness. For the existence on can consult any standard text on measure theory: for example [16] or [23]. First we need to establish a very useful theorem. Definition 6.1. A class C of subsets of 2 is closed under finite intersections if for when Aj,....An €C, then Ay AQN...N An EC as well (n arbitrary but finite). A class C is closed under increasing limits if wherever Ay C Ay C Ag C -CAn C... is a sequence of events in C, then UR, An € ¢ as well. A class C is closed under differences if whenever A,B EC with AC B, then B\ AEC. Theorem 6.2 (Monotone Class Theorem). Let C be a class of subsets of 22, closed under finite intersections and containing Q. Let B be the smallest class containing C which is closed under increasing limits and by difference. Then B = (C). Proof. First note that the intersection of classes of sets closed under increasing limits and differences is again a class of that type. So, by taking the intersec- tion of all such classes. there always exists a smallest class containing C which is closed_under increasing limit: differences. For each set B, denote Bp to be the collection of sets A such that A € B and AMB € B. Given the properties of B, one easily checks that Bg is closed under increasing limits and by difference. Let B € C; for each C € C one has BNC €C Cc Band C € B, thus C € Bg. Hence C C Bg C B. Therefore B = Bz, by the properties of B and of Bz. Now let B € B. For each C € C, we have B € Bo, and because of the preceding, BMC € B, hence C € Bg, whence C C Bg CB, hence B = Bz. Since B = Bg for all B € B, we conclude B is closed by finite intersec- tions. Furthermore 2 € B, and B is closed by difference, hence also under complementation. Since B is closed by increasing limits as well, we conclude B is a o-algebra, and it is clearly the smallest such containing C. o The proof of the uniqueness in Theorem 6.1 is an immediate consequence of Corollary 6.1 below, itself a consequence of the Monotone Class Theorem. Corollary 6.1. Let P and Q be two probabilities defined on A, and suppose P and Q agree on a class C C A which is closed under finite intersections. If a(C) =A, we have P=Q. ‘ Proof. 2.€ A because A is a o-algebra, and since P(2) = Q(22) = 1 because they are both probabilities, we can assume without loss of generality that 2.0 C. Let B= {A € A: P(A) = Q(A)}. By the definition of a Probability measure and Theorem 2.3, B is closed by difference and by increasing limits. ' B\ A denotes BN AP 6. Construction of a Probability Measure 37 Also B contains C by hypothesis. Therefore since o(C) = A, we have B= A by the Monotone Class Theorem (Theorem 6.2). a There is a version of Theorem 6.2 for functions. We will not have need of it in this book, but it is a useful theorem to know in general so we state it here without proof. For a proof the reader can consult [19, p. 365]. Let M be a class of functions mapping a given space 92 into R. We let o(M) denote the smallest o-algebra on §2 that makes all of the functions in M measurable: o(M) = {f -1(A);A € B(R); f € M}. Theorem 6.3 (Monotone Class Theorem). Lei M be a class of bounded functions mapping Q into R. Suppose M is closed under multiplication: f.g € M implies fg € M. Let A = o(M). Let H be a vector space of functions with H containing M. Suppose H contains the constant func- tions and is such that whenever (fn)n>1 is a sequence in H such that O B (note also that M92, (a.b + 4) = (a,b), so By C B and thus B = o(Bo)). The relation (7.1) implies that P((@y)) = FY) — F@), and if A € Bg is of the form A= Uicicn(ei, yi] with ys < tina, then P(A) = Dy cjen{F (ys) — Flei)}- If Q is another probability measure such that F(a) = Q((—00, a), 40 7. Construction of a Probability Measure on R then the preceding shows that P = Q on Bo. Theorem 6.1 then implies that P =Q on all of B, so they are the same Probability measure. Oo The significance of Theorem 7.1 is that we know, in principle. the complete probability measure P if we know its distribution function F’ : that is, we can in principle determine from F the probability P(A) for any given Borel set A. (Determining these probabilities in practice is another matter.) It is thus important to characterize all functions F which are distribution functions, and also to construct them easily. (Recall that a function F is right continuous if limy)» F(y) = F(x), for all « € R.) Theorem 7.2. A function F is the distribution function of a (unique) prob- ability on (R,B) if and only if one has: (i) F is non-decreasing; (ii) F is right continuous; (iii) Him, 0 F(x) = 0 and limy_.4., F(z) = 1. Proof. Assume that F is a distribution function. If y > 2, then (—90.2] C (—00, yj], so P((—o0, «]) < P((—oc, y]) and thus F(x) < F(y). Thus we have (i). Next let z, decrease to x. Then N3,(—00, tn] = (—00, 2], and the sequence of events {(—90,2,];n > 1} is a decreasing sequence. Therefore P(N) (00, fn]) = limps P((—26,tn]) = P((—0,2]) by Theorem 2.3, and we have (ii), Similarly, Theorem 2.3 gives us (iii) as well Next we assume that we have (i), (ii), and (iii) and we wish to show F is a distribution function. In accordance with (iii), let us set F(—oo) = 0 and F (+00) = 1. As in the proof of Theorem 7.1, let Bo be the set of finite disjoint unions of intervals of the form (z,y], with —o0 < x < y < +00. Define a set function P, P : By — (0, 1] as follows: for A =Urcien(in yi) with y; 0. By hypothesis (iii) there exists a a z such that F(—z) < € and 1— F(z) < «. For each n,i there exists a? € (a7, y?] such that F(a?) — F(x!) < sr, by (ii) (right continuity). Set By = Ureick, {a7 uF] 1(-2,2]}; Bn = UmenBr- Note that B/, € Bo and BY, C Ay, hence By € By and By C An. Furthermore, An\Bn C Umen(4m\Biy), hence P(An) ~ P(Bn) S P((-2,21°) + S3 P((Am\Br) 1 (-2 4) m= no ky < P((-z,2]°) + 32 YS P((2?, a?) m=1i=1 n kn < F(-2) +1-F(2)+ YO SO{F(a?) — F(e?)} < Be. (7.2) mal i=l Furthermore observe that B, C A, (where By, is the closure of Bn), hence ne_,B, = @ by hypothesis. Also B, S [-2, z], hence each B, is a compact: set. Tt is is a property of compact spaces! (known as “The Finite Intersection Property”) that for closed sets Fg, Nges Fg # 0 if and only if NgecFs #0 for all finite subcollections C’ of B. Since in our case N°2,B, = 0, by the Finite Intersection Property we must have that there exists an m such that B, = ¢ for all n > m. Therefore By, = ¢ for all n > m, hence P(Bn) = 0 for all n > m. Finally then P(An) = P(An) — P(Bn) $ 3e by (7.2), for all n > m. Since € was arbitrary, we have P(A,) | 0. (Observe that this rather lengthy proof would become almost trivial if the sequence k,, above were bounded; but although A, decreases to the empty set, it is not usually true). qa Corollary 7.1. Let F be the distribution function of the probability P on R. Denoting by F(x—) the left limit of F at x (which exists since F is nonde- creasing), for all x < y we have 1 For a definition of a compact space and the Finite intersection Property one can consult (for example) [12, p.81]. 42 7. Construction of a Probability Measure on R. (i) P((a.y] = Fly) ~ F(@). (ii) P({w.y]) = F(y) — F(a-), (ii) P([e.y)) = F(y-) - F(@-), Gv) P((x.y)) = Fy) ~ F(). (v) Pz}) = F(x) — F(e-), and in particular P({x}) =0 for all x if and only the function F is continu- ous. Proof. (i) has already been shown. For (ii) we write P(e— 2.) = Fy) Fe-4) by (i). The left side converges to F(y) — F(«—) as n — oc by definition of the left limit of F; the right side converges to P((z.y]) by Theorem 2.3 because the sequence of intervals (x — 4.y] decreases to [x.y]. The claims (iii), (iv) and (v) are proved similarly. oO Examples. We first consider two general examples: 1. If f is positive and Riemann-integrable and f™.. f («)dx = 1, the function F(z) = Poe f(y)dy is a distribution function of a probability on R; the function f is called its density. (It is not true that each distribution function admits a density, as the following example shows). 2. Let a € R. A “point mass” probability on R (also known as “Dirac measure”) is one that satisfies lifa€ A, P(A)= {5 otherwise. Its distribution function is Oifa 0 and f*. f(«)dz = 1, which the reader can check is indeed the case for examples 3-10. We abuse language a bit by referring to the density f alone as the distribution, since it does indeed determine uniquely the distribution. 7. Construction of a Probability Measure on R 43 ifasa0. 4 fe) = ifr <0. is called 7 Exponential distribution with parameter 3 > 0. The exponential distribution is often used to model the lifetime of objects whose decay has “no memory”; that is, if X is exponential, then the probability of an object lasting t more units of time given it has lasted s units already, is the same as the probability of a new object lasting ¢ units of time. The lifetimes of light bulbs (for example) are often modeled this way: thus if one believes the model it is pointless to replace a working light bulb with a new one. This memoryless property characterizes the exponential distribution: see Exercises 9.20 and 9.21. ge _ : 5. f(@) = ra 7 1 3. fa) = {ra xz <0, is called the Gamma distribution with parameters a, 3 (0 < a < co and 0 < 8 <0; TI denotes the gamma function) ? The Gamma distribution arises in various applications. One example is in reliability theory: if one has a part in a machine with an exponen- tial (3) lifetime. one can build in reliability by including n — 1 back-up components. When a component fails. a back-up is used. The result- ing lifetime then has a Gamma distribution with parameters (n. 3). (See Exercise 15.17 in this regard.) The Gamma distribution also has a rela- tionship to the Poisson distribution (see Exercise 9.22) as well as to the chi square distribution (see Example 6 in Chapter 15). The chi square distribution is important in Statistics: See the Remark at the end of Chapter 11. ana 1e(52)" if 2 > 0, 6. f(z) = ifx <0. is called i Weibull distribution with parameters a, 3 (0~ a* te *da, a > 0; it follows from the definition that (a) = (a ~ 1)! for a EN, and P(3) = vm 44 7. Construction of a Probability Measure on R- known as the Gaussian distribution. Standard notation for the Normal with parameters j: and 0? is N(y.07). We discuss the Normal Distribution at length in Chapters 16 and 21; it is certainly the most important distribution in probability and it is central to much of the subject of Statistics. 8. Let gyu,o2(t) = Tee ee, the normal density. Then 1 : f(a) = 7 9u.02(log x) ifz>0, : 0 ifx <0, is called the Lognormal distribution with parameters 1, ¢?(—00

i be any sequence of pairwise disjoint events and P a proba- bility. Show that lim, P(An) = 0. 7.2* Let (As)sen be a family of pairwise disjoint events. Show that if P(Ag) > 0. each 3 € B, then B must be countable. 7.3. Show that the maximum of the Gamma density occurs at 7 = ss for a>, Be 7.4. Show that the maximum of the Weibull density occurs at 2 = 3(*54)«, fora>1. 7.5. Show that the maximum of the Normal density occurs at 2 = p. 7.6 Show that the maximum of the Lognormal density occurs at 2 = e#e~?". 7.7 Show that the maximum of the double exponential density occurs at waa 7.8 Show that the Gamma and Weibull distributions both include the Ex- ponential as a special case by taking a = 1. 7.9’ Show that the uniform. normal, double exponential, and Cauchy densities are all symmetric about their midpoints. 7.10 A distribution is called unimodal if the density has exactly one absolute maximum. Show that the normal, exponential, double exponential, Cauchy, Gamma, Weibull, and Lognormal are unimodal. 7.11 Let P(A) = Lr la(z)f(x)dx for a nonnegative function f with Jo F(w)dx = 1. Let A = {zo}, a singleton (that is, the set A consists of one single point on the real line). Show that A is a Borel set and also a null set (that is, P(A) = 0). 7.12 Let P be as given in Exercise 7.11. Let B be a set with countable cardinality (that is, the number of points in B can be infinite, but only countably infinite). Show that B is a null set for P. 7.13 Let P and B be as given in Exercise 7.12. Suppose A is an event with P(A) = 3. Show that P(AU B) = 3 as well. 7.14 Let Aj,....An.... be a sequence of null sets. Show that B = UZ, A; is also a null set. 7.15 Let X be a r.v. defined on a countable Probability space. Suppose E{|X|} = 0. Show that X = 0 except possibly on a null set. Is it possible to conclude, in general, that X = 0 everywhere (i.e.. for all w)? [Ans.: No] 46 7. Construction of a Probability Measure on R. 7.16* Let F be a distribution function. Show that in general F can have an infinite number of jump discontinuities. but that there can be at most countably many. 7.17 Suppose a distribution function F is given by 1 , 1 1 F(a) = Fljo,coy(@) + 5 Urey (2) + 3G V2.00)(2)- Let P be given by P((-00,2]) = F(a). Then find the probabilities of the following events: a) A=(-3.3 b) B=(-4,2 ce) C=(§.3) d) D= (0,2) e) E= (3.00) 7.18 Suppose a function F is given by 1 F() = Viger i=1 Show that it is the distribution function of a probability on R. Let us define P by P((—oc,c]) = F(x). Find the probabilities of the following events: a) A=[l, 0) b) B=[5-00) c) C= {0} d) D=(0,3) e) E=(-cx.0) f) G= (0,00) 8. Random Variables In Chapter 5 we considered random variables defined on a countable prob- ability space ({2..A,P). We now wish to consider an arbitrary abstract space. countable or not. If X maps 2 into a state space (F.F), then what we will often want to compute is the probability that X takes its val- ues in a given subset of the state space. We take these subsets to be ele- anents of the g-algebra F of subsets of F. Thus, we will want to compute P({w: X(w) € A}) = P(X € A) = P(X71(A)), which are three equivalent ways to write the same quantity. The third is enlightening: in order to com- pute P(X~1(A)), we need X~1(A) to be an element of A, the g-algebra on 2 on which P is defined. This motivates the following definition. Definition 8.1. (a) Let (E,€) and (F,F) be two measurable spaces. A func- tion X : E— F is called measurable (relative to € and F) if X*(A) cE, for all A €F. (One also writes X~\(F) c €.) (b) When (E,€) = (9,A), a measurable function X is called a random variable (r.v.). (c) When F = R, we usually take F to be the Borel o-algebra B of R. We will do this henceforth without special mention. Theorem 8.1. Let C be a class of subsets of F such that o(C) = F. In order for a function X : E > F to be measurable (w.r.t. the o-algebras E and F ), it is necessary and sufficient that X~1(C) c €. Proof. The necessity is clear, and we show sufficiency. That is, suppose that X71(C) € € for all C € C. We need to show X~1(A) € € for all A € F. First note that X~1(UnAn) = UnX71(An), X7(AnAn) = On X7 (An), and X71(A°) = (X71(A))*. Define B = {A € F:X~1(A) € €}. Then C c B, and since X~! commutes with countable intersections, countable unions, and complements, we have that B is also a o-algebra. Therefore B > o(C), and also F > B, and since F = o(C) we conclude F = B, and thus X~1(F) ¢ o(X-*(C)) CE. aq We have seen that a probability measure P on R is characterized by the quantities P((—20,a]). Thus the distribution measure P* on R of a random variable X should be characterized by P*((—oo, a]) = P(X < a) and what is perhaps surprisingly nice is that being a random variable is 48 8. Random Variables further characterized only by events of the form {w:X(w) < a} = {X < a}. Indeed, what this amounts to is that a function is measurable — and hence a random variable — if and only if its distribution function is defined. Corollary 8.1. Let (F,F) = (R,B) and let (E.€) be an arbitrary measur- able space. Let X, Xp be real-valued functions on E. a) X is measurable if and only if {X < a} = {w: X(w) < a} = X71((—oe. a]) € €, for each a; or iff {X , Xm. We have just seen each Y;, is measurable, and we have also seen that inf, Y;, is therefore measur- able; hence lim sup,,_.., Xn is measurable. Analogously lim inf, 5. Xn = sup, infm>n Xm is measurable. If lim, X, = X, then X = limsup,_.., Xp = liminfpoo Xn (be- cause the limit exists by hypothesis). Since lim sup, _,., Xn is measurable and equal to X, we conclude X is measurable as well. (c aq Theorem 8.2. Let X be measurable from (E.€) into (F,F). and Y mea- surable from (F,F) into (G,G); then Y o X is measurable from (E,.€) into (G,g). Proof. Let A € G. Then (Y o X)~1(A) = X~1(Y¥~1(A)). Since Y is measur- able, B = Y~1(A) € F. Since X is measurable, X~1(B) € €. q A topological space is an abstract space with a collection of open sets;! the collection of open sets is called the topology of the space. An abstract definition of a continuous function is as follows: given two topological spaces (E,U) and (F,V) (where U are the open sets of E and V are the open sets of 2 A “collection of open sets” is a collection of sets such that any union of sets in the collection is also in the collection, and any finite intersection of open sets in the collection is also in the collection. 8. Random Variables 49 F), then a continuous function f:E — F is a function such that f~!(A) EU for each A € YV. (This is written concisely as f~1(V) C U.) The Borel o- algebra of a topological space (£,U4) is B = o(U). (The open sets do not form a o-algebra by themselves: they are not closed under complements or under countable intersections.) Theorem 8.3. Let (E,U) and (F,V) be two topological spaces, and let E, F be their Borel g-algebras. Every continuous function X from E into F is then measurable (also called “Borel”). Proof. Since F = o(V), by Theorem 6.4 it suffices to show that X~1(V) c E. But for O € V, we know X~1(Q) is open and therefore in €, as € being the Borel g-algebra, it contains the class U of open sets of EF. a Recall that for a subset A of E, the indicator function 14(x) is defined to be . A lifreA, ta)= {jie ea Thus the function 14(z), usually written 14 with the argument x being im- plicit, “indicates” whether or not a given z is in A. (Sometimes the function 1, is known as the “characteristic function of A” and it is also written x4; this terminology and notation is somewhat out of date.) Theorem 8.4. Let (F.F) = (R,B), and (E,€) be any measurable space. a) An indicator 14 on Eis measurable if and only if A € E. b) If Xy,...,Xp are real-valued measurable functions on (H,E), and if f is Borel on R”, then [(X1,-..,Xn) is measurable. c) If X.Y are measurable, so also are X +¥, XY, XV (ashort-hand for max(X,Y)), X AY (a short-hand for min(X,Y)), and X/Y (if ¥ #0). Proof. (a) If BC R, we have if O¢ B, 1¢B ym _)A if 0¢B,16B Gay"(B)= 9 ae if 0B. 1¢B E if0EB,1EB The result follows. (b) The Borel o-algebra B” on R” is generated by the quadrants Tli a; P(A,). ‘ (9.2) i=l (This is also written [ X(w)P(dw) and even more simply f XdP.) A little algebra shows that B{X} does not depend on the particular rep- resentation (9.1) chosen for X. Let X,Y be two simple random variables and 3 a real number. We clearly can write both X and Y in the form (9.1), with the same subsets A; which form a partition of 2, and with numbers a; for X and b; for Y. Then 8X and X +/Y are again in the form (9.1) with the same A; and with the respective numbers 3a; and a; +b;. Thus E{9X} = GE{X} and E{X+Y} = E{X}+ E{Y}; that is expectation is linear on the vector space of all simple r.v.’s. If further X < Y we have a; < 0; for all i, and thus E{X} < E{Y}. Next we define expectation for positive random variables. For X positive (by this, we assume that X may take all values in [0. oo], including +oc: this innocuous extension is necessary for the coherence of some of our further results), let 52 9. Integration with Respect to a Probability Measure E{X} =sup(B{Y}: Y a simple rv. withO 0, but we can have E{X} = oc. even when X is never equal to +0, Finally let X be an arbitrary r.v. Let X* = max(X,0) and X~ = —min(X,0). Then X = X* — X~, and X*, X~ are positive r.v.’s. Note that |X| = X++4X-. Definition 9.2. (a) A r.v. X has a finite expectation (is “integrable”) if both E{X*} and E{X~} are finite. In this case, its expectation is the number E{X} = E{X+}— B{X-}, (9.4) also written [ X(w)dP(w) or f XdP. (If X >0 then X- =0 and Xt =X and, since obviously E{0} = 0, this definition coincides with (9.3)). We write L to denote the set of all integrable random variables. (Some- times we write L1(2,A, P) to remove any possible ambiguity.) (b) A rv, X admits an expectation if E{X*+} and E{X~} are not both equal to +00. Then the expectation of X is still given by (9.4), with the conventions +00 + a = +00 and —00 + a = —oo when a € R. (If X > 0 this definition again coincides with (9.3); note that if X admits an expectation, then E{X} € [—oo, +00], and X is integrable if and only if its expectation is finite.) Remark 9.1. When £2 is finite or countable we have thus two different def- initions for the expectation of a r.v. X, the one above and the one given in Chapter 5. In fact these two definitions coincide: it is enough to verify this for a simple r.v. X, and in this case the formulas (5.1) and (9.2) are identical. The next theorem contains the most important properties of the expec- tation operator, The proofs of (d), (e) and (f) are considered hard and could be skipped. Theorem 9.1. (a) L! is a vector space, and expectation is a linear map on L', and it is also positive (ie. X > 0 + E{X} > 0). If further 0 Y as. (Y € £}), alln, we have E {lim inf, .5. X;,} < liminf, ... E{X,}. In particular if X, > 0 a.s. all n, then E{lim infy Xn} < lim infy x E{Xn}. (f) (Lebesgue’s dominated convergence theorem): If the r.v.'s Xp converge as. to X and if |Xn| E{X}. The a.s. equality between random variables is clearly an equivalence rela- tion, and two equivalent (i.e. almost surely equal) random variables have the same expectation: thus one can define a space L! by considering “C! modulo this equivalence relation”. In other words, an element of L? is an equivalence class, that is a collection of all r.v. in £! which are pairwise a.s. equal. In view of (c) above, one may speak of the “expectation” of this equivalence class (which is the expectation of any one element belonging to this class). Since further the addition of random variables or the product of a rv. by a constant preserve a.s. equality, the set L! is also a vector space. Therefore we commit the (innocuous) abuse of identifying a r.v. with its equivalence class, and commonly write X € L! instead of X € £}, If 1 < p < 0x, we define L? to be the space of r.v.’s such that |X|? € £}, L® is defined analogously to L}, That is, L? is £? modulo the equivalence relation “almost surely”. Put more simply, two elements of £? that are a.s. equal are considered to be representatives of one element of L?. We will use in this book only the spaces L! and L? (that is p = 1 or 2). Before proceeding to the proof of Theorem 9.1 itself, we show two auxiliary results. Result 1. For every positive r.v. X there exists a sequence (Xp)n>1 of pos- itive simple r.v.’s which increases toward X as n increases to infinity. An example of such a sequence is given by Xg(w) = [REM EAI" < X (we) < (412 and OS bn" — 1, AMO) = 1 XW) Er (9.5) Result 2. If X is a positive r.v., and if (Xp)n>1 is any sequence of positive simple r.v.’s increasing to X, then E{X,} increases to E{X}. To see this, observe first that the sequence E{X,,} increases to a limit a, which satisfies a < E{X} by (9.3). To obtain that indeed a = E{X}, and in view of (9.3) again, it is clearly enough to prove that if Y is a simple r.v. such that 0 < Y < X, then E{Y} < a. The variable Y takes on m different values, say a,...,@m, and set Aj, = {Y = ax}. Choose € € (0,1]. The rv. Yne = (1—8)¥1ta-ev X,}. Furthermore it is obvious that Y;,-< < X;, hence using (9.2), we obtain 54 9. Integration with Respect to a Probability Measure E{¥ne} = (1—@) So axP (Anns) S E{Xn}. (9.6) k=1 Now recall that Y < lim, X,,. hence (1 — €)¥ < lim, X, as soon as Y > 0. hence clearly Aj.n,z — A, as n — oo. An application of Theorem 2.4 yields P(Ag.n.c) > P(Ag). hence taking the limit in (9.6) gives (l= 2) S7 ag P(Ay) = (1-2) E{Y} 0, Y > 0. Let A = {w: X(w) # Y(w)} = {X AY}. Then P(A) = 0. Also, FAY} = F{Y1a + Ylach} = E{Y1a} + E{¥ 1a} = BLY 14} + {X14}. Let Y, be simple and Y, increase to Y. Then Y,14 are simple and Y,14 increase to Y1,4 too. Since Y,, is simple it is bounded, say by N. Then 0 < E{Ynla} < B{N14} = NP(A) = 0. Therefore E{Y14} = 0. Analogously, E{X1,} = 0. Finally we have PY} = PAV 14} + E{X1 ac} = 04 B{X1 gc} = B{X 1 ac} + E{X 14} = E{X}. We conclude by noting that if Y = X a.s., then also Y* = Xt as. and Y~ = X~ a.s., and (c) follows. (d) For each fixed n choose an increasing sequence Y,,,. k = 1,2,3,... of positive simple r.v.’s increasing to X;,, (Result 1), and set 9. Integration with Respect to a Probability Measure 55, Zp = Max Ynip- nsk Then (Z,),>1 is a non-decreasing sequence of positive simple r.v.’s. and thus it has a limit Z = limg_.x Z,. Also Yn S Ze S Xp SX as. fornsk which implies that Xn 0. X, € £}, and Eflim inf, ss. Xn }< lim infy 00 E{Xp, } if and only if Elim inf, +o X,} < liminf,,... E{X,}, because lim inf X,, = (lim inf X,,) — n= 00 nao Therefore without loss of generality we assume X,, > 0 a.s., each n. Set Y;, = infy>, X,. Then Y, are also random variables and form a non- decreasing sequence. Moreover lim Y, = lim inf X,,. n= no Since Xp > Yn, we have E{Xp} > E{¥q}, whence liminf E{X,} > lim B(Y} = B{ lim, ¥,} = Ef{liminf X,} by the Monotone Convergence Theorem (part (d) of this theorem). (f) Set U = liminf,.x X, and V = limsup,_... Xn- By hypothesis U=V =X as. We also have |X,| < Y as., hence |X| < Y as well, hence X, and X are integrable. On the one hand X, > —Y as. and —Y € L}, so Fatou’s lemma (e) yields 56 9. Integration with Respect to a Probability Measure E{U} < liminf B{X,}. me We also have —X,, > —Y a.s. and —V = lim inf, ... —Xn, so another appli- cation of Fatou’s lemma yields ~B{V} = B{-V} 2 liminf E{—Xn} = —limsup E{Xn}. Putting together these two inequalities and applying (c) yields E{X} = E{U} < liminf E{X,} < limsup E{Xn} S E{V} = E{X}- This completes the proof. oO A useful consequence of Lebesgue’s Dominated Convergence Theorem (Theorem 9.1(f)) is the next result which allows us to interchange summa- tion and expectation. Since an infinite series is a limit of partial sums and an expectation is also a limit, the interchange of expectation and summation amounts to changing the order of taking two limits. Theorem 9.2. Let X,, be a sequence of random variables. (a) If the Xp.’s are all positive, then {Sx} = Soe Gh (9.7) n=l both sides being simultaneously finite or infinite. (b) If TO, E{|Xn|} < 00, then °°, Xp converges a.s. and the sum of this series is integrable and moreover (9.7) holds. Proof. Let Sn = 7p |Xn| and Ty = S3p_y Xz. Then PLS} = e{y pal} = eX}. k=l k=l and the sequence S,, clearly increases to the limit $ = 0%, |Xq| (which may be finite for some values of w and infinite for others). Therefore by the Monotone Convergence Theorem (Theorem 9.1(d)) we have: {5} = im B{5,} = 2 E(|Xil} < oe. k=l If all X,,’s are positive, then S,, = T, and this proves (a). If the X,’s are not necessarily positive, but 0%; H{|Xn|} < oc, we deduce also that H{S} < oo. 9. Integration with Respect to a Probability Measure 57 Now. for every € > 0 we have 1ys.} < €S. hence P(S = 00) = E{lis—coy} S EE {S}. Then E{S} < oo and since the choice of ¢ is arbitrary, we have that P(S = 2c) = 0: we deduce that }77_, X; is an absolutely convergent series a.s. and its sum, say T, is the limit of the sequence T,,. Moreover [Th] < Sn <8 and § is in L}. Hence by the Dominated Convergence Theorem (Theo- rem 9.1(f)) we have that ofS xi) = P{ lim Ta} = E{T}, k=1 J which is (9.7). a Recall that L1 and L? are the sets of equivalence classes of integrable (resp. square-integrable) random variables for the a.s. equivalence relation. Theorem 9.3. a) If X,Y € L?, we have XY € L} and the Cauchy-Schwarz inequality: EAXY}| < VERE}, b) We have L? CL}, and if X € L, then E{X}? < E{X?}; c) The space L? is a linear space, i.e. if X,Y € L? and a,3 € R, then aX + BY € L? (we will see in Chapter 22 that in fact L? is a Hilbert space). Proof. (a) We have |XY| < X?/24+¥?/2, hence X,Y € L? implies XY € L. For every z € R we have 0< E{(aX +Y)*} = 2? E{X*} + 2c B{XY} + E{Y?}. (9.8) The discriminant of the quadratic equation in z given in (9.8) is and since the equation is always nonnegative, {XY} ~ B{X?}B{Y?} <0, which gives the Cauchy-Schwarz inequality. (b) Let X € L?. Since X = X-1 and since the function equal to 1 identically obviously belongs to L? with E{1?} = 1, the claim follows readily from (a). (c) Let X.Y € L*, Then for constants a, 3, (aX + BY)? < a®X?/24+ G’Y?/2 is integrable and aX + GY € L* and L? is a vector space. Oo 58 9. Integration with Respect to a Probability Measure If X € L?. the variance of X. written 9?(X) or 0%. is Var(X) = 0?(X) = E{(X — E{X})?}. (Note that X € L? + X € L’. so E{X} exists.) Let p = E{X}. Then Var(X) = E((X ~ p)?} = F(X?) ~ 2 BX} + 1? = E{X*}— 22 + p? = E{X*}—,2. Thus we have as well the trivial but nonetheless very useful equality: o?(X) = E{X?}— E{X}. Theorem 9.4 (Chebyshev’s Inequality). P{|X|> a} < FAAS a Proof. Since a71,)x)>a} < X?. we have pl < E{X?}. or a2 P(|X| > a) < B{X?}: and dividing by a? gives the result. oO Chebyshev's inequality is also known as the Bienaymé-Chebyshev inequal- ity, and often is written equivalently as PIX — E(X}| >a} <2 oe) The next theorem is useful; both Theorem 9.5 and Corollary 9.1 we call the Expectation Rule. as they are vital tools for calculating expectations. It shows in particular that the expectation of a r.v. depends only on its distribution. Theorem 9.5 (Expectation Rule). Let X be a r.v. on (2,A.P), with values in (E.€), and distribution PX. Let h:(E,€) > (R, B) be measurable. a) We have h(X) € £1(2,A, P) if and only if h € L1(E.E, P*). b) If either h is positive, or if it satisfies the equivalent conditions in (a), we have: E{h(X)} = [rcyr* an. (9.9) Proof. Recall that the distribution measure P* is defined by P*(B) = P(X~1(B)), Therefore 9. Integration with Respect to a Probability Measure 59 E{1p(X)} = P(X71(B)) = PX(B) = / Lp(2)P* (dz). Thus if h is simple, (9.9) holds by the above and linearity. If h is positive, let h, be simple, positive and increase to h, Then E{h(X)} = E{ lim h,,(X)} n° = lim E{hn(X)} noe lim. / hy (x) P* (dx) / im hy (x) P* (dx) = [royP* aa) t where we have used the Monotone Convergence Theorem twice. This proves (b) when h is positive, and applied to |h| it also proves (a) (recalling that a t.v. belongs to C? if and only if the expectation of its absolute value is finite). If h is not positive. we write h = h* — h~ and deduce the result by subtraction, oO The next result can be proved as a consequence of Theorem 9.5. but we prove it in Chapter 11 (Corollary 11.1) so we omit its proof here. Corollary 9.1 (Expectation Rule). Suppose X is a random variable that has a density f. (That is, F(w) = P(X <2) and F(z) = f?_, f(u)du,—c0 < at <0.) If E{|h(X)|} < 0c or if h is positive, then E{h(X)} = f h(x) f(x)de. Examples: 1. Let X be exponential with parameter a. Then O° E{h(X)} = [ h(x)ae~°* da. ‘0 In particular, if h(x) = x, we have oe 1 E{X}= [ ane" dr = —, lo a by integration by parts. Thus the mean of an exponential random variable is 1/a. 2. Let X be normal (or Gaussian) with parameters (1,07). Then E{X} = p4, since i 1 E{X\= (ep)? /207 gy = [ager 60 9. Integration with Respect to a Probability Measure To see this, let y= 2 ~ pw; then « = y+ py, and x 1 2 (29? E{X}= wea {X} [sete y 2° md 2/992 1 2 992 = ye VO dy + [ eV 20" dy. I. Vino” a 0 for all z > 0 and >2z > gf for x > 1. That E{X~} = 00 is proved similarly. Exercises 61 Exercises for Chapters 8 and 9 Let X : (92..A) > (R, B) be ar-v. Let F={A:A=X7}(B), some B € B} = X71(B). Show that X is measurable as a function from (2, F) to (R, B). 9.2 * Let (2.4, P) be a probability space, and let F and G be two o-algebras on §2. Suppose F C A and G C A (we say in this case that F and G are sub a-algebras of A). The o-algebras F and G are independent if for any A € F, any B € G. P(AN B) = P(A)P(B). Suppose F and G are independent, and ar.v. X is measurable from both (2, F) to (R, B) and from (2, G) to (R, B). Show that X is a.s. constant; that is, P(X = c) = 1 for some constant c. 9.3* Given (2,4, P), let AU = {AUN:A € AN € NV}, where MV’ are the null sets (as in Theorem 6.4). Suppose X = Y a.s. where X and Y are two real-valued functions on 2. Show that X: ((2,A") — (R, B) is measurable if and only if Y: (2,.A4’) — (R, B) is measurable. 9.4* Let X € £! on (9,.A, P) and let A, be a sequence of events such that limp oc P(An) = 0. Show that lim, E{X14,,} = 0. (Caution: We are not assuming that lim, X14, = 0 a.s.) 9.5* Given (2,A, P), suppose X is a r.v. with X > 0 as. and E{X} = 1. Define Q : A > R by Q(A) = E{X14}. Show that Q defines a probability measure on ({2,.A). 9.6 For Q as in Exercise 9.5, show that if P(A) = 0, then Q(A) = 0. Give an example that shows that Q(A) = 0 does not in general imply P(A) = 0. 9.7 * For Q as in Exercise 9.5, suppose also P(X > 0) = 1. Let Eg denote expectation with respect to Q. Show that Eg{Y} = Ep{Y X}. 9.8 Let Q be as in Exercise 9.5, and suppose that P(X > 0) = 1. (a) Show that + is integrable for Q. (b) Define R:A > R by R(A) = Ee{ probability measure P of Exercise 9. 9.9 Let Q be as in Exercise 9.8. Show that Q(A) = 0 implies P(A) = 0 (compare with Exercise 9.6). La}. Show that R is exactly the (Hint: Use Exercise 9.7.) 9.10\Let X be uniform over (a,b). Show that E{X} = 244. 9.11 Let X be an integrable r.v. with density f(c). and let p = E{X}. Show that. oo Var(X) = 0?(X) = / (a — pw)? f(a)de. 62 9. Integration with Respect to a Probability Measure 9.12 Let X be uniform over (a.}). Show that 02(X) = 452, 9.18 Let X be Cauchy with density 1... Show that o2(X) is not mF (@—ay) defined, and E{X?} = oc. 9.14 The beta function is B(r, s) = eee. where I’ is the gamma function. Equivalently 1 B(r.s) = [ rots dt (r>0.s>0). oO X is said to have a beta distribution if the density f of its distribution measure is a1 -2)s 7) if0<2r <1, f(a) = B(r,s) —s 0 ife 1. Show that for X having a beta distribution with parameters (r,s) (r > 0.8 > 0), then B(r+k,s)_ P(r+hI(r +) E{x* =e { Bir. s) I(r)P(r+s+h)? for k > 0. Deduce that > r EX} = r+s" o(X)= i (+ 92r+s+1) The beta distribution is a rich family of distributions on the interval [0. 1). It is often used to model random proportions. 9.15 Let X have a lognormal distribution with parameters (1,07). Show that E(XT} = ert dere and deduce that E{X} = e437 and o% = e2#+ (e? —1). (Hint: E{X"} = I a” f(a)dx where f is the lognormal density; make the change of variables y = log(«) — p to obtain E{x" [. 1 (rutry—y? 207) g ) = e . noo VOR 4 9.16 The gamma distribution is often simplified to a one parameter distribu- tion. A r.v. X is said to have the standard gamma distribution with parameter @ if the density of its distribution measure is given by goerle-® fie)=) Tay 12° 0 ifa <0. Exercises 63 That is. 3 = 1. (Recall P'(a) = f° t°-1e~tdt.) Show that for X standard gainma with parameter a. then = Pet Gs), FO =F ke Deduce that X has mean @ and also variance a. 9.17 * Let X be a nonnegative r.v. with mean jz and variance o”, both finite. Show that for any b > 0, P{X > p+bo} < — (Hini: Consider the function g(x) = tease and that E{((X — p)b+ o)?} = 0%(b? + 1).) 9.18 Let X be ar.v. with mean and variance o?, both finite. Show that P{u-do 4. (Note that this is interesting only for d > 1.) 9.19 Let X be normal (or Gaussian) with parameters = 0 and o? = 1. Show that P(X > 2) < te e7 }*", for x > 0. 9.20 Let X be an exponential r.v.. Show that P{X > s+t| X > s} P{X > t} for s > 0, i > 0. This is known as the “memoryless property” o} the exponential. g, 9.21 * Let X be ar-.y. with the property that P{X > s+t| X > s} = P{X > t}. Show that if h(t) = P{X > t}, then h satisfies Cauchy's equation: h(s +t) = h(s)h(t) (s >0.t> 0) and show that X is exponentially distributed (Hint: use the fact that h is continuous from the right, so Cauchy’s equation can be solved). 9.22 Let a be an integer and suppose X has distribution Gamma (a, 3). Show that P(X < xz) = P(Y > a), where Y is Poisson with parameter = «§. (Hint: Recall (a) = (a— 1)! and write down P(X < 2), and then use integration by parts with u = £°—? and du = e7~‘/9dt.) 9.23 The Hazard Rate of a nonnegative random variable X is defined by <3 > hx(t) = lim Pitt) 20 € when the limit exists. The hazard rate can be thought of as the probability that an object does not survive an infinitesimal amount of time after time tf. The memoryless property of the exponential gives rise to a constant rate. A Weibull random variable can be used as well to model lifetimes. Show that: 64 9. Integration with Respect to a Probability Measure a) If X is exponential (\), then its hazard rate is hx (t) = A; b) If X is Weibull (a, 3), then its hazard rate is hx(t) = aGt0}. 9.24 A positive random variable X has the logistic distribution if its distri- bution function is given by 1 F(a) = P(X $2) = aa (a >0), for parameters (j1, 3), 3 > 0. a) Show that if 1 = 0 and 9 = 1, then a density for X is given by en? f= Gy b) Show that if X has a logistic distribution with parameters (1, 3), then X has a hazard rate and it is given by hx(t) = (G) F(t). 10. Independent Random Variables Recall that two events A and B are independent if knowledge that B has occurred does not change the probability that A will occur: that is, P(A | B) = P(A). This of course is algebraically equivalent to the statement P(AN B) = P(A)P(B). The latter expression generalizes easily to a finite number of events: Ay,..., A, are independent if P(N;=7A;) = Tes P(Aj), for every subset J of {1,...,n} (see Definition 3.1). For two random variables X and Y to be independent we want knowledge of Y to leave unchanged the probabilities that X will take on certain values, which roughly speaking means that the events {X € A} and {Y € B} are independent for any choice of A and B in the o-algebras of the state space of X and Y. This is more easily expressed in terms of the o-algebras generated by X and Y: Recall that if X:(2,A) — (B,€), then X~1(€) is a sub o- algebra of A, called the o-algebra generated by X. Definition 10.1. a) Sub o-algebras (A;)icr of A, are independent if for every finite subset J of I, and all A; € Aj, one has P (Nic sAi) = T] P(Ad- ied b) Random variables (X;)ier, with values in (Ej, &), are independent if the generated o-algebras X;1(E;) are independent. We will next, for notational simplicity, consider only pairs (X,Y) of ran- dom variables. However the results extend without difficulty to finite families of r.v.’s. Note that X and Y are not required to take values in the same space: X can take its values in (E,€) and Y in (F,F). Theorem 10.1. In order for X andY to be independent, it is necessary and sufficient to have any one of the following conditions holding: a) P(X © AY € B) = P(X € A)P(Y €B) forall ACE, BEF; b) P(X € AY € B) = P(X € A)P(Y €B) for all ACC, B ED, where C and D are respectively classes of sets stable under finite intersections which generate € and F; 66 10. Independent Random Variables c) f(X) and g(Y) are independent for each pair (f.g) of measurable func- tions; d) ELf(X)g(¥)} = ECLFOQ}E(G(Y)} for each pair (f.9) of functions bounded measurable. or positive measurable. e) Let E and F be metric a and let E. F be their Borel o-algebras. Then E{ f(X)g(¥)} = E{f(X)}E{g(¥)} for each pair (f.g) of bounded, continuous functions. Proof. (a) This is a restatement of the definition. since X~1(€) is exactly all events of the form {X € A}. for A € E. (a)=>(b): This is trivial since C C € and DCF. (a)=>(b): This is evident. (b)=>(a): The collection of sets A € € that verifies P(X € A,Y € B) = P(X € A)P(Y € B) for a given B € D is closed under increasing limits and by difference and it contains the class C by hypothesis, and this class C is closed by intersection. So the Monotone Class Theorem 6.2 yields that this collection is in fact € itself. In other words. Assumption (b) is satisfied with C = €. Then analogously by fixing A € € and letting J = {B € F:P(X € A,Y € B) = P(X € A)P(Y € B)}. we have J D o(D) and thus J = F. (c)=>(a): We need only to take f(x) = x and g(y) = y. (a)=+(c): Given f and g, note that AX) ME) = XP ME) CXTME). Also. g(¥)71(F) C Y~}(F), and since X~ (é), and Y~1(F) are independent. the two sub o-algebras f(X)~1(E) and g(Y)~1(F) will also be. (d)=+(a): Take f(x) = ta 2) and g(y) = 1p(y)- (a)=>(d): We have (d) holds for indicator functions. and thus for simple functions (ie.. f(@) = Thala, (x)) by linearity. If f and g are positive. let f, and g, be simple positive functions increasing to f and g respectively. Observe that the products fn(X)gn(Y) increase to f(X)g(Y). Then BUS (X)g(W)} = B (him, fa X)gu(V)} = fim, BE fal Xou(¥)} = Jaw EC fal X) ECan ¥)} = BUO}ELAY)} by the monotone convergence theorem. This gives the result when f and g are positive. When f and g are bounded we write f = f* — f~ and g = gt -g7 and we conclude by linearity. (d)(e): This is evident. (d)=>(b): It is enough to prove (b) when C and D are the classes of all closed sets of E and F (these classes are stable by intersection). Let for example A be a closed subset of EF. If d(a, A) denotes the distance between the point x and the set A. then f,(%) = min(1,nd(z. A)) is continuous. it satisfies 0 < f,, < 1, and the sequence (f,,) decreases to the indicator function 1,4. Similarly with B a closed subset of F we associate continuous functions 10. Independent Random Variables 67 Qn decreasing to 1g and having 0 < g, < 1. Then it suffices to reproduce the proof of the implication (a)=>(d). substituting the monotone convergence theorem for the dominated convergence theorem. o Example: Let E and F be finite or countable. For the couple (X,Y) let PAY = P(X =i.Y = j) = P({w:X(w) =iand Y = j}) =P((X=iN(¥ =5)}. Then X and Y are independent if and only if PXY = PX PY. as a conse- quence of Theorem 10.1. We present more examples in Chapter 12. We now wish to discuss “jointly measurable” functions. In general. if € and F are each o-algebras on spaces E and F respectively. then the Cartesian product Ex F = {AC ExF:A= AxI,A€ €andT € F} is notao-algebra on E x F., Consequently we write o(€ x F) to denote the smallest o-algebra on E x F generated by € x F. Such a construct is common, and we give it a special notation: €F =o(€xF). Theorem 10.2. Let f be measurable: (E x F.E & F) — (R,R). For each x € E (resp. y € F), the “section” y > f(a,y) (resp. « > f(x,y)) is an F-measurable (resp. E-measurable) function. Note: The converse to Theorem 10.2 is false in general. Proof. First assume f is of the form f(x,y) = 1c(«.y). for C € € & F. Let H = {C € E@F:y > 1c(x.y) is F-measurable for each fixed « € E}. Then H is a o-algebra and H contains € x F, hence o(€ x F) C H. But by construction H C o(€ x F), so we have H = € & F. Thus we have the result for indicators and hence also for simple functions by linearity. If f is positive, let fy be simple functions increasing to f. Then gn(y) = fn(a,y) for « fixed is F-measurable for each n, and since g(y) = lim gn(y) = fla,y). and since the limit of measurable functions is measurable. we have the result for f. Finally if f is arbitrary, take f = f+ — f7, and since the result holds for f+ and f~, it holds as well for f because the difference of two measurable functions is measurable. o Theorem 10.3 (Tonelli-Fubini). Let P and Q be two probabilities on (E,€) and (F,F) respectively. a) Define R(A x B) = P(A)Q(B), for A€ € and B € F. Then R extends uniquely to a probability on (Ex FE 2F), written PQ. 68 10. Independent Random Variables b) For each function f that is E © F-measurable, positive, or integrable with respect to P ®Q, the function « — f f(x,y)Q(dy) is E-measurable, the function y > f f(x,y)P(dx) is F-measurable and [rersa= [{ fe. necan} Prac) -/ { / sa.) Pda) b aca. Proof. (a) Let C € € & F, and let us write C(x) = {y:(x,y) € C}. If C = Ax B, we have in this case C(x) = B if « € A and C(x) = @ otherwise. hence: RC) = P@Q(C) = P(A)Q(B) = [ Paxatcc). Let H = {C € E@F:4 > Q[C(x)] is €-measurable}. Then H is closed under increasing limit and differences, while € x F CH C E@F, whence H = E®F by the monotone class theorem. For each C € H = €®F, we can now define (since Q[C(x)] is measurable and positive) RC) = f Pldz)Q{CWw)). We need to show R is a probability measure. We have R(2) = RE x F) =f. P(dz)Q[F] = 1. E Let C, € EF be pairwise disjoint and set C = U7{1Cn. Then since the C;,() also are pairwise disjoint and since Q is a probability measure, Q(C(2)] = S32) Q[Cn(x)]. Apply Theorem 9.2 to the probability measure P and to the functions f,(x) = Q[Cn(2)], to obtain YH) =O f ae = [Xo soar n=l = [ Panaicte)) = RC). Thus R is a probability measure. The uniqueness of R follows from Corol- lary 6.1. (b) Note that we have already established part (b) in our proof of (a) for functions f of the form f(x,y) = 1e(a,y). C € € ® F. The result follows for positive simple functions by linearity. If f is positive, € ® F-measurable, let fn be simple functions increasing to f. Then 10. Independent Random Variables 69 En(f) = jim Betfa) = im. f { f tatenQtau)} Peas) But « > f fn(z,y)Q(dy) are functions that increase to x > f f(x,y)Q(dy), hence by the monotone convergence theorem = [ {im txenaran} Pea, and again by monotone convergence = ff sm tue naan} Peder = f { f 102,nean} Pee An analogous argument gives = [{f renrcas)} aan. Finally for general f it suffices to take f = f+ — f~ and the result follows. oO Corollary 10.1. Let X and Y be two r.v. on (2,A,P), with values in (E. €) and (F,F) respectively. The pair Z = (X.Y) is a r.v. with values in (E x F,E QF), and the r.v.’s X,Y are independent if and only if the distribution P‘\XY) of the couple (X,Y) equals the product PX @ PY of the distributions of X andY. Proof. Since Z~1(A x B) = X—1(A) NY~1(B) belongs to A as soon as A € € and B € F, the measurability of Z follows from the definition of the product o-algebra € ® F and from Theorem 8.1. X and Y are independent iff for all A € € and B € F, we have P((X,Y) € Ax B) = P(X € A)P(Y € B), or equivalently . P®Y)(4 x B) = PX(A)PY(B). This is equivalent to saying that P(*-¥0(A x B) = (P* @ PY)(A x B) for all Ax B € EF), which by the uniqueness in Fubini’s theorem is in turn equivalent to the fact that PY) = P¥ @ PY on €@F. Oo We digress slightly to discuss the construction of a model with independent random variables. Let 4 be a probability measure on (E,€). It is easy to . with values in E, whose distribution measure is ju: simply take 2 = E; A= €; P = p; and let X be the identity: X(«) = ¢. Slightly more complicated is the construction of two independent random variables, X and Y, with values in (E£,€), (F,F), and given distribution 70 10. Independent Random Variables measures 1 and v. We can do this as follows: take Q = Ex F: A= ESF: P=p&v, and X(«.y) =x: Y(x,y) = y. where (t.y) € Ex F. Significantly more complicated. but very important for applications. is to construct an infinite sequence of independent random variables of given distributions. Specifically, for each n let X,, be defined on (Q,.A,Pn). and let us set co Q=T] 2, — (countable Cartesian product) n=l A=QAn nal where ®92,A, denotes the smallest o-algebra on 2 generated by all sets of the form Ay x Ag x... x Ag X Qpa1 X Oppo X..., ALE A k=1,2.3,.... That is, A is the smallest o-algebra generated by finite Cartesian products of sets from the coordinate o-algebras. The next theorem is from general measure theory, and can be considered a (non trivial) extension of Fubini’s theorem. We state it without proof. Theorem 10.4. Given (n.An,Pn) probability spaces and 2 = TTp_y Qn: A= 2%1An, then there exists a probability P on (2,A), and it is unique, such that k P(Ay x Ap x... Ap X Qh X Qe -.) = Tp A(40 i=1 for allk = 1.2,... and A; € Aj. For X;, defined on (@,,An,Pn) as in Theorem 10.4, let X, denote its natural extension to 2 as follows: for w € 2, let w = with w; € 2. each i. Then Xn(w) = Xn(wn): Corollary 10.2. Let X, be defined on (Qn.An. Px), each n, and let Xn be its natural extension to (2,A.P) as given above. Then (Xn )no1 are all independent. and the law of X, on (Q..A,P) is identical to the law of Xp on (2n,An: Pr): Proof. We have XZ(Bp) = 2X 00. Qn x XTBn) X Qnar x Qnae X + and by Theorem 10.4 we have for k = 1.2,...: 10. Independent Random Variables 71 P (AK, Xq"(Bn)) = P (XP"(Br) x... Xp 1 (Br) X Qear x...) k = [[ PX, € B,), n=1 and the result follows. oO Next we wish to discuss some significant properties of independence. Let A, be a sequence of events in A. We define: limsup Ay = R21 (Um>n4m) = lim (UmanAm)- noo This event can be interpreted probabilistically as: limsup A, = “A, occurs infinitely often”, n—00 which means that A, occurs for an infinite number of n. This is often abbre- viated “i.o.”. and thus we have: lim sup An = {An i.0.}. n Theorem 10.5 (Borel-Cantelli). Let A, be a sequence of events in (2,A, P). a) If 0, P(An) < x. then P(Ay i.0.) = 0. b) If P(A, io.) = 0 and if the A,’s are mutually independent, then So P(An) < o- Note: An alternative statement to (b) is: if 4, are mutually independent events, and if 7°, P(An) = oc, then P(A, i.o.) = 1. Hence for mutually independent events An, and since the sum )>,, P(A,) has to be either finite or infinite, the event { A, i.o.} has probability either 0 or 1; this is a particular case of the so-called zero-one law to be seen below. Proof. (a) Let an = P(An) = E{1a,}- By Theorem 9.2(b) 3%, an < 90 implies {7 14, < oc as. On the other hand, 7*, 14,(w) = oo if and only if w € limsup,.,. An. Thus we have (a). (b) Suppose now the A,’s are mutually independent. Then P(limsup A,) = lim lim P (US,_,, Am) n—s0 nc k00 lim lim (1~P (Nk,_,, AS) im=n Am noc k—90 a k eae Tes ( He- Pn) m=n by independence; 72 10. Independent Random Variables k =1> dy, fm, TEC ~ on) men where a, = P(Am). By hypothesis P(lim sup, An) = 0, 80 limy-soo limp oe [Then (1 — am) = 1. Therefore by taking logarithms we have k lim lim > log(1 ~ am) = 0, m0 k—+100 m=n or tim, YF log(1 — am) = 0, man which means that 57,,, log(1—a,) is a convergent series. Since |log(1—a)| > @ for 0< @ <1, we have that 3°, am is convergent as well. Qo Let now X,, be r.v.’s all defined on (§2,A, P). Define the o-algebras By = 0(Xn) Cr = 0 (Up>nBp) Coo = WE Cn Coo is called the tail o-algebra. Theorem 10.6 (Zero-one law). Let X,, be independent r.v.’s, all defined on (2,A,P), and let Co. be the corresponding tail c-algebra. If C € Coo, then P(C)=0or1. Proof. Let Dn = o(UpenBy). By the hypothesis, C,, and D,, are independent, hence if A €C,, BE Dy, then P(ANB) = P(A)P(B). (10.1) If A € Co. we hence have (10.1) for all B € UD,, hence also for all B€ D= a(UD,), by the Monotone Class Theorem (Theorem 6.2). However Coo CD, whence we have (10.1) for B = A € Co, which implies P(A) = P(A)P(A) = P(A)?, hence P(A) = 0 or 1. a Consequences: 1. {w: limn sco Xn(w) exists} € Coo, therefore X,, either converges a.s. or it diverges a.s. 2. Each r.v. which is C,, measurable is a.s. constant. In particular, lim sup X,,, lim inf X,,, a n=00 lim suy 1 X, li int | X, sup — > : m inf — > oo a ma? n—0o 1 ? Pen psn are all a.s. constant. (Recall we are still assuming that X, is a sequence of independent r.v.’s) Exercises 73 Exercises for Chapter 10 10.1 Let f = (ft. f2):@ 2 Ex F. Show that f:(2,A) > (Ex FE @F) is measurable if and only if fy is measurable from ({2,4) to (E,€) and fy is measurable from (§2..A) to (F,F). 10.2 Let R? = R x R, and let B? be the Borel sets of R?, while B denotes the Borel sets of R. Show that B? = B® B. 10.3 Let 2 = [0,1], A be the Borel sets of (0, 1], and let P(A) = f La(a)dr for A € A. Let X(x) = x. Show that X has the uniform distribution. 10.4 Let 2 =Rand A = B. Let P be given by P(A) = Te fla(ejer® Pde. Let X(x) = 2. Show that X has anormal distribution with parameters = 0 and o? = 1. 10.5 Construct an example to show that E{XY} = E{X}E{Y} does not imply in general that X and Y are independent r.v.’s (we assume X,Y and XY are all in L1). 10.6 Let X,Y be independent random variables taking values in N with : , . P(X=)=PY=)=5 G=L2Q..). Find the following probabilities: a) P(min(X,Y) X) [Ans Dys0 xacy] d) P(X divides Y) [Ans.: 3] e) P(X > kY) for a given positive integer k [Ans.: 54-4] 10.7 Let X,Y be independent geometric random variables with parameters Nand yw. Let Z = min(X,Y). Show Z is geometric and find its parame- ter. [Ans: Ay..] 10.8 Let X.Y € L?. Define the covariance of X and Y as Cov(X,Y) = B{(X ~ w(¥ —)} where E{X} =p and E{Y} = v. Show that Cov(X,Y) = E{XY} - pw and show further that X and Y independent implies Cov(X,.Y) =0. 10.9 Let X,Y € L’. If X and Y are independent, show that XY € L’. Give an example to show XY need not be in L! in general (i.e., if X and Y are not independent). 74 10. Independent Random Variables 10.10 * Let n be a prime number greater than 2: and let X.Y be independent and uniformly distributed on {0.1..... n— 1}. (That is. P(X =i) = P(Y i) = 2. for i = 0.1.....n ~1.) For each r. 0 < r < n~1. define Z, = X 4r¥ (mod n). a) Show that the rv.’s {Z,:0 limsup P(An)- m0 10.13 A sequence of r.v.’s X,, X2,... is said to be completely convergent to X if ce Yo PUK, — X| > 2) 0. n=1 Show that if the sequence X,, is independent then complete convergence is equivalent to convergence a.s. 10.14 Let y,v be two finite measures on (E.€). (F, F), respectively. i.e. they satisfy all axioms of probability measures except that u(E) and v(F) are positive reals, but not necessarily equal to 1. Let A= pSv on (Ex F.E& F) be defined by A(A x B) = p(A)v(B) for Cartesian products A x B (A € €, BeéF). a) Show that \ extends to a finite measure defined on € @ F: b) Let f : Ex F — R be measurable. Prove Fubini’s Theorem: if f is A-integrable, then « > f f(x.y)v(dy) and y > f f(x. y)u(dz) are re- spectively € and ¥ measurable. and moreover [tars [fe dudenay = ff te. vetasyutan). (Hint: Use Theorem 10.3.) 10.15 * A measure r is called o-finite on (G.G) if there exists a sequence of sets (G;)j1. G; € G, such that UX,G; = G and r(G;) < oe, each j. Show that if 4,7 are assumed to be o-finite and assuming that \ = ps & v exists. then a) A= @v is o-finite: and Exercises 7 b) (Fubini’s Theorem): If f:E x F — R is measurable and ,-integrable. then x > f f(x,y)u(dy) and y — J f(x. y)u(dx) are respectively € and F measurable, and moreover faq [f tenudewmay = ff 12nv(autaey. (Hint: Use Exercise 10.14 on sets Bj x Fy, where (Ej) < 2 and v(Fy) < x.) 10.16 * Toss a coin with P(Heads)= p repeatedly. Let A, be the event that k or more consecutive heads occurs amongst the tosses numbered 2*.2* + 1,..., 24+? — 1, Show that P(A, io.) = 1 if p > } and P(A, io.) = 0 if iy. 17 Let Xo. Xi, X2,. - be independent random variables with P(X, = =P(X,=-l)= “all n. Let Z, = HfL Xi. Show that 2), Z2, Z3,... are ecg 10.18 Let X,Y be independent and suppose P(X + Y =a) = 1, where a is a constant. Show that both X and Y are constant random variables. 11. Probability Distributions on R We have already seen that a probability measure P on (R,B) (with B the Borel sets of R) is characterized by its distribution function F(a) = P((-00,]). We now wish to use the tools we have developed to study Lebesgue measure on R. Definition 11.1. Lebesgue measure is a set function m:B — [0,00] that satisfies (i) (countable additivity) if A,, Az. A3.-.. are pairwise disjoint Borel sets, then m (US, Aj) = So mA) i=l (ii) if a,b R, a < b, then m((a,))) Theorem 11.1. Lebesgue measure is unique. Proof. Fix a Ry which clearly has P(R) = 1. Further if Aj. Ag...., Am... are all pairwise disjoint, then P (UE A) = | Fla) lyx,a,pla)de -/ (Svea) de et since the A; are pairwise disjoint: => [ foatnae = Pay i=l i=l by using Theorem 9.2. Therefore we have countable additivity and P is a true probability measure on (R,B). Taking A = (—oo, 2] in (11.4) yields P(-se.2)) = [Fad that is P admits the density f. We now show that P determines f up to a set of Lebesgue measure zero. Suppose f’ is another density for P. Then f’ will also satisfy (11.4) (to see this, define P’ by (11.4) with f’ and observe that both P and P’ have the same distribution function, implying that P = P’). Therefore, if we choose e >Oand set A= {a: f(x) +e < f'(x)} and if m(A) > 0, then P(A) +em(a) = [ (F(a) + e)talarde sf F(@)1alede = PLA), a contradiction. We conclude m({f +¢ < f’}) = 0. Since {f +e < f’} increases to {f < f’} as € decreases to 0, we obtain that m({f’ < f}) = 0. Analogously, m({f’ > f}) = 0, hence f’ = f almost everywhere (dm). [Almost everywhere” means except on a set of measure zero; for probability measures we say “almost surely” instead of “almost everywhere” .] oO 80 11. Probability Distributions on R Remark 11.1. Since the density f and the distribution function F satisfy F(x) = f*,. f(y)dy, one is tempted to conclude that F is differentiable, with derivative equal to f. This is true at each point x where f is continuous. One can show — and this is a difficult result due to Lebesgue — that F is differentiable dm-almost everywhere regardless of the nature of f. But this result is an almost everywhere result, and it is not true in general for all x. However in most “concrete” examples, when the density exists it turns out that F’ is piecewise differentiable: in this case one may take f = F’ (the derivative of F’) wherever it exists, and f = 0 elsewhere. Corollary 11.1 (Expectation Rule). Let X be an R-valued r.v. with den- sity f. Let g be a positive Borel measurable function. Then g is integrable (resp. admits an integral) with respect to P*, the distribution measure of X, if and only if the product fg is integrable (resp. admits an integral) with respect to Lebesgue measure, and in this case we have E{g(X)} = [oPXac) = [o@sarae. (115) Proof. The equality (11.5) holds for indicator functions by Theorem 11.3, because it reduces to (11.4). Therefore (11.5) holds for simple functions by linearity. For g nonnegative, let gp, be simple functions increasing to g. Then (11.5) holds by the monotone convergence theorem. For general g, let g = g* —g7, and the result follows by taking differences. Qo We presented examples of densities in Chapter 7. Note that all the exam- ples were continuous or piecewise continuous, while here we seem concerned with Borel measurable densities. Most practical examples of r.v.’s in Statis- tics turn out to have relatively smooth densities, but when we perform simple operations on random variables with nice densities (such as taking a condi- tional expectation), we quickly have need for a much more general theory that includes Borel measurable densities. Let X be a r.v. with density f. Suppose Y = g(X) for some g. Can we express the density of Y (if it exists) in terms of f? We can indeed in some “good” cases. We begin with a simple result: Theorem 11.4. Let X have density fx and let g be a Borel measurable function. Let ¥ = g(X). Then Fy =P 0. Then 11. Probability Distributions on R. 81 1 Fy(u) =P (-$1o«(X) exp(—Ay)) _ fl-e for y >0 ~ lO otherwise. Therefore (cf. Remark 11.1): d Ae ify > 0 ho) = FW) = {3 oe and we see that Y is exponential with parameter \. Caution: The preceding example is deceptively simple because g was injective, or one to one. The general result is given below: Corollary 11.2. Let X have a continuous density fx. Let g:R — R be continuously differentiable with a non-vanishing derivative (hence g is strictly monotone). Let h(y) = g7'(y) be the inverse function (also continuously differentiable). Then Y = 9(X) has the density fy(y) = fx(hy))|h(y)- Proof. Suppose g is increasing. Let Fy(y) = P(Y < y). Then Fy(y) = P(g(X) sy) = P(h(g(X)) < hy), since fh is monotone increasing because g is. Then the above gives A(y) = P(X Shy) = Fx(hyy) =f sle)ae. It is a standard result from calculus (see, e.g., [18, p.259]) that if a function g is injective (one-to-one), differentiable, and such that its derivative is never zero, then h = g~} is also differentiable and h'(z) = Ray: Therefore Fy(y) is differentiable and d qiw= F(A(y))n'(y) = F(ACy)|A(Y)|- If g is decreasing the same argument yields afew) = f(ACy)) (hy) = FAG) IAC) |- 82 11. Probability Distributions on R Corollary 11.3. Let X have a continuous density fx. Let g:R > R be piecewise strictly monotone and continuously differentiable: that is. there exist intervals Ty. 12 I,, which partition R such that g is strictly monotone and continuously differentiable on the interior of each I. For each i. g: I; +R is invertible on g(I;) and let hy be the inverse function. Let Y = g(X) and let A= {y:y = g(z).x € R}, the range of g. Then the density fy of Y exists and is given by fly) = Xo fr (hily)HAi(y)|1ay)- Remark: The proof. similar to the proof of the previous corollary, is left to the reader. Our method uses the continuity of fx. but the result holds when fx is simply measurable. Example: Let X be normal with parameters = 0; 0? = 1. Let Y = X?, Then in this case g(a) = x2, which is neither monotone nor injective. Take I, = (0,00) and Ip = (-0.0). Then g is injective and strictly monotone on Ty and Jy. and I; U Ip = R. g(I) = (0.00) and g(Iz) = (0,00). Then hy : (0.00) + R by hi(y) = Vy and hg : [0.90) > R by holy) = — Vy. mcnl = |g for i=1.2. na Therefore by Corollary 11.3, 1 1 1 1 Tee ag ew? 30 OTE Tm aye 79) 1 1 - Ta Vi loon) (y)- The random variable Y is called a x? random variable with one degree of freedom. (This is pronounced “chi square” .) The preceding example is sufficiently simple that it can also be derived “by hand”, without using Corollary 11.3. Indeed, Fy(y) = PUY < y) = P(X? 0) 11. Probability Distributions on R. 83 Similarly. ovi 2 =Fx(-V) = -[ edn, oC us whence d 1 -1 —(-F(- = ee VF ay! (-vy)) Vin yg 1 ay) pty 0): Van 2 (y>0)+ and adding yields the same result as we obtained using the Corollary. Remark: The chi square distribution plays an important role in Statistics. Let p be an integer. Then a random variable X with density 1 = pP/2—-le-F = Topper e 2, 0 0. Show that: fry) = wa S> (fax (haly)) + fx (Fi(y)) Yeaey() — for appropriate functions h; and k;. 11.8* Let X be uniform on (—7,7) and let Y = atan(X), a > 0. Find fy(y)- [Ans: fr(y) = aie : 11.9* Let X have a density, and let Y = ce * 1x50), (a >0,c> 0). Find fy-(y) in terms of fy. [Ans: fi-(y) = @# Ge BEM 1, 9(y),] 11.10 A density f is called symmetric if f(—x) = f(a), for all x. (That is, f is an even function.) A random variable X is symmetric if X and —X both have the same distribution. Suppose X has a density f. Show that X is sym- metric if and only if it has a density f which is symmetric. In this case, does it admit also a nonsymmetric density? [Ans.: Yes, just modify f on a non-empty set of Lebesgue measure zero in R4]. [Note: Examples of symmetric densi- ties are the uniform on (—a.a); the normal with parameters (0,07); double exponential with parameters (0,3); the Cauchy with parameters (0, 3). Exercises 85 11.11 Let X be positive with a density f. Let Y = x4; and find the density for Y. 11.12 Let X be normal with parameters (1,07). Show that Y = e* has a lognormal distribution. 11.13 Let X be ar-y. with distribution function F that is continuous. Show that Y = F(X) is uniform. 11.14 Let F be a distribution function that is continuous and is such that the inverse function F'~! exists. Let U be uniform on (0,1). Show that X = F-"(U) has distribution function F. 11.15 * Let F be a continuous distribution function and let U be uniform on (0,1). Define G(u) = inf{x : F(x) > u}. Show that G(U) has distribution function F. 11.16 Let Y = —}In(U), where U is uniform on (0,1). Show that Y is exponential with parameter \ by inverting the distribution function of the exponential. (Hint: If U is uniform on (0, 1) then so also is 1—U.) This gives a method to simulate exponential random variables. 12. Probability Distributions on R” In Chapter 11 we considered the simple case of distributions on (R, B). The case of distributions on (R”, B”) for n = 2,3, ... is both analogous and more complicated. [B” denotes the Borel sets of R”.] First let us note that by essentially the same proof as used in Theorem 2.1, we have that B” is generated by “quadrants” of the form Ttce-a: a,€Q: i=l note that BE BS... B = B": that is, B” is also the smallest o-algebra generated by the n-fold Cartesian product of B, the Borel sets on R. The n-dimensional distribution function of a probability measure on (R”, B”) is defined to be: P(a1,---54n) =P (I~) . i=1 It is more subtle to try to characterize P by using F for n > 2 than it is for n = 1, and consequently distribution functions are rarely used for n > 2. We have also seen that the density of a probability measure on R, when it exists. is a very convenient tool. Contrary to distribution functions, this notion of a density function extends easily and is exactly as convenient on R” as it is on R (but, as is the case for n = 1, it does not always exist). Definition 12.1. The Lebesgue measure m, on (R”,B") is defined on Cartesian product sets Ay x Az x... An by n (14) = TL. all A; € B, (12.1) aT ; where m is the one dimensional Lebesgue measure defined on (R,B). As in Theorem 10.3, one can extend the measure defined in (12.1) for Cartesian product sets uniquely to a measure m, on (R",B"), and m, will still have countable additivity. This measure m, is Lebesgue measure, and it is char- acterized also by the following seemingly weaker condition than (12.1): 88 12. Probability Distributions on R” a n Mn (I) =T]i-ai). all -x ajajei; > 0, for all (a),...,an) € R". 92 12. Probability Distributions on R" Proof. The symmetry is clear. since Cov(X;. Xj) = Cov(X;.X;) trivially. A simple calculation shows that n (Sex). a1 > aja ;Ci; and since variances are always nonnegative, we are done. Oo Theorem 12.5. Let X be an R”-valued r.v. with covariance matrix C. Let A be anm xn matrix and set Y = AX. Then Y is an R™-valued r.v. and its covariance matrix is C' = ACA*, where A* denotes A transpose. Proof. The proof is a simple calculation. oO We now turn our attention to functions of R"-valued random variables. We address the following problem: let g:R” — R” be Borel. Given X = (Xi... Xn) with density f, what is the density of Y = g(X) in terms of f, and to begin with, does it exist at all? We will need the following theorem from advanced calculus (see for example [22, p.83]). Let us recall first that if g is a differentiable function from an open set G in R® into R”, its Jacobian matric Jy(x) at point x € G is Jg(x) = 2(x) (that is, Jg(a)iy = HE (e), where g = (91, 92,---;9n))- The Jacobian of g at point x is the determinant of the matrix J,(x). If this Jacobian is not zero, then g is invertible on a neighborhood of x, and the Jabobian of the inverse g7} at point y = g(x) is the inverse of the Jacobian of g at x. Theorem 12.6 (Jacobi’s Transformation Formula). Lei G be an open set in R® and let g:G — R” be continuously differentiable.’ Suppose g is injective (one to one) on G and its Jacobian never vanishes. Then for f measurable and such that the product flgig) is positive or integrable with respect to Lebesgue measure, [tote [lacey aer(sg(e))\te 9(G) G where by g(G) we mean: g(G) = {y ER”: there exists x € G with g(x) = y}. The next theorem is simply an application of Theorem 12.6 to the density functions of random variables. Theorem 12.7. Let X = (X1,...,Xn) have joint density f. Let g:R” > R” be continuously differentiable and injective, with non-vanishing Jacobian. Then Y = g(X) has density 1 A function g is continuously differentiable if it is differentiable and also its deriva- tive is continuous. 12. Probability Distributions on R™ 93 fry) = Lx(g7"(y))| det Jg-2(y)) ify is in the range of g WM= 0 otherwise. Proof. We denote by G the range of g, that is G = {y € R”: there exists xc € R” with y = g(x)}. The properties of g imply that G is an open set and that the inverse function g~! is well defined on G and continuously differentiable with non-vanishing Jacobian. Let B € B”, and A = g-1(B). We have P(X €A)= [telus =f, fx(ode g7*(B) = f xo *@aer -:(2)|de, by Theorem 12.6 applied with g~!. But we also have P(Y € B) = P(X € A), hence povea)= f fru. Since B € B” is arbitrary we conclude fx(o-4(@))] det J-a(2))] = f(a), a.e., whence the result. a In analogy to Corollary 11.3 of Chapter 11, we can also treat a case where g is not injective but nevertheless smooth. Corollary 12.1. Let S € B” be partitioned into disjoint subsets So, Si,- Sm such that U™ 9S; = S, and such that Mp(So) = 0 and that for each i 1,...,m, g: S$; + R” is injective (one to one) and continuously differentiable with non-vanishing Jacobian. Let Y = g(X), where X is an R”-valued r.v. with values in S and with density fx. Then Y has a density given by m fry) = Yo fx Fw) det J,-1(y)| i=1 where gy! denotes the inverse map g;':9(S:) + S; and J,-1 is its corre- sponding Jacobian matrix. Examples: 1. Let X,Y be independent normal r.v.’s, each with parameters p = 0, o? = 1. Let us calculate the joint distribution of (U,V) = (X+Y,X-Y). Here 94 12. Probability Distributions on R” g(a. y) = (@+y.a—y) = (ur). riwo= (GS). The Jacobian in this simple case does not depend on (u,v) (that is, it is constant), and is and Jg-1 (u,v) = and det Jg-1 = Therefore fey) (uv) = for —co 0. This example shows also that the ratio of two independent normals with mean 0 is a Cauchy rv. (a =O and 8 =1). Exercises 99 Exercises for Chapter 12 12.1 Show that . SP ah wy?) [ / ea? dady = 2n0°. moe Jae and therefore that page tH) 20? is a true density. (Hint: Use polar coor- dinates.) 12.2 Suppose a joint density f(x.y)(x.y) factors: fix.vy(z,y) = g(x)h(y). Find fx(a) and fy(y). 12.3 Let (X,Y) have joint density 1 fry) = 2ro\ 02 exp ( 1 {Sop _ r(x = wi)(y — na) + y = Ha)? }) : ~ 20 =r?) o 0102 o3 Find fx=2(y). [Ans Seo OP aaay — pz — “22a —pn)}?),] 12.4 Let px,y denote the correlation coefficient for (X,Y). Let a >0,¢>0 and b € R. Show that PaX+b.cY¥ +b = PXY+ (This is useful since it shows that p is independent of the scale of measurement for X and Y.) 12.5 If a 4 0, show that a la’ so that if Y = aX +bis an affine non-constant function of X, then px,y = +1. PX,aX $b = 12.6 Let X,Y have finite variances and let 2-(4)¥-(82)8 Show that 0% = 1— p\y. and deduce that if pxyy = +1, then Y is a non-constant affine function of X. 12.7* (Gut (1995), p. 27.) Let (X,Y) be uniform on the unit ball: that is. 1 —ifer+y <1 fixyy(ay) = 4 7 0 ifa?+y? >1. Find the distribution of R = VX? + Y?. (Hint: Introduce an auxiliary r.v. S = Arctan (%).) [Ans: fr(r) = 2r1(0.1)(r).] 100 12. Probability Distributions on R” 12.8 Let (X.Y) have density f(x,y). Find the density of Z = X + Y. (Hint: Find the joint density of (Z,W) first where W = Y.) [Ans: fz(z) = Jo foxy) (2 — w, w) dw] 12.9 Let X be normal with 4 = 0 and o? < oo, and let © be uniform on (0, x): that is f(@) 216.2) (0)- Assume X and @ are independent. Find the distribution of Z = X + acos(@). (This is useful in electrical engineering.) 1 (2—acos w)2/202 [Ans: fz(z) = mel ea 008 w)"/207 day] 12.10 Let X and Y be independent and suppose Z = g(X) and W = h(Y), with g and h both injective and differentiable. Find a formula for fz.w(z.w), the joint density of (Z,W). 12.11 Let (X,Y) be independent normals, both with means » = 0 and variances 0”. Let x Z= VX? +Y? and W = Arctan (>), > Show that Z has a Rayleigh distribution, that W is uniform on (—$, $), and that Z and W are independent. © -t e'*) is bounded in modulus by 1, the continuity follows from Lebesgue’s dominated convergence theorem (Theorem 9.1(f)). Qo Actually one can show that ji is uniformly continuous, but we do not need such a result here. Theorem 13.2. Let X be an R” valued random variable and suppose E{|X|™} < oc for some integer m. Then the characteristic function yx of X has continuous partial derivatives up to order m, and a" " B{Xj, Xj, e8%) —————— 9 \= i" BD we XG elite 7 day. Oe ex(u) = i BEX; X56} dm Proof. We prove an equivalent formulation stated in terms of Fourier trans- forms of probability measures. (To see the equivalence. simply take p to be PX, the distribution measure on R” of X as in (13.2).) Let y be a probability measure on R® and assume f(2) = |z|" is integrable: 13. Characteristic Functions 105 f |)" pda) < 00. Re Then we wish to show that /i(u) is m-times continuously differentiable and an fi Tee ee =i" fay, ..23,08 ula), We give the proof only for the case m = 1. The general case can be established analogously by recurrence. In order to prove that fe exists at point wu, it is enough to prove that for every sequence of reals tp ‘ending to 0, and with v = (v1,..., Un) being the unit vector in 3” in the “direction j (ie. with coordinates v, = 0 for k 4 j and v; = 1), then the sequence - 5, (ilu + tye) — aw} = few u(dr). (13.3) converges to a limit independent of the sequence t,, and in this case this limit. equals pe (u)- The sequence of functions 2 + cay (where x; is the jth coordinate of s € R”) converges pointwise to ix; by differentiation; moreover S2I2|, eltety 1 | and [2aluae) <0 by hypothesis. Therefore by Lebesgue’s dominated convergence theorem (Theorem 9.1(f)) we have that (13.3) converges to if aye (dz), Therefore [oie sul. (13.4) The proof that the partial derivative in (13.4) above is continuous is exactly the same as that of Theorem 13.1. oO An immediate application of the above is to use characteristic functions to calculate the moments of random variables. (The kth moment of a rv. X is E{X*}.) For the first two moments (by far the most important) we note that for X real valued (by Theorem 13.2): E{X} = ig, (0) if E{|X|} < 00 (13.5) E{X?} = —p(0) if B{X?} < 00. (13.6) 106 13. Characteristic Functions Examples: 1. Bernoulli (p): If X is Bernoulli with parameter p. then ex(u) = Efe} = 69(1 — p) +ep =[pe™ 41 —p 2. Binomial B(p,n): If X is Binomial with parameters n. p. then ex(u) = Efe} => (Jena ~ py =| (pe™ + 1p)" } j=0 We could also have noted that n x=S°Y;, j=l where Yj,..., ¥, are independent and Bernoulli (p). Then =E {ti | = I Efe} j= px(u) = Efe%} = fe Lies jal by the independence of the Y;’s; : gy, (u) = (pe + 1—p)”. 1 a 3. Poisson (A): ex(u) = B(e%} = eX =1) k=0 oo k oo -> tuk AY =) _ yn e™ = e me = k=0 ° =e dere =| MED 4. Uniform on (—a,a): a aT ee = pfeexy 1} iurde = . ex(u) = Be} = Je de Qaiu using that e* = cosz + isin z, and that cos(a) = cos(—a), this equals 2i sin au sinau 2aiu au 13. Characteristic Functions 107 5, The Normal (u = 0:0? = 1): Calculating the characteristic function of the normal is a bit hard. It can be done via contour integrals and the residue theorein (using the theory of complex variables), or by analytic continuation (see Exercise 17 of Chapter 14); we present here a perhaps non-intuitive method that has the virtue of being elementary: 1 2 . = ft Le Pde ex (u) J Jia cos ur oe Pac es |S UE 2? Lap vin , Vir Qn Since sin ua e~*’/? is an odd and integrable function. we have that ee 2 / sinure* (de =0. and thus x yx(u) = l/l cosun ec” /2dx, = By Theorem 13.2 we cau differentiate both sides with respect to u to obtain: 1 se ee (u) = el +6; and exponentiating gives yx(u) = eh, Since yx (0) = 1, we deduce e© = 1, whence ex 108 13. Characteristic Functions Theorem 13.3. Let X be an R”-valued random variable anda € R™. Let A be anm xn matrix. Then par ax(u) =e px(A*u), for allu€ R™, where A* denotes A transpose. Proof. One easily verifies that eiluatAX) and then taking expectations of both sides gives the result. Oo Examples (continued) 6. The Normal (1,07): Let X be N(u,07). Then one easily checks (see Exercise 14.18) that Y = xu is Normal (0,1). Alternatively, X can be written X = y+ oY, where Y is N(0,1). Then using Theorem 13.3 and example 5 we have iup—u?o? /2 yx =e 7. The Exponential (A): Let X be Exponential with parameter A. Then 0 ex(u)= f ec Neda, 0 A formal calculation gives oe -[ Ae M8 dg = 0 os A= tu but this is not mathematically rigorous. It can be justified by, for example, a contour integral using complex analysis. Another method is as follows: It is easy to check that the functions eae M(-Acos(ur) + usin(ua)). cn *(—u cos(u) — Asin(ux)) Mw , have derivatives \e-** cos(ux) and Ae~** sin(ux) respectively. Thus oo Xd oc [ re cos(ur)da = —se“"(—Acos(ur) + usin(uz))| 0 ww 0 oc r oo [ Ae sin(ua)de = — se **(—ucos(ur) — Asin(ux)) fo Yu 0 Hence we get 13. Characteristic Functions 109 8. The Gamma (a, 3): One can show using contour integration and the residue theorem in the theory of complex variables that if X is Gamma (a. 3) then px(u) = oa One can also calculate the characteristic function of a Gamma random variable without resorting to contour integration: see Exercise 14.19. 14. Properties of Characteristic Functions We have seen several examples on how to calculate a characteristic function when given a random variable. Equivalently we have seen examples of how to calculate the Fourier transforms of probability measures. For such transforms to be useful, we need to know that knowledge of the transform characterizes the distribution that gives rise to it. The proof of the next theorem uses the Stone-Weierstrass theorem and thus is a bit advanced for this book. Nevertheless we include the proof for the sake of completeness. Theorem 14.1 (Uniqueness Theorem). The Fourier transform ji of a probability measure on R” characterizes pu: that is, if two probabilities on R" admit the same Fourier transorm, they are equal. Proof. Let 1 ~ le? 20? Sot) = aan le? /20? | and Flow) = mere, Then f(o.2) is the density of X = (Xj,...,Xn), where X; is N(0,07) for each j (1 < j < n). By Example 6 of Chapter 13 and the Tonelli-Fubini Theorem, we have n 4 1 — . f(o.ajel" da -{ eat He) day... de [se ve . oon 1 q Ih 2 1 = P (gph tte) eh 20? dz. I, avin ; Therefore Loaf ue Hou) = Gaaapl («. “= ") 1 i tate = arose |, Slon)e da, 112 14. Properties of Characteristic Functions Next suppose that jr and jz2 are two probability measures on R” with the same Fourier transforms fi; = fiz = ji. Then / f(o,u— v)yn(du) = f anya {| loa Par} pala = [ So.2) Geeamak (S) eae. (the reader will check that one can apply Fubini’s theorem here), and the exact same equalities hold for 42. We conclude that [ oeatae) = fe nalae) for all g € H, where 1 is the vector space generated by all functions of the form u— f(o,u — v). We then can apply the Stone-Weierstrass theorem? to conclude that H is dense in Co under uniform convergence. where Co is the set of functions “vanishing at 00”: that is, Co consists of all continuous functions on R” such that limy.y—< |f(x)| = 0. We then obtain that J, aenntas) =f aCeynatae) for all g € Co. Since the indicator function of an open set can be written as the increasing limit of functions in Co, the Monotone Convergence Theorem (Theorem 9.1(d)) then gives (A) = w2(A), all open sets AC R”. Finally the Monotone Class Theorem (Theorem 6.2) gives (A) = f2(A) — for all Borel sets A CR”, which means ji = 2. o Corollary 14.1. Let X = (X,...,Xp) be an R”-valued random variable. Then the real-valued r.v.’s (Xj)i and therefore by Theorem 14.1 we have UX = UX, @UX, @--- OMX, which is equivalent to independence. Oo R is not enough for the r.v.’s X; to be independent. Caution: In the above, having ¢x(u,u,-...u) =[T%; ex, (w) for all uw € 114 14, Properties of Characteristic Functions Exercises for Chapters 13 and 14 Note: The first three exercises require the use of contour integration and the residue theorem from complex analysis. These problems are giyen the symbol “2” 14.17 Let f(a) = a Cauchy density, for a r.v. X. Show that sae ex(u) =e", by integrating around a semicircle with diameter [—R. R] on the real axis to the left. from the point (~,0) to the point (2.0) over the real axis. 14.24 Let X be a gamma r.v. with parameters (a.3). Show using contour integration that yx(w) = <:. [Hint: Use the contour for 0 < ¢ < don the real axis, go from (d, ot back to (c,0). then descend vertically to the line y = —%S and descend southeast along the line, and then ascend vertically to (d, 0). ° 14.3* Let X be N(0.1) (i... normal with « = 0 and o? = 1), and show that ¢x(u) = e7“/? by contour integration. [Hint: use the contour from (R,0) to (—R,0) on the real axis; then descend vertically to (—R.—iu): then proceed horizontally to (R, —iu). and then ascend vertically back to the real axis.] 14.4 * Suppose E{|X|?} < oc and E{X} = 0. Show that Var(X) = 0? < 0, and that 1 yx(u) =1- gue + 0(u?) as u 0. [Recall that a function g is o(t) if limy.o %@! = 0,] 14.5 Let X = (Xj,...,X,) be an R” valued r.y. Show that a) yx(u,0.0.....0) = px, (u) (we R) b) yx(u.u,t....-U) =x; 4..4x,(U) (we R) 14.6 Let Z denote the complex conjugate of z. That is, if z = a+ ib then z= a-— ib (a,b € R). Show that for X ar.v., ex (u) = ex(-u). 14.7 Let X be a r.v. Show that yx(u) is a real-valued function (as opposed to a complex-valued function) if and only if X has a symmetric distribution. (That is, PX = P-*. where P* is the distribution measure of X.) [Hint: Use Exercise 14.6, Theorem 13.3, and Theorem 14.1.] 14.8 Show that if X and Y are iid. then Z = X — Y has a symmetric distribution. Exercises 115 14.9 Let X)..... X,, be independent. each with mean 0. and each with finite third moments. Show that e| (x) } = LA. (Hint: Use characteristic functions.) 14.10 Let j1.....j1n be probability measures. Suppose A; > 0 (1 < j < n) and 0}_1 Aj = 1. Let v = 3}, Ajj. Show that v is a probability measure, too. and that H(u) = 7 A5Ay(u)- j=l 14,11 Let X have the double exponential (or Laplace) distribution with a=0,8=1: fx(2) = sor -o 0 and a@ in the domain of definition of Tx. 116 14. Properties of Characteristic Functions 14.14 Let X be lognormal with parameters (j:.07). Find the Mellin trans- form (c.f. Exercise 14.13) Tx (0). Use this and the Peden that Tx(k) = E{X*} to calculate the ath moments of the lognormal distribution for k=... 14.15 Let X be N(0.1). Show that E{X?°+1} = 0 and E{X} = S— = (2n—-1)(2n —3)...3+1 14.16 * Let X be N(0,1). Let M(s) = Efee*} = [. ae (« - 3”) da. 00 VOR 2 Show that M(s) = e**/2, (Hint: Complete the square in the integrand.) 14.17 * Substitute s = iu in Exercise 14.16 to obtain the characteristic func- tion of the Normal yx (u) = e~"’/?; justify that one can do this by the theory of analytic continuation of functions of a complex variable. 14.18 Let X be N(u,02). Show that ¥ = *=# is N(0,1). 14.19* (Feller [9]) Let X be a Gamma r.v. with parameters (a, 3). One can calculate its characteristic function without using contour integration. Assume 3 = 1 and expand e‘* in a power series. Then show eu” © -egnto~ P(n+) 0, yn Fai ae Pf ant ‘ae = > Tray u) and show this is a binomial series which sums to Ta 15. Sums of Independent Random Variables Many of the important uses of Probability Theory flow from the study of sums of independent random variables. A simple example is from Statistics: if we perform an experiment repeatedly and independently, then the “average value” is given by F = ta X;, where Xj; represents the outcome of the j*® experiment. The r.v. F is then called an estimator for the mean j: of each of the X;. Statistical theory studies when (and how) % converges to ps as n tends to oo. Even once we show that 7 tends to pz as n tends to oo, we also need to know how large n should be in order to be reasonably sure that T is close to the true value 4 (which is, in general, unknown). There are other, more sophisticated questions that arise as well: what is the probability distribution of ©? If we cannot infer the exact distribution of Z, can we approximate it? How large need n be so that our approximation is sufficiently accurate? If we have prior information about 4, how do we use that to improve upon our estimator T? Even to begin to answer some of these fundamentally important questions we need to study sums of independent random variables. Theorem 15.1. Let X,Y be two R-valued independent random variables. The distribution measure uz of Z = X +Y is the convolution product of the probability measures x and py, defined by wx ny A)= ff 1ale+ vux(de)ur (dy. (15.1) Proof. Since X and Y are independent, we know that the joint distribution of (X,Y) is zx ® py. Therefore E{g(X.¥)} = [ [ocx arn (ay. and in particular, using g(x,y) = f(@ +y): E{f(X+Y)}= / f(a t y)ux (dz)py (dy). (15.2) for any Borel function f on R for which the integrals exist. It suffices to take f(x) = lala). Oo 118 15. Sums of Independent Random Variables Remark 15.1. Formula (15.2) above shows that for f : R — R Borel measurable and Z = X + Y with X and Y independent: egy = f Heviux + uvvae) = ff Fe + duxlaa)ns (ay. Theorem 15.2. Let X.Y be independent real valued random variables, with Z=X+Y. Then the characteristic function yz is the product of gx and yy: that is: 2(u) = ex(ujyy(u)- Proof. Let f(z) = e!*) and use formula (15.2). o Caution: If Z = X + Y, the property that yz(u) = yx(u)yy (wu) for all u € R is not enough to ensure that X and Y are independent. Theorem 15.3. Let X,Y be independent real valued random variables and let Z=X+Y. a) If X has a density fx, then Z has a density fz and moreover: fal2)= [ fx(e- var (as) b) If in addition Y has a density fy, then fale)= f txle—wsrndu = [ fxle)fy le ~2)ae. Proof. (b): Suppose (a) is true. Then fale) = f fle ~ verde) However py (dy) = fy-(y)dy, and we have the first equality, Interchanging the roles of X and Y gives the second equality. (a): By Theorem 15.1 we have wad) = [ [ 1ale + wux(dr yur (ay) = [ {frste+ ntetoras} rte. Next let z = c+ y; dz = dz; = [ {f raeotste nae evan and applying the Tonelli-Fubini theorem: -/ { J te-surtan} rate. Since A was arbitrary we have the result for all Borel sets A, which proves the theorem. o The next theorem is trivial but surprisingly useful. 15. Sums of Independent Random Variables 119 Theorem 15.4. Let X.Y be independent real valued random variables that are square integrable (that is E{X*} < 00 and E{Y?} < x). Then OXsy = OX + oF. Proof. Since X and Y are independent we have E{XY} = E{X}E{Y}. and oXyy = E{X?} 4 QE{XY} 4 B{Y?} — (E{X} 4 E{Y})? = 0% +02. o Examples: 1, Let Xy....,Xq be ii.d. Bernoulli (p). Then ¥Y = So", X; is Binomial B(p.n). We have seen n 8 n FAY} = EY SOX; 0 =O P{X|} = So p= ap. j=l j=l joi Note that ox, = E{X}} — B{X)}¥ = p~p? = p(l—p). Therefore by Theorem 15.4, 8 of = Dok, = np(1 — p). j=l Note that the above method of computing the variance is preferable to explicit use of the distribution of Y, which would give rise to the following calculation: n ob = 2G — np)? (")rc = pyr. 5=0 2. Let X be Poisson (\) and Y be Poisson (jz), and X and Y are indepen- dent. Then Z = X + Y is also Poisson (A + 1). Indeed, yz = Yxywy implies p2(u) = eXle™=1) ele 1) et mle™=1) which is the characteristic function of a Poisson (A + 42). Therefore Z is Poisson by the uniqueness of characteristic functions (Theorem 14.1). 3. Suppose X is Binomial B(p.n) and Y is Binomial B(p,m). (X and Y have the same p.) Let Z = X + Y. Then ez = xy, hence 120 15, Sums of Independent Random Variables ez (u) = (pe + (1 — p))"(pe™ + (A= p))™ = (pe™ + (1 pyr. which is the characteristic function of a Binomial B(p.m +n); hence Z is Binomial B(p,m +n) by Theorem 14.1. We did not really need characteristic functions for this result: simply note that m. x=Scu, and = Y=°V;, jal cot and thus m n Z= SU; +095, jm Ga where Uj and V; are all 14.4, Bernoulli (p). Hence min Z= ow; j=l where Wj are i.i.d. Bernoulli (p). (The first n W;’s are the U;’s; the next m W,’s are the V;’s.) . Suppose X is normal N(,07) and Y is also normal N(v, ™): and X and Y are independent. Then Z = i +Y is normal N(w+v,0? +77). Indeed YZ = PXPOXx implies gz(u) = eit wa? [2 giuy—w? 7? /2 = elte(ute)—w? (a? 417) /2 which is the characteristic function of a normal N(u + v,0? + 7”), and we again use Theorem 14.1. . Let X be the Gamma (a, 3) and Y be Gamma (6,3) and suppose X and Y are independent. Then if Z = X + Y, ez = yxgy, and therefore Bord — tu)? (f whence Z has the characteristic function of a Gamma (a +6, 3), and thus by Theorem 14.1, Z is a Gamma (a + 6, 8). . In Chapter 11 we defined the chi square distribution with p degrees of freedom (denoted x3), and we observed that if X is x7, then x = 7? in distribution, where Z is N(0,1). We also noted that if X is Xpe then X is Gamma (§, 2)- Therefore let X be Xps and let Z1,...,Zp be iid. N(0,1). If Y = 1 Z?, by Example 5 we have that since each Z? is Gatzina G4 , then | Y is Gamma (§, 5) which is x2. We conclude that if X is x3, then X= hy 4? in pan where Z, are iid. N(0,1). Exercises 121 Exercises for Chapter 15 15.1 Let Xy,...,X, be independent random variables, and assume E{X;} = wand 0?(X)) = 0? E{Sn}P(N =n). n=0 (Hint: Show first that E{Sy}=E {> Sxtoxn} = So ES livony n=0 n=0 and justify the second equality above.) 15.6 Suppose E{N} < oc and E{|X;|} < 2. Show that E{Sy} = E{N}E{X;}. (Hint: Use Exercise 15.5.) 15.7 Suppose B{N} < oc and E{|X;|} < oc. Show that sy(u) = Ef(ex,(w))*}- (Hint: Show first ES sy (u) = > Efe" 1 yeny }) n=1 15.8 Solve Exercise 15.6 using Exercise 15.7. (Hint: Recall that E{Z} = ip (0), for a rv. Z in L.) 15.9 Let X.Y be real valued and independent. Suppose X and X + Y have the same distribution. Show that Y is a constant r.v. equal to 0 a.s. 15.10 Let f,g map R to Rx, such that [. f(v)dz < oc and [. g(a)dx 0 we have ¢(u) #0 for all u with |u| < a. Let W(u) = 3%) for |u| < a. and show w(u) = {w(u/2")}*"; then show this tends to 1 as n — oo. (See Exercise 14.4.) Deduce that y(t) = {<2(t/2")}4" and let n — >.) Exercises 123 15.13 Let Xy.Xo..... Xp be iid. Normal N(u, 07). Let # = + O%_, X} and let Yj = X; — x. Find the joint characteristic function of (7.%;..... Yn). Let S? = 1 07, Y?. Deduce that T and S? are independent. 15.14 Show that |1 — e'*|? = 2(1 —cosx) < x? for all z € R. Use this to show that |1 —¢x(u)| < E{|uX|}. 15.15 Let A = (—2. 2]. Show that 12 J Busta) < Fea (t- Re ex(u)}. (Hint: 1— cos > 0 and 1~cosx > 3x? — j-2%, all x € R; also if 2 = a+ ib. then Re z = a. where a.b,€ R.) 15.16 If y is a characteristic function, show that |y|? is one too, (Hint: Let X,Y be iid. and consider Z = X — Y.) 15.17 Let X,.....Xq be independent exponential random variables with parameter ,3 > 0. Show that Y = >, Xi is Gamma (a. 3). 16. Gaussian Random Variables (The Normal and the Multivariate Normal Distributions) Let us recall that a Normal random variable with parameters (y,07), where “© Rand o” > 0, is a random variable whose density is given by: f(a) = eee ne -w 0 this comes from Example 13.6, and when o? = 0 this is trivial. Let us recall also that when £(X) = N(j1,0”), then E{X}=p, — Var(X)=07. (16.3) At first glance it might seem strange to call a distribution with such an odd appearing density “normal”. The reason for this dates back to the early 18*” century, when the first versions of the Central Limit Theorem appeared in books by Jacob Bernoulli (1713) and A. de Moivre (1718). These early versions of the Central Limit Theorem were expanded upon by P. Laplace and especially C. F. Gauss. Indeed because of the fundamental work of Gauss normal random variables are often called Gaussian random variables, and the former 10 Deutsche Mark note in Germany has a picture of Gauss on it and a graph of the function f given in (16.1), which is known as the Gaussian density. (This use of Probability Theory on currency, perhaps inspired by the extensive use of probability in Finance, has disappeared now that the Mark has been replaced with the Euro.) The Gaussian version of the Central Limit 126 16. Gaussian Random Variables Theorem can be loosely interpreted as saying that sums of i.i.d. random vari- ables are approximately Gaussian. This is quite profound. since one needs to know almost nothing about the actual distributions one is summing to conclude the sum is approximately Gaussian. Finally we note that later Paul Lévy did much work to find minimal hypotheses for the Central Limit theo- rem to hold, It is this family of theorems that is central to much of Statistical Theory: it allows one to assume a precise Gaussian structure from minimal hypotheses. It is the “central” nature of this theorem in Statistics that gives it its name. and which in turn makes Gaussian random variables both impor- tant and ubiquitous. hence normal. We treat the Central Limit Theorem in Chapter 21: here we lay the groundwork by studying the Gaussian random variables that will arise as the limiting distributions, For a real-valued random variable X the definition £(X) = N(u,07) is clear: it is a r.v. X with a density given by (16.1) if 0? > 0. and it is X = pw if 0? = 0. For an R”-valued r.v. the definition is more subtle; the reason is that we are actually describing the class of random variables that can arise as limits in the Central Limit Theorem, and this class is more complicated in R® when n > 2. Definition 16.1. An R”-valued random variable X = (Xj...., Xn) is Gaussian (or Multivariate Normal) if every linear combination SY" _, a;X; has a (one-dimensional) Normal distribution (possibly degenerate; for evam- ple it has the distribution N(0,0) when a; = 0, all j). Characteristic functions are of help when studying Gaussian random vari- ables. Theorem 16.1. X is an R”-valued Gaussian random variable if and only if its characteristic function has the form x(t) = exp titan) ~ 5, Qu} (16.4) where 1. € R” and Q is an nxn symmetric nonnegative semi-definite matric. Q is then the covariance matrix of X and yu is the mean of X, that is p1, E{X;} for all j. Proof (Sufficiency): Suppose we have (16.4). Let Y= Me a;X; = (a, X) j=l be a linear combination of the components of X. We need to show Y is (univariate) normal. But then for v € R: v2 yy (v) = px(va) = exp {iva n) ~ ae Ga) 16. Gaussian Random Variables 127 and by equation (16.2). yy(v) is the characteristic function of a normal N((a.p). (a. Qa)), and thus by Theorem 14.1 we have Y is normal. (Necessity): Suppose X is Gaussian, and let Y = ¥0ajX; = (a. X) j=l be a linear combination of the components of X. Let Q = Cov(X) be the covariance matrix of X. Then E{Y} = (a.u) where p= (f1,...-Hn) and E{Xj} = pi, 1 < i iujny— 5 > Uj; j=l cx gilup)—4(uQu) where j1 = (j11,+.+;4tn) and Q is the diagonal matrix Since yx(u) is of the form (16.4), we know that X is multivariate normal. The converse of Example 1 is also true: Corollary 16.1. Let X be an R”-valued Gaussian random variable. The components X; are independent if and only if the covariance matrix Q of X is diagonal. Proof. Example 1 shows the necessity. Suppose then we know Q is diagonal, ie. By Equation (16.4) of Theorem 16.1 it follows that x factors: ex(u) = [] ex, (us). jel where px,(utj) = exp {iusu, - pot} . Corollary 14.1 then gives that the X; are independent, and they are each normal (N(5.03)) by Equation (16.2). Qo The next theorem shows that all Gaussian random variables (i-e., Multi- variate Normal random variables) arise as linear transformations of vectors of independent univariate normals. (Recall that we use the terms normal and Gaussian interchangeably.) 16. Gaussian Random Variables 129 Theorem 16.2. Let X be an R"-valued Gaussian random variable with mean vector 4. Then there exist independent Normal random variables Y,, w Yn with L(¥j)=NOAj), AV20, (Si 0. Some of the 1; can sometimes take the value zero. In this case we have Y; = 0. Thus the number of independent normal random variables required in Theorem 16.2 can be strictly less in number than the number of non-trivial components in the Gaussian r.v. X. Proof of Theorem 16.2: Since Q is a covariance matrix it is symmetric. non- negative semi-definite and there always exists an orthogonal matrix A such that Q = AAA*, where A is a diagonal matrix with all entries nonnegative. (Recall that an orthogonal matrix is a matrix where the rows (or columns), considered as vectors, are orthonormal: that is they have length (or norm) one and the scalar product of any two of them is zero (i.e., they are orthog- onal).) A* means the transpose of the matrix A. Since A is orthogonal, then A* is also the inverse of A. We set Y=A'(X—p) where py = E{X;} for X; the j*® component of X. Since X is Gaussian by hypothesis, we have that Y must be Gaussian too, since any linear combina- tion of the components of Y is also a linear combination of the components of X and therefore univariate normal. Moreover the covariance matrix of Y is A*QA = A, the sought after diagonal matrix. Since X = ju + AY (because A*—! = A), we have proved the theorem. Oo Corollary 16.2. An R”-valued Gaussian random variable X has a density on R” if and only if the covariance matriz Q is non-degenerate (that is, there does not exist a vector a € R” such that Qa = 0, or equivalently that det(Q) # 0). Proof. By Theorem 16.2 we know there exist n independent normals ¥;,..-, Yn of laws N(0,A;), (l 0, for all j (1 < j < n), because det(Q) = det(A) = [J Ai. Since \j > 0 and L(¥;) = N(0,A;), we know that Y has a density given by es, ol fy(y) = fy(y) I Jie and since X = + AY, we deduce from Theorem 12.7 that X has the density 130 16. Gaussian Random Variables 1 IX) = 5 dea Next suppose Q is degenerate: that is. det(Q) = 0. Then there exists an a €R”.a#¢ Osuch that Qa = 0 (that is, the kernel of the linear transforma- tion represented by Q is non-trivial). The random variable Z = (a. X) has a variance equal to (a, Qa) = 0. so it is a.s. equal to its mean (a. 4). Therefore P(X € H) = 1. where H is an affine hyperplane orthogonal to the vector a and containing the vector j, that is H = {x € R”: (x — pt. a) = O}.) Since the dimension of H is n — 1, the n-dimensional Lebesgue measure of H is zero. If X were to have a density, we would need to have the property eT Een Q ap) (16.5) 1=P(X €H) =| f(a)da =| P(t... tn)dry... day. (16.6) A A However J talereecte)der dan =0 because H is a hyperplane (see Exercise 16.1), hence (16.6) cannot hold: whence X cannot have a density. oO Comment: Corollary 16.2 shows that when n > 2 there exist normal (Gaus- sian) non constant random variables without densities (when n = 1 a normal variable is either constant or with a density). Moreover since (as we shall see in Chapter 21) these random variables arise as limits in the Central Limit Theorem. they are important and cannot be ignored. Thus while it is tempt- ing to define Gaussian random variables as (for example) random variables having densities of the form given in (16.5), such a definition would not cover some important cases. An elementary but important property of R"-valued Gaussian random variables is as follows: Theorem 16.3. Let X be an R"-valued Gaussian random variable, and let Y be an R™-valued Gaussian r.v. If X and Y are independent then Z = (X.Y) is an R"+™-valued Gaussian r.v. Proof. We have pz(u) = ex(w)yy(v) where w=(w.v); weR", veEeR™; since X and ¥ are independent. By Theorem 16.1 ga(u) = exp {iw u®) - 3. Q*w) } exp {i W) - 5h ar} = exp {iqese).(u®.n) - pau}. 16. Gaussian Random Variables 131 where ox “0 a= (4 or): Again using Theorem 16.1 this is the characteristic function of a Gaussian ny. Oo We say that two random variables X.Y are uncorrelated if Cov (X,Y) = 0. Since Cov (X.Y) = E{XY} ~ E{X}E{Y}, this is equivalent to saying that E{XY} = E{X}E{Y}. This of course is true if X and Y are independent (Theorem 12.3) and thus X.Y independent implies that X,Y are uncorrelated. The converse is false is general. However it is true for the Multivariate Normal (or Gaussian) case, a surprising and useful fact. Theorem 16.4. Let X be an R”-valued Gaussian random variable. Two components X; and Xx of X are independent if and only if they are uncor- related. Proof. We have already seen the necessity. Conversely, suppose that X; and Xx are uncorrelated. We can consider the two dimensional random vector Y = (1, Y2), with Yj = X; and Yy = X,. Clearly Y is a bivariate normal vector, and since Cov(¥;, Y2) = 0 by hypothesis we have that the covariance matrix of Y is diagonal, and the theorem reduces to Corollary 16.1. oO A standard model used in science, engineering, and the social sciences is that of simple linear regression.’ Here one has random variables ¥;, 1 < i < n, of the form Yizatsu+e, (1 0, where = 1 37, ai. One can treat these models quite generally (see Chapter 12 of [5] for example), but we will limit our attention to the most important case: where the “errors” are normal. Indeed, let the random variables (€;)1 0: Z=YV(iy\a}- Then Z is also N(0,1) (see Exercise 16.2). but ¥ + Z = 2Y1py\ 2\a|) = 0 and Y + Z is not as. equal to a constant. Therefore X = (Y, Z) is an R?-valued r.v. which is not Gaussian, even though its two components are each normal (or Gaussian). It is worth emphasizing that the Multivariate Normal has several special properties not shared in general with other distributions. We have seen that 16. Gaussian Random Variables 135, . Components are independent if and only if they are uncorrelated: . We have that the components are univariate normal: thus the components. belong to the same distribution family as the vector random variable: 3. A Gaussian X with an N(u,Q) distribution with Q invertible can be linearly transformed into an N(0./) r.v. (Exercise 16.6); that is, linear transformations do not change the distribution family: 4, The density exists if and only if the covariance matrix is nondegenerate, giving a simple criterion for the existence of a density: and finally 5. We have that the conditional distributions of Multivariate Normal dis- tributions are also normal (Exercise 16.10). ne These six properties show a remarkable stability inherent in the Multi- variate Normal. There are many more special features of the normal that we do not go into here. It is interesting to note that the normal distribution does not really exist in nature. It arises via a limiting procedure (the Central Limit Theorem). and thus it is an approximation of reality, and often it is an excellent approx- imation. When one says, for example, that the heights of twenty year old men in the United States are normally distributed with mean y and variance a”, one actually means that the heights are approximately so distributed. Indeed, if the heights were in fact normally distributed, there would be a strictly positive probability of finding men that were of negative height and also of finding men taller than the Sears Tower in Chicago. Such results are of course nonsense. However these positive probabilities are so small as to be equal to zero to many decimal places, and since zero is the true probability of such events we do not have a contradiction to the result that the normal distribution is indeed an excellent approximation to the true distribution of men, which is itself not precisely known. 136 16. Gaussian Random Variables Exercises for Chapter 16 16.1 Let ae R”, a #0, and » € R”. Let H be the hyperplane in R” given by H = {« ER”: (w—p,a) = 0}. Show that m,(H) = 0 where m, is n-dimensional Lebesgue measure, and deduce that [ feoae= [of ston. for any Borel function f on R”. fn)la( n)day...dtn =0 16.2 Let L(Y) = N(0,1), and let a> 0. Let _f Y if|y|a. Show that £(Z) = N(0,1) as well. 16.3 Let X be N(0,1) and let Z be independent of X with P(Z = 1) = P(Z=-1) =3. Let Y = ZX. Show L(Y) = N(0,1), but that (X,Y) is not Gaussian (i.e., not Multivariate Normal). 16.4 Let (X,Y) be Gaussian with mean (ux, jy) and covariance matrix Q and det(Q) > 0. Let p be the correlation coefficient Cov (X,Y) © \/War(Xy vary) Show that if -1 < p < 1 the density of (X,Y) exists and is equal to: foe (t0) = sex { so (Ge) (YB Y. Qroxoy /1— p? i 2(1 — p*) ox Be em } oxoy oy . Show that: if p = —1 or p = 1, then the density of (X,Y) does not exist. 16.5 Let p be in between —1 and 1, and 5, «7 (j = 1,2) be given. Construct X,, X2 Normals with means j11, 12; variances 0, o3; and correlation p. (Hint: Let Yi, Y2 be iid. N(0,1) and set U; = Y; and Uz = p¥, + \/1 ~p? Yo. Then let X; = mj + 0j¥; (j = 1,2).) Exercises 137 16.6 Suppose X is Gaussian N(w.Q) on R”, with det(Q) > 0. Show that there exists a matrix B such that Y = B(X —) has the N(0, J) distribution, where / is the n x n identity matrix. (Special Note: This shows that any Gaus- sian r.v. with non-degenerate covariance matrix can be linearly transformed into a standard normal.) 16.7 Let X be Gaussian and let n Y= 0 a;X;, j=l X,,). Show that Y is univariate N(u,07) where w= Soa E{X}} j=l where X = (Xy and ° *= Yo eva) 420 1ja,Cov (Xj, Xp). ak 16.8 Let (X.Y) be bivariate normal N(y,Q), where o-( o% poner) poxoy oF and p is the correlation coefficient (|p| < 1), (det(Q) > 0). Then (X,Y) has a density f and show that its conditional density fy=e(y) is the density of a univariate normal with mean py + p2¢(a — jx) and variance o¥-(1 — p?). (cf. Theorem 12.2.) 16.9 Let X be N(1u,Q) with pe = (1,1) and Q = ¢ 2): Find the conditional distribution of Y = X, + X2 given Z = X, — Xp =0. Answer: fz-o(y) = aR {4 lly = }. 16.10 Let £(X) = N(u.Q) with det(Q) > 0. Show that the conditional distributions of any number of coordinates of X, knowing the others, are also multivariate normal (cf. Theorem 12.2). [This Exercise generalizes Ex- ercise 16.8.] 16.11 (Gut, 1995). Let (X,Y) have joint density fixsy(a.y) = cexp{-(1+2°)(1+y")}. 00 < 2,y < 00, where c is chosen so that f is a density. Show that f is not the density of a bivariate normal but that fx=,(y) and fy=,(z) are each normal densities. (This shows that the converse of Exercise 16.10 does not hold.) 138 16. Gaussian Random Variables 16.12 Let (X.Y) be Bivariate Normal with correlation coefficient p and mean (0,0). Show that if |p| <1, then Z = # is Cauchy with parameters a = pz and 3 = 2*y/1~—p’. (Note: This result was already established in Example 12.5 when X and Y were independent.) We conclude that the ratio of two centered Bivariate Normals is a Cauchy r.v. 16.13 * Let (X.Y) be bivariate normal with mean 0 and correlation coeffi- cient p. Let 3 be such that cos3=p (00,.¥ > 0} = P{X <0. 0.¥ <0} = P{X <0 >0}=4-5. 16.15 * Let (X,Y) be bivariate normal with density Fow(0.y) = 1 orm (4 ats) (OY Sroxoy V1 : Show that: a) E{XY} = poxoy b) E{X?y?} = E{X?}B{¥?} + (BL XY}? c) E{|XY|} = 2&*2¥ (cosa + asin a) where a is given by sina = p (—3 < a < $) (cf. Exercise 16.13). 16.16 Let (X.Y) be bivariate normal with correlation p and 0% = o3-. Show that X and Y — pX are independent. 16.17 Let X be N(u,@) with det(Q) > 0, with X R”-valued. Show that (X= p)"Q7U(X~p) is x7(n). Exercises 139 16.18 Let Xj.....2 X,, be iid. V(0. 07). and let ig 1< 7 ek —z)? =5 us and S$? = = S0(X; 7)’. j=) j=. Recall from Exercise 15.13 that Z and S? are independent. Show that n n SOX? = OK; — 2)? +o? j=) j=) and deduce that (n — 1)S?/o? has a y2_, distribution and that nZ?/o? has a x3 distribution. 16.19 Let e)..... En be iid. N(0.07) and suppose ¥; = a + 3: +e,1< i 3) repeatedly, and {X,, = 1} corresponds to heads on the nt toss and {X;, = 0} corresponds to tails on the nt toss. In the “long run”, we would expect the proportion of heads to be p; this would justify our model that claims the probability of heads is p. Mathematically we would want tim Xi (w) +... + Xnl ne n =p forallwe 2. This simply does not happen! For example let wo = {T.T,T,...}, the se- quence of all tails. For this wo, ig, im, = Se Xj (wo) = 0. jar More generally we have the event A = {w: only a finite number of heads occur}. Then _ 1s fi 2G) =0 for allwe A. We readily admit that the event A is very unlikely to occur. Indeed, we can show (Exercise 17.13) that P(A) = 0. In fact, what we will eventually show (see the Strong Law of Large Numbers [Chapter 20]) is that 142 17. Convergence of Random Variables P stim 137X,w)= =1 wv: tim . jw) =p} =1. j= This type of convergence of random variables. where we do not have conver- gence for all w but do have convergence for almost all w (i.e.. the set of w where we do have convergence has probability one). is what typically arises. Caveat: In this chapter we will assume that all random variables are defined on a given, fixed probability space (§2..A,P) and takes values in R or R". We also denote by |z| the Euclidean norm of « € R”. Definition 17.1. We say that a sequence of random variables (Xp)n>1 con- verges almost surely to a random variable X if N= {w: lim Xp(w) 4 X(w)} has P(N) =0. n= Recall that the set N is called a null set, or a negligible set. Note that Nes A= {ws lim X,(w) =X(w)} and then P(A) =1. na We usually abbreviate almost sure convergence by writing lim X, =X as. n0 We have given an example of almost sure convergence from coin tossing pre- ceding this definition. Just as we defined almost sure convergence because it naturally occurs when “pointwise convergence” (for all “points”) fails, we need to introduce two more types of convergence. These next two types of convergence also arise naturally when a.s. convergence fails, and they are also useful as tools to help to show that a.s. convergence holds. Definition 17.2. A sequence of random variables (X,)n>1 converges in L? to X (wherel < p <2) if |Xnl, |X| are in L? and: lim E{|X, —X nt ?} =0. Alternatively one says X;, converges to X in pth mean. and one writes x,2X. The most important cases for convergence in pth mean are when p = 1 and when p = 2. When p = 1 and all r.v.’s are one-dimensional, we have 17. Convergence of Random Variables 143 \E{Xq — X}| < BE|X, — |} and |E{/X,|} — EYX|} < BCX, — XI} because {|| — |yl| < ja — yle Hence X, 5X implies B{X,}— E{X} and E{|Xq|} > E{|X]}. (171) Similarly. when X,, “ X for p € (1,00), we have that F{|X,,|?} converges to E{|X|P}: see Exercise 17.14 for the case p = 2. Definition 17.3. A sequence of random variables (X,)n>, converges in probability to X if for any ¢ > 0 we have slim, P(e LX ~ X(u)|>e})=0. This is also written lim P(\|X, —X|>e)=0, n—00 and denoted XxX, 2X. Using the epsilon-delta definition of a limit, one could alternatively say that X, tends to X in probability if for any ¢ > 0, any 6 > 0, there exists N = N(6) such that P(\X,—X|>e) <6 for alln > N. Before we establish the relationships between the different types of con- vergence, we give a surprisingly useful small result which characterizes con- vergence in probability. Theorem 17.1. X;, 2x if and only if . |\Xn=X| | _ lim e{ BAL =0. Proof. There is no loss of generality by taking X = 0. Thus we want to show X, © 0 if and only if lim, 0 {225} = 0. First suppose that X, 4 0. Then for any ¢ > 0, limn—oo P(|Xn| > €) = 0. Note that [Xn [Xn —< lax Vyixnise} S Uixalsey +E T+ [Xa ~ Tr] Xn) tiXeteet FEM iXaised S ixalpey + Therefore BL lV < efigyysey} te = PUXal > 6) be 1+ [Xn] J ~ . Taking limits yields 144 17. Convergence of Random Variables . |Xnl Ei —" hb sme {Ba ° Since € > 0 is fixed, we conclude limn 4. P(|X»| > €) = 0. Qa Remark: What this theorem says is that X, 2X PE {f((X, — X|)} 0 for the function f(x) = a the same equivalence holds for any function f on R which is bounded, strictly increasing on [0,00), continuous, and with f(0) = 0. For example we have Xp, % X iff E{|X,—X|A1} > 0 and also iff H{arctan(|X;, — X|)} > 0. A careful examination of the proof shows that The next theorem shows that convergence in probability is the weakest of the three types of convergence (a.s., L?, and probability). Theorem 17.2. Let (Xn)n>1 be a sequence of random variables. a) If X, 5X, then X, 5X. b) If X, “3X, then X, 2X. Proof. (a) Recall that for an event A, P(A) = E{1,4}, where 14 is the indi- cator function of the event A. Therefore, PA|Xn — X| >e} = E {1 yx,.~x1>e}}- Note that. XaexP > lon the event {|X, — X| > e}, hence |Xp—X < 0 always, we can simply drop the indicator function to get: ? ‘The notation iff is a standard notation shorthand for “if and only if” 17. Convergence of Random Variables 145 0), which gives the result. (b) Since R= XT < 1 always, we have X,-X X, im EB 4 ~~ sth = dim, {a a}- {im Pe a} FXO} =0 by Lebegue’s Dominated Convergence Theorem (9.1(f)). We then apply The- orem 17.1. o The converse to Theorem 17.2 is not true; nevertheless we have two partial converses. The most delicate one concerns the relation with a.s. convergence, and goes as follows: Theorem 17.3. Suppose Xp, 4. X. Then there exists a subsequence n, such that limp oc Xn, = X almost surely. Proof. Since X, % X we have that lim, .o H{%s=XL} = 0 by The- +1Xn—X] orem 17.1. Choose a subsequence n;, such that Bee =*t} < sr. Then a ny — XI [Xn =X1) Soka PA PAR Sa} < 90 and by Theorem 9.2 we have that PS rex ¢) = ap, and as soon as an — 0 we deduce that Xn % 0 (that is, X,, tends to 0 in probability). More precisely, let X;,,; be the indicator of the interval [4,4], 1 X and also that |X,| 0 we have {IX]>¥ +e} ¢ |X] >[Xn| +e} < {|X| — |Xn] > e} c {|X —X,] >e}. hence P(|X|>Y¥ +e) < P(\X—X,| >). and since this is true for each n, we have P(\X|>Y +e) < lim P(|X — X,|> <2) ne by hypothesis. This is true for each ¢ > 0, hence 1 P(\X|>Y)< lim P(\X|>Y+—) mide m from which we get |X| < Y a.s. Therefore X € L? too. Suppose now that X;, does not converge to X in L?. There is a subse- quence (n,) such that E{|X,, — X|?} > e for all k. and for some ¢ > 0. The subsequence X,,,, trivially converges to X in probability. so by Theorem 17.3 it admits a further subsequence Xnx, which converges a.s. to X. Now, the r.v.’s X,, —X tend a.s. to 0 as j — oo, while staying smaller than 2Y. so by Lebesgue’s Dominated Convergence we get that E{|X;,, — X|P} > 0, which contradicts the property that FE {|X,,— X|?} > ¢ for all k: hence we are done. oO The next theorem is elementary but also quite useful to keep in mind. Theorem 17.5. Let f be a continuous function. a) Iflimp tx Xn =X as. then limy sx f(Xn) = f(X) as. b) IfXq 2X. then f(X,) ® F(X). 17, Convergence of Random Variables 147 = {w : lim,-x Xn(w) # X(w)}. Then P(N) = 0 by . then Proof. (a) Let N hypothesis. If w ¢ » slim, f(Xn(e)) = f (Bim Xue) = #K(). where the first equality is by the continuity of f. Since this is true for any w & N, and P(N) =0. we have the almost sure conyergence. (b) For each k > 0, let us set: {|f(Xn) — F(X)| > e} C {| f(Xn) — F(X] > es |X| SREY {]X] > BH. (17.2) Since f is continuous. it is uniformly continuous on any bounded interval. Therefore for our ¢ given, there exists a 6 > 0 such that |f(x) — f(y)| < e if | — y| <6 for ¢ and y in [—k.k]. This means that {|f(Xn) — F(X)| > |X] Sk} C {Xn — X] > 6 |X| Sh} C {|Xn—X| > 5}. Combining this with (17.2) gives {1f(Xn) = F(X) > ef C {|Xn — X| > OF U{|X) > Fh. (17.3) Using simple subadditivity (P(AUB) < P(A) + P(B)) we obtain from (17.3): PAF (Xn) — FX) > e} S P(\Xn — X| > 6) + PUA] > k). However {|X| > k} tends to the empty set as k increases to oc so limp oo P(|X| > k) = 0. Therefore for y > 0 we choose k so large that P(\|X| >k) < +. Once k is fixed, we obtain the 6 of (17.3), and therefore lim P (|f(Xn) — f(X)| > 2) < lim P(|X,-X|>6)+y7=7%. n=00 n—00 Since 7 > 0 was arbitrary. we deduce the result. o 148 17. Convergence of Random Variables Exercises for Chapter 17 17.1 Let X,,; be as given in Example 2. Let Zp,j =n?Xp,j. Let Ym, be the sequence obtained by ordering the Z,,; as was done in Example 2. Show that. Ym tends to 0 in probability but that (Yn)m>, does not tend to 0 in L?, although each Y,, belongs to L?. 17.2 Show that Theorem 17.5(b) is false in general if f is not assumed to be continuous. (Hint: Take f(z) = 149}(x) and the X,,’s tending to 0 in probability.) 17.3 Let Xp be iid. random variables with P(X, = 1) = 4 and P(X; = —1) = 3. Show that i nu ja converges to 0 in probability. (Hint: Let S, = Sx"_, Xj, and use Chebyshev’s inequality on P{|S;,| > ne}.) 17.4 Let X, and S,, be as in Exercise 17.3. Show that ae Sn2 converges to zero a.s. (Hint: Show that 5%, P{7s|S,2| > e} < co and use the Borel- Cantelli Theorem.) 17.5 * Suppose |X| < Y a.s., each n,n = 1,2,3.. Show that sup, [Xn] 1 have finite variances and zero means (i.e., Var(X,) = o%, < oc and E{X,} = 0, all n). Suppose limy 0%, = 0. Show Xp converges to 0 in L? and in probability. 17.10 Let X; be iid. with finite variances and zero means. Let S, = Sh, X). Show that +S, tends to 0 in both L? and in probability. Exercises 149 17.11* Suppose limn—. Xn = X a.s. and |X| <0o as, Let Y = sup, |Xn|- Show that Y <0 as. 17.12 * Suppose limn so Xn = X as. Let ¥ = sup, [Xp — X|. Show ¥ < 96 a.s. (see Exercise 17.11). and define a new probability measure Q by 1 1 1 Q(A) = te{usry}. where e= B {5}. Show that X,, tends to X in L) under the probability measure Q. 17.13 Let A be the event described in Example 1. Show that P(A) = 0. (Hint: Let A, = { Heads on n* toss }. Show that S7*, P(An) = oc and use the Borel-Cantelli Theorem (Theo- rem 10.5.) ) 17.14 Let X,, and X be real-valued r.v.’s in L?, and suppose that X,, tends to X in L?. Show that E{X?} tends to E{X?} (Hint: use that |z? — y?| = (x — y)? + 2\y||z — y| and the Cauchy-Schwarz inequality). 17.15 * (Another Dominated Convergence Theorem.) Let (Xn)n>1 be random variables with X, 2 X (limp oo Xn = X in probability). Suppose |X, (w)| < C for a constant C’ > 0 and all w. Show that lim, —o. E{|Xpn—X|} = 0. (Hint: First show that P(|X| < C) = 1.) 18. Weak Convergence In Chapter 17 we considered four types of convergence of random variables: pointwise everywhere. pointwise almost surely, convergence in p*” mean (L? convergence), and convergence in probability. While all but the first differ from types of convergence seen in elementary Calculus courses, they are nev- ertheless squarely in the analysis tradition, and they can be thought of as variants of standard pointwise convergence. While these types of convergence are natural and useful in probability, there is yet another notion of conver- gence which is profoundly different from the four we have already seen. This convergence, known as weak convergence, is fundamental to the study of Prob- ability and Statistics. As its name implies, it is a weak type of convergence. The weaker the requirements for convergence, the easier it is for a sequence of random variables to have a limit. What is unusual about weak conver- gence. however, is that the actual values of the random variables themselves are not important! It is simply the probabilities that they will assume those values that matter. That is, it is the probability distributions of the random variables that will be converging, and not the values of the random variables themselves. It is this difference that makes weak convergence a convergence of a different type than pointwise and its variants, Since we will be dealing with the convergence of distributions of random variables, we begin by considering probability measures on R4@, some d > 1. Definition 18.1. Let jin and jy be probability measures on R4 (d> 1). The Sequence [in converges weakly to p if f f()n(da) converges to f f(x)u(dr) for each f which is real-valued, continuous and bounded on R¢. At first glance this definition may look like it has a typographical error: one is used to considering sim, f futriutae) = [ feyu(a; but here f remains fixed and it is indeed yz that varies. Note also that we do not consider all bounded Borel measurable functions f, but only the subset that are bounded and continuous. Definition 18.2. Let (Xn)nzi, X be R¢-valued random variables. We say X,, converges in distribution to X (or equivalently X,, converges in law to X ) 15218, Weak Convergence if the distribution measures P*" converge weakly to PX. We write Xp, 2x : or equivalently Xn 4x. Theorem 18.1. Let (Xn)nzi, X be R4-valued random variables. Then X, 3 X if and only if dim, E{f(Xn)} = BLED}, for all continuous, bounded functions f on R4. Proof. This is just a combination of Definitions 18.1 and 18.2, once we observe that EK) = f HeyP*(ar), BLE} = f $2) a). oO It is important to emphasize that if X;, converges in distribution to X, there is no requirement that (Xpn)n>1 and X be defined on the same probability space (§2, A, P)! Indeed in Statistics. for example, it happens that a sequence (Xn)n>1 all defined on one space will converge in distribution to a rv. X that cannot exist on the same space the (X,),>, were defined on! Thus the notion of weak convergence permits random variables to converge in ways that would otherwise be fundamentally impossible. In order to have almost sure or L? convergence, or convergence in proba- bility, one always needs that (Xp)n>,, X are all defined on the same space. Thus a priori convergence in distribution is not comparable to the other kinds of convergence. Nevertheless, if by good fortune (or by construction) all of the (X;,)n>1 and X are all defined on the same probability space, then we can compare the types of convergence. Theorem 18.2. Let (Xp)n>1, X all be defined on a given and fixed proba- bility space (92,A, P). If Xp converges to X in probability, then X;, converges to X in distribution as well. Proof. Let f be bounded and continuous on R?. Then by Theorem 17.5 we know that f(X,,) converges to f(X) in probability too. Since f is bounded, f(X,) converges to f(X) in L’ by Theorem 17.4. Therefore lim, E{f(Xn)} = E{f(X)} by (17.1), and Theorem 18.1 gives the re- sult. oO There is a (very) partial converse to Theorem 18.2 Theorem 18.3. Let (Xn)n>i, X be defined on a given fixed probability space (2,A,P). If Xp converges to X in distribution, and if X is ar.v. equal a.s. io a constant, then Xz, converges to X in probability as well. 18. Weak Convergence 153 Proof. Suppose that X is a.s. equal to the constant a. The function f(x) = a is bounded and continuous: therefore lim,—... E{ Hath = 0, and hence X,;, converges to a in probability by Theorem 17.1. It is tempting to think that convergence in distribution implies the fol- lowing: that if X,, 2 X then P(X, € A) converges to P(X € A) for all Borel sets A. This is almost never true. We do have P(X, € A) converges to P(X € A) for some sets A, but these sets are quite special. This is related to the convergence (in the real valued case) of distribution functions: indeed, if Xp are real valued and X, ™ X, then if F,(x) = P(X, < 2) were to converge to F(x) = P(X < x), we would need to have P(X, € (—90, 2]) converge to P(X € (—oc, 2]) for all « € R, and even this is not always true! Let us henceforth assume that (Xp)n>1, X are real valued random vari- ables and that (F;,)n>1, F are their respective distribution functions. The next theorem is rather difficult and can be skipped. We note that it is much simpler if we assume that F,, the distribution of the limiting random variable X. is itself continuous. This suffices for many applications, but we include a proof of Theorem 18.4 for completeness. For this theorem, recall that the distribution function F of a r.v. is nondecreasing and right-continuous, and so it has left limits everywhere, that is limy.:,y<2 F(y) = F(#—) exists for all « (see Exercise 18.4). Theorem 18.4. Let (Xn)n>1, X be real valued random variables. a) If Xn BX then limp oo Fa(w) = F(a) for all x in the dense subset of R given by D = {x : F(a—) = F(x)}. (Fa(z) = P(Xn < 2); D is sometimes called the set of continuity points of F’.) b) Suppose lim, Fn(x) = F(x) for all x in a dense subset of R. Then x, 2x. Proof of (a): Assume X, ™ X. Let D = {2x: F(a—) = F(a)}. Then Disa dense subset of R. since its complement (the set of discontinuities of F) is at. most countably infinite (see Exercises 18.4 and 18.5), and the complement of a countable set is always dense in R. Let us fix ¢ € R. For each integer p > 1 let us introduce the following bounded, continuous functions: 1 ify 1. Note further that EX9p(Xn)} < Fr(w) < Et fo(Xn)} and hence E{gp(X)} < lim inf F(a) < lim sup F(x) < E{f,(X)}. each p > 1. n00 no ~ (18.1) Now limp fp(y) = l-oe,aj(y) and limp sc gp(y) = 1(-20,2)(y), hence by Lebesgue’s dominated convergence theorem (Theorem 9.1(f)) we have that jim, FA fp X)} = B{1(-cc.ai(X)} =P(X F(a). Proof of (b): Now we suppose that limps. F,(«) = F(x) for all c € A, where A is a dense subset of R. Let f be a bounded, continuous function on R and take ¢ > 0. Let r,s be in A such that P(X ¢ (r.s]) =1~ F(s) + FQ”) Se. (Such r and s exist. since F(a) decreases to 0 as a decreases to —oc. and increases to 1 as x increases to +00, and since J is dense). Since F,, converges to F on A by hypothesis, there exists an N, such that for n >). P(Xn € (r,8]) = Since [r, s] is a closed (compact) interval and f is continuous, we know f is uniformly continuous on [r. s]; hence there exists a finite number of points r=ro <<... <1rk = such that — Fi,(s) + Fy(r) < 2e. (18.3) 18. Weak Convergence 155 {f@)-f@pise if nasesry. and each of the r; are in A, 1 < j < k. (That we may choose r; in A follows from the fact that A is dense.) Next we set. k 9) = SOF )Ur,1-ryi(@) (18.4) and by the preceding we have |f(x) ~ g(x)| < ¢ on (r.s]. Therefore if a = sup, |f(x)|, we obtain |E{f(Xn)} — E{g(Xn)}] S OP(Xn ¢ (1,8) +e (18.5) and the same holds for X in place of X,,. Using the definition (18.4) for g, observe that k E{g(Xn)} = Yo (ry) (Fal) ~ Fa(rp—1)} and analogously ke E{g(X)} = S005) {F (19) — Fa) }- j=l Since all the rj’s are in A, we have limy—ce Fn(ry) = F(rj) for each j. Since there are only a finite number of rj’s, we know there exists an Nz such that for n > No, |E{g(Xn)} — E{g(X)}| <. (18.6) Let us now combine (18.5) for X, and X and (18.6): ifm > max(N), No). then |E{P(Xn)} — FLAX} S/E{f(Xn)} — Elg(Xn)}) + |El9(Xn)} — BL9(X)}] + |E{g(X)} — EXf(X)}I < (aP(Xn ¢ (r.8]) +2) +e + (aP(X ¢ (r,5]) +2) S ae +e) +e + (ae +e) S 8ae + 32. Since ¢ was arbitrary, we conclude that lim,.. E{f(Xn)} = E{f(X)} for all bounded, continuous f; hence by Theorem 18.1 we have the result, 0 156 18. Weak Convergence Examples: 1. Suppose that ({n)n>1 is a sequence of probability measures on R that are all point masses (or, Dirac measures ): that is. for each n there exists a point a» such that jin({a@,}) = 1 and Hn ({an}*) = Un(R \ {an }) = 0. Then yz, converges weakly to a limit y if and only if a, converges to a point a; and in this case p is point mass at a. [Special note:: “point mass” probability measures are usually written €, or 6, in the literature, which is used to denote point mass of size one at the point a.} Note that F,(2) = lo,.cc)(#), and therefore limn—. Fy(t) = F(x) on a dense subset easily implies that F must be of the form 1jq,.)(), where a = limp oc On+ 2. Let 1 0 if eS-7 1 1 Fre) stat if —~. nm Then tim, F(t) = F(x) = 10.<)(2) for all x except x = 0; ths the set D of Theorem 18.4 is D = R \ {0}. Thus if £(X,,) is given by F,,, then we have X;, 2, X. where X is constant and equal to 0 a.s. (£(X) is given by F.) What we have shown is that a sequence of uniform random variables (Xn)n>1, with X, uniform on 1). converge weakly to 0 (ie., the constant random variable equal ). 3. Let (Xn)n>i1, X be random variables with densities f,(«), f(x). Then the distribution function a F(x) -/ f(ujdu 00 is continuous; thus F(c—) = F(a) on all of R. Suppose f,() < g(a), all nand x, and f*. g(x)dx < 00, and limp. fn(w) = f(a) almost every- where. Then F,(x) converges to F(a) by Lebesgue’s dominated conver- gence theorem and thus X;, 2x. Note that alternatively in this example we have that tim, / h(w)P** (da) = lim. / h(t) fr(a)de = / h(a) lim f(e)dx = [ ro\fteyar = [rar (aa) 18. Weak Convergence 157 for any bounded continuous function h by Lebesgue’s dominated conver- gence theorem, and we have another proof that X,, 2. X. This proof works also in the multi-dimensional case, and we see that a slightly stronger form of convergence than weak convergence takes place here: we need h above to be bounded and measurable, but the continuity is superfluous. The previous example has the following extension, which might look a bit surprising: we can interchange limits and integrals in a case where the sequence of functions is not dominated by a single integrable function; this is due to the fact that all functions f, and f below have integrals equal to 1. Theorem 18.5. Let (Xn)n>i, X be r.v.’s with values in R4, with densities dn, f. If the sequence fn converges pointwise (or even almost everywhere) to f, then X, 2X. Proof. Let h be a bounded measurable function on R¢, and a = sup, {h(«)|. Put Ay (a) = h(v) +a and ho(x) = a—h(x). These two functions are positive, and thus so are hifn and he fn, all n. Since further for i = 1,2 the sequence hi fn converges almost everywhere to h;f, we can apply Fatou’s Lemma (see Theorem 9.1) to obtain E{hi(X)} = | feonceyar < tinint f fa(a)hs(e)de = lim inf £{h,(X,)}- aoe mae (18.7) Observe that E{h(Xn)} = B{hy(Xn)}—@ and E{h(Xn)} = a— E{ho(Xn)}, and the same equalities hold with X in place of Xp. Since liminf(z,) = —limsup(—2,), it follows from (18.7) applied successively to i= 1 and i that E{h(X)} < liminf, 5 E{h(Xn)}, E{h(X)} > limsup,, . E{h(Xn)}- Hence E{h(X,)} converges to E{h(X)}, and the theorem is proved. o The next theorem is a version of what is known as “Helly’s selection principle”. It is a difficult theorem, but we will need it to establish the relation between weak convergence and convergence of characteristic functions. The condition (18.8), that the measures can be made arbitrarily small, uniformly in n, on the complement of a compact set, is often called tightness. Theorem 18.6. Let (iin)n>1 be a sequence of probability measures on R. and suppose lim_ sup fn (/-m, m]*) = 0. (18.8) moo Then there exists a subsequence nx such that (jn, )k>1 converge weakly. 158 18, Weak Convergence Proof. Let F(t) = pn((—2¢. a]). Note that for each « € R. 0 < F(a) <1 for all n, thus (F,(©))nz1 is a bounded sequence of real numbers. Hence by the Bolzano-Weierstrass theorem there always exists a subsequence nx such that (Fn, (@))e>1 converges. (Of course the subsequence nx a priori depends on the point 2). We need to construct a limit in a countable fashion, so we restrict our attention to the rational numbers in R (denoted Q). Let ri.r2.....17; be an enumeration of the rationals. For ri, there exists a subsequence nj, of n such that the limit exists. We set: Gr) lim Fy, (r2)- koo6 For r2, there exists a sub-subsequence n2.4 such that the limit exists. Again. set: G(r2) lim Fr, ,(72). koe That is, n2,, is a subsequence of n1~. We continue this way: for rj, let nj4 be a subsequence of nj—1.x such that the limit exists. Again, set: Gry) = fim Fry (0))- We then form just one subsequence by taking ny := nx,4. Thus for rj, we have GOr3) = im, Fas(), since nz is a subsequence of nj, once k > j. Next we set: F(a) = inf G(y). (18.9) weQ yor Since the function G defined on Q is non-decreasing, so also is the function F given in (18.9), and it is right continuous by construction. Let ¢ > 0. By hypothesis there exists an m such that Hr{[-m,m]®) <€ for all n simultaneously. Therefore F,(t) Seif <—m, and F(z) >1—e ife>m; therefore we have the same for G, and finally F(a) Se if 2<-—m F(a)>1-c if a>m. (18.10) Since 0 < F <1, F is right continuous and non-decreasing. property (18.10) gives that F’ is a true distribution function, corresponding to a probability measure js on R. 18. Weak Convergence 159 Finally, suppose x is such that F(z—) = F(x). For ¢ > 0. there exist y.2€Q with y m}) = 0. (18.12) moe A useful observation is that in order to show weak convergence, one does not have to check that f f dun converges to f f dy for all bounded, continuous f. but only for a well chosen subset of them. We state the next result in terms of the convergence of random variables. Theorem 18.7. Let (X;,)n>1 be a sequence of random‘variables (R. or R4- valued). Then X, 2X if and only if littn ss. E{g(Xn)} = E{g(X)} for all bounded Lipschitz continuous functions 9. Proof. A function g is Lipschitz continuous if there exists a constant k such that lg(x) — g(y)| < kil — yll, all x,y. Note that necessity is trivial. so we show sufficiency. We need to show lim, E{f(Xn)} = E{F(X)} for all bounded, continuous functions f. Let f be bounded continuous, and let a = sup, |f(a)|. Suppose there exist Lipschitz continuous functions g;, with —a< gi S giz < f, and limi. gi(x) = f(x). Then Tim inf E{f(Xn)} = limint F{ge(Xn)} = Eg}, for each fixed 7. But the Monotone Convergence Theorem applied to g;(X)+a and f(X) + a implies 160 18. Weak Convergence Jim B{gi(X)} = EXD}. Therefore lim inf E{f(Xn)} 2 B{F(X)}. (18.13) Next, exactly the same argument applied to —f gives lim sup E{f(Xp)} < EC F(X}, (18.14) and combining (18.13) and (18.14) gives dim EUS(Xa)} = BUX): It remains then only to construct the functions g;. We need to find a sequence of Lipschitz functions {j1, j2....} such that sup, jx(x) = f(x) and jx(x) > —a; then we can take g;(«) = max{j,(x),...,Ji(«)}, and we will be done. By replacing f(x) by f(z) = f(x) +a if necessary, without loss of gener- ality we can assume the bounded function f(x) is positive for all «. For each Borel set A define a function representing distance from A by da(2) = inf {|r yllsy © A}. Then for rationals r > 0 and integers m, define Ino (@) = 7A (Md tyegeyyery()) « Note that |da(z)—da(y)] < Ja —yll for any set A. hence |jin.r()—Jm.r(y)| S ml|x — y||, and so jm,r is Lipschitz continuous. Moreover jy.,(@) 0. Choose a positive rational r such that. f(@)-e r for all y in a neighborhood of a. Therefore deysp(yyer}(@) > 0, hence jmr(z) = 7 > fle) —e. for m sufficiently large. Since the rationals and integers are countable, the collection {im rim € N,r € Q,} is countable. If {j;};>, represents an enumeration, we have seen that sup, j:(x) > f(x). Since j; < f, each i, we have sup, j;(2) = f(a) and we are done. a Corollary 18.1. Let (Xp)n>1 be a sequence of random variables (R or R4 valued). Then X, © X if and only if limp oo E{g(Xn)} = E{g(X)} for all bounded uniformly continuous functions g. Proof. If g is Lipschitz then it is uniformly continuous, so Theorem 18.7 gives the result. oO 18. Weak Convergence 161 Remark 18.2. In Theorem 18.7 we reduced the test class of functions for R or R¢ valued randoin variables to converge weakly: we reduced it from bounded continuous functions to bounded Lipschitz continuous functions. One may ask if it can be further reduced. It can in fact be further reduced to C* functions with compact support. See Exercises 18.19-18.22 in this regard, where the solutions to the exercises show that X,, converges to X in distribution if and only if E{f(X;,)} converges to E{ f(X)} for all bounded, C™ functions f. A consequence of Theorem 18.7 is Slutsky’s Theorem, which is useful in Statistics. Theorem 18.8 (Slutsky’s Theorem). Let (Xp)n>1 and (Yn)n>1 be two sequences of R? valued random variables, with Xp 2X and ||Xn - Yn|| 3 0 in probability. Then Y, 2 X. Proof. By Theorem 18.7 it suffices to show lim, E{f(Y,)} = E{f(X)} for all Lipschitz continuous, bounded f. Let then f be Lipschitz continuous. We have | f(x) — f(y)| < kl|x — y|| for some real k, and |f()| < @ for some real a. Then we have dim, PEF(Xn) = F¥a)H Stim, BC f(Xn) — F%e) Lh S he + lim EX (Xn) — £0) x.-vat 20} But limp soo E{|f (Xn) — f(¥n)|1 xn -¥al)>e}} < line 2aP {|| Xn = Yall > e} = 0, and since € > 0 is arbitrary we deduce that limn—x |E{f (Xn) - f(Yn)}| = 0. Therefore lim E(f(%)} = lim EU(%)} = BUD, and the theorem is proved. o We end this section with a consideration of the weak convergence of ran- dom variables that take on at most a countable number of values (e.g., the binomial, the Poisson, the hypergeometric, etc.). Since the state space is countable, we can assume that every function is continuous: this amounts to endowing the state space with the discrete topology (Caution: if the state space, say /, is naturally contained in R for example, then this discrete topology is induced by the usual topology on R. only when the minimum of |x—y| for x,y € EA[—m, m] is bounded away from 0 for all m > 0, like when E=N or E = Z, where Z denotes the integer). The next theorem gives a simple characterization of weak convergence in this case, and it is comparable to Theorem 18.5. Theorem 18.9. Let X,. X be random variables with at most countably many values. Then X;, > X if and only if 162 18. Weak Convergence lim P(Xn = J) = P(X = Jf) n= for each j in the state space of (Xn)nz1. X. Proof. First suppose Xp 2 X. Then lim E{f(Xn)} = ELF(X)} nox for every bounded, continuous function f (Theorem 18.1). Since all functions are continuous, choose f(x) = 14;}(a) and we obtain the result. Next, suppose limn oo P(Xn = j) = P(X = j) for all j in the state space E. Let f be a bounded function with a = sup, |f(j)|. Take € > 0. Since Mex =j=1 Jee is a convergent series, there must exist a finite subset A of FE such that Soe(X sj) 21-e jeA also for n large enough we have as well: Yo P(X, =f) > 1-28. jea Note that EAS(X)} = SO PPX = 3). jeE so we have, for n large enough: ELF} ~ Dye fO)P(X =H) See (18.15) |ELF%)} — Dye FUP (Xn = A)| < 2a. Finally we note that since A is finite we have dim SO PG) Pn =) = VO PV) P(X = 9)- (18.16) jEA jeA Thus from (18.15) and (18.16) we deduce Tim sup [4 F(Xn)} — BLE} S Bae, Since € was arbitrary, we have slim, E(f(Xn)} = BU} for each bounded (and a fortiori continuous) function f. Thus we have X;, % X by Theorem 18.1. oO 18. Weak Convergence 163 Examples: 4. If 4 denotes the Poisson distribution with parameter A, then ay nr(9) =e ‘ae and thus if Ay > A, we have j1,,,(J) > Ha(J) for each j = 1.2,3,... and by Theorem 18.9 we have that z,,, converges weakly to py. 5. If 4p denotes the Binomial B(p,n) distribution and if py — p, as in Example 4 and by Theorem 18.9 we have that ip, converges weakly to Mp 6. Let Ln.p denote the Binomial B(p,n). Consider the sequence jin,» Where limp oo npn = > 0. Then as in Exercise 4.1 we have rood = (8) FC) CO’) for 0 1), then X, 2 X. 18.2 Let a € R%. Show by constructing it that there exists a continuous function f : R¢ — R such that 0 < f(x) < 1 for all x € R*: f(a) = 0; and f(a) = 1 if |w—a| > ¢ for a given e > 0. (Hint: First solve this exercise when d=1 and then mimic your construction for d > 2.) 18.3 Let X be a real valued random variable with distribution function F’. Show that F(«—-) = F(«) if and only if P(X = 2) =0. 18.4* Let 9: RR. 0< g(a) < 1, g nondecreasing, and suppose g is right continuous (that is, limy—.,y>2 9(y) = g(x) for all «). Show that g has left limits everywhere (that is, limy—2,y<2 9(y) = g(x—) exists for all x) and that the set A = {x : g(e—) g(z)} is at most countably infinite. (Hint: First show there are only a finite number of points x such that g(x) — g(w—) > ¢; then let k tend to oc). 18.5 * Let F be the distribution function of a real valued random variable. Let D = {a : F(a—) = F(x)} (notation of Exercise 18.4). Show that D is dense in R. (Hint: Use Exercise 18.4 to show that the complement of D is at most countably infinite.) 18.6 Let (X;,)n>1 be a sequence of real valued random variables with £(X,) uniform on [—n,n]. In what sense(s) do X,, converge to a random variable X? [Answer: None.] 18.7 Let fn(x) be densities on R and suppose limnsoo fn(t) = €7*1(es¢)- If fn is the density for a random variable X,, each n, what can be said about the convergence of X;, as n tends to oo? [Answer: X 2 X, where X is exponential with parameter 1.]| 18.8 Let (Xn)nzi be iid. Cauchy with a = 0 and @ = 1. Let ¥, = Sate d Ny Show that Y, converges in distribution and find the limit. Does Y, converge in probability as well? 18.9 Let (Xn)nz1 be a sequence of random variables and suppose sup, E{(X")*} < co. Let pm be the distribution measure of X,,. Show that the sequence jn is tight (Hint: use Chebyshev’s inequality). 18.10 * Let X,, X and Y be real-valued r.v.’s, all defined on the same space (92,A, P). Assume that lim, EC f(Xn)o(¥)} = EL F(X) 9(¥)} whenever f and g are bounded, and f is continuous, and g is Borel. Show that the sequence (X,Y) converges in law to (X,Y). If furthermore X = h(Y) for some Borel function h, show that X, 5 X. Exercises 165 18.11 Let 444 denote the Pareto (or Zeta) distribution with parameter a. Let Qn — @ > 0 and show that jig,, tends weakly to pa. 18.12 Let yz. denote the Geometric distribution of parameter a. Let an > @ > 0, and show that ji, tends weakly to fla. 18.13 Let p(x.5.n) be a Hypergeometric distribution, and let N go to oo in such a way that p = 4 remains constant. The parameter n is held fixed Show as N tends to 00 as described above that j(v,5,n) converges weakly to the Binomial distribution B(p,n). 18.14 (Slutsky’s Theorem.) Let X,, converge in distribution to X and let Y,, converge in probability to a constant c. Show that (a) X,¥_ > eX (in distribution) and (b) ¥« 3, * (in distribution), (c ¥ 0) 18.15 Let (Xn)n2i- (Yx)nz1 all be defined on the same probability space. Suppose X,, 3 X and Y, converges in probability to 0. Show that X;, + Yn converges in distribution to X. 18.16 Suppose real valued (Xp)n>1 have distribution functions F,,, and that X,, 2X. Let p > 0 and show that for every positive N, N N [ |clPF(dz) < lim sup f |2\? Fy, (dx) < 00. -N noo JN 18.17 * Let real valued (X;,)n>1 have distribution functions F,,, and X have distribution function F’. Suppose for some r > 0, co lim / |F,(e) — F(2)|"dr = 0. ims Show that X,, Bx (Hint: Suppose there exists a continuity point y of F such that limp—oc Fn(y) # F(y). Then there exists ¢ > 0 and a subsequence (nx )uz1 such that |Fy,(y) — F(y)| > €. all k. Show then |Fn, (x) ~F(a)| > § for either x € [y,,y) or x € (y, ya] for appropriate yi, y2. Use this to derive a contradiction.) 18.18 * Suppose a sequence (Fp) n>1 of distribution functions on R converges to a continuous distribution function F on R. Show that the convergence is uniform in x (—oo < x < 00). (Hint: Begin by showing there exist points @),..+;0m such that F(a) < €, F(#j;41) — F(2;) < €, and 1— F(a) N, |F,(a;) — F(a;)| 1. X, Y by R-valued random variables, all on the same space. and suppose that X,, + ¢Y converges in distribution to X + 0Y for each fixed o > 0. Show that X,, converges to X in distribution. (Hint: Use Exercise 18.19.) 18.21 (Pollard [17]) Let X and Y be independent r.v.’s on the same space. with values in R and assume Y is N(0.1). Let f be bounded continuous. Show that BUX +0Y)} = EUfo(X)} 1 ios ~hle—al2/o? r) = —— ae BAI da, fo) Vino I. fe) Show that f, is bounded and C* where 18.22 Let (Xn)no1, X be R-valued random variables. Show that X;, con- verges to X in distribution if and only if E{f(X,,)} converges to E{ f(X)} for all bounded C* functions f. (Hint: Use Exercises 18.20 and 18.21.) 19. Weak Convergence and Characteristic Functions Weak convergence is at the heart of much of probability and statistics. Limit theorems provide much of the justification of statistics, and they also have a myriad of other applications. There is an intimate relationship between weak conyergence and characteristic functions. and it is indeed this relationship (provided by the next theorem) that makes characteristic functions so useful in the study of probability and statistics. Theorem 19.1 (Lévy’s Continuity Theorem), Let (jin)nz1 be @ se- quence of probability measures on R4, and let ({in)n>1 denote their Fourier transforms, or characteristic functions. a) If tn converges weakly to a probability measure ju, then jfin(u) > fi(u) for allu€ R¢; b) If fin(u) converges to a function f(u) for all u € R4, and if in addition f is continuous at 0, then there exists a probability 4 on R4 such that f(u) = fi(u), and py converges weakly to p. Proof. (a) Suppose ji, converges weakly to yz. Since e'“* is continuous and bounded in modulus, fins) = fe pa(de) converges to atu) = fe u(ar) by weak convergence (the function « + e'“* is complex-valued, but we can consider separately the real-valued part cos(u«) and the imaginary part sin(ur), which are both bounded and continuous). (b) Although we state the theorem for R%, we will give the proof only for d= 1. Suppose that limp. fin(w) = f(w) exists for all u. We begin by showing tightness (cf Theorem 18.6) of the sequence of probability measures Hn. Using Fubini’s theorem (Theorem 10.3 or more precisely Exercise 10.14) we have: [ fin(u)du = [. {f. onde) du 168 19. Weak Convergence and Characteristic Functions =f {ei and using that e"“* = cos(ur) + isin(uz), 2e a 7 [. {f- cos(uar) + isin(ur)au} bald). Since sin(ur) is an odd function, the imaginary integral is zero over the symmetric interval (—a,@), and thus: = 9 - / 2 sin(ae)in(de) noe © Since f°, ldu = 2a, we have 1 a an(u)du=2— [> 2s a al — fn (u))du = — [sialon nk 2) ff, sin(ax) =f 0-=@) pin(de) Now since 2(1 — S82) > 1 if Ju] > 2 and 2(1 — $22) > 0 always. the above is oo > [rae (aryn(de) 00 = fi 2ye.a/ai-(@)nn(ae) -2 2]° =tn( |=] }- Let G= 2 and we have the useful estimate: gps wn (B AVS F [a Aa(w))au (19.1) ~2/8 Let € > 0. Since by hypothesis f is continuous at 0, there exists a > 0 such that |1 — f(u)| < e/4 if jul < 2/a. (This is because fi,(0) = 1 for all n, whence limy-.90 fin(0) = f(0) = 1 as well.) Therefore, 2/0 a € € ofa (19.2) Since fi, (u) are characteristic functions, |fin(w)| <1, so by Lebesgue’s domi- nated convergence theorem (Theorem 9.1 (f)) we have a 2 fo sl, (= sed! < 2 fa 2 fo tim [ (= Antwan = f (1 = f(u))du. nee J 2a 2/a 19. Weak Convergence and Characteristic Functions 169 Therefore there exists an N such that n > N implies fo 2/o 0 ~ fatwa fo (1 — f(u))du € ~2/ a whence $ [74% (1 — fin(u))du < ¢. We next apply (19.1) to conclude pn ([—0, a]®) < ¢, for alln > N. There are only a finite number of n before N, and for each n < N, there exists an a, such that [in([—An,@n]°) < €. Let a = max(ay,..., Q@pi@). Then we have fin([-a,q]®) <, forall n. (19.3) The inequality (19.3) above means that for the sequence (f,)n>1, for any € > 0 there exists an a € R such that sup,, n([—a,a]°) < €. Therefore we have shown: lim sup sup fin ([—m, m]*) = 0 moc on for any fixed m € R. We have established tightness for the sequence {jin}n>1. We can next apply Theorem 18.6 to obtain a subsequence (ng)x>1 Such that pn, converges weakly to jz as k tends to oo. By part (a) of this theorem, dim Any (w) = fi(u) for all u, hence f(u) = f(u), and f is the Fourier transform of a probability measure. It remains to show that the sequence (jin),>1 itself (and not just (tn, 421) converges weakly to y. We show this by the method of contradiction. Let F,,, F be distribution functions of jz, and ys. That is, F(t) = un((—o0,4)); Fle) = (20,1). Let D be the set of continuity points of F’: that is, D ={x: F(a) = F(2)} Suppose that j,, does not converge weakly to yz, then by Theorem 18.4 there must exist at least one point « € D and a subsequence (nx)g>1 such that limy—.s0 Fn, (x) exists (by taking a further subsequence if necessary) and moreover limy— oo Fn, (2) = 3 # F(a). Next by Theorem 18.6 there also exists a subsequence of the subsequence (nx) (that is, a sub-subsequence (nx; )j21), such that (Hnx, )jz1 converges weakly to a limit v as j tends to oo. Exactly as we have argued, however, we get lim flax (w) =0(u), joe 170 19. Weak Convergence and Characteristic Functions and since lim fi, (u) = f(u), we conclude 0(u) = f(u). But we have seen that f(u) = fi(u). Therefore by Theorem 14.1 we must have ps = v. Finally. pn,, converging to v = y implies (by Theorem 18.4) that lim; Fre, (2) = F(x). since « is in D, the continuity set of 1, by hypothesis. But lim. Fry, (@) = 3 # F(x), and we have a contradiction. Oo Remark 19.1. Actually more is true in Theorem 19.1a than we proved: one can show that if 7, converges weakly to a probability measure jy on R4. then ji, converges to ji uniformly on compact subsets of R* Example. Let (X;,)n>1 be a sequence of Poisson random variables with parameter A, = n. Then if Zn = Fike =n). Zn 2 Z, where L(Z) =N(0,1). To see this, we have B{el%n} = Bfem(ornyh = iii {oto} = ewiuvAen(et"/ 71) by Example 13.3. Continuing and using a Taylor expansion for e*. we have the above equals = ea Might shat) = wi ting ut /2- AE _ er em bam) where h(u,n) stays bounded in n for each wand hence limo “4 = 0, Therefore, lim yz, (u) =e" /?, n="00 and since e~“’/? is the characteristic function of a N(0,1). (Example 13.5), we have that Z,, converges weakly to Z by Theorem 19.1 b. Exercises 171 Exercises for Chapter 19 19.1 Let (Xn)n>1 be N(jin. 02) random variables. Suppose jin > js € Rand o? — 0? > 0. Show that X, 3 X. where £(X) = N(u.0?) 19.2 Let (Xj)n>1 be N(#n,02) random variables. Suppose that X;, 2x for some random: variable X. Show that the, eae Hn and 0? have limits uw € Rand o? > 0, and that X is N(u.0") (Hint: ox, and yx being the characteristic functions of X,, and X, write PX = = cite SFE for some ys € R and o? > 0). and use Lévy’s Theorem to obtain that ¢x(u) =e * 19.3 Let (Xn)nz1. (Ya)nzi be sequences with X, and Yq defined on the same space for each n. Suppose X,, 2 X and ¥, 2 Y, and assume X;, and Y,, are independent for all n and that X and Y are independent. Show that XntY¥n BX+Y, 20. The Laws of Large Numbers One of the fundamental results of Probability Theory is the Strong Law of Large Numbers. It helps to justify our intuitive notions of what probability actually is (Example 1), and it has many direct applications, such as (for example) Monte Carlo estimation theory (see Example 2). Let (X;)n>1 be a sequence of random variables defined on the same prob- ability space and let Sp = 7}, Xj. A theorem that states that +5, con- verges in some sense is a law of large numbers. There are many such results; for example L? ergodic theorems or the Birkhoff ergodic theorem, considered when the measure space is actually a probability space, are examples of laws of large numbers, (See Theorem 20.3, for example). The convergence can be in probability, in L’, or almost sure. When the convergence is almost sure, we call it a strong law of large numbers. Theorem 20.1 (Strong Law of Large Numbers). Let (X;,)n>1 be inde- pendent and identically distributed (i.i.d.) and defined on the same space. Let M=E{Xj} and 0? =a, <0. Let Sp = YT}, Xj. Then tim 22 = jim tx, =p as. and in L?. noe n noe n Remark 20.1, We write 1.0? instead of yj, 0%, , since all the (Xj)j>1 have the same distribution and therefore the same mean and variance. Note also that lim,,..0 °* = y in probability, since L? and a.s. convergence both imply convergence in probability. It is easy to prove limp..oo 8 = yz in probability using Chebyshev’s inequality, and this is often called the Weak Law of Large Numbers. Since it is a corollary of the Strong Law given here, we do not include its proof. The proof of Theorem 20.1 is also simpler if we assume only X; € L* (all j), and it is often presented this way in textbooks. A stronger result, where the X,,’s are integrable but not necessarily square-integrable is stated in Theorem 20.2 and proved in Chapter 27. Proof of Theorem 20.1: First let us note that without loss of generality we can assume ys = E{X;} = 0. Indeed if 4 4 0, then we can replace X; with 174 20. The Laws of Large Numbers Z; = Xj — pw. We obtain limy—.. + D521 Zj = 0 and therefore _ lie a 1 - lim nS ~ ph) = Jim, (4d) -p=0 no from which we deduce the result. We henceforth assume js = 0. Recall S$, = 37’, X; and let Y, = Be, : Then E{Y,} = 5. B{Xj;} = 0. Moreover E{¥2} = Fs Dicjen E{X;X,}. However if j # k then * B{X)Xe} = E{Xj}E( Xe} = 0 since X; and X;, are assumed to be independent. Therefore n EAY2} = YO E{X}} (20.1) jal “ 1 > 2 2 j=) ° ne (no) o n and hence lim E{Y?} =0. Since Y, converges to 0 in L? we know there is a subsequence converging to 0 a.s. However we want to conclude the original sequence converges a.s To do this we find a subsequence converging a.s., and then treat the terms in between successive terms of the subsequence. Since E{Y?} = £, let us choose the subsequence n?: then ce oo 42 24 VaR} = OS *~_, Y2, < oc a.s., and hence the tail of this convergent series converges to 0; we conclude lim Y,2=0 as (20.2) n=36 Next let n € N. Let p(n) be the integer such that p(n)? Sn < (p(n) +1). Then 2 , n 1 y, — 2M y ye = - bX j=p(n)? +1 20. The Laws of Large Numbers 175, and as we saw in (20.1): 2 2 _ 2 ef ( 5 we) ‘uo \ ae An) o < 2Pin) +1 °, ~ ne? < 2yn+t a < 33 n because p(n) < /n. Now we apply the same argument as before. We have Sel (4 ey v0?) ‘Vey. “F< n=l Thus by Theorem 9.2 again, we have 2 2 2 > (% - pie) Yoon) 1 are in L’. An elegant way to prove Theorem 20.2 is to use the backwards martingale convergence theorem (see, e.g., Theorem 27.5). Let (§2..A, P) be a probability space, and let T : 2 — 92 be one to one (i.e. injective) such that T(A) C A (ie., T maps measurable sets to measurable sets) and if A € A, then P(T(A)) = P(A) (ie., T is measure preserving). Let T?(w) = T(T(w)) and define analogously powers of T. A set A is invariant under T if 14(w) = 1a(T(w)). 176 20, The Laws of Large Numbers Theorem 20.3 (Ergodic Strong Law of Large Numbers). Let T be a one-to-one measure preserving transformation of 2 onto itself. Assume the only T-invariant sets are sets of probability 0 or 1. If X € L* then im = SI X(T(w)) = B{X} gq a.s. and in L}. Theorem 20.3 is a consequence of the Birkhoff ergodic theorem; its advan- tage is that it replaces the hypothesis of independence with one of ergodicity. It is also called the strong law of large numbers for stationary sequences of random variables. Example 1: In Example 17.1 we let (X,)j>1 be a sequence of i.i.d. Bernoulli random variables, with P(X; = 1) = p and P(X; =0) = q =1—p (all j). Then Sp = 3>7_, X; is the number of “successes” in n trials, and +S, is the percentage of successes. The Strong Law of Large Numbers (Theorem 20.1) now tells us that tim 52 = p(X} =pas. (20.3) noo n This gives, essentially, a justification to our claim that the probability of success is p. Thus in some sense this helps to justify the original axioms of probability we presented in Section 2, since we are finally able to deduce the intuitively pleasing result (20.3) from our original axioms. Example 2: This is a simple example of a technique known as Monte Carlo approximations. (The etymology of the name is from the city of Monte Carlo of the Principality of Monaco, located in southern France. Gambling has long been legal there, and the name is a tribute to Monaco’s celebration of the “laws of chance” through the operation of elegant gambling casinos.) Suppose f is a measurable function on [0,1], and f |f(«)|dx < oc. Often we cannot obtain a closed form expression for a = i f(x)dz and we need to estimate it. If we let (U;);>1 be a sequence of independent uniform random variables on [0,1]. and we call I, = 1a f(U;), then by Theorem 20.2 we have . im Fe) =LUO)} = [soar 5 a.s. and in L?. Thus if we were to simulate the sequence (U;);>1 on a computer (using a random number generator to simulate uniform eu, variables, which is standard), we would get an approximation of Jot a)dx for large n. This is just one method to estimate to f(x)dz. and it is a not the best one except in the case where one wants to estimate a high dimensional 20. The Laws of Large Numbers 177 integral: that is, if one wants to estimate fi, f(x)dx for d large. The exact same ideas apply. Example 3: ({7. p.120]) Let 2 be a circle of radius r = 34. Let A be the Borel sets of the circle and let P be the Lebesgue measure on the circle (One can identify here the circle with the interval 0, 1)). Let a be irrational and T be rotation of 92 through @ radians about the center of the circle. Then one can verify that T is injective, measure preserving, and that the invariant sets all have probability zero or one (this is where the irrationality of a comes in). Therefore by Theorem 20.3 we have ig y . , dim, 5, Xe + fa) -[ X(«)de for any X € L’ defined on 2. for P-almost all x. 178 20. The Laws of Large Numbers Exercises for Chapter 20 20.1* (A Weak Law of Large Numbers). Let (X;) be a sequence of random variables such that sup; E{X}} = ¢ < 2 and E{X;X;} = 0 if j # k. Let Sn = Cha x}. a) Show that P(/+,| >) < = fore > 0: b) limy—oo 2.$n = 0 in L? and in probability. (Note: The usual i.i.d. assumptions have been considerably weakened here.) 20.2 Let (¥j)j>1 be a sequence of independent Binomial random variables, all defined on the same probability space, and with law B(p,1). Let X, = Den Yj. Show that X; is B(p, j) and that “2 converges a.s. to p. 20.3 Let (X;)j21 be iid. with X; in L?. Let ¥; = e*». Show that L a) converges to a constant @ a.s. [Answer: a = € 20.4 Let (Xj)j1 be iid. with X, in L? and B{X,} = pw. Let (¥j)j21 be also iid. with Y; in L! and E{Y;} = v ¢ 0.Show that FAX | x=" as. 1 SY ~ > 7 j5a 20.5 Let (Xj)jz1 be iid, with Xj in L* and suppose Je 77(X) — ») converges in distribution to a random variable Z, Show that n 1 lim — > Xj=v as. nN j=l (Hini: If Z, = Re Dh (X; — v), prove first that yaen converges in distri- bution to 0). 20.6 Let (X)j1 be iid. with X, in L?, Show that lim L yx = E{X?} as. nevcen 4 20.7 Let (X})jx1 be iid. N(1,3) random variables. Show that lim 22 13 +. noo X? + XP? Exercises 179 20.8 Let (Xj)j> be iid. with mean jy: and variance o?. Show that lim 1x pao as, ta Ke 8. 20.9 Let (Xj)j>1 be iid. integer valued random variables with E{|X;|} < oc. Let Sp = S07 Xj. (Sn)nza iscalled a random walk on the integers. Show that if #(X;) > 0 then lim S, = 00. as. noo 21. The Central Limit Theorem The Central Limit Theorem is one of the most impressive achievements of probability theory. From a simple description requiring minimal hypothe- ses, we are able to deduce precise results. The Central Limit Theorem thus serves as the basis for much of Statistical Theory. The idea is simple: let and o? = Var(X;) (all 7). The key observation is that absolutely nothing (except a finite variance) is assumed about the distribution of the random variables (X;)j>1. Therefore, if one can assume that a random variable in question is the sum of many iid. random variables with finite variances, that one can infer that the random variable’s distribution is approximately Gaussian. Next one can use data and do Statistical Tests to estimate js and o*, and then one knows essentially everything! Theorem 21.1 (Central Limit Theorem). Let (Xj;)j;>1 be iid. with E{X5} = w and Var(X;) = 0? (all j) with 0 < 0? < 00. Let Sn = Wh, Xj. Let Y, = Sage Then Yp, converges in distribution to Y, where L(Y) = N(O,1). Observe that if 0? = 0 above. then Xj = pr as. for all j, hence Sa = yw as. Proof. Let yj be the characteristic function of X, ~y. Since the (X;)j>1 are iid., ¢; does not depend on j and we write y. Let Y, = Sarge. Since the X; are independent, by Theorem 15.2 Py, (u) = Pa (X)— py (4) (21.1) yet _ u = PSO C5-0) \oVn 182 21, The Central Limit Theorem Next note that E{Xj;—p} = 0 and E{(Xj—p)?} = 0?. hence by Theorem 13.2 we know that y has two continuous derivatives and moreover o(u) = iB {(X; -1) yen, (uw) = -E{(X; wre}, Therefore y’(0) = 0 and 9” (0) = —o?. If we expand y in a Taylor expansion about u = 0, we get. 24,2 g(u) =1+0- S + u2h(u) (21.2) where h(w) +0 as u > 0 (because y” is continuous). Recall from (21.1): ont = (o(st9))" = er log o( wa) awa = erlos(- $F +a), where here “log” denotes the principal value of the complex valued logarithm. Taking limits as n tends to oo and using (for example) L’H6pital’s rule gives that > lim yy, (u) =e" ?: n30 Lévy's Continuity Theorem (Theorem 19.1) then implies that Y, converges in law to Z, where yz(u) = e~“”/?; but then we know that £(Z) = N(0, 1). using Example 13.5 and the fact that characteristic functions characterize distributions (Theorem 14.1). a Let us now discuss the relationship between laws of large numbers and the central limit theorem. Let (X});>1 be iid. with finite variances. and let 1 = E{X)}. Then by the Strong Law of Large Numbers, Sn «72 lim 2 = as. and in L?, (21.3) nose n where S, = Via X;. Thus we know the limit is , but a natural question is: How large must n be so that we are sufficiently close to 4? If we rewrite (21.8) as lim =0 as. and in L?, (21.4) no Sn al then what we wish to know is called a rate of convergence. We could ask, for example, does there exist an a € R, a 4 0. such that lim n® Sx | <0 as. (c #0)? 21. The Central Limit Theorem 183, In fact. no such @ exists. Indeed, one cannot have no (Se — #2) convergent to a non-zero constant or to a non-zero random variable a.s., or even in probability. However by the central limit theorem we know that if a = 3, Vni(S# — 1) converges in distribution to the normal distribution N(0,0?), In this sense. the rate of convergence of the strong law of large numbers is \/7i. One can weaken slightly the hypotheses of Theorem 21.1. Indeed with essentially the same proof. one can show: Theorem 21.2 (Central Limit Theorem). Let (X;);>1 be independent but not necessarily identically distributed. Let E{X;} = 0 (all j), and let 2 52. o} = 0%, . Assume sup E{|X;|***} < 00 , some e>0, i o 2 > 05 = 00 Then where £(Z) =.N(0,1) and where convergence is in distribution. While Theorem 21.1 is, in some sense, the “classical” Central Limit The- orem, Theorem 21.2 shows it is possible to change the hypotheses and get similar results. As a consequence there are in fact many different central limit theorems, all similar in that they give sufficient conditions for properly nor- malized sums of random variables to converge in distribution to a normally distributed random variable. Indeed, martingale theory allows us to weaken the hypotheses of Theorem 21.2 substantially. See Theorem 27.7. We note that one can also weaken the independence assumption to one of “asymptotic independence” via what is known as mixing conditions, but this is more difficult. Finally, we note that Theorem 21.1 has a d-dimensional version which again has essentially the same proof. Theorem 21.3 (Central Limit Theorem). Let (Xj)j>1 be iid. Rt- valued random variables. Let the (vector) wy = E{X;}, and let Q denote the covariance matrix: Q = (4k,e)i1 be iad. with P(X; = 1) = p and P(X; = ae =qH p. Then S, = Dyn? 3 is Binomial (C(S,) = B(p,n)). We have p = E{X;} =p and o? = 0% = pq = p(1 ~ p). By the a Law of Large Numbers we have . Sn lim — =pas. nse 1 and by the Central Limit Theorem (Theorem 21.1) we have (with con- vergence being in distribution): Sy, — np Dy vnp(l where £(Z) = N(0.1)- 2. Suppose (Xj) j>) are i.i.d. random variables. all in L?, and with (common) distribution function F. We assume F is unknown and we would like to estimate it. We give here a standard technique to do just that. Let Y3(x) = 11x, <2} Note that Y; are i.id. and in L?. Next define 1 Fra(e) = > we) for x fixed. = The function F;,(z) defined on R is called the empirical distribution func- tion (it should indeed be written as F;,(r,w), since it depends on w!). By the Strong Law of Large numbers we have Jim Fa(a) = im > Sve) = HMw}. jel However, EAM (2) } = BLA ex, , be iid. and suppose E{|X;P} < co. Let Gye) = P(SBE 1 be independent, double exponential with parameter 1 (that is, the common density is }e~!"!, -o0 < @ < 90). Show that sim. va (BE) =z j where £(Z) = N(0, 3), and where convergence is in distribution. (Hint: Use Slutsky’s theorem (Exercise 18.14).) 21.3 Construct a sequence of random variables (X,);>1, independent, such that lim;—... X; = 1 in probability, and E{X?} > j. Let Y be independent of the sequence (X;);>,, and L(Y) = N(0,1). Let Z; = YX;, j > 1. Show that a) E{Z;}=0 b) limj...00 9%, = 00 c) lim;-..0 Z; = Z in distribution, where £(Z) = N(0,1). (Hint: To construct Xj, let (92;,A;, P;) be ([0, 1], B[0, 1], m(ds)). where m is Lebesgue measure on [0, 1]. Let x; (J+ lors) + Layjai&). and take the infinite product as in Theorem 10.4. To prove (c) use Slutsky’s theorem (Exercise 18.14)). (Note that the hypotheses of the central limit the- orems presented here are not satisfied; of course, the theorems give sufficient conditions, not necessary ones.) 21.4 (Durrett, [8]). Let (Xj)j>1 be iid. with E{X,} = 1 and 63, =o% € (0, 20). Show that 2 (VB - va) Zz, with £(Z) = N(O, 1). (tine Sa _ (8a + Vn) = eS. va).) Exercises 187 21.5 Let (X;) be iid. Poisson random variables with parameter \ = 1. Let Sn = DF. Xj. Show that limy. Sa = Z, where £(Z) = N(0,1). va 21.6 Let Y* be a Poisson random variable with parameter \ > 0. Show that lim Atoc where £(Z) = N(0,1) and convergence is in distribution. (Hint: Use Exer- cise 21.5 and compare Y* with Sj; and Sj,)4.., where [A] denotes the largest integer less than or equal to A.) 21.7 Show that lim e~ n=390 (Hint: Use Exercise 21.5.) 21.8 Let (X;)j>1 be iid. with E{X;} = 0 and o%, = 0? < x. Let S, = Yo", Xj. Show that 2. does not converge in probability. 21.9* Let (X;)jz1 be iid. with E{X;} = 0 and 0%, = 0? < ox. Let Sn = S2"_, X}. Show that Sul] _ 2 slim, n{ Sal} - /2o. (Hint: Let £(Z) = N (0,07) and calculate E{|Z]|}.) 21.10 (Gut, [I1]). Let (X;)js1 be iid. with the uniform distribution on (1,1). Let mn Lia Xs Ven KF + Uj AF Show that //nY,, converges. (Answer: where £(Z) = .N(0,3).) Yn = VnY,, converges in distribution to Z 21.11 Let (X;);> be independent and let Xj have the uniform distribution on (~j.9)- a) Show that in distribution where £(Z) = N(0. 3) (Hint: Show that the characteristic function of Xj is yx, (u) = “222; compute gs, (u). then yg, jn2/2(U), and prove that the limit is e~“’/18 by using Ware aint HGnt)) 188 21. The Central Limit Theorem b) Show that lim n-+50 in distribution, where £(Z) = N(0,1). (Note: This is not a particular case of Theorem 21.2). 21.12 * Let X € L? and suppose X has the same distribution as wt 2). where Y. Z are independent and X, Y, Z all have the same distribution. Show that X is N(0.0?) with 0? < oo. (Hint: Show by iteration that X has the same law as I> 7, X; with (X;) iid., for n= 2") 22. L? and Hilbert Spaces We suppose given a probability space (92, F, P). Let L? denote all (equiva- lence classes for a.s. equality of) random variables X such that E{X?} < oo. We henceforth identify all random variables X,Y in L? that are equal as. and consider them to be representatives of the same random variable. This has the consequence that if E{X?} = 0. we can conclude that X = 0 (and not only X =0a.s.). We can define an inner product in L? as follows: for X,Y in L?, define (X,Y) = B{XY}. Note that |E{XY}| < E{X2}2E{Y?}2 < oc by the Cauchy-Schwarz in- equality. We have seen in Theorem 9.3 that L? is a linear space: if X,Y are both in L?, and a, 3 are constants, then aX + 3Y is in L? as well. We further note that the inner product is linear in each component: For example (aX + BY, Z) =a(X,Z) + BlY,Z). Finally, observe that (X,X) > 0, and (X,X) = 0 if and only if X =0 as. since X = 0 a.s. implies X = 0 by our convention of identifying almost surely equal random variables. This leads us to define a norm for L? as follows: IX) = (XX)? = BOPP, We then have ||X'|| = 0 implies X = 0 (recall that in L?, X = 0 is the same as X = 0a.s.)., and by bilinearity and the Cauchy-Schwarz inequality we get |X +Y|P = B{X?} + 2B{XY} + B(Y?} < XI? + 2X4 IYI + YIP = (A+ Y ID, and thus we obtain Minkowski’s inequality: |X +¥] < |X +1¥ 1, so that our norm satisfies the triangle inequality and is a true norm. We have shown the following: . 190 22. L? and Hilbert Spaces Theorem 22.1. L? is a normed linear space with an inner product (-,:). 1 Moreover one has || «|| = (-.+)?. We next want to show that L? is a complete normed linear space; that is, if X, is a sequence of random variables that is Cauchy under |] - ||, then there exists a limit in L? (recall that X,, is Cauchy if |X» —Xym|| + 0 when both m and n tend to infinity: every convergent sequence is Cauchy). Theorem 22.2 is sometimes known as the Riesz—Fischer Theorem. Theorem 22.2. L? is complete. Proof. Let Xn be a Cauchy sequence in L?. That is, for any ¢ > 0. there exists N such that n,m > N implies ||X, — Xm|] < ©. Choose a sequence of epsilons of the form 3+. Then we have a subsequence (Xp,.),>1 such that Xan — Xnvaall S ge Define 2 Ya = 37 [Xn, — Xnpail- By the triangle inequality we have 2 n Eye} < (= WXnp - Xveal) <1. at Let Y = limp—oo Yn which exists because Y,(w) is a nondecreasing sequence, each w (a.s.). Since E{¥,2} < 1 each n, by the Monotone Convergence Theo- rem (Theorem 9.1(d)) E{Y¥?} < 1 as well. Therefore Y < oo a.s., and hence the sequence Xn, + 37°, (Xnp.: ~ Xn,) converges absolutely a.s. Since it is a telescoping series we conclude X’,, (w) converges toward a limit X(w) as p — oo, and moreover |X (w)| < |Xn,(w)| + ¥Y(w). Since X,, and Y are in L?, so also X € L?. Next, note that m X-X,, = lim Zp = lim, YE (Xnger — Xnq): g=P Since |Zp,| < Y for each p,m, by Lebesgue’s dominated convergence theorem (Theorem 9.1(f)) we have : - . ' . - - 1 IX = Xopll = lim 2%) < tim Y> [Xng —Xagll S$ Sy q=p and we conclude lim,—s ||X ~ Xn,|| = 0. Therefore X,, converges to X in L’, 22, L? and Hilbert Spaces 191 Finally, Xn — X|| <||Xn — Xp!) + ||Xn, — XI). Hence letting n and p go to infinity, we deduce that X,, tends to X in L?. oO Definition 22.1. A Hilbert space H is a complete normed linear space with an inner product satisfying (x,x)? = |la||, allz €H. We now have established: Theorem 22.3. L? is a Hilbert space. Henceforth we will describe results for Hilbert spaces; of course these results apply as well for L?. From now on H will denote a Hilbert space with norm || - || and inner product (-,-), while a an 3 below always denote real numbers. Definition 22.2. Two vectors x and y in H are orthogonal if (x,y) =0. A vector « is orthogonal to a set of vectors I’ if (x,y) = 0 for every y EL. Observe that if (,y) = 0 then || + y||? = ||a||? + |ly||?; this is a Hilbert space version of the Pythagorean theorem. Theorem 22.4 (Continuity of the inner product). [fz, — x andy, > y in H, then (tn, Yn) > (a,y) in R (and thus also |lxn|| — |\zI). Proof. The Cauchy-Schwarz inequality implies (xr, y) < |Jz| |Iy|], hence (a, 9) = (ny Ym)| = |(@ = Gms Yn) + (@ = Fn Y — Yn) + (ne Y = Yn)| S [I — 2n[l[lgynll + le — ell ly = Yall + Mare ly Yall. Note that sup,, ||yn|| < 20 and sup, ||a,|] < 00, since x, and y, are both convergent sequences in H (for example, ||<;,|| < ||a,—<||+||z|] and ||z|] < co and ||, — £|| > 0). Thus the right side of the above inequality tends to 0 as n tends to ov. o Definition 22.3. A subset L of H is called a subspace if it is linear (that is, x,y €L implies ax+ By €L) and if it is closed (that is, if (tn)n>1 converges tox inL, thenx €L). Theorem 22.5. Let I be a set of vectors. Let I+ denote all vectors orthog- onal to all vectors in I’. Then I'+ is a subspace of H. Proof. First note that [+ is a linear space, even if I’ is not. Indeed, if x,y € I+, then (2, z) = 0 and (y,z) = 0, for each z € I’. Therefore (aa + By, 2) = a(x, z) + By, z) = 0, and az + 3y € I+ also. It follows from Theorem 22.4 that I+ is closed. O 192 22. L? and Hilbert Spaces Definition 22.4. For a subspace £ of H, let d(x, £) = inf{||x — yll:y € L} denote the distance from x to L. Note that if £ is a subspace, then x € CL iff d(x,£) = 0 (recall that a linear subspace of a closed space is always closed). Theorem 22.6. Let L be a subspace of H; « €H. There is a unique vector y €L such that || ~yl| = d(2, L). Proof. If « € £, then y = «. If is not in L, let yn € £ such that limy—o. |]e— Yn\| = d(a, £2). We want to show that (yn)n>1 is Cauchy in H. Note first that lly — Youll]? = |e — Yall? + [le — yn ll? — 2(@ — Ys B= Yn)- (22.1) We use the inequality jp — Inti < lit inl 4 le uel to conclude that lim, lo - I < ata. ), hence 4 tim, ze torte =d(x,L), since d(x, £) is an infimum and un tun € L because CL is a subspace. We now have ( = lim {lle — youll? + [lee — yall? + 200 ~ Yn — Yn) } /4 m d(x, Ly? = im, |e _ 00 Yn + Yn 2 and therefore lim_(@ — ym,@— Yn) = d(x, £L)°. (22.2) n,m—oo If we now combine (22.1) and (22.2) we see that (Yn)n>1 is Cauchy. Therefore lim yn = y exists and is in £, since L is closed. Moreover d(x,£) = ||x — y||, by the continuity of the distance function. It remains to show the uniqueness of y. Suppose z were another such vector in £. Then the sequence Wen =Y Want = 4, is again a Cauchy sequence in £ by the previous argument, and hence it converges to a unique limit; whence y = z. o 22. L? and Hilbert Spaces 193 We now consider the important concept of projections. We fix our Hilbert space and our (closed. linear) subspace CL. The projection of a vector x in HH onto £ consists of taking the (unique) y € L which is closest to 2. We let IT denote this projection operator. The next theorem gives useful properties of IT. Theorem 22.7. The projection operator IT of H onto a subspace L satisfies the following three properties: (i) IT is idempotent: IT? = IT: (ii) Ha =a force €£; Mx =0 fora elt; (iii) For every x € H, x — Ix is orthogonal to L. Proof. (i) follows immediately from the definition of projection. (ii) If # € CL, then d(x.) = 0, and since x is closest to x (||¢ — z|| = 0). Ix = x. Moreover if x € L+. then |]r—yl|? = (e—y,z—y) = |lel|? + lly? for y € £, and thus y = 0 minimizes d(a,C); hence Ha = 0. (iii) We first note that, for y € £: le — Hall? < je - (Ta + y)IP = lla — Hal? + lly? — 2(@ — Hay). and therefore 2(e — Ha,y) < |\y|?. Since y € £ was arbitrary and since C is linear we can replace y with ay, any a € R,. to obtain 2x — Ix, ay) < |lay||?. and dividing by a gives Q(x — Hx. y) < ally||?: we let a tend to zero to conclude (a —ITa,y) < 0. Analogously we obtain (x — ITx.y) > 0 by considering negative a. Thus « — /7z is orthogonal toL. Qo Corollary 22.1. Let IT be the projection operator of H onto a subspace L. Then x = IIx + (x — IT) is a unique representation of x as the sum of a vector in L and one in L+. Such a representation exists. Moreover x — Ix is the projection of x onto L+: and (£L+)*+ = L. 194 22. L? and Hilbert Spaces Proof. The existence of such a representation is shown in Theorem 22.7(iii). As for uniqueness. let « = y+z be another such representation. Then y—ITz = 2 — (x ~ ITz) is a vector simultaneously in £ and L~: therefore it must be 0 (because it is orthogonal to itself). and we have uniqueness. Next observe that £ Cc (£+)+. Indeed, if z € £L and y € £+ then (x,y) = 0, soz € (£+)+. On the other hand if x € (£+)+, then c= y+z2 with y EL and z € £+. But z must be 0. since otherwise we have (x. z) = (y, z) + (z.2). and (y,z) =Osince y € £ and z € £; and also (x, z) = 0 since z € £* and a € (£+)+. Thus (z,z) = 0, hence z = 0. Therefore ¢ = y, with y € L. hence weLand (LttcL. qd Corollary 22.2. Let IT be the projection operator H onto a subspace L. Then (i) (a. y) = (a, ITy), (ii) HT is a linear operator: (ax + By) = allx + BITy. Proof. (i) By Corollary 22.1 we write uniquely: T= 2, +22, a € Livy EL, y=nty. ye Liy ele. Then (Hx, y) = (ar.y) = (a1, yr + y2) = (1.91). since (z1, yz) = 0. Continuing in reverse for y, and using (x2, y,) = 0: = (a + 22,41) = (2, yn) = (e, Hy). (ii) Again using the unique decomposition of Corollary 22.1, we have: ax + By = (aa, + Gyr) + (ax2 + Fy), hence Maz + By) = ar, + By, = allx + BITy. Qo We end this treatment with a converse that says, in essence, that if an operator behaves like a projection then it is a projection. Theorem 22.8. Let T map H onto a subspace L. Suppose that x — Tx is orthogonal to £ for alla € H. Then T = I, the projection operator onto the subspace L. Proof. We can write « = Tx +(a— Tx), with Tz € £ and («—Tx) € L+. By Corollary 22.1 to Theorem 22.7, T'z must be the projection of x onto L. O Exercises 195 Exercises for Chapter 22 22.1 Using that (a — b)? > 0, prove that (a + b)? < 2a? + 267. 22.2 Let «,y € H. a Hilbert space, with (7.y) = 0. Prove the Pythagorean Theorem: || + y||? = |a||? + |lylP?- 22.3 Show that R” is a Hilbert space with an inner product given by the “dot product” if 7=(ay...., ap) and Y= (y1....,Yn), then (7, ¥) = TL, aiys- 22.4 Let £ be a linear subspace of H and IT projection onto £. Show that ITy is the unique element of £ such that (ITy,z) = (y, 2), for all z € £. 23. Conditional Expectation Let X and Y be two random variables with Y taking values in R with X taking on only countably many values. It often arises that we know already the value of X and want to calculate the expected value of Y taking into account the knowledge of X, That is. suppose we know that the event {X = j} for some value j has occurred. The expectation of Y may change given this knowledge. Indeed, if Q(A) = P(A|X = j), it makes more sense to calculate Eg{Y} than it does to calculate Ep{Y} (Ep{-} denotes expectation with respect to the Probability measure R.) Definition 23.1. Let X have values {1}, £2,..., Ln,...$ and Y be a random variable. Then if P(X = x;) > 0 the conditional expectation of Y given {X = 2;} is defined to be EY |X = 23} = EQfY}, where Q is the probability given by Q(A) = P(A|X = a,), provided E{|Y|} < x. Theorem 23.1. In the previous setting, and if further Y is countably valued with values {yy yo, .... Yn,.-} and if P(X = a3) > 0, then ELY|X = 23} = > mw P(Y = ya|X = 23), k=1 provided the series is absolutely convergent. Proof. « EAY|X = 23} = EQ (¥} = > yw QY = yx) = Yo yeP(Y = yulX = 25). xsl k=l Oo Next, still with X having at most a countable number of values, we wish to define the conditional expectation of any real valued r.v. Y given knowledge of the random variable X, rather than given only the event {X = 2;}. To this effect we consider the function 198 23. Conditional Expectation _ f E{Y|X = 05} if P(X = aj) >0 f(z) = {ae arbitrary value if P(X =2,;)=0. (23.1) Definition 23.2. Let X be countably valued and let Y be a real valued ran- dom variable, The conditional expectation of Y given X is defined to be E(YIX} = F(X), where f is given by (23.1) provided f is well defined (that is, Y is integrable with respect to the probability measure Q; defined by by Q;(A) = P(A|X = x), for all j such that P(X = x;) > 0), Remark 23.1. The above definition does not really define E{Y|X} every- where, but only almost everywhere since it is arbitrary on each set {X = x} such that P(X = x) = 0; this will be a distinctive feature of the conditional expectation for more general r.v, X’s as defined below. Example: Let X be a Poisson random variable with parameter A. When X =n, we have that each one of the n outcomes has a probability of success p, independently of the others. Let S denote the total number of successes. Let us find E{S|X} and E{X|S}. We first compute E{S|X = n}. If X =n, then $ is binomial with param- eters n and p, and E{S|X =n} = pn. Thus E{S|X} = pX. To compute E{X|S}, we need to compute E{X|S = k}: to do this we first compute P(X = n|S = k): P(S = h|X =n)P(X =n) P(X =njS=k)= P=) _ ") pk(L =p)" = Lime (PL pyr-® (Gar) > - moe eT mP)A for n > k. Thus, B(XIS =H} = Ya ~O-PIA = k+(1—p)d, n>k hence, E{X|S}=S+(1~p)d Finally, one can check directly that E{S} = E{E{S|X}}: also this follows from Theorem 23.3 below. Therefore, we also have that E{S} = pE{X} = pr. 23. Conditional Expectation 199 Next we wish to consider the general case: that is, we wish to treat E{Y|X} where X is no longer assumed to take only countably many val- ues. The preceding approach does not work, because the events {X = z} in general have probability zero. Nevertheless we found in the countable case that E{Y|X} = f(X) for a function f, and it is this idea that extends to the general case. with the aid of the next theorem. Let us recall a definition already given in Chapter 10: Definition 23.3. Let X:(2,A) > (R",B”) be measurable. The o-algebra generated by X is o(X) = X~}(B") (it is a o-algebra: see the proof of Theorem 8.1), which is also given by o(X) = {AC 2: X~\(B) =A, for some B € B"}. Theorem 23.2. Let X be an R” valued random variable and let Y be an R-valued random variable. Y is measurable with respect to o(X) if and only if there exists a Borel measurable function f on R” such that Y = f(X). Proof. Suppose such a function f exists. Let B € B. Then Y~'(B) = X~*(f-1(B)). But A = f-*(B) € B”, whence X~*(A) € o(X) (alterna- tively, see Theorem 8.2). Next suppose Y~!(B) € o(X), for each B € B. Suppose first Y = Yoh ala, for some k < oc, with the a;’s all distinct and the A,’s pair- wise disjoint. Then A; € o(X), hence there exists B; € B” such that A; = X71(B)). Let f(x) = T, ailp,(2), and we have Y = f(X), with f Borel measurable: so the result is proved for every simple r.v. Y which is o(X)-measurable. If Y is next assumed only positive, it can be written Y = limn—oo Yn, where Y,, are simple and non-decreasing in n. (See for ex- ample such a construction in Chapter 9.) Each Y,, is o(X) measurable and also Yn = fn(X) as we have just seen. Set f(x) = limsup, x, fn(). Then y lim Yn = lim f,(X). n° But (lim sup fn)(X) = lim sup(f,(X)). n90 n and since lim sup, 45 fn(«) is Borel measurable, we are done. For general Y, we can write Y = Y* — Y~, and we are reduced to the preceding case. Qo In what follows, let (92,A, P) be a fixed and given probability space, and let X : 2 R”. The space £?(2,.A, P) is the space of all random variables ¥ such that E{¥?} < oo. If we identify all random variables that are equal a.s., we get the space L?((2..A, P). We can define an inner product (or “scalar product”) by (Y,Z) = E{Y Z}. 200 23. Conditional Expectation Then L?(2..A,P) is a Hilbert space, as we saw in Chapter 22. Since o(X) C A, the set L?(2.0(X), P) is also a Hilbert space, and it is a (closed) Hilbert subspace of L*(,A,P). (Note that L?(2,o(X),P) has the same inner product as does L?(,.A, P).) Definition 23.4. Let Y € L?(Q, A, P). Then the conditional expectation of Y given X is the unique element Y in L?(2,0(X), P) such that E{YZ} = E{YZ} for all Z € L?(2,0(X), P). (23.2) We write EAY |X} for the conditional expectation of Y given X, namely Y. Note that Y is simply the Hilbert space projection of Y on the closed lin- ear subspace L?(2,0(X), P) of L?(2,A, P): this is a consequence of Corol- lary 22.1 (or Exercise 23.4), and thus the conditional expectation does exist. Observe that since E{Y|X} is o(X) measurable, by Theorem 23.2 there exists a Borel measurable f such that E{Y|X} = f(X). Therefore (23.2) is equivalent to ELF (X)g(X)} = ELV o(X)} (23.3) for each Borel g such that g(X) € £2. Next let us replace o(X) with simply a o-algebra G with G € A. Then L?(2,G, P) is a sub-Hilbert space of L?((2,.A, P), and we can make an anal- ogous definition: Definition 23.5. Let Y € L?(2,A,P) and let G be a sub o-algebra of A. Then the conditional expectation of Y given G is the unique element E{Y |G} of L?(2,G,P) such that E{YZ} = E{E{Y|g}2Z} (23.4) for all Z € L?(2,G.P). Important Note: The conditional expectation is an element of L?, that is an “equivalence class” of random variables. Thus any statement like E{Y|G} > Oor E{Y|G} = Z, etc... should be understood with an implicit “al- most surely” qualifier, or equivalently as such: there is a “version” of E{Y|G} that is positive, or equal to Z, etc... Theorem 23.3. Let Y € L?(2,A,P) andG be a sub o-algebra of A. a) If Y > 0 then E{Y|G} > 0; b) IfG = 0(X) for some random variable X , there exists a Borel measurable function f such that E{Y|G} = f(X); c) E{E{Y|9}} = E{Y}; d) The map ¥Y > E{Y|G} is linear. 23. Conditional Expectation 201 Proof. Property (b) we proved immediately preceding the theorem. For (c) we need only to apply (23.4) with Z = 1. Property (d) follows from (23.4) as well: if U,V are in L?. then E{(U + aV)Z} = E{UZ} + aE {VZ} = E{E{U|G}Z} + aE{E{V|G}Z} = EX(E{U|G} + aE {V|G})2}, and thus E{U + aV|G} = E{U|G} + aE{V|G} by uniqueness (alternatively, as said before, E{Y|G} is the projection of Y on the subspace L(2,9, P), and projections have been shown to be linear in Corollary 22.2). Finally for (a) we again use (23.4) and take Z to be 1, z¢yjg}<0}, assuming Y >0as. Then E{YZ} > 0 since both Y and Z are nonnegative, but EL ELY|G}2} = P{E{Y |G} 1 popvigy 0. This violates (23.3), so we conclude P({E{Y|G} < 0}) = 0. o Remark 23.2. As one can see from Theorem 23.3, the key property of conditional expectation is the property (23.4); our only use of Hilbert space projection was to show that the conditional expectation exists, We now wish to extend the conditional expectation of Definition 23.4 to random variables in LZ’, not just random variables in L?. Here the technique of Hilbert space projection is no longer available to us. Once again let £*(2,.A, P) be the space of all Z* random variables; we identify all random variables that are equal a.s. and we get the (Banach) space L*(92,A, P). Analogously, let L+({2,A, P) be all nonnegative random variables, again identifying all a.s. equal random variables. We allow random variables to assume the value +00. Lemma 23.1. Let Y € L*(92,A,P) and let G be a sub o-algebra of A. There exists a unique element E{Y|G} of L*+(92,G,P) such that E{YX} = E{E{Y|G}X} (23.5) for all X in L*(2,G,P) and this conditional expectation agrees with the one in Definition 23.5 if further Y € L?(2,A,P). Moreover, if 0< Y < Y’, then E{Y 9} < E(Y'IG}. (236) Proof. If Y is in L?(2,A,P) and positive, we define E{Y|G} as in Defini- tion 23.5. If X in L+(2,G,P) then X,, = X An is square-integrable. Hence the Monotone Convergence Theorem (applied twice) and (23.5) yield E{YX} = lim E{YX,} n = lim E{E{Y|G}Xp} n = E(E(Y|G}X} (23.7) and (23.5) holds for all positive X. 202 23, Conditional Expectation Let now Y be in L+(92. A. P). Each Yn = Y Am is bounded and hence in L?. and by Theorem 23.3, conditional expectation on L? is a positive operator. so E{Y A m|G} is increasing: therefore the following limit exists and we can set E(Y|G}= lim E{¥n/G}.- (23.8) If X € L*+(2.G, P), we apply the Monotone Convergence Theorem several times as well as (23.8)to deduce that: E{YX} = lim B{¥,.X} m =E {lim EY mIG}X} = E{E{y|G}X}. Furthermore if Y < Y’ we have YAm < Y'Am for allm, hence E{Y Am|G} < E{Y' A m|G} as well by Theorem 23,.3(a). Therefore (23.6) holds. It remains to establish the uniqueness of E{Y |G} as defined above. Let U and V be two versions of E{Y|G} and let A, = {U < V < n} and suppose P(A,) > 0. Note that An € G. We then have E{Y14,} = B{U1,,} = B{V1a,}. since E{Y 14} = E{E{Y|G}1,4} for all A €G by (23.7). Further, 0 < U1,, < Via, 0 implies that the rv. V14, and U1y,,, are not a.s. equal: we deduce that E{U1,4} < E{V1i,}. whence a contradiction. Therefore P(An) = 0 for all n, and since {U > V} = Un>1 An we get P{U < V}) = 0; analogously P({V > U}) = 0. and we have uniqueness. a Theorem 23.4. Let Y € L}(2,A,P) and let G be a sub o-algebra of A. There exists a unique element E{Y|G} of L*(2,G,P) such that E{YX} = E{E{Y|G}X} (23.9) for all bounded G-measurable X and this conditional expectation agrees with the one in Definition 23.5 (resp. Lemma 23.1) when further Y € L?(Q..A, P) (resp. Y > 0), and satisfies a) If Y >0 then E{Y|G} > 0; b) The map Y — E{Y|G} is linear. Proof. Since Y is in L}, we can write y=yt-y7 where Y* = max(Y,0) and Y~ =— min(Y,0): moreover Y* and Y~ are also in L'(2,G,P). Next set E{Y|G} = E{Y*|G} — E{y~ |g}. 23. Conditional Expectation 203 This formula makes sense: indeed the r.v. Y+ and Y~, hence E{Y*+|G} and E{Y~|G} as well by Theorem 23.3(c), are integrable, hence a.s. finite. That E{Y |G} satisfies (23.9) follows from Lemma 23.1. For uniqueness, let U.V be two versions of E{Y |G}, and let A= {U < V}. Then A € G, so 1, is bounded and G-measurable. Then E{¥14} = E{E{Y|G}14} = E{U14} = E{V 14}. But if P(A) > 0, then E{U1,} < E{V1,4}. which is a contradiction. So P(A) = 0 and analogously P({V < U}) = 0 as well. The final statements are trivial consequences of the previous definition of E{Y|G} and of Lemma 23.1 and Theorem 23.3. o Example: Let (X, Z) be real-valued random variables having a joint density f(a, 2). Let g be a bounded function and let Y =4(Z). We wish to compute E{Y|X} = E{g(Z)|X}. Recall that X has density fy given by fx(a)= f f2.2)de and we defined in Chapter 12 (see Theorem 12.2) a conditional density for Z given X = 2 by: f(@2) fra) = Fan whenever f(z) 4 0. Next consider na) = f a(e)fxae(ede. We then have, for any bounded Borel function A(.x): E(M XIX) = f MeayA(eVfx(ayde = ff aerteon(e)de ho) fx(v)ae = f(a,2) ) . -/ 92) Fag Medd (ade de = ff may s(e.2)ae ar = Ejg(Z)k(X)} = E{YR(X)}. Therefore by (23.9) we have that BLY|X} = h(X). This gives us an explicit way to calculate conditional expectations in the case when we have densities. 204 23. Conditional Expectation Theorem 23.5. Let Y be a positive or integrable r.v. on (2, FP). Let G be a sub o-algebra. Then E{Y|G} = Y if and only if Y is G-measurable. Proof. This is trivial from the definition of conditional expectation. oO Theorem 23.6. Let Y € L'(2,A,P) and suppose X and Y are indepen- dent. Then BLY|X} = E(Y}. Proof. Let g be bounded Borel. Then E{Yg(X)} = E{Y}E{g(X)} by in- dependence. Thus taking f(x) = E{Y} for all « (the constant function) in Theorem 23.2, we have the result by (23.9). Oo Theorem 23.7. Let X,Y be random variables on (92,A,P), let G be a sub a-algebra of A, and suppose that X is G-measurable. In the two following cases: a) the variables X, ¥ and XY are integrable, b) the variables X and Y are positive, we have E{XY|G} = XE{y|g}. Proof. Assume first (b). For any G-measurable positive r.v. Z we have E{XYZ} = E{XZE{Y|9}} by (23.5). Since XE{Y|G} is also G-measurable, we deduce the result by another application of the characterization (23.5). In case (a), we observe that X*Y*, X~Y+, XTY~ and X~Y~ are all integrable and positive. Then E{X*+Y*|G} = X+E{Y+|G} by what pre- cedes, and similarly for the other three products, and all these quantities are finite. It remains to apply the linearity of the conditional expectation and the property XY = X*+Y*+ + X-Y~ —X+Y~ -X-Y*>. Qo Let us note the important observation that the principal convergence theorems also hold for conditional expectations (we choose to emphasize be- low the fact that all statements about conditional expectations are “almost sure”): Theorem 23.8. Let (Yn)n>1 be a sequence of r.v.’s on (2,.A,P) and let G be a sub o-algebra of A. a) (Monotone Convergence.) If Y, > 0, n> 1, and Y,, increases to Y a.s., then lim E{Y,|G} = E{Y|G} as. n—00 23. Conditional Expectation 205 b) (Fatou’s Lemma.) If Yn > 0. n> 1, then Elim inf ¥,|G} 1) for some Z € L(2, A. P), then lim E{¥alG} = EAI} as. Proof. a) By (23.6) we have B{¥n+1|G} > E{Yn|G} a.s., each n; hence U = limy—ce E{¥nlG} exists as. Then for all positive and G-measurable rv. X we have: E{UX} = lim E{E{Yn|G}X} = Jim E{¥nX} by (23.5): and = lim E{YX} noo by the usual monotone convergence theorem. Thus U = E{Y|G}, again by (23.5). The proofs of (b) and (c) are analogous in a similar vein to the proofs of Fatou’s lemma and the Dominated Convergence Theorem without condition- ing. oO We end with three useful inequalities. Theorem 23.9 (Jensen’s Inequality). Let y:R — R be convex, and let X and ¢(X) be integrable random variables. For any o-algebra G, po B{X|G} < Efe(X)|G}. Proof. A result in real analysis is that if ¢ : R — R is convex, then y(2) = sup, (@n@ + by) for a countable collection of real numbers (dn, bn). Then EfanX + bn|G} = anE{X|G} + bn. But E{anX + bnlG} < E{y(X)|G}. hence an E{X|G} + bn < Efie(X)|G}, al n. Taking the supremum in n, we get the result. Note that y() = 2? is of course convex, and thus as a consequence of Jensen's inequality we have (E{X|9})? < E{X*|9}. An important consequence of Jensen’s inequality is Hélder’s inequality for random variables. 206 23. Conditional Expectation Theorem 23.10 (Hélder’s Inequality). Let X.Y be random variables with E{|X|P} < oc, B{Y|*} < x. where p > 1, and 5 + > = 1. Then |E{XY}] < E{IXY|} < BUX }PE(IX|)} 4. (Hence if X € L? and Y € L? with p,q as above, then the product XY belongs to L*). Proof. Without loss of generality we can assume X > 0, Y > Oand E{X?} > 0, since E{X?} = 0 implies XP = Oa.s., thus X = Oa.s. and there is nothing to prove. Let C = E{X?} < oo. Define a new probability measure Q by 1 A) = SE{1aX"}. QA) = GEMAX?} Next define Z = 34+1,x 0}. Since v(x) = |x|? is convex, Jensen's inequality (Theorem 23.9) yields (Eq{Z})! < E{Z%}. Thus. ~ C4 Xp-) y )\t - (H{s25}) y \¢ =to{(xi) } 1 y \fy -26((s)") 1 1 = hefrrg ger}. and q= =2; while (p— 1)q = p, hence 1 1 a G__ NP ze fy oN } 1 = —E{y¢ oe}. Lega 1 Y : aeEIXY}! = awe {ge} Thus E{XY}2 < CT E{Y}, and taking q’” roots yields E{XY}< Cr Ey 8, Since aoa = ; and C = E{X?}. we have the result. ao 23. Conditional Expectation 207 Corollary 23.1 (Minkowski’s Inequality). Let X.Y be random variables and1< p< x with E{|X|?} < 2% and E{|Y|?} < oc. Then E{\X +Y|P}> < E{X?}> + B{V?}o. Proof. If p = 1 the result is trivial. We therefore asume that p > 1. We use Hélder’s inequality (Theorem 23.10). We have E{\X+Y/P}= E {|X| |X + yp} +E {lY| |X + yy} < EUXPP EX + [POs + BYP} ELLX + YO} , hence But (p— 1)q =p, and i =1- 1 P = (EUXP} + EUV IP}*) B{IX + ¥ [PP and we have the result. oO Minkowski’s inequality allows one to define a norm (satisfying a triangle inequality) on the space L? of equivalence classes (for the relation “equality a.s.”) of random variables with E{|X|?} < oc. Definition 23.6. For X in L?, define a norm by 4 IXllp = E{X? i}. Note that Minkowski’s inequality shows that L? is a bonafide normed lin- ear space. In fact it is even a complete normed linear space (called a “Banach space”). But for p # 2 it is not a Hilbert space: the norm is not associated with an inner product. 208 23. Conditional Expectation Exercises for Chapter 23 For Exercises 23.1-23.6, let Y be a positive or integrable random variable on the space (.2,.4, P) and G be a sub o-algebra of A. 23.1 Show |E{Y|9}| < E{|Y||9}- 23.2 Suppose H C G where H is a sub o-algebra of G. Show that E{E{Y|G}|H} = E{Y|H}. 23.3 Show that E{Y|Y} = Y as. 23.4 Show that if |Y| t}) =e7* for t > 0. Calculate E{Y | Y At}, where Y At = min(t, Y). 23.10 (Chebyshev's inequality). Prove that for X € L? and a > 0, P(|X| > alg) < C2. 23.11 (Cauchy-Schwarz). For X,Y in L? show (E{XY|G})? < E{X*|G}E{Y?|G}. 23.12 Let X € L?. Show that E{(X — E{X|G})*} < B{(X — B{X})}. 23.13 Let p >1 andr > p. Show that L? D L’, for expectation with respect to a probability measure. Exercises 209 23.14* Let Z be defined on (92.F, P) with Z > 0 and E{Z} = 1. Define a new probability Q by Q(A) = E{14Z}. Let G be a sub o-algebra of F. and let U = E{Z|G}. Show that Eg{X|g} = 2%2181, for any bounded F-measurable random variable X. (Here Eg{X|G} denotes the conditional expectation of X relative to the probability measure Q.) 23.15 Show that the normed linear space L? is complete for each p, 1 < p < oc. (Hint: See the proof of Theorem 22.2.) 23.16 Let X € L1(92,F.P) and let G,H be sub o-algebras of F. Moreover let H be independent of o(o(X),@). Show that E{X|o(G, H)} = E{X|G}. 23.17 Let (X;)no1 be independent and in L! and let S, = 0%, X; and Gn = O(Sn,Sn4i,--.)- Show that E{X,|Gn} = E{Xi | Sn} and also E{X;|Gn} = E{X, | Sn} for 1 0, having the property that F, C Fnai CF, alln > 0. Definition 24.1. A sequence of random variables (Xp)n>0 is called a mar- tingale if () E{\Xn|} < oc, each n; Xn is Fr measurable, each nj (iii) E{Xp|Fin} = Xm a.s., each m1 be independent with E{|X,|} < oo and E{X,} = 0, all n. For n> 1 let F, = o{X,;k < n} and S, = op_, Xp. For n=O let Fo = {6,2} be the “trivial” o-algebra and Sp = 0. Then (S,)n>0 is an (Fn)n>o martingale, since 212 24. Martingales E{Sn|Fin} = E{ Sm + (Sn — Sm)|Fin} = Sm + E{Sn — Sm|Fin} n = Sin +e{ > alr} k=m41 n =Sm+ > E{Xe} k=m4+1 =Sn- When the variables X;, have p = E{X,} 4 0, then using X;, — pz instead of X,, above we obtain similarly that (Sp — ny)n>0 is an (Fn)n>o martingale. Example 24.2, Let Y be measurable with E{|Y|} < oc and define Xn = E{Y|Fn}- Then E{|X,|} < E{|Y|} < oc and for m < n, E{Xn| Fn} = E{E{Y Fa }lFin} = FAY |Fin} = Xn (see Exercises 23.1 and 23.2). Definition 24.2. A martingale X = (Xy)n>0 ts said to be closed by a ran- dom variable Y if E{|Y|} < 00 and X, = E{Y|Fy}, each n. Example 24.2 shows that any rv. Y € F with E{|Y|} < oo gives an example of a closed martingale by taking X, = E{Y|F,}, n > 0. An important property of martingales is that a martingale has constant expectation: Theorem 24.1. If (Xn)n>0 is a martingale, then n > E{X,} is constant. That is, E{X,} = E{X}, all n > 0. Proof. E{Xn} = E{E{Xn|Fo}} = E{Xo}- Qo The converse of Theorem 24.1 is not true, but there is a partial converse using stopping times (see Theorem 24.7). Definition 24.3. A random variable T: 2 = N = NU {+00} is called a stopping time if {T o{n:X, > 12} if X, >12 for some n€N ~ to otherwise. That is. P(e) = inf (rs Xq(w) 2 12} if X,,(w) > 12 for some integer n, and T(w) = +00 if not. Note that the event {w:T(w) < n} can be expressed as: n {Psn}=U{Xe2> Whe Fa k=0 because {X;, > 12} € Fi, C Fy if k < n. The term “stopping time” comes from gambling: a gambler can decide to stop playing at a random time (de- pending for example on previous gains or losses), but when he or she actually decides to stop, his or her decision is based upon the knowledge of what hap- pened before and at that time, and obviously not on future outcomes: the reader can check that this corresponds to Definition 24.3. Theorem 24.1 extends to bounded stopping times (a stopping time is T is bounded if there exists a constant c such that P{T < c} = 1). If T is a finite stopping time, we denote by Xr the r.v. Xp(w) = X7(.)(w); that is, it takes the value X, whenever T = n. Theorem 24.2. Let T be a stopping time bounded by c and let (Xn)n>o be a martingale. Then E{X7} = E{Xo}. Proof. We have X7p(w) = 7%) Xn(w)1 ¢rqwy=n}- Therefore, assuming with- out loss of generality that c is itself an integer, E{Xr} = £{¥ stiren} n=0 = P{S Xtiren} n=0 e =O E{Xnl eran) n=0 Since {T = n} = {T < n}\{T < n—1} we see {T =n} € Fy, and we obtain =O E{E{X Fa} preny} n=0 214 24. Martingales = YO E{Xelprany} n=0 =E po» tir} n=0 = E{X,} = E{Xo}. with the last equality by Theorem 24.1. Oo The o-algebra F,, can be thought of as representing observable events up to and including time n. We wish to create an analogous notion of observable events up to a stopping time 7. Definition 24.4. Let T be a stopping time. The stopping time o-algebra Fr is defined to be Fr ={AE€F: AN{T 1 are in Fr. then ES 2° (U 4) MT o be @ martingale and let S,T be stopping times bounded by a constant c, with S < T a.s. Then E{Xr|Fs} = Xs as. Proof. First |X| < Y2%_9|Xnl is integrable (without loss of generality we can assume again that c is an integer), as well as Xs, and further Xg is F5- measurable by the previous theorem. So it remains to prove that E{X7Z} = E{XsZ} for every bounded Fs-measurable r.v. Z. By a standard argument it is even enough to prove that if A € Fs then E{Xrla} = E{Xsla} (if this holds, then E{ XZ} = E{XgZ} holds for simple Z by linearity, then for all F5-measurable and bounded Z by Lebesgue’s Dominated Convergence Theorem). So let A € Fs. Define a new random time R by R(w) = S(w)La(w) + T(w)Lac(w)- Then R is a stopping time also: indeed, {Ren} =AN{S 0 be a sequence of random variables with Xp, being Fy measurable, each n. Suppose E{\|X,|} < oo for each n, and E{Xr} = E{Xo} for all bounded stopping times T. Then X is a martin- gale. Proof. Let 0 << m o, With Fm C Fp form < n. 24.1 If T =n. show that Fr = Fy. 24.2 Show that $A T' = min(S,T) is a stopping time. 24.3 Show that S$ VT = max(S,7’) is a stopping time. 24.4 Show that 9 +7 is a stopping time. 24.5 Show that aT is a stopping time for a > 1, @ integer. 24.6 Show that Faar C Fr C Fev. 24.7 Show that T is a stopping time if and only if {7’ = n} € F,, each n > 0. 24.8 Let A € Fp and define nana {% tos Show that 7, is another stopping time. 24.9 Show that T' is Fp—measurable. 24.10 Show that {S < T}, {S < T}, and {S = T} are all in Fg Fr. 24.11* Show that E{ELY |Fr}| Fs} = E{E{Y|Fs}\Fr} = E{Y|Fsar}- 24.12 Let M = (Mn)n>o be a martingale with M,, € L?, each n. Let S,T be bounded stopping times with S < T. Show that Ms, Mr, are both in L?, and show that E{(Mr — Ms)?|Fs} = E{M# — M3|Fs}, and that E{(Mr — Ms)?} = B{M?} — E{M3}. 24,13 Let y be convex and let M = (Mn)n>o be a martingale. Show that. n — E{(M,)} is a nondecreasing function. (Hint: Use Jensen’s inequality [Theorem 23.9].) 24.14 Let X,, be a sequence of random variables with E{X, | Fri} = 0 and X,, F,-measurable, each n. Let S, = Vreo Xz. Show that (Sp)n>o is a martingale for (Fp )n>0- 25. Supermartingales and Submartingales In Chapter 24 we defined a martingale via an equality for certain conditional expectations. If we replace that equality with an inequality we obtain super- martingales and submartingales. Once again (2, F, P) is a probability space that is assumed given and fixed, and (F,)n>1 is an increasing sequence of o-algebras. Definition 25.1. A sequence of random variables (Xn)n>o is called a sub- martingale (respectively a supermartingale) if (i) E{|Xp|} < cc, each n; (ii) Xp is Fy-measurable, each n; (iii) E{X,|Fin} = Xm as. (resp. < Xm a8.) eachm o is a martingale if and only if it is a submartingale and a supermartingale. Theorem 25.1. If (Mn,)n>o is a martingale, and if y is conver and y( Mn) is integrable for each n, then (~(Mn))n>0 is a submartingale. Proof. Let m 0 is a martingale then Xp = |M,|, n > 0, is a submartingale. Proof. p(x) = |x| is a convex, so apply Theorem 25.1. o Theorem 25.2. Let T be a stopping time bounded by C € N and let (Xn)n>0 be a submartingale. Then E{X7} < E{Xc}-. Proof. The proof is analogous to the proof of Theorem 24.2, so we omit it. Oo The next theorem shows a connection between submartingales and mar- tingales. 220 25. Supermartingales and Submartingales Theorem 25.3 (Doob Decomposition). Let X = (Xn)n>0 be a sub- martingale. There eaists a martingale M = (Mn)n>o and a process A = (An)nzo with Angi > An as. and Ani, being F,-measurable, each n > 0, such that Xn = Xp +Mn+An, with Mg = Ao = 0. Moreover such a decomposition is a.s. unique. Proof. Define Ag = 0 and 7 An = SO E{X, —Xp-1|Fe-} forn> 1. k= Since X is a submartingale we have E{X, — X;,—1|Fxe—1} > 0 each k, hence Agi: = Ag as., and also Ay; being Fy-measurable. Note also that E{Xn | Fri} — Xn-1 = B{Xn — Xn-1 | Fr-i} = An - Anas and hence E{Xn | Fr-1} ~ An = Xn-1 ~ An-13 but An € Fr), so E{Xn — An|Fn-1} = Xn-1 — An-1- (25.1) Letting M,, = X;, — Ay we have from (25.1) that M is a martingale and we have the existence of the decomposition. As for uniqueness, suppose Xn =Xot+Mnt+An, 020, Xn =XotIn+Cn, n20, are two such decompositions. Subtracting one from the other gives Ln — Mn = An - Cn. (25.2) Since An, Cy, are F,-, measurable, L, — M,, is F,~; measurable as well; therefore Ly — My = E{Ln ~ Mn\Fa-i} = Dut ~ Muy =An-1—Cn-a as. Continuing inductively we see that L, ~ M, = Lo — My = 0 as. since Lo = Mo = 0. We conclude that L, = M,, a.s., whence A, = C, as. and we have uniqueness. o Corollary 25.2. Let X = (Xn)nzo0 be a supermartingale. There exists a unique decomposition X,=Xo4+My—An, n>0 with Mo = Ao = 0, (Mn)n>0 @ martingale, and Ay being F,—,-measurable with Ay > Aj, as. 25. Supermartingales and Submartingales 221 Proof. Let Yn = —Xn. Then (¥n)n>0 is a submartingale. Let the Doob de- composition be Yn = Yo+In+Ch. and then X, = Xo — Ly, — Cy; set M, = —L, and Ay, = Cy, n > 0. oO 222 25. Supermartingales and Submartingales Exercises for Chapter 25 25.1 Show that X = (Xn)n>o is a submartingale if and only if Y, = —Xn, n > 0, is a supermartingale. 25.2 Show that if X = (X,)n>o0 is both a submartingale and a supermartin- gale, then X is a martingale. 25.3 Let X =(Xn)n>o be a submartingale with Doob decomposition X, = Xo +My, + An. Show that E{A,} < oo, each n < 0. 25.4 Let M =(Mp)no0 be aimartingale with My = 0 and suppose E{M?2} < oo. each n. Show that X,, = M?. n > 0. is a submartingale, and let X, = Ly, +An be its Doob decomposition. Show that E{M?2} = E{A,}. 25.5 Let M and A be as in Exercise 25.4. Show that A, — An_, = E{(Mn— Mn-1)?\Fn—-1}+ 25.6 Let X = (Xy)no0 be a submartingale. Show that if y is convex and nondecreasing on R and if y(X») is integrable for each n, then Y,, = (Xn) is also a submartingale. 25.7 Let X =(Xn)nz>o be an increasing sequence of integrable r.v.. each Xp being F,,-measurable. Show that X is a submartingale. 26. Martingale Inequalities One of the reasons martingales have become central to probability theory is that their structure gives rise to some powerful inequalities. Our presentation follows Bass [1]. Once again (92, F, P) is a probability space that is assumed given and fixed. and (Fp )n>0 is an increasing sequence of o-algebras. Let M = (Mn)n>0 be a sequence of integrable r.v.’s. each M, being F,-measurable. and let My = sup; @) = E{cuz>0}} S a In the martingale case we can replace MW, with only |M],,| on the right side. Theorem 26.1 (Doob’s First Martingale Inequality). Lei M = (Mp )nzo be a martingale or a positive submartingale. Then P(M; > a) < EUMal}, Proof. Let T = min{j : |M;| > a} (recall our convention that the minimum of an empty subset of N is +00). Since g(x) = |2| is convex and increasing on R,, we have that |M,| is a submartingale (by Theorem 25.1 if M is a martingale, or by Exercise 24.6 if M is a positive submartingale). The set {T a} and {My > a} are equal, hence P(M* >a) =P(T 0)a) < GEtiMron!Ler 0 be a random variable, p > 0, and E{X?} < x. Then 00 E{X?} | pr P(X > d)dd. 0 Proof. We have oc oc | pr?“ P(X > Add = PP B{lixsay}dd, 0 0 and by Fubini’s Theorem (see Exercise 10.15) oo x = e{[ PAP exsayaa} = eff peal = E{X?}. 0 0 Theorem 26.2 (Doob’s L? Martingale Inequalities). Let M = (Mn)n>0 be a martingale or a positive submartingale. let 1 < p < 00. There exists a constant c depending only on p such that E{(My)?} < cB{|Mn|?}- oO Proof. We give the proof in the martingale case. Since y(x) = |x| is convex we have |M,,| is a submartingale as in Theorem 26.1. Let X, = Mnlija,|>3)- For n fixed define Z,=E(XalF} OSA Sn Note that Z;, 0 < j < nis a martingale. Note further that My < Z} + 9, since \M;| = |E{M, | F;}| = E{Mn lM iani>g) + Mal iatnisg)F} = |E{Xp + Mnlimyicgy | Fi} SERIA +5 a = (Z| + z By Doob’s First Inequality (Theorem 26.1) we have P(Mt>a)< P(z > 5) 2 2 - < SB (\Znl} = ZE(Xal} 2 | = CEA Malltiaaingy}- 26. Martingale Inequalities 225 By Lemma 26.1 we have 2 Bag} = [pws Paty > yan oe <[ 2p? E{|Mn lV ejang yay }A , } and using Fubini’s theorem (see Exercise 10.15): 2|Mn| =E {f naan fo = 2p Pp os 7 Ei Mni?}- ao Note that we showed in the proof of Theorem 26.2 that the constant c< 2, With more work one can show that ¢? = 52;- Thus Theorem 26.2 could be restated as: Theorem 26.3 (Doob’s L? Martingale Inequalities). Let M = (Mn)nz0, be a martingale or a positive submartingale. Let 1 < p< oc. Then E((Mz?}* <2. EtiMaP}}, p-l or in the notation of L? norms: (Melly < allMnlp- Our last inequality of this section is used to prove the Martingale Conver- gence Theorem of Chapter 27. We introduce Doob’s notion of uperossings. Let (Xn)n>o be a submartingale, and let a < b. The number of upcrossings of an interval [a.b] is the number of times a process crosses from below a to above b at a later time. We can express this idea nicely using stopping times. Define Ty = 0, and inductively for j > 0: Sjyy =minfk > Tj): X_ Sa}, Tjgy =min{k > Sy41: X_ > b}, (26.1) with the usual convention that the minimum of the empty set is +00; with the dual convention that the maximum of the empty set is 0, we can then define U, = max{j : T; < n} (26.2) and U,, is the number of upcrossings of [a,b] before time n. 226 26. Martingale Inequalities Theorem 26.4 (Doob’s Upcrossing Inequality). Let (Xj)n>0 be @ sub- martingale, let a . we obtain: ® n Yn = Yaran + > (Yran = Ysuan) +90 (¥Suaian—Yran)- (26.3) i=1 i=) Each upcrossing of (X;,) between times 0 and n corresponds to an integer i such that S; < T; < n, with Ys, = 0 and Yr, = Yran > b—a, while Yr,an — Ys,an = 0 by construction for all i. Hence n Onan = Ya,an) > (b= @)Un- ia By virtue of (26.3) we get a (b= a)Un < Ym —Yoian = Yo (Yeician = Yrvan)s i and since Ys,an > 0, we obtain n (b= @)Un < Yn = 7 (¥5..:4n > Yvan) + i= Take expectations on both sides: since (Y,,) is a submartingale and the stop- ping times T; An and S,, An are bounded (by n) and T; An < Siz; An, we have E{Y¥s,.,an — Yruan} > 0 and thus (b= a)E{Un} < E{¥n}. Exercises 227 Exercises for Chapter 26 26.1 Let Y, € L? and suppose limp. E(Y2) = 0. Let (F,)e>0 be an increasing sequence of o-algebras and let Xf = E{Y,|Fi.}. Show that limp oo B{sup,(XP)?} = 0. 26.2 Let X,Y be nonnegative and satisfy aP(X >a)< E{Y1,x30}}- for all a > 0. Show that B{X?} < E{qX?"Y}, where L+}=1;p>1. 26.3 Let X.Y be as in Exercise 26.2 and suppose that ||X|j, < oo and IY |p < 00. Show that ||X||, < q|[Y ||). (Hint: Use Exercise 26.2 and Hilder’s inequality.) 26.4 Establish Exercise 26.3 without the assumption that || X'||, < o°. 26.5 * Use Exercise 26.3 to prove Theorem 26.3. sisi Gy ASSES he SRBVEEIT AF 27. Martingale Convergence Theorems In Chapter 17 we studied convergence theorems, but they were all of the type that one form of convergence, plus perhaps an extra condition, implies another type of convergence. What is unusual about martingale convergence theorems is that no type of convergence is assumed — only a certain structure = yet convergence is concluded. This makes martingale convergence theorems special in analysis; the only similar situation arises in ergodic theory. Theorem 27.1 (Martingale Convergence Theorem). Let (Xp)n>1 be a submartingale such that sup, E{Xx} < oo. Then limp Xn = X exists a.s. (and is finite a.s.). Moreover, X is in L’. |Warning: we do not assert here that X,, converges to X in L’; this is not true in general.) Proof. Let U;, be the number of upcrossings of (a, b] before time n, as defined in (26.2). Then U,, is non-decreasing hence U(a, b) = limp. U, exists. By the Monotone Convergence Theorem E{U(a,d)} = im E{Un} IA + sup £{(Xn—a)*} IA 7 (sup 2(x3} + lal) < rm < 00 for some constant ¢; ¢ < 00 by our hypotheses, and the first inequality above comes from Theorem 26.4 and the second one from (x — a)* < xt + |a| for all reals a, x. Since E{U(a,b)} < 00, we have P{U(a,b) < oo} = 1. Then X, upcrosses [a,b] only finitely often a.s., and if we let Ag» = {limsup Xp > b: Jim inf Xn liminf X,}, n n 230 27. Martingale Convergence Theorems and we conclude limp. Xp exists a.s. It is still possible that the limit is infinite however. Since X,, is a sub- martingale, E{X,,} > E{Xo}. hence BA\Xnl} = B{XH} + BEXS} = 2E{X$}- F(X} < 2E{Xt} — E{Xo}. er.) hence E{lim |X,.|} < lim inf E{|X,|} < 2sup E{X7} — E{Xo} < 2, noe by Fatou’s lemma and (27.1) combined with the hypothesis that sup,, E{X7} < oo. Thus X, converges a.s. to a finite limit X. Note that we have also showed that E{|X|} = F{lim, 2 |X;,|} < oo, hence X is in L}. Oo Corollary 27.1. If X, is a nonnegative supermartingale, or a martingale bounded above or bounded below, then limp. Xn = X exists a.s., and X € D. Proof. If X,, is a nonnegative supermartingale then (—X,,)n>y is a submartin- gale bounded above by 0 and we can apply Theorem 27.1. If (Xn)n>1 is a martingale bounded below, then X, > —c a.s., all n, for some constant c, with c > 0. Let Y, = X, +c, then Y,, is a nonnegative martingale and hence a nonnegative supermartingale, and we need only to apply the first part of this corollary. If (Xn)n>1 is a martingale bounded above, then (—Xp)n>1 is a martingale bounded below and again we are done. Oo Theorem 27.1 gives the a.s. convergence to a r.v. X, which is in L. But it does not give L! convergence of X, to X. To obtain that we need a slightly stronger hypothesis, and we need to introduce the concept of uniform integrability. Definition 27.1. A subset H of L’ is said to be a uniformly integrable col- lection of random variables if lim sup E{1qxj>0}|X|} = 0. c7 NCH * Next we present two sufficient conditions to ensure uniform integrability. Theorem 27.2. Let H be a class of random variables a) Ifsupyen E{|X|?} < 00 for some p > 1, then H is uniformly integrable. b) If there exists ar.v. Y such that |X| c > 0, then 2'~? < ¢'-?, and multiplying by 2? yields a < clPz?. Therefore we have E{X Lax} Se PE (IXP Lax} S Ga hence lime +2. Sup yen FLIX yx}p0)} S lime soo gr = 0. (b) Since |X| < Y as. for all X € H, we have IX{Laxj>cy S Y¥lgyoe}- But limesao Y1jy><} = 0 as; thus by Lebesgue’s dominated convergence theorem we have Jim, sup BUX xincy} S im HY lyse} = E{ lim Ylpysqj} =0. emo Oo For more results on uniform integrability we recommend [15, pp. 16-21]. We next give a strengthening of Theorem 27.1 for the martingale case. Theorem 27.3 (Martingale Convergence Theorem). a) Let (Mn)n>1 be a martingale and suppose (Mn)n>1 is a uniformly integrable collection of random variables. Then lim Mn = Moo evists a.s., n=00 M,, is in L’, and M,, converges to Mz. in L’. Moreover M, = E{Mox | Fn}. b) Conversely let Y € L’ and consider the martingale M, = E{Y|Fn}. Then (Mn)nz1 is a uniformly integrable collection of r.v.’s. In other words, with the terminology of Definition 24.2, the martingale (M,,) is closed if and only if it is uniformly integrable. Proof. a) Since (M;,)n>1 is uniformly integrable, for ¢ > 0 there exists c such that supy E{|Mn|1\a1,\>c) + < €- Therefore E{|Mnl} = E{|Mn|Leiatg\>ep} + E{IMnl (iatgiey} Sete Therefore (M,)n>1 is bounded in L}. Therefore sup, E{M;t} < oo and by Theorem 27.1 we have lim M,, = M., exists a.s. and Mx is in L}. n=50 To show M,, converges to M,, in L’, define 232 27. Martingale Convergence Theorems c if >a fe(a)=4 2 if jal 0 given: E {\fe(Mn) ~ Mul} < §. B {\fel Mac) ~ Mal} < 5. (27.3) all n; (27.2) Since lim M, = Mx as. we have limp oo fe(Mn) = fe(Mox), and so by Lebesgue’s Dominated Convergence Theorem (Theorem 9.1(f)) we have for n> N, N large enough: E{\fe(Mn) ~ fe(Moo)|} < (7.4) Therefore using (27.2), (27.3), and (27.4) we have E{|Mn — Mol} N. Hence My, —+ Moo in L?. It remains to show E{ Mx | Fn} = Mn. Let A € Fin and n > m. Then E{Mnla) = E{Mm la} by the martingale property. However, |E{M, la} — E{Moola}| < E{|Mn — Moola} S E{|Mn — Mocl} which tends to 0 as n tends to oo. Thus E{M,,1a} = E{M..1a} and hence E{Mx | Fn} = Mn as. b) We already know that (Mp)n>1) is a martingale. If c > 0 we have Mnlgiaai>cy = ELV 1gjatai>cy | Fr} because {|M,,| > c} € Fy. Hence for any d > 0 we get E{\Mn\lgmniscy} S BUY Wgataiseyt S PUY Lgyi>ay} + dP (|Mn| > ©) , d S BEY gvisay} + 7 EUMal}- (27.5) Take ¢ > 0. We choose d such that the first term in (27.5) is smaller than e/2, then c such that the second term in (27.5) is smaller than ¢/2: thus E{|\Mn}1yat,j>¢}} < € for all n, and we are done. aq 27. Martingale Convergence Theorems 233, The martingale property is that E{X,, | F,} =X; as. is natural to think of n. m as positive counting numbers (i... did above. But we can also consider the index set —N: the negative integers. case if |m| > |n|, but m and n are negative integers, then m 0 of [a,b] between time —n and 0. Then U_,, is increasing as n increases, and let U(a,b) = limp soo U_n, which exists. By Monotone Convergence BU (a,b)} = im E{U-n}

0 and X7,, + X* a.s. yield =n E{X*} 1 be an iid. sequence with E{|X,|} < oc. Then Xt. t Xn lim 27-7" noc n — B{X} as. Proof. Let S, = X, +...+ Xn, and Fon = o( Sn. Snai+Snz2....). Then Fn C Fm ifn > m, and the process Mon = E{Xi|F-n} is a backwards martingale. Note that E{M_,} — E{X,}, each n. Also note that by symmetry for 1 1 be independent random vari- ables, E{Yn} =0, all n, and E{Y2} < 00 alln. Suppose ~~~, E{¥2} < 20. Let Sn = Sihay Yj. Then limn soo Sn = D2, Yj ewists a.s., and it is finite as. Proof, Let Fy = o(Yi,.-.,¥n), and note that E{Sni1—Sn | Fn} = E{¥aan | Fn} = E{Y¥n41} = 0. hence (S,)n>1 is an ¥,-martingale. Note further that sup, E{S#} < sup, (E{S2} +1) < O, E{¥2}+1 < co. Thus the result follows from the Martingale Convergence Theorem (Theorem 27.1). oO The Martingale Convergence Theorems proved so far (Theorems 27.1 and 27.4) are strong convergence theorems: all random variables are defined on the same space and converge strongly to random variables on the same space, almost surely and in L’. We now give a theorem for a class of martingales 27. Martingale Convergence Theorems 235, that do not satisfy the hypotheses of Theorem 27.1 and moreover do not have a strong convergence result. Nevertheless we can obtain a weak convergence result, where the martingale converges in distribution as n — oc. The limit is of course a normal distribution, and such a theorem is known as a martingale central limit theorem. The result below is stated in a way similar to the Central Limit Theorem for ii.d. variables X,, with their partial sums S,,: Condition (i) implies that (Sp) is a martingale, but on the other hand an arbitrary martingale (S,) is the sequence of partial sums associated with the random variables X,, = Sn — Sp. and these also satisfy (i). Theorem 27.7 (Martingale Central L' a sequence of random variables satisfying (i) E{Xn | Fra} (i) E{Xn | Fraps (iit) E{IXpP | Fra} ee , and so for n large enough we have 0 < 1— ¢ < 1. Therefore we reduce “the left side of (27.12) by multiplying by (1 — we yr-p for n large enough, to obtain WV? wes, wWVTP Fits, luls a Eset (yo BoM RV cK (om) Pf} (-g) feel sea (27.13) Finally we use telescoping (finite) sums to observe ye £ {ems} and thus by the triangle inequality and (27.13) we have (always for n > ©): a 3 iw ony (7 KluP tule jefe } (: a) ; but this is the characteristic function of an N(0.1) random variable (cf Example 13.5), and characteristic functions characterize distributions (Theorem 14.1), so we are done. oO Remark 27.1. If S,, is the martingale of Theorem 27.7, we know that strong martingale convergence cannot hold: indeed if we had lim, Sp = Sas. with $ in L, then we would have limy 3% = 0 as., and the weak convergence of Se to a normal random variable would not be possible. What makes it not possible to have the strong martingale convergence is the behavior of the conditional variances of the martingale increments X,, (hypothesis (ii) of Theorem 27.7). o We end our treatment of martingales with an example from analysis: this example illustrates the versatile applicability of martingales; we use the mar- tingale convergence theorem to prove a convergence result for approximation of functions. Example 27.1. ((10]) Let f be a function in L?(0, 1] for Lebesgue measure restricted to [0,1]. Martingale theory can provide insights into approxima- tions of f by orthogonal polynomials. Let us define the Rademacher functions on [0,1] as follows. We set Ro(a) =1,0 1, we set for0 m. (See Exercise 27.8.) Next we define the Haar functions as follows: Ho(«) = Ro(2), Ay (a) = Ry(x). For n > 2. letn =14+24...4+2°7? += 27"! —14A, where r > 2 and 1<\< 2°). Then V2-TR,,(a) for 2=2 . Let 1 n= [ H, (2) f(a)de, Sn(a. f) = > 0-H, (2). (27.16) r=0 Then limn oo Sn(t, f) = f(a) a.e. Moreover if S*(x. f) = sup, |Sn(2.f)I, then 1 pa [ .nyars (4) [ Uf (w)|Pdz. Proof. We first show that S,,(z, f) is a martingale. We have E{Sn+1(@.f) | Fu} = Sn(@.f) + Ef{onsiAng1(@) | Fn} = Sn(@,f) + Ong E{Hn+1(2) | Fn} = Sn(«, f) 27. Martingale Convergence Theorems 239 where we used (27.15). However more is true: Sn(w.f) = Eff | Fr}, (27.17) which is the key result. Indeed to prove (27.17) is where we need the coeffi- cients a, given in (27.16). (See Exercise 27.10.) Next we show S,,(2, f) satisfies sup, E{S,(#,f)*} < 00, for p > 1 (the hypothesis for the Martingale Convergence Theorem; Theorem 27.1). We actually show more thanks to Jensen’s inequality (Theorem 23.9): since y(u) = |u|? is convex for p > 1, we have that 1 | iSa(0. f) Pdr = EXECS | Fa}} S E{ELif? | Fah} = EXifl’} 1 -[ f(a) ida < cc, 0 and thus sup E{S,,(x, f)*} < sup E{|Sp (x, f)|P} n n < E{|f|P} < 20. We now have by Theorem 27.1 that lim S,,(x, f) = f(x) almost everywhere. noc and also by Doob’s L? martingale inequalities (Theorem 26.2) we have p \P B(s(s}s (25) Bsa) < (2) ein. [ “(s(@. fy)Pde < (AY [ " flayirae. We remark that results similar to Theorem 27.8 above hold for classical Fourier series, although they are harder to prove. or equivalently oO 240 27. Martingale Convergence Theorems Exercises for Chapter 27 27.1 (A martingale proof of Kolmogorov’s zero-one law.) Let X,, be inde- pendent random variables and let C,, be the corresponding tail o-algebra (as defined in Theorem 10.6). Let C € C... Show that E{1c|F,} = P(C). all n. where F,, = 0(X;;0 < j < n). Show further lim, ,, E{lc|Fn} = le as. and deduce that P(C) = 0 or 1. 27.2 A martingale X is bounded in L? if sup, E{X2} < cc. Let X be a martingale with X,, in L?, each n. Show that X is bounded in L? if and only if cc SO E{(Xn = Xna1)?} < co. n=1 (Hint: Recall Exercise 24,12.) 27.3 Let X bea martingale that is bounded in L?: show that sup, E{|Xal} < oo, and conclude that lim, X, = X a.s., with E{|X|} < oo. 27.4* Let X be a martingale bounded in L?. Show that lim, X, = X a.s. and in L?, That is, show that limy oo E{(X, —X)?} =0. 27.5 (Random Signs) Let (Xp)n21 be iid. with PX = P(X, = =1) = }. Let (an)nz1 be a sequence of real numbers, Show a ee aX, is a.s. convergent if 7%, a2 < oo. 27.6 Let X1, X2,... be i.id. nonnegative random variables with E{X,} = 1. Let Ry = Thea X;, and show that A, is a martingale. 27.7 Show that if n # m, then the Rademacher functions R, and R,, are independent for P = \ Lebesgue measure restricted to (0, 1]. 27.8 Let H, be the Haar functions, and suppose A € F, = o(Ho, Mh...., Hf,,). Show that [toss ote = A 27.9 Let f be in L?(0, 1]. Let Sy (2, f) be as defined in (27.16) and show that E{f | Fn} = Sp(2, f). (Hint: Show that [ tea = [ Sp(a, fda for A € Fr A A by using that the Haar functions are an orthonormal system; that is, 1 1 [ Ay(t)Am(2)dz = 0 ifn # mand [ H,,(x)?da = 1) 0 0 Exercises 241 27.10 Use Martingale Convergence to prove the following 0—1 law. Let (Fn) be an increasing sequence of o-algebras and G,, a decreasing sequence of o- algebras. with G, C o(U%,Fn). Suppose that F,, and G,, are independent. for each n. Show that if A €M2L,Gn, then P(A) =0or 1. 27.11 Let H be a subset of L!. Let G be defined on (0,00) and suppose G is positive, increasing, and lim sca] = 00. tox t Suppose further that sup yey E{G((X))} < oc. Show that 4 is uniformly integrable. (This extends Theorem 27.2(a).) 28. The Radon-Nikodym Theorem Let (2.F, P) be a probability space. Suppose a random variable X > 0 a.s. has the property E{X}= 1. Then if we define a set function Q on F by Q(A) = E{aX} (28.1) then it is easy to see that Q defines a new probability. Indeed Q() = E{loX} = E{X} =1 and if A;, Ao, A3,... are disjoint in F then Q (U 4) = E(us,ayX} t=1 -2 {Suna} i=1 =e naxy i=l = oA) i=) and we have countable additivity. The interchange of the expectation and the summation is justified by the Monotone Convergence Theorem (Theo- rem 9.1(d)). Let us consider two properties enjoyed by Q: (i) If P(A) = 0 then Q(A) = 0. This is true since Q(A) = E{1,X}, and then 1, is a.s. 0, and hence 1,X = Oas. (ii) For every ¢ > 0 there exists 6 > 0 such that if A € F and P(A) < 6, then QA) 0 there exists 6 > 0 such that if A €F and P(A) <6, then Q(A) 0. Set A = limsup,, 2: An- By Borel-Cantelli Lemma (Theorem 10.5) we have P(A) = 0. Fatou’s lemma has a symmetric version for limsups, which we established in passing during the proof of Theorem 9.1(f): this gives Q(A) = limsup Q(A,n) >. and we obtain a contradiction. o It is worth noting that conditions (i) and (ii) are actually equivalent. Indeed we showed (i) implies (ii) in Theorem 28.1; that (ii) implies (i) is simple: suppose we have (ii) and P(A) = 0. Then for any e > 0, P(A) < 6 and so P(A) < . Since € was arbitrary we must have Q(A) = 0. Definition 28.1. Let P,Q be two finite measures. We say Q is absolutely continuous with respect to P if whenever P(A) = 0 for A € F, then Q(A) = 0. We denote this Q << P. Examples: We have seen that for any r.v. X > 0 with E{X} = 1. we have Q(A) = E{1,X} gives a probability measure with Q < P. A naturally occurring example is Q(A) = P(A | A), where P(A) > 0, It is trivial to check that P(A) = 0 implies Q(A) = 0. Note that this example is also of the form Q(A) = E{1,X}, where X = Puy la: The Radon-Nikodym theorem characterizes all absolutely continuous probabilities. Indeed we see that if Q < P, then Q must be of the form (28.1). Thus our original class of examples is all that there is. We first state a simplified version of the theorem, for separable o-fields. Our proof follows that of P. A. Meyer [15]. Definition 28.2. A sub o-algebra G of F is separable if G = o(Ai,..., An,...), with A; © F, alli. That is, G is generated by a countable sequence of events. Theorem 28.2 (Radon-Nikodym). Lei (2,F,P) be a probability space with a separable o-algebra F. If Q is a finite measure on F and if P(A) =0 implies Q(A) = 0 for any such \ € F, then there exists a unique integrable positive random variable X such that Q(A) = E(1nX}. We write X = 43. Further X is unique almost surely: that is if X’ satisfies the same properties, then X' = X P-a.s. Proof. Since the result is obvious when Q = 0, we can indeed assume that Q(2) > 0. Then we can normalize Q by taking Q = ame, so we assume 28. The Radon-Nikodym Theorem 245 without loss that Q is a probability measure. Let Aj, A2,....4, be a count- able enumeration of sets in F such that F = o( Ay. A2,..., An-...). We define a filtration (Fn)n>1 by (At... An): There then exists a finite partition of {2 into sets Ani, An2,--.; Ant, Such that each element of F;, is the (finite) union of some of these events. Such events are called “atoms”. We define kn “2 Fa ale “) (28.2) with the convention that § = 0 (since Q < P the numerator is 0 whenever the denominator is 0 above). We wish to show the process (Xp )n>1 is in fact a martingale. Observe first that X,, is ¥,-measurable. Next, let m Pay deanna? 7 — QAAns) = » Plan) Ane A). We can write Now, since A € F;,, the set \ can be written as the union of some of the (dis- joint) partition sets A,;, that is A = UyerAn, for a subset JC {1,..., kn}. Therefore AM An; = Ani if i € J and AN Any = ¢ otherwise, and we now obtain [Xue >> Fgh Pas nA) = Vi QAns) = QA) ie where we have used again the fact that Q(A,,;) = 0 whenever P(An,;) = 0. Since A € Fm we get similarly [, XmdP = Q(A). Hence (28.1) holds, and further if we take \ = 92 then we get f X,dP = Q(2) = 1 < 00, so Xp, is P-integrable. Therefore (X,)n>1 is a martingale. 246 28. The Radon-Nikodym Theorem We also have that the martingale (X,,) is uniformly integrable. Indeed. we have / X,dP = Q(Xn > &); Xn>c} by Markov’s inequality P(X,) < Fides EU} Fol c Let ¢ > 0, and let 6 be associated with < as in Theorem 28.1 (since Q << P by hypothesis). If ¢ > 1/6 then we have P(X, > c) < 6, hence Q(Xn > c) Se, hence Je X, ce} 4ndP < ¢: therefore the sequence (X;,) is uniformly integrable, and by our second Martingale Convergence Theorem (Theorem 27.3) we have that there exists a r.v. X in L! such that limp—oo Xn = X as. and in L’ and moreover E{X | Fy} = Xn. Let now A € F, and define R(A) = E{1,X}. Then R agrees with Q on each Fn, since if A € Fy. R(A) = E{A,X} = E{1,X,} = Q(A). The Monotone Class Theorem (6.3) now implies that R= Q, since F=o(Fain>1). O Remark 28.1. We can use Theorem 28.2 to prove a more general Radon— Nikodym theorem, without the separability hypothesis. For a proof of Theo- rem 28.3 below, see (24, pp.147-149]. Theorem 28.3 (Radon-Nikodym). Let P be a probability on (Q,F) and let Q be a finite measure on (82, F). If Q< P then there exists a nonnegative rv. such that = E{1,X} for all A € F. Moreover X is P-unique a.s. We write X The Radon-Nikodym theorem is directly related to conditional expecta- tion. Suppose given ({2,F, P) and let G be a sub o-algebra of F. Then for any nonnegative r.v. X with E{X} < 00, Q(A) = E{X1,} for A in G defines a finite measure on (2,G), and P(A) = 0 implies QA) = 0. Thus % exists on the space (2,G), and we define Y = #2. then Y is G-measurable. Note further that if A € G, then E{Y 1p} = Q(A) = E{X1)}. Thus Y is a version of E{X | G}. In fact, it is possible to prove the Radon— Nikodym Theorem with a purely measure-theoretic proof, not using martin- gales. Then one can define the conditional expectation as above: this is an alternative way for constructing conditional expectation, which does not use Hilbert space theory. Finally note that if P is a probability on R having a density f, and since P(A) = f, f(a)dz, then P is absolutely continuous with respect to Lebesgue measure m on R (here m is a o-finite measure, but the Radon-Nikodym Theorem “works” also in this case), and we sometimes write f = a. Exercises 247 Exercises for Chapter 28 28.1 Suppose Q < P and P 0a.s. (dP). 28.2 Suppose Q ~ P. Let X = 9%. Show that + = gs. 28.3 Let jy: be a measure such that p = aan QP, for P,, probability mea- sures and an > 0. all n. Suppose Q, < Py, each n, and that v= 7, BnQn and 3, > 0, all n, Show that (A) = 0 implies »(A) = 0. 28.4 Let P,Q be two probabilities and let R = Pea Show that P< R. 28.5 Suppose Q ~ P. Give an example of a P martingale which is not a martingale for Q. Also give an example of a process which is a martingale for both P and Q simultaneously. References 1. R. Bass (1995), Probabilistic Techniques in Analysis; Springer-Verlag; New York. 2, H. Bauer (1996), Probability Theory; Walter de Gruyter; Berlin, 3. 4, G. Cardano (1961), Liber de ludo aleae; The Book on Games of Chance; Sidney J, Bernoulli (1713), Ars Conjectandi; Thurnisiorum; Basel (Switzerland). Gould (Translator); Holt, Rinehart and Winston, . G, Casella and R. L, Berger (1990), Statistical Inference; Wadsworth; Belmont, CA. . A, De Moivre (1718), The Doctrine of Chances; or, A Method of Calculating the Probability of Events in Play; W, Pearson; London, Also in 1756, The Doctrine of Chances (Third Edition), reprinted in 1967: Chelsea, New York. . J, Doob (1994), Measure Theory; Springer-Verlag; New York. 8 R. Durrett (1991), Probability: Theory and Examples; Wadsworth and 10. i. 12, 13. 14, 15. 16, 18, 19. 20. 21, 22. Brooks/Cole; Belmont, CA. . W. Feller (1971), An Introduction to Probability Theory and Its Applications (Volume II); John Wiley; New York. A. Garsia (1970), Topics in Almost Everywhere Convergence; Markham; Chicago. A. Gut (1995), An Intermediate Course in Probability; Springer-Verlag; New York. N. B, Haaser and J. A. Sullivan (1991), Real Analysis; Dover; New York. C. Huygens (1657); See Oeuvres Completes de Christiaan Huygens, (with a French translation (1920) ), The Hague: Nijhoff. A. N. Kolmogorov (1933), Grundbegriffe der Wahrscheinlichkeitrechnung. En- glish translation: Foundations of the Theory of Probability, (1950), Nathan Mor- rison translator; Chelsea; New York, P, A. Meyer (1966), Probability and Potentials; Blaisdell; Waltham, MA (USA). J. Neveu (1975), Mathematical foundations of the calculus of probabilities; Holden—Day; San Francisco. . D, Pollard (1984), Convergence of Stochastic Processes; Springer-Verlag; New York. M. H. Protter and P. Protter (1988), Calculus with Analytic Geometry (Fourth Edition); Jones and Bartlett; Boston. M. Sharpe (1988), General Theory of Markov Processes; Academic Press; New York. G. F. Simmons (1963), Introduction to Topology and Modern Analysis; McGraw-Hill; New York. S. M. Stigler (1986), The History of Statistics: The measurement of uncertainty before 1900, Harvard University Press; Cambridge, MA. D. W. Stroock (1990), A Concise Introduction to the Theory of Integration; World Scientific; Singapore. 250 References 23. S, J. Taylor (1973), Introduction to Measure and Integration: Cambridge Uni- versity Press; Cambridge (U.K.). 24. D. Williams (1991), Probability with Martingales: Cambridge; Cambridge. UK. Index 14 Indicator function 10, 49 2° Set of all subsets of 2 3,7 A* Atranspose 92 A, — A convergence of the sets An to A 10 B(r,s) beta function 62 E{X} Expectation of X 27.51, 52 E{Y |G} Conditional expectation of Y given G 200 Eq{X |G} Conditional expectation of ‘X given G under Q 209 H, n™ Haar function 238 Jq Jacobian matrix 92 D := £! modulo as. equal 53 L? asa normed linear space 207 L? := £ modulo as. equal 53 N(u, %, 127 N(u,o") Normal distribution with a je and variance a? 125 P(A|B) 16 P2Q 67 P* Distribution measure of X 4,27 PX @ PY Product of P* and P* 69 P‘**) Distribution measure of the pair (X,Y) 69 Q

Inner product 189 252 Index $8 Radon-Nikodym derivative of Q with respect to P 244 N Natural numbers 30 Q RationalsinR 8 R Real numbers (R= (—00,+00)) 8 Z The integers 161 B Borel sets of R 8,39 B" Borel sets of R” 87 C* Functions with an arbitrarily large number of derivatives 161 Cos Tail o-algebra 72 E@F=a(ExF) 67 Fy Stopping time o-algebra 214 £) Random variables with finite expectation 28, 52 £1(2,A,P) L} on the space (2,4, P) 52 LP Random variables with finite p' moment 53 N Null sets 37 Cov(X,Y) covariance of X and Y 73, 91 a.e, almost everywhere 88 a.s. almost surely 37,52 additivity 9 — o-additivity 8,35 algebra 7,35 atoms 245 Bayes’ Theorem 17 Bernoulli distribution 184 — characteristic function of 106 Bernoulli, Jacob 125 Berry-Esseen 185 beta function, distribution 62 biased estimator 139 BienayméChebyshev inequality: see 30, 119, 176, Chebyshev 29, 58 binomial distribution 23, 26, 30, 119, 163, 184 ~ characteristic function 106 Bolzano-Weierstrass theorem 158 Bonferroni inequalities 13 Borel (o-algebra) 8 -onR 8,39 -onR” 8&7 Borel sets 8 Borel-Cantelli Theorem 71 Box-Muller (simulation of) 101 Cauchy sequence 190 Cauchy distribution 44, 60, 98 138 114 — and bivariate normal — characteristic function Cauchy's equation 63 Cauchy-Schwarz inequality 57, 208 Central Limit Theorem 181, 183 Chebyshev inequality 29. 58.208 chi square distribution 82. 83, 96. 120 closed under differences 36 closed under finite intersections 36 closed under increasing limits 36 closure of a martingale 212 Cobb-Douglas distribution 44 completely convergent 74 conditional density 89, 203 conditional expectation 197, 200 ~ defined as Hilbert space projection 200 ~ defined in L1 202 ~ defined in L? 200 conditional probability 16 continuously differentiable 92 convergence (of random variables) - almost sure 142 —inL? 142 - inp mean 142 — in distribution 152 - in probability 143 — pointwise 141 convolution of functions 122 convolution product (of probability measures) 117 correlation coefficient 91 countable additivity 8, 35,77 countable cartesian product 70 covariance 73, 91 — and independence 91 ~ matrix 91 De Morgan's laws 12 density function -~oR 42,78 — on R™ 88 Dirac (mass. measure) discrete uniform distribution distribution function 39, 50 —onR" 87 distribution of a random variable 4, 27,50 Doob decomposition 220 Doob’s L? martingale inequalities 224, 225 Doob’s first martingale inequality 42, 156 22, 32 223 Doob’s optional sampling theorem 215 Doob’s upcrossing inequality 226 double exponential distribution 44 ~ characteristic function of 115 empirical distribution function 184 equivalence class 53 estimator 117, 132 event 3,8 expectation 27. 51,52, 67 — of asimple random variable 51 expectation rule 58,59, 80 exponential distribution 43, 53, 59, 95 ~ characteristic function of 108 exponential distribution ~ and Gamma distribution 123 Fatou’s lemma 53, 205 finite additivity 9 Fourier transform 103, 111 Fubini’s theorem 67.75 ~ (see also Tonelli-Fubini theorem) 67 Galton-McAlister distribution 44 gamma distribution 43, 83, 96, 120 — and sums of exponentials 123 ~ characteristic function of 109 — relation to x? 96, 120 gamma function 43 Gauss C.F. 125 Gaussian distribution = see Normal distribution 44 geometric distribution 23, 31 Glivenko-Cantelli theorem 185 Gosset, W. 97 Hilder inequality 206 Haar system 238 hazard rate 43, 63 Helly’s selection principle 157 hypergeometric distribution 22, 26 iid. (independent identically dis- tributed) 100, 173 iff (if and only if) 144 image of P by X50 independent ~ o-algebras 65 — events 15 ~ infinite sequence of random variables 65, 69 ~ pairwise 15 Index 253 = random variables 65 ~ random variables and their densities 39 indicator function 10.49 infinitely often (i.0.) 71 inner product 189 Jacobi’s transformation formula 92 Jacobian matrix 92 Jensen’s inequality 205 y’s continuity theorem 167 Lévy, P. 126 Laplace distribution — see double exponential distribution 44 Laplace, P. 125 law 50 Lebesgue measure -oR 77 -onR” 87 Lebesgue’s dominated convergence theorem 53, 205 lim inf 10 lim sup 10 linear estimator 132 linear regresion 131 logistic distribution 64 lognormal distribution 44 marginal densities 89 Markov’s inequality 29 martingale 211 ~ backwards martingales 233 ~ central limit theorem 235 ~ convergence theorems 229-231, 233 convergence with uniform integrabil- ity 231 measurable function 47 = jointly 67 measure preserving map 175 Mellin transform 115 Minkowski’s inequality Moivre, A. de 125 monotone class theorem 36, 37 monotone convergence theorem 52, 204 Monte Carlo approximation 176 multivariate normal 126 189, 207 natural numbers 30 negative binomial distribution 31 negligible set 37, 142 254 Index normal distribution 44. 60. 82, 93. 95-97, 111, 120, 125. 181 ~ characteristic function 120. 126 ~ multivariate 126 ~ non-degenerate 129 ~ simulation of 101 — standard 125 normed linear space 189 — complete 190 null set 37,142 107. 108, order statistics 100 orthogonal matrix 129 orthogonal vectors 191 orthonormal vectors 129 pairwise disjoint 8 pairwise independent 15 Pareto distribution 31 partition equation 17 Pascal distribution 31 point mass probability 42.156 Poisson distribution 23. 26, 30, 43, 119, 163 and conditional expectation 198 — approximation to the binomial 24 ~ characteristic function 106, 119 — convergence to the normal 170 positive semidefinite matrix 91 predictor variable 131 probability measure 4,8 projection operator (Hilbert space) 193 projections (Hilbert space) 193 Pythagorean theorem 191 Rademacher functions 237 Radon-Nikodym theorem 244, 246 random signs 240 random variable 4, 27 random walk on the integers 179 Rayleigh distribution 98 regression 131 — residuals 139 ~ simple linear regression 131 Riemann zeta function 31 Riesz-Fischer theorem 190 right continuous function 40 simple random variable 51 singleton 21 Slutsky’s theorem 161 Stone-Weierstrass theorem 112 stopping time 212 — bounded stopping time 213 strong law of large numbers 173. 175 233 — ergodic strong law of large numbers 176 -- Kolmogorov strong law of large numbers 175, 234 Student's t-distribution 97 subadditivity 13 submartingale 219 subspace (Hilbert space) 191 supermartingale 219 symmetric ~ density 84.97 ~ distribution 114 ~ random variable 84 tail g-algebra 72 tail event zero-one law 72 tightness 157 Tonelli-Fubini theorem 67 topological space 48 triangular distribution 48 uncorrelated random variables 131 uniform distribution 43. 80. 176 ~ characteristic function 106 — on the ball 99 uniform integrability 230 unimodal 45 uniqueness theorem for characteristic functions 111 uperossings 225 variance 29.58 weak convergence (of probability measures) 151 weak law of large numbers 178 Weibull distribution 43 zero-one law 72, 240 zeta distribution ~ see Pareto distribution 31 Universitext Aksoy, A; Khamsi, M. A.: Methods in Fixed Point Theory Alevras, D; Padberg M.W.: Linear Opti- mization and Extensions Andersson, M.: Topics in Complex Analysis Aoki, M.;: State Space Modeling of Time Se- ries Audin, M.: Geometry Aupetit, B.: A Primer on Spectral Theory Bachem, A.; Kern, W.: Linear Programming Duality Bachmann, G.; Narici, L.; Beckenstein, E.: Fourier and Wavelet Analysis Badescu, L.: Algebraic Surfaces Balakrishnan, R.; Ranganathan, K.: A Text- book of Graph Theory Balser, W.: Formal Power Series and Linear Systems of Meromorphic Ordinary Differen- tial Equations Bapat, R.B.: Linear Algebra and Linear Mod- els Benedetti, R.; Petronio, C.: Lectures on Hy- perbolic Geometry Berberian, S. K,.: Fundamentals of Real Anal- ysis Berger, M.: Geometry I, and II Bliedtner, J.; Hansen, W.: Potential Theory Blowey, J.F; Coleman, J.P; Craig, A.W. (Eds): Theory and Numerics of Differential Equations Borger, E.; Gridel, E,; Gurevich, Y.: The Clas- sical Decision Problem Béttcher, A; Silbermann, B,; Introduction to Large Truncated Toeplitz Matrices Boltyanski, V.; Martini, H.; Soltan, P.S,: Ex- cursions into Combinatorial Geometry Boltyanskii, V.G.; Efremovich, V.A.: Intu- itive Combinatorial Topology Booss, B; Bleecker, D.D.: Topology and Analysis Borkar, V. S.: Probability Theory Carleson, L; Gamelin, T. W.: Complex Dy- namics Cecil, T. E.: Lie Sphere Geometry: With Ap- plications of Submanifolds Chae, S. B,: Lebesgue Integration Chandrasekharan, K.: Classical Transform Charlap, L. S.: Bieberbach Groups and Flat Manifolds Chern, $.; Complex Manifolds without Po- tential Theory Chorin, A.J.; Marsden, J. E.: Mathematical Introduction to Fluid Mechanics Cohn, H.: A Classical Invitation to Algebraic Numbers and Class Fields Curtis, M.L.: Abstract Linear Algebra Curtis, M. L.: Matrix Groups Cyganowski, S; Kloeden, P; Ombach, J.: From Elementary Probability to Stochastic Differential Equations with MAPLE Dalen, D. van: Logic and Structure Das, A.: The Special Theory of Relativity: A Mathematical Exposition Debarre, O.: Higher-Dimensional Algebraic Geometry Deitmar, A.: A First Course in Harmonic Analysis Demazure, M.: Bifurcations and Catastro- phes Devlin, K.J Fundamentals of Contempo- rary Set Theory DiBenedetto, E.: Equations Diener, F; Diener, M.(Eds.); Nonstandard Analysis in Practice Dimca, A.: Singularities and Topology of Hypersurfaces DoCarmo, M. P: Differential Forms and Ap- plications Duistermaat, J. J.; Kolk, JA. C.: Lie Groups Edwards, R.E.; A Formal Background to Higher Mathematics Ia, and Ib Fourier Degenerate Parabolic Edwards, R.E.: A Formal Background to Higher Mathematics Ia, and IIb Emery, M.: Stochastic Calculus in Manifolds Endler, O.: Valuation Theory Erez, B.: Galois Modules in Arithmetic Everest, G; Ward, T.: Heights of Polynomials and Entropy in Algebraic Dynamics Farenick, D, R.: Algebras of Linear Transfor- mations Foulds, L. R.: Graph Theory Applications Frauenthal, J. C.: Mathematical Modeling in Epidemiology Friedman, R.: Algebraic Surfaces and Holo- morphic Vector Bundles Fuks, D. B.; Rokhlin, V.A.: Beginner’s Course in Topology Fuhrmann, P.A.: A Polynomial Approach to Linear Algebra Gallot, S.; Hulin, D.; Lafontaine, J. Rieman- nian Geometry Gardiner, C, F: A First Course in Group The- ory Gdrding, L; Tambour, T: Algebra for Com- puter Science Godbillon, C.: Dynamical Systems on Sur- faces Goldblatt, R.: Orthogonality and Spacetime Geometry Gouvéa, F Q.: p-Adic Numbers Gustafson, K.E; Rao, D.K.M.: Numerical Range. The Field of Values of Linear Opera- tors and Matrices Hahn, A. J.: Quadratic Algebras, Clifford Al- gebras, and Arithmetic Witt Groups Hajek, P; Havrdnek, T.: Mechanizing Hy- pothesis Formation Heinonen, J: Lectures on Analysis on Metric Spaces Hlawka, E,; Schoifengeier, J.; Taschner, R.: Geometric and Analytic Number Theory Holmgren, R. A.: A First Course in Discrete Dynamical Systems Howe, R., Tan, E.Ch.: Non-Abelian Har- monic Analysis Howes, N.R.: Modern Analysis and Topol- ogy Hsieh, P-F; Sibuya, Y. (Eds.): Basic Theory of Ordinary Differential Equations Humi, M., Miller, W.: Second Course in Or- dinary Differential Equations for Scientists and Engineers Hurwitz, A; Kritikos, N.: Lectures on Num- ber Theory Iversen, B.: Cohomology of Sheaves Jacod, J.; Protter, P; Probability Essentials Jennings, G.A.: Modern Geometry with Ap- plications Jones, A.; Morris, S.A; Pearson, K.R.: Ab- stract Algebra and Famous Inpossibilities Jost, J.: Compact Riemann Surfaces Jost, J.: Postmodern Analysis Jost, J; Riemannian Geometry and Geomet- ric Analysis Kac, V; Cheung, P; Quantum Calculus Kannan, R.; Krueger, C.K: Advanced Anal- ysis on the Real Line Kelly, P; Matthews, G.: The Non-Euclidean Hyperbolic Plane Kempf, G.: Complex Abelian Varieties and Theta Functions Kitchens, B. P: Symbolic Dynamics Kloeden, P; Ombach, J; Cyganowski, 8: From Elementary Probability to Stochastic Differential Equations with MAPLE Kloeden, P.E.; Platen; E.; Schurz, H.: Nu- merical Solution of SDE Through Computer Experiments Kostrikin, A. I: Introduction to Algebra Krasnoselskii, M.A; Pokrovskii, A. V.: Sys- tems with Hysteresis Luecking, D. H., Rubel, L. A: Complex Anal- ysis. A Functional Analysis Approach Ma, Zhi-Ming; Roeckner, M.: Introduction to the Theory of (non-symmetric) Dirichlet Forms Mac Lane, S.; Moerdijk, L: Sheaves in Geom- etry and Logic Marcus, D, A.: Number Fields Martinez, A.: An Introduction to Semiclas- sical and Microlocal Analysis Matsuki, K.: Introduction to the Mori Pro- gram Mc Carthy, P.J.: Introduction to Arithmeti- cal Functions Meyer, R. M.: Essential Mathematics for Ap- plied Field Meyer-Nieberg, P: Banach Lattices Mines, R.; Richman, E; Ruitenburg, W.: A Course in Constructive Algebra Moise, E, E.: Introductory Problem Courses in Analysis and Topology Montesinos-Amilibia, J. M.: Classical Tessel- lations and Three Manifolds Morris, P: Introduction to Game Theory Nikulin, V. V, Shafarevich, I. R.: Geometries and Groups Oden, J. J.; Reddy, J. N.: Variational Methods in Theoretical Mechanics Qksendal, B.: Stochastic Differential Equa- tions Poizat, B.: A Course in Model Theory Polster, B.: A Geometrical Picture Book Porter, J.R.; Woods, R.G,: Extensions and Absolutes of Hausdorff Spaces Radjavi, H.; Rosenthal, P: Simultaneous Tri- angularization Ramsay, A.; Richtmeyer, R. D.: Introduction to Hyperbolic Geometry Rees, E, G,; Notes on Geometry Reisel, R. B.: Elementary Theory of Metric Spaces : Introduction to Robust and ‘obust Statistical Methods Ribenboim, P: Classical Theory of Algebraic Numbers Rickart, C. E.; Natural Function Algebras Rotman, J. ].: Galois Theory Rubel, L. A.: Entire and Meromorphic Func- tions Rybakowski, K.P: The Homotopy Index and Partial Differential Equations Sagan, H.: Space-Filling Curves Samelson, H.: Notes on Lie Algebras Schiff, J. L.: Normal Families Sengupta, J.K; Optimal Decisions under Uncertainty Séroul, R.: Programming for Mathemati- cians Seydel, R.: Tools for Computational Finance Shafarevich, I. R.: Discourses on Algebra Shapiro, J.H.: Composition Operators and Classical Function Theory Simonnet, M.: Measures and Probabilities Smith, K. E.; Kahanpdd, L.; Kekéiléinen, P; Traves, W.: An Invitation to Algebraic Ge- ometry Smith, K,T: Power Series from a Computa- tional Point of View Smoryriski, C.: Logical Number Theory I. An Introduction Stichtenoth, H.: Algebraic Function Fields and Codes Stillwell, J.: Geometry of Surfaces Stroock, D, W.: An Introduction to the The- ory of Large Deviations Sunder, V. S,: An Invitation to von Neumann. Algebras Tamme, G,; Introduction to Etale Cohomol- ogy Tondeur, P:: Foliations on Riemannian Man- ifolds Verhulst, E: Nonlinear Differential Equa- tions and Dynamical Systems Wong, M. W.: Wey! Transforms Zaanen, A.C.: Continuity, Integration and Fourier Theory Zhang, F: Matrix Theory Zong, C.: Sphere Packings Zong, C.: Strange Phenomena in Convex and Discrete Geometry

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy