0% found this document useful (0 votes)
5 views158 pages

Notes

Uploaded by

nilaksh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views158 pages

Notes

Uploaded by

nilaksh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

Probability theory

M. Math. 2nd year


Fall 2023

1 Review of measure theory and integration


1.1 Measure theory
Definition 1. Let Ω be a non-empty set, and F a class of subsets of Ω. The
class F is a field if
• Ω ∈ F;
• A ∈ F implies that Ac ∈ F;
• A, B ∈ F implies that A ∪ B ∈ F.
Definition
S∞ 2. A field F of subsets of Ω is a σ-field if A1 , A2 , . . . ∈ F implies
that n=1 An ∈ F.
Definition 3. Let F be a collection of subsets of Ω. A function µ from F to
[0, ∞] is countably additive S∞if µ(A) < ∞ for some A ∈ F and for disjoint
F-sets A1 , A2 , . . . such that n=1 An ∈ F, it holds that
∞ ∞
!
[ X
µ An = µ(An ) .
n=1 n=1

If F is a σ-field, then a µ satisfying the above is a measure and (Ω, F, µ) is a


measure space. If F is a σ-field and µ(Ω) < ∞, then µ is a finite measure.
If F is a σ-field and there exist A1 , A2 , . . . ∈ F such that Ω = A1 ∪ A2 ∪ . . .
and µ(An ) < ∞ for all n, then µ is a σ-finite measure. If F is a σ-field
and µ(Ω) = 1, then µ is a probability measure and (Ω, F, µ) is a probability
space.
Exercise 1.1. If F is a field and µ is countably additive on F, show that ∅ ∈ F
and
µ(∅) = 0 .
Theorem 1.1 (Caratheodory extension theorem). Suppose that F is a field on
Ω, and µ is a countably additive function on F. Then, there exists a measure
µ∗ on (Ω, σ(F)) such that
µ∗ (A) = µ(A) for all A ∈ F .

1
The above theorem, which is ubiquitous in construction of measures, is The-
orem 11.2, pg 166, of Billingsley (1995).
Definition 4. A sequence S of sets An increases to a set A, denoted by An ↑ A

if A1 ⊂ A2 ⊂ . . ., and A = n=1 An . Similarly, An ↓ A is also defined.
Definition 5. A family M of subsets of Ω is a monotone class if
• (closed under monotone union) An ∈ M and An ↑ A implies that A ∈ M,
• (closed under monotone intersection) An ∈ M and An ↓ A implies that
A ∈ M.
The following is Theorem 3.4, pg 43, of Billingsley (1995).
Theorem 1.2 (Monotone class theorem). If F is a field, and M is a monotone
class, then F ⊂ M implies that σ(F) ⊂ M.
Theorem 1.3 (Uniqueness). Suppose F is a field, and µ1 , µ2 are measures on
(Ω, σ(F)) which agree on F and are σ-finite on F. Then
µ1 = µ2 .
Proof. Follows from Theorem 1.2.
Definition 6. A non-empty collection S of subsets of Ω is a semi-field if
A, B ∈ S implies A ∩ B ∈ S and A ∈ S implies
Ac = A1 ∪ . . . ∪ An ,
for some disjoint A1 , . . . , An ∈ S.
Exercise 1.2. If S is a semi-field, show that
F = {A1 ∪ . . . ∪ An : A1 , . . . , An ∈ S are disjoint}
is the smallest field generated by S.
The following corollary of Theorems 1.1 and 1.3 will be most useful for us.
Corollary 1.1. Suppose that S is a semi-field on Ω, and µ : S → [0, ∞] is a
countably additive function. Then, there exists a measure µ∗ on (Ω, σ(S)) such
that
µ∗ (A) = µ(A) for all A ∈ S .
Furthermore, if µ is σ-finite on S, then µ∗ is unique.
Exercise 1.3. Prove Corollary 1.1.
Exercise 1.4. Let µ1 and µ2 be measures on (R, B(R)) defined by
µ1 (A) = #(A ∩ Q) , µ2 (A) = 2#(A ∩ Q) , A ∈ B(R) .
Show that µ1 , µ2 are σ-finite and agree on S = {(a, b] ∩ R : −∞ ≤ a ≤ b ≤ ∞}
which is a semi-field but they do not agree on (R, σ(S)). Corollary 1.1 thus fails
if µ1 and µ2 are σ-finite, but not on S, and everything else holds.

2
Definition 7. For d ≥ 1, B(Rd ) is the Borel σ-field on Rd , that is, the σ-field
generated by all open sets in Rd . A measure µ on Rd , B(Rd ) is Radon if
µ(K) < ∞ for all compact K ⊂ Rd .
The following, which is Theorem 12.4, pg 176, of Billingsley (1995), is fun-
damental to probability theory.
Theorem 1.4. If F : R → R is a non-decreasing right continuous function,
then there exists a unique Radon measure µ on (R, B(R)) such that

µ (a, b] = F (b) − F (a) , −∞ < a < b < ∞ .
Furthermore, if F (∞) = 1 and F (−∞) = 0, then µ is a probability measure.
In Theorem 1.4, the following conventions are used:
F (∞) = lim F (x) if it exists, and F (−∞) = lim F (x) if it exists.
x→∞ x→−∞

Unless mentioned otherwise, F (∞) and F (−∞) will mean the above throughout
the course.
Exercise 1.5. Show that a Radon measure µ on (Rd , B(Rd )) is regular, that is,
µ(A) = inf {µ(U ) : U open, and U ⊃ A} = sup {µ(F ) : F closed, and F ⊂ A} ,
for all A ∈ B(Rd ).
Exercise 1.6. Show that µ1 as in Exc 1.4 is σ-finite but not regular. Thus not
all σ-finite measures on (R, B(R)) are regular.
Exercise 1.7. Suppose that µ1 , µ2 are Radon measures on (Rd , B(Rd )) such
that
µ1 (U c ) = µ2 (U c ) = 0
for some open set U ⊂ Rd . If
µ1 (R) ≤ µ2 (R) ,
for all rectangles R = (a1 , b1 ] × . . . × (ad , bd ] ⊂ U with a1 , . . . , ad , b1 , . . . , bd ∈ Q,
then show that
µ1 (A) ≤ µ2 (A) , A ∈ B(Rd ) .

1.2 Integration
Definition 8. Let R̄ = R ∪ {−∞, ∞} and
  
B R̄ = σ B(R) ∪ {−∞}, {∞} .

That is, B R̄ is the smallest σ-field which is a superset of B(R) and contain
the singleton sets {−∞}, {∞}. Given a measurable space (Ω, A) and a function
f : Ω → R̄, f is A-measurable if
f −1 A ∈ A for all A ∈ B(R̄) ,
where f −1 A = {ω ∈ Ω : f (ω) ∈ A}.

3
Given a measure space (Ω, A, µ) and a measurable f : Ω → [0, ∞], the
integral of f with respect to µ will be denoted by any of the following:
Z Z Z Z
f dµ , f (ω) dµ(ω) , f (ω) µ(dω) , f dµ etc.

For a measurable f : Ω → R̄, its integral is defined as


Z Z Z
f dµ = f dµ − f − dµ ,
+

whenever either f + dµ < ∞ or f − dµ < ∞, where x+ = x ∨ 0 and x− =


R R

(−x) ∨ 0 for all x ∈ R̄. The above is defined even when the right hand side is
±∞; “∞ − ∞” is the only caseR when it is undefined.
We say f is integrable if f + dµ < ∞ and f − dµ < ∞, which happens
R

if and only if Z
|f | dµ < ∞ .

In other words, f is integrable, f has a finite integral, |f | isR integrable, f + and


f − are integrable, f ∈ L1 all mean the same. However, “ f dµ is defined” is
not the same as “f is integrable”, which deserves emphasis.
Exercise 1.8. Show that the integral of f , if defined, remains unchanged if the
underlying σ-field is changed to anything with respect to which f is measurable.
Theorem 1.5. Suppose g, f, f1 , f2 , . . . are measurable functions from a measure
space (Ω, A, µ) to R̄.
R R
1. (Monotone convergence theorem) If 0 ≤ fn ↑ f , then fn dµ ↑ f dµ.
2. (Fatou’s lemma) If fn ≥ 0, then
Z   Z
lim inf fn dµ ≤ lim inf fn dµ .
n→∞ n→∞

3. (Dominated convergence theorem) If fn → f , |fn | ≤ g and g is integrable,


then f is integrable and
Z Z
fn dµ → g dµ .

The above are Theorems 16.2, 16.3 and 16.4 on pg 208-209 of Billingsley
(1995), respectively.
Definition 9. Suppose (Ω1 , A1 , µ) is a measure space, (Ω2 , A2 ) is a measurable
space and T : Ω1 → Ω2 is a measurable map, that is,
T −1 A ∈ A1 for all A ∈ A2 .
The push forward measure of µ under T is the measure µ ◦ T −1 on (Ω2 , A2 )
defined by
µ ◦ T −1 (A) = µ T −1 A , A ∈ A2 .


4
Theorem 1.6. Suppose (Ω1 , A1 , µ) is a measure space, (Ω2 , A2 ) is a measurable
space and T : Ω1 → Ω2 is a measurable map. Then, for a measurable f : Ω2 →
R̄, Z Z
f (y) d µ ◦ T −1 (y) ,

f (T (x)) dµ(x) =
Ω1 Ω2

whenever the integral on either side is defined.


Exercise 1.9. Prove the above theorem by first showing it when f = 1A for
some A ∈ A2 , then for non-negative simple functions f and finally using the
monotone convergence theorem.

Definition 10. If µ and ν are measures on (Ω, A), then µ is absolutely contin-
uous with respect to ν, we write µ  ν, if

ν(A) = 0 ⇒ µ(A) = 0 , for all A ∈ A .

Theorem 1.7 (Radon-Nikodym). Suppose (Ω, A) is a measurable space on


which µ, ν are σ-finite measures such that µ  ν. Then there exists a measurable
f : Ω → [0, ∞) such that
Z
f dν = µ(A) , A ∈ A . (1.1)
A

If (1.1) holds with f replaced by any other function g, then g = f ν-a.e.


The above is Theorem 32.2, page 422, of Billingsley (1995).
Definition 11. The function f satisfying (1.1) is the “Radon-Nikodym deriva-
tive of µ with respect to ν”, and is denoted by

f= .

Exercise 1.10. Suppose µ, ν are σ-finite measures on (Ω, A) and µ  ν. Then,
for a measurable g : Ω → R̄,
Z Z

g dµ = g dν ,

whenever the integral on either side is defined.

Definition 12. If (Ω1 , A1 ) and (Ω2 , A2 ) are measurable spaces, then the product
σ-field A1 ⊗ A2 is defined by

A1 ⊗ A2 = σ (A1 × A2 : A1 ∈ A1 , A2 ∈ A2 ) .

Exercise 1.11. Show that

B(R) ⊗ B(R) = B R2 .


5
Exercise 1.12. If (Ω1 , A1 , µ1 ) and (Ω2 , A2 , µ2 ) are σ-finite measure spaces,
then show that there there exists a unique measure µ1 ⊗µ2 on (Ω1 ×Ω2 , A1 ⊗A2 )
satisfying

µ1 ⊗ µ2 (A1 × A2 ) = µ1 (A1 )µ2 (A2 ) , A1 ∈ A1 , A2 ∈ A2 .

Show that µ1 ⊗ µ2 is σ-finite.


Theorem 1.8 (Tonelli). Suppose (Ω1 , A1 , µ1 ) and (Ω2 , A2 , µ2 ) are σ-finite mea-
sure spaces and f : Ω1 × Ω2 → [0, ∞] is A1 ⊗ A2 -measurable. Then, for a fixed
ω 1 ∈ Ω1 ,
f (ω1 , ·) is measurable w.r.t. A2 ,
Z
f (·, ω2 ) µ2 (dω2 ) if measurable w.r.t. A1 ,
Ω2

and likewise with the roles of ω1 and ω2 interchanged. Furthermore,


Z Z Z 
f d(µ1 ⊗ µ2 ) = f (ω1 , ω2 ) µ2 (dω2 ) µ1 (dω1 )
Ω1 ×Ω2 Ω Ω
Z 1 Z 2 
= f (ω1 , ω2 ) µ1 (dω1 ) µ2 (dω2 ) .
Ω2 Ω1

Convention
The usual convention for iterated integrals is the following:
Z Z Z Z 
f (ω1 , ω2 ) µ2 (dω2 )µ1 (dω1 ) = f (ω1 , ω2 ) µ2 (dω2 ) µ1 (dω1 ) ,
Ω1 Ω2 Ω1 Ω2

that is, the left hand side above means the right hand side.

Exercise 1.13. Suppose (Ω1 , A1 , µ1 ) and (Ω2 , A2 , µ2 ) are σ-finite measure


spaces and f : Ω1 × Ω2 → R̄ is µ1 ⊗ µ2 -integrable. Show that
Z
|f (ω1 , ω2 )| µ2 (dω2 ) < ∞ for almost every ω1 ∈ Ω1 ,
Ω2

and that the similar statement holds for integrals over Ω1 .


Theorem 1.9 (Fubini). Suppose (Ω1 , A1 , µ1 ) and (Ω2 , A2 , µ2 ) are σ-finite mea-
sure spaces and f ∈ L1 (Ω1 × Ω2 , µ1 ⊗ µ2 ). Then
Z Z Z Z
f (ω1 , ω2 ) µ1 (dω1 )µ2 (dω2 ) = f (ω1 , ω2 ) µ2 (dω2 )µ1 (dω1 ) .

The theorems of Tonelli and Fubini are subsumed in Theorem 18.3, pg 234,
Billingsley (1995).

6
Definition 13. The Lebesgue measure λ is the measure on (R, B(R)) satisfying

λ (a, b] = b − a , for all − ∞ < a ≤ b < ∞ ,
the existence and uniqueness of which is guaranteed by Theorem 1.4 by taking
F (x) = x. For a Borel measurable function f : (a, b) → R̄, where −∞ ≤ a <
b ≤ ∞, its Lebesgue integral is defined as
Z b Z
f (x) dx = f (x) λ(dx) ,
a (a,b)

whenever the right hand side is defined.


Theorem 1.10 (Fundamental theorem of calculus). Suppose −∞ < a < b < ∞
and F : [a, b] → R is differentiable on (a, b) and continuous at a and b. Then
F 0 , the derivative of F , is Borel measurable. If
Z b
|F 0 (x)| dx < ∞ , (1.2)
a

then Z b
F 0 (x) dx = F (b) − F (a) .
a
If (1.2) holds with b = ∞, then the above holds as well with F (b) replaced by
limx→∞ F (x) which necessarily exists, and likewise if −∞ = a < b ≤ ∞.
The above is Theorem 7.21, page 149, of Rudin (1987).
Theorem 1.11 (Change of variable). If U, V ⊂ R are open sets, ψ : U → V is
a C 1 bijection whose derivative ψ 0 never vanishes, then
Z Z
f ◦ ψ(x)|ψ 0 (x)| dx = f (y) dy ,
U V

for a measurable f : V → R̄ whenever the integral on either side makes sense.


In other words, for substituting y = ψ(x), dy is to replaced by |ψ 0 (x)|dx. It
is being emphasized that if |ψ 0 | is replaced by ψ then the formula is incorrect.
The above is a special case of the change of variables formula in d dimensions,
which is Theorem 1.12 below.
Exercise 1.14. Calculate
Z ∞ Z ∞
2
+y 2 )
e−(x dx dy .
−∞ −∞

Soln.: Let
Z ∞ Z ∞
2
+y 2 )
I= e−(x dx dy
−∞ −∞
Z ∞ Z ∞
2 2
(Tonelli) = e−x e−y dy dx .
−∞ −∞

7
For a fixed x ∈ R \ {0}, put y = xz using Theorem 1.11 to get
Z ∞ Z ∞
−y 2 2 2
e dy = |x| e−z x dz .
−∞ −∞

Thus
Z ∞ Z ∞
2
(1+z 2 )
I= |x|e−x dz dx
−∞ −∞
Z ∞ Z ∞
2
(1+z 2 )
(Tonelli) = |x|e−x dx dz .
−∞ −∞

For a fixed z ∈ R,
Z ∞ Z ∞
2
(1+z 2 ) 2 2
|x|e−x dx = 2 xe−x (1+z ) dx
−∞ 0
Z ∞
2
(Theorem 1.11: y = x2 , dy = 2xdx) = e−y(1+z ) dz
"0 2
#y=∞
e−y(1+z )
(Theorem 1.10) = −
1 + z2
y=0
1
= .
1 + z2
Therefore
Z ∞
1
I= dz
−∞ 1 + z2
 −1 ∞
(Theorem 1.10) = tan z −∞
π  π
= − − = π.
2 2
This completes the solution of the exercise.
An immediate consequence of the above exercise is
Z ∞
2 √
e−x dx = π ,
−∞

showing with the help of Theorem 1.11 that


Z ∞ √
2
e−y /2 dy = 2π .
−∞

Definition 14. The Lebesgue measure λd is the d-fold product of the one-
dimensional Lebesgue measure λ, that is, λd is the unique measure on the space
(Rd , B(Rd )) satisfying
d
Y
λd (A1 × . . . × Ad ) = λ(Ai ) , A1 , . . . , Ad ∈ B(R) .
i=1

8
For stating the next result, a Jacobian matrix has to be first defined. Con-
sider an open set U ⊂ Rd and a function F : U → Rd . Denote by f1 , . . . , fd the
coordinate functions of F , that is,

F (x) = f1 (x), . . . , fd (x) , x ∈ Rd .




If the first partial derivatives of F exist, that is, ∂fi (x)/∂xj exists for all x ∈ U
and 1 ≤ i, j ≤ d, then its Jacobian matrix at x, denoted by J(x), is a d × d
matrix defined by  
∂fi (x)
J(x) = ,x ∈ U ,
∂xj 1≤i,j≤d

that is, the (i, j)-th entry of J(x) is ∂fi (x)/∂xj . The statement of the theorem
is the following, of which Theorem 1.11 is a special case.
Theorem 1.12. For open subsets U and V of Rd , let T : U → V be a bijection
which is continuously differentiable, that is, the first partial derivatives of T
exist and are continuous. Assume that its Jacobian matrix J(x) is non-singular
for all x ∈ U . Then for any non-negative measurable function f : V → R,
Z Z

f T (x) |det(J(x))| dx = f (y) dy ,
U V

det(A) denoting the determinant of A for any square matrix A.


For the sake of completeness, a proof of the above theorem is provided. The
following facts from linear algebra and multivariable analysis are needed.
Fact 1.1. If T : Rd → Rd is a linear map, then for a compact rectangle

R = [a1 , b1 ] × . . . × [ad , bd ] ⊂ Rd ,

with −∞ < ai < bi < ∞ for i = 1, . . . , d, it holds that


d
Y
λ ({T (x) : x ∈ R}) = | det(T )| (bi − ai ) = | det(T )|λ(R) ,
i=1

where λ is the Lebesgue measure on Rd .


The following is the inverse function theorem.
Fact 1.2. Let U ⊂ Rd be an open set and T : U → Rd be continuously differen-
tiable. Denoting by J(x) the Jacobian matrix of T at x ∈ U , assume that J(x0 )
is non-singular for some x0 ∈ U . Then, there exists an open neighbourhood X
of x0 such that T is one-one on X, the set T (X) is open, T −1 is continuously
−1
differentiable on T (X) and the Jacobian matrix of T −1 at y is J(T −1 y) for
all y ∈ T (X).
The following is another fact from multivariable analysis which essentially
follows from the one-dimensional mean value theorem.

9
Fact 1.3. Suppose that U ⊂ Rd is open, R ⊂ U is a closed rectangle and
T : U → Rd is continuously differentiable such that

|Jij (y) − Jij (x)| ≤ α , x, y ∈ R , 1 ≤ i, j ≤ d ,

where Jij (z) is the (i, j)-th entry of the Jacobian matrix J(z) of T at z for all
z ∈ U and 1 ≤ i, j ≤ d. Then,

kT (x) − T (y) − J(x)(x − y)k ≤ dαkx − yk , x, y ∈ R ,

where k · k is the L∞ norm on Rd defined by

kxk = max |xi | , x = (x1 , . . . , xd ) ∈ Rd , (1.3)


1≤i≤d

x, y, T (x), T (y) are viewed as d × 1 vectors and hence J(x)(x − y) is also a d × 1


vector.
A proof of the above fact is provided in Subsection 9.1 of the Appendix.

Proof of Theorem 1.12


The proof of Theorem 1.12 will be executed by sequentially showing each step
below. Step 4. would complete the proof.
Step 1. For any compact rectangle R = [a1 , b1 ] × . . . × [ad , bd ] ⊂ U with
−∞ < ai < bi < ∞ for i = 1, . . . , d, and a1 , . . . , ad , b1 , . . . , bd ∈ Q,
Z

λ T (R) ≤ |det(J(x))| dx . (1.4)
R

Step 2. For all A ∈ B(Rd ),


Z

λ T (A ∩ U ) ≤ |det(J(x))| dx . (1.5)
A∩U

Step 3. For any non-negative measurable function f : V → R,


Z Z

f T (x) |det(J(x))| dx ≥ f (y) dy . (1.6)
U V

Step 4. The inequality in (1.6) is an equality.


The proof of Step 1., which is the main step of the proof, is based on the
idea that locally T is like a linear transformation.
Proof of Step 1. Fix a compact rectangle R = [a1 , b1 ] × . . . × [ad , bd ] ⊂ U where
ai < bi and a1 , . . . , ad , b1 , . . . , bd ∈ Q. Let ε > 0. Since det(J(·)) is a continuous
function, it is uniformly continuous on R. Choose δ1 > 0 such that

|det(J(x)) − det(J(x0 ))| ≤ ε for all x, x0 ∈ R, kx − x0 k ≤ δ1 , (1.7)

where k · k denotes the L∞ norm as in (1.3) throughout.

10
Recall that the function A → A−1 , from the space of d × d non-singular
matrices to itself, is continuous. Since J(x) is non-singular for all x ∈ U , the
map x 7→ J(x)−1 is continuous on U . Thus,

f : R × {z ∈ Rd : kzk = 1} → Rd ,

defined by

f (x, z) = J(x)−1 z , (x, z) ∈ R × {z ∈ Rd : kzk = 1} ,

is a continuous function defined on a compact set; elements of Rd are viewed as


d × 1 vectors by convention. Therefore,

c = max kf (x, z)k : (x, z) ∈ R × {z ∈ Rd : kzk = 1} < ∞ .




In other words,
J(x)−1 z ≤ ckzk , x ∈ R, z ∈ Rd . (1.8)
Denote by Jij (x) the (i, j)-th entry of J(x) for all x ∈ U and 1 ≤ i, j ≤ d.
Uniform continuity of Jij (·) on R ensures the existence of δ2 > 0 such that
ε
|Jij (x) − Jij (x0 )| ≤ for all x, x0 ∈ R, kx − x0 k ≤ δ2 . (1.9)
cd
Let 0 < δ ≤ min{δ1 , δ2 } be such that δ −1 (bi − ai ) is an integer for every i.
Choosing such a δ is possible because bi − ai is rational; if pi , qi are positive
integers with bi − ai = pi /qi , letting
1
δ= ,
nq1 . . . qd
works for large n, for example.
Consider the square

[a1 + (i1 − 1)δ, a1 + i1 δ] × . . . × [ad + (id − 1)δ, ad + id δ] ,

where i1 , . . . , id are positive integers with ij ≤ δ −1 (bj − aj ) for j = 1, . . . , d.


Denote the collection of all such squares by {Q1 , . . . , Qk }. In other words,
Q1 , . . . , Qk are compact squares of side-length δ such that

R = Q1 ∪ . . . ∪ Qk ,

and λ(Qi ∩ Qj ) = 0 for 1 ≤ i < j ≤ k. Let xi be the centre of Qi (the centre


of a square or a rectangle is well defined). Recalling that k · k is the L∞ norm,
write
Qi = Bδ/2 (xi ) , i = 1, . . . , k , (1.10)
where for r ≥ 0 and z = (z1 , . . . , zd ) ∈ Rd ,

Br (z) = {y ∈ Rd : ky − zk ≤ r} = [z1 − r, z1 + r] × . . . × [zd − r, zd + r] . (1.11)

11
The above is precisely the advantage of working with the L∞ norm.
For i = 1, . . . , k, fix xi ∈ Qi and define

φi (z) = J(xi )(z − xi ) + T (xi ) , z ∈ Rd .

Our first claim is that

T (Qi ) ⊂ φi (Qεi ) , i = 1, . . . , k , (1.12)

where
Qεi = B(1+ε)δ/2 (xi ) , i = 1, . . . , k .
Proceeding towards proving (1.12), fix i ∈ {1, . . . , k}, and use Fact 1.3 along
with (1.9) to claim that for all z ∈ Qi ,
ε
kT (z) − T (xi ) − J(xi )(z − xi )k ≤ kz − xi k .
c
Since the left hand side above equals kT (z) − φi (z)k, it follows that
ε
kT (z) − φi (z)k ≤ kz − xi k , z ∈ Qi . (1.13)
c
Therefore, for z ∈ Qi ,

φ−1 −1 −1
i ◦ T (z) − z = φi ◦ T (z) − φi ◦ φi (z)
= J(xi )−1 (T (z) − φi (z))
≤ c kT (z) − φi (z)k
≤ εkz − xi k ,

(1.8) and (1.13) implying the inequalities in the penultimate line and the last
line, respectively. Thus, for z ∈ Qi ,

φ−1 −1
i ◦ T (z) − xi ≤ φi ◦ T (z) − z + kz − xi k ≤ (1 + ε)kz − xi k .

Recall (1.10) to argue that

φ−1 ε
i ◦ T (z) ∈ Qi , z ∈ Qi ,

which is equivalent to (1.12).


An immediate implication of (1.12) is that for fixed i = 1, . . . , k,

λ (T (Qi )) ≤ λ ({J(xi )z + T (xi ) − J(xi )xi : z ∈ Qεi })


= λ ({J(xi )z : z ∈ Qεi })
= | det(J(xi ))|λ(Qεi ) ,

the second line following from the translation-invariance of the Lebesgue mea-
sure, and Fact 1.1 and the observation that Qεi is a rectangle implying the last
line. This is the crux of the proof in that it shows how the modulus of the

12
determinant of the Jacobian appears. Further, (1.11) shows Qεi is a square of
side-length (1 + ε)δ. Therefore,

λ(Qεi ) = (1 + ε)d δ d = (1 + ε)d λ(Qi ) ,

(1.10) implying the second equality. Put everything together to get

λ (T (Qi )) ≤ | det(J(xi ))|(1 + ε)d λ(Qi ) .

Thus,
k
X
λ (T (R)) = λ(T (Qi ))
i=1
k
X
d
≤ (1 + ε) | det(J(xi ))|λ(Qi )
i=1
k 
X 
≤ (1 + ε)d ε + min | det(J(z))| λ(Qi )
z∈Qi
i=1
 Z 
≤ (1 + ε)d ελ(R) + | det(J(x))|dx ,
R

(1.7) and that δ ≤ δ1 implying the penultimate line. Since the above holds for
all ε > 0, letting ε ↓ 0 completes the proof of Step 1.
While Step 1. was mostly based on analysis and linear algebra, the proof of
Step 2. is standard in measure theory and follows from Exc 1.7.
Proof of Step 2. Define measures µ and ν on Rd by

µ(A) = λ(T (A ∩ U )) , A ∈ B(Rd ) ,

and Z
ν(B) = |det(J(x))| dx , B ∈ B(Rd ) .
B∩U
The claim (1.5) is equivalent to

µ(A) ≤ ν(A) , A ∈ B(Rd ) . (1.14)

In view of Exc 1.7, it suffices to show that the claim holds for any compact
rectangle with rational corners, that is,

µ(R) ≤ ν(R) , (1.15)

if R = [a1 , b1 ] × . . . × [ad , bd ] ⊂ U for some a1 , . . . , ad , b1 , . . . , bd ∈ Q with ai < bi ,


which is precisely what has been shown in Step 1.
The proof of Step 3., which is also standard, is based on approximating a
non-negative measurable function by simple functions from below.

13
Proof of Step 3. First let f : V → R be a non-negative simple function, that is,
k
X
f= αi 1Ai ,
i=1

for some α1 , . . . , αk ∈ [0, ∞] and A1 , . . . , Ak ∈ B(Rd ) with Ai ⊂ V for all i.


Then,
Z k
X
f (y) dy = αi λ(Ai )
V i=1
k
X
= αi λ(T (T −1 Ai ))
i=1
k
X Z
≤ αi | det(J(x))| dx
i=1 T −1 Ai
Z k
X
= | det(J(x))| αi 1T −1 Ai (x) dx
U i=1
Z k
X
= | det(J(x))| αi 1Ai (T (x)) dx
U i=1
Z
= | det(J(x))| f (T (x)) dx ,
U

the inequality in the third line following from Step 2. Thus,


Z Z
f (y) dy ≤ | det(J(x))| f (T (x)) dx . (1.16)
V U

For a measurable function f : V → [0, ∞), there exist non-negative simple


functions fn such that fn ↑ f . The desired inequality (1.16) holds with f
replaced by fn therein. Letting n → ∞ with the help of MCT, the proof of Step
3. follows.
Step 4. is a consequence of the inverse function theorem.
Proof of Step 4. Fact 1.2 and the assumption that J(x) is non-singular for all
x ∈ U imply that T −1 : V → U is a continuously differentiable bijection whose
Jacobian matrix is J(T −1 y)−1 for all y ∈ V . Using Step 3. with U, V, T replaced
by V, U, T −1 implies
Z Z
g ◦ T −1 (y) det J(T −1 y)−1 dy ,

g(x) dx ≤ (1.17)
U V

for any measurable g : U → [0, ∞).


Fix a measurable f : V → [0, ∞). Define
g(x) = f ◦ T (x)| det(J(x))| , x ∈ U .

14
Apply (1.17) to this g to get
Z Z
g ◦ T −1 (y) det J(T −1 y)−1 dy

f ◦ T (x)| det(J(x))| dx ≤
U
ZV
f (y) det J(T −1 y) det J(T −1 y)−1 dy
 
=
ZV
= f (y) dy .
V

Compare this with (1.6) obtained in Step 3. to get


Z Z
f ◦ T (x)| det(J(x))| dx = f (y) dy .
U V

This completes the proof of Step 4. and that of Theorem 1.12 as well.

2 Random experiments and random variables


A random experiment is an experiment for which there is a set of possible
outcomes, of which any one may occur. Though it cannot be predicted which
outcome will occur, the “probabilities” of those outcomes are understood from
intuition. For example, if a fair coin is tossed, then the possible outcomes are
head and tail, each occurring with probability 1/2. No attempts will be made
to give a mathematical definition of random experiments.
Given a random experiment, a probability space is associated with it, which
naturally captures our intuition about the experiment. In other words, the
probability space is the mathematically precise starting point of the study of
probability theory. This is best understood from the following few examples.
The set of all possible outcomes is called the “sample space” and is usually
denoted by Ω.
Example 2.1. A fair die is rolled n times. The sample space of this experiment
is 
Ω = (x1 , . . . , xn ) : xi ∈ {1, . . . , 6}, i = 1, . . . , n .
The probability space associated with this experiment is (Ω, A, P ) where A = P
is the power set of Ω and
#A
P (A) = ,A ⊂ Ω.
#Ω
Example 2.2. A fair coin is tossed till the first head is obtained. The sample
space is
Ω = {H, HT, HHT, . . .} .
Define p : Ω → [0, 1] by

p T . . . T (n times)H = 2−n−1 , n = 0, 1, 2, . . . ,


15
and X
P (A) = p(ω) , A ⊂ Ω .
ω∈A

Thus (Ω, P(Ω), P ) is the probability space associated with this experiment.
As in the above examples, if the sample space Ω is countable and p(ω) ≥ 0
is the probability of the outcome ω for all ω ∈ Ω, where
X
p(ω) = 1 ,
ω∈Ω

then letting X
P (A) = p(ω) , A ⊂ Ω ,
ω∈A

Ω, P(Ω), P is the natural probability space. The above method, however, fails
for an uncountable sample space. For random experiments with such a sample
space, measure theory is essential, as illustrated in the next example.
Example 2.3. A fair coin is tossed infinitely often. That is, for n = 1, 2, 3, . . .,
there is a n-th toss which yields either a head or a tail. The sample space of
this experiment is

Ω = (ω1 , ω2 , ω3 , . . .) : ωn ∈ {H, T } , n = 1, 2, . . . ,

which is clearly uncountable. In order to associate a probability measure on a


suitably chosen collection of subsets of Ω, we first need to decide what are the
events of interest. In practice, we would be interested in events whose occurrence
is decided by the first finitely many tosses, or events that can be constructed from
them. Keeping this in mind, define

Eω1 ...ωn = {(ω10 , ω20 , . . .) ∈ Ω : ωi0 = ωi , 1 ≤ i ≤ n} , n ∈ N, ω1 , . . . , ωn ∈ {H, T } .

That is, Eω1 ...ωn is the event that the first toss yields ω1 , the outcome of the
second toss is ω2 , and so on till the n-th toss. Define

S = {∅} ∪ Eω1 ...ωn : n ∈ N, ω1 , . . . , ωn ∈ {H, T } .

The following several exercises show that there is a unique probability measure
P on (Ω, σ(S)) satisfying

P (Eω1 ...ωn ) = 2−n , n = 1, 2, . . . , ω1 , . . . , ωn ∈ {H, T } , (2.1)

which is precisely the claim of Exc 2.6.


Exercise 2.1. If S is a semi-field and A, B ∈ S, show that

A \ B = C1 ∪ . . . ∪ Ck ,

for some k ≥ 1 and disjoint C1 , . . . , Ck ∈ S.

16
Definition 15. Given a non-empty set Ω and a non-empty collection C of
subsets of Ω, C has the “Cantor intersection property” if the following holds.
Whenever C1 , C2 , C3 . . . ∈ C are such that C1 ⊃ C2 ⊃ C3 ⊃ . . . and Cn 6= ∅ for
all n, it holds that

\
Cn 6= ∅ .
n=1

For example, the collection of all compact sets of Rd has the Cantor inter-
section property.
Exercise 2.2. Suppose S is a semi-field having the Cantor intersection prop-
S∞
erty. If A1 , A2 , . . . ∈ S are disjoint and n=1 An ∈ S, show that An = ∅ for all
but finitely many n’s.
Hint.: Assume A1 , A2 , . . . ∈ S are disjoint,

[
A= An ∈ S ,
n=1

and An 6= ∅ for infinitely many n’s. Use Exc 2.1 to write

A \ A1 = C1 ∪ . . . ∪ Ck ,

for some C1 , . . . , Ck ∈ S. Then for some i,

Ci 6⊂ A2 ∪ . . . ∪ An for all finite n ,

because otherwise A is the union of finitely many An ’s which would imply that
all but finitely many of An ’s are empty. Let B1 = Ci . Apply Exc 2.1 to B1 \ A2
to get B2 ∈ S such that B2 ⊂ B1 \ A2 and

B2 6⊂ A3 ∪ . . . ∪ An for all finite n .

Proceed inductively to obtain Bn+1 ∈ S with

Bn+1 ⊂ Bn \ An+1 , (2.2)

and Bn+1 6⊂ An+2 ∪ . . . ∪ An+k for any finite k.


Thus B1 , B2 , . . . ∈ S with B1 ⊃ B2 ⊃ . . ., and (2.2) implies (by induction)

∅=
6 Bn ⊂ An+1 ∪ An+2 ∪ . . . , n ≥ 1 .

The Cantor intersection property of S implies



\ ∞ [
\ ∞
∅=
6 Bn ⊂ Ak ,
n=1 n=1 k=n

which is a contradiction because A1 , A2 , . . . are disjoint.

17
Exercise 2.3. Let Ω and S be as in Example 2.3. Show that S is a semi-field
having the Cantor intersection property.
Exercise 2.4. Let Ω and S be as in Example 2.3. Define P : S → [0, 1] by

P (E) = 0 if E = ∅ ,

and by (2.1) otherwise. Show that P is finitely additive on S, that is, if


A1 , . . . , An ∈ S are disjoint such that A1 ∪ . . . ∪ An ∈ S, then
n
!
[
P Ai = P (A1 ) + . . . + P (An ) .
i=1

Hint.: Prove this by induction on n. This is a tautology for n = 1 and easy


to prove for n = 2. Assuming it for n, prove it for n + 1 by proceeding along the
following lines. Suppose A1 , . . . , An+1 ∈ S are disjoint (and non-empty WLOG)
and their union is in S. Thus,

Ai = Eω1i ω2i ...ωki for some ki ≥ 1, and ω i , . . . , ωki i ∈ {H, T } , i = 1, . . . , n + 1 .


i

Assume WLOG that k1 = k1 ∨ . . . ∨ kn+1 . Let u be the opposite of ωk11 , that is


u = H if ωk11 = T and vice versa. Argue that for some i,

ω11 , . . . , ωk11 −1 , u, u, u, . . . ∈ Ai .


For this i, show that ki = k1 and

ω1i , . . . , ωki i = ω11 , . . . , ωk11 −1 , u .


 

Thus A1 ∪ Ai = Eω11 ...ωk1 , which allows the induction hypothesis to be used.


1 −1

Exercise 2.5. Let Ω and S be as in Example 2.3. Use Exc 2.2 and 2.3 to show
if A1 , A2 , . . . ∈ S are disjoint such that

[
An ∈ S ,
n=1

then all but finitely many of An ’s are empty. Hence argue that P defined in Exc
2.4 is countably additive on S.
Exercise 2.6. Let Ω and S be as in Example 2.3. Use the corollary of Theorems
1.1 and 1.3 to show that there exists a unique probability measure P on (Ω, σ(S))
satisfying (2.1).
Now that the natural association of a probability space with a random ex-
periment is understood, we shall define a random variable and its C.D.F.

18
Definition 16. A random variable X defined on a probability space (Ω, A, P )
is a measurable function X : Ω → R̄ such that

P X −1 {−∞, ∞} = 0 .

(2.3)

A measurable function X for which (2.3) fails is an improper random variable.


Given a random variable X, its cumulative distribution function (C.D.F.) is a
function F : R → [0, 1] defined by

F (x) = P X −1 (−∞, x] , x ∈ R .


It should be noted that an improper random variable is not a random vari-


able. The following theorem gives necessary and sufficient conditions for a
function to be a C.D.F.
Theorem 2.1. If F is the C.D.F. of a random variable X, then
1. F is non-decreasing,
2. F is right continuous,
3. F (−∞) = 0,
4. and F (∞) = 1.
Conversely, if F : R → [0, 1] is a function satisfying 1.–4. above, then there
exists a random variable X defined on some probability space whose C.D.F. is
F.
Above and elsewhere, the convention adopted is

F (∞) = lim F (x) whenever it exists ,


x→∞

and
F (−∞) = lim F (x) whenever it exists .
x→−∞

Proof of Theorem 2.1. The proof of 1.–4. is easy and is left as an exercise when
F is a C.D.F. Conversely, for a F : R → [0, 1] satisfying 1.–4., Theorem 1.4
guarantees the existence of a unique probability measure P on (R, B(R)) satis-
fying
P ((a, b]) = F (b) − F (a) , −∞ < a < b < ∞ .
Letting Ω = R, A = B(R) and X : R → R to be the identity function, it is easy
to see that F is the C.D.F. of X which is a random variable on (Ω, A, P ). This
completes the proof.
Henceforth, (Ω, A, P ) will be the probability space underlying any random
variable talked about, unless explicitly mentioned otherwise. Theorem 2.1 guar-
antees that such a probability space exists whenever a few conditions are satis-
fied.

19
Definition 17. Given a possibly improper random variable X, its distribution,
usually denoted by P (X ∈ ·) or P ◦ X −1 (·), is the push forward measure of P
on (R̄, B(R̄)) under X, that is,
P (X ∈ B) = P ◦ X −1 (B) = P ({ω ∈ Ω : X(ω) ∈ B}) , B ∈ B(R̄) .
For a Borel function f : R̄ → R̄, we denote by
Z Z
f (x)P (X ∈ dx) or f (x)P ◦ X −1 (dx) ,
R̄ R̄

the integral of f with respect to the distribution of X whenever it is defined.


Exercise 2.7. For a random variable X, show that its distribution is the unique
measure µ on (R, B(R)) satisfying
µ ((a, b] ∩ R) = F (b) − F (a) , −∞ ≤ a ≤ b ≤ ∞ ,
were F is the C.D.F. of X.
Definition 18. For a possibly improper random variable X, its expectation is
defined as Z
E(X) = X(ω)P (dω) ,

whenever the integral on the right hand side makes sense.
Note that E(X) is defined if either of E(X + ) and E(X − ) is finite whereas
E(X) is finite when both are finite which happens if and only if E(|X|) < ∞.
The following is a formula relating expectation with C.D.F.
Theorem 2.2. For a possibly improper random variable X whose expectation
is defined Z ∞ Z 0
E(X) = P (X > x) dx − P (X ≤ x) dx . (2.4)
0 −∞

The following exercise is needed for the proof.


Exercise 2.8. For a possibly improper random variable X ≥ 0, show that
{(ω, x) ∈ Ω × R : 0 ≤ X(ω) < x} ∈ A ⊗ B(R) .
Hint.: First show this for a simple function X.
Proof of Theorem 2.2. We first show this when X ≥ 0. The definition of expec-
tation implies
Z
E(X) = X(ω)P (dω)

Z Z ∞ 
= 1[0≤x<X(ω)] dx P (dω)
ZΩ 0

= P (dω) ⊗ dx ,
{(ω,x)∈Ω×R:0≤X(ω)<x}

20
the last line following from Tonelli and Exc 2.8. Use Tonelli again to write
Z ∞Z
E(X) = 1[0≤x<X(ω)] P (dω) dx
Z0 ∞ Ω
= P (X > x) .
0

For X which is not necessarily non-negative,


Z ∞ Z ∞
E X+ = P (X + > x) dx =

P (X > x) dx ,
0 0

the second equality follows from the observation that for x ≥ 0, X > x ⇐⇒
X + > x. Similarly,
Z ∞

P (X − > x) dx

E X =
0
Z 0
= P (X − > −x) dx
−∞
Z 0
X − > −x ⇐⇒ X < x for x ≤ 0 =

P (X < −x) dx
−∞
Z 0
= P (X ≤ −x) dx ,
−∞

the last line following from the fact that P (X ≤ ·) and P (X < ·) differs on a set
which is at most countable and hence has Lebesgue measure zero. Recalling that
X = X + − X − and either of E(X + ) and E(X − ) is finite, the proof follows.
The following theorem relates the expectation with its distribution.
Theorem 2.3. For a possibly improper random variable X and a Borel function
f : R̄ → R̄, Z
E(f (X)) = f (x)P (X ∈ dx) ,

whenever either side makes sense. In particular, if E(X) is defined then


Z
E(X) = xP (X ∈ dx) .

Proof. Follows from Theorem 1.6.


The proof of the next result follows directly from the definition.
Theorem 2.4 (Linearity of expectation). If X and Y have finite expectations,
then so does αX + βY for α, β ∈ R and then

E (αX + βY ) = αE(X) + βE(Y ) .

21
Proof. Follows from the inequality |αX + βY | ≤ |α||X| + |β||Y | and the fact
that integral on a measure space is monotone and linear.
If Definition 18 were replaced by any other definition, for example, by (2.4),
then proving the above theorem would have become extremely difficult.
Definition 19. For a random variable X with a finite mean µ, its variance is
defined as
Var(X) = E (X − µ)2 .
 
p
The standard deviation of X is Var(X).
Theorem 2.5. The variance of X is defined and finite if and only if E(X 2 ) <
∞, in which case,
2
Var(X) = E(X 2 ) − (E(X)) . (2.5)
For the proof, the following fact is needed.
Fact 2.1 (Cauchy-Schwarz inequality). If f, g are measurable functions from a
measure space (Ω, A, µ) to R, then
Z Z 1/2 Z 1/2
|f g| dµ ≤ f 2 dµ g 2 dµ .

Proof of Theorem 2.5. A restatement of the Cauchy-Schwarz inequality when


the measure is a probability measure is
p
E(|XY |) ≤ E(X 2 )E(Y 2 ) ,

for random variables X and Y . Take Y to be identically 1 and square both sides
to get
2
(E(|X|)) ≤ E(X 2 ) . (2.6)
If E(X 2 ) < ∞, then (2.6) shows X has a finite expectation. Let µ = E(X).
Write
(X − µ)2 = X 2 − 2µX + µ2 .
Since X 2 and X have a finite expectation, the linearity of expectation implies
so does the left hand side and

E (X − µ)2 = E(X 2 ) − 2µE(X) + µ2 = E(X 2 ) − µ2 ,


 

which proves the “if” part and (2.5).


Conversely, if Var(X) is defined and finite, that is, if E(X) = µ is finite and
so is
E (X − µ)2 ,
 

then writing
X 2 = (X − µ)2 + 2µX − µ2 ,
it follows that E(X 2 ) < ∞. This shows the “only if” part and thus completes
the proof.

22
The formula (2.5) is used almost always for calculating variance. A word of
caution: (2.6) should not be misinterpreted as
Z 2 Z
|f | dµ ≤ f 2 dµ ,

when µ is not a probability measure. The above is clearly false, for example,
if µ is the Lebesgue measure on R and f (x) = x−1 1(x > 1) because then the
right hand side is finite whereas the left hand side is not.
Definition 20. A random variable X is discrete if there exists a countable set
C ⊂ R such that P (X ∈ C) = 1. The probability mass function of a discrete
random variable X is the function f : R → [0, 1] defined by
f (x) = P (X = x) , x ∈ R .
Theorem 2.6. If X is a discrete random variable, then for any measurable
f : R → R̄, X
E (f (X)) = f (x)P (X = x) , (2.7)
x∈R
whenever the left hand side is defined, where the sum on the right hand side is to
be interpreted as the sum over those x for which P (X = x) > 0. In particular,
if X has an expectation, then
X
E(X) = xP (X = x) .
x∈R

Proof. Let C = {x ∈ R : P (X = x) > 0}. Since X is discrete, P (X ∈ C) = 1.


Let µ be the counting measure on C. The fact that for A, E ∈ A with P (E) = 1,
P (A ∩ E) = P (A), implies for B ∈ B(R),
P (X ∈ B) = P (X ∈ B ∩ C)
X
= P (X = x)
x∈B∩C
Z
= P (X = x)µ(dx) .
B

In other words,
P (X ∈ dx)
= P (x = x) ,
µ(dx)
that is, P (X = ·) is the Radon-Nikodym derivative of P (X ∈ ·) with respect to
µ. Exc 1.10 shows that for a measurable f : R → R̄,
Z Z
f (x)P (X ∈ dx) = f (x)P (X = x)µ(dx) ,
R R

whenever either side is defined. The Pleft hand side equals E(f (X)) by Theorem
2.3, and the right hand side is simply x∈C f (x)P (X = x). Since P (X = x) = 0
for x ∈
/ C, this completes the proof of (2.7). The second claim being a special
case of (2.7), the proof follows.

23
Example 2.4. A coin with chances of head p is tossed infinitely often. Pro-
ceeding like in Example 2.3, construct the probability space for this experiment.
That is, if Ω and S are as therein, show that P : S → [0, 1], defined by
n
Y
P (Eω1 ...ωn ) = [p1(ωi = H) + q1(ωi = T )] , n ∈ N , ω1 , . . . , ωn ∈ {H, T } ,
i=1

where q = 1 − p and P (∅) = 0, is countably additive. Use Theorem 1.1 to


complete the construction.
Let X be the number of heads obtained in the first n tosses. Show that the
PMF of X is  
n k n−k
P (X = k) = p q , k = 0, 1, . . . , n .
k
The convention followed here and elsewhere is that left hand side is to be in-
terpreted as zero for those k for which it has not been defined, that is, for
k ∈ R \ {0, 1, . . . , n} in this case. The distribution of X is called Binomial(n, p).
Check that its mean and variance are np and npq, respectively. If n = 1, then
this has another name, which is, Bernoulli(p). In other words, a Bernoulli(p)
random variable takes the values 0 and 1 with probabilities q and p, respectively.
Let Y be the number of tosses needed to get the first head. The PMF of Y is
P (Y = n) = q n−1 p , n ∈ N .
The distribution of Y is called Geometric(p). Check that its mean and variance
are p−1 and p−2 q, respectively.
For a fixed k = 1, 2, 3, . . ., let Z be the number of tosses needed to get the
k-th head. The PMF of Z is
 
n − 1 n−k k
P (Z = n) = q p , n ∈ {k, k + 1, k + 2, . . .} .
k−1
Since the right hand side is the (n−k+1)-th term in the expansion of pk (1−q)−k ,
the distribution of Z is called Negative Binomial(k, p). Check that its mean and
variance are kp−1 and kp−2 q, respectively.
Exercise 2.9. Suppose Xn is a Bin(n, pn ) random variable defined on a prob-
ability space (Ωn , An , Pn ). If
lim npn = λ ∈ (0, ∞) ,
n→∞

show that for k = 0, 1, 2, . . .,


λk
lim Pn (Xn = k) = e−λ .
n→∞ k!
Exercise 2.10. Show using Theorem 2.1 that there exists a discrete random
variable X defined on some probability space with
λn
P (X = n) = e−λ , n ∈ {0, 1, 2, . . .} .
n!

24
The distribution of X is called Poisson(λ). Show that its mean and variance
both equal λ.
Definition 21. A random variable X is continuous if

P (X = x) = 0 for all x ∈ R .

Exercise 2.11. If X is a random variable with C.D.F. F , show that for all
x ∈ R,
P (X = x) = F (x) − F (x−) .
Hence argue that X is a continuous random variable if and only if F is a con-
tinuous function.
Definition 22. A Borel function f : R → [0, ∞) is the density of a random
variable X if Z
P (X ∈ B) = f (x) dx , for all B ∈ B(R) .
B

Exercise 2.12. 1. Show that f is the density of X is equivalent to

P (X ∈ dx)
= f (x) , x ∈ R ,
dx
that is, f is the Radon-Nikodym derivative of P (X ∈ ·) with respect to the
Lebesgue measure.
2. Prove that if f and g are densities of X, then f = g a.e. In other words, a
density is unique upto a set of Lebesgue measure zero.
3. Prove that a non-negative Borel function f on R is a density of X having
C.D.F. F if and only if
Z x
f (t) dt = F (x) , x ∈ R .
−∞

4. Show that a random variable is continuous if it has a density.


Definition 23. A function F : R → R is absolutely continuous if given ε > 0
there exists δ > 0 such that
n
X
|F (yi ) − F (xi )| ≤ ε ,
i=1

whenever x1 ≤ y1 ≤ x2 ≤ y2 ≤ . . . ≤ xn ≤ yn are such that


n
X
(yi − xi ) ≤ δ .
i=1

Theorem 2.7. A random variable has a density if and only if its C.D.F. is
absolutely continuous.

25
The proof uses the following exercise.
Exercise 2.13. If h is an integrable function on a measure space (Ω, A, µ), then
given ε > 0 there exists δ > 0 such that
Z
|h| dµ ≤ ε ,
A

for all A ∈ A with µ(A) ≤ δ.


Proof of Theorem 2.7. We start with proving the “if” part. Let X be a random
variable with an absolutely continuous C.D.F. F . In view of Theorem 1.7, which
is the Radon-Nikodym theorem, it suffices to show

P (X ∈ ·)  λ , (2.8)

λ being the Lebesgue measure. To prove this, fix any B ∈ B(R) with λ(B) = 0.
We shall prove P (X ∈ B) = 0 by showing for any ε > 0

P (X ∈ B) ≤ ε . (2.9)

Fix ε > 0. Absolute continuity of F implies there exists δ > 0 such that
n
X
|F (yi ) − F (xi )| ≤ ε ,
i=1

whenever x1 ≤ y1 ≤ x2 ≤ y2 ≤ . . . ≤ xn ≤ yn are such that


n
X
(yi − xi ) ≤ δ .
i=1

Since λ(B) = 0 and the Lebesgue measure is regular, see Exc 1.5, there exists
an open set U ⊂ R with U ⊃ B and λ(B) ≤ δ. An open subset of R is the union
of countably many disjoint open intervals, that is,
[
U= (xi , yi ) ,
i≥1

for some x1 , y1 , x2 , y2 , . . . satisfying xi < yi and (xi , yi ) ∪ (xj , yj ) = ∅ for all


i 6= j. Therefore,
n n
!
X [
(yi − xi ) = λ (xi , yi ) ≤ λ(U ) ≤ δ ,
i=1 i=1

26
showing that
n
X
ε≥ [F (yi ) − F (xi )]
i=1
Xn
(absolute continuity implies continuity) = [F (yi −) − F (xi )]
i=1
Xn
= P (xi < X < yi )
i=1
n
!
[
=P X∈ (xi , yi ) .
i=1

Since !
n
[
P X∈ (xi , yi ) ↑ P (X ∈ U ) ≥ P (X ∈ B) ,
i=1

(2.9) follows. Arbitrariness of ε shows (2.8) which by an appeal to Theorem


1.7 proves the existence of the density of X which is nothing but the Radon-
Nikodym derivative of P ◦ X −1 with respect to Lebesgue.
Conversely, suppose that X has a density f . That is, f : R → [0, ∞) is Borel
and satisfies Z ∞
f (x) dx = 1 .
−∞

Fix ε > 0. Use Exc 2.13 to choose δ > 0 such that


Z
f (x) dx ≤ ε ,
B

for all B ∈ B(R) with λ(B) ≤ δ. For n ≥ 1, fix x1 ≤ y1 ≤ . . . ≤ xn ≤ yn with


n
X
(yi − xi ) ≤ δ .
i=1

Letting B = (x1 , y1 ] ∪ . . . ∪ (xn , yn ], the above simply means λ(B) ≤ δ. The


choice of δ implies
Z n
X
ε≥ f (x) dx = P (X ∈ B) = [F (yi ) − F (xi )] ,
B i=1

F being the C.D.F. of X. Thus F is absolutely continuous. This proves the


“only if” part and thereby completes the proof.
The following result is most useful in getting the density of a random variable
when it exists.

27
Theorem 2.8. A random variable X with C.D.F. F has a density if and only
if Z ∞
f (x) dx = 1 ,
−∞

where (
d
f (x) = dx F (x) , if F is differentiable at x ,
(2.10)
0, otherwise .
In that case, f is the density of X.
The proof uses the following facts.
Fact 2.2 (Theorem 31.2, pg 404, Billingsley (1995)). A non-decreasing function
F : [a, b] → R is differentiable a.e. on (a, b). If f is as in (2.10), then f is Borel
measurable, non-negative and satisfies
Z b
f (x) dx ≤ F (b) − F (a) .
a

Fact 2.3 (Theorem 31.3, pg 406, Billingsley (1995)). If f : [a, b] → [0, ∞) is


integrable and Z x
F (x) = f (t) dt , x ∈ [a, b] ,
a

then F is differentiable a.e. on (a, b) and

d
F (x) = f (x) for almost all x ∈ (a, b) .
dx
Proof of Theorem 2.8. We start with proving the “if” part. Assume
Z ∞
f (x) dx = 1 , (2.11)
−∞

f being as in (2.10). Fact 2.2 implies


Z b
f (x) dx ≤ F (b) − F (a) , −∞ < a < b < ∞ .
a

Keeping b fixed and letting a → −∞, MCT shows that the left hand side goes
to the corresponding integral from −∞ to b. Since F is a C.D.F., F (−∞) = 0,
showing that
Z b
f (x) dx ≤ F (b) . (2.12)
−∞

A similar argument shows that for a fixed,


Z ∞
f (x) dx ≤ 1 − F (a) .
a

28
Thus
Z ∞
F (a) ≤ 1 − f (x) dx
a
Z a

(using (2.11) = f (x) dx
−∞
≤ F (a) ,

(2.12) implying the last line by putting b = a. Thus


Z a
f (x) dx = F (a) for all a ∈ R .
−∞

Exc 2.12.3 shows f is the density of X. This proves the “if” part.
For the “only if” part, assume X has a density g. That is,
Z x
g(t) dt = P (X ≤ x) = F (x) , x ∈ R .
−∞

Thus, for −∞ < a < b < ∞,


Z x
F (x) − F (a) = g(t) dt , x ∈ [a, b] .
a

Fact 2.3 shows that the left hand side is differentiable a.e. on (a, b) and

d
F (x) = g(x) for a.e. x ∈ (a, b) .
dx
Since this is true for all a, b with −∞ < a < b < ∞, the above equality holds
a.e. on R. A comparison with (2.10) shows f = g a.e. Therefore
Z ∞ Z ∞
f (x) dx = g(x) dx = 1 ,
−∞ −∞

proving the “only if” part. This completes the proof.


Theorems 2.7 and 2.8 and their proofs essentially show the following result.

Theorem 2.9. If X is a random variable with C.D.F. F , then the following


are equivalent.
1. A density of X exists.
2. The function F is absolutely continuous.

3. The distribution of X is absolutely continuous with respect to Lebesgue,


that is,
P (X ∈ ·)  λ(·) .

29
4. Given ε > 0 there exists δ > 0 such that

P (X ∈ B) ≤ ε whenever B ∈ B(R) and λ(B) ≤ δ .

5. If f is the derivative of F wherever it exists and zero elsewhere, then


Z ∞
f (x) dx = 1 .
−∞

If any of these hold, then f defined above is the density of X.

Proof. Exc.
The equivalence of 2. and 5. in the above theorem is important from the
point of view of analysis as well.
Theorem 2.10. If X has density f , then for any measurable g : R → R̄,
Z ∞
E (g(X)) = g(x)f (x) dx ,
−∞

whenever either side makes sense. In particular,


Z ∞
E(X) = xf (x) dx ,
−∞

if either side is defined.


Proof. Similar to the proof of Theorem 2.6 once it is observed that

P (X ∈ dx)
f (x) = .
dx

Example 2.5. The distribution of X is Uniform(a, b) for −∞ < a < b < ∞ if


its C.D.F. is 
0 ,
 x < a,
F (x) = x−ab−a , a ≤ x ≤ b,

1, x > b.

Check that the density of X is


1
f (x) = ,a ≤ x ≤ b,
b−a
with the usual convention of interpreting it as zero wherever it is not defined.
Show that the mean and variance of Uniform(a, b) are (a + b)/2 and (b − a)2 /12,
respectively.

30
Example 2.6. A random variable X follows standard normal or standard Gaus-
sian if its density is
1 2
f (x) = √ e−x /2 , x ∈ R .

It is easy to see that E(X) = 0 and
Z ∞
2
E(X ) = x2 f (x) dx
−∞
r Z ∞
2 2
= x2 e−x /2 dx
π 0
r Z ∞ 
2 2

= x xe−x /2 dx .
π 0
Integrating by parts with the help of the observation that
d  −x2 /2  2
−e = xe−x /2 ,
dx
we get
r
2
 
2
 ∞ Z ∞ 2
 
2
E(X ) = x −e−x /2 − −e−x /2 dx
π 0 0
r Z ∞
2 2
= e−x /2 dx
π 0
= 1.

Thus the mean and variance of the standard normal distribution are 0 and 1,
respectively.
The distribution with density

1 (x − µ)2
 
1
f (x) = √ exp − ,x ∈ R,
σ 2π 2 σ2

is called Normal(µ, σ 2 ) for µ ∈ R and σ > 0. Show that f defined as above is


indeed a density, and that the mean and variance of Normal(µ, σ 2 ) are µ and
σ 2 , respectively.
Definition 24. Let (Ω, A, P ) be a probability space and A, B ∈ A with P (A) >
0. The conditional probability of B given A is defined as

P (B ∩ A)
P (B|A) = .
P (A)

Example 2.7. Suppose we want a non-negative random variable X which has


the “memoryless” property, that is, P (X > t) > 0 for all t ≥ 0 and

P (X > s + t|X > t) = P (X > s) , s, t ≥ 0 .

31
This is the same as

P (X > s + t) = P (X > s)P (X > t) , s, t ≥ 0 ,

that is,  
1 − F (s + t) = 1 − F (s) 1 − F (t) , s, t ≥ 0 .
Letting G = log(1 − F ), the condition is

G(s + t) = G(s) + G(t) , s, t ≥ 0 .

It follows that
G(r) = −λr , r ∈ [0, ∞) ∩ Q ,

where λ = −G(1) = − log 1 − F (1) ≥ 0. Right continuity of G implies

G(x) = λx , x ≥ 0 ,

that is,
F (x) = 1 − e−λx , x ≥ 0 .
Since F (∞) = 1, it is necessary that λ > 0. As X ≥ 0, F (x) = 0 for x < 0.
For λ > 0, X follows Exponential(λ) if F defined by
(
1 − e−λx , x ≥ 0 ,
F (x) =
0, x < 0.

is its C.D.F. Show that the density of Exponential(λ) is

f (x) = λe−λx , x > 0 ,

and its mean and variance are λ−1 and λ−2 , respectively.
Definition 25. For α > 0, define
Z ∞
Γ(α) = xα−1 e−x dx .
0

Exercise 2.14. Show that


1. Γ(α) < ∞ for all α > 0,
2. Γ(α + 1) = αΓ(α) , α > 0 ,
3. and
Γ(n) = (n − 1)! , n ∈ N .

Example 2.8. For α > 0, Gamma(α) is the distribution with density


1 α−1 −x
f (x) = x e ,x > 0.
Γ(α)
Check that its mean and variance both equal α.

32
Definition 26. For α, β > 0, define
Z 1
B(α, β) = xα−1 (1 − x)β−1 . (2.13)
0

Exercise 2.15. 1. Show that the RHS of (2.13) is finite for α, β > 0.
2. Show that  
1 1
B , = π,
2 2

by substituting θ = sin−1 x in the RHS of (2.13).
Example 2.9. For α, β > 0, X follows Beta(α, β) if its density is
1
f (x) = xα−1 (1 − x)β−1 , 0 < x < 1 .
B(α, β)
When α = β = 1/2, show that X has C.D.F.

0 ,


x < 0,
F (x) = π2 sin−1 x , 0 ≤ x ≤ 1 ,

1, x > 1.

This is why the Beta(1/2, 1/2) distribution is also called the “arc-sine law”.

3 Independence
In this chapter the concept of independence is studied, which is a fundamental
concept in probability theory. Unless mentioned otherwise, (Ω, A, P ) is the
probability space underlying everything we talk about. For example, all random
variables are defined on this space and any collection of sets we talk about is a
subset of A, unless the contrary is explicitly stated.
Definition 27. A collection of σ-fields A1 , . . . , An are independent if
n
Y
P (A1 ∩ . . . ∩ An ) = P (Ai ) ,
i=1

for all A1 ∈ A1 , . . . , An ∈ An .
The following is an important observation which connects the usually given
definition of independence of events with the above.
Exercise 3.1. If A1 , . . . , An are independent σ-fields and A1 , . . . , An belong to
A1 , . . . , An , respectively, show that

P (Ai1 ∩ . . . ∩ Aik ) = P (Ai1 ) . . . P (Aik ) ,

for all 1 ≤ i1 < . . . < ik ≤ n.

33
The above when n = 2 should be compared with Definition 24.
Theorem 3.1. If S1 , . . . , Sn are semi-fields such that
n
! n
\ Y
P Ai = P (Ai ) , for all A1 ∈ S1 , . . . , An ∈ Sn ,
i=1 i=1

then σ(S1 ), . . . , σ(Sn ) are independent.

Proof. The first step is to show


n
! n
\ Y
P Ai = P (Ai ) , for all A1 ∈ σ(S1 ), A2 ∈ S2 , . . . , An ∈ Sn . (3.1)
i=1 i=1

To that end, fix A2 ∈ S1 , . . . , An ∈ Sn and define µ1 , µ2 : A → [0, ∞) by


n
Y
µ1 (A) = P (A) P (Ai ) ,
i=2
µ2 (A) = P (A ∩ A2 ∩ . . . ∩ An ) ,

for all A ∈ A. Thus µ1 and µ2 are finite measures on (Ω, A), which agree on S1
by the hypothesis of the theorem. As S1 is a semi-field, Corollary 1.1 implies
that µ1 and µ2 agree on σ(S1 ). In other words, (3.1) holds.
We shall now show inductively that for i = 1, . . . , n,
n
! n
\ Y
P Ai = P (Ai ) , (3.2)
i=1 i=1

for all A1 ∈ σ(S1 ), . . . , Ai ∈ σ(Si ), Ai+1 ∈ Si+1 , . . . , An ∈ Sn ; (3.1) shows


this holds for i = 1. As the induction hypothesis, assume (3.2) for some i ∈
{1, . . . , n − 1}. Fix A1 , . . . , Ai , Ai+2 , . . . , An in σ(S1 ), . . . , σ(Si ), Si+2 , . . . , Sn ,
respectively. As before, define finite measures ν1 , ν2 on (Ω, A) by
Y
ν1 (A) = P (A) P (Aj ) ,
1≤j≤n, j6=i+1
 
\
ν2 (A) = P A ∩ Aj  ,
1≤j≤n, j6=i+1

for all A ∈ A. The induction hypothesis implies that ν1 and ν2 agree on


Si+1 and hence they do so on σ(Si+1 ) by Corollary 1.1. As this holds for
all A1 , . . . , Ai , Ai+2 , . . . , An in σ(S1 ), . . . , σ(Si ), Si+2 , . . . , Sn , respectively, (3.2)
follows for i + 1. Mathematical induction shows (3.2) for i = n, which completes
the proof.

34
Definition 28. Random variables X1 , . . . , Xn , all of which by convention are
defined on (Ω, A, P ), are independent if σ(X1 ), . . . , σ(Xn ) are independent σ-
fields, where σ(X) is the smallest σ-field with respect to which X is measurable
for any X : Ω → R̄, that is,
σ(X) = X −1 B : B ∈ B(R̄) .


Exercise 3.2. Show that discrete random variables X1 , . . . , Xn are independent


if and only if
n
Y
P (X1 = x1 , . . . , Xn = xn ) = P (Xi = xi ) , for all x1 , . . . , xn ∈ R .
i=1

Theorem 3.2. Random variables X1 , . . . , Xn are independent if and only if


n
Y
P (X1 ≤ x1 , . . . , Xn ≤ xn ) = P (Xi ≤ xi ) , x1 , . . . , xn ∈ R . (3.3)
i=1

The proof uses the following exercise.


Exercise 3.3. For random variables X1 , . . . , Xn and −∞ ≤ ai ≤ bi ≤ ∞ for
i = 1, . . . , n,
P (ai < Xi ≤ bi , i = 1, . . . , n)
X
= (−1)#{1≤i≤n:xi =ai } P (X1 ≤ x1 , . . . , Xn ≤ xn ) .
(x1 ,...,xn )∈{a1 ,b1 }×...×{an ,bn }

Soln.: As usual, denote by 1A or 1(A) the indicator of A, that is, it is one or


zero depending on whether A occurs or not, respectively. Write
P (ai < Xi ≤ bi , i = 1, . . . , n) = E (1 (ai < Xi ≤ bi , i = 1, . . . , n)) . (3.4)
Observe that
1 (ai < Xi ≤ bi , i = 1, . . . , n)
Yn
= 1(ai < Xi ≤ bi )
i=1
Yn
= [1(Xi ≤ bi ) − 1(Xi ≤ ai )]
i=1
Yn X
= (−1)1(xi =ai ) 1(Xi ≤ xi )
i=1 xi ∈{ai ,bi }

X X n
Y
= ... (−1)1(xi =ai ) 1(Xi ≤ xi )
x1 ∈{a1 ,b1 } xn ∈{an ,bn } i=1
X
= (−1)#{1≤i≤n:xi =ai } 1(X1 ≤ x1 , . . . , Xn ≤ xn ) .
(x1 ,...,xn )∈{a1 ,b1 }×...×{an ,bn }

Taking expectation on both sides and using (3.4), the solution follows.

35
Proof of Theorem 3.2. The “only if” part is trivial because (3.3) follows from
the observation that X1−1 (−∞, x1 ], . . . , Xn−1 (−∞, xn ] belong to σ(X1 ), . . . . . .
. . . , σ(Xn ), respectively. For the “if” part, assume (3.3). The first observation
is that (3.3) holds for x1 , . . . , xn ∈ R̄ because if xi = −∞ for one or more i,
then both sides are zero and if xi → ∞, then both sides of (3.3) increase to the
respective quantities obtained by putting xi = ∞ for those i’s. Thus, (3.3) can
be assumed to hold for all x1 , . . . , xn ∈ R̄ without loss of generality.
Let Si = {Xi−1 (ai , bi ] : −∞ ≤ ai ≤ bi ≤ ∞} for i = 1, . . . , n. For A1 ∈
S1 , . . . , An ∈ Sn , that is, Ai = Xi−1 (ai , bi ] for some ∞ ≤ ai ≤ bi ≤ ∞,

P (A1 ∩ . . . ∩ An )
= P (a1 < X1 ≤ b1 , . . . , an < Xn ≤ bn )
X
= (−1)#{1≤i≤n:xi =ai } P (X1 ≤ x1 , . . . , Xn ≤ xn )
(x1 ,...,xn )∈{a1 ,b1 }×...×{an ,bn }
X n
Y
#{1≤i≤n:xi =ai }
= (−1) P (Xi ≤ xi )
(x1 ,...,xn )∈{a1 ,b1 }×...×{an ,bn } i=1

X n
Y
= (−1)1(xi =ai ) P (Xi ≤ xi )
(x1 ,...,xn )∈{a1 ,b1 }×...×{an ,bn } i=1
n
Y X
= (−1)1(xi =ai ) P (Xi ≤ xi )
i=1 xi ∈{ai ,bi }
n
Y
= (P (Xi ≤ bi ) − P (Xi ≤ ai ))
i=1
Yn
= P (ai < Xi ≤ bi )
i=1
= P (A1 ) . . . P (An ) ,

Exc 3.3 implying the third line and the fourth line following from (3.3) which
holds for all x1 , . . . , xn ∈ R̄. Theorem 3.1 shows σ(S1 ), . . . , σ(Sn ) are indepen-
dent, which is the same as independence of X1 , . . . , Xn .
Exercise 3.4. Let Ω = (0, 1], A be the collection of Borel subsets of (0, 1] and
P be the restriction of Lebesgue measure to (0, 1].
1. For all ω ∈ Ω, show that there exist unique X1 (ω), X2 (ω), . . . ∈ {0, 1, 2}
such that
X∞
ω= 3−n Xn (ω) ,
n=1

and {n : Xn (ω) equals either 1 or 2} is an infinite set. In other words,


the ternary expansion of ω is being considered and in case where multiple
expansions are possible, the non-terminating one, that is, the one which
has infinitely many 2’s is being taken.

36
2. For n ≥ 1 and i1 , . . . , in ∈ {0, 1, 2}, show that for ω ∈ Ω,
n
X n
X
X1 (ω) = i1 , . . . , Xn (ω) = in ⇐⇒ 3−j ij < ω ≤ 3−n + 3−j ij .
j=1 j=1

Hence prove that X1 , . . . , Xn are independent and each takes values 0, 1, 2


with probability 1/3 for each.
3. Prove that inf{n ≥ 1 : Xn = 2} is a proper random variable, that is, it is
finite almost surely (“almost surely” or “a.s.” simply means “with proba-
bility 1”). In fact, observe that it is a Geometric(1/3) random variable.
4. Show that the set of ω ∈ Ω which have multiple ternary expansions is
countable.
5. Use the above two claims to argue that
n
C = ω ∈ (0, 1] : ω has a unique ternary expansion and (3.5)


X o
ω= 3−n xn for some x1 , x2 , . . . ∈ {0, 1}
n=1

is a Borel set of zero Lebesgue measure.


Hint. Observe that

!
\
C= {ω : Xn (ω) 6= 2} ∩ {ω : ω has a unique ternary expansion} .
n=1

Exercise 3.5. Let (Ω, A, P ) be as in Example 2.3. Define X : Ω → R by



X
X(ω) = 3−n 1(ωn = H) , ω = (ω1 , ω2 , . . .) ∈ Ω .
n=1

1. Show that X is a random variable, that is, it is a measurable function.


2. Prove that X is a one-one function. Hence show that X is a continuous
random variable.
3. If C is as in (3.5), show that P (X ∈ C) = 1.
4. Prove that X cannot have a density.
Hint. Show that if X has a density, then P (X ∈ C) would be zero because
Lebesgue measure of C is zero.
5. Argue using Theorem 2.7 that the C.D.F. of X is continuous but not ab-
solutely continuous.
Definition 29. A possibly infinite collection {Ai : i ∈ I} is independent if for
all n ≥ 2 and distinct i1 , . . . , in ∈ I, the σ-fields Ai1 , . . . , Ain are independent.

37
For a collection {Ai : i ∈ I} of σ-fields, denote
!
_ [
Ai = σ Ai .
i∈I i∈I

Theorem 3.3. If {Ai : i ∈ I} is an independent collection of σ-fields and


I1 , I2 , . . . , Ik are disjoint non-empty subsets of I, then
_ _
Ai , . . . , Ai
i∈I1 i∈I1

are independent.
Proof. The first step of the proof is to show the claim when I1 , . . . , Ik are finite
sets. Suppose
Ii = {ni1 , . . . , niki } , i = 1, . . . , k .
Define

Si = {A1 ∩ . . . ∩ Aki : Aj ∈ Aij for j = 1, . . . , ki } , i = 1, . . . , k .

Obviously, Si is a semi-field because it is closed under finite intersections. Fur-


ther, if Aj ∈ Aij for j = 1, . . . , ki ,
c
(A1 ∩ . . . ∩ Aki ) = B1 ∪ . . . ∪ Bki ,

where

B1 = Ac1 , B2 = A1 ∩ Ac2 , . . . , Bki = A1 ∩ . . . ∩ Aki −1 ∩ Acki .

Since B1 , . . . , Bki are disjoint Si -sets, Si is a semi-field. Since {Aij : j =


1, . . . , ki , i = 1, . . . , k} are independent, it follows that

P (A1 ∩ . . . ∩ Ak ) = P (A1 ) . . . P (Ak ) for all A1 ∈ S1 , . . . , Ak ∈ Sk .

Theorem 3.1 implies independence of σ(S1 ), . . . , σ(Sk ), which are same as


_ _
Ai , . . . , Ai ,
i∈I1 i∈I1

respectively.
Now let I1 , . . . , Ik be disjoint non-empty subsets of I. Define
[ _
Fi = Aj , i = 1, . . . , k .
J⊂Ii , J finite j∈J

Clearly, Fi is a field for i = 1, . . . , k and the first step implies that

P (A1 ∩ . . . ∩ Ak ) = P (A1 ) . . . P (Ak ) , A1 ∈ F1 , . . . , Ak ∈ Fk .

Once again, Theorem 3.1 implies independence of σ(F1 ), . . . , σ(Fk ) and com-
pletes the proof.

38
An immediate corollary of the above theorem is the following.
Corollary 3.1. If {Ai : i ∈ I} is an independent collection of σ-fields and for
each α ∈ Θ, ∅ =
6 Iα ⊂ I is such that

Iα ∩ Iβ = ∅ , for all α, β ∈ Θ , α 6= β ,
W
then { i∈Iα Ai : α ∈ Θ} is independent.
W
Exercise 3.6. 1. If Ai is a σ-field for all i ∈ I, show that A ∈ i∈I Ai if
and only if _
A∈ Ai ,
i∈I0

for some countable I0 ⊂ I.


2. If Xi is a random-variable for all i ∈ I, show that A ∈ σ(Xi : i ∈ I) if
and only if
A ∈ σ(Xi : i ∈ I0 )
for some countable I0 ⊂ I.
3. If Ai ∈ A for all i ∈ I, show that A ∈ σ(Ai : i ∈ I) if and only if

A ∈ σ(Ai : i ∈ I0 )

for some countable I0 ⊂ I.


Definition 30. A possibly infinite collection of random variables {Xi : i ∈ I}
is independent if {σ(Xi ) : i ∈ I} is an independent collection of σ-fields. Two
d
random variables Y and Z are identically distributed, that is, Y = Z, if

P (Y ∈ B) = P (Z ∈ B) , B ∈ B(R) .

The collection {Xi : i ∈ I} is independent and identically distributed or i.i.d.


if it is independent and
d
Xi = Xj , i, j ∈ I .
Exercise 3.7. If {Ai : i ∈ I} is an independent collection of σ-fields and Xi is
Ai -measurable for each i ∈ I, show that {Xi : i ∈ I} is independent.
Theorem 3.4. If X and Y are independent random variables with finite mean,
then XY has a finite mean and

E(XY ) = E(X)E(Y ) .

Proof. As the first step, we show that for simple non-negative independent ran-
dom variables X and Y ,

E(XY ) = E(X)E(Y ) .

39
Since X and Y are simple and non-negative,
m
X n
X
X= αi 1Ai , Y = βi 1Bi ,
i=1 i=1

for some α1 , . . . , αm , β1 , . . . , βn ≥ 0, A1 , . . . , Am ∈ σ(X) and B1 , . . . , Bn ∈


σ(Y ). Thus,
m X
X n
E(XY ) = αi βj P (Ai ∩ Bj )
i=1 j=1
Xm X n
= αi βj P (Ai )P (Bj )
i=1 j=1

= E(X)E(Y ) ,

the independence of σ(X) and σ(Y ) being used in the second line.
The second step is to show that for non-negative independent random vari-
ables X and Y , E(XY ) = E(X)E(Y ). There exist σ(X)-measurable simple
random variables sn such that 0 ≤ sn ↑ X and σ(Y )-measurable simple random
variables tn such that 0 ≤ tn ↑ Y . As sn and tn are independent by Exc 3.7,
the first step implies
E(sn tn ) = E(sn )E(tn ) .
Observing that 0 ≤ sn tn ↑ XY , letting n → ∞ and using MCT, it follows that
E(XY ) = E(X)E(Y ).
Finally suppose X and Y are independent and integrable. Then |X| and |Y |
are independent. The second step shows

E (|X| |Y |) = E(|X|)E(|Y |) < ∞ .

Thus, XY is integrable. Splitting X = X + − X − and likewise for Y , the proof


follows.
Definition 31. For random variables X and Y such that X, Y, XY are inte-
grable, the covariance of X and Y is
  
Cov(X, Y ) = E X − E(X) Y − E(Y ) .

Theorem 3.5. If X and Y are random variables with finite variance, then
Cov(X, Y ) is defined and
p
|Cov(X, Y )| ≤ Var(X)Var(Y ) .

Proof. Follows from Cauchy-Schwarz.


Theorem 3.6. 1. If Cov(X, Y ) is defined, then

Cov(X, Y ) = E(XY ) − E(X)E(Y ) .

40
2. If X and Y are independent and integrable, then Cov(X, Y ) exists and
equals zero.
3. If X has a finite variance, Cov(X, X) = Var(X).
4. If Cov(X, Y ) is defined, then

Cov(αX + γ, βY + δ) = αβCov(X, Y ) , α, β, δ, γ ∈ R .

5. If X1 , . . . , Xn have finite variances, then


n
! n
X X X
Var Xi = Var(Xi ) + 2 Cov(Xi , Xj ) .
i=1 i=1 1≤i<j≤n

Proof. Exc.
Definition 32. For random variables X and Y whose variances are finite and
positive, their correlation is
Cov(X, Y )
Corr(X, Y ) = p .
Var(X)Var(Y )

Exercise 3.8. For a random variable X, show that Var(X) = 0 if and only if
X is a degenerate random variable, that is, for some c ∈ R, X = c a.s.
Theorem 3.7. Suppose X and Y are non-degenerate random variables with
finite variances.
1. For α, β, γ, δ ∈ R with α, β 6= 0,

Corr (αX + γ, βY + δ) = sgn(αβ)Corr(X, Y ) .

2. A correlation coefficient always lies between −1 and 1, that is,

|Corr(X, Y )| ≤ 1 .

3. There exist a, b, c ∈ R with a, b 6= 0 such that aX + bY = c a.s. if and only


if
Corr(X, Y ) = ±1 .

Proof. 1. A trivial consequence of Theorem 3.6.4.


2. Follows from Theorem 3.5.
3. For the “only if” part, suppose aX + bY = c a.s. for some a, b 6= 0. Then
 c a 
Corr(X, Y ) = Corr X, − X
 a b b
= sgn − Corr(X, X)
 ab 
= sgn − ,
b

41
1. being used in the second line and the last line follows from the observation
that Corr(X, X) = 1 which is a restatement of Theorem 3.6.3. Thus aX +bY = c
a.s. for some a, b 6= 0 implies Corr(X, Y ) = ±1.
Conversely, suppose Corr(X, Y ) = ±1. Define

X − E(X) Y − E(Y )
X0 = p , Y0 = p .
Var(X) Var(Y )

Then E(X 02 ) = E(Y 02 ) = 1 and E(X 0 Y 0 ) = Corr(X, Y ). If Corr(X, Y ) = 1,


then

E (X 0 − Y 0 )2 = E(X 02 ) + E(Y 02 ) − 2E(X 0 Y 0 ) = 2 − 2Corr(X, Y ) = 0 ,


 

showing that X 0 = Y 0 a.s. In other words, Corr(X, Y ) = 1 implies

X Y E(X) E(Y )
p − =p − a.s.
Var(X) Var(Y ) Var(X) Var(Y )

A similar calculation shows that Corr(X, Y ) = −1 implies

X Y E(X) E(Y )
p + =p + a.s.
Var(X) Var(Y ) Var(X) Var(Y )

This proves the “if” part and thus completes the proof.
Now we proceed towards showing that given a countable collection of CDFs,
there exist independent random variables with those CDFs. The first step in
that direction is the following result, which is an alternate way of proving the
second part of Theorem 2.1.
Theorem 3.8. Let F be a C.D.F. and define

F ← (y) = inf{x ∈ R : F (x) ≥ y} , 0 < y < 1 . (3.6)

If U ∼ Uniform(0, 1), then F ← (U ) has C.D.F. F .


Proof. It suffices to prove that for x0 ∈ R and 0 < y0 < 1,

F ← (y0 ) ≤ x0 ⇐⇒ y0 ≤ F (x0 ) , (3.7)

because then it would follow from the fact that 0 < U < 1 a.s. that for x ∈ R,

P (F ← (U ) ≤ x) = P (U ≤ F (x)) = F (x) .

Proceeding towards proving (3.7), first assume y0 ≤ F (x0 ). In other words,

x0 ∈ {x ∈ R : F (x) ≥ y0 } .

Thus, x0 ≥ inf{x ∈ R : F (x) ≥ y0 } = F ← (y0 ), proving the “⇐” part, that is,
the “if” part.

42
For the reverse implication of (3.7), we shall show that y0 > F (x0 ) ⇒
F ← (y0 ) > x0 . Assume y0 > F (x0 ). Right continuity of F implies there exists
x1 > x0 with F (x1 ) < y0 . As F is non-decreasing,
 
F (−∞, x1 ] = 0, F (x1 ) ⊂ (0, y0 ) .

In other words, F (x) < y0 for all x ≤ x1 , which is equivalent to

{x : F (x) ≥ y0 } ⊂ (x1 , ∞) .

Thus, inf{x : F (x) ≥ y0 } ≥ x1 > x0 . That is, F ← (y0 ) > x0 , as desired. This
proves the “⇒” implication, that is, the “only if” part of (3.7), which completes
the proof.
The next step in the same direction gives an alternate way of generating an
Uniform(0, 1) random variable.
Theorem 3.9. If X1 , X2 , . . . are i.i.d. from Bernoulli(1/2), that is, they take
values 0 and 1 with probability 1/2 each, then

X
U= 2−n Xn
n=1

follows Uniform(0, 1).


Proof. For n = 1, 2, 3, . . ., define
n
X
Un = 2−i Xi .
i=1

We shall first show by induction on n that

P Un = 2−n i = 2−n , i = 0, 1, . . . , 2n − 1 .

(3.8)

Since U1 = X1 /2, that is, U1 takes values 0 and 1/2 each with probability
1/2, (3.8) trivially holds for n = 1. Assume (3.8) for some n as the induction
hypothesis. Notice
Un+1 − Un = 2−n−1 Xn+1 . (3.9)
Since Un is measurable with respect to σ(X1 , . . . , Xn ) which is independent of
σ(Xn+1 ) by Corollary 3.1, Un is independent of Un+1 −Un . Another implication
of (3.9) and the fact Un ∈ {2−n i : i = 0, 1, . . . , 2n − 1} is that
(
{2−n i : i = 0, 1, . . . , 2n − 1} , if Un+1 − Un = 0 ,
Un+1 ∈ (3.10)
{2−n i + 2−n−1 : i = 0, 1, . . . , 2n − 1} , otherwise.

43
Since {2−n i : i = 0, 1, . . . , 2n − 1} ∩ {2−n i + 2−n−1 : i = 0, 1, . . . , 2n − 1} = ∅,
for i = 0, 1, . . . , 2n − 1, (3.10) implies

P (Un+1 = 2−n i) = P (Un+1 = 2−n i, Un+1 − Un = 0)


= P (Un = 2−n i, Un+1 − Un = 0)
(independence of Un , Un+1 − Un ) = P (Un = 2−n i)P (Un+1 − Un = 0)
1
= P (Un = 2−n i)
2
= 2−n−1 ,

the penultimate line follows from (3.9), whereas (3.8) for n implies the last line.
A similar calculation with (3.10) shows

P Un+1 = 2−n i + 2−n−1 = 2−n−1 , i = 0, 1, . . . , 2n − 1 .




Observing that

{2−n i : i = 0, 1, . . . , 2n − 1} ∪ {2−n i + 2−n−1 : i = 0, 1, . . . , 2n − 1}


= {2−n−1 i : i = 0, 1, . . . , 2n+1 − 1} ,

(3.8) follows for n + 1. Mathematical induction shows (3.8) for all n.


Note that Un ↑ U and hence for any x ∈ R, [Un ≤ x] ↓ [U ≤ x]. Thus, for
x ∈ (0, 1),

P (U ≤ x) = lim P (Un ≤ x)
n→∞
= lim 2−n ([2n x] + 1)
n→∞
= x,

(3.8) implying the second line, [z] denoting the largest integer less than or equal
to z. Monotonicity of the CDF implies P (U ≤ 0) = 0 and P (U ≤ 1) = 1.
Hence, 
0, x ≤ 0 ,

P (U ≤ x) = x, 0 < x < 1 ,

1, x ≥ 1 ,

that is, U follows Uniform(0, 1). This completes the proof.


Now we are in a position to prove the existence of a countable collection of
independent random variables.
Theorem 3.10. Given CDFs F1 , F2 , . . ., there exist independent random vari-
ables X1 , X2 , . . ., defined on some probability space (Ω, A, P ), whose CDFs are
F1 , F2 , . . ., respectively.

44
Proof. Let (Ω, A, P ) be the probability space associated with infinite tosses of
a fair coin as in Example 2.3. Define for n = 1, 2, . . .,

Yn (ω) = 1(ωn = H) , ω = (ω1 , ω2 , ω3 , . . .) ∈ Ω .

Clearly, Y1 , Y2 , . . . are i.i.d. from Bernoulli(1/2). Fix a bijection φ : N × N → N


and define
Zij = Yφ(i,j) , i, j ∈ N .
It is immediate that {Zij : i, j ∈ N} are i.i.d. from Bernoulli(1/2).
Define
X∞
Ui = 2−j Zij , i ∈ N . (3.11)
j=1

For i = 1, 2, . . ., Ui is measurable with respect to Ai = σ(Zi1 , Zi2 , . . .). Corollary


3.1 shows A1 , A2 , . . . are independent σ-fields. Hence, U1 , U2 , . . . are indepen-
dent. Theorem 3.9 in view of (3.11) shows Ui follows Uniform(0, 1) for every i.
That is, U1 , U2 , . . . are i.i.d. from Uniform(0, 1).
Finally, set Xi = Fi← (Ui ), where Fi← is as in (3.6). Theorem 3.8 shows Xi
has CDF Fi . Since U1 , U2 , . . . are independent, so are X1 , X2 , . . ., and hence the
proof follows.
Remark 1. The construction (3.11) along with Theorems 3.8 and 3.9 gives an
alternative proof of the second part of Theorem 2.1 which completely bypasses
Theorem 1.4.

Definition 33. Let RN = {(x1 , x2 , x3 , . . .) : xn ∈ R , n = 1, 2, . . .} and



B RN = σ {A1 × A2 × . . . × An × R × R × . . . : n ≥ 1 , A1 , . . . , An ∈ B(R)} .

 on (R, B(R)), then there


Theorem 3.11. If P1 , P2 , . . . are probability measures
exists a unique probability measure P on RN , B(RN ) such that
n
Y
P (A1 × . . . × An × R × R × . . .) = Pi (Ai ) , A1 , . . . , An ∈ B(R) , n ≥ 1 .
i=1
(3.12)
Proof. For n = 1, 2, . . ., let

Fn (x) = Pn ((−∞, x]) , x ∈ R .

Theorem 3.10 shows the existence of independent random variables X1 , X2 ,


X3 , . . . having CDFs F1 , F2 , . . ., respectively, defined on some probability space
(Ω, A, P). Exc 2.7 shows that P(Xi ∈ A) = Pi (A) for i = 1, 2, . . . and A ∈ B(R).
Define T : Ω → RN by

T (ω) = (X1 (ω), X2 (ω), . . .) , ω ∈ Ω .

45
The map T is measurable, that is, T −1 B ∈ A for all B ∈ B(RN ), because for all
n ≥ 1 and A1 , . . . , An ∈ B(R),
n
\
T −1 (A1 × . . . × An × R × R × . . .) = Xi−1 Ai ∈ A .
i=1
−1
Let P = P ◦ T . Then, for A1 , . . . , An as above,
P (A1 × . . . × An × R × R × . . .) = P T −1 (A1 × . . . × An × R × R × . . .)


n
!
\
−1
=P Xi Ai
i=1
n
Y
= P(Xi ∈ Ai )
i=1
Yn
= Pi (Ai ) ,
i=1

the independence of X1 , X2 , . . . implying the penultimate line. Thus, P satisfies


(3.12). Uniqueness follows from the observation that
{A1 × . . . × An × R × R × . . . : n ≥ 1, A1 , . . . , An ∈ B(R)}
is a semi-field. This completes the proof.
Definition 34. The probability measure P on (RN , B(RN )) satisfying (3.12) is
the product measure of P1 , P2 , . . . and is denoted by

O
P = Pn .
n=1

The above infinite product could be defined because P1 , P2 , . . . are all prob-
ability measures.
We conclude this chapter by pointing out that measure theory is indispens-
able for a rigorous treatment of probability theory, which is now amply clear.
The following are a few instances where usage of measure theory was necessary.
1. For studying the simple random experiment of infinite tosses of a fair coin,
as in Example 2.3.
2. For defining expectation of a general random variable, that is, one which
is neither discrete nor has a density.
3. Showing linearity of expectation is very difficult, even for random variables
having a density, without the measure theoretic definition.
4. Answering the question of when a random variable has a density is im-
possible without the Radon-Nikodym theorem.
5. Last but not the least, the study of independence in this chapter became
much easier thanks to measure theory.

46
4 Several random variables
Definition 35. For random variables X1 , . . . , Xd , which by convention are de-
fined on the same probability space (Ω, A, P ), the joint CDF of (X1 , . . . , Xd ) is
a function F : Rd → [0, 1] defined by

F (x1 , . . . , xd ) = P (X1 ≤ x1 , . . . , Xd ≤ xd ) , x1 , . . . , xd ∈ R .

A joint CDF will often be referred to simply by ‘CDF’. For this chapter, we
introduce the following notations:

H = {(a1 , b1 ] × . . . × (ad , bd ] : −∞ < ai < bi < ∞ for i = 1, . . . , d} ,


X
∆R F = (−1)#{i:xi =ai } F (x1 , . . . , xd ) , (4.1)
(x1 ,...,xd )∈{a1 ,b1 }×...×{ad ,bd }

for a function F : Rd → R and R = (a1 , b1 ] × . . . × (ad , bd ] ∈ H. A restatement


of Exc 3.3 in the above notations is that

P (X1 , . . . , Xd ) ∈ R = ∆R F , R ∈ H , (4.2)

if F is the CDF of (X1 , . . . , Xd ). The following theorem can be proved along


similar lines as the direct part of Theorem 2.1, with the help of (4.2).
Theorem 4.1. If F is the joint CDF of d random variables, then
1. ∆R F ≥ 0 for all R ∈ H,
2. F is continuous from above, that is,

lim F (y1 , . . . , yd ) = F (x1 , . . . , xd ) for all x1 , . . . , xd ∈ R ,


y1 ↓x1 ,...,yd ↓xd

3. for any k = 1, . . . , d and 1 ≤ ii < . . . < ik ≤ d,

lim F (x1 , . . . , xd ) = 0 ,
i1 →−∞,...,ik →−∞

where xj ∈ R is fixed for all j ∈ {1, . . . , d} \ {i1 , . . . , ik },


4. and
lim F (x1 , . . . , xd ) = 1 .
x1 →∞,...,xd →∞

Proof. Exercise.
The following measure-theoretic fact is a d-dimensional generalization of
Theorem 1.4.
Fact 4.1. If F : Rd → R is a function which is continuous from above and
satisfies ∆R F ≥ 0 for all R ∈ H, then there exists a unique Radon measure µ
on (Rd , B(Rd )) such that

µ(R) = ∆R F , for all R ∈ H .

47
Neither the above fact nor the following theorem, which is built on it and
gives a converse of Theorem 4.1, will be used much in the course. Nonetheless,
a proof of the above fact is given in Subsection 9.2 of the Appendix.
Theorem 4.2. If F : Rd → [0, 1] satisfies 1.-4. of Theorem 4.1, then there exist
random variables X1 , . . . , Xd defined on some probability space such that F is
the joint CDF of (X1 , . . . , Xd ).
Proof. Since F satisfies 1. and 2., the preceding fact implies there exists a Radon
measure µ on (Rd , B(Rd )) such that

µ(R) = ∆R F , R ∈ H .

Our first goal is to show µ is a probability measure. To that end, rewrite the
above for R = (−m, n]d , where m, n ∈ N, as
X
µ (−m, n]d = (−1)#{i:xi =−m} F (x1 , . . . , xd ) , m, n ∈ N .


(x1 ,...,xd )∈{−m,n}d

In the right hand side above, if n is fixed and m → ∞, then every term except
F (n, . . . , n) goes to zero by 3. of Theorem 4.1, which F satisfies by hypothesis.
As µ is a measure, the left hand side increases to µ((−∞, n]d ) as m → ∞.
Therefore,
µ (−∞, n]d = F (n, n, . . . , n) , n ≥ 1 .

(4.3)
Let n → ∞ and use 4. to conclude µ(Rd ) = 1. In other words, µ is a probability
measure.
Let Ω = Rd , A = B(Rd ) and P = µ. Define Xi : Rd → R by

Xi (x1 , . . . , xd ) = xi , (x1 , . . . , xd ) ∈ Rd ,

for i = 1, . . . , d. Then X1 , . . . , Xd are random variables on the probability space


(Ω, A, P ). Arguments leading to (4.3) can be slightly tweaked to show that

P ((−∞, x1 ] × . . . × (−∞, xd ]) = F (x1 , . . . , xd ) , x1 , . . . , xd ∈ R .

As the above is the same as saying F is the CDF of (X1 , . . . , Xd ), the proof
follows.
Definition 36. For random variables X1 , . . . , Xd , the joint distribution of
(X1 , . . . , Xd ) is the measure on (Rd , B(Rd )) given by P ◦(X1 , . . . , Xd )−1 , that is,
the measure pushed forward to Rd by (X1 , . . . , Xd ). For a measurable function
f : Rd → R, the integral of f with respect to the measure P ◦ (X1 , . . . , Xd )−1 , if
defined, is denoted by
Z ∞ Z ∞
... f (x1 , . . . , xd )P (X1 ∈ dx1 , . . . , Xd ∈ dxd ) .
−∞ −∞

d
For random variables Y1 , . . . , Yd , (X1 , . . . , Xd ) = (Y1 , . . . , Yd ) means

P ((X1 , . . . , Xd ) ∈ B) = P ((Y1 , . . . , Yd ) ∈ B) for all B ∈ B(Rd ) .

48
The following result is similar to its one-dimensional analogue.
Theorem 4.3. 1. For random variables X1 , . . . , Xd and a Borel measurable
f : Rd → R,
Z ∞ Z ∞
E (f (X1 , . . . , Xd )) = ... f (x1 , . . . , xd )P (X1 ∈ dx1 , . . . , Xd ∈ dxd ) ,
−∞ −∞

whenever either side is defined.


2. For random variables X1 , . . . , Xd , Y1 , . . . , Yd ,
d
(X1 , . . . , Xd ) = (Y1 , . . . , Yd )

if and only if the CDFs of (X1 , . . . , Xd ) and (Y1 , . . . , Yd ) are the same.
Proof. 1. Follows from Theorem 1.6.
2. Follows from (4.2) and the observation that
( d ! )
Y
d
(ai , bi ] ∩ R : −∞ ≤ ai ≤ bi ≤ ∞ , i = 1, . . . , d
i=1

is a semi-field and for every set in the above class, there exists sets in H increas-
ing to that.
Definition 37. For discrete random variables X1 , . . . , Xd , the joint PMF of
(X1 , . . . , Xd ) is the function p : Rd → [0, 1] defined by

p(x1 , . . . , xd ) = P (X1 = x1 , . . . , Xd = xd ) , x1 , . . . , xd ∈ R .

A Borel function f : Rd → [0, ∞) is the joint density of (X1 , . . . , Xd ) if


Z
P [(X1 , . . . , Xd ) ∈ B] = f (x) dx , B ∈ B(Rd ) .
B

Theorem 4.4. A Borel map f : Rd → [0, ∞) is the density of (X1 , . . . , Xd ) if


and only if
Z x1 Z xd
F (x1 , . . . , xd ) = ... f (z1 , . . . , zd ) dzd . . . dz1 ,
−∞ −∞

for all x1 , . . . , xd ∈ R, where F is the CDF of (X1 , . . . , Xd ).


Proof. Follows from Theorem 4.3.2.
Theorem 4.5. If X1 , . . . , Xd are discrete random variables, then for k =
1, . . . , d − 1,
X X
P (X1 = x1 , . . . , Xk = xk ) = ... P (X1 = x1 , . . . , Xd = xd ) ,
xk+1 ∈R xd ∈R

49
for all x1 , . . . , xk ∈ R. If f is the joint density of (X1 , . . . , Xd ), then for k =
1, . . . , d − 1, g : Rk → R defined by
Z ∞ Z ∞
g(x1 , . . . , xk ) = ... f (x1 , . . . , xd ) dxk+1 . . . dxd , x1 , . . . , xk ∈ R ,
−∞ −∞

is the density of (X1 , . . . , Xk ).


Proof. We prove the second claim; the proof of the first one is similar. Let f be
the density of (X1 , . . . , Xd ). Then for B ∈ B(Rk ),
P [(X1 , . . . , Xk ) ∈ B]
= P (X1 , . . . , Xd ) ∈ B × Rd−k
 
Z ∞ Z ∞
= ... 1 [(x1 , . . . , xk ) ∈ B] f (x1 , . . . , xd )dxd . . . dx1
−∞ −∞
Z Z 
= 1 ((x1 , . . . , xk ) ∈ B) f (x1 , . . . , xd )dxk+1 . . . dxd dx1 . . . dxk
k Rd−k
ZR
= 1 ((x1 , . . . , xk ) ∈ B) g(x1 , . . . , xk ) dx1 . . . dxk .
Rk

As this is true for all B ∈ B(Rk ), g is the density of (X1 , . . . , Xk ), as claimed.


Theorem 4.6. For discrete independent random variables, the joint PMF is
the product of the marginal PMFs, that is, if X1 , . . . , Xd are discrete and inde-
pendent, then
d
Y
P (X1 = x1 , . . . , Xd = xd ) = P (Xi = xi ) , (x1 , . . . , xd ) ∈ Rd .
i=1

If X1 , . . . , Xd are independent with respective densities f1 , . . . , fd , then f , de-


fined by,
d
Y
f (x1 , . . . , xd ) = fi (xi ) , x1 , . . . , xd ∈ R ,
i=1
is the density of (X1 , . . . , Xd ).
Proof. The first claim follows immediately from the definition of independence.
For the second claim, using independence, write for x1 , . . . , xd ∈ R,
d
Y
P (X1 ≤ x1 , . . . , Xd ≤ xd ) = P (Xi ≤ xi )
i=1
d Z
Y xi
= fi (zi ) dzi
i=1 −∞
Z x1 Z xd
= ... f (z1 , . . . , zd )dzd . . . dz1 .
−∞ −∞

Theorem 4.4 completes the proof.

50
The following is a converse of the above, and in fact, slightly stronger than
that.
Theorem 4.7. If X1 , . . . , Xd are discrete random variables for which there exist
c ∈ R and functions p1 , . . . , pd : R → R such that
d
Y
P (X1 = x1 , . . . , Xd = xd ) = c pi (xi ) , x1 , . . . , xd ∈ R ,
i=1

then X1 , . . . , Xd are independent. Furthermore, if


X
pi (x) = 1 , i = 1, . . . , d ,
x∈R

then c = 1 and p1 , . . . , pd are the respective marginal PMFs of X1 , . . . , Xd .


If (X1 , . . . , Xd ) has a joint density f which can be written as
d
Y
f (x1 , . . . , xd ) = c fi (xi ) , (x1 , . . . , xd ) ∈ Rd , (4.4)
i=1

for measurable functions f1 , . . . , fd : R → R, then X1 , . . . , Xd are independent.


If, in addition, f1 , . . . , fd integrate to 1, then c = 1 and f1 , . . . , fd are the
marginal densities of X1 , . . . , Xd , respectively.
Proof. The second claim will be proved as the proof of the first claim is similar.
Non-negativity of f and (4.4) show
d
Y
|c| |fi (xi )| = f (x1 , . . . , xd ) , x1 , . . . , xd ∈ R . (4.5)
i=1

Thus,
d Z
Y ∞ Z ∞ Z ∞ d
Y
|c| |fi (xi )| dxi = ... |c| |fi (xi )| dx1 . . . dxd
i=1 −∞ −∞ −∞ i=1
= 1,

(4.5) and that f is a density implying the second line. Hence,


Z ∞
|fi (x)| dx < ∞ , i = 1, . . . , d .
−∞

This allows integrating both sides of (4.4) over (x1 , . . . , xd ) ∈ Rd which yields

c α1 . . . αd = 1 , (4.6)

where Z ∞
αi = fi (x) dx , i = 1, . . . , d .

51
Theorem 4.5 shows that for i = 1, . . . , d, the density gi of Xi can be obtained
by fixing xi and integrating the right hand side of (4.4) over all other variables.
That is,  
Y
gi (xi ) = c  αj  fi (xi ) , xi ∈ R ;
j∈{1,...,d}\{i}

(4.6) implies
gi (x) = αi−1 fi (x) , x ∈ R .
Use this to rewrite (4.4) as
d
Y
f (x1 , . . . , xd ) = c αi gi (xi )
i=1
d
Y
= gi (xi ) ,
i=1

for all x1 , . . . , xd ∈ R, the last line following from (4.6). Therefore, for B1 , . . .
. . . , Bd ∈ B(R),
Z Z
P (X1 ∈ B1 , . . . , Xd ∈ Bd ) = ... f (x1 , . . . , xd ) dxd . . . dx1
B1 Bd
Z Z d
Y
= ... gi (xi ) dxd . . . dx1
B1 Bd i=1
d Z
Y
= gi (xi ) dxi
i=1 Bi
d
Y
= P (Xi ∈ Bi ) ,
i=1

the equality in the last line holds because g1 , . . . , gd are the respective densities
of X1 , . . . , Xd . Thus, X1 , . . . , Xd are independent.
If Z ∞
fi (x) dx = 1 , i = 1, . . . , d ,
−∞

then αi = 1 for all i, showing c = 1 by (4.6) and that gi = fi . In this case,


f1 , . . . , fd are thus the respective marginal densities of X1 , . . . , Xd . This com-
pletes the proof.
The following exercise is a variant of the above theorem.
Exercise 4.1. Suppose f is the density of (X1 , . . . , Xm , Y1 , . . . , Yn ). If (X1 , . . .
. . . , Xm ) and (Y1 , . . . , Yn ) are independent, then show that

f (x1 , . . . , xm , y1 , . . . , yn ) = fX (x1 , . . . , xm )fY (y1 , . . . , yn ) ,

52
for almost all (x1 , . . . , xm , y1 , . . . , yn ) ∈ Rm+n , where fX and fY are the den-
sities of (X1 , . . . , Xm ) and (Y1 , . . . , Yn ), respectively. Conversely, if there exist
measurable g : Rm → R and h : Rn → R and c ∈ R such that
f (x1 , . . . , xm , y1 , . . . , yn ) = cg(x1 , . . . , xm )h(y1 , . . . , yn ) ,
for almost all (x1 , . . . , xm , y1 , . . . , yn ) ∈ Rm+n , then show that (X1 , . . . , Xm )
and (Y1 , . . . , Yn ) are independent. Besides, if
Z Z
g(x) dx = 1 = h(x) dx ,
Rm Rn

then show that c = 1, that g, h are non-negative a.e., and that g ∨ 0 and h ∨ 0
are the respective densities of (X1 , . . . , Xm ) and (Y1 , . . . , Yn ).
Theorem 4.8. Suppose X = (X1 , . . . , Xd ) is a random vector with P (X ∈
U ) = 1 for some open set U ⊂ Rd . Let ψ : U → V be a bijection for some open
set V ⊂ Rd . Let T : V → U be the inverse of ψ. Assume T is continuously
differentiable and its Jacobian matrix J(y) at y ∈ V , defined by
∂T (y)
J(y) = ,
∂y
is non-singular for all y ∈ V . Then the joint density of Y = (Y1 , . . . , Yd ) = ψ(X)
is (
f ◦ T (y)| det(J(y))| , y ∈ V ,
g(y) =
0, y∈/V.

Proof. Since (Y1 , . . . , Yd ) ∈ V a.s., for B ∈ B(Rd ),


 
P (Y1 , . . . , Yd ) ∈ B = P (Y1 , . . . , Yd ) ∈ B ∩ V

= P (X1 , . . . , Xd ) ∈ T (B ∩ V )
Z
= f (x) dx
T (B∩V )
Z
= f ◦ T (y)| det(J(y))| dy
ZB∩V
= g(y) dy ,
B

the penultimate line following from Theorem 1.12. Hence the proof follows.
Example 4.1. Let X ∼ Gamma(α) and Y ∼ Gamma(β) independently of each
other. We want to find the distribution of W = X/(X + Y ).
Theorem 4.8 is the only tool at our disposal, which is valid for one-one
functions from an open subset of R2 to R2 . Therefore, we define an auxiliary
random variable Z = X + Y . Thus, (W, Z) = ψ(X, Y ) where ψ : U → V is a
bijection defined by
 
x
ψ(x, y) = , x + y , (x, y) ∈ U ,
x+y

53
and U = (0, ∞)2 and V = (0, 1) × (0, ∞) are open sets. The inverse of ψ is
T : V → U defined by

T (w, z) = (wz, z − wz) , (w, z) ∈ V .

The Jacobian matrix of T is


 
z w
J(w, z) = ,
−z 1−w

showing | det J(w, z)| = z for (w, z) ∈ V . The joint density of (X, Y ) is
1
f (x, y) = e−x−y xα−1 y β−1 , (x, y) ∈ U .
Γ(α)Γ(β)

Theorem 4.8 shows that the joint density g of (W, Z) at (w, z) ∈ V is

g(w, z) = f ◦ T (w, z)| det J(w, z)|


1
= e−z (wz)α−1 (z − wz)β−1 z
Γ(α)Γ(β)
1
= wα−1 (1 − w)β−1 e−z z α+β−1 ,
Γ(α)Γ(β)

and g(w, z) = 0 for (w, z) ∈


/ V . In other words,

g(w, z) = c h1 (w)h2 (z) , (w, z) ∈ R2 ,

where
1
h1 (w) = wα−1 (1 − w)β−1 1(0 < w < 1) , w ∈ R ,
B(α, β)
1
h2 (z) = e−z z α+β−1 1(z > 0) , z ∈ R ,
Γ(α + β)
and
B(α, β)Γ(α + β)
c= .
Γ(α)Γ(β)
Since h1 and h2 are densities of Beta(α, β) and Gamma(α + β), respectively,
Theorem 4.7 shows that c = 1 and W and Z are independent with respective
densities h1 and h2 . In particular, this means X/(X + Y ) follows Beta(α, β).
Furthermore, c = 1 reconfirms that

Γ(α)Γ(β)
B(α, β) = , α, β > 0 .
Γ(α + β)

In the next result and elsewhere, a vector x ∈ Rn , for n = 2, 3, . . ., is to be


thought of as an n × 1 column vector, unless mentioned otherwise.

54
Theorem 4.9. If X = (X1 , . . . , Xd ) has density f , A is a d × d non-singular
matrix and
Y = AX + µ ,
for some fixed µ ∈ Rd , then Y = (Y1 , . . . , Yd ) has density
1
f A−1 (y − µ) , y ∈ Rd .

g(y) =
| det (A) |

Proof. Follows immediately from Theorem 4.8 by observing that Y = ψ(X)


where ψ : Rd → Rd is a bijection defined by

ψ(x) = Ax + µ , x ∈ Rd ,

that the inverse of ψ is T defined by

T (y) = A−1 (y − µ) , y ∈ Rd ,

and that the Jacobian matrix of T is A−1 .

The next result is a striking application of Theorem 4.8 which is very useful
in statistics.
Theorem 4.10. If X1 , . . . , Xn are i.i.d. from standard normal for n ≥ 2, and
n n
1X X
X̄ = Xi , and S = (Xi − X̄)2 ,
n i=1 i=1

then X̄ and S are independent.


Proof. Let P be an n × n orthogonal matrix whose first row is
 
1 1
√ ... √ ;
n n

such a P exists because the above is a vector of norm 1. Let X = (X1 , . . . , Xn )


which is to be thought of as a column vector by convention. Define
 
Y1
Y =  ...  = P X .
 

Yn

The density of X is
n
!  
−n/2 1X 2 1
f (x) = (2π) exp − x = (2π)−n/2 exp − xT x ,
2 i=1 i 2

55
for all x = (x1 , . . . , xn ) ∈ Rn . Theorem 4.9 shows that the density of Y is
1
f PTy

g(y) = T
| det(P )|
 
T
 −n/2 1 T T T
since det(P ) = ±1 = (2π) exp − (P y) (P y)
2
 
−n/2 1 T T
= (2π) exp − y P P y
2
 
−n/2 1 T
= (2π) exp − y y ,
2

for y ∈ Rn , the last line following from the fact that P is an orthogonal matrix.
Thus,
n
Y 1 2
g(y) = √ e−yi /2 , y = (y1 , . . . , yn ) ∈ Rn . (4.7)
i=1

In other words, the joint density g(y1 , . . . , yn ) of (Y1 , . . . , Yn ) can be factorized


as the product of the standard normal density evaluated at y1 , . . . , yn . Theorem
4.7 shows Y1 , . . . , Yn are i.i.d. from standard normal.
The choice of the first row of P implies

Y1 = nX̄ . (4.8)

Once again, P is an orthogonal matrix implies


n
X n
X
Xi2 = Yi2 . (4.9)
i=1 i=1

Write
n
X
Xi2 − 2Xi X̄ + (X̄)2

S=
i=1
n
X
= Xi2 − n(X̄)2
i=1
n
X
= Yi2 − Y12
i=1
n
X
= Yi2 ,
i=2

(4.8) and (4.9) implying the penultimate line. As S is a function of Y2 , . . . , Yn


and X̄ is a function of Y1 , the independence of X̄ and S follows, which completes
the proof.

56
Definition
Pn 38. If Z1 , . . . , Zn are i.i.d. from standard normal, the distribution
of i=1 Zi2 is called χ2n .
Exercise 4.2. Show that S, which is as in Theorem 4.10, has the χ2n−1 distri-
bution.
Exercise 4.3. Let X1 , . . . , Xn be i.i.d. from standard normal, and
 
X1
 .. 
X= . .
Xn

Fix µ ∈ Rn and let Σ be a n × n real symmetric positive definite (p.d.) matrix,


that is, ΣT = Σ and xT Σx > 0 for all x ∈ Rn \ {0}. Let Σ1/2 be the p.d. square
root of Σ, that is, Σ1/2 is the unique p.d. matrix whose square is Σ. Define

Y = µ + Σ1/2 X . (4.10)

Show that the density of Y = (Y1 , . . . , Yn ) is


 
1 1 T −1
g(y) = exp − (y − µ) Σ (y − µ) , y ∈ Rn .
(2π)n/2 det(Σ1/2 ) 2
Soln.: Follows from Theorem 4.9.
The density obtained in the above exercise is the density of the so-called
multivariate normal distribution which is formally defined below. The next
several results are devoted to p
understanding the properties of this distribution.
Observing that det(Σ1/2 ) = det(Σ), the following definition makes perfect
sense.
Definition 39. If X = (X1 , . . . , Xn ) has density
 
1 1
f (x) = p exp − (x − µ)T Σ−1 (x − µ) , x ∈ Rn ,
(2π)n/2 det(Σ) 2

for some µ ∈ Rn and n × n p.d. matrix Σ, then X follows n-dimensional mul-


tivariate normal distribution with parameters µ and Σ, which is written as

X ∼ Nn (µ, Σ) .

The interpretation of µ and Σ in the distribution Nn (µ, Σ) will be clear after


a couple of results. The following theorem is essentially the converse of Exc 4.3.
Theorem 4.11. If X ∼ Nn (µ, Σ) and (Y1 , . . . , Yn ) = Y = Σ−1/2 (X − µ), then
Y1 , . . . , Yn are i.i.d. from standard normal.
Proof. The density of X is
 
1 1
f (x) = p exp − (x − µ)T Σ−1 (x − µ) , x ∈ Rn
(2π)n/2 det(Σ) 2

57
Writing
Y = Σ−1/2 X − Σ−1/2 µ ,
Theorem 4.9 with A = Σ−1/2 and µ replaced by −Σ−1/2 µ therein implies that
the density of Y is
1 
1/2

−1/2

g(y) = f Σ y + Σ µ
det(Σ−1/2 )
1 
1/2

= f Σ y + µ
det(Σ−1/2 )
 

−1/2 −1/2

−n/2 1  1/2 T −1  1/2 
det(Σ ) = (det(Σ)) = (2π) exp − Σ y Σ Σ y
2
 
1
= (2π)−n/2 exp − y T y ,
2

the last line follows from the fact that Σ1/2 is symmetric and

Σ1/2 Σ−1 Σ1/2 = I .

Like in (4.7) and the subsequent argument, Theorem 4.7 shows Y1 , . . . , Yn are
i.i.d. from standard normal, which completes the proof.

The next theorem shows that if X ∼ Nn (µ, Σ), then µ and Σ are the “mean
vector” and the “covariance matrix” of X, respectively.
Theorem 4.12. If X ∼ Nn (µ, Σ) where

µ = (µ1 , . . . , µn ) and Σ = ((σij ))1≤i,j≤n ,

then

E(Xi ) = µi , i = 1, . . . , n ,
Cov(Xi , Xj ) = σij , 1 ≤ i, j ≤ n .

In particular, Var(Xi ) = σii for i = 1, . . . , n.


Proof. Let (Y1 , . . . , Yn ) = Y = Σ−1/2 (X −µ); Y1 , . . . , Yn are i.i.d. from standard
normal by Theorem 4.11. Rewrite the above as

X = µ + Σ1/2 Y ,

or
n
X
Xi = µi + θij Yj , i = 1, . . . , n ,
j=1

where Σ1/2 = ((θij ))1≤i,j≤n . Since Y1 , . . . , Yn are zero mean random variables,
it immediately follows E(Xi ) = µi for i = 1, . . . , n. Theorem 3.6 shows that for

58
fixed 1 ≤ i, j ≤ n,
n n
!
X X
Cov(Xi , Xj ) = Cov θik Yk , θjl Yl
k=1 l=1
n X
X n
= θik θjl Cov (Yk , Yl )
k=1 l=1
Xn
= θik θjk ,
k=1

the last line following from the fact that Y1 , . . . , Yn are independent and each
has variance one. Recalling that θik is the (i, k)-th entry of Σ1/2 which is a
symmetric matrix, write
n
X n
X
θik θjk = θik θkj
k=1 k=1

= (i, j)-th entry of Σ1/2 Σ1/2


= (i, j)-th entry of Σ
= σij .

It thus follows that

Cov(Xi , Xj ) = σij , 1 ≤ i, j ≤ n .

Taking i = j implies Var(Xi ) = σii and completes the proof.


Exercise 4.4. If Y1 , . . . , Yn are random variables, each having variance one,
such that
Cov(Yi , Yj ) = 0 , 1 ≤ i < j ≤ n ,
A is an m × n matrix and

  
X1 Y1
 ..   .. 
X =  .  = A .  ,
Xm Yn

show that the covariance matrix of X is AAT .

The next theorem is consistent with the above exercise .


Theorem 4.13. If X ∼ Nn (0, I), where I is the n × n identity matrix, then
for any m × n matrix B with Rank(B) = m and µ ∈ Rm ,

BX + µ ∼ Nm µ, BB T .


59
Proof. Let us first deal with the case m = n. In this case, B is a non-singular
matrix as Rank(B) = m. The density of X is
 
−n/2 1 T
f (x) = (2π) exp − x x , x ∈ Rn .
2
Theorem 4.9 implies that the density of Y = BX + µ is
1
f B −1 (y − µ)

g(y) =
| det(B)|
 
1 1 −1 T −1
= exp − (B (y − µ)) (B (y − µ))
| det(B)|(2π)n/2 2
 
1 1 T T −1
= exp − (y − µ) (BB ) (y − µ)
| det(B)|(2π)n/2 2
 
1 1 T −1
= p exp − (y − µ) Σ (y − µ) ,
(2π)n/2 det(Σ) 2

where Σ = BB T is a p.d. matrix because B is non-singular and therefore


p
det(Σ) = | det(B)|. This shows the stated claim when m = n.
Now assume 1 ≤ m < n. Since Rank(B) = m, the rows of B form a basis of
the row space of B. Let C be a matrix whose rows form a basis of the orthogonal
complement of the row space of B. In other words, C is an (n − m) × n matrix
with Rank(C) = n − m and
BC T = 0 . (4.11)
Define Y1 , . . . , Yn by  
Y1
 .. 
 .  = AX + µ̃ ,
Yn
where A = [ B C ] is an n × n non-singular matrix and µ̃ is an n × 1 vector defined
by µ̃ = [ µ0 ]. Using the result for the case m = n, which has already been shown,
it follows that
(Y1 , . . . , Yn ) ∼ Nn µ̃, AAT .

(4.12)
Using (4.11) to write
BB T
 
T 0
AA = , (4.13)
0 CC T
it is immediate that
" −1 #
T −1
 BB T 0
AA = −1 .
0 CC T

Thus for y (1) ∈ Rm and y (2) ∈ Rn−m , which are column vectors by convention,
and letting  (1) 
y
y = (2) , (4.14)
y

60
we get
−1 −1 −1
(y − µ̃)T AAT (y − µ̃) = (y (1) − µ)T BB T (y (1) − µ) + y (2) CC T y (2) .
(4.15)
Recall (4.12) to write the density of (Y1 , . . . , Yn ) as
 
1 1 −1
g(y) = p exp − (y − µ̃)T AAT (y − µ̃)
(2π)n/2 det(AAT ) 2
 
1 1 (1) T T −1
 (1)
= p exp − (y − µ) BB (y − µ)
(2π)n/2 det(AAT ) 2
 
1 −1 (2)
× exp − y (2) CC T y ,
2

by (4.15), where y is partitioned as in (4.14). Use (4.13) to write

det(AAT ) = det(BB T ) det(CC T ) ,

which allows simplifying g to


 
1 1 (1) T T −1
 (1)
g(y) = p exp − (y − µ) BB (y − µ)
(2π)m/2 det(BB T ) 2
 
1 1 −1 (2)
× p exp − y (2) CC T y
(2π)(n−m)/2 det(CC T ) 2
   
= g1 y (1) g2 y (2) ,

where g1 and g2 are the densities of Nm (µ, BB T ) and Nn−m (0, CC T ), respec-
tively. Exc 4.1 shows that (Y1 , . . . , Ym ) and (Ym+1 , . . . , Yn ) are independent
from Nm (µ, BB T ) and Nn−m (0, CC T ), respectively. Since
 
Y1
 .. 
 .  = BX + µ ,
Ym

a restatement of the former is that

BX + µ ∼ Nm (µ, BB T ) ,

which completes the proof.


Theorem 4.14. If X ∼ Nn (µ, Σ) and B is an m×n matrix with Rank(B) = m,
then
BX ∼ Nm Bµ, BΣB T .


Proof. Let (Y1 , . . . , Yn ) = Y = Σ−1/2 (X − µ). Theorem 4.11 shows Y1 , . . . , Yn


are i.i.d. from standard normal, that is, Y ∼ Nn (0, I).

61
Write
BX = B(X − µ) + Bµ = BΣ1/2 Σ−1/2 (X − µ) + Bµ = AY + Bµ ,
where A = BΣ1/2 . Since A is an m × n matrix with Rank(A) = m because Σ1/2
is non-singular, Theorem 4.13 shows
AY + Bµ ∼ Nm Bµ, AAT .


Observing that
AAT = BΣB T ,
the proof follows.
An immediate corollary of the above theorem is the following.
Corollary 4.1. If X1 , . . . , Xn are independent and Xi ∼ N (µi , σi2 ) for i =
1, . . . , n, then !
X n Xn n
X
Xi ∼ N µi , σi2 .
i=1 i=1 i=1

Proof. Follows from Theorem 4.14 by taking B to be the 1 × n matrix [1 . . . 1]


and observing that
(X1 , . . . , Xn ) ∼ Nn (µ, Σ) ,
where µ = (µ1 , . . . , µn ), Σ = ((σij ))1≤i,j≤n and
(
σi2 , i = j ,
σij =
0, i 6= j .

Exercise 4.5. If (X, Y ) follows bivariate normal, that is,


(X, Y ) ∼ N2 (µ, Σ)
for some µ ∈ R2 and 2 × 2 p.d. matrix Σ, show that
Corr(X, Y ) = 0 ⇐⇒ X, Y are independent.
Show that the above equivalence fails if each of X and Y follows normal but
(X, Y ) is not necessarily bivariate normal.
Hint. HW4/15.
Exercise 4.6. If X ∼ Nn (0, Σ), show that
n
!
X
Xi = 2 Tr Σ2 ,
2

Var
i=1

where Tr(·) denotes the trace of a square matrix.


Hint. The spectral theorem for real symmetric matrices implies Σ = P DP T
for some orthogonal matrix P and diagonal matrix D. Define Y = P T X and
show that
Y T Y = XT X .

62
Exercise 4.7. Suppose (X1 , . . . , Xn ) ∼ Nn (µ, Σ) where µ = (µ1 , . . . , µn ) and
Σ = ((σij ))1≤i,j≤n .
1. Show that Xi ∼ N (µi , σii ) for i = 1, . . . , n.
2. If i1 , . . . , ik ∈ {1, . . . , n} are distinct, show that
 
(Xi1 , . . . , Xik ) ∼ Nk (µij )1≤j≤k , ((σij ,ij0 ))1≤j,j 0 ≤k .

3. For A, B ⊂ {1, . . . , n} with A 6= ∅, B 6= ∅ and A ∩ B = ∅, show that

(Xi : i ∈ A) and (Xi : i ∈ B) are independent

if and only if
Cov(Xi , Xj ) = 0 for all i ∈ A, j ∈ B .

Exercise 4.8. Suppose X and Y are i.i.d. from standard normal. Let ρ ∈
(−1, 1) and set p
Z = ρX + 1 − ρ2 Y .
Show that    
0 1 ρ
(X, Z) ∼ N2 , .
0 ρ 1
Exercise 4.9. If X1 , . . . , Xn are i.i.d. from standard normal, P is an m × n
matrix with 1 ≤ m < n and P P T = Im , and

(Y1 , . . . , Ym ) = Y = P X ,

show that
m
X
Yi2 ∼ χ2m ,
i=1
n
X m
X
Xi2 − Yi2 ∼ χ2n−m ,
i=1 i=1

and
m
X n
X m
X
Yi2 , Xi2 − Yi2 are independent.
i=1 i=1 i=1

The last topic to be studied in this chapter is order statistics. We start with
defining the same.
Definition 40. The ascending sort map is a map T : Rn → Rn which sorts
the entries of a vector in ascending order, that is, T (x1 , . . . , xn ) = (y1 , . . . , yn )
means y1 ≤ . . . ≤ yn and (y1 , . . . , yn ) is a permutation of (x1 , . . . , xn ) for all
(x1 , . . . , xn ) ∈ Rn . For random variables X1 , . . . , Xn , their order statistics
X(1) , . . . , X(n) are defined by

X(1) , . . . , X(n) = T (X1 , . . . , Xn ) .

63
The ascending order map is a continuous function from Rn to Rn and thus
Borel measurable. Therefore, if X1 , . . . , Xn are random variables, which by
convention are defined on the same probability space, their order statistics are
random variables as well.
Theorem 4.15. If X1 , . . . , Xn are i.i.d. from some density f , then the density
of their order statistics (X(1) , . . . , X(n) ) is
(
n!f (x1 ) . . . f (xn ) , x1 < x2 < . . . < xn ,
g(x1 , . . . , xn ) =
0, otherwise .

The proof uses the following exercise which is a special case of Exc 1.7.
Exercise 4.10. Suppose P, Q are finite measures on (Rn , B(Rn )) such that for
some open set U ,
P (U c ) = 0 = Q(U c ) .
If P (R) = Q(R) for all R = (a1 , b1 ]×. . .×(an , bn ] ⊂ U with −∞ < ai < bi < ∞,
i = 1, . . . , n, show that P and Q agree on B(Rn ).
Proof of Theorem 4.15. Letting U = {(x1 , . . . , xn ) ∈ Rn : x1 < . . . < xn }, the
proof would follow from the above exercise once it is shown that
Z
c

P (X(1) , . . . , X(n) ) ∈ U = 0 = g(x1 , . . . , xn ) dx1 . . . dxn , (4.16)
Uc

and Z

P (X(1) , . . . , X(n) ) ∈ R = g(x1 , . . . , xn ) dx1 . . . dxn , (4.17)
R
for all

R = (a1 , b1 ] × . . . × (an , bn ] ⊂ U with − ∞ < ai < bi < ∞ , i = 1, . . . , n . (4.18)

Since X1 , . . . , Xn are i.i.d. from a density f , (X1 , . . . , Xn ) has a density h


given by
h(x1 , . . . , xn ) = f (x1 ) . . . f (xn ) , x1 , . . . , xn ∈ R .
Since {(x1 , . . . , xn ) : xi = xj for some 1 ≤ i < j ≤ n} is a finite union of lower
dimension subspaces of Rn , it is a set of zero Lebesgue measure and hence the
integral of h on that set is zero. In other words,

P (Xi = Xj for some 1 ≤ i < j ≤ n) = 0 .

Consequently,

P (X(i) = X(j) for some 1 ≤ i < j ≤ n) = 0 .

This along with the obvious fact that X(1) ≤ . . . ≤ X(n) shows that

P (X(1) , . . . , X(n) ) ∈ U c = 0 .


64
That Z
g(x1 , . . . , xn ) dx1 . . . dxn = 0
Uc
follows tautologically from the definition of g. That is, (4.16) holds.
For (4.17), fix R as in (4.18). An immediate consequence of R ⊂ U is that
a1 < b1 ≤ a2 < b2 ≤ . . . ≤ an < bn and hence
(ai , bi ] ∩ (aj , bj ] = ∅ , 1 ≤ i < j ≤ n . (4.19)
Therefore,

P (X(1) , . . . , X(n) ) ∈ R

= P ai < X(i) ≤ bi , i = 1, . . . , n
 
[  
=P ai < Xπ(i) ≤ bi , i = 1, . . . , n 
π permutation of {1,...,n}
X 
= P ai < Xπ(i) ≤ bi , i = 1, . . . , n ,
π permutation of {1,...,n}

the penultimate line following from the fact that X(1) (ω), . . . , X(n) (ω) is a per-
mutation of X1 (ω), . . . , Xn (ω) for every ω ∈ Ω, and the last line follows from
the observation that for distinct permutations π and π 0 ,
   
ai < Xπ(i) ≤ bi , i = 1, . . . , n ∩ ai < Xπ0 (i) ≤ bi , i = 1, . . . , n = ∅ ,
which is another consequence of (4.19).
For a fixed permutation π, the independence of Xπ(1) , . . . , Xπ(n) implies
X n
Y
 
P (X(1) , . . . , X(n) ) ∈ R = P ai < Xπ(i) ≤ bi
π permutation of {1,...,n} i=1
  n Z bi
d
X Y
Xπ(i) = X1 , i = 1, . . . , n = f (x) dx
π permutation of {1,...,n} i=1 ai

n Z
Y bi
= n! f (x) dx
i=1 ai
Z b1 Z bn
= n! ... f (x1 ) . . . f (xn ) dxn . . . dx1
a1 an
Z
= g(x1 , . . . , xn ) dx1 . . . dxn .
R

Thus (4.17) holds for every R as in (4.18). This completes the proof.
Exercise 4.11. If (X1 , . . . , Xn ) is a random vector in Rn with joint density f
and (X(1) , . . . , X(n) ) is its order statistic, show that the density of the latter is
(P 
π permutation of {1,...,n} f xπ(1) , . . . , xπ(n) , x1 < . . . < xn ,
g(x1 , . . . , xn ) =
0, otherwise .

65
5 Conditional expectation
Following the usual convention, (Ω, A, P ) is the probability space underlying
everything we talk about, unless specifically mentioned otherwise. As in Defi-
nition 24, for A, B ∈ A with P (A) > 0, the conditional probability of B given
that A has occurred is
P (B ∩ A)
P (B|A) = .
P (A)
Suppose now that 0 < P (A) < 1 and we want to define the conditional prob-
ability of B given that we know whether A has occurred or not. If A has
occurred, then the above is the natural definition of the said conditional prob-
ability, whereas it is
P (B ∩ Ac )
P (Ac )
c
if A has not occurred, that is, A has occurred. In other words,
P (B ∩ A) P (B ∩ Ac )
1A + 1Ac (5.1)
P (A) P (Ac )
is a natural definition of the conditional probability of B given the knowl-
edge of whether A has occurred or not. The said conditional probability is
thus a random variable depending on 1A and 1Ac alone. To generalize this, if
A1 , A2 , A3 , . . . are mutually and exhaustive events of positive probability, that
is,

[
Ai ∩ Aj = ∅ for all i 6= j , Ai = Ω , and P (Ai ) > 0 , i = 1, 2, . . . , (5.2)
i=1

then a similar reasoning says that the conditional probability of B given the
knowledge which one of A1 , A2 , . . . has occurred is

X P (B ∩ Ai )
Z= 1Ai ,
i=1
P (Ai )

which should be thought of as a generalization of (5.1).


Thus, Z is a random variable depending on 1A1 , 1A2 , . . . , alone. Further-
more, if E ∈ σ({A1 , A2 , . . .}), then (5.2) shows
[
E= Ai for some N ⊂ N .
i∈N

Therefore,
X
P (B ∩ E) = P (B ∩ Ai )
i∈N
Z  XZ
P (B ∩ Ai )
Z
Z dP = dP = P (B ∩ Ai ) = Z dP
Ai P (Ai ) Ai Ai
i∈N
Z
= Z dP .
E

66
The conditional probability of B, given the knowledge of which one of A1 , A2 , . . .
has occurred, is thus a random variable Z that is σ({A1 , A2 , . . .})-measurable
and satisfies
Z
Z dP = P (B ∩ E) for all E ∈ σ({A1 , A2 , . . .}) .
E
R
Since the right hand side is the same as E 1B dP , and interpreting the condi-
tional probability of B as the “conditional expectation” of 1B , a natural can-
didate for the latter is thus a σ({A1 , A2 , . . .})-measurable random variable Z
satisfying Z Z
Z dP = 1B dP for all E ∈ σ({A1 , A2 , . . .}) .
E E
Reasoning along similar lines, for an integrable random variable X and a
σ-field F ⊂ A, the conditional expectation of X given F should be defined as
an F-measurable random variable Z which satisfies
Z Z
Z dP = X dP for all E ∈ F .
E E

The following theorem guarantees the existence of such Z and its uniqueness
upto zero probability sets.
Theorem 5.1. For an integrable random variable X and a σ-field F ⊂ A, there
exists an integrable random variable Z which is F-measurable and satisfies
Z Z
Z dP = X dP for all E ∈ F . (5.3)
E E
0
If Z is another F-measurable and integrable random variable such that the above
holds with Z replaced by Z 0 , then Z 0 = Z a.s.
Proof. Write X = X + − X − where X + = X ∨ 0 and X − = (−X) ∨ 0. Since X
is integrable, so are X + and X − . Define measures µ+ and µ− on (Ω, F) by
Z Z
µ+ (E) = X + dP , µ− (E) = X − dP for all E ∈ F .
E E
+ −
Thus µ and µ are finite measures on (Ω, F) and each of them is absolutely
continuous with respect to P . Theorem 1.7, which is the Radon-Nikodym the-
orem, implies there exist F-measurable functions Z1 and Z2 from Ω to [0, ∞)
satisfying
Z Z
Z1 dP = µ+ (E) , and Z2 dP = µ− (E) , for all E ∈ F .
E E

Letting Z = Z1 − Z2 , (5.3) clearly holds.


If Z 0 is another F-measurable and integrable random variable such that (5.3)
holds with Z replaced by Z 0 , then it would follow that
Z
(Z − Z 0 ) dP = 0 , E ∈ F .
E

67
Since Z − Z 0 is F-measurable, the above implies Z − Z 0 = 0 a.s. This completes
the proof.
Definition 41. For an integrable random variable X and a σ-field F ⊂ A,
the conditional expectation of X given F, denoted by E(X|F), is an integrable
random variable Z which is F-measurable and satisfies (5.3). Theorem 5.1 guar-
antees the existence of such Z and its uniqueness upto sets of zero probability.
While the above definition is motivated by (5.1), the same can be arrived at
by another route of reasoning. Recall that for any X ∈ L2 (Ω),

E(X) = arg min E (X − α)2 ,


 
α∈R

that is, E(X) is the unique α ∈ R at which the right hand side is minimized.
When no additional information is available, minimization over the set of real
numbers makes sense. However, if for a σ-field F ⊂ A, we know for each
E ∈ F whether it has occurred or not, then the class should be expanded to all
F-measurable functions, because any F-measurable function is now “known”.
The following result makes this idea precise.
Theorem 5.2. If E(X 2 ) < ∞, and F ⊂ A is a σ-field, then
Z 
2
E(X|F) = arg min (X − Y ) dP : Y is measurable with respect to F a.s.

Proof. Recall that L2 (Ω, A, P ) is a Hilbert space of which L2 (Ω, F, P ) is a sub-


space. Further, L2 (Ω, F, P ) is a complete metric space. Hence, L2 (Ω, F, P ) is a
closed subspace of L2 (Ω, A, P ). Since X ∈ L2 (Ω, A, P ), it has a projection onto
L2 (Ω, F, P ), which we call Z. In other words,
Z 
Z = arg min (X − Y )2 dP : Y ∈ L2 (Ω, F, P ) . (5.4)

Our first task is to show that the above Z actually minimizes the L2 distance
from X over all F-measurable functions. Indeed, for a random variable Y with
E(Y 2 ) = ∞, it holds that
Z
(X − Y )2 dP = ∞

because otherwise Y = X − (X − Y ) would be in L2 as X is in L2 . Since


Z ∈ L2 (Ω, F, P ),
Z Z
(X − Z) dP < ∞ = (X − Y )2 dP , if E(Y 2 ) = ∞ .
2

Thus,
Z 
2
Z = arg min (X − Y ) dP : Y is measurable with respect to F .

68
To complete the proof, all that remains to show is Z = E(X|F) a.s. Since Z is
F-measurable, this would follow once (5.3) is shown to hold for this Z.
Results in functional analysis show that Z as in (5.4) is an orthogonal pro-
jection onto L2 (Ω, F, P ). That is, X − Z belongs to the orthogonal complement
of L2 (Ω, F, P ). Since 1E ∈ L2 (Ω, F, P ) for any E ∈ F, it thus follows that
Z
(X − Z)1E dP = 0 ,

which is the same as (5.3). Hence the proof follows.


The inadequacy of the statement of Theorem 5.2 as a definition of conditional
expectation is that it works only for L2 (Ω), whereas Definition 41 is valid on
L1 (Ω), which is a superset of L2 (Ω) because P is a finite measure. Therefore, the
conditional expectation of any integrable random variable will be as in Definition
41.
The following theorem is in line with our intuition that given a σ-field F,
any F-measurable random variable is a known quantity, and hence should come
outside the conditional expectation just like a constant comes out of an expec-
tation. Throughout this chapter F is a σ-field with F ⊂ A, unless mentioned
otherwise.
Theorem 5.3. If X and Y are random variables such that Y and XY are
integrable and X is F-measurable, then
E (XY |F) = X E(Y |F) a.s.
Proof. Let us first assume Y ≥ 0 and µ be a finite measure on (Ω, A) defined
by Z
µ(E) = Y dP , E ∈ A .
E
As shown in the proof of Theorem 5.1, Z = E(Y |F) is simply the Radon-
Nikodym derivative of µ with respect to P on (Ω, F). Exc 1.10 and the fact
that X is F-measurable show that
Z Z
|X|Z dP = |X| dµ
(Ω,F ) (Ω,F )
Z
(by Exc 1.8) = |X| dµ
(Ω,A)
Z
= |X|Y dP < ∞ ,
(Ω,A)

the equality in the last line is implied by the fact that Y is the Radon-Nikodym
derivative of µ with respect to P on (Ω, A) and the inequality follows from the
hypothesis that XY is integrable. Thus XZ is P -integrable.
Thus, for all E ∈ F, XZ1E is P -integrable, showing by a similar argument
that Z Z
X1E Z dP = X1E dµ , (5.5)
(Ω,F ) (Ω,F )

69
because X1E is F-measurable. A similar argument shows
Z Z
X1E Y dP = X1E dµ . (5.6)
(Ω,A) (Ω,A)

Once again, the right hand sides of (5.5) and (5.6) are equal by Exc 1.8. This
shows Z Z
XZ dP = XY dP , E ∈ F .
E E

Since XZ is F-measurable, it follows that XZ = E(XY |F) which completes the


proof for the case Y ≥ 0. For the general case, the proof follows from similar
arguments by splitting Y = Y + − Y − .
Theorem 5.4. If X and Y are integrable random variables, then the following
hold.
1. The random variable X + Y is integrable and

E(X + Y |F) = E(X|F) + E(Y |F) a.s.

2. For α ∈ R,
E(αX|F) = αE(X|F) a.s.

3. If X ≥ 0 a.s., then
E(X|F) ≥ 0 a.s.

4. If X ≤ Y a.s., then
E(X|F) ≤ E(Y |F) a.s.

Proof. 1. Follows from the definition of conditional expectation.


2. A special case of Theorem 5.3, though it follows from the definition as
well.

3. Follows from the definition of conditional expectation.


4. Implied by 1.-3. above by the following arguments:

E(Y |F) − E(X|F)


(by 1. and 2.) = E(Y − X|F)
(by 3.) ≥ 0 a.s.

The following is the so-called tower property and is in line with the intuition
that conditional expectation given F of an L2 random variable is projection
onto L2 (Ω, F, P ) as in Theorem 5.2.

70
Theorem 5.5 (Tower property). If F ⊂ G ⊂ A and F, G are σ-fields, then for
any integrable X, 
E E(X|G) F = E(X|F) a.s.
Proof. Let Y = E(X|G) and Z = E(X|F). Then for any A ∈ F,
Z Z
Z dP = X dP
A
ZA
(Y = E(X|G) and A ∈ F ⊂ G) = Y dP .
A

Since Z is F-measurable and the above holds for all A ∈ F, we get

Z = E(Y |F) a.s. ,

which is precisely the claim of the theorem.


The following special case of the tower property deserves special mention.

Corollary 5.1. For an integrable X and a σ-field G ⊂ A,



E E (X|G) = E(X) .

Proof. Follows from Theorem 5.5 by taking F = {∅, Ω} and observing that for
any integrable random variable Z,

E(Z|F) = E(Z) .

Theorem 5.6. If X is an integrable random variable and G ⊂ A is a σ-field


which is independent of σ(X) ∨ F, then

E (X|F ∨ G) = E(X|F) a.s.

Proof. Let Z = E(X|F). Since Z is integrable and F ∨ G-measurable, all that


needs to be shown is
Z Z
Z dP = X dP for all E ∈ F ∨ G . (5.7)
E E

Let S = {A ∩ B : A ∈ F, B ∈ G}; (5.7) will first be shown to hold for all


E ∈ S. Fix E ∈ S, that is, E = A ∩ B for some A ∈ F and B ∈ G. Then

E (X1E ) = E ((X1A )1B )


(X1A , 1B are respectively measurable w.r.t. σ(X) ∨ F, G) = E(X1A )E(1B )
(Z = E(X|F), A ∈ F) = E(Z1A )E(1B )
= E (Z1E ) ,

71
the last line again following from the independence of F and G and that Z1A
and 1B are measurable with respect to them, respectively. Thus, (5.7) holds for
all E ∈ S.
Since S is a semi-field, (5.7) can easily be shown to hold for all E in the field
generated by S. Finally, standard arguments using Theorem 1.2, which is the
monotone class theorem, completes the proof.
Corollary 5.2. If X is integrable and G is a σ-field independent of σ(X), then

E(X|G) = E(X) a.s.

Proof. Follows from Theorem 5.7 by taking F = {∅, Ω}.

The above corollary is diametrically opposite to

E(X|G) = X , if X is measurable w.r.t. G .

Definition 42. If X and Y are random variables and the former is integrable,
define 
E(X|Y ) = E X|σ(Y ) .
Theorem 5.7. Suppose X and Y are independent and f : R2 → R is a Borel
function such that
E (|f (X, Y )|) < ∞ .
Then  Z 
Y ∈ y∈R: |f (x, y)|P (X ∈ dx) < ∞ a.s. (5.8)
R

Further,
E (f (X, Y )|Y ) = g(Y ) ,
where
(R R
f (x, y)P (X ∈ dx) , if R |f (x, y)|P (X ∈ dx) < ∞ ,
g(y) = R
0, otherwise .

Proof. Denote by µX and µY the respective distributions of X and Y , that is,


µX (B) = P (X ∈ B) for all B ∈ B(R), and likewise for µY . Independence of X
and Y implies

P ((X, Y ) ∈ B) = µX ⊗ µY (B) for all B ∈ B(R2 ) .

Thus,
Z Z
|f (x, y)|µX (dx)µY (dy) = E (|f (X, Y )|)
R R
< ∞.

72
Tonelli’s theorem implies
 Z 
µY y ∈ R : |f (x, y)|µX (dx) < ∞ = 1 ,
R

which is exactly the same as (5.8).


Fubini’s theorem implies for any A ∈ σ(Y ), that is, for A = Y −1 B for some
B ∈ B(R),
Z Z
E (f (X, Y )1A ) = f (x, y)1B (y)µX (dx)µY (dy)
ZR Z
R

= f (x, y)µX (dx) µY (dy)
ZB R

(by (5.8)) = g(y)µY (dy)


B
because A = Y −1 B = E (g(Y )1A ) .


As this holds for all A ∈ σ(Y ), the proof follows.


Exercise 5.1. If X and Y are independent from standard normal and standard
uniform, respectively, calculate E(Y eXY ).
Soln.: First note that

0 ≤ Y eXY ≤ eXY ≤ e|XY | ≤ e|X| .

Since E(e|X| ) < ∞ because X follows normal, Y eXY is integrable. The tower
property implies

E Y eXY = E E Y eXY |Y
 

(Theorem 5.3) = E Y E(eXY |Y )



 2
  2

the above theorem and that E(eXy ) = ey /2 , y ∈ R = E Y eY /2
Z 1
2
= yey /2 dy
0
Z 1/2
z = y 2 /2, dz = y dy = ez dz

0
= e1/2 − 1 .

Exercise 5.2. For independent X and Y and 0 < p < ∞, show that

E (|X + Y |p ) < ∞ ⇐⇒ E(|X|p ) < ∞ and E(|Y |p ) < ∞ .

Soln.: The “⇐” part follows trivially from the observation

|X + Y |p ≤ 2p (|X|p + |Y |p ) ,

73
and doesn’t really need independence.
For the “⇒” part, assume

E (|X + Y |p ) < ∞ .

Independence of X, Y and (5.8) imply

Y ∈ {y ∈ R : E (|X + y|p ) < ∞} a.s.

In particular, the above set is non-empty. Thus, there exists y ∈ R such that

E (|X + y|p ) < ∞ .

Since
|X|p ≤ 2p (|X + y|p + |y|p ) ,
we get E(|X|p ) < ∞. This also shows E(|Y |p ) < ∞ and thus proves the “⇒”
part.
Exercise 5.3. A random variable X is infinitely divisible if for all fixed n =
1, 2, . . ., there exist i.i.d. random variables Xn1 , . . . , Xnn defined on some prob-
ability space such that
d
X = Xn1 + . . . + Xnn . (5.9)
If X is an infinitely divisible random variable with mean zero and variance one,
show that
E(X 4 ) ≥ 3 .
Hint. Use the above exercise and HW3/15.
Exercise 5.4. If X and Y are independent and either of them is a continuous
random variable, show that
X 6= Y a.s.
Exercise 5.5. If    
0 1 ρ
(X, Y ) ∼ N2 , ,
0 ρ 1
show that E(Y |X) = ρX.

6 Modes of convergence and the laws of large


numbers
In this chapter, we shall understand the “frequentists’ interpretation” of prob-
ability. Recall that “the probability of Heads for a coin is p” means that in
large number of tosses of the coin, the observed proportion of Heads is close
to p. To make this precise, we need notions of convergence for random vari-
ables. As usual, all random variables are defined on (Ω, A, P ), unless mentioned
otherwise.

74
Definition 43. For random variables Xn and X, if it holds that
n o
P ω ∈ Ω : lim Xn (ω) = X(ω) = 1,
n→∞

then as n → ∞, Xn converges to X almost surely, or

Xn → X a.s.

Exercise 6.1. For random variables X, X1 , X2 , . . ., show that


n o  
ω ∈ Ω : lim Xn (ω) = X(ω) = lim inf Xn ≥ X ≥ lim sup Xn ∈ A .
n→∞ n→∞ n→∞

The following exercise shows that a.s. convergence is as good as convergence


for every sample point, for practical purposes.
Exercise 6.2. If Xn → X a.s., show that there exist random variables X10 ,
X20 , . . . such that Xn0 → X and

P (Xn0 = Xn for all n) = 1 .

Theorem 6.1. If Xn → X a.s. and |Xn | ≤ Y for some Y which has finite
expectation, then show that X has a finite expectation, and

lim E(Xn ) = E(X) .


n→∞

Proof. Exercise.
Definition 44. A sequence of random variables (Xn ) converges in probability
P
to X or Xn −→ X if for every ε > 0,

lim P (|Xn − X| > ε) = 0 .


n→∞

P P
Exercise 6.3. If Xn −→ X and Xn −→ X 0 , then show that X = X 0 , a.s.
P
Theorem 6.2. If Xn → X a.s., then Xn −→ X.
Proof. Assume that Xn → X a.s. Fix ε > 0. Clearly,

[Xn → X] ⊂ [1(|Xn − X| > ε) → 0] ,

and hence
1(|Xn − X| > ε) → 0 a.s.
Theorem 6.1 shows
lim E (1(|Xn − X| > ε)) = 0 ,
n→∞
that is,
P (|Xn − X| > ε) → 0 .
P
Since this holds for all ε > 0, it follows that Xn −→ X which completes the
proof.

75
Example 6.1. Let Ω = (0, 1], A = B((0, 1]) and P be the restriction of Lebesgue
measure to (0, 1]. Define for all ω ∈ Ω,
 
1
X1 (ω) = 1 0 < ω ≤ ,
2
 
1
X2 (ω) = 1 <ω≤1 ,
2
 
1
X3 (ω) = 1 0 < ω ≤ ,
4
 
1 1
X4 (ω) = 1 <ω≤ ,
4 2
 
1 3
X5 (ω) = 1 <ω≤ ,
2 4
 
3
X6 (ω) = 1 <ω≤1 ,
4
..
. .
P
Then, Xn −→ 0 but  
P lim Xn = 0 = 0 .
n→∞

The above example shows that convergence in probability is a strictly weaker


notion of convergence than almost sure convergence.
Theorem 6.3 (Markov inequality). For a non-negative random variable Z and
a > 0,
1
P (Z ≥ a) ≤ E(Z) .
a
Proof. Since Z ≥ 0, it holds that

E(Z) ≥ E (Z1(Z ≥ a))


≥ E (a1(Z ≥ a))
= aP (Z ≥ a) ,

from which the proof follows.


Definition 45. For 1 ≤ p < ∞ and X, X1 , X2 , . . . ∈ Lp (Ω), Xn → X in Lp if

lim E (|Xn − X|p ) = 0 .


n→∞

Theorem 6.4. If Xn → X in Lp for some 1 ≤ p < ∞, then the following hold.

1. For 1 ≤ q ≤ p, Xn → X in Lq .
P
2. As n → ∞, Xn −→ X.

76
Proof. 1. Follows from the fact that for any random variable Z,

kZkq ≤ kZkp ,

as long as 0 < q ≤ p, which is a restatement of HW3/14, the solution of which


follows by applying Jensen to get

φ (E(|Z|q )) ≤ E (φ(|Z|q )) ,

where φ(x) = |x|p/q is convex.


2. For ε > 0,

P (|Xn − X| > ε) = P (|Xn − X|p > εp ) ≤ ε−p E (|Xn − X|p ) ,

by the Markov inequality. Letting n → ∞ completes the proof.


Exercise 6.4. If Xn → Y in Lp for some 1 ≤ p < ∞ and Xn → Z a.s., show
that
Y = Z a.s.
Exercise 6.5. 1. In Example 6.1, show that for all p ∈ [1, ∞), Xn → 0 in
Lp .
2. Show that for p ∈ [1, ∞), convergence in Lp neither implies nor is implied
by a.s. convergence.
Theorem 6.5 (Weak law of large numbers (WLLN) for finite variance). If
X1 , X2 , . . . are i.i.d. random variables with mean µ and finite variance, then
n
1X
Xi → µ
n i=1

as n → ∞ in L2 and hence in probability.


Proof. Since !
n
1X
E Xi = µ , n = 1, 2, . . . ,
n i=1
it follows that
 !2  !
n n
1 X 1X
E Xi − µ  = Var Xi
n i=1 n i=1
n
1 X
(X1 , . . . , Xn are independent) = 2 Var(Xi )
n i=1

d d
 1
X1 = . . . = Xn = Var(X1 )
n
→ 0,n → ∞.

77
Thus
n
1X
Xi → µ as n → ∞ ,
n i=1

in L2 and hence in probability. This completes the proof.


Exercise 6.6. If a coin with probability of Heads p is tossed infinitely many
times, and Xn denotes the proportion of Heads observed in the first n tosses,
then show that as n → ∞,
P
Xn −→ p .
Theorem 6.6 (Borel-Cantelli lemma). If A1 , A2 , . . . are events such that

X
P (An ) < ∞ ,
n=1

then
P ({ω ∈ Ω : ω ∈ An for infinitely many n}) = 0 .
Proof. Let
Bn = An ∪ An+1 ∪ . . . , n ≥ 1 ,
and

\
B∞ = Bn .
n=1

Clearly,
B∞ = {ω ∈ Ω : ω ∈ An for infinitely many n} .
Furthermore, since Bn ↓ B∞ , it follows that

X
P (B∞ ) = lim P (Bn ) ≤ lim P (Ak ) = 0 ,
n→∞ n→∞
k=n

which completes the proof.


P
Theorem 6.7. If Xn −→ X, then Xn has a subsequence Xnk such that

Xnk → X a.s. ,

as k → ∞.
P
Proof. Since Xn −→ X, there exists n1 such that
1
P (|Xn1 − X| > 1) ≤ .
2
There exists N2 such that
 
1
P |Xn − X| > ≤ 2−2 for all n ≥ N2 .
2

78
Define n2 = N2 ∨ (n1 + 1). Proceeding similarly, we get positive integers n1 <
n2 < n3 < . . . such that
 
1
P |Xnk − X| > ≤ 2−k for all k .
k

Hence,
∞  
X 1
P |Xnk − X| > < ∞.
k
k=1

Borel-Cantelli Lemma implies that


 
1
P |Xnk − X| > for infinitely many k = 0 .
k

Thus,
Xnk → X a.s. ,
as k → ∞. This completes the proof.
P
Exercise 6.7. If Xn −→ X and |Xn | ≤ Y for some Y with E(Y ) < ∞, then
prove that
lim E(Xn ) = E(X) .
n→∞

Exercise 6.8. Prove or disprove the following claim. If Xn and X are ran-
dom variables such that any subsequence {Xnk : k ≥ 1} of Xn has a further
subsequence {Xnkl : l ≥ 1} such that

Xnkl → X a.s. ,

then Xn → X a.s.

Exercise 6.9. Show that the following are equivalent for random variables Xn
and X.
P
1. As n → ∞, Xn −→ X.
2. Every subsequence {Xnk : k ≥ 1} of {Xn : n ≥ 1} has a further subse-
quence {Xnkl : l ≥ 1} such that as l → ∞,

Xnkl → X a.s.

3. Every subsequence {Xnk : k ≥ 1} of {Xn : n ≥ 1} has a further subse-


quence {Xnkl : l ≥ 1} such that as l → ∞,

P
Xnkl −→ X .

79
Theorem 6.8. If X, X1 , X2 , . . . are random variables such that

X
P (|Xn − X| > ε) < ∞ for all ε > 0 ,
n=1

then Xn → X a.s.
Proof. Let
Z = lim sup |Xn − X| ,
n→∞

which is a possibly improper random variable. The proof would follow if it can
be shown that Z = 0 a.s., which is the same as

P (Z > ε) = 0 for all ε > 0 . (6.1)

For ε > 0,

P (Z > ε) = P (|Xn − X| > ε for infinitely many n) .

The hypothesis in conjunction with the Borel-Cantelli lemma shows that the
right hand side is zero. Thus (6.1) follows, which completes the proof.
Theorem 6.9 (Strong law of large numbers (SLLN) for finite fourth moment).
If X1 , X2 , . . . are i.i.d. random variables with finite fourth moment, show that
as n → ∞,
n
1X
Xi → E(X1 ) a.s. and in L4 .
n i=1

Proof. Without loss of generality, assume that E(X1 ) = 0. It suffices to show


that  !4 
∞ n
X 1 X
E Xi  < ∞ , (6.2)
n=1
n i=1

for the following reasons. Markov’s inequality would show for ε > 0 and n =
1, 2, . . .,
!  !4 
n n
X 1 X
P Xi > ε ≤ ε−4 E  Xi  .
i=1
n i=1

From (6.2) it would follow that


∞ n
!
X X
P Xi > ε < ∞,
n=1 i=1

which in conjunction with Theorem 6.8 would prove


n
1X
Xi → 0 a.s.
n i=1

80
Besides, (6.2) would show that
 !4 
n
1X
lim E  Xi  = 0,
n→∞ n i=1

which is the same as


n
1X
Xi → 0 in L4 , n → ∞ .
n i=1

Thus, showing (6.2) suffices.


For proving (6.2), write for a fixed n,
 !4   
n n
1 X X
E Xi  = n−4 E  Xi Xj Xk Xl  (6.3)
n i=1
i,j,k,l=1
n
X
= n−4 E(Xi Xj Xk Xl ) .
i,j,k,l=1

Since X1 , . . . , Xn are i.i.d. and zero mean,

E(Xi Xj Xk Xl ) 6= 0

implies either i = j = k = l, in which case E(Xi Xj Xk Xl ) = E(X14 ), or exactly


one of the following holds:

i = j 6= k = l, i = k 6= j = l or i = l 6= j = k .

In each of the above 3 cases, E(Xi Xj Xk Xl ) = (E(X12 ))2 . Thus,


 !4 
n
1 X h 2 i
E Xi  = n−4 nE(X14 ) + 3n(n − 1) E X12 (6.4)
n i=1
h 2 i
≤ n−2 E(X14 ) + 3 E X12 .

Thus (6.2) follows, which completes the proof.


The next result, which is the most general strong law of large numbers,
assumes only finite mean and no higher moment. The following example shows
that such random variables exist.
Example 6.2. Suppose X is a random variable with density
1
f (x) = 1(x > e) , x ∈ R ,
cx2 (log x)2
where Z ∞
dx
c= .
e x2 (log x)2

81
Thus X ≥ 0 and
1 ∞
Z
dx
E(X) =
c e x(log x)2
1 ∞ dy
  Z
dx
y = log x, dy = =
x c 1 y2
1
= < ∞.
c
For any ε > 0,
1 ∞
Z
dx
E(X 1+ε ) =
c e x (log x)2
1−ε

=∞

because
x
lim = ∞,
x→∞ x1−ε (log x)2
and Z ∞
dx
= ∞.
e x
Thus E(X) < ∞ = E(X 1+ε ) for all ε > 0. That is, X has finite mean but any
higher moment is infinite.
Theorem 6.10 (SLLN). For i.i.d. random variables X1 , X2 , . . . with finite mean
µ,
n
1X
Xi → µ a.s. ,
n i=1
as n → ∞.
For proving the SLLN, the following inequality will be used.
Theorem 6.11 (Kolmogorov maximal inequality). Let X1 , . . . , Xn be indepen-
dent random variables with finite variance. Then, for any α > 0,
 
P max |Sk − E(Sk )| ≥ α ≤ α−2 Var(Sn ) ,
1≤k≤n

where
k
X
Sk = Xi , 1 ≤ k ≤ n .
i=1

The following inequality obtained by putting n = 1 above is known as Cheby-


shev’s inequality in probability theory. This follows directly from Markov’s
inequality as well.
Corollary 6.1 (Chebyshev’s inequality). If X has mean µ and finite variance
σ 2 , then
P (|X − µ| ≥ α) ≤ α−2 σ 2 , α > 0 .

82
Proof of Theorem 6.11. WLOG, assume that X1 , . . . , Xn are zero mean. We
start with the observation that
  n
[
max |Sk | ≥ α = Ak ,
1≤k≤n
k=1

where
Ak = [|Sk | ≥ α > |Sj | for all 1 ≤ j ≤ k − 1] , k = 1, . . . , n .
Since A1 , . . . , An are disjoint, it follows that

Var(Sn ) = E(Sn2 )
Xn
E Sn2 1Ak

=
k=1
Xn
E (Sn − Sk )2 1Ak + E Sk2 1Ak + 2E ((Sn − Sk )Sk 1Ak )
   
=
k=1
Xn
E Sk2 1Ak + 2E ((Sn − Sk )Sk 1Ak ) .
  

k=1

Since Sn − Sk and Sk 1Ak are independent and the former has zero mean, it
follows that
E ((Sn − Sk )Sk 1Ak ) = 0 ,
and hence
n
X
E Sk2 1Ak

Var(Sn ) ≥
k=1
Xn
E α2 1Ak


k=1
 
= α2 P max |Sk | ≥ α .
1≤k≤n

This completes the proof.


Proof of Theorem 6.10. WLOG, assume that E(X1 ) = 0. For n ≥ 1, define
n
X
Sn := Xi ,
i=1

Xn0 := Xn 1(|Xn | ≤ n) ,
and
n
X
Sn0 := Xk0 .
k=1

83
Notice that

X ∞
X
P (Xn 6= Xn0 ) = P (|X1 | > n)
n=1 n=1
X∞ Z n
≤ P (|X1 | > s)ds (6.5)
n=1 n−1
Z ∞
= P (|X1 | > s)ds
0
= E(|X1 |)
< ∞,

(6.5) following from the observation that P (|X1 | > n) ≤ P (|X1 | > s) for s ≤ n.
From the Borel-Cantelli lemma, it follows that

lim sup |Sn − Sn0 | < ∞


n→∞

almost surely, and hence


n−1 Sn − n−1 Sn0 → 0
almost surely. So it suffices to show that

n−1 Sn0 → 0 a.s. (6.6)

Notice that by DCT,

E(Xn0 ) = E [X1 1(|X1 | ≤ n)] → E(X1 ) = 0 ,

and hence,
lim n−1 E(Sn0 ) = 0 .
n→∞

Therefore, (6.6) will follow if we can show that

n−1 [Sn0 − E(Sn0 )] → 0 a.s.

For r ≥ 1, set
Zr := max |Sk0 − E(Sk0 )| .
2r−1 ≤k<2r

Since
1 0
|S − E(Sk0 )| ≤ 2−(r−1) Zr for all 2r−1 ≤ k ≤ 2r ,
k k
it suffices to show that
2−r Zr → 0 a.s.
The above will follows from Theorem 6.8 if it can be shown that

X
P [|Zr | > 2r ε] < ∞ .
r=1

84
for any ε > 0. Kolmogorov’s inequality implies that
X∞ X∞  
r 0 0 r
P [|Zr | > 2 ε] ≤ P max r |Sk − E(Sk )| > 2 ε
1≤k≤2
r=1 r=1
X∞
≤ ε−2 4−r Var(S20 r )
r=1
r

X 2
X
−2 −r
= ε 4 Var(Xj0 )
r=1 j=1

X ∞
X
= ε−2 Var(Xj0 ) 4−r
j=1 r=dlog2 je

X
≤ K j −2 Var(Xj0 ) ,
j=1

the last line following from the calculation that



X 4 −dlog2 je
4−r = 4
3
r=dlog2 je
4 − log2 j
4 ≤
3
4 −2
= j .
3
Thus, in order to complete the proof, all that needs to be shown is that

X
j −2 Var(Xj0 ) < ∞ .
j=1

To that end, observe that


∞ ∞
2
X X
j −2 Var(Xj0 ) ≤ j −2 E(Xj0 )
j=1 j=1
X∞
= j −2 E(X12 1(|X1 | ≤ j))
j=1

X ∞
X
= E(X12 1(k − 1 < |X1 | ≤ k)) j −2
k=1 j=k
X∞
≤ E(X12 1(k − 1 < |X1 | ≤ k))2/k (6.7)
k=1
X∞
≤ 2 E(|X1 |1(k − 1 < |X1 | ≤ k))
k=1
= 2E|X1 | < ∞ ,

85
(6.7) following from the fact that for k ≥ 2,
∞ ∞ Z j
X
−2
X 1 2
j ≤ x−2 dx = ≤ ,
j−1 k−1 k
j=k j=k

and

X Z ∞
j −2 ≤ 1 + x−2 dx = 2 ,
j=1 1

which together imply



X 2
j −2 ≤ ,k ∈ N.
k
j=k

Hence, the proof follows.


Exercise 6.10. Let X1 , X2 , . . . be i.i.d. taking values 1 and −1, with respective
probabilities p and 1 − p. Define
n
X
Sn = Xi , n = 0, 1, 2, . . . ;
i=1

(Sn : n ≥ 0) is called a random walk. Show that


1
P (Sn = k for some n ≥ 0) = 1 for all k = 1, 2, . . . if p > ,
2
and
1
P (Sn = k for some n ≥ 0) = 1 for all k = −1, −2, . . . if p < .
2
Theorem 6.12 (Kolmogorov’s zero-one law). If A1 , A2 , . . . are independent
σ-fields and
\∞ _∞
T = Ak ,
n=1 k=n

then P (A) equals either 0 or 1 for all A ∈ T .


Proof. For a fixed n = 1, 2, . . ., T ⊂ An+1 ∨ An+2 ∨ . . . and hence T is indepen-
dent of A1 ∨ . . . ∨ An . Thus,

P (A ∩ B) = P (A)P (B) for all A ∈ T , B ∈ A1 ∨ . . . ∨ An .

The above holds for all n = 1, 2, . . ., showing



[
P (A ∩ B) = P (A)P (B) for all A ∈ T , B ∈ (A1 ∨ . . . ∨ An ) .
n=1

86
S∞
Since n=1 (A1 ∨ . . . ∨ An ) is a field,
S∞ Theorem 3.1 shows that T is independent
of the sigma-field generated by n=1 (A1 ∨ . . . ∨ An ). Observing that
∞ ∞
!
[ _
σ (A1 ∨ . . . ∨ An ) = An ⊃ T ,
n=1 n=1

T is thus independent of itself. In other words,

P (A ∩ A) = P (A)P (A) for all A ∈ T ,

showing that P (A) has to be either 0 or 1, which completes the proof.


Exercise 6.11. If X1 , X2 , . . . are independent random variables and

Sn = X1 + . . . + Xn , n = 1, 2, . . . ,

show that  
P lim sup Sn = ∞ = 0 or 1 ,
n→∞

and that there exists a ∈ [−∞, ∞] such that


 
1
P lim inf Sn = a = 1 .
n→∞ n

1
Convince yourself that the above is not necessarily true with n Sn replaced by
Sn .

Theorem 6.13 (Second Borel-Cantelli lemma). If A1 , A2 , . . . are independent


events such that
X∞
P (An ) = ∞ ,
n=1

then
P (An occurs for infinitely many n) = 1 .

Proof. Since
∞ _
\ ∞
E = [An occurs for infinitely many n] = {Ak , Ack , ∅, Ω} ,
n=1 k=n

Kolmogorov’s zero-one law shows P (E) is either 0 or 1. Thus it suffices to show

P (E) > 0 . (6.8)

87
Recall that for any n ≥ 1,
n
! n
!
[ \
P Ai =1−P Aci
i=1 i=1
n
Y
=1− (1 − P (Ai ))
i=1
Yn
1 − x ≤ e−x for all x ∈ R ≥ 1 − e−P (Ai )

i=1
n
!
X
= 1 − exp − P (Ai )
i=1
→ 1,
P∞
as n → ∞ because i=1 P (Ai ) = ∞.
Let α1 , α2 , . . . ∈ (0, 1) be such that

Y
αi > 0 .
i=1

2
For example, αi = e−1/i for i = 1, 2, . . . satisfies the above. The above calcula-
tions show there exists n1 such that
n1
!
[
P Ai ≥ α1 .
i=1
P∞
Since i=n1 +1 P (Ai ) = ∞, a similar calculation shows there exists n2 > n1
such that !
n2
[
P Ai ≥ α2 .
i=n1 +1

Proceeding inductively, get integers 0 = n0 < n1 < n2 < . . . such that


 
[nk
P Ai  ≥ αk , k ∈ N .
i=nk−1 +1

Clearly,

\ nk
[
E⊃ Ai .
k=1 i=nk−1 +1

88
Therefore,
 

\ nk
[
P (E) ≥ P  Ai 
k=1 i=nk−1 +1
 

Y nk
[
= P Ai 
k=1 i=nk−1 +1

Y
≥ αk > 0 .
k=1

Thus (6.8) holds, from which the proof follows.


An immediate consequence of the second Borel-Cantelli lemma is the follow-
ing, which should be compared with Theorem 6.8.
Exercise 6.12. If X1 , X2 , X3 , . . . are independent random variables, then show
that
X ∞
Xn → X a.s. ⇐⇒ P (|Xn − X| > ε) < ∞ for all ε > 0 .
n=1
Show that if the above holds, then X is a degenerate random variable.
Exercise 6.13. Suppose X1 , X2 , . . . are random variables such that

X
E(Xn2 ) < ∞ .
n=1

If Y1 , Y2 , . . . are such that σ(Xn : n ≥ 1), σ(Y1 ), σ(Y2 ), . . . are independent and
Yn takes values 1 and −1, each with probability 1/2 for n = 1, 2, . . ., show that
n
X
Xi Yi → Z , as n → ∞ ,
i=1

in L2 , for some Z ∈ L2 (Ω).


Soln.: Let
n
X
Zn = Xi Yi , n ≥ 1 .
i=1

Since L2 (Ω) is a complete metric space, it suffices to show that {Zn : n ≥ 1} is


a Cauchy sequence in L2 (Ω). For 1 ≤ m < n,
 !2 
Xn
E (Zn − Zm )2 = E 
 
Xi Yi 
i=m+1
n
X X
Yi2 = 1 = E(Xi2 ) + 2

E (Xi Xj Yi Yj ) .
i=m+1 m+1≤i<j≤n

89
For m + 1 ≤ i < j ≤ n, independence of σ(X1 , X2 , . . .), σ(Yi ), σ(Yj ) shows

E (Xi Xj Yi Yj ) = E(Xi Xj )E(Yi )E(Yj ) = 0 .

Thus,
n
X
E (Zn − Zm )2 = E(Xi2 ) .
 
i=m+1

Given ε > 0, choosing N such that



X
E(Xi2 ) ≤ ε ,
i=N +1

which is possible from the given hypothesis, it holds that for N ≤ m < n,
n
X ∞
X
2 2
E(Xi2 ) ≤ ε ,
 
E (Zn − Zm ) = E(Xi ) ≤
i=m+1 i=N +1

showing {Zn : n ≥ 1} is a Cauchy sequence in L2 (Ω).


Exercise 6.14. If X1 , X2 , . . . are i.i.d. from standard exponential, and X(n,1) ,
. . . , X(n,n) are the order statistics of X1 , . . . , Xn , then show that

X(n,[n/2]) → log 2 a.s., as n → ∞ .

Exercise 6.15. 1. If X1 , X2 , . . . are independent and αn → ∞ such that


n
1 X P
Xi −→ X as n → ∞ ,
αn i=1

show that X is a degenerate random variable.


2. Hence or otherwise, prove that if X1 , X2 , . . . are i.i.d. from standard nor-
mal, then there does not exist a random variable Z such that
n
1 X P
√ Xi −→ Z , n → ∞ .
n i=1

7 Characteristic function and moment generat-


ing function
By convention, (Ω, A, P ) is the underlying probability space.
Definition 46. A function Z : Ω → C which satisfies Z −1 A ∈ A for all
A ∈ B(C) is a complex-valued random variable. A C-valued random variable Z
is integrable if Z
|Z| dP < ∞ ,

90
and in that case its expectation is defined as
Z
E(Z) = Z dP .

Throughout this chapter,√<(z) and =(z) will denote the real and imaginary
parts of z for z ∈ C and ι = −1. That is, for z = x + ιy where x, y ∈ R,

<(z) = x , =(z) = y .

Exercise 7.1. 1. For a function Z : Ω → C, show that

σ(Z) := {Z −1 A : A ∈ B(C)} = σ (<(Z)) ∨ σ (=(Z)) .

Hence prove that Z is B(C)-measurable if and only if <(Z) and =(Z) are
B(R)-measurable.
2. For a C-valued integrable random variable Z, show that

|E(Z)| ≤ E(|Z|) .

3. Show that a C-valued random variable Z is integrable if and only if <(Z)


and =(Z) are integrable, and in that case,

E(Z) = E (<(Z)) + ιE(=(Z)) .

4. For integrable C-valued random variables Z1 , Z2 and α, β ∈ C, show that


αZ1 + βZ2 is integrable and

E(αZ1 + βZ2 ) = αE(Z1 ) + βE(Z2 ) .

5. If Z1 and Z2 are integrable C valued random variables which are indepen-


dent, that is, σ(Z1 ) and σ(Z2 ) are independent, then show that Z1 Z2 is
integrable and
E(Z1 Z2 ) = E(Z1 )E(Z2 ) .
Definition 47. For a probability measure µ on R, its characteristic function
(CHF) is a function φ : R → C defined by
Z
φ(t) := eιtx µ(dx), t ∈ R ,
R

where ι := −1. The characteristic function of a random variable X is the
CHF of the measure P ◦ X −1 on R, that is,
Z
φX (t) := eιtx P (X ∈ dx), t ∈ R .
R

Exercise 7.2. For a random variable X with CHF φX , show that

φX (t) := E eιtX = E[cos tX] + ιE[sin tX], t ∈ R .


 

91
Theorem 7.1. Let φX be the characteristic function of a random variable X.
Then
1. φX (0) = 1 and |φX (t)| ≤ 1 for all t,
2. φX is uniformly continuous,
3. aX + b has the characteristic function φaX+b given by

φaX+b (t) = eibt φX (at), t ∈ R .

Proof. Proof of 1. follows immediately from Exc 7.1.2, which implies

E(eιtX ) ≤ E |eιtX | = 1 .


For 2., notice that for any t, h ∈ R,


h i
|φX (t + h) − φX (t)| = E ei(t+h)X − eitX

(Exc 7.1.2) ≤ E ei(t+h)X − eitX


= E eihX − 1 .

By the DCT, it follows that

lim E eihX − 1 = 0 ,
 
h→0

which completes the proof of uniform continuity.


Finally, 3. follows immediately from Exc 7.1.4 which implies
 
E eιt(aX+b) = eιbt E eιatX , t ∈ R .


Theorem 7.2. If λ 6= 0 and φX is the characteristic function of X, then the


following three statements are equivalent.
1. φX (λ) = 1.
2. φX has period λ.
3.  

P X∈ Z = 1.
λ
Proof. 1.⇒3. Assume that φX (λ) = 1. This means that

E[cos λX] = 1 .

Therefore, cos λX = 1 almost surely, which shows 3.


3.⇒2. If 3. holds, then
eιλX = 1 a.s

92
Thus, for a fixed t ∈ R,
eι(t+λ)X = eιtX a.s.
Taking expectation of both sides shows φX (t+λ) = φX (t), from which 2. follows.
2.⇒1. Trivial because φX (0) = 1.
Exercise 7.3. If the CHF φX of X satisfies

|φX (λ)| = 1 for some λ 6= 0 ,

show that there exists a ∈ R such that


 

P X −a∈ Z = 1.
λ

Definition 48. For a probability measure µ on R, its moment generating func-


tion (MGF) is a function ψ : R → (0, ∞] defined by
Z ∞
ψ(t) = etx µ(dx) .
−∞

The moment generating function of a random variable X is ψX defined by

ψX (t) = E etX , t ∈ R .


While the characteristic function (CHF) takes values in {z ∈ C : |z| ≤ 1},


the moment generating function at t is possibly +∞ for all t 6= 0. However, like
the CHF, the MGF also maps 0 to 1.
Theorem 7.3. Let µ be a probability measure on R with MGF ψ. If

α = inf{t ∈ R : ψ(t) < ∞} and β = sup{t ∈ R : ψ(t) < ∞} ,

then {t ∈ R : ψ(t) < ∞} ⊃ (α, β). If α < β then ezx is µ-integrable for z with
α < <(z) < β and f : {z ∈ C : α < <(z) < β} → C defined by
Z ∞
f (z) = ezx µ(dx) , (7.1)
−∞

is a holomorphic function. If, in addition, α < 0 < β, then all moments of µ


are finite, that is, Z ∞
|x|n µ(dx) < ∞ , n = 1, 2, . . . ,
−∞

and

zn ∞ n
X Z
f (z) = x µ(dx) , z ∈ C, |z| < (−α) ∧ β . (7.2)
n=0
n! −∞

93
Proof. Monotonicity and positivity of the exponential function on R implies
that for t1 < t2 < t3 ,
e t2 x ≤ e t1 x + e t3 x , x ∈ R , (7.3)
showing that
ψ(t2 ) ≤ ψ(t1 ) + ψ(t3 ) .
Thus, {t ∈ R : ψ(t) < ∞} is a convex subset of R which contains (α, β).
If α < <(z) < β for some z ∈ C, then the above shows ψ(<(z)) < ∞ and
thus Z ∞ Z ∞
|ezx | µ(dx) = ex<(z) µ(dx) = ψ(<(z)) < ∞ .
−∞ −∞
DCT for complex-valued function shows that f defined by (7.1) is continuous. A
standard application of Fubini and Morera’s theorem in conjunction with (7.3)
proves f is holomorphic.
For the final claim, assume α < 0 < β. For 0 < t < (−α) ∧ β and n ≥ 1,
|tx|n
|x|n ≤ t−n n! ≤ n!t−n e|tx| ≤ n!t−n etx + e−tx , x ∈ R .

n!
Thus,
Z Z ∞
|x|n µ(dx) ≤ n!t−n etx + e−tx µ(dx) = n!t−n (ψ(t) + ψ(−t)) < ∞ .

R −∞

Further, for z ∈ C with |z| ≤ t,


Z n
X 1
f (z) = lim (zx)i µ(dx) .
R n→∞
i=0
i!

Since for all n and x ∈ R,


n n
X 1 X 1
(zx)i ≤ |zx|i ≤ e|zx| ≤ et|x| ,
i=0
i! i=0
i!

and Z
et|x| µ(dx) ≤ ψ(t) + ψ(−t) < ∞ ,
R
DCT shows that
Z n Z X n
X 1 1
lim (zx)i µ(dx) = lim (zx)i µ(dx)
R n→∞ i=0 i! n→∞ R
i=0
i!
n Z
X 1 i
= lim z xi µ(dx)
n→∞
i=0
i! R
∞ Z
X 1 i
= z xi µ(dx) .
i=0
i! R

Since this holds for all z ∈ C with |z| ≤ t and t is arbitrary in (0, (−α) ∧ β),
(7.2) follows.

94
Corollary 7.1. If µ is a probability measure such that α < 0 < β, where α, β
are as in Theorem 7.3, then the MGF ψ of µ satisfies
∞ n Z ∞
X t
ψ(t) = xn µ(dx) , t ∈ R, |t| < (−α) ∧ β ,
n=0
n! −∞

and the CHF φ of µ satisfies



(ιt)n ∞ n
X Z
φ(t) = x µ(dx) , t ∈ R, |t| < (−α) ∧ β .
n=0
n! −∞

Remark 2. Only DCT and no complex analysis is used in the proof of (7.2),
and therefore for Corollary 7.1.
Exercise 7.4. Show that the characteristic function φ of standard normal is
2
φ(t) := e−t /2
, t ∈ R, (7.4)
in each of the following different ways.
1. Recall (from the solution of HW4/12) that the MGF of standard normal
is 2
ψ(t) = et /2 , t ∈ R , (7.5)
whose analytic continuation to C is
2
ψ̃(z) = ez /2
,z ∈ C.
Use Theorem 7.3 to arrive at (7.4). This line of argument essentially
justifies replacing t by ιt in (7.5).
2. Derive from HW1/18 that
Z ∞ √
2 2
eιtx−x /2 dx = 2π e−t /2 , t ∈ R ,
−∞

which is the same as (7.4).


3. Use Corollary 7.1 and HW2/17 to directly calculate the CHF of standard
normal.
Theorem 7.4 (Inversion theorem). If the probability measure µ has character-
istic function φ, and a < b are such that µ{a, b} = 0, then
Z T −ιta
1 e − e−ιtb
µ(a, b] = lim φ(t)dt .
T →∞ 2π −T ιt
Lemma 7.1. If
Z T
sin x
S(T ) := dx, T ≥ 0 ,
0 x
then
π
lim S(T ) = .
T →∞ 2

95
Proof of Theorem 7.4. Fix a, b satisfying the hypotheses. Before giving the ac-
tual proof, let us start with a sketch of the proof; every step in the sketch will
eventually be justified. For T > 0,
T
e−ιta − e−ιtb
Z
φ(t) dt
−T ιt
− e−ιtb ∞ ιtx
Z T −ιta Z
e
= e µ(dx) dt
−T ιt −∞
Z ∞ Z T ιt(x−a)
e − eιt(x−b)
= dt µ(dx) (7.6)
−∞ −T ιt
Z ∞Z T
= t−1 (sin(t(x − a)) − sin(t(x − b))) dt µ(dx) (7.7)
−∞ −T
Z ∞

= 2 sgn(x − a)S(T |x − a|) − sgn(x − b)S(T |x − b|) µ(dx) , (7.8)
−∞

S being as in Lemma 7.1, provided (7.6)-(7.8) can be justified. The said lemma
implies that

lim sgn(x − a)S(T |x − a|) − sgn(x − b)S(T |x − b|)
T →∞
π
= (sgn(x − a) − sgn(x − b))
2

π , a < x < b ,

= 0 , x < a or x > b ,
π

2 , x = a or x = b .

Thus,
T
e−ιta − e−ιtb
Z
lim φ(t) dt
T →∞ −T ιt
Z ∞

= lim 2 sgn(x − a)S(T |x − a|) − sgn(x − b)S(T |x − b|) µ(dx)
T →∞ −∞
Z ∞
= (2π1(a < x < b) + π1(x ∈ {a, b})) µ(dx) (7.9)
−∞

= 2πµ (a, b] ,

because µ({a, b}) = 0 by assumption, provided the interchange of integral and


limit in (7.9) can be justified. The claim would thus follow once (7.6)-(7.9) are
justified.

96
For justifying the interchange of integrals in (7.6), notice that

eιt(x−a) − eιt(x−b) e−ιta − e−ιtb


=
ιt ιt
Z b
= e−ιtx dx
a

≤ b − a. (7.10)

Therefore,
∞ T ∞ T
eιt(x−a) − eιt(x−b)
Z Z Z Z
dt µ(dx) ≤ (b − a) 1 dt µ(dx)
−∞ −T ιt −∞ −T
(Tonelli) = 2T (b − a) < ∞ .

Thus, (7.6) follows from Fubini.


The equality in (7.7) follows immediately by observing that for fixed x,
T T
eιt(x−a) − eιt(x−b)
Z Z
dt = t−1 (sin(t(x − a)) − sin(t(x − b))) dt
−T ιt −T
ZT
−ι t−1 (cos(t(x − a)) − cos(t(x − b))) dt ,
−T

and that t 7→ t−1 (cos(t(x − a)) − cos(t(x − b))) is an odd function.


For justifying (7.8), first fix x 6= a and write
T T
sin(t(x − a)) sin(t(x − a))
Z Z
dt = (x − a) dt
−T t −T t(x − a)
T |x−a|
x−a
Z
sin y
(y = t(x − a), dy = |x − a| dt) = dy
|x − a| −T |x−a| y
Z T |x−a|
sin y
= sgn(x − a) dy
−T |x−a| y
= 2 sgn(x − a)S (T |x − a|) ,

as y 7→ y −1 sin y is an even function. The identity


T
sin(t(x − a))
Z
dt = 2 sgn(x − a)S (T |x − a|)
−T t

holds for x = a as well because in that case both sides vanish. The above holds
with a replaced by b, which establishes (7.8).
Finally, (7.9) is justified by the observation that

K = sup |S(t)| < ∞ , (7.11)


t≥0

97
which follows from Lemma 7.1 and the fact that S(·) is a continuous function.
Since
|sgn(x − a)S(T |x − a|) − sgn(x − b)S(T |x − b|)| ≤ 2K ,
and µ is a finite measure, DCT justifies the interchange of limit and integral in
(7.9). This completes the proof.
The following is an immediate corollary of Theorem 7.4.
Corollary 7.2. If µ1 and µ2 are probability measures on (R, B(R)) with respec-
tive CHFs φ1 and φ2 , then

φ1 (t) = φ2 (t) for all t ∈ R ⇐⇒ µ1 = µ2 .

Theorem 7.5. Suppose µ1 and µ2 are probability measures on (R, B(R)) with
respective MGFs ψ1 and ψ2 . If there exists θ > 0 such that

ψ1 (t) = ψ2 (t) < ∞ , t ∈ [−θ, θ] ,

then µ1 = µ2 .
Proof. For i = 1, 2, define fi : {z ∈ C : |<(z)| < θ} → C by
Z
fi (z) = ezx µi (dx) ,
R

which is possible because the MGFs of µ1 and µ2 are finite on [−θ, θ]. Theorem
7.3 shows that f1 and f2 are holomorphic. Since ψi is the restriction of fi to
(−θ, θ), the assumption implies f1 and f2 agree on an uncountable set. Thus,
f1 = f2 . As {ιt : t ∈ R} is contained in the domains of f1 and f2 , it follows that

f1 (ιt) = f2 (ιt) , t ∈ R .

The above is the same as saying the CHFs of µ1 and µ2 are identical. Corollary
7.2 completes the proof.
Theorem 7.6 (Inversion theorem for densities). If the characteristic function
φ of a probability measure µ is integrable on R, that is,
Z ∞
|φ(t)|dt < ∞ ,
−∞

then f defined by Z ∞
1
f (x) := e−ιtx φ(t)dt ,
2π −∞
is a density of µ.
Proof. Let F be the CDF of µ, that is,

F (x) = µ ((−∞, x]) , x ∈ R .

Step 1. The function F is continuous.

98
Proof of Step 1. Suffices to show that for all x ∈ R, µ{x} = 0. Fix x ∈ R. Let
a < x ≤ b be such that µ{a, b} = 0. By the preceding result, it follows that
Z T −ιta
1 e − e−ιtb
µ(a, b] = lim φ(t)dt .
T →∞ 2π −T ιt
Notice that for all T ≥ 0,
Z T −ιta Z T −ιta Z ∞
e − e−ιtb e − e−ιtb
φ(t)dt ≤ |φ(t)|dt ≤ (b − a) |φ(t)|dt ,
−T ιt −T ιt −∞

(7.10) implying the rightmost inequality. Hence


Z ∞
µ(a, b] ≤ (b − a) |φ(t)|dt .
−∞

Let (an ) and (bn ) be such that an < x ≤ bn , µ{an , bn } = 0, an , bn → x. Then


Z ∞
µ{x} ≤ µ(an , bn ] ≤ (bn − an ) |φ(t)|dt ,
−∞

and the RHS converges to zero as n → ∞. This completes the proof of Step
1.
Step 2. The function F is differentiable, and
F 0 (x) = f (x) .
Proof of Step 2. Fix x ∈ R and h 6= 0. Then, by Step 1 and the preceding
result, it follows that
Z T −ιtx
F (x + h) − F (x) 1 1 e − e−ιt(x+h)
= µ ((x, x + h]) = lim φ(t)dt .
h h T →∞ 2π −T ιth
Since
e−ιtx − e−ιt(x+h) 1
φ(t) = |φ(t)| e−ιth − 1 ≤ |φ(t)| , (7.12)
ιth |th|
the inequality following from (7.10) by putting b = h and a = 0. As |φ(t)| is
integrable on R, by DCT, it follows that
Z ∞ −ιtx
F (x + h) − F (x) 1 e − e−ιt(x+h)
= φ(t)dt . (7.13)
h 2π −∞ ιth
Since,
e−ιt(x+h) − e−ιtx d −ιtx
lim = e = −ιte−ιtx ,
h→0 h dx
it follows that the integrand in (7.13) converges to e−ιtx φ(t) as h → 0. By
(7.12), the modulus of the integrand in (7.13) is bounded above by |φ(t)|. DCT
allows the limit as h → 0 to be interchanged with the integral in the RHS of
(7.13), which completes the proof of Step 2.

99
Arguments similar to those in the proof of Theorem 7.1.2 show that f is a
continuous function. By Step 2, it follows that for all real a < b,
Z b
µ(a, b] = F (b) − F (a) = f (x)dx ,
a

the second equality following by the fundamental theorem of calculus. This


completes the proof.
The following is an immediate corollary of Theorem 7.6.
Corollary 7.3. Suppose µ is a probability measure on R which has a continuous
density f . If the CHF φ of µ is integrable on R with respect to the Lebesgue
measure, then Z ∞
e−ιtx φ(t) dt = 2πf (x) for all x ∈ R .
−∞

Exercise 7.5. Use Corollary 7.3 to give a fifth proof of the fact
Z ∞ √
2
e−x /2 dx = 2π ;
−∞

HW2/16 gave the fourth proof.


Example 7.1. Let
1 −|x|
f (x) =
e ,x ∈ R,
2
and µ be the probability measure whose density is f . The CHF of µ can be
calculated and shown to be
1
φ(t) = ,t ∈ R.
1 + t2
Since f is continuous, the above corollary implies
Z ∞
dt
e−ιtx = 2πf (x) = πe−|x| , x ∈ R .
−∞ 1 + t2
Replacing t by −t using the symmetry of φ and dividing throughout by π, the
above is the same as saying that the CHF ξ of the Cauchy distribution is
ξ(x) = e−|x| , x ∈ R .
Definition 49. The CHF of a probability measure µ on (Rd , B(Rd )) is a func-
tion φ : Rd → C defined by
 
Z d
X
φ (t1 , . . . , td ) = exp ι tj xj  µ(dx1 , . . . , dxd ) , (t1 , . . . , td ) ∈ Rd .
Rd j=1

For an Rd -valued random variable X, its CHF φX is


 
φX (t) = E eιht,Xi , t ∈ Rd .

100
Theorem 7.7 (Inversion theorem on Rd ). If φ is the CHF of probability measure
µ on Rd , then for ∆ = [a1 , b1 ] × . . . × [ad , bd ] where aj < bj for j = 1, . . . , d, and
µ(∂∆) = 0 where ∂∆ is the boundary of ∆,
 
d −ιtj aj −ιtj bj

Z Y e e
µ(∆) = (2π)−d lim φ(t1 , . . . , td )   dt1 . . . dtd .
T →∞ [−T,T ]d
j=1
ιtj

Proof. We shall proceed along the lines of the proof of Theorem 7.4. Let ∆ =
[a1 , b1 ] × . . . × [ad , bd ] satisfy the hypotheses. For T > 0,
 
d −ιtj aj −ιtj bj

Z Y e e
φ(t1 , . . . , td )   dt1 . . . dtd (7.14)
[−T,T ]d j=1
ιtj
    
d −ιtj aj −ιtj bj d
−e
Z Z
Y e X
=   exp ι tj xj  µ(dx1 , . . . , dxd )
[−T,T ] d
j=1
ιt j Rd j=1

dt1 . . . dtd
 
d
eιtj (xj −aj ) − eιtj (xj −bj ) 
Z Z Y
=  µ(dx1 , . . . , dxd ) dt1 , . . . , dtd .
[−T,T ]d Rd j=1
ιtj
(7.15)
Recall (7.10) to write
d
eιtj (xj −aj ) − eιtj (xj −bj )
Z Z Y
µ(dx1 , . . . , dxd ) dt1 , . . . , dtd
[−T,T ]d Rd j=1
ιtj
Z Z d
Y
≤ (bj − aj ) µ(dx1 , . . . , dxd ) dt1 , . . . , dtd
[−T,T ]d Rd j=1

d
Y
= (2T )d (bj − aj ) < ∞ .
j=1

Fubini implies that the quantity in (7.15) equals


 
d ιtj (xj −aj ) ιtj (xj −bj )

Z Z

Y e e  dt1 , . . . , dtd µ(dx1 , . . . , dxd )
Rd [−T,T ]d j=1
ιt j
 
d T ιtj (xj −aj ) ιtj (xj −bj )

Z Z
Y e e
=  dtj  µ(dx1 , . . . , dxd )
Rd j=1 −T ιt j
 
Z Yd
=  2 (sgn(xj − aj )S(T |xj − aj |) − sgn(xj − bj )S(T |xj − bj |))
Rd j=1

µ(dx1 , . . . , dxd ) ,

101
(7.8) implying the last equality.
Denote for x1 , . . . , xd ∈ R,

ψT (x1 , . . . , xd )
d
Y
= 2 (sgn(xj − aj )S(T |xj − aj |) − sgn(xj − bj )S(T |xj − bj |)) ,
j=1

and use Lemma 7.1 to argue


d
Y
lim ψT (x1 , . . . , xd ) = π d (sgn(xj − aj ) − sgn(xj − bj ))
T →∞
j=1

=: ψ∞ (x1 , . . . , xd ) .

For (x1 , . . . , xd ) in the interior of ∆, that is, if aj < xj < bj for all j, then

sgn(xj − aj ) − sgn(xj − bj ) = 2 , j = 1, . . . , d ,

and hence
ψ∞ (x1 , . . . , xd ) = (2π)d .
On the other hand, if (x1 , . . . , xd ) ∈ ∆c , then there exists j for which either
xj < aj or xj > bj and hence for that j,

sgn(xj − aj ) − sgn(xj − bj ) = 0 .

In other words,
ψ∞ (x1 , . . . , xd ) = 0 , (x1 , . . . , xd ) ∈ ∆c .
Thus,

ψ∞ (x1 , . . . , xd ) = (2π)d 1 ((x1 , . . . , xd ) ∈ ∆) , (x1 , . . . , xd ) ∈ (∂∆)c . (7.16)

Letting K as in (7.11), it is immediate that

|ψT (x1 , . . . , xd )| ≤ (4K)d ,

which along with DCT implies that as T → ∞,


Z Z
ψT (x1 , . . . , xd )µ(dx1 , . . . , dxd ) → ψ∞ (x1 , . . . , xd )µ(dx1 , . . . , xd )
Rd Rd
= (2π)d µ(∆) ,

(7.16) and that µ(∂∆) = 0 imply the second line. The left hand side of the first
line above is the same as the quantity in (7.14). That is, we have shown
 
d −ιtj aj −ιtj bj

Z Y e e  dt1 . . . dtd = (2π)d µ(∆) ,
lim φ(t1 , . . . , td ) 
T →∞ [−T,T ]d
j=1
ιtj

which is precisely the claimed formula.

102
Theorem 7.8 (Uniqueness theorem on Rd ). If µ1 and µ2 are probability mea-
sures on Rd with identical CHFs, then µ1 = µ2 .
Proof. Let µ1 and µ2 be probability measures on Rd with identical CHFs. To
show µ1 = µ2 , in view of Theorem 4.3.2, it suffices to prove that

F1 (x) = F2 (x) , x ∈ Rd , (7.17)

where F1 , F2 are the respective CDFs of µ1 , µ2 , that is,

Fi (x1 , . . . , xd ) = µi ((−∞, x1 ] × . . . × (−∞, xd ]) , i = 1, 2, (x1 , . . . , xd ) ∈ Rd .

An immediate consequence of Theorem 7.7 is that for a compact rectangle


∆ ⊂ Rd ,
µ1 (∆) = µ2 (∆) if µ1 (∂∆) = 0 = µ2 (∂∆) , (7.18)
∂∆ being the boundary of ∆. Define

Cij = x ∈ R : µj {(x1 , . . . , xd ) ∈ Rd : xi = x} > 0 , i = 1, . . . , d, j = 1, 2 .


 

Since Cij is countable, so is C defined by


d [
[ 2
C= Cij .
i=1 j=1

Thus C c is dense in R and for ∆ = [a1 , b1 ] × . . . × [ad , bd ],

µ1 (∂∆) = 0 = µ2 (∂∆) if a1 , b1 , . . . , ad , bd ∈ C c and ai < bi , i = 1, . . . , d .


(7.19)
Proceeding towards (7.17), fix x1 , . . . , xd ∈ C c . Since C c is dense, there exist
an ∈ C c such that ∧di=1 xi > a1 > a2 > . . . and an → −∞. It follows from (7.18)
and (7.19) that

µ1 ([an , x1 ] × . . . × [an , xd ]) = µ2 ([an , x1 ] × . . . × [an , xd ]) , n ≥ 1 .

Letting n → ∞, (7.17) follows for x = (x1 , . . . , xd ) if x1 , . . . , xd ∈ C c . Using the


facts that C c is dense once again, and that F is continuous from above, (7.17)
follows for all x ∈ Rd , which completes the proof.
The following result is an immediate consequence of Theorem 7.8.

Theorem 7.9 (Cramér-Wold device). For Rd -valued random variables X and


Y,
d d
X = Y ⇐⇒ hλ, Xi = hλ, Y i for all λ ∈ Rd .
Proof. Follows from Theorem 7.8.

Remark 3. No elementary proof of the Cramér-Wold device, without using


CHFs which essentially belong to the domain of Fourier analysis, is known.

103
Example 7.2. Let Z1 , . . . , Zd be i.i.d. from standard normal and Z = (Z1 ,
. . . , Zd ). Fix a d × d symmetric non-negative definite (n.n.d.) matrix Σ and
µ ∈ Rd and define
X = µ + Σ1/2 Z , (7.20)
where elements of Rd are to be interpreted as column vectors by convention. Let
us calculate the CHF of X. Fix λ ∈ Rd and write

λT X = λT µ + θ T Z ,

where
θ = Σ1/2 λ .
Recall that θT Z follows N (0, kθk2 ), where k · k is the L2 -norm, if kθk > 0; θT Z
is degenerate at zero otherwise. Assuming for a moment that t = kθk > 0,
  T 
 T  θ Z
E eιθ Z = E exp ιt
kθk
T
 
θ Z 2
because ∼ N (0, 1) = e−t /2
kθk
2
= e−kθk /2
 
1 T
= exp − θ θ
2
 
1 T
= exp − λ Σλ .
2

If kθk = 0, that is, θ is the zero vector, then also


 
 T 
ιθ Z 1 T
E e = exp − λ Σλ ,
2

because both sides equal 1 in this case. Thus,


 
 T  T
 T  1
E eιλ X = eιλ µ E eιθ Z = exp ιλT µ − λT Σλ .
2

In other words, the CHF φX of X is


 
1
φX (λ) = exp ιλT µ − λT Σλ , λ ∈ Rd .
2

Definition 50. An Rd -valued random variable X follows Nd (µ, Σ) for µ ∈ Rd


and a d × d symmetric n.n.d. matrix Σ, if the CHF of X is
 
1
φX (λ) = exp ιλT µ − λT Σλ , λ ∈ Rd .
2

104
The above definition is consistent with Definition 39 in the following sense.
If Σ is p.d. and X ∼ Nd (µ, Σ) according to Definition 50, then the density of X
is f as in Definition 39. Indeed, (7.20) should be compared with (4.10) to see
this immediately.
Remark 4. The Nd (µ, Σ) is called a “singular normal distribution” if Σ is
n.n.d. but not p.d. It should be noted that a singular normal distribution in one
dimension is a degenerate distribution.
Exercise 7.6. Show that a Nd (µ, Σ) distribution has a density if and only if Σ
is p.d.
Theorem 7.10. For an Rd -valued random variable X, µ ∈ Rd and a d × d
n.n.d. matrix Σ,

X ∼ Nd (µ, Σ) ⇐⇒ hλ, Xi ∼ N λT µ, λT Σλ for all λ ∈ Rd .




Proof. For the “⇒ part”, assume X ∼ Nd (µ, Σ) and fix λ ∈ Rd . Then for t ∈ R,
   
E eιthλ,Xi = E eιhtλ,Xi
 
T 1 T
= exp ι(tλ) µ − (tλ) Σ(tλ)
2
2 2
= eιtθ−σ t /2
,

where θ = λT µ and σ 2 = λT Σλ. As the above is true for all t ∈ R, Definition


50 shows that hλ, Xi ∼ N (θ, σ 2 ). This proves the “⇒ part”.
For the reverse implication, assume that

hλ, Xi ∼ N λT µ, λT Σλ for all λ ∈ Rd .




Let Y ∼ Nd (µ, Σ). The already proven “⇒ part” shows that

hλ, Y i ∼ N λT µ, λT Σλ for all λ ∈ Rd .




d
Thus hλ, Xi = hλ, Y i for all λ ∈ Rd . The Cramér-Wold device shows
d
X=Y ,

from which the “⇐ part” follows. This completes the proof.


Exercise 7.7. If X is a Rd -valued random vector such that for all λ ∈ Rd ,
λT X follows one-dimensional normal, show that X ∼ Nd (µ, Σ), where µ and Σ
are the mean vector and the variance-covariance matrix of X, respectively.
Exercise 7.8. For a random variable X with CHF φ, show that the following
are equivalent.
d
1. The distribution of X is symmetric, that is, X = −X.

105
2. For all t ∈ R, =(φ(t)) = 0, that is, φ is a real function.
3. The function φ is even, that is, φ(−t) = φ(t) for all t ∈ R.
Exercise 7.9. If X1 , X2 , . . . are i.i.d. from the Cauchy distribution, show that
there does not exist a random variable Z such that
n
1X P
Xi −→ Z , n → ∞ .
n i=1

Exercise 7.10. Suppose X and Y are independent random variables.


1. Show that X + Y has a density if either X or Y has a density.
Hint. If f is the density of X, use Theorem 5.7 to show that for fixed
z ∈ R,
Z z−Y
P (X + Y ≤ z|σ(Y )) = f (x) dx .
−∞
The “conditional probability” of an event given a σ-field is the same as
the conditional expectation of the indicator of that event.
2. Show that X + Y has a bounded continuous density if the CHF of either
X or Y is integrable on R.

8 Weak convergence and the central limit theo-


rem
Definition 51. For probability measures µ, µ1 , µ2 , . . . on (R, B(R)), µn con-
verges weakly to µ, or µn ⇒ µ if
lim µn ((−∞, x]) = µ ((−∞, x]) ,
n→∞

for all x ∈ R with µ({x}) = 0. For R-valued random variables X, X1 , X2 , . . .,


we say Xn converges to X in law, in distribution or weakly, and denote it by
Xn ⇒ X, if
P ◦ Xn−1 ⇒ P ◦ X −1 .
Henceforth, B(R) or B(Rd ) will be the underlying σ-field, depending on the
context, unless specifically mentioned otherwise.
Exercise 8.1. If X, X1 , X2 , . . . are random variables with respective CDFs
F, F1 , F2 , . . ., show that Xn ⇒ X if and only if
lim Fn (x) = F (x) for every continuity point x of F .
n→∞

Example 8.1. If Xn ∼ Binomial(n, pn ) where pn ∈ (0, 1) are such that


lim npn = λ ∈ (0, ∞) ,
n→∞

then Xn ⇒ X where X ∼ Poisson(λ).

106
d
Exercise 8.2. If Xn ⇒ Y and Xn ⇒ Z, show that Y = Z.
P
Theorem 8.1. If Xn −→ X, then Xn ⇒ X.
Proof. Fix x ∈ R such that P (X = x) = 0. It suffices to show for all such x,

lim P (Xn ≤ x) = P (X ≤ x) . (8.1)


n→∞

Fix ε > 0. Since P (X = x) = 0, there exists w < x < y such that

P (X ∈ [w, y]) ≤ ε .

Clearly,
[X > y] ∩ [|Xn − X| ≤ y − x] ⊂ [Xn > x] .
Take complements of both sides to get

[X ≤ y] ∪ [Xn − X| > y − x] ⊃ [Xn ≤ x] .

Thus,

P (Xn ≤ x) ≤ P ([X ≤ y] ∪ [Xn − X| > y − x])


≤ P (X ≤ y) + P (|Xn − X| > y − x) .

P
Let n → ∞ and use the fact Xn −→ X to argue

lim sup P (Xn ≤ x) ≤ P (X ≤ y) ≤ P (X ≤ x) + ε ,


n→∞

the right inequality following from the choice of w and y. Since ε is arbitrary,
we get
lim sup P (Xn ≤ x) ≤ P (X ≤ x) .
n→∞

Since
[Xn > x] ∩ [|Xn − X| ≤ x − w] ⊂ [X > w] ,
proceeding along similar lines would yield

lim inf P (Xn ≤ x) ≥ P (X ≤ x) ,


n→∞

from which (8.1) would follow and would complete the proof.
Exercise 8.3. If X follows standard normal and Xn = −X, show that
P
Xn ⇒ X but Xn −→
6 X.

Exercise 8.4. If X is a degenerate random variable, show that


P
Xn ⇒ X ⇐⇒ Xn −→ X .

107
Theorem 8.2. For probability measures µ1 , µ2 , . . . , µ∞ on R, µn ⇒ µ∞ if and
only if Z Z
lim f dµn = f dµ∞ , (8.2)
n→∞

for every bounded continuous function f : R → R.


Proof. For the “if part”, assume (8.2) and fix x ∈ R with µ∞ ({x}) = 0. Fix
ε > 0 and let w < x < y be such that µ∞ ([w, y]) ≤ ε. Let f : R → R be
the function which is 0 on [x, ∞), 1 on (−∞, w] and is the line segment joining
(w, 1) and (x, 0) on [w, x]. Thus f is bounded and continuous and

1(−∞,w] ≤ f ≤ 1(−∞,x] . (8.3)

The right inequality above implies that for n = 1, 2, . . .,


Z
µn ((−∞, x]) ≥ f dµn .

Letting n → ∞,
Z
lim inf µn ((−∞, x]) ≥ lim inf f dµn
n→∞ n→∞
Z
(by (8.2)) = f dµ∞

(the left inequality of (8.3)) ≥ µ∞ ((−∞, w])


≥ µ∞ ((−∞, x]) − ε ,

the choice of w implying the last line. Arbitrariness of ε shows

lim inf µn ((−∞, x]) ≥ µ∞ ((−∞, x]) .


n→∞

A similar argument with g : R → R which is 0 on [y, ∞), 1 on (−∞, x] and is


the line segment joining (x, 1) and (y, 0) on [x, y] yields the desired upper bound
and thus proves µn ⇒ µ∞ . This proves the “if part”.
For the “only if part”, assume µn ⇒ µ∞ . Let f : R → R be bounded
continuous and fix ε > 0. Fix a, b ∈ R with a < b such that µ∞ ({a, b}) = 0 and

µ∞ ([a, b]c ) ≤ ε .

Continuity of f implies it is uniformly continuous on [a, b]. Thus there exists


δ > 0 satisfying

|f (x) − f (y)| ≤ ε for all x, y ∈ [a, b] , with |x − y| ≤ δ .

Choose x0 = a < x1 < . . . < xk = b satisfying µ({x0 , . . . , xk }) = 0 and

xi − xi−1 ≤ δ , i = 1, . . . , k ;

108
this is possible because a and b have been chosen to be continuity points of µ∞ .
Thus, for n = 1, 2, . . . , ∞,
Z k
X
f (x)µn (dx) − f (xi )µn ((xi−1 , xi ]) (8.4)
(a,b] i=1
k Z
X
= [f (x) − f (xi )] µn (dx)
i=1 (xi−1 ,xi ]
k Z
X
≤ |f (x) − f (xi )| µn (dx)
i=1 (xi−1 ,xi ]
k
X
≤ µn ((xi−1 , xi ]) max |f (x) − f (xi )|
x∈[xi−1 ,xi ]
i=1
k
X
≤ε µn ((xi−1 , xi ])
i=1
= εµn ((a, b]) ≤ ε ,

the inequality in the penultimate line following from the choice of δ and that
xi − xi−1 ≤ δ , i = 1, . . . , k.
Thus, for n = 1, 2, . . .,
Z Z
f (x) µn (dx) − f (x) µ∞ (dx)
(a,b] (a,b]
Z k
X
≤ f (x)µn (dx) − f (xi )µn ((xi−1 , xi ])
(a,b] i=1
Z k
X
+ f (x)µ∞ (dx) − f (xi )µ∞ ((xi−1 , xi ])
(a,b] i=1
k
X k
X
+ f (xi )µn ((xi−1 , xi ]) − f (xi )µ∞ ((xi−1 , xi ])
i=1 i=1
k
X k
X
≤ 2ε + f (xi )µn ((xi−1 , xi ]) − f (xi )µ∞ ((xi−1 , xi ]) .
i=1 i=1

Since x0 , . . . , xk are continuity points of µ∞ which is the weak limit of µn ,

lim µn ((xi−1 , xi ]) = µ∞ ((xi−1 , xi ]) ,


n→∞

and hence
k
X k
X
lim f (xi )µn ((xi−1 , xi ]) − f (xi )µ∞ ((xi−1 , xi ]) = 0 . (8.5)
n→∞
i=1 i=1

109
Therefore,
Z Z
lim sup f (x) µn (dx) − f (x) µ∞ (dx) ≤ 2ε . (8.6)
n→∞ (a,b] (a,b]

Let K = supx |f (x)| which is finite because f is bounded. Thus,


Z
f (x) µ∞ (dx) ≤ Kµ∞ ((a, b]c ) ≤ Kε ,
(a,b]c

and
Z
lim sup f (x) µn (dx) ≤ K lim sup µn ((a, b]c )
n→∞ (a,b]c n→∞

(µ∞ ({a, b}) = 0) = Kµ∞ ((a, b]c )


≤ Kε .

Combine these with (8.6) to get


Z Z
lim sup f dµn − f dµ∞ ≤ 2(K + 1)ε .
n→∞

Since ε is arbitrary, (8.2) follows. This proves the “only if part” and therefore
completes the proof.
Theorem 8.3 (Lévy Continuity theorem). Let µn , µ be probability measures on
R with characteristic functions φn , φ. Then, µn ⇒ µ if and only if

lim φn (t) = φ(t) for all t ∈ R .


n→∞

Proof. The “only if” part follows trivially from Theorem 8.2. For the “if” part,
assume that
lim φn (t) = φ(t) for all t ∈ R .
n→∞

Step 1. Let Fn be the c.d.f. of µn . There exist integers 1 ≤ n1 < n2 < . . . such
that
lim Fnk (r) exists for all r ∈ Q .
k→∞

Proof of Step 1. Follows immediately from Cantor’s diagonal argument because


Q is countable and for all r ∈ Q, {Fn (r) : r ∈ Q} is a bounded sequence.
Step 2. Denote
H(r) := lim Fnk (r), r ∈ Q ,
k→∞

and
G(x) := inf{H(r) : r > x, r ∈ Q} .
Then, G is a non-decreasing right continuous function.

110
Proof of Step 2. Non-decreasing is immediate. For right continuity, fix x ∈ R
and ε > 0. Clearly, there exists r ∈ (x, ∞) ∩ Q such that

H(r) ≤ G(x) + ε .

Clearly,
G((x + r)/2) ≤ H(r) ≤ G(x) + ε .
Thus, G is right continuous at x.

Step 3. For every continuity point x of G,

lim Fnk (x) = G(x) .


k→∞

Proof of Step 3. Fix a continuity point x of G and ε > 0. Therefore, there exist
w < x < y such that

G(x) − ε ≤ G(w) ≤ G(y) ≤ G(x) + ε .

Let r1 , r2 be rationals such that w < r1 < x < r2 < y. Then,

G(x) − ε ≤ G(w)
≤ H(r1 )
= lim Fnk (r1 )
k→∞
≤ lim inf Fnk (x)
k→∞
≤ lim sup Fnk (x)
k→∞
≤ lim Fnk (r2 )
k→∞
= H(r2 )
≤ G(y) (8.7)
≤ G(x) + ε ,

the inequality in (8.7) following from the fact that for all r ∈ (y, ∞) ∩ Q,

H(r) ≥ H(r2 ) ,

that is, H(r2 ) is a lower bound of the set of which G(y) is the infimum. Letting
ε ↓ 0 completes the proof of Step 3.

Step 4. Given ε > 0, there exists a such that

lim sup µn {x : |x| ≥ a} ≤ ε .


n→∞

111
Proof of Step 4. Observe that for u > 0,

1 u 1 u
Z Z Z
1 − eιtx µn (dx)dt

(1 − φn (t))dt =
u −u u −u R
1 u
Z Z
1 − eιtx dtµn (dx)

=
R u −u
  Z  
sin 0 sin ux
Interpreting =1 = 2 1− µn (dx)
0 R ux
  Z  
sin z sin ux
≤ 1,z ∈ R ≥ 2 1− µn (dx)
z [|x|≥2/u] ux
   
| sin ux|
Z
sin ux 1 1
≤ ≤ ≥ 2 1− µn (dx)
ux |ux| |ux| [|x|≥2/u] |ux|
Z
1
≥ 2 µn (dx)
[|x|≥2/u] 2
= µn [|x| ≥ 2/u] .

By DCT, it follows that for all fixed u > 0,

1 u 1 u 1 u
Z Z Z
lim (1 − φn (t))dt = (1 − φ(t))dt ≤ |1 − φ(t)| dt .
n→∞ u −u u −u u −u

Fix ε > 0. Since φ is a characteristic function, it is continuous at 0, and hence


there exists u > 0 such that

|φ(t) − 1| ≤ ε/2, |t| ≤ u .

Letting a = 2/u, putting everything together,

1 u
Z
lim sup µn {x : |x| ≥ a} ≤ lim (1 − φn (t))dt
n→∞ n→∞ u −u

1 u
Z
≤ |1 − φ(t)| dt
u −u
≤ ε.

Thus Step 4 follows.


Step 5. As x → ∞, G(x) → 1 and as x → −∞, G(x) → 0.
Proof of Step 5. Since G is non-decreasing, G(−∞) and G(∞) exist. Fix ε > 0.
Use Step 4 to get a > 0 such that

lim sup µn ((−a, a)c ) ≤ ε .


n→∞

112
Let x ≤ −a be a continuity point of G. Since G is non-decreasing,
G(−∞) ≤ G(x)
(By Step 3) = lim Fnk (x)
k→∞
≤ lim sup Fnk (−a)
k→∞
= lim sup µnk ((−∞, −a])
k→∞
≤ lim sup µn ((−a, a)c )
n→∞
≤ ε.
Since ε is arbitrary and G is non-negative, it follows that G(−∞) = 0. A similar
argument shows that if y ≥ −a is a continuity point of G, then G(y) ≥ 1 − ε,
and hence G(∞) = 1. This proves Step 5.
Step 6. As k → ∞, µnk =⇒ µ.
Proof of Step 6. Steps 2 and 5 in conjunction with Theorem 1.4 imply there
exists a probability measure ν on R such that
ν(−∞, x] = G(x), x ∈ R .
Step 3 implies that
µnk =⇒ ν .
By the already proven “only if” part, it follows that
Z
lim φnk (t) = eιtx ν(dx) for all t ∈ R .
k→∞

This in conjunction with the hypothesis


lim φn (t) = φ(t) , t ∈ R ,
n→∞

shows Z
φ(t) = eιtx ν(dx) for all t ∈ R .

Since φ is the CHF of µ, Corollary 7.2 implies


µ=ν.
This completes the proof.
Step 7. As n → ∞, µn =⇒ µ.
Proof of Step 7. Let µmk be any subsequence of µn . Steps 1 - 6 show that µmk
has a further subsequence µmkl such that
µmkl =⇒ µ as l → ∞ .
Since this is true for all subsequences µmk , it follows that µn =⇒ µ as n →
∞.

113
Step 7 clearly completes the proof of the “only if” part, and thereby proves
the theorem.
Exercise 8.5. Suppose that µ1 , µ2 , . . . are probability measures on R with CHFs
φ1 , φ2 , . . ., respectively. Assume

lim φn (t) = φ(t) , t ∈ R .


n→∞

Show that there exists a probability measure µ whose CHF is φ if and only if φ
is continuous at zero and in that case µn ⇒ µ.
Theorem 8.4 (Central limit theorem (CLT) on R for i.i.d.). Let X1 , X2 , . . .
be i.i.d. random variables with mean µ and variance σ 2 ∈ (0, ∞). Then, as
n → ∞, Pn
j=1 Xj − nµ
=⇒ Z ,
n1/2 σ
where Z follows standard normal.
Lemma 8.1. For all θ ∈ R,
 
1
eιθ − 1 + ιθ − θ2 ≤ 2 min(θ2 , |θ|3 ) .
2
Proof. Notice that
   
1 1
eιθ − 1 + ιθ − θ2 ≤ cos θ − 1 − θ2 + |sin θ − θ| .
2 2
Denote
 
1
R1 := cos θ − 1 − θ2 ,
2
R2 := sin θ − θ .

By Taylor’s theorem, there exists ξ, ξ 0 such that

θ2 θ3
cos θ = 1− + sin ξ (8.8)
2 6
θ2
= 1− cos ξ 0 . (8.9)
2
Equations (8.8) and (8.9) respectively show that

|θ|3
|R1 | ≤ ≤ |θ|3 ,
6
θ2
|R1 | ≤ (1 + | cos ξ 0 |) ≤ θ2 .
2
Therefore,
|R1 | ≤ min(θ2 , |θ|3 ) .

114
Applying Taylor to sin θ shows the existence of η, η 0 satisfying

θ3
sin θ = θ− cos η
6
θ2
= θ− sin η 0 .
2
Thus,
|R2 | ≤ min(θ2 , |θ|3 ) ,
and this completes the proof.
Lemma 8.2. For y, z ∈ C with |y| ∨ |z| ≤ 1, and n ∈ N,

|y n − z n | ≤ n|y − z| .

Proof. The observation

n−1
X
|y n − z n | = (y − z) y n−1−j z j
j=0

n−1
X
= |y − z| y n−1−j z j
j=0

≤ n|y − z| ,

completes the proof.

Proof of Theorem 8.4. WLOG, we assume that µ = 0 and σ = 1. Then, what


needs to be shown is that
n
X
n−1/2 Xj =⇒ N (0, 1) .
j=1

Let φ be the characteristic function of X1 . In view of the Lévy continuity


theorem, what needs to be shown is that
√ 2
lim φ(t/ n)n = e−t /2 for all t ∈ R . (8.10)
n→∞

Fix t ∈ R, and notice that

t2 2
 √

ιt
E[eιtX1 / n
] − E 1 + √ X1 − X1
n 2n
2

 
ιt t
≤ E eιtX1 / n − 1 + √ X1 − X12
n 2n
 
(by Lemma 8.1) ≤ 2E min(t2 X12 /n, |t|3 |X1 |3 /n3/2 ) ,

115
that is,
√ t2
   
φ(t/ n) − 1 − ≤ 2E min(t2 X12 /n, |t|3 |X1 |3 /n3/2 ) .
2n
By Lemma 8.2, it follows that for n > t2 ,
n
√ n t2 √ t2
  
φ(t/ n) − 1 − ≤ n φ(t/ n) − 1 −
2n 2n
 
≤ 2E min(t2 X12 , |t|3 |X1 |3 /n1/2 ) .

By DCT, the extreme RHS goes to 0 as n → ∞. Thus, (8.10) follows, and


completes the proof.
We now proceed towards the multivariate CLT, that is, CLT on Rd , for
which, weak convergence on Rd is to be defined.
Definition 52. For probability measures µ, µ1 , µ2 , . . . on (Rd , B(Rd )), µn ⇒ µ
if Z Z
lim f dµn = lim f dµ ,
n→∞ Rd n→∞ Rd
for all bounded continuous f : R → R. If X, X1 , X2 , . . . are Rd -valued random
d

variables, then Xn ⇒ X if P ◦ Xn−1 ⇒ P ◦ X.


Theorem 8.2 shows the above definition is consistent with Definition 51 in
the case d = 1. The advantage of the above definition is that the weak limit, if
exists, can easily be shown to be unique, as claimed in the following exercise. For
a probability measure on Rd , the underlying σ-field B(Rd ) will not be mentioned
henceforth.
Exercise 8.6. If µn , µ, ν are probability measures on Rd such that µn ⇒ µ and
µn ⇒ ν, show that µ = ν.
Hint. Use Theorem 7.8.
Another convenience of Definition 52 is that the following result now becomes
automatic.
Theorem 8.5 (Continuous mapping theorem). Suppose X1 , X2 , . . . , X∞ are
Rd1 -valued random variables and Xn ⇒ X∞ . If g : Rd1 → Rd2 is a continuous
function, then
g(Xn ) ⇒ g(X∞ ) ,
as Rd2 -valued random variables.
Proof. According to the definition, it suffices to check that for any bounded
continuous f : Rd2 → R,
lim E (f ◦ g(Xn )) = E (f ◦ g(X∞ )) .
n→∞

Fix such f . Since f, g are continuous, so is f ◦ g. As f is bounded, so is


f ◦ g. Thus f ◦ g : Rd1 → R is bounded continuous and the definition of weak
convergence implies the above. This completes the proof.

116
The following result shows, among other things, that if weak convergence
on Rd were defined by CDFs as in Definition 51, then that would have been
equivalent to Definition 52.
Theorem 8.6 (Portmanteau theorem). For probability measures µ1 , µ2 , . . . , µ∞
on Rd with respective CDFs F1 , F2 , . . . , F∞ , the following are equivalent.

1. As n → ∞, µn ⇒ µ∞ .
2. For any closed set F ⊂ Rd ,

lim sup µn (F ) ≤ µ∞ (F ) .
n→∞

3. For any open set U ⊂ Rd ,

lim inf µn (U ) ≥ µ∞ (U ) .
n→∞

4. For A ∈ B(Rd ) with µ∞ (∂A) = 0, where ∂A is the boundary of A,

lim µn (A) = µ∞ (A) .


n→∞

5. For all x ∈ Rd at which F∞ is continuous,

lim Fn (x) = F∞ (x) .


n→∞

The proof uses the following exercise.


Exercise 8.7. If µ is a probability measure on Rd with CDF F , show that for
x = (x1 , . . . , xd ) ∈ Rd ,

F is continuous at x ⇐⇒ µ(∂Ex ) = 0 ,

where Ex = (−∞, x1 ] × . . . × (−∞, xd ]. Equivalently, show that if F is the CDF


of an Rd -valued random variable (X1 , . . . , Xd ), then for x = (x1 , . . . , xd ) ∈ Rd ,
d
!
_
F is continuous at x ⇐⇒ P (Xi − xi ) = 0 = 0 .
i=1

Proof of Portmanteau theorem. Since µ1 , µ2 , . . . , µ∞ are all probability mea-


sures, it follows trivially that
2 ⇐⇒ 3 .
Thus it suffices to show 1⇒2⇒4⇒5⇒1.

117
Proof of 1⇒2. Assume 1, that is, µn ⇒ µ∞ . Let k · k be any norm on Rd and
define
d(F, x) = inf{kx − yk : y ∈ F } , x ∈ Rd .
Fix ε > 0 and define f : Rd → R by

fε (x) = 1 − 1 ∧ ε−1 d(F, x) , x ∈ Rd .




Since d(F, ·) is a continuous function, so is f . Further, 0 ≤ fε ≤ 1,

fε (x) = 1 , if x ∈ F ,

and
fε (x) = 0 , if d(F, x) ≥ ε .
In other words,
1F (x) ≤ fε (x) ≤ 1 (d(F, x) < ε) .
Thus,
Z
lim sup µn (F ) = lim sup 1F (x)µn (dx)
n→∞ n→∞
Z
≤ lim fε (x)µn (dx)
n→∞
Z
(as µn ⇒ µ∞ ) = fε (x)µ∞ (dx)

≤ µ∞ {x ∈ Rd : d(F, x) < ε} .


As ε ↓ 0,
{x ∈ Rd : d(F, x) < ε} ↓ {x ∈ Rd : d(F, x) = 0} = F ,
the set theoretic equality following from the fact that F is closed. Thus,

lim µ∞ {x ∈ Rd : d(F, x) < ε} = µ∞ (F ) .



ε↓0

Therefore,
lim sup µn (F ) ≤ µ∞ (F ) .
n→∞

Hence it follows that 1⇒2.


Proof of 2⇒4. From the equivalence of 2 and 3, which is indeed a tautology,
assume that
lim sup µn (F ) ≤ µ∞ (F ) , F ⊂ Rd closed , (8.11)
n→∞

and
lim inf µn (U ) ≥ µ∞ (U ) , U ⊂ Rd open . (8.12)
n→∞

Fix A ∈ B(Rd ) such that µ∞ (∂A) = 0, that is,

µ∞ (Ā) = µ∞ (A◦ ) = µ∞ (A) , (8.13)

118
where Ā and A◦ are the closure and interior of A, respectively. Invoke (8.12)
with U = A◦ to get

µ∞ (A◦ ) ≤ lim inf µn (A◦ )


n→∞
(as A◦ ⊂ A) ≤ lim inf µn (A)
n→∞
≤ lim sup µn (A)
n→∞
≤ lim sup µn (Ā)
n→∞
≤ µ∞ (Ā) ,

(8.11) implying the last line. This in conjunction with (8.13) shows

lim µn (A) = µ∞ (A) .


n→∞

Thus, 2⇒4.
Proof of 4⇒5. Assume 4. Let x = (x1 , . . . , xd ) be a continuity point of F∞ .
Exc 8.7 shows that
µ∞ (∂Ex ) = 0 ,
where Ex = (−∞, x1 ] × . . . × (−∞, xd ]. The hypothesis 4 which has been
assumed shows that
lim µn (Ex ) = µ∞ (Ex ) ,
n→∞

which is exactly the same as

lim Fn (x) = F∞ (x) .


n→∞

Thus 4⇒5.
Proof of 5⇒1. Assume 5. Let

C = {x ∈ Rd : F∞ is countinuous at x} .

The assumption 5 immediately implies


d
Y
lim ∆R Fn = ∆R F∞ , R = (ai , bi ] if {a1 , b1 } × . . . × {ad , bd } ⊂ C , (8.14)
n→∞
i=1

where ∆R F is as in (4.1). Recall Exc 3.3, a restatement of which is that

µn (R) = ∆R Fn , n = 1, . . . , ∞ . (8.15)

Let

Ci = z ∈ R : µ∞ {(x1 , . . . , xd ) ∈ Rd : xi = z} > 0 , i = 1, . . . , d .
 

119
Clearly, C1 , . . . , Cd are countable sets and hence
c
D = (C1 ∪ . . . ∪ Cd )

is dense in R. It is immediate that for (z1 , . . . , zd ) ∈ Dd ,

µ∞ {(x1 , . . . , xd ) ∈ Rd : xi = zi for some i = 1, . . . , d} = 0 .




Thus for such (z1 , . . . , zd ),


 
µ∞ ∂ (−∞, z1 ] × . . . × (−∞, zd ] = 0 .

In view of Exc 8.7, this means Dd ⊂ C. Combine this with (8.14) and (8.15) to
get
d
Y
lim µn (R) = µ∞ (R) , R = (ai , bi ] , a1 , b1 , . . . , ad , bd ∈ D . (8.16)
n→∞
i=1

Fix a bounded continuous f : Rd → R and ε > 0. Fix a, b ∈ D, which is


dense in R, such that
µ∞ (a, b]d ≥ 1 − ε .

(8.17)
As [a, b]d is a compact set, f is uniformly continuous there. Hence there exists
δ > 0 such that

|f (x) − f (y)| ≤ ε , for all x, y ∈ [a, b]d , kx − yk ≤ δ , (8.18)

where k(z1 , . . . , zd )k = |z1 | ∨ . . . ∨ |zd | is the max norm on Rd . Let a = x0 <


x1 < . . . < xk = b be such that x0 , . . . , xk ∈ D and

xi − xi−1 ≤ δ , i = 1, . . . , k .

Set  
Yd 
H= (xij −1 , xij ] : 1 ≤ i1 , . . . , id ≤ k .
 
j=1

A consequence of (8.16) is that

lim µn (R) = µ∞ (R) , R ∈ H .


n→∞

For R ∈ H, (8.18) and that k · k has been chosen to be the max-norm imply

|f (y) − f (z)| ≤ ε , for all y, z ∈ R .

Since the k d many rectangles in H are disjoint and their union is (a, b]d , we get
Z XZ
f dµn = f dµn , n = 1, 2, . . . , ∞ .
(a,b]d R∈H R

120
Proceeding like in (8.4)-(8.5) with the help of the above three claims, the ana-
logue of (8.6) can be shown, which is
Z Z
lim sup f dµn − f dµ∞ ≤ 2ε .
n→∞ (a,b]d (a,b]d

Finally, (8.16) also implies

lim µn (a, b]d = µ∞ (a, b]d ,


 
n→∞

which in conjunction with (8.17) shows


Z
lim sup f dµn ≤ Kε ,
n→∞ ((a,b]d )c

where K = supx∈Rd |f (x)| which is finite because f is bounded. Trivially,


Z
f dµ∞ ≤ Kε ,
((a,b]d )c

Thus Z Z
lim sup f dµn − f dµ∞ ≤ 2(K + 1)ε .
n→∞ Rd Rd

Since ε is arbitrary, it follows that


Z Z
lim f dµn = f dµ∞ .
n→∞ Rd Rd

This being true for any bounded continuous f , µn ⇒ µ∞ . Thus, 5⇒1.


The proof of Portmanteau theorem is now complete.
Definition 53. A sequence {µn : n = 1, 2, . . .} of probability measures on Rd is
tight if given ε > 0 there exists a compact set K ⊂ Rd such that

lim inf µn (K) ≥ 1 − ε .


n→∞

The following result connects tightness with weak convergence.

Theorem 8.7. If {µn : n = 1, 2, . . .} is a tight sequence of probability measures


on Rd , then there exists a subsequence {µnk : k = 1, 2, . . .} and a probability
measure µ on Rd such that

µnk ⇒ µ , k → ∞ .

121
Proof. We shall proceed like in the proof of Lévy continuity theorem. Let Fn
be the CDF of µn , that is,

Fn (x) = µn ((−∞, x1 ] × . . . × (−∞, xd ]) , n = 1, 2, . . . , x = (x1 , . . . , xd ) ∈ Rd .

As Qd is a countable set and for every r ∈ Qd , {Fn (r) : n = 1, 2, . . .} is a


bounded sequence of real numbers, there exist 1 ≤ n1 < n2 < . . . such that

lim Fnk (r) exists for all r ∈ Qd .


k→∞

Define
G(r) = lim Fnk (r) , r ∈ Qd ,
k→∞
d
and F : R → [0, 1] by

F (x1 , . . . , xd ) = inf {G(r1 , . . . , rd ) : r1 > x1 , . . . , rd > xd , r1 , . . . , rd ∈ Q} .

We shall show that F is a CDF, that is, it satisfies the assumptions of Theorem
4.2, from which it would follows that F induces a probability measure µ on Rd .
It will be shown that µnk ⇒ µ. This is achieved in the following few steps.
Step 1. The function F is continuous from above, that is,

lim F (y1 , . . . , yd ) = F (x1 , . . . , xd ) , for all x1 , . . . , xd ∈ R .


y1 ↓x1 ,...,yd ↓xd

Proof of Step 1. The definition of F implies that for fixed x = (x1 , . . . , xd ) ∈ Rd


and ε > 0, there exist rationals r1 > x1 , . . . , rd > xd such that

G(r1 , . . . , rd ) ≤ F (x1 , . . . , xd ) + ε .

Define
d
1^
δ= (ri − xi ) .
2 i=1

For y ∈ [x1 , x1 + δ] × . . . × [xd , xd + δ], once again, the definition of F implies


that

F (x) ≤ F (y)
≤ G(r1 , . . . , rd )
≤ F (x) + ε .

Thus Step 1 follows.


Step 2. For R = (a1 , b1 ] × . . . × (ad , bd ] where −∞ < ai < bi < ∞ for
i = 1, . . . , d,
∆R F ≥ 0 .

122
Proof of Step 2. Fix R as above and ε > 0. Let x = (x1 , . . . , xd ) ∈ E =
{a1 , b1 } × . . . × {ad , bd }. There exist rationals r1 > x1 , . . . , rd > xd such that

G(r1 , . . . , rd ) ≤ F (x1 , . . . , xd ) + ε2−d .

Set
d
^
δx = (ri − xi ) .
i=1

Since G is non-decreasing by definition, it thus follows that

F (x) ≤ G(s1 , . . . , sd ) ≤ G(r1 , . . . , rd ) ≤ F (x) + ε

for all s1 , . . . , sd ∈ Q with xi ≤ si ≤ xi + δx , i = 1, . . . , d .


Taking ! !
d
^ 1^
δ= δx ∧ (bi − ai ) ,
2 i=1
x∈E

it thus follows that for all x = (x1 , . . . , xd ) ∈ E,

|F (x) − G(s)| ≤ ε2−d , s ∈ Qd ∩ ([x1 , x1 + δ] × . . . × [xd , xd + δ]) . (8.19)

Choose ti ∈ [ai , ai + δ] ∩ Q and ui ∈ [bi , bi + δ] ∩ Q for i = 1, . . . , d. Since


δ < bi − ai for all i, ti ≤ ai + δ < bi ≤ ui for all i. Letting

R0 = (t1 , u1 ] × . . . × (td , ud ] ,

(8.19) implies
|∆R F − ∆R0 G| ≤ ε .
Since
∆R0 G = lim ∆R0 Fnk ≥ 0 ,
k→∞

it follows that
∆R F ≥ −ε .
As ε is arbitrary, Step 2 follows.
Step 3. As x1 → ∞, . . . , xd → ∞, F (x1 , . . . , xd ) → 1. On the other hand, as
Vd
i=1 xi → −∞, F (x1 , . . . , xd ) → 0.

Proof of Step 3. This is the only step in which tightness of {µn } is used. Since
0 ≤ F (x) ≤ 1 for all x ∈ Rd , it suffices to show that for ε > 0 there exists
a, b ∈ R such that for x = (x1 , . . . , xd ) ∈ Rd ,
d
^
F (x) ≥ 1 − ε if xi > b , (8.20)
i=1

123
and
d
^
F (x) ≤ ε if xi < a . (8.21)
i=1
Fix ε > 0. Tightness implies there exists a compact set K such that
lim inf µn (K) ≥ 1 − ε .
n→∞

Since K is compact and hence bounded, there exist a, b ∈ Q with a < b and
K ⊂ (a, b]d . Thus
G(b, . . . , b) = lim Fnk (b, . . . , b)
k→∞
= lim µnk (−∞, b]d

k→∞
≥ lim inf µn (K)
n→∞
≥ 1 − ε.
The definition of F and that G is non-decreasing imply that
d
Y
F (z) ≥ G(v) if z = (z1 , . . . , zd ) ∈ Rd , and v ∈ Qd ∩ (−∞, zi ] . (8.22)
i=1

Thus,
F (x1 , . . . , xd ) ≥ G(b, . . . , b) for all x1 ≥ b, . . . , xd ≥ b ,
showing that (8.20) holds. Fix x = (x1 , . . . , xd ) ∈ Rd with xj < a for some fixed
j. Let r ∈ Q be such that r > (x1 ∨ . . . ∨ xd ) and define y = (y1 , . . . , yd ) where
(
r, i 6= j ,
yi =
a, i = j .

Thus y ∈ Qd and yi > xi for all i, which shows


F (x) ≤ G(y)
= lim µnk ((−∞, y1 ] × . . . × (−∞, yd ])
k→∞
≤ lim sup µn ((−∞, y1 ] × . . . × (−∞, yd ])
n→∞
≤ lim sup µn (K c ) ,
n→∞

the last line following from the argument that yj = a and K ⊂ (a, b]d show
((−∞, y1 ] × . . . × (−∞, yd ]) ∩ K = ∅ ,
and hence
(−∞, y1 ] × . . . × (−∞, yd ] ⊂ K c .
Finally,
lim sup µn (K c ) = 1 − lim inf µn (K) ≤ ε ,
n→∞ n→∞

which establishes (8.21). This proves Step 3.

124
Steps 1-3 in conjunction with Theorem 4.2 show that there exists a proba-
bility measure µ on Rd satisfying

µ ((−∞, x1 ] × . . . × (−∞, xd ]) = F (x) , x = (x1 , . . . , xd ) ∈ Rd .

To complete the proof by showing µnk ⇒ µ, k → ∞, in view of the Portmanteau


theorem, it suffices to prove that

lim Fnk (x) = F (x) , (8.23)


k→∞

for every continuity point x of F . Fix such x = (x1 , . . . , xd ) and ε > 0. By


continuity, there exist wi < xi < yi for i = 1, . . . , d such that

F (w1 , . . . , wd ) ≥ F (x1 , . . . , xd ) − ε ,

and
F (y1 , . . . , yd ) ≤ F (x1 , . . . , xd ) + ε .
Let r1 , . . . , rd , s1 , . . . , sd ∈ Q be such that wi < ri < xi < si < yi for i = 1, . . . , d.
Thus,

F (x) − ε ≤ F (w1 , . . . , wd )
(definition of F ) ≤ G(r1 , . . . , rd )
= lim Fnk (r1 , . . . , rd )
k→∞
≤ lim inf Fnk (x)
k→∞
≤ lim sup Fnk (x)
k→∞
≤ lim Fnk (s1 , . . . , sd )
k→∞
= G(s1 , . . . , sd )
(by (8.22)) ≤ F (y1 , . . . , yd )
≤ F (x) + ε .

Since ε is arbitrary, (8.23) follows, which completes the proof of Theorem 8.7.

The following is a generalization of Theorem 7.9, and hence this also is called
the Cramér-Wold device.
Theorem 8.8 (Cramér-Wold device for weak convergence). For Rd -valued ran-
dom variables X1 , X2 , . . . X∞ , Xn ⇒ X∞ if and only if

hλ, Xn i ⇒ hλ, X∞ i for all λ ∈ Rd . (8.24)

The proof uses the following exercise from real analysis.

125
Exercise 8.8. 1. If F, F1 , F2 , . . . are functions from Rd to [0, 1], show that

lim Fn (x) = F (x)


n→∞

for every continuity point x of F if and only if every subsequence {Fnk } of {Fn }
has a further subsequence {Fnkl } such that

lim Fnkl (x) = F (x)


l→∞

for every continuity point x of F .


2. Hence or otherwise, prove that for probability measures µ, µ1 , µ2 , . . . on Rd ,
µn ⇒ µ if and only if every subsequence {µnk } of {µn } has a further subsequence
{µnkl } such that
µnkl ⇒ µ , l → ∞ .
Proof of Theorem 8.8. The “only if” part follows trivially from the continuous
mapping theorem because for a fixed λ ∈ Rd , x 7→ hλ, xi is a continuous map
from Rd to R.
Conversely, assume (8.24). Denote

Xn = (Xn1 , . . . , Xnd ) , n = 1, . . . , ∞ .

For fixed i ∈ {1, . . . , d}, letting λ be the vector whose i-th coordinate is 1 and
rest are 0, (8.24) implies

Xni ⇒ X∞i , i = 1, . . . , d . (8.25)

We shall first show that {P ◦Xn−1 : n = 1, 2, . . .} is tight, that is, given ε > 0,
a compact K ⊂ Rd will be obtained satisfying

lim inf P (Xn ∈ K) ≥ 1 − ε . (8.26)


n→∞

Fix ε > 0. Let 0 < α < ∞ be such that


ε
P (|X∞i | < α) ≥ 1 − , i = 1, . . . , d .
d
Use 3 of the Portmanteau theorem with d = 1, U = (−α, α) and (8.25) to get
ε
lim inf P (|Xni | < α) ≥ P (|X∞i | < α) ≥ 1 − ,
n→∞ d
a consequence of which is
ε
lim sup P (|Xni | ≥ α) ≤ , i = 1, . . . , d . (8.27)
n→∞ d

126
Let K = [−α, α]d . Thus,

lim inf P (Xn ∈ K) = 1 − lim sup P (Xn ∈ K c )


n→∞ n→∞
d
!
[
= 1 − lim sup P [|Xni | > α]
n→∞
i=1
d
X
≥ 1 − lim sup P (|Xni | > α)
n→∞
i=1
d
X
≥1− lim sup P (|Xni | > α)
n→∞
i=1
d
X
≥1− lim sup P (|Xni | ≥ α)
n→∞
i=1
d
X ε
≥1− = 1 − ε,
i=1
d

(8.27) implying the inequality in the last line. Thus, (8.26) holds. In other
words, {P ◦ Xn−1 : n = 1, 2, . . .} is tight.
By Exc 8.8.2, it suffices to show that every subsequence {Xnk } of {Xn } has
a further subsequence converging weakly to X∞ . Fix a subsequence {Xnk }.
Since {P ◦ Xn−1 : n = 1, 2, . . .} is tight, so is {P ◦ Xn−1
k
: k = 1, 2, . . .}. Theorem
8.7 implies {Xnk } has a subsequence {Xnkl : l = 1, 2, . . .} such that

Xnkl ⇒ Y , l → ∞ ,

for some Rd -valued random variable Y . The already proven “only if” part of
this theorem implies

hλ, Xnkl i ⇒ hλ, Y i , l → ∞ , λ ∈ Rd .

Comparing this with the hypothesis (8.24) yields


d
hλ, X∞ i = hλ, Y i , λ ∈ Rd .

Theorem 7.9 implies


d
X∞ = Y .
Therefore,
Xnkl ⇒ X∞ , l → ∞ .
This gives us the desired further subsequence of {Xnk } which converges weakly
to X∞ . Hence the proof follows.

The CLT in Rd now becomes a trivial consequence of the above theorem.

127
Theorem 8.9 (CLT in Rd ). Suppose X1 , X2 , . . . are i.i.d. random variables
taking values in Rd such that each coordinate of X1 has mean zero and finite
variance. Then
n
1 X
√ Xi ⇒ Z , n → ∞
n i=1

where Z ∼ Nd (0, Σ) and Σ is the covariance matrix of X1 .


Proof. In view of Theorem 8.8, it suffices to prove that for all λ ∈ Rd ,
* n
+
1 X
λ, √ Xi ⇒ hλ, Zi , n → ∞ . (8.28)
n i=1

To that end fix λ ∈ Rd , write


* n
+ n
1 X 1 X
λ, √ Xi = √ hλ, Xi i ,
n i=1 n i=1

and notice that hλ, X1 i, hλ, X2 i, . . . are i.i.d. Denoting X1 = (X11 , . . . , X1d ), the
assumption that X11 , . . . , X1d are zero mean implies

E (hλ, X1 i) = 0 .

Further, if σij is the (i, j)-th entry of Σ, that is,

σij = Cov(X1i , Xij ) ,

then writing λ = [λ1 . . . λd ]T ,


d
!
X
Var (hλ, X1 i) = Var λi X1i
i=1
d X
X d
= λi λj σij
i=1 j=1

= λT Σλ .

The CLT on R, which is Theorem 8.4, implies


n
1 X
√ hλ, Xi i ⇒ Yλ , n → ∞ ,
n i=1

where Yλ ∼ N (0, λT Σλ). Theorem 7.10 shows that


d
Yλ = hλ, Zi .

In other words, (8.28) holds, from which the proof follows.

128
The last theorem of this course is Lindeberg’s CLT, which is a generalization
of Theorem 8.4 in that the assumption of identical distribution therein is relaxed.
Theorem 8.10 (Lindeberg’s CLT). Suppose that for n = 1, 2, . . ., Xn1 , . . . , Xnn
are independent R-valued random variables satisfying the following:

E(Xni ) = 0 , i = 1, . . . , n, n = 1, 2, . . . ,
n
X
2
= σ2 < ∞ ,

lim E Xni
n→∞
i=1

and
n
X
2

lim E Xni 1(|Xni | > ε) = 0 , for every ε > 0 . (8.29)
n→∞
i=1

Then, as n → ∞,
n
X
Xni ⇒ Z ,
i=1

where Z ∼ N (0, σ 2 ).
The assumption (8.29) is called Lindeberg’s condition. The family {Xni :
1 ≤ i ≤ n, n = 1, 2, . . .} is called a triangular array, which is why, Theorem 8.10
is also known as CLT for triangular arrays. Theorem 8.4 follows from Theorem
8.10 as claimed in the following exercise.
Exercise 8.9. Suppose X1 , X2 , . . . are i.i.d. zero mean random variables with
finite variance σ 2 . Define
1
Xni = √ Xi , 1 ≤ i ≤ n, n ≥ 1 .
n
Show that {Xni : 1 ≤ i ≤ n, n = 1, 2, . . .} satisfies the assumptions of Theorem
8.10 and hence argue that Theorem 8.4 is a special case of that.
Theorem 8.10 can be proven along the lines of the proof of Theorem 8.4, that
is, with the help of the Lévy continuity theorem. For pedagogical reasons, we
shall prove it using Lindeberg’s principle which completely bypasses the Fourier
analytic method, that is, the use of characteristic functions. The following two
exercises, for example, can be easily solved using the Lévy continuity theorem,
though the solutions hinted at don’t use it.
Exercise 8.10. If Xn ∼ N (0, σn2 ) and 0 ≤ σn → σ < ∞, show that Xn ⇒ X
where X ∼ N (0, σ 2 ).
d
Hint. If Z ∼ N (0, 1), then Xn = σn Z → σZ.
Exercise 8.11. Suppose X, X1 , X2 , . . . are random variables such that for all
thrice differentiable bounded f : R → R whose first three derivatives are bounded,
it holds that
lim E (f (Xn )) = E (f (X)) .
n→∞

129
Show that Xn ⇒ X.
Hint. Let 
1,
 x ≤ 0,
f (x) = (1 − x4 )4 , 0 < x < 1,

0, x ≥ 1.

Observe that for w < y,


 
x−w
1(−∞,w] (x) ≤ f ≤ 1(−∞,y] (x) for all x ∈ R .
y−w
Proof of Theorem 8.10. Using Exc 8.11, it suffices to show that

lim E (f (Sn )) = E (f (Z)) , (8.30)


n→∞

for all thrice differentiable f : R → R such that f and its first three derivatives
are bounded, where
Xn
Sn = Xni , n ≥ 1.
i=1
Fix such f .
Let (Z1 , Z2 , . . .) be a collection of i.i.d. standard normal random variables
which is independent of the triangular array {Xni : 1 ≤ i ≤ n, n ≥ 1}. Set
q
σni = E(Xni 2 ) , 1 ≤ i ≤ n , n = 1, 2, . . . ,

and v
u n
uX
σn = t 2 ,n ≥ 1.
σni
i=1

Since
n
X
σni Zi ∼ N (0, σn2 ) , n = 1, 2, . . . , (8.31)
i=1

and σn2 → σ 2 , Exc 8.10 shows


n
!!
X
lim E f σni Zi = E (f (Z)) .
n→∞
i=1

Thus, (8.30) would follow once it is shown that


n
!!
X
lim E f (Sn ) − f σni Zi = 0. (8.32)
n→∞
i=1

Fix n ∈ {1, 2, . . .} and write


n
! n
X X
f (Sn ) − f σni Zi = (f (Yi−1 ) − f (Yi )) ,
i=1 i=1

130
where
n
X i
X
Yi = Xnj + σnj Zj , i = 0, 1, . . . , n ,
j=i+1 j=1

with the usual interpretation of the sum as zero if the lower limit exceeds the
upper limit. Thus,
n
!! n
X X
E f (Sn ) − f σni Zi ≤ |E(f (Yi−1 ) − f (Yi ))| . (8.33)
i=1 i=1

Fix i ∈ {1, . . . , n} and write

Yi = W + σni Zi ,

and
Yi−1 = W + Xni ,
where
n
X i−1
X
W = Xnj + σnj Zj .
j=i+1 j=1

It is immediate that W, Xni , Zi are independent. Taylor’s theorem implies


1 2 00
f (Yi−1 ) = f (W ) + Xni f 0 (W ) + Xni f (ξ1 ) (8.34)
2
1 2 00 1 3 000
= f (W ) + Xni f 0 (W ) + Xni f (W ) + Xni f (ξ2 ) , (8.35)
2 6
for some ξ1 and ξ2 between W and Yi−1 , where f 0 , f 00 , f 000 are the first three
derivatives of f , respectively. Let

K = sup (|f (x)| ∨ |f 0 (x)| ∨ |f 00 (x)| ∨ |f 000 (x)|) ,


x∈R

which is finite by assumption. A consequence of (8.34) is that


 
0 1 2 00 1 2 00
f (Yi−1 ) − f (W ) + Xni f (W ) + Xni f (W ) = Xni |f (ξ1 ) − f 00 (W )|
2 2
1 2
≤ Xni (|f 00 (ξ1 )| + |f 00 (W )|)
2
2
≤ KXni .

Similarly, (8.35) shows


 
0 1 2 00 1
f (Yi−1 ) − f (W ) + Xni f (W ) + Xni f (W ) ≤ K|Xni |3 ≤ K|Xni |3 .
2 6
Thus,
 
0 1 2 00 2
f (Yi−1 ) − f (W ) + Xni f (W ) + Xni f (W ) ≤ K(Xni ∧ |Xni |3 ) .
2

131
Therefore,
 
2 3
 0 1 2 00
KE Xni ∧ |Xni | ≥ E f (Yi−1 ) − f (W ) + Xni f (W ) + Xni f (W )
2
 
0 1 2 00
≥ E(f (Yi−1 )) − E f (W ) + Xni f (W ) + Xni f (W )
2
1 2
= E(f (Yi−1 )) − E(f (W )) − σni E(f 00 (W )) ,
2
the last line following from the independence of W and Xni and that the mean
2
and variance of Xni are zero and σni , respectively. A similar calculation shows

1 2
E(f 00 (W )) ≤ KE |σni Zi |3 = Cσni
3

E(f (Yi )) − E(f (W )) − σni ,
2

where C = KE(|Z1 |3 ). Combine the two inequalities obtained to get


2
∧ |Xni |3 + Cσni
3

|E (f (Yi−1 ) − f (Yi ))| ≤ KE Xni .

Summing the above inequality over i = 1, . . . , n and using (8.33), we get


n
!! n n
X X X
3 2
∧ |Xni |3 .

E f (Sn ) − f σni Zi ≤C σni +K E Xni
i=1 i=1 i=1

Thus, (8.32) would follow, which would complete the proof, once the following
are shown:
n
X
3
lim σni = 0, (8.36)
n→∞
i=1
n
X
2
∧ |Xni |3 = 0 .

and lim E Xni (8.37)
n→∞
i=1

For (8.36), write


n
X
3
σni ≤ σn2 max σni
1≤i≤n
i=1
r
= σn2 2 .
max σni
1≤i≤n

Since σn2 → σ 2 < ∞, (8.36) would follow if it can be shown that


2
lim max σni = 0.
n→∞ 1≤i≤n

Fix ε > 0 and write


2 2 2
σni = E(Xni 1(|Xni | ≤ ε)) + E(Xni 1(|Xni | > ε)) ≤ ε2 + E(Xni
2
1(|Xni | > ε)) .

132
Hence
2
max σni ≤ ε2 + max E(Xni
2
1(|Xni | > ε))
1≤i≤n 1≤i≤n
Xn
≤ ε2 + 2
E(Xni 1(|Xni | > ε)) .
i=1

Invoke (8.29) to argue


2
lim sup max σni ≤ ε2 .
n→∞ 1≤i≤n

Since ε is arbitrary,
2
lim max σni = 0,
n→∞ 1≤i≤n

which shows (8.36).


Finally, for (8.37), fix ε > 0 and write
n
X n n
2
 X  X
∧ |Xni |3 ≤ E |Xni |3 1(|Xni | ≤ ε + 2

E Xni E Xni 1(|Xni | > ε)
i=1 i=1 i=1
Xn n
X
2 2

≤ε E(Xni )+ E Xni 1(|Xni | > ε)
i=1 i=1
n
X
= εσn2 + 2

E Xni 1(|Xni | > ε) .
i=1

Let n → ∞ and use (8.29) to get


n
X
2
∧ |Xni |3 ≤ εσ 2 .

lim sup E Xni
n→∞
i=1

Since ε is arbitrary, (8.37) follows. This in conjunction with (8.36) shows (8.32),
which completes the proof.
Remark 5. The above proof is transparent in that it displays the property of
normal that has been used. Indeed, (8.31) does use the fact that the sum of
independent normal random variables also follows normal.
Exercise 8.12. If X1 , X2 , . . . are i.i.d. and P (X1 = 0) < 1, show that there
does not exist a random variable Z such that
n
X
Xi ⇒ Z , n → ∞ .
i=1

Exercise 8.13. Show that a sequence of probability measure {µn } on Rd is tight


if and only if given any subsequence of {µn }, there exists a further subsequence
which converges to a probability measure µ on Rd . This is a special case of
Prohorov’s theorem

133
Exercise 8.14. Show that the Lindeberg condition (8.29) is implied by the Lya-
punov condition
n
X
E |Xni |2+δ = 0 for some δ > 0 .

lim
n→∞
i=1

Exercise 8.15. If Xn ∼ Binomial(n, pn ) where pn are such that

lim npn (1 − pn ) = ∞ ,
n→∞

show that
X − npn
p n ⇒Z,
npn (1 − pn )
where Z ∼ N (0, 1).
Exercise 8.16. Suppose X is as in Exc 5.3, that is, it is infinitely divisible,
E(X) = 0 and Var(X) = 1. Show that

E(X 4 ) = 3 ⇐⇒ X ∼ N (0, 1) .

Hint. If Xn1 , . . . , Xnn are as in (5.9) and E(X 4 ) = 3, show that

4 3
E(Xn1 )= .
n2
Use Exc 8.14.
Exercise 8.17. A coin with probability of head p ∈ (0, 1) is tossed infinitely
many times. Let Xn be the number of the toss on which the n-th head is obtained.
Show that  
−1/2 n
n Xn − ⇒Z,
p
where Z ∼ N (0, σ 2 ) for some σ 2 . Calculate σ 2 .
Exercise 8.18. There are k boxes numbered 1, . . . , k and an infinite supply of
balls. The balls are thrown, one by one, randomly into one of the boxes. Let
Xn1 , . . . , Xnk denote the number of balls in Boxes 1, . . . , k, respectively, after
the first n balls are thrown. Show that
 n n
n−1/2 Xn1 − , . . . , Xnk − ⇒ (Z1 , . . . , Zk ) ,
k k
where (Z1 , . . . , Zk ) ∼ Nk (0, Σ) for some k × k matrix Σ. Calculate Σ.
Exercise 8.19. If Xn ∼ Binomial(n, pn ) and

lim npn = λ ∈ (0, ∞) ,


n→∞

use the Lévy continuity theorem to show that Xn ⇒ Z where Z ∼ Poisson(λ).

134
Exercise 8.20. Suppose that X1 , X2 , . . . are i.i.d. random variables with den-
sity
f (x) = e−1 x−2 1(x > e−1 ), x ∈ R .
Show that as n → ∞, √
1/ n
(X1 . . . Xn ) ⇒Z,
where Z follows the log-normal distribution, that is, log Z follows standard nor-
mal.
Exercise 8.21. If X1 , X2 . . . are i.i.d. with zero mean and finite positive vari-
ance, show that there does not exist a random variable Z such that
n
1 X P
√ Xi −→ Z , n → ∞ .
n i=1

Exercise 8.22. Suppose Xn are random variables with all moments finite such
that
lim E(Xnk ) = mk ∈ R , k ∈ {1, 2, . . .} .
n→∞
If there exists a unique probability measure µ on R such that
Z
xk µ(dx) = mk , k = 1, 2, . . . ,
R

show that Xn ⇒ X where P ◦ X −1 = µ.


Hint. First show {P ◦ Xn−1 : n = 1, 2, . . .} is tight.
Exercise 8.23. Suppose X1 , X2 , . . . are i.i.d. from the density
f (x) = x−2 , x ≥ 1 ,
then show that there exists a random variable Z such that
n−1 max Xi ⇒ Z .
1≤i≤n

Find the distribution of Z.


Exercise 8.24. Suppose X, X1 , X2 , . . . are random variables with CDFs F,
F1 , F2 , . . ., respectively. If Xn ⇒ X and F is continuous, show that
lim sup |Fn (x) − F (x)| = 0 .
n→∞ x∈R

Exercise 8.25. Suppose µ, µ1 , µ2 , . . . are probability measures on R having den-


sities f, f1 , f2 , . . . with respect to the Lebesgue measure (λ). If
fn (x) → f (x) for a.e.(λ) x ,
show that fn → f in L1 (R, λ) and hence
lim µn (B) = µ(B) , B ∈ B(R) .
n→∞

Hint. Write
|fn − f | = fn + f − 2(fn ∧ f ) .

135
9 Appendix
9.1 Proof of Fact 1.3
Proof of Fact 1.3. Since k·k is the L∞ norm, it suffices to show that the absolute
value of each entry of the d × 1 vector T (x) − T (y) − J(x)(x − y) is at most.
dαkx − yk. In other words, it suffices to show that if f : U → R is continuously
differentiable, and

|fi (y) − fi (x)| ≤ α , x, y ∈ R, i = 1, . . . , d ,

where
∂f (x)
fi (x) = , x ∈ U, i = 1, . . . , d ,
∂xi
then
d
X
f (x) − f (y) − fi (x)(xi − yi ) ≤ dαkx − yk , x, y ∈ R .
i=1

Let f be a function satisfying the hypotheses. Let x0 = x, xd = y, and for


1 ≤ i ≤ d − 1,
xi = (y1 , . . . , yi , xi+1 , . . . , xd ) .
Since R is a rectangle, x1 , . . . , xd−1 ∈ R. For a fixed i = 1, . . . , d, xi−1 and xi
have all entries identical except the i-th one, which are xi and yi respectively.
The one-dimensional mean value theorem implies there exists ξi between xi and
yi such that

f xi−1 − f xi = (xi − yi )fi (y1 , . . . , yi−1 , ξi , xi+1 , . . . , xd ) .


 

Since ξ˜i = (y1 , . . . , yi−1 , ξi , xi+1 , . . . , xd ) ∈ R because R is a rectangle, the


hypotheses on f imply

fi (ξ˜i ) − fi (x) ≤ α , i = 1, . . . , d .

136
Therefore,
d
X
f (x) − f (y) − fi (x)(xi − yi )
i=1
d
 X
= f x0 − f xd −

fi (x)(xi − yi )
i=1
d
X d
 X
f xi−1 − f xi −
 
= fi (x)(xi − yi )
i=1 i=1
d 
X 
= fi (ξ˜i ) − fi (x) (xi − yi )
i=1
d
X
≤ fi (ξ˜i ) − fi (x) |xi − yi |
i=1
≤ dα max |xi − yi |
1≤i≤d

= dαkx − yk .
This completes the proof.

9.2 Proof of Fact 4.1


Proof of Fact 4.1. Let F : Rd → R satisfy the assumptions, that is,
lim F (y1 , . . . , yd ) = F (x1 , . . . , xd ) for all (x1 , . . . , xd ) ∈ Rd , (9.1)
y1 ↓x1 ,...,yd ↓xd

and
∆R F ≥ 0 for all R ∈ H , (9.2)
where
H = {(a1 , b1 ] × . . . × (ad , bd ] : −∞ < ai < bi < ∞ for i = 1, . . . , d} ,
X
∆R F = (−1)#{i:xi =ai } F (x1 , . . . , xd ) , (9.3)
(x1 ,...,xd )∈{a1 ,b1 }×...×{ad ,bd }

for all R = (a1 , b1 ] × . . . × (ad , bd ] ∈ H.


Step 1. The function R 7→ ∆R F is a finitely additive set function on H, that
is, for disjoint R1 , . . . , Rn ∈ H such that R = R1 ∪ . . . ∪ Rn ∈ H,
n
X
∆R F = ∆R i F .
i=1

Proof of Step 1. For R = (a1 , b1 ] × (ad , bd ] ∈ H, and x = (x1 , . . . , xd ) ∈ Rd ,


define
(
(−1)#{i:xi =ai } , x ∈ {a1 , b1 } × . . . × {ad , bd } ,
sgn(x, R) =
0, otherwise .

137
That is, sgn(x, R) is zero unless x is a vertex of R.
Rewrite (9.3) as
X
∆R F = sgn(x, R)F (x) .
x=(x1 ,...,xd )∈{a1 ,b1 }×...×{ad ,bd }

Suppose R = (a1 , b1 ] × . . . × (ad , bd ] ∈ H and for some n1 , . . . , nd ∈ N,

ai = ai,0 < ai,1 < . . . < ai,ni = bi , i = 1, . . . , d .

Let
d
Y
Rk1 ,...,kd = (ai,ki −1 , ai,ki ] , 1 ≤ k1 ≤ n1 , . . . , 1 ≤ kd ≤ nd . (9.4)
i=1

We shall first show that


n1
X nd
X
... ∆Rk1 ,...,kd F = ∆R F . (9.5)
k1 =1 kd =1

The LHS above equals

X n1
X nd
X
F (x) ... sgn (x, Rk1 ,...,kd ) , (9.6)
x∈A k1 =1 kd =1

Qd
where A = i=1 {ai,0 , ai,1 , . . . , ai,ni }. Let A0 = {a1 , b1 } × . . . × {ad , bd } and
observe that for x ∈ A0 , there exists unique k1 , . . . , kd such that

sgn (x, Rk1 ,...,kd ) 6= 0 ,

and for this k1 , . . . , kd ,

sgn (x, Rk1 ,...,kd ) = sgn(x, R) .

Thus, the quantity in (9.6) equals

X X n1
X nd
X
sgn(x, R)F (x) + F (x) ... sgn (x, Rk1 ,...,kd ) .
x∈A0 x∈A\A0 k1 =1 kd =1

Since the first term above is the same as ∆R F , (9.5) would follow once it is
shown that
Xn1 nd
X
... sgn (x, Rk1 ,...,kd ) = 0 , x ∈ A \ A0 . (9.7)
k1 =1 kd =1

Fix x = (x1 , . . . , xd ) ∈ A \ A0 . Then there exists i ∈ {1, . . . , d} such that

xi = ai,ui for some 1 ≤ ui ≤ ni − 1 .

138
Thus for 1 ≤ k1 ≤ n1 , . . . , 1 ≤ kd ≤ nd , x is not a vertex of Rk1 ,...,kd by (9.4),
unless ki equals either ui or ui + 1, that is,

sgn (x, Rk1 ,...,kd ) = 0 if ki ∈


/ {ui , ui + 1} .

Further,
 
sgn x, Rk1 ,...,ki−1 ,ui ,ki+1 ,...,kd = − sgn x, Rk1 ,...,ki−1 ,ui +1,ki+1 ,...,kd .

Thus (9.7) follows which proves (9.5).


To complete the proof of Step 1, let R1 , . . . , Rn ∈ H be disjoint such that
R = R1 ∪ . . . ∪ Rn ∈ H. Let R = (a1 , b1 ] × . . . × (ad , bd ] ∈ H and

ai = ai,0 < ai,1 < . . . < ai,ni = bi , i = 1, . . . , d ,


Qd
be such that vertices of R1 , . . . , Rn belong to i=1 {ai,0 , ai,1 , . . . , ai,ni }. If
Rk1 ,...,kd is as in (9.4), then

either Rk1 ,...,kd ⊂ Ri or Rk1 ,...,kd ∩ Ri = ∅ ,

for 1 ≤ k1 ≤ n1 , . . . , 1 ≤ kd ≤ nd and i = 1, . . . , n. Use (9.5) to write


X
∆R F = ∆Rk1 ,...,kd F
1≤k1 ≤n1 ,...,1≤kd ≤nd
n
X X
= ∆Rk1 ,...,kd F
i=1 1≤k1 ≤n1 ,...,1≤kd ≤nd : Rk1 ,...,kd ⊂Ri
n
X
= ∆R i F ,
i=1

(9.5) being used again in the last line. This completes the proof of Step 1.
Step 2. If R1 , R2 ∈ H and R1 ⊂ R2 , then ∆R1 F ≤ ∆R2 F .
Proof of Step 2. Follows from (9.2) and Step 1 by observing that R2 \ R1 =
S1 ∪ . . . ∪ Sn for some disjoint S1 , . . . , Sn ∈ H.
Step 3. If R = (a1 , b1 ] × . . . × (ad , bd ] ∈ H and for ε > 0, Rε = (a1 , b1 + ε] ×
. . . × (ad , bd + ε], then
lim ∆Rε F = ∆R F .
ε↓0

Proof of Step 3. Follows from (9.1).


Step 4. If R = (a1 , b1 ] × . . . × (ad , bd ] ∈ H,

lim ∆(a01 ,b1 ]×...×(a0d ,bd ] F = ∆R F .


a01 ↓a1 ,...,a0d ↓ad

Proof of Step 4. Follows from (9.1).

139
For the next several steps, fix n = (n1 , . . . , nd ) ∈ Zd and let
Ωn = (n1 − 1, n1 ] × . . . × (nd − 1, nd ] ,
and
Sn = {∅} ∪ {R ∈ H : R ⊂ Ωn } .
Step 5. The collection Sn is a semi-field on Ωn and µn : Sn → [0, ∞) defined
by
µn (R) = ∆R F , ∅ =
6 R ∈ Sn ,
and µn (∅) = 0 is a finitely additive set function.
Proof of Step 5. That Sn is a semi-field is immediate. Finite additivity of µn
follows from Step 1.
Step 6. Let Fn = {A1 ∪ . . . ∪ Ak : A1 , . . . , Ak ∈ Sn are disjoint}. Then Fn is a
field on Ωn . Extend µn to Fn by
k
X
µn (A1 ∪ . . . ∪ Ak ) = µn (Ai ) , A1 , . . . , Ak ∈ Sn are disjoint .
i=1

Then µn is well defined on Fn , that is, different representations yield the same
definition, is finitely additive on Fn , monotone on Fn , that is, µn (A) ≤ µn (B)
for A, B ∈ Fn with A ⊂ B and finitely sub-additive on Fn , that is,
k
X
µn (A1 ∪ . . . ∪ Ak ) ≤ µn (Ai ) , A1 , . . . , Ak ∈ Fn .
i=1

Proof of Step 6. That Fn is a field follows from Step 5 which says Sn is a semi-
field. If A1 , . . . , Ak ∈ Sn are disjoint and so are B1 , . . . , Bl ∈ Sn such that
A1 ∪ . . . ∪ Ak = B1 ∪ . . . ∪ Bl ,
then Step 5 shows
k
X k X
X l l
X
µn (Ai ) = µn (Ai ∩ Bj ) = µn (Bj ) .
i=1 i=1 j=1 j=1

Thus, µn is well defined on Fn in that the definition is not dependent on the


representation. A similar argument shows µn is finitely additive on Fn . If
A, B ∈ Fn and A ⊂ B, then finite additivity shows
µn (B) = µn (A) + µn (B \ A) ≥ µn (A) ,
showing µn is monotone on Fn . Finally for A, B ∈ Fn , finite additivity shows
µn (A ∪ B) = µn (A) + µn (B \ A) ≤ µn (A) + µn (B) ,
the inequality following from monotonicity of µn . Induction shows µn is finitely
sub-additive on Fn . This completes the proof of Step 6.

140
Step 7. The set function µn is countably additive on Sn .
Proof of Step 7. Let R1 , R2 , . . . ∈ Sn be disjoint such that

R = R1 ∪ R2 ∪ . . . ∈ Sn .

For k = 1, 2, 3, . . ., finite additivity of µn on Fn shown in Step 6 implies


k k
!
X [
µn (Ri ) = µn Ri ≤ µn (R) ,
i=1 i=1

the inequality following from monotonicity of µn . Thus, countable additivity


would follow once it is shown that

X
µn (R) ≤ µn (Ri ) . (9.8)
i=1

Let R = (a1 , b1 ] × . . . × (ad , bd ] and for i = 1, 2, . . .,

Ri = (ai,1 , bi,1 ] × . . . × (ai,d , bi,d ] .

Fix δ > 0. Use Step 3 to get εi > 0 such that ∆R̃i F ≤ ∆Ri F + 2−i δ where

R̃i = (ai,1 , bi,1 + εi ] × . . . × (ai,d , bi,d + εi ] .

Fix a0i ∈ (ai , bi ) for i = 1, . . . , d. Since



[ ∞
[
[a01 , b1 ] × . . . × [a0d , bd ] ⊂ R = Ri ⊂ (ai,1 , bi,1 + εi ) × . . . × (ai,d , bi,d + εi ) ,
i=1 i=1

the Heine-Borel theorem implies


k
[
[a01 , b1 ] × . . . × [a0d , bd ] ⊂ (ai,1 , bi,1 + εi ) × . . . × (ai,d , bi,d + εi )
i=1

for some finite k. Letting R0 = (a01 , b1 ] × . . . × (a0d , bd ], it follows that


 
R0 ⊂ Ωn ∩ R̃1 ∪ . . . ∪ R̃k .

141
Monotonicity and finite sub-additivity of µn shown in Step 6 implies
k
X
µn (R0 ) ≤ µn (R̃i ∩ Ωn )
i=1
k
X
= ∆R̃i ∩Ωn F
i=1
k
X
(Step 2) ≤ ∆R̃i F
i=1
X∞
∆Ri F + 2−i δ

(choice of εi ) ≤
i=1

X
=δ+ µn (Ri ) .
i=1

Since δ is arbitrary, it follows that



X
µn (R0 ) ≤ µn (Ri ) .
i=1

Letting a01 ↓ a1 , . . . , a0d ↓ ad and using Step 4, (9.8) follows. This completes the
proof of Step 7.
Step 8. The set function µn can be extended to a measure on (Ωn , σ(Sn )).
Proof of Step 8. Follows from Step 7 and Corollary 1.1 of the Caratheodory
extension theorem.
Step 9. If X
µ(A) = µn (A ∩ Ωn ) , A ∈ B(Rd ) ,
n∈Zd

then µ is a Radon measure on (Rd , B(Rd )) satisfying


µ(R) = ∆R F , R ∈ H . (9.9)
d
Proof of Step 9. As µn is a measure on (Ωn , σ(Sn )) for each n ∈ Z by Step
8 and (Ωn : n ∈ Zd ) is a partition of Rd , µ defined above is a measure
on (Rd , B(Rd )). For R ∈ H, as R is bounded and non-empty, there exist
n1 , . . . , nk ∈ Zd such that R ∩ Ωni 6= ∅ for i = 1, . . . , k and R ⊂ Ωn1 ∪ . . . ∪ Ωnk .
Thus,
k
X
µ(R) = µni (R ∩ Ωni )
i=1
k
X
6 R ∩ Ωni ∈ Sni ) =
(∅ = ∆R∩Ωni F
i=1
(Step 1) = ∆R F ,

142
showing (9.9). To see that µ is Radon, for any compact set K ⊂ Rd , there exists
n ∈ N such that R = (−n, n]d ⊃ K. Thus

µ(K) ≤ µ(R) = ∆R F ,

by (9.9). This shows µ is a Radon measure and completes the proof of Step
9.
Step 10. The measure µ is the only measure on (Rd , B(Rd )) satisfying (9.9).
Proof of Step 10. Suppose µ0 is a measure on (Rd , B(Rd )) such that (9.9) holds
with µ replaced by µ0 . Then µ and µ0 agree on H, and hence on
( d
)
Y
d
S= R ∩ (ai , bi ] : −∞ ≤ ai ≤ bi ≤ ∞ ,
i=1

because for every set in S there exist sets in H increasing to the former. Further,
µ and µ0 are σ-finite on H and hence on S which is a semi-field that generates
B(Rd ). Corollary 1.1 shows µ and µ0 agree on B(Rd ), as claimed in Step 10.
Steps 9 and 10 complete the proof of the fact.
Remark 6. A function F satisfying (9.1) and (9.2) is not necessarily mono-
tonic. For example, F : R2 → R defined by

F (x, y) = xy ,

satisfies (9.1) and (9.2), and in fact induces the Lebesgue measure on R2 , though
F is not monotonic because

F (0, 0) = 0 < F (1, 1) = F (−1, −1) = 1 .

That is, x1 ≤ x2 and y1 ≤ y2 implies neither F (x1 , y1 ) ≤ F (x2 , y2 ) nor


F (x1 , y1 ) ≥ F (x2 , y2 ).

10 Solutions of selected problems


HW5/13
If X and Y are independent and either of them is a continuous random variable,
show that
X 6= Y a.s.
Soln.: Let Z = 1(X = Y ), that is, Z is a random variable which equals one if
X = Y and zero else. Assume that X is a continuous random variable, that is,

P (X = x) = 0 for all x ∈ R . (10.1)

143
Applying Theorem 5.7 to the function f : R2 → R defined by f (x, y) = 1(x = y)
and using the independence of X and Y yields

E (Z|Y ) = g(Y ) , (10.2)

where Z ∞
g(y) = 1(x = y)P (X ∈ dx) , y ∈ R .
−∞
A moment’s thought reveals that for all y ∈ R,

g(y) = P (X = y) = 0 ,

(10.1) implying the second equality. Thus g(Y ) = 0. Taking expectation of both
sides of (10.2) and using the tower property of conditional expectation show

E(Z) = 0 ,

which is the same as P (X = Y ) = 0 and thus proves the desired claim.

HW 5/20
Suppose that X and Y are integrable random variables defined on a probability
space (Ω, A, P ) such that

E (X|σ(Y )) = Y, a.s. ,
E (Y |σ(X)) = X, a.s.

Show that X = Y a.s.


Soln.: Before proceeding with the solution, it is advisable to show under the
additional assumptions of E(X 2 ) < ∞ and E(Y 2 ) < ∞, that E (X − Y )2 = 0
and hence X = Y a.s. This is much easier and shorter than the solution below.
Now suppose X and Y have finite mean and we do not assume anything
more than what is given. For any c ∈ R, the tower property of conditional
expectation shows
 
E (Y 1(X ≤ c)) = E E Y 1(X ≤ c) σ(X)
 
(using Theorem 5.3) = E 1(X ≤ c)E Y σ(X)
 
E Y |σ(X) = X = E (X1(X ≤ c)) .

Thus,

0 = E ((Y − X)1(X ≤ c))


= E ((Y − X)1(X ≤ c, Y ≤ c)) + E ((Y − X)1(X ≤ c < Y )) . (10.3)

Therefore,

E ((Y − X)1(X ≤ c, Y ≤ c)) = −E ((Y − X)1(X ≤ c < Y ))


≤ 0,

144
because (Y − X)1(X ≤ c < Y ) ≥ 0.
Reversing the roles of X and Y , it can be shown that

E ((X − Y )1(X ≤ c, Y ≤ c)) ≤ 0 ,

and hence it follows that

E ((Y − X)1(X ≤ c, Y ≤ c)) = 0 .

Thus (10.3) shows


E ((Y − X)1(X ≤ c < Y )) = 0 .
Since the random variable inside the expectation is non-negative, it is zero a.s.,
that is,
P (X ≤ c < Y ) = 0 , c ∈ R .
Therefore
 
[ X
P (X < Y ) = P  [X ≤ c < Y ] ≤ P (X ≤ c < Y ) = 0 .
c∈Q c∈Q

Another role reversal shows P (Y < X) = 0 which in combination with the


above proves X = Y a.s.

Exc 4.9
If X1 , . . . , Xn are i.i.d. from standard normal, P is an m × n matrix with 1 ≤
m < n and P P T = Im , and

(Y1 , . . . , Ym ) = Y = P X ,

show that
m
X
Yi2 ∼ χ2m ,
i=1
n
X m
X
Xi2 − Yi2 ∼ χ2n−m ,
i=1 i=1

and
m
X n
X m
X
Yi2 , Xi2 − Yi2 are independent.
i=1 i=1 i=1
T
Soln.: The assumption P P = Im means that the m rows of P form an
orthonormal set in Rn . This orthonormal set can be extended to an orthonormal
basis of Rn . In other words, there exists an (n − m) × n matrix Q such that
P
R = [Q ] is an orthogonal matrix, that is RRT = I.
Define
[Z1 Z2 . . . Zn ]T = RX .

145
Since R is an orthogonal matrix, Z1 , . . . , Zn are i.i.d. from standard normal.
The definition of R implies Y1 = Z1 , . . . , Ym = Zm . Thus,
m
X m
X
Yi2 = Zi2 ∼ χ2m ,
i=1 i=1

and for the same reason,


n
X n
X m
X
Zi2 = Zi2 − Yi2
i=m+1 i=1 i=1
n
X Xm
(R is an orthogonal matrix) = Xi2 − Yi2 .
i=1 i=1
Pn m
Xi2 − i=1 Yi2
P
Thus, i=1P is a function of (Zm+1 , . . . , Zn ), and hence is inde-
m 2
pendent of i=1 Yi . That Zm+1 , . . . , Zn are i.i.d. from N (0, 1) implies that
n
X m
X
Xi2 − Yi2 ∼ χ2n−m .
i=1 i=1

Exc. 5.3
A random variable X is infinitely divisible if for all fixed n = 1, 2, . . ., there
exist i.i.d. random variables Xn1 , . . . , Xnn defined on some probability space
such that
d
X = Xn1 + . . . + Xnn .
If X is an infinitely divisible random variable with mean zero and variance one,
show that
E(X 4 ) ≥ 3 .
Soln.: Assume WLOG that
E(X 4 ) < ∞ , (10.4)
as there is nothing to prove otherwise. Fix n ≥ 2 and write
d
X = Xn1 + Y ,

where Y = Xn2 + . . . + Xnn . Use Exc 5.2 with p = 4, (10.4) and the fact that
Xn1 is independent of Y to infer
4
E(Xn1 ) < ∞.
d d
Since Xn1 = . . . = Xnn , it follows that
4
E(Xni ) < ∞ , i = 1, . . . , n .

An immediate consequence of the above is that Xn1 has finite mean and
variance. Further,
0 = E(X) = nE(Xn1 ) ,

146
shows E(Xn1 ) = 0. That Xn1 , . . . , Xnn are i.i.d. shows

1 = Var(X) = nVar(Xn1 ) ,
2 1
that is, E(Xn1 )= n.Proceed like in (6.3)-(6.4) to obtain
 !4 
Xn
E(X 4 ) = E  Xni 
i=1

4 2
2
= nE(Xn1 ) + 3n(n − 1) E(Xn1 ) (10.5)
2
2
≥ 3n(n − 1) E(Xn1 )
n−1
=3 .
n
Letting n → ∞ shows E(X 4 ) ≥ 3.

Exc 6.4
If Xn → Y in Lp for some 1 ≤ p < ∞ and Xn → Z a.s., show that

Y = Z a.s.
P
Soln.: Since Xn → Y in Lp , it follows that Xn −→ Y . As Xn → Z a.s.,
P
Xn −→ Z. Thus, Y and Z are both limits in probability of Xn . Hence Y = Z
a.s.

Exc 6.12
If X1 , X2 , X3 , . . . are independent random variables, then show that

X
Xn → X a.s. ⇐⇒ P (|Xn − X| > ε) < ∞ for all ε > 0 .
n=1

Show that if the above holds, then X is a degenerate random variable.


Soln.: The “⇐ part” is a restatement of Theorem 6.8. For the “⇒ part”,
assume
Xn → X a.s.
WLOG, we can take
X = lim sup Xn .
n→∞

Thus, X is σ(Xn , Xn+1 , Xn+2 , . . .)-measurable for every n. That is, X is mea-
surable with respect to

\
T = σ(Xn , Xn+1 , Xn+2 , . . .) .
n=1

147
Kolmogorov’s zero-one law implies T is a trivial σ-field, and hence X is a de-
generate random variable. That is, there exists a ∈ R with X = a a.s.
The assumption thus becomes

Xn → a a.s.

Therefore, for any ε > 0,

P (|Xn − a| > ε for infinitely many n) = 1 .

The second Borel-Cantelli lemma, in view of the independence of X1 , X2 , . . .


implies
X∞
P (|Xn − a| > ε) < ∞ .
n=1

This proves the “⇒ part”.


It has already been shown that if Xn → X a.s., then X is degenerate.

Exc 6.14
If X1 , X2 , . . . are i.i.d. from standard exponential, and X(n,1) , . . . , X(n,n) are the
order statistics of X1 , . . . , Xn , then show that

X(n,[n/2]) → log 2 a.s., as n → ∞ .

Soln.: For fixed 0 < ε < log 2, observe that

X(n,[n/2]) ≤ log 2 − ε ⇐⇒ at least [n/2] many of X1 , . . . , Xn are ≤ log 2 − ε .

Denoting
Zi = 1(Xi ≤ log 2 − ε) , i = 1, 2, . . . ,
it thus follows that
n
!
 X
P X(n,[n/2]) ≤ log 2 − ε = P Zi ≥ [n/2]
i=1
n
!

  X
−(log 2−ε)
µ = E(Z1 ) = 1 − e =1− =P (Zi − µ) ≥ [n/2] − nµ
2 i=1
n
!
1X [n/2]
=P (Zi − µ) ≥ −µ .
n i=1 n

Since ε > 0,
1 [n/2]
µ< = lim ,
2 n→∞ n

148
[n/2]
µ< n for large n. For such n,
n
!
1X [n/2]
P (Zi − µ) ≥ −µ
n i=1 n
 !4  
n 4
1 X [n/2]
≤P (Zi − µ) ≥ −µ 
n i=1 n
 !4 
 −4 n
[n/2] X
≤ −µ n−4 E  (Zi − µ)  (Markov inequality)
n i=1
 −4
[n/2]
n−4 nE (Z1 − µ)4 + 3n(n − 1)(Var(Z1 ))2 ,
 
= −µ
n

(6.3)-(6.4) being used in the last line. As n → ∞,


 −4  −4
[n/2] 1
−µ → −µ .
n 2

Since E (Z1 − µ)4 < ∞, we get
n
!
2 1X [n/2]
lim sup n P (Zi − µ) > − µ < ∞,
n→∞ n i=1 n

which shows

X 
P X(n,[n/2]) ≤ log 2 − ε < ∞ .
n=1

A similar calculation shows



X 
P X(n,[n/2]) ≥ log 2 + ε < ∞ .
n=1

The above two inequalities in conjunction with Theorem 6.8 show that

X(n,[n/2]) → log 2 , a.s.

Exc 6.15
1. If X1 , X2 , . . . are independent and αn → ∞ such that
n
1 X P
Xi −→ X as n → ∞ ,
αn i=1

show that X is a degenerate random variable.

149
2. Hence or otherwise, prove that if X1 , X2 , . . . are i.i.d. from standard nor-
mal, then there does not exist a random variable Z such that
n
1 X P
√ Xi −→ Z , n → ∞ .
n i=1

Soln.:

1. Since
n
1 X P
Xi −→ X as n → ∞ ,
αn i=1
there exist 1 ≤ n1 < n2 < . . . such that as k → ∞,
nk
1 X
Xi → X a.s.
αnk i=1

WLOG, assume
nk
1 X
X = lim sup Xi .
k→∞ αnk i=1
Since αn → ∞, for any fixed n = 1, 2, . . .,
n−1
1 X
Xi → 0 , k → ∞ .
αnk i=1

Hence
nk
1 X
X = lim sup Xi ,
k→∞ αnk i=n

showing that X is σ(Xn , Xn+1 , . . .)-measurable. As this is true for all n,


X is measurable with respect to

\
σ(Xn , Xn+1 , . . .) .
n=1

Kolmogorov’s zero-one law shows X is degenerate.


2. For the sake of contradiction, assume there exists a random variable Z
such that
n
1 X P
√ Xi −→ Z , n → ∞ .
n i=1
It follows from 1. that Z is degenerate. Which means
n
1 X
√ Xi
n i=1

150
converges in probability and hence in distribution to a degenerate random
variable. This is a contradiction because
n
1 X
√ Xi ∼ N (0, 1) for n = 1, 2, . . . ,
n i=1

and hence it converges in law to a standard normal random variable.

Exc 7.5
Use Corollary 7.3 to give a fifth proof of the fact
Z ∞ √
2
e−x /2 dx = 2π .
−∞

Soln.: Suppose for this exercise that we didn’t know the value of
Z ∞
2
e−x /2 dx .
−∞

It is easy to see that the above integral is finite; say


Z ∞
2
c= e−x /2 dx .
−∞

Thus,
1 −x2 /2
f (x) = e ,x ∈ R,
c
is a density.
Let µ be the probability measure on R whose density is f . Then, the MGF
of µ is
Z ∞
ψ(t) = etx f (x) dx
−∞
1 ∞ tx−x2 /2
Z
= e dx
c −∞
Z ∞
1 2 2
= et /2 e−(x−t) /2 dx
c −∞
2
= et /2
.

The analytic continuation of ψ to C is


2
ψ̃(z) = ez /2
,z ∈ C.

Theorem 7.3 shows the CHF of µ is


2
φ(t) = e−t /2
,t ∈ R.

151
Since µ is a probability measure with a continuous density f and the CHF
φ of µ is integrable on R, Corollary 7.3 implies
Z ∞
eιtx φ(t) dt = 2πf (x) , x ∈ R .
−∞

Putting x = 0 yields
Z ∞
2 2π
e−t /2
dt = 2πf (0) = .
−∞ c
Since the extreme left hand side equals c, the above implies

c= ,
c

that is, c = 2π. This completes the solution.

Exc 7.6
Show that a Nd (µ, Σ) distribution has a density if and only if Σ is p.d.
Soln.: If Σ is p.d., then it is known that the density of Nd (µ, Σ) is
 
−1/2 1
f (x) = (2π)d det(Σ) exp − (x − µ)T Σ−1 (x − µ) , x ∈ Rd .
2
If X ∼ Nd (µ, Σ) and Σ is not p.d., then there exists λ ∈ Rd \ {0} with
λT Σλ = 0 .
In other words, Var(λT X) = 0. Thus,
X ∈ {x ∈ Rd : λT x = λT µ} a.s.
Since {x ∈ Rd : λT x = λT µ} is a set of zero Lebesgue measure in Rd , X cannot
have a density.

Exc 8.14
Show that the Lindeberg condition (8.29) is implied by the Lyapunov condition
n
X
E |Xni |2+δ = 0 for some δ > 0 .

lim
n→∞
i=1

Soln.: Notice that for any random variable Z and ε, δ > 0,


E |Z|2+δ ≥ E |Z|2+δ 1(|Z| > ε)
 

≥ εδ E |Z|2 1(|Z| > ε) .



(10.6)
Thus,
n
X n
X
2
1(|Xnj | > ε ≤ ε−δ E |Xni |2+δ .
 
E Xnj
j=1 i=1
Therefore, the Lyapunov condition implies the Lindeberg condition.

152
Remark 7. Convince yourself, using a variant of Example 6.2, that the Lya-
punov condition is not implied by the Lindeberg condition.

Exc 8.16
Suppose X is as in Exc 5.3, that is, it is infinitely divisible, E(X) = 0 and
Var(X) = 1. Show that

E(X 4 ) = 3 ⇐⇒ X ∼ N (0, 1) .

Soln.: The “⇒ part” is the only one which needs a proof, as we know that the
fourth moment of standard normal is 3. Assume E(X 4 ) = 3. For n ≥ 1, let
Xn1 , . . . , Xnn be i.i.d. such that
d
X = Xn1 + . . . + Xnn .

Clearly, Xni has mean zero and variance 1/n for n = 1, 2, . . . and i = 1, . . . , n.
Recall (10.5) to conclude
 
4 1 n−1
E(Xn1 )= E(X 4 ) − 3
n n
3
= 2.
n
Thus,
n
X
4 3
E(Xni )= → 0,n → ∞.
i=1
n

Exc 8.14 shows that the conditions of Lindberg CLT hold with σ 2 = 1. There-
fore,
Xn
Xni ⇒ Z ,
i=1
d
where Z ∼ N (0, 1). It is obvious that Z = X, that is, X follows standard
normal. This proves the “⇒ part”.

Exc 8.17
A coin with probability of head p ∈ (0, 1) is tossed infinitely many times. Let
Xn be the number of the toss on which the n-th head is obtained. Show that
 
n
n−1/2 Xn − ⇒Z,
p

where Z ∼ N (0, σ 2 ) for some σ 2 . Calculate σ 2 .


Soln.: Define X0 = 0 and

Zi = Xi − Xi−1 , i ∈ N .

153
Thus for n = 1, 2, . . ., Xn = Z1 + . . . + Zn . For n ≥ 1 and k1 , . . . , kn ∈ N, the
event [Z1 = k1 , . . . , Zn = kn ] simply means that in the first k1 + . . . + kn tosses,
the heads occur at the tosses number k1 , k1 + k2 , . . . , k1 + . . . + kn and rest are
tails. Thus,

P (Z1 = k1 , . . . , Zn = kn ) = pn q k1 +...+kn −n , k1 , . . . , kn ∈ N ,

where q = 1 − p. In other words,


n
Y
P (Z1 = k1 , . . . , Zn = kn ) = pq ki −1 , k1 , . . . , kn ∈ N .
i=1

Since

X p
pq k−1 = = 1,
1−q
k=1

Theorem 4.7 shows that Z1 , . . . , Zn are i.i.d. with PMF

P (Z1 = k) = pq k−1 , k ∈ N ,

that is, they follow Geometric(p). Since


1
E(Z1 ) = ,
p
q
Var(Z1 ) = 2 ,
p
the CLT shows
n  
−1/2
X 1
n Zi − ⇒Z,
i=1
p

where Z ∼ N (0, σ 2 ) and


q
σ2 = ,
p2
which is the same as  
−1/2 n
n Xn − ⇒Z.
p

Exc 8.18
There are k boxes numbered 1, . . . , k and an infinite supply of balls. The balls
are thrown, one by one, randomly into one of the boxes. Let Xn1 , . . . , Xnk
denote the number of balls in Boxes 1, . . . , k, respectively, after the first n balls
are thrown. Show that
 n n
n−1/2 Xn1 − , . . . , Xnk − ⇒ (Z1 , . . . , Zk ) ,
k k
where (Z1 , . . . , Zk ) ∼ Nk (0, Σ) for some k × k matrix Σ. Calculate Σ.

154
Soln.: Define

Yij = 1 (i-th ball goes into Box number j) , i = 1, 2, . . . , j = 1, 2, . . . , k .

Letting Yi = (Yi1 , . . . , Yik ) (interpreted as a k ×1 column vector) for i = 1, 2, . . .,


it is immediate that the Rk -valued random variables Y1 , Y2 , . . . are i.i.d. Further,
1
E(Y1i ) = , i = 1, . . . , k ,
k
1 1
Var(Y1i ) = − 2 , i = 1, . . . , k ,
k k
1
Cov(Y1i , Y1j ) = − 2 , 1 ≤ i < j ≤ k .
k
Since
n
X
Xnj = Yij , n = 1, 2, . . . ,
i=1

Theorem 8.9 shows


" Xn1 − n # n
" #! "Z #
k
1 1. 1
. ⇒ ... , n → ∞ ,
X
−1/2 −1/2
n .. =n Yi − ..
X −n i=1
k 1 Z
nk k k

where (Z1 , . . . , Zk ) ∼ Nk (0, Σ) and Σ = ((σij ))1≤i,j≤k is defined by


(
1
− 12 , i = j ,
σij = k 1 k (10.7)
− k2 , i 6= j .

Remark 8. The matrix Σ defined in (10.7) is n.n.d. but not p.d.

Exc 8.22
Suppose Xn are random variables with all moments finite such that

lim E(Xnk ) = mk ∈ R , k ∈ {1, 2, . . .} .


n→∞

If there exists a unique probability measure µ on R such that


Z
xk µ(dx) = mk , k = 1, 2, . . . ,
R

show that Xn ⇒ X where P ◦ X −1 = µ.


Soln.: Let
µn = P ◦ Xn−1 , n ≥ 1 .
The claim to be proved is µn ⇒ µ, which would follow once it is shown that
every subsequence of {µn } has a further subsequence which converges weakly
to µ. For proceeding towards that, we shall first show that {µn : n = 1, 2, . . .}
is tight.

155
The hypothesis is that
Z
lim xk µn (dx) = mk , k ≥ 1 . (10.8)
n→∞ R

The above with k = 2 implies


Z
K = sup x2 µn (dx) < ∞ .
n≥1 R

Thus, for T > 0 and n = 1, 2, . . .,


Z
1 K
µn ({x : |x| > T }) = µn {x : x2 > T 2 } ≤ 2 x2 µn (dx) ≤

.
T R T2
q
Therefore, for ε > 0, take T = K ε to get that

sup µn {x : |x| > T } ≤ ε ,


n≥1

which is equivalent to
inf µn ([−T, T ]) ≥ 1 − ε .
n≥1

Since [−T, T ] is a compact set, tightness of {µn } follows.


To show µn ⇒ µ, we shall proceed as mentioned, that is, show that every
subsequence has a further subsequence which converges weakly to µ. Fix a
subsequence {µni : i = 1, 2, . . .} of {µn }. Since {µn : n = 1, 2, . . .} is tight, so is
{µni : i = 1, 2, . . .}. Theorem 8.7 shows that {µni : i = 1, 2, . . .} has a further
subsequence {µnij : j = 1, 2, . . .} such that

µnij ⇒ ν , j → ∞ , (10.9)

for some probability measure ν on R. The claim would follow once it is shown
that ν = µ. Showing that m1 , m2 , m3 , . . . are the moments of ν would suffice for
that because the hypothesis includes that µ is the unique measure with those
as moments. The first step, however, is to show that all the moments of ν are
finite, towards which we shall now proceed.
To show that all moments of ν are finite, it suffices to prove it for even
moments, that is, for k = 1, 2, . . .,
Z ∞
x2k ν(dx) < ∞ .
−∞

An implication of (10.9) is
Z Z
lim f (x)µnij (dx) = f (x)ν(dx) ,
j→∞ R R

156
for all bounded continuous f : R → R. Fix k = 1, 2, . . . and 0 < T < ∞ and
derive from the above that
Z Z
2k 2k
(|x| ∧ T ) ν(dx) = lim (|x| ∧ T ) µnij (dx)
R j→∞ R
Z
≤ lim x2k µnij (dx)
j→∞ R
= m2k ,

(10.8) implying the last line. Since for all x ∈ R,


2k
0 ≤ (|x| ∧ T ) ↑ x2k , T → ∞ ,

MCT implies
Z Z
2k 2k
x ν(dx) = lim (|x| ∧ T ) ν(dx) ≤ m2k < ∞ .
R T →∞ R

Thus, the even moments of ν are finite, and hence so are odd moments as well.
In the next step, we shall show that for k = 1, 2, . . ., the k-th moment of ν
is mk . Fix k = 1, 2, . . . and define for all T > 0, fT : R → R by

k
x ,
 |x| ≤ T ,
k
fT (x) = (−T ) , x < −T ,

 k
T , x>T.

Since fT is bounded and continuous, (10.9) implies


Z Z
lim fT (x)µnij (dx) = fT (x)ν(dx) , T > 0 . (10.10)
j→∞ R R

Since |fT (x)| ≤ |x|k and |x|k is integrable with respect to ν, DCT shows that
Z Z
lim fT (x)ν(dx) = xk ν(dx) .
T →∞ R R

Let ε > 0 be arbitrary. Use the above to get T0 such that


Z Z
fT (x)ν(dx) − xk ν(dx) ≤ ε , T ≥ T0 . (10.11)
R R

In order to infer from (10.10) that the k-th moment of µnij converges to
that of ν, an estimate of the above form with ν replaced by µnij is also needed,
uniformly over j. To that end, observe that for any x ∈ R and T > 0,

fT (x) − xk = fT (x) − xk 1(|x| > T )


≤ |fT (x)| + |x|k 1(|x| > T )


≤ 2|x|k 1(|x| > T ) .

157
Thus, for n = 1, 2, . . .,
Z Z
k
fT (x) − x µn (dx) ≤ 2 |x|k µn (dx)
R {x:|x|>T }
Z
≤ 2T −k x2k µn (dx) ,
R

the last line following by an argument similar to (10.6). Since


Z
C = sup x2k µn (dx) < ∞ ,
n≥1 R

by (10.8), it follows that


Z
1/k
fT (x) − xk µn (dx) ≤ ε for all n = 1, 2, . . . , T ≥ 2Cε−1 .
R

Thus (10.10) and (10.11), in conjunction with the above and an approximation
1/k
of xk by fT (x) for T = T0 ∨ 2Cε−1 , imply that
Z Z
lim sup xk µnij (dx) − xk ν(dx) ≤ 2ε .
j→∞ R R

Since ε is arbitrary, it follows that


Z Z
lim xk µnij (dx) = xk ν(dx) .
j→∞ R R

Recalling (10.8), the above implies that


Z
xk ν(dx) = mk , k ≥ 1 .
R

Since µ is the unique probability measure whose k-th moment is mk for k =


1, 2, . . ., it follows that ν = µ; (10.9) thus shows that µ is the weak limit of µnij ,
which was all that was required to complete the argument.

References
Billingsley, P. (1995). Probability and Measure. Wiley, New York, 3rd edition.

Rudin, W. (1987). Real and Complex Analysis. McGraw Hill Book Company,
third edition.

158

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy