0% found this document useful (0 votes)
4 views46 pages

FT Notes

The document outlines the MA901 course on Probability Theory at the University of Warwick, focusing on fundamental concepts such as probability spaces, random variables, and limit theorems. It serves as a preparatory course for students to refresh their knowledge and prepare for examinations. The notes emphasize the importance of understanding proofs and provide a structured overview of the topics covered in the course.

Uploaded by

Andreea Popescu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views46 pages

FT Notes

The document outlines the MA901 course on Probability Theory at the University of Warwick, focusing on fundamental concepts such as probability spaces, random variables, and limit theorems. It serves as a preparatory course for students to refresh their knowledge and prepare for examinations. The notes emphasize the importance of understanding proofs and provide a structured overview of the topics covered in the course.

Uploaded by

Andreea Popescu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

MA901: Fundamental Tools

(Probability Theory)

Based on lecture notes by

Martin Herdegen

Lectured by

Osian Shelley

The University of Warwick

Department of Statistics

This version: September 9, 2021


Contents
Foreword 1

1 Fundamental concepts of Probability Theory 2


1.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Distribution of random variables: univariate case . . . . . . . . . . . . . . . . . 10
1.4 Distribution of random variables: multivariate case . . . . . . . . . . . . . . . . 14
1.5 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.8 Lp -spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.9 Variance and covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.10 The inequalities of Markov and Jensen . . . . . . . . . . . . . . . . . . . . . . . 30
1.11 Product spaces and Fubini’s theorem . . . . . . . . . . . . . . . . . . . . . . . . 33

2 Sequences of random variables and limit theorems 36


2.1 Convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2 Uniform integrability and the dominated convergence theorem . . . . . . . . . . 39
2.3 The laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 The law of iterated logarithm and the central limit theorem . . . . . . . . . . . 43

i
Foreword
Fundamental Tools week is a preparatory course to refresh your knowledge of probability
theory, and introduce you to the style of questions you are likely to encounter in courses
containing mathematics at Warwick.

It is likely that these notes may seem more formal than other courses you have taken before.
Do not let this concern you too much, as most of the course will involve calculation using the
formal results laid out here. However, you are expected to know and understand the proofs
that are presented here.

You should be aware that the material covered here is considered to be basic mathematical
content which, hopefully, you will have encountered before. Importantly, if you have not
covered a topic before, then these notes should not serve as the means to learn about it – the
material presented here is a brief summary of these topics and is not meant to be exhaustive.
Ideally, you should consult other textbooks which will provide far more detail.

Structure and Assessment


The course is held over a week, with twelve lectures on probability theory. As such, we will
not be able to cover all of the material, so it is important you have read through these notes.
You will have the chance to attend seminars and work through various questions on the main
topics of the course.

There is an examination at the end of the week which broadly assumes the knowledge you
will find here, along with the example sheets completed in seminars. As a general guide, if
you understand most of the material here and have completed all of the questions from the
assignment sheets then you will not have a problem passing the exam.

Course webpage
There is a webpage for this course, which you can find on myWBS (search MA901).

It will contain a copy of these notes, as well as assignment sheets as they are set through the
week. No solution sheets will be provided for the questions covered in class.

Finally, please note that these notes have been in place since 2020/2021, and differ substantially
from previous years. As such, they might contain errors and typos, although we believe that
the mathematical content itself is correct. If you do find errors, then please contact us so that
they may be corrected. You can always find an up-to-date copy of the notes on the course
webpage.

1
1 Fundamental concepts of Probability Theory
In this chapter, we study some fundamental concepts from Measure Theory and Probability
Theory that are foundational for various applications in Mathematical Finance.

1.1 Probability spaces


First we look at the most basic object in Probability Theory, a probability space. This has
three components. The first component is a sample space.

Definition 1.1. A sample space Ω is a (finite or infinite) set.

Each ω ∈ Ω describes a possible “state of the world”. Key examples are Ω = {ω1 , . . . , ωN } for
N ∈ N (finite sample space), Ω = {ω1 , ω2 , . . .} (countable sample space), and Ω = R.

The second component of a probability space is a σ-algebra.

Definition 1.2. Given a sample space Ω, a σ-algebra F on Ω is a collection of subsets of Ω


such that

(1) Ω ∈ F;

(2) A ∈ F =⇒ Ac = Ω \ A ∈ F;
S
(3) A1 , A2 , . . . ∈ F =⇒ n∈N An ∈ F.

The pair (Ω, F) is called a measurable space and each A ∈ F is called F-measurable or an
F-measurable event.

Example 1.3. Let F be a σ-algebra on Ω. Then

(a) the empty set ∅ is in F. Indeed, this follows from (1) and (2) via ∅ = Ω \ Ω.
T
(b) if A1 , A2 , . . . ∈ F, then the countable intersection n∈N An is in F. Indeed, this follows
from the de Morgan laws,1 (2), and (3) via n∈N An = ( n∈N Acn )c .
T S

(c) if A1 , . . . , An ∈ F, then the finite union A1 ∪ A2 ∪ . . . ∪ An and the finite intersection


A1 ∩ A2 ∩ . . . ∩ An are in F. Indeed, this follows from (a) and (3) and (a) and (b),
respectively, by setting An+1 , An+2 , . . . := ∅.

(d) If A, B ∈ F, then A \ B and B \ A are in F. Indeed, this follows from (2) and (c) via
A \ B = A ∩ B c and B \ A = B ∩ Ac .
1
laws say that if (An )n∈N is any collection of subsets of Ω, then ( n∈N An )c = n∈N Acn
S T
The
T de Morgan’s
and ( n∈N An )c = n∈N Acn .
S

2
T S
(e) if A1 , A2 , . . . ∈ F, then lim supn→∞ An := n∈N k≥n Ak is in F. Indeed this follows
from (3) and (b).
S T
(f) if A1 , A2 , . . . ∈ F, then lim inf n→∞ An := n∈N k≥n Ak is in F. Indeed, this follows
from (b) and (3).

Remark 1.4. Note that if A1 , A2 , . . . ∈ F then

lim inf An = {ω ∈ Ω : #{n ∈ N : ω ∈


/ An } < ∞} =: {An eventually},
n→∞

lim sup An = {ω ∈ Ω : #{n ∈ N : ω ∈ An } = ∞} =: {An infinitely often}.


n→∞

Clearly, limes inferior is the event where eventually all of the An occur. On the other
hand, limes superior is the event where infinitely many of the An occur. In particular,
lim inf n→∞ An ⊂ lim supn→∞ An .

If Ω is finite (or countable), i.e., Ω = {ω1 , . . . , ωN } (or Ω = {ω1 , ω2 , . . .}), then the usual choice
for a σ-algebra on Ω is F := 2Ω := {A : A ⊂ Ω}, the power set of Ω.

If Ω is uncountable, e.g. Ω = R, it turns out that the power set 2Ω is “too big” to be chosen as
σ-algebra.2 For this reason, one uses instead the following procedure: One chooses a generator
A of “good subsets” of Ω that one wants to be measurable. One then denotes by σ(A) the
smallest σ-algebra on Ω that contains A. It is called the σ-algebra generated by A. It is
explicitly given by
\
σ(A) = G.
G σ-algebra,
A⊂G

Note that different generators A may generate the same σ-algebra; one usually chooses a
“small” generator or a generator with “good properties”.

Example 1.5. (a) If Ω = R, one wants that all open sets OR in R are measurable. One then
sets BR := σ(OR ) and calls this the Borel σ-algebra on R. One can show that BR contains all
closed and open sets in R and that is is also generated by the generator A, where A contains
all (closed) sets of the form (−∞, a] for a ∈ R.3

(b) More generally, if Ω is a subset of Rd for d ≥ 1, one denotes the open sets in Ω by OΩ ,
and sets BΩ := σ(OΩ ) and calls this the Borel σ-algebra on Ω.4

The third component of a probability space is a probability measure.


2
See [1, Theorem 1.5] for a precise formulation of this statement.
3
See Exercise Sheet 1-6.
4
Even more generally, if (Ω, τ ) is a topological space, one sets BΩ := σ(τ ), and calls this the Borel σ-algebra
on Ω.

3
Definition 1.6. Given a measurable space (Ω, F), a probability measure P on (Ω, F) is a map
F → [0, 1] such that

(1) P [∅] = 0 and P [Ω] = 1;


S P∞
(2) A1 , A2 , . . . ∈ F with Ai ∩ Aj = ∅ for i 6= j =⇒ P [ n∈N An ] = n=1 P [An ].

The triple (Ω, F, P ) is called a probability space.

Remark 1.7. The properties (1) and (2) in Definition 1.6 are called the axioms of Kolmogorov
after the Russian mathematician A. Kolmogorov (1903 - 1987). Property (2) is referred to as
σ-additivity, where the σ stands for “countable”.

Example 1.8. (a) Let (Ω, F) be a measurable space and suppose that F contains all ele-
mentary events, i.e., {ω} ∈ F for all ω ∈ Ω. Fix ω ∗ ∈ Ω and define the map P : F → [0, 1]
by 
1 if ω ∗ ∈ A,
P [A] =
0 if ω ∗ ∈
/ A.

One can check that P is a probability measure on (Ω, F). It is called the Dirac measure for ω ∗
and often denoted by δω∗ . The Dirac measure for ω ∗ models the “deterministic case”, where
we know with probability 1 that the state of the world will be ω ∗ .5

(b) Let Ω = {ω1 , . . . , ωN } for some N ∈ N and F = 2Ω . Define the map P : F → [0, 1] by

|A|
P [A] = ,
|Ω|

where |A| denotes the number of elements in A. One can check that P is a probability
measure on (Ω, F). It is called the discrete uniform distribution on Ω. For N = 2 (and
ω1 = H and ω2 = T ), this is a good model for the flipping of a fair coin, and for N = 6 (and
ω1 = 1, . . . , ω6 = 6), this is a good model for the rolling of a fair die.

Example 1.9. A poker hand consists of 5 cards. If the cards have distinct consecutive values
and are not all of the same suit, we say that the hand is a straight. What is the probability
of being dealt a straight?

Let our sample space Ω consist of all possible poker hands and A ∈ 2Ω the event that we are
dealt a straight. Note that the number of possible outcomes for which the poker hand consists
of an ace, two, three, four and five is 45 . In 4 of these cases, the suits must be identical,
so 45 − 4 of these hands are straights. Thus, there are 10(45 − 4) hands that are straights.
Assuming all poker hands are equally likely, we use the discrete uniform distribution on Ω to
5
Note that unless Ω = {ω ∗ }, it is false to say that we are sure that the state of the world will be ω ∗ ; we
can only say that the state of the world will be ω ∗ P -almost surely; cf. Definition 1.11 below.

4
deduce that
|A| 10(45 − 5)
P [A] = = 52 ≈ 0.0039.
|Ω|

5

The following result collects some fundamental rules for calculating probabilities.

Proposition 1.10. Let (Ω, F, P ) be a probability space.

(a) If A ∈ F, then
P [Ac ] = P [Ω \ A] = 1 − P [A].

(b) If A ⊂ B ∈ F, then
P [B] = P [A] + P [B \ A] ≥ P [A]. (1.1)

(c) If A, B ∈ F, then

P [A ∪ B] = P [A] + P [B] − P [A ∩ B] ≤ P [A] + P [B].

(d) If A1 ⊂ A2 ⊂ · · · ∈ F, then
" #
[
P An = lim P [An ].
n→∞
n∈N

(e) If A1 ⊃ A2 ⊃ · · · ∈ F, then
" #
\
P An = lim P [An ].
n→∞
n∈N

(f) If A1 , A2 , . . . ∈ F, then

" #
[ X
P An ≤ P [An ].
n∈N n=1

Proof. We only prove parts (a), (d) and (f). The other parts are left as an exercise.

(a) Set B1 := A, B2 := Ac , B3 := ∅, B4 := ∅, . . .. Then B1 , B2 , . . . ∈ F with Bi ∩ Bj = ∅ for


S
i 6= j. Moreover, n∈N Bn = Ω. Hence, σ-additivity of P and together with P [∅] = 0 and
P [Ω] = 1 give
" #
[ X
1 = P [Ω] = P Bn = P [Bn ] = P [A] + P [Ac ] + 0 = P [A] + P [Ac ].
n∈N n∈N

5
Rearranging yields (a).

(d) Set B1 := A1 , B2 := A2 \ A1 , B3 := A3 \ A2 , . . .. Then B1 , B2 , . . . ∈ F with Bi ∩ Bj = ∅ for


i 6= j. Moreover, nk=1 Bk = An for all n ∈ N and k∈N Bk = k∈N Ak . Hence, σ-additivity
S S S

of P yields

∞ n n
" # " # " #
[ [ X X [
P Ak = P Bk = P [Bk ] = lim P [Bk ] = lim P Bk = lim P [An ].
n→∞ n→∞ n→∞
k∈N k∈N k=1 k=1 k=1

(f) Set B1 := A1 , B2 := A1 ∪ A2 , B3 := A1 ∪ A2 ∪ A3 , . . . Then B1 ⊂ B2 ⊂ · · · ∈ F and


S S Pn
k∈N Bk = k∈N Ak . Moreover, P [Bn ] ≤ k=1 P [Ak ] by repeated application of part (c).
Hence, by part (d),

n ∞
" # " #
[ [ X X
P Ak = P Bk = lim P [Bn ] ≤ lim P [Ak ] = P [Ak ].
n→∞ n→∞
k∈N k∈N k=1 k=1

Events with probability zero or one are of particular importance because of the obvious in-
terpretation of “impossible” and “sure” events. However, this is not fully correct as P [A] = 0
does in general not imply that A = ∅, and P [A] = 1 does in general not imply that A = Ω.
The following definition makes this precise.

Definition 1.11. Let (Ω, F, P ) be a probability space.

(a) N ∈ F is called a P -null set if P [N ] = 0.

(b) Let E(ω) be a property that a state of the world ω ∈ Ω can have or not have. We say
that E holds P -almost surely if there exists a P -nullset N ∈ F such that E holds for all
ω ∈ Ω \ N.

1.2 Random variables


The next fundamental object we are looking at is the concept of a random variable, which is
a map between measurable spaces with a special property, called measurability.

Definition 1.12. Let (Ω, F) and (Ω0 , F 0 ) be measurable spaces and X : Ω → Ω0 a map.

(a) The preimage of a set A0 ∈ Ω0 under X is denoted by

X −1 (A0 ) := {X ∈ A0 } := {ω ∈ Ω : X(ω) ∈ A0 }.

(b) the collection of sets n o


σ(X) := {X ∈ A0 } : A0 ∈ F 0

6
is called the σ-algebra generated by X.

(c) X is called measurable with respect to F and F 0 (or shorter F-F 0 -measurable) if

X −1 (A0 ) := {X ∈ A0 } := {ω ∈ Ω : X(ω) ∈ A0 } ∈ F, for all A0 ∈ F 0 . (1.2)

In this case, if Ω0 = R and F 0 = BR , then X is called an F-measurable random variable.6


If Ω0 = Rd and F 0 = BΩ0 for d ≥ 2, we call X an F-measurable random vector.

Intuitively, σ(X) contains all the “information” about X. Clearly, X is measurable with
respect to F and F 0 if and only if σ(X) ⊂ F, i.e., F contains all the information about X.

Example 1.13. Let (Ω, F) be a measurable space and A ∈ F an F-measurable event. Then
the indicator function 1A , defined by

1 if ω ∈ A,
1A (ω) :=
0 if ω ∈ Ac ,

is an F-measurable random variable. Indeed, let A0 ∈ BR . Then with X = 1A , we have





∅ / A0 ,
if 0, 1 ∈


if 1 ∈ A0 , 0 ∈
/ A0 ,

A
X −1 (A0 ) = {ω ∈ Ω : X(ω) ∈ A0 } =


Ac if 0 ∈ A0 , 1 ∈
/ A0 ,


if 0, 1 ∈ A0 ,

Ω

Hence, σ(X) = {∅, A, Ac , Ω} and since A ∈ F, it follows that σ(X) ⊂ F and X is F-


measurable.

Whereas Definition 1.12 is important from a theoretical perspective, it is almost useless for
checking in practice that a given function X : Ω → Ω0 is F-F 0 -measurable because we usually
do not have a good description of all A0 ∈ F 0 . But fortunately, one can show that it suffices
to check (1.2) for all A0 in a generator A0 of F 0 , which in general is much smaller than F 0 .
This is the content of the following result; for a proof see [3, Theorem 1.81].

Theorem 1.14. Let (Ω, F) and (Ω0 , F 0 ) be measurable spaces and X : Ω → Ω0 a map. Suppose
that F 0 = σ(A0 ).

(a) We have
{X ∈ A0 } : A0 ∈ A0
 
σ(X) = σ .
6
Also if Ω0 is a Borel subset of R and F 0 = BΩ , we say that X is an F-measurable random variable valued
in Ω0 . Note that each F-measurable random variable valued in Ω0 is also an F-measurable random variable
(valued in R) so that we do not need to worry too much about the exact domain. The same holds for random
vectors.

7
(b) X is measurable with respect to F and F 0 if and only if

X −1 (A0 ) := {X ∈ A0 } := {ω ∈ Ω : X(ω) ∈ A0 } ∈ F for all A0 ∈ A0 .

We note two important corollaries.

Corollary 1.15. Let (Ω, F) be a measurable space. Then a function X : Ω → R is an


F-measurable random variable if and only if

X −1 ((−∞, x]) = {X ≤ x} := {ω ∈ Ω : X(ω) ≤ x} ∈ F, for all x ∈ R. (1.3)

Proof. Set A0 := {(−∞, x] : x ∈ R}. Note that (1.3) states that X −1 (A0 ) ∈ F for all A0 ∈ A0 .
Hence, the claim follows from (1.14) and the fact that BR = σ(A0 ) by Example 1.5(a).

Corollary 1.16. Let Ω be a subset of Rd and F = BΩ . If f : Ω → R is continuous, then f is


F-measurable.

Proof. By Corollary 1.15, it suffices to show that f −1 ((−∞, x]) is in BΩ for all x ∈ R. So fix
x ∈ R. As f is continuous, the preimage of the closed set (−∞, x] is again closed. As BΩ
contains all closed sets in Ω, it follows that f −1 ((−∞, x]) ∈ BΩ .

We proceed to show that the composition of measurable maps is again measurable.

Theorem 1.17. Let (Ω, F), (Ω0 , F 0 ), and (Ω00 , F 00 ) be measurable spaces. Let f : Ω → Ω0
be F-F 0 -measurable and g : Ω0 → Ω00 be F 0 -F 00 -measurable. Then h : Ω → Ω00 , defined by
h = g ◦ f is F-F 00 -measurable.7

Proof. Let A00 ∈ F 00 . Set A0 := g −1 (A00 ). Then

h−1 (A00 ) = {ω ∈ Ω : h(ω) ∈ A00 } = {ω ∈ Ω : g(f (ω)) ∈ A00 }


= {ω ∈ Ω : f (ω) ∈ g −1 (A00 )} = {ω ∈ Ω : f (ω) ∈ A0 }
= f −1 (A0 ).

Since g is F 0 -F 00 -measurable, it follows that A0 = g −1 (A00 ) ∈ F 0 , and since f is F-F 0 -


measurable, it follows that f −1 (A0 ) ∈ F. Hence h is F-F 00 -measurable.

We now show that sums and products of random variables are again random variables. In
order to do so, we require the following lemma:
7
Recall that g ◦ f is defined by (g ◦ f )(ω) = g(f (ω)).

8
Lemma 1.18. Let (Ω, F) be a measurable space and let X1 , . . . , Xn : Ω → R be maps. Define
X = (X1 , . . . , Xn ) : Ω → Rn . Then

X is F-BRn -measurable ⇔ each Xi is F-BR -measurable.

Tn −1
Proof. For b ∈ Rn , we have X −1 ((−∞, b]) = i=1 Xi ((−∞, bi ]). if each Xi is measurable,
then X −1 ((−∞, b]) ∈ F. Since closed sets of the form (−∞, b] generate BRn , it follows from
theorem 1.14 that X is F-BRn -measurable.

Conversely, we suppose that X is F-BRn -measurable. For i = 1, . . . , n we let πi : Rn → R,


x 7→ xi be the projection of the ith coordinate. Clearly, πi is continuous and thus BRn -BR
measurable by corollary 1.16. Hence Xi = πi ◦ X is F-BR -measurable by theorem 1.17.

Theorem 1.19. Let (Ω, F) be a measurable space and X1 and X2 be F-measurable random
variables. Then X1 + X2 , X1 − X2 , X1 X2 , and X1 /X2 are again F-measurable random
variables.8

Proof. The map π : R2 → R, (x, α) 7→ αx is continuous thus measurable by corollary 1.16.


Moreover (X1 , X2 ) : Ω → R2 is measurable by lemma 1.18. Hence also the composed map
X1 X2 = π ◦ (X1 , X2 ) is measurable by theorem 1.17. Similarly, we obtain measurability for
X1 + X2 , X1 − X2 and X1 /X2 .9

Finally, we show that countable suprema and infima or random variables are again measur-
able.10 To this end, we need to extend the real line R by the points −∞ and +∞. Thus, we
set
 
R = R ∪ {−∞, +∞} and BR := σ [−∞, x] : x ∈ R .

One can check that BR ⊂ BR , so that every real-valued random variable X can be in a
canonical way identified with an R-valued measurable map.11 For this reason, we will not
always carefully distinguish between R-valued and R-valued measurable maps and call both
random variables.

Theorem 1.20. Let (Ω, F) be a measurable space and X1 , X2 , . . . be F-measurable R-valued


random variables. Then the following maps are also F-measurable:

(a) supn∈N Xn .

(b) inf n∈N Xn .


8
Here, we agree that x/0 := 0 for all x ∈ R, which is a standard convention in measure theory.
9
X1 /X2 requires a little more care. Here, we agree that x/0 := 0 for all x ∈ R, which is a standard
convention in measure theory.
10
Note that uncountable suprema and infima are in general not measurable.
11
See [3, Corollary 1.87] for a precise formulation of this statement.

9
(c) lim supn→∞ Xn .

(d) lim inf n→∞ Xn .

Proof. We only prove (a) and (c); the proofs of (b) and (d) are similar.

(a) Let x ∈ R. Then by the fact that each of the Xn are F − BR̄ -measurable, we have
  \
sup Xn ∈ [−∞, x] = {Xn ∈ [−∞, x]} ∈ F.
n∈N n∈N

Now the claim follows from Theorem 1.14(b).

(c) For any n ∈ N set Yn := supm≥n Xm . Then each Yn is F-measurable by part (a). Hence,
lim supn→∞ Xn = inf n∈N Yn is F-measurable by part (b).

1.3 Distribution of random variables: univariate case


The definition of a random variable does not mention any probability measure P at all. If we
add a probability measure P , we get some further notions.

Definition 1.21. Let (Ω, F, P ) be a probability space and X an F-measurable random vari-
able.

(a) The distribution (or law or image) of X under P is the probability measure on (R, BR )
defined by
PX [B] := P [X ∈ B], B ∈ BR .

(b) The distribution function (or cumulative distribution function (cdf)) of X under P is
the map FX : R → [0, 1] given by

FX (x) := P [X ≤ x], x ∈ R.

The following result lists the three defining properties of a distribution function; its proof is
left as an exercise.

Lemma 1.22. Let (Ω, F, P ) be a probability space and X an F-measurable random variable.
Then the distribution function FX : R → [0, 1] has the following properties:

(a) FX is nondecreasing.

(b) FX is right-continuous.

(c) lim FX (x) = 0 and lim FX (x) = 1.


x→−∞ x→∞

10
The above lemma shows that for each random variable X, there exists a function F that is
nondecreasing, right-continuous, and with limits 0 at −∞ and 1 at +∞, respectively. The next
result show that also the converse is true. For each function F with these three properties,
there exists a random variable X with distribution function F ; for a proof see [3, Theorem
1.104].

Theorem 1.23. Let F : R → [0, 1] be a function that is nondecreasing, right-continuous


and satisfies limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. Then there exists a probability space
(Ω, F, P ) and a random variable X such that FX = F .12

The next definition looks at the concept that two random variables X and Y have the same
distribution.

Definition 1.24. Let (Ω, F, P ) be a probability space and X and Y F-measurable random
variables.13 Then X and Y are said to be identically distributed if

PX = PY .

The next result shows that the distribution function of a random variable does indeed describe
the whole distribution; for a proof see [2, Theorem 7.1].

Theorem 1.25. Let (Ω, F, P ) be a probability space and X and Y F-measurable random
variables. Then the following are equivalent:

(1) X and Y are identically distributed.

(2) FX = FY .

We proceed to introduce the important class of discrete distributions.

Definition 1.26. A real-valued random variable X (or more precisely its distribution) is called
discrete if there exists a finite or countable set B ∈ BR such that P X [B] = P [X ∈ B] = 1. In
this case, the probability mass function (pmf) pX : R → [0, 1] of X is given by

pX (x) = PX [{x}] = P [X = x], x ∈ R.

Remark 1.27. (a) If X is discrete, is straightforward to check that


X
pX (x) = 0 for x ∈ B c and pX (x) = 1. (1.4)
x∈B
12
The proof of Theorem 1.23 shows that one can always take Ω = R, F := BR and X : Ω → R given by
X(ω) = ω, i.e., X is the identity map.
13
More generally, X and Y might be defined on different probability spaces.

11
More generally, for any Borel set A ∈ BR , we have the formula
X
PX [A] = P [X ∈ A] = pX (x).
x∈A

Note that due to (1.4), we always sum over a countable set and this sum is finite and bounded
above by 1.

(b) One can show that a random variable X has a discrete distribution if and only if its
distribution function is piecewise constant, i.e., for each x ∈ R, there is ε > 0 such that
FX (y) = FX (x) for all y ∈ [x, x + ε]. Moreover, pX (x) = FX (x) − FX (x−) for all x ∈ R,
where FX (x−) := limy↑↑x FX (y) := limy→x,y<x FX (y) denotes the left limit of FX at x.14 This
together with Theorem 1.25 also shows that the distribution of a discrete random variable is
uniquely described by its pmf.

We proceed to list some important examples of discrete distributions.

Example 1.28. A discrete random variable X is said to have a

(a) Bernoulli distribution with parameter p ∈ [0, 1] if its pmf is given by





p if x = 1,

pX (x) = 1 − p if x = 0,



0 otherwise.

(b) binomial distribution with parameters n ∈ N and p ∈ [0, 1] if its pmf is given by
 
 n px (1 − p)n−x

if x ∈ {0, . . . , n},
pX (x) = x

0 otherwise.

(c) geometric distribution with parameter p ∈ (0, 1) if its pmf is given by15

p(1 − p)x if x ∈ N0 ,
pX (x) =
0 otherwise.
14
This left limit always exists because FX is nondecreasing.
15
Warning: In parts of the literature, the geometric distribution is shifted by one to the right, i.e., it is a
distribution on N.

12
(d) Poisson distribution with parameter λ > 0 if its pmf is given by
x
e−λ λ

if x ∈ N0 ,
pX (x) = x!
0 otherwise.

Finally, we introduce the equally important class of continuous distributions.

Definition 1.29. A real-valued random variable X (or more precisely its distribution) is called
(absolutely) continuous if there exists a nonnegative, measurable function fX : R → [0, ∞)
satisfying Z x
FX (x) = fX (y) dy, x ∈ R.
−∞

In this case, the function fX is called the probability density function (pdf) of X.

Remark 1.30. (a) One can show that if X is continuous with pdf fX , then for any Borel set
A ∈ BR , we have the formula
Z
PX [A] = P [X ∈ A] = fX (y) dy. (1.5)
A

This together with Theorem 1.25 and fundamental properties of the (Lebesgue) integral show
that the distribution of a random variable is uniquely characterised by its pdf (up to Lebesgue-
null sets).

(b) Note that there are many random variables which have neither a discrete nor a continuous
distribution.

We proceed to list some important examples of continuous distributions.

Example 1.31. A continuous random variable X is said to have a

(a) uniform distribution on (a, b), where −∞ < a < b < ∞ if its pdf is given by
 1
 if x ∈ (a, b),
fX (x) = b − a
0 otherwise.

We then also write X ∼ U(a, b).

(b) exponential distribution with rate parameter λ > 0 if its pdf is given by

λ exp(−λx) if x > 0,
fX (x) =
0 otherwise.

13
(c) Normal distribution with parameters µ ∈ R and σ 2 > 0 if its pdf is given by

1 (x−µ)2
fX (x) = √ e− 2σ 2 , x ∈ R.
2πσ 2

We then also write X ∼ N (µ, σ 2 ).

1.4 Distribution of random variables: multivariate case


In this section, we study the distribution of (multivariate) random vectors. This is very similar
to the case of (univariate) random variables, and so we will be brief.

Definition 1.32. Let X1 , . . . , XN be random variables on some probability space (Ω, F, P ).

(a) The joint distribution function of X1 , . . . , XN under P is the map F(X1 ,...,XN ) : RN →
[0, 1] given by

F(X1 ,...,XN ) (x1 , . . . , xN ) = P [X1 ≤ x1 , . . . , XN ≤ xN ], x1 , . . . , xN ∈ R.

(b) X1 , . . . , XN are called jointly continuous if there is a measurable function f(X1 ,...,XN ) :
RN → [0, ∞) such that

F(X1 ,...,XN ) (x1 , . . . , xN )


Z x1 Z xN
= ··· f(X1 ,...,XN ) (y1 , . . . , yN ) dyN · · · dy1 , x1 , . . . , xN ∈ R.
−∞ −∞

In this case, the function f(X1 ,...,XN ) is called the joint probability density function (joint
pdf) of X1 , . . . , XN .

Remark 1.33. (a) To simplify the notation one often sets X := (X1 , . . . , XN ) and writes
FX (x) and fX (x) for F(X1 ,...,XN ) (x1 , . . . , xN ) and f(X1 ,...,XN ) (x1 , . . . , xN ), respectively, where
x = (x1 , . . . , xN ).

(b) If X = (X1 , . . . , XN ) is a random vector, then the one-dimensional distributions of


X1 , . . . , XN are often called marginal distributions. If X1 , . . . , XN are jointly continuous, one
can show that the marginal pdfs can be obtained from the joint pdf by “integrating out” the
other random variables. For example if X1 , X2 are jointly continuous with joint pdf f(X1 ,X2 ) ,
then the marginal pdfs of X1 and X2 are given by
Z ∞ Z ∞
fX1 (x1 ) = f(X1 ,X2 ) (x1 , y2 ) dy2 and fX2 (x2 ) = f(X1 ,X2 ) (y1 , x2 ) dy1
−∞ −∞

That these integrals are well defined and measurable is the content of Fubini’s theorem, see
Section 1.11 below.

14
The key example of a multivariate distribution is the multivariate normal distribution.

Example 1.34. A random vector X = (X1 , . . . , XN ) is said to have a multivariate normal dis-
tribution with mean vector µ ∈ RN and covariance matrix Σ ∈ RN ×N , where Σ is symmetric
and positive definite, if X1 , . . . , XN are jointly continuous with joint pdf
 
1 1 > −1
fX (x) = p exp − (x − µ) Σ (x − µ) .
det(2πΣ) 2

1.5 Conditional probabilities


In this section, we briefly look at the elementary notion of conditional probabilities.

To motivate this concept, suppose we are given a probability space (Ω, F, P ) and we have been
informed that A ∈ F has occurred. We want to find a new probability measure P [· | A] on
(Ω, F) that takes this information into account. Clearly, this new probability measure should
be consistent with the old probability measure P in the sense that P [· | A] is proportional to P .
Moreover, since we already know that A has occurred, one should require that P [A | A] = 1.
It is then not difficult to check that these two properties uniquely determine P [· | A]. The
answer is given by the following definition.

Definition 1.35. Let (Ω, F, P ) be a probability space and A ∈ F with P [A] > 0. Then for
B ∈ F, the conditional probabilty of B given A is denoted by P [B|A] and defined by

 P [A∩B] when P [A] > 0,
P [A]
P [B|A] = (1.6)
0 otherwise.

It is straightforward to check that P [· | A] is again a probability measure on (Ω, F).

The following result lists two elementary facts about conditional probabilities.

Theorem 1.36. Let (Ω, F, P ) be a probability space, I a finite or countable index set,16 and
S
(Ai )i∈I an F-measurable partition of Ω, i.e., each Ai is F-measurable, i∈I Ai = Ω and
Ai ∩ Aj = ∅ for i 6= j. Suppose that P [Ai ] > 0 for each i ∈ I. Then

(a) For every B ∈ F, we have the law of total probability


X
P [B] = P [B | Ai ]P [Ai ].
i∈I
16
We always implicitly assume that index sets are nonempty.

15
(b) For every B ∈ F with P [B] > 0 and each k ∈ I, we have Bayes’ formula

P [B | Ak ]P [Ak ]
P [Ak | B] = P .
i∈I P [B | Ai ]P [Ai ]

Proof. (a) The definition of conditional probabilities and the σ-additivity of P give
X X
P [B | Ai ]P [Ai ] = P [B ∩ Ai ] = P [B].
i∈I i∈I

(b) The definition of conditional probabilities and part (a) give

P [B | Ak ]P [Ak ] P [Ak ∩ B]
P = = P [Ak | B].
i∈I P [B | Ai ]P [Ai ] P [B]

Exercise. The proportion of Jaguar cars manufactured in Coventry is 0.7, and the proportion
of these with some fault is 0.2. All other Jaguars are made in Birmingham and the proportion
of faulty Birmingham cars is 0.1. What is the probability that a randomly selected Jaguar
car:

(a) is both faulty and manufactured in Coventry?

(b) is faulty?

(c) is manufactured in Coventry given that it is faulty?

1.6 Independence
In this section, we study the key concept of (stochastic) independence. We start by looking
at independence of families of events and then generalise this to families of σ-algebras and
families of random variables.

To motivate this concept, suppose we are given a probability space (Ω, F, P ) and two F-
measurable events A, B with P [A], P [B] > 0. Then intuitively A and B are (stochastically)
independent if the probability assigned to A is not influenced by the information that B has
occurred and vice versa. In formulas, this means that

P [A] = P [A | B] and P [B] = P [B | A]. (1.7)

Using the definition of conditional probabilities in (1.6), it is straightforward to check that


(1.7) is equivalent to P [A ∩ B] = P [A]P [B]. The following definition generalises this idea of
stochastic independence from two to many events.

Definition 1.37. Let (Ω, F, P ) be a probability space, I an arbitrary index set and (Ai )i∈I

16
a family of events in F. Then the Ai are said to be (stochastically) independent if for any
finite set of distinct indices i1 , . . . , in ∈ I,
n
Y
P [Ai1 ∩ Ai2 ∩ · · · ∩ Ain ] = P [Aik ]. (1.8)
k=1

Exercise. Suppose two dice are rolled and we note the upturned face. Let

A := {The sum is 7},


B := {The first die gives 3},
C := {The second die gives 4}.

Are the events pairwise independet? Are the events independent?

We now “lift” the definition of independence from families of events to families of σ-algebras.

Definition 1.38. Let (Ω, F, P ) be a probability space, I an arbitrary index set and (Gi )i∈I
a family of sub-σ-algebras of F.17 Then the Gi are said to be independent if for any finite set
of distinct indices i1 , . . . , in ∈ I and any events Ai1 ∈ Gi1 , . . . , Ain ∈ Gin ,
n
Y
P [Ai1 ∩ Ai2 ∩ · · · ∩ Ain ] = P [Aik ]. (1.9)
k=1

Next, we want to define a notion of independence for random variables. We do this by requiring
that their generated σ-algebras (cf. Definition 1.12) are independent.

Definition 1.39. Let (Ω, F, P ) be a probability space, I an arbitrary index set and (Xi )i∈I
a family of F-measurable random variables. Then the Xi are said to be independent if their
generated σ-algebras σ(Xi ), i ∈ I, are independent.

The following result shows that we need to check independence only for certain generators of
a σ-algebra; for a proof see [3, Theorem 2.13].

Theorem 1.40. Let (Ω, F, P ) be a probability space, I an arbitrary index set and (Ai )i∈I
a family of classes of events that are independent, i.e., for any finite set of distinct indices
i1 , . . . , in ∈ I and any events Ai1 ∈ Ai1 , . . . , Ain ∈ Ain ,
n
Y
P [Ai1 ∩ Ai2 ∩ · · · ∩ Ain ] = P [Aik ]. (1.10)
k=1

Moreover, suppose that each each Ai is closed under intersections, i.e., if A, B ∈ Ai then also
17
This means that each Gi is a σ-algebra on Ω and Gi ⊂ F .

17
A ∩ B ∈ Ai .18

(a) The generated σ-algebras σ(Ai ) are also independent.

(b) If K is an arbitrary index set and


S(Ik )k∈K a partition of I in mutually disjoint “blocks”,
then the generated σ-algebras σ j∈Ik Aj are also independent.

We note the following important corollary for the case of random variables. Its proof is left
as an exercise.

Corollary 1.41. Let (Ω, F, P ) be a probability space and X1 , . . . , XN F-measurable random


variables. Then X1 , . . . , XN are independent if and only if

N
Y
F(X1 ,...,XN ) (x1 , . . . , xN ) = FXn (xn ), for all x1 , . . . , xN ∈ R.
n=1

Moreover, if X1 , . . . , XN are jointly continuous and have a continuous joint pdf f(X1 ,...,XN )
and continuous marginal pdfs fX1 , . . . , fXN , then X1 , . . . , XN are independent if and only if

N
Y
f(X1 ,...,XN ) (x1 , . . . , xN ) = fXn (xn ), for all x1 , . . . , xN ∈ R.
n=1

As another application of Theorem 1.40, we show that if a family (Ai )i∈I of events is inde-
pendent, then so is the family of their complements (Aci )i∈I .

Example 1.42. Let (Ω, F, P ) be a probability space, I an arbitrary index set and (Ai )i∈I
a family of independent events in F. Then also (Aci )i∈I is a family of independent events.
Indeed, set
Ai := {Ai }, i ∈ I.

Then the Ai are trivially closed under intersection and independent because the Ai are. More-
over, by Example 1.13, it follows that σ(cAi ) := {∅, A, Ac , Ω}. Hence, by Theorem 1.40, the
generated σ-algebras σ(Ai ) are also independent. By the Definition 1.38 of independence of
σ-algebras, it follows that the (Aci ) are independent.

The next result computes the probability of a sequence of events A1 , A2 , . . . happening in-
T S
finitely often. To this end, recall that {An i.o.} = lim supn→∞ An = n∈N k≥n Ak .

Lemma 1.43. (Borel-Cantelli) Let (Ω, F, P ) be a probability space and A1 , A2 , . . . F-measurable


events.
P∞
(a) If n=1 P [An ] < ∞, then P [{An i.o.}] = 0.
18
A set closed under intersections is often called a π-system.

18
P∞
(b) If the An are independent and n=1 P [An ] = ∞, then P [{An i.o.}] = 1.

P∞
Proof. (a) Proposition 1.10(e) and (f) together with n=1 P [An ] < ∞ yields


" # " #
\ [ [ X
P [{An i.o.}] = P Ak = lim P Ak ≤ lim P [Ak ] = 0.
n→∞ n→∞
n∈N k≥n k≥n k=n

(b) It suffices to show that P [{An i.o.}c ] = 0. The de Morgan laws and Proposition 1.10(f)
yield "
\ [ c # "
[ \
# "
\
#
c
P [{An i.o.} ] = P Ak =P Ack = lim P Ack
n→∞
n∈N k≥n n∈N k≥n k≥n

It suffices to show that P [ k≥n Ack ] = 0 for each n ∈ N. So fix n ∈ N. Set B1 := Acn ,
T

B2 := Acn ∩ Acn+1 , , . . . Then B1 ⊃ B2 ⊃ · · · ∈ F and `∈N B` = k≥n Ack . Proposition


T T

1.10(e), the fact that (Acn )n∈N is independent by Example 1.42, and the elementary inequality
1 − x ≤ exp(−x) for x ∈ R give
" # " #
\ \
Ack = P B` = lim P [B` ] = lim P Acn ∩ · · · Acn+`−1
 
P
`→∞ `→∞
k≥n `∈N
n+`−1 n+`−1
!
Y X
= lim (1 − P [Ak ]) ≤ lim inf exp − P [Ak ] = 0.
`→∞ `→∞
k=n k=n

Example 1.44. Suppose we roll a fair die exactly once and define An as the event that in
this roll the face showed a six, for every n ∈ N. Clearly we have
∞ ∞
X X 1
P [An ] = = ∞;
6
n=1 n=1

however, P [{An i.o.}] = P [A1 ] < 1. This shows that in part (b) of the Borel-Cantelli lemma,
the assumption of independence is indispensable.

Example 1.45. Suppose the number of calls received by a call centre each day is Xn times
on day n ∈ N, where Xn ∼ Poisson(λn ) such that 0 ≤ λn ≤ Λ, for some fixed Λ ∈ (0, ∞).
Then,
P [Xn ≥ n for infinitely many n] = 0.

19
Indeed, this follows from the Borel-Cantelli lemma by noting that

X ∞ X
X ∞ ∞ X
X m
P [Xn ≥ n] = P [Xn = m] = P [Xn = m]
n=1 n=1 m=n m=1 n=1
∞ X m ∞
X λm
n
X Λm
= eλn ≤ m = ΛeΛ < ∞.
m! m!
m=1 n=1 m=1

1.7 Expectation
In this section, we aim to define for a random variable X on a probability space (Ω, F, P ),
the expectation of X under P (also called the integral of X with respect to P ). We proceed
in three steps.

First, we define the expectation for simple random variable.

Definition 1.46. A random variable X on a probability space (Ω, F, P ) is called simple if


there is n ∈ N, c1 , . . . , cn ∈ R \ {0} and A1 , . . . , An ∈ F such that

X = c1 1A1 + · · · + cn 1An .

In this case, the expectation of X with respect to P is given by

E P [X] := c1 P [A1 ] + · · · + cn P [An ]. (1.11)

If there is no danger of confusion, we often drop the qualifier P in E P .

Remark 1.47. (a) It is not difficult to check that each simple random variable has a repre-
sentation such that the Ai are pairwise disjoint and the ci are distinct.

(b) It is not difficult to check that (1.11) is independent of the choice of the representation of
X. More precisely, if X can also be written as X = d1 1B1 + · · · + dm 1Bm then

c1 P [A1 ] + · · · + cn P [An ] = d1 P [B1 ] + · · · + dm P [Bm ].

The following result shows that the expectation is linear on simple random variables. Its proof
is left as an exercise.

Lemma 1.48. Let X and Y be simple random variables on some probability space (Ω, F, P ).
For a, b ∈ R, aX + bY is again a simple random variable and

E [aX + bY ] = aE [X] + bE [Y ] .

20
We next aim to define the expectation for arbitrary nonnegative random variables.

Definition 1.49. Let X be a [0, ∞]-valued random variable on some probability space (Ω, F, P ).
Then the expectation of X with respect to P is given by

E P [X] := sup E P [Y ] : Y is nonnegative, simple, and satisfies Y ≤ X .




If there is no danger of confusion, we often drop the qualifier P in E P .

Remark 1.50. (a) The expectation of a nonnegative random variable can be ∞ even if X
itself never takes the value ∞.

(b) It follows immediately from Definition (1.49) that the expectation is monotone in the sense
that if X and Y are [0, ∞]-valued random variables with X ≤ Y then E [X] ≤ E [Y ].

Definition 1.49 suggests that in order to calculate the expectation of a nonnegative random
variable X, we approximate X from below by a nondecreasing sequence of nonnegative simple
random variables (Xn )n∈N and calculate limn→∞ E [Xn ]. The following two results show that
this idea really works.

Lemma 1.51. Let X be a [0, ∞]-valued random variable on some probability space (Ω, F, P ).
Then there exists a nondecreasing sequence of nonnegative simple random variables (Xn )n∈N
with limn→∞ Xn = X.

Proof. For n ∈ N, set Xn (ω) := min(2−n b2n X(ω)c, n), where b·c denotes the floor function.19
Then for each ω ∈ Ω, it is easy to check that the sequence (Xn (ω))n∈N is nondecreasing and
satisfies limn→∞ Xn (ω) = X(ω).

The next theorem shows that expectation and monotone limits can be interchanged. It is one
of the cornerstones of modern integration theory; for a proof see [3, Lemma 4.2].

Theorem 1.52 (Monotone convergence theorem). Let X1 , X2 , . . . and X be [0, ∞]-valued


random variables on some probability space (Ω, F, P ). Suppose that X1 ≤ X2 ≤ · · · P -a.s.
and limn→∞ Xn = X P -a.s. Then

lim E [Xn ] = E [X] . (1.12)


n→∞

Remark 1.53. Note that in (1.12) both sides may be ∞.

Theorem 1.52 serves as a crucial ingredient to many proofs, where one first establishes the
result for simple random variables and then passes to a monotone limit. To illustrate this
19
If X(ω) = +∞, we use the conventions that c × ∞ = ∞ for c > 0 and b∞c = ∞.

21
approach, we prove the following generalisation of Lemma 1.48.

Lemma 1.54. Let X and Y be [0, ∞]-valued random variables on some probability space
(Ω, F, P ). For a, b ≥ 0, aX + bY is again a [0, ∞]-valued random variable and20

E [aX + bY ] = aE [X] + bE [Y ] .

Proof. First aX + bY is a [0, ∞]-valued random variable by (a straightforward extension of)


Theorem 1.19. Next, let (Xn )n∈N and (Yn )n∈N be nondecreasing sequences of nonnegative
simple random variables satisfying limn→∞ Xn = X and limn→∞ Yn = Y . (They exist by
Lemma 1.51.) Then (aXn + bYn )n∈N is a nondecreasing sequences of nonnegative simple
random variables with limn→∞ (aXn + bYn ) = aX + bY . Hence, the monotone convergence
theorem (Theorem 1.52) and Lemma 1.48(a) give

E [aX + bY ] = lim E [aXn + bYn ] = lim (aE [Xn ] + bE [Yn ])


n→∞ n→∞

= a lim E [Xn ] + b lim E [Yn ] = aE [X] + bE [Y ] .


n→∞ n→∞

The following result can also be proved using the monotone convergence theorem. Its proof is
left as an exercise.

Lemma 1.55. Let X be a [0, ∞]-valued random variable on some probability space (Ω, F, P ).

(a) We have X = 0 P -a.s. if and only if E [X] = 0.

(b) If E [X] < ∞, then X < ∞ P -a.s.

As another application of the monotone convergence theorem, we prove the important Lemma
of Fatou.

Lemma 1.56 (Fatou). Let X1 , X2 , . . . be [0, ∞]-valued random variables on some probability
space (Ω, F, P ). Then h i
E lim inf Xn ≤ lim inf E [Xn ] .
n→∞ n→∞

Proof. For n ∈ N, set Yn := inf m≥n Xm . Then Yn ≤ Xn for each n ∈ N, and so

E [Yn ] ≤ E [Xn ] , n ∈ N. (1.13)

by Remark 1.50(b). Moreover, (Yn )n∈N is a nondecreasing sequence of nonnegative random


variables with limn→∞ Yn = lim inf n→∞ Xn . Hence, by the monotone convergence theorem
20
Here, we use the standard convention in measure theory that c × ∞ =: ∞ for c > 0 and 0 × ∞ =: 0.

22
(Theorem 1.52) and (1.13), we obtain
h i h i
E lim inf Xn = E lim Yn = lim E [Yn ] = lim inf E [Yn ] ≤ lim inf E [Xn ] .
n→∞ n→∞ n→∞ n→∞ n→∞

Finally, we define the expectation for a general random variable. To this end, recall that
R̄ = [−∞, +∞].

Definition 1.57. Let X be an R̄-valued random variable on some probability space (Ω, F, P ).
X is called integrable or said to have finite expectation (with respect to P ) if E P [|X|] < ∞.
In this case one sets
E P [X] := E P X + − E P X − ,
   

where X + = max{0, X} denotes the positive part of X and X − = max{0, −X} denotes the
negative part of X.21 If there is no danger of confusion, we often drop the qualifier P in E P .

Remark 1.58. (a) In situations where one wants to highlight the underlying sample space
Ω, it is more handy to write the expectation in integral notation. For an integrable random
variable X on a probability space (Ω, F, P ), we set
Z
X(ω)P (dω) := E P [X] ,

and call this the integral of X with respect to P .

(b) The construction of the integral can be easily extended to general measures µ on (Ω, F),
which are still σ-additive but do not longer satisfy µ(Ω) = 1; see [3, Chapter 4] for details.

• If X = c1 1A1 + · · · cn 1An simple, we set22


Z
X(ω)µ(dω) := c1 µ(A1 ) + · · · + cn µ(An ).

• If X is [0, ∞]-valued, we set


Z Z 
X(ω)µ(dω) := sup Y (ω)µ(dω) : Y is nonnegative, simple, and satisfies Y ≤ X .
Ω Ω

• If X is general, we say that X is µ-integrable if


Z
|X(ω)|µ(dω) < ∞.

21
Note that both X + and X − are nonnegative random variables by Theorem 1.17, Corollary 1.16 and the
fact that the functions x 7→ x+ and x 7→ x− are continuous. Moreover, X = X + − X − , and |X| = X + + X − .
22
Here, we use the standard convention in measure theory that ∞ + ∞ =: ∞.

23
In this case, we set
Z Z Z
X(ω)µ(dω) := X + (ω)µ(dω) − X − (ω)µ(dω)
Ω Ω Ω

and call this the integral of X with respect to µ.

(c) The most important example of a general measure is the Lebesgue-measure λ on R, which
is the unique measure on (R, BR ) such that λ((a, b]) = b − a for all −∞ < a < b < ∞. If
f : R → R is a Borel-measurable function, we say that f is Lebesgue-integrable if
Z ∞ Z
|f (x)| dx := |f (x)|λ(dx) < ∞
−∞ R

and write Z ∞ Z
f (x) dx := f (x)λ(dx)
−∞ R

for the Lebesgue integral of f .23

The following result lists important properties of the expectation operator.

Lemma 1.59. Let X and Y be integrable random variables on some probability space (Ω, F, P ).

(a) For a, b ∈ R, aX + bY is again an integrable random variable and

E[aX + bY ] = aE[X] + bE[Y ].

(b) If X ≥ Y P -a.s., then


E[X] ≥ E[Y ]. (1.14)

Moreover, the inequality in (1.14) is an equality if and only if X = Y P -a.s.

Property (a) is referred to as linearity of the expectation and property (b) as monotonicity of
the expectation.

Proof. (a) By the triangle inequality, Lemma 1.54, and the fact that X and Y are integrable
it follows that

E[|aX + bY |] ≤ E[|a||X| + |b||Y |] = |a|E[|X|] + |b|E[|Y |] < ∞.

Thus, aX + bY is integrable. The rest of the claim follows from splitting X, Y , and X + Y
into their positive and negative parts and applying Lemma 1.54; for details see [3, Theorem
4.9(c)].
23
Note that for Riemann-integrable functions, the Lebesgue integral and the Riemann integral coincide; see
[3, Chapter 4.3].

24
(b) Set Z := X − Y . Then Z ≥ 0 P -a.s. By part (a), it suffices to show that E[Z] ≥ 0, where
the inequality is an equality if and only if Z = 0 P -a.s. Since Z ≥ 0 P -a.s., it follows that
Z − = 0 P -a.s., and so
E[Z] = E[Z + ] − E[Z − ] = E[Z + ].

by Lemma 1.55(a). Since Z + is nonnegative, E[Z + ] ≥ 0 by Definition 1.49. This gives


E[Z] ≥ 0. Moreover, E[Z] = E[Z + ] = 0 if and only if Z + = 0 P -a.s. by Lemma 1.55(a).
Finally Z + = 0 P -a.s. if and only if Z = 0 P -a.s. since Z − = 0 P -a.s.

The next result is a measure theoretic change of variable formula. Its proof (which uses again
the monotone convergence theorem) is left as an exercise.

Proposition 1.60. Let (Ω, F, P ) be a probability space, (Ω0 , F 0 ) a measurable space, and
X : Ω → Ω0 an F − F 0 measurable map. Moreover, let g : Ω0 → R̄ be an F 0 − BR̄ -measurable
map. Let PX be image measure of X under P .24 Then g(X) is P integrable if and only if g
is PX integrable. Moreover, in this case (or if g ≥ 0) we have the transformation formula:

Z Z
g(X(ω))P (dω) = g(ω 0 )PX (dω 0 ).
Ω Ω0

The following lemma considers the special case that X is a random variable with discrete
or continuous distribution. Its proof (which uses Remarks 1.27 and 1.30 together with the
monotone convergence theorem) is left as an exercise.

Lemma 1.61. Let X be a random variable on a probability space (Ω, F, P ) and g : R → R a


measurable function.

(a) If X is discrete with pmf pX , then


X
E[|g(X)|] = |g(x)|pX (x). (1.15)
x

If (1.15) is finite, then


X
E[g(X)] = g(x)pX (x).
x
24
This is the probability measure on (Ω0 , F 0 ) defined by

PX [A0 ] := P [X ∈ A0 ].

25
(b) If X is continuous with pdf fX , then
Z ∞
E[|g(X)|] = |g(x)|fX (x) dx. (1.16)
−∞

If (1.16) is finite, then


Z ∞
E[g(X)] = g(x)fX (x) dx.
−∞

1
Example 1.62. Let X ∼ Poisson(λ) and Y := 1+X . To find the expected value of Y we can
1
employ 1.15. Indeed, let g : R → R such that x 7→ 1+x . Note that X takes values in N and
thus |g(X)| = g(X).

Thus,
∞ ∞ 
λk

X X 1
E[g(X)] = g(k)pX (k) = e−λ
1+k k!
k=0 k=0

e−λ X λk+1 1
= = (1 − e−λ ).
λ (k + 1)! λ
k=0

Finally, we link the notion of independence to the concept of expectation; see [3, Theorem
5.4] for a proof (which uses once again the monotone convergence theorem).

Theorem 1.63. Let X and Y be independent integrable random variables on some probability
space (Ω, F, P ). Then XY is also integrable and

E[XY ] = E[X]E[Y ].

1.8 Lp -spaces
In this section, we introduce the key notion of Lp -spaces.

First, we consider the case p ∈ [1, ∞).

Definition 1.64. Let (Ω, F, P ) be a probability space and p ∈ [1, ∞). For an F-measurable
random variable X, set25
kXkp := E [|X|p ]1/p . (1.17)

If kXkp < ∞, we say that X has finite p-th moment (with respect to P ) and call E [X p ] < ∞
the p-th moment of X. We denote the collection of all random variables on (Ω, F) with finite
1
25
Here, we use the natural convention that ∞ p = ∞.

26
p-th moment with respect to P by Lp (Ω, F, P ). If there is no danger of confusion, we often
write Lp (P ) or just Lp for Lp (Ω, F, P ).
1
Remark 1.65. The reason for the “outside power” p in (1.17) is to ensure that the map
k · kp : Lp → R is positively homogenous. Indeed, let λ ≥ 0 and X ∈ Lp . Then linearity of the
expectation gives

kλXkp = E [|λX|p ]1/p = (λp E [|X|p ])1/p = λE [|X|p ]1/p = λkXkp .

Next, we turn to the case p = ∞.

Definition 1.66. Let (Ω, F, P ) be a probability space. For an F-measurable random variable
X, set
kXk∞ := inf {K ≥ 0 : |X| ≤ K P -a.s.} .

If kXk∞ < ∞, we say that X is P -a.s.-bounded. We denote the collection of all real-valued
random variables on (Ω, F) that are P -a.s.-bounded by L∞ (Ω, F, P ). If there is no danger of
confusion, we often write L∞ (P ) or just L∞ for L∞ (Ω, F, P ).

The following result shows that Lp -spaces are naturally ordered.

Proposition 1.67. Let 1 ≤ p1 < p2 ≤ ∞. Then Lp2 (Ω, F, P ) ⊂ Lp1 (Ω, F, P ).

Proof. Let X ∈ Lp2 .

First assume that p2 = ∞. Then there is a constant K > 0 such that X ≤ K P -a.s.
Monotonicity of the expectation gives E[|X|p1 ] ≤ K p1 < ∞ and so X ∈ Lp1 .

Next assume that p2 < ∞. The elementary inequality xp1 ≤ 1 + xp2 for x ≥ 0 together with
monotonicity and linearity of the expectation give

E[|X|p1 ] ≤ E[1 + |X|p2 ] ≤ 1 + E[|X|p2 ] < ∞.

Thus, X ∈ Lp1 .

This containment does not work when we have a measure which can take finite but arbitrarily
large measure on Ω:

Example 1.68. Consider the measure space (R, B(R), λ) where λ is the Lebesgue-measure.
Let 
1/x if x ≥ 1;
f (x) =
0 otherwise.

27
Then f ∈ L2 (R, B(R), λ), f ∈
/ L1 (R, B(R), λ) since

∞ ∞
Z Z
|f (x)|2 λ(dx) = −1/x = 1; |f (x)|λ(dx) = log(x) = ∞,
R 1 R 1

respectively.

We proceed to state the important inequalities of Hölder and Minkowski; for a proof see [3,
Theorems 7.16 and 7.17]. Hölder’s inequality shows that the product of random variables that
lie in Lp -spaces with conjugate exponents is integrable. Here, p, q ∈ [1, ∞] are called conjugate
1 1
if p + q = 1, with the convention that 1/∞ := 0.

Theorem 1.69 (Hölder’s inequality). Let (Ω, F, P ) be a probability space and X, Y be random
1 1
variables with X ∈ Lp (P ) and Y ∈ Lq (P ), where p, q ∈ [1, ∞] and p + q = 1. Then XY ∈
L1 (P ) and
kXY k1 ≤ kXkp kY kq

Minkowski’s inequality shows that the map k · kp : Lp → R+ satisfies the triangle inequality.

Theorem 1.70 (Minkowski’s inequality). Let (Ω, F, P ) be a probability space, p ∈ [1, ∞] and
X, Y ∈ Lp (P ). Then
kX + Y kp ≤ kXkp + kY kp

One important consequence of Minkowki’s inequality is that the map k · kp is a norm and
each Lp (Ω, F, P ) is a normed vectors space.26 One can even show that the metric/topology
induced by k · kp is complete and hence a Banach space. This is the content of the following
result; for a proof see [3, Theorem 7.18]

Theorem 1.71 (Fischer-Riesz). Let (Ω, F, P ) be a probability space, p ∈ [1, ∞], and (Xn )n∈N
a Cauchy sequence in Lp , i.e., for each ε > 0, there is N ∈ N such that for all m, n ≥ N ,

kXn − Xm kp < ε.

There there is X ∈ Lp with


lim kXn − Xkp = 0.
n→∞

1.9 Variance and covariance


In this section, we study the variance and covariance of random variables.
26
To be precise this also requires to identify random variables that coincide P -a.s, i.e., one has to pass from
random variables to equivalence classes of random variables that coincide P -a.s.. Some authors indicate this
by writing Lp (Ω, F, P ) instead of Lp (Ω, F, P ) when doing this. However, we will always work with random
variables.

28
Definition 1.72. Let (Ω, F, P ) be a probability space and X, Y ∈ L2 (P ).

(a) The variance of X is defined by

Var[X] := E (X − E [X])2 .
 

(b) The covariance of X and Y is defined by

Cov[X, Y ] := E [(X − E [X])(Y − E[Y ])] .

(c) X and Y are called uncorrelated if Cov[X, Y ] = 0 and correlated if Cov[X, Y ] 6= 0.

The following result lists some elementary properties of the variance/covariance; its proof is
left as an exercise.

Proposition 1.73. Let (Ω, F, P ) be a probability space, X, Y ∈ L2 and a, b, c, d ∈ R.

(a) Cov[X, Y ] = E [XY ] − E[X]E[Y ].

(b) Cov[aX + b, cX + d] = ac Cov[X, Y ].

(c) Var[X] = 0 if and only if X is P -a.s.-constant.

We proceed to show that independent random variables are uncorrelated.

Proposition 1.74. Let (Ω, F, P ) be a probability space and X, Y ∈ L2 . If X and Y are


independent, then they are uncorrelated.

Proof. This follows immediately from Proposition 1.73(a) and Theorem 1.63.

The converse of Proposition 1.74 is false:


1
Example 1.75. Let (Ω = {1, 2, 3}, 2Ω , P ) be a probability space such that P (ω) = 3 for each
ω ∈ Ω. Define two random variables by
 


1, if ω = 1, 

0, if ω = 1,
 
X(ω) = 0, if ω = 2, Y (ω) = 1, if ω = 2,

 

−1,

if ω = 3,

0, if ω = 3.

Then E[X] = 0, E[XY ] = 0 and so X and Y are uncorrelated. However,

1 1
P [X = 1, Y = 1] = 0 6= · = P [X = 1]P [Y = 1],
3 3

29
and therefore X and Y are not independent.

We proceed to calculate the variance of the sum of random variables.

Proposition 1.76. Let (Ω, F, P ) be a probability space and X1 , . . . , Xn ∈ L2 . Then

n n
" #
X X X
Var Xn = Var[Xk ] + 2 Cov[Xi , Xj ]. (1.18)
k=1 k=1 1≤i<j≤n

In particular, if X1 , . . . , Xn are uncorrelated, the Bienaymé formula holds:

n n
" #
X X
Var Xk = Var[Xk ].
k=1 k=1

Proof. Linearity of the expectation and the fact that Cov[X, Y ] = Cov[Y, X] yield

n n n
" # " ! !#
X X X
Var Xn = E Xk − E[Xk ] Xk − E[Xk ]
k=1 k=1 k=1
n X
X n
= E [(Xi − E[Xi ])(Xj − E[Xj ])]
i=1 j=1
Xn X n n
X n
X
= Cov[Xi , Xj ] = Var[Xk ] + Cov[Xi , Xj ]
i=1 j=1 k=1 i,j=1,i6=j
Xn X
= Var[Xk ] + 2 Cov[Xi , Xj ].
k=1 1≤i<j≤n

1.10 The inequalities of Markov and Jensen


In this section, we study some key inequalities of probability theory.

First, we establish the elementary but important Markov inequality.

Theorem 1.77 (Markov’s inequality). Let X be a random variable on some probability space
(Ω, F, P ) and f : [0, ∞) → [0, ∞) a nondecreasing function.27 Then for any ε > 0 with
f (ε) > 0, we have the Markov inequality:

E [f (|X|)]
P [|X| ≥ ε] ≤ .
f (ε)

Proof. Monotonicity and linearity of the expectation together with the fact that f is nonde-
27
Note that f is then automatically measurable. Indeed, let x ∈ R, then {f −1 ([0, x])} ∈ B[0,∞] because it is
of the form [0, a) or [0, a] for some a ∈ [0, ∞].

30
creasing give
   
E [f (|X|)] ≥ E f (|X|)1{f (|X|)≥f (ε))} ≥ E f (ε)1{f (|X|)≥f (ε)}
 
≥ E f (ε)1{|X|≥ε} = f (ε)P [|X| ≥ ε].

Now the claim follows by rearrangement.

Remark 1.78. We can look at Markov’s inequality in a different light. Let X be a random
variable on some probability space (Ω, F, P ). Moreover, let ε := cE[|X|] for some c ∈ R+ .
Then, by Markov’s inequality,
1
P [|X| ≥ cE[|X|]] ≤ .
c

Moreover, knowing that P [|X| ≥ α] = β for some α ∈ R+ , we can deduce that E[|X|] ≥ αβ.

The case f (x) = x2 is of special importance.

Corollary 1.79 (Chebyshev’s inequality). Let X be a real-valued random variable on some


probability space (Ω, F, P ) and assume that X ∈ L2 (P ). Then for ε > 0, we have the Cheby-
shev inequality:
Var[X]
P [|X − E [X] | ≥ ε] ≤ .
ε2

Example 1.80. Suppose it is known that the number of items produced in a factory during a
week is a random variable X on a probability space (Ω, F, P ) such that X ∈ L2 and E[X] = 50.
Let A be defined as the event

A := {This week’s production is between 40 and 60}.

If the variance of this week’s production is 25, then what can be said about P [A]?

By Chebyschev’s inequality,

P [A] = P [|X − 50| < 10] = 1 − P [|X − E[X]| ≥ 10]


Var[X] 3
≥1− 2
= .
10 4

For the next inequality, we need to recall the notion of a convex function.

Definition 1.81. Let D ⊂ R be a non-empty interval. A function f : D → R is called convex


if
f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ), x1 , x2 ∈ D, λ ∈ [0, 1]. (1.19)

It is called strictly convex if the inequality in (1.19) is strict for x1 6= x2 and λ ∈ (0, 1).

31
Graphically speaking, (strict) convexity means that straight line segments joining (x1 , f (x1 ))
to (x2 , f (x2 )) always lie (strictly) above the graph of f .

Figure 1: Visualisation of Jenen’s inequality.

Remark 1.82. (a) If f is convex, then f is automatically continuous in the interior of D; see
[3, Theorem 7.7(i)].

(b) If f is (strictly) convex, then −f is called (strictly) concave.

(c) If f : D → R is twice continuously differentiable then f is convex if and only if f 00 ≥ 0 in


the interior of D. Moreover, it is strictly convex if f 00 > 0 in the interior of D.28

We proceed to state and prove the fundamental inequality for convex functions.

Theorem 1.83 (Jensen’s inequality). Let (Ω, F, P ) be a probability space and X an integrable
random variable with values in a non-empty interval D ⊂ R. Let f : D → R be convex and
suppose that f is nonnegative or E [|f (X)|] < ∞. Then

E [f (X)] ≥ f (E [X]) .

Moreover, the inequality is strict when f is strictly convex and X is not P -a.s. constant.
28
The converse is not true: For example, the function f : R → R, x 7→ x4 is strictly convex, but f 00 (0) = 0.

32
Proof. The claim is trivial if f is nonnegative and E [f (X)] = ∞. So it suffices to consider
the case E [|f (X)|] < ∞.

First, using the definition of convexity, one can show that for each a ∈ D, there is b ∈ R such
that
f (x) ≥ f (a) + b(x − a), (1.20)

where the inequality in (1.20) is strict for x 6= a if f is strictly convex.29

Next, choose a := E [X]. One can show that a ∈ D because D is an interval. Let b ∈ R be
such that (1.20) is satisfied. Then

f (X) ≥ f (a) + b(X − a)

and
P [f (X) > f (a) + b(X − a)] = P [X 6= a] > 0,

if f is strictly convex and X is not P -a.s. constant. Thus, by monotonicity and linearity of
the integral and the fact that a = E [X],

E [f (X)] ≥ E [f (a) + b(X − a)] = f (a) + b(E [X] − a) = f (a) + b(a − a) = f (a)
= f (E [X]),

where the inequality is strict if f is strictly convex and X is not P -a.s. constant.

1.11 Product spaces and Fubini’s theorem


In this section, we study products of probability spaces and formulate the important Theorem
of Fubini.

First, we consider the product of sample spaces.

Definition 1.84. Let Ω1 , . . . , ΩN be sample spaces. Set

Ω := {ω = (ω1 , . . . , ωN ) : ω1 ∈ Ω1 , . . . , ωN ∈ ΩN }.
29
If f is twice continuously differentiable, the (weak) inequality (1.20) can be easily derived as follows: Fix
a ∈ D and set b := f 0 (a). By a Taylor expansion of f in a of order 1 with Lagrange remainder term, we obtain
for fixed x ∈ D
1
f (x) = f (a) + b(x − a) + f 00 (ξ)(x − a)2 ,
2
where ξ lies in the interval with the endpoints x and a. Since f 00 ≥ 0 by convexity of f , (1.20) follows.

33
Then Ω is called the product sample space of Ω1 , . . . , ΩN and denoted by

N
Ω := ×Ω
n=1
n := Ω1 × · · · × ΩN .

If Ωi = Ω0 for all i ∈ {1, . . . , N }, we also write Ω := ΩN


0 .

Example 1.85. Consider rolling a die 3 times. Then this can be modelled by the sample
space Ω := {1, . . . , 6}3 := {(ω1 , ω2 , ω3 ) : ω1 , ω2 , ω3 ∈ {1, . . . , 6}}.

Next, we consider the product of σ-algebras.

Definition 1.86. Let (Ω1 , F1 ), . . . , (ΩN , FN ) be measurable spaces. Set

F := σ ({A1 × · · · × AN : A1 ∈ F1 , . . . , AN ∈ FN }) .

Then F is called the product σ-algebra of F1 , . . . , FN and denoted by

N
O
F := Fn := F1 ⊗ · · · ⊗ FN .
n=1

If (Ωn , Fn ) = (Ω0 , F0 ) for all i ∈ {1, . . . , N }, we also write F := F0⊗N .

Intuitively, the product σ-algebra is the smallest σ-algebra such that all rectangular sets A1 ×
· · · × AN of Ω with A1 ∈ F1 , . . . , AN ∈ FN are measurable.

Finally, we consider the product of probability measures. The following result establishes
existence and uniqueness of the product measure; for a proof see [3, Theorem 14.14].

Theorem 1.87. Let (Ω1 , F1 , P1 ), . . . , (ΩN , FN , PN ) be probability spaces. Then there exists a
unique probability measure P on (×N
NN
n=1 Ωn , n=1 Fn ) such that

N
Y
P [A1 × · · · × AN ] = P [An ] (1.21)
n=1

It is called the product measure of P1 , . . . , PN and denoted by

N
O
P := Pn := P1 ⊗ · · · ⊗ PN
n=1

If (Ωi , Ai , Pi ) = (Ω0 , F0 , P0 ) for all i ∈ {1, . . . , N }, we also write P := P0⊗N .

We proceed to study the expectation of random variables with respect to the product measure
of two probability measures. In this case the integral notation is more handy; cf. Remark

34
1.58. The following result shows that instead of integrating over the product measure (which
we do not know explicitly), we can also integrate first over one measure and then over the
other. Moreover, the order of integration does not matter; for a proof see [3, Theorem 14.16].

Theorem 1.88 (Fubini). Let (Ω1 , F1 , P1 ) and (Ω2 , F2 , P2 ) be probability spaces. Let X :
Ω1 × Ω2 → R̄ be F1 ⊗ F2 -measurable. Assume that either X ≥ 0 P1 ⊗ P2 -almost surely or
X ∈ L1 (P1 ⊗ P2 ). Then
R
• The map ω1 7→ Ω2 X(ω1 , ω2 )P2 (dω2 ) is F1 -measurable and P1 -integrable in case that
X ∈ L1 (P1 ⊗ P2 ).30
R
• The map ω2 7→ Ω1 X(ω1 , ω2 )P1 (dω1 ) is F2 -measurable and P2 -integrable in case that
X ∈ L1 (P1 ⊗ P2 ).31

• We have the identity


Z Z Z 
X(ω1 , ω2 )(P1 ⊗ P2 )(d(ω1 , ω2 )) = X(ω1 , ω2 )P2 (dω2 ) P1 (dω1 )
Ω1 ×Ω2 Ω1 Ω2
Z Z 
= X(ω1 , ω2 )P1 (dω1 ) P2 (dω2 ).
Ω2 Ω1

Remark 1.89. (a) Fubini’s theorem also holds more generally, if we replace P1 and P2 by
σ-finite measures µ1 and µ2 . Of special importance is the case if µ1 and µ2 are the Lebesgue
measure; see [3, Section 14.2] for details.

(b) The notion of product sample spaces, product σ-algebras, and product measures can be
extended to countable (and even uncountable) families of probability spaces; see [3, Chapter
14] for details.

30
R R
Here, we agree that Ω2 X(ω1 , ω2 )P2 (dω2 ) := −∞ if Ω2 |X(ω1 , ω2 )|P2 (dω2 ) = ∞.
31
R R
Here, we agree that Ω1 X(ω1 , ω2 )P1 (dω1 ) := −∞ if Ω1 |X(ω1 , ω2 )|P1 (dω1 ) = ∞.

35
2 Sequences of random variables and limit theorems
2.1 Convergence of random variables
In this section, we look at different types of convergence for random variables.

Definition 2.1. Let (Xn )n∈N be a sequence of random variables on a probability space
(Ω, F, P ). Then (Xn )n∈N is said to converge to a random variable X

• in Lp , where p ∈ [1, ∞], if each Xn ∈ Lp (P ) and

lim kXn − Xkp = 0.


n→∞

Lp
In this case, we write Xn → X.

• almost surely, if there is a P -nullset N ∈ F such that

lim Xn (ω) = X(ω), for all ω ∈ Ω \ N.


n→∞

a.s.
In this case, we write Xn → X.

• in probability, if for each ε > 0,

lim P [|Xn − X| > ε] = 0.


n→∞

P
In this case, we write Xn → X.

• in distribution, if

lim FXn (x) = FX (x), for all x ∈ R such that FX is continuous at x.


n→∞

In this case, we write Xn ⇒ X.

Remark 2.2. (a) If (Xn )n∈N converges in Lp , almost surely, or in probability, the limiting
random variable X is P -almost surely unique. In light of Proposition 2.3 below, it suffices to
show this for the case of convergence in probability. So suppose that (Xn )n∈N converges in
probability to X and X 0 . Let ε > 0 be given. Using that for each x ∈ R
n εo n 0 εo
{|X − X 0 | > ε} ⊂ |X − x| > ∪ |X − x| > .
2 2

we obtain
 h εi h ε i
P [|X − X 0 | > ε] ≤ lim sup P |X − Xn | > + P |X 0 − Xn | > = 0.
n→∞ 2 2

36
We may conclude that X = X 0 P -a.s.
Lp
(b) Since |X| ≤ |X − Xn | + |Xn | for each n, Xn ∈ Lp together with Xn → X and Minkowski’s
inequality give X ∈ Lp . Moreover, using also that |Xn | ≤ |X − Xn | + |X| for each n ∈ N, we
get by Minkowski’s inequality and properties of the limit inferior and the limit superior32

kXkp ≤ lim inf k|X − Xn | + |Xn |kp ≤ lim inf (kX − Xn kp + kXn kp )
n→∞ n→∞

= lim inf kXn kp ≤ lim sup kXn kp ≤ lim sup k|X − Xn | + |X|kp
n→∞ n→∞ n→∞

≤ lim sup(kX − Xn kp + kXkp ) = kXkp .


n→∞

This implies that


lim kXn kp = kXkp < ∞.
n→∞

P
(c) Using Markov’s inequality (Theorem 1.77), it is not difficult to check that Xn → X if and
only if33
lim E [|Xn − X| ∧ 1)] = 0.
n→∞

This alternative characterisation shows that the topology induced by convergence in proba-
bility is metrisable with metric d(X, Y ) = E [|X − Y | ∧ 1].

(d) One can show with some effort (see [3, Theorem 13.23]), that Xn ⇒ X if and only if for
all bounded continuous functions f : R → R,

lim E [f (Xn )] = E [f (X)] .


n→∞

This alternative characterisation explains why convergence in distribution is also called weak
convergence.34

We proceed to study the relationship between the different types of convergence. First, it is
not difficult to check that Lp convergence does not imply almost sure convergence and vice
versa.
32
More precisely, we use that if (an )n∈N and (bn ) are sequence of real numbers, where (bn )n∈N is convergent,
then lim inf (an +bn ) = lim inf an + lim bn and lim sup(an +bn ) = lim sup an + lim bn . Here, we only show the
n→∞ n→∞ n→∞ n→∞ n→∞ n→∞
claim for the limit inferior. Since lim inf (an + bn ) ≥ lim inf an + lim inf bn without any assumptions on (bn )n∈N ,
n→∞ n→∞ n→∞
in suffices to show that lim inf (an + bn ) ≤ lim inf an + lim bn . Let ε > 0 be given and set b := lim bn . Then
n→∞ n→∞ n→∞ n→∞
there is N ∈ N such that bn ≤ b + ε for all n ≥ N . Hence, by the definition of the limit inferior,
 
lim inf (an + bn ) = lim inf (ak + bk ) ≤ lim inf (ak + b + ε) = lim inf ak + (b + ε) = lim inf an + b + ε
n→∞ n→∞ k≥n n→∞ k≥n n→∞ k≥n n→∞

Now the claim follows from letting ε → 0.


33
Recall that x ∧ y := min(x, y) and x ∨ y := max(x, y) for x, y ∈ R.
34
Note, however, that from a functional analysis perspective, this “weak convergence” is in fact weak∗ -
convergence.

37
Next, we show that convergence in Lp and almost sure convergence both imply convergence
in probability. By contrast, it is not difficult to check that convergence in probability does
neither imply Lp -convergence nor almost sure convergence.

Proposition 2.3. Let (Ω, F, P ) be a probability space, (Xn )n∈N a sequence of random variables
and X a random variable.

(a) If (Xn )n∈N converges to X in Lp for p ∈ [1, ∞], then it converges to X in probability.

(b) If (Xn )n∈N converges to X almost surely, then it converges to X in probability.

Proof. (a) The case p = ∞ is easy and left as an exercise. Assume that p < ∞. Let ε > 0.
Lp
Markov’s inequality (Theorem 1.77) with f (x) = xp and the fact that Xn → X give

E[|X − Xn |p ] 1
lim sup P [|Xn − X| ≥ ε] ≤ lim sup p
= lim sup p kXn − Xkpp = 0.
n→∞ n→∞ ε n→∞ ε

a.s.
(b) Let ε > 0. For n ∈ N, set An := {|Xn − X| > ε}. Since Xn → X, it follows that
P [{An i.o.}] = 0. Hence, by Exercise 1.2(c), we have

lim sup P [|Xn − X| > ε] = lim sup P [An ] ≤ P [{An i.o.}] = 0.


n→∞ n→∞

Finally, we show that convergence in probability implies convergence in distribution. By


contrast, it is not difficult to check that convergence in distribution does not imply convergence
in probability.

Proposition 2.4. Let (Ω, F, P ) be a probability space, (Xn )n∈N a sequence of random variables
and X a random variable. If (Xn )n∈N converges to X in probability, then it converges to X
in distribution.

Proof. Let x ∈ R be a continuity point of FX . We have to show that

lim FXn (x) = FX (x).


n→∞

Using that A = (A ∩ B) ∪ (A ∩ B c ) ⊂ B ∪ (A ∩ B c ) for all A, B ∈ F, we obtain for fixed ε > 0


and n ∈ N,

{X ≤ x − ε} ⊂ {Xn ≤ x} ∪ {X ≤ x − ε, Xn > x} ⊂ {Xn ≤ x} ∪ {|X − Xn | ≥ ε},


{Xn ≤ x} ⊂ {X ≤ x + ε} ∪ {Xn ≤ x, X > x + ε} ⊂ {X ≤ x + ε} ∪ {|X − Xn | ≥ ε}.

38
Taking probabilities, we get

FX (x − ε) ≤ FXn (x) + P [|Xn − X| ≥ ε],


FXn (x) ≤ FX (x + ε) + P [|Xn − X| ≥ ε].

P
Letting n → ∞ and using that Xn → X, we obtain

FX (x − ε) ≤ lim inf FXn (x) ≤ lim sup FXn (x) ≤ FX (x + ε).


n→∞ n→∞

Now (2.1) follows by letting ε → 0 and using that FX is continuous at x.

2.2 Uniform integrability and the dominated convergence theorem


In this section, we try to understand under which conditions almost sure convergence implies
convergence in L1 , i.e., under which conditions limits and expectations can be interchanged.

The key ingredient is the notion of uniform integrability of a family of random variables.

Definition 2.5. A family (Xi )i∈I of random variables on some probability space (Ω, F, P ) is
said to be uniformly integrable (UI) if
 
lim sup E |Xi |1{|Xi |≥K} = 0.
K→∞ i∈I

It is not difficult to check that a single random variable is uniformly integrable if and only
if it is integrable. The following result lists some further simple criteria to check for uniform
integrability. Its proof is left as an exercise.

Lemma 2.6. Let (Ω, F, P ) be a probability space and (Xi )i∈I a family of integrable random
variables.

(a) If there is X ∈ L1 with |Xi | ≤ X P -a.s. for all i ∈ I, then (Xi )i∈I is uniformly
integrable.

(b) If I is a finite index set, then (Xi )i∈I is uniformly integrable.

(c) If (Xi )i∈I is uniformly integrable and X ∈ L1 , then (Xi + X)i∈I is again uniformly
integrable.

The following result gives two equivalent useful characterisations of uniform integrability; for
a proof see [3, Theorems 6.19 and 6.24]

Theorem 2.7 (de la Vallée-Poussin). Let (Xi )i∈I be a family of variables on some probability
space (Ω, F, P ). Then the following are equivalent.

39
(1) (Xi )i∈I is uniformly integrable.

(2) (Xi )i∈I is bounded in L1 , i.e., supi∈I E [|Xi |] < ∞, and for each ε > 0, there exists
δ > 0 such that
P [A] ≤ δ =⇒ E [|Xi |1A ] ≤ ε for all i ∈ I.

H(x)
(3) There exists a nondecreasing convex function H : [0, ∞) → [0, ∞) with lim x =∞
x→∞
such that supi∈I E [H(|Xi |)] < ∞.

We note an important corollary, which gives one of the most useful criterion in practice to
check that a family of random variables is UI.

Corollary 2.8. Let (Xi )i∈I be a family of random variables on some probability space (Ω, F, P )
and p ∈ (1, ∞]. Suppose that (Xi )i∈I is bounded in Lp , i.e., supi∈I kXi kp < ∞. Then (Xi )i∈I
is uniformly integrable.

With the help of Theorem 2.7, we can now show that a sequence of integrable random variable
that converges almost surely converges in L1 if and only if it is uniformly integrable

Theorem 2.9. Let (Ω, F, P ) be a probability space and (Xn )n∈N a sequence of random vari-
ables that converges almost surely to random variable X.35 Then the following are equivalent.

(1) (Xn )n∈N is uniformly integrable.

(2) (Xn )n∈N converges to X in L1 .

Proof. We shall only proof the more important direction “(1) ⇒ (2)”; the other direction is
left as an exercise.

“(1) ⇒ (2)”. First we show that X is integrable. Using that the Xn are bounded in L1 by
Theorem 2.7(b), Fatou’s lemma gives
h i
E [|X|] = E lim inf |Xn | ≤ lim inf E [|Xn |] ≤ sup E [|Xn |] < ∞.
n→∞ n→∞ n∈N

Next, set Yn := Xn − X for n ∈ N. Then Yn converges to 0 almost surely, and (Yn )n∈N is UI
by Lemma 2.6. Let ε > 0 be given. Then for each n ∈ N,
     
E [|Yn |] = E |Yn |1{|Yn |≤ε} + E |Yn |1{|Yn |>ε} ≤ ε + E |Yn |1{|Yn |>ε} . (2.1)
35
With slightly more work, one can show that the result still holds if we only assume that (Xn )n∈N converges
to X in probability; see [3, Theorem 6.25] for details.

40
Moreover, by Theorem 2.7(b), there is δ > 0 such that

P [A] ≤ δ =⇒ E [|Yn |1A ] ≤ ε for all n ∈ N. (2.2)

Now using that Yn converges to 0 almost surely, and hence in probability by Proposition 2.3,
there is N ∈ N such that P [|Yn | > ε] ≤ δ for all n ≥ N . Combining this with (2.1) and (2.2),
we obtain
E [|Yn |] ≤ 2ε for all n ≥ N.

Since ε > 0 was arbitrary, we may conclude that Yn converges to 0 in L1 and hence Xn
converges to X in L1 .

The following result follows immediately from Theorem 2.9 and Lemma 2.6(a). It is known
as the dominated convergence theorem.

Theorem 2.10 (Dominated convergence theorem). Let (Ω, F, P ) be a probability space and
(Xn )n∈N a sequence of integrable random variables that converges almost surely to random
variable X. Suppose that there is an integrable random variable Y such that |Xn | ≤ Y P -a.s.
for all n ∈ N. Then Xn converges to X in L1 .

Remark 2.11. The direction “(1) ⇒ (2)” in Theorem 2.9 is often referred to as generalised
dominated convergence theorem.

2.3 The laws of large numbers


In this section, we study averages of random variables that are independent and identically
distributed (i.i.d). If the random variables are integrable, these averages converge in probability
and almost surely to their mean.

First, we study the convergence in probability, which is usually referred to as the weak law of
large numbers.

Theorem 2.12 (Weak law of large numbers). Let X1 , X2 , . . . be a sequence of i.i.d. random
Pn
variables in L1 with mean µ on some probability space (Ω, F, P ). Set Sn := i=1 Xi for
n ∈ N. Then
Sn P
→ µ.
n

Proof. We show the result under the additional assumption that E (X1 )2 < ∞; the general
 

case follows from Theorem 2.13 below and the fact that almost sure convergence implies
convergence in probability. Set σ 2 := Var[X 1 ]. Then by the fact the X i are i.i.d., we obtain

41
by linearity of the expectation and the Bienaymé formula (1.18),
  n
Sn 1 X  i 1
E = E X = nµ = µ, (2.3)
n n n
i=1
n
σ2
 
Sn 1 X 1
Var = 2 Var[Xi ] = 2 nσ 2 = , (2.4)
n n n n
i=1

Let ε > 0. By the Chebyshev’s inequality (Corollary 1.79), (2.3) and (2.4), we obtain for
n ∈ N,
1 σ2
       
Sn Sn Sn 1 Sn
P −µ >ε =P −E > ε ≤ 2 Var = .
n n n ε n n ε2
Letting n → ∞ establishes the claim.

Next, we study the almost sure version of the law of large numbers. This is usually referred
to as the strong law of large numbers.

Theorem 2.13 (Strong law of large numbers). Let X1 , X2 , . . . be a sequence of i.i.d. random
Pn
variables in L1 with mean µ on some probability space (Ω, F, P ). Set Sn := i=1 Xi for
n ∈ N. Then
Sn a.s.
→ µ.
n

Proof. We show the result under the additional assumption that K := E X14 < ∞; for the
 

general case see [3, Theorem 5.16]. We may assume without loss of generality that E [X1 ] = 0;
otherwise consider X̃i := Xi − E [Xi ] and use that Snn = S̃nn + µ, where S̃n := ni=1 X̃i . Then
P

independence of the Xi gives

E [Xi Xj Xk Xl ] = 0 for i, j, k, l ∈ N distinct,


E Xi Xj Xk2 = 0 for i, j, k ∈ N distinct,
 

E Xi Xj3 = 0 for i, j ∈ N distinct,


 

Moreover, Jensen’s inequality and the fact that the Xi are i.i.d. gives
 2
E Xi2 Xj2 = E Xi2 E Xj2 = E X12 ≤ E X14 = K,
       
for i, j ∈ N distinct.

Thus, by some elementary combinatorics, we obtain

n X
n X
n X
n n n X
i−1
  X X X
E Sn4 = E Xi4 + 6 E Xi2 Xj2
   
E [Xi Xj Xk Xl ] =
i=1 j=1 k=1 `=1 i=1 i=1 j=1
n(n − 1)
≤ nK + 6 K ≤ 3n2 K.
2

42
P∞ 1
Now using the fact that n=1 n2 < ∞, we obtain by monotone convergence,

∞  ∞ ∞
" #
Sn 4

X X 1  4 X 1
E = 4
E Sn ≤ 3K < ∞.
n n n2
n=1 n=1 n=1

P∞ Sn 4 Sn 4
 
By Lemma 1.55, this implies that n=1 n < ∞ P -a.s. It follows that limn→∞ n =0
P -a.s.36 Hence, we have limn→∞ Snn = 0 P -a.s.

2.4 The law of iterated logarithm and the central limit theorem
If X1 , X2 , . . . are i.i.d. random variables in L1 with mean µ then the strong law of large
Pn
numbers implies that S̃nn converges to 0 almost surely, where S̃n = k=1 (Xk − µ). Two
natural follow-up questions are to understand the precise size of S̃n in terms of n and to find
a nondegenerate limit in distribution under a different scaling in n.

The famous law of iterated logarithm by Hartman and Wintner answers the question on the
precise size of S̃n ; for a proof see [3, Theorem 22.11].

Theorem 2.14 (Law of iterated logarithm). Let X1 , X2 , . . . be a sequence of i.i.d. random


variables in L2 with mean µ and variance σ 2 > 0 on some probability space (Ω, F, P ). Set
S̃n = nk=1 (Xk − µ) for n ∈ N. Then
P

S̃n
lim sup p = 1 P -a.s.
n→∞ σ 2n log(log n)

The central limit theorem answers the second question. The correct scaling in n to get a

nondegenerate weak limit is n and the corresponding distribution is the normal distribution.
For a proof, we refer to [3, Theorem 15.37]

Theorem 2.15 (Central limit theorem). Let X1 , X2 , . . . be a sequence of i.i.d. random vari-
ables in L2 with mean µ and variance σ 2 > 0 on some probability space (Ω, F, P ). Set
S̃n := nk=1 (Xk − µ) for n ∈ N. Then
P

S̃n
√ ⇒ N (0, 1).
σ n

36 P∞
Recall that if (an )n∈N is a sequence of nonnegative numbers with n=1 an < ∞, then limn→∞ an = 0.

43
References
[1] H.-O. Georgii, Stochastics, De Gruyter Textbook, Walter de Gruyter & Co., Berlin, 2008.

[2] J. Jacod and P. Protter, Probability essentials, second ed., Universitext, Springer-Verlag,
Berlin, 2003.

[3] A. Klenke, Probability theory, Universitext, Springer-Verlag London, Ltd., London, 2008.

44

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy