FT Notes
FT Notes
(Probability Theory)
Martin Herdegen
Lectured by
Osian Shelley
Department of Statistics
i
Foreword
Fundamental Tools week is a preparatory course to refresh your knowledge of probability
theory, and introduce you to the style of questions you are likely to encounter in courses
containing mathematics at Warwick.
It is likely that these notes may seem more formal than other courses you have taken before.
Do not let this concern you too much, as most of the course will involve calculation using the
formal results laid out here. However, you are expected to know and understand the proofs
that are presented here.
You should be aware that the material covered here is considered to be basic mathematical
content which, hopefully, you will have encountered before. Importantly, if you have not
covered a topic before, then these notes should not serve as the means to learn about it – the
material presented here is a brief summary of these topics and is not meant to be exhaustive.
Ideally, you should consult other textbooks which will provide far more detail.
There is an examination at the end of the week which broadly assumes the knowledge you
will find here, along with the example sheets completed in seminars. As a general guide, if
you understand most of the material here and have completed all of the questions from the
assignment sheets then you will not have a problem passing the exam.
Course webpage
There is a webpage for this course, which you can find on myWBS (search MA901).
It will contain a copy of these notes, as well as assignment sheets as they are set through the
week. No solution sheets will be provided for the questions covered in class.
Finally, please note that these notes have been in place since 2020/2021, and differ substantially
from previous years. As such, they might contain errors and typos, although we believe that
the mathematical content itself is correct. If you do find errors, then please contact us so that
they may be corrected. You can always find an up-to-date copy of the notes on the course
webpage.
1
1 Fundamental concepts of Probability Theory
In this chapter, we study some fundamental concepts from Measure Theory and Probability
Theory that are foundational for various applications in Mathematical Finance.
Each ω ∈ Ω describes a possible “state of the world”. Key examples are Ω = {ω1 , . . . , ωN } for
N ∈ N (finite sample space), Ω = {ω1 , ω2 , . . .} (countable sample space), and Ω = R.
(1) Ω ∈ F;
(2) A ∈ F =⇒ Ac = Ω \ A ∈ F;
S
(3) A1 , A2 , . . . ∈ F =⇒ n∈N An ∈ F.
The pair (Ω, F) is called a measurable space and each A ∈ F is called F-measurable or an
F-measurable event.
(a) the empty set ∅ is in F. Indeed, this follows from (1) and (2) via ∅ = Ω \ Ω.
T
(b) if A1 , A2 , . . . ∈ F, then the countable intersection n∈N An is in F. Indeed, this follows
from the de Morgan laws,1 (2), and (3) via n∈N An = ( n∈N Acn )c .
T S
(d) If A, B ∈ F, then A \ B and B \ A are in F. Indeed, this follows from (2) and (c) via
A \ B = A ∩ B c and B \ A = B ∩ Ac .
1
laws say that if (An )n∈N is any collection of subsets of Ω, then ( n∈N An )c = n∈N Acn
S T
The
T de Morgan’s
and ( n∈N An )c = n∈N Acn .
S
2
T S
(e) if A1 , A2 , . . . ∈ F, then lim supn→∞ An := n∈N k≥n Ak is in F. Indeed this follows
from (3) and (b).
S T
(f) if A1 , A2 , . . . ∈ F, then lim inf n→∞ An := n∈N k≥n Ak is in F. Indeed, this follows
from (b) and (3).
Clearly, limes inferior is the event where eventually all of the An occur. On the other
hand, limes superior is the event where infinitely many of the An occur. In particular,
lim inf n→∞ An ⊂ lim supn→∞ An .
If Ω is finite (or countable), i.e., Ω = {ω1 , . . . , ωN } (or Ω = {ω1 , ω2 , . . .}), then the usual choice
for a σ-algebra on Ω is F := 2Ω := {A : A ⊂ Ω}, the power set of Ω.
If Ω is uncountable, e.g. Ω = R, it turns out that the power set 2Ω is “too big” to be chosen as
σ-algebra.2 For this reason, one uses instead the following procedure: One chooses a generator
A of “good subsets” of Ω that one wants to be measurable. One then denotes by σ(A) the
smallest σ-algebra on Ω that contains A. It is called the σ-algebra generated by A. It is
explicitly given by
\
σ(A) = G.
G σ-algebra,
A⊂G
Note that different generators A may generate the same σ-algebra; one usually chooses a
“small” generator or a generator with “good properties”.
Example 1.5. (a) If Ω = R, one wants that all open sets OR in R are measurable. One then
sets BR := σ(OR ) and calls this the Borel σ-algebra on R. One can show that BR contains all
closed and open sets in R and that is is also generated by the generator A, where A contains
all (closed) sets of the form (−∞, a] for a ∈ R.3
(b) More generally, if Ω is a subset of Rd for d ≥ 1, one denotes the open sets in Ω by OΩ ,
and sets BΩ := σ(OΩ ) and calls this the Borel σ-algebra on Ω.4
3
Definition 1.6. Given a measurable space (Ω, F), a probability measure P on (Ω, F) is a map
F → [0, 1] such that
Remark 1.7. The properties (1) and (2) in Definition 1.6 are called the axioms of Kolmogorov
after the Russian mathematician A. Kolmogorov (1903 - 1987). Property (2) is referred to as
σ-additivity, where the σ stands for “countable”.
Example 1.8. (a) Let (Ω, F) be a measurable space and suppose that F contains all ele-
mentary events, i.e., {ω} ∈ F for all ω ∈ Ω. Fix ω ∗ ∈ Ω and define the map P : F → [0, 1]
by
1 if ω ∗ ∈ A,
P [A] =
0 if ω ∗ ∈
/ A.
One can check that P is a probability measure on (Ω, F). It is called the Dirac measure for ω ∗
and often denoted by δω∗ . The Dirac measure for ω ∗ models the “deterministic case”, where
we know with probability 1 that the state of the world will be ω ∗ .5
(b) Let Ω = {ω1 , . . . , ωN } for some N ∈ N and F = 2Ω . Define the map P : F → [0, 1] by
|A|
P [A] = ,
|Ω|
where |A| denotes the number of elements in A. One can check that P is a probability
measure on (Ω, F). It is called the discrete uniform distribution on Ω. For N = 2 (and
ω1 = H and ω2 = T ), this is a good model for the flipping of a fair coin, and for N = 6 (and
ω1 = 1, . . . , ω6 = 6), this is a good model for the rolling of a fair die.
Example 1.9. A poker hand consists of 5 cards. If the cards have distinct consecutive values
and are not all of the same suit, we say that the hand is a straight. What is the probability
of being dealt a straight?
Let our sample space Ω consist of all possible poker hands and A ∈ 2Ω the event that we are
dealt a straight. Note that the number of possible outcomes for which the poker hand consists
of an ace, two, three, four and five is 45 . In 4 of these cases, the suits must be identical,
so 45 − 4 of these hands are straights. Thus, there are 10(45 − 4) hands that are straights.
Assuming all poker hands are equally likely, we use the discrete uniform distribution on Ω to
5
Note that unless Ω = {ω ∗ }, it is false to say that we are sure that the state of the world will be ω ∗ ; we
can only say that the state of the world will be ω ∗ P -almost surely; cf. Definition 1.11 below.
4
deduce that
|A| 10(45 − 5)
P [A] = = 52 ≈ 0.0039.
|Ω|
5
The following result collects some fundamental rules for calculating probabilities.
(a) If A ∈ F, then
P [Ac ] = P [Ω \ A] = 1 − P [A].
(b) If A ⊂ B ∈ F, then
P [B] = P [A] + P [B \ A] ≥ P [A]. (1.1)
(c) If A, B ∈ F, then
(d) If A1 ⊂ A2 ⊂ · · · ∈ F, then
" #
[
P An = lim P [An ].
n→∞
n∈N
(e) If A1 ⊃ A2 ⊃ · · · ∈ F, then
" #
\
P An = lim P [An ].
n→∞
n∈N
(f) If A1 , A2 , . . . ∈ F, then
∞
" #
[ X
P An ≤ P [An ].
n∈N n=1
Proof. We only prove parts (a), (d) and (f). The other parts are left as an exercise.
5
Rearranging yields (a).
of P yields
∞ n n
" # " # " #
[ [ X X [
P Ak = P Bk = P [Bk ] = lim P [Bk ] = lim P Bk = lim P [An ].
n→∞ n→∞ n→∞
k∈N k∈N k=1 k=1 k=1
n ∞
" # " #
[ [ X X
P Ak = P Bk = lim P [Bn ] ≤ lim P [Ak ] = P [Ak ].
n→∞ n→∞
k∈N k∈N k=1 k=1
Events with probability zero or one are of particular importance because of the obvious in-
terpretation of “impossible” and “sure” events. However, this is not fully correct as P [A] = 0
does in general not imply that A = ∅, and P [A] = 1 does in general not imply that A = Ω.
The following definition makes this precise.
(b) Let E(ω) be a property that a state of the world ω ∈ Ω can have or not have. We say
that E holds P -almost surely if there exists a P -nullset N ∈ F such that E holds for all
ω ∈ Ω \ N.
Definition 1.12. Let (Ω, F) and (Ω0 , F 0 ) be measurable spaces and X : Ω → Ω0 a map.
X −1 (A0 ) := {X ∈ A0 } := {ω ∈ Ω : X(ω) ∈ A0 }.
6
is called the σ-algebra generated by X.
(c) X is called measurable with respect to F and F 0 (or shorter F-F 0 -measurable) if
Intuitively, σ(X) contains all the “information” about X. Clearly, X is measurable with
respect to F and F 0 if and only if σ(X) ⊂ F, i.e., F contains all the information about X.
Example 1.13. Let (Ω, F) be a measurable space and A ∈ F an F-measurable event. Then
the indicator function 1A , defined by
1 if ω ∈ A,
1A (ω) :=
0 if ω ∈ Ac ,
Whereas Definition 1.12 is important from a theoretical perspective, it is almost useless for
checking in practice that a given function X : Ω → Ω0 is F-F 0 -measurable because we usually
do not have a good description of all A0 ∈ F 0 . But fortunately, one can show that it suffices
to check (1.2) for all A0 in a generator A0 of F 0 , which in general is much smaller than F 0 .
This is the content of the following result; for a proof see [3, Theorem 1.81].
Theorem 1.14. Let (Ω, F) and (Ω0 , F 0 ) be measurable spaces and X : Ω → Ω0 a map. Suppose
that F 0 = σ(A0 ).
(a) We have
{X ∈ A0 } : A0 ∈ A0
σ(X) = σ .
6
Also if Ω0 is a Borel subset of R and F 0 = BΩ , we say that X is an F-measurable random variable valued
in Ω0 . Note that each F-measurable random variable valued in Ω0 is also an F-measurable random variable
(valued in R) so that we do not need to worry too much about the exact domain. The same holds for random
vectors.
7
(b) X is measurable with respect to F and F 0 if and only if
Proof. Set A0 := {(−∞, x] : x ∈ R}. Note that (1.3) states that X −1 (A0 ) ∈ F for all A0 ∈ A0 .
Hence, the claim follows from (1.14) and the fact that BR = σ(A0 ) by Example 1.5(a).
Proof. By Corollary 1.15, it suffices to show that f −1 ((−∞, x]) is in BΩ for all x ∈ R. So fix
x ∈ R. As f is continuous, the preimage of the closed set (−∞, x] is again closed. As BΩ
contains all closed sets in Ω, it follows that f −1 ((−∞, x]) ∈ BΩ .
Theorem 1.17. Let (Ω, F), (Ω0 , F 0 ), and (Ω00 , F 00 ) be measurable spaces. Let f : Ω → Ω0
be F-F 0 -measurable and g : Ω0 → Ω00 be F 0 -F 00 -measurable. Then h : Ω → Ω00 , defined by
h = g ◦ f is F-F 00 -measurable.7
We now show that sums and products of random variables are again random variables. In
order to do so, we require the following lemma:
7
Recall that g ◦ f is defined by (g ◦ f )(ω) = g(f (ω)).
8
Lemma 1.18. Let (Ω, F) be a measurable space and let X1 , . . . , Xn : Ω → R be maps. Define
X = (X1 , . . . , Xn ) : Ω → Rn . Then
Tn −1
Proof. For b ∈ Rn , we have X −1 ((−∞, b]) = i=1 Xi ((−∞, bi ]). if each Xi is measurable,
then X −1 ((−∞, b]) ∈ F. Since closed sets of the form (−∞, b] generate BRn , it follows from
theorem 1.14 that X is F-BRn -measurable.
Theorem 1.19. Let (Ω, F) be a measurable space and X1 and X2 be F-measurable random
variables. Then X1 + X2 , X1 − X2 , X1 X2 , and X1 /X2 are again F-measurable random
variables.8
Finally, we show that countable suprema and infima or random variables are again measur-
able.10 To this end, we need to extend the real line R by the points −∞ and +∞. Thus, we
set
R = R ∪ {−∞, +∞} and BR := σ [−∞, x] : x ∈ R .
One can check that BR ⊂ BR , so that every real-valued random variable X can be in a
canonical way identified with an R-valued measurable map.11 For this reason, we will not
always carefully distinguish between R-valued and R-valued measurable maps and call both
random variables.
(a) supn∈N Xn .
9
(c) lim supn→∞ Xn .
Proof. We only prove (a) and (c); the proofs of (b) and (d) are similar.
(a) Let x ∈ R. Then by the fact that each of the Xn are F − BR̄ -measurable, we have
\
sup Xn ∈ [−∞, x] = {Xn ∈ [−∞, x]} ∈ F.
n∈N n∈N
(c) For any n ∈ N set Yn := supm≥n Xm . Then each Yn is F-measurable by part (a). Hence,
lim supn→∞ Xn = inf n∈N Yn is F-measurable by part (b).
Definition 1.21. Let (Ω, F, P ) be a probability space and X an F-measurable random vari-
able.
(a) The distribution (or law or image) of X under P is the probability measure on (R, BR )
defined by
PX [B] := P [X ∈ B], B ∈ BR .
(b) The distribution function (or cumulative distribution function (cdf)) of X under P is
the map FX : R → [0, 1] given by
FX (x) := P [X ≤ x], x ∈ R.
The following result lists the three defining properties of a distribution function; its proof is
left as an exercise.
Lemma 1.22. Let (Ω, F, P ) be a probability space and X an F-measurable random variable.
Then the distribution function FX : R → [0, 1] has the following properties:
(a) FX is nondecreasing.
(b) FX is right-continuous.
10
The above lemma shows that for each random variable X, there exists a function F that is
nondecreasing, right-continuous, and with limits 0 at −∞ and 1 at +∞, respectively. The next
result show that also the converse is true. For each function F with these three properties,
there exists a random variable X with distribution function F ; for a proof see [3, Theorem
1.104].
The next definition looks at the concept that two random variables X and Y have the same
distribution.
Definition 1.24. Let (Ω, F, P ) be a probability space and X and Y F-measurable random
variables.13 Then X and Y are said to be identically distributed if
PX = PY .
The next result shows that the distribution function of a random variable does indeed describe
the whole distribution; for a proof see [2, Theorem 7.1].
Theorem 1.25. Let (Ω, F, P ) be a probability space and X and Y F-measurable random
variables. Then the following are equivalent:
(2) FX = FY .
Definition 1.26. A real-valued random variable X (or more precisely its distribution) is called
discrete if there exists a finite or countable set B ∈ BR such that P X [B] = P [X ∈ B] = 1. In
this case, the probability mass function (pmf) pX : R → [0, 1] of X is given by
11
More generally, for any Borel set A ∈ BR , we have the formula
X
PX [A] = P [X ∈ A] = pX (x).
x∈A
Note that due to (1.4), we always sum over a countable set and this sum is finite and bounded
above by 1.
(b) One can show that a random variable X has a discrete distribution if and only if its
distribution function is piecewise constant, i.e., for each x ∈ R, there is ε > 0 such that
FX (y) = FX (x) for all y ∈ [x, x + ε]. Moreover, pX (x) = FX (x) − FX (x−) for all x ∈ R,
where FX (x−) := limy↑↑x FX (y) := limy→x,y<x FX (y) denotes the left limit of FX at x.14 This
together with Theorem 1.25 also shows that the distribution of a discrete random variable is
uniquely described by its pmf.
(b) binomial distribution with parameters n ∈ N and p ∈ [0, 1] if its pmf is given by
n px (1 − p)n−x
if x ∈ {0, . . . , n},
pX (x) = x
0 otherwise.
(c) geometric distribution with parameter p ∈ (0, 1) if its pmf is given by15
p(1 − p)x if x ∈ N0 ,
pX (x) =
0 otherwise.
14
This left limit always exists because FX is nondecreasing.
15
Warning: In parts of the literature, the geometric distribution is shifted by one to the right, i.e., it is a
distribution on N.
12
(d) Poisson distribution with parameter λ > 0 if its pmf is given by
x
e−λ λ
if x ∈ N0 ,
pX (x) = x!
0 otherwise.
Definition 1.29. A real-valued random variable X (or more precisely its distribution) is called
(absolutely) continuous if there exists a nonnegative, measurable function fX : R → [0, ∞)
satisfying Z x
FX (x) = fX (y) dy, x ∈ R.
−∞
In this case, the function fX is called the probability density function (pdf) of X.
Remark 1.30. (a) One can show that if X is continuous with pdf fX , then for any Borel set
A ∈ BR , we have the formula
Z
PX [A] = P [X ∈ A] = fX (y) dy. (1.5)
A
This together with Theorem 1.25 and fundamental properties of the (Lebesgue) integral show
that the distribution of a random variable is uniquely characterised by its pdf (up to Lebesgue-
null sets).
(b) Note that there are many random variables which have neither a discrete nor a continuous
distribution.
(a) uniform distribution on (a, b), where −∞ < a < b < ∞ if its pdf is given by
1
if x ∈ (a, b),
fX (x) = b − a
0 otherwise.
(b) exponential distribution with rate parameter λ > 0 if its pdf is given by
λ exp(−λx) if x > 0,
fX (x) =
0 otherwise.
13
(c) Normal distribution with parameters µ ∈ R and σ 2 > 0 if its pdf is given by
1 (x−µ)2
fX (x) = √ e− 2σ 2 , x ∈ R.
2πσ 2
(a) The joint distribution function of X1 , . . . , XN under P is the map F(X1 ,...,XN ) : RN →
[0, 1] given by
(b) X1 , . . . , XN are called jointly continuous if there is a measurable function f(X1 ,...,XN ) :
RN → [0, ∞) such that
In this case, the function f(X1 ,...,XN ) is called the joint probability density function (joint
pdf) of X1 , . . . , XN .
Remark 1.33. (a) To simplify the notation one often sets X := (X1 , . . . , XN ) and writes
FX (x) and fX (x) for F(X1 ,...,XN ) (x1 , . . . , xN ) and f(X1 ,...,XN ) (x1 , . . . , xN ), respectively, where
x = (x1 , . . . , xN ).
That these integrals are well defined and measurable is the content of Fubini’s theorem, see
Section 1.11 below.
14
The key example of a multivariate distribution is the multivariate normal distribution.
Example 1.34. A random vector X = (X1 , . . . , XN ) is said to have a multivariate normal dis-
tribution with mean vector µ ∈ RN and covariance matrix Σ ∈ RN ×N , where Σ is symmetric
and positive definite, if X1 , . . . , XN are jointly continuous with joint pdf
1 1 > −1
fX (x) = p exp − (x − µ) Σ (x − µ) .
det(2πΣ) 2
To motivate this concept, suppose we are given a probability space (Ω, F, P ) and we have been
informed that A ∈ F has occurred. We want to find a new probability measure P [· | A] on
(Ω, F) that takes this information into account. Clearly, this new probability measure should
be consistent with the old probability measure P in the sense that P [· | A] is proportional to P .
Moreover, since we already know that A has occurred, one should require that P [A | A] = 1.
It is then not difficult to check that these two properties uniquely determine P [· | A]. The
answer is given by the following definition.
Definition 1.35. Let (Ω, F, P ) be a probability space and A ∈ F with P [A] > 0. Then for
B ∈ F, the conditional probabilty of B given A is denoted by P [B|A] and defined by
P [A∩B] when P [A] > 0,
P [A]
P [B|A] = (1.6)
0 otherwise.
The following result lists two elementary facts about conditional probabilities.
Theorem 1.36. Let (Ω, F, P ) be a probability space, I a finite or countable index set,16 and
S
(Ai )i∈I an F-measurable partition of Ω, i.e., each Ai is F-measurable, i∈I Ai = Ω and
Ai ∩ Aj = ∅ for i 6= j. Suppose that P [Ai ] > 0 for each i ∈ I. Then
15
(b) For every B ∈ F with P [B] > 0 and each k ∈ I, we have Bayes’ formula
P [B | Ak ]P [Ak ]
P [Ak | B] = P .
i∈I P [B | Ai ]P [Ai ]
Proof. (a) The definition of conditional probabilities and the σ-additivity of P give
X X
P [B | Ai ]P [Ai ] = P [B ∩ Ai ] = P [B].
i∈I i∈I
P [B | Ak ]P [Ak ] P [Ak ∩ B]
P = = P [Ak | B].
i∈I P [B | Ai ]P [Ai ] P [B]
Exercise. The proportion of Jaguar cars manufactured in Coventry is 0.7, and the proportion
of these with some fault is 0.2. All other Jaguars are made in Birmingham and the proportion
of faulty Birmingham cars is 0.1. What is the probability that a randomly selected Jaguar
car:
(b) is faulty?
1.6 Independence
In this section, we study the key concept of (stochastic) independence. We start by looking
at independence of families of events and then generalise this to families of σ-algebras and
families of random variables.
To motivate this concept, suppose we are given a probability space (Ω, F, P ) and two F-
measurable events A, B with P [A], P [B] > 0. Then intuitively A and B are (stochastically)
independent if the probability assigned to A is not influenced by the information that B has
occurred and vice versa. In formulas, this means that
Definition 1.37. Let (Ω, F, P ) be a probability space, I an arbitrary index set and (Ai )i∈I
16
a family of events in F. Then the Ai are said to be (stochastically) independent if for any
finite set of distinct indices i1 , . . . , in ∈ I,
n
Y
P [Ai1 ∩ Ai2 ∩ · · · ∩ Ain ] = P [Aik ]. (1.8)
k=1
Exercise. Suppose two dice are rolled and we note the upturned face. Let
We now “lift” the definition of independence from families of events to families of σ-algebras.
Definition 1.38. Let (Ω, F, P ) be a probability space, I an arbitrary index set and (Gi )i∈I
a family of sub-σ-algebras of F.17 Then the Gi are said to be independent if for any finite set
of distinct indices i1 , . . . , in ∈ I and any events Ai1 ∈ Gi1 , . . . , Ain ∈ Gin ,
n
Y
P [Ai1 ∩ Ai2 ∩ · · · ∩ Ain ] = P [Aik ]. (1.9)
k=1
Next, we want to define a notion of independence for random variables. We do this by requiring
that their generated σ-algebras (cf. Definition 1.12) are independent.
Definition 1.39. Let (Ω, F, P ) be a probability space, I an arbitrary index set and (Xi )i∈I
a family of F-measurable random variables. Then the Xi are said to be independent if their
generated σ-algebras σ(Xi ), i ∈ I, are independent.
The following result shows that we need to check independence only for certain generators of
a σ-algebra; for a proof see [3, Theorem 2.13].
Theorem 1.40. Let (Ω, F, P ) be a probability space, I an arbitrary index set and (Ai )i∈I
a family of classes of events that are independent, i.e., for any finite set of distinct indices
i1 , . . . , in ∈ I and any events Ai1 ∈ Ai1 , . . . , Ain ∈ Ain ,
n
Y
P [Ai1 ∩ Ai2 ∩ · · · ∩ Ain ] = P [Aik ]. (1.10)
k=1
Moreover, suppose that each each Ai is closed under intersections, i.e., if A, B ∈ Ai then also
17
This means that each Gi is a σ-algebra on Ω and Gi ⊂ F .
17
A ∩ B ∈ Ai .18
We note the following important corollary for the case of random variables. Its proof is left
as an exercise.
N
Y
F(X1 ,...,XN ) (x1 , . . . , xN ) = FXn (xn ), for all x1 , . . . , xN ∈ R.
n=1
Moreover, if X1 , . . . , XN are jointly continuous and have a continuous joint pdf f(X1 ,...,XN )
and continuous marginal pdfs fX1 , . . . , fXN , then X1 , . . . , XN are independent if and only if
N
Y
f(X1 ,...,XN ) (x1 , . . . , xN ) = fXn (xn ), for all x1 , . . . , xN ∈ R.
n=1
As another application of Theorem 1.40, we show that if a family (Ai )i∈I of events is inde-
pendent, then so is the family of their complements (Aci )i∈I .
Example 1.42. Let (Ω, F, P ) be a probability space, I an arbitrary index set and (Ai )i∈I
a family of independent events in F. Then also (Aci )i∈I is a family of independent events.
Indeed, set
Ai := {Ai }, i ∈ I.
Then the Ai are trivially closed under intersection and independent because the Ai are. More-
over, by Example 1.13, it follows that σ(cAi ) := {∅, A, Ac , Ω}. Hence, by Theorem 1.40, the
generated σ-algebras σ(Ai ) are also independent. By the Definition 1.38 of independence of
σ-algebras, it follows that the (Aci ) are independent.
The next result computes the probability of a sequence of events A1 , A2 , . . . happening in-
T S
finitely often. To this end, recall that {An i.o.} = lim supn→∞ An = n∈N k≥n Ak .
18
P∞
(b) If the An are independent and n=1 P [An ] = ∞, then P [{An i.o.}] = 1.
P∞
Proof. (a) Proposition 1.10(e) and (f) together with n=1 P [An ] < ∞ yields
∞
" # " #
\ [ [ X
P [{An i.o.}] = P Ak = lim P Ak ≤ lim P [Ak ] = 0.
n→∞ n→∞
n∈N k≥n k≥n k=n
(b) It suffices to show that P [{An i.o.}c ] = 0. The de Morgan laws and Proposition 1.10(f)
yield "
\ [ c # "
[ \
# "
\
#
c
P [{An i.o.} ] = P Ak =P Ack = lim P Ack
n→∞
n∈N k≥n n∈N k≥n k≥n
It suffices to show that P [ k≥n Ack ] = 0 for each n ∈ N. So fix n ∈ N. Set B1 := Acn ,
T
1.10(e), the fact that (Acn )n∈N is independent by Example 1.42, and the elementary inequality
1 − x ≤ exp(−x) for x ∈ R give
" # " #
\ \
Ack = P B` = lim P [B` ] = lim P Acn ∩ · · · Acn+`−1
P
`→∞ `→∞
k≥n `∈N
n+`−1 n+`−1
!
Y X
= lim (1 − P [Ak ]) ≤ lim inf exp − P [Ak ] = 0.
`→∞ `→∞
k=n k=n
Example 1.44. Suppose we roll a fair die exactly once and define An as the event that in
this roll the face showed a six, for every n ∈ N. Clearly we have
∞ ∞
X X 1
P [An ] = = ∞;
6
n=1 n=1
however, P [{An i.o.}] = P [A1 ] < 1. This shows that in part (b) of the Borel-Cantelli lemma,
the assumption of independence is indispensable.
Example 1.45. Suppose the number of calls received by a call centre each day is Xn times
on day n ∈ N, where Xn ∼ Poisson(λn ) such that 0 ≤ λn ≤ Λ, for some fixed Λ ∈ (0, ∞).
Then,
P [Xn ≥ n for infinitely many n] = 0.
19
Indeed, this follows from the Borel-Cantelli lemma by noting that
∞
X ∞ X
X ∞ ∞ X
X m
P [Xn ≥ n] = P [Xn = m] = P [Xn = m]
n=1 n=1 m=n m=1 n=1
∞ X m ∞
X λm
n
X Λm
= eλn ≤ m = ΛeΛ < ∞.
m! m!
m=1 n=1 m=1
1.7 Expectation
In this section, we aim to define for a random variable X on a probability space (Ω, F, P ),
the expectation of X under P (also called the integral of X with respect to P ). We proceed
in three steps.
X = c1 1A1 + · · · + cn 1An .
Remark 1.47. (a) It is not difficult to check that each simple random variable has a repre-
sentation such that the Ai are pairwise disjoint and the ci are distinct.
(b) It is not difficult to check that (1.11) is independent of the choice of the representation of
X. More precisely, if X can also be written as X = d1 1B1 + · · · + dm 1Bm then
The following result shows that the expectation is linear on simple random variables. Its proof
is left as an exercise.
Lemma 1.48. Let X and Y be simple random variables on some probability space (Ω, F, P ).
For a, b ∈ R, aX + bY is again a simple random variable and
E [aX + bY ] = aE [X] + bE [Y ] .
20
We next aim to define the expectation for arbitrary nonnegative random variables.
Definition 1.49. Let X be a [0, ∞]-valued random variable on some probability space (Ω, F, P ).
Then the expectation of X with respect to P is given by
Remark 1.50. (a) The expectation of a nonnegative random variable can be ∞ even if X
itself never takes the value ∞.
(b) It follows immediately from Definition (1.49) that the expectation is monotone in the sense
that if X and Y are [0, ∞]-valued random variables with X ≤ Y then E [X] ≤ E [Y ].
Definition 1.49 suggests that in order to calculate the expectation of a nonnegative random
variable X, we approximate X from below by a nondecreasing sequence of nonnegative simple
random variables (Xn )n∈N and calculate limn→∞ E [Xn ]. The following two results show that
this idea really works.
Lemma 1.51. Let X be a [0, ∞]-valued random variable on some probability space (Ω, F, P ).
Then there exists a nondecreasing sequence of nonnegative simple random variables (Xn )n∈N
with limn→∞ Xn = X.
Proof. For n ∈ N, set Xn (ω) := min(2−n b2n X(ω)c, n), where b·c denotes the floor function.19
Then for each ω ∈ Ω, it is easy to check that the sequence (Xn (ω))n∈N is nondecreasing and
satisfies limn→∞ Xn (ω) = X(ω).
The next theorem shows that expectation and monotone limits can be interchanged. It is one
of the cornerstones of modern integration theory; for a proof see [3, Lemma 4.2].
Theorem 1.52 serves as a crucial ingredient to many proofs, where one first establishes the
result for simple random variables and then passes to a monotone limit. To illustrate this
19
If X(ω) = +∞, we use the conventions that c × ∞ = ∞ for c > 0 and b∞c = ∞.
21
approach, we prove the following generalisation of Lemma 1.48.
Lemma 1.54. Let X and Y be [0, ∞]-valued random variables on some probability space
(Ω, F, P ). For a, b ≥ 0, aX + bY is again a [0, ∞]-valued random variable and20
E [aX + bY ] = aE [X] + bE [Y ] .
The following result can also be proved using the monotone convergence theorem. Its proof is
left as an exercise.
Lemma 1.55. Let X be a [0, ∞]-valued random variable on some probability space (Ω, F, P ).
As another application of the monotone convergence theorem, we prove the important Lemma
of Fatou.
Lemma 1.56 (Fatou). Let X1 , X2 , . . . be [0, ∞]-valued random variables on some probability
space (Ω, F, P ). Then h i
E lim inf Xn ≤ lim inf E [Xn ] .
n→∞ n→∞
22
(Theorem 1.52) and (1.13), we obtain
h i h i
E lim inf Xn = E lim Yn = lim E [Yn ] = lim inf E [Yn ] ≤ lim inf E [Xn ] .
n→∞ n→∞ n→∞ n→∞ n→∞
Finally, we define the expectation for a general random variable. To this end, recall that
R̄ = [−∞, +∞].
Definition 1.57. Let X be an R̄-valued random variable on some probability space (Ω, F, P ).
X is called integrable or said to have finite expectation (with respect to P ) if E P [|X|] < ∞.
In this case one sets
E P [X] := E P X + − E P X − ,
where X + = max{0, X} denotes the positive part of X and X − = max{0, −X} denotes the
negative part of X.21 If there is no danger of confusion, we often drop the qualifier P in E P .
Remark 1.58. (a) In situations where one wants to highlight the underlying sample space
Ω, it is more handy to write the expectation in integral notation. For an integrable random
variable X on a probability space (Ω, F, P ), we set
Z
X(ω)P (dω) := E P [X] ,
Ω
(b) The construction of the integral can be easily extended to general measures µ on (Ω, F),
which are still σ-additive but do not longer satisfy µ(Ω) = 1; see [3, Chapter 4] for details.
23
In this case, we set
Z Z Z
X(ω)µ(dω) := X + (ω)µ(dω) − X − (ω)µ(dω)
Ω Ω Ω
(c) The most important example of a general measure is the Lebesgue-measure λ on R, which
is the unique measure on (R, BR ) such that λ((a, b]) = b − a for all −∞ < a < b < ∞. If
f : R → R is a Borel-measurable function, we say that f is Lebesgue-integrable if
Z ∞ Z
|f (x)| dx := |f (x)|λ(dx) < ∞
−∞ R
and write Z ∞ Z
f (x) dx := f (x)λ(dx)
−∞ R
Lemma 1.59. Let X and Y be integrable random variables on some probability space (Ω, F, P ).
Property (a) is referred to as linearity of the expectation and property (b) as monotonicity of
the expectation.
Proof. (a) By the triangle inequality, Lemma 1.54, and the fact that X and Y are integrable
it follows that
Thus, aX + bY is integrable. The rest of the claim follows from splitting X, Y , and X + Y
into their positive and negative parts and applying Lemma 1.54; for details see [3, Theorem
4.9(c)].
23
Note that for Riemann-integrable functions, the Lebesgue integral and the Riemann integral coincide; see
[3, Chapter 4.3].
24
(b) Set Z := X − Y . Then Z ≥ 0 P -a.s. By part (a), it suffices to show that E[Z] ≥ 0, where
the inequality is an equality if and only if Z = 0 P -a.s. Since Z ≥ 0 P -a.s., it follows that
Z − = 0 P -a.s., and so
E[Z] = E[Z + ] − E[Z − ] = E[Z + ].
The next result is a measure theoretic change of variable formula. Its proof (which uses again
the monotone convergence theorem) is left as an exercise.
Proposition 1.60. Let (Ω, F, P ) be a probability space, (Ω0 , F 0 ) a measurable space, and
X : Ω → Ω0 an F − F 0 measurable map. Moreover, let g : Ω0 → R̄ be an F 0 − BR̄ -measurable
map. Let PX be image measure of X under P .24 Then g(X) is P integrable if and only if g
is PX integrable. Moreover, in this case (or if g ≥ 0) we have the transformation formula:
Z Z
g(X(ω))P (dω) = g(ω 0 )PX (dω 0 ).
Ω Ω0
The following lemma considers the special case that X is a random variable with discrete
or continuous distribution. Its proof (which uses Remarks 1.27 and 1.30 together with the
monotone convergence theorem) is left as an exercise.
PX [A0 ] := P [X ∈ A0 ].
25
(b) If X is continuous with pdf fX , then
Z ∞
E[|g(X)|] = |g(x)|fX (x) dx. (1.16)
−∞
1
Example 1.62. Let X ∼ Poisson(λ) and Y := 1+X . To find the expected value of Y we can
1
employ 1.15. Indeed, let g : R → R such that x 7→ 1+x . Note that X takes values in N and
thus |g(X)| = g(X).
Thus,
∞ ∞
λk
X X 1
E[g(X)] = g(k)pX (k) = e−λ
1+k k!
k=0 k=0
∞
e−λ X λk+1 1
= = (1 − e−λ ).
λ (k + 1)! λ
k=0
Finally, we link the notion of independence to the concept of expectation; see [3, Theorem
5.4] for a proof (which uses once again the monotone convergence theorem).
Theorem 1.63. Let X and Y be independent integrable random variables on some probability
space (Ω, F, P ). Then XY is also integrable and
E[XY ] = E[X]E[Y ].
1.8 Lp -spaces
In this section, we introduce the key notion of Lp -spaces.
Definition 1.64. Let (Ω, F, P ) be a probability space and p ∈ [1, ∞). For an F-measurable
random variable X, set25
kXkp := E [|X|p ]1/p . (1.17)
If kXkp < ∞, we say that X has finite p-th moment (with respect to P ) and call E [X p ] < ∞
the p-th moment of X. We denote the collection of all random variables on (Ω, F) with finite
1
25
Here, we use the natural convention that ∞ p = ∞.
26
p-th moment with respect to P by Lp (Ω, F, P ). If there is no danger of confusion, we often
write Lp (P ) or just Lp for Lp (Ω, F, P ).
1
Remark 1.65. The reason for the “outside power” p in (1.17) is to ensure that the map
k · kp : Lp → R is positively homogenous. Indeed, let λ ≥ 0 and X ∈ Lp . Then linearity of the
expectation gives
Definition 1.66. Let (Ω, F, P ) be a probability space. For an F-measurable random variable
X, set
kXk∞ := inf {K ≥ 0 : |X| ≤ K P -a.s.} .
If kXk∞ < ∞, we say that X is P -a.s.-bounded. We denote the collection of all real-valued
random variables on (Ω, F) that are P -a.s.-bounded by L∞ (Ω, F, P ). If there is no danger of
confusion, we often write L∞ (P ) or just L∞ for L∞ (Ω, F, P ).
First assume that p2 = ∞. Then there is a constant K > 0 such that X ≤ K P -a.s.
Monotonicity of the expectation gives E[|X|p1 ] ≤ K p1 < ∞ and so X ∈ Lp1 .
Next assume that p2 < ∞. The elementary inequality xp1 ≤ 1 + xp2 for x ≥ 0 together with
monotonicity and linearity of the expectation give
Thus, X ∈ Lp1 .
This containment does not work when we have a measure which can take finite but arbitrarily
large measure on Ω:
Example 1.68. Consider the measure space (R, B(R), λ) where λ is the Lebesgue-measure.
Let
1/x if x ≥ 1;
f (x) =
0 otherwise.
27
Then f ∈ L2 (R, B(R), λ), f ∈
/ L1 (R, B(R), λ) since
∞ ∞
Z Z
|f (x)|2 λ(dx) = −1/x = 1; |f (x)|λ(dx) = log(x) = ∞,
R 1 R 1
respectively.
We proceed to state the important inequalities of Hölder and Minkowski; for a proof see [3,
Theorems 7.16 and 7.17]. Hölder’s inequality shows that the product of random variables that
lie in Lp -spaces with conjugate exponents is integrable. Here, p, q ∈ [1, ∞] are called conjugate
1 1
if p + q = 1, with the convention that 1/∞ := 0.
Theorem 1.69 (Hölder’s inequality). Let (Ω, F, P ) be a probability space and X, Y be random
1 1
variables with X ∈ Lp (P ) and Y ∈ Lq (P ), where p, q ∈ [1, ∞] and p + q = 1. Then XY ∈
L1 (P ) and
kXY k1 ≤ kXkp kY kq
Minkowski’s inequality shows that the map k · kp : Lp → R+ satisfies the triangle inequality.
Theorem 1.70 (Minkowski’s inequality). Let (Ω, F, P ) be a probability space, p ∈ [1, ∞] and
X, Y ∈ Lp (P ). Then
kX + Y kp ≤ kXkp + kY kp
One important consequence of Minkowki’s inequality is that the map k · kp is a norm and
each Lp (Ω, F, P ) is a normed vectors space.26 One can even show that the metric/topology
induced by k · kp is complete and hence a Banach space. This is the content of the following
result; for a proof see [3, Theorem 7.18]
Theorem 1.71 (Fischer-Riesz). Let (Ω, F, P ) be a probability space, p ∈ [1, ∞], and (Xn )n∈N
a Cauchy sequence in Lp , i.e., for each ε > 0, there is N ∈ N such that for all m, n ≥ N ,
kXn − Xm kp < ε.
28
Definition 1.72. Let (Ω, F, P ) be a probability space and X, Y ∈ L2 (P ).
Var[X] := E (X − E [X])2 .
The following result lists some elementary properties of the variance/covariance; its proof is
left as an exercise.
Proof. This follows immediately from Proposition 1.73(a) and Theorem 1.63.
1 1
P [X = 1, Y = 1] = 0 6= · = P [X = 1]P [Y = 1],
3 3
29
and therefore X and Y are not independent.
n n
" #
X X X
Var Xn = Var[Xk ] + 2 Cov[Xi , Xj ]. (1.18)
k=1 k=1 1≤i<j≤n
n n
" #
X X
Var Xk = Var[Xk ].
k=1 k=1
Proof. Linearity of the expectation and the fact that Cov[X, Y ] = Cov[Y, X] yield
n n n
" # " ! !#
X X X
Var Xn = E Xk − E[Xk ] Xk − E[Xk ]
k=1 k=1 k=1
n X
X n
= E [(Xi − E[Xi ])(Xj − E[Xj ])]
i=1 j=1
Xn X n n
X n
X
= Cov[Xi , Xj ] = Var[Xk ] + Cov[Xi , Xj ]
i=1 j=1 k=1 i,j=1,i6=j
Xn X
= Var[Xk ] + 2 Cov[Xi , Xj ].
k=1 1≤i<j≤n
Theorem 1.77 (Markov’s inequality). Let X be a random variable on some probability space
(Ω, F, P ) and f : [0, ∞) → [0, ∞) a nondecreasing function.27 Then for any ε > 0 with
f (ε) > 0, we have the Markov inequality:
E [f (|X|)]
P [|X| ≥ ε] ≤ .
f (ε)
Proof. Monotonicity and linearity of the expectation together with the fact that f is nonde-
27
Note that f is then automatically measurable. Indeed, let x ∈ R, then {f −1 ([0, x])} ∈ B[0,∞] because it is
of the form [0, a) or [0, a] for some a ∈ [0, ∞].
30
creasing give
E [f (|X|)] ≥ E f (|X|)1{f (|X|)≥f (ε))} ≥ E f (ε)1{f (|X|)≥f (ε)}
≥ E f (ε)1{|X|≥ε} = f (ε)P [|X| ≥ ε].
Remark 1.78. We can look at Markov’s inequality in a different light. Let X be a random
variable on some probability space (Ω, F, P ). Moreover, let ε := cE[|X|] for some c ∈ R+ .
Then, by Markov’s inequality,
1
P [|X| ≥ cE[|X|]] ≤ .
c
Moreover, knowing that P [|X| ≥ α] = β for some α ∈ R+ , we can deduce that E[|X|] ≥ αβ.
Example 1.80. Suppose it is known that the number of items produced in a factory during a
week is a random variable X on a probability space (Ω, F, P ) such that X ∈ L2 and E[X] = 50.
Let A be defined as the event
If the variance of this week’s production is 25, then what can be said about P [A]?
By Chebyschev’s inequality,
For the next inequality, we need to recall the notion of a convex function.
It is called strictly convex if the inequality in (1.19) is strict for x1 6= x2 and λ ∈ (0, 1).
31
Graphically speaking, (strict) convexity means that straight line segments joining (x1 , f (x1 ))
to (x2 , f (x2 )) always lie (strictly) above the graph of f .
Remark 1.82. (a) If f is convex, then f is automatically continuous in the interior of D; see
[3, Theorem 7.7(i)].
We proceed to state and prove the fundamental inequality for convex functions.
Theorem 1.83 (Jensen’s inequality). Let (Ω, F, P ) be a probability space and X an integrable
random variable with values in a non-empty interval D ⊂ R. Let f : D → R be convex and
suppose that f is nonnegative or E [|f (X)|] < ∞. Then
E [f (X)] ≥ f (E [X]) .
Moreover, the inequality is strict when f is strictly convex and X is not P -a.s. constant.
28
The converse is not true: For example, the function f : R → R, x 7→ x4 is strictly convex, but f 00 (0) = 0.
32
Proof. The claim is trivial if f is nonnegative and E [f (X)] = ∞. So it suffices to consider
the case E [|f (X)|] < ∞.
First, using the definition of convexity, one can show that for each a ∈ D, there is b ∈ R such
that
f (x) ≥ f (a) + b(x − a), (1.20)
Next, choose a := E [X]. One can show that a ∈ D because D is an interval. Let b ∈ R be
such that (1.20) is satisfied. Then
and
P [f (X) > f (a) + b(X − a)] = P [X 6= a] > 0,
if f is strictly convex and X is not P -a.s. constant. Thus, by monotonicity and linearity of
the integral and the fact that a = E [X],
E [f (X)] ≥ E [f (a) + b(X − a)] = f (a) + b(E [X] − a) = f (a) + b(a − a) = f (a)
= f (E [X]),
where the inequality is strict if f is strictly convex and X is not P -a.s. constant.
Ω := {ω = (ω1 , . . . , ωN ) : ω1 ∈ Ω1 , . . . , ωN ∈ ΩN }.
29
If f is twice continuously differentiable, the (weak) inequality (1.20) can be easily derived as follows: Fix
a ∈ D and set b := f 0 (a). By a Taylor expansion of f in a of order 1 with Lagrange remainder term, we obtain
for fixed x ∈ D
1
f (x) = f (a) + b(x − a) + f 00 (ξ)(x − a)2 ,
2
where ξ lies in the interval with the endpoints x and a. Since f 00 ≥ 0 by convexity of f , (1.20) follows.
33
Then Ω is called the product sample space of Ω1 , . . . , ΩN and denoted by
N
Ω := ×Ω
n=1
n := Ω1 × · · · × ΩN .
Example 1.85. Consider rolling a die 3 times. Then this can be modelled by the sample
space Ω := {1, . . . , 6}3 := {(ω1 , ω2 , ω3 ) : ω1 , ω2 , ω3 ∈ {1, . . . , 6}}.
F := σ ({A1 × · · · × AN : A1 ∈ F1 , . . . , AN ∈ FN }) .
N
O
F := Fn := F1 ⊗ · · · ⊗ FN .
n=1
Intuitively, the product σ-algebra is the smallest σ-algebra such that all rectangular sets A1 ×
· · · × AN of Ω with A1 ∈ F1 , . . . , AN ∈ FN are measurable.
Finally, we consider the product of probability measures. The following result establishes
existence and uniqueness of the product measure; for a proof see [3, Theorem 14.14].
Theorem 1.87. Let (Ω1 , F1 , P1 ), . . . , (ΩN , FN , PN ) be probability spaces. Then there exists a
unique probability measure P on (×N
NN
n=1 Ωn , n=1 Fn ) such that
N
Y
P [A1 × · · · × AN ] = P [An ] (1.21)
n=1
N
O
P := Pn := P1 ⊗ · · · ⊗ PN
n=1
We proceed to study the expectation of random variables with respect to the product measure
of two probability measures. In this case the integral notation is more handy; cf. Remark
34
1.58. The following result shows that instead of integrating over the product measure (which
we do not know explicitly), we can also integrate first over one measure and then over the
other. Moreover, the order of integration does not matter; for a proof see [3, Theorem 14.16].
Theorem 1.88 (Fubini). Let (Ω1 , F1 , P1 ) and (Ω2 , F2 , P2 ) be probability spaces. Let X :
Ω1 × Ω2 → R̄ be F1 ⊗ F2 -measurable. Assume that either X ≥ 0 P1 ⊗ P2 -almost surely or
X ∈ L1 (P1 ⊗ P2 ). Then
R
• The map ω1 7→ Ω2 X(ω1 , ω2 )P2 (dω2 ) is F1 -measurable and P1 -integrable in case that
X ∈ L1 (P1 ⊗ P2 ).30
R
• The map ω2 7→ Ω1 X(ω1 , ω2 )P1 (dω1 ) is F2 -measurable and P2 -integrable in case that
X ∈ L1 (P1 ⊗ P2 ).31
Remark 1.89. (a) Fubini’s theorem also holds more generally, if we replace P1 and P2 by
σ-finite measures µ1 and µ2 . Of special importance is the case if µ1 and µ2 are the Lebesgue
measure; see [3, Section 14.2] for details.
(b) The notion of product sample spaces, product σ-algebras, and product measures can be
extended to countable (and even uncountable) families of probability spaces; see [3, Chapter
14] for details.
30
R R
Here, we agree that Ω2 X(ω1 , ω2 )P2 (dω2 ) := −∞ if Ω2 |X(ω1 , ω2 )|P2 (dω2 ) = ∞.
31
R R
Here, we agree that Ω1 X(ω1 , ω2 )P1 (dω1 ) := −∞ if Ω1 |X(ω1 , ω2 )|P1 (dω1 ) = ∞.
35
2 Sequences of random variables and limit theorems
2.1 Convergence of random variables
In this section, we look at different types of convergence for random variables.
Definition 2.1. Let (Xn )n∈N be a sequence of random variables on a probability space
(Ω, F, P ). Then (Xn )n∈N is said to converge to a random variable X
Lp
In this case, we write Xn → X.
a.s.
In this case, we write Xn → X.
P
In this case, we write Xn → X.
• in distribution, if
Remark 2.2. (a) If (Xn )n∈N converges in Lp , almost surely, or in probability, the limiting
random variable X is P -almost surely unique. In light of Proposition 2.3 below, it suffices to
show this for the case of convergence in probability. So suppose that (Xn )n∈N converges in
probability to X and X 0 . Let ε > 0 be given. Using that for each x ∈ R
n εo n 0 εo
{|X − X 0 | > ε} ⊂ |X − x| > ∪ |X − x| > .
2 2
we obtain
h εi h ε i
P [|X − X 0 | > ε] ≤ lim sup P |X − Xn | > + P |X 0 − Xn | > = 0.
n→∞ 2 2
36
We may conclude that X = X 0 P -a.s.
Lp
(b) Since |X| ≤ |X − Xn | + |Xn | for each n, Xn ∈ Lp together with Xn → X and Minkowski’s
inequality give X ∈ Lp . Moreover, using also that |Xn | ≤ |X − Xn | + |X| for each n ∈ N, we
get by Minkowski’s inequality and properties of the limit inferior and the limit superior32
kXkp ≤ lim inf k|X − Xn | + |Xn |kp ≤ lim inf (kX − Xn kp + kXn kp )
n→∞ n→∞
= lim inf kXn kp ≤ lim sup kXn kp ≤ lim sup k|X − Xn | + |X|kp
n→∞ n→∞ n→∞
P
(c) Using Markov’s inequality (Theorem 1.77), it is not difficult to check that Xn → X if and
only if33
lim E [|Xn − X| ∧ 1)] = 0.
n→∞
This alternative characterisation shows that the topology induced by convergence in proba-
bility is metrisable with metric d(X, Y ) = E [|X − Y | ∧ 1].
(d) One can show with some effort (see [3, Theorem 13.23]), that Xn ⇒ X if and only if for
all bounded continuous functions f : R → R,
This alternative characterisation explains why convergence in distribution is also called weak
convergence.34
We proceed to study the relationship between the different types of convergence. First, it is
not difficult to check that Lp convergence does not imply almost sure convergence and vice
versa.
32
More precisely, we use that if (an )n∈N and (bn ) are sequence of real numbers, where (bn )n∈N is convergent,
then lim inf (an +bn ) = lim inf an + lim bn and lim sup(an +bn ) = lim sup an + lim bn . Here, we only show the
n→∞ n→∞ n→∞ n→∞ n→∞ n→∞
claim for the limit inferior. Since lim inf (an + bn ) ≥ lim inf an + lim inf bn without any assumptions on (bn )n∈N ,
n→∞ n→∞ n→∞
in suffices to show that lim inf (an + bn ) ≤ lim inf an + lim bn . Let ε > 0 be given and set b := lim bn . Then
n→∞ n→∞ n→∞ n→∞
there is N ∈ N such that bn ≤ b + ε for all n ≥ N . Hence, by the definition of the limit inferior,
lim inf (an + bn ) = lim inf (ak + bk ) ≤ lim inf (ak + b + ε) = lim inf ak + (b + ε) = lim inf an + b + ε
n→∞ n→∞ k≥n n→∞ k≥n n→∞ k≥n n→∞
37
Next, we show that convergence in Lp and almost sure convergence both imply convergence
in probability. By contrast, it is not difficult to check that convergence in probability does
neither imply Lp -convergence nor almost sure convergence.
Proposition 2.3. Let (Ω, F, P ) be a probability space, (Xn )n∈N a sequence of random variables
and X a random variable.
(a) If (Xn )n∈N converges to X in Lp for p ∈ [1, ∞], then it converges to X in probability.
Proof. (a) The case p = ∞ is easy and left as an exercise. Assume that p < ∞. Let ε > 0.
Lp
Markov’s inequality (Theorem 1.77) with f (x) = xp and the fact that Xn → X give
E[|X − Xn |p ] 1
lim sup P [|Xn − X| ≥ ε] ≤ lim sup p
= lim sup p kXn − Xkpp = 0.
n→∞ n→∞ ε n→∞ ε
a.s.
(b) Let ε > 0. For n ∈ N, set An := {|Xn − X| > ε}. Since Xn → X, it follows that
P [{An i.o.}] = 0. Hence, by Exercise 1.2(c), we have
Proposition 2.4. Let (Ω, F, P ) be a probability space, (Xn )n∈N a sequence of random variables
and X a random variable. If (Xn )n∈N converges to X in probability, then it converges to X
in distribution.
38
Taking probabilities, we get
P
Letting n → ∞ and using that Xn → X, we obtain
The key ingredient is the notion of uniform integrability of a family of random variables.
Definition 2.5. A family (Xi )i∈I of random variables on some probability space (Ω, F, P ) is
said to be uniformly integrable (UI) if
lim sup E |Xi |1{|Xi |≥K} = 0.
K→∞ i∈I
It is not difficult to check that a single random variable is uniformly integrable if and only
if it is integrable. The following result lists some further simple criteria to check for uniform
integrability. Its proof is left as an exercise.
Lemma 2.6. Let (Ω, F, P ) be a probability space and (Xi )i∈I a family of integrable random
variables.
(a) If there is X ∈ L1 with |Xi | ≤ X P -a.s. for all i ∈ I, then (Xi )i∈I is uniformly
integrable.
(c) If (Xi )i∈I is uniformly integrable and X ∈ L1 , then (Xi + X)i∈I is again uniformly
integrable.
The following result gives two equivalent useful characterisations of uniform integrability; for
a proof see [3, Theorems 6.19 and 6.24]
Theorem 2.7 (de la Vallée-Poussin). Let (Xi )i∈I be a family of variables on some probability
space (Ω, F, P ). Then the following are equivalent.
39
(1) (Xi )i∈I is uniformly integrable.
(2) (Xi )i∈I is bounded in L1 , i.e., supi∈I E [|Xi |] < ∞, and for each ε > 0, there exists
δ > 0 such that
P [A] ≤ δ =⇒ E [|Xi |1A ] ≤ ε for all i ∈ I.
H(x)
(3) There exists a nondecreasing convex function H : [0, ∞) → [0, ∞) with lim x =∞
x→∞
such that supi∈I E [H(|Xi |)] < ∞.
We note an important corollary, which gives one of the most useful criterion in practice to
check that a family of random variables is UI.
Corollary 2.8. Let (Xi )i∈I be a family of random variables on some probability space (Ω, F, P )
and p ∈ (1, ∞]. Suppose that (Xi )i∈I is bounded in Lp , i.e., supi∈I kXi kp < ∞. Then (Xi )i∈I
is uniformly integrable.
With the help of Theorem 2.7, we can now show that a sequence of integrable random variable
that converges almost surely converges in L1 if and only if it is uniformly integrable
Theorem 2.9. Let (Ω, F, P ) be a probability space and (Xn )n∈N a sequence of random vari-
ables that converges almost surely to random variable X.35 Then the following are equivalent.
Proof. We shall only proof the more important direction “(1) ⇒ (2)”; the other direction is
left as an exercise.
“(1) ⇒ (2)”. First we show that X is integrable. Using that the Xn are bounded in L1 by
Theorem 2.7(b), Fatou’s lemma gives
h i
E [|X|] = E lim inf |Xn | ≤ lim inf E [|Xn |] ≤ sup E [|Xn |] < ∞.
n→∞ n→∞ n∈N
Next, set Yn := Xn − X for n ∈ N. Then Yn converges to 0 almost surely, and (Yn )n∈N is UI
by Lemma 2.6. Let ε > 0 be given. Then for each n ∈ N,
E [|Yn |] = E |Yn |1{|Yn |≤ε} + E |Yn |1{|Yn |>ε} ≤ ε + E |Yn |1{|Yn |>ε} . (2.1)
35
With slightly more work, one can show that the result still holds if we only assume that (Xn )n∈N converges
to X in probability; see [3, Theorem 6.25] for details.
40
Moreover, by Theorem 2.7(b), there is δ > 0 such that
Now using that Yn converges to 0 almost surely, and hence in probability by Proposition 2.3,
there is N ∈ N such that P [|Yn | > ε] ≤ δ for all n ≥ N . Combining this with (2.1) and (2.2),
we obtain
E [|Yn |] ≤ 2ε for all n ≥ N.
Since ε > 0 was arbitrary, we may conclude that Yn converges to 0 in L1 and hence Xn
converges to X in L1 .
The following result follows immediately from Theorem 2.9 and Lemma 2.6(a). It is known
as the dominated convergence theorem.
Theorem 2.10 (Dominated convergence theorem). Let (Ω, F, P ) be a probability space and
(Xn )n∈N a sequence of integrable random variables that converges almost surely to random
variable X. Suppose that there is an integrable random variable Y such that |Xn | ≤ Y P -a.s.
for all n ∈ N. Then Xn converges to X in L1 .
Remark 2.11. The direction “(1) ⇒ (2)” in Theorem 2.9 is often referred to as generalised
dominated convergence theorem.
First, we study the convergence in probability, which is usually referred to as the weak law of
large numbers.
Theorem 2.12 (Weak law of large numbers). Let X1 , X2 , . . . be a sequence of i.i.d. random
Pn
variables in L1 with mean µ on some probability space (Ω, F, P ). Set Sn := i=1 Xi for
n ∈ N. Then
Sn P
→ µ.
n
Proof. We show the result under the additional assumption that E (X1 )2 < ∞; the general
case follows from Theorem 2.13 below and the fact that almost sure convergence implies
convergence in probability. Set σ 2 := Var[X 1 ]. Then by the fact the X i are i.i.d., we obtain
41
by linearity of the expectation and the Bienaymé formula (1.18),
n
Sn 1 X i 1
E = E X = nµ = µ, (2.3)
n n n
i=1
n
σ2
Sn 1 X 1
Var = 2 Var[Xi ] = 2 nσ 2 = , (2.4)
n n n n
i=1
Let ε > 0. By the Chebyshev’s inequality (Corollary 1.79), (2.3) and (2.4), we obtain for
n ∈ N,
1 σ2
Sn Sn Sn 1 Sn
P −µ >ε =P −E > ε ≤ 2 Var = .
n n n ε n n ε2
Letting n → ∞ establishes the claim.
Next, we study the almost sure version of the law of large numbers. This is usually referred
to as the strong law of large numbers.
Theorem 2.13 (Strong law of large numbers). Let X1 , X2 , . . . be a sequence of i.i.d. random
Pn
variables in L1 with mean µ on some probability space (Ω, F, P ). Set Sn := i=1 Xi for
n ∈ N. Then
Sn a.s.
→ µ.
n
Proof. We show the result under the additional assumption that K := E X14 < ∞; for the
general case see [3, Theorem 5.16]. We may assume without loss of generality that E [X1 ] = 0;
otherwise consider X̃i := Xi − E [Xi ] and use that Snn = S̃nn + µ, where S̃n := ni=1 X̃i . Then
P
Moreover, Jensen’s inequality and the fact that the Xi are i.i.d. gives
2
E Xi2 Xj2 = E Xi2 E Xj2 = E X12 ≤ E X14 = K,
for i, j ∈ N distinct.
n X
n X
n X
n n n X
i−1
X X X
E Sn4 = E Xi4 + 6 E Xi2 Xj2
E [Xi Xj Xk Xl ] =
i=1 j=1 k=1 `=1 i=1 i=1 j=1
n(n − 1)
≤ nK + 6 K ≤ 3n2 K.
2
42
P∞ 1
Now using the fact that n=1 n2 < ∞, we obtain by monotone convergence,
∞ ∞ ∞
" #
Sn 4
X X 1 4 X 1
E = 4
E Sn ≤ 3K < ∞.
n n n2
n=1 n=1 n=1
P∞ Sn 4 Sn 4
By Lemma 1.55, this implies that n=1 n < ∞ P -a.s. It follows that limn→∞ n =0
P -a.s.36 Hence, we have limn→∞ Snn = 0 P -a.s.
2.4 The law of iterated logarithm and the central limit theorem
If X1 , X2 , . . . are i.i.d. random variables in L1 with mean µ then the strong law of large
Pn
numbers implies that S̃nn converges to 0 almost surely, where S̃n = k=1 (Xk − µ). Two
natural follow-up questions are to understand the precise size of S̃n in terms of n and to find
a nondegenerate limit in distribution under a different scaling in n.
The famous law of iterated logarithm by Hartman and Wintner answers the question on the
precise size of S̃n ; for a proof see [3, Theorem 22.11].
S̃n
lim sup p = 1 P -a.s.
n→∞ σ 2n log(log n)
The central limit theorem answers the second question. The correct scaling in n to get a
√
nondegenerate weak limit is n and the corresponding distribution is the normal distribution.
For a proof, we refer to [3, Theorem 15.37]
Theorem 2.15 (Central limit theorem). Let X1 , X2 , . . . be a sequence of i.i.d. random vari-
ables in L2 with mean µ and variance σ 2 > 0 on some probability space (Ω, F, P ). Set
S̃n := nk=1 (Xk − µ) for n ∈ N. Then
P
S̃n
√ ⇒ N (0, 1).
σ n
36 P∞
Recall that if (an )n∈N is a sequence of nonnegative numbers with n=1 an < ∞, then limn→∞ an = 0.
43
References
[1] H.-O. Georgii, Stochastics, De Gruyter Textbook, Walter de Gruyter & Co., Berlin, 2008.
[2] J. Jacod and P. Protter, Probability essentials, second ed., Universitext, Springer-Verlag,
Berlin, 2003.
[3] A. Klenke, Probability theory, Universitext, Springer-Verlag London, Ltd., London, 2008.
44