0% found this document useful (0 votes)
8 views52 pages

FundProb Notes22

These lecture notes for the course 6CCM341A at King's College London cover the fundamentals of probability, including measure spaces, independence, measurable functions, random variables, and integration for measurable functions. The notes are concise and intended to supplement in-person lectures, with additional examples and explanations provided during classes. The document is structured into sections that outline definitions, properties, and applications relevant to probability theory.

Uploaded by

beizhang2125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views52 pages

FundProb Notes22

These lecture notes for the course 6CCM341A at King's College London cover the fundamentals of probability, including measure spaces, independence, measurable functions, random variables, and integration for measurable functions. The notes are concise and intended to supplement in-person lectures, with additional examples and explanations provided during classes. The document is structured into sections that outline definitions, properties, and applications relevant to probability theory.

Uploaded by

beizhang2125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Fundamentals of Probability Lecture notes 2022

KCL 6CCM341A Version: March 21, 2023

These notes are intended to supplement lectures for the course 6CCM341A Fundamentals of Prob-
ability delivered in the autumn semester September to December 2022 at King’s College London.
The style of these typed notes is intentionally concise. Further background and explanations of
steps and additional examples are given in the in-person lectures; the visualiser notes from lectures
are available on the KEATS page.
Some sections are based closely on notes by previous lecturers of this KCL course, especially Igor
Wigman and Kolyan Ray, with thanks.

Contents
1 Measure spaces 2
1.1 Definitions and properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Measure spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Limits of sequences of events . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Generating σ-algebras and measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 The Borel σ-algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 The Borel measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Non-measurable sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Measurable functions and random variables 12


2.1 Measurable functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Measurable sets vs open sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Measurable functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Properties of measurable functions . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Probabilities with random variables . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Convergence of random variables and measurable functions . . . . . . . . . . . . . . 16
2.3.1 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Convergence of measurable functions . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Integration for measurable functions 21


3.1 The (Lebesgue) integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Definition for simple functions . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Extension to non-negative measurable functions . . . . . . . . . . . . . . . . . 22
3.1.3 Positive and negative parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.4 Convergence of integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.5 Application to counting measure . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Integrals in probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Integrals of fX , and expectations of g(X) . . . . . . . . . . . . . . . . . . . . 30
3.2.4 Inequalities for g(X) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

3.2.5 Transformations of random variables . . . . . . . . . . . . . . . . . . . . . . . 33

4 Multivariate probability 34
4.1 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 R2 -valued measurable functions . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 R2 -valued random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.3 Transformations of multivariate distributions . . . . . . . . . . . . . . . . . . 37
4.1.4 Conditional probability in the continuous setting . . . . . . . . . . . . . . . . 39
4.2 Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 IID normals in polar coordinates . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.2 Gaussian random vectors - formalism . . . . . . . . . . . . . . . . . . . . . . 41
4.2.3 Bivariate Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.4 Densities for general Gaussian random vectors . . . . . . . . . . . . . . . . . 43
4.3 Random walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.1 Setup and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 Limit results - statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.3 Limit results - usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.4 Limit results - proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

1 Measure spaces
1.1 Definitions and properties
In previous courses, we have met many examples of probability distributions and random variables,
and are comfortable performing various operations with these. For example, if X ∈ {1, 2, . . . , 6}
represents the outcome of a dice, and U ∼ Unif[0, 1], we know that
P (X is not a multiple of 3) = 1 − P (X is a multiple of 3) = 32 ,
P U ≥ 21 or U ∈ [ 31 , 23 ] = P U ≥ 13 = 23 .
 
and
Indeed, if we consider a more complicated ‘event’, such as A = [ 12 , 1] ∪ [ 81 , 14 ] ∪ [ 321 1
, 16 ] ∪ · · · , it is
valid to conclude
P (U ∈ A) = P U ∈ [ 12 , 1] + P U ∈ [ 81 , 14 ] + . . . = 12 + 18 + 32
1
+ . . . = 23 .
 

At least initially, these intuitively clear rules cover all situations we are likely to meet in practice,
especially in applied settings. But it leaves open the question:
Question: Does P (U ∈ A) make sense for every set A ⊆ [0, 1]?
In order to address questions such as these, we need to specify a minimal set of rules (known as
axioms) that probabilities should follow. We will then try to derive as many seemingly ‘intuitive’
rules as possible from this set of axioms.

1.1.1 Probability spaces


With the question above in mind, our definition of a probability space comes in two parts. First we
must specify rules for deciding which collections of outcomes can be assigned a probability; then we
must specify some rules for assigning the probabilities.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Definition 1.1. A sample space, Ω, is a set of outcomes. Subsets of Ω are called events. Let F be
a collection of events. Then F is a σ-algebra if the following hold:
F1 The whole outcome space1 Ω ∈ F;
F2 if A ∈ F then the complement Ac ∈ F also;
S
F3 for any countable collection (An )n≥1 of events in F, the union n≥1 An ∈ F.

Definition 1.2. Given a σ-algebra F on Ω, a function P : F → [0, 1] is a probability measure if the


following hold:
P1 P (A) ≥ 0 for all A ∈ F (note that this is redundant if we specify the codomain of P above);
P2 P (Ω) = 1;
P3 For any countable2 collection (An )n≥1 of disjoint events in F, we have
 
[ X
P An  = P (An ) . (1.1)
n≥1 n≥1

Result (1.1) is called countable additivity.


The triple (Ω, F, P) is then called a probability space.

Example. For the dice, one has Ω = {1, 2, . . . , 6}, with F = P(Ω), the power set of Ω (ie all
subsets of Ω), and P (A) = |A|
6

For the uniform distribution on [0, 1], we have Ω = [0, 1]. However, given the key question posed
earlier, it is not clear what F should be. However, we know that all intervals [a, b] should be included
in F, and P ([a, b]) = b − a for all 0 ≤ a ≤ b ≤ 1.

Example. For any Ω, the power set P(Ω) is a σ-algebra on Ω.

Proposition 1.3. Given a σ-algebra F on Ω, the following hold:


ˆ ∅ ∈ F;
ˆ for any countable collection (An )n≥1 in F, the intersection
T
n≥1 An ∈ F.

Proof. Applying F2 to F1, we have Ωc = ∅ ∈ F. To study countable intersections, consider De


Morgan’s Law \ [
Ω\ An = Acn ,
n≥1 n≥1

and apply F2 to confirm Acn ∈ F, then F 3 for their union. Finally, apply F2 to return to
T
An .
1
Given F2, we could equivalently take the first axiom to be F1’: ∅ ∈ F, and in some steps later, it will be more
convenient to work with F1’.
2
Here, ‘countable’ means ‘finite or countably infinite’. See Problem Set 1 Q7 for discussion of reducing the countably
infinite case to the finite case.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Proposition 1.4. Let (Ω, F, P) be a probability space. Then


ˆ P (Ac ) = 1 − P (A) for all A ∈ F and in particular P (∅) = 0.
ˆ For A, B ∈ F with A ⊆ B, we have P (A) ≤ P (B).
ˆ For general (rather than disjoint) unions, we have P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Proof. We use the axioms P2, P3 in various orders.


ˆ Events A, Ac are disjoint, and their union A ∪ Ac = Ω. We obtain
P3 P2
P (A) + P (Ac ) = P (Ω) = 1.

ˆ See Problem Set 1.


ˆ See Problem Set 1.

1.1.2 Measure spaces


One can observe that the area function on sets in R2 has some shared properties with a probability
space. It is not necessarily clear that all sets in R2 have a well-defined area. However, for those that
do, the area function satisfies axioms P1 and P3. Clearly it does not satisfy P2 since Area(R2 ) = ∞.
Both the area function and a probability measure describe ‘how large a subset’ is. This motivates
introducing a more general object without axiom P2 to capture both situations.

Definition 1.5. Let E be a set, and E a σ-algebra on E. The pair (E, E) will be referred to as a
measurable space. Then a function µ : E → [0, ∞] is a measure if it satisfies the following:
M1 µ(A) ≥ 0 for all A ∈ E (as before, this is also captured in the codomain of µ);
M2 µ(∅) = 0;
M3 Same countable addivitity condition as P3, ie
 
[ X
µ An  = µ(An ),
n≥1 n≥1

when (An )n≥1 are disjoint.


Then we say (E, E, µ) is a measure space.

Example. Some important measure spaces:


ˆ Any probability space (Ω, F, P) is also a measure space;
ˆ For any set E, take E = P(E), and define counting measure µ(A) := |A| for all A ⊆ E. Note
that if |E| = ∞, then certainly µ(E) = ∞. In particular, we can have the uniform measure
on an infinite countable set; whereas it is impossible to construct the uniform probability
distribution.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

ˆ If (E, E, µ) is a measure space, so is (E, E, cµ) for any constant c ≥ 0.


µ(A)
ˆ If 0 < µ(E) < ∞, then P (A) := µ(E) defines a probability space (E, E, P).
Note that many of the properties of probability spaces in Proposition 1.4 pass directly to measures.
Specifically, if axiom P2 is not required, then the result is inherited immediately. In a general
measure space (E, E, µ), one has µ(A) + µ(Ac ) = µ(E), but the statement µ(Ac ) = µ(E) − µ(A)
can be ill-defined if µ(E) = ∞, since ∞ − ∞ is not well-defined.

Proposition 1.6 (Countable sub-additivity). Given measure space (E, E, µ), let (An )n≥1 be a
sequence in E. Then  
[ X
µ An  ≤ µ(An ). (1.2)
n≥1 n≥1

Proof. See lectures for motivation and figures. Define A1 and then for each n ≥ 2 in turn,
S B1 := S
Bn := An \ (A1 ∪ · · · ∪ An−1 ). So the disjoint union Bn = An . In addition, we have Bn ⊆ An
and so µ(Bn ) ≤ µ(An ). We conclude
   
M3
[ [ X X
µ An  = µ  Bn  = µ(Bn ) ≤ µ(An ).
n≥1 n≥1 n≥1 n≥1

Proposition 1.7 (Continuity). Given measure space (E, E, µ), let (An )n≥1 be an increasing S se-
quence in E, ie An ⊆ An+1 for all n. Then limn→∞ µ(An ) exists in [0, ∞] and is equal to µ( An ).

Proof. Consider Bn := An \ An−1 . (This is a special caseS of usageS in previous proof.) Then
S n
k=1 Bn = An and this is a disjoint union. Furthermore, n≥1 Bn = n≥1 An . So

n
n→∞
X X [  [ 
µ(An ) = µ(Bk ) −→ µ(Bk ) = µ Bk = µ Ak .
k=1 k≥1 k≥1 k≥1

1.2 Independence
We work with a probability space (Ω, F, P) as before. We have met the notion of independence of
events or random variables before, and may have some intuition about what it means. Roughly
speaking, two random objects are independent if the outcome of one ‘does not influence’ the outcome
of the other. This is formalised as follows.

Definition 1.8. Two events A, B ∈ F are independent if

P(A ∩ B) = P(A)P(B). (1.3)

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

More generally, a countable collection of events (An )n≥1 is independent if for all distinct finite index
sets i1 , . . . , ik , we have
k
Y
P(Ai1 ∩ · · · ∩ Aik ) = P(Aij ). (1.4)
j=1

Example. Let Ωn be the set of functions f : {1, . . . , n} → {1, . . . , n}, and P a uniform choice from
n−1
Ωn . Let Ai be the event {f ∈ Ωn : f (i) = i}. Then P(Ai ) = nnn = n1 and

k
nn−k 1 Y
P(Ai1 ∩ · · · ∩ Aik ) = = = P(Aij ),
nn nk
j=1

and so (Ai ) are independent events. This is no longer true if we take Ωn to be the set of permutations
on {1, . . . , n}.

Example. It is important that (1.4) holds for all collections i1 , . . . , ik . For example, given three
events A, B, C, it is not sufficient to check that P(A ∩ B ∩ C) = P(A)P(B)P(C).

Example. Let Ω = {H, T }2 , denoting flipping a coin twice, with each outcome having probability
1/4. Consider the following events:

A = {first coin H}, B = {second coin H}, C = {coins give same outcome},

Ie. A = {HH, HT }, B = {HH, T H}, C = {HH, T T }.


Then P(A) = P(B) = P(C) = 12 , and P(A ∩ B) = P(A ∩ C) = P(B ∩ C) = 1
4 so the events A, B, C
are pairwise independent. But A, B, C are not independent since
1
P(A ∩ B ∩ C) = 4 6= P(A)P(B)P(C).

Proposition 1.9. Consider a probability space (Ω, F, P), and events A, B ∈ F.


ˆ If A is independent of B, then A is also independent of B c .
ˆ If P(A) = 0 or P(A) = 1, then A and B are independent.

Proof. We use (1.3) and the axioms P1-P3.


ˆ We may write A ∩ B c = A \ (A ∩ B), and calculate

(1.3)
P(A ∩ B c ) = P(A) − P(A ∩ B) = P(A) − P(A)P(B) = P(A) [1 − P(B)] = P(A)P(B c ),

as required.
ˆ See Problem Set 2.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

1.2.1 Limits of sequences of events


Given a sequence S of events (An )n≥1 in a probability or general measure
T space, we have seen how to
treat the union An (via the axioms P3/M3), and the intersection An (with Proposition 1.3). It
is also useful to be able to study the events:
\ [
{An holds infinitely often} = Ak ,
n≥1 k≥n
[ \
{An holds eventually} = Ak .
n≥1 k≥n

Here ‘holds infinitely often’ means ‘holds for infinitely many n’; and ‘holds eventually’ means ‘is
eventually always true’, or ‘holds for all n ≥ N , for some N ’ or ‘holds for all but finitely many n’.
(Sometimes these events are also called lim sup An and lim inf An , respectively, but we will avoid
this terminology.) Note that
\ [
An ⊆ {An eventually} ⊆ {An i.o.} ⊆ An .
n≥1 n≥1

Proposition 1.10 (Borel–Cantelli Lemmas). Consider a probability space (Ω, F, P) and a sequence
of events (An )n≥1 .
P
BC1) If n≥1 P(An ) < ∞, then P(An i.o.) = 0.
P
BC2) If n≥1 P(An ) = ∞ and events (An ) are independent, then P(An i.o.) = 1.

Note. For BC2, the independence condition is necessary. See Problem Set 2 for discussion.
S
Proof. BC1) Since {An i.o.} ⊆ k≥n Ak for every n, we have
n→∞
X
P(An i.o.) ≤ P(Ak ) −→ 0,
k≥n
P
since P(An ) < ∞. So P(An i.o.) = 0.
BC2) It is equivalent to show that P(Acn eventually) = 0. To do this, we study the intersections of
the complements, whose probabilities are tractable using independence:
 
\ Y Y
P Ack  = P(Ack ) = (1 − P(Ak ))
k≥n k≥n k≥n
Y X
≤ exp(−P(Ak )) = exp(− P(Ak )),
k≥n k≥n
P
and recalling that k≥n P(Ak ) = ∞ for each n by assumption,
 
\
P Ack  = exp(−∞) = 0.
k≥n

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

But note that ( k≥n Ack )n≥1 is an increasing sequence of events. So by Proposition 1.7, we have
T

   
[ \ \
P Ack  = lim P  Ack  = 0.
n→∞
n≥1 k≥n k≥n

One of the most famous applications of the Borel–Cantelli lemmas is the following popular result.

Proposition 1.11 (Infinite monkey theorem). An immortal monkey hits keys on a typewriter
repeatedly, both uniformly at random, and independently. Then almost surely, the monkey will at
some stage write Hamlet 3 .

Proof. Let K denote the number of letters (including spaces and punctuation) in Hamlet, let T
denote the number of keys on the typewriter, and let X1 , X2 , . . . denote the keys struck by the
monkey. Define the event
n o
An := XKn+1 , XKn+2 , . . . , XK(n+1) is the text of Hamlet .

Then P(An ) = T −K for all n and (An ) are independent. So applying BC2 gives the result!

Example (For interested readers). Consider a sequence of random permutations (σn )n≥1 as follows:
σ1 = (1); then for each n ≥ 2, create σn from σn−1 by inserting element n at a uniformly-chosen
position in σn−1 . We would like to prove that P(first element changes i.o.) = 1. But this follows
from BC2, since
X X1
P(first element changes σn−1 7→ σn ) = = ∞,
n
n≥2 n≥2

and these events are independent.


By contrast, if the new element is added at place max(n−Xn , 1) where (X2 , X3 , . . .) are independent
random variables distributed as Geometric(q), then

P(first element changes σn−1 7→ σn ) = (1 − q)n−2 ,

and since (1 − q)n−2 < ∞, this shows that almost surely the first element changes only finitely
P
often, by BC1.
(The first example is consistent with the idea that one cannot generate a uniform permutation on
N. The second example can be extended to show that every element changes only finitely often,
and so it is consistent to extend to a random permutation σ∞ on N. This Mallows permutation is
one of the main models for random permutations on an infinite set.)
3
Hamlet is one of the most famous (and longest) plays by William Shakespeare.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

1.3 Generating σ-algebras and measures


The goal is to find a σ-algebra E on R, and a measure µ on (R, E) that is consistent with our
intuition that   
µ (a, b) = µ (a, b] = µ [a, b] = b − a, , a < b ∈ R.
Unfortunately, the set of intervals is not a σ-algebra, and there are obstacles4 to constructing a
measure on (R, P(R)).
Instead, we work with the smallest σ-algebra on R that includes all the intervals. We now make
this notion precise.

Definition 1.12. Consider E any set, and A a collection of subsets of E. Then the σ-algebra
generated by A is the intersection of all σ-algebras on E which contain A, that is
\
σ(A) := E. (1.5)
A⊆E
E a σ-algebra

Note. Since P(E) is a σ-algebra containing A, the intersection in (1.5) is non-empty. However, it
is not obvious that σ(A) as defined in (1.5) is a σ-algebra (!), but this follows as a result of the
following lemma.

Lemma 1.13. Let (Ei )i∈I be a family of σ-algebras on E (where I is an arbitrary index set). Then
\ 
E := Ei := A ⊆ E : A ∈ Ei ∀i ∈ I ,
i∈I

is a σ-algebra on E.

Proof. We check the axioms for a σ-algebra in turn.


ˆ F1’: we have ∅ ∈ Ei for all i ∈ I, so ∅ ∈ E.
ˆ F2: A ∈ E ⇒ A ∈ Ei ∀i ∈ I ⇒ Ac ∈ Ei ∀i ∈ I ⇒ Ac ∈ E, where in the middle deduction
we use that each Ei is a σ-algebra.
ˆ F3: similarly, given (AnS
)n≥1 all in E, then for each i ∈SI these sets are all in Ei too. Since Ei
is a σ-algebra, we have An ∈ Ei for all i ∈ I, and so An ∈ E.
So E satisfies all the axioms to be a σ-algebra on E.

Example. When A = {A} a single event, then any σ-algebra containing A must also contain
{∅, A, Ac , E}, and this collection is itself a σ-algebra. So σ(A) = {∅, A, Ac , E}.

Example. Consider a uniform choice from Ω = {H, T }4 , corresponding to tossing a coin four times.
Let Ai := {ith coin is H}. Then
 
σ {A1 , A2 } = {all events determined by first two outcomes}. (1.6)
4
See later in the course for a (non-examinable) discussion of these!

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

1.3.1 The Borel σ-algebra


For E = R, the σ-algebra generated by all intervals
n o
(a, b] : −∞ < a < b < ∞ (1.7)

is called the Borel σ-algebra on R, denoted B(R).


In our definition, we used open-closed intervals in (1.7), but there were (many) other options, and
it is important to confirm that other choices for the generating set would give the same σ-algebra.

Lemma 1.14. Let E be a set, and A, A0 collections of subsets of E. Suppose that

A ⊆ σ(A0 ) and A0 ⊆ σ(A). (1.8)

Then σ(A) = σ(A0 ).

Proof. Straightforward, but omitted.

Example. Let A be the set in (1.7) and let


n o
A0 := (a, b) : −∞ < a < b < ∞ ,

be the set of open intervals in R. Then since


\ [
(a, b] = (a, b + n1 ), and (a, b) = (a, b − n1 ],
n n

we have A ⊆ σ(A0 ) and A0 ⊆ σ(A), respectively. Applying Lemma 1.14, we find that σ(A0 ) =
σ(A) = B(R), so the open intervals also generate the Borel σ-algebra on R.

1.3.2 The Borel measure


The reason for discussing the σ-algebra generated by A is that we hope to extend a (pre-)measure
which is defined naturally on A to a genuine measure defined on σ(A). There are three potential
issues in doing this
ˆ The (pre-)measure µ might not be consistent on A;
ˆ It might not be possible to extend µ to σ(A);
ˆ This extension might not be unique.
In reality these are all genuine concerns, but they are not relevant for the Borel σ-algebra B(R).
More generally, we have the following conditions for extending a measure appropriately:
ˆ Provided A is a ring5 , and µ is consistent on A, then µ can be extended to a measure on
σ(A).
5
A ring is a set system preserved under exclusion and finite unions, but the definition is not important to this
presentation.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

ˆ Provided A is a π-system6 , any extension to σ(A) is unique.


The first of these results is known as Carathéodory’s extension theorem; the key step in the second
is known as Dynkin’s lemma. We will not present proofs of these results, but see Problem Set 2 Q4
for an idea of the methods.
The key example of this procedure is:

Definition 1.15. There is a unique measure µ on (R, B(R)) such that µ((a, b]) = b − a, and this
is called the Borel measure on R.

1.3.3 Non-measurable sets


It would be reasonable to conjecture either of the following claims:
ˆ All subsets A ⊆ R are Borel.
ˆ The Borel σ-algebra can be constructed from one of the generating sets A, eg the intervals in
(1.7), ‘by induction’, ie by repeatedly adding all complements and countable unions of what
is already present.
Unfortunately, neither of these claims is true. Explaining why the second claim does not hold is
outside the scope of even a non-examinable section7 . However, we will explain the construction of
a non-Borel set.
We will need the result that Borel sets are preserved under translations, which feels intuitively
reasonable. To formalise the notion of translation, given a set A ⊆ R and x ∈ R, we define
A + x = {x + a : a ∈ A}.

Lemma 1.16. Whenever A ⊆ R is Borel, then A + x is Borel also, and µ(A + x) = µ(A).

Proof. See Problem Set 5.

Now, recall that the reals form an (Abelian) group under addition, denoted (R, +). One may
consider the rationals Q as a (normal) subgroup of (R, +). In particular, we may consider the
quotient group R/Q of cosets of Q in R, whose elements are sets (Ri , i ∈ I), where each has the
form Ri = ri + Q.
Now consider the set R := {ri , i ∈ I}, which we claim is not Borel-measurable. The details of this
and the previous paragraph are (obviously) not examinable, and were presented semi-formally in
the lecture.
6
A π-system is a collection of sets which include the empty set and which is preserved under (finite) intersections.
7
Roughly speaking, you need to apply the operations infinitely often; then apply the operations infinitely often;
and so on. This procedure is called transfinite induction. It is also rather difficult to give an example of a set which
is Borel, but can’t be constructed by regular induction from A.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

2 Measurable functions and random variables


2.1 Measurable functions
2.1.1 Measurable sets vs open sets
In previous courses, we have met the notion of open sets in both R and in general topological spaces.
We focus on R for now. A set A ⊆ R is open if ∀x ∈ A, some small interval (x − , x + ) ∈ A. A
set C is closed if C c is open. Not all sets are open or closed, and some sets (eg R) are both open
and closed.
Remember that open sets are preserved under countable unions, but not under countable intersec-
tions. Similarly, closed sets are preserved under countable intersections, but not under countable
unions. It is therefore unsurprising that the Borel σ-algebra on R includes the open and closed sets.

Lemma 2.1. If A ⊆ R is open or closed, then A ∈ B(R).

Proof. It is sufficient to study the case where A is open. If A is closed, then Ac is open, and so the
result follows as B(R) is a σ-algebra, and so preserved under taking complements.
For every q ∈ A, let

αq := inf{x ≤ q : (x, q] ⊆ A}, βq := sup{x ≥ q : [q, x) ⊆ A}.

Then for any x ∈ A,


S for small enough  > 0, the interval (x − , x + ) ⊆ (αq , βq ) for some q ∈ A. It
follows that A = (αq , βq ), and so A is a Borel set, as a countable union of open intervals.
q∈A

A general definition of continuity of functions is that f is continuous if for all A open, f −1 (A) is
open. This motivates the definition of measurable functions.

2.1.2 Measurable functions


Definition 2.2. Given measurable spaces (E, E) and (G, G), a function f : E → G is measurable if
for all A ∈ G, we have f −1 (A) ∈ E.

Example. Given A ⊆ E, the indicator function 1A : E → R is defined as


(
1 x∈A
1A (x) =
0 x∈6 A.

If A ∈ E, then 1A is a measurable function. To see this, note that for C ∈ B(R),





 ∅ 0, 1 6∈ C

A 1 ∈ C, 0 6∈ C
f −1 (C) = ,


 Ac 0 ∈ C, 1 6∈ C

E 0, 1 ∈ C

and all these sets are in E.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

In practice, it is not necessary to check f −1 (A) for all A ∈ B(R), but just the intervals that generate
the σ-algebra.

Lemma 2.3. Given measurable spaces (E, E) and (R, B(R)), a function f : E → R is measurable
if and only if f −1 ((−∞, a]) ∈ E for all a ∈ R.

Proof. Omitted.

Note. It is also valid to check f −1 ((−∞, a)) ∈ E for every a ∈ R.

Example. Consider f : R → R, f (x) = x2 . Then f −1 ((−∞, a]) = ∅ if a < 0. Otherwise,


√ √
f −1 ((−∞, a]) = [− a, a] ∈ B(R), and so f is measurable.

1
Example. Consider f : R → R given by f (x) = x for x 6= 0, and f (0) = 0. Then

1
[ a , 0)
 a<0
f −1 ((−∞, a]) = (−∞, 0] a=0

(−∞, 0] ∪ [ a1 , ∞) a > 0.

In all cases f −1 ((−∞, a]) ∈ B(R), so f is measurable.


In fact, for functions R → R, measurability is a more general condition than continuity.

Proposition 2.4. Any continuous function f : R → R is measurable with respect to B(R).

Proof. We know that f −1 ((−∞, a)) is open since f is continuous. By Lemma 2.1, this implies
f −1 (−∞, a)) ∈ B(R), which is sufficient.

2.1.3 Properties of measurable functions


Proposition 2.5. Given measurable spaces (E, E) and (R, B(R)), let f, g be two measurable func-
tions E → R. Then
1) f + g is measurable;
2) f · g is measurable.
Now, let (fn )n≥1 be a sequence of measurable functions E → R. Then
3) The functions inf fn and sup fn are measurable if they exist.
n≥1 n≥1

4) The pointwise limit lim fn is measurable if it exists.


n→∞

Now, let f : E → R and g : R → R be measurable. Then


5) The composition8 f ◦ g is measurable.
8
This is defined as (f ◦ g)(x) = g(f (x).

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Proof. 1) It is sufficient to show that {x ∈ E : f (x) + g(x) < a} ∈ E for all a ∈ R. Note that
n o
{f (x) + g(x) < a} = ∃q ∈ Q : f (x) < q and g(x) < a − q ,

and so we can decompose the event of interest as


[
{x ∈ E : f (x) + g(x) < a} = {x ∈ E : f (x) < q} ∩ {x ∈ E : g(x) < q − a} .
q∈Q

All events in this decomposition are in E, which is preserved under intersections and countable
unions, so the event of interest is in E also, as required.
2) See the problem set.
T
3) It is sufficient to show that {x ∈ E : sup fn (x) ≤ a} ∈ E. But this event is equal to {fn (x) ≤ a}.
n n
Since each event in this intersection is in E, so is the intersection, as required. The argument for
inf fn is similar.
n

4) Recall the definitions9

lim sup fn := inf sup fm , lim inf fn := sup inf fm .


n→∞ n m≥n n→∞ n m≥n

In particular, when lim fn exists, we have lim fn = lim sup fn = lim inf fn . For each n, we know
that sup fm is measurable by the sup section of part 3) above. And so inf sup fm is measurable
m≥n n→∞ m≥n
by the inf section of part 3) above. This shows that lim fn is measurable.
n→∞

5) Suppose g(f (A)) is Borel. Then f (A) is Borel since g is measurable. Then A ∈ E since f is
measurable.

2.2 Random variables


Definition 2.6. In a probability space (Ω, F, P), a measurable function X : Ω → R is called a
random variable.

Example. For Ω = {H, T }4 corresponding to tossing a coin four times, let X denote the total
number of heads.
The idea is that the random variable X is defined within the broader probability space. We have
seen in previous courses that X has the binomial distribution, and could easily be defined on its
own probability space (eg with Ω = {0, 1, 2, 3, 4}) but this is not necessary.

Note. By default, random variables take real values. However, given a function X : Ω → E
measurable with respect to (E, E), we can refer to X as an E-valued random variable.
9
Recall also that the advantage of lim inf and lim sup is that they always exist (in [−∞, +∞] etc), whereas not all
sequences have limits.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

2.2.1 Probabilities with random variables


Given X a random variable in probability space (Ω, F, P), we can study P(X ∈ A) for all A ∈ B(R).
Here, P(X ∈ A) is an informal but useful shorthand for P(X −1 (A)). In fact, defining PX (A) =
P(X ∈ A) gives a probability measure PX on (R, B(R)).

Definition 2.7. Given a random variable X on probability space (Ω, F, P), the distribution function
of X is a function FX : R → [0, 1] defined by
FX (x) = P(X ≤ x) = P X −1 ( (−∞, x] ) .

(2.1)
Since {(−∞, x] : x ∈ R} generates B(R), the probability measure PX is completely determined by
FX . If two random variables X and Y have the same distribution function ie FX = FY , then we
d
say X and Y are equal in distribution, denoted X = Y .

Example. Consider the probability space ([0, 1], B([0, 1])) with the Borel measure. Then, taking
the function U : [0, 1] → [0, 1] given by U (x) = x, we end up with U a random variable matching
the Uniform [0, 1] definition we gave earlier. We have distribution function

0
 x≤0
FU (x) = P(U ≤ x) = x x ∈ [0, 1]

1 x ≥ 1.

We have included all the cases for completeness’ sake, but this is not generally necessary if context
makes it clear what the meaningful range of the RV is.
In this setting, U 2 is also a random variable, with distribution function
√ √
FU 2 (x) = P(U 2 ≤ x) = P(U ≤ x) = x, x ∈ [0, 1].

Proposition 2.8. For a general random variable X, the distribution function FX is non-decreasing,
and satisfies lim FX (x) = 0 and lim FX (x) = 1. Furthermore, FX is right-continuous.
x→−∞ x→+∞

Proof. The events {X ≤ x} are increasing with x, hence FX is non-decreasing. The limits are
established on the Problem Set.
For right-continuity, note that FX (x + n1 ) − FX (x) = P(X ∈ (x, x + n1 ]). Since (x, x + n1 ] = ∅, it
T
n
follows from continuity of P (or more directly from PX ) that
lim FX (x + n1 ) − FX (x) = lim P(X ∈ (x, x + n1 ]) = P(∅) = 0.
n→∞ n→∞
1
Since FX is non-decreasing, the limit FX (x + n) → FX (x) as n → ∞ induces the more general limit
FX (z) → FX (x) as z ↓ x through the reals.

Definition 2.9. A countable collection of random variables (Xn )n≥1 is independent if for all distinct
index sets i1 , . . . , ik , and for all x1 , . . . , xk ∈ R, we have
k
Y 
P (Xi1 ≤ x1 , . . . , Xik ≤ xk ) = P Xij ≤ xj . (2.2)
j=1

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

As in the case of independent events, it is not true that pairwise independence implies independence
(See Problem Set).

Example. If (An )n≥1 are independent events, then the indicator functions (1An ) are independent
random variables.
In fact, if (Xn )n≥1 are independent10 , then for any Borel sets A1 , . . . , Ak , we have
k
Y 
P (Xi1 ∈ A1 , . . . , Xik ∈ Ak ) = P Xij ∈ Aj . (2.3)
j=1

Definition 2.10. When a collection of random variables (Xn )n≥1 is independent, and all Xn are
equal in distribution (see Definition 2.7) we say (Xn ) are independent and identically distributed,
abbreviated IID. This property applies to a number of settings, including repeatedly tossing a coin,
or sampling a few members from a large population.

2.3 Convergence of random variables and measurable functions


Previous courses in analysis and topology have discussed in general terms what is meant by con-
vergence of a sequence. In more exotic spaces, it is often of central importance to be clear what it
means for two elements to be ‘close’. In a metric space this is clear, but it is not obvious how to
place a metric on every set, for example on the collection of probability measures.
There are many possible interpretations for the notion that two random variables X and Y are
‘close’, for example based on observing the outcomes, or comparing the distribution functions, or
by studying how X, Y are embedded in a common probability space.

2.3.1 Convergence in distribution


The following notion has been seen in previous courses in the context of the Central Limit Theorem,
which we will return to later in this course.

Definition 2.11. Let X, (Xn )n≥1 be random variables, with distribution functions FX , FXn . Then
Xn converges in distribution to X if FXn (x) → FX (x) for all x ∈ R such that F is continuous at x.
d
This is denoted Xn −→ X.

d
Example. Let Un be uniform on {1, 2, . . . , n}, and U uniform on [0, 1]. Then n1 Un −→ U as n → ∞,
since
bnxc
FUn (x) = −→ x, as n → ∞.
n

Example. Let (an )n≥1 be a real-valued sequence such that an → a ∈ R as n → ∞. Define the
d
‘random variables’ Xn such that P(Xn = an ) = 1, and X such that P(X = a) = 1. Then Xn −→ X.
Note that this would fail without the restriction of x to points of continuity of FX in Definition
2.11.
10
In contrast to previous definitions, the weaker form (2.2) is normally given as the definition of independence, with
the general form (2.3) following.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

2.3.2 Convergence of measurable functions


Note that Definition 2.11 makes no reference to the underlying probability space(s) of the random
variables. Convergence in distribution depends only on the distributions PXn (via the FXn s). In
many contexts, we are interested in more nuanced comparisons of random variables defined on the
same probability space.
Recall from Definition 2.6 that a random variable is a measurable function on a probability space.
So convergence of random variables in a common probability space is a special case of convergence
on measurable functions on a general measure space.
We will consider a measure space (E, E, µ) and measurable functions f, fn : E → R. First, we recall
some familiar notions of function convergence:
ˆ We say fn converges to f pointwise if for all x ∈ E, |fn (x) − f (x)| → 0.
ˆ We say fn converges to f uniformly if sup |fn (x) − f (x)| → 0.
x∈E

It is sometimes helpful to think of uniform convergence as a comment on the ‘speed’ of convergence


being comparable across x ∈ E.
We now define two alternative notions of convergence.

Definition 2.12. We say fn converges to f almost everywhere if


 
µ {x ∈ E : fn (x) 6→ f (x)} = 0. (2.4)

That is, fn (x) → f (x) for all x ∈ E \ A, where µ(A) = 0. So if fn converges to f pointwise, then it
also converges to f almost everywhere (sometimes abbreviated a.e.).

Definition 2.13. We say fn converges to f in measure if for all  > 0,


 
µ {x ∈ E : |fn (x) − f (x)| > } −→ 0, as n → ∞. (2.5)

That is, for large n, fn is uniformly close to f , except on a set of small measure11 . So if fn converges
to f uniformly, then it also converges to f in measure.
While using these notions of convergence can be advantageous since they are slightly more general
than pointwise/uniform, we note that the limits are not unique under this convergence!

Proposition 2.14. Let f, g be two measurable functions E → R such that µ({x ∈ E : f (x) 6=
g(x)}) = 0. Suppose that fn → f almost everywhere (or in measure). Then fn → g almost
everywhere (or, respectively, in measure).

Proof. For convergence almost everywhere, using abbreviated notation,

µ (fn 6→ g) ≤ µ (fn 6→ f ) + µ (f 6= g) = 0 + 0 = 0.

The argument for convergence in measure is similar (see the Problem Set).
11
It is standard to abbreviate the notation in (2.4) as µ(fn 6→ f ) and in (2.5) as µ(|fn − f | > ), which some readers
might find more clear.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Example. A general
S example is when fn = 1An for (An )n≥1 some sequence in E. Suppose A1 ⊆
A2 ⊆ . . ., with An = A.
n

ˆ Then 1An converges to 1A almost everywhere (indeed, it converges pointwise);


ˆ But 1An converges in measure to 1A only if µ(A \ An ) → 0, which is not necessarily true when
µ(A) = ∞.
ˆ However, note that this is still better than uniform convergence, which holds only if An is
equal to A for all n ≥ n0 .
Now consider A1 , A2 , . . . such that µ(An ) → 0 as n → ∞.
ˆ Then 1An → 0 in measure;
ˆ But 1An does not necessarily converge a.e.

Proposition 2.15. Suppose that µ(E) < ∞. Then fn → f almost everywhere implies fn → f in
measure.

Proof. For fixed  > 0, consider the sets


AN := {x ∈ E : |fn (x) − f (x)| > , for some n ≥ N },
which are decreasing in N . Then
\
AN = {x ∈ E : |fn (x) − f (x)| >  for infinitely many n},
N ≥1

and by assumption, this RHS set has measure zero. Therefore µ(AN ) → 0 as N → ∞. But
µ(x : |fN (x) − f (x)| > ) ≤ µ(x : |fn (x) − f (x)| > , for some n ≥ N ) → 0,
and so fn → f in measure holds too.

2.3.3 Convergence of random variables


When the measure space is actually a probability space (Ω, F, P), we use different terminology for
these notions of convergence, but the same results apply. In both the following definitions, we
consider random variables X, (Xn )n≥1 on a common probability space (Ω, F, P).

Definition 2.16. We say Xn converges to X almost surely if


P (Xn → X as n → ∞) = 1. (2.6)
This is the analogue of convergence almost everywhere for a probability space, and is abbreviated
a.s.
Xn → X a.s., or Xn −→ X.

Definition 2.17. We say Xn converges to X in probability or in P if for all  > 0,


P (|Xn − X| > ) −→ 0, as n → ∞. (2.7)
P
This is analogue of convergence in measure for a probability space, and is abbreviated Xn −→ X if
p
the probability measure P is explicit, and sometimes Xn −→ X if it is implicit.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

A quick summary of the relationships between these modes of convergence is the following:
P d
Xn → X a.s. ⇒ Xn −→ X ⇒ Xn −→ X. (2.8)

Shortly, we will prove these relations, and discuss converses. First, we give some examples of almost
sure convergence. Later, we will see some examples of sequences that converge in probability, but
not almost surely.

Example. Let X, X1 , X2 , . . . be IID random variables with finite mean12 E [X]. Then the Strong
a.s.
Law of Large Numbers asserts that n1 (X1 + . . . + Xn ) −→ E [X]. A special case of this is when X
has the Bernoulli distribution. We say X ∼ Bern(p) if X = 1 with probability p, and X = 0 with
probability 1 − p. Then X1 , X2 , . . . corresponds to a sequence of biased coin tosses, and the SLLN
states that the proportion of heads converges to p with probability 1.

Example. Real numbers between 0 and 1 can be written in binary, for example as 0.1101000110 · · · .
It feels reasonable that choosing the digits by tossing a coin should generate a uniform random
number in [0, 1]. To formalise this, let X1 , X2 , . . . be IID Bern(1/2) RVs as in the previous example.
We would like to define
U = 12 X1 + 14 X2 + 18 X3 + . . . , (2.9)
but it is not immediately clear that this is well-defined (let alone has the uniform distribution!). To
make it more clear, we define

Un = 12 X1 + 41 X2 + . . . + 1
2n Xn ,

and show instead that Un converges almost surely. Recall from analysis courses, the definition of
a Cauchy sequence13 . By our construction, the sequence (U1 , U2 , . . .) is Cauchy with probability 1,
a.s.
and so P(Un converges) = 1, and we may define U = lim Un , for which Un −→ U .
n
a a
It is easy to check that for k ∈ N, and a = 1, 2, . . . , 2k , we have P(Un < 2k
) = 2k
whenever n ≥ k.
d
This implies that P(Un ≤ x) → x for all x ∈ [0, 1], and so Un −→ Unif[0, 1]. In other words, U in
(2.9) is well-defined, and has the Uniform distribution on [0, 1].

Proposition 2.18. If the sequence of random variables Xn converges to X almost surely, then it
converges in probability also.

Proof. Probability space (Ω, F, P) satisfies the condition P(Ω) = 1 < ∞, so the result follows
directly as a special case of Proposition 2.15! (Proof in the language of probability spaces given in
lectures.)
12
Which will be defined in the formal language of this course shortly, but the informal definition of previous courses
is perfectly fine here.
13
A sequence of real numbers (an ) whose terms become arbitrarily close to one another is called Cauchy . More
formally, for all  > 0, there exists N such that |an − am | <  whenever n, m ≥ N . Any real-valued Cauchy sequence
converges to a limit. The advantage of this definition (including in our application here) is that one can verify that a
sequence converges without knowing what the limit is!

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Proposition 2.19. Suppose Xn converges to X in probability. Then Xn converges to X in distri-


bution also.
Conversely, if Xn converges to a constant c in distribution, then Xn converges to c in probability
also.

Proof. We must prove that FXn (x) → FX (x) whenever x is a point of continuity of FX . Fix any
 > 0, and note that if Xn ≤ x then either X ≤ x +  or |Xn − X| >  hold. So
FXn (x) ≤ P(X ≤ x + ) + P(|Xn − X| > ). (2.10)
The second term on the RHS vanishes as n → ∞, and so lim sup FXn (x) ≤ P(X ≤ x+) = FX (x+).
Taking  → 0, and using continuity of FX at x, we then have lim sup FXn (x) ≤ FX (x).
Analogously to (2.10), we also have
FXn (x) ≥ P(X ≤ x − ) − P(|Xn − X| > ),
and an identical argument leads to lim inf FXn (x) ≥ FX (x). So lim FXn (x) = FX (x) as required.
The partial converse for the case of deterministic limit is addressed on the Problem Set.

Example. Convergence in probability requires all RVs Xn to be defined on the same probability
space as the limit X, whereas convergence in distribution does not. But even if they are defined on
the same probability space, the converse implication does not hold. Consider RVs X, Y with the
same distribution, satisfying P(X = Y ) < 1. Then the sequence (X, X, X, . . .) converges to Y in
distribution, but not in probability.

1
Example. Suppose Xn are independent RVs, taking values {0, 1} with P (Xn = 1) = n. Then by
P
the second Borel–Cantelli lemma, Xn does not converge a.s. to zero, but Xn → 0.
This effect is captured more generally by the following lemma, which can be a useful practical test
for convergence to a constant of a sequence of RVs.

Lemma 2.20. Suppose that for every  > 0,


X
P (|Xn | > ) < ∞. (2.11)
n≥1

Then Xn → 0 almost surely.

Proof. We apply the first Borel–Cantelli lemma to (2.11), and conclude that
P(|Xn | >  for infinitely many n) = 0, or, equivalently, P(|Xn | ≤  eventually) = 1.
Note that an intersection over all  > 0 is not countable. So take a sequence 1 > 2 > . . . > 0 such
that k → 0 as k → ∞. Note that Xn → 0 iff {|Xn | ≤ k eventually} holds for all k ≥ 1, and note
that these events are decreasing in k. So
 
\
P(Xn → 0) = P  {|Xn | ≤ k eventually} = lim P (|Xn | ≤ k eventually) = 1,
k→∞
k≥1

which completes the proof

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

We have a partial converse, provided the Xn s are independent. This case might seem rather patho-
logical, since convergence in probability and almost surely apply to the setting where all RVs are
defined on the same probability space (normally in a more interesting way than just independently).
However, this situation does come up in some applications.

Lemma 2.21. Suppose that the sequence (Xn )n≥1 is independent, and Xn → 0 almost surely.
Then, for every  > 0, X
P (|Xn | > ) < ∞.
n≥1

Proof. See Problem Set 5.

3 Integration for measurable functions


In previous analysis courses, we have met the notion of Riemann integration, where functions are
approximated from above and below by piecewise constant functions, and so the ‘area under the
curve’ is approximated by the area of a union of rectangles.
This mode of integration works well for continuous or ‘almost-continuous’ functions, but it fails for
functions such as 1Q that are discontinuous everywhere. (As indeed are many functions relevant
to applications.) In this course, we introduce a new mode of integration that is a better fit for
functions f : E → R from a general measure space (E, E, µ), and behaves nicely under some forms
of convergence.

3.1 The (Lebesgue) integral


R
The main building block for Lebesgue’s theory of integration is that 1A should equal µ(A), when-
ever A is a measurable set. We construct the integral for more complicated functions by extending
linearly, and approximating. It is worth noting that (unlike for the Riemann integral) we must start
by studying functions which take non-negative values.

3.1.1 Definition for simple functions


Definition 3.1. Given a measure space (E, E, µ), we say a measurable function f : E → R is
simple, if it can be expressed as
k
X
f= ai 1Ai , Ai ∈ E, ai ≥ 0. (3.1)
i=1

The set of simple functions will be denotes S(E). Given a simple function f as in (3.1), we define
the integral of f as
Z X k
f dµ = ai µ(Ai ). (3.2)
E i=1

It is possible that ai = 0 and µ(Ai ) = ∞, in which case we define 0 × ∞ = 0.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Note that f could have multiple representations as (3.1), and so it is not immediate that (3.2) is
well-defined. However, this can be verified by reducing to the case where the Ai s are disjoint, which
is discussed on the Problem Set.

Proposition 3.2. Let f, g be simple functions


i) Consider real constants α, β ≥ 0. Then αf + βg ∈ S(E), and
Z Z Z
(αf + βg) dµ = α f dµ + β g dµ.
E E E

ii) If f (x) ≤ g(x) a.e.14 , then Z Z


f dµ ≤ g dµ.
E E
R
iii) We have E f dµ = 0 if and only iff f = 0 a.e.
Pk P`
Proof. i) We can write f = i=1 ai 1Ai and g = j=1 bi 1Bi , so that

k
X `
X
αf + βg = (αai )1Ai + (βbj )1Bj ,
i=1 j=1

which satisfies the definition (3.1), and the equality of integrals follows immediately.
ii) As noted in Definition 3.1, we may assume without loss of generality (and with considerable
convenience!) that the Ai s are disjoint, and that the Bj s are disjoint. It is helpful to introduce

A0 = E \ (A1 ∪ · · · ∪ Ak ), B0 = E \ (B1 ∪ · · · ∪ B` ),

so that A0 , A1 , . . . , Ak and B0 , B1 , . . . , B` are both partitions of E. Then we may rewrite as


k X `
(
X ai i≥1
f= ai,j 1Ai ∩Bj , where ai,j =
i=0 j=0
0 i=0
PP
and similarly for g = bi,j 1Ai ∩Bj . The relation that f ≤ g a.e. means that for every i, j, we
either have µ(Ai ∩ Bj ) = 0, or ai,j ≤ bi,j . The comparison of integrals then follows directly from
the definition (3.2).
iii) Both statements are equivalent to the statement that ai = 0 for all i such that µ(Ai ) > 0.

3.1.2 Extension to non-negative measurable functions


The idea is to approximate a general non-negative function f using simple functions. For reasons we
will explore later, it is important to carry out this approximation in a monotone way, ie from below.
Note that this contrasts with the definition of the Riemann integral, which involves sandwiching
between upper- and lower-integrals of piecewise-constant functions.
14
Meaning for almost all x ∈ E, ie the set of x for which this fails has measure zero.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Definition 3.3. Let f : E → R be a measurable function taking non-negative values. Then the
integral of f is defined as
Z Z 
f dµ := sup g dµ : g ∈ S(E), 0 ≤ g ≤ f . (3.3)
E E

Note that the g on the RHS are simple functions which are bounded above by f .
We now verify briefly that any non-negative measurable function can be well-approximated by
simple functions.

Lemma 3.4. Let f : E → R be a measurable function taking non-negative values. Then there
exists an increasing sequence (fn )n≥1 of simple functions such that the monotone limit fn ↑ f holds
a.e. as n → ∞.

Proof. Define
(
k
2n when 2kn ≤ f (x) < k+1
2n , for some k = 0, 1, . . . , 22n − 1
fn (x) =
2n when f (x) ≥ 2n .

That is, whenever f (x) < 2n , we construct fn (x) by ‘rounding down’ f (x) to the nearest multiple
of 2−n . Each fn takes only finitely many values so is simple. It is clear that fn+1 (x) is equal either
1
to fn (x) or to fn (x) + 2n+1 , and so fn ↑ f .

It is not clear why integration and (monotone or otherwise) converge should commute. The following
key theorem shows that they do.

THEOREM 3.5 (Monotone convergence theorem). Let (fn )n≥1 and f Rbe measurable R functions
E → R, all taking non-negative values, and satisfying fn ↑ f a.e. Then E fn dµ ↑ E f dµ holds
both when the limit is finite and infinite.

Proof. Omitted, for now.

Example. To see that monotonicity is essential, consider fn = n1[0,1/n] , for which the integral over
R is always 1, but the function converges a.e. to 0.
The MCT is a very useful tool in studying integrals of measurable functions. For now, we will use
it to lift the results of Proposition 3.2 to the more general case.

Proposition 3.6. Let f, g be measurable functions E → R taking non-negative values.


i) Consider real constants α, β ≥ 0. Then αf + βg ∈ S(E), and
Z Z Z
(αf + βg) dµ = α f dµ + β g dµ.
E E E

ii) If f (x) ≤ g(x) a.e., then Z Z


f dµ ≤ g dµ.
E E

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023
R
iii) We have E f dµ = 0 if and only iff f = 0 a.e.

Proof. To show part i) we approximate f, g by fn , gn ∈ S(E), for example as given by Lemma 3.4.
Then, we have by Proposition 3.2,
Z Z Z
(αfn + βgn )dµ = α fn dµ + β gn dµ,
E E E

and since αfn + βgn is also an increasing sequence in n which converges a.e. to αf + βg as n → ∞,
we may apply MCT and take a monotone limit of both sides to conclude
Z Z Z
(αf + βg) dµ = α f dµ + β g dµ.
E E E

Part ii) is discussed on the problem set.

R part iii), let fn ↑ f as given by Lemma


For R 3.4. If f = 0 a.e., then all the fn s are zero a.e., and so
E fn dµ = 0 for all n, which implies E f dµ = 0. All these steps are reversible (using Proposition
3.2 iii)).

3.1.3 Positive and negative parts


As we saw in the counterexample below the statement of MCT, the monotonicity is essential to the
construction of the integral for non-negative functions. This poses a challenge when constructing
the integral for general functions taking positive and negative values. The main case that is not
permitted is the cancellation of an infinite positive part with an infinite negative part (which is
sometimes possible when studying Riemann integration).

Definition 3.7. Let f : E → R be a measurable function. Define the positive and negative parts

f + (x) = max(f (x), 0), f − (x) = − min(f (x), 0).

Here, the functions f + , f − are both measurable and take non-negative values. Indeed, we have

f = f + − f −, and |f | = f + + f − .

f + dµ < ∞ and15
R
Definition
R − 3.8. We say a measurable function f : E → R is integrable if E
E f dµ < ∞. We then define the integral of f to be
Z Z Z
f dµ = +
f dµ − f − dµ.
E E E

Example. In general, if we have


R a measurable function f and a non-negative measurable function
g such that |f | ≤ g a.e., and g < ∞, then f is integrable. This observation is strengthened in
Theorem 3.11 shortly to include convergence conditions.
15
Note that this condition is equivalent to inf E |f | dµ < ∞. This framework is reminiscent of the definition of
absolute convergence of a series discussed in previous courses.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023
R∞ R∞
Example. On range [0, ∞), the function f (x) = sinx x is not integrable since both 0 f + = 0 f − =
RK
∞. However, the Riemann integral (defined as the limit in K of 0 sinx x dx) does exist. The details
of this calculation are on the Problem Set.

Example. In the previous example, it was crucial that the range was infinite. In general, if a
function f is Riemann-integrable on a finite interval [a, b], then f is Lebesgue integrable on16 [a, b].
This is explored in more detail on the problem set.

Proposition 3.9. Let f, g be measurable functions E → R.


i) Consider real constants α, β ≥ 0. Then αf + βg ∈ S(E), and
Z Z Z
(αf + βg) dµ = α f dµ + β g dµ.
E E E

ii) If f (x) ≤ g(x) a.e., then Z Z


f dµ ≤ g dµ.
E E

iii) Let (fn )n≥1 and f be measurable functions E → R R (without


R the restriction to take non-
negative values), and satisfying fn ↑ f a.e. Then E fn dµ ↑ E f dµ, as in Theorem 3.5.

Proof. The main step for part i) is Lemma which is stated and proved below. The remainder of the
argument is addressed on the problem set.
For part ii), study g − f , which is non-negative and use Proposition 3.6 ii), and the linearity result
proved in part i) to convert the result to the required form.
Part iii) is addressed on the problem set.
R R
Lemma 3.10. Let f, g be measurable non-negative functions E → R with E f dµ, E g dµ < ∞.
Then f − g is integrable, and
Z Z Z
(f − g) dµ = f dµ − g dµ.
E E E

Proof. To prove f − g is integrable, use the triangle inequality, and Proposition 3.6 ii),
Z Z Z Z
|f − g| dµ ≤ (|f | + |g|) dµ = f dµ + g dµ < ∞.
E E E E

Now, we introduce the following sets to characterise when f − g is positive and negative,
A+ = {x : f (x) ≥ g(x)}, A− = {x : f (x) < g(x)}.
Then (f − g)+ = (f − g)1A+ . We now apply linearity for addition of non-negative functions to
(f − g)1A+ and g1A+ , obtaining
Z Z h i Z Z
f 1A + = (f − g)1A+ + g1A+ dµ = (f − g)1A+ dµ + g1A+ dµ.
E E E E
16
This terminology is not formal. Strictly speaking, we should say that f 1[a,b] is integrable (on R).

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Rearranging the outer terms, we find


Z Z Z Z
+
(f − g) dµ = (f − g)1A+ dµ = f 1A+ dµ − g1A+ dµ.
E E e E

Similarly, Z Z Z
(f − g)− dµ = g1A− dµ − f 1A− dµ.
E E E
R
Returning to (f − g) and combining these results, we obtain (with informal, abbreviated notation)
Z Z Z Z Z Z Z
+ −
(f − g) = (f − g) − (f − g) = f 1A+ + f 1A− − g1A+ − g1A−
E E Z Z
= f − g,

as required.

3.1.4 Convergence of integrals


R R
We have already seen one example where fn → f a.e. but fn 6→ f . Theorem 3.5 gives us a
result under monotone convergence, but clearly we want to have some results under more general
convergence conditions too. Let us note at this point that uniform convergence (eg on (−∞, ∞)) is
certainly not a guarantee of convergence of the associated integrals. This is essentially because the
integral of a uniformly-small error over an infinite range will end up infinite!
The following theorem gives us a partial solution.

THEOREM 3.11 (Dominated convergence theorem). Let (fn )n≥1 and f be measurable functions
E → R such that fn → f a.e. Suppose there exists an integrable function17 g : E → [0, ∞) (ie taking
non-negative values) such that for all n ≥ 1, we have |fn | ≤ g a.e. Then fn and f are integrable
and Z Z
fn dµ −→ f dµ.
E E

Example. Suppose µ(E) < ∞, and the fn , f are uniformly R bounded, ie there exists C < ∞ such
that |fn |, |f | ≤ C a.e. Then oneRcan take
R g ≡ C, and so g dµ = Cµ(E) < ∞, and thus under these
conditions fn → f a.e. implies fn → f by DCT.

Example. Let fn : [0, ∞) → R be defined by fn (x) =R∞e−nx . Then fn → 0 a.e. (indeed, for all
−x
x 6= 0), and we can bound by fn (x) ≤ e , for which 0 e−x dx < ∞. So we can use e−x as the
∞ ∞
dominating function, and use DCT18 to show 0 e−nx dx → 0 0 dx = 0.
R R

 
(cos x)n
Example. Let fn : [0, π] → R be defined by fn (x) = sin x + n . Then |fn | ≤ 1 and19 fn (·) →
Rπ Rπ
sin(·) a.e., and the limiting function is integrable on [0, π]. So 0 fn (x) dx → 0 sin(x) dx = 2.
17
The role of g in DCT is sometimes called the dominating function.
18
Note that in this case we have fn ↓ 0 a.e., so we could also have used the decreasing version of the MCT proved
on the problem set.
19
Note that if |fn | ≤ g a.e. and fn → f a.e., then |f | ≤ g a.e. also.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

3.1.5 Application to counting measure


We will see shortly in Section 3.2 how the machinery of integration introduced in these sections
specialises to probability spaces. As a warm-up, we will consider applying the theory to counting
measure.
We study measurable space (N, P(N)) with measure µ defined by µ(A) = |A| for all A ⊆ N. We
note the following consequences of our general definitions:
ˆ A measurable function f : N → R is just a sequence f (1), f (2), . . ..
ˆ If two measurable functions / sequences f, g are equal a.e., this means f (n) = g(n) for all n.
(Since the only set of measure zero is the empty set.)
P
In general, this measurable function f is not simple, but can be written as f = n≥1 f (n)1{n} ,
which should be understood, formally, as
K
X
f = lim f (n)1{n} .
K→∞
n=1
R P P
We then have f dµ = f (n)µ(n) = f (n), which is uncontroversial if f ≥P0. But for sequences
taking positive and negative values, the condition for f to be integrable is |f (n)| < ∞, ie the
definition of absolute convergence of a series.
(n) (n)
To apply DCT, suppose that for each n ≥ 1 we have a sequence (a1 , a2 , . . .) of real numbers such
(n)
that ak → ak as n → ∞ for every k. Then we might conjecture that under certain circumstances,
we have
(n) (n)
a1 + a2 + . . . → a1 + a2 + . . . . (3.4)
(n+1) (n)
ˆ If we have ak ≥ ak for all n, k, then (3.4) follows from MCT.
(n)
ˆ If there exists a non-negative sequence (b1 , b2 , . . .) such that ak ≤ bk for all n, k, and
P
bk <
∞, then (3.4) follows from DCT.

3.2 Integrals in probability spaces


We have seen in Section 2.2 that the analogue of measurable functions in probability spaces are
random variables. Under this analogy, integrals of measurable functions are expectations of random
variables.

3.2.1 Expectations
Throughout this section, we will use our usual notation (Ω, F, P) for a probability space.

Definition 3.12. (Note, it is particularly relevant for expectations that if E [X + ] = ∞ and E [X − ] <
∞, we may write E [X] = ∞.)

Example. For A ∈ F, the indicator function 1A is a random variable and E [1A ] = P(A).
The following properties of expectations are inherited directly from the corresponding properties of
general integrals:

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

ˆ (Linearity of expectation) For random variables X, Y , we have, for all λ, µ ∈ R


E [λX + µY ] = λE [X] + µE [Y ] ,
whenever X, Y are both integrable.
ˆ If X ≤ Y holds a.s., then E [X] ≤ E [Y ].
Suppose g : R → R is measurable, and X a random variable. Then g(X) is also a random variable,
as a measurable function Ω → R, and so we do not need a separate definition of E [g(X)].

Example. Taking g(x) = x2 , we note that X 2 is a random variable. Similarly (X − E [X])2 is a


random variable, whenever E [X] < ∞.
of X is defined as E (X − E [X])2 . Note that the alternative expression var(X) =
 
The
 variance
E X 2 − (E [X])2 follows from linearity of expectation.


Example. An instructive example is the deterministic case when X = c a.s. for some c ∈ R.

Definition 3.13. For a measurable space (E, E), and x0 ∈ E, define the (Dirac) delta measure δx0 ,
(
1 x0 ∈ A
δx0 (A) = , A ∈ E. (3.5)
0 x0 6∈ A

So for X = c a.s., one option is to take P = δc . Now, to compute E [f (X)] for a function f : R → R,
we note taking f˜ = f (x0 )1{c} gives f = f˜ a.e., and f˜ is simple, so
Z Z
f dδc = f˜ dδc = f˜(c)δc ({c}) = f˜(c) = f (c).
R R

Example. Now suppose X is a random variable on (Ω, F, P) taking values {1, 2, . . .}. Then 1{X=n}
is also a RV, and we have
N
X
X = lim X1{X≤n} = lim n1{X=n} ,
N →∞ N →∞
n=1

where the first equality denotes an almost sure limit, and then
"N # N
X X X
E [X] = lim E n1{X=n} = lim nP(X = n) = nP(X = n),
N →∞ N →∞
n=1 n=1 n≥1
PN
where the first equality follows from applying MCT to n=1 n1{X=n} , which is monotone in N .
See the problem set for a similar argument for E [g(X)].
We revisit the following result, which gives a useful bound on probabilities in terms of expectations.

Proposition 3.14 (Markov’s inequality). Let X be a random variable taking non-negative values.
Then for any a ∈ [0, ∞), we have P (X ≥ a) ≤ E[X]
a .

Proof. Introduce the auxiliary random variable Y = a1{X≥a} , so that Y ≤ X a.s. But then
E [Y ] = aP (X ≥ a) ≤ E [X], and the result follows.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

3.2.2 Density functions


The idea of the density function fX of a ‘continuous random variable’ is familiar from previous
courses in probability. Roughly speaking, integrating the density function recovers probabilities,
and also gives a recipe for calculating expectations as
Z b Z ∞
P (a ≤ X ≤ b) = fX (x) dx, E [X] = xfX (x) dx. (3.6)
a −∞

We are now in a position to formalise this notion, starting with the following idea of ‘change of
measure’.

Proposition 3.15. Let (E, E, µ) be a measure space, and f : E → R a non-negative measurable


function. Then for all A ∈ E, define
Z Z
ν(A) = f dµ = f 1A dµ.
A E

Then ν(·) is a measure on (E, E), and for every integrable g : E → R, we have
Z Z
g dν = f g dµ. (3.7)
E E

Proof. We first check that ν is a measure. Note that ν(A) ≥ 0 by construction, and ν(∅) = 0.
P of disjoint sets (An )n≥1 in E. Applying the monotone
Now, suppose given a countable sequence
convergence theorem to the finite sums N n=1 f 1An , we have
 
Z X XZ
f 1An  dµ = f 1An dµ,
E n≥1 n≥1 E

from which it follows that  


[ X
ν An  = ν(An ).
n≥1 n≥1
Pk
For (3.7), we start with the case g simple, ie g = i=1 ai 1Ai , with ai ≥ 0, for which
Z k
X k
X Z Z
g dν = ai ν(Ai ) = ai f 1Ai dµ = f g dµ,
E i=1 i=1 E E

where the last equality follows from linearity of (finite collections of) integrals.
Then for measurable g ≥ 0, we use simple approximations gn ↑ g a.e. as in Lemma 3.4. Then
f gn ↑ f g a.e. also holds, so two applications of MCT (on ν and µ in the first and third equalities,
respectively) give
↑ ↑
Z Z Z Z
g dν = lim gn dν = lim f gn dµ = f g dµ.
E E E E
In contrast to the proof of linearity in Proposition 3.9, here the lift to the case of general measurable
g is immediate, after noting that (f g)+ = f g + and (f g)− = f g − .

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Recall from Section 2.2.1 that any random variable X on probability space (Ω, F, P) induces a
probability measure PX on R, given by PX (A) = P(X = A). We will apply the following definition
to PX .

Definition 3.16. Given a measure space (E, E, µ), suppose that another measure ν on (E, E)
satisfies Z
ν(A) = f 1A dµ, ∀A ∈ E,
E
then f is the density of ν with respect to µ.
When X is a random variable, with induced measure PX , then if there exists non-negative measur-
able fX : R → R such that Z
PX (A) = P (X ∈ A) = fX 1A dx, (3.8)
R
with dx understood to mean Borel measure on R, then fX is the probability density function of X.
For a pdf fX of a random variable X,
ˆ We must have R fX dx = 1 (by taking A = R in (3.8)).
R

ˆ If we have f˜X = fX a.e., then f˜X is also “the” pdf of X.


ˆ It is sufficient to verify (3.8) for all A = (−∞, a], or for any other generating set of B(R).
(Though note that (−∞, a] is particularly useful as it correponds directly to the distribution
function FX as we study shortly.)

3.2.3 Integrals of fX , and expectations of g(X)


Recall the distribution function FX of a random variable X, defined by
Z
FX (x) = P(X ≤ x) = P(X ∈ (−∞, x]) = −RfX 1(−∞,x] dx.

By the Fundamental Theorem


Rx of Calculus, if FX is continuous, and FX0 exists for all but finitely
many x, then FX (x) = −∞ FX0 (u) du, and so20 FX0 (·) is the pdf of X.
In practice, distributions are often defined by FX or fX , including the familiar examples below.

Example. A random variable U has the uniform distribution on interval [a, b] if



0
 x<a
x−a
FU (x) = b−a x ∈ [a, b]

1 x > b.

The density function is then fU = b−a1


1[a,b] . Note that if we take f˜U = b−a
1
1(a,b] or similar, then
˜ ˜
fU = fU a.e., and so both fU , fU may be considered the density of U . That is, we don’t distinguish
between the uniform distribution(s) Unif([a, b]) and Unif((a, b]) etc.
20 0
Note that it does not matter how we define FX at the finite collection of values for which FX is not differentiable,
since this set has measure zero.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Example. A random variable X has the exponential distribution with parameter λ > 0, when
P(X ≥ x) = e−λx for all x ∈ [0, ∞). That is, FX (x) = 1 − e−λx and fX (x) = λe−λx 1[0,∞) .
The main result concerns the formalism of the motivating idea (3.6)

Proposition 3.17. Let X be a random variable in probability space (Ω, F, P), and g : R → R a
measurable function. If g(X) is integrable, then21
Z Z
E [g(X)] = g dPX = g(x)fX (x) dx, (3.9)
R R

where the second equality holds whenever X has density fX .

Proof. The second equality in (3.9) is a direct consequence of Proposition 3.15, so we only need to
prove the first equality. We use the same structure as before, verifying the result for g simple, then
g ≥ 0 measurable, then general measurable g.
If g is a simple function ki=1 ai 1Ai for Ai ∈ B(R), then g(X) = ki=1 ai 1{X∈Ai } is a simple random
P P
variable, and
Xk k
X Z
E [g(X)] = ai P(X ∈ Ai ) = ai PX (Ai ) = g dPX ,
i=1 i=1 R

using the definition of expectations/integrals of simple functions in the first and last equalities.
If g ≥ 0 is measurable, we approximate by simple gn ↑ g. It is important here that this holds
PX -a.e., but in fact it holds for all x ∈ R. Thus gn (X) ↑ g(X) almost surely, and so we can use
MCT twice, as usual, to obtain
Z Z
E [g(X)] = lim E [gn (X)] = lim gn dPX = g dPX ,
n→∞ n→∞

since both limits are monotone.


Writing g(X) = g + (X) − g −1 (X) in the usual way lifts the result to general measurable g.

Example. Let X ∼ Exp(λ), so that fX (x) = λe−λx 1[0,∞) . Then


Z ∞ Z ∞
−λx 1 2
E X2 = λx2 e−λx dx =
 
E [X] = λxe dx = , .
0 λ 0 λ2
1
The variance of X is then λ2
.

3.2.4 Inequalities for g(X)


Now that we have a good collection of results for g(X) (which is, recall, a random variable), it is
useful to apply Markov’s inequality (Proposition 3.14) to g(X) when possible. We require g ≥ 0,
and E [g(X)] < ∞ for the inequality to be useful.
21
Some sources call this result as the law of the unconscious statistician (or LOTUS ) since it is tempting to treat
(3.9) as the definition of E [g(X)] rather than as a consequence of the theory in this section of the course.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Corollary 3.18 (Chebyshev’s inequality). Let X be a random variable, with22 µ = E [X] < ∞,
and σ 2 = Var(X) < ∞. Then, for all k > 0,
1
P (|X − µ| ≥ kσ) ≤
. (3.10)
k2
For any t ∈ [0, ∞) satisfying E etX < ∞, we also have the following Chernoff bound
 

E etX
 
P (X ≥ a) ≤ . (3.11)
eta

Proof. Apply Markov’s inequality to the random variable (X − µ)2 . Then


 E (X − µ)2
 
2 2 2 σ2 1
P (|X − µ| ≥ kσ) = P (X − µ) ≥ k σ ≤ 2 2
= 2 2
= 2.
k σ k σ k
tX
For the Chernoff bound, apply Markov’s inequality to the random variable e . For this argument,
it is important that x 7→ etx is increasing in x, so that P(X ≥ a) = P etX ≥ eta .

Note. Chebyshev’s inequality always gives a ‘better’ bound than Markov’s inequality, provided
2
 2
σ < ∞. Of course, there do exist random variables for which E [X] < ∞ but E X = ∞.
We also briefly state some inequalities involving expected values of random variables, which gener-
alise results seen in previous analysis courses.
Recall the notion of convexity of a function g : R → R. In applications, it is of useful to use the
following characterisation:
g twice differentiable, and g 00 (x) ≥ 0, ∀x ∈ R ⇒ g convex. (3.12)

Proposition 3.19 (Jensen’s inequality). Let X be an integrable RV, and g : R → R convex. Then
g(E [X]) ≤ E [g(X)].

Corollary 3.20 (Jensen’s inequality - finitePcase). WhenP X is discrete, taking values x1 , . . . , xk ∈ R


with probabilities p1 , . . . , pk , we recover g( ki=1 pi xi ) ≤ ki=1 pi g(xi ). Note that the case k = 2 of
this result is usually taken as the definition of convex.

x1 +...+xn √
Corollary 3.21 (AM-GM). Let x1 , . . . , xn ∈ (0, ∞). Then n ≥ n x1 · · · xn .

Remark. The two sides of this inequality are called the arithmetic mean and geometric mean.

Proof. Let X be uniform on {x1 , . . . , xn }. Then apply Jensen with the function (− log x).

Proposition 3.22 (Cauchy–Schwarz inequality). Let X, Y be two random variables.


Then (E [|XY |])2 ≤ E X 2 E Y 2 .
   

Remark. The absolute value signs on the LHS ensure that the LHS is always defined, as the
expected value of a non-negative
 2  RV. Note
2
 that this result includes the statement that if E [|XY |] =
∞ then at least one of E X and E Y is infinite also.
22
Note that µ and σ have different meanings in this setting compared their previous roles relating to measures. It
is generally clear from context which is intended.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

3.2.5 Transformations of random variables


We have previously discussed that for a random variable X, and measurable function g : R → R, it
is valid to view g(X) itself as a random variable, and in Proposition 3.17 we saw how to study its
expectation using the density function fX of X. We now explore how to define the density of g(X)
directly.

Proposition 3.23. Let X be a continuous random variable with density fX , and let g : R → R
be a measurable function such that i) g is either strictly increasing or strictly decreasing; ii) g −1 is
differentiable everywhere.
Then the density of Y = g(X) is given by

fY (y) = fX (g −1 (y) d −1
dy g (y) . (3.13)

Proof. We focus on the case g strictly increasing, and use the distribution function FY .

FY (y) = P(g(X) ≤ y) = P(X ≤ g −1 (y)) = FX (g −1 (y)),

where, since g is strictly increasing and continuous, g −1 (y) exists as a real number. Then, differen-
tiating, we obtain
d −1
FY0 (y) = FX0 (g −1 (y)) dy g (y) = fX (g −1 (y)) dy
d −1
g (y).
In the case where g is strictly decreasing, we have FY (y) = 1 − FX (g −1 (y)) and, consequently,

FY0 (y) = −fX (g −1 (y)) dy


d −1
g (y),

d −1
which is consistent with (3.13) since dy g (y) is then negative.

Note that this result and argument is just a conversion of familiar results about ‘integration by
substitution’ into the language of probability densities, and we could have proved Proposition 3.23
using this framework also. Note that we now have two expressions for E [g(X)], that is
Z Z
yfY (y), dy = E [g(X)] = f (x)fX (x) dx,

with fY given by (3.13) and the fact that they are equal follows directly using integration by
substitution.

Example. Let X ∼ Exp(λ) and set Y = cX. Then


x
= 1 − e−λx/c ,

P(Y ≤ x) = P X ≤ c

which is the distribution function of Exp( λc ).


Alternatively, using the result of Proposition 3.23 diretcly, we have

fY (y) = 1c f yc = λc e−λy/c ,


which is the pdf of Exp( λc ).

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

4 Multivariate probability
Most interesting situations in probability involve multiple random variables defined on the same
probability space (Ω, F, P). There are times when it is important that random variables are inde-
pendent (see Section 2.2.1) and we have also many examples where constructing random variables
in a dependent fashion produces interesting effects.
In this section we will discuss some of the formalism, and several applications of the situation where
the distribution of a collection of random variables is defined jointly and generally.

4.1 Multivariate distributions


We will focus initially on the case of two random variables (X, Y ) on some probability space
(Ω, F, P). By viewing (X, Y ) with respect to Cartesian axes, we can view this as a random el-
ement of R2 , or an R2 -valued random variable.

4.1.1 R2 -valued measurable functions


We need some measure-theoretic formalism to make sense of the notion of a R2 -valued random
variable. Fortunately, many of the definitions for R carry over directly. The Borel sets B(R2 ) are
generated as  
B(R2 ) := σ {(a, b] × (c, d] : a < b, c < d} ,

or by any other reasonable collection of rectangles, with Borel measure defined by


 
µ (a, b] × (c, d] = (d − c)(b − a),

and extended to B(R2 ) analogously to the case for R discussed in Section 1.3.2.
For this section, we refer to Borel measure on R2 as µ, with dx, dy used to indicate integrals with
respect to Borel measure on R, to avoid confusion. The definition of a measurable function is
unchanged, as is the construction of the integral. However, it is not immediately clear under what
circumstances one may study an integral over R2 as a conventional ‘double-integral’ over each copy
of R in turn. The following theorem clarifies this.

Proposition 4.1 (Fubini’s Theorem). Let f be a measurable non-negative function R2 → R. Then


it holds that Z Z Z Z Z
f dµ = f (x, y) dx dy = f (x, y) dy dx, (4.1)
R2 R R R R

including in the sense that if one of these integrals is infinite, then all three are infinite.
Now, for general f measurable R2 → R, if any of the following integrals
Z Z Z Z Z
|f | dµ, |f (x, y)| dx dy, |f (x, y)| dy dx
R2 R R R R

is finite, then all three are finite, and we say f is integrable, and (4.1) holds for f .

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Product measures - non-examinable


The theory above is a special case of the general notion of a product measure. For interested readers,
we summarise this briefly, as a non-examinable aside.
Given two measure spaces (E1 , E1 , µ1 ) and (E2 , E2 , µ2 ), it is natural to seek a measure on E = E1 ×E2
such that the restriction to each coordinate gives µ1 and µ2 respectively. First, it is necessary to
define a σ-algebra on E1 × E2 , and this σ-algebra E = E1 ⊗ E2 is generated by the product sets,
 
E := σ A1 × A2 : A1 ∈ E1 , A2 ∈ E2 ,

noting that this is not asserting that every set in E can be decomposed as such a product (which is
certainly not true). In fact, if E1 = σ(A1 ) and E2 = σ(A2 ) then we have
 
E = σ A1 × A2 : A1 ∈ A1 , A2 ∈ A2 ,

which is particularly convenient if A1 , A2 are more tractable than E1 , E2 (as in the case of B(R)
generated by the intervals). The product measure µ = µ1 ⊗ µ2 is then defined by:
µ(A1 × A2 ) = µ1 (A1 )µ2 (A2 ), A1 ∈ E1 , A2 ∈ E2 ,
then extending to E1 ⊗ E2 using the machinery mentioned in Section 1.3.2. There are two key
regularity properties we would like to hold:
ˆ Uniqueness of this extension;
ˆ Fubini’s theorem (4.1).
It turns out these are not always valid. However they are valid if both µ1 , µ2 satisfy a ‘σ-finite’
condition, which holds whenever µi (Ei ) < ∞, and also for many infinite measures23 , including Borel
measure on R.
However, counting measure on an uncountable set (see the examples given below Definition 1.5) is
not σ-finite, and Fubini’s theorem sometimes does not hold for products involving this measure.
See the Problem Set for a concrete example when exchanging the order of integration fails.

4.1.2 R2 -valued random variables


The joint distribution of random variables (X, Y ) is defined by P((X, Y ) ∈ Ā) for Ā ∈ B(R2 ). As
in the case of a single RV, we can reduce this abstract definition to something more complete. It is
in fact also defined by P(X ∈ A, Y ∈ A0 ) for A, A0 ∈ B(R) and, most relevantly for calculations, by
P(X ≤ x, Y ≤ y) with x, y ∈ R. Motivated by this, we define the joint distribution function
FX,Y (x, y) = P(X ≤ x, Y ≤ y), x, y ∈ R.
The joint density function fX,Y : R2 → [0, ∞) is measurable, and satisfies
  Z
P (X, Y ) ∈ Ā = fX,Y dµ, ∀Ā ∈ B(R2 ),

23 S
Precisely, a measure space is σ-finite if E can be expressed
S as a countable union E = n≥1 En where each
µ(En ) < ∞. This is true for Borel measure on R, as R = n≥1 [−n, n] where the Borel measure of each interval
[−n, n] is finite.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

which is equivalent to
Zx Zy
FX,Y (x, y) = fX,Y (u, v) du dv.
u=−∞ v=−∞
The fundamental theorem of calculus then gives
∂2
fX,Y (x, y) = FX,Y (x, y). (4.2)
∂x∂y

The following notion allows us to reduce the case of a joint distribution (in particular with a joint
density) to the distribution of each component separately.

Definition 4.2. If random vector (X, Y ) has joint density fX,Y , then the marginal density
Z∞
fX (x) := fX,Y (x, y) dy. (4.3)
−∞

In a joint context, the distribution of X by itself is called the marginal distribution of X.


The joint distribution (X, Y ) induces a measure P(X,Y ) on R2 via P(X,Y ) (Ā) = P (X, Y ) ∈ Ā for


all Ā ∈ B(R2 ). Furthermore, when (X, Y ) has a joint density, we have the analogue to Proposition
3.17 for measurable functions g : R2 → R:
Z Z Z Z
E [g(X, Y )] = g dP(X,Y ) = g fX,Y dµ = g(x, y)fX,Y (x, y) dx dy.
R2 R2 R R

Note that P(X,Y ) can’t in general be split as a product, whereas dµ can.

Example. Let g(x, y) = (x − E [X])(y − E [Y ]), so then


E [g(X, Y )] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ] , (4.4)
which is called the covariance of X, Y . Note that when X, Y are independent, we have Cov(X, Y ) =
0, but the converse is not generally true.
Recall (2.2) for a definition of independent random variables. This translates directly to the setting
of joint distribution functions. Specifically, X, Y are independent precisely when
FX,Y (x, y) = P(X ≤ x, Y ≤ y) = P(X ≤ x)P(Y ≤ y) = FX (x)FY (y), ∀x, y ∈ R.
Since this is a product of a function of x and a function of y, the characterisation (4.2) is particularly
useful, and gives
d d
fX,Y (x, y) = P(X ≤ x) P(Y ≤ y) = fX (x)fY (y).
dx dy
h i
Similarly, when g(x, y) can be decomposed as h(x)h̃(y), we have E [g(X, Y )] = E h(X)h̃(Y ) =
h i
E [h(X)] E h̃(Y ) .

Note. In the language of product measures, we have X, Y independent ⇐⇒ P(X,Y ) = PX ⊗ PY .

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

4.1.3 Transformations of multivariate distributions


The theory of Section 3.2.5 is particularly useful in the multivariate setting, where we might want
to study functions of a collection of random variables such as X + Y , or a reparameterisation eg in
polar coordinates (X, Y ) = (R, ϑ).
Unlike in one dimension, we must draw a distinction between a function of a multivariate RV and a
transformation. We will begin with a function g(X, Y ) of a multivariate RV (X, Y ) and will focus
only on the case of X + Y when X, Y are independent.

Sums of independent RVs


Let X, Y be independent with densities fX , fY , respectively. Let Z = X + Y . We can find the
density fZ of Z in the following way.
First, note that the joint density fX,Y is given by fX,Y (x, y) = fX (x)fY (y) by independence. We
will now study Z via its distribution function FZ .
Z Z
FZ (z) = P(X + Y ≤ z) = fX,Y dµ = 1{x+y≤z} fX,Y dµ
{x+y≤z} R2

which we can transform into a more practical form using Fubini:

Z∞ z−x
Z
= fX (x)fY (y) dy dx
x=−∞ y=−∞
Z∞ Zz
= fY (w − x)fX (x) dw dx
x=−∞ w=−∞
Zz Z∞
= fY (w − z)fX (x) dx dw,
w=−∞ x=−∞

where we use Fubini again in the final equality to change the order of integration. By differentiating,
we find
Z∞
d
fZ (z) = dz FZ (z) = fY (z − x)fX (x) dx.
x=−∞

This quantity24 is called the convolution of fX , fY .

Example. Suppose X, Y ∼ Exp(λ) are IID, and Z = X + Y . Then


Zz Zz
fZ (z) = λ2 e−λx e−λ(z−x) dx = λ2 e−λz dx = λ2 ze−λz .
x=0 x=0
k
24 P
There is a analogue in the discrete case X, Y ∈ {0, 1, 2, . . .}, where P(X + Y = k) = P(X = `)P(Y = k − `).
`=0

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Definition 4.3. We say X has the Gamma distribution, denoted X ∼ Γ(n, λ) for λ > 0, n ∈
{1, 2, . . .} when X has density
λn xn−1
fX (x) = e−λx 1 .
(n − 1)! {x≥0}
Note that n = 1 reduces to Exp(λ), and n = 2 is the distribution of the sum of two IID Exp(λ)s,
as derived in the previous exercise.
In fact if X1 , X2 , . . . are IID Exp(λ) RVs, then X1 + . . . + Xn ∼ Γ(n, λ).
For a single random variable, the theory of the density of the transformed RV reduces directly to
integration by substitution. The same is true for multiple RVs, and we recall the key object for
studying changes of variables in higher-dimensional integrals.

Definition 4.4. Let ϕ : D ⊆ R2 → C ⊆ R2 be a function mapping (x, y) 7→ (u, v) for which all
partial derivatives exist on D. Then the Jacobean J(ϕ) is defined as the following determinant of
the matrix of partial derivatives.
!
∂u ∂u
∂x ∂y ∂u ∂v ∂u ∂v
J(ϕ) = ∂v ∂v = ∂x ∂y − ∂y ∂x ,
∂x ∂y

noting that (despite the notation), this is a function of (x, y) on D.

Proposition 4.5. Let ϕ : (x, y) 7→ (u, v) be a one-to-one mapping from D ⊆ R2 to C ⊆ R2 , for


which ϕ−1 has continuous partial derivatives everywhere on C. We define (U, V ) = ϕ(X, Y ). Then
if (X, Y ) has joint density fX,Y supported on D, then (U, V ) is jointly continuous with joint density
fU,V (u, v) = fX,Y (ϕ−1 (u, v)) J(ϕ−1 ) 1C (u, v). (4.5)

Proof. As referenced below Proposition 3.23 for the monovariate case, this follows from the usual
construction of integration by substitution in two dimensions. (The Jacobean represents the infini-
tissimal change in area factor under a substitution.)

X
Example. Let X, Y ∼ Exp(λ) be IID, and set Z = X + Y and Q = X+Y . Then fX,Y (x, y) =
2
λ e −λ(x+y) , and we can study
 
ϕ : [0, ∞)2 → [0, ∞) × [0, 1], (x, y) 7→ (z, q) = x + y, x+y
x
.

Then ϕ−1 is defined by x = zq, y = z(1 − q) and so


 
−1 q z
J(ϕ ) = det = −z.
1 − q −z
So (Z, Q) has joint density fZ,Q (z, q) = λ2 ze−λz , which splits as a product of fZ (z) = λ2 ze−λz and
fQ (q) = 1[0,1] (q), so that Z, Q are independent and Z ∼ Γ(2, λ), Q ∼ Unif([0, 1]).
This is the natural setup to model an arrivals process, where for example X is the time of the
first bus, and Y the additional time to wait for the second bus. This result shows that the first
bus arrives at a uniformly chosen time within the interval [0, X + Y ]. This observation forms the
basis for defining the Poisson process, which is a central tool for modelling random processes in
continuous time.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

4.1.4 Conditional probability in the continuous setting


In previous probability courses, we have seen the notion of conditional probabilities for events,
defined by
P(A ∩ B)
P(B | A) = , and P(A ∩ B) = P(A)P(B | A),
P(A)
provided P(A) > 0. We can define a (continuous) random variable X conditional on an event A of
positive probability via

P(X ∈ C ∩ A)
P(X ∈ C | A) = , ∀C ∈ B(R),
P(A)

or, more practically, by 


P {X ≤ x} ∩ A
P(X ≤ x | A) = ,
P(A)
where the LHS satisfies the properties of a distribution function. We could define the conditional
distribution function FX|A (x) = P(X ≤ x | A), viewing (X | A) as the ‘conditional distribution of
X given A’.
If X is continuous, then FX is continuous, and so FX|A is also continuous. In general, it is reason-
able to consider the conditional probability measure PX|A , but in practical terms it is particularly
convenient if we can differentiate FX|A to obtain the conditional density fX|A (x) such that
Z
P(X ∈ C | A) = fX|A (x) dx, ∀C ∈ B(R).
C
R R
From this, we obtain the conditional expectation E [X | A] = X dPX|A = xfX|A (x) dx.

Example. Let X ∼ Exp(λ) and A = {X ≥ a} for some a ∈ R. One can check that
d
(X | A) = a + Exp(λ), (4.6)

which is known as the memoryless property of the exponential distribution.


A central outstanding question is how to condition on an event A with P(A) = 0. This is partic-
ularly relevant in the setting where (X, Y ) are jointly continuous, and we want to make sense of
conditioning on {X = x}, for example in a modelling setting where X is observed, and Y remains
unknown.

Definition 4.6. Suppose (X, Y ) have joint density fX,Y , and recall the marginals fX , fY from
Definition 4.2. Then for x ∈ R such that fX (x) > 0, the conditional density of Y given X = x is

fX,Y (x, y)
fY |X=x (y) = . (4.7)
fX (x)

Note immediately that if X, Y are independent, then for all x we have fY |X=x = fY . The conditional
expectation is defined as the natural extension E [Y | X = x]. Noting that this is a function of x,
one can extend it to E [Y | X] which is a random variable (ie a function of random variable X).

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

To justify this definition, consider conditioning instead on {x ≤ X ≤ x + } where  > 0 is small.


Then
Ry R x+
v=−∞ u=x fX,Y (u, v) du dv
P (Y ≤ y | x ≤ X ≤ x + ) = R x+
u=x fX (u) du
Ry
 fX,Y (x, v) dv
≈ v=−∞ ,
fX (x)
Ru Z y
v=−∞ fX,Y (x, v) dv fX,Y (x, v)
= = dv,
fX (x) v=−∞ fX (x)
fX,Y (x,y)
where the approximation is valid so long as fX,Y (·, v) and fX (·) are continuous at x. So fX (x)
fits the definition of the (conditional) density.

4.2 Gaussian random variables


A (monovariate) random variable X has the normal distribution with parameters µ ∈ R and σ 2 < ∞,
2 2
denoted X ∼ N (µ, σ 2 ) if X has density fX (x) = √ 1 2 e−(x−µ) /2σ . We distinguish the standard
2πσ
2
normal distribution Z ∼ N (0, 1) defined by density fZ (x) = √1 e−x /2 .

An important feature of the normal distribution(s) is closure under linear transformations. That is,

Z ∼ N (0, 1) ⇒ X = σZ + µ ∼ N (µ, σ 2 ). (4.8)

4.2.1 IID normals in polar coordinates


The goal of Section 4.2 is to explore how this theory of normal distributions under linear trans-
formations transfers to the higher-dimensional setting. As a warmup, we consider the case of two
independent standard normals X, Y ∼ N (0, 1), viewed as a random point in R2 .

Now, consider the reparameterisation via R = X 2 + Y 2 and Θ ∈ [0, 2π) defined such that X =
R cos Θ, Y = R sin Θ.
 2 2
Then the joint density fX,Y (x, y) = 2π 1
exp − x +y
2 and we can consider the Jacobean for the
transformation (r, ϑ) 7→ (x, y) given by
 ∂x ∂y   
∂r ∂r cos ϑ sin ϑ
det ∂x ∂y = det = r.
∂ϑ ∂ϑ
−r sin ϑ r cos ϑ

1 2
So fR,Θ (r, ϑ) = 2π exp(− r2 )r, which splits as a product.
We conclude that random variables (R, Θ) are independent, with Θ ∼ Unif([0, 2π)). A consequence
of this is that the joint distribution of (X, Y ) is invariant under rotations. In other words, there is
a notion of a two-dimensional normal distribution that does not depend on the choice of axes. This
is strong evidence that this is an important higher-dimensional distribution.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

4.2.2 Gaussian random vectors - formalism


Definition 4.7. A (monovariate) random variable X is Gaussian if X ∼ N (µ, σ 2 ) for some µ ∈ R,
σ 2 < ∞.
A random vector X = (X1 , . . . , Xn ) ∈ Rn is Gaussian25 if uT X is Gaussian for all vectors26 u ∈ Rn .
Given such X, we define the mean µ = E [X] ∈ Rn coordinate-wise, and the covariance matrix
n
V = Cov(Xi , Xj ) i,j=1 . By construction V is a symmetric matrix in Rn×n .

Note that the key feature of this definition is that uT X is Gaussian, not the exact parameters. In
fact, the exact parameters are clear: uT X ∼ N (uT µ, uT V u). (See the problem set.) We must also
check that uT V u is non-negative, which we will do a bit later.
It also follows quickly from this definition (without needing to apply any transformation formulas)
that for a matrix A ∈ Rn×n and b ∈ Rn , that AX + b is also Gaussian. (See the problem set.)
We briefly recall the definition of the moment generating function of a (monovariate) random vari-
able, as discussed in previous courses.
  Let X be a R-valued random variable, then the moment
generating function mX (t) := E etX for t ∈ R. For some distributions, and for some values of t,
we may have mX (t) = ∞.

Example. For Z ∼ N (0, 1) and X ∼ N (µ, σ 2 ), we have


 2  
σ 2 t2
mZ (t) = exp − t2 , mX (t) = exp µt + 2 . (4.9)

2
Note that because the density of the normal distribution decays like e−x , which is faster than e−tx
for all t, the MGF of a Gaussian is finite everywhere.
A key usage of MGFs is that they ‘determine distributions’, provided they are defined on an interval.
Formally, if X, Y are two random variables for which mX (t) and mY (t) are finite and equal for all
d
t in some interval [−, ], then X = Y .

Example. Let X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) be independent normal RVs. Then X1 + X2 ∼
N (µ1 + µ2 , σ12 + σ22 ).
Note that E [X1 + X2 ] = µ1 + µ2 and Var(X1 + X2 ) = σ12 + σ22 is clear without reference to
the normal distribution. To verify that X1 + X2 is Gaussian, we use MGFs, since in general
mX1 +X2 (t) = mX1 (t)mX2 (t) when X1 , X2 are independent. In this particular case, we obtain
 
mX1 +X2 (t) = exp (µ1 + µ2 )t + 21 (σ12 + σ22 )t2 ,

and so we can read off that X1 + X2 ∼ N (µ1 + µ2 , σ12 + σ22 ).

Definition
h T i 4.8. The MGF of a random vector X = (X1 , . . . , Xn ) ∈ Rn is defined as mX (u) =
E eu X , for u ∈ Rn .
25
Sometimes we say X is multivariate Gaussian or (X, Y ) is jointly Gaussian to emphasise the dimension.
26
One can alternatively think of uT X as u · X or hu, Xi.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Then the corresponding result for random vectors says that if X, Y are two Rn -valued random
d
vectors for which mX (u) = mY (u) are finite and equal for all u ∈ [−, ]n , then X = Y .
So if X is a Gaussian random vector with mean µ and covariance V , we can explicitly calculate
h T i
mX (u) = E eu X = exp uT µ + 12 uT V u ,

(4.10)

via the t = 1 case for the MGF of the distribution N (uT µ, uT V u).
We conclude that the distribution of a Gaussian random vector is completely characterised by
its mean and covariance matrix. The main outstanding question is whether all matrices V are
achievable as the covariance matrix. However, we do know about one particular Gaussian random
vector.

Definition 4.9. The standard Gaussian in n dimensions is a random vector X = (X1 , . . . , Xn )


where the Xi are IID N (0, 1) random variables. The density of X is given by
n  2
||x||2
 
x
Y
fX (x) = √1

exp − 2i = (2π)−n/2 exp − 2 2 , x = (x1 , . . . , xn ) ∈ Rn .
i=1

4.2.3 Bivariate Gaussians


It is a useful both as a theoretical warmup, and for applications, to consider the case n = 2. Recall
that in general it is false that Cov(X, Y ) = 0 implies X, Y are independent. However, in the case
of Gaussian random vectors, this useful result is true!

Proposition 4.10. Let (X1 , X2 ) be jointly Gaussian. Then Cov(X1 , X2 ) = 0 if and only if X1 , X2
are independent.

Proof. The converse, that X1 , X2 independent implies Cov(X1 , X2 ) = 0 is always true, see (4.4).
A sufficient condition for independence of X1 , X2 is that the MGF mX (u) splits as a product of a
function of X1 and a function of X2 , for all u ∈ Rn . Referring to the explicit calculation (4.10), and
writing µ = (µ1 , µ2 ) and σ12 , σ22 for the variances of X1 , X2 we have

exp(uT µ) = exp(u1 µ1 ) exp(u2 µ2 ),


1 T
u = exp(u21 σ12 ) exp(u22 σ22 ) exp (2u1 u2 Cov(X1 , X2 )) .

exp 2u V

So it is clear that mX (u) splits as a product if (in fact, if and only if) Cov(X1 , X2 ) = 0.

For a two-dimensional joint Gaussian, the covariance acts a measure of a dependence between
the two random components, but varies with the overall magnitudes. It is in practice often more
appropriate to use the following measure.

Definition 4.11. For two random variables X, Y , define the correlation of X, Y to be


Cov(X, Y )
Corr(X, Y ) = p ∈ [−1, 1].
Var(X)Var(Y )

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Lemma 4.12. For all X, Y , we have Corr(X, Y ) ∈ [−1, 1].

Proof. Reduce to the case E [X] = E [Y ] = 0, then the result follows immediately from Cauchy–
Schwarz, as in Proposition 3.22.

As we have seen, much theory is considerably easier in the situation where the random variables
are independent. A particularly nice result about two-dimensional joint Gaussians is the possibility
to reduce the dependent case to a situation defined in terms of auxiliary random variables which
are independent.

Proposition 4.13. Let (X, Y ) be jointly Gaussian. Then there exists a ∈ R such that Y can be
expressed as aX + Z, where Z is Gaussian, and X, Z are independent. (And so certainly (X, Z) is
jointly Gaussian).

Proof. A covariance calculation gives a. See Problem Set 9.

Example. When (X, Y ) are jointly Gaussian, (Y | X = x) is Gaussian for all x ∈ R. With
Proposition 4.13 in mind, we do not need to calculate using the joint density and (4.7) to justify
this assertion.
d d
Instead, we note that the independence of X, Z implies (Z | X = x) Z , and so (Y | X = x) = ax+Z.

4.2.4 Densities for general Gaussian random vectors


The formalism above has not yet confirmed that any Gaussian random vectors exist (!) apart from
the standard Gaussian. In previous courses, students may have seen the multivariate Gaussian
defined directly by its density. We will now recover this density, and show that the formal definition
is consistent with what has been seen before.
We have seen in (4.8) that general R-valued Gaussian random variables can be expressed as a linear
transformation of a standard R-valued Gaussian. The main idea is that the same principle holds in
higher dimension.
We know that if X ∈ Rn is Gaussian with mean µ and covariance matrix V , then for A ∈ Rn×n and
b ∈ Rn we have Y = AX + b also Gaussian. We now calculate Y ’s parameters. Since E is linear,
we have E [Y ] = AE [X] + b = Aµ + b. Recall that

V = Cov(X) = E (X − µ)(X − µ)T ,


 

and so
h  T i
Cov(Y ) = E AX + b − (Aµ + b) AX + b − (Aµ + b)
h  T i
= E A(X − µ) A(x − µ)
= E A(X − µ)(X − µ)T AT
 

= AE (X − µ)(X − µ)T AT
 

= AV AT . (4.11)

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

We also note that when Z ∈ Rn is a standard Gaussian, the covariance matrix Cov(Z) is the identity
matrix Idn on Rn .

In one-dimension, converting as in (4.8), the scaling factor is σ = σ 2 . This is more involved in
the higher-dimensional setting. We must revisit some linear algebra in order to make sense of the
notion of the ‘square-root’ of a matrix.

Definition 4.14. Any real symmetric matrix V can be diagonalised, that is, expressed as V =
U T DU , where D is diagonal, and U is orthogonal, meaning U T U = Id. Furthermore, the diagonal
matrix D consists of the all the eigenvalues on the diagonal, where these eigenvalues are all read,
and the corresponding eigenvectors form an orthonormal basis.

Definition 4.15. A symmetric matrix A ∈ Rn×n is positive semi-definitive if uT Au ≥ 0 for all


u ∈ Rn .

Lemma 4.16. For a Gaussian vector X, the covariance matrix V is symmetric and positive semi-
definite.

Proof. Matrix V is symmetric since Cov(·, ·) is symmetric in its two arguments. Then, to confirm
V is positive semi-definitive, note that
 
Xn n
X n
X
uT V u = ui Cov(Xi , Xj )uj = Cov  ui Xi , uj Xj  = Var(uT X) ≥ 0,
i,j=1 i=1 j=1

where we use bilinearity of covariance in the middle equality.

Lemma 4.17. Given a matrix A ∈ Rn×n which is symmetric and positive semi-definite, all the
eigenvalues of A are non-negative. Furthermore, A can be written as A = BB for some B ∈ Rn×n
symmetric and positive semi-definite.

Proof. Let u be an eigenvector of A with eigenvalue λ. Then


0 ≤ uT Au = λuT u = λ||u||2 ,
so λ ≥ 0. Now, writing A = U T DU , we can define D1/2 by taking the diagonal matrix with the
square-roots of all the eigenvalues of A on the diagonal, so that D = D1/2 D1/2 . Now we have
A = U T DU = U T D1/2 D1/2 U = (U T D1/2 U )(U T D1/2 U ),
so the result holds with B = U T D1/2 U .

We are now ready to move to the density of the general Gaussian random vector.

THEOREM 4.18. For every vector µ ∈ Rn and positive semi-definite V ∈ Rn×n  , there exists an
n-dimensional Gaussian random vector X with E [X] = µ and V = Cov(Xi , Xj ) i,j . Furthermore,
if det(V ) = 0, then X has density
1
exp − 12 (x − µ)T V −1 (x − µ) .

fX (x) = p (4.12)
(2π)n/2 det(V )

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Note. In one-dimension, the case det(V ) = 0 corresponds to σ 2 = 0, when the random variable is
deterministic. In higher-dimension, det(V ) = 0 implies that X is supported on a subspace of lower
dimension than n, and so does not have a density.

Proof. Write V = V 1/2 V 1/2 using Lemma 4.17, and set X = V 1/2 Z + µ, where Z is a standard
Gaussian on Rn . Then E [X] = V 1/2 E [Z] + µ = µ, and so using (4.11)

Cov(X) = V 1/2 Cov(Z)V 1/2 = V 1/2 Idn V 1/2 = V.

Thus X is Gaussian and has the correct mean and variance, which completely characterises the
distribution.
Now, consider ϕ : Rn → Rn with ϕ : z 7→ V 1/2 z + µ, so that ϕ−1 (x) = V −1/2 (x − µ), which makes
sense when det(V ) 6= 0, so that V (and thus V 1/2 ) is invertible. We aim to use Proposition 4.5 to
handle the density of X as a transformation of Z. Note that the Jacobean J(ϕ−1 ) is constant and
equal to the determinant of V −1/2 ie J(ϕ−1 ) = √ 1 . Then we also have
det(V )
h i
||ϕ−1 (x)||2 = ||V −1/2 (x − µ)||2 = V −1/2 (x − µ)T V −1/2 (x − µ)
= (x − µ)T V −1/2 V −1/2 (x − µ) = (x − µ)T V −1 (x − µ).

Applying the general formula (4.5) we obtain


1 1
fZ (ϕ−1 (x)) = exp − 12 (x − µ)T V −1 (x − µ) .

fX (x) = p p
det(V ) (2π) n/2 det(V )

Example. Let Z ∈ Rn be standard Gaussian, and A ∈ Rn×n an orthogonal matrix so that AAT =
Idn . Then W = AZ is Gaussian, with E [W ] = 0 and Cov(W ) = ACov(Z)AT = Idn . So in
fact the standard Gaussian is invariant under orthogonal transformations of Rn , just as in the
one-dimensional case which we explored in Section 4.2.1.

4.3 Random walks


The random walks we discuss in this section are examples of random processes, which have many
applications, as well as intrinsic theoretical interest. The main goal of this section is to derive
limiting results for random walks, but we will also try to use this topic to explain potential future
directions for study in probability.

4.3.1 Setup and examples


Definition 4.19. Let X and (Xn )n≥1 be IID random variables, and define S0 = 0 and Sn =
X1 + . . . + Xn for n ≥ 1. Then the random process (S0 , S1 , S2 , . . .) is called random walk (where
the distribution of the increments is X).
In this version of the definition, the increments are IID. Some of the results to be presented have
more general versions which demand independence but not identical distributions. Furthermore,
there are other classes of random processes (for example, Markov processes and martingales) which

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

do not have independent increments, but demand other regularity properties that allow interesting
analysis. We will not explore this further in this course.
We now give several examples of random walks, and discuss unique features of each.
(
+1 with P = 21
Example. Taking Xn := defines simple symmetric random walk on Z (SSRW ).
−1 with P = 21
In this setting, direct enumeration of all possible options is often the best way to study probabilities.
For example  
number of paths (0,0) to (12,0) 1 12
P(S12 = 0) = = 12 ,
212 2 6
and
C6
P(S12 = 0, and S1 , . . . , S11 ≥ 0) = ,
212
where C6 is the 6th Catalan number.
We generalise this example to biased random walk with
(
+1 with P = p
Xn = (4.13)
−1 with P = 1 − p

for some probability p ∈ [0, 1]. The case p = 1/2 corresponds to SSRW. The drift µ = E [Xn ]
satisfies 
< 0
 p < 12
µ = E [Xn ] = 0 p = 21 . (4.14)
 1
>0 p> 2

a.s. a.s.
We might ask whether Xn −→ +∞ or Xn −→ −∞ (or neither) in each of these cases.

Example. Taking Xn ∼ Exp(λ) for some λ > 0 gives an example of an arrivals process, used to
describe the timings of events which occur in sequence. In this context, we describe the increments
of the random walk as “IID holding times”. Since the exponential distribution is memoryless (as
shown in (4.6)) it makes sense to describe such an arrival processes as ‘rate λ’. Note that Sn has
the Gamma distribution Γ(n, λ), as in Definition 4.3.

Example. Taking Xn ∼ N (µ, σ 2 ) gives a Gaussian random walk27 . Note that for finite n, the
increments (X1 , . . . , Xn ) can be viewed as a Gaussian random vector, and so (S1 , . . . , Sn ) is also a
Gaussian random vector, since each value is a linear combination of the Xi s.
To simplify the calculation, assume µ = 0 and σ 2 = 1, so that E [Sk ] = 0 for all k. Recall that
a Gaussian random vector is entirely defined by its mean and covariance matrix. So what is the
covariance matrix of (S1 , . . . , Sn )? We can compute the covariance of Sk and S` , with k ≤ ` by
decomposing as follows:

S` = (X1 + . . . + Xk ) + (Xk+1 + . . . + X` ),
27
We mention now that this is an example of a Gaussian field (here in one dimension). Gaussian fields on higher
dimensional discrete spaces (like Zd ) can also be defined, as well as (with considerable complexities) in the continuum,
and these objects are the subject of active research interest at the forefront of modern probability theory.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

where the two bracketed terms are independent, and the first term is equal to Sk . We obtain

Cov(S` , Sk ) = Cov(Sk , Sk ) + Cov(S` − Sk , Sk ) = Cov(Sk , Sk ) = Var(Sk ) = k.

4.3.2 Limit results - statements


The three main results of this section are the following limit results for the ‘running average’ quantity
Sn
n . Readers may find it helpful to revisit the definitions of almost sure convergence and convergence
in distribution from Section 2.3.3. We will prove these results (under certain conditions) in Section
4.3.4
For all the following results, we assume we are given a random walk (S0 , S1 , . . .) as in Definition
4.19 whose increments have mean µ = E [X1 ] < ∞.

Sn d
Proposition 4.20 (Weak Law of Large Numbers (WLLN)). We have n −→ µ.

Note. Recall that convergence in distribution to a constant is equivalent to convergence in proba-


bility to the same constant, so sometimes WLLN is phrased as Snn −→ µ.
P

Sn a.s.
THEOREM 4.21 (Strong Law of Large Numbers (SLLN)). We have n −→ µ.

Note. In general, convergence almost surely implies convergence in probability and in distribution
to the same limit, so SLLN implies WLLN, hence strong and weak.

THEOREM 4.22 (Central Limit Theorem (CLT)). Assume that the variance of the increments
σ 2 = Var(X1 ) < ∞. Then
Sn − nµ d
√ −→ N (0, 1). (4.15)
nσ 2

It is helpful to think about the statement (4.15) of the CLT in three stages:
ˆ The distribution of Sn is ‘concentrated’ on nµ (and this is the WLLN);

ˆ The fluctuations of Sn around nµ have order n;
ˆ The exact distribution of the fluctuations of Sn around nµ is approximately normal.

4.3.3 Limit results - usage


Strong and weak laws of large numbers
d
Example. Another simple example of a random walk is to take Xn = Bern(p), so that Xn = 1 with
probability p, and Xn = 0 with probability 1 − p. We can think of Xn as recording the ‘success’
of an experiment conducted repeatedly. Then, intuitively, the definition of the success probability
is likely to involve a statement like “the proportion of successful experiments will be p when the
number of experiments grows large”.
The SLLN makes this precise. In this context, Sn is then the number of successes in the first n
a.s.
experiments, and so Snn is the proportion of successful experiments. The SLLN gives Snn −→ p.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023
a.s.
Example. Consider biased random walk as in (4.13). Then SLLN gives Snn −→ µ. where µ is
positive or negative as characterised by (4.14). In particular, if µ > 0, then Snn → µ implies
Sn → ∞, and so we have answered the question about how the limiting behaviour of the random
walk Sn itself (rather than Snn ) depends on the drift. We have
(
a.s. +∞ µ > 0 ⇐⇒ p > 12
Sn −→
−∞ µ < 0 ⇐⇒ p < 21

Example. Let Xn ∼ Exp(λ), with Sn viewed as the time of the nth arrival. Then SLLN gives

Sn time of nth arrival a.s. 1


= −→ . (4.16)
n n λ
One might instead wish to study

N (t) = number of arrivals up to time t = max{n : Sn ≤ t}.

This process N (t) is generally called the Poisson process with rate λ, and in fact it can be shown
that N (t) has independent increments, and satisfies N (t) − N (s) ∼ Po(λ(t − s)). For now, we will
a.s.
just give a sandwiching argument to show that N t(t) → λ.
Begin by noting that N (Sn ) = n, and that SN (t) ≤ t < SN (t)+1 . Consequently, one has

N (t) N (t) N (t)


< ≤
SN (t)+1 t SN (t)
N (t) N (t) + 1 N (t) N (t)
× < ≤ . (4.17)
N (t) + 1 SN (t)+1 t SN (t)
a.s.
As t → ∞, we have N (t) → ∞, and so all the following hold,

N (t) a.s. N (t) a.s. N (t) + 1 a.s.


−→ 1, −→ λ, −→ λ,
N (t) + 1 SN (t) SN (t)+1

where inverting the converging quantities in (4.16) is valid since λ > 0. (There is no probability in
this statement, just a result in real analysis which is relevant in this context with probability 1.)
Combining these estimates, we see that both outer28 quantities in (4.17) converge almost surely to
λ, and so N t(t) also converges almost surely to λ.

Central Limit Theorem


d
It is not unreasonable to think of the CLT as saying Sn ≈ N (nµ, nσ 2 )’. In more formal terms, the
CLT only gives us valid non-trivial limiting results for P(a ≤ Sn ≤ b) if a, b have the form
√ √
a = an = nµ + za n, b = bn = nµ + zb n,
28
This is what makes it a sandwiching argument.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

leading to
 
n −nµ
a√ n −nµ
S√ n −nµ
b√
P(an ≤ Sn ≤ bn ) = P ≤ ≤
nσ 2 nσ 2 nσ 2
 
n −nµ
S√
= P za ≤ ≤ zb
nσ 2
−→ P (za ≤ Z ≤ zb ) ,

where Z ∼ N (0, 1).

Example. If Xn ∼ N (µ, σ 2 ) then in fact S√n −nµ ∼ N (0, 1) is true for every n. In other words, the
nσ 2
distributional property of the CLT holds for this case without taking a limit!

Example. The CLT can be applied jointly to, for example (Sn , S2n ). The key observation here is
that (Sn , S2n − Sn ) are independent. For ease of notation, let us assume µ = 0, σ 2 = 1. Then,
applying CLT in each coordinate gives
 
Sn S2n − Sn d
√ , √ −→ (Z1 , Z2 ),
n n

where Z1 , Z2 are independent standard normals, which is equivalent to


 
Sn S2n d
√ ,√ −→ (Z1 , Z1 + Z2 ).
n n

This idea can be extended to show that for any sequence of reals t1 < t2 < . . . < tk , one has
Sbt1 nc Sbt2 nc Sbtk nc
 
d
√ , √ , ... , √ −→ (W (t1 ), W (t2 ), . . . , W (tk )) ,
n n n

where (W (t1 ), . . . , W (tk )) is a Gaussian random vector with covariances given by Cov(W (tj ), W (tk )) =
min(tj , tk ). The case with ti ∈ N is studied on the problem set.
This raises the question of whether there is a random
 continuous
 process W : [0, ∞) → R, and
Sbtnc
whether one can make sense of the notion that √n , t ≥ 0 converges to W . Defining the limit
process formally as Brownian motion is the next step, and is explored in other probability courses,
and used heavily in modelling, including in mathematical finance. The convergence of the rescaled
random walk to Brownian motion holds29 in considerable generality.

4.3.4 Limit results - proofs


We will now prove the WLLN, SLLN and CLT. In all cases, we will need to impose some extra
conditions to make the proof work.
29
Interested students should search for Donsker’s invariance principle or Donsker’s theorem to find further details.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Direct proofs
Proof of WLLN, under assumption σ 2 < ∞. Recall that the Weak Law of Large Numbers can be
viewed as a statement about convergence in probability, as well as its original form about conver-
gence in distribution. Chebyshev’s inequality (Corollary 3.18) gives us a tool to quantify this. We
have, for any  > 0,

Sn
 Var(Sn ) nσ 2
P n − µ >  = P (|Sn − nµ| > n) ≤ = → 0,
2 n2 2 n2
as n → ∞, as required.

Proof of SLLN, under assumption that E X 4 < ∞. Having imposed the fourth moment condition,
 

we would like to denote Y = X − µ and Yn = Xn − µ. We must check that Y also satisfies the
fourth moment condition.
As a preliminary step, note that finite fourth moments implies E [X] , E X 2 , E X 3 < ∞. To see
   

this, we have, for any k = 1, 2, 3,


h i h i
E X 4 < ∞ ⇒ E X 4 1{|X|≥1} < ∞ ⇒ E X k 1{|X|≥1} < ∞ ⇒ E X k < ∞.
   

Then
E (X − µ)4 = E X 4 − 4µE X 3 + . . . − µ4 < ∞.
     

So, we may assume without loss of generality that µ = 0 (since E [Y ] as constructed above is zero).
We now analyse
h i
E Sn4 = E (X1 + . . . + Xn )4
 

n
X X  X
E Xi4 + 4 E Xi3 Xj + 12 E Xi2 Xj Xk + . . . ,
    
=
i=1 i6=j i,j,k
distinct

with a sum over all combinations of monomials with total power equal to 4,
n
X X   X
E Xi4 + 4 E Xi3 E [Xj ] + 12 E Xi2 E [Xj ] E [Xk ] + . . .
   
=
i=1 i6=j i,j,k
distinct

and now all the terms with an E [Xi ] vanish, since this expected value is zero,
n  
2 n
X X
Xi4
   2  2
= E +6 E Xi E Xj = C4 n + 6C2 .
2
i=1 i6=j

We may conclude that there exists a constant C such that E Sn4 ≤ Cn2 for all n. From this, we
 
P Sn 4
can study ( n ) via its expectation. We obtain
"∞   #
X Sn 4 X C
E ≤ < ∞.
n n2
n=1 n≥1

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Consequently, we know that the random variable ( Snn )4 is finite with probability 1 and we have
P
the string of implications
X   4 a.s.   
Sn 4 a.s.
< ∞ = 1 ⇒ P Snn → 0 = 1 ⇒ P Snn → 0 = 1,

P n

which completes the proof of SLLN under the fourth-moment condition.

Proofs with MGFs


We have seen previously that MGFs determine distributions, in the sense that two distributions
whose MGFs are equal, must have the same distribution. The following result shows that MGFs also
characterise convergence in distribution, so a sequence of random variables whose MGFs converge
appropriate to the MGF of a given limiting distribution, in fact satisfy convergence in distribution
towards that limit.

THEOREM 4.23 (Lévy’s continuity thm - MGF version). Let X, (Xn )n≥1 be random variables
(or distributions) with MGFs mX (·), mXn (·), respectively. Assume that mX (t) < ∞ for all t in
d
some interval (−, ) around 0. Then, if mXn (t) → mX (t) for all t ∈ (−, ), we have Xn −→ X.
We will not prove this theorem in this course, but we will make sense of the conditions to apply it,
using the formalism of our earlier work. We would like to expand
h i
t2 2 2
 tX 
= E 1 + tX + 2! X + . . . = 1 + tE [X] + t2! E X 2 + . . . .
 
E e (4.18)

By the  dominated convergence theorem (Theorem 3.11), we know that this is valid whenever
E e|tX| < ∞.
For the purposes of using this to prove WLLN and CLT, it is useful to note
ˆ When X = µ a.s., that is P(X = µ) = 1, then mX (t) = etµ ;
2
ˆ When X ∼ N (0, 1), we have mX (t) = exp( t2 ), as in (4.9).
We can now prove the CLT, subject to the condition that E e|tX| < ∞. We will carry out a proof
 

of WLLN as a warm-up.

Proof of WLLN subject to E e|tX| < ∞ condition. We know how to write the MGF of Sn in terms
 

of the MGF of X, as mSn (t) = (mX (t))n . The goal is to study mSn /n since corresponds to the
sequence of RVs which is converging. We have
h Sn i h t i n
mSn /n (t) = E et n = E e n Sn = mSn nt = mX nt

.

We can now expand as n


mSn /n (t) = 1 + µ nt + o( n1 ) → eµt .
Sn d
So by Theorem 4.23, we have n → µ.

Dominic Yeo dominic.yeo@kcl.ac.uk


Fundamentals of Probability Lecture notes 2022
KCL 6CCM341A Version: March 21, 2023

Proof of CLT subject to E e|tX| < ∞ condition. By replacing X with X−µ


 
σ2
we may assume that
2
µ = 0 and σ = 1. With this assumption, the linear term in (4.18) disappears, leaving

t2
mX (t) = 1 + + o(t2 ).
2

We will use this to study the MGF of Sn / n.
n
t Sn
h √ i h √t i   
S
mSn /√n (t) = E e n = E e n n = mSn √tn = mX ( √tn ) .

Here the relevant expansion is


 n 2 /2
t2
mSn /√n (t) = 1 + 2n + o(t2 ) → et ,

Sn d
which matches the MGF of N (0, 1). We conclude, using Theorem 4.23 that √
n
→ N (0, 1).

Dominic Yeo dominic.yeo@kcl.ac.uk

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy