FundProb Notes22
FundProb Notes22
These notes are intended to supplement lectures for the course 6CCM341A Fundamentals of Prob-
ability delivered in the autumn semester September to December 2022 at King’s College London.
The style of these typed notes is intentionally concise. Further background and explanations of
steps and additional examples are given in the in-person lectures; the visualiser notes from lectures
are available on the KEATS page.
Some sections are based closely on notes by previous lecturers of this KCL course, especially Igor
Wigman and Kolyan Ray, with thanks.
Contents
1 Measure spaces 2
1.1 Definitions and properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Measure spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Limits of sequences of events . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Generating σ-algebras and measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 The Borel σ-algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 The Borel measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Non-measurable sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Multivariate probability 34
4.1 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 R2 -valued measurable functions . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 R2 -valued random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.3 Transformations of multivariate distributions . . . . . . . . . . . . . . . . . . 37
4.1.4 Conditional probability in the continuous setting . . . . . . . . . . . . . . . . 39
4.2 Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 IID normals in polar coordinates . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.2 Gaussian random vectors - formalism . . . . . . . . . . . . . . . . . . . . . . 41
4.2.3 Bivariate Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.4 Densities for general Gaussian random vectors . . . . . . . . . . . . . . . . . 43
4.3 Random walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.1 Setup and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 Limit results - statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.3 Limit results - usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.4 Limit results - proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1 Measure spaces
1.1 Definitions and properties
In previous courses, we have met many examples of probability distributions and random variables,
and are comfortable performing various operations with these. For example, if X ∈ {1, 2, . . . , 6}
represents the outcome of a dice, and U ∼ Unif[0, 1], we know that
P (X is not a multiple of 3) = 1 − P (X is a multiple of 3) = 32 ,
P U ≥ 21 or U ∈ [ 31 , 23 ] = P U ≥ 13 = 23 .
and
Indeed, if we consider a more complicated ‘event’, such as A = [ 12 , 1] ∪ [ 81 , 14 ] ∪ [ 321 1
, 16 ] ∪ · · · , it is
valid to conclude
P (U ∈ A) = P U ∈ [ 12 , 1] + P U ∈ [ 81 , 14 ] + . . . = 12 + 18 + 32
1
+ . . . = 23 .
At least initially, these intuitively clear rules cover all situations we are likely to meet in practice,
especially in applied settings. But it leaves open the question:
Question: Does P (U ∈ A) make sense for every set A ⊆ [0, 1]?
In order to address questions such as these, we need to specify a minimal set of rules (known as
axioms) that probabilities should follow. We will then try to derive as many seemingly ‘intuitive’
rules as possible from this set of axioms.
Definition 1.1. A sample space, Ω, is a set of outcomes. Subsets of Ω are called events. Let F be
a collection of events. Then F is a σ-algebra if the following hold:
F1 The whole outcome space1 Ω ∈ F;
F2 if A ∈ F then the complement Ac ∈ F also;
S
F3 for any countable collection (An )n≥1 of events in F, the union n≥1 An ∈ F.
Example. For the dice, one has Ω = {1, 2, . . . , 6}, with F = P(Ω), the power set of Ω (ie all
subsets of Ω), and P (A) = |A|
6
For the uniform distribution on [0, 1], we have Ω = [0, 1]. However, given the key question posed
earlier, it is not clear what F should be. However, we know that all intervals [a, b] should be included
in F, and P ([a, b]) = b − a for all 0 ≤ a ≤ b ≤ 1.
and apply F2 to confirm Acn ∈ F, then F 3 for their union. Finally, apply F2 to return to
T
An .
1
Given F2, we could equivalently take the first axiom to be F1’: ∅ ∈ F, and in some steps later, it will be more
convenient to work with F1’.
2
Here, ‘countable’ means ‘finite or countably infinite’. See Problem Set 1 Q7 for discussion of reducing the countably
infinite case to the finite case.
Definition 1.5. Let E be a set, and E a σ-algebra on E. The pair (E, E) will be referred to as a
measurable space. Then a function µ : E → [0, ∞] is a measure if it satisfies the following:
M1 µ(A) ≥ 0 for all A ∈ E (as before, this is also captured in the codomain of µ);
M2 µ(∅) = 0;
M3 Same countable addivitity condition as P3, ie
[ X
µ An = µ(An ),
n≥1 n≥1
Proposition 1.6 (Countable sub-additivity). Given measure space (E, E, µ), let (An )n≥1 be a
sequence in E. Then
[ X
µ An ≤ µ(An ). (1.2)
n≥1 n≥1
Proof. See lectures for motivation and figures. Define A1 and then for each n ≥ 2 in turn,
S B1 := S
Bn := An \ (A1 ∪ · · · ∪ An−1 ). So the disjoint union Bn = An . In addition, we have Bn ⊆ An
and so µ(Bn ) ≤ µ(An ). We conclude
M3
[ [ X X
µ An = µ Bn = µ(Bn ) ≤ µ(An ).
n≥1 n≥1 n≥1 n≥1
Proposition 1.7 (Continuity). Given measure space (E, E, µ), let (An )n≥1 be an increasing S se-
quence in E, ie An ⊆ An+1 for all n. Then limn→∞ µ(An ) exists in [0, ∞] and is equal to µ( An ).
Proof. Consider Bn := An \ An−1 . (This is a special caseS of usageS in previous proof.) Then
S n
k=1 Bn = An and this is a disjoint union. Furthermore, n≥1 Bn = n≥1 An . So
n
n→∞
X X [ [
µ(An ) = µ(Bk ) −→ µ(Bk ) = µ Bk = µ Ak .
k=1 k≥1 k≥1 k≥1
1.2 Independence
We work with a probability space (Ω, F, P) as before. We have met the notion of independence of
events or random variables before, and may have some intuition about what it means. Roughly
speaking, two random objects are independent if the outcome of one ‘does not influence’ the outcome
of the other. This is formalised as follows.
More generally, a countable collection of events (An )n≥1 is independent if for all distinct finite index
sets i1 , . . . , ik , we have
k
Y
P(Ai1 ∩ · · · ∩ Aik ) = P(Aij ). (1.4)
j=1
Example. Let Ωn be the set of functions f : {1, . . . , n} → {1, . . . , n}, and P a uniform choice from
n−1
Ωn . Let Ai be the event {f ∈ Ωn : f (i) = i}. Then P(Ai ) = nnn = n1 and
k
nn−k 1 Y
P(Ai1 ∩ · · · ∩ Aik ) = = = P(Aij ),
nn nk
j=1
and so (Ai ) are independent events. This is no longer true if we take Ωn to be the set of permutations
on {1, . . . , n}.
Example. It is important that (1.4) holds for all collections i1 , . . . , ik . For example, given three
events A, B, C, it is not sufficient to check that P(A ∩ B ∩ C) = P(A)P(B)P(C).
Example. Let Ω = {H, T }2 , denoting flipping a coin twice, with each outcome having probability
1/4. Consider the following events:
A = {first coin H}, B = {second coin H}, C = {coins give same outcome},
(1.3)
P(A ∩ B c ) = P(A) − P(A ∩ B) = P(A) − P(A)P(B) = P(A) [1 − P(B)] = P(A)P(B c ),
as required.
See Problem Set 2.
Here ‘holds infinitely often’ means ‘holds for infinitely many n’; and ‘holds eventually’ means ‘is
eventually always true’, or ‘holds for all n ≥ N , for some N ’ or ‘holds for all but finitely many n’.
(Sometimes these events are also called lim sup An and lim inf An , respectively, but we will avoid
this terminology.) Note that
\ [
An ⊆ {An eventually} ⊆ {An i.o.} ⊆ An .
n≥1 n≥1
Proposition 1.10 (Borel–Cantelli Lemmas). Consider a probability space (Ω, F, P) and a sequence
of events (An )n≥1 .
P
BC1) If n≥1 P(An ) < ∞, then P(An i.o.) = 0.
P
BC2) If n≥1 P(An ) = ∞ and events (An ) are independent, then P(An i.o.) = 1.
Note. For BC2, the independence condition is necessary. See Problem Set 2 for discussion.
S
Proof. BC1) Since {An i.o.} ⊆ k≥n Ak for every n, we have
n→∞
X
P(An i.o.) ≤ P(Ak ) −→ 0,
k≥n
P
since P(An ) < ∞. So P(An i.o.) = 0.
BC2) It is equivalent to show that P(Acn eventually) = 0. To do this, we study the intersections of
the complements, whose probabilities are tractable using independence:
\ Y Y
P Ack = P(Ack ) = (1 − P(Ak ))
k≥n k≥n k≥n
Y X
≤ exp(−P(Ak )) = exp(− P(Ak )),
k≥n k≥n
P
and recalling that k≥n P(Ak ) = ∞ for each n by assumption,
\
P Ack = exp(−∞) = 0.
k≥n
But note that ( k≥n Ack )n≥1 is an increasing sequence of events. So by Proposition 1.7, we have
T
[ \ \
P Ack = lim P Ack = 0.
n→∞
n≥1 k≥n k≥n
One of the most famous applications of the Borel–Cantelli lemmas is the following popular result.
Proposition 1.11 (Infinite monkey theorem). An immortal monkey hits keys on a typewriter
repeatedly, both uniformly at random, and independently. Then almost surely, the monkey will at
some stage write Hamlet 3 .
Proof. Let K denote the number of letters (including spaces and punctuation) in Hamlet, let T
denote the number of keys on the typewriter, and let X1 , X2 , . . . denote the keys struck by the
monkey. Define the event
n o
An := XKn+1 , XKn+2 , . . . , XK(n+1) is the text of Hamlet .
Then P(An ) = T −K for all n and (An ) are independent. So applying BC2 gives the result!
Example (For interested readers). Consider a sequence of random permutations (σn )n≥1 as follows:
σ1 = (1); then for each n ≥ 2, create σn from σn−1 by inserting element n at a uniformly-chosen
position in σn−1 . We would like to prove that P(first element changes i.o.) = 1. But this follows
from BC2, since
X X1
P(first element changes σn−1 7→ σn ) = = ∞,
n
n≥2 n≥2
and since (1 − q)n−2 < ∞, this shows that almost surely the first element changes only finitely
P
often, by BC1.
(The first example is consistent with the idea that one cannot generate a uniform permutation on
N. The second example can be extended to show that every element changes only finitely often,
and so it is consistent to extend to a random permutation σ∞ on N. This Mallows permutation is
one of the main models for random permutations on an infinite set.)
3
Hamlet is one of the most famous (and longest) plays by William Shakespeare.
Definition 1.12. Consider E any set, and A a collection of subsets of E. Then the σ-algebra
generated by A is the intersection of all σ-algebras on E which contain A, that is
\
σ(A) := E. (1.5)
A⊆E
E a σ-algebra
Note. Since P(E) is a σ-algebra containing A, the intersection in (1.5) is non-empty. However, it
is not obvious that σ(A) as defined in (1.5) is a σ-algebra (!), but this follows as a result of the
following lemma.
Lemma 1.13. Let (Ei )i∈I be a family of σ-algebras on E (where I is an arbitrary index set). Then
\
E := Ei := A ⊆ E : A ∈ Ei ∀i ∈ I ,
i∈I
is a σ-algebra on E.
Example. When A = {A} a single event, then any σ-algebra containing A must also contain
{∅, A, Ac , E}, and this collection is itself a σ-algebra. So σ(A) = {∅, A, Ac , E}.
Example. Consider a uniform choice from Ω = {H, T }4 , corresponding to tossing a coin four times.
Let Ai := {ith coin is H}. Then
σ {A1 , A2 } = {all events determined by first two outcomes}. (1.6)
4
See later in the course for a (non-examinable) discussion of these!
we have A ⊆ σ(A0 ) and A0 ⊆ σ(A), respectively. Applying Lemma 1.14, we find that σ(A0 ) =
σ(A) = B(R), so the open intervals also generate the Borel σ-algebra on R.
Definition 1.15. There is a unique measure µ on (R, B(R)) such that µ((a, b]) = b − a, and this
is called the Borel measure on R.
Lemma 1.16. Whenever A ⊆ R is Borel, then A + x is Borel also, and µ(A + x) = µ(A).
Now, recall that the reals form an (Abelian) group under addition, denoted (R, +). One may
consider the rationals Q as a (normal) subgroup of (R, +). In particular, we may consider the
quotient group R/Q of cosets of Q in R, whose elements are sets (Ri , i ∈ I), where each has the
form Ri = ri + Q.
Now consider the set R := {ri , i ∈ I}, which we claim is not Borel-measurable. The details of this
and the previous paragraph are (obviously) not examinable, and were presented semi-formally in
the lecture.
6
A π-system is a collection of sets which include the empty set and which is preserved under (finite) intersections.
7
Roughly speaking, you need to apply the operations infinitely often; then apply the operations infinitely often;
and so on. This procedure is called transfinite induction. It is also rather difficult to give an example of a set which
is Borel, but can’t be constructed by regular induction from A.
Proof. It is sufficient to study the case where A is open. If A is closed, then Ac is open, and so the
result follows as B(R) is a σ-algebra, and so preserved under taking complements.
For every q ∈ A, let
A general definition of continuity of functions is that f is continuous if for all A open, f −1 (A) is
open. This motivates the definition of measurable functions.
In practice, it is not necessary to check f −1 (A) for all A ∈ B(R), but just the intervals that generate
the σ-algebra.
Lemma 2.3. Given measurable spaces (E, E) and (R, B(R)), a function f : E → R is measurable
if and only if f −1 ((−∞, a]) ∈ E for all a ∈ R.
Proof. Omitted.
1
Example. Consider f : R → R given by f (x) = x for x 6= 0, and f (0) = 0. Then
1
[ a , 0)
a<0
f −1 ((−∞, a]) = (−∞, 0] a=0
(−∞, 0] ∪ [ a1 , ∞) a > 0.
Proof. We know that f −1 ((−∞, a)) is open since f is continuous. By Lemma 2.1, this implies
f −1 (−∞, a)) ∈ B(R), which is sufficient.
Proof. 1) It is sufficient to show that {x ∈ E : f (x) + g(x) < a} ∈ E for all a ∈ R. Note that
n o
{f (x) + g(x) < a} = ∃q ∈ Q : f (x) < q and g(x) < a − q ,
All events in this decomposition are in E, which is preserved under intersections and countable
unions, so the event of interest is in E also, as required.
2) See the problem set.
T
3) It is sufficient to show that {x ∈ E : sup fn (x) ≤ a} ∈ E. But this event is equal to {fn (x) ≤ a}.
n n
Since each event in this intersection is in E, so is the intersection, as required. The argument for
inf fn is similar.
n
In particular, when lim fn exists, we have lim fn = lim sup fn = lim inf fn . For each n, we know
that sup fm is measurable by the sup section of part 3) above. And so inf sup fm is measurable
m≥n n→∞ m≥n
by the inf section of part 3) above. This shows that lim fn is measurable.
n→∞
5) Suppose g(f (A)) is Borel. Then f (A) is Borel since g is measurable. Then A ∈ E since f is
measurable.
Example. For Ω = {H, T }4 corresponding to tossing a coin four times, let X denote the total
number of heads.
The idea is that the random variable X is defined within the broader probability space. We have
seen in previous courses that X has the binomial distribution, and could easily be defined on its
own probability space (eg with Ω = {0, 1, 2, 3, 4}) but this is not necessary.
Note. By default, random variables take real values. However, given a function X : Ω → E
measurable with respect to (E, E), we can refer to X as an E-valued random variable.
9
Recall also that the advantage of lim inf and lim sup is that they always exist (in [−∞, +∞] etc), whereas not all
sequences have limits.
Definition 2.7. Given a random variable X on probability space (Ω, F, P), the distribution function
of X is a function FX : R → [0, 1] defined by
FX (x) = P(X ≤ x) = P X −1 ( (−∞, x] ) .
(2.1)
Since {(−∞, x] : x ∈ R} generates B(R), the probability measure PX is completely determined by
FX . If two random variables X and Y have the same distribution function ie FX = FY , then we
d
say X and Y are equal in distribution, denoted X = Y .
Example. Consider the probability space ([0, 1], B([0, 1])) with the Borel measure. Then, taking
the function U : [0, 1] → [0, 1] given by U (x) = x, we end up with U a random variable matching
the Uniform [0, 1] definition we gave earlier. We have distribution function
0
x≤0
FU (x) = P(U ≤ x) = x x ∈ [0, 1]
1 x ≥ 1.
We have included all the cases for completeness’ sake, but this is not generally necessary if context
makes it clear what the meaningful range of the RV is.
In this setting, U 2 is also a random variable, with distribution function
√ √
FU 2 (x) = P(U 2 ≤ x) = P(U ≤ x) = x, x ∈ [0, 1].
Proposition 2.8. For a general random variable X, the distribution function FX is non-decreasing,
and satisfies lim FX (x) = 0 and lim FX (x) = 1. Furthermore, FX is right-continuous.
x→−∞ x→+∞
Proof. The events {X ≤ x} are increasing with x, hence FX is non-decreasing. The limits are
established on the Problem Set.
For right-continuity, note that FX (x + n1 ) − FX (x) = P(X ∈ (x, x + n1 ]). Since (x, x + n1 ] = ∅, it
T
n
follows from continuity of P (or more directly from PX ) that
lim FX (x + n1 ) − FX (x) = lim P(X ∈ (x, x + n1 ]) = P(∅) = 0.
n→∞ n→∞
1
Since FX is non-decreasing, the limit FX (x + n) → FX (x) as n → ∞ induces the more general limit
FX (z) → FX (x) as z ↓ x through the reals.
Definition 2.9. A countable collection of random variables (Xn )n≥1 is independent if for all distinct
index sets i1 , . . . , ik , and for all x1 , . . . , xk ∈ R, we have
k
Y
P (Xi1 ≤ x1 , . . . , Xik ≤ xk ) = P Xij ≤ xj . (2.2)
j=1
As in the case of independent events, it is not true that pairwise independence implies independence
(See Problem Set).
Example. If (An )n≥1 are independent events, then the indicator functions (1An ) are independent
random variables.
In fact, if (Xn )n≥1 are independent10 , then for any Borel sets A1 , . . . , Ak , we have
k
Y
P (Xi1 ∈ A1 , . . . , Xik ∈ Ak ) = P Xij ∈ Aj . (2.3)
j=1
Definition 2.10. When a collection of random variables (Xn )n≥1 is independent, and all Xn are
equal in distribution (see Definition 2.7) we say (Xn ) are independent and identically distributed,
abbreviated IID. This property applies to a number of settings, including repeatedly tossing a coin,
or sampling a few members from a large population.
Definition 2.11. Let X, (Xn )n≥1 be random variables, with distribution functions FX , FXn . Then
Xn converges in distribution to X if FXn (x) → FX (x) for all x ∈ R such that F is continuous at x.
d
This is denoted Xn −→ X.
d
Example. Let Un be uniform on {1, 2, . . . , n}, and U uniform on [0, 1]. Then n1 Un −→ U as n → ∞,
since
bnxc
FUn (x) = −→ x, as n → ∞.
n
Example. Let (an )n≥1 be a real-valued sequence such that an → a ∈ R as n → ∞. Define the
d
‘random variables’ Xn such that P(Xn = an ) = 1, and X such that P(X = a) = 1. Then Xn −→ X.
Note that this would fail without the restriction of x to points of continuity of FX in Definition
2.11.
10
In contrast to previous definitions, the weaker form (2.2) is normally given as the definition of independence, with
the general form (2.3) following.
That is, fn (x) → f (x) for all x ∈ E \ A, where µ(A) = 0. So if fn converges to f pointwise, then it
also converges to f almost everywhere (sometimes abbreviated a.e.).
That is, for large n, fn is uniformly close to f , except on a set of small measure11 . So if fn converges
to f uniformly, then it also converges to f in measure.
While using these notions of convergence can be advantageous since they are slightly more general
than pointwise/uniform, we note that the limits are not unique under this convergence!
Proposition 2.14. Let f, g be two measurable functions E → R such that µ({x ∈ E : f (x) 6=
g(x)}) = 0. Suppose that fn → f almost everywhere (or in measure). Then fn → g almost
everywhere (or, respectively, in measure).
µ (fn 6→ g) ≤ µ (fn 6→ f ) + µ (f 6= g) = 0 + 0 = 0.
The argument for convergence in measure is similar (see the Problem Set).
11
It is standard to abbreviate the notation in (2.4) as µ(fn 6→ f ) and in (2.5) as µ(|fn − f | > ), which some readers
might find more clear.
Example. A general
S example is when fn = 1An for (An )n≥1 some sequence in E. Suppose A1 ⊆
A2 ⊆ . . ., with An = A.
n
Proposition 2.15. Suppose that µ(E) < ∞. Then fn → f almost everywhere implies fn → f in
measure.
and by assumption, this RHS set has measure zero. Therefore µ(AN ) → 0 as N → ∞. But
µ(x : |fN (x) − f (x)| > ) ≤ µ(x : |fn (x) − f (x)| > , for some n ≥ N ) → 0,
and so fn → f in measure holds too.
A quick summary of the relationships between these modes of convergence is the following:
P d
Xn → X a.s. ⇒ Xn −→ X ⇒ Xn −→ X. (2.8)
Shortly, we will prove these relations, and discuss converses. First, we give some examples of almost
sure convergence. Later, we will see some examples of sequences that converge in probability, but
not almost surely.
Example. Let X, X1 , X2 , . . . be IID random variables with finite mean12 E [X]. Then the Strong
a.s.
Law of Large Numbers asserts that n1 (X1 + . . . + Xn ) −→ E [X]. A special case of this is when X
has the Bernoulli distribution. We say X ∼ Bern(p) if X = 1 with probability p, and X = 0 with
probability 1 − p. Then X1 , X2 , . . . corresponds to a sequence of biased coin tosses, and the SLLN
states that the proportion of heads converges to p with probability 1.
Example. Real numbers between 0 and 1 can be written in binary, for example as 0.1101000110 · · · .
It feels reasonable that choosing the digits by tossing a coin should generate a uniform random
number in [0, 1]. To formalise this, let X1 , X2 , . . . be IID Bern(1/2) RVs as in the previous example.
We would like to define
U = 12 X1 + 14 X2 + 18 X3 + . . . , (2.9)
but it is not immediately clear that this is well-defined (let alone has the uniform distribution!). To
make it more clear, we define
Un = 12 X1 + 41 X2 + . . . + 1
2n Xn ,
and show instead that Un converges almost surely. Recall from analysis courses, the definition of
a Cauchy sequence13 . By our construction, the sequence (U1 , U2 , . . .) is Cauchy with probability 1,
a.s.
and so P(Un converges) = 1, and we may define U = lim Un , for which Un −→ U .
n
a a
It is easy to check that for k ∈ N, and a = 1, 2, . . . , 2k , we have P(Un < 2k
) = 2k
whenever n ≥ k.
d
This implies that P(Un ≤ x) → x for all x ∈ [0, 1], and so Un −→ Unif[0, 1]. In other words, U in
(2.9) is well-defined, and has the Uniform distribution on [0, 1].
Proposition 2.18. If the sequence of random variables Xn converges to X almost surely, then it
converges in probability also.
Proof. Probability space (Ω, F, P) satisfies the condition P(Ω) = 1 < ∞, so the result follows
directly as a special case of Proposition 2.15! (Proof in the language of probability spaces given in
lectures.)
12
Which will be defined in the formal language of this course shortly, but the informal definition of previous courses
is perfectly fine here.
13
A sequence of real numbers (an ) whose terms become arbitrarily close to one another is called Cauchy . More
formally, for all > 0, there exists N such that |an − am | < whenever n, m ≥ N . Any real-valued Cauchy sequence
converges to a limit. The advantage of this definition (including in our application here) is that one can verify that a
sequence converges without knowing what the limit is!
Proof. We must prove that FXn (x) → FX (x) whenever x is a point of continuity of FX . Fix any
> 0, and note that if Xn ≤ x then either X ≤ x + or |Xn − X| > hold. So
FXn (x) ≤ P(X ≤ x + ) + P(|Xn − X| > ). (2.10)
The second term on the RHS vanishes as n → ∞, and so lim sup FXn (x) ≤ P(X ≤ x+) = FX (x+).
Taking → 0, and using continuity of FX at x, we then have lim sup FXn (x) ≤ FX (x).
Analogously to (2.10), we also have
FXn (x) ≥ P(X ≤ x − ) − P(|Xn − X| > ),
and an identical argument leads to lim inf FXn (x) ≥ FX (x). So lim FXn (x) = FX (x) as required.
The partial converse for the case of deterministic limit is addressed on the Problem Set.
Example. Convergence in probability requires all RVs Xn to be defined on the same probability
space as the limit X, whereas convergence in distribution does not. But even if they are defined on
the same probability space, the converse implication does not hold. Consider RVs X, Y with the
same distribution, satisfying P(X = Y ) < 1. Then the sequence (X, X, X, . . .) converges to Y in
distribution, but not in probability.
1
Example. Suppose Xn are independent RVs, taking values {0, 1} with P (Xn = 1) = n. Then by
P
the second Borel–Cantelli lemma, Xn does not converge a.s. to zero, but Xn → 0.
This effect is captured more generally by the following lemma, which can be a useful practical test
for convergence to a constant of a sequence of RVs.
Proof. We apply the first Borel–Cantelli lemma to (2.11), and conclude that
P(|Xn | > for infinitely many n) = 0, or, equivalently, P(|Xn | ≤ eventually) = 1.
Note that an intersection over all > 0 is not countable. So take a sequence 1 > 2 > . . . > 0 such
that k → 0 as k → ∞. Note that Xn → 0 iff {|Xn | ≤ k eventually} holds for all k ≥ 1, and note
that these events are decreasing in k. So
\
P(Xn → 0) = P {|Xn | ≤ k eventually} = lim P (|Xn | ≤ k eventually) = 1,
k→∞
k≥1
We have a partial converse, provided the Xn s are independent. This case might seem rather patho-
logical, since convergence in probability and almost surely apply to the setting where all RVs are
defined on the same probability space (normally in a more interesting way than just independently).
However, this situation does come up in some applications.
Lemma 2.21. Suppose that the sequence (Xn )n≥1 is independent, and Xn → 0 almost surely.
Then, for every > 0, X
P (|Xn | > ) < ∞.
n≥1
The set of simple functions will be denotes S(E). Given a simple function f as in (3.1), we define
the integral of f as
Z X k
f dµ = ai µ(Ai ). (3.2)
E i=1
Note that f could have multiple representations as (3.1), and so it is not immediate that (3.2) is
well-defined. However, this can be verified by reducing to the case where the Ai s are disjoint, which
is discussed on the Problem Set.
k
X `
X
αf + βg = (αai )1Ai + (βbj )1Bj ,
i=1 j=1
which satisfies the definition (3.1), and the equality of integrals follows immediately.
ii) As noted in Definition 3.1, we may assume without loss of generality (and with considerable
convenience!) that the Ai s are disjoint, and that the Bj s are disjoint. It is helpful to introduce
A0 = E \ (A1 ∪ · · · ∪ Ak ), B0 = E \ (B1 ∪ · · · ∪ B` ),
Definition 3.3. Let f : E → R be a measurable function taking non-negative values. Then the
integral of f is defined as
Z Z
f dµ := sup g dµ : g ∈ S(E), 0 ≤ g ≤ f . (3.3)
E E
Note that the g on the RHS are simple functions which are bounded above by f .
We now verify briefly that any non-negative measurable function can be well-approximated by
simple functions.
Lemma 3.4. Let f : E → R be a measurable function taking non-negative values. Then there
exists an increasing sequence (fn )n≥1 of simple functions such that the monotone limit fn ↑ f holds
a.e. as n → ∞.
Proof. Define
(
k
2n when 2kn ≤ f (x) < k+1
2n , for some k = 0, 1, . . . , 22n − 1
fn (x) =
2n when f (x) ≥ 2n .
That is, whenever f (x) < 2n , we construct fn (x) by ‘rounding down’ f (x) to the nearest multiple
of 2−n . Each fn takes only finitely many values so is simple. It is clear that fn+1 (x) is equal either
1
to fn (x) or to fn (x) + 2n+1 , and so fn ↑ f .
It is not clear why integration and (monotone or otherwise) converge should commute. The following
key theorem shows that they do.
THEOREM 3.5 (Monotone convergence theorem). Let (fn )n≥1 and f Rbe measurable R functions
E → R, all taking non-negative values, and satisfying fn ↑ f a.e. Then E fn dµ ↑ E f dµ holds
both when the limit is finite and infinite.
Example. To see that monotonicity is essential, consider fn = n1[0,1/n] , for which the integral over
R is always 1, but the function converges a.e. to 0.
The MCT is a very useful tool in studying integrals of measurable functions. For now, we will use
it to lift the results of Proposition 3.2 to the more general case.
Proof. To show part i) we approximate f, g by fn , gn ∈ S(E), for example as given by Lemma 3.4.
Then, we have by Proposition 3.2,
Z Z Z
(αfn + βgn )dµ = α fn dµ + β gn dµ,
E E E
and since αfn + βgn is also an increasing sequence in n which converges a.e. to αf + βg as n → ∞,
we may apply MCT and take a monotone limit of both sides to conclude
Z Z Z
(αf + βg) dµ = α f dµ + β g dµ.
E E E
Definition 3.7. Let f : E → R be a measurable function. Define the positive and negative parts
Here, the functions f + , f − are both measurable and take non-negative values. Indeed, we have
f = f + − f −, and |f | = f + + f − .
f + dµ < ∞ and15
R
Definition
R − 3.8. We say a measurable function f : E → R is integrable if E
E f dµ < ∞. We then define the integral of f to be
Z Z Z
f dµ = +
f dµ − f − dµ.
E E E
Example. In the previous example, it was crucial that the range was infinite. In general, if a
function f is Riemann-integrable on a finite interval [a, b], then f is Lebesgue integrable on16 [a, b].
This is explored in more detail on the problem set.
Proof. The main step for part i) is Lemma which is stated and proved below. The remainder of the
argument is addressed on the problem set.
For part ii), study g − f , which is non-negative and use Proposition 3.6 ii), and the linearity result
proved in part i) to convert the result to the required form.
Part iii) is addressed on the problem set.
R R
Lemma 3.10. Let f, g be measurable non-negative functions E → R with E f dµ, E g dµ < ∞.
Then f − g is integrable, and
Z Z Z
(f − g) dµ = f dµ − g dµ.
E E E
Proof. To prove f − g is integrable, use the triangle inequality, and Proposition 3.6 ii),
Z Z Z Z
|f − g| dµ ≤ (|f | + |g|) dµ = f dµ + g dµ < ∞.
E E E E
Now, we introduce the following sets to characterise when f − g is positive and negative,
A+ = {x : f (x) ≥ g(x)}, A− = {x : f (x) < g(x)}.
Then (f − g)+ = (f − g)1A+ . We now apply linearity for addition of non-negative functions to
(f − g)1A+ and g1A+ , obtaining
Z Z h i Z Z
f 1A + = (f − g)1A+ + g1A+ dµ = (f − g)1A+ dµ + g1A+ dµ.
E E E E
16
This terminology is not formal. Strictly speaking, we should say that f 1[a,b] is integrable (on R).
Similarly, Z Z Z
(f − g)− dµ = g1A− dµ − f 1A− dµ.
E E E
R
Returning to (f − g) and combining these results, we obtain (with informal, abbreviated notation)
Z Z Z Z Z Z Z
+ −
(f − g) = (f − g) − (f − g) = f 1A+ + f 1A− − g1A+ − g1A−
E E Z Z
= f − g,
as required.
THEOREM 3.11 (Dominated convergence theorem). Let (fn )n≥1 and f be measurable functions
E → R such that fn → f a.e. Suppose there exists an integrable function17 g : E → [0, ∞) (ie taking
non-negative values) such that for all n ≥ 1, we have |fn | ≤ g a.e. Then fn and f are integrable
and Z Z
fn dµ −→ f dµ.
E E
Example. Suppose µ(E) < ∞, and the fn , f are uniformly R bounded, ie there exists C < ∞ such
that |fn |, |f | ≤ C a.e. Then oneRcan take
R g ≡ C, and so g dµ = Cµ(E) < ∞, and thus under these
conditions fn → f a.e. implies fn → f by DCT.
Example. Let fn : [0, ∞) → R be defined by fn (x) =R∞e−nx . Then fn → 0 a.e. (indeed, for all
−x
x 6= 0), and we can bound by fn (x) ≤ e , for which 0 e−x dx < ∞. So we can use e−x as the
∞ ∞
dominating function, and use DCT18 to show 0 e−nx dx → 0 0 dx = 0.
R R
(cos x)n
Example. Let fn : [0, π] → R be defined by fn (x) = sin x + n . Then |fn | ≤ 1 and19 fn (·) →
Rπ Rπ
sin(·) a.e., and the limiting function is integrable on [0, π]. So 0 fn (x) dx → 0 sin(x) dx = 2.
17
The role of g in DCT is sometimes called the dominating function.
18
Note that in this case we have fn ↓ 0 a.e., so we could also have used the decreasing version of the MCT proved
on the problem set.
19
Note that if |fn | ≤ g a.e. and fn → f a.e., then |f | ≤ g a.e. also.
3.2.1 Expectations
Throughout this section, we will use our usual notation (Ω, F, P) for a probability space.
Definition 3.12. (Note, it is particularly relevant for expectations that if E [X + ] = ∞ and E [X − ] <
∞, we may write E [X] = ∞.)
Example. For A ∈ F, the indicator function 1A is a random variable and E [1A ] = P(A).
The following properties of expectations are inherited directly from the corresponding properties of
general integrals:
Example. An instructive example is the deterministic case when X = c a.s. for some c ∈ R.
Definition 3.13. For a measurable space (E, E), and x0 ∈ E, define the (Dirac) delta measure δx0 ,
(
1 x0 ∈ A
δx0 (A) = , A ∈ E. (3.5)
0 x0 6∈ A
So for X = c a.s., one option is to take P = δc . Now, to compute E [f (X)] for a function f : R → R,
we note taking f˜ = f (x0 )1{c} gives f = f˜ a.e., and f˜ is simple, so
Z Z
f dδc = f˜ dδc = f˜(c)δc ({c}) = f˜(c) = f (c).
R R
Example. Now suppose X is a random variable on (Ω, F, P) taking values {1, 2, . . .}. Then 1{X=n}
is also a RV, and we have
N
X
X = lim X1{X≤n} = lim n1{X=n} ,
N →∞ N →∞
n=1
where the first equality denotes an almost sure limit, and then
"N # N
X X X
E [X] = lim E n1{X=n} = lim nP(X = n) = nP(X = n),
N →∞ N →∞
n=1 n=1 n≥1
PN
where the first equality follows from applying MCT to n=1 n1{X=n} , which is monotone in N .
See the problem set for a similar argument for E [g(X)].
We revisit the following result, which gives a useful bound on probabilities in terms of expectations.
Proposition 3.14 (Markov’s inequality). Let X be a random variable taking non-negative values.
Then for any a ∈ [0, ∞), we have P (X ≥ a) ≤ E[X]
a .
Proof. Introduce the auxiliary random variable Y = a1{X≥a} , so that Y ≤ X a.s. But then
E [Y ] = aP (X ≥ a) ≤ E [X], and the result follows.
We are now in a position to formalise this notion, starting with the following idea of ‘change of
measure’.
Then ν(·) is a measure on (E, E), and for every integrable g : E → R, we have
Z Z
g dν = f g dµ. (3.7)
E E
Proof. We first check that ν is a measure. Note that ν(A) ≥ 0 by construction, and ν(∅) = 0.
P of disjoint sets (An )n≥1 in E. Applying the monotone
Now, suppose given a countable sequence
convergence theorem to the finite sums N n=1 f 1An , we have
Z X XZ
f 1An dµ = f 1An dµ,
E n≥1 n≥1 E
where the last equality follows from linearity of (finite collections of) integrals.
Then for measurable g ≥ 0, we use simple approximations gn ↑ g a.e. as in Lemma 3.4. Then
f gn ↑ f g a.e. also holds, so two applications of MCT (on ν and µ in the first and third equalities,
respectively) give
↑ ↑
Z Z Z Z
g dν = lim gn dν = lim f gn dµ = f g dµ.
E E E E
In contrast to the proof of linearity in Proposition 3.9, here the lift to the case of general measurable
g is immediate, after noting that (f g)+ = f g + and (f g)− = f g − .
Recall from Section 2.2.1 that any random variable X on probability space (Ω, F, P) induces a
probability measure PX on R, given by PX (A) = P(X = A). We will apply the following definition
to PX .
Definition 3.16. Given a measure space (E, E, µ), suppose that another measure ν on (E, E)
satisfies Z
ν(A) = f 1A dµ, ∀A ∈ E,
E
then f is the density of ν with respect to µ.
When X is a random variable, with induced measure PX , then if there exists non-negative measur-
able fX : R → R such that Z
PX (A) = P (X ∈ A) = fX 1A dx, (3.8)
R
with dx understood to mean Borel measure on R, then fX is the probability density function of X.
For a pdf fX of a random variable X,
We must have R fX dx = 1 (by taking A = R in (3.8)).
R
Example. A random variable X has the exponential distribution with parameter λ > 0, when
P(X ≥ x) = e−λx for all x ∈ [0, ∞). That is, FX (x) = 1 − e−λx and fX (x) = λe−λx 1[0,∞) .
The main result concerns the formalism of the motivating idea (3.6)
Proposition 3.17. Let X be a random variable in probability space (Ω, F, P), and g : R → R a
measurable function. If g(X) is integrable, then21
Z Z
E [g(X)] = g dPX = g(x)fX (x) dx, (3.9)
R R
Proof. The second equality in (3.9) is a direct consequence of Proposition 3.15, so we only need to
prove the first equality. We use the same structure as before, verifying the result for g simple, then
g ≥ 0 measurable, then general measurable g.
If g is a simple function ki=1 ai 1Ai for Ai ∈ B(R), then g(X) = ki=1 ai 1{X∈Ai } is a simple random
P P
variable, and
Xk k
X Z
E [g(X)] = ai P(X ∈ Ai ) = ai PX (Ai ) = g dPX ,
i=1 i=1 R
using the definition of expectations/integrals of simple functions in the first and last equalities.
If g ≥ 0 is measurable, we approximate by simple gn ↑ g. It is important here that this holds
PX -a.e., but in fact it holds for all x ∈ R. Thus gn (X) ↑ g(X) almost surely, and so we can use
MCT twice, as usual, to obtain
Z Z
E [g(X)] = lim E [gn (X)] = lim gn dPX = g dPX ,
n→∞ n→∞
Corollary 3.18 (Chebyshev’s inequality). Let X be a random variable, with22 µ = E [X] < ∞,
and σ 2 = Var(X) < ∞. Then, for all k > 0,
1
P (|X − µ| ≥ kσ) ≤
. (3.10)
k2
For any t ∈ [0, ∞) satisfying E etX < ∞, we also have the following Chernoff bound
E etX
P (X ≥ a) ≤ . (3.11)
eta
Note. Chebyshev’s inequality always gives a ‘better’ bound than Markov’s inequality, provided
2
2
σ < ∞. Of course, there do exist random variables for which E [X] < ∞ but E X = ∞.
We also briefly state some inequalities involving expected values of random variables, which gener-
alise results seen in previous analysis courses.
Recall the notion of convexity of a function g : R → R. In applications, it is of useful to use the
following characterisation:
g twice differentiable, and g 00 (x) ≥ 0, ∀x ∈ R ⇒ g convex. (3.12)
Proposition 3.19 (Jensen’s inequality). Let X be an integrable RV, and g : R → R convex. Then
g(E [X]) ≤ E [g(X)].
x1 +...+xn √
Corollary 3.21 (AM-GM). Let x1 , . . . , xn ∈ (0, ∞). Then n ≥ n x1 · · · xn .
Remark. The two sides of this inequality are called the arithmetic mean and geometric mean.
Proof. Let X be uniform on {x1 , . . . , xn }. Then apply Jensen with the function (− log x).
Remark. The absolute value signs on the LHS ensure that the LHS is always defined, as the
expected value of a non-negative
2 RV. Note
2
that this result includes the statement that if E [|XY |] =
∞ then at least one of E X and E Y is infinite also.
22
Note that µ and σ have different meanings in this setting compared their previous roles relating to measures. It
is generally clear from context which is intended.
Proposition 3.23. Let X be a continuous random variable with density fX , and let g : R → R
be a measurable function such that i) g is either strictly increasing or strictly decreasing; ii) g −1 is
differentiable everywhere.
Then the density of Y = g(X) is given by
fY (y) = fX (g −1 (y) d −1
dy g (y) . (3.13)
Proof. We focus on the case g strictly increasing, and use the distribution function FY .
where, since g is strictly increasing and continuous, g −1 (y) exists as a real number. Then, differen-
tiating, we obtain
d −1
FY0 (y) = FX0 (g −1 (y)) dy g (y) = fX (g −1 (y)) dy
d −1
g (y).
In the case where g is strictly decreasing, we have FY (y) = 1 − FX (g −1 (y)) and, consequently,
d −1
which is consistent with (3.13) since dy g (y) is then negative.
Note that this result and argument is just a conversion of familiar results about ‘integration by
substitution’ into the language of probability densities, and we could have proved Proposition 3.23
using this framework also. Note that we now have two expressions for E [g(X)], that is
Z Z
yfY (y), dy = E [g(X)] = f (x)fX (x) dx,
with fY given by (3.13) and the fact that they are equal follows directly using integration by
substitution.
fY (y) = 1c f yc = λc e−λy/c ,
4 Multivariate probability
Most interesting situations in probability involve multiple random variables defined on the same
probability space (Ω, F, P). There are times when it is important that random variables are inde-
pendent (see Section 2.2.1) and we have also many examples where constructing random variables
in a dependent fashion produces interesting effects.
In this section we will discuss some of the formalism, and several applications of the situation where
the distribution of a collection of random variables is defined jointly and generally.
and extended to B(R2 ) analogously to the case for R discussed in Section 1.3.2.
For this section, we refer to Borel measure on R2 as µ, with dx, dy used to indicate integrals with
respect to Borel measure on R, to avoid confusion. The definition of a measurable function is
unchanged, as is the construction of the integral. However, it is not immediately clear under what
circumstances one may study an integral over R2 as a conventional ‘double-integral’ over each copy
of R in turn. The following theorem clarifies this.
including in the sense that if one of these integrals is infinite, then all three are infinite.
Now, for general f measurable R2 → R, if any of the following integrals
Z Z Z Z Z
|f | dµ, |f (x, y)| dx dy, |f (x, y)| dy dx
R2 R R R R
is finite, then all three are finite, and we say f is integrable, and (4.1) holds for f .
noting that this is not asserting that every set in E can be decomposed as such a product (which is
certainly not true). In fact, if E1 = σ(A1 ) and E2 = σ(A2 ) then we have
E = σ A1 × A2 : A1 ∈ A1 , A2 ∈ A2 ,
which is particularly convenient if A1 , A2 are more tractable than E1 , E2 (as in the case of B(R)
generated by the intervals). The product measure µ = µ1 ⊗ µ2 is then defined by:
µ(A1 × A2 ) = µ1 (A1 )µ2 (A2 ), A1 ∈ E1 , A2 ∈ E2 ,
then extending to E1 ⊗ E2 using the machinery mentioned in Section 1.3.2. There are two key
regularity properties we would like to hold:
Uniqueness of this extension;
Fubini’s theorem (4.1).
It turns out these are not always valid. However they are valid if both µ1 , µ2 satisfy a ‘σ-finite’
condition, which holds whenever µi (Ei ) < ∞, and also for many infinite measures23 , including Borel
measure on R.
However, counting measure on an uncountable set (see the examples given below Definition 1.5) is
not σ-finite, and Fubini’s theorem sometimes does not hold for products involving this measure.
See the Problem Set for a concrete example when exchanging the order of integration fails.
which is equivalent to
Zx Zy
FX,Y (x, y) = fX,Y (u, v) du dv.
u=−∞ v=−∞
The fundamental theorem of calculus then gives
∂2
fX,Y (x, y) = FX,Y (x, y). (4.2)
∂x∂y
The following notion allows us to reduce the case of a joint distribution (in particular with a joint
density) to the distribution of each component separately.
Definition 4.2. If random vector (X, Y ) has joint density fX,Y , then the marginal density
Z∞
fX (x) := fX,Y (x, y) dy. (4.3)
−∞
all Ā ∈ B(R2 ). Furthermore, when (X, Y ) has a joint density, we have the analogue to Proposition
3.17 for measurable functions g : R2 → R:
Z Z Z Z
E [g(X, Y )] = g dP(X,Y ) = g fX,Y dµ = g(x, y)fX,Y (x, y) dx dy.
R2 R2 R R
Z∞ z−x
Z
= fX (x)fY (y) dy dx
x=−∞ y=−∞
Z∞ Zz
= fY (w − x)fX (x) dw dx
x=−∞ w=−∞
Zz Z∞
= fY (w − z)fX (x) dx dw,
w=−∞ x=−∞
where we use Fubini again in the final equality to change the order of integration. By differentiating,
we find
Z∞
d
fZ (z) = dz FZ (z) = fY (z − x)fX (x) dx.
x=−∞
Definition 4.3. We say X has the Gamma distribution, denoted X ∼ Γ(n, λ) for λ > 0, n ∈
{1, 2, . . .} when X has density
λn xn−1
fX (x) = e−λx 1 .
(n − 1)! {x≥0}
Note that n = 1 reduces to Exp(λ), and n = 2 is the distribution of the sum of two IID Exp(λ)s,
as derived in the previous exercise.
In fact if X1 , X2 , . . . are IID Exp(λ) RVs, then X1 + . . . + Xn ∼ Γ(n, λ).
For a single random variable, the theory of the density of the transformed RV reduces directly to
integration by substitution. The same is true for multiple RVs, and we recall the key object for
studying changes of variables in higher-dimensional integrals.
Definition 4.4. Let ϕ : D ⊆ R2 → C ⊆ R2 be a function mapping (x, y) 7→ (u, v) for which all
partial derivatives exist on D. Then the Jacobean J(ϕ) is defined as the following determinant of
the matrix of partial derivatives.
!
∂u ∂u
∂x ∂y ∂u ∂v ∂u ∂v
J(ϕ) = ∂v ∂v = ∂x ∂y − ∂y ∂x ,
∂x ∂y
Proof. As referenced below Proposition 3.23 for the monovariate case, this follows from the usual
construction of integration by substitution in two dimensions. (The Jacobean represents the infini-
tissimal change in area factor under a substitution.)
X
Example. Let X, Y ∼ Exp(λ) be IID, and set Z = X + Y and Q = X+Y . Then fX,Y (x, y) =
2
λ e −λ(x+y) , and we can study
ϕ : [0, ∞)2 → [0, ∞) × [0, 1], (x, y) 7→ (z, q) = x + y, x+y
x
.
P(X ∈ C ∩ A)
P(X ∈ C | A) = , ∀C ∈ B(R),
P(A)
Example. Let X ∼ Exp(λ) and A = {X ≥ a} for some a ∈ R. One can check that
d
(X | A) = a + Exp(λ), (4.6)
Definition 4.6. Suppose (X, Y ) have joint density fX,Y , and recall the marginals fX , fY from
Definition 4.2. Then for x ∈ R such that fX (x) > 0, the conditional density of Y given X = x is
fX,Y (x, y)
fY |X=x (y) = . (4.7)
fX (x)
Note immediately that if X, Y are independent, then for all x we have fY |X=x = fY . The conditional
expectation is defined as the natural extension E [Y | X = x]. Noting that this is a function of x,
one can extend it to E [Y | X] which is a random variable (ie a function of random variable X).
An important feature of the normal distribution(s) is closure under linear transformations. That is,
1 2
So fR,Θ (r, ϑ) = 2π exp(− r2 )r, which splits as a product.
We conclude that random variables (R, Θ) are independent, with Θ ∼ Unif([0, 2π)). A consequence
of this is that the joint distribution of (X, Y ) is invariant under rotations. In other words, there is
a notion of a two-dimensional normal distribution that does not depend on the choice of axes. This
is strong evidence that this is an important higher-dimensional distribution.
Note that the key feature of this definition is that uT X is Gaussian, not the exact parameters. In
fact, the exact parameters are clear: uT X ∼ N (uT µ, uT V u). (See the problem set.) We must also
check that uT V u is non-negative, which we will do a bit later.
It also follows quickly from this definition (without needing to apply any transformation formulas)
that for a matrix A ∈ Rn×n and b ∈ Rn , that AX + b is also Gaussian. (See the problem set.)
We briefly recall the definition of the moment generating function of a (monovariate) random vari-
able, as discussed in previous courses.
Let X be a R-valued random variable, then the moment
generating function mX (t) := E etX for t ∈ R. For some distributions, and for some values of t,
we may have mX (t) = ∞.
2
Note that because the density of the normal distribution decays like e−x , which is faster than e−tx
for all t, the MGF of a Gaussian is finite everywhere.
A key usage of MGFs is that they ‘determine distributions’, provided they are defined on an interval.
Formally, if X, Y are two random variables for which mX (t) and mY (t) are finite and equal for all
d
t in some interval [−, ], then X = Y .
Example. Let X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) be independent normal RVs. Then X1 + X2 ∼
N (µ1 + µ2 , σ12 + σ22 ).
Note that E [X1 + X2 ] = µ1 + µ2 and Var(X1 + X2 ) = σ12 + σ22 is clear without reference to
the normal distribution. To verify that X1 + X2 is Gaussian, we use MGFs, since in general
mX1 +X2 (t) = mX1 (t)mX2 (t) when X1 , X2 are independent. In this particular case, we obtain
mX1 +X2 (t) = exp (µ1 + µ2 )t + 21 (σ12 + σ22 )t2 ,
Definition
h T i 4.8. The MGF of a random vector X = (X1 , . . . , Xn ) ∈ Rn is defined as mX (u) =
E eu X , for u ∈ Rn .
25
Sometimes we say X is multivariate Gaussian or (X, Y ) is jointly Gaussian to emphasise the dimension.
26
One can alternatively think of uT X as u · X or hu, Xi.
Then the corresponding result for random vectors says that if X, Y are two Rn -valued random
d
vectors for which mX (u) = mY (u) are finite and equal for all u ∈ [−, ]n , then X = Y .
So if X is a Gaussian random vector with mean µ and covariance V , we can explicitly calculate
h T i
mX (u) = E eu X = exp uT µ + 12 uT V u ,
(4.10)
via the t = 1 case for the MGF of the distribution N (uT µ, uT V u).
We conclude that the distribution of a Gaussian random vector is completely characterised by
its mean and covariance matrix. The main outstanding question is whether all matrices V are
achievable as the covariance matrix. However, we do know about one particular Gaussian random
vector.
Proposition 4.10. Let (X1 , X2 ) be jointly Gaussian. Then Cov(X1 , X2 ) = 0 if and only if X1 , X2
are independent.
Proof. The converse, that X1 , X2 independent implies Cov(X1 , X2 ) = 0 is always true, see (4.4).
A sufficient condition for independence of X1 , X2 is that the MGF mX (u) splits as a product of a
function of X1 and a function of X2 , for all u ∈ Rn . Referring to the explicit calculation (4.10), and
writing µ = (µ1 , µ2 ) and σ12 , σ22 for the variances of X1 , X2 we have
So it is clear that mX (u) splits as a product if (in fact, if and only if) Cov(X1 , X2 ) = 0.
For a two-dimensional joint Gaussian, the covariance acts a measure of a dependence between
the two random components, but varies with the overall magnitudes. It is in practice often more
appropriate to use the following measure.
Proof. Reduce to the case E [X] = E [Y ] = 0, then the result follows immediately from Cauchy–
Schwarz, as in Proposition 3.22.
As we have seen, much theory is considerably easier in the situation where the random variables
are independent. A particularly nice result about two-dimensional joint Gaussians is the possibility
to reduce the dependent case to a situation defined in terms of auxiliary random variables which
are independent.
Proposition 4.13. Let (X, Y ) be jointly Gaussian. Then there exists a ∈ R such that Y can be
expressed as aX + Z, where Z is Gaussian, and X, Z are independent. (And so certainly (X, Z) is
jointly Gaussian).
Example. When (X, Y ) are jointly Gaussian, (Y | X = x) is Gaussian for all x ∈ R. With
Proposition 4.13 in mind, we do not need to calculate using the joint density and (4.7) to justify
this assertion.
d d
Instead, we note that the independence of X, Z implies (Z | X = x) Z , and so (Y | X = x) = ax+Z.
and so
h T i
Cov(Y ) = E AX + b − (Aµ + b) AX + b − (Aµ + b)
h T i
= E A(X − µ) A(x − µ)
= E A(X − µ)(X − µ)T AT
= AE (X − µ)(X − µ)T AT
= AV AT . (4.11)
We also note that when Z ∈ Rn is a standard Gaussian, the covariance matrix Cov(Z) is the identity
matrix Idn on Rn .
√
In one-dimension, converting as in (4.8), the scaling factor is σ = σ 2 . This is more involved in
the higher-dimensional setting. We must revisit some linear algebra in order to make sense of the
notion of the ‘square-root’ of a matrix.
Definition 4.14. Any real symmetric matrix V can be diagonalised, that is, expressed as V =
U T DU , where D is diagonal, and U is orthogonal, meaning U T U = Id. Furthermore, the diagonal
matrix D consists of the all the eigenvalues on the diagonal, where these eigenvalues are all read,
and the corresponding eigenvectors form an orthonormal basis.
Lemma 4.16. For a Gaussian vector X, the covariance matrix V is symmetric and positive semi-
definite.
Proof. Matrix V is symmetric since Cov(·, ·) is symmetric in its two arguments. Then, to confirm
V is positive semi-definitive, note that
Xn n
X n
X
uT V u = ui Cov(Xi , Xj )uj = Cov ui Xi , uj Xj = Var(uT X) ≥ 0,
i,j=1 i=1 j=1
Lemma 4.17. Given a matrix A ∈ Rn×n which is symmetric and positive semi-definite, all the
eigenvalues of A are non-negative. Furthermore, A can be written as A = BB for some B ∈ Rn×n
symmetric and positive semi-definite.
We are now ready to move to the density of the general Gaussian random vector.
THEOREM 4.18. For every vector µ ∈ Rn and positive semi-definite V ∈ Rn×n , there exists an
n-dimensional Gaussian random vector X with E [X] = µ and V = Cov(Xi , Xj ) i,j . Furthermore,
if det(V ) = 0, then X has density
1
exp − 12 (x − µ)T V −1 (x − µ) .
fX (x) = p (4.12)
(2π)n/2 det(V )
Note. In one-dimension, the case det(V ) = 0 corresponds to σ 2 = 0, when the random variable is
deterministic. In higher-dimension, det(V ) = 0 implies that X is supported on a subspace of lower
dimension than n, and so does not have a density.
Proof. Write V = V 1/2 V 1/2 using Lemma 4.17, and set X = V 1/2 Z + µ, where Z is a standard
Gaussian on Rn . Then E [X] = V 1/2 E [Z] + µ = µ, and so using (4.11)
Thus X is Gaussian and has the correct mean and variance, which completely characterises the
distribution.
Now, consider ϕ : Rn → Rn with ϕ : z 7→ V 1/2 z + µ, so that ϕ−1 (x) = V −1/2 (x − µ), which makes
sense when det(V ) 6= 0, so that V (and thus V 1/2 ) is invertible. We aim to use Proposition 4.5 to
handle the density of X as a transformation of Z. Note that the Jacobean J(ϕ−1 ) is constant and
equal to the determinant of V −1/2 ie J(ϕ−1 ) = √ 1 . Then we also have
det(V )
h i
||ϕ−1 (x)||2 = ||V −1/2 (x − µ)||2 = V −1/2 (x − µ)T V −1/2 (x − µ)
= (x − µ)T V −1/2 V −1/2 (x − µ) = (x − µ)T V −1 (x − µ).
Example. Let Z ∈ Rn be standard Gaussian, and A ∈ Rn×n an orthogonal matrix so that AAT =
Idn . Then W = AZ is Gaussian, with E [W ] = 0 and Cov(W ) = ACov(Z)AT = Idn . So in
fact the standard Gaussian is invariant under orthogonal transformations of Rn , just as in the
one-dimensional case which we explored in Section 4.2.1.
do not have independent increments, but demand other regularity properties that allow interesting
analysis. We will not explore this further in this course.
We now give several examples of random walks, and discuss unique features of each.
(
+1 with P = 21
Example. Taking Xn := defines simple symmetric random walk on Z (SSRW ).
−1 with P = 21
In this setting, direct enumeration of all possible options is often the best way to study probabilities.
For example
number of paths (0,0) to (12,0) 1 12
P(S12 = 0) = = 12 ,
212 2 6
and
C6
P(S12 = 0, and S1 , . . . , S11 ≥ 0) = ,
212
where C6 is the 6th Catalan number.
We generalise this example to biased random walk with
(
+1 with P = p
Xn = (4.13)
−1 with P = 1 − p
for some probability p ∈ [0, 1]. The case p = 1/2 corresponds to SSRW. The drift µ = E [Xn ]
satisfies
< 0
p < 12
µ = E [Xn ] = 0 p = 21 . (4.14)
1
>0 p> 2
a.s. a.s.
We might ask whether Xn −→ +∞ or Xn −→ −∞ (or neither) in each of these cases.
Example. Taking Xn ∼ Exp(λ) for some λ > 0 gives an example of an arrivals process, used to
describe the timings of events which occur in sequence. In this context, we describe the increments
of the random walk as “IID holding times”. Since the exponential distribution is memoryless (as
shown in (4.6)) it makes sense to describe such an arrival processes as ‘rate λ’. Note that Sn has
the Gamma distribution Γ(n, λ), as in Definition 4.3.
Example. Taking Xn ∼ N (µ, σ 2 ) gives a Gaussian random walk27 . Note that for finite n, the
increments (X1 , . . . , Xn ) can be viewed as a Gaussian random vector, and so (S1 , . . . , Sn ) is also a
Gaussian random vector, since each value is a linear combination of the Xi s.
To simplify the calculation, assume µ = 0 and σ 2 = 1, so that E [Sk ] = 0 for all k. Recall that
a Gaussian random vector is entirely defined by its mean and covariance matrix. So what is the
covariance matrix of (S1 , . . . , Sn )? We can compute the covariance of Sk and S` , with k ≤ ` by
decomposing as follows:
S` = (X1 + . . . + Xk ) + (Xk+1 + . . . + X` ),
27
We mention now that this is an example of a Gaussian field (here in one dimension). Gaussian fields on higher
dimensional discrete spaces (like Zd ) can also be defined, as well as (with considerable complexities) in the continuum,
and these objects are the subject of active research interest at the forefront of modern probability theory.
where the two bracketed terms are independent, and the first term is equal to Sk . We obtain
Sn d
Proposition 4.20 (Weak Law of Large Numbers (WLLN)). We have n −→ µ.
Sn a.s.
THEOREM 4.21 (Strong Law of Large Numbers (SLLN)). We have n −→ µ.
Note. In general, convergence almost surely implies convergence in probability and in distribution
to the same limit, so SLLN implies WLLN, hence strong and weak.
THEOREM 4.22 (Central Limit Theorem (CLT)). Assume that the variance of the increments
σ 2 = Var(X1 ) < ∞. Then
Sn − nµ d
√ −→ N (0, 1). (4.15)
nσ 2
It is helpful to think about the statement (4.15) of the CLT in three stages:
The distribution of Sn is ‘concentrated’ on nµ (and this is the WLLN);
√
The fluctuations of Sn around nµ have order n;
The exact distribution of the fluctuations of Sn around nµ is approximately normal.
Example. Let Xn ∼ Exp(λ), with Sn viewed as the time of the nth arrival. Then SLLN gives
This process N (t) is generally called the Poisson process with rate λ, and in fact it can be shown
that N (t) has independent increments, and satisfies N (t) − N (s) ∼ Po(λ(t − s)). For now, we will
a.s.
just give a sandwiching argument to show that N t(t) → λ.
Begin by noting that N (Sn ) = n, and that SN (t) ≤ t < SN (t)+1 . Consequently, one has
where inverting the converging quantities in (4.16) is valid since λ > 0. (There is no probability in
this statement, just a result in real analysis which is relevant in this context with probability 1.)
Combining these estimates, we see that both outer28 quantities in (4.17) converge almost surely to
λ, and so N t(t) also converges almost surely to λ.
leading to
n −nµ
a√ n −nµ
S√ n −nµ
b√
P(an ≤ Sn ≤ bn ) = P ≤ ≤
nσ 2 nσ 2 nσ 2
n −nµ
S√
= P za ≤ ≤ zb
nσ 2
−→ P (za ≤ Z ≤ zb ) ,
Example. If Xn ∼ N (µ, σ 2 ) then in fact S√n −nµ ∼ N (0, 1) is true for every n. In other words, the
nσ 2
distributional property of the CLT holds for this case without taking a limit!
Example. The CLT can be applied jointly to, for example (Sn , S2n ). The key observation here is
that (Sn , S2n − Sn ) are independent. For ease of notation, let us assume µ = 0, σ 2 = 1. Then,
applying CLT in each coordinate gives
Sn S2n − Sn d
√ , √ −→ (Z1 , Z2 ),
n n
This idea can be extended to show that for any sequence of reals t1 < t2 < . . . < tk , one has
Sbt1 nc Sbt2 nc Sbtk nc
d
√ , √ , ... , √ −→ (W (t1 ), W (t2 ), . . . , W (tk )) ,
n n n
where (W (t1 ), . . . , W (tk )) is a Gaussian random vector with covariances given by Cov(W (tj ), W (tk )) =
min(tj , tk ). The case with ti ∈ N is studied on the problem set.
This raises the question of whether there is a random
continuous
process W : [0, ∞) → R, and
Sbtnc
whether one can make sense of the notion that √n , t ≥ 0 converges to W . Defining the limit
process formally as Brownian motion is the next step, and is explored in other probability courses,
and used heavily in modelling, including in mathematical finance. The convergence of the rescaled
random walk to Brownian motion holds29 in considerable generality.
Direct proofs
Proof of WLLN, under assumption σ 2 < ∞. Recall that the Weak Law of Large Numbers can be
viewed as a statement about convergence in probability, as well as its original form about conver-
gence in distribution. Chebyshev’s inequality (Corollary 3.18) gives us a tool to quantify this. We
have, for any > 0,
Sn
Var(Sn ) nσ 2
P n − µ > = P (|Sn − nµ| > n) ≤ = → 0,
2 n2 2 n2
as n → ∞, as required.
Proof of SLLN, under assumption that E X 4 < ∞. Having imposed the fourth moment condition,
we would like to denote Y = X − µ and Yn = Xn − µ. We must check that Y also satisfies the
fourth moment condition.
As a preliminary step, note that finite fourth moments implies E [X] , E X 2 , E X 3 < ∞. To see
Then
E (X − µ)4 = E X 4 − 4µE X 3 + . . . − µ4 < ∞.
So, we may assume without loss of generality that µ = 0 (since E [Y ] as constructed above is zero).
We now analyse
h i
E Sn4 = E (X1 + . . . + Xn )4
n
X X X
E Xi4 + 4 E Xi3 Xj + 12 E Xi2 Xj Xk + . . . ,
=
i=1 i6=j i,j,k
distinct
with a sum over all combinations of monomials with total power equal to 4,
n
X X X
E Xi4 + 4 E Xi3 E [Xj ] + 12 E Xi2 E [Xj ] E [Xk ] + . . .
=
i=1 i6=j i,j,k
distinct
and now all the terms with an E [Xi ] vanish, since this expected value is zero,
n
2 n
X X
Xi4
2 2
= E +6 E Xi E Xj = C4 n + 6C2 .
2
i=1 i6=j
We may conclude that there exists a constant C such that E Sn4 ≤ Cn2 for all n. From this, we
P Sn 4
can study ( n ) via its expectation. We obtain
"∞ #
X Sn 4 X C
E ≤ < ∞.
n n2
n=1 n≥1
Consequently, we know that the random variable ( Snn )4 is finite with probability 1 and we have
P
the string of implications
X 4 a.s.
Sn 4 a.s.
< ∞ = 1 ⇒ P Snn → 0 = 1 ⇒ P Snn → 0 = 1,
P n
THEOREM 4.23 (Lévy’s continuity thm - MGF version). Let X, (Xn )n≥1 be random variables
(or distributions) with MGFs mX (·), mXn (·), respectively. Assume that mX (t) < ∞ for all t in
d
some interval (−, ) around 0. Then, if mXn (t) → mX (t) for all t ∈ (−, ), we have Xn −→ X.
We will not prove this theorem in this course, but we will make sense of the conditions to apply it,
using the formalism of our earlier work. We would like to expand
h i
t2 2 2
tX
= E 1 + tX + 2! X + . . . = 1 + tE [X] + t2! E X 2 + . . . .
E e (4.18)
By the dominated convergence theorem (Theorem 3.11), we know that this is valid whenever
E e|tX| < ∞.
For the purposes of using this to prove WLLN and CLT, it is useful to note
When X = µ a.s., that is P(X = µ) = 1, then mX (t) = etµ ;
2
When X ∼ N (0, 1), we have mX (t) = exp( t2 ), as in (4.9).
We can now prove the CLT, subject to the condition that E e|tX| < ∞. We will carry out a proof
of WLLN as a warm-up.
Proof of WLLN subject to E e|tX| < ∞ condition. We know how to write the MGF of Sn in terms
of the MGF of X, as mSn (t) = (mX (t))n . The goal is to study mSn /n since corresponds to the
sequence of RVs which is converging. We have
h Sn i h t i n
mSn /n (t) = E et n = E e n Sn = mSn nt = mX nt
.
t2
mX (t) = 1 + + o(t2 ).
2
√
We will use this to study the MGF of Sn / n.
n
t Sn
h √ i h √t i
S
mSn /√n (t) = E e n = E e n n = mSn √tn = mX ( √tn ) .
Sn d
which matches the MGF of N (0, 1). We conclude, using Theorem 4.23 that √
n
→ N (0, 1).