Sic 2018 Notes
Sic 2018 Notes
A.
c C. Davison, 2018
http://stat.epfl.ch
1 Introduction 2
1.1 Motivation 3
1.2 Preliminaries 18
1.3 Combinatorics 26
2 Probability 36
2.3 Independence 70
3 Random Variables 85
1
5.1 Basic Notions 176
9 Likelihood 387
2
9.5 Linear Regression 422
3
1 Introduction slide 2
Stochastic networks
Erdös–Rényi graph (1960), with p = 0.01. The arcs between each pair of vertices appear with
probability p, independently of the other arcs. In this case, if p > (1 + ǫ) log n/n, ǫ > 0, the graph will
be connected (almost certainly).
(Source: Wikipedia)
4
‘Giant component’
Erdös–Rényi graph (1960), with n = 150, p = 0.01. If when n → ∞ we have np → c > 1, then there
is (almost certainly) a connected subgraph containing a positive fraction of the vertices. No other
component contains more than O(log n) of the vertices.
(Source: www.cs.purdue.edu)
Stochastic networks II
Chain network Nearest-neighbour network Scale-free network
5
Modeling of webpages as networks
person
instructor
topic
gener
interest
parallel
support
parallel
Fig. 3. Common structure in the webpages data. Panel (a) shows the estimated common structure for the four cat-
egories. The nodes represent 100 terms with the highest log-entropy weights. The area of the circle representing a
node is proportional to its log-entropy weight. The width of an edge is proportional to the magnitude of the associated
partial correlation. Panels (b)–(d) show subgraphs extracted from the graph in panel (a).
Randomized algorithms
6
Signal processing
NMR data Wavelet Decomposition Coefficients
60
1
2
40
Resolution Level
7 6 5 4 3
y
20
8
0
9
0 200 400 600 800 1000 0 128 256 384 512
Translate
Daub cmpct on ext. phase N=2
Signal processing
Original coefficients Shrunken coefficients
1
1
2
2
Resolution Level
Resolution Level
7 6 5 4 3
7 6 5 4 3
8
8
9
7
Signal processing
NMR data Bayesian posterior median
60
60
40
40
wr(w)
20
20
y
0
0
−20
−20
0 200 400 600 800 1000 0 200 400 600 800 1000
Video data
100 150 200 250 300 350 400
videoVBR
50
Amount of coded information (Variable Bit Rate) per frame for a certain video sequence. There were
about 25 frames per second.
8
Time series
6e+04
Number
0e+00
Value
60000
0
Time
Number and value of transactions (arbitrary units) every hour for mobile phones, 2010.
Practical motivation
A lot of later courses rely on probability and statistics:
Applied data analysis (West)
Automatic speech processing (Bourlard)
Biomedical signal processing (Vesin)
Stochastic models in communication (Thiran)
Machine learning (Jaggi/Urbanke)
Performance evaluation (Le Boudec)
Signal processing for communications (Prandoni)
Principles of digital communications (Teletar)
...
Probability and Statistics for SIC slide 15
Organisation
Lecturer: Professor A. C. Davison
Assistants: see moodle page or information sheet
Lectures: Monday 14.15–16.00, CE6; Tuesday, 13.15–15.00, CE1
Exercises: Monday 16.15–18.00, CE6
Distinction between
– exercises (solutions available) and
– problems (solutions posted later)
Test: 16th April, 16.15–18.00, no written notes (simple calculator allowed)
Quizzes: 15-minute quizzes on 5 and 19 March, 9 and 30 April, 14 May, no written notes (simple
calculator allowed)
Course material (including Random Exercise Generator): see moodle page for the course.
9
Probability and Statistics for SIC slide 16
Course material
Probability constitutes roughly the first 60% of the course, and a good book is
Ross, S. M. (2007) Initiation aux Probabilités. PPUR: Lausanne.
Ross, S. M. (2012) A First Course in Probability, 9th edition. Pearson: Essex.
Statistics comprises roughly the last 40% of the course. Possible books are
Davison, A. C. (2003). Statistical Models. Cambridge University Press. Sections 2.1, 2.2; 3.1, 3.2;
4.1–4.5; 7.3.1; 11.1.1, 11.2.1.
Morgenthaler, S. (2007) Introduction à la Statistique. PPUR: Lausanne.
Wild, C. & Seber, G. A. F. (2000). Chance Encounters: A First Course in Data Analysis and
Statistics. John Wiley & Sons: New York.
Helbling, J.-M. & Nuesch, P. (2009). Probabilités et Statistique (polycopie).
There are many excellent introductory books on both topics, look in the Rolex Learning Centre.
Sets
Definition 1. A set A is a unordered collection of objects, x1 , . . . , xn , . . .:
A = {x1 , . . . , xn , . . .} .
We write x ∈ A to say that ‘x is an element of A’, or ‘x belongs to A’. The collection of all possible
objects in a given context is called the universe Ω.
An ordered set is written A = (1, 2, . . .). Thus {1, 2} = {2, 1}, but (1, 2) 6= (2, 1).
Examples:
10
Subsets
Definition 2. A set A is a subset of a set B if x ∈ A implies that x ∈ B: we write A ⊂ B.
If A ⊂ B and B ⊂ A, then every element of A is contained within B and vice versa, thus A = B:
both sets contain exactly the same elements.
Note that ∅ ⊂ A for every set A. Thus,
∅ ⊂ {1, 2, 3} ⊂ N ⊂ Z ⊂ Q ⊂ R ⊂ C, ∅⊂I⊂C
Venn diagrams are useful for grasping the existing elementary relations between sets, but they
can be deceptive (not all relations can be represented).
Cardinal of a set
Definition 3. A finite set A has a finite number of elements, and this number is called its cardinal:
Boolean operations
Definition 4. Let A, B ⊂ Ω. Then we can define
the union and the intersection of A and B to be
A ∪ B = {x ∈ Ω : x ∈ A or x ∈ B} , A ∩ B = {x ∈ Ω : x ∈ A and x ∈ B} ;
A \ B = A ∩ B c = {x ∈ Ω : x ∈ A and x 6∈ B},
A △ B = (A \ B) ∪ (B \ A).
11
Boolean operations
If {Aj }∞
j=1 is an infinite set of the subsets of Ω, then
∞
[
Aj = A1 ∪ A2 ∪ · · · : those x ∈ Ω that belong to at least one Aj ;
j=1
\∞
Aj = A1 ∩ A2 ∩ · · · : those x ∈ Ω that belong to every Aj .
j=1
The following results are easy to show (e.g., using Venn diagrams):
(Ac )c = A, (A ∪ B)c = Ac ∩ B c , (A ∩ B)c = Ac ∪ B c ;
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C), A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C);
S T∞ T∞ S∞
( ∞ c
j=1 Aj ) =
c c
j=1 Aj , ( j=1 Aj ) =
c
j=1 Aj .
Partition
Definition 5. A partition of Ω is a collection of nonempty subsets A1 , . . . , An in Ω such that
the Aj are exhaustive, i.e., A1 ∪ · · · ∪ An = Ω, and
the Aj are disjoint, i.e., Ai ∩ Aj = ∅, for i 6= j.
A partition can also be composed of an infinity of sets {Aj }∞
j=1 .
Example 7. Let Aj be the set of integers that can be divided by j, for j = 1, 2, . . .. Do the Aj
partition Ω = N?
Note to Example 6
Obviously, Aj ∩ Ai = ∅ if i 6= j. Moreover any real number x lies in A⌊x⌋ , where ⌊x⌋ is the largest
integer less than or equal to x. Therefore these sets partition R.
Note to Example 7
Note that 6 ∈ A2 ∩ A3 , so these sets do not partition N.
12
Cartesian product
Definition 8. The Cartesian product of two sets A, B is the set of ordered pairs
A × B = {(a, b) : a ∈ A, b ∈ B}.
Note to Example 9
{(a, 1), (a, 2), . . . , (b, 3)}.
Combinatorics: Reminders
Combinatorics is the mathematics of counting. Two basic principles:
multiplication: if I have m hats and n scarves, there are m × n different ways of wearing both a
hat and a scarf;
addition: if I have m red hats and n blue hats, then I have m + n hats in total.
In mathematical terms: if A1 , . . . , Ak are sets, then
13
Permutations: Ordered selection
Definition 10. A permutation of n distinct objects is an ordered set of those objects.
Theorem 11. Given n distinct objects, the number of different permutations (without repetition) of
length r ≤ n is
n!
n (n − 1) (n − 2) · · · (n − r + 1) = .
(n − r)!
Thus there are n! permutations of length n.
P
Theorem 12. Given n = ri=1 ni objects of r different types, where ni is the number of objects of
type i that are indistinguishable from one another, the number of permutations (without repetition) of
the n objects is
n!
.
n1 ! n2 ! · · · nr !
Probability and Statistics for SIC slide 28
Example
Example 13. A class of 20 students choose a committee of size 4 to organise a ‘voyage d’études’. In
how many different ways can they pick the committee if:
(a) there are 4 distinct roles (president, secretary, treasurer, travel agent)?
(b) there is one president, one treasurer, and two travel agents?
(c) there are two treasurers and two travel agents?
(d) their roles are indistinguishable?
Note to Example 13
(a) First choose the president, then the secretary, etc., giving 20 × 19 × 18 × 17 = 116280.
This is the number of permutations of length 4 in a group of size 20.
(b) 20 × 19 × 18 × 17/2! = 58140.
(c) 20 × 19 × 18 × 17/(2!2!) = 29070.
(d) The first could have been chosen in 20 ways, the second in 19, etc. But the final group of four
could be elected in 4! orders, so the number of ways is 20 × 19 × 18 × 17/4! = 4845.
14
Combinations: non ordered selection
Theorem 15. The number of ways of choosing a set of r objects from a set of n distinct objects
without repetition is
n! n
= .
r!(n − r)! r
Theorem 16. The number of ways of distributing n distinct objects into r distinct groups of size
n1 , . . . , nr , where n1 + · · · + nr = n, is
n!
.
n1 ! n2 ! · · · nr !
Probability and Statistics for SIC slide 31
Note to Theorem 17
The numbers of ways of choosing r objects from n is the same as the number of ways of choosing
n − r objects from n.
To choose r objects from n + 1, we first designate one of the n + 1. Then if that object is in the
sample, we must choose r − 1 from among the other n, and if not, we must choose r from the n,
which gives the result.
Suppose I have n blue hats and m red hats. Then the number of ways I can choose r hats from all
my hats equals the number of ways I can choose j red hats and r − j blue hats, summed over the
possible choices of j.
The binomial results are standard.
For the last part, with r fixed, we have
−r n n(n − 1) · · · (n − r + 1) 1
n = r
→ , n → ∞.
r n r! r!
15
Partitions of integers
Theorem 18. (a) The number of distinct vectors (n1 , . . . , nr ) of positive integers, n1 , . . . , nr > 0,
satisfying n1 + · · · + nr = n, is
n−1
.
r−1
(b) The number of distinct vectors (n1 , . . . , nr ) of non-negative integers, n1 , . . . , nr ≥ 0, satisfying
n1 + · · · + nr = n, is
n+r−1
.
n
Example 19. How many different ways are there to put 6 identical balls in 3 boxes, in such a way that
each box contains at least one ball?
Example 20. How many different ways are there to put 6 identical balls into 3 boxes?
Note to Theorem 18
(a) Line up the n balls, and note that there are n − 1 spaces between them. You must choose r − 1
out of these n − 1 spaces to place these separators, giving the stated formula.
(b) Line up the n balls and the r − 1 separators. Any distinct configurations of these n + r − 1 objects
will correspond to a different partition, so the number of these partitions is the number of ways the
balls and separators can be ordered, and this is the stated formula.
Note to Example 19
We have a total of n = 6 balls and r = 3 groups, each of which must have at least one member, so
the number is
6−1 5!
= = 10.
3−1 3!2!
Note to Example 20
Now there is the possibility of empty boxes, so the total number is
6+3−1 8!
= = 28.
6 6!2!
16
Reminder: Some series
Theorem 21. (a) A geometric series is of the form a, aθ, aθ 2 , . . .; we have
n
( n+1
X a 1−θ , θ 6= 1,
i 1−θ
aθ =
i=0
a(n + 1), θ = 1.
P∞ i
If |θ| < 1, then i=0 θ = 1/(1 − θ), and
∞
X i! 1
θ i−r = , r = 1, 2, . . . .
r!(i − r)! (1 − θ)r+1
i=0
Small lexicon
Mathematics English Français
Ω, A, B . . . set ensemble
A∪B union union
A∩B intersection intersection
Ac complement of A (in Ω) complémentaire de A (en Ω)
A\B difference différence
A∆B symmetric difference différence symétrique
A×B Cartesian product produit cartésien
|A| cardinality cardinal
{Aj }nj=1 pairwise disjoint {Aj }nj=1 disjoint deux à deux
partition partition
permutation permutation
combination combinaison
n
r binomial coefficient coefficient binomial (Cnr )
n
n1 ,...,nr multinomial coefficient coefficient multinomial
indistinguishable indifférentiable
colour-blind daltonien (ienne)
17
2 Probability slide 36
18
Motivation: Game of dice
We throw two fair dice, one red and one green.
(a) What is the set of possible results?
(b) Which results give a total of 6?
(c) Which results give a total of 12?
(d) Which results give an odd total?
(e) What are the probabilities of the events (b), (c), (d)?
Calculation of probabilities
We can try to calculate the probabilities of events such as (b), (c), (d) by throwing the dice
numerous times and letting
This is an empirical rather than a mathematical answer, to be reached only after a lot of work
(how many times should we roll the dice?), and it will yield different answers each time—not
satisfactory!
For simple examples, we often use symmetry to calculate probabilities. This isn’t possible for more
complicated cases—we construct mathematical models, based on notions of
– random experiments
– probability spaces.
Random experiment
Definition 22. A random experiment is an ‘experiment’ whose result is (or can be defined as)
random.
Example 24. I roll 2 fair dice, one red and one green.
Example 26. The waiting time until the end of this lecture.
19
Andrey Nikolaevich Kolmogorov (1903–1987)
(Source: http://en.academic.ru/dic.nsf/enwiki/54484)
Definition 28. A probability space (Ω, F, P) is a mathematical object associated with a random
experiment, comprising:
a set Ω, the sample space (universe), which contains all the possible results (outcomes,
elementary events) ω of the experiment;
a collection F of subsets of Ω. These subsets are called events, and F is called the event space;
a function P : F 7→ [0, 1] called a probability distribution, which associates a probability
P(A) ∈ [0, 1] to each A ∈ F.
Sample space
The sample space Ω is the space composed of elements representing all the possible results of a
random experiment. Each element ω ∈ Ω is associated with a different result.
Ω is analogous to the universal set. It can be finite, countable or uncountable.
Ω is nonempty. (If Ω = ∅, then nothing interesting can happen.)
For simple examples with finite Ω, we often choose Ω so that each ω ∈ Ω is equiprobable:
1
P(ω) = , for every ω ∈ Ω.
|Ω|
20
Note to Example 29
Example 23: Here we can write Ω = {ω1 , ω2 }, where ω1 and ω2 represent Tail and Head respectively.
Example 24: Ω = {ω1 , . . . , ω36 }, representing all 36 different possibilities.
Example 25: Ω = {ωj : j = 0, 1, . . . , }, representing any non-negative number.
Example 26: Ω = {ω : ω ∈ [0, 45] minutes}, an uncountable set.
Example 27: Ω? We have to decide what we count as weather outcomes, so this is not so easy.
In general discussion we use ω as an element of Ω, but in examples it is usually easier to write H or T
or (r, g) or similar.
Event space
F is a set of subsets of Ω which represents the events of interest.
Note to Example 30
First we set up the probability space Ω. If we write (2, 4) to mean that the red shows 2 and the green
shows 4, we have
Ω = {(r, g) : r, g = 1, . . . , 6},
giving
By symmetry if the two dice are fair, then |Ω| = 36, |A| = |C| = 6, |B| = 18, and |A ∩ B| = 3, so the
probabilities are
P(A) = P(C) = 6/36 = 1/6, P(B) = 18/36 = 1/2, P(A ∩ B) = 3/36 = 1/12.
21
Event space F , II
Definition 31. An event space F is a set of the subsets of Ω such that:
(F1) F is nonempty;
(F2) if A ∈ F then Ac ∈ F;
S∞
(F3) if {Ai }∞
i=1 are all elements of F, then i=1 Ai ∈ F.
F is also called a sigma-algebra (en français, une tribu).
Let A, B, C, {Ai }∞
i=1 be elements of F. Then the preceding axioms imply that
Sn
(a) i=1 Ai ∈ F,
(b) Ω ∈ F, ∅ ∈ F,
(c) A ∩ B ∈ F, A \ B ∈ F, A ∆ B ∈ F,
T
(d) ni=1 Ai ∈ F.
Example 33. I roll two fair dice, one red and one green.
(a) What is my event space F1 ?
(b) I only tell my friend the total. What is his event space F2 ?
(c) My friend looks at the dice himself, but he is colour-blind. What then is his event space F3 ?
Note to Example 32
We can write Ω = {H, T }, and then have two choices:
Either of these satisfies the axioms (check this) and hence is a valid event space. Only the second,
however, is interesting. In the first the only non-null event is {H, T }, which corresponds to ‘the
experiment was performed and a head or a tail was observed’.
22
Probability and Statistics for SIC note 1 of slide 49
Note to Example 33
(a) Since we see an outcome of the form (r, g), we can reply to any question about the outcomes; thus
we take F1 to be the set of all possible subsets of Ω{(r, g) : r, g = 1, . . . , 6}. The ordered pair (r, g)
corresponds to the event Ar,g = {(r, g)} (‘the experiment was performed and the outcome was (r, g)’),
and the 236 distinct elements Bj of F1 can be constructed by taking all possible unions and
intersections of the Ar,g . (Note that the intersection of any two or more disjoint events here will give
∅, and the union of all of them gives Ω.) This means that F1 is the power set of
{Ar,g : r, g = 1, . . . , 6}, and |F1 | = 236 .
(b) If I tell him only that the ‘total is t’ for t = 2, . . . , 12, then he can reply to any question about the
total, but nothing else. So his event space F2 is based on the events T2 , . . . , T12 , where
T2 = {(1, 1)}, T3 = {(1, 2), (2, 1)}, T4 = {(1, 3), (2, 2), (3, 1)}, . . . , T12 = {(6, 6)}.
His event space therefore comprises all the possible unions and intersections of these 11 events, and
therefore |F2 | = 211 .
(c) Since he is colour-blind, he cannot tell the difference between (1, 2) and (2, 1), etc.. Thus F3 is
made up of all possible unions and intersections of the sets
{(1, 1)}, {(2, 2)}, . . . , {(6, 6)}, {(1, 2), (2, 1)}, {(1, 3), (3, 1)}, . . . , {(5, 6), (6, 5)}.
23
Examples
Example 34. A woman planning her future family considers the following possibilities (we suppose
that the chances of having a boy or a girl are the same each time) :
(a) have three children;
(b) keep giving birth until the first girl is born or until three children are born, stop when one of the
two situations arises.
(c) keep giving birth until there are one of each gender or until there are three children, stop when
one of the two situations arises.
Let Bi be the event ‘i boys are born’, A the event ‘there are more girls than boys’. Calculate P(B1 )
and P(A) for (a)–(c).
Example 35 (Birthdays). n people are in a room. What is the probability that they all have a
different birthday?
Note to Example 34
We learn from this example that:
changing the protocol or stopping rule can change the observable outcomes and hence the sample
space;
the outcomes need not have the same probabilities under different stopping rules;
in some cases it is possible to compute probabilities for outcomes in one sample space by
comparing it to another sample space.
(a) We can write the sample space under this stopping rule as
where B denotes a boy, G denotes a girl and the ordering is important. These events all have
probability 1/8, by symmetry. Then B1 = {BGG, GBG, GGB} and A = B1 ∪ {GGG} have
probabilities 3/8 and 1/2 respectively. The latter is obvious also by symmetry.
(b) Under this stopping rule the sample space is
but these are not equi-probable; for example B1 = {BG} here corresponds to the event
{BGG, BGB} in Ω1 and so has probability 1/4, and A = {G} here corresponds to the event
{GBB, GBG, GGB, GGG} in Ω1 , and so has probability 1/2.
(c) Under this stopping rule the sample space is
noting that BG here corresponds to {BGG, BGB} in Ω1 , and likewise GB here corresponds to
{GBG, GBB} in Ω1 . In this case the event B1 = {GB, BG, GGB} in Ω3 corresponds to
{GBB, GBG, BGG, BGB, GGB} in Ω1 and hence has probability 5/8, and in Ω3 the event
A = {GGG, GGB} has probability 1/4.
24
Birthdays
1.0
0.8
0.6
Probability
0.4 0.2
0.0
0 10 20 30 40 50 60
n
Note to Example 35
The sample space can be written Ω = {1, . . . , 365}n , and each of these possibilities has probability
365−n . We seek the probability of the event
A = {(i1 , . . . , in ) : i1 6= i2 6= · · · =
6 in }.
There are 365 × 364 × · · · (365 − n + 1) = 365!/(365 − n)! ways this can happen, so the overall
probability is 365!/{(365 − n)!365n }, which is shown in the graph.
25
Il Saggiatore, 1623
(Source: Wikipedia)
Il Saggiatore, 1623
La filosofia è scritta in questo grandissimo libro che continuamente ci sta aperto innanzi
a gli occhi (io dico l’universo), ma non si può intendere se prima non s’impara a intender la
lingua, e conoscer i caratteri, ne’ quali è scritto. Egli è scritto in lingua matematica, e i
caratteri son triangoli, cerchi, ed altre figure geometriche, senza i quali mezi è impossibile a
intenderne umanamente parola; senza questi è un aggirarsi vanamente per un oscuro
laberinto.
The book of the Universe cannot be understood unless one first learns to comprehend
the language and to understand the alphabet in which it is composed. It is written in the
language of mathematics, and its characters are triangles, circles, and other geometric
figures, without which it is humanly impossible to understand a single word of it; without
these, one wanders about in a dark labyrinth.
9 = 6 + 2 + 1 = 5 + 3 + 1 = 5 + 2 + 2 = 4 + 4 + 1 = 4 + 3 + 2 = 3 + 3 + 3.
10 = 6 + 3 + 1 = 6 + 2 + 2 = 5 + 4 + 1 = 5 + 3 + 2 = 4 + 4 + 2 = 4 + 3 + 3.
True or false?
26
Probability and Statistics for SIC slide 56
Probability function P
Definition 36. A probability distribution P assigns a probability to each element of the event space
F, with the following properties:
(P 1) if A ∈ F, then 0 ≤ P(A) ≤ 1;
(P 2) P(Ω) = 1;
(P 3) if {Ai }∞
i=1 are pairwise disjoint (i.e., Ai ∩ Aj = ∅, i 6= j), then
∞
! ∞
[ X
P Ai = P(Ai ).
i=1 i=1
Properties of P
Theorem 37. Let A, B, {Ai }∞
i=1 be events of the probability space (Ω, F, P). Then
(a) P(∅) = 0;
(b) P(Ac ) = 1 − P(A);
(c) P(A ∪ B) = P(A) + P(B) − P(A ∩ B). If A ∩ B = ∅, then P(A ∪ B) = P(A) + P(B);
(d) if A ⊂ B, then P(A) ≤ P(B), and P(B \ A) = P(B) − P(A);
S P∞
(e) P ( ∞ i=1 Ai ) ≤ i=1 P(Ai ) (Boole’s inequality);
S
(f) if A1 ⊂ A2 ⊂ · · · , then limn→∞ P(An ) = P ( ∞ Ai );
Ti=1
∞
(g) if A1 ⊃ A2 ⊃ · · · , then limn→∞ P(An ) = P ( i=1 Ai ).
27
Note to Theorem 37
(a) Since ∅ ∩ A = ∅ for any A ∈ F, we can apply (P 3) to a finite number of sets, just by adding an
infinite number of ∅s. In particular, Ω = Ω ∪ ∅ ∪ ∅ ∪ · · · , and these are pairwise disjoint, so
and then
as required.
S Ai ⊂ Ai+1 for every i, so (Ai+1 \Ai ) ∩ (Aj+1 \Aj ) = ∅ when i 6= j (draw picture), and
(f) Now
An = ni=1 (Ai \Ai−1 ), where we’ve set A0 = ∅. Note that P(Ai+1 \Ai ) = P(Ai+1 ) − P(Ai ). Thus by
(P 3) we have
∞
[ ∞
X
P( Ai ) = P(A1 ) + P(Ai \Ai−1 )
i=1 i=2
X∞
= P(A1 ) + {P(Ai ) − P(Ai−1 )} ,
i=2
" n
#
X
= lim P(A1 ) + {P(Ai ) − P(Ai−1 )} ,
n→∞
i=2
= lim P(An ).
n→∞
28
Continuity of P
Reminder: A function f is continuous at x if for every sequence {xn } such that
Parts (f) and (g) of Theorem 37 can be extended to show that for all sequences of sets for which
Inclusion-exclusion formulae
If A1 , . . . , An are events of (Ω, F, P ), then
which is what we want, since the last term is P(A1 ∩ A2 ∩ A3 ). The general formula follows by
iteration of this argument.
29
Note to inclusion-exclusion formulae: II
For example, with n = 4, we have
where there are 4, 6, 4, 1 terms in the terms having 1, 2, 3, and 4 events, respectively.
Example 38. What is the probability of getting at least one 6 when I roll three fair dice?
Example 39. An urn contains 1000 lottery tickets numbered from 1 to 1000. One ticket is drawn at
random. Before the draw a fairground showman offers to pay $3 to whoever will give him $2, if the
number on the ticket is divisible by 2, 3, or 5. Would you give him your $2 before the draw? (You lose
your money if the ticket is not divisible by 2, 3, or 5.)
Note to Example 38
Let Ai be the event there is a 6 on die i; we want P(A1 ∪ A2 ∪ A3 ). Now by symmetry P(Ai ) = 1/6,
P(Ai ∩ Aj ) = 1/36, and P(A1 ∩ A2 ∩ A3 ) = 1/216. Therefore the second inclusion-exclusion formula
gives
3 3 1 91
P(A1 ∪ A2 ∪ A3 ) = − + = .
6 36 216 216
Probability and Statistics for SIC note 1 of slide 61
Note to Example 39
Here we can write Ω = {1, . . . , 1000}, and let Di be the event that the number is divisible by i. We
want
30
2.2 Conditional Probability slide 62
Conditional probability
Definition 40. Let A, B be events of the probability space (Ω, F, P), such that P(B) > 0. Then the
conditional probability of A given B is
P(A ∩ B)
P(A | B) = .
P(B)
If P(B) = 0, we adopt the convention P(A ∩ B) = P(A | B)P(B), so both sides are equal to zero.
Thus
P(A) = P(A ∩ B) + P(A ∩ B c ) = P(A | B)P(B) + P(A | B c )P(B c )
even if P(B) = 0 or P(B c ) = 0.
Example 41. We roll two fair dice, one red and one green. Let A and B be the events ‘the total
exceeds 8’, and ‘we get 6 on the red die’. If we know that B has occurred, how does P(A) change?
Note to Example 41
We first draw a square containing pairs {(r, g) : r, g = 1, . . . , 6} to display the totals of the two dice.
By inspection, and since all the individual outcomes have probability 1/36, we have
P(A) = (1 + 2 + 3 + 4)/36 = 5/18, P(B) = 6/36 = 1/6, and thus by definition the conditional
probability is P(A | B) = P(A ∩ B)/P(B) = (4/36)/(1/6) = 2/3.
Thus including the information that B has occurred changes the probability of A: conditioning can be
interpreted as inserting information into the calculation of probabilities, resulting in a new probability
space, as we see in the next theorem.
∞
! ∞
[ X
Q Ai = Q(Ai ).
i=1 j=1
Thus conditioning on different events allows us to construct lots of different probability distributions,
starting with a single probability distribution.
31
Note to Theorem 42
We just need to check the axioms. If A ∈ F, then
and finally,
∞
! S S P∞ ∞
[ P( ∞i=1 Ai ∩ B) P{ ∞i=1 (Ai ∩ B)} i=1 P(Ai ∩ B)
X
Q Ai = = = = Q(Ai ),
P(B) P(B) P(B)
i=1 i=1
using the properties of P(·) and the fact that if A1 , A2 , . . . are pairwise disjoint, then so too are the
A1 ∩ B, A2 ∩ B, . . ..
Essay towards solving a problem in the doctrine of chances. (1763/4) Philosophical Transactions
of the Royal Society of London.
(Source: Wikipedia)
32
Bayes’ theorem
Theorem 43 (Law of total probability). Let {Bi }∞ i=1 be pairwise disjoint eventsS(i.e. Bi ∩ Bj = ∅,
i 6= j) of the probability space (Ω, F, P), and let A be an event satisfying A ⊂ ∞ i=1 Bi . Then
∞
X ∞
X
P(A) = P(A ∩ Bi ) = P(A | Bi )P(Bi ).
i=1 i=1
Theorem 44 (Bayes). Suppose that the conditions above are satisfied, and that P(A) > 0. Then
P(A | Bj )P(Bj )
P(Bj | A) = P∞ , j ∈ N.
i=1 P(A | Bi )P(Bi )
These results are also true if the number of Bi is finite, and if the Bi partition Ω.
Example
Example 45. You suspect that the man in front of you at the security check at the airport is a
terrorist. Knowing that one person out of 106 is a terrorist, and that a terrorist is detected by the
security check with a probability of 0.9999, but that the alarm goes off when an ordinary person goes
through with a probability of 10−5 , what is the probability that he is a terrorist, given that the alarm
goes off when he passes through security?
Note to Example 45
Let A and T respectively denote the events ‘the alarm sounds’ and ‘he is a terrorist’. Then we seek
33
Multiple conditioning
Theorem 46 (‘Prediction decomposition’). Let A1 , . . . , An be events in a probability space. Then
Note to Theorem 46
Just iterate. For example, if we let B = A1 ∩ A2 and note that P(B) = P(A2 | A1 )P(A1 ) by the
definition of conditional probability, then
on using the definition of conditional probability, twice. For the general case, just extend this idea, by
setting
as required.
Example
Example 47. n men go to a dinner. Each leaves his hat in the cloakroom. When they leave, having
thoroughly sampled the local wine, they choose their hats randomly.
(a) What is the probability that no one chooses his own hat?
(b) What is the probability that exactly r men choose their own hats?
(c) What happens when n is very big?
34
Note to Example 47
This is an example of many types of matching problem, going back to Montmort (1708).
The sample space here is the permutations of the numbers {1, . . . , n}, of size n!.
Let Ai denote the event that the ith hat is on the ith head, and note that P(Ai ) = 1/n,
1 1 (n − r)!
P(Ai ∩ Aj ) = P(Ai | Aj )P(Aj ) = × , . . . , P(A1 ∩ · · · ∩ Ar ) = ,
n−1 n n!
using the prediction decomposition. Thus the probability that at least r out of n hats are on the
right heads is (n − r)!/n!. Let pn (k) denote the probability that exactly k out of n men get the
right hat.
(a) We want to compute
Independent events
Intuitively, saying that ‘A and B are independent’ means that the occurrence of one of the two does
not affect the occurrence of the other. That is to say that, P(A | B) = P(A), so the knowledge that
B has occurred leaves P(A) unchanged.
35
Probability and Statistics for SIC slide 71
Note to Example 48
The sample space can be written as Ω = {BB, BG, GB, GG}, in an obvious notation, and the events
that ‘the ith child is a boy’ are B1 = {BB, BG} and B2 = {BB, GB}. Then
(a) P(B2 | B1 ) = P(B1 ∩ B2 )/P(B2 ) = P({BB})/P(B1 ) = 1/4 ÷ 1/2 = 1/2 = P(B2 ). Thus B2
and B1 are independent.
(b) the event ‘at least one child is a boy’ is C = B1 ∪ B2 = {BB, BG, GB}, and the event ‘two
boys’ is D = {BB}, so now we seek P(D | C2) = 1/4 ÷ 3/4 = 1/3 6= P(D). Thus D and C are
not independent.
Note also the importance of precise language: in (a) we know that a specific child is a boy, and in (b)
we are told only that one of the two children is a boy. These different pieces of information change the
probabilities, because the conditioning event is not the same.
Independence
Definition 49. Let (Ω, F, P) be a probability space. Two events A, B ∈ F are independent (we
write A ⊥⊥ B) iff
P(A ∩ B) = P(A)P(B).
In compliance with our intuition, this implies that
P(A ∩ B) P(A)P(B)
P(A | B) = = = P(A),
P(B) P(B)
Example 50. A pack of cards is well shuffled and one card is packed at random. Are the events A ‘the
card is an ace’, and H ‘the card is a heart’ independent? What can we say about the events A and K
‘the card is a king’ ?
Note to Example 50
The sample space Ω consists of the 52 cards, which are equiprobable. P(A) = 4/52 = 1/13 and
P(H) = 13/52 = 1/4, and P(A ∩ H) = 1/52 = P(A)P(H), so A and H are independent. However
P(A ∩ K) = 0 6= P(A)P(K), so these are not independent.
36
Types of independence
Definition 51. (a) The events A1 , . . . , An are (mutually) independent if for all sets of indices
F ⊂ {1, . . . , n}, !
\ Y
P Ai = P(Ai ).
i∈F i∈F
(c) The events A1 , . . . , An are conditionally independent given B if for all sets of indices
F ⊂ {1, . . . , n}, !
\ Y
P Ai | B = P(Ai | B).
i∈F i∈F
A few remarks
Mutual independence entails pairwise independence, but the converse is only true when n = 2.
Mutual independence neither implies nor is implied by conditional independence.
Independence is a key idea that greatly simplifies probability calculations. In practice, it is essential
to verify whether events are independent, because undetected dependence can greatly modify the
probabilities.
Example 52. A family has two children. Show that the events ‘the first born is a boy’, ‘the second
child is a boy’, and ‘there is exactly one boy’ are pairwise independent but not mutually independent.
Note to Example 52
The sample space is Ω = {BB, BG, GB, GG}, so P(B1 ) = 1/2, P(B2 ) = 1/2, P(1B) = 1/2, using
an obvious notation.
Also P(B1 ∩ B2 ) = P(B1 ∩ 1B) = P(B2 ∩ 1B) = 1/4, but P(B1 ∩ B2 ∩ 1B) = 0, while the product of
all three probabilities is 1/8.
Example 53. In any given year, the probability that a male driver has an accident and claims on his
insurance is µ, independently of other years. The probability for a female driver is λ < µ. An insurer
has the same number of male drivers and female drivers, and picks one of them at random.
(a) Give the probability that he (or she) makes a claim this year.
(b) Give the probability that he (or she) makes claims in two consecutive years.
(c) If the company randomly selects a person that made a claim, give the probability that (s)he makes
a claim the following year.
(d) Show that the knowledge that a claim was made in one year increases the probability that a claim
is made in the following year.
37
Note to Example 53
Let Ar denote the event that the selected driver has accidents in r successive years, and M denote the
event that (s)he is male.
(a) Here the law of total probability gives
(b) Independence of accidents from year to year, for each driver individually, gives
Thus they would only be equal if λ = µ, i.e. with no difference between the sexes.
Series-Parallel Systems
An electric system has components labelled 1, . . . , n, which fail independently of each another. Let Fi
be the event ‘the ith component is faulty’, with P(Fi ) = pi . The event S, ‘the system fails’ occurs if
current cannot pass from one end of the system to the other. If the components are arranged in
parallel, then
Y n
PP (S) = P(F1 ∩ · · · ∩ Fn ) = pi .
i=1
38
Reliability
Example 54 (Chernobyl). A nuclear power station depends on a security system whose components
are arranged according to:
The components fail independently with probability p, and the system fails if current cannot pass from
A to B.
(a) What is the probability that the system fails?
(b) The components are made in batches, which can be good or bad. For a good batch, p = 10−6 ,
whereas for a bad batch p = 10−2 . The probability that a batch is good is 0.99. What is the
probability that the system fails (i) if the components come from different batches? (ii) if all the
components come from the same batch?
Note to Example 54
The two parallel systems in the upper right and lower branches have respective probabilities p3 and
pl = p2 of failing, so the overall probability of failure for the top branch, which is a series system, is
pu = 1 − (1 − p)(1 − p3 ). The upper and lower branches are in parallel, so the probability that they
both fail is pu × pl = p2 {1 − (1 − p)(1 − p3 )} = f (p), say.
Such computations can be used recursively to compute failure probabilities for very large systems.
The probability of failure of a component selected randomly from the two sorts of batches is
so the probability of failure in case (i) is f (q) = 1.029995 × 10−12 , whereas in (ii) it is
39
2.4 Edifying Examples slide 78
Female smokers
Survival after 20 years for 1314 women in the town of Whickham, England (Appleton et al., 1996, The
American Statistician). The columns contain: number of dead women after 20 years/number of
surviving women at the start of the study (%).
40
Simpson’s paradox
Define the events ‘dead after 20 years’, D, ‘smoker’, S, and ‘in age category a at the start’, A = a.
For almost every a we have
P(D | S, A = a) > P(D | S c , A = a),
but
P(D | S) < P(D | S c ).
Note that
X
P(D | S) = P(D | S, A = a)P(A = a),
a
X
P(D | S c ) = P(D | S c , A = a)P(A = a),
a
so if the probabilities P(D | S, A = a) and P(D | S c , A = a) vary a lot with a, weighting them with
the P(A = a) can reverse the order of the inequalities.
This is an example of Simpson’s paradox: ‘forgetting’ conditioning can change the conclusion of a
study.
She was convicted in November 1999, then released in January 2003, because it turned out some
pathological evidence suggesting her innocence had not been diclosed to her lawyer. As a result of her
case, the Attorney-General ordered a review of hundreds of other cases, and two other women in the
same situation were released from jail.
41
The rates of SIDS
42
Note on Sally Clark story
Estimated probabilities: How were the probabilities obtained? What is their accuracy? There are
very few SIDS deaths, and the number 1/8543 may be based on as few as 4 SIDS deaths. Using
standard methods, the estimated probability could be from 0.04 to 0.32 deaths/1000 live births, so
(for example), the figure of 1/73 million could be much larger.
Ecological fallacy: Even if we accept the argument above, the SUDI study conflates a lot of
different types of families and cases: there is no reason to suppose that the marginal probability of
1/8500 applies to any particular individual (think of Simpson’s paradox, which we just met).
Independence? If there is a genetic or environmental factor leading to SIDS, then the probability
of two deaths might be much higher than claimed. Just suppose that a genetic factor G is present
in 0.1% of families, and leads to a probability of death of 1/10 for each child, and that conditional
on G or Gc , deaths are independent. Then we might have
P( two deaths ) = P( two deaths | G)P(G) + P( two deaths | Gc )P(Gc )
.
= (1/10)2 × 0.001 + (1/8500)2 × 0.999 = 0.0001 = 1/104 ≫ 1/(73 × 106 ).
Prosecutors’ Fallacy: The probability calculated was P( two deaths | innocent ), whereas what
is wanted is P( innocent | two deaths ). To get the latter we need to apply Bayes’ theorem. Let
E denote the evidence observed (two deaths), and C denote culpability. Then we have
P(E | C c )P(C c )
P(C c | E) = ,
P(E | C c )P(C c ) + P(E | C)P(C)
and we see that in order to compute the required probability, we have to have some estimates of
P(C). Suppose that P(C) = 10−6 and that P(E | C) = 1, as murdering two of your own children
is probably quite rare. Then even using the probabilities above, Bayes’s theorem would give that
.
P(C c | E) = 0.014 ≈ 14/103 ,
which, though small, is nothing like as small as 1/(73 × 106 ). Thus even accepting the ‘squaring
of probabilities’, the case for the prosecution is not nearly as strong as the original argument
suggested.
43
3 Random Variables slide 85
44
3.1 Basic Ideas slide 87
Random variables
We usually need to consider random numerical quantities.
Example 55. We roll two fair dice, one red and one green. Let X be the total of the sides facing up.
Find all possible values of X, and the corresponding probabilities.
In particular, we set Ax = {ω ∈ Ω : X(ω) = x}. Note that we must have Ax ∈ F for every x ∈ R, in
order to calculate P(X = x).
Note to Example 55
Draw a grid. X takes values in DX = {2, . . . , 12}, and so is clearly a discrete random variable. By
symmetry the 36 points in Ω are equally likely, so, for example,
2
P(X = 3) = P({(1, 2), (2, 1)}) = .
36
Thus the probabilities for {2, 3, 4 . . . , 12} are respectively
1/36, 2/36, 3/36, 4/36, 5/36, 6/36, 5/36, 4/36, 3/36, 2/36, 1/36.
Examples
Example 58. We toss a coin repeatedly and independently. Let X be the random variable
representing the number of throws until we first get heads. Calculate
P(X = 3), P(X = 15), P(X ≤ 3.5), P(X > 1.7), P(1.7 ≤ X ≤ 3.5).
Example 59. A natural set Ω when I am playing darts is the wall on which the dart board is hanging.
The dart lands on a point ω ∈ Ω ⊂ R2 . My score is X(ω) ∈ DX = {0, 1, . . . , 60}.
45
Note to Example 58
X takes values in {1, 2, 3, . . .} = N, and so is clearly a discrete random variable, with countable
support.
Let p = P(F ); then the event X = 3 corresponds to two failures, each with probability 1 − p, followed
by a success, with probability p, giving P(X = 3) = (1 − p)2 p by independence of the successive trials.
Likewise P(X = 15) = (1 − p)14 p, and
and similarly
Note to Example 59
Here an infinite Ω ⊂ R2 is mapped onto the finite set {0, . . . , 60}. Even though the underlying Ω is
uncountable, the support of X is countable.
46
Bernoulli random variables
Definition 60. A random variable that takes only the values 0 and 1 is called an indicator variable,
or a Bernoulli random variable, or a Bernoulli trial.
Example 61. Suppose that n identical coins are tossed independently, let Hi be the event ‘we get
heads for the ith coin’, and let Ii = I(Hi ) be the indicator of this event. Then
represent?
Note to Example 61
If n = 3, then we can write the sample space as
Ω = {T T T, T T H, T HT, HT T, T HH, HT H, HHT, HHH}. Clearly DX = {0, 1, 2, 3}, and
X is the total number of heads in the first n tosses, Y = 1 if and only if the sequence starts HTT,
and Z counts the number of times a 1 is followed by a 0 in the sequence of n tosses.
Mass functions
A random variable X associates probabilities to subsets of R. In particular when X is discrete, we have
Ax = {ω ∈ Ω : X(ω) = x},
Definition 62. The probability mass function (PMF) of a discrete random variable X is
47
Binomial random variable
Example 63 (Example 61 continued). Give the PMFs and supports of Ii , of Y and of X.
We write X ∼ B(n, p), and call n the denominator and p the probability of success. With n = 1,
this is a Bernoulli variable.
Note to Example 63
Ii takes values 0 and 1, with probabilities P(Ii = 1) = p, and P(Ii = 0) = 1 − p.
Y is also binary with P(Y = 1) = p(1 − p)2 , P(Y = 0) = 1 − p(1 − p)2 .
X takes values 0, 1, . . . , n, with binomial probabilities (see below).
0.30
0.15
0.15
f(x)
f(x)
0.00
0.00
0 2 4 6 8 10 0 2 4 6 8 10
x x
B(20,0.1) B(40,0.9)
0.30
0.30
0.15
0.15
f(x)
f(x)
0.00
0.00
0 5 10 15 20 0 10 20 30 40
x x
Examples
Example 65. A multiple choice test contains 20 questions. For each question you must choose the
correct answer amongst 5 possible answers. A pass is obtained with 10 correct answers. A student
picks his answers at random.
Give the distribution for his number of correct answers.
What is the probability that he will pass the test?
48
Note to Example 65
Since n = 20 and p = 1/5 = 0.2, the number of correct replies is X ∼ B(20, 0.2). The probability of
success is
20
X 20 .
P(X ≥ 10) = 0.2x (1 − 0.2)20−x = 0.0026
x
x=10
Geometric distribution
Definition 66. A geometric random variable X has PMF
This models the waiting time until a first event, in a series of independent trials having the same
success probability.
Example 67. To start a board game, m players each throw a die in turn. The first to get six begins.
Give the probabilities that the 3rd player will begin on his first throw of the die, that he will begin, and
of waiting for at least 6 throws of the die before the start of the game.
Note to Example 67
In this case DX = N.
Here p = 1/6, so the probability that the third person starts on his first throw of the die is
(5/6)2 × 1/6 = 0.116. He starts if the first six appears on throw 3, m + 3, 2m + 3, . . . and this equals
∞
X ∞
X ∞
X
3+im−1 2 p(1 − p)2
P(X = 3 + im) = p(1 − p) = p(1 − p) (1 − p)im = ,
1 − (1 − p)m
i=0 i=0 i=0
where p = 1/6.
The probability of waiting for at least 6 tosses is (1 − p)6 = 0.335.
49
Note to Theorem 68
Since P(X > n) = (1 − p)n , we seek
P(X > n + m | X > m) = (1 − p)m+n /(1 − p)m = (1 − p)n = P(X > n).
Thus we see that there is a ‘lack of memory’: knowing that X > m does not change the probability
that we have to wait at least another n trials before seeing the event.
It models the waiting time until the nth success in a series of independent trials having the same
success probability.
Example 70. Give the probability of seeing 2 heads before 5 tails in repeated tosses of a coin.
Note to Example 70
This is the probability that X ≤ 6, where X is the waiting time for n = 2 heads. It is
2−1 2 2−2 3−1 2
p (1 − p) + p (1 − p)3−2
2−1 2−1
4−1 2 4−2 5−1 2 5−2 6−1 2
+ p (1 − p) + p (1 − p) + p (1 − p)6−2 .
2−1 2−1 2−1
If we assume that the coin is fair, so p = 0.5, R gives
50
Geometric and negative binomial PMFs
Geom(0.5) Geom(0.1)
0.2 0.4
0.2 0.4
f(x)
f(x)
0.0
0.0
0 10 20 30 40 0 10 20 30 40
x+1 x+1
NegBin(4,0.5) NegBin(6,0.3)
0.20
0.20
0.10
0.10
f(x)
f(x)
0.00
0.00
0 10 20 30 40 0 10 20 30 40
x+4 x+6
Γ(y + α) α
fY (y) = p (1 − p)y , y = 0, 1, 2, . . . , 0 ≤ p ≤ 1, α > 0,
Γ(α)y!
where Z ∞
Γ(α) = uα−1 e−u du, α>0
0
is the Gamma function. The principal properties of Γ(α) are:
Γ(1) = 1;
Γ(α + 1) = αΓ(α), α > 0;
Γ(n) = (n − 1)!, n = 1, 2, 3, . . . ;
√
Γ( 12 ) = π.
Hypergeometric distribution
Definition 71. We draw a sample of m balls without replacement from an urn containing w white
balls and b black balls. Let X be the number of white balls drawn. Then
w
b
x m−x
P(X = x) = w+b
, x = max(0, m − b), . . . , min(w, m),
m
Example 72. I leave for a camping trip in Ireland with six tins of food, two of which contain fruit. It
pours with rain, and the labels come off the tins. If I pick three of the six tins at random, find the
distribution of the number of tins of fruit among the three I have chosen.
51
Note to Example 72
White balls correspond to fruit tins, black balls to others, so w = 2, b = 4, and I take m = 3.
Therefore the number of fruit tins X drawn has probability
2 4
x 3−x
P(X = x) = 6 , x = 0, . . . , 2,
3
and some calculation gives P(X = 0) = 1/5, P(X = 1) = 3/5, P(X = 2) = 1/5.
Capture-recapture
Example 73. In order to estimate the number of fish N in a lake, we first catch r fish, mark them,
and let them go. After having waited long enough for the fish population to become well-mixed, we
catch another sample of size s.
Find the distribution of the number of marked fish, M , in this sample.
Show that the value of N which maximises P(M = m) is ⌊rs/m⌋, and calculate the best
estimation of N when s = 50, r = 40, and m = 4.
The basic idea behind this example is used to estimate the sizes of populations of endangered species,
the number of drug addicts or of illegal immigrants in human populations, etc. One practical problem
often encountered is that certain individuals become harder to recapture, whereas others enjoy it; thus
the probabilities of recapture are heterogeneous, unlike in the example above.
Note to Example 73
The total number is N , of which r are marked and N − r unmarked. The distribution of M is
r N −r
m s−m
PN (M = m) = N
, m = max(0, s + r − N ), . . . , min(r, s),
s
provided that (after a little algebra) rs/m > N . Hence the largest value of N for which this ratio
b = ⌊rs/m⌋, which therefore maximises the probability, because we can write
increases is N
PN (M = m) PNmin +1 (M = m)
PN (M = m) = × ··· × PNmin (M = m),
PN −1 (M = m) PNmin (M = m)
where the latter probability is for the smallest value of N for which the probability that M = m is
positive.
b = ⌊50 × 40/4⌋ = 500.
In the example given, N
The behaviour of such estimators can be very poor.
52
Hypergeometric PMFs
Probability mass functions of M (left) and of ⌊rs/M ⌋ (centre) in Example 73, when r = 40, s = 50
and N = 1000, without ⌊rs/M ⌋ = +∞, which corresponds to M = 0, and PN (M = m) as a function
of N (right):
0.20
0.25
0.25
0.20
0.20
0.15
Pr_N(m)
0.15
0.15
PDF
0.10
0.10
0.10
0.05
0.05
0.05
0.00
0.00
0.00
0 10 20 30 40 0 500 1000 1500 2000 0 500 1000 1500 2000
m rsm N
This definition generalizes the outcome of a die throw, which corresponds to the DU(1, 6) distribution.
‘Life is good for only two things, discovering mathematics and teaching mathematics.’
(Source: http://www-history.mcs.st-and.ac.uk/PictDisplay/Poisson.html)
53
Poisson distribution
Definition 75. A Poisson random variable X has the PMF
λx −λ
fX (x) = e , x = 0, 1, . . . , λ > 0.
x!
We write X ∼ Pois(λ).
Since λx /x! > 0 for any λ > 0 and x ∈ {0, 1, . . .}, and
1 1
e−λ = = P∞ λx
> 0,
eλ x=0 x!
P∞
we see that fX (x) > 0 and x=0 fX (x) = 1, so this is a probability distribution.
The Poisson distribution appears everywhere in probability and statistics, often as a model for
counts, or for a number of rare events.
It also provides approximations to probabilities, for example for random permutations (Example 47,
random hats) or the binomial distribution (later).
0.3 0.6
f(x)
f(x)
0.0
0.0
0 5 10 15 20 0 5 10 15 20
x x
Pois(4) Pois(10)
0.20
0.20
0.10
0.10
f(x)
f(x)
0.00
0.00
0 5 10 15 20 0 5 10 15 20
x x
54
Poisson process
A fundamental model for random events taking place in time or space, e.g., in queuing systems,
communication systems, . . .
Consider point events taking place in a time period T = [0, T ], and write N (I) for the number of
events in a subset I ⊂ T . Suppose that
– events in disjoint subsets of T are independent;
– the probability that an event takes place in an interval of (small) width δ is δλ + o(δ) for some
λ > 0;
– the probability of no events in an interval of (small) width δ is 1 − δλ + o(δ).
Here o(δ) is a term such that limδ→0 o(δ)/δ = 0.
Then
– N {(a, b)} ∼ Pois{λ(b − a)}, or, more generally, N (I) ∼ Pois(λ|I|);
– if I1 , . . . , Ik are disjoint subsets of T , then N (I1 ), . . . , N (Ik ) are independent Poisson
variables.
We can use these properties to deduce that
– sums of independent Poisson variables have Poisson distributions;
– the waiting time to the first event has an exponential distribution;
– the intervals between events have independent exponential distributions.
(λt)n −λt
lim P{N (t) = n} = e , n = 0, 1, 2, . . . ,
m→∞ n!
which is the probability mass function of the Poisson distribution with mean λt.
55
Note on the Poisson process, II
Here are some implications of this:
– if T were divided into two disjoint intervals T1 , T2 such that λ|T1 | = µ1 and λ|T2 | = µ2 , the
same argument applied separately to T1 and T2 shows that their respective counts N1 and N2
have Poisson distributions with means µ1 and µ2 , and (A1) implies that these are independent.
Since N1 + N2 = N (t), we deduce that the sum of two independent Poisson variables is
Poisson, with mean µ1 + µ2 ;
– the waiting time X1 to the first event exceeds t if and only if N (t) = 0, so
so we see that X1 ∼ exp(λ), and E(X1 ) = 1/λ. Moreover the waiting time Xn to the nth
event exceeds t if and only if N (t) < n, so
n−1
X (λt)r −λt
P(Xn ≤ t) = 1 − P{N (t) < n} = 1 − e , t > 0,
r=0
r!
λn tn−1 −λt
fXn (t) = e , t > 0,
(n − 1)!
which is the density of the gamma distribution with shape parameter n and scale λ; recall that
Γ(n) = (n − 1)!. By independence of events in disjoint intervals, this must have the same
distribution as a sum of n independent waiting times distributed like X1 , so a sum of
independent exponential variables is gamma;
– now suppose that we start observing such a process at a random time t0 ≫ 0. What is the
distribution of the interval into which t0 falls? We can write the total length of the interval as
W = X− + X+ , where X− is the backward recurrence time from t0 to the previous event, and
X+ is the time to the next event, and (A1) implies that these are independent. Now
X+ ∼ exp(λ), and since there is no directionality, X− ∼ exp(λ), so W has the gamma
distribution with parameters 2 and λ. In particular, the expected length of W is twice that of
X1 . This is an example of length-biased sampling : sampling the Poisson process at a random
time means that the sampling point will fall into an interval that is longer than average.
Alternatively we can argue as follows: imagine that we take intervals at random from M
separate Poisson processes, with M very large, and place these intervals end to end. The
number of intervals of length x will be approximately M fX1 (x) dx, the total length of the M
intervals will be approximately M E(X1 ), and the portion of this taken up by intervals of length
x will be M fX1 (x) dx × x. Thus a point chosen uniformly at random in the total length
M E(X1 ) will fall into an interval of length x with probability
56
Cumulative distribution function
Definition 76. The cumulative distribution function (CDF) of a random variable X is
which is a step function with jumps at the points of the support DX of fX (x).
Example 77. Give the support and the probability mass and cumulative distribution functions of a
Bernoulli random variable.
Example 78. Give the cumulative distribution function of a geometric random variable.
Note to Example 77
The support is D = {0, 1}, and the CDF is
0,
x < 0,
F (x) = 1 − p, 0 ≤ x < 1,
1, x ≥ 1.
Note to Example 78
The support is D = N, and for x ≥ 1 we have
⌊x⌋
X
P(X ≤ x) = p(1 − p)r−1 ,
r=1
p{1 − (1 − p)⌊x⌋ }
P(X ≤ x) = = 1 − (1 − p)⌊x⌋ .
1 − (1 − p)
Thus (
0, x < 1,
P(X ≤ x) = ⌊x⌋
1 − (1 − p) , x ≥ 1.
57
Properties of a cumulative distribution function
Theorem 79. Let (Ω, F, P) be a probability space and X : Ω 7→ R a random variable. Its cumulative
distribution function FX satisfies:
(a) limx→−∞ FX (x) = 0;
(b) limx→∞ FX (x) = 1;
(c) FX is non-decreasing, so FX (x) ≤ FX (y) for x ≤ y;
(d) FX is continuous on the right, thus
lim FX (x + t) = FX (x), x ∈ R;
t↓0
Note to Theorem 79
(a) If not, there must be a blob of mass at −∞, which is not allowed, as X ∈ R.
(b) Ditto, for +∞.
(c) If y ≥ x, then F (y) = F (x) + P(x < X ≤ y), so the difference is always non-negative.
(d) Now F (x + t) = P(X ≤ x) + P(x < X ≤ x + t), and the second term here tends to zero, because
any point in the interval (x, x + t] at which there is positive probability must lie to the right of x.
(e) We have P(X > x) = 1 − P(X ≤ x) = 1 − FX (x).
(f) We have P(x < X ≤ y) = P(X ≤ y) − P(X ≤ x) = FX (y) − FX (x).
Remarks
We can obtain the probability mass function of a discrete random variable from the cumulative
distribution function using
f (x) = F (x) − lim F (y).
y↑x
In many cases X only takes integer values, DX ⊂ Z, and so f (x) = F (x) − F (x − 1) for x ∈ Z.
From now on we will mostly ignore the implicit probability space (Ω, F, P) when dealing with a
random variable X. We will rather think in terms of X, FX (x), and fX (x). We can legitimise this
‘oversight’ mathematically.
We can specify the distribution of a random variable in an equivalent way by saying (for example):
– X follows a Poisson distribution with parameter λ; or
– X ∼ Pois(λ); or
– by giving the probability mass function of X; or
– by giving the cumulative distribution function of X.
58
Transformations of discrete random variables
Real-valued functions of random variables are random variables themselves, so they possess probability
mass and cumulative distribution functions.
Example 82. Let Y be the remainder of the division by four of the total of two independent dice
throws. Calculate the PMF of Y .
Probability and Statistics for SIC slide 111
Note to Theorem 80
We have X X
fY (y) = P(Y = y) = P(X = x) = fX (x).
x:g(x)=y x:g(x)=y
Note to Example 81
Here Y = I(X ≥ 1) takes values 0 and 1, and
∞
X ∞
X λx
fY (0) = P(Y = 0) = P(X = 0) = e−λ , fY (1) = P(Y = 1) = P(X = x) = e−λ = 1−e−λ .
x!
x=1 x=1
Note to Example 82
Y has support 0, 1, 2, 3, and mass function given by
59
3.2 Expectation slide 112
Expectation
P
Definition 83. Let X be a discrete random variable for which x∈DX |x|fX (x) < ∞, where DX is
the support of fX . The expectation (or expected value or mean) of X is
X X
E(X) = xP(X = x) = xfX (x).
x∈DX x∈DX
P
If E(|X|) = x∈DX |x|fX (x) is not finite, then E(X) is not well defined.
E(X) is also sometimes called the “average of X”. We will limit the use of the word “average” to
empirical quantities.
The expectation is analogous in mechanics to the notion of centre of gravity of an object whose
mass is distributed according to fX .
Example 84. Calculate the expectation of a Bernoulli random variable with probability p.
Example 86. Calculate the expectation of the random variables with PMFs
4 1
fX (x) = , fY (x) = , x = 1, 2, . . . .
x(x + 1)(x + 2) x(x + 1)
Note to Example 84
First we note that if the support of X is finite, then E(|X|) < maxx∈DX |x| < ∞.
If I is Bernoulli with probability p, then E(I) = 0 × (1 − p) + 1 × p = p.
Note to Example 85
Here DX = {0, 1, . . . , n} is finite, so E(|X|) < ∞.
We get
X n
n x
E(X) = x p (1 − p)n−x
x=0
x
n
X n × (n − 1)!
= p × px−1 (1 − p)(n−1)−(x−1)
(x − 1)!{n − 1 − (x − 1)}!
x=1
n−1
X
n−1 y
= np p (1 − p)n−1−y = np,
y=0
y
where we have set y = x − 1. This agrees with the previous example, since X can be viewed as a sum
I1 + · · · + In .
Probability and Statistics for SIC note 2 of slide 113
60
Note to Example 86
Note that fY sums to unity: since the series is absolutely convergent we can re-organise the brackets in
the sums, giving
∞
X X ∞ ∞
1 1 1 1 X 1 1
= − = + − = 1,
x=1
x(x + 1) x=1 x x+1 1 x=1 x+1 x+1
However,
∞
X 1
E(Y ) = = +∞.
x=1
x+1
Thus it is relatively easy to construct random variables whose expectations are infinite: existence of an
expected value is not guaranteed.
61
Note to Theorem 87
Write Y = g(X), and note that for any y in the support DY of Y , we have
X X
fY (y) = P(Y = y) = P{g(X) = y} = P(X = x) = fX (x).
{x∈DX :g(x)=y} {x∈DX :g(x)=y}
Therefore
X X X X X X
E(Y ) = yfY (y) = y fX (x) = g(x)fX (x) = g(x)fX (x),
y∈DY y∈DY {x∈DX :g(x)=y} y∈DY x:g(x)=y x∈DX
as required.
Note to Example 88
Note that
∞
X ∞
X
λx −λ λx−r −λ
E{X(X − 1) · · · (X − r + 1)} = x(x − 1) · · · (x − r + 1) e = λr e = λr ,
x! (x − r)!
x=0 x−r=0
Remark: Linearity of the expected value, (a) and (b), and fact (c), are very useful in calculations.
62
Note to Theorem 89
(a) We need to show absolute convergence:
X X X X
|ax + b|f (x) ≤ (|a||x| + |b|)f (x) = |a| |x|f (x) + |b| f (x) < ∞,
x x x x
and setting a = E(X) and simplifying the right-hand side to E(X 2 ) − E(X)2 yields the result.
Moments of a distribution
P r
Definition 90. If X has a PMF f (x) such that x |x| f (x) < ∞, then
(a) the rth moment of X is E(X r );
(b) the rth central moment of X is E[{X − E(X)}r ];
(c) the variance of X is var(X) = E[{X − E(X)}2 ] (the second central moment);
p
(d) the standard deviation of X is defined as var(X) (non-negative);
(e) the rth factorial moment of X is E{X(X − 1) · · · (X − r + 1)}.
Remarks:
E(X) and var(X) are the most important moments: they represent the ‘average value’ E(X) of
X, and the ‘average squared distance’ of X from its mean, E(X).
The variance is analogous to the moment of inertia in mechanics: it measures the scatter of X
around its mean, E(X), with small variance corresponding to small scatter, and conversely.
The expectation and standard deviation have the same units (kg, m, . . . ) as X.
Example 91. Calculate the expectation and variance of the score when we roll a die.
Note to Example 91
Now X takes values 1, . . . , 6 with equal probabilities 1/6. Obviously E(|X|) < ∞, and
E(X) = (1 + · · · + 6)/6 = 21/6 = 7/2. The variance is
6
X 1
E[{X − E(X)}2 ] = (x − 7/2)2 = 2
6 × 1
4 × (1 + 9 + 25) = 35/12.
6
x=1
63
Properties of the variance
Theorem 92. Let X be a random variable whose variance exists, and let a, b be constants. Then
The first of these formulae expresses the variance in terms of either the ordinary moments, or the
factorial moments. Usually the first is more useful, but occasionally the second can be used.
The second formula shows that the variance does not change if X is shifted by a fixed quantity b,
but the dispersion is increased by the square of a multiplier a.
The third shows that the variance is appropriately named: if X has zero variance, then it does not
vary.
Note to Theorem 92
(a) Just expand, use linearity of E, and simplify.
(b) Ditto.
(c) If we write E(X) = µ and
X
var(X) = E[{X − E(X)}2 ] = E[{X − µ}2 ] = f (x)(x − µ)2 = 0,
x
then for each x ∈ DX , either x = µ or f (x) = 0. Suppose that f (a), f (b) > 0 and a 6= b. Then if
var(X) = 0, we must have a = µ = b, which is a contradiction. Therefore f (x) > 0 for a unique value
of x, and then we must have f (x) = 1, so P(X = x) = 1 and (x − µ)2 = 0; thus
P(X = µ) = fX (µ) = 1.
Note to Example 93
By recalling Example 88, we find
64
Poisson du moment
(Source: Copernic)
Moment du Poisson
(Source: Copernic)
65
Properties of the variance II
Theorem 94. If X takes its values in {0, 1, . . .}, r ≥ 2, and E(X) < ∞, then
∞
X
E(X) = P(X ≥ x),
x=1
X∞
E{X(X − 1) · · · (X − r + 1)} = r (x − 1) · · · (x − r + 1)P(X ≥ x).
x=r
Example 96. Each packet of a certain product has equal chances of containing one of n different
types of tokens, independently of each other packet. What is the expected number of packets you will
need to buy in order to get at least one of each type of token?
Note to Theorem 94
The first part of this is
∞
X ∞
X x
X ∞
X
E(X) = xf (x) = P(X = x) 1= P(X ≥ x),
x=1 x=1 r=1 x=1
as follows on changing the order of summation, noting that since all the terms are positive, this is
a legal operation.
The second part is proved in the same way, first writing
(x − 1)! x−1
r(x − 1) · · · (x − r + 1) = r! = r! .
(r − 1)!(x − r)! r−1
Then we write
X∞ X∞ ∞ ∞ y
x−1 X X X x−1
r (x − 1) · · · (x − r + 1)P(X ≥ x) = r! fX (y) = fX (y)r! ,
x=r x=r
r − 1 y=x y=r x=r
r−1
66
Note to Example 95
In this case X ∈ {1, 2, . . .}, and Theorem 94 yields
∞
X 1
E(X) = (1 − p)x−1 = = 1/p ≥ 1.
1 − (1 − p)
x=1
For the variance, note that the second part of Theorem 94, with r = 2, gives
∞
X
E{X(X − 1)} = 2 (x − 1)(1 − p)x−1
x=2
( ∞
)
d X
= 2(1 − p) − (1 − p)x−1
dp
x=1
d
= 2(1 − p) (−1/p) = 2(1 − p)/p2 .
dp
Hence the variance is
var(X) = E{X(X − 1)} + E(X) − E(X)2 = 2(1 − p)/p2 + 1/p − 1/p2 = (1 − p)/p2 .
Note to Example 96
This can be represented as X1 + X2 + · · · + Xn , where X1 is the number of packets to the first token,
then X2 is the number of packets to the next different token (i.e., not the first), etc. Thus the Xr are
independent geometric variables with probabilities p = n/n, (n − 1)/n, . . . , 1/n. Hence the expectation
is n(1 + 1/2 + 1/3 + · · · + 1/n) ∼ n log n, which → ∞ as n → ∞.
67
3.3 Conditional Probability Distributions slide 121
Example 99. Calculate the conditional PMFs of X ∼ Geom(p), (a) given that X > n, (b) given
that X ≤ n.
Probability and Statistics for SIC slide 122
Note to Theorem 98
We need to check the two properties of a distribution function.
Non-negativity is obvious because the fX (x | B) = P(X = x | B) are conditional probabilities.
S
Now Ax ∩ Ay = ∅ if x 6= y, and x∈R Ax = Ω. Hence the Ax partition R, and thus
X X
fX (x | B) = P(Ax ∩ B)/P(B) = P(B)/P(B) = 1.
x x
Note to Example 99
(a) The event B1 = {X > n} has probability (1 − p)n , so the new mass function is
This implies that conditional on X > n, X − n has the same distribution as did X originally.
(b) The event B2 = B1c = {X ≤ n} has probability 1 − (1 − p)n , so the new mass function is
68
Conditional expected value
P
Definition 100. Suppose that x |g(x)|fX (x | B) < ∞. Then the conditional expected value of
g(X) given B is X
E{g(X) | B} = g(x)fX (x | B).
x
Theorem 101. Let X be a random variable with expected value E(X) and let B be an event with
P(B), P(B c ) > 0. Then
E(X) = E(X | B)P(B) + E(X | B c )P(B c ).
More generally, when {Bi }∞
i=1 is a partition of Ω, P(Bi ) > 0 for all i, and the sum is absolutely
convergent,
X∞
E(X) = E(X | Bi )P(Bi ).
i=1
Example
Example 102. Calculate the expected values for the distributions in Example 99.
69
Note to Example 99
(a) Since
fX (x | B1 ) = p(1 − p)x−n−1 , x = n + 1, n + 2, . . . ,
we have
∞
X ∞
X
E(X | B1 ) = xp(1 − p)x−n−1 = (n + y)p(1 − p)y−1
x=n+1 y=1
since the first sum equals unity and the second is the expectation of a Geom(p) variable.
(b) We can tackle this directly using the expression
n
X p(1 − p)x−1
E(X | B2 ) = x
1 − (1 − p)n
x=1
Convergence of distributions
We often want to approximate one distribution by another. The mathematical basis for doing so is the
convergence of distributions.
Definition 103. Let {Xn }, X be random variables whose cumulative distribution functions are {Fn },
F . Then we say that the random variables {Xn } converge in distribution (or converge in law) to
X, if, for all x ∈ R where F is continuous,
Fn (x) → F (x), n → ∞.
D
We write Xn −→ X.
70
Law of small numbers
n
Recall from Theorem 17 that n−r r → 1/r! for all r ∈ N, when n → ∞.
Theorem 104 (Law of small numbers). Let Xn ∼ B(n, pn ), and suppose that npn → λ > 0 when
D
n → ∞. Then Xn −→ X, where X ∼ Pois(λ).
Theorem 104 can be used to approximate binomial probabilities for large n and small p by Poisson
probabilities.
Example 105. In Example 47 we saw that the probability of having exactly r fixed points in a random
permutation of n objects is
n−r
1 X (−1)k e−1
→ , r = 0, 1, . . . , n → ∞,
r! k! r!
k=0
which is the required Poisson mass function; call this limiting Poisson random variable X. This
convergence implies that P(Xn ≤ x) → P(X ≤ x) for any fixed real x, since P(Xn ≤ x) is just then a
finite sum of probabilities, each of which is converging to the limiting Poisson probability.
0.15
f(x)
f(x)
0.00
0.00
0 5 10 15 0 5 10 15
x x
B(50,0.1) Pois(5)
0.15
0.15
f(x)
f(x)
0.00
0.00
0 5 10 15 0 5 10 15
x x
Mass functions of three binomial distributions and the Poisson distribution, all with expectation 5.
71
Numerical comparison
Example 106 (Binomial and Poisson distributions). Compare P(X ≤ 3) for X ∼ B(20, p), with
p = 0.05, 0.1, 0.2, 0.5 with the results from a Poisson approximation, P(X ′ ≤ 3), with X ′ ∼ Pois(np),
using the functions pbinom and ppois in the software R — see
http://www.r-project.org/
Due to a lack of evidence and of reliable witnesses, the prosecutor tried to convince the jury that
Collins and her friend were the only pair in Los Angeles who could have committed the crime. He
found a probability of p = 1/(12 × 106 ) that a couple picked at random should fit the description, and
they were convicted.
In a higher court it was argued that the number of couples X fitting the description must follow a
Poisson distribution with λ = np, where n is the size of the population to which the couple belong. To
be certain that the couple were guilty, P(X > 1 | X ≥ 1) must be very small. But with n = 106 ,
2 × 106 , 5 × 106 , 10 × 106 , these probabilities are 0.041, 0.081, 0.194, 0.359: it was therefore very far
from certain that they were guilty. They were finally cleared.
with Poisson parameter λ = np = 1/12, 1/6, 5/12 and 1 respectively. Calculation gives the required
numbers. In fact here X has a truncated Poisson distribution.
Probability and Statistics for SIC note 1 of slide 130
72
Example
Example 108. Let XN be a hypergeometric variable, then
m N −m
x n−x
P(XN = x) = N
, x = max(0, m + n − N ), . . . , min(m, n).
n
This is the distribution of the number of white balls obtained when we take a random sample of size n
without replacement from an urn containing m white balls and N − m black balls. Show that when
N, m → ∞ in such a way that m/N → p, where 0 < p < 1,
n x
P(XN = x) → p (1 − p)n−x , i = 0, . . . , n.
x
Which distribution?
We have encountered several distributions: Bernoulli, binomial, geometric, negative binomial,
hypergeometric, Poisson—how to choose? Here is a little algorithm to help your reasoning:
Is X based on independent trials (0/1) with a same probability p, or on draws from a finite population,
with replacement?
If Yes, is the total number of trials n fixed, so X ∈ {0, . . . , n}?
– If Yes: use the binomial distribution, X ∼ B(n, p) (and thus the Bernoulli distribution if
n = 1).
⊲ If n ≈ ∞ or n ≫ np, we can use the Poisson distribution, X ∼ Pois(np).
– If No, then X ∈ {n, n + 1, . . .}, and we use the geometric (if X is the number of trials until
one success) or negative binomial (if X is the number of trials until the last of several
successes) distributions.
If No, then if the draw is independent but without replacement from a finite population, then X ∼
hypergeometric distribution.
There are many more distributions, and we may choose a distribution on empirical grounds. The
following map comes from Leemis and McQueston (2008, American Statistician) . . .
73
Probability and Statistics for SIC slide 133
74
4 Continuous Random Variables slide 135
DX = {x ∈ R : X(ω) = x, ω ∈ Ω}
Definition 109 (Reminder). Let (Ω, F, P) be a probability space. The cumulative distribution
function of a rv X defined on (Ω, F, P) is
where Bx = {ω : X(ω) ≤ x} ⊂ Ω.
Remarks:
Evidently,
dF (x)
f (x) = .
dx
Ry
Since P(x < X ≤ y) = f (u) du for x < y, for all x ∈ R,
x
Z y Z x
P(X = x) = lim P(x < X ≤ y) = lim f (u) du = f (u) du = 0.
y↓x y↓x x x
If X is discrete, then its PMF f (x) is often also called its density function.
75
Motivation
We study continuous random variables for several reasons:
they appear in simple but powerful models—for example, the exponential distribution often
represents the waiting time in a process where events occur completely at random;
they give simple but very useful approximations for complex problems—for example, the normal
distribution appears as an approximation for the distribution of an average, under fairly general
conditions;
they are the basis for modelling complex problems either in probability or in statistics—for
example, the Pareto distribution is often a good approximation for heavy-tailed data, in finance
and for the internet.
We will discuss a few well-known distributions, but there are plenty more (see map at the end of
Chapter 3) . . ..
Basic distributions
Definition 111 (Uniform distribution). The random variable U having density
(
1
, a ≤ u ≤ b,
f (u) = b−a a < b,
0, otherwise,
In practice random variables are almost always either discrete or continuous, with exceptions such as
daily rain totals.
Example 113. Find the cumulative distribution functions of the uniform and exponential distributions,
and establish the lack of memory (or memorylessness) property of X:
76
Note to Example 113
Integration of the uniform density gives
0, u ≤ a,
F (u) = (u − a)/(b − a), a < u ≤ b,
1, u > b.
Gamma distribution
Definition 114 (Gamma distribution). The random variable X having density
( α
λ
xα−1 e−λx , x > 0,
f (x) = Γ(α)
0, otherwise,
is called a gamma random variable with parameters α, λ > 0; we write X ∼ Gamma(α, λ).
Here α is called the shape parameter and λ is called the rate, with λ−1 the scale parameter. By
letting α = 1 we get the exponential density, and when α = 2, 3, . . . we get the Erlang density.
Slide 99 gives the properties of Γ(·).
0.4 0.8
f(x)
f(x)
0.0
0.0
−2 0 2 4 6 8 −2 0 2 4 6 8
x x
0.4 0.8
f(x)
f(x)
0.0
0.0
−2 0 2 4 6 8 −2 0 2 4 6 8
x x
77
Laplace distribution
Definition 115 (Laplace). The random variable X having density
λ −λ|x−η|
f (x) = e , x ∈ R, η ∈ R, λ > 0,
2
is called a Laplace random variable (or sometimes a double exponential) random variable.
(Source: http://www-history.mcs.st-and.ac.uk/PictDisplay/Laplace.html)
Pierre-Simon Laplace (1749–1827): Théorie Analytique des Probabilités (1814)
According to Napoleon Bonaparte: ‘Laplace did not consider any question from the right angle: he
sought subtleties everywhere, conceived only problems, and brought the spirit of “infinitesimals” into
the administration.’
Probability and Statistics for SIC slide 142
Pareto distribution
Definition 116 (Pareto). The random variable X with cumulative distribution function
(
0, x < β,
F (x) = α , α, β > 0,
β
1− x , x ≥ β,
Example 117. Find the cumulative distribution function of the Laplace distribution, and the
probability density function of the Pareto distribution.
78
Note to Example 117
For the Laplace distribution, integration of the density gives
(
1 −λ|x−η|
e , x ≤ η,
F (x) = 2 1 −λ|x−η|
1 − 2e , x > η.
Moments
Definition 118. Let g(x) be a real-valued function, and X a continuous random variable of density
f (x). Then if E{|g(X)|} < ∞, we define the expectation of g(X) to be
Z ∞
E{g(X)} = g(x)f (x) dx.
−∞
Example 119. Calculate the expectation and the variance of the following distributions: (a) U (a, b);
(b) gamma; (c) Pareto.
79
Note to Example 119
1
(a) Note that we need to compute E(U r ) for r = 1, 2, and this is r+1 (br+1 − ar+1 )/(b − a). Hence
E(X) = 12 (b2 − a2 )/(b − a) = (b + a)/2, as expected. For the variance, note that
3 − a3
1b
E(X 2 ) − E(X)2 = 3 b − (b + a)2 /4 = 31 (b2 + ab + a2 ) − (b2 + 2ab + a2 )/4 = (b − a)2 /12.
−a
(b) In this case
Z ∞
r
E(X ) = xr × λα xα−1 Γ(α)−1 exp(−λx) dx
0
Z ∞
= λ−r Γ(α)−1 ur+α−1 e−u du
0
= λ−r Γ(r + α)/Γ(α).
provided that α > r. If α ≤ r then the moment does not exist. In particular, E(X) < ∞ only if α > 1.
Conditional densities
We can also calculate conditional cumulative distribution and density functions: for reasonable subsets
A ⊂ R we have
R
P(X ≤ x ∩ X ∈ A) f (y) dy
FX (x | X ∈ A) = P(X ≤ x | X ∈ A) = = Ax ,
P(X ∈ A) P(X ∈ A)
Example 120. Let X ∼ exp(λ). Find the density and the cumulative distribution function of X, given
that X > 3.
Probability and Statistics for SIC slide 145
80
Note to Example 120
With A = (3, ∞), we have P(X ∈ A) = exp(−3λ). Hence
(
0, x < 3,
FX (x | X ∈ A) = exp(−3λ)−exp(−λx)
exp(−3λ) , x ≥ 3,
and the formula here reduces to 1 − exp{−(x − 3)λ}, x > 3. This is just the exponential density,
shifted along to x = 3. There is a close relation to the lack of memory property.
Example
Example 121. To get a visa for a foreign country, you call its consulate every morning at 10 am. On
any given day the civil servant is only there to answer telephone calls with probability 1/2, and when he
does answer, he lets the phone ring for a random amount of time T (min) whose distribution is
(
0, t ≤ 1,
FT (t) = −1
1 − t , t > 1.
(a) If you call one morning and don’t hang up, what is the probability that you will listen to the ringing
tone for at least s minutes?
(b) You decide to call once every day, but to hang up if there has been no answer after s∗ minutes.
Find the value of s∗ which minimises your time spent listening to the ringing tone.
2 4 6 8 10
Time t (min)
81
Note to Example 121
(a) Let S be the time for which it rings on a given day. Then
(
1, s<1
P(S > s) = P(S > s | absent)P(absent) + P(S > s | present)P(present) = 1 1 −1
2 + 2s , s ≥ 1.
This is a defective distribution, since lims→∞ FS (s) < 1, because there is a point mass of 1/2 at
+∞, corresponding to the event that he is absent.
The expected waiting time if you don’t put the phone down is
Z ∞
1 1 ds
E(S) = E(S | absent)P(absent) + E(S | present)P(present) = 2 × ∞ + 2 s 2 = ∞,
1 s
The number of calls N until you get through to the visa clerk is a geometric random variable with
success probability
p = P(S < s∗ ) = 1 − P(S > s∗ ) = 21 (1 − 1/s∗ ) :
there are N − 1 unsuccessful calls each of length s∗ , followed by a successful call. The number of
unsuccessful calls has expectation
2 1
E(N ) − 1 = 1/p − 1 = −1= ∗ (2s∗ − s∗ + 1) = (s∗ + 1)/(s∗ − 1).
1 − 1/s∗ s −1
w′ (s∗ ) = (s∗ − 1)−2 (s∗ 2 − s∗ − 2 − log s∗ ) = (s∗ − 1)−2 {(s∗ − 2)(s∗ + 1) − log s∗ },
and setting w′ (s∗ ) = 0 gives that s∗ must solve the equation (s∗ − 2)(s∗ + 1) = log s∗ , for s∗ > 1.
This gives s∗ = 2.25 minutes as being the optimum length of call, and in this case w(s∗ ) = 7.3
minutes, while E(N ) = 1 + 2.6 = 3.6 calls.
Note the shape of the graph on slide 147. The expected total waiting time increases very sharply for
the impatient (who put the phone down before time s∗ ), but not so fast for patient people who wait
beyond time s∗ . Draw your own conclusions!
82
X discrete or continuous?
Discrete Continuous
Support DX countable contains an interval (x− , x+ ) ⊂ R
P Rb
P(a < X ≤ b) {x:a<x≤b} fX (x) a fX (x) dx
Ra
P(X = a) fX (a) ≥ 0 a fX (x) dx = 0
P R∞
E{g(X)} (if well defined) x∈R g(x)fX (x) −∞ g(x)fX (x) dx
Quantiles
Definition 122. Let 0 < p < 1. We define the p quantile of the cumulative distribution function
F (x) to be
xp = inf{x : F (x) ≥ p}.
For most continuous random variables, xp is unique and equals xp = F −1 (p), where F −1 is the inverse
function F ; then xp is the value for which P(X ≤ xp ) = p. In particular, we call the 0.5 quantile the
median of F .
The infimum is needed when there are jumps in the distribution function, or when it is flat over some
interval. Here is an example:
Example 125. Compute x0.5 and x0.9 for a Bernoulli random variable with p = 1/2.
83
Probability and Statistics for SIC note 2 of slide 150
Likewise
x0.5 = inf{x : F (x) ≥ 0.5} = inf{x : x ≥ 0} = 0.
Transformations
We often consider Y = g(X), where g is a known function, and we want to calculate FY and fY given
FX and fX .
Example 126. Let Y = − log(1 − U ), where U ∼ U (0, 1). Calculate FY (y) and discuss. Calculate
also the density and cumulative distribution function of W = − log U . Explain.
Example 127. Let Y = ⌈X⌉, where X ∼ exp(λ) (thus Y is the smallest integer greater than X).
Calculate FY (y) and fY (y).
which is the exponential density; note that the transformation here is monotone. Thus Y has an
exponential distribution.
For W = − log U , we have
where the < can become an ≤ because there is no probability at individual points in R.
Hence W also has an exponential distribution. This is obvious, because if U ∼ U (0, 1), then
1 − U ∼ U (0, 1) also.
84
Note to Example 127
Y = r iff r − 1 < X ≤ r, so for r = 1, 2, . . . , we have
Z r Z r
P(Y = r) = fX (x) dx = λe−λx dx = (e−λ(r−1) − e−λr ) = (e−λ )r−1 (1 − e−λ ).
r−1 r−1
General transformation
We can formalise the previous discussion in the following way:
Definition 128. Let g : R 7→ R be a function and B ⊂ R any subset of R. Then g −1 (B) ⊂ R is the
set for which g{g −1 (B)} = B.
Theorem 129. Let Y = g(X) be a random variable and By = (−∞, y]. Then
(R
g −1 (By ) fX (x) dx, X continuous,
FY (y) = P(Y ≤ y) = P
x∈g −1 (By ) fX (x), X discrete,
where g−1 (By ) = {x ∈ R : g(x) ≤ y}. When g is monotone increasing or decreasing and has
differentiable inverse g−1 , then
−1
dg (y)
fY (y) = fX {g −1 (y)}, y ∈ R.
dy
Example 131. Find the distribution and density functions of Y = cos(X), where X ∼ exp(1).
85
Note to Theorem 129
We have
P(Y ∈ B) = P{g(X) ∈ B} = P{X ∈ g −1 (B)},
because X ∈ g−1 (B) if and only if g(X) ∈ g{g −1 (B)} = B.
To find FY (y) we take By = (−∞, y], giving
dg−1 (y)
fY (y) = fX {g −1 (y)}, y ∈ R.
dy
In the case of a continuous density, FY (y) = P{X ≥ g −1 (y)} = 1 − FX {g−1 (y)} and differentiation
gives
dg−1 (y)
fY (y) = − fX {g −1 (y)}, y ∈ R;
dy
note that −dg−1 (y)/dy ≥ 0, because g −1 (y) is monotone decreasing.
Thus in both cases we can write
−1
dg (y)
fY (y) = fX {g−1 (y)}, y ∈ R.
dy
86
Note to Example 130
Note first that since X only puts probability on R+ , Y ∈ (1, ∞).
In terms of the theorem, let By = (−∞, y], and note that g(x) = ex is monotone increasing, with
g−1 (y) = log y, so
P(Y ≤ y) = P(Y ∈ B) = P{g(X) ∈ B} = P{X ∈ g −1 (B)} = P{X ∈ (−∞, log y]} = FX (log y),
so
P(Y ≤ y) = 1 − exp{−λ log y} = 1 − y −λ , y > 1.
Hence Y has the Pareto distribution with β = 1, α = λ, and
(
0, y ≤ 1,
fY (y) = −λ−1
λy , y > 1.
To get the density directly, we note that dg−1 (y)/dy = 1/y, and
−1
dg (y)
fY (y) = fX {g −1 (y)} = |y −1 | × λe−λ log y = λy −λ−1 , y > 1,
dy
and fY (y) = 0 for y ≤ 1, because if y < 1, then log y < 0, and fX (x) = 0 for x < 0.
87
Note to Example 131
Here Y = g(X) = cos(X) takes values only in the range −1 ≤ y ≤ 1, so if y < −1, By = ∅, and
if y ≥ 1, By = R, thus giving (
0, y < −1
FY (y) =
1, y ≥ 1.
A sketch of the function cos x for x ≥ 0 shows that in the range 0 < x < 2π, and for −1 < y < 1,
the event cos(X) ≤ y is equivalent to the event cos−1 (y) ≤ X ≤ 2π − cos−1 (y). Since the cosine
function is periodic, the set By is an infinite union of disjoint intervals. In fact
∞
[
−1
cos(X) ≤ y ⇔ X∈g (By ) = {x : 2πj + cos−1 (y) ≤ x ≤ 2π(j + 1) − cos−1 (y)},
j=0
and therefore
P(Y ≤ y) = P{X ∈ g−1 (B)}
∞
X
= P 2πj + cos−1 (y) ≤ X ≤ 2π(j + 1) − cos−1 (y)
j=0
X∞
= exp[−λ{2πj + cos−1 (y)}] − exp[−λ{2π(j + 1) − cos−1 (y)}]
j=0
exp{−λ cos−1 (y)} − exp{λ cos−1 (y) − 2πλ}
= ,
1 − exp(−2πλ)
where we noticed that the summation is proportional to a geometric series.
Note that if y = 1, then cos−1 (y) = 0, and so P(Y ≤ 1) = 1, and if y = −1, then cos−1 (y) = π,
and then P(Y ≤ −1) = 0, as required. Here we used values of cos−1 (y) in the range [0, π].
The density function is found by differentiation: since cos{cos−1 (y)} = y, we have
d cos−1 (y) 1
=− ,
dy sin{cos−1 (y)}
88
4.3 Normal Distribution slide 153
Normal distribution
Definition 132. A random variable X having density
1 (x − µ)2
f (x) = exp − , x ∈ R, µ ∈ R, σ > 0,
(2π)1/2 σ 2σ 2
2 2
√ with expectation µ and variance σ : we write X ∼ N (µ, σ ). (The
is a normal random variable
standard deviation of X is σ 2 = σ > 0.)
When µ = 0, σ 2 = 1, the corresponding random variable Z is standard normal, Z ∼ N (0, 1), with
density
2
φ(z) = (2π)−1/2 e−z /2 , z ∈ R.
Then Z x Z x
1 2 /2
FZ (x) = P(Z ≤ x) = Φ(x) = φ(z) dz = e−z dz.
−∞ (2π)1/2 −∞
The normal distribution is often called the Gaussian distribution. Gauss used it for the combination
of astronomical and topographical measures.
89
Johann Carl Friedrich Gauss (1777–1855)
The normal distribution is often called the Gaussian distribution. Gauss used it for the combination
of astronomical and topographical measures.
N(0,1) density
0.4
0.3
phi(z)
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
90
Interpretation of N (µ, σ 2)
The density function is centred at µ, which is the most likely value and also the median;
the standard deviation σ is a measure of the spread of the values around µ:
– 68% of the probability lies in the interval µ ± σ;
– 95% of the probability lies in the interval µ ± 2σ;
– 99.7% of the probability lies in the interval µ ± 3σ.
Example 133. The average height for a class of students was 178 cm, with standard deviation 7.6 cm.
If this is representative of the population, then 68% have heights in the interval 178 ± 7.6 cm (blue
lines), 95% in the interval 178 ± 2 × 7.6 cm (green lines), and 99.7% in the interval 178 ± 3 × 7.6 cm
(cyan lines, almost invisible).
0.06
0.05
0.04
Density
0.03
0.02
0.01
0.00
Height (cm)
Properties
Theorem 134. The density φ(z), the cumulative distribution function Φ(z), and the quantiles zp of
Z ∼ N (0, 1) satisfy, for all z ∈ R:
(a) the density is symmetric with respect to z = 0, i.e., φ(z) = φ(−z);
(b) P(Z ≤ z) = Φ(z) = 1 − Φ(−z) = 1 − P(Z ≥ z);
(c) the standard normal quantiles zp satisfy zp = −z1−p , for all 0 < p < 1;
(d) z r φ(z) → 0 when z → ±∞, for all r > 0. This imples that the moments E(Z r ) exist for all
r ∈ N;
(e) we have
φ′ (z) = −zφ(z), φ′′ (z) = (z 2 − 1)φ(z), φ′′′ (z) = −(z 3 − 3z)φ(z), ...
Note that if X ∼ N (µ, σ 2 ), then we can write X = µ + σZ, where Z ∼ N (0, 1).
91
Theorem 134
(a) Obvious by substitution:
2 /2 2 /2
φ(−z) = (2π)−1/2 e−(−z) = (2π)−1/2 e−z = φ(z).
(c) Again obvious by symmetry, using (b): p = Φ(z) = 1 − Φ(−z) implies that zp = −z1−p .
(d) This is just a fact from analysis, since for any r ≥ 0, we have
zr zr
z r φ(z) ∝ P∞ 2i
< → 0, z → ∞,
i=0 z /i! z 2(r+1)
92
Values of the function Φ(z)
z 0 1 2 3 4 5 6 7 8 9
0.0 .50000 .50399 .50798 .51197 .51595 .51994 .52392 .52790 .53188 .53586
0.1 .53983 .54380 .54776 .55172 .55567 .55962 .56356 .56750 .57142 .57535
0.2 .57926 .58317 .58706 .59095 .59483 .59871 .60257 .60642 .61026 .61409
0.3 .61791 .62172 .62552 .62930 .63307 .63683 .64058 .64431 .64803 .65173
0.4 .65542 .65910 .66276 .66640 .67003 .67364 .67724 .68082 .68439 .68793
0.5 .69146 .69497 .69847 .70194 .70540 .70884 .71226 .71566 .71904 .72240
0.6 .72575 .72907 .73237 .73565 .73891 .74215 .74537 .74857 .75175 .75490
0.7 .75804 .76115 .76424 .76730 .77035 .77337 .77637 .77935 .78230 .78524
0.8 .78814 .79103 .79389 .79673 .79955 .80234 .80511 .80785 .81057 .81327
0.9 .81594 .81859 .82121 .82381 .82639 .82894 .83147 .83398 .83646 .83891
1.0 .84134 .84375 .84614 .84850 .85083 .85314 .85543 .85769 .85993 .86214
1.1 .86433 .86650 .86864 .87076 .87286 .87493 .87698 .87900 .88100 .88298
1.2 .88493 .88686 .88877 .89065 .89251 .89435 .89617 .89796 .89973 .90147
1.3 .90320 .90490 .90658 .90824 .90988 .91149 .91309 .91466 .91621 .91774
1.4 .91924 .92073 .92220 .92364 .92507 .92647 .92786 .92922 .93056 .93189
1.5 .93319 .93448 .93574 .93699 .93822 .93943 .94062 .94179 .94295 .94408
1.6 .94520 .94630 .94738 .94845 .94950 .95053 .95154 .95254 .95352 .95449
1.7 .95543 .95637 .95728 .95818 .95907 .95994 .96080 .96164 .96246 .96327
1.8 .96407 .96485 .96562 .96638 .96712 .96784 .96856 .96926 .96995 .97062
1.9 .97128 .97193 .97257 .97320 .97381 .97441 .97500 .97558 .97615 .97670
2.0 .97725 .97778 .97831 .97882 .97932 .97982 .98030 .98077 .98124 .98169
Remark: A more detailed table can be found in the Formulaire. You may also use the function pnorm
in the software R: Φ(z) = pnorm(z).
P(Z ≤ 0.53), P(Z ≤ −1.86), P(−1.86 < Z < 0.53), z0.95 , z0.025 , z0.5 .
> pnorm(0.53)
[1] 0.701944
> pnorm(-1.86)
[1] 0.03144276
> pnorm(0.53)- pnorm(-1.86)
[1] 0.6705013
> qnorm(0.95)
[1] 1.644854
> qnorm(0.025)
[1] -1.959964
> qnorm(0.5)
[1] 0
Probability and Statistics for SIC note 1 of slide 160
93
Examples and calculations
Example 136. The duration in minutes of a maths lecture is N (47, 4), but should be 45. Give the
probability that (a) the lecture finishes early, (b) the lecture finishes at least 5 minutes late.
Example 137. Show that the expectation and variance of X ∼ N (µ, σ 2 ) are µ and σ 2 , and find the p
quantile of X.
Example 138. Calculate the cumulative distribution function and the density of Y = |Z| and
W = Z 2 , where Z ∼ N (0, 1).
as before.
√ √ √ √
For W , the same argument gives P(W ≤ w) = P(− w ≤ Z ≤ w) = Φ( w) − Φ(− w), for
w > 0. Then differentiate to obtain the density.
√ √
In this case g(x) = x2 and g −1 (Bw ) = g −1 {(−∞, w]} = (− w, w) for w ≥ 0 and g −1 (Bw ) = ∅
for w < 0. This gives the previous result, by a slightly more laborious route.
94
Normal approximation to the binomial distribution
The normal distribution is a central to probability, partly because it can be used to approximate
probabilities of other distributions. One of the basic results is:
Theorem 139 (de Moivre–Laplace). Let Xn ∼ B(n, p), where 0 < p < 1, let
B(16, 0.5) and Normal approximation B(16, 0.1) and Normal approximation
density
density
0.20
0.20
0.00
0.00
0 5 10 15 0 5 10 15
r r
B(16, 0.5) and Poisson approximation B(16, 0.1) and Poisson approximation
density
density
0.20
0.20
0.00
0.00
0 5 10 15 0 5 10 15
r r
95
Continuity correction
A better approximation to P(Xn ≤ r) is given by replacing r by r + 12 ; the 1
2 is called the continuity
correction. This gives !
. r + 12 − np
P(Xn ≤ r) = Φ p .
np(1 − p)
Binomial(15, 0.4) and Normal approximation
0.20
0.15
Density
0.10 0.05
0.00
0 5 10 15
x
Example 140. Let X ∼ B(15, 0.4). Calculate the exact and approximate values of P(X ≤ r) for
r = 1, 8, 10, with and without the continuity correction. Comment.
NumeRical Results
pbinom(c(1,8,10),15,prob=0.4)
[1] 0.005172035 0.904952592 0.990652339
pnorm(c(1,8,10),mean=15*0.4,sd=sqrt(15*0.4*0.6))
[1] 0.004203997 0.854079727 0.982492509
pnorm(c(1,8,10)+0.5,mean=15*0.4,sd=sqrt(15*0.4*0.6))
[1] 0.008853033 0.906183835 0.991146967
Probability and Statistics for SIC slide 165
Example
Example 141. The total number of students in a class is 100.
(a) Each student goes independently to a maths lecture with probability 0.6. What is the size of the
smallest classroom suited for the number of students who go to class, with a probability of 0.95?
(b) There are 14 lectures per semester, and the students decide to go to each lecture independently.
What is now the size of the smallest classroom necessary?
96
Note to Example 141
(a) The number of students present X is B(100, 0.6), so the mean is 100 × 0.6 = 60 and the variance
is 100 × 0.6 × 0.4 = 24. We seek x such that
X − 60 x − 60 . x − 60
0.95 = P(X ≤ x) = P √ ≤ √ =Φ √ ,
24 24 24
√ √
and this implies that (x − 60)/ 24 = Φ−1 (0.95) = 1.65, and thus x = 60 + 24 × 1.65 = 68.08.
Better have a room for 69.
(b) Now we want to solve the equation
14 14
14 X − 60 x − 60 . x − 60
0.95 = P(X ≤ x) =P √ ≤ √ =Φ √ ,
24 24 24
√ √
and this implies that (x − 60)/ 24 = Φ−1 (0.951/14 ) = 2.68, and thus x = 60 + 24 × 2.68 = 73.14.
Better have a room for 74.
Probability and Statistics for SIC note 1 of slide 166
it is difficult to draw strong conclusions from such a graph for small n, as the variability is then
large—we have a tendency to over-interpret patterns in the plot.
97
Note to the following graphs
First graph: the normal graph is close to a straight line, whereas the exponential one is not.
Suggests that the normal would be a reasonable model for these data. Derive the formula for the
exponential plotting positions, using the quantile formula for the exponential distribution.
Second graph: Here we compare the real data (top centre) with simulated data. The fact that it is
hard to tell which is which (you need to remember the shape of the first graph, or to note that
tied observations are impossible with simulations) suggests that the heights can be considered to
be normal.
The lower left is gamma: there is clearer nonlinearity than with the other panels—but it is hard to
be sure with this sample size.
The lower middle is obviously not normal; the sample size is big, however.
Heights of students
Q-Q plots for the heights of n = 36 students in SSC, for the exponential and normal distributions.
190
Height (cm)
Height (cm)
180
180
170
170
160
160
Height (cm)
Height (cm)
175 165
160
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Normal plotting positions Normal plotting positions Normal plotting positions
190
185
185
180
Height (cm)
Height (cm)
Height (cm)
175
175
170
165
165
160
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Normal plotting positions Normal plotting positions Normal plotting positions
98
n = 100: Which sample is not normal?
There are five samples of simulated normal variables, and one simulated gamma sample.
210
210
210
Height (cm)
Height (cm)
Height (cm)
190
190
190
170
170
170
150
150
150
−2 0 1 2 −2 0 1 2 −2 0 1 2
210 Normal plotting positions Normal plotting positions Normal plotting positions
210
210
Height (cm)
Height (cm)
Height (cm)
190
190
190
170
170
170
150
150
150
−2 0 1 2 −2 0 1 2 −2 0 1 2
Normal plotting positions Normal plotting positions Normal plotting positions
210
210
Height (cm)
Height (cm)
Height (cm)
190
190
190
170
170
170
150
150
150
−3 −1 1 2 3 −3 −1 1 2 3 −3 −1 1 2 3
Normal plotting positions Normal plotting positions Normal plotting positions
210
210
210
Height (cm)
Height (cm)
Height (cm)
190
190
190
170
170
170
150
150
150
−3 −1 1 2 3 −3 −1 1 2 3 −3 −1 1 2 3
Normal plotting positions Normal plotting positions Normal plotting positions
99
Which density?
Uniform variables lie in a finite interval, and give equal probability to each part of the interval;
exponential and gamma variables lie in (0, ∞), and are often used to model waiting times and
other positive quantities,
– the gamma has two parameters and is more flexible, but the exponential is simpler and has
some elegant properties;
Pareto variables lie in the interval (β, ∞), so are not appropriate for arbitrary positive quantities
(which could be smaller than β), but are often used to model financial losses over some threshold
β;
normal variables lie in R and are used to model quantities that arise (or might arise) through
averaging of many small effects (e.g., height and weight, which are influenced by many genetic
factors), or where measurements are subject to error;
Laplace variables lie in R; the Laplace distribution can be used in place of the normal in situations
where outliers might be present.
100
5. Several Random Variables slide 174
Lexicon
Example 142. The distribution of (height, weight) of a student picked at random from the class.
Example 143 (Hats, continuation of Example 47). Three men with hats permute them in a random
way. Let I1 be the indicator of the event in which man 1 has his hat, etc. Find the joint distribution of
(I1 , I2 , I3 ).
101
Note to Example 143
The possibilities each have probability 1/6, and with the notation that Ij indicates that the jth hat is
on the right head, are
1 2 3
1 2 3 (I1 , I2 , I3 ) = (1, 1, 1)
1 3 2 (I1 , I2 , I3 ) = (1, 0, 0)
2 1 3 (I1 , I2 , I3 ) = (0, 0, 1)
2 3 1 (I1 , I2 , I3 ) = (0, 0, 0)
3 1 2 (I1 , I2 , I3 ) = (0, 0, 0)
3 2 1 (I1 , I2 , I3 ) = (0, 1, 0)
from which we can compute anything we like.
Example 145 (Hats, Continuation of Example 143). Find the joint distribution of
(X, Y ) = (I1 , I2 + I3 ).
102
Continuous random variables
Definition 146. The random variable (X, Y ) is said to be (jointly) continuous if there exists a
function fX,Y (x, y), called the (joint) density of (X, Y ), such that
Z Z
P{(X, Y ) ∈ A} = fX,Y (u, v) dudv, A ⊂ R2 .
(u,v)∈A
By letting A = {(u, v) : u ≤ x, v ≤ y}, we see that the (joint) cumulative distribution function of
(X, Y ) can be written
Z x Z y
FX,Y (x, y) = P(X ≤ x, Y ≤ y) = fX,Y (u, v) dudv, (x, y) ∈ R2 ,
−∞ −∞
Example
Example 147. Calculate the joint cumulative distribution function and P(X ≤ 1, Y ≤ 2) when
(
e−x−y , y > x > 0,
fX,Y (x, y) ∝
0, otherwise.
We can write f (x, y) = ce−x−y I(y > x)I(x > 0), where I(A) is the indicator function of the set A.
103
Note to Example 147
Note that if min(x, y) ≤ 0, then F (x, y) = 0, and consider the integral for y > x (sketch):
Z x Z y
F (x, y) = du dv f (u, v)
−∞ −∞
Z x Z y
= c e−u du e−v dv
Z0 x u
u
= c du e−u e−v y
Z0 x
= c du e−u e−u − e−y
Z0 x
= c du (e−2u − e−u−y )
0 x
= c e−u−y − 21 e−2u 0
= 21 c 1 − e−2x − 2e−y + 2e−y−x .
Exponential families
Definition 148. Let (X1 , . . . , Xn ) be a discrete or continuous random variable with mass/density
function of the form
( p )
X
f (x1 , . . . , xn ) = exp si (x)θi − κ(θ1 , . . . , θp ) + c(x1 , . . . , xn ) , (x1 , . . . , xn ) ∈ D ⊂ Rn ,
i=1
Example 149. Show that the (a) Poisson and (b) gamma distributions are exponential families.
Example 150 (Random graph model). (a) Suppose that we have d ≥ 3 nodes, and links appear
between nodes i and j (i 6= j) independently with probability p. Let Xi,j be the indicator that there is
a link between P
i and j. Show that thePjoint mass function of X1,2 , . . . , Xd−1,d is an exponential family.
(b) If s1 (x) = i<j xi,j and s2 (x) = i<j<k xi,j xj,k xk,i , discuss the properties of data from an
exponential family with mass function
as θ and β vary.
104
Note to Example 149
(a) We write
which is of the required form with n = p = 1, s(x) = x, θ = log λ ∈ Θ = R, κ(θ) = exp(θ), and
c(x) = − log x!.
(b) We write
f (x; λ, α) = λα xα−1 exp(−λx)/Γ(α)
= exp {α log x − λx + α log λ − log Γ(α) − log x} , λ, α > 0, x > 0,
which is of the required form with n = 1, p = 2, θ1 = α, θ2 = −λ, so Θ = R+ × R− ,
s1 (x) = log x, s2 (x) = x, so D = R × R+ and κ(θ) = log Γ(θ1 ) − θ1 log(−θ2 ), c(x) = − log x.
P
which is of the given form with n = d(d − 1)/2, p = 1, s(x) = i<j xi,j , c(x1,2 , . . . , xd−1,d ) ≡ 0,
θ = log{p/(1 − p)} ∈ Θ = R, and κ(θ) = d(d − 1) log(1 + eθ )/2 (check this).
Note that p = 1/2 corresponds to θ = 0, which corresponds to links appearing independently with
probability 0.5, whereas setting θ ≪ 0 will give a very sparse graph, with very few links.
(b) Here s1 (x) counts how many links there are, and s2 (x) counts how many triangles there are.
Increasing β therefore gives more probability to graphs with lots of triangles, whereas decreasing β
makes triangles less likely. So, taking θ ≪ 0 and β ≫ 0 will tend to give a graph with a few links,
but mostly in triangles. Note that the normalising constant is very complex, as it is
X
κ(θ, β) = log exp {s1 (x)θ + s2 (x)β} ,
105
Marginal and conditional distributions
Definition 151. The marginal probability mass/density function of X is
(P
fX,Y (x, y), discrete case,
fX (x) = R ∞y x ∈ R.
−∞ fX,Y (x, y) dy, continuous case,
fX,Y (x, y)
fY |X (y | x) = , y ∈ R,
fX (x)
Examples
Example 152. Calculate the conditional PMF of Y given X, and the marginal PMFs of Example 145.
Example 153. Calculate the marginal and conditional densities for Example 147.
Example 154. Every day I receive a number of emails whose distribution is Poisson, with parameter
µ = 100. Each is a spam independently with probability p = 0.9. Find the distribution of the number
of good emails which I receive. Given that I have received 15 good ones, find the distribution of the
total number that I received.
Probability and Statistics for SIC slide 183
106
Note to Example 152
The joint mass function can be represented as
x y f (x, y)
0 0 2/6
0 1 2/6
1 0 1/6
1 2 1/6
so
and so we obtain
x y f (y | x)
0 0 1/2
0 1 1/2
1 0 1/2
1 2 1/2
and its integral is 2(1 − 1/2) = 1, so this is also a valid density function.
For the conditional densities we have
and
f (x | y) = 2e−x−y /{2e−y (1 − e−y )} = e−x /(1 − e−y ), 0 < x < y.
It is easy to check that both conditional densities integrate to unity. Compare to Example 120.
107
Note to Example 154
Let N denote the total number of emails, and G the number of good ones. Then conditional on
N = n, G ∼ B(n, p), so
n! µn
fG,N (g, n) = fG|N (g | n)fN (n) = (1−p)g pn−g × e−µ , n ∈ {0, 1, 2, . . .}, g ∈ {0, 1, . . . , n},
g!(n − g)! n!
where µ > 0 and 0 < p < 1. Thus the number of good emails G has density
∞
X
fG (g) = fG,N (g, n)
n=g
∞
e−µ µg (1 − p)g X 1
= × µn−g pn−g
g! n=g
(n − g)!
∞
e−µ µg (1 − p)g X 1
= × (µp)r , where r = n − g,
g! r=0
r!
e−µ µg (1 − p)g {µ(1 − p)}g −µ(1−p)
= × eµp = e , g ∈ {0, 1, . . .},
g! g!
which is the Poisson mass function with parameter µ(1 − p).
Finally, given that G = g,
n! g n−g × µn e−µ
fG,N (g, n) g!(n−g)! (1 − p) p n! (pµ)n−g
fN |G (n | g) = = = e−pµ , n = g, g + 1, . . . ,
fG (g) e−µ(1−p) µg (1 − p)g /g! (n − g)!
which is a Poisson distribution with mean µp, shifted to start at n = g. Thus the number of spams
S = N − G has a Poisson distribution, with mean µp.
We analogously define the conditional and marginal densities, the cumulative distribution functions,
etc., by replacing (X, Y ) by X = XA , Y = XB , where A, B ⊂ {1, . . . , n} and A ∩ B = ∅. So for
example, if n = 4, we can consider the marginal distribution of (X1 , X2 ) and its conditional
distribution given (X3 , X4 ).
Subsequently everything can be generalised to n variables, but for ease of notation we will mostly limit
ourselves to the bivariate case.
Probability and Statistics for SIC slide 184
108
Multinomial distribution
Definition 156. The random variable (X1 , . . . , Xk ) has the multinomial distribution of
denominator m and probabilities (p1 , . . . , pk ) if its mass function is
k
X
m!
f (x1 , . . . , xk ) = px1 px2 · · · pxk k , x1 , . . . , xk ∈ {0, . . . , m}, xj = m,
x1 ! × · · · × xk ! 1 2
j=1
This distribution appears as the distribution of the number of individuals in the categories {1, . . . , k}
when m independent individuals fall into the classes with probabilities {p1 , . . . , pk }. It generalises the
binomial distribution to k > 2 categories.
Example 157 (Vote). n students vote for three candidates for the presidency of their syndicate. Let
X1 , X2 , X3 be the number of corresponding votes, and suppose that the n students vote independently
with probabilities p1 = 0.45, p2 = 0.4, and p3 = 0.15. Find the joint distribution of X1 , X2 , X3 ,
calculate the marginal distribution of X3 , and the conditional distribution of X1 given X3 = x3 .
109
Note to Example 157
This is a multinomial distribution with k = 3, denominator n, and the given probabilities. The
joint density is therefore
3
X
n!
f (x1 , x2 , x3 ) = px1 px2 px3 , x1 , x2 , x3 ∈ {0, . . . , n}, xj = n.
x1 !x2 !x3 ! 1 2 3
j=1
The marginal distribution of X3 is the number of votes for the third candidate. If we say that a
vote for him is a success, and a vote for one of the other two is a failure, we see that
X3 ∼ B(n, p3 ): X3 is binomial with denominator n and probability 0.15.
Alternatively we can compute the marginal density for x3 = 0, . . . , n using Definition 151 with
X = X3 and Y = (X1 , X2 ) as
X n!
P(X3 = x3 ) = px1 px2 px3
x1 !x2 !x3 ! 1 2 3
{(x1 ,x2 ):x1 +x2 =n−x3 }
n! X (x1 + x2 )! x1 x2
= px3 p p
x3 !(x1 + x2 )! 3 x1 !x2 ! 1 2
{(x1 ,x2 ):x1 +x2 =n−x3 }
n!
= px3 (p1 + p2 )n−x3
x3 !(x1 + x2 )! 3
n!
= px3 (1 − p3 )n−x3 ,
(n − x3 )!x3 ! 3
using Newton’s binomial formula (Theorem 17) and the fact that p1 + p2 = 1 − p3 . Thus again we
see that X3 ∼ B(n, p3 ).
If we now take the ratio of the joint density of (X1 , X2 , X3 ) to the marginal density of X3 , we
obtain the conditional density
fX1 ,X2 ,X3 (x1 , x2 , x3 )
fX1 ,X2 |X3 (x1 , x2 | x3 ) =
fX3 (x3 )
n! x1 x2 x3
x1 !x2 !x3 ! p1 p2 p3
= n! x1 +x2 px3
x3 !(x1 +x2 )! (p1 + p2 ) 3
(x1 + x2 )! x1 x2
= π 1 π 2 , 0 ≤ x1 ≤ x + 1 + x2 ,
x1 !x2 !
where π1 = p1 /(p1 + p2 ), π2 = 1 − π1 . This density is binomial with denominator
n − x3 = x1 + x2 and probability π1 = p1 /(1 − p3 ). Note that X2 = n − x3 − X1 , so although the
conditional mass function here has two arguments X1 , X2 , in reality it is of dimension 1.
We conclude that, conditional on knowing the vote for one candidate, X3 = x3 , the split of votes
for the other two candidates has a binomial distribution. If we regard a vote for candidate 1 as a
‘success’, then X1 ∼ B(n − x3 , π1 ), where n − x3 is the number of votes not for candidate 3, and
π1 is the conditional probability of voting for candidate 1, given that a voter has not chosen
candidate 3.
Probability and Statistics for SIC note 1 of slide 185
110
Independence
Definition 158. Random variables X, Y defined on the same probability space are independent if
which will be our criterion of independence. This condition concerns the functions fX,Y (x, y), fX (x),
fY (y): X, Y are independent iff (2) remains true for all x, y ∈ R.
Thus knowing that X = x does not affect the density of Y : this is an obvious meaning of
“independence”. By symmetry fX|Y (x | y) = fX (x) for all y such that fY (y) > 0.
Examples
Example 159. Are (X, Y ) independent in (a) Example 145? (b) Example 147? (c) when
(
e−3x−2y , x, y > 0,
fX,Y (x, y) ∝
0, sinon.
If X and Y are independent, then in particular the support of (X, Y ) must be of the form
SX × SY ⊂ R 2 .
iid
Example 161. If X1 , X2 , X3 ∼ exp(λ), give their joint density.
111
Note to Example 159
(a) Since
2 1
fX (0)fY (2) = × 6= fX,Y (0, 2) = 0,
3 6
X and Y are dependent. This is obvious, because if I have the wrong hat (i.e., X = 0), then it is
impossible that both other persons have the correct hats (i.e., Y = 2 is impossible).
Finding a single pair (x, y) giving fX,Y (x, y) 6= fX (x)fY (y) is enough to show dependence, while to
show independence it must be true that fX,Y (x, y) = fX (x)fY (y) for every possible (x, y).
(b) In this case (
2e−x−y , y > x > 0,
fX,Y (x, y) =
0, otherwise.
and we previously saw that
so obviously the joint density is not the product of the marginals. This is equally obvious on looking at
the conditional densities.
In this case, the dependence is clear without any computations, as the support of (X, Y ) cannot be the
product of sets IA (x)IB (y), but it would have to be if they were independent.
(c) The density factorizes and the support is a Cartesian product, so they are independent.
Mixed distributions
We sometimes encounter distributions with X discrete and Y continuous, or vice versa.
Example 162. A big insurance company observes that the distribution of the number of insurance
claims X in one year for its clients does not follow a Poisson distribution. However, a claim is a rare
event, and so it seems reasonable that the distribution of small numbers should be applied. To model
X, we suppose that for each client, the number of claims X in one year follows a Poisson distribution
Pois(y), but that Y ∼ Gamma(α, λ): the mean number of claims for a client with Y = y is then
E(X | Y = y) = y, since certain clients are more likely to make a claim than others.
Find the joint distribution of (X, Y ), the marginal distribution of X, and the conditional distribution of
Y given X = x.
112
Note to Example 162
If X, conditional on Y = y has the Poisson density with parameter y > 0, then
y x −y
fX|Y (x | y) = e , x = 0, 1, . . . , y > 0.
x!
R∞
Recall also the definition of the gamma function: Γ(a) = 0 ua−1 e−u du, for a > 0, and that
Γ(a + 1) = aΓ(a).
The joint density is
yx λα y α−1
fX|Y (x | y) × fY (y) = exp(−y) × exp(−λy), x ∈ {0, 1, . . . , }, y > 0,
x! Γ(α)
113
Insurance and learning
Mean=0.1
1.0
10
Probability
4 6
0.2
2
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10
Accident rate, y Number of accidents, x
Mean=2
1.0
10
0.4 0.6 0.8
Conditional density
8
Probability
4 6
0.2
2
0.0
0
0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0
Number of accidents, x Accident rate, y
The graph shows how the knowledge of the number of accidents changes the distribution of the rate of
accidents y for an insured party. Top left: the original density fY (y). Top right: the conditional mass
function fX|Y (x | y = 0.1) for a good driver. Bottom left: the conditional mass function
fX|Y (x | y = 2) for a bad driver. Bottom right: the conditional densities fY |X (y | x) with x = 0
(blue), 1 (red), 2 (black), 3 (green), 4 (cyan) (in order of decreasing maximal density).
Joint moments
Definition 163. Let X, Y be random variables of density fX,Y (x, y). Then if E{|g(X, Y )|} < ∞, we
can define the expectation of g(X, Y ) to be
(P
g(x, y)fX,Y (x, y), discrete case,
E{g(X, Y )} = RR x,y
g(x, y)fX,Y (x, y) dxdy, continuous case.
In particular we define the joint moments and the joint central moments by
114
Properties of covariance
Theorem 164. Let X, Y, Z be random variables and a, b, c, d ∈ R constants. The covariance satisfies:
cov(X, X) = var(X);
cov(a, X) = 0;
cov(X, Y ) = cov(Y, X), (symmetry);
cov(a + bX + cY, Z) = b cov(X, Z) + c cov(Y, Z), (bilinearity);
cov(a + bX, c + dY ) = bd cov(X, Y );
var(a + bX + cY ) = b2 var(X) + 2bc cov(X, Y ) + c2 var(Y );
cov(X, Y )2 ≤ var(X)var(Y ), (Cauchy–Schwarz inequality).
and since this quadratic polynomial in a has at most one real root, we have
By letting g(X) = X − E(X) and h(Y ) = Y − E(Y ), we can see that if X and Y are independent,
then
cov(X, Y ) = · · · = 0.
Thus X, Y indep ⇒ cov(X, Y ) = 0. However, the converse is false.
115
Linear combinations of random variables
Pn
Definition 165. The average of random variables X1 , . . . , Xn is X = n−1 j=1 Xj .
(c) If X1 , . . . , Xn are independent and all have mean µ and variance σ 2 , then
116
Correlation
Unfortunately the covariance depends on the units of measurement, so we often use the following
dimensionless measure of dependence.
Example 169. We can model the heredity of a quantitative genetic characteristic as follows. Let X be
its value for a parent, and Y1 and Y2 its values for two children.
iid
Let Z1 , Z2 , Z3 ∼ N (0, 1) and take
117
Properties of correlation
Theorem 170. Let X, Y be random variables having correlation ρ = corr(X, Y ). Then:
(a) −1 ≤ ρ ≤ 1;
(b) if ρ = ±1, then there exist a, b, c ∈ R such that
aX + bY + c = 0
(X, Y ) 7→ (a + bX, c + dY )
is
corr(X, Y ) 7→ sign(bd)corr(X, Y ).
Limitations of correlation
Note that:
correlation measures linear dependence, as in the upper panels below;
we can have strong nonlinear dependence, but correlation zero, as in the bottom left panel;
correlation can be strong but specious, as in the bottom right, where two sub-populations, each
without correlation, are combined.
rho=−0.3 rho=0.9
4
4
2
2
0
0
y
y
−2
−2
−4
−4
−4 −2 0 2 4 −4 −2 0 2 4
x x
rho=0 rho=0.9
4
4
2
2
0
0
y
y
−2
−2
−4
−4
−4 −2 0 2 4 −4 −2 0 2 4
x x
118
Correlation 6= causation
Two variables can be very correlated without one causing changes in the other.
The left panel shows strong dependence between the number of mobile phone transmitter masts,
and the number of births in UK towns. Do masts increase fertility?
The right panel shows that this dependence disappears when we allow for population size: more
people ⇒ more births and more transmitter masts. Adding masts will not lead to more babies.
rho=0.92 rho=−0.09
22 18
Total births in 2009
2e+04
Conditional expectation
Definition 171. Let g(X, Y ) be a function of a random vector (X, Y ). Its conditional expectation
given X = x is
(P
g(x, y)fY |X (y | x), in the discrete case,
E{g(X, Y ) | X = x} = R ∞y
−∞ g(x, y)fY |X (y | x) dy, in the continuous case,
on the condition that fX (x) > 0 and E{|g(X, Y )| | X = x} < ∞. Note that the conditional
expectation E{g(X, Y ) | X = x} is a function of x.
Example 172. Let Z = XY , where X and Y are independent, X having a Bernoulli distribution with
probability p, and Y having the Poisson distribution with parameter λ.
Find the density of Z.
Find E(Z | X = x).
Example 173. Calculate the conditional expectation and variance of the total number of emails
received in Example 154, given the arrival of g good emails.
119
Note to Example 172
The event Z = 0 occurs iff we have either X = 0 and Y takes any value, or if X = 1 and Y = 0.
Since X and Y are independent, we therefore have
∞
X
fZ (0) = P(X = 0, Y = y) + P(X = 1, Y = 0)
y=0
X∞
= P(X = 0)P(Y = y) + P(X = 1)P(Y = 0)
y=0
∞
X
= P(X = 0) P(Y = y) + P(X = 1)P(Y = 0)
y=0
= (1 − p) × 1 + p × e−λ .
Similarly
No other values for Z are possible. Clearly the above probabilities are non-negative, and
∞
X ∞
X ∞
X
fZ (z) = (1 − p) + pe−λ + pλz e−λ /z! = (1 − p) + p λz e−λ /z! = (1 − p) + p = 1,
z=0 z=1 z=0
so (
(1 − p) + pe−λ , z = 0,
fZ =
pλz e−λ /z!, z = 1, 2, . . . ,
is indeed a density function.
Now
since if we know that X = x, then the value x of X is a constant, and since Y and X are
independent, E{h(Y ) | X = x} = xE{h(Y )} for any function h(Y ). Therefore
E(Z | X = 0) = 0, E(Z | X = 1) = λ.
120
Expectation and conditioning
Sometimes it is easier to calculate E{g(X, Y )} in stages.
where EX and varX represent expectation and variance according to the distribution of X.
Example 175. n = 200 persons pass a busker on a given day. Each one of them decides
independently with probability p = 0.05 to give him money. The donations are independent, and have
expectation µ = 2$ and variance σ 2 = 1$2 . Find the expectation and the variance of the amount of
money he receives.
121
Note to Example 175
Let Xj = 1 if the jth person decides to give him money and Xj = 0 otherwise, and let Yj be the
amount of money given by the jth person, if money is given. Then we can write his total takings as
T = g(X, Y ) = Y1 X1 + · · · + Yn Xn ,
iid iid
where X1 , . . . , Xn ∼ B(1, p) are independent Bernoulli variables and Y1 , . . . , Yn ∼ (µ, σ 2 ). We
want to compute E(T ) and var(T ).
We first condition on X1 , . . . , Xn , in which case (using an obvious shorthand notation)
E(T | X = x) = E(Y1 X1 + · · · + Yn Xn | X = x)
Xn
= E(Yj Xj | X = x)
j=1
n
X n
X n
X n
X
= xj E(Yj | X = x) = xj E(Yj ) = xj µ = µ xj ,
j=1 j=1 j=1 j=1
var(T | X = x) = var(Y1 X1 + · · · + Yn Xn | X = x)
Xn
= var(Yj Xj | X = x) by independence of the Yj
j=1
Xn n
X n
X
= x2j var(Yj | X = x) = x2j σ 2 = σ 2 xj .
j=1 j=1 j=1
In these expressions the Xj are treated as fixed quantities xj and are regarded as constants, since
the computations are conditional on Xj = xj . Note that x2j = xj , since xj = 0, 1.
Now we ‘uncondition’, by replacing the values xj of the Xj by the corresponding random variables,
and in order to calculate the expressions in Theorem 174 we therefore need to compute
Xn Xn Xn
E µ Xj , var µ Xj , E σ 2 Xj .
j=1 j=1 j=1
Pn
We have that S = j=1 Xj ∼ B(n, p), so S has mean np and variance np(1 − p), and this yields
E(T ) = EX [E{T | X = x}] = EX (µS) = µEX (S) = npµ = 200 × 0.05 × 2 = 20,
var(T ) = EX [var{T | X = x}] + varX [E{T | X = x}]
= EX (σ 2 S) + varX (µS) = npσ 2 + µ2 np(1 − p)
= 200 × 0.05 × 1 + 22 × 200 × 0.05 × 0.95 = 48.
Probability and Statistics for SIC note 1 of slide 200
122
5.3 Generating Functions slide 201
Definition
Definition 176. We define the moment-generating function of a random variable X by
MX (t) = E(etX )
Example 177. Calculate MX (t) when: (a) X = c with probability one; (a) X is an indicator variable;
(c) X ∼ B(n, p); (d) X ∼ Pois(λ); (e) X ∼ N (µ, σ 2 ).
123
Note to Example 177
(a) X is discrete, so MX (t) = 1 × et×c = ect , valid for t ∈ R.
(b) Here MX (t) = (1 − p)et×0 + pet×1 = 1 − p + pet , valid for t ∈ R.
(c) Using the binomial theorem we have
n
X Xn
n x
tx n−x n
MX (t) = e p (1 − p) = (pet )x (1 − p)n−x = (1 − p + pet )n , t ∈ R.
x=0
x x=0
x
(d) We have
∞
X x ∞
X
xt λ −λ (λet )x
MX (t) = e e = e−λ = exp(λet )e−λ = exp{λ(et − 1)}, t ∈ R,
x! x!
x=0 x=0
P
where we have used the exponential series ea = ∞ n
n=0 a /n! for any a ∈ R.
(e) We first consider Z ∼ N (0, 1) and compute
Z ∞ Z ∞
1 2
E(etZ ) = etz × φ(z) dz = etz × √ e−z /2 dz.
−∞ −∞ 2π
2
If we take σ = 1, µ = t, the left-hand side is the MGF of Z, and the right is et /2 , valid for any t ∈ R.
(As an aside, note that if we take µ = 0, σ 2 = 1/(1 − 2t), then the left-hand side is the MGF of Z 2 ,
and the right is (1 − 2t)−1/2 , valid only if t < 1/2. Thus
This is the moment-generating function of a chi-squared random variable with one degree of freedom.)
Now note that
124
Important theorems I
Theorem 178. If M (t) is the MGF of a random variable X, then
MX (0) = 1;
Ma+bX (t) = eat MX (bt);
∂ r MX (t)
E(X r ) = ;
∂tr t=0
′
E(X) = MX (0);
′′ ′
var(X) = MX (0) − MX (0)2 .
Theorem 180 (No proof). There exists an injection between the cumulative distribution functions
FX (x) and the moment-generating functions MX (t).
Theorem 180 is very useful, as it says that if we recognise a MGF, we know to which distribution it
corresponds.
and set t = 0, using Theorem 178 to get the expectation and variance, λ−1 and λ−2 respectively.
125
Linear combinations
Theorem 181. Let a, b1 , . . . , bn ∈ R and X1 , . . . , Xn be independent rv’s whose MGFs exist. Then
Y = a + b1 X1 + · · · + bn Xn has MGF
n
Y
MY (t) = · · · = eta MXj (tbj ).
j=1
MS (t) = MX (t)n .
ind
Example 182. Let X1 , X2 ∼ Pois(λ), Pois(µ). Find the distribution of X1 + X2 .
so by Theorem 180 and Theorem 177(d) we recognise that X1 + X2 is a Poisson variable with
parameter λ1 + λ2 .
Probability and Statistics for SIC note 2 of slide 204
126
Probability and Statistics for SIC note 3 of slide 204
Important theorems II
D
Definition 184 ( −→ , Reminder). Let {Xn }, X be random variables whose cumulative distribution
functions are {Fn }, F . Then we say that the random variables {Xn } converge in distribution to X,
if, for all x ∈ R where F is continuous,
Fn (x) → F (x), n → ∞.
D
We then write Xn −→ X.
Theorem 185 (Continuity, no proof). Let {Xn }, X be random variables with distribution functions
{Fn }, F , whose MGFs Mn (t), M (t) exist for 0 ≤ |t| < b. Then if Mn (t) → M (t) for |t| ≤ a < b when
D
n → ∞, then Xn −→ X, i.e., Fn (x) → F (x) at each x ∈ R where F is continuous.
Example 186 (Law of small numbers, II). Let Xn ∼ B(n, pn ) and X ∼ Pois(λ). Show that if
n → ∞, pn → 0 in such a way that npn → λ, then
D
Xn −→ X.
Probability and Statistics for SIC slide 205
and this is true for any t ∈ R. Hence the hypothesis of the theorem is clearly satisfied, and thus
D
Xn −→ X.
Probability and Statistics for SIC note 1 of slide 205
127
Mean vector and covariance matrix
Definition 187. Let X = (X1 , . . . , Xp )T be a p × 1 vector of random variables. Then
E(X1 )
E(X)p×1 = ... ,
E(Xp )
var(X1 ) cov(X1 , X2 ) ··· cov(X1 , Xp )
cov(X1 , X2 ) var(X2 ) ··· cov(X2 , Xp )
var(X)p×p = .. .. .. .. ,
. . . .
cov(X1 , Xp ) cov(X2 , Xp ) · · · var(Xp )
are called the expectation (mean vector) and the (co)-variance matrix of X.
where T = {t ∈ Rp : MX (t) < ∞}. Let the rth and (r, s)th elements of the mean vector E(X)p×1
and of the covariance matrix var(X)p×p be the quantities E(Xr ) and cov(Xr , Xs ).
128
Example
Example 189. Emails arrive as a Poisson process with rate λ emails per day: the number of emails
arriving each day has the Poisson distribution with parameter λ. Each is a spam with probability p.
Show that the numbers of good emails and of spams are independent Poisson variables with
parameters (1 − p)λ and pλ.
where we have used the iterated expectation formula from Theorem 174. The inner expectation is
X N Yn
E exp t1 Ij + t2 (1 − Ij ) | N = n = E [exp {t1 Ij + t2 (1 − Ij )}]
j=1 j=1
because conditional on N = n, the I1 , . . . , In are independent, and because they are Bernoulli
variables each with success probability p, we have
Therefore
XN n
E exp t1 Ij + t2 (1 − Ij ) | N = n = (1 − p)et2 + pet1 ,
j=1
and on inserting the right-hand side of this into the original expectation, and then treating N = n as
random with a Poiss(λ) distribution, we get
h i
t2 t1 N
E {exp(t1 S + t2 G)} = EN (1 − p)e + pe
∞
X λn n
= e−λ (1 − p)et2 + pet1
n!
n=0
= exp −λ + λ (1 − p)et2 + pet1
= exp −λ(1 − p + p) + λ (1 − p)et2 + pet1
= exp −λ(1 − p) + λ(1 − p)et2 × exp −λp + λpet1
= E {exp(t2 G)} × E {exp(t1 S)} ,
which is the MGF of two independent Poisson variables G and S with means (1 − p)λ and pλ, as
required.
129
Parenthesis: Characteristic function
Many distributions do not have a MGF, since E(etX ) < ∞ only for t = 0. In this case, the Laplace
transform of the density is not useful. Instead we can use the Fourier transform, leading us to the
following definition.
√
Definition 190. Let i = −1. The characteristic function of X is
ϕX (t) = E(eitX ), t ∈ R.
Every random variable has a characteristic function, which possesses the same key properties as the
MGF. Characteristic functions are however more complicated to handle, as they require ideas from
complex analysis (path integrals, Cauchy’s residue theorem, etc.).
Theorem 191 (No proof). X and Y have the same cumulative distribution function if and only if
they have the same characteristic function. If X is continuous and has density f and characteristic
function ϕ then Z ∞
1
f (x) = e−itx ϕ(t) dt
2π −∞
for all x at which f is differentiable.
Example 193. Calculate the CGF and the cumulants of (a) X ∼ N (µ, σ 2 ); (b) Y ∼ Pois(λ).
so κr = λ for all r.
Probability and Statistics for SIC note 1 of slide 210
130
Cumulants of sums of random variables
Theorem 194. If a, b1 , . . . , bn are constants and X1 , . . . , Xn are independent random variables, then
n
X
Ka+b1 X1 +···+bn Xn (t) = ta + KXj (tbj ).
j=1
iid
the rth cumulant of X1 + · · · + Xn is the sum of the rth cumulants of the Xj . If the X1 , . . . , Xn ∼ F
and have CGF K(t), then t has CGF nK(t) and has rth cumulant nκr .
131
5.4 Multivariate Normal Distribution slide 213
XA ⊥⊥ XB ⇔ ΩA,B = 0.
iid
(d) If X1 , . . . , Xn ∼ N (µ, σ 2 ), then Xn×1 = (X1 , . . . , Xn )T ∼ Nn (µ1n , σ 2 In ).
(e) Linear combinations of normal variables are normal:
Lemma 198. The random vector X ∼ Np (µ, Ω) has a density function on Rp if and only if Ω is
positive definite, i.e., Ω has rank p. If so, the density function is
1
f (x; µ, Ω) = exp − 12 (x − µ)T Ω−1 (x − µ) , x ∈ Rp . (3)
(2π)p/2 |Ω|1/2
If not, X is a linear combination of variables that have a density function on Rm , where m < p is the
rank of Ω.
Probability and Statistics for SIC slide 215
132
Note to Lemma 197
(a) Let ej denote the p-vector with 1 in the jth place and zeros everywhere else. Then
Xj = eTj X ∼ N (µj , ωjj ), giving the mean and variance of Xj .
Now var(Xj + Xk ) = var(Xj ) + var(Xk ) + 2cov(Xj , Xk ), and
KXA (t) + KXB (t) = tTA µA + 21 tTA ΩAA tA + tTB µB + + 12 tTB ΩBB tB
if and only if the final term of KX (t) equals zero for all t, which occurs if and only if ΩAB = 0. Hence
the elements of the variance matrix corresponding to cov(Xr , Xs ) must equal zero for any r ∈ A and
s 6∈ A, as required. Clearly this also holds if A ∪ B =
6 {1, . . . , p}.
(d) In this case each of the Xj has mean µ and variance σ 2 , and since they are independent,
cov(Xj , Xk ) = 0 for j 6= k. If u ∈ Rn , then uT X is a linear combination of normal variables, with
mean and variance X X
uj µ = uT µ1n , u2j σ 2 = uT σn2 u,
so X ∼ Nn (µ1n , σ 2 In ), as required.
(e) The MGF of a + BX equals
T
E [exp{tT (a + BX)}] = E [exp{tT a + (B T t)T X)}] = et a MX (B T t)
= exp{tT a + (B T t)T µ + 12 (B T t)T Ω(B T t)}
= exp tT (a + Bµ) + 21 tT (BΩB T )t ,
133
Note I to Lemma 198
Since Ω is positive semi-definite, the spectral theorem tells us that we may write Ω = AT DA,
where D = diag(d1 , . . . , dp ) contains the eigenvalues of Ω, with d1 ≥ d2 ≥ · · · ≥ dp ≥ 0, and A is
a p × p orthogonal matrix, i.e., AT A = AAT = Ip and |A| = 1. Note that
|Ω| = |AT DA| = |AT | × |D| × |A| = |D|, and that if the inverse exists, Ω−1 = AT D −1 A.
Now Y = AX ∼ Np (Aµ, AΩAT ), and AΩAT = AAT DAAT = D is diagonal, so Y1 , . . . , Yp are
independent normal variables with means bj given by the elements of Aµ and variances dj .
Suppose that dp > 0, so that Ω has rank p. Then all the Yj have non-degenerate normal densities,
and since they are independent, their joint density is
p
Y
(yj − bj )2
fY (y) = (2πdj )−1/2 exp − = (2π)−p/2 |D|−1/2 exp − 12 (y − b)T D −1 (y − b) .
2dj
j=1
Since Y = AX and A−1 = AT , we have that X = AT Y , and this transformation has Jacobian
|AT | = 1. Since |D| = |Ω|, we can appeal to Theorem 204 and hence write the density of X as
fX (x) = |AT |fY (Ax) = (2π)−p/2 |Ω|−1/2 exp − 21 (Ax − Aµ)T D −1 (Ax − Aµ) , x ∈ Rp ,
where (Ax − Aµ)T D −1 (Ax − Aµ) = (x − µ)T AT D −1 A(x − µ) = (x − µ)T Ω−1 (x − µ), giving (3).
If Ω has rank m < p, then dm > 0 but dm+1 = · · · = dp = 0. In this case only Y1 , . . . , Ym have
positive variances, and the argument above allows us to construct a joint density for Y1 , . . . , Ym on
Rm . Since Ym = bm , . . . , Yp = bp with probability one, we can write
X = AT Y = AT (Y1 , . . . , Ym , bm+1 , . . . , bp )T ,
which confirms that the density of X is positive only on an m-dimensional linear subspace of Rp
generated by the variation of Y1 , . . . , Ym ; it might be said to have only ‘m degrees of freedom’.
134
Note II to Lemma 198
Since Ω is symmetric and positive semi-definite, the spectral theorem tells us that we may write
Ω = ADAT , where D = diag(d1 , . . . , dp ) contains the (real) eigenvalues of Ω, with
d1 ≥ d2 ≥ · · · ≥ dp ≥ 0, and A is a p × p orthogonal matrix, i.e., AT A = AAT = Ip and |A| = 1.
The columns A1 , . . . , Ap of A are the eigenvectors corresponding to the respective eigenvalues;
note that
Xp
Ω = ADAT = dj aj aTj ,
j=1
that |Ω| = |ADAT | = |A| × |D| × |AT | = |D|, and that Ω−1 = AD −1 AT if the inverse exists.
Now let Yj ∼ N (0, dj ) be independent variables, let Y = (Y1 , . . . , Yp )T , and let u ∈ Rp ; note that
if dj = 0 then Yj = 0 with probability one. Then
p
X
T T T
u X = u (µ + AY ) = u µ + Yj uT aj
j=1
is a linear combination of normal variables, so it has a normal distribution, with mean uT µ and
variance
X p Xn Xn
var uT µ + Yj uT aj = (uT aj )2 var(Yj ) = uT dj aj aTj u = uT Ωu,
j=1 j=1 j=1
X = AT Y = AT (Y1 , . . . , Ym , bm+1 , . . . , bp )T ,
which confirms that the density of X is positive only on an m-dimensional linear subspace of Rp
generated by the variation of Y1 , . . . , Ym ; it might be said to have only ‘m degrees of freedom’.
135
Bivariate normal densities
Normal PDF with p = 2, µ1 = µ2 = 0, ω11 = ω22 = 1, and correlation ρ = ω12 /(ω11 ω22 )1/2 = 0 (left),
−0.5 (centre) and 0.9 (right).
Examples
Example 199. If X ∼ N (1, 4) , Y ∼ N (−1, 9), corr(X, Y ) = −1/6, and they have a joint normal
distribution, give the joint distribution of (X, Y ). Hence find the distribution of W = X + Y .
iid
Example 200. If X1 , . . . , X4 ∼ N (0, σ 2 ), find the distribution of Y = BX when
1 −1 −1 −1
1 −1 1 1
B= 1
.
1 −1 1
1 1 1 −1
136
Note to Example 199
Part (a) of Lemma 197 gives that
X E(X) var(X) cov(X, Y )
∼ N2 , ,
Y E(Y ) cov(X, Y ) var(Y )
Therefore
X 1 4 −1
∼ N2 , .
Y −1 −1 9
Since W is a linear combination of normal variables, it has a normal distribution, and we can apply
Part (e) of Lemma 197 with r = 1, p = 2, a = 0 and B = (1, 1) to obtain
X T 4 −1 1
W = (1, 1) ∼ N1 0 + (1, 1)(1, −1) , (1, 1) = N (0, 11).
Y −1 9 1
because it is easy to check that BB T = 4I4 . Thus the variables Y1 , . . . , Y4 have N (0, 4σ 2 )
distributions, and are independent because their covariance matrix is diagonal.
XA ∼ Nq (µA , ΩA );
137
Proof of Theorem 201
First note that without loss of generality we can permute the elements of X so that the components of
XA appear before those of XB , then writing X T = (XA T
, XBT ). Partition the vectors t, µ, and the
matrix Ω conformally with X, using obvious notation.
(a) The CGF of X is
T T
T 1 T tA µA 1 tA ΩA ΩAB tA
KX (t) = t µ + 2 t Ωt = +2
tB µB tB ΩBA ΩB tB
µA − ΩAB Ω−1
B µB , ΩA − ΩAB Ω−1
B ΩBA ,
and as cov(XB , W ) = 0 (check!) and they are jointly normally distributed, W ⊥⊥ XB . Now
XA = W + ΩAB Ω−1
B XB ,
and as W and XB are independent, the distribution of W is unchanged by conditioning on the event
XB = xB . The conditional mean of XA is therefore
as required. Likewise
because the term in XB is conditionally constant. This gives the required result.
Example
Example 202. Let (X1 , X2 ) be the pair (height (cm), weight (kg)) for a population of people aged
20. To model this, we take
180 225 90
µ= , Ω= .
70 90 100
(a) Find the marginal distributions of X1 and of X2 , and corr(X1 , X2 ).
(b) Do the marginal distributions determine the joint distribution?
(c) Find the conditional distribution of X2 given that X1 = x1 , and of X1 given that X2 = x2 .
138
Note to Example 202
(a) The marginal distributions are X1 ∼ N (180, 225) and X2 ∼ N (70, 100). The correlation is
ω12 90 90
√ =√ = = 0.6.
ω11 ω22 225 × 100 150
where XA = X2 , XB = X1 , so
µA + ΩAB Ω−1 −1
B (xB − µB ) = µ2 + ω21 ω11 (x1 − µ1 ) = 70 + 0.4(x1 − 180),
ΩA − ΩAB Ω−1 2
B ΩBA = 100 − 90 /225 = 64.
Thus X2 | X1 = x1 ∼ N {70 + 0.4(x1 − 180), 64}: larger height leads to larger weight, on average.
A similar computation gives
In each case the mean depends linearly on the conditioning variable, and the conditional variance is
smaller than the marginal variance, consistent with the idea that conditioning adds information and
therefore reduces uncertainty.
139
Francis Galton (1822–1911)
(Source: Wikipedia)
140
5.5 Transformations slide 223
where the | · | ensures that the same formula holds with monotonic decreasing g.
X bivariate
We calculate P(Y ∈ B), with Y ∈ Rd a function of X ∈ R2 and
Y1 g1 (X1 , X2 )
Y = ... = ..
. = g(X).
Yd gd (X1 , X2 )
It can be helpful to include indicator functions in formulae for densities of new variables (examples
later).
141
Note to Example 203
We want to compute P(Y ≤ y) = P(X1 + X2 ≤ y), and with By = (−∞, y] and g(x1 , x2 ) = x1 + x2 ,
we have that
g−1 (By ) = {(x1 , x2 ) ∈ R2 : x1 + x2 ≤ y}.
Thus we want to compute FY (y) = P(X1 + X2 ≤ y). If y < 0 this is zero, and otherwise equals
Z y Z y−x1
FY (y) = P(X1 + X2 ≤ y) = dx1 dx2 λ2 e−λ(x1 +x2 )
0 0
Z y h i0
= λ dx1 e−λx1 e−λx2
y−x1
Z0 y
= λ dx1 e−λx1 (1 − e−λ(y−x1 ) )
0
= 1 − e−λy − λye−λy , y ≥ 0,
giving (
0, y < 0,
FY (y) = −λy −λy
1−e − λye , y ≥ 0.
Differentiation gives fY (y) = λ2 ye−λy for y > 0, (the gamma density with shape parameter α = 2).
iid
Example 205. Calculate the joint density of X1 + X2 and X1 − X2 when X1 , X2 ∼ N (0, 1).
iid
Example 206. Calculate the joint density of X1 + X2 and X1 /(X1 + X2 ) when X1 , X2 ∼ exp(λ).
142
Note to Example 205
We already have one way to do this, as we can write
Y1 X1 + X2 1 1 X1 X1
= = =B ,
Y2 X1 − X2 1 −1 X2 X2
say, and use results for the multivariate normal distribution in Lemma 197(e).
Using Theorem 204 instead, we need to compute
fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (x1 , x2 ) × |J(x1 , x2 )|−1 x1 =h1 (y1 ,y2 ),x2 =h2 (y1 ,y2 ) .
J(x1 , x2 ) = |B| = | − 2| = 2.
143
Note to Example 206
We write
f (x1 , x2 ) = λ2 exp{−λ(x1 + x2 )}I(x1 > 0)I(x2 > 0).
With Y1 = X1 + X2 > 0 and Y2 = X1 /(X1 + X2 ) ∈ (0, 1), we have
Clearly these transformations satisfy the conditions of Theorem 204. We can either compute
1 1 (x + x )
1 2
J = x2 x1 = − 2
= 1/y1 > 0,
(x1 +x2 )2 − (x1 +x 2)
2 (x 1 + x 2)
or (maybe better),
∂h1 ∂h1
−1 ∂y1 ∂y2
J = ∂h ∂h2 = y1 > 0.
∂y12 ∂y2
Thus
f (y1 , y2 ) = λ2 exp{−λ(x1 + x2 )}I(x1 > 0)I(x2 > 0)|J −1 |x1 =y1 y2 ,x2 =y1 (1−y2 )
= y1 λ2 exp(−λy1 )I(y1 y2 > 0)I{y1 (1 − y2 ) > 0},
= y1 λ2 exp(−λy1 )I(y1 > 0) × I(0 < y2 < 1)
= fY1 (y1 ) × fY2 (y2 ).
Integration over y2 shows that the marginal density of Y1 is y1 λ2 exp(−λy1 )I(y1 > 0), and so
Y1 ∼ Gamma(1, λ) and Y2 ∼ U (0, 1), independently.
Example 208. Show that the sum of independent exponential and gamma variables has a gamma
distribution.
Probability and Statistics for SIC slide 227
144
Note to Theorem 207
Change variables to W = X and S = X + Y , so the Jacobian is
1 0
J = = 1,
1 1
and note that x = w and y = s − w. Thus, since X and Y are independent, an application of
Theorem 204 gives
The computation in the discrete case is similar, but the Jacobian is not needed.
The product of the indicator functions is positive only when w > 0 and s − w > 0 simultaneously, i.e.,
when 0 < w < s, and hence on putting constants outside the integral, we have
Z
λα+1 e−λs s α−1
fS (s) = w dw.
Γ(α) 0
On noting that the integral equals sα /α and recalling that αΓ(α) = Γ(α + 1), we have
λα+1 sα −λs
fS (s) = e , s > 0,
Γ(α + 1)
145
Risk estimation
Accurate estimation of risk is essential in financial markets, nuclear power plants, etc. We often need
to calculate the effect of rare events for several variables together, with little information on their joint
distribution. To be concrete, let −X1 , −X2 be negative shocks in a financial market, and consider
S = X1 + X2 , whose quantiles s1−α we need to estimate, such that
in the normal case (often used in practice) twice the marginal risk is an upper bound for the joint risk,
but in the Pareto case it is a lower bound. So if we base risk calculations on the normal distribution
but reality is Pareto, losses can be much greater than predicted.
146
Note on risk estimation, II
Now consider the convolution of independent random variables X1 , X2 both having distribution
function F (x) = 1 − x−1/2 , x > 1, and thus density function 12 x−3/2 , x > 1. According to
Theorem 207, their sum S = X1 + X2 has density
Z s−1
1
fS (s) = 4 x−3/2 (z − x)−3/2 dx, z > 2.
1
To work this out we first set x = su and a = 1/s for convenience, and on setting u = sin2 θ obtain
Z 1−a
1 −2
fS (s) = 4 s {u(1 − u)}−3/2 du
a
Z θ2
1 −2 dθ
= 2s 2 2
,
θ1 sin θ cos θ
Z θ2
−2 dθ
= 2s 2 ,
θ1 sin (2θ)
−2 cos 2θ θ2
= s − ,
sin 2θ θ1
θ1
1
= 12 s−2 − tan θ ,
tan θ θ2
after a little trigonometry. The limits for the integral are determined using a = sin2 θ1 , giving
tan θ1 = {a/(1 − a)}1/2 , and 1 − a = sin2 θ2 , giving tan θ2 = {(1 − a)/a}1/2 . Thus
h i
fS (s) = 12 s−2 {(1 − a)/a}1/2 − {a/(1 − a)}1/2 − {a/(1 − a)}1/2 + {(1 − a)/a}1/2
s−2
= p (1 − 2a)
a(1 − a)
s−2
= , s > 2,
s2 (s − 1)1/2
after a little algebra. It is then easy to check that FS (s) = 1 − 2(s − 1)1/2 /s, defined for s > 2.
Now the 1 − α quantile of X is x1−α = 1/α2 , and the 1 −√α quantile of S is the solution to the
equation α = 2(s − 1)1/2 /s, and this is s1−α = 2α−2 (1 + 1 − α2 ). Thus we see that in this case
the sum of the α-quantiles of X1 and X √2 is always less than the α-quantile of their sum. For very
2 2
small α the ratio s1−α /(2x1−α ) = 1 + 1 − α ∼ 2 − (1 − α) /2, so the quantile of the sum is almost
twice the sum of the quantiles.
The implication is that the choice of model can have a huge effect on estimation of risk. If two risk
variables have a joint normal distribution, then we can bound the quantile of their sum S by doubling
the quantile of one of them. However if they have another joint distribution, then this may badly
underestimate the quantile of S. In many applications Pareto distributions for extreme risks are much
more plausible than are normal tails, but joint normality is often used. This can lead to financial
meltdown due to serious underestimation of risk, especially for complex products where dependencies
are hidden. Google ‘The formula that killed Wall Street’, or check out
http://www.wired.com/techbiz/it/magazine/17-03/wp_quant?currentPage=all.
147
Multivariate case
Theorem 204 extends to random vectors with continuous density, Y = g(X) ∈ Rn , where X ∈ Rn is a
continuous variable:
then
fY1 ,...,Yn (y1 , . . . , yn ) = fX1 ,...,Xn (x1 , . . . , xn ) |J(x1 , . . . , xn )|−1 ,
evaluated at x1 = h1 (y1 , . . . , yn ), . . . , xn = hn (y1 , . . . , yn ).
148
5.6 Order Statistics slide 231
Definition
Definition 211. The order statistics of the rv’s X1 , . . . , Xn are the ordered values
In particular, the minimum is X(1) , the maximum is X(n) , and the median is
1
X(m+1) (n = 2m + 1, odd), 2 (X(m) + X(m+1) ) (n = 2m, even).
iid
Theorem 212. Let X1 , . . . , Xn ∼ F , from a continuous distribution with density f , then:
P(X(n) ≤ x) = F (x)n ;
P(X(1) ≤ x) = 1 − {1 − F (x)}n ;
n!
fX(r) (x) = F (x)r−1 f (x){1 − F (x)}n−r , r = 1, . . . , n.
(r − 1)!(n − r)!
iid
Example 213. If X1 , X2 , X3 ∼ exp(λ), give the densities of the X(r) .
Example 214. Abélard and Héloïse make an appointment to work at the Learning Centre. Both are
late independently of each other, arriving at times distributed uniformly up to one hour after the time
agreed. Find the distribution and the expectation of the time at which the first one arrives, and give
the density of his (or her) waiting time. Find the expected time at which they can start to work.
149
Note to Example 213
We note that in this case f (x) = λe−λx and F (x) = 1 − exp(−λx), and then just apply the theorem
with n = 3 and r = 1, 2, 3:
because the event V < v occurs if and only if both of them independently arrive before v, and the
event V ≤ v, U > u occurs if and only if they both arrive in the interval (u, v). It follows that the joint
density is
∂ 2 P(V ≤ v, U ≤ u)
f (u, v) = = 2I(0 < u < v < 1).
∂u∂v
Therefore w = v − u has density
Z 1 Z 1
f (w) = 2I(0 < u < v < 1) du = 2 I(0 < u < u + w < 1) du
u=0 u=0
Z 1
= 2 I(0 < u < 1 − w) du
u=0
= 2(1 − w), 0 < w < 1.
They canR start to work when the second of them arrives, at time V , and this has expectation
1
E(V ) = 0 2v dv = 2/3, i.e., 40 minutes after the agreed time.
150
6. Approximation and Convergence slide 234
Motivation
It is often difficult to calculate the exact probability p of an event of interest, and we have to
approximate. Possible approaches:
try to bound p;
analytic approximation, often using the law of large numbers and the central limit theorem;
numerical approximation, often using Monte Carlo methods.
The final approaches use the notion of convergence of sequences of random variables, which we will
study in this chapter.
We have already seen examples of these ideas: normal approximation to the binomial distribution, law
of small numbers, . . .
Probability and Statistics for SIC slide 235
Inequalities
Theorem 215. If X is a random variable, a > 0 a constant, h a non-negative function and g a convex
function, then
These inequalities are more useful for theoretical calculations than for practical use.
Example 216. We are testing a classification method, in which the probability of a correct
classification is p. Let Y1 , . . . , Yn be the indicators of correct classifications in n test cases, and let Y
be their average. For ε = 0.2 and n = 100, bound
151
Note to Theorem 215
(a) Let Y = h(X). If y ≥ 0, then for any a > 0, y ≥ yI(y ≥ a) ≥ aI(y ≥ a). Therefore
E{h(X)} = E(Y ) ≥ E{Y I(Y ≥ a)} ≥ E{aI(Y ≥ a)} = aP(Y ≥ a) = aP{h(X) ≥ a},
Hoeffding’s inequality
Theorem 217. (Hoeffding’s inequality, no proof) Let Z1 , . . . , Zn be independent random variables
such that E(Zi ) = 0 and ai ≤ Zi ≤ bi for constants ai < bi . If ε > 0, then for all t > 0,
n
! n
X Y 2 2
P Zi ≥ ε ≤ e−tε et (bi −ai ) /8 .
i=1 i=1
This inequality is much more useful than the others for finding powerful bounds in practical situations.
iid
Example 218. Show that if X1 , . . . , Xn ∼ Bernoulli(p) and ε > 0, then
2
P(|X − p| > ε) ≤ 2e−2nε .
152
Note to Example 218
For the theoretical part, take Zi = (Xi − p)/n, and note that −p/n ≤ Zi ≤ (1 − p)/n, so
bi − ai = 1/n. Then
X X
P(|X − p| > ε) = P( Zi > ε) + P(− Zi > ε),
To minimise this with respect to t, we take t = 4nε, which leads to the result.
For the numerical part, just insert into the previous part and get 0.00067, which is much smaller than
the bound obtained using the Chebyshov inequality (Example 216).
Convergence
Definition 219 (Deterministic convergence). If x1 , x2 , . . . , x are real numbers, then xn → x iff for all
ε > 0, there exists Nε such that |xn − x| < ε for all n > Nε .
Probabilistic convergence is more complicated . . . We could hope that (for example) Xn → X if either
or
E(Xn ) → E(X)
when n → ∞.
Then when n → ∞,
153
Modes of convergence of random variables
Definition 221. Let X, X1 , X2 , . . . be random variables with cumulative distribution function
F, F1 , F2 , . . .. Then
a.s.
(a) Xn converges to X almost surely, Xn −→ X, if
P lim Xn = X = 1;
n→∞
2
(b) Xn converges to X in mean square, Xn −→ X, if
P
(c) Xn converges to X in probability, Xn −→ X, if for all ε > 0,
D
(d) Xn converges to X in distribution, Xn −→ X, if
a.s.
Xn −→ X
To understand this better:
all the variables {Xn }, X must be defined on the same probability space, (Ω, F, P). It is not trivial
to construct this space (we need ‘Kolmogorov’s extension theorem’).
Then to each ω ∈ Ω corresponds a sequence
Example 222. Let U ∼ U (0, 1), where Ω = [0, 1], U (ω) = ω, Xn (ω) = U (ω)n , n = 1, 2, . . ., and
a.s.
X(ω) = 0. Show that Xn −→ X.
as required.
154
Relations between modes of convergence
a.s. 2 P
If Xn −→ X, Xn −→ X or Xn −→ X, then X1 , X2 , . . . , X must all be defined with respect to
D
only one probability space. This is not the case for Xn −→ X, which only concerns the
probabilities. This last is thus weaker than the others.
These modes of convergence are related to one another in the following way:
a.s.
Xn −→ X ⇒
P D
Xn −→ X ⇒ Xn −→ X
2
Xn −→ X ⇒
so Xn does not converge in probability to Z, and thus neither of the other modes of convergence can
be true either.
Probability and Statistics for SIC note 2 of slide 243
155
Continuity theorem (reminder)
Theorem 225 (Continuity). Let {Xn }, X be random variables with cumulative distribution functions
{Fn }, F , whose MGFs Mn (t), M (t) exist for 0 ≤ |t| < b. If there exists a 0 < a < b such that
D
Mn (t) → M (t) for |t| ≤ a when n → ∞, then Xn −→ X, that is to say, Fn (x) → F (x) at each
x ∈ R where F is continuous.
We could replace Mn (t) and M (t) by the cumulant-generating functions Kn (t) = log Mn (t) and
K(t) = log M (t).
We established the law of small numbers (Theorem 104 and Example 186, Poisson approximation
of the binomial distribution) by using this result.
Here is another example:
Example 226. Let X be a random variable which has a geometric distribution with a probability of
success p. Calculate the limit distribution of pX when p → 0.
The third line is known as Slutsky’s lemma. It is very useful in statistical applications.
iid iid
2 ), Y , . . . , Y ∼ (µ , σ 2 ), µ 6= 0, σ 2 , σ 2 < ∞, and
Example 228. Let X1 , . . . , Xn ∼ (µX , σX 1 n Y Y X X Y
define
X n Xn
Rn = Y /X, Y = n−1 Yj , X = n−1 Xj .
j=1 j=1
P
Show that Rn −→ µY /µX when n → ∞.
156
Probability and Statistics for SIC slide 245
Hence Mn must be renormalised to get a non-degenerate limit distribution. Let {an } > 0 and
{bn } be sequences of constants, and consider the convergence in distribution of
Yn = (Mn − bn )/an ,
Examples
iid
Example 229. Let X1 , . . . , Xn ∼ exp(λ), and let Mn be their maximum. Find an , bn such that
D
Yn = (Mn − bn )/an −→ Y , where Y has a non-degenerate distribution.
iid
Example 230. Let X1 , . . . , Xn ∼ U (0, 1), and let Mn be their maximum. Find an , bn such that
D
Yn = (Mn − bn )/an −→ Y , where Y has a non-degenerate distribution.
157
Probability and Statistics for SIC note 1 of slide 247
Fisher–Tippett theorem
iid
Theorem 231. Suppose that X1 , . . . , Xn ∼ F , where F is a continuous cumulative distribution
function. Let Mn = max{X1 , . . . , Xn }, and suppose that the sequences of constants {an } > 0 and
D
{bn } can be chosen so that Yn = (Mn − bn )/an −→ Y , where Y has a non-degenerate limit
distribution H(y) when n → ∞. Then H must be the generalised extreme-value (GEV)
distribution, ( h i
−1/ξ
exp − {1 + ξ(y − η)/τ }+ , ξ 6= 0,
H(y) =
exp [− exp {−(y − η)/τ }] , ξ = 0,
where u+ = max(u, 0), and η, ξ ∈ R, τ > 0.
Example
The graph below shows the distributions of Mn and of Yn for n = 1, 7, 30, 365, 3650, from left to
iid
right, for X1 , . . . , Xn ∼ N (0, 1). The panel on the right also shows the limit distribution (bold),
H(y) = exp{− exp(−y)}.
1.0
1.0
0.8
0.8
0.6
0.6
CDF
CDF
0.4
0.4
0.2
0.2
0.0
0.0
-4 -2 0 2 4 -4 -2 0 2 4
y y
158
6.3 Laws of Large Numbers slide 250
Theorem 232. (Weak law of large numbers) Let X1 , X2 , . . . be a sequence of independent identically
distributed random variables with finite expectation µ, and write their average as
X = n−1 (X1 + · · · + Xn ).
P
Then X −→ µ; i.e., for all ε > 0,
P(|X − µ| > ε) → 0, n → ∞.
Thus, under mild conditions, the averages of samples of important size converge towards the
expectation of the distribution from which the sample is taken.
If the Xi are independent Bernoulli trials, we return to our primitive notion of probability as a limit
of relative frequencies. The circle is complete.
159
Remarks
The weak law is easy to prove under the supplementary hypothesis that var(Xj ) = σ 2 < ∞. We
calculate E(X) and var(X), then we apply Chebyshov’s inequality. For any ε > 0,
σ2
P(|X − µ| > ε) ≤ var(X)/ε2 = → 0, n → ∞.
nε2
The same result applies to smooth functions of averages, empirical quantiles, and other statistics.
iid
Let X1 , . . . , Xn ∼ F , where F is a continuous cumulative distribution function, and let
xp = F −1 (p) be the p quantile of F . By noting that
n
X
X(⌈np⌉) ≤ xp ⇔ I(Xj ≤ xp ) ≥ ⌈np⌉
j=1
P
and applying the weak law to the sum on the right, we have X(⌈np⌉) −→ xp .
This is stronger in the sense that for all ε > 0, the weak law allows the event |X − µ| > ε to occur
an infinite number of times, though with smaller and smaller probabilities. The strong law excludes
this possibility: it implies that the event |X − µ| > ε can only occur a finite number of times.
The weak and strong laws remain valid under certain types of dependence amongst the Xj .
Standardisation of an average
The law of large numbers shows us that the average X approaches µ when n → ∞. If var(Xj ) < ∞,
then Lemma 166 tells us that
E(X) = µ, var(X) = σ 2 /n,
so, for all n, the difference between X and its expectation relative to its standard deviation,
X − E(X) X −µ n1/2 (X − µ)
Zn = = p =
var(X)1/2 σ 2 /n σ
160
Central limit theorem
Theorem 234 (Central limit theorem (CLT)). Let X1 , X2 , . . . be independent random variables with
expectation µ and variance 0 < σ 2 < ∞. Then
n1/2 (X − µ) D
Zn = −→ Z, n → ∞,
σ
where Z ∼ N (0, 1).
Thus ( )
n1/2 (X − µ) .
P ≤z = P(Z ≤ z) = Φ(z)
σ
for large n.
iid
The following page shows this effect for X1 , . . . , Xn ∼ exp(1); the histograms show how the empirical
densities of Zn approach the density of Z.
n=5 n=10
0.4
0.2 0.4
Density
Density
0.2
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
z z
n=20 n=100
0.4
0.4
Density
Density
0.2
0.2 0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
z z
161
Note to Theorem 234
The cumulant-generating function of
n
X
µ
Zn = X − µ /(σ 2 /n)1/2 = (n−1/2 /σ)Xj − n1/2
σ
j=1
is
n
X µ
KZn (t) = KXj (tn−1/2 /σ) − n1/2 t,
σ
j=1
where
KXj (t) = tµ + 12 t2 σ 2 + o(t2 ), t → 0.
Thus
h i µ
KZn (t) = n tn−1/2 µ/σ + 12 (tn−1/2 /σ)2 σ 2 + o{t2 /(nσ 2 )} − n1/2 t → t2 /2, n → ∞,
σ
is the CGF of Z ∼ N (0, 1). Thus the result follows by the continuity theorem, Theorem 185.
so Pn
j=1 Xj − nµ
n(X − µ) n1/2 (X − µ)
√ √ = = = Zn
nσ 2 nσ 2 σ
can be approximated using a normal variable:
( Pn )
Xn
j=1 Xj − nµ x − nµ . x − nµ
P
Xj ≤ x = P √ ≤ =Φ .
j=1 nσ 2 (nσ 2 )1/2 (nσ 2 )1/2
The accuracy of the approximation depends on the underlying variables: it is (of course) exact for
normal Xj , works better if the Xj are symmetrically distributed (e.g., uniform), and typically is
adequate if n > 25 or so.
162
Example
Example 235. A book of 640 pages has a number of random errors on each page. If the number of
errors on each page follows a Poisson distribution with expectation λ = 0.1, what is the probability
that the book contains less than 50 errors?
P
When nj=1 Xj takes whole values, we can obtain a better approximation using a continuity correction:
( )
X n
. x + 21 − nµ
P Xj ≤ x = Φ 2 )1/2
;
j=1
(nσ
Pn
this can be important when the distribution of j=1 Xj is quite discrete.
The true number is 0.031. With continuity correction we take Φ{(−15 + 0.5)/8} = 0.035.
Delta method
We often need the approximate distribution of a smooth function of an average.
Theorem 236. Let X1 , X2 , . . . be independent random variables with expectation µ and variance
0 < σ 2 < ∞, and let g ′ (µ) 6= 0, where g′ is the derivative of g. Then
g(X) − g(µ) D
−→ N (0, 1), n → ∞.
{g′ (µ)2 σ 2 /n}1/2
·
This implies that for large n, we have g(X) ∼ N g(µ), g ′ (µ)2 σ 2 /n . Combined with Slutsky’s
lemma, we have
n
· 1 X
g(X) ∼ N g(µ), g′ (X)2 S 2 /n , S2 = (Xj − X)2 .
n−1
j=1
iid
Example 237. If X1 , . . . , Xn ∼ exp(λ), find the approximate distribution of log X.
163
Note to Theorem 236
We note first that
X −µ D
Zn = p −→ Z ∼ N (0, 1),
σ 2 /n
√
and therefore that we may write X = µ + σZn / n. Taylor series expansion for large n gives
√ √
g(X) = g(µ + σZn / n) = g(µ) + g ′ (µ)σZn / n + O(n−1 ),
.
and this implies that E{g(X)} = g(µ) + O(n−1 ) and that
.
var{g(X)} = g′ (µ)2 σ 2 /n + o(n−1 ).
as n → ∞, which proves the result. Slutsky’s lemma is needed only to replace g′ (µ)2 σ 2 by
P
g′ (X)2 S 2 −→ g ′ (µ)2 σ 2 .
We must be careful here, because the terms O(n−1 ) are random, and we need to know how to handle
them. But this is possible.
164
Sample quantiles
iid
Definition 238. Let X1 , . . . , Xn ∼ F , and 0 < p < 1. Then the p sample quantile of X1 , . . . , Xn is
the r th order statistic X(r) , where r = ⌈np⌉.
iid
Theorem 239. (Asymptotic distribution of order statistics) Let 0 < p < 1, X1 , . . . , Xn ∼ F , and
xp = F −1 (p). Then if f (xp ) > 0,
X(⌈np⌉) − xp D
−→ N (0, 1), n → ∞.
[p(1 − p)/{nf (xp )2 }]1/2
Example 240. Show that the median of a normal sample of size n is approximately distributed
according to N {µ, πσ 2 /(2n)}.
1.5 3.0
Density
Density
0.0
0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
x x
n=41 n=81
1.5 3.0
1.5 3.0
Density
Density
0.0
0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
x x
165
Note to Example 240
We first note that F −1 (1/2) where F (x) = Φ{(x − µ)/σ} gives (x − µ)/σ = 0 and this means that
x1/2 = µ. Then note that
f (µ) = (2πσ 2 )−1/2 exp −(µ − µ)2 /2σ 2 = (2πσ 2 )−1/2 ,
1 1 −1 πσ 2
÷ × n = ,
4 2πσ 2 2n
which proves the required result.
166
7 Exploratory Statistics slide 265
Nature
Experiment/
Theory Observation
To try and understand Nature, we invent theories that lead to models (e.g., Mendelian genetics,
quantum theory, fluid mechanics, . . . ).
We contrast the models with observations, preferably from experiments over which we have
some control, to assess if the theory is adequate, or should be rejected or improved.
The data are never measured exactly, so we usually need Statistics to assess whether differences
between the data and model are due to measurement error, or whether the data conflict with the
model, and therefore undermine or even falsify the theory.
Data can only be used to falsify a theory—future data might be incompatible with it.
167
Construction of knowledge: the (very) near future
Nature Computation
Experiment/
Theory Observation Algorithms
What is Statistics?
Statistics is the science of extracting information from data.
Key points:
– variation (in the data) and the resulting uncertainty (about inferences) are important and are
represented using probability models and probabilistic ideas;
– context is important—it is crucial to know how the data were gathered and what they
represent.
168
Statistical cycle
The statistical method has four main stages:
– planning of a study, in order to obtain the maximum relevant information with the minimum
cost—
⊲ ideally—consideration of the problem, discussion of what data will be useful and how to get
them, choice of experimental design,
⊲ less ideally—someone comes with data and asks what to do with them;
– implementation of the experiment and reception of data—lots can go wrong;
– data analysis—
⊲ data exploration based on graphical or similar tools, followed by
⊲ statistical inference from models fitted to the data;
– presentation and interpretation of the results, followed by practical conclusions and action.
Often this cycle is iterated: data analysis suggests questions that we can’t answer, so we have to
get more data, . . .
Study types
Often we aim to compare the effects of treatments (drugs, adverts, . . . ) on units (patients,
website users, . . . ).
Two broad approaches:
– designed experiment—we control how treatments are allocated to units, usually using some
form of randomisation;
– observational study—the allocation of treatments is not under our control.
Designed experiments allow much stronger inferences than observational studies: carefully done,
they can establish that
correlation ⇒ causation.
Both types of study are all around us—clinical trials, web surveys, sample surveys, . . .
A key advantage of randomisation is the reduction (ideally avoidance) of bias.
169
Effect of randomization
✓✏
✰ ✓✏ ✓✏ ✓✏ ✓✏ ✓✏
T ✛ X ✛ U T ✛ X ✛ U
✒✑ ✒✑ ✒✑ ✒✑ ✒✑ ✒✑
❅ ❅
❅ ❅
❅ ❅
❅❘✓✏
❅ ❄✠ ❅❘✓✏
❅ ❄✠
Y Y
✒✑ ✒✑
Would offering office hours to students after the test improve final grades? Variables:
X: known mark at the test
Y : unknown final grade
T : treatment—give office hours or not
U : Unseen factors (motivation, math ability, hours spent at Sat, . . . )
Observational study (left): U can influence T , so we can’t separate their effects.
Randomized experiment (right): U can only influence T via the known X, so we can adjust for
X when estimating how T affects Y .
We might observe correlation between T and Y in both cases, but only in the second can we infer
causation.
Probability and Statistics for SIC slide 273
Data analysis
Data analysis is often said to comprise two phases:
exploratory data analysis (exploratory/descriptive statistics):
mostly simple, flexible and graphical methods that allow us to study groups of data and to detect
specific structures (tendencies, forms, atypical observations).
For example:
– in what range do most of your weights lie?
– is there an association between your weights and heights?
Exploratory analysis suggests working hypotheses and models that can be formalised and checked
in the second phase—but we risk bias if we use the data both to formulate a model and to check
it. It should be checked on new data, to reduce the chance that we are deceiving ourselves.
statistical inference (formal/inferential statistics): leads to statistical conclusions from data using
notions from probability theory. This involves model construction, testing, estimation and
forecasting methods.
170
7.2 Types of Data slide 275
Population, sample
Population: the entire set of units we might study;
Sample: subset of the population
The data collected are usually a sample of individuals from a population of interest.
Illustration:
Population: set of EPFL students.
Unit: an individual student.
Observation: the weight of the unit.
Sample: a subset of second-year students in SIC.
We want to study a characteristic (or characteristics) that each unit possesses, a statistical variable,
for example the weight (and height) of the individual.
Types of variables
A variable can be quantitative or qualitative.
A quantitative variable can be discrete (often integer) or continuous (i.e., taking any value in an
interval).
Illustration:
Quantitative discrete variables:
– number of children in a family
– number of students in this room
Quantitative continuous variables:
– weight in kilos
– height in centimetres
Often continuous variables are rounded to some extent, or can only be recorded to a certain
precision.
171
Qualitative variables
A (categorical) qualitative variable can be nominal (non-ordered) or ordinal (ordered).
Illustration:
Qualitative nominal variables:
– gender (male or female)
– blood types (A, B, AB, O).
Qualitative ordinal variables:
– the meal on offer at the Vinci (good, average, bad)
– interest in statistics (very low, low, average, high, very high)
We may convert quantitative variables into categorical variables for descriptive reasons.
Illustration: Size in centimetres: S, M, L, XL, XXL.
Probability and Statistics for SIC slide 278
Qualitative variable
Example 241. The blood types of 25 donors were collected:
AB B A O B
O B O A O
B O B B B
A O AB AB O
A B AB O A
172
Pie chart
Diagramme en camembert/en secteurs (pie chart)
B
A
AB
Bar plot
Diagramme en barres (bar plot)
8
6
4
2
0
A B O AB
x1 , x2 , . . . , xn
of the variable, which can be arranged in increasing order, giving the sample order statistics
173
Example
Example 242. The weights of 92 students in an American school were measured in pounds.
The data are:
Boys
140 145 160 190 155 165 150 190 195 138 160
155 153 145 170 175 175 170 180 135 170 157
130 185 190 155 170 155 215 150 145 155 155
150 155 150 180 160 135 160 130 155 150 148
155 150 140 180 190 145 150 164 140 142 136
123 155
Girls
140 120 130 138 121 125 116 145 150 112 125
130 120 130 131 120 118 125 135 125 118 122
115 102 115 150 110 116 108 95 125 133 110
150 108
Stem-and-leaf diagram
We translate weight 95 7→ 9 | 5, weight 102 7→ 10 | 2, etc.
9 5
10 288
11 002556688
12 00012355555
13 0000013555688
14 00002555558
15 0000000000355555555557
16 000045
17 000055
18 0005
19 00005
20
21 5
174
Histogram
To construct a histogram, it is useful to have a frequency table, which can be considered to
summarize the observed values.
A histogram shows the number of observations in groups determined by a division into intervals of
the same length.
Here is an example of the construction of a frequency table:
Histogram II
Histogram of Weight Histogram of Weight
30
30
25
25
20
20
Frequency
Frequency
15
15
10
10
5
5
0
80 100 120 140 160 180 200 220 80 100 120 140 160 180 200 220
Weight Weight
0.020
0.015
0.015
Density
Density
0.010
0.010
0.005
0.005
0.000
0.000
80 100 120 140 160 180 200 220 80 100 120 140 160 180 200 220
Weight Weight
Histograms of the weight of students in the American school; 9 (left) and 13 (right) classes with
absolute frequencies (top) and relative frequencies (bottom).
Histogram III
Advantage: a histogram can be used with large and small datasets.
Disadvantage: the loss of information due to the absence of the values of the observations and
the choice of the width of the boxes, which can be difficult, leading to different possibilities for
interpretation!
Remark: The stem-and-leaf diagram can be seen as a particular histogram obtained by a rotation.
Remark: There exist better versions of the histogram, such as kernel density estimates.
175
Probability and Statistics for SIC slide 288
This gives a nonparametric estimator of the density underlying the sample, and depends on:
the kernel K — not very important. We often take
Construction of a KDE
Left: construction of a kernel density estimate, with sample y1 , . . . , yn shown by the rug.
Right: effect of the bandwidth h, which smoothes the data more as h increases.
h=1 (red), h=2 (black), h=3 (blue)
0.15
0.15
0.10
0.10
Density
Density
0.05
0.05
0.00
0.00
100 105 110 115 120 125 100 105 110 115 120 125
Length (mm) Length (mm)
Remarks
It is not easy to create good graphs. Often those generated by standard software (e.g., Excel) are
poor. Some suggestions:
try to show the data itself as much as possible—no chart-junk (unnecessary colours/lines/. . .
etc.);
put units and clear explanations for the axes and the legend;
to compare related quantities, use the same axes and put the graphs close together;
choose the scales such that systematic relations appear at a ∼ 45◦ angle to the axes;
varying the aspect ratio can reveal interesting things;
draw the graph in such a way that departures from ‘standard’ appear as departures from linearity
or from a random cloud of points.
176
Probability and Statistics for SIC slide 291
Chartjunk
This graph shows 5 numbers!
(Source: http://www.datavis.ca/gallery/say-something.php)
(Source: http://www.datavis.ca/gallery/say-something.php)
177
Choosing the right axes
Effect of the choice of the axes on the perception of a relationship:
65
60
45 50 55 60
Model ozone (ppbv)
40
45
35
35 40 45 50 55 60 65 35 40 45 50 55 60 65
Observed ozone (ppbv) Observed ozone (ppbv)
(Source: Wikipedia)
178
Shapes of densities
A B
0.6
0.6
0.5
0.5
0.4
0.4
Frequences
Frequences
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
−5 0 5 −5 0 5
Variable Variable
C D
0.6
0.6
0.5
0.5
0.4
0.4
Frequences
Frequences
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
−5 0 5 −5 0 5
Variable Variable
Central tendency
Indicators of central tendency (measures of position):
We previously met the mean and median
E(X), F −1 (0.5),
of a theoretical distribution, F . Now we define the corresponding sample quantities, based on data
x1 , . . . , xn .
The (arithmetic) average:
X n
x1 + · · · + xn
x= = n−1 xi .
n
i=1
179
Sample median
Take med(x) = x(⌈n/2⌉) , where ⌈x⌉ is the smallest integer ≥ x.
Data x with n = 7,
Data x with n = 8,
Breakdown point
If the distribution is symmetric, then the average is close to the median.
The average is much more sensitive to extreme (atypical) values, so-called outliers, than the
median.
The median resists the impact of outliers:
x = 2,
x1 = 1, x2 = 2, x3 = 3 gives
med(x) = 2,
but
x = 11,
x1 = 1, x2 = 2, x3 = 30 gives
med(x) = 2,
so the median is unchanged by changing 3 7→ 30, but the average reacts badly.
We say that the median has (asymptotic) breakdown point 50%, because in a very large sample
the median would only move an arbitrarily large amount if 50% of the observations were corrupted.
The average has breakdown point 0%, because a single bad value can move it to ±∞.
180
Quartiles and sample quantiles
We previously defined the quantiles of a theoretical distribution F as xp = F −1 (p). Now we define
the sample equivalents.
The median (50%/50%) can be generalised by dividing the distribution into four or more equal
parts.
The bounds of the classes thus obtained are called quartiles (if 4 parts) or more generally sample
quantiles.
To define the p quantile (also called the 100p percentile), qb(p), the data are put in increasing
order
x(1) ≤ · · · ≤ x(n) .
We calculate np/100. If this is not an integer, we take the next largest integer:
Calculation of quantiles
Illustration: Calculation of the 32% percentile with n = 10 and data
27, 29, 31, 31, 31, 34, 36, 39, 42, 45.
Here l np m
np 10 × 32
= = 3.2 ⇒ = ⌈3.2⌉ = 4 ⇒ qb(0.32) = x(4) = 31.
100 100 100
Probability and Statistics for SIC slide 303
Variability/dispersion measures
The most common is the sample standard deviation
( n
)1/2 ( n
!)1/2
1 X 1 X
s= (xi − x)2 = x2i − n x2 ,
n−1 n−1
i=1 i=1
but this has breakdown point 0%. The quantity s2 is the sample variance.
The range
max(x1 , . . . , xn ) − min(x1 , . . . , xn ) = x(n) − x(1)
is unsatisfactory as we consider only the two most extreme xi , which are very sensitive to outliers;
its breakdown point is also 0%.
The interquartile range (IQR)
181
Probability and Statistics for SIC slide 304
Measures of correlation
We often want to measure the strength of relationship between two different quantities, based on
data pairs (x1 , y1 ), . . . , (xn , yn ) from n individuals. Recall from slide 195 that the theoretical
correlation between two random variables (X, Y ) is defined as
cov(X, Y )
corr(X, Y ) = .
{var(X)var(Y )}1/2
“Five-number summary”
The list of five values
called the “five-number summary”, gives a simple and useful numerical summary of a sample. It is
the basis for the boxplot.
Boxplot (boîte à moustache) of the weights of students in an American school, with a ‘rug’ showing
the individual values.
The central box shows qb(0.25), the sample median and qb(0.75), so its width is the IQR. The limits of
the whiskers are discussed below.
Probability and Statistics for SIC slide 307
182
Boxplot: calculations
Weights of the 92 American students.
The “five-number summary” is
thus
IQR(x) = qb(0.75) − qb(0.25) = 156 − 125 = 31
C = 1.5 × IQR(x) = 1.5 × 31 = 46.5
qb(0.25) − C = 125 − 46.5 = 78.5
qb(0.75) + C = 156 + 46.5 = 202.5
The bounds of the whiskers are the most extreme xi lying inside the numbers
qb(0.25) − C, qb(0.75) + C.
Any xi outside the whiskers are shown individually; they might be outliers relative to the normal
distribution.
Probability and Statistics for SIC slide 308
Boxplot: example 1
The boxplot is very useful for comparing several groups of observations:
Girl
Boy
183
Boxplot: example 2
3
2
1
−3 −2 −1 0 1 2 3
5
Height (cm)
Exam mark
180
4
170
3
160
IN SC IN SC
184
7.6 Choice of a Model slide 312
−4 −2 0 2 4
x
185
Normal Q-Q plot
The histogram or the boxplot may suggest that a normal distribution is suitable for the data, such
as: no atypical values, symmetry, unimodality.
But we need to know this more precisely: the best tool to graphically check normality is the
“normal Q-Q plot”, i.e., a graph of the ordered sample values
A graph close to a straight line suggests that the observations can be well fitted by a normal model.
Abnormal values appear as isolated points.
The slope and the intercept at x = 0 give estimates of σ and µ respectively.
Example: heights
Histogram and normal Q-Q plot of the heights of 88 students:
190
0.04
Height (cm)
180
Density
0.02
170
160
0.00
186
Example: Newcomb data
Normal Q-Q plot of 66 measures of time for light to cross a known distance, measured by Newcomb:
40
40
35
20
Temps de passage
Temps de passage
30
0
25
−20
20
−40
−2 −1 0 1 2 −2 −1 0 1 2
187
8 Statistical Inference slide 318
Introduction
The study of mathematics is based on deduction:
axioms ⇒ consequences.
Inferential statistics concern induction—having observed an event A, we want to say something about
a probability space (Ω, F, P) we suppose to be underlying the data:
?
A ⇒ (Ω, F, P ).
In the past the term inverse probability was given to this process.
Statistical model
We assume that the observed data, or data to be observed, can be considered as realisations of a
random process, and we aim to say something about this process based on the data.
Since the data are finite, and the process is unknown, there will be many uncertainties in our
analysis, and we must try to quantify them as well as possible.
Several problems must be addressed:
– specification of a model (or of models) for the data;
– estimation of the unknowns of the model (parameters, . . .);
– tests of hypotheses concerning a model;
– planning of the data collection and analysis, to answer the key questions as effectively as
possible (i.e., minimise uncertainty for a given cost);
– decision when faced with uncertainties;
– prediction of future unknowns;
– behind the other problems lies the relevance of the data to the question we want to answer.
188
Definitions
Notation: we will use y and Y to represent the data y1 , . . . , yn and Y1 , . . . , Yn .
Definition 244. A statistical model is a probability distribution f (y) chosen or constructed to learn
from observed data y or from potential data Y .
If f (y) = f (y; θ) is determined by a parameter θ of finite dimension, it is a parametric model,
and otherwise it is a nonparametric model.
A perfectly known model is called simple, otherwise it is composite.
Statistical models are (almost) always composite in practice, but simple models are useful when
developing theory.
Definition 246. The sampling distribution of a statistic T = t(Y ) is its distribution when Y ∼ f (y).
Definition 247. A random sample is a set of independent and identically distributed random
variables Y1 , . . . , Yn , or their realisations y1 , . . . , yn .
Examples
Example 248. Assume that y1 , . . . , yn is a random sample from a Bernoulli distribution with unknown
parameter p ∈ (0, 1). Then the statistic
Xn
t= yj
j=1
189
Note to Example 249
If µ and σ 2 are finite, then elementary computations (see Lemma 166) give
Xn Xn
E(Y ) = E n−1 Yj = n−1 nE(Yj ) = µ, var(Y ) = n−2 var(Yj ) = σ 2 /n,
j=1 j=1
since the Yj are independent and all have variance σ 2 . These results do not rely on normality of the
Yj , but the variance computation does need independence. We see that the larger n is, the smaller is
the variance of Y . This backs up our intuition that a larger sample is more informative about the
underlying phenomenon—but the data must be sampled independently, and the variance must be finite!
If in addition the Yj are normal, then Y is a linear combination of normal variables, and so has a
normal distribution,
Y ∼ N (µ, σ 2 /n),
so we have a very precise idea of how Y will behave (or, rather, we would have, if we knew µ and σ 2 ).
Statistical models
We would like to study a set of individuals or elements called a population based on a subset of this
set called a sample:
statistical model: the unknown distribution F or density f of Y ;
parametric statistical model: the distribution of Y is known except for the values of parameters
θ, so we can write F (y) = F (y; θ), but with θ unknown;
sample (must be representative of the population): “data” y1 , . . . , yn , often supposed to be a
iid
random sample, i.e., Y1 , . . . , Yn ∼ F ;
statistic: any function T = t(Y1 , . . . , Yn ) of the random variables Y1 , . . . , Yn ;
estimator: a statistic θb used to estimate a parameter θ of f .
Example
iid
Example 250. If we assume that Y1 , . . . , Yn ∼ N (µ, σ 2 ) but with µ, σ 2 unknown, then
this is a parametric statistical model;
b = Y is an estimator of µ, whose observed value is y;
µ
P P
b2 = n−1 nj=1 (Yj − Y )2 is an estimator of σ 2 , whose observed value is n−1 nj=1 (yj − y)2 .
σ
Note that:
a statistic T is a function of the random variables Y1 , . . . , Yn , so T is itself a random variable;
the sampling distribution of T depends on the distribution of the Yj ;
if we cannot deduce the exact distribution of T from that of the Yj , we must sometimes make do
with knowing E(T ) and var(T ), which give partial information on the distribution of T , and thus
may allow us to approximate the distribution of T (often using the central limit theorem).
190
Probability and Statistics for SIC slide 326
Estimation methods
There are many methods for estimating the parameters of models. The choice among them depends
on various criteria, such as:
ease of calculation;
efficiency (getting estimators that are as precise as possible);
robustness (getting estimators that don’t fail calamitously when the model is wrong, e.g., when
outliers appear).
The trade-off between these criteria depends on what assumptions we are willing to make in a given
context.
Examples of common methods are:
method of moments (simple, can be inefficient);
maximum likelihood estimation (general, optimal in many parametric models);
M-estimation (even more general, can be robust, but loses efficiency compared to maximum
likelihood).
Method of moments
The method of moments estimate of a parameter θ is the value θ̃ that matches the theoretical
and empirical moments.
For a model with p unknown parameters, we set the theoretical moments of the population equal
to the empirical moments of the sample y1 , . . . , yn , and solve the resulting equations, i.e.,
Z n
r r 1X r
E(Y ) = y f (y; θ) dy = yj , r = 1, . . . , p.
n
j=1
We thus need as many (finite!) moments of the underlying model as there are unknown
parameters.
We may have more than one choice of moments to use, so in principle the estimate is not unique,
but in practice we usually use the first r moments, because they give the most stable estimates.
Example 252. If y1 , . . . , yn is a random sample from the N (µ, σ 2 ) distribution, estimate µ and σ 2 .
Example 251
Standard computations show that if Y ∼ U (0, θ), then E(Y ) = θ/2. To find the moments
estimate of θ, we therefore solve the equation
191
Probability and Statistics for SIC note 1 of slide 328
Example 252
The theoretical values of the first two moments are
Definition 253. If y1 , . . . , yn is a random sample from the density f (y; θ), then the likelihood for θ is
The data are treated as fixed, and the likelihood L(θ) is regarded as a function of θ.
Definition 254. The maximum likelihood estimate (MLE) θb of a parameter θ is the value that
gives the observed data the highest likelihood. Thus
b ≥ L(θ) for each θ.
L(θ)
192
Calculation of the MLE θb
We simplify the calculations by maximising ℓ(θ) = log L(θ) rather than L(θ).
The approach is:
calculate the log-likelihood ℓ(θ) (and plot it if possible);
find the value θb maximising ℓ(θ), which often satisfies dℓ(θ)/dθ
b = 0;
check that θb gives a maximum, often by checking that d2 ℓ(θ)/dθ
b 2 < 0.
Example 255. Suppose that y1 , . . . , yn is a random sample from an exponential density with unknown
b
λ. Find λ.
Example 256. Suppose that y1 , . . . , yn is a random sample from a uniform density, U (0, θ), with
b
unknown θ. Find θ.
Probability and Statistics for SIC slide 330
d2 ℓ(λ) n
2
= − 2 < 0, λ > 0,
dλ λ
b gives the unique maximum.
so the log likelihood is concave, and therefore λ
193
M-estimation
This generalises maximum likelihood estimation. We maximise a function of the form
n
X
ρ(θ; Y ) = ρ(θ; Yj ),
j=1
where ρ(θ; y) is (if possible) concave as a function of θ for all y. Equivalently we minimise
−ρ(θ; Y ).
We choose the function ρ to give estimators with suitable properties, such as small variance or
robustness to outliers.
Taking ρ(θ; y) = log f (y; θ) gives the maximum likelihood estimator.
iid
Example 257. Let Y1 , . . . , Yn ∼ f with E(Yj ) = θ, and take ρ(y; θ) = −(y − θ)2 . Find the least
squares estimator of θ.
iid
Example 258. Let Y1 , . . . , Yn ∼ f such that E(Yj ) = θ, and take ρ(y; θ) = −|y − θ|. Find the
corresponding estimator of θ.
X n
dρ(θ; y)
− =− 2(yj − θ),
dθ
j=1
d2 ρ(θ; y)
− = 2n > 0,
d2 θ
so the minimum is unique.
194
Note to Example 258
We want to maximise
n
X
ρ(θ; y) = − |yj − θ|,
j=1
and we note that if θ > y then −|y − θ| = y − θ and if θ < y then −|y − θ| = θ − y, so the respective
derivatives with respect to θ are −1 and +1. This implies that
dρ(θ; y)
− = P (θ) − N (θ),
dθ
where P (θ) is the number of yj for which θ < yj and N (θ) = n − P (θ) is the number of yj for which
θ > yj . Hence when regarded as a function of θ,
dρ(θ; y)
− = 2P (θ) − n
dθ
is a step function that has initial value n for θ = −∞, drops by 2 at each yj , and takes value −n when
θ = +∞. If n is odd, then 2P (θ) − n equals zero when θ is the median of the sample, and if n is even,
then 2P (θ) − n equals zero on the interval y(n/2) ≤ θ ≤ y(n/2+1) . In this latter case we can take the
median to be (y(n/2) + y(n/2+1) )/2 for uniqueness.
Thus this choice of function ρ yields the sample median as an estimator.
Bias
How should we compare estimators?
195
Note to Example 260
In Example 249 we saw that
E(Y ) = µ, var(Y ) = σ 2 /n,
b = Y as an estimator of µ is E(Y ) − µ = 0.
so the bias of µ
P
b2 = n−1 j (Yj − Y )2 , note that
To find the expectation of σ
n
X n
X 2
(Yj − Y )2 = Yj − µ − (Y − µ)
j=1 j=1
n
X n
X n
X
2
= (Yj − µ) − 2 (Yj − µ)(Y − µ) + (Y − µ)2
j=1 j=1 j=1
Xn
= (Yj − µ)2 − 2n(Y − µ)2 + n(Y − µ)2
j=1
n
X
= (Yj − µ)2 − n(Y − µ)2 ,
j=1
which implies that
X n Xn
E (Yj − Y )2 = E (Yj − µ)2 − nE (Y − µ)2
j=1 j=1
= nvar(Yj ) − nvar(Y )
= nσ 2 − nσ 2 /n
= (n − 1)σ 2 .
Therefore
Xn n−1 2
2
E σ
b = n−1 E (Yj − Y )2 = σ ,
n
j=1
b2 is
and the bias of σ
(n − 1)σ 2 σ2
σ2 ) − σ2 =
E(b − σ2 = − .
n n
b2 underestimates σ 2 , by an amount that should be small for large n.
Therefore on average σ
196
Bias and variance
High bias, low variability Low bias, high variability
High bias, high variability The ideal: low bias, low variability
This is the average squared distance between θb and its target value θ.
Definition 262. Let θb1 and θb2 be two unbiased estimators of the same parameter θ. Then
var(θb1 ) ≤ var(θb2 ).
197
Note to Example 263
We’ve already seen in Lemma 166 that
. πσ 2
E(M ) = µ, var(M ) = ,
2n
so both estimators are (approximately) unbiased (in fact exactly unbiased), but
var(M ) π
= > 1,
var(Y ) 2
Delta method
In practice, we often consider functions of estimators, and so we appeal to another version of the delta
method (Theorem 236).
Theorem 264 (Delta method). Let θb be an estimator based on a sample of size n, such that
·
θb ∼ N (θ, v/n),
for large n, and let g be a smooth function such that g′ (θ) 6= 0. Then
·
b ∼
g(θ) N g(θ) + vg ′′ (θ)/(2n), vg ′ (θ)2 /n .
b as an estimator of g(θ) is
This implies that the mean square error of g(θ)
n o vg′′ (θ) 2 vg′ (θ)2
b ≈
MSE g(θ) + .
2n n
198
Note to Example 265
Let ψ = g(θ) = exp(−θ) = P(Y = 0).
P
The two estimators are T1 = n−1 I(Yi = 0) and T2 = exp(−Y ).
Simple computations (e.g., noting that nT1 ∼ B(n, ψ)) give
var(T2 ) θ exp(−2θ) θ
= = θ <1
var(T1 ) exp(−θ){1 − exp(−θ)} e −1
with respect to the vector θp×1 , giving an estimator θbp×1 , which often has an approximate
Np (θ, V ) distribution.
199
8.3 Interval Estimation slide 337
Pivots
A key element of statistical thinking is to assess uncertainty of results and conclusions.
Let t = 1 be an estimate of an unknown parameter θ based on a sample of size n:
if n = 105 we are much more sure that θ ≈ t than if n = 10;
as well as t we would thus like to give an interval which will be wider when n = 10 than when
n = 105 , to make the uncertainty of t explicit.
We suppose that we have
data y1 , . . . , yn , which are regarded as a realisation of a
random sample Y1 , . . . , Yn drawn from a
statistical model f (y; θ) whose unknown
parameter θ is estimated by the
estimate t = t(y1 , . . . , yn ), which is regarded as a realisation of the
estimator T = t(Y1 , . . . , Yn ).
We therefore need to link θ and Y1 , . . . , Yn .
Definition 266. Let Y = (Y1 , . . . , Yn ) be sampled from a distribution F with parameter θ. Then a
pivot is a function Q = q(Y, θ) of the data and the parameter θ, where the distribution of Q is known
and does not depend on θ. We say that Q is pivotal.
Example
iid
Example 267. Let Y1 , . . . , Yn ∼ U (0, θ) with θ unknown,
X
M = max(Y1 , . . . , Yn ), Y = n−1 Yj .
200
Note to Example 267
We first note that Q1 is a function of the data and the parameter, and that
so
P(Q1 ≤ q) = P(M/θ ≤ q) = P(M ≤ θq) = (θq/θ)n = q n , 0 < q < 1.
which is known and does not depend on θ. Hence Q1 is a pivot.
In Example 119(a) we saw that if Y ∼ U (0, θ), then E(Y ) = θ/2 and var(Y ) = θ 2 /12. Hence
Lemma 166(c) gives that Y has mean θ/2 and variance θ 2 /(12n), and for large n,
·
Y ∼ N {θ/2, θ 2 /(12n)} using the central limit theorem. Therefore
Y − θ/2 ·
Q2 = p = (3n)1/2 (2Y /θ − 1) ∼ N (0, 1).
2
θ /(12n)
Thus Q2 depends on both data and θ, and has an (approximately) known distribution: hence Q2 is
an (approximate) pivot. (In fact it is exact, if we could know the distribution of Y exactly.)
Confidence intervals
Definition 268. Let Y = (Y1 , . . . , Yn ) be data from a parametric statistical model with scalar
parameter θ. A confidence interval (CI) (L, U ) for θ with lower confidence bound L and upper
confidence bound U is a random interval that contains θ with a specified probability, called the
(confidence) level of the interval.
L = l(Y ) and U = u(Y ) are statistics that can be computed from the data Y1 , . . . , Yn . They do
not depend on θ.
In a continuous setting (so < gives the same probabilities as ≤), and if we write the probabilities
that θ lies below and above the interval as
P (θ < L) = αL , P (U < θ) = αU ,
P (L ≤ θ ≤ U ) = 1 − P (θ < L) − P (U < θ) = 1 − αL − αU .
Often we seek an interval with equal probabilities of not containing θ at each end, with
αL = αU = α/2, giving an equi-tailed (1 − α) × 100% confidence interval.
We usually take standard values of α, such that 1 − α = 0.9, 0.95, 0.99, . . .
201
Construction of a CI
We use pivots to construct CIs:
– we find a pivot Q = q(Y, θ) involving θ;
– we obtain the quantiles qαU , q1−αL of Q;
– then we transform the equation
Example 270. A sample of n = 16 Vaudois number plates has maximum 523308 and average 320869.
Give two-sided 95% CIs for the number of cars in canton Vaud.
Probability and Statistics for SIC slide 341
1/n
P{αU ≤ M/θ ≤ (1 − αL )1/n } = 1 − αL − αU ,
so
1/n
L = M/(1 − αL )1/n , U = M/αU .
·
For Q2 = (3n)1/2 (2Y /θ − 1) ∼ N (0, 1), the quantiles are z1−αL and zαU , so
note that for large n these are L ≈ 2Y {1 − z1−αL /(3n)1/2 } and U ≈ 2Y {1 − zαU /(3n)1/2 }.
202
Note to Example 270
We set αU = αL = 0.025, with M and Y observed to be m = 523308 and y = 320869.
1/n
For Q1 with n = 16 we have αU = 0.0251/16 = 0.794, (1 − αL )1/n = 0.9751/16 = 0.998, so
1/n
L = m/(1 − αL )1/n = 524135, U = m/αU = 659001.
Note that this CI does not contain m (and this makes sense).
·
For Q2 = (3n)1/2 (2Y /θ − 1) ∼ N (0, 1), the quantiles are zαU = −z1−αL = −1.96, so we obtain
2y 2y
L= = 500226, U= = 894903.
1 + 1.96/(3n)1/2 1 − 1.96/(3n)1/2
This is much wider than the other CI, and includes impossible values, as we already know that
θ ≥ m.
Clearly we prefer the interval based on Q1 .
Interpretation of a CI
(L, U ) is a random interval that contains θ with probability 1 − α.
We imagine an infinite sequence of repetitions of the experiment that gave (L, U ).
In that case, the CI that we calculated is one of an infinity of possible CIs, and we can consider
that our CI was chosen at random from among them.
Although we do not know whether our particular CI contains θ, the event θ ∈ (L, U ) has
probability 1 − α, matching the confidence level of the CI.
In the figure below, the parameter θ (green line) is contained (or not) in realisations of the 95% CI
(red). The black points show the corresponding estimates.
100
80 60
Repetition
40 20
0
−2 0 2 4 6 8 10 12
Parameter
203
One- and two-sided intervals
A two-sided confidence interval (L, U ) is generally used, but one-sided confidence intervals,
of the form (−∞, U ) or (L, ∞), are also sometimes required.
For one-sided CIs, we take αU = 0 or αL = 0, giving respective intervals (L, ∞) or (−∞, U ).
To get a one-sided (1 − α) × 100% interval, we can compute a two-sided interval with
αL = αU = α, and then replace the unwanted limit by ±∞ (or another value if required in the
context).
Example 271. A sample of n = 16 Vaudois number plates has maximum 523308. Use the pivot Q1
to give one-sided 95% CIs for the number of cars in canton Vaud.
For the interval of form (L, ∞), we have have (524988.3, ∞), with the interpretation that we are
95% sure that the number of cars in the canton is at least 524988.3 (which we would interpret as
524988, for practical purposes).
For the interval of form (−∞, U ), we have have (−∞, 631061.6), but since we have observed
m = 523308, we replace the lower bound, giving (523308, 631061.6). We are 95% sure that the
number of cars in the canton is lower than 631062 but it must be at least 523308.
Probability and Statistics for SIC note 1 of slide 343
Standard errors
In most cases we use approximate pivots, based on estimators whose variances we must estimate.
Definition 272. Let T = t(Y1 , . . . , Yn ) be an estimator of θ, let τn2 = var(T ) be its variance, and let
V = v(Y1 , . . . , Yn ) be an estimator of τn2 . Then we call V 1/2 , or its realisation v 1/2 , a standard error
for T .
T −θ T −θ τn D
= × −→ Z, n → ∞.
V 1/2 τn V 1/2
Hence, when basing a CI on the Central Limit Theorem, we can replace τn by V 1/2 .
204
Approximate normal confidence intervals
We can often construct approximate CIs using the CLT, since many statistics that are based on
averages of Y = (Y1 , . . . , Yn ) have√approximate normal distributions for large n. If T = t(Y ) is an
estimator of θ with standard error V , and if Theorem 273 applies, then
·
T ∼ N (θ, V ),
√ ·
and so (T − θ)/ V ∼ N (0, 1). Thus
n √ o
.
P zαU < (T − θ)/ V ≤ z1−αL = Φ(z1−αL ) − Φ(zαU ) = 1 − αL − αU ,
Recall that if αL , αU < 1/2, then z1−αL > 0 and zαU < 0, so L < U .
Example 269 is an example of this, with T = 2Y and V = T 2 /(3n), since for large n,
L = αU = 0.025, and then z1−αL = −zαU = 1.96, giving the ‘rule of thumb’
Often we take α√
(L, U ) ≈ T ± 2 V for a two-sided 95% confidence interval.
Y −µ
Z=p ∼ N (0, 1).
σ 2 /n
205
Unknown variance
In applications σ 2 is usually unknown. If so, Theorem 274 implies that
Y −µ (n − 1)S 2
p ∼ tn−1 , ∼ χ2n−1
S 2 /n σ2
are pivots that provide confidence intervals for µ and σ 2 , respectively, i.e.,
S S
(L, U ) = Y − √ tn−1 (1 − αL ), Y − √ tn−1 (αU ) , (5)
n n
2 2
(n − 1)S (n − 1)S
(L, U ) = , , (6)
χ2n−1 (1 − αL ) χ2n−1 (αU )
where:
– tν (p) is the p quantile of the Student t distribution with ν degrees of freedom;
– χ2ν (p) is the p quantile of the chi-square distribution with ν degrees of freedom.
For symmetric densities such as the normal and the Student t , the quantiles satisfy
(Source: Wikipedia)
206
Probability and Statistics for SIC slide 348
0.4
1
0.4
0.3
PDF
PDF
2
0.2
0.2
4
6
0.1
10
0.0
0.0
0 5 10 15 20 -4 -2 0 2 4
w t
Left: χ2ν densities with ν = 1, 2, 4, 6, 10. Right: tν densities with ν = 1 (bottom centre), 2, 4, 20, ∞
(top centre).
Example
Example 275. Suppose that the resistance X of a certain type of electrical equipment has an
approximate N (µ, σ 2 ) distribution. A random sample of size n = 9 has average x = 5.34 ohm and
variance s2 = 0.122 ohm2 .
Find an equi-tailed two-sided 95% CI for µ.
Find an equi-tailed two-sided 95% CI for σ 2 .
How does the interval for µ change if we are later told that σ 2 = 0.122 ?
How does the calculation change if we want a 95% confidence interval for µ of form (L, ∞)?
207
Comments
The construction of confidence intervals is based on pivots, often using the central limit theorem
to approximate the distribution of an estimator, and thus giving approximate intervals.
A confidence interval (L, U ) not only suggests where an unknown parameter is situated, but its
width U − L gives an idea of the precision of the estimate.
In most cases √
U −L∝ V ∝ n−1/2 ,
so multiplying the sample size by 100 increases precision only by a factor of 10.
Having to estimate the variance using V decreases precision, and thus increases the width.
To get a one-sided (1 − α) × 100% interval, we can compute a two-sided interval with
αL = αU = α, and then replace the unwanted limit by ±∞ (or another suitable limit).
In some cases, especially normal models, exact CIs are available.
Statistical tests
Example 276. I observe 115 heads when spinning it a 5Fr coin 200 times, and 105 heads when tossing
it.
Give a statistical model for this problem.
Is the coin fair?
5Fr, 1978, spins 5Fr, 1978, tosses
1.0
1.0
0.8
0.8
Proportion of heads
Proportion of heads
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
208
Note to Example 276
On the assumption that the spins are independent, and that heads occurs with probability θ, the
total number of heads R ∼ B(n = 200, θ), and if the coin is fair, θ = 1/2.
One way to see if the coin is fair is to compute a 95% CI for the unknown θ, and see if the value
θ = 1/2 lies in the interval.
An unbiased estimator for θ is θb = R/n (and in fact this is the MLE, and the moments estimator),
and its variance is θ(1 − θ)/n, which we can estimate by V = θ(1b − θ)/n,
b so our discussion of
confidence intervals tells us that an approximate 95% confidence for θ is
√ √
θb ± z1−α/2 V = θb ± 1.96 V ,
which gives
tosses: (0.456, 0.594) spins: (0.506, 0.644),
suggesting that since the 95% confidence interval for spins does not contain 1/2, the coin is not
fair for spins, but that it is fair for tosses.
Note that if we had had R = 85 for tosses, then we would get interval (0.356, 0.494), and would
also have concluded that the coin is not fair for tosses.
Similar computations for the CI with α = 99% give
so if we take a wider confidence interval, we conclude that the coin is fair for spins also.
209
Null and alternative hypotheses
In a general testing problem we aim to use the data to decide between two hypotheses.
The null hypothesis H0 , which represents the theory/model we want to test.
– For the coin tosses, H0 is that the coin is fair, i.e., P(heads) = θ = θ 0 = 1/2.
The alternative hypothesis H1 , which represents what happens if H0 is false.
– For the coin tosses, H1 is that the coin is not fair, i.e., P(heads) 6= θ 0 .
When we decide between the hypotheses, we can make two sorts of error:
Type I error (false positive): H0 is true, but we wrongly reject it (and choose H1 );
Type II error (false negative): H1 is true, but we wrongly accept H0 .
Decision
Accept H0 Reject H0
State of Nature H0 true Correct choice (True negative) Type I Error (False positive)
H1 true Type II Error (False negative) Correct choice (True positive)
Taxonomy of hypotheses
Definition 277. A simple hypothesis entirely fixes the distribution of the data Y , whereas a
composite hypothesis does not fix the distribution of Y .
Example 278. If
iid iid
H0 : Y1 , . . . , Yn ∼ N (0, 1), H1 : Y1 , . . . , Yn ∼ N (0, 3),
then both hypotheses are simple.
then H0 (‘the coin is fair’) is simple but H1 (‘the coin is not fair’) is composite.
Example 280. If µ, σ 2 are unknown and F is a unknown (but non-normal) distribution, and
iid iid
H0 : Y1 , . . . , Yn ∼ N (µ, σ 2 ), H1 : Y1 , . . . , Yn ∼ F,
then both H0 (‘the data are normally distributed’) and H1 (‘the data are not normally distributed’) are
composite.
210
True and false positives: Example
H0 : T ∼ N (0, 1) and H1 : T ∼ N (µ, 1), with µ > 0.
Reject H0 if T > t, where t is some cut-off, so we
– reject H0 incorrectly (false positive) with probability
H1
ROC curve
Definition 281. The receiver operating characteristic (ROC) curve of a test plots β(t) against
α(t) as the cut-off t varies, i.e., it shows (P0 (T ≥ t), P1 (T > t)), when t ∈ R.
In the example above, we have α = Φ(−t), so t = −Φ−1 (α) = −zα , so equivalently we graph
Here is the ROC curve for the example above, which has µ = 2 (in red). Also shown are the ROC
curves for µ = 0, 0.4, 3, 6. Which is which?
1.0
0.8
True positive probability β(t)
0.6
0.4
0.2
0.0
211
Example, II
In case you need help, here are the densities for three of the cases:
t t t
H0 H0 H0
H1 H1 H1
Definition 282. Let P0 (·) and P1 (·) denote probabilities computed under null and alternative
hypotheses H0 and H1 respectively. Then the size and power of a statistical test of H0 against H1 are
U − L ∝ n−1/2 ,
so increasing n gives a narrower interval and will increase the power of the test. This makes sense,
because having more data should allow us to be more certain in our conclusions.
Unfortunately, not all tests correspond to confidence intervals, so we need a more general approach.
For example, checking the fit of a model is not usually possible using a confidence interval . . .
212
Probability and Statistics for SIC slide 361
Example 283. In a legal dispute, it was claimed that the numbers below were faked:
261 289 291 265 281 291 285 283 280 261 263 281 291 289 280
292 291 282 280 281 291 282 280 286 291 283 282 291 293 291
300 302 285 281 289 281 282 261 282 291 291 282 280 261 283
291 281 246 249 252 253 241 281 282 280 261 265 281 283 280
242 260 281 261 281 282 280 241 249 251 281 273 281 261 281
282 260 281 282 241 245 253 260 261 281 280 261 265 281 241
260 241
Real data could be expected to have final digits uniformly distributed on {0, 1, . . . , 9}, but here we have
0 1 2 3 4 5 6 7 8 9
14 42 14 9 0 6 2 0 0 5
How strong is the evidence that the final digits are not uniform?
213
Pearson statistic
Definition 284. Let O1 , . . . , Ok be the number of observations of a random sample of size
n = n1 + · · · + nk falling into the categories 1, . . . , k, whose expected numbers are E1 , . . . , Ek , where
Ei > 0. Then the Pearson statistic (or chi-square statistic) is
k
X (Oi − Ei )2
T = .
Ei
i=1
iid
Definition 285. Let Z1 , . . . , Zν ∼ N (0, 1), then W = Z12 + · · · + Zν2 follows the chi-square
distribution with ν degrees of freedom, whose density function is
1
fW (w) = wν/2−1 e−w/2 , w > 0, ν = 1, 2, . . . ,
2ν/2 Γ(ν/2)
R∞
where Γ(a) = 0 ua−1 e−u du, a > 0, is the gamma function.
214
Null and alternative hypotheses for Example 283
Null hypothesis, H0 : the final digits are independent and distributed according to the uniform
distribution on 0, . . . , 9. This simple null hypothesis implies that
P O0 , . . . , O9 have the multinomial
distribution with probabilities p0 = · · · = p9 = 0.1, and since Ej /10 > 5, we have
.
P0 (T ≤ t) = P(χ29 ≤ t), t > 0.
Alternative hypothesis, H1 : the final digits are independent but not uniform, so O0 , . . . , O9
follow a multinomial distribution with unequal probabilities, p0 , . . . , p9 . This hypothesis is
composite, and the parameter θ ≡ (p1 , . . . , p9 ) is of dimension 9, as p0 = 1 − p1 − · · · − p9 . Under
this model,
P1 (T > t) ≥ P(χ29 > t), t > 0.
Since values of T tend to be smaller under H0 than under H1 , we should large values of T to be
evidence against H0 in favour of H1 .
We verify this on the following slides.
35
30
15 20 25
0.08
Ordered P
Density
0.04
10
5
0.00
0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
P Quantiles of Chi^2_9
0.12
50 40
0.08
Ordered P
Density
30
0.04
20 10
0.00
0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
P Quantiles of Chi^2_9
215
Monte Carlo simulations of T , n = 100, 50
Pearson’s statistics for 10,000 sets of data when testing H0 : p0 = · · · = p9 = 0.1, when: (a) (top) the
data are generated with p0 = p1 = 0.15, p2 = · · · = p9 = 0.0875, and n = 100; (b) (bottom) the data
are generated with p0 = p1 = 0.2, p2 = · · · = p9 = 0.075 and n = 50. The red line shows the χ29
density.
0.12
80
60
0.08
Ordered P
Density
40
0.04
20
0.00
0
0 20 40 60 80 0 5 10 15 20 25 30 35
P Quantiles of Chi^2_9
0.12
50 40
0.08
Ordered P
Density
20 30
0.04
10
0.00
0
0 20 40 60 80 0 5 10 15 20 25 30 35
P Quantiles of Chi^2_9
Example
The simulations in the previous figures show that
·
under H0 , we indeed have T ∼ χ29 , even with n = 50;
under H1 , the distribution of T is shifted to the right;
the size of the shift under H1 will determine the power of the test, which depends on the sample
size n and on the non-uniformity of (p0 , . . . , p9 ).
0 1 2 3 4 5 6 7 8 9
14 42 14 9 0 6 2 0 0 5
.
give observed value of T equal to tobs = 158.
For a test of H0 at significance level α = 0.05, note that the (1 − α) quantile of the χ29
distribution is 16.92. Since tobs > 16.92, we can reject H0 at significance level 0.05.
In fact,
.
P0 (T ≥ tobs ) = P(χ29 ≥ 158) < 2.2 × 10−16 ,
so seeing data like this would be essentially impossible under H0 . It is almost certain that the
observed final digits did not come from a uniform distribution.
216
Evidence and P-values
A statistical hypothesis test has the following elements:
a null hypothesis H0 , to be tested against an alternative hypothesis H1 ;
data, from which we compute a test statistic T , chosen such that large values of T provide
evidence against H0 ;
the observed value of T is tobs , which we compare with the null distribution of T , i.e., the
sampling distribution of T under H0 ;
we measure the evidence against H0 using the P-value
pobs = P0 (T ≥ tobs ),
Examples
Example 287. Recast Example 276 in terms of P-values.
Example 288. Ten new electricity meters are measured for quality control purposes, resulting in the
data
983 1002 998 996 1002 983 994 991 1005 986
Is there a systematic divergence from the standard value of 1000?
since here n = 200 and θ 0 = 1/2 yield E(R) = 100 and var(R) = 50.
·
Since T = Z 2 , where Z ∼ N (0, 1) we have that T ∼ χ21 under H0 .
This gives tobs = 0.5 for the tosses, and tobs = 4.5 for the spins, with corresponding P-values
. . . .
P0 (T ≥ tobs ) = P(χ21 ≥ 0.5) = 0.480, P0 (T ≥ tobs ) = P(χ21 ≥ 4.5) = 0.034.
With α = 0.05 we would accept H0 for the tosses but reject it for the spins.
With α = 0.01 we would accept H0 for both tosses and spins.
217
Note to Example 288
iid
We assume that Y1 , . . . , Yn ∼ N (µ, σ 2 ), with σ 2 unknown. We take
H0 : µ = µ0 = 1000, H1 : µ 6= 1000.
Y − µ0
Z=p ∼ tn−1 .
S 2 /n
Here the alternative hypothesis H1 is two-sided, i.e., we will reject if either Y is much larger or
much smaller than µ0 , so we should take
Y −µ
0
T = p = |Z|,
S 2 /n
and for a test at significance level α = 0.05 we therefore need to choose tα such that
α = P0 (T > tα ) = 1 − P0 (−tα ≤ Z ≤ tα ) .
But Z ∼ tn−1 is a pivot under H0 , so 1 − P0 (−tα ≤ Z ≤ tα ) = 2P0 (Z ≤ −tα ), and this implies
that tα = −tn−1 (α/2). With α = 0.025 and n = 10, we have t9 (0.025) = −2.262 from the
tables, or R, as qt(0.025, df=9).
For the data above, y = 994 and
n
1X
s2 = (yi − y)2 = 64.88.
9
i=1
p
Now tobs = |(994 − 1000)/ 64.88/10| = | − 2.35| = 2.35 > tα = 2.262, so we reject H0 at level
α = 5%.
Alternatively we can compute the 95% confidence interval based on Z, which is
(988.238, 999.762). Since this does not contain µ0 , H0 is rejected at the 5% level.
If instead the alternative hypothesis is H1 : µ > 1000, then we take Z as the test statistic, since
we are likely to have positive Z under H1 . In this case we need to choose tα such that
( )
Y − µ0
α = P0 (Z > tα ) = P0 p > tα .
S 2 /n
Since Z ∼ tn−1 , we have that tα = t9 (0.95) = 1.833, and since zobs = −2.35 < 1.833, we cannot
reject the null hypothesis at the 5% level. Indeed, having y = 994 suggests that it is not true that
µ > µ0 .
If the alternative hypothesis is H1 : µ < 1000, then we take T = −Z as the test statistic, since we
are likely to have negative Z under H1 . In this case we need to choose tα such that
( ) ( )
Y − µ0 Y − µ0
α = P0 (−Z > tα ) = P0 p < −tα = P0 p < tn−1 (α) ,
S 2 /n S 2 /n
implying that tα = −tn−1 (α) = tn−1 (1 − α). With α = 0.05, we therefore have tα = 1.833, and
since −zobs = 2.35 > tα = 1.833, we reject the null hypothesis at the 5% level. Having
y = 994 < µ0 suggests that maybe µ < µ0 .
218
Probability and Statistics for SIC note 2 of slide 371
α Evidence against H0
0.05 Weak
0.01 Positive
0.001 Strong
0.0001 Very strong
Choice of α
As with CIs, conventional values are often used, such as α = 0.05, 0.01, 0.001.
The most common value is α = 0.05, which corresponds to a Type I error probability of 5%, i.e.,
H0 will be rejected once in every 20 tests, even when it is true.
When many tests are performed, using large α can give many false positives, i.e., significant tests
for which in fact H0 is true.
Consider a microarray experiment, where we test 1000 genes at significance level α, to see which
genes influence some disease. If only 100 genes have effects, we can write
where α is the size of the test, β > α is its power, and S denotes the event that the test is
significant at level α. Bayes’ theorem gives
219
Many tests: State of nature
!"#$%&'(#)*#'++',$
!-#$%&'(#'++',$#.%'/')$
True state of nature: we will test 1000 genes, of which only 100 show real effects.
Ideally tests for the 100 genes for which H1 is true will reject, and those for the 900 for which H0
is true will fail to reject.
Then we will follow up on the 100 rejections, and win fame and fortune.
H0 true, no effect
H1 true, effect present
False positive
If we test at significance level α = 0.05, we expect to wrongly reject for 0.05 × 900 = 45 of the
genes where H0 is true; these are the false positives, which we falsely conclude show an effect.
H0 true, no effect
H1 true, effect present
False positive
False negative
If the power of the test is 0.5, then we expect to reject H0 in only 50% of the cases in which H1 is
true, i.e., we will miss 0.5 × 100 = 50 genes for which H1 is true, and there is a true effect; these
are the false negatives.
220
Many tests: Non-reproducible results
Among the tests that reject H0 , we will have 45 false positives and just 50 true positives, so
almost 50% of our future effort in following up these genes will be wasted.
To reduce this wasted effort, we must
– decrease the size α of the test, which will reduce the number of false positives; and
– increase the power β of the test, which will reduce the number of false negatives.
This mis-use of hypothesis testing has led to a lot of ‘discoveries’ that cannot be replicated in
follow-up experiments based on further data.
http://www.badscience.net
http://en.wikipedia.org/wiki/Anil_Potti
221
8.5 Comparison of Tests slide 379
Types of test
There are many different tests for different hypotheses. Two important classes of tests are:
parametric tests, which are based on a parametric statistical model, such as
iid
Y1 , . . . , Yn ∼ N (µ, σ 2 ), and H0 : µ = 0;
nonparametric tests, which are based on a more general statistical model, such as
iid
Y1 , . . . , Yn ∼ f , et H0 : P(Y > 0) = P(Y < 0) = 1/2, i.e., the median of f is at y = 0
The main advantage of a parametric test is the possibility of finding a (nearly-)optimal test, if the
underlying assumptions are correct, though such a test could perform badly in the presence of outliers.
A nonparametric test is often more robust, but it will suffer a loss of power compared to a parametric
test, used appropriately.
Medical analogy
We diagnose an illness based on symptoms presented by a patient:
Decision
Healthy Diseased
Patient Healthy True negative False positive
Diseased False negative True positive
In the graphic below, Symptom 1 gives perfect diagnoses, but Symptom 2 is useless. Think how the
probability of a correct diagnosis varies as the different lines move parallel to their slopes.
4
S
SS
S S
M
S M MM M
SS M
SS S S
S M MMM M M M M MM
SS SS S S
SS S S MM MM M M
S SS
S M M M M M
2
S S M MM M M M
S SSSS SSSSS SS M MM
M MM
MM M MM
M
S S S SSS S SSSSSS S S M MM MMM
MM MMMM
MM
M
MMMM
M M
MM M M
S S SSSSSSSS SS
SS SSSS SS S
SSSS M M MM
M MM
M MM M M
SSSS SSS
S SSS S SS S S
S S S M MMM
MMM M M M
S SS SSSSS S
SS
S S
SSS
S
SSSS
SSS
S
SSSS
S S S S S M
MMMMMMMM MMM
MM
M M
M
M
MM
MMMM
MM M M
M MMMMMMM
S S SS S
S SS
SS SSS S
SS
S SSSSSS
S SS S S S SS S MM M
MMMMM M
MM
MM
MM
M
MMM
M
M
MMMM
MM
MM
M
MM MMM M
MM M
M
SSSS SS S SSS SSS SSSS SS S SSS S M M MMM
M M MMM M MM MM M
S SSS SSSSSSS S SSS S
SSS S SS
S SSSSSS SS
M MMM MM
MM
MM MMM MM M MM
MMMMM
Symptom 2
SS
SSSS SSSSS
SS SS
SSS
SSSSS
SS S
SS S SSSS SS SS M MMMMM M MM
M MMMM
M MMM M M
M M M
S SSSSS
SSSS SS
S
S SS
S
S
S
SS
S SS SS S
S
SSSSSSSS
S SSS SSSSS SS SS M M MMMM
M
M
M
M
MM
MM
MM
M
M
M
MM
M
M
MM
M
M
M
MM
M
M
M
MM
M
M
M
M
MM
MM
MMMMM M M
S SS SSSS SS S SS
SSS
S S
S
SS
SS SS
SSSS
S S
SSSS
SS
SSSS
S
SSS
S S SSSS S S M MM
M
MM M
M
MM
M
MMM M
MMM
M
MMM
M
M
M
MM
MM M
M MMM
M MMM MMM MMMM M M
S S S S S SSSSSSS S
SSSS
S
SS
SSS
S
S
SSSSSSS
SSS
SS
SSSSS SSSSSSS
S SSSS S SS M M
M M
M
M
M
M
M
M
M
M
M
MM
M
M
MM
MM
M
M
M
M
MM
M
M
M
M
M
MMMM
MM
M
M
M
MMMM
MM
M
MMMM M
MMMM
M
MM
S SS SS SS
S S
S SSS S S SSSS S SS S S M M MM MM MMMMMMMM
M
M
MM
M
MMM
M
MMMM MM MM MM M MM M
0
S
S S SS S S S SSS SS SSS MMM MMMM
M MMM MM
S S SS S S SSS SS
SSSS
SSSS
S SS
S SS
S
SSS S
SS
S SSSSSSS
SS
SS
SSS S
S SS
SSSS
S MMM M MMM
M
MM
M
M
MMM
MM
MM M
MM
M
M
MM
M
M M
MM
M
MM
M
M
M MMMMM MMMMM
M
SS SS SS S SS S S
SSSS
S SSSSS SSSSSS SS M M
MMMMM
MMM
MMMMM
MMMMM
M MM
MM
M M
MM
M M
MM
MMM M
M
MMMMM M M
S S SSSS SS
S
S SS S
S SS
SS SS S S SS
SSSS SSSS S S M MM M
MMMM
M MMM
M
M
M
M
MM
MM
M
M
M
MM
MM
M
MMMM
M M M
MM M
MMMM MMMM M M
M
M
S
SS SSS S
SSSS S S S S
SSS
SSS
S SS S
SSSS SSS S
SS
S
S S MM M MM M
M
M
M MMM
M
M M MM MMM
M
S S S SS SS
SS
SSSSSSS SS
SSSSSSS
SS
S SSSSS
S S SSSS S S S SS M
M M
MMMMM
MMM
M M
M M
MMM
M
M
MM M
M
MM MMMMM
M MM MM
M
S S SS SSSS SS S
S SS
S SSSSSS
SSSS SSSSS
S
SSSS SSSSS S S MMMMM M M
M MMMMM
M MM
MM
M
MMMMM MM
M M
MM
MM
M
M
MM
M
MM M
M M M M
SS S S SSSS SS SSSSS SSSS
S SS SSSS SS S
SS
S SS S
S S M MMMM MMM
M MMMM
MM
M
MMMMMM
M MMM MM M
M MMMM
S SS SSS S SSS SS S S M M MMMMM M
M MM
SSS SS
SS
SSSSS
SSSSSS SS
S SSS SSSSSS S M
MM M M
M
M
M
MMMM M
MMM
M
MMM
MMMM M
MMMM MM MM M
S S SSS SSS SSS S SS SSS SS MM M
MM M
MMM
MM
M M MM
MM MMM M M
MM
M
MM
M
SS S S MMM M M M M
S S SS SS S S MMM MM M M MMM M
SS S S
S SS S M M MMM MMMM M M
−2
S M M M MM MM
S SS SS M
SS S M M M
S S SS M M M M M
S SS MM M M
S M M
S S M
S
−4
−5 0 5
Symptom 1
222
ROC curve, II
We previously met the ROC curve as a summary of the properties of a test.
A good test will have an ROC curve lying as close to the upper left corner as possible.
A useless test has an ROC curve lying on (or close to) the diagonal.
This suggests that if we have a choice of tests, we should choose one whose ROC curve is as close
to the north-west as possible, i.e., we should choose the test that maximises the power for a given
size.
This leads us to the Neyman–Pearson lemma, which says how to do this (in ideal
circumstances).
1.0
0.8
True positive probability β(t)
0.6
0.4
0.2
0.0
Y ∈ Y ⇒ Reject H0 , Y ∈ Y ⇒ Accept H0 .
P
In Example 287, Y = {(y1 , . . . , yn ) : | yj − 100|/50 > 1.96}.
We aim to choose Y such that P1 (Y ∈ Y) is the largest possible such that P0 (Y ∈ Y) = α.
Lemma 289 (Neyman–Pearson). Let f0 (y), f1 (y) be the densities of Y under simple null and
alternative hypotheses. Then if it exists, the set
223
Note to Lemma 289
Suppose that a region Yα such that P0 (Y ∈ Yα ) = α does exist and let Y ′ be any other critical region
of size α or less. Then for any density f ,
Z Z
f (y) dy − f (y) dy, (7)
Yα Y′
equals Z Z Z Z
f (y) dy + f (y) dy − f (y) dy − f (y) dy,
Yα ∩Y ′ Yα ∩Y ′ Y ′ ∩Yα Y ′ ∩Yα
If f = f0 , (7) and hence (8) are non-negative, because Y ′ has size at most that of Yα . Suppose that
f = f1 . If y ∈ Yα , then tα f0 (y) > f1 (y), while f1 (y) ≥ tα f0 (y) if y ∈ Yα . Hence when f = f1 , (8) is
no smaller than Z Z
tα f0 (y) dy − f0 (y) dy ≥ 0.
Yα ∩Y ′ Y ′ ∩Yα
Example
Example 290. (a) Construct an optimal test for the hypothesis H0 : θ = 1/2 in Example 276, with
α = 0.05.
(b) Do you think that θ = 1/2 for spins?
224
Note to Example 290
The joint density of n independent Bernoulli variables can be written as
X
f (y) = θ r (1 − θ)n−r , 0 < θ < 1, r = yj ,
f1 (y) θ r (1 − θ)n−r
= = {2(1 − θ)}n {θ/(1 − θ)}r ,
f0 (y) (1/2)r (1 − 1/2)n−r
which is increasing in r if θ > 1/2 and is decreasing in r if θ < 1/2. Hence if θ > 1/2 we must take
X
Y1 = {y1 , . . . , yn : y j ≥ r1 }
for some r2 . So if we want to test H0 against (say) H1 : θ = 0.6, we take Y1 , and if we want to
test H0 against (say) H1 : θ = 0.4, we take Y2 .
Suppose that we take H1 : θ = 0.6. Then we need to choose r1 such that
( ) !
R − n/2 r1 − n/2 . r1 − n/2
α = P0 (Y ∈ Y1 ) = P0 (R ≥ r1 ) = P0 p ≥ p = 1−Φ p
n/4 n/4 n/4
. √ .
and this implies that r1 = n/2 + nz1−α /2. With n = 200 and α = 0.05 this is r1 = 111.6.
Since we observed R = 115 > r1 , we reject H0 at the 5% significance level, and conclude that the
coin is biased upwards (but not downwards).
Since the result does not depend on the value of θ chosen, provided θ > 0.5, we would also reject
against any other H1 setting θ > 1/2.
A similar computation gives r2 = 88.37.
If we are not sure of the value of θ, then we take a region of the form Y1 ∪ Y2 . But in order for it
to have overall size α, we take α/2 for each of the regions, giving r1 = 113.86 and r2 = 86.14.
Since Y ∈ Y1 ∪ Y2 , we still reject H0 at the 5% significance level, and conclude that the coin is
biased, without being sure in which direction it is biased.
225
Power and distance
iid
A canonical example is where Y1 , . . . , Yn ∼ N (µ, σ 2 ), and
H 0 : µ = µ0 , H 1 : µ = µ1 .
If σ 2 is known, then the Neyman–Pearson lemma can be applied, and we find that the most
powerful test is based on Y and its power is Φ(zα + δ), where Φ(zα ) = α, and
|µ1 − µ0 |
δ = n1/2
σ
is the standardized distance between the models.
We see that
– the power increases if n increases, or if |µ1 − µ0 | increases, since in either case the difference
between the hypotheses is easier to detect,
– the power decreases if σ increases, since then the data become noisier,
– if δ = 0, then the power equals the size, because the two hypotheses are the same, and
therefore P0 (·) = P1 (·).
Many other situations are analogous to this, with power depending on generalised versions of δ.
Summary
We have considered the situation where we have to make a binary choice between
– the null hypothesis, H0 , against which we want to test
– the alternative hypothesis, H1 ,
using a test statistic T whose observed value is tobs , computing the P-value,
pobs = P0 (T ≥ tobs ),
Decision
Accept H0 Reject H0
State of Nature H0 true Good choice Type I Error
H1 true Type II Error Good choice
If we try to minimise the probability of Type II error (i.e., maximise power) for a given probability
of Type I error (fixed size), we can construct an optimal test, but this is only possible in simple
cases. Otherwise we usually have to compare tests numerically.
226
9 Likelihood slide 387
Illustration
When we toss a coin, small asymmetries influence the probability of obtaining heads, which is not
necessarily 1/2. If Y1 , . . . , Yn denote the results of independent Bernoulli trials, then we can write
P(Yj = 1) = θ, P(Yj = 0) = 1 − θ, 0 ≤ θ ≤ 1, j = 1, . . . , n.
1 1 1 1 1 0 1 1 1 1
Basic Idea
For a value of θ which is not very credible, the density of the data will be smaller: the higher the
density, the more credible the corresponding θ. Since the y1 , . . . , y10 result from independent trials, we
have
10
Y
f (y1 , . . . , y10 ; θ) = f (yj ; θ) = f (y1 ; θ) × · · · × f (y10 ; θ) = θ 5 × (1 − θ) × θ 4
j=1
9
= θ (1 − θ),
227
Relative likelihood
To compare values of θ, we only need to consider the ratio of the corresponding values of L(θ):
Example
Example 291. Find θb and RL(θ) for a sequence of independent Bernoulli trials.
The following graph represents RL(θ), for n = 10, 20, 100 and the sequence
1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1
1 1 1 1 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 1
1 1 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1
1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1
1 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 0
b values of θ which are far away from θb become less credible
As n increases, RL(θ) gets closer to θ:
b
with respect to θ.
This suggests that we could construct a CI by taking the set
{θ : RL(θ) ≥ c} ,
228
Note to Example 291
The likelihood is
n
Y n
Y
L(θ) = f (y; θ) = f (yj ; θ) = θ yj (1 − θ)1−yj = θ s (1 − θ)n−s , 0 ≤ θ ≤ 1,
j=1 j=1
P
where s = yj and we have used the fact that the observations are independent. Therefore
Setting dℓ(θ)/dθ = 0 gives just one solution, θb = s/n = y, and since the second derivative is always
negative, this is clearly the maximum. Therefore
s
L(θ) θ 1 − θ n−s
RL(θ) = = , 0 ≤ θ ≤ 1.
b
L(θ) θb 1 − θb
Bernoulli sequence
229
Bernoulli sequence
1.0 0.8
Relative likelihood
0.4 0.6
c=0.3
0.2
c=0.1
0.0
Likelihood
Definition 292. Let y be a set of data, whose joint probability density f (y; θ) depends on a parameter
θ, then the likelihood and the log likelihood are
considered a function of θ.
230
Maximum likelihood estimation
Definition 293. The maximum likelihood estimate θb satisfies
b ≥ L(θ) for all θ,
L(θ)
b ≥ ℓ(θ), since L(θ) and ℓ(θ) have their maxima at the same value of θ. The
which is equivalent to ℓ(θ)
corresponding random variable is called the maximum likelihood estimator (MLE).
Often θb satisfies
b
dℓ(θ) b
d2 ℓ(θ)
= 0, 2
< 0.
dθ dθ
In this course we will suppose that the first of these equations has only one solution (not always
the case in reality).
In realistic cases we use numerical algorithms to obtain θb and d2 ℓ(θ)/dθ
b 2.
Information
Definition 294. The observed information J(θ) and the expected information (or Fisher
information) I(θ) are
2
d2 ℓ(θ) d ℓ(θ)
J(θ) = − , I(θ) = E{J(θ)} = E − .
dθ 2 dθ 2
They measure the curvature of −ℓ(θ): the larger J(θ) and I(θ), the more concentrated ℓ(θ) and L(θ)
are.
iid b var(θ),
b J(θ) and I(θ).
Example 295. If y1 , . . . , yn ∼ Bernoulli(θ), calculate L(θ), ℓ(θ), θ,
d2 ℓ(θ) s n−s
J(θ) = − 2
= 2+ .
dθ θ (1 − θ)2
Now treating θb as a random variable, θb = S/n, where S ∼ B(n, θ), we see that since E(S) = nθ and
var(S) = nθ(1 − θ), we have after a little algebra that
b = θ(1 − θ) n
var(θ) , I(θ) = E{J(θ)} = , 0 < θ < 1.
n θ(1 − θ)
b = 1/I(θ).
Note that var(θ)
231
Limit distribution of the MLE
Theorem 296. Let Y1 , . . . , Yn be a random sample from a parametric density f (y; θ), and let θb be the
MLE of θ. If f satisfies regularity conditions (see below), then
D
b 1/2 (θb − θ) −→ N (0, 1),
J(θ) n → ∞.
θb
I1−α = (L, U ) = (θb − J(θ)
b −1/2 z1−α/2 , θb + J(θ)
b −1/2 z1−α/2 ).
b which
We can show that for large n (and a regular model) no estimator has a smaller variance than θ,
b
θ
implies that the CIs I1−α are as narrow as possible.
Example 297. Find the 95% CI for the coin data with n = 10, 20, 100.
b
n Tails θb b
J(θ) θ
I0.95 W
I0.95
10 9 0.9 111.1 (0.72, 1.08) (0.63, 0.99)
20 16 0.8 125.0 (0.62, 0.98) (0.59, 0.94)
100 69 0.69 467.5 (0.60, 0.78) (0.60, 0.78)
Theorem 299. If θ 0 is the value of θ that generated the data, then under the regularity conditions
giving θb a normal limit distribution,
D
W (θ 0 ) −→ χ21 , n → ∞.
·
Hence W (θ 0 ) ∼ χ21 for large n.
iid
Example 300. Find W (θ) when Y1 , . . . , Yn ∼ Bernoulli(θ 0 ).
232
Note to Example 300
Since
ℓ(θ) = s log θ + (n − s) log(1 − θ), 0 ≤ θ ≤ 1,
and θb = s/n = y, we have
h i
W (θ) = 2 nθb log(θ/θ)
b b log{(1 − θ)/(1
+ n(1 − θ) b − θ)} ,
D
and if we write θb = θ + n−1/2 a(θ)Z, where a2 (θ) = θ(1 − θ) and Z −→ N (0, 1), we end up after a
Taylor series or two with
. D
W (θ) = Z 2 −→ χ21 .
With 1 − α = 0.95 we have χ21 (0.95) = 3.84. Thus the 95% CI for a scalar θ contains all θ such
b − 1.92. In this case we have
that ℓ(θ) ≥ ℓ(θ)
b = exp{ℓ(θ) − ℓ(θ)}
RL(θ) = L(θ)/L(θ) b ≥ exp(−1.92) ≈ 0.15;
233
CIs based on the likelihood ratio statistic
0
−1
Level 0.9
Level 0.95
−4 −3 −2
Log likelihood
Level 0.99
−5
−6
(Source: http://www.benbest.com/science/standard.html)
The top quark was discovered in 1995.
The result of the experiments led to find it was a variable y = 17, which should have the
distribution Poisson(θ) with θ = 6.7 if this quark did not exist.
234
Top quark: Likelihood
0 −2
Log likelihood
−6 −4
−8
5 10 15 20 25
theta
Here we have f (y; θ) = θ y e−θ /y!, θb = y, y = 17, and θ 0 = 6.7, and thus
n o
wobs = W (θ 0 ) = 2 log f (y; θ) b − log f (y; θ 0 ) ,
n o
= 2 (y log θb − θb − log y!) − (y log θ 0 − θ 0 − log y!)
= 11.06.
.
Thus P(W ≥ wobs ) = P(χ21 ≥ 11.06) = 0.00088: a rare event, if θ 0 = 6.7.
Regularity
The regularity conditions are complicated. Situations where they are false are often cases where
one of the parameters is discrete;
the support of f (y; θ) depends on θ;
the true θ is on the limit of its possible values.
The conditions are satisfied in the majority of cases met in practice.
Here is an example where they are not satisfied:
iid b
Example 301. If Y1 , . . . , Yn ∼ U (0, θ), find the likelihood L(θ) and the MLE θ.
b
Show that the limit distribution of n(θ − θ)/θ when n → ∞ is exp(1).
Discuss.
Probability and Statistics for SIC slide 406
235
Note to Example 301
First show that owing to the independence, we have
n
Y n
Y −1
L(θ) = fY (yj ; θ) = θ I(0 < yj < θ) = θ −n I(max yj < θ), θ > 0,
j=1 j=1
Now n o
b
P n(θ − θ)/θ ≤ x = P(θb ≥ θ − xθ/n) = 1 − {(θ − xθ/n)/θ}n → 1 − exp(−x),
Example
Comparison of the distributions of θb in a regular case (panels above, with standard deviation ∝ n−1/2 )
and in a nonregular case (Example 301, panels below, with standard deviation ∝ n−1 ). In other
nonregular cases it might happen that the distribution is not nice (like here) and/or that the
convergence is slower than in regular cases.
n=16, regular n=64, regular n=256, regular
6
1.5
Density
Density
Density
2.5
3
0.0
0.0
Density
Density
60
15
3
0
236
9.3 Vector Parameter slide 408
Vector θ
Often θ is a vector of dimension p. Then the definitions and results above are valid with some slight
changes:
the MLE θb often satisfies the vector equation
b
dℓ(θ)
= 0p×1 ;
dθ
J(θ) and I(θ) are p × p matrices;
and in regular cases,
·
θb ∼ Np {θ, J(θ)
b −1 }.
and the elements of the observed information matrix J(µ, σ 2 ) are given by
n
∂2ℓ n ∂2ℓ n ∂2ℓ n 1 X
2
= − 2, 2
= − 4 (y − µ), = − (yj − µ)2 .
∂µ σ ∂µ∂σ σ ∂(σ 2 )2 2σ 4 σ 6
j=1
237
Note 2 to Example 302
To obtain the MLEs, we solve simultaneously the equations
! P
∂ℓ(µ,σ2 )
∂µ σ −2 nj=1 (yj − µ) 0
= n 1 P n 2 = 0 .
∂ℓ(µ,σ2 ) − + (y − µ)
2 ∂σ 2σ2 2σ4 j=1 j
Now
n n n
∂ℓ(b b2 )
µ, σ 1 X X
−1
X
=0⇒ 2 (yj − µ
b) = 0 ⇒ nb
µ= yj ⇒ µ
b=n yj = y
∂µ σ
b
j=1 j=1 j=1
and
n n n
∂ℓ(b b2 )
µ, σ n 1 X 2 2 −1
X
2 −1
X
= 0 ⇒ = (y j − µ
b) ⇒ σ
b = n (y j − µ
b ) = n (yj − y)2 .
∂σ 2 σ2
2b σ4
2b
j=1 j=1 j=1
The first of these has the sole solutionPb = y for all values of σ 2 , and therefore ℓ(b
µ µ, σ 2 ) is
unimodal with maximum at σ b2 = n−1 (yj − y)2 . At the point (b b2 ), the observed information
µ, σ
2
matrix J(µ, σ ) is diagonal with elements diag{n/b 2
σ , n/(2b 4
σ )}, and so is positive definite. Hence
P
b = y and σ
µ b2 = n−1 (yj − y)2 are the sole solutions to the likelihood equation, and therefore
are the maximum likelihood estimates.
·
The fact that θb ∼ Np {θ, J(θ)
b −1 } implies that
2
µb · µ σ
b /n 0
∼ N2 , ,
b2
σ σ2 0 σ 4 /n
2b
which implies that µ b and σb2 are approximately independent, since their covariance is zero and they
have an approximate joint normal distribution. In fact we know from Theorem 274 that they are
exactly independent, with µ b ∼ N (µ, σ 2 /n), and nbσ 2 ∼ σ 2 χ2n−1 ; this last fact implies that
E(bσ 2 ) = (n − 1)σ 2 /n and var(b
σ 2 ) = 2(n − 1)σ 4 /n2 , which converge to the values in the
large-sample distribution as n → ∞.
For confidence intervals, we note that the large-sample theory implies that a (1 − α) confidence
interval for µ is √ √
µ b/ n = y ± z1−α/2 σ
b ± z1−α/2 σ b/ n
and unless n is very small, this approximate interval is very similar to the exact interval
√
y ± tn−1 (1 − α/2)s/ n
that stems from Theorem 274. A similar argument applies to the interval for σ 2 .
238
Nested models
Very few applications have only one parameter, so we must test hypotheses concerning vector
parameters.
For example, in the case of the normal model, we often want to test the hypothesis H0 : µ = µ0 ,
where µ0 is a specified value (such as µ0 = 0), without restricting σ 2 . In this case we want to
compare two models, where
whose respective parameters have dimensions 2 (general model) and 1 (simplified model).
In a general context, put θp×1 = (ψq×1 , λ(p−q)×1 ), and suppose that we want to test whether the
simple model with ψ = ψ 0 explains the data as well as the general model. Thus, under the general
model, θ = (ψ, λ) ∈ Rq × Rp−q , whereas under the simple model, θ = (ψ, λ) ∈ {ψ 0 } × Rp−q .
We say that the simpler model is nested in the general model.
We use this terminology in all situations where one of the two models becomes the same as the
other when the parameter is restricted.
θb = (ψ,
b λ),
b θb0 = (ψ 0 , λ
b0 ),
Then if it is true that ψ = ψ 0 (i.e., the simpler of the two models is true), we have
D
W (ψ 0 ) −→ χ2q , n → ∞.
This gives a basis for tests and CIs as before, using the approximation
·
W (ψ 0 ) ∼ χ2q ,
b
valid for large n. In general this approximation is better than that for θ.
This result generalises Theorem 299, where θ ≡ ψ is scalar, q = 1, and λ is not present, p − q = 0.
239
Example
Example 303. Below the results of 100 tosses of two different coins, with 1 denoting a head:
1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1
1 1 1 1 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 1
1 1 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1
1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1
1 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 0
1 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 0 0 1 0
1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0
1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 0 0 1
1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 1 1 0 0
0 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1 0 1
Let θ1 , θ2 be the corresponding probabilities of obtaining heads. Find the likelihood and the likelihood
ratio statistic. Does θ1 = θ2 : are the probabilities equal?
and the contours of the relative log likelihood ℓ(θ1 , θ2 ) − ℓ(θb1 , θb2 ) are shown on the next slide.
Clearly the parameter has dimension p = 2.
Differentiation gives that θb1 = 69/(69 + 31) = 0.69 and θb2 = 55/(55 + 45) = 0.55, corresponding
to the black blob in the figure, at (θ1 , θ2 ) = (θb1 , θb2 ) = (0.69, 0.55).
Under the simpler model θ1 = θ2 = θ, we have
and the parameter has one dimension, q = 1, corresponding to the diagonal red line in the contour
plot. Clearly θb = 124/(124 + 76) = 0.62, corresponding to the red blob on the diagonal line, at
b θ)
(θ1 , θ2 ) = (θ, b = (0.62, 0.62).
To compare the two models, we note that the corresponding likelihood ratio statistic is
and reading off from the graph this (approximately) equals wobs = −2(−2) = 4 (since the relative
log likelihood is −2 at the red blob). Thus, if the simpler model is true, we have
. .
P(W ≥ wobs ) = P(χ21 ≥ 4) = 0.046. This event has a fairly small probability (around 1 in 20), so
it is some, but not strong, evidence that the more complex model is needed, i.e., that the coins do
not have the same success probability.
240
Example: Likelihood
Contours of log likelihood
1.0
0.8
0.6
theta2
0.4
0.2
0.0
theta1
241
Drinking and driving
Example 304. Formulate a model for the data, and use it to verify the change in the proportion of
accidents due to alcohol in 2005–2006.
Is there a difference on the other side of the Röstigraben?
242
Values of the maximised log likelihood ℓb
Model ℓb Number of parameters 2(ℓb2 − ℓb1 ) df
µca ≡ µ −4668.59 1
µca = λc −161.62 23 9011.9 22
µca = λc ψ I(a=2006) −157.70 24 7.7 1
I(a=2006,r=0) I(a=2006,r=1)
µca = λc ψ0 ψ1 −155.20 25 5.2 1
µca = λca −146.72 46 16.9 21
The indices:
c for the canton
a for the year
r = 1 for the other side of the Röstigraben
Calculation of the likelihood ratio statistics:
50
0.08
40
Ordered LR statistics
0.06
30
Density
0.04
20
0.02
10
0.00
0 10 20 30 40 50 0 10 20 30 40 50
Likelihood ratio statistic Quantiles of chi−squared distribution, 21 df
243
Estimates
Here are the estimates and standard errors for the ‘best’ model:
General approach
Having understood the problem and looked at the data:
we choose one or several models, basing our choice on
– prior knowledge, or
– stochastic reasoning, or
– purely empirical ideas;
we fit the models by maximum likelihood;
we compare nested models using their maximised log likelihoods, often via the likelihood ratio
statistic 2(ℓb2 − ℓb1 );
·
we choose one or several ‘best’ models, and we use the approximation θb ∼ N {θ 0 , J(θ)
b −1 } to find
CIs for the parameters, which we can interpret with respect to the original problem;
we verify whether the ‘best’ models are good;
if all goes OK, we stop; otherwise, we start again at step 1, or we look for more (better?) data.
244
General case
In general we take the likelihood based on data y = (y1 , . . . , yn ) to be P(Y = y; θ). If the yj are
independent, then (as before) we have
n
Y
L(θ) = P(Y = y; θ) = P(Yj = yj ; θ). (9)
j=1
In practice continuous observations yj are rounded, so the true observation lies in a small interval
(yj − δ/2, yj + δ/2), and then
P(Yj = yj ) = P{Yj ∈ (yj − δ/2, yj + δ/2)} = F (yj + δ/2) − F (yj − δ/2) ≈ δf (yj ).
for continuous observations, but in general we use (9) for independent data.
For example, if some of the yj are only known to exceed some value c, then (9) is
n
Y
L(θ) = P(Y = y; θ) ∝ f (yj ; θ)I(yj ≤c) {1 − F (c; θ)}I(yj >c) .
j=1
245
Santa Maria della Salute
(Source: http://www.fotocommunity.de/pc/pc/display/26118338)
160
140
120
100
80
60
246
Simple linear regression
Let Y be a random variable, the response variable and suppose that its distribution depends on a
variable x, supposed non-random, the explanatory variable. Sometimes we call x the
independent variable, and Y the dependent variable.
A simple model to describe linear dependence of E(Y ) on x is
Y ∼ N (β0 + β1 x, σ 2 ),
Estimation
Example 305. Show that if the Y1 , . . . , Yn are independent, then the log likelihood for the simple
linear model is
n
2 n n 2 1 X
ℓ(β0 , β1 , σ ) = − log(2π) − log σ − 2 (yj − β0 − β1 xj )2 , (β0 , β1 , σ 2 ) ∈ R2 × R+ .
2 2 2σ
j=1
Hence show that if n ≥ 2 and not all the xj are equal, then
Pn n
j=1 yj (xj − x)
X
b
β1 = Pn , b0 = y − βb1 x, σ
β b 2
= n −1
(yj − βb0 − βb1 xj )2 ,
(x − x)2
j=1 j j=1
and that (under the regularity conditions) the approximate distribution of the estimators for large n is
b
β0 β0 2 T −1 1 x1
σ
b (X X) 0
, Xn×2 = ... ... .
βb1 ∼ ·
N3 β1 , 4 /n
0 2b
σ
b2
σ σ2 1 xn
247
Note to Example 305
For the log likelihood we simply note that if the Y1 , . . . , Yn are independent, then
n
Y n
Y 1
f (y; β0 , β1 , σ 2 ) = f (yj ; β0 , β1 ) = √ exp − 21 (yj − β0 − β1 xj )2 /σ 2 ,
j=1 j=1 2πσ 2
out to be as stated.
The final result comes from using the results on slide 409, noting that the part of J(β0 , β1 , σ 2 ) for
(β0 , β1 ) can be written as X T X/σ 2 , and that the off-diagonal terms involving σ 2 are zero in
J(βb0 , βb1 , σ
b2 )
248
Results
It is difficult to interpret β0 if xj = j, as then β0 corresponds to the mean height for the year
j = 0, well before Venice was founded. Thus we take xj = j − 2000, and then β0 corresponds to
the average height in the year 2000.
We find
Parameter Estimate Standard Error
β0 (cm) 131.5 2.6
β1 (cm/year) 0.35 0.04
σ 2 (cm2 ) 16.82 35.5
The fitted line seems reasonable, but there is a lot of variation around it (σ 2 is large).
180
Annual maximum flood (cm)
160
140
120
100
80
60
We can verify the assumption of normality by comparing the residuals r1 , . . . , rn to the normal
distribution:
Normal Q−Q Plot
4
3
Sample Quantiles
2
1
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
249
SIC Students, 2012
Data on the heights and weights of 8 women and 80 men for this course in 2012: is the linear relation
the same for men and for women?
110
100
90
Weight (kg)
80
70
60
50
Height (cm)
Regression models
Let Yj denote the weight (kg) of the jth person, xj their height (cm), zj an indicator, zj = 1 if
jth person is a man, zj = 0 otherwise.
Let Yj ∼ N (µj , σ 2 ) be independent random variables with five nested models:
1. µj = β0 (same line, slope 0);
2. µj = β0 + β1 xj (same line for men and women);
3. µj = β0 zj + β2 (1 − zj ) + β1 xj (two parallel lines);
4. µj = β0 + β1 xj zj + β3 xj (1 − zj ) (same intercept, different slopes);
5. µj = β0 zj + β2 (1 − zj ) + β1 xj zj + β3 xj (1 − zj ) (two different lines).
There are exact results to compare these models in the context of analysis of variance
(ANOVA), but we can use the general ideas of likelihood as an approximation:
250
Models 1 (grey), 2 (black) and 5 (red/blue)
110
110
50 60 70 80 90
50 60 70 80 90
Weight (kg)
Weight (kg)
160 170 180 190 160 170 180 190
Left: the grey line can be improved by taking a non-zero slope (black). Right: the red line is badly
determined, but the blue one seems OK (and very similar to the black one).
Results
Estimates for model 2:
Parameter Estimate Standard Error
β0 (kg) -71.8 21.4
β1 (kg/cm) 0.82 0.12
σ 2 (kg2 ) 8.462 10.79
Evidently this cannot be extrapolated to babies, who will have negative weights, but a CI for β0 ,
even at 99%, does not contain 0.
The normality assumption seems roughly OK, according to a QQ-plot of the residuals:
Normal Q−Q Plot
3
2
Residuals
1
0
−1
−2 −1 0 1 2
Theoretical Quantiles
251
Comments
Linearity seems reasonable in both cases. Is there a slight acceleration in the slope in Venice,
which we could try to model with a polynomial such that
E(Yj ) = β0 + β1 xj + · · · + βq xqj ?
Normality seems less reasonable in Venice(?)—since the data are the annual maxima, one of our
models for maxima (end §6.2) might be more appropriate (e.g., Gumbel distribution).
We could use these models for forecasting, but then a linear trend must be extrapolated—which
seems OK in Venice for the mid term (0–10 years?), but maybe not later.
For these models with normal observations there are exact results, but often it is enough to use the
general theory for tests, CIs etc.
There are many generalisations of regression models. For example, we could suppose that
Bernoulli variables Yj are independent with
eβ0 +β1 xj 1
P(Yj = 1) = , P(Yj = 0) = , j = 1, . . . , n,
1 + eβ0 +β1 xj 1+ eβ0 +β1 xj
a logistic regression model.
252
10 Bayesian Inference slide 435
Bayesian inference
Up to now we have supposed that all the information about θ comes from the data y. But if we have
knowledge about θ in the form of a prior density
π(θ),
we can use Bayes’ theorem to compute the posterior density for θ conditional on y, i.e.,
f (y | θ)π(θ)
π(θ | y) = .
f (y)
In order to do this, we have to have prior information, π(θ), which may be based on
data separate from y;
an ‘objective’ notion of what it is ‘reasonable’ to believe about θ;
a ‘subjective’ notion of what ‘I’ believe about θ.
We will reconsider π(θ) after discussion of Bayesian mechanics.
Interpretation: the knowledge that the event A has occurred changes the probabilities of the events
B1 , . . . , Bk :
P(B1 ), . . . , P(Bk ) −→ P(B1 | A), . . . , P(Bk | A).
253
Application of Bayes’ theorem
We suppose that the parameter θ has density π(θ), and that the conditional density of Y given θ is
f (y | θ). The joint density is
f (y, θ) = f (y | θ)π(θ),
and by Bayes’ theorem the conditional density of θ given Y = y is
f (y | θ)π(θ)
π(θ | y) = ,
f (y)
where Z
f (y) = f (y | θ)π(θ) dθ
Bayesian updating
We can use Bayes’ theorem to update the prior density for the random variable θ to a posterior
density for θ:
y
π(θ) −→ π(θ | y),
or equivalently
data
prior uncertainty −→ posterior uncertainty.
We use π(θ), π(θ | y), rather than f (θ), f (θ | y), to stress that these distributions depend on
information external to the data.
π(θ | y) contains our knowledge about θ, once we have seen data y, if our initial knowledge about
θ was contained in the density π(θ).
254
Beta(a, b) density
Definition 306. The beta(a, b) density for θ ∈ (0, 1) has the form
θ a−1 (1 − θ)b−1
π(θ) = , 0 < θ < 1, a, b > 0,
B(a, b)
where a and b are parameters, B(a, b) = Γ(a)Γ(b)/Γ(a + b) is the beta function, and
Z ∞
Γ(a) = ua−1 e−u du, a > 0,
0
a ab
E(θ) = , var(θ) = .
a+b (a + b + 1)(a + b)2
Example 308. Calculate the posterior density of θ for a sequence of Bernoulli trials, if the prior
density is Beta(a, b).
a a(a + 1) ab
E(θ) = , E(θ 2 ) = , var(θ) = .
a+b (a + b)(a + b + 1) (a + b + 1)(a + b)2
255
Note to Example 308
Suppose that conditional on θ, the data y1 , . . . , yn are a random sample from the Bernoulli
distribution, for which P(Yj = 1) = θ and P(Yj = 0) = 1 − θ, where 0 < θ < 1. The likelihood is
n
Y
L(θ) = f (y | θ) = θ yj (1 − θ)1−yj = θ s(1 − θ)n−s , 0 < θ < 1,
j=1
P
where s = yj .
The posterior density of θ conditional on the data and using the beta prior density is given by Bayes’
theorem, and is
θ s+a−1 (1 − θ)n−s+b−1 /B(a, b)
π(θ | y) = R1
s+a−1 (1 − θ)n−s+b−1 dθ/B(a, b)
0 θ
s+a−1
∝ θ (1 − θ)n−s+b−1 , 0 < θ < 1. (10)
As this has unit integral for all positive a and b, the constant normalizing (10) must be
B(a + s, b + n − s). Therefore
1
π(θ | y) = θ s+a−1 (1 − θ)n−s+b−1 , 0 < θ < 1.
B(a + s, b + n − s)
Thus the posterior density of θ has the same form as the prior: acquiring data has the effect of
updating (a, b) to (a + s, b + n − s). As the mean of the B(a, b) density is a/(a + b), the posterior
mean is (s + a)/(n + a + b), and this is roughly s/n in large samples. Hence the prior density inserts
information equivalent to having seen a sample of a + b observations, of which a were successes. If we
.
were very sure that θ = 1/2, for example, we might take a = b very large, giving a prior density tightly
concentrated around θ = 1/2, whereas taking smaller values of a and b would increase the prior
uncertainty.
Prior densities
a= 0.5 , b= 0.5 a= 1 , b= 1 a= 5 , b= 5
12
12
12
Density of theta
Density of theta
Density of theta
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
a= 5 , b= 10 a= 10 , b= 5 a= 10 , b= 10
12
12
12
Density of theta
Density of theta
Density of theta
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
256
Posterior densities with n = 10, s = 9
a+s= 9.5 , b+n−s= 1.5 a+s= 10 , b+n−s= 2 a+s= 14 , b+n−s= 6
12
12
12
Density of theta
Density of theta
Density of theta
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
12
12
Density of theta
Density of theta
Density of theta
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
12
12
Density of theta
Density of theta
Density of theta
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
12
12
Density of theta
Density of theta
Density of theta
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
257
Posterior densities with n = 100, s = 69
a+s= 69.5 , b+n−s= 31.5 a+s= 70 , b+n−s= 32 a+s= 74 , b+n−s= 36
12
12
12
Density of theta
Density of theta
Density of theta
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
12
12
Density of theta
Density of theta
Density of theta
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
Properties of π(θ | y)
The density contains all the information about θ, but it is useful to extract summaries, such as the
posterior expectation or the posterior variance,
We could use the posterior expectation or the MAP estimator as point estimates for θ based on
the data y, but a more systematic approach is based on loss functions.
Example 309. Calculate the posterior expectation and variance of θ, and its MAP estimate, for
Example 308.
258
Note to Example 309
We saw earlier that the beta density with parameters a, b has
a a(a + 1) ab
E(θ) = , E(θ 2 ) = , var(θ) = ,
a+b (a + b)(a + b + 1) (a + b + 1)(a + b)2
so since the posterior density is beta with parameters a + s, b + n − s, the posterior mean and variance
are
a+s (a + s)(b + n − s)
E(θ | y) = , var(θ | y) = .
a+b+n (a + b + n + 1)(a + b + n)2
The MAP estimate is obtained by maximising the density
1
π(θ | y) = θ s+a−1 (1 − θ)n−s+b−1 , 0 < θ < 1.
B(a + s, b + n − s)
Definition 310. If Y ∼ f (y; θ), then the loss function R(y; θ) is a non-negative function of Y and θ.
The expected posterior loss is
Z
E {R(y; θ) | y} = R(y; θ)π(θ | y) dθ.
Example 311. If I seek to estimate θ by minimising E {R(y; θ) | y} with respect to θ̃(y), show that
with
R1 (y; θ) = |θ̃ − θ|, R2 (y; θ) = (θ̃ − θ)2 ,
R
I obtain that θ̃1 (y) is the median of π(θ | y) and that θ̃2 = E(θ | y) = θπ(θ | y) dθ is the posterior
expectation of θ.
Loss functions are also useful when we want to base a decision on the data: we construct R(y; θ) to
represent the loss when we observe y and base the decision on it, but the state of reality is θ, and then
choose the decision that minimises the expected loss.
259
Note to Example 311
For R1 (y; θ) = |θ̃ − θ|, we have
n o n o
E {R1 (y; θ) | y} = E (θ̃ − θ)I(θ̃ > θ) | y + E (θ − θ̃)I(θ̃ < θ) | y
Z θ̃ Z ∞
= (θ̃ − θ)π(θ | y) dθ + (θ − θ̃)π(θ | y) dθ,
−∞ θ̃
and differentiation of this with respect to θ̃ gives
Z θ̃ Z ∞
π(θ | y) dθ − π(θ | y) dθ
−∞ θ̃
This equals zero when the two probabilities are the same, and then we must have
Z θ̃ Z ∞
π(θ | y) dθ = π(θ | y) dθ = 12 ,
−∞ θ̃
so θ̃1 ≡ θ̃1 (y) is the median of the posterior density π(θ | y).
For R2 (y; θ) = (θ̃ − θ)2 , we have on setting m(y) = E(θ | y) and using a little algebra that
h i
E {R2 (y; θ) | y} = E {θ̃ − m(y) + m(y) − θ}2 | y
h i h i
= E {θ̃ − m(y)}2 | y + 2E {θ̃ − m(y)}{m(y) − θ} | y + E {m(y) − θ}2 | y
= {θ̃ − m(y)}2 + var(θ | y) :
the first term is constant with respect to the posterior distribution of θ because θ̃ and m(y) do not
depend on θ, but only on the variable y, which is fixed by the conditioning; the second term is
h i
E {θ̃ − m(y)}{m(y) − θ} | y = {θ̃ − m(y)}E {m(y) − θ | y} = 0;
n = 10 n = 30 n = 100 θb ± 1.96J(θ)
b −1/2
Lower 0.619 0.633 0.595 0.599
Upper 0.989 0.912 0.774 0.781
260
Probability and Statistics for SIC slide 448
Conjugate densities
Particular combinations of data and prior densities give posterior densities of the same form as the
prior densities. Example :
s,n
θ ∼ Beta(a, b) −→ θ | y ∼ Beta(a + s, b + n − s),
where the data s ∼ B(n, θ) correspond to s successes out of n independent trials with success
probability θ.
The beta density is the conjugate prior density of the binomial distribution: if the likelihood is
proportional to θ s (1 − θ)n−s , then choosing a beta prior for θ ensures that the posterior density of
θ is also beta, with updated parameters (a + s, b + n − s).
Conjugate prior densities are very useful, as we can often avoid having to integrate:
If we recognise π(θ | y), no need to integrate!
iid
Example 312. Let Y1 , . . . , Yn | µ ∼ N (µ, σ 2 ) and µ ∼ N (µ0 , τ 2 ), where µ0 , σ 2 and τ 2 are known.
Obtain the posterior distribution of µ | Y1 , . . . , Yn , without integration.
has exponent
1 B
x2 − +x − 21 B 2 /A,
2A A
so that we can read off A and B from the coefficients of x2 and x.
Turning to this particular calculation, we seek a density for µ, so any terms not involving µ can be
treated as constants. Now
π(µ | y) ∝ f (y | µ) × π(µ)
Xn
1 1 2 1 1 2
= exp − 2 (yj − µ) ×√ exp − 2 (µ − µ0 ) ,
(2πσ 2 )n/2 2σ
j=1
2πτ 2 2τ
Thus P
yj /σ 2 + µ0 /τ 2 1
µ|y∼N 2 2
, .
n/σ + 1/τ n/σ + 1/τ 2
2
261
Probability and Statistics for SIC note 1 of slide 449
Example 313. Calculate the posterior distribution for another Bernoulli trial, independent of the
previous ones.
Bayesian approach
Treat each unknown (parameter θ, predictand Z, . . .) as a random variable, give it a distribution,
and use Bayes’ theorm to calculate its posterior distribution given any data.
We must build a more elaborate model, with prior information, but we can then treat all the
unknowns (parameters, data, missing values, etc.) on the same basis, and thus we just need to
apply probability calculus, conditioning on whatever we have observed.
Philosophical questions:
– Are we justified in using prior knowledge in this way?
– Where does this knowledge come from?
We often choose prior distributions for practical reasons (e.g., conjugate distributions) rather than
philosophical reasons.
Practical question:
– How to calculate all the integrals we need?
We often use Monte Carlo methods, and construct Markov chains whose limit distributions are
π(θ | y). This is a story for another day . . .
262
Probability and Statistics for SIC slide 451
NMR Data
60 NMR data Wavelet Decomposition Coefficients
1
2
40
Resolution Level
7 6 5 4 3
y
20
8
0
9
0 200 400 600 800 1000 0 128 256 384 512
Translate
Daub cmpct on ext. phase N=2
Parsimonious representations
In many modern applications we want to extract a signal from a noisy environment:
finding the combination of genes leading to an illness;
cleaning a biomedical image;
denoising a download;
detection of spams.
We often search for a parsimonious representation of the signal, with many null elements.
Probability and Statistics for SIC slide 454
Orthogonal transformation
Original data X with noisy signal µn×1 : X ∼ Nn (µ, σ 2 In ).
T
Suppose Yn×1 = Wn×n Xn×1 , where W T W = W W T = In is orthogonal.
Choose W such that θ = W T µ should have many small elements, and a few big ones, giving a
sparse representation of µ in the basis corresponding to W ;
‘kill’ small coefficients of Y , which correspond to the noise, giving
263
Wavelet decomposition
A good choice of W is based on wavelets, which have the required sparseness properties.
Here are the Haar wavelet coefficients, with n = 8:
1 1 1 0 1 0 0 0
1 1 1 0 −1 0 0 0
1 1 −1 0 0 1 0 0
1 1 −1 0 0 −1 0 0
.
1 −1 0 1 0 0 1 0
1 −1 0 1 0 0 −1 0
1 −1 0 −1 0 0 0 1
1 −1 0 −1 0 0 0 −1
We set up W so that each column of this orthogonal matrix has unit norm., i.e., we post-multiply
this matrix by √ √ √ √ √ √
{diag( 8, 8, 2, 2, 2, 2, 2, 2)}−1 ,
to ensure that W W T = W T W = I8 .
Probability and Statistics for SIC slide 456
where
a = τ 2 /(τ 2 + σ 2 ), b2 = 1/(1/σ 2 + 1/τ 2 ),
and
p(σ 2 + τ 2 )−1/2 φ{y/(σ 2 + τ 2 )1/2 }
py =
(1 − p)σ −1 φ(y/σ) + p(σ 2 + τ 2 )−1/2 φ{y/(σ 2 + τ 2 )1/2 }
is the posterior probability that θ 6= 0.
264
Bayesian shrinkage
To estimate θ, we use loss function |θ̃ − θ|, so θ̃ is the posterior median of θ.
Here are the cumulative distribution functions of θ, prior (left) and posterior when p = 0.5,
σ = τ = 1, and y = −2.5 (centre), et y = −1 (right).
If y is close to zero, then θ̃ = 0, but if not, then 0 < |θ̃| < |y|, so θ̃ shrinks y towards zero.
Lines: probability=0.5 (red); value of y (blue); posterior median θ̃ (green).
Prior Posterior, y=−2.5, posterior median=−0.98 Posterior, y=−1, posterior median=0
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
CDF
CDF
CDF
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
theta theta theta
1
2
2
Resolution Level
Resolution Level
7 6 5 4 3
7 6 5 4 3
8
8
9
265
Probability and Statistics for SIC slide 460
60
60
40
40
wr(w)
20
20
y
0
0
−20
−20
0 200 400 600 800 1000 0 200 400 600 800 1000
Spam filter
We wish to construct a spam filter, using the presence or absence of certain email characteristics
C1 , . . . , Cm .
The data Y are of the form
S C1 C2 ··· Cm
1 0 1 1 ··· 1
2 1 0 1 ··· 0
.. .. .. .. .. ..
. . . . . .
n 0 0 0 ··· 0
where S = 1 for a spam, and Ci = 1 if characteristic i (e.g., the word ‘Nigeria’, Russian language,
hotmail address) is present.
Simple model:
P(S = 1) = p, P(S = 0) = 1 − p,
P(Ci = 1 | S = 1) = αi , P(Ci = 0 | S = 1) = 1 − αi ,
P(Ci = 1 | S = 0) = βi , P(Ci = 0 | S = 0) = 1 − βi ,
and the C1 , . . . , Cm are independent, given the value of S.
266
Spam filter
For a new email with C1+ , . . . , Cm
+ but without S + , we calculate
P(S + = 1 | C1+ , . . . , Cm
+
, Y ),
and quarantine the email if this probability exceeds some threshold d ∈ (0, 1).
If we write θ = (p, α1 , . . . , αm , β1 , . . . , βm ), and if we suppose that a priori
iid
p, α1 , . . . , αm , β1 , . . . , βm ∼ U (0, 1),
Spam filter
With the new characteristics C + = (C1+ , . . . , Cm
+ ), we want to calculate
P(S + = 1, C + | Y )
P(S + = 1 | C + , Y ) =
P(C + | Y )
P(S + = 1, C + | Y )
=
P(S + = 0, C + | Y ) + P(S + = 1, C + | Y )
where Z
P(S = s , C | Y ) = P(S + = s+ , C + = c+ | θ, y)π(θ | y) dθ,
+ + +
267
Spam filter
Thus P(S + = s+ , C + | Y ) equals
m
B(1 + s + s+ , 2 + n − s − s+ ) Y B(1 + t+ + + +
i1 , 1 + ti2 )B(1 + ti3 , 1 + ti4 )
,
B(1 + s, 1 + n − s) B(1 + ti1 , 1 + ti2 )B(1 + ti3 , 1 + ti4 )
i=1
Comments
The key assumption that the C1 , . . . , Cm are independent, given S, is probably false, but maybe
not too damaging — often idiot’s Bayes works quite well.
Simulations with p = 0.8, n = 100 emails whose S and C are known, and 1000 new emails for
which only C + is known.
Here an email is classified as spam if
From 180 good emails, 141 are misclassified with m = 2, whereas only 1 is misclassified with
m = 20.
m=2 m = 20
Spam Good Total Spam Good Total
Spam 761 59 820 Spam 810 10 820
Good 141 39 180 Good 1 179 180
Comments
Bayesian ideas provide an approach that integrates the treatment of uncertainty and modelling,
with which we can tackle very complex problems.
The main philosophical difficulty is the status of the prior information.
The main practical difficulty is the need to calculate many complex multidimensional integrals.
268
Probability and Statistics for SIC slide 467
269