0% found this document useful (0 votes)
118 views269 pages

Sic 2018 Notes

This document provides an overview of the motivation and topics covered in a course on probability and statistics for stochastic and information sciences (SIC). It discusses how probability and statistics provide mathematical tools for modeling random events and processes across various domains including weather forecasting, economics, networks, algorithms, signal processing, and more. Specific examples are given showing applications in modeling network structures, randomized algorithms, and signal processing using techniques like wavelet decomposition. The course content is outlined covering topics like probability, random variables, statistical inference, likelihood, and Bayesian inference.

Uploaded by

Aslı Yörüsün
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views269 pages

Sic 2018 Notes

This document provides an overview of the motivation and topics covered in a course on probability and statistics for stochastic and information sciences (SIC). It discusses how probability and statistics provide mathematical tools for modeling random events and processes across various domains including weather forecasting, economics, networks, algorithms, signal processing, and more. Specific examples are given showing applications in modeling network structures, randomized algorithms, and signal processing using techniques like wavelet decomposition. The course content is outlined covering topics like probability, random variables, statistical inference, likelihood, and Bayesian inference.

Uploaded by

Aslı Yörüsün
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 269

Probability and Statistics for SIC

A.
c C. Davison, 2018

http://stat.epfl.ch

1 Introduction 2

1.1 Motivation 3

1.2 Preliminaries 18

1.3 Combinatorics 26

2 Probability 36

2.1 Probability Spaces 38

2.2 Conditional Probability 62

2.3 Independence 70

2.4 Edifying Examples 78

3 Random Variables 85

3.1 Basic Ideas 87

3.2 Expectation 112

3.3 Conditional Probability Distributions 121

3.4 Notions of Convergence 125

4 Continuous Random Variables 135

4.1 Basic Ideas 136

4.2 Further Ideas 149

4.3 Normal Distribution 153

4.4 Q-Q Plots 167

5. Several Random Variables 174

1
5.1 Basic Notions 176

5.2 Dependence 190

5.3 Generating Functions 201

5.4 Multivariate Normal Distribution 213

5.5 Transformations 223

5.6 Order Statistics 231

6. Approximation and Convergence 234

6.1 Inequalities 236

6.2 Convergence 239

6.3 Laws of Large Numbers 250

6.4 Central Limit Theorem 255

6.5 Delta Method 261

7 Exploratory Statistics 265

7.1 Introduction 266

7.2 Data 275

7.3 Graphs 279

7.4 Numerical Summaries 296

7.5 Boxplot 306

7.6 Choice of a Model 312

8 Statistical Inference 318

8.1 Introduction 319

8.2 Point Estimation 324

8.3 Interval Estimation 337

8.4 Hypothesis Tests 352

8.5 Comparison of Tests 379

9 Likelihood 387

9.1 Motivation 388

9.2 Scalar Parameter 396

9.3 Vector Parameter 408

9.4 Statistical Modelling 414

2
9.5 Linear Regression 422

10 Bayesian Inference 435

10.1 Basic Ideas 436

10.2 Bayesian Modelling 452

3
1 Introduction slide 2

1.1 Motivation slide 3


Motivation
Probability and statistics provide the mathematical tools and models for the study of ‘random’ events:
 weather forecasts, finance/economics (Nobel prizes, 2003, 2011), . . .;
 network modelling;
 stochastic algorithms;
 internet traffic;
 errors in signal coding;
 image processing;
 ...
Statistical ideas provide optimal methods for prediction, noise reduction, signal processing, and the
reconstruction of a true signal or image.

Probability and Statistics for SIC slide 4

Stochastic networks

Erdös–Rényi graph (1960), with p = 0.01. The arcs between each pair of vertices appear with
probability p, independently of the other arcs. In this case, if p > (1 + ǫ) log n/n, ǫ > 0, the graph will
be connected (almost certainly).
(Source: Wikipedia)

Probability and Statistics for SIC slide 5

4
‘Giant component’

Erdös–Rényi graph (1960), with n = 150, p = 0.01. If when n → ∞ we have np → c > 1, then there
is (almost certainly) a connected subgraph containing a positive fraction of the vertices. No other
component contains more than O(log n) of the vertices.
(Source: www.cs.purdue.edu)

Probability and Statistics for SIC slide 6

Stochastic networks II
Chain network Nearest-neighbour network Scale-free network

Guo et al. (2011, Biometrika)

Probability and Statistics for SIC slide 7

5
Modeling of webpages as networks

person

instructor

topic

gener

interest

parallel

support

parallel

Fig. 3. Common structure in the webpages data. Panel (a) shows the estimated common structure for the four cat-
egories. The nodes represent 100 terms with the highest log-entropy weights. The area of the circle representing a
node is proportional to its log-entropy weight. The width of an edge is proportional to the magnitude of the associated
partial correlation. Panels (b)–(d) show subgraphs extracted from the graph in panel (a).

Guo et al. (2011, Biometrika)

Probability and Statistics for SIC slide 8

Randomized algorithms

(Source: Cambridge University Press)

Probability and Statistics for SIC slide 9

6
Signal processing
NMR data Wavelet Decomposition Coefficients

60

1
2
40

Resolution Level
7 6 5 4 3
y
20

8
0

9
0 200 400 600 800 1000 0 128 256 384 512
Translate
Daub cmpct on ext. phase N=2

Data and coefficients of an orthogonal transformation

Probability and Statistics for SIC slide 10

Signal processing
Original coefficients Shrunken coefficients
1

1
2

2
Resolution Level

Resolution Level
7 6 5 4 3

7 6 5 4 3
8

8
9

0 128 256 384 512 0 128 256 384 512


Translate Translate
Daub cmpct on ext. phase N=2 Daub cmpct on ext. phase N=2

Orignal and ‘thresholded’ coefficients

Probability and Statistics for SIC slide 11

7
Signal processing
NMR data Bayesian posterior median

60

60
40

40
wr(w)
20

20
y
0

0
−20

−20
0 200 400 600 800 1000 0 200 400 600 800 1000

Data and signal reconstructed using a statistical method

Probability and Statistics for SIC slide 12

Video data
100 150 200 250 300 350 400
videoVBR
50

0 200 400 600 800 1000


Time

Amount of coded information (Variable Bit Rate) per frame for a certain video sequence. There were
about 25 frames per second.

Probability and Statistics for SIC slide 13

8
Time series

6e+04
Number

0e+00
Value

60000
0

2010.0 2010.2 2010.4 2010.6 2010.8 2011.0

Time

Number and value of transactions (arbitrary units) every hour for mobile phones, 2010.

Probability and Statistics for SIC slide 14

Practical motivation
A lot of later courses rely on probability and statistics:
 Applied data analysis (West)
 Automatic speech processing (Bourlard)
 Biomedical signal processing (Vesin)
 Stochastic models in communication (Thiran)
 Machine learning (Jaggi/Urbanke)
 Performance evaluation (Le Boudec)
 Signal processing for communications (Prandoni)
 Principles of digital communications (Teletar)
 ...
Probability and Statistics for SIC slide 15

Organisation
 Lecturer: Professor A. C. Davison
 Assistants: see moodle page or information sheet
 Lectures: Monday 14.15–16.00, CE6; Tuesday, 13.15–15.00, CE1
 Exercises: Monday 16.15–18.00, CE6
 Distinction between
– exercises (solutions available) and
– problems (solutions posted later)
 Test: 16th April, 16.15–18.00, no written notes (simple calculator allowed)
 Quizzes: 15-minute quizzes on 5 and 19 March, 9 and 30 April, 14 May, no written notes (simple
calculator allowed)
 Course material (including Random Exercise Generator): see moodle page for the course.

9
Probability and Statistics for SIC slide 16

Course material
Probability constitutes roughly the first 60% of the course, and a good book is
 Ross, S. M. (2007) Initiation aux Probabilités. PPUR: Lausanne.
 Ross, S. M. (2012) A First Course in Probability, 9th edition. Pearson: Essex.

Statistics comprises roughly the last 40% of the course. Possible books are
 Davison, A. C. (2003). Statistical Models. Cambridge University Press. Sections 2.1, 2.2; 3.1, 3.2;
4.1–4.5; 7.3.1; 11.1.1, 11.2.1.
 Morgenthaler, S. (2007) Introduction à la Statistique. PPUR: Lausanne.
 Wild, C. & Seber, G. A. F. (2000). Chance Encounters: A First Course in Data Analysis and
Statistics. John Wiley & Sons: New York.
 Helbling, J.-M. & Nuesch, P. (2009). Probabilités et Statistique (polycopie).

There are many excellent introductory books on both topics, look in the Rolex Learning Centre.

Probability and Statistics for SIC slide 17

1.2 Preliminary ideas slide 18

Sets
Definition 1. A set A is a unordered collection of objects, x1 , . . . , xn , . . .:

A = {x1 , . . . , xn , . . .} .

We write x ∈ A to say that ‘x is an element of A’, or ‘x belongs to A’. The collection of all possible
objects in a given context is called the universe Ω.
An ordered set is written A = (1, 2, . . .). Thus {1, 2} = {2, 1}, but (1, 2) 6= (2, 1).

Examples:

CH = {Geneva, Vaud, . . . , Grisons} set of Swiss cantons


{0, 1} = finite set made up of the elements 0 and 1
N = {1, 2, . . .}, positive integers, countable set
Z = {. . . , −1, 0, 1, 2, . . .}, integers, countable set
R = real numbers, uncountable set
∅ = { } empty set, has no elements

Probability and Statistics for SIC slide 19

10
Subsets
Definition 2. A set A is a subset of a set B if x ∈ A implies that x ∈ B: we write A ⊂ B.

 If A ⊂ B and B ⊂ A, then every element of A is contained within B and vice versa, thus A = B:
both sets contain exactly the same elements.
 Note that ∅ ⊂ A for every set A. Thus,

∅ ⊂ {1, 2, 3} ⊂ N ⊂ Z ⊂ Q ⊂ R ⊂ C, ∅⊂I⊂C

 Venn diagrams are useful for grasping the existing elementary relations between sets, but they
can be deceptive (not all relations can be represented).

Probability and Statistics for SIC slide 20

Cardinal of a set
Definition 3. A finite set A has a finite number of elements, and this number is called its cardinal:

card A, #A, |A|.

 Evidently |∅| = 0 and |{0, 1}| = 2


 Exercise: Show that if A and B are finite and A ⊂ B, then |A| ≤ |B|.

Probability and Statistics for SIC slide 21

Boolean operations
Definition 4. Let A, B ⊂ Ω. Then we can define
 the union and the intersection of A and B to be

A ∪ B = {x ∈ Ω : x ∈ A or x ∈ B} , A ∩ B = {x ∈ Ω : x ∈ A and x ∈ B} ;

 the complement of A in Ω to be Ac = {x ∈ Ω : x 6∈ A}.

Evidently A ∩ B ⊂ A ∪ B,and if the sets are finite, then

|A| + |B| = |A ∩ B| + |A ∪ B|, |A| + |Ac | = |Ω|.

We can also define the difference between A and B to be

A \ B = A ∩ B c = {x ∈ Ω : x ∈ A and x 6∈ B},

(note that A \ B 6= B \ A), and the symmetric difference

A △ B = (A \ B) ∪ (B \ A).

Probability and Statistics for SIC slide 22

11
Boolean operations
If {Aj }∞
j=1 is an infinite set of the subsets of Ω, then


[
Aj = A1 ∪ A2 ∪ · · · : those x ∈ Ω that belong to at least one Aj ;
j=1
\∞
Aj = A1 ∩ A2 ∩ · · · : those x ∈ Ω that belong to every Aj .
j=1

The following results are easy to show (e.g., using Venn diagrams):
 (Ac )c = A, (A ∪ B)c = Ac ∩ B c , (A ∩ B)c = Ac ∪ B c ;
 A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C), A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C);
S T∞ T∞ S∞
 ( ∞ c
j=1 Aj ) =
c c
j=1 Aj , ( j=1 Aj ) =
c
j=1 Aj .

Probability and Statistics for SIC slide 23

Partition
Definition 5. A partition of Ω is a collection of nonempty subsets A1 , . . . , An in Ω such that
 the Aj are exhaustive, i.e., A1 ∪ · · · ∪ An = Ω, and
 the Aj are disjoint, i.e., Ai ∩ Aj = ∅, for i 6= j.
A partition can also be composed of an infinity of sets {Aj }∞
j=1 .

Example 6. Let Aj = [j, j + 1), for j = . . . , −1, 0, 1, . . .. Do the Aj partition Ω = R?

Example 7. Let Aj be the set of integers that can be divided by j, for j = 1, 2, . . .. Do the Aj
partition Ω = N?

Probability and Statistics for SIC slide 24

Note to Example 6
Obviously, Aj ∩ Ai = ∅ if i 6= j. Moreover any real number x lies in A⌊x⌋ , where ⌊x⌋ is the largest
integer less than or equal to x. Therefore these sets partition R.

Probability and Statistics for SIC note 1 of slide 24

Note to Example 7
Note that 6 ∈ A2 ∩ A3 , so these sets do not partition N.

Probability and Statistics for SIC note 2 of slide 24

12
Cartesian product
Definition 8. The Cartesian product of two sets A, B is the set of ordered pairs

A × B = {(a, b) : a ∈ A, b ∈ B}.

In the same way


A1 × · · · × An = {(a1 , . . . , an ) : a1 ∈ A1 , . . . , an ∈ An }.
If A1 = · · · = An = A, then we write A1 × · · · × An = An .

As the pairs are ordered, A × B 6= B × A unless A = B.


If A1 , . . . , An are all finite, then

|A1 × · · · × An | = |A1 | × · · · × |An |.

Example 9. Let A = {a, b}, B = {1, 2, 3}. Describe A × B.

Probability and Statistics for SIC slide 25

Note to Example 9
{(a, 1), (a, 2), . . . , (b, 3)}.

Probability and Statistics for SIC note 1 of slide 25

1.3 Combinatorics slide 26

Combinatorics: Reminders
Combinatorics is the mathematics of counting. Two basic principles:
 multiplication: if I have m hats and n scarves, there are m × n different ways of wearing both a
hat and a scarf;
 addition: if I have m red hats and n blue hats, then I have m + n hats in total.
In mathematical terms: if A1 , . . . , Ak are sets, then

|A1 × · · · × Ak | = |A1 | × · · · × |Ak |, (multiplication),

and if the Aj are disjoint, then

|A1 ∪ · · · ∪ Ak | = |A1 | + · · · + |Ak |, (addition).

Probability and Statistics for SIC slide 27

13
Permutations: Ordered selection
Definition 10. A permutation of n distinct objects is an ordered set of those objects.

Theorem 11. Given n distinct objects, the number of different permutations (without repetition) of
length r ≤ n is
n!
n (n − 1) (n − 2) · · · (n − r + 1) = .
(n − r)!
Thus there are n! permutations of length n.
P
Theorem 12. Given n = ri=1 ni objects of r different types, where ni is the number of objects of
type i that are indistinguishable from one another, the number of permutations (without repetition) of
the n objects is
n!
.
n1 ! n2 ! · · · nr !
Probability and Statistics for SIC slide 28

Example
Example 13. A class of 20 students choose a committee of size 4 to organise a ‘voyage d’études’. In
how many different ways can they pick the committee if:
(a) there are 4 distinct roles (president, secretary, treasurer, travel agent)?
(b) there is one president, one treasurer, and two travel agents?
(c) there are two treasurers and two travel agents?
(d) their roles are indistinguishable?

Probability and Statistics for SIC slide 29

Note to Example 13
(a) First choose the president, then the secretary, etc., giving 20 × 19 × 18 × 17 = 116280.
This is the number of permutations of length 4 in a group of size 20.
(b) 20 × 19 × 18 × 17/2! = 58140.
(c) 20 × 19 × 18 × 17/(2!2!) = 29070.
(d) The first could have been chosen in 20 ways, the second in 19, etc. But the final group of four
could be elected in 4! orders, so the number of ways is 20 × 19 × 18 × 17/4! = 4845.

Probability and Statistics for SIC note 1 of slide 29

Multinomial and binomial coefficients


Definition 14. Let n1 , . . . , nr be integers in 0, 1, . . . , n, having total n1 + · · · + nr = n. Then
 
n n!
= ,
n1 , n2 , . . . , nr n1 ! n2 ! · · · nr !

is called the multinomial coefficient.


The most common case arises when r = 2:
   
n n!
= = Cnk in certain books
k k!(n − k)!

is called the binomial coefficient.


Probability and Statistics for SIC slide 30

14
Combinations: non ordered selection
Theorem 15. The number of ways of choosing a set of r objects from a set of n distinct objects
without repetition is  
n! n
= .
r!(n − r)! r
Theorem 16. The number of ways of distributing n distinct objects into r distinct groups of size
n1 , . . . , nr , where n1 + · · · + nr = n, is
n!
.
n1 ! n2 ! · · · nr !
Probability and Statistics for SIC slide 31

Properties of binomial coefficients


Theorem 17. If n, m ∈ {1, 2, 3 . . .} and r ∈ {0, . . . , n}, then:
   
n n
= ;
r n−r
     
n+1 n n
= + , (Pascal’s triangle);
r r−1 r
Xr     
m n m+n
= , (Vandermonde’s formula);
j r−j r
j=0
X n  
n r n−r
(a + b)n = a b , (Newton’s binomial theorem);
r
r=0
X∞  
n+j −1 j
(1 − x)−n = x , |x| < 1, (negative binomial series);
j
j=0
 
n 1
lim n−r = , r ∈ N.
n→∞ r r!

Probability and Statistics for SIC slide 32

Note to Theorem 17
 The numbers of ways of choosing r objects from n is the same as the number of ways of choosing
n − r objects from n.
 To choose r objects from n + 1, we first designate one of the n + 1. Then if that object is in the
sample, we must choose r − 1 from among the other n, and if not, we must choose r from the n,
which gives the result.
 Suppose I have n blue hats and m red hats. Then the number of ways I can choose r hats from all
my hats equals the number of ways I can choose j red hats and r − j blue hats, summed over the
possible choices of j.
 The binomial results are standard.
 For the last part, with r fixed, we have
 
−r n n(n − 1) · · · (n − r + 1) 1
n = r
→ , n → ∞.
r n r! r!

Probability and Statistics for SIC note 1 of slide 32

15
Partitions of integers
Theorem 18. (a) The number of distinct vectors (n1 , . . . , nr ) of positive integers, n1 , . . . , nr > 0,
satisfying n1 + · · · + nr = n, is  
n−1
.
r−1
(b) The number of distinct vectors (n1 , . . . , nr ) of non-negative integers, n1 , . . . , nr ≥ 0, satisfying
n1 + · · · + nr = n, is  
n+r−1
.
n
Example 19. How many different ways are there to put 6 identical balls in 3 boxes, in such a way that
each box contains at least one ball?

Example 20. How many different ways are there to put 6 identical balls into 3 boxes?

Probability and Statistics for SIC slide 33

Note to Theorem 18
(a) Line up the n balls, and note that there are n − 1 spaces between them. You must choose r − 1
out of these n − 1 spaces to place these separators, giving the stated formula.
(b) Line up the n balls and the r − 1 separators. Any distinct configurations of these n + r − 1 objects
will correspond to a different partition, so the number of these partitions is the number of ways the
balls and separators can be ordered, and this is the stated formula.

Probability and Statistics for SIC note 1 of slide 33

Note to Example 19
We have a total of n = 6 balls and r = 3 groups, each of which must have at least one member, so
the number is  
6−1 5!
= = 10.
3−1 3!2!

Probability and Statistics for SIC note 2 of slide 33

Note to Example 20
Now there is the possibility of empty boxes, so the total number is
 
6+3−1 8!
= = 28.
6 6!2!

Thus there are 18 ways to get at least one empty box.

Probability and Statistics for SIC note 3 of slide 33

16
Reminder: Some series
Theorem 21. (a) A geometric series is of the form a, aθ, aθ 2 , . . .; we have
n
( n+1
X a 1−θ , θ 6= 1,
i 1−θ
aθ =
i=0
a(n + 1), θ = 1.
P∞ i
If |θ| < 1, then i=0 θ = 1/(1 − θ), and

X i! 1
θ i−r = , r = 1, 2, . . . .
r!(i − r)! (1 − θ)r+1
i=0

The exponential series



X xn
exp(x) =
n=0
n!

converges absolutely for all x ∈ C.

Probability and Statistics for SIC slide 34

Small lexicon
Mathematics English Français
Ω, A, B . . . set ensemble
A∪B union union
A∩B intersection intersection
Ac complement of A (in Ω) complémentaire de A (en Ω)
A\B difference différence
A∆B symmetric difference différence symétrique
A×B Cartesian product produit cartésien
|A| cardinality cardinal
{Aj }nj=1 pairwise disjoint {Aj }nj=1 disjoint deux à deux
partition partition
permutation permutation
combination combinaison
n
r  binomial coefficient coefficient binomial (Cnr )
n
n1 ,...,nr multinomial coefficient coefficient multinomial
indistinguishable indifférentiable
colour-blind daltonien (ienne)

Probability and Statistics for SIC slide 35

17
2 Probability slide 36

Small probabilistic lexicon

Mathematics English Français


one fair die (several fair dice) dé juste/équilibré (plusieurs dés justes/équilibrés)
random experiment expérience aléatoire
Ω sample space ensemble fondamental
ω outcome, elementary event épreuve, événement élémentaire
A, B, . . . event événement
F event space espace des événements
sigma-algebra tribu
P probability distribution/probability function loi de probabilité
(Ω, F, P) probability space espace de probabilité
inclusion-exclusion formula formule d’inclusion-exclusion
P(A | B) probability of A given B probabilité de A sachant B
independence indépendance
(mutually) independent events événements (mutuellement) indépendants
pairwise independent events événements indépendants deux à deux
conditionally independent events événements conditionellement indépendants

Probability and Statistics for SIC slide 37

2.1 Probability Spaces slide 38

The Card Players

Paul Cézanne, 1894–95, Musée d’Orsay, Paris

Probability and Statistics for SIC slide 39

18
Motivation: Game of dice
We throw two fair dice, one red and one green.
 (a) What is the set of possible results?
 (b) Which results give a total of 6?
 (c) Which results give a total of 12?
 (d) Which results give an odd total?
 (e) What are the probabilities of the events (b), (c), (d)?

Probability and Statistics for SIC slide 40

Calculation of probabilities
 We can try to calculate the probabilities of events such as (b), (c), (d) by throwing the dice
numerous times and letting

# of times event takes place


probability of an event = .
# experiments carried out

This is an empirical rather than a mathematical answer, to be reached only after a lot of work
(how many times should we roll the dice?), and it will yield different answers each time—not
satisfactory!
 For simple examples, we often use symmetry to calculate probabilities. This isn’t possible for more
complicated cases—we construct mathematical models, based on notions of
– random experiments
– probability spaces.

Probability and Statistics for SIC slide 41

Random experiment
Definition 22. A random experiment is an ‘experiment’ whose result is (or can be defined as)
random.

Example 23. I toss a coin.

Example 24. I roll 2 fair dice, one red and one green.

Example 25. The number of emails I receive today.

Example 26. The waiting time until the end of this lecture.

Example 27. The weather here tomorrow at midday.

Probability and Statistics for SIC slide 42

19
Andrey Nikolaevich Kolmogorov (1903–1987)

Grundbegriffe der Wahrscheinlichkeitsrechnung (1933)

(Source: http://en.academic.ru/dic.nsf/enwiki/54484)

Probability and Statistics for SIC slide 43

Probability space (Ω, F , P)


A random experiment is modelled by a probability space.

Definition 28. A probability space (Ω, F, P) is a mathematical object associated with a random
experiment, comprising:
 a set Ω, the sample space (universe), which contains all the possible results (outcomes,
elementary events) ω of the experiment;
 a collection F of subsets of Ω. These subsets are called events, and F is called the event space;
 a function P : F 7→ [0, 1] called a probability distribution, which associates a probability
P(A) ∈ [0, 1] to each A ∈ F.

Probability and Statistics for SIC slide 44

Sample space
 The sample space Ω is the space composed of elements representing all the possible results of a
random experiment. Each element ω ∈ Ω is associated with a different result.
 Ω is analogous to the universal set. It can be finite, countable or uncountable.
 Ω is nonempty. (If Ω = ∅, then nothing interesting can happen.)

Example 29. Describe the sample spaces for Examples 23–27.

For simple examples with finite Ω, we often choose Ω so that each ω ∈ Ω is equiprobable:
1
P(ω) = , for every ω ∈ Ω.
|Ω|

Then P(A) = |A|/|Ω|, for every A ⊂ Ω.

Probability and Statistics for SIC slide 45

20
Note to Example 29
Example 23: Here we can write Ω = {ω1 , ω2 }, where ω1 and ω2 represent Tail and Head respectively.
Example 24: Ω = {ω1 , . . . , ω36 }, representing all 36 different possibilities.
Example 25: Ω = {ωj : j = 0, 1, . . . , }, representing any non-negative number.
Example 26: Ω = {ω : ω ∈ [0, 45] minutes}, an uncountable set.
Example 27: Ω? We have to decide what we count as weather outcomes, so this is not so easy.
In general discussion we use ω as an element of Ω, but in examples it is usually easier to write H or T
or (r, g) or similar.

Probability and Statistics for SIC note 1 of slide 45

Event space
F is a set of subsets of Ω which represents the events of interest.

Example 30 (Example 24, continued). Give the events

A the red die shows a 4,


B the total is odd,
C the green die shows a 2,
A∩B the red die shows a 4 and the total is odd.

Calculate their probabilities.

Probability and Statistics for SIC slide 46

Note to Example 30
First we set up the probability space Ω. If we write (2, 4) to mean that the red shows 2 and the green
shows 4, we have
Ω = {(r, g) : r, g = 1, . . . , 6},
giving

A = {(4, g), g = 1, . . . , 6},


B = {(1, 2), (1, 4), (1, 6), (2, 1), (2, 3), (2, 5), . . . , (6, 1), (6, 3), (6, 5)},
C = {(r, 2), r = 1, . . . , 6},
A ∩ B = {(4, 1), (4, 3), (4, 5)}.

By symmetry if the two dice are fair, then |Ω| = 36, |A| = |C| = 6, |B| = 18, and |A ∩ B| = 3, so the
probabilities are

P(A) = P(C) = 6/36 = 1/6, P(B) = 18/36 = 1/2, P(A ∩ B) = 3/36 = 1/12.

Probability and Statistics for SIC note 1 of slide 46

21
Event space F , II
Definition 31. An event space F is a set of the subsets of Ω such that:
(F1) F is nonempty;
(F2) if A ∈ F then Ac ∈ F;
S∞
(F3) if {Ai }∞
i=1 are all elements of F, then i=1 Ai ∈ F.
F is also called a sigma-algebra (en français, une tribu).

Let A, B, C, {Ai }∞
i=1 be elements of F. Then the preceding axioms imply that
Sn
(a) i=1 Ai ∈ F,
(b) Ω ∈ F, ∅ ∈ F,
(c) A ∩ B ∈ F, A \ B ∈ F, A ∆ B ∈ F,
T
(d) ni=1 Ai ∈ F.

Probability and Statistics for SIC slide 47

Use of these axioms


To prove (a)–(d), we argue as follows:
(a) Take An+1 = An+2 = · · · = An , and apply (F3).
(b) If F is non-empty, then it has an element A, and by (F2) Ac ∈ F, so A ∪ Ac = Ω ∈ F. Also,
Ωc = ∅ ∈ F.
(c) Note that A ∩ B = (Ac ∪ B c )c , and sets operated on by union and complement remain in F.
Likewise for the differences.
T T S
(d) We write ni=1 Ai = (( ni=1 Ai )c )c = ( ni=1 Aci )c ∈ F.

Probability and Statistics for SIC slide 48

Event space F , III


 If Ω is countable, we often take F to be the set of all the subsets of Ω. This is the biggest (and
richest) event space for Ω.
 We can define different event spaces for the same sample space.

Example 32. Give the event space for Example 23.

Example 33. I roll two fair dice, one red and one green.
(a) What is my event space F1 ?
(b) I only tell my friend the total. What is his event space F2 ?
(c) My friend looks at the dice himself, but he is colour-blind. What then is his event space F3 ?

Probability and Statistics for SIC slide 49

Note to Example 32
We can write Ω = {H, T }, and then have two choices:

F1 = {{H, T }, ∅} , F2 = {{H, T }, ∅, {H}, {T }} .

Either of these satisfies the axioms (check this) and hence is a valid event space. Only the second,
however, is interesting. In the first the only non-null event is {H, T }, which corresponds to ‘the
experiment was performed and a head or a tail was observed’.

22
Probability and Statistics for SIC note 1 of slide 49

Note to Example 33
(a) Since we see an outcome of the form (r, g), we can reply to any question about the outcomes; thus
we take F1 to be the set of all possible subsets of Ω{(r, g) : r, g = 1, . . . , 6}. The ordered pair (r, g)
corresponds to the event Ar,g = {(r, g)} (‘the experiment was performed and the outcome was (r, g)’),
and the 236 distinct elements Bj of F1 can be constructed by taking all possible unions and
intersections of the Ar,g . (Note that the intersection of any two or more disjoint events here will give
∅, and the union of all of them gives Ω.) This means that F1 is the power set of
{Ar,g : r, g = 1, . . . , 6}, and |F1 | = 236 .
(b) If I tell him only that the ‘total is t’ for t = 2, . . . , 12, then he can reply to any question about the
total, but nothing else. So his event space F2 is based on the events T2 , . . . , T12 , where

T2 = {(1, 1)}, T3 = {(1, 2), (2, 1)}, T4 = {(1, 3), (2, 2), (3, 1)}, . . . , T12 = {(6, 6)}.

His event space therefore comprises all the possible unions and intersections of these 11 events, and
therefore |F2 | = 211 .
(c) Since he is colour-blind, he cannot tell the difference between (1, 2) and (2, 1), etc.. Thus F3 is
made up of all possible unions and intersections of the sets

{(1, 1)}, {(2, 2)}, . . . , {(6, 6)}, {(1, 2), (2, 1)}, {(1, 3), (3, 1)}, . . . , {(5, 6), (6, 5)}.

There are 6 + 15 such sets, so |F3 | = 221 , and obviously F2 ⊂ F3 ⊂ F1 .


In cases (b) and (c) the event spaces have less information than in (a): they represent a coarsening of
F1 , so that fewer questions can be answered.

Probability and Statistics for SIC note 2 of slide 49

Event space F , III


 Usually the event space is clear from the context, but it is important to write out Ω and F
explicitly, in order to avoid confusion.
 This can also be useful when so-called ‘paradoxes’ appear (generally due to an unclear or erroneous
mathematical formulation of the problem).
 It is essential to give Ω and F when doing exercices, tests and exams.

Probability and Statistics for SIC slide 50

23
Examples
Example 34. A woman planning her future family considers the following possibilities (we suppose
that the chances of having a boy or a girl are the same each time) :
(a) have three children;
(b) keep giving birth until the first girl is born or until three children are born, stop when one of the
two situations arises.
(c) keep giving birth until there are one of each gender or until there are three children, stop when
one of the two situations arises.
Let Bi be the event ‘i boys are born’, A the event ‘there are more girls than boys’. Calculate P(B1 )
and P(A) for (a)–(c).

(In fact, the ratio of boys/girls at birth is ∼ 105/100.)

Example 35 (Birthdays). n people are in a room. What is the probability that they all have a
different birthday?

Probability and Statistics for SIC slide 51

Note to Example 34
We learn from this example that:
 changing the protocol or stopping rule can change the observable outcomes and hence the sample
space;
 the outcomes need not have the same probabilities under different stopping rules;
 in some cases it is possible to compute probabilities for outcomes in one sample space by
comparing it to another sample space.
(a) We can write the sample space under this stopping rule as

Ω1 = {BBB, BBG, BGB, BGG, GBB, GBG, GGB, GGG},

where B denotes a boy, G denotes a girl and the ordering is important. These events all have
probability 1/8, by symmetry. Then B1 = {BGG, GBG, GGB} and A = B1 ∪ {GGG} have
probabilities 3/8 and 1/2 respectively. The latter is obvious also by symmetry.
(b) Under this stopping rule the sample space is

Ω2 = {BBB, BBG, BG, G},

but these are not equi-probable; for example B1 = {BG} here corresponds to the event
{BGG, BGB} in Ω1 and so has probability 1/4, and A = {G} here corresponds to the event
{GBB, GBG, GGB, GGG} in Ω1 , and so has probability 1/2.
(c) Under this stopping rule the sample space is

Ω3 = {BBB, GGG, BBG, GGB, GB, BG},

noting that BG here corresponds to {BGG, BGB} in Ω1 , and likewise GB here corresponds to
{GBG, GBB} in Ω1 . In this case the event B1 = {GB, BG, GGB} in Ω3 corresponds to
{GBB, GBG, BGG, BGB, GGB} in Ω1 and hence has probability 5/8, and in Ω3 the event
A = {GGG, GGB} has probability 1/4.

Probability and Statistics for SIC note 1 of slide 51

24
Birthdays

1.0
0.8
0.6
Probability
0.4 0.2
0.0

0 10 20 30 40 50 60
n

Probability and Statistics for SIC slide 52

Note to Example 35
The sample space can be written Ω = {1, . . . , 365}n , and each of these possibilities has probability
365−n . We seek the probability of the event

A = {(i1 , . . . , in ) : i1 6= i2 6= · · · =
6 in }.

There are 365 × 364 × · · · (365 − n + 1) = 365!/(365 − n)! ways this can happen, so the overall
probability is 365!/{(365 − n)!365n }, which is shown in the graph.

Probability and Statistics for SIC note 1 of slide 52

Galileo Galilei (1564–1642)

(Source: Wikipedia, portrait by Ottavio Leoni)

Probability and Statistics for SIC slide 53

25
Il Saggiatore, 1623

(Source: Wikipedia)

Probability and Statistics for SIC slide 54

Il Saggiatore, 1623
La filosofia è scritta in questo grandissimo libro che continuamente ci sta aperto innanzi
a gli occhi (io dico l’universo), ma non si può intendere se prima non s’impara a intender la
lingua, e conoscer i caratteri, ne’ quali è scritto. Egli è scritto in lingua matematica, e i
caratteri son triangoli, cerchi, ed altre figure geometriche, senza i quali mezi è impossibile a
intenderne umanamente parola; senza questi è un aggirarsi vanamente per un oscuro
laberinto.
The book of the Universe cannot be understood unless one first learns to comprehend
the language and to understand the alphabet in which it is composed. It is written in the
language of mathematics, and its characters are triangles, circles, and other geometric
figures, without which it is humanly impossible to understand a single word of it; without
these, one wanders about in a dark labyrinth.

Probability and Statistics for SIC slide 55

Three dice problem


Three fair dice are rolled. Let Ti be the event ‘the total is i’, for i = 3, . . . , 18. Which is most likely,
T9 or T10 ?
T9 occurs if the dice have the following outcomes

9 = 6 + 2 + 1 = 5 + 3 + 1 = 5 + 2 + 2 = 4 + 4 + 1 = 4 + 3 + 2 = 3 + 3 + 3.

T10 occurs if the dice have the following outcomes

10 = 6 + 3 + 1 = 6 + 2 + 2 = 5 + 4 + 1 = 5 + 3 + 2 = 4 + 4 + 2 = 4 + 3 + 3.

Thus they are equiprobable.

True or false?

26
Probability and Statistics for SIC slide 56

Note to the three dice problem


We take Ω = {(r, s, t) : r, s, t = 1, . . . , 6}, for a total of 63 = 216 equiprobable outcomes.
Now T9 occurs if we have r + s + t = 9, but the outcomes listed are not equiprobable, because
{1, 2, 6} and {1, 3, 5} can each arise in 3! ways, while {2, 2, 5} can arise in just 3 ways. Adding up the
numbers of outcomes gives |T9 | = 25, |T10 | = 27, so the latter is more probable.

Probability and Statistics for SIC note 1 of slide 56

Probability function P
Definition 36. A probability distribution P assigns a probability to each element of the event space
F, with the following properties:
(P 1) if A ∈ F, then 0 ≤ P(A) ≤ 1;
(P 2) P(Ω) = 1;
(P 3) if {Ai }∞
i=1 are pairwise disjoint (i.e., Ai ∩ Aj = ∅, i 6= j), then


! ∞
[ X
P Ai = P(Ai ).
i=1 i=1

Probability and Statistics for SIC slide 57

Properties of P
Theorem 37. Let A, B, {Ai }∞
i=1 be events of the probability space (Ω, F, P). Then
(a) P(∅) = 0;
(b) P(Ac ) = 1 − P(A);
(c) P(A ∪ B) = P(A) + P(B) − P(A ∩ B). If A ∩ B = ∅, then P(A ∪ B) = P(A) + P(B);
(d) if A ⊂ B, then P(A) ≤ P(B), and P(B \ A) = P(B) − P(A);
S P∞
(e) P ( ∞ i=1 Ai ) ≤ i=1 P(Ai ) (Boole’s inequality);
S
(f) if A1 ⊂ A2 ⊂ · · · , then limn→∞ P(An ) = P ( ∞ Ai );
Ti=1

(g) if A1 ⊃ A2 ⊃ · · · , then limn→∞ P(An ) = P ( i=1 Ai ).

Probability and Statistics for SIC slide 58

27
Note to Theorem 37
(a) Since ∅ ∩ A = ∅ for any A ∈ F, we can apply (P 3) to a finite number of sets, just by adding an
infinite number of ∅s. In particular, Ω = Ω ∪ ∅ ∪ ∅ ∪ · · · , and these are pairwise disjoint, so

1 = P(Ω) = P(Ω) + P(∅) + P(∅) + · · · ,

so since P(∅) ≥ 0, we must have P(∅) = 0.


Further, if we have a finite collection A1 , . . . , An of pairwise disjoint events, then we can complement
them with An+1 = An+2 = · · · = ∅, which gives Ai ∩ Aj = ∅ for any i 6= j and all i, j ∈ N, and then
n
! ∞
! ∞ n ∞ n
[ [ X X X X
P Ai = P Ai = P(Ai ) = P(Ai ) + P(∅) = P(Ai ),
i=1 i=1 i=1 i=1 i=n+1 i=1

so (P3) also holds for any finite number of disjoint events.


(b) Follows from the finite version of (P3) ( in (a)) by setting A1 = A, A2 = Ac , and noting that
1 = P(Ω) = P(A ∪ Ac ) = P(A) + P(Ac ).
(c) Follows from (P 3) by writing A ∪ B = (A ∩ B c ) ∪ (A ∩ B) ∪ (Ac ∩ B), which are pairwise disjoint,
and noting that this gives

P(A) = P(A ∩ B) + P(A ∩ B c ), P(B) = P(A ∩ B) + P(Ac ∩ B),

and then

P(A∪B) = P(A∩B c )+P(A∩B)+P(Ac ∩B) = {P(A)−P(A∩B)}+P(A∩B)+{P(B)−P(A∩B)},

giving the required result.


(d) Follows by writing B = A ∪ (B ∩ Ac ), S S∞ = B ∩ A .
and noting that B\A c

(e) Iteration: for k ∈ N, we write Bk−1 = i=k Ai = Ak ∪ i=k+1 Ai = Ak ∪ Bk , say, and note that
(c) gives P (Bk−1 ) = P(Ak ∪ Bk ) ≤ P(Ak ) + P(Bk ) , resulting in

! k ∞
[ X X
P Ai ≤ P(A1 ) + P(B1 ) ≤ P(Ai ) + P(Bk ) ≤ P(Ai )
i=1 i=1 i=1

as required.
S Ai ⊂ Ai+1 for every i, so (Ai+1 \Ai ) ∩ (Aj+1 \Aj ) = ∅ when i 6= j (draw picture), and
(f) Now
An = ni=1 (Ai \Ai−1 ), where we’ve set A0 = ∅. Note that P(Ai+1 \Ai ) = P(Ai+1 ) − P(Ai ). Thus by
(P 3) we have

[ ∞
X
P( Ai ) = P(A1 ) + P(Ai \Ai−1 )
i=1 i=2
X∞
= P(A1 ) + {P(Ai ) − P(Ai−1 )} ,
i=2
" n
#
X
= lim P(A1 ) + {P(Ai ) − P(Ai−1 )} ,
n→∞
i=2
= lim P(An ).
n→∞

(g) Like (f).

Probability and Statistics for SIC note 1 of slide 58

28
Continuity of P
Reminder: A function f is continuous at x if for every sequence {xn } such that

lim xn = x, we have lim f (xn ) = f (x).


n→∞ n→∞

Parts (f) and (g) of Theorem 37 can be extended to show that for all sequences of sets for which

lim An = A, we have lim P(An ) = P(A).


n→∞ n→∞

Hence P is called a continuous set function.


Probability and Statistics for SIC slide 59

Inclusion-exclusion formulae
If A1 , . . . , An are events of (Ω, F, P ), then

P(A1 ∪ A2 ) = P(A1 ) + P(A2 ) − P(A1 ∩ A2 )


P(A1 ∪ A2 ∪ A3 ) = P(A1 ) + P(A2 ) + P(A3 )
−P(A1 ∩ A2 ) − P(A1 ∩ A3 ) − P(A2 ∩ A3 )
+P(A1 ∩ A2 ∩ A3 )
..
! .
n
[ n
X X
P Ai = (−1)r+1 P(Ai1 ∩ · · · ∩ Air ).
i=1 r=1 1≤i1 <···<ir ≤n

The number of terms in the general formula is


         
n n n n n
+ + + ··· + + = 2n − 1.
1 2 3 n−1 n

Probability and Statistics for SIC slide 60

Note to inclusion-exclusion formulae


We saw the first equality as part (c) of Theorem 37.
For the second, write B = A2 ∪ A3 , and note that

P(A1 ∪ A2 ∪ A3 ) = P(A1 ) + P(A2 ∪ A3 ) − P{A1 ∩ (A2 ∪ A3 )}


= P(A1 ) + P(A2 ∪ A3 ) − P{(A1 ∩ A2 ) ∪ (A1 ∩ A3 )}
= P(A1 ) + P(A2 ) + P(A3 ) − P(A2 ∩ A3 )
−P(A1 ∩ A2 ) − P(A1 ∩ A3 ) + P{(A1 ∩ A2 ) ∩ (A1 ∩ A3 )}

which is what we want, since the last term is P(A1 ∩ A2 ∩ A3 ). The general formula follows by
iteration of this argument.

Probability and Statistics for SIC note 1 of slide 60

29
Note to inclusion-exclusion formulae: II
For example, with n = 4, we have

P(A1 ∪ A2 ∪ A3 ∪ A4 ) = P(A1 ) + P(A2 ) + P(A3 ) + P(A4 )


− {P(A1 ∩ A2 ) + P(A1 ∩ A3 ) + P(A1 ∩ A4 )
+P(A2 ∩ A3 ) + P(A2 ∩ A4 ) + P(A3 ∩ A4 )}
+ {P(A1 ∩ A2 ∩ A3 ) + P(A1 ∩ A2 ∩ A4 )
+P{(A1 ∩ A3 ∩ A4 ) + P(A2 ∩ A3 ∩ A4 )}
−P(A1 ∩ A2 ∩ A3 ∩ A4 )

where there are 4, 6, 4, 1 terms in the terms having 1, 2, 3, and 4 events, respectively.

Probability and Statistics for SIC note 2 of slide 60

Example 38. What is the probability of getting at least one 6 when I roll three fair dice?

Example 39. An urn contains 1000 lottery tickets numbered from 1 to 1000. One ticket is drawn at
random. Before the draw a fairground showman offers to pay $3 to whoever will give him $2, if the
number on the ticket is divisible by 2, 3, or 5. Would you give him your $2 before the draw? (You lose
your money if the ticket is not divisible by 2, 3, or 5.)

Probability and Statistics for SIC slide 61

Note to Example 38
Let Ai be the event there is a 6 on die i; we want P(A1 ∪ A2 ∪ A3 ). Now by symmetry P(Ai ) = 1/6,
P(Ai ∩ Aj ) = 1/36, and P(A1 ∩ A2 ∩ A3 ) = 1/216. Therefore the second inclusion-exclusion formula
gives
3 3 1 91
P(A1 ∪ A2 ∪ A3 ) = − + = .
6 36 216 216
Probability and Statistics for SIC note 1 of slide 61

Note to Example 39
Here we can write Ω = {1, . . . , 1000}, and let Di be the event that the number is divisible by i. We
want

P(D2 ∪ D3 ∪ D5 ) = P(D2 ) + P(D3 ) + P(D5 ) − P(D2 ∩ D3 ) − P(D2 ∩ D5 ) − P(D3 ∩ D5 )


+P(D2 ∩ D3 ∩ D5 )
= P(D2 ) + P(D3 ) + P(D5 ) − P(D6 ) − P(D10 ) − P(D15 ) + P(D30 )
500 + 333 + 200 − 166 − 100 − 66 + 33 367 .
= = = 0.734.
1000 500
So with probability 0.734 you gain 3-2=1 and with probability 0.266 you lose 2: the average gain is
1 × 0.7334 + (−2) × 0.266 = 0.201: you will win on average if you play. The ‘return on investment’ is
0.201/2 ≈ 0.1, or 10%, which is excellent compared to a bank.

Probability and Statistics for SIC note 2 of slide 61

30
2.2 Conditional Probability slide 62

Conditional probability
Definition 40. Let A, B be events of the probability space (Ω, F, P), such that P(B) > 0. Then the
conditional probability of A given B is

P(A ∩ B)
P(A | B) = .
P(B)

If P(B) = 0, we adopt the convention P(A ∩ B) = P(A | B)P(B), so both sides are equal to zero.
Thus
P(A) = P(A ∩ B) + P(A ∩ B c ) = P(A | B)P(B) + P(A | B c )P(B c )
even if P(B) = 0 or P(B c ) = 0.

Example 41. We roll two fair dice, one red and one green. Let A and B be the events ‘the total
exceeds 8’, and ‘we get 6 on the red die’. If we know that B has occurred, how does P(A) change?

Probability and Statistics for SIC slide 63

Note to Example 41
We first draw a square containing pairs {(r, g) : r, g = 1, . . . , 6} to display the totals of the two dice.
By inspection, and since all the individual outcomes have probability 1/36, we have
P(A) = (1 + 2 + 3 + 4)/36 = 5/18, P(B) = 6/36 = 1/6, and thus by definition the conditional
probability is P(A | B) = P(A ∩ B)/P(B) = (4/36)/(1/6) = 2/3.
Thus including the information that B has occurred changes the probability of A: conditioning can be
interpreted as inserting information into the calculation of probabilities, resulting in a new probability
space, as we see in the next theorem.

Probability and Statistics for SIC note 1 of slide 63

Conditional probability distributions


Theorem 42. Let (Ω, F, P) be a probability space, and let B ∈ F such that P(B) > 0 and
Q(A) = P(A | B). Then (Ω, F, Q) is a probability space. In particular,
 if A ∈ F, then 0 ≤ Q(A) ≤ 1;
 Q(Ω) = 1;
 if {Ai }∞
i=1 are pairwise disjoint, then


! ∞
[ X
Q Ai = Q(Ai ).
i=1 j=1

Thus conditioning on different events allows us to construct lots of different probability distributions,
starting with a single probability distribution.

Probability and Statistics for SIC slide 64

31
Note to Theorem 42
We just need to check the axioms. If A ∈ F, then

Q(A) = P(A | B) = P(A ∩ B)/P(B) ∈ [0, 1],

because A ∩ B ⊂ B and therefore P(A ∩ B) ≤ P(B). Likewise

Q(Ω) = P(Ω ∩ B)/P(B) = P(B)/P(B) = 1,

and finally,

! S S P∞ ∞
[ P( ∞i=1 Ai ∩ B) P{ ∞i=1 (Ai ∩ B)} i=1 P(Ai ∩ B)
X
Q Ai = = = = Q(Ai ),
P(B) P(B) P(B)
i=1 i=1

using the properties of P(·) and the fact that if A1 , A2 , . . . are pairwise disjoint, then so too are the
A1 ∩ B, A2 ∩ B, . . ..

Probability and Statistics for SIC note 1 of slide 64

Thomas Bayes (1702–1761)

Essay towards solving a problem in the doctrine of chances. (1763/4) Philosophical Transactions
of the Royal Society of London.
(Source: Wikipedia)

Probability and Statistics for SIC slide 65

32
Bayes’ theorem
Theorem 43 (Law of total probability). Let {Bi }∞ i=1 be pairwise disjoint eventsS(i.e. Bi ∩ Bj = ∅,
i 6= j) of the probability space (Ω, F, P), and let A be an event satisfying A ⊂ ∞ i=1 Bi . Then


X ∞
X
P(A) = P(A ∩ Bi ) = P(A | Bi )P(Bi ).
i=1 i=1

Theorem 44 (Bayes). Suppose that the conditions above are satisfied, and that P(A) > 0. Then

P(A | Bj )P(Bj )
P(Bj | A) = P∞ , j ∈ N.
i=1 P(A | Bi )P(Bi )

These results are also true if the number of Bi is finite, and if the Bi partition Ω.

Probability and Statistics for SIC slide 66

Note to Theorems 43 and 44


Since the Bi are disjoint, then so are their subsets A ∩ Bi . Thus
( ∞
) (∞ ) ∞ ∞
[ [ X X
P(A) = P A ∩ Bi = P (A ∩ Bi ) = P(A ∩ Bi ) = P(A | Bi )P(Bi ).
i=1 i=1 i=1 i=1

For Bayes’ theorem, we note that

P(A ∩ Bj ) P(A | Bj )P(Bj ) P(A | Bj )P(Bj )


P(Bj | A) = = = P∞
P(A) P(A) i=1 P(A | Bi )P(Bi )

using the theorem of total probability, Theorem 43.

Probability and Statistics for SIC note 1 of slide 66

Example
Example 45. You suspect that the man in front of you at the security check at the airport is a
terrorist. Knowing that one person out of 106 is a terrorist, and that a terrorist is detected by the
security check with a probability of 0.9999, but that the alarm goes off when an ordinary person goes
through with a probability of 10−5 , what is the probability that he is a terrorist, given that the alarm
goes off when he passes through security?

Probability and Statistics for SIC slide 67

Note to Example 45
Let A and T respectively denote the events ‘the alarm sounds’ and ‘he is a terrorist’. Then we seek

P(A | T )P(T ) 0.9999 × 10−6 .


P(T | A) = c c
= −6 −5 −6
= 0.0909.
P(A | T )P(T ) + P(A | T )P(T ) 0.9999 × 10 + 10 × (1 − 10 )

Thus the odds are around 10:1 that he is not a terrorist.


We would have to decrease the false alarm probability of 10−5 to 10−6 to have probability 0.5 that he
is a terrorist.
Probability and Statistics for SIC note 1 of slide 67

33
Multiple conditioning
Theorem 46 (‘Prediction decomposition’). Let A1 , . . . , An be events in a probability space. Then

P(A1 ∩ A2 ) = P(A2 | A1 )P(A1 ),


P(A1 ∩ A2 ∩ A3 ) = P(A3 | A1 ∩ A2 )P(A2 | A1 )P(A1 ),
..
.
n
Y
P(A1 ∩ · · · ∩ An ) = P(Ai | A1 ∩ · · · ∩ Ai−1 ) × P(A1 ).
i=2

Probability and Statistics for SIC slide 68

Note to Theorem 46
Just iterate. For example, if we let B = A1 ∩ A2 and note that P(B) = P(A2 | A1 )P(A1 ) by the
definition of conditional probability, then

P(A1 ∩ A2 ∩ A3 ) = P(A3 ∩ B) = P(A3 | B)P(B) = P(A3 | A1 ∩ A2 )P(A2 | A1 )P(A1 ),

on using the definition of conditional probability, twice. For the general case, just extend this idea, by
setting

P(A1 ∩ · · · ∩ An ) = P(An | A1 , . . . , An−1 )P(A1 , . . . , An−1 )


= P(An | A1 , . . . , An−1 )P(An−1 | A1 , . . . , An−2 )P(A1 , . . . , An−2 )
..
.
n
Y
= P(Ai | A1 ∩ · · · ∩ Ai−1 ) × P(A1 ),
i=2

as required.

Probability and Statistics for SIC note 1 of slide 68

Example
Example 47. n men go to a dinner. Each leaves his hat in the cloakroom. When they leave, having
thoroughly sampled the local wine, they choose their hats randomly.
(a) What is the probability that no one chooses his own hat?
(b) What is the probability that exactly r men choose their own hats?
(c) What happens when n is very big?

Probability and Statistics for SIC slide 69

34
Note to Example 47
 This is an example of many types of matching problem, going back to Montmort (1708).
 The sample space here is the permutations of the numbers {1, . . . , n}, of size n!.
 Let Ai denote the event that the ith hat is on the ith head, and note that P(Ai ) = 1/n,

1 1 (n − r)!
P(Ai ∩ Aj ) = P(Ai | Aj )P(Aj ) = × , . . . , P(A1 ∩ · · · ∩ Ar ) = ,
n−1 n n!
using the prediction decomposition. Thus the probability that at least r out of n hats are on the
right heads is (n − r)!/n!. Let pn (k) denote the probability that exactly k out of n men get the
right hat.
 (a) We want to compute

P(Ac1 ∩ · · · ∩ Acn ) = 1 − P(A1 ∪ · · · ∪ An ),

so we use the inclusion-exclusion formula to compute pn (0) = 1 − P(A1 ∪ · · · ∪ An ):


 
X n X 
1 − P(A1 ∪ · · · ∪ An ) = 1 − (−1)r+1 P(Ai1 ∩ · · · ∩ Air )
 
r=1 1≤i1 <···<ir ≤n
     
−1 n (n − 2)! n+1 n (n − n)!
= 1− n×n − × + · · · + (−1) × ×
2 n! n n!
Xn X n
= 1− (−1)i+1 /i! = (−1)i /i! → e−1 , n → ∞.
i=1 i=0
 (b) The probability that men 1, . . . , r have the right hats and no-one else does is
P(A1 ∩ · · · ∩ Ar ∩ Acr+1 ∩ · · · ∩ Acn ) = P(A1 ∩ · · · ∩ Ar ) × P(Acr+1 ∩ · · · ∩ Acn | A1 ∩ · · · ∩ Ar )
n−r
(n − r)! X
= × (−1)i /i!,
n!
i=0
n

but since there are r distinct ways of choosing r from n, the total probability is
n−r n−r
n! (n − r)! X 1 X 1
pn (r) = × × (−1)i /i! = × (−1)i /i! → e−1 , n → ∞.
r!(n − r)! n! r! r!
i=0 i=0

 (c) See above.


Probability and Statistics for SIC note 1 of slide 69

2.3 Independence slide 70

Independent events
Intuitively, saying that ‘A and B are independent’ means that the occurrence of one of the two does
not affect the occurrence of the other. That is to say that, P(A | B) = P(A), so the knowledge that
B has occurred leaves P(A) unchanged.

Example 48. A family has two children.


(a) We know that the first child is a boy. What is the probability that the second child is a boy?
(b) We know that one of the two children is a boy. What is the probability that the other child is also
a boy?

35
Probability and Statistics for SIC slide 71

Note to Example 48
The sample space can be written as Ω = {BB, BG, GB, GG}, in an obvious notation, and the events
that ‘the ith child is a boy’ are B1 = {BB, BG} and B2 = {BB, GB}. Then
 (a) P(B2 | B1 ) = P(B1 ∩ B2 )/P(B2 ) = P({BB})/P(B1 ) = 1/4 ÷ 1/2 = 1/2 = P(B2 ). Thus B2
and B1 are independent.
 (b) the event ‘at least one child is a boy’ is C = B1 ∪ B2 = {BB, BG, GB}, and the event ‘two
boys’ is D = {BB}, so now we seek P(D | C2) = 1/4 ÷ 3/4 = 1/3 6= P(D). Thus D and C are
not independent.
Note also the importance of precise language: in (a) we know that a specific child is a boy, and in (b)
we are told only that one of the two children is a boy. These different pieces of information change the
probabilities, because the conditioning event is not the same.

Probability and Statistics for SIC note 1 of slide 71

Independence
Definition 49. Let (Ω, F, P) be a probability space. Two events A, B ∈ F are independent (we
write A ⊥⊥ B) iff
P(A ∩ B) = P(A)P(B).
In compliance with our intuition, this implies that

P(A ∩ B) P(A)P(B)
P(A | B) = = = P(A),
P(B) P(B)

and by symmetry P(B | A) = P(B).

Example 50. A pack of cards is well shuffled and one card is packed at random. Are the events A ‘the
card is an ace’, and H ‘the card is a heart’ independent? What can we say about the events A and K
‘the card is a king’ ?

Probability and Statistics for SIC slide 72

Note to Example 50
The sample space Ω consists of the 52 cards, which are equiprobable. P(A) = 4/52 = 1/13 and
P(H) = 13/52 = 1/4, and P(A ∩ H) = 1/52 = P(A)P(H), so A and H are independent. However
P(A ∩ K) = 0 6= P(A)P(K), so these are not independent.

Probability and Statistics for SIC note 1 of slide 72

36
Types of independence
Definition 51. (a) The events A1 , . . . , An are (mutually) independent if for all sets of indices
F ⊂ {1, . . . , n}, !
\ Y
P Ai = P(Ai ).
i∈F i∈F

(b) The events A1 , . . . , An are pairwise independent if

P(Ai ∩ Aj ) = P(Ai ) P(Aj ), 1 ≤ i < j ≤ n.

(c) The events A1 , . . . , An are conditionally independent given B if for all sets of indices
F ⊂ {1, . . . , n}, !
\ Y
P Ai | B = P(Ai | B).
i∈F i∈F

Probability and Statistics for SIC slide 73

A few remarks
 Mutual independence entails pairwise independence, but the converse is only true when n = 2.
 Mutual independence neither implies nor is implied by conditional independence.
 Independence is a key idea that greatly simplifies probability calculations. In practice, it is essential
to verify whether events are independent, because undetected dependence can greatly modify the
probabilities.

Example 52. A family has two children. Show that the events ‘the first born is a boy’, ‘the second
child is a boy’, and ‘there is exactly one boy’ are pairwise independent but not mutually independent.

Probability and Statistics for SIC slide 74

Note to Example 52
The sample space is Ω = {BB, BG, GB, GG}, so P(B1 ) = 1/2, P(B2 ) = 1/2, P(1B) = 1/2, using
an obvious notation.
Also P(B1 ∩ B2 ) = P(B1 ∩ 1B) = P(B2 ∩ 1B) = 1/4, but P(B1 ∩ B2 ∩ 1B) = 0, while the product of
all three probabilities is 1/8.

Probability and Statistics for SIC note 1 of slide 74

Example 53. In any given year, the probability that a male driver has an accident and claims on his
insurance is µ, independently of other years. The probability for a female driver is λ < µ. An insurer
has the same number of male drivers and female drivers, and picks one of them at random.
(a) Give the probability that he (or she) makes a claim this year.
(b) Give the probability that he (or she) makes claims in two consecutive years.
(c) If the company randomly selects a person that made a claim, give the probability that (s)he makes
a claim the following year.
(d) Show that the knowledge that a claim was made in one year increases the probability that a claim
is made in the following year.

Probability and Statistics for SIC slide 75

37
Note to Example 53
Let Ar denote the event that the selected driver has accidents in r successive years, and M denote the
event that (s)he is male.
(a) Here the law of total probability gives

P(A1 ) = P(A1 | M )P(M ) + P(A1 | M c )P(M c ) = µ × 1


2 +λ× 1
2 = (µ + λ)/2.

(b) Independence of accidents from year to year, for each driver individually, gives

P(A2 ) = P(A2 | M )P(M ) + P(A2 | M c )P(M c ) = µ2 × 1


2 + λ2 × 1
2 = (µ2 + λ2 )/2.

(c) Now we want

P(A2 | A1 ) = P(A2 ∩ A1 )/P(A1 ) = P(A2 )/P(A1 ) = (λ2 + µ2 )/(λ + µ).

(d) Note that (λ2 + µ2 )/(λ + µ) > (λ + µ)/2, because

2(λ2 + µ2 ) − (λ + µ)2 = λ2 + µ2 − 2λµ = (λ − µ)2 > 0.

Thus they would only be equal if λ = µ, i.e. with no difference between the sexes.

Probability and Statistics for SIC note 1 of slide 75

Series-Parallel Systems
An electric system has components labelled 1, . . . , n, which fail independently of each another. Let Fi
be the event ‘the ith component is faulty’, with P(Fi ) = pi . The event S, ‘the system fails’ occurs if
current cannot pass from one end of the system to the other. If the components are arranged in
parallel, then
Y n
PP (S) = P(F1 ∩ · · · ∩ Fn ) = pi .
i=1

If the components are arranged in series, then


n
Y
PS (S) = P(F1 ∪ · · · ∪ Fn ) = 1 − (1 − pi ).
i=1

If there exist upper and lower bounds p+ and p− such that

1 > p+ > pi > p− > 0, i = 1, . . . , n,

and n → ∞, then PP (S) → 0, PS (S) → 1.

Probability and Statistics for SIC slide 76

38
Reliability
Example 54 (Chernobyl). A nuclear power station depends on a security system whose components
are arranged according to:

The components fail independently with probability p, and the system fails if current cannot pass from
A to B.
(a) What is the probability that the system fails?
(b) The components are made in batches, which can be good or bad. For a good batch, p = 10−6 ,
whereas for a bad batch p = 10−2 . The probability that a batch is good is 0.99. What is the
probability that the system fails (i) if the components come from different batches? (ii) if all the
components come from the same batch?

Probability and Statistics for SIC slide 77

Note to Example 54
The two parallel systems in the upper right and lower branches have respective probabilities p3 and
pl = p2 of failing, so the overall probability of failure for the top branch, which is a series system, is
pu = 1 − (1 − p)(1 − p3 ). The upper and lower branches are in parallel, so the probability that they
both fail is pu × pl = p2 {1 − (1 − p)(1 − p3 )} = f (p), say.
Such computations can be used recursively to compute failure probabilities for very large systems.
The probability of failure of a component selected randomly from the two sorts of batches is

q = 10−6 × 0.99 + 10−2 × 0.1 = 0.00010099,

so the probability of failure in case (i) is f (q) = 1.029995 × 10−12 , whereas in (ii) it is

0.99f (10−6 ) + 0.01f (10−2 ) = 1.000099 × 10−8 ,

roughly 104 times larger than in (i).

Probability and Statistics for SIC note 1 of slide 77

39
2.4 Edifying Examples slide 78

Death and the Ladies

(Source: La Danse Macabre des Femmes, Project Gutenberg)

Probability and Statistics for SIC slide 79

Female smokers
Survival after 20 years for 1314 women in the town of Whickham, England (Appleton et al., 1996, The
American Statistician). The columns contain: number of dead women after 20 years/number of
surviving women at the start of the study (%).

Age (years) Smokers Non-smokers


Total 139/582 (24) 230/732 (31)

18–24 2/55 (4) 1/62 (2)


25–34 3/124 (2) 5/157 (3)
35–44 14/109 (13) 7/121 (6)
45–54 27/130 (21) 12/78 (15)
55–64 51/115 (44) 40/121 (33)
65–74 29/36 (81) 101/129 (78)
75+ 13/13 (100) 64/64 (100)

According to the totals, there is a beneficial effect of smoking:

24% < 31%!


Probability and Statistics for SIC slide 80

40
Simpson’s paradox
Define the events ‘dead after 20 years’, D, ‘smoker’, S, and ‘in age category a at the start’, A = a.
For almost every a we have
P(D | S, A = a) > P(D | S c , A = a),
but
P(D | S) < P(D | S c ).
Note that
X
P(D | S) = P(D | S, A = a)P(A = a),
a
X
P(D | S c ) = P(D | S c , A = a)P(A = a),
a

so if the probabilities P(D | S, A = a) and P(D | S c , A = a) vary a lot with a, weighting them with
the P(A = a) can reverse the order of the inequalities.

This is an example of Simpson’s paradox: ‘forgetting’ conditioning can change the conclusion of a
study.

Probability and Statistics for SIC slide 81

The tragic story of Sally Clark


An English solicitor, whose first son died of Sudden Infant Death Syndrome (SIDS) a few weeks
after his birth in 1996. Following the death of her second son in the same manner, she was arrested in
1998 and accused of double murder. Her trial was controversial, as a very eminent paediatrician,
Professor Sir Roy Meadow, testified that the probability that two children should die of SIDS in a
family such as that of Sally Clark was of 1 in 73 million, a number he obtained as 1/85002 , where
1/8500 was the estimated probability of a single death due to SIDS.

She was convicted in November 1999, then released in January 2003, because it turned out some
pathological evidence suggesting her innocence had not been diclosed to her lawyer. As a result of her
case, the Attorney-General ordered a review of hundreds of other cases, and two other women in the
same situation were released from jail.

She died of alcoholism in March 2007.


Probability and Statistics for SIC slide 82

41
The rates of SIDS

Data on the rates of infantile deaths, (CESMA SUDI report,


http://cemach.interface-test.com/Publications/CESDI-SUDI-Report-(1).aspx)

Probability and Statistics for SIC slide 83

Sally Clark: Four tragic errors


 Estimated probabilities
 ‘Ecological fallacy’
 Independence? Really?
 ‘Prosecutors’ fallacy’

Probability and Statistics for SIC slide 84

42
Note on Sally Clark story
 Estimated probabilities: How were the probabilities obtained? What is their accuracy? There are
very few SIDS deaths, and the number 1/8543 may be based on as few as 4 SIDS deaths. Using
standard methods, the estimated probability could be from 0.04 to 0.32 deaths/1000 live births, so
(for example), the figure of 1/73 million could be much larger.
 Ecological fallacy: Even if we accept the argument above, the SUDI study conflates a lot of
different types of families and cases: there is no reason to suppose that the marginal probability of
1/8500 applies to any particular individual (think of Simpson’s paradox, which we just met).
 Independence? If there is a genetic or environmental factor leading to SIDS, then the probability
of two deaths might be much higher than claimed. Just suppose that a genetic factor G is present
in 0.1% of families, and leads to a probability of death of 1/10 for each child, and that conditional
on G or Gc , deaths are independent. Then we might have
P( two deaths ) = P( two deaths | G)P(G) + P( two deaths | Gc )P(Gc )
.
= (1/10)2 × 0.001 + (1/8500)2 × 0.999 = 0.0001 = 1/104 ≫ 1/(73 × 106 ).
 Prosecutors’ Fallacy: The probability calculated was P( two deaths | innocent ), whereas what
is wanted is P( innocent | two deaths ). To get the latter we need to apply Bayes’ theorem. Let
E denote the evidence observed (two deaths), and C denote culpability. Then we have
P(E | C c )P(C c )
P(C c | E) = ,
P(E | C c )P(C c ) + P(E | C)P(C)
and we see that in order to compute the required probability, we have to have some estimates of
P(C). Suppose that P(C) = 10−6 and that P(E | C) = 1, as murdering two of your own children
is probably quite rare. Then even using the probabilities above, Bayes’s theorem would give that
.
P(C c | E) = 0.014 ≈ 14/103 ,

which, though small, is nothing like as small as 1/(73 × 106 ). Thus even accepting the ‘squaring
of probabilities’, the case for the prosecution is not nearly as strong as the original argument
suggested.

Probability and Statistics for SIC note 1 of slide 84

43
3 Random Variables slide 85

Small probabilistic lexicon

Mathematics English Français


one fair die (several fair dice) un dé juste/équilibré (plusieurs dés justes/équilibrés)
random experiment expérience aléatoire
Ω sample space ensemble fondamental
ω outcome, elementary event épreuve, événement élémentaire
A, B, . . . event événement
F event space l’espace des événements
sigma-algebra tribu
P probability distribution/probability function loi de probabilité
(Ω, F, P) probability space espace de probabilité
inclusion-exclusion formula formule d’inclusion-exclusion
P(A | B) probability of A given B probabilité de A sachant B
independence indépendance
(mutually) independent events événements (mutuellement) indépendants
pairwise independent events événements indépendants deux à deux
conditionally independent events événements conditionellement indépendants

X, Y, Z, W, . . . random variable variable aléatoire


FX (x) (cumulative) distribution function fonction de répartition
fX (x) (probability) density/mass function (PDF) fonction de densité/masse (fm)
E(X) expectation/mean of X espérance de X
var(X) variance of X la variance de X
var(X)1/2 standard deviation of X deviation standard (ou écart-type, mais . . .) de X
fX (x | B) conditional density/mass function fonction de densité/masse conditionnelle

Probability and Statistics for SIC slide 86

44
3.1 Basic Ideas slide 87

Random variables
We usually need to consider random numerical quantities.

Example 55. We roll two fair dice, one red and one green. Let X be the total of the sides facing up.
Find all possible values of X, and the corresponding probabilities.

Definition 56. Let (Ω, F, P) be a probability space. A random variable (rv) X : Ω 7→ R is a


function from the sample space Ω taking values in the real numbers R.

Definition 57. The set of values taken by X,

DX = {x ∈ R : ∃ω ∈ Ω such that X(ω) = x}

is called the support of X. If DX is countable, then X is a discrete random variable.

The random variable X associates probabilities to subsets S included in R, given by

P(X ∈ S) = P({w ∈ Ω : X(w) ∈ S}).

In particular, we set Ax = {ω ∈ Ω : X(ω) = x}. Note that we must have Ax ∈ F for every x ∈ R, in
order to calculate P(X = x).

Probability and Statistics for SIC slide 88

Note to Example 55
Draw a grid. X takes values in DX = {2, . . . , 12}, and so is clearly a discrete random variable. By
symmetry the 36 points in Ω are equally likely, so, for example,
2
P(X = 3) = P({(1, 2), (2, 1)}) = .
36
Thus the probabilities for {2, 3, 4 . . . , 12} are respectively

1/36, 2/36, 3/36, 4/36, 5/36, 6/36, 5/36, 4/36, 3/36, 2/36, 1/36.

Probability and Statistics for SIC note 1 of slide 88

Examples
Example 58. We toss a coin repeatedly and independently. Let X be the random variable
representing the number of throws until we first get heads. Calculate

P(X = 3), P(X = 15), P(X ≤ 3.5), P(X > 1.7), P(1.7 ≤ X ≤ 3.5).

Example 59. A natural set Ω when I am playing darts is the wall on which the dart board is hanging.
The dart lands on a point ω ∈ Ω ⊂ R2 . My score is X(ω) ∈ DX = {0, 1, . . . , 60}.

Probability and Statistics for SIC slide 89

45
Note to Example 58
X takes values in {1, 2, 3, . . .} = N, and so is clearly a discrete random variable, with countable
support.
Let p = P(F ); then the event X = 3 corresponds to two failures, each with probability 1 − p, followed
by a success, with probability p, giving P(X = 3) = (1 − p)2 p by independence of the successive trials.
Likewise P(X = 15) = (1 − p)14 p, and

P(X ≤ 3.5) = P(X ≤ 3) + P(3 < X ≤ 3.5)


= p + (1 − p)p + (1 − p)2 p
= 1 − P(X > 3)
= 1 − (1 − p)3 ,

and similarly

P(1.7 ≤ X ≤ 3.5) = P(X = 2) + P(X = 3)


= (1 − p)p + (1 − p)2 p
= p(1 − p)(1 + 1 − p)
= p(1 − p)(2 − p).

Probability and Statistics for SIC note 1 of slide 89

Note to Example 59
Here an infinite Ω ⊂ R2 is mapped onto the finite set {0, . . . , 60}. Even though the underlying Ω is
uncountable, the support of X is countable.

Probability and Statistics for SIC note 2 of slide 89

Jacob Bernoulli (1654–1705)

Ars Conjectandi, Basel (1713)


(Source: http://www-history.mcs.st-and.ac.uk/PictDisplay/Bernoulli_Jacob.html)

Probability and Statistics for SIC slide 90

46
Bernoulli random variables
Definition 60. A random variable that takes only the values 0 and 1 is called an indicator variable,
or a Bernoulli random variable, or a Bernoulli trial.

Typically the values 0/1 correspond to false/true, failure/success, bad/good, . . .

Example 61. Suppose that n identical coins are tossed independently, let Hi be the event ‘we get
heads for the ith coin’, and let Ii = I(Hi ) be the indicator of this event. Then

P(Ii = 1) = P(Hi ) = p, P(Ii = 0) = P(Hic ) = 1 − p,

where p is the probability of obtaining heads.


 If n = 3 and X = I1 + I2 + I3 , describe Ω, DX and the sets Ax .
 What do
n
X
X = I1 + · · · + In , Y = I1 (1 − I2 )(1 − I3 ), Z= Ij−1 (1 − Ij )
j=2

represent?

Probability and Statistics for SIC slide 91

Note to Example 61
 If n = 3, then we can write the sample space as
Ω = {T T T, T T H, T HT, HT T, T HH, HT H, HHT, HHH}. Clearly DX = {0, 1, 2, 3}, and

A0 = {T T T }, A1 = {T T H, T HT, HT T }, A2 = {T HH, HT H, HHT }, A3 = {HHH}.

 X is the total number of heads in the first n tosses, Y = 1 if and only if the sequence starts HTT,
and Z counts the number of times a 1 is followed by a 0 in the sequence of n tosses.

Probability and Statistics for SIC note 1 of slide 91

Mass functions
A random variable X associates probabilities to subsets of R. In particular when X is discrete, we have

Ax = {ω ∈ Ω : X(ω) = x},

and we can define:

Definition 62. The probability mass function (PMF) of a discrete random variable X is

fX (x) = P(X = x) = P(Ax ), x ∈ R.

It has two key properties :


(i) fX (x) ≥ 0, and it is only positive for x ∈ DX , where DX is the image of the function X, i.e.,
the support of fX ;
P
(ii) the total probability {i:xi ∈DX } fX (xi ) = 1.
When there is no risk of confusion, we write fX ≡ f and DX ≡ D.

Probability and Statistics for SIC slide 92

47
Binomial random variable
Example 63 (Example 61 continued). Give the PMFs and supports of Ii , of Y and of X.

Definition 64. A binomial random variable X has PMF


 
n x
f (x) = p (1 − p)n−x , x = 0, 1, . . . , n, n ∈ N, 0 ≤ p ≤ 1.
x

We write X ∼ B(n, p), and call n the denominator and p the probability of success. With n = 1,
this is a Bernoulli variable.

Remark: we use ∼ to mean ‘has the distribution’.


The binomial model is used when we are considering the number of ‘successes’ of a trial which is
independently repeated a fixed number of times, and where each trial has the same probability of
success.
Probability and Statistics for SIC slide 93

Note to Example 63
 Ii takes values 0 and 1, with probabilities P(Ii = 1) = p, and P(Ii = 0) = 1 − p.
 Y is also binary with P(Y = 1) = p(1 − p)2 , P(Y = 0) = 1 − p(1 − p)2 .
 X takes values 0, 1, . . . , n, with binomial probabilities (see below).

Probability and Statistics for SIC note 1 of slide 93

Binomial probability mass functions


B(10,0.5) B(10,0.3)
0.30

0.30
0.15

0.15
f(x)

f(x)
0.00

0.00

0 2 4 6 8 10 0 2 4 6 8 10
x x

B(20,0.1) B(40,0.9)
0.30

0.30
0.15

0.15
f(x)

f(x)
0.00

0.00

0 5 10 15 20 0 10 20 30 40
x x

Probability and Statistics for SIC slide 94

Examples
Example 65. A multiple choice test contains 20 questions. For each question you must choose the
correct answer amongst 5 possible answers. A pass is obtained with 10 correct answers. A student
picks his answers at random.
 Give the distribution for his number of correct answers.
 What is the probability that he will pass the test?

Probability and Statistics for SIC slide 95

48
Note to Example 65
Since n = 20 and p = 1/5 = 0.2, the number of correct replies is X ∼ B(20, 0.2). The probability of
success is
20  
X 20 .
P(X ≥ 10) = 0.2x (1 − 0.2)20−x = 0.0026
x
x=10

after a painful calculation, or, better, using R,

> 1-pbinom(q=9, size=20, prob=0.2)


[1] 0.002594827
> pbinom(q=9, size=20, prob=0.2, lower.tail=FALSE)
[1] 0.002594827
Probability and Statistics for SIC note 1 of slide 95

Geometric distribution
Definition 66. A geometric random variable X has PMF

fX (x) = p(1 − p)x−1 , x = 1, 2, . . . , 0 ≤ p ≤ 1.

We write X ∼ Geom(p), and we call p the success probability.

This models the waiting time until a first event, in a series of independent trials having the same
success probability.

Example 67. To start a board game, m players each throw a die in turn. The first to get six begins.
Give the probabilities that the 3rd player will begin on his first throw of the die, that he will begin, and
of waiting for at least 6 throws of the die before the start of the game.

Theorem 68 (Lack of memory). If X ∼ Geom(p), then

P(X > n + m | X > m) = P(X > n).

This is also sometimes called memorylessness.

Probability and Statistics for SIC slide 96

Note to Example 67
In this case DX = N.
Here p = 1/6, so the probability that the third person starts on his first throw of the die is
(5/6)2 × 1/6 = 0.116. He starts if the first six appears on throw 3, m + 3, 2m + 3, . . . and this equals

X ∞
X ∞
X
3+im−1 2 p(1 − p)2
P(X = 3 + im) = p(1 − p) = p(1 − p) (1 − p)im = ,
1 − (1 − p)m
i=0 i=0 i=0

where p = 1/6.
The probability of waiting for at least 6 tosses is (1 − p)6 = 0.335.

Probability and Statistics for SIC note 1 of slide 96

49
Note to Theorem 68
Since P(X > n) = (1 − p)n , we seek

P(X > n + m | X > m) = (1 − p)m+n /(1 − p)m = (1 − p)n = P(X > n).

Thus we see that there is a ‘lack of memory’: knowing that X > m does not change the probability
that we have to wait at least another n trials before seeing the event.

Probability and Statistics for SIC note 2 of slide 96

Negative binomial distribution


Definition 69. A negative binomial random variable X with parameters n and p has PMF
 
x−1 n
fX (x) = p (1 − p)x−n , x = n, n + 1, n + 2, . . . , 0 ≤ p ≤ 1.
n−1

We write X ∼ NegBin(n, p). When n = 1, X ∼ Geom(p).

It models the waiting time until the nth success in a series of independent trials having the same
success probability.

Example 70. Give the probability of seeing 2 heads before 5 tails in repeated tosses of a coin.

Probability and Statistics for SIC slide 97

Note to Example 70
This is the probability that X ≤ 6, where X is the waiting time for n = 2 heads. It is
   
2−1 2 2−2 3−1 2
p (1 − p) + p (1 − p)3−2
2−1 2−1
     
4−1 2 4−2 5−1 2 5−2 6−1 2
+ p (1 − p) + p (1 − p) + p (1 − p)6−2 .
2−1 2−1 2−1
If we assume that the coin is fair, so p = 0.5, R gives

pnbinom(q=4, size=2, prob=0.5)


[1] 0.890625

where note that q = x − n in the parametrization used in R.

Probability and Statistics for SIC note 1 of slide 97

50
Geometric and negative binomial PMFs
Geom(0.5) Geom(0.1)

0.2 0.4

0.2 0.4
f(x)

f(x)
0.0

0.0
0 10 20 30 40 0 10 20 30 40
x+1 x+1

NegBin(4,0.5) NegBin(6,0.3)
0.20

0.20
0.10

0.10
f(x)

f(x)
0.00

0.00
0 10 20 30 40 0 10 20 30 40
x+4 x+6

Probability and Statistics for SIC slide 98

Negative binomial distribution: alternative version


We sometimes write the geometric and negative binomial variables in a more general form, setting
Y = X − n, and then the probability mass function is

Γ(y + α) α
fY (y) = p (1 − p)y , y = 0, 1, 2, . . . , 0 ≤ p ≤ 1, α > 0,
Γ(α)y!

where Z ∞
Γ(α) = uα−1 e−u du, α>0
0
is the Gamma function. The principal properties of Γ(α) are:

Γ(1) = 1;
Γ(α + 1) = αΓ(α), α > 0;
Γ(n) = (n − 1)!, n = 1, 2, 3, . . . ;

Γ( 12 ) = π.

They will be useful later.

Probability and Statistics for SIC slide 99

Hypergeometric distribution
Definition 71. We draw a sample of m balls without replacement from an urn containing w white
balls and b black balls. Let X be the number of white balls drawn. Then
w
 b 
x m−x
P(X = x) = w+b
, x = max(0, m − b), . . . , min(w, m),
m

and the distribution of X is hypergeometric. We write X ∼ HyperGeom(w, b; m).

Example 72. I leave for a camping trip in Ireland with six tins of food, two of which contain fruit. It
pours with rain, and the labels come off the tins. If I pick three of the six tins at random, find the
distribution of the number of tins of fruit among the three I have chosen.

Probability and Statistics for SIC slide 100

51
Note to Example 72
White balls correspond to fruit tins, black balls to others, so w = 2, b = 4, and I take m = 3.
Therefore the number of fruit tins X drawn has probability
2 4 
x 3−x
P(X = x) = 6 , x = 0, . . . , 2,
3

and some calculation gives P(X = 0) = 1/5, P(X = 1) = 3/5, P(X = 2) = 1/5.

Probability and Statistics for SIC note 1 of slide 100

Capture-recapture
Example 73. In order to estimate the number of fish N in a lake, we first catch r fish, mark them,
and let them go. After having waited long enough for the fish population to become well-mixed, we
catch another sample of size s.
 Find the distribution of the number of marked fish, M , in this sample.
 Show that the value of N which maximises P(M = m) is ⌊rs/m⌋, and calculate the best
estimation of N when s = 50, r = 40, and m = 4.

The basic idea behind this example is used to estimate the sizes of populations of endangered species,
the number of drug addicts or of illegal immigrants in human populations, etc. One practical problem
often encountered is that certain individuals become harder to recapture, whereas others enjoy it; thus
the probabilities of recapture are heterogeneous, unlike in the example above.

Probability and Statistics for SIC slide 101

Note to Example 73
The total number is N , of which r are marked and N − r unmarked. The distribution of M is

r N −r

m s−m
PN (M = m) = N
, m = max(0, s + r − N ), . . . , min(r, s),
s

(work out the limits carefully).


For the second part, we seek to maximise this probability with respect to N . Now compare the
probabilities for N and N − 1 and take ratios, giving
r  N −r  . r  N −1−r 
PN (M = m) (N − r)(N − s)
= m Ns−m m
N
s−m
−1 = >1
PN −1 (M = m) s s
N (N + m − r − s)

provided that (after a little algebra) rs/m > N . Hence the largest value of N for which this ratio
b = ⌊rs/m⌋, which therefore maximises the probability, because we can write
increases is N

PN (M = m) PNmin +1 (M = m)
PN (M = m) = × ··· × PNmin (M = m),
PN −1 (M = m) PNmin (M = m)

where the latter probability is for the smallest value of N for which the probability that M = m is
positive.
b = ⌊50 × 40/4⌋ = 500.
In the example given, N
The behaviour of such estimators can be very poor.

Probability and Statistics for SIC note 1 of slide 101

52
Hypergeometric PMFs
Probability mass functions of M (left) and of ⌊rs/M ⌋ (centre) in Example 73, when r = 40, s = 50
and N = 1000, without ⌊rs/M ⌋ = +∞, which corresponds to M = 0, and PN (M = m) as a function
of N (right):

0.20
0.25

0.25
0.20

0.20

0.15
Pr_N(m)
0.15

0.15
PDF

PDF

0.10
0.10

0.10

0.05
0.05

0.05
0.00

0.00

0.00
0 10 20 30 40 0 500 1000 1500 2000 0 500 1000 1500 2000
m rsm N

Probability and Statistics for SIC slide 102

Discrete uniform distribution


Definition 74. A discrete uniform random variable X has PMF
1
fX (x) = , x = a, a + 1, . . . , b, a < b, a, b ∈ Z.
b−a+1
We write U ∼ DU(a, b).

This definition generalizes the outcome of a die throw, which corresponds to the DU(1, 6) distribution.

Probability and Statistics for SIC slide 103

Siméon-Denis Poisson (1781–1840)

‘Life is good for only two things, discovering mathematics and teaching mathematics.’
(Source: http://www-history.mcs.st-and.ac.uk/PictDisplay/Poisson.html)

Probability and Statistics for SIC slide 104

53
Poisson distribution
Definition 75. A Poisson random variable X has the PMF
λx −λ
fX (x) = e , x = 0, 1, . . . , λ > 0.
x!
We write X ∼ Pois(λ).

 Since λx /x! > 0 for any λ > 0 and x ∈ {0, 1, . . .}, and
1 1
e−λ = = P∞ λx
> 0,
eλ x=0 x!
P∞
we see that fX (x) > 0 and x=0 fX (x) = 1, so this is a probability distribution.
 The Poisson distribution appears everywhere in probability and statistics, often as a model for
counts, or for a number of rare events.
 It also provides approximations to probabilities, for example for random permutations (Example 47,
random hats) or the binomial distribution (later).

Probability and Statistics for SIC slide 105

Poisson probability mass functions


Pois(0.5) Pois(1)
0.3 0.6

0.3 0.6
f(x)

f(x)
0.0

0.0

0 5 10 15 20 0 5 10 15 20
x x

Pois(4) Pois(10)
0.20

0.20
0.10

0.10
f(x)

f(x)
0.00

0.00

0 5 10 15 20 0 5 10 15 20
x x

Probability and Statistics for SIC slide 106

54
Poisson process
 A fundamental model for random events taking place in time or space, e.g., in queuing systems,
communication systems, . . .
 Consider point events taking place in a time period T = [0, T ], and write N (I) for the number of
events in a subset I ⊂ T . Suppose that
– events in disjoint subsets of T are independent;
– the probability that an event takes place in an interval of (small) width δ is δλ + o(δ) for some
λ > 0;
– the probability of no events in an interval of (small) width δ is 1 − δλ + o(δ).
Here o(δ) is a term such that limδ→0 o(δ)/δ = 0.
 Then
– N {(a, b)} ∼ Pois{λ(b − a)}, or, more generally, N (I) ∼ Pois(λ|I|);
– if I1 , . . . , Ik are disjoint subsets of T , then N (I1 ), . . . , N (Ik ) are independent Poisson
variables.
 We can use these properties to deduce that
– sums of independent Poisson variables have Poisson distributions;
– the waiting time to the first event has an exponential distribution;
– the intervals between events have independent exponential distributions.

Probability and Statistics for SIC slide 107

Note on the Poisson process


 Consider point events taking place in the interval T = (0, t], for t > 0, satisfying the given axioms,
and note that the probability of seeing two or more events in an interval of length δ is
1 − {1 − λδ + o(δ)} − {λδ + o(δ)} = o(δ) → 0, as δ → 0.
 Let N (t) denote the number of events in T and let δ = t/m for some large m. The probability
that N (t) = n is approximately the probability that one event occurs in precisely n of the m
intervals (0, δ], (δ, 2δ], . . . , ((m − 1)δ, mδ], and no events occur in the other m − n intervals, and
this is given by the binomial formula
 
. m
P{N (t) = n} = {λδ + o(δ)}n {1 − λδ + o(δ)}m−n
n
 
m
= × (λt/m)n (1 − λt/m)m−n + mo(t/m)
n
 
m
= m−n × (λt)n (1 − λt/m)m−n + mo(t/m) (1)
n
where in the second line the o(δ) terms are assembled and then δ is replaced by t/m. As m → ∞
with n fixed, mo(t/m) → 0 and (1 + a/m)m−n → ea for any real a, while Lemma 104 implies
that the first term in (1) converges to 1/n!, so

(λt)n −λt
lim P{N (t) = n} = e , n = 0, 1, 2, . . . ,
m→∞ n!
which is the probability mass function of the Poisson distribution with mean λt.

Probability and Statistics for SIC note 1 of slide 107

55
Note on the Poisson process, II
 Here are some implications of this:
– if T were divided into two disjoint intervals T1 , T2 such that λ|T1 | = µ1 and λ|T2 | = µ2 , the
same argument applied separately to T1 and T2 shows that their respective counts N1 and N2
have Poisson distributions with means µ1 and µ2 , and (A1) implies that these are independent.
Since N1 + N2 = N (t), we deduce that the sum of two independent Poisson variables is
Poisson, with mean µ1 + µ2 ;
– the waiting time X1 to the first event exceeds t if and only if N (t) = 0, so

P(X1 ≤ t) = 1 − P{N (t) = 0} = 1 − e−λt , t > 0,

so we see that X1 ∼ exp(λ), and E(X1 ) = 1/λ. Moreover the waiting time Xn to the nth
event exceeds t if and only if N (t) < n, so
n−1
X (λt)r −λt
P(Xn ≤ t) = 1 − P{N (t) < n} = 1 − e , t > 0,
r=0
r!

and differentiation of this with respect to t leads (after cancelling terms) to

λn tn−1 −λt
fXn (t) = e , t > 0,
(n − 1)!

which is the density of the gamma distribution with shape parameter n and scale λ; recall that
Γ(n) = (n − 1)!. By independence of events in disjoint intervals, this must have the same
distribution as a sum of n independent waiting times distributed like X1 , so a sum of
independent exponential variables is gamma;
– now suppose that we start observing such a process at a random time t0 ≫ 0. What is the
distribution of the interval into which t0 falls? We can write the total length of the interval as
W = X− + X+ , where X− is the backward recurrence time from t0 to the previous event, and
X+ is the time to the next event, and (A1) implies that these are independent. Now
X+ ∼ exp(λ), and since there is no directionality, X− ∼ exp(λ), so W has the gamma
distribution with parameters 2 and λ. In particular, the expected length of W is twice that of
X1 . This is an example of length-biased sampling : sampling the Poisson process at a random
time means that the sampling point will fall into an interval that is longer than average.
Alternatively we can argue as follows: imagine that we take intervals at random from M
separate Poisson processes, with M very large, and place these intervals end to end. The
number of intervals of length x will be approximately M fX1 (x) dx, the total length of the M
intervals will be approximately M E(X1 ), and the portion of this taken up by intervals of length
x will be M fX1 (x) dx × x. Thus a point chosen uniformly at random in the total length
M E(X1 ) will fall into an interval of length x with probability

. M xfX1 (x) dx xfX1 (x) dx xλe−λx dx


fW (x) dx = = = = λ2 xe−λx dx, x > 0,
M E(X1 ) E(X1 ) 1/λ

which corresponds to the gamma density function with parameters 2 and λ.

Probability and Statistics for SIC note 2 of slide 107

56
Cumulative distribution function
Definition 76. The cumulative distribution function (CDF) of a random variable X is

FX (x) = P(X ≤ x), x ∈ R.

If X is discrete, we can write X


FX (x) = P(X = xi ),
{xi ∈DX :xi ≤x}

which is a step function with jumps at the points of the support DX of fX (x).

When there is no risk of confusion, we write F ≡ FX .

Example 77. Give the support and the probability mass and cumulative distribution functions of a
Bernoulli random variable.

Example 78. Give the cumulative distribution function of a geometric random variable.

Probability and Statistics for SIC slide 108

Note to Example 77
The support is D = {0, 1}, and the CDF is

0,
 x < 0,
F (x) = 1 − p, 0 ≤ x < 1,


1, x ≥ 1.

Draw a picture, showing a step function with a jump of 1 − p at x = 0 and of p at x = 1.

Probability and Statistics for SIC note 1 of slide 108

Note to Example 78
The support is D = N, and for x ≥ 1 we have
⌊x⌋
X
P(X ≤ x) = p(1 − p)r−1 ,
r=1

so we need to sum a geometric series with common ratio 1 − p, giving

p{1 − (1 − p)⌊x⌋ }
P(X ≤ x) = = 1 − (1 − p)⌊x⌋ .
1 − (1 − p)

Thus (
0, x < 1,
P(X ≤ x) = ⌊x⌋
1 − (1 − p) , x ≥ 1.

Probability and Statistics for SIC note 2 of slide 108

57
Properties of a cumulative distribution function
Theorem 79. Let (Ω, F, P) be a probability space and X : Ω 7→ R a random variable. Its cumulative
distribution function FX satisfies:
(a) limx→−∞ FX (x) = 0;
(b) limx→∞ FX (x) = 1;
(c) FX is non-decreasing, so FX (x) ≤ FX (y) for x ≤ y;
(d) FX is continuous on the right, thus

lim FX (x + t) = FX (x), x ∈ R;
t↓0

(e) P(X > x) = 1 − FX (x);


(f) if x < y, then P(x < X ≤ y) = FX (y) − FX (x).

Probability and Statistics for SIC slide 109

Note to Theorem 79
(a) If not, there must be a blob of mass at −∞, which is not allowed, as X ∈ R.
(b) Ditto, for +∞.
(c) If y ≥ x, then F (y) = F (x) + P(x < X ≤ y), so the difference is always non-negative.
(d) Now F (x + t) = P(X ≤ x) + P(x < X ≤ x + t), and the second term here tends to zero, because
any point in the interval (x, x + t] at which there is positive probability must lie to the right of x.
(e) We have P(X > x) = 1 − P(X ≤ x) = 1 − FX (x).
(f) We have P(x < X ≤ y) = P(X ≤ y) − P(X ≤ x) = FX (y) − FX (x).

Probability and Statistics for SIC note 1 of slide 109

Remarks
 We can obtain the probability mass function of a discrete random variable from the cumulative
distribution function using
f (x) = F (x) − lim F (y).
y↑x

In many cases X only takes integer values, DX ⊂ Z, and so f (x) = F (x) − F (x − 1) for x ∈ Z.
 From now on we will mostly ignore the implicit probability space (Ω, F, P) when dealing with a
random variable X. We will rather think in terms of X, FX (x), and fX (x). We can legitimise this
‘oversight’ mathematically.
 We can specify the distribution of a random variable in an equivalent way by saying (for example):
– X follows a Poisson distribution with parameter λ; or
– X ∼ Pois(λ); or
– by giving the probability mass function of X; or
– by giving the cumulative distribution function of X.

Probability and Statistics for SIC slide 110

58
Transformations of discrete random variables
Real-valued functions of random variables are random variables themselves, so they possess probability
mass and cumulative distribution functions.

Theorem 80. If X is a random variable and Y = g(X), then


X
fY (y) = fX (x).
x:g(x)=y

Example 81. Calculate the PMF of Y = I(X ≥ 1) when X ∼ Pois(λ).

Example 82. Let Y be the remainder of the division by four of the total of two independent dice
throws. Calculate the PMF of Y .
Probability and Statistics for SIC slide 111

Note to Theorem 80
We have X X
fY (y) = P(Y = y) = P(X = x) = fX (x).
x:g(x)=y x:g(x)=y

Probability and Statistics for SIC note 1 of slide 111

Note to Example 81
Here Y = I(X ≥ 1) takes values 0 and 1, and

X ∞
X λx
fY (0) = P(Y = 0) = P(X = 0) = e−λ , fY (1) = P(Y = 1) = P(X = x) = e−λ = 1−e−λ .
x!
x=1 x=1

Probability and Statistics for SIC note 2 of slide 111

Note to Example 82
Y has support 0, 1, 2, 3, and mass function given by

fY (0) = P(Y = 0) = P(X ∈ {4, 8, 12}) = (3 + 5 + 1)/36 = 9/36,


fY (1) = P(Y = 1) = P(X ∈ {5, 9}) = (4 + 4)/36 = 8/36,
fY (2) = P(Y = 2) = P(X ∈ {2, 6, 10}) = (1 + 5 + 3)/36 = 9/36,
fY (3) = P(Y = 3) = P(X ∈ {3, 7, 11}) = (2 + 6 + 2)/36 = 10/36,

which fortunately adds to 36/36.

Probability and Statistics for SIC note 3 of slide 111

59
3.2 Expectation slide 112

Expectation
P
Definition 83. Let X be a discrete random variable for which x∈DX |x|fX (x) < ∞, where DX is
the support of fX . The expectation (or expected value or mean) of X is
X X
E(X) = xP(X = x) = xfX (x).
x∈DX x∈DX
P
 If E(|X|) = x∈DX |x|fX (x) is not finite, then E(X) is not well defined.
 E(X) is also sometimes called the “average of X”. We will limit the use of the word “average” to
empirical quantities.
 The expectation is analogous in mechanics to the notion of centre of gravity of an object whose
mass is distributed according to fX .

Example 84. Calculate the expectation of a Bernoulli random variable with probability p.

Example 85. Calculate the expectation of X ∼ B(n, p).

Example 86. Calculate the expectation of the random variables with PMFs
4 1
fX (x) = , fY (x) = , x = 1, 2, . . . .
x(x + 1)(x + 2) x(x + 1)

Probability and Statistics for SIC slide 113

Note to Example 84
First we note that if the support of X is finite, then E(|X|) < maxx∈DX |x| < ∞.
If I is Bernoulli with probability p, then E(I) = 0 × (1 − p) + 1 × p = p.

Probability and Statistics for SIC note 1 of slide 113

Note to Example 85
Here DX = {0, 1, . . . , n} is finite, so E(|X|) < ∞.
We get
X n  
n x
E(X) = x p (1 − p)n−x
x=0
x
n
X n × (n − 1)!
= p × px−1 (1 − p)(n−1)−(x−1)
(x − 1)!{n − 1 − (x − 1)}!
x=1
n−1 
X 
n−1 y
= np p (1 − p)n−1−y = np,
y=0
y

where we have set y = x − 1. This agrees with the previous example, since X can be viewed as a sum
I1 + · · · + In .
Probability and Statistics for SIC note 2 of slide 113

60
Note to Example 86
Note that fY sums to unity: since the series is absolutely convergent we can re-organise the brackets in
the sums, giving

X X ∞   ∞  
1 1 1 1 X 1 1
= − = + − = 1,
x=1
x(x + 1) x=1 x x+1 1 x=1 x+1 x+1

after cancelling terms.


A similar argument works for fX , since
X∞ X∞  
4 1 1
= 2 −
x=1
x(x + 1)(x + 2) x=1
x(x + 1) (x + 1)(x + 2)
( ∞  )
1 X 1 1
= 2 + − = 1.
1×2 (x + 1)(x + 2) (x + 1)(x + 2)
x=1

Now since the sum below is absolutely convergent, we have


∞ ∞   ( ∞  )
X 1 X 1 1 1 X 1 1
E(X) = 4 =4 − =4 + − = 2.
(x + 1)(x + 2) x+1 x+2 1+1 x+2 x+2
x=1 x=1 x=1

However,

X 1
E(Y ) = = +∞.
x=1
x+1

Thus it is relatively easy to construct random variables whose expectations are infinite: existence of an
expected value is not guaranteed.

Probability and Statistics for SIC note 3 of slide 113

Expected value of a function


Theorem 87. Let X be a random variable with mass function f , and let g be a real-valued function
of X. Then X
E{g(X)} = g(x)f (x),
x∈DX
P
when x∈DX |g(x)|f (x) < ∞.

Example 88. Let X ∼ Pois(λ). Calculate the expectations of

X, X(X − 1), X(X − 1) · · · (X − r + 1).

Probability and Statistics for SIC slide 114

61
Note to Theorem 87
Write Y = g(X), and note that for any y in the support DY of Y , we have
X X
fY (y) = P(Y = y) = P{g(X) = y} = P(X = x) = fX (x).
{x∈DX :g(x)=y} {x∈DX :g(x)=y}

Therefore
X X X X X X
E(Y ) = yfY (y) = y fX (x) = g(x)fX (x) = g(x)fX (x),
y∈DY y∈DY {x∈DX :g(x)=y} y∈DY x:g(x)=y x∈DX

as required.

Probability and Statistics for SIC note 1 of slide 114

Note to Example 88
Note that

X ∞
X
λx −λ λx−r −λ
E{X(X − 1) · · · (X − r + 1)} = x(x − 1) · · · (x − r + 1) e = λr e = λr ,
x! (x − r)!
x=0 x−r=0

which yields E(X) = λ and E{X(X − 1)} = λ2 .

Probability and Statistics for SIC note 2 of slide 114

Properties of the expected value


Theorem 89. Let X be a random variable with a finite expected value E(X), and let a, b ∈ R be
constants. Then
(a) E(·) is a linear operator, i.e., E(aX + b) = aE(X) + b ;
(b) if g(X) and h(X) have finite expected values, then

E{g(X) + h(X)} = E{g(X)} + E{h(X)};

(c) if P(X = b) = 1, then E(X) = b ;


(d) if P(a < X ≤ b) = 1, then a < E(X) ≤ b ;
(e) {E(X)}2 ≤ E(X 2 ).

Remark: Linearity of the expected value, (a) and (b), and fact (c), are very useful in calculations.

Probability and Statistics for SIC slide 115

62
Note to Theorem 89
(a) We need to show absolute convergence:
X X X X
|ax + b|f (x) ≤ (|a||x| + |b|)f (x) = |a| |x|f (x) + |b| f (x) < ∞,
x x x x

and after that we just apply linearity of the summation.


(b) Follows using same argument as in (a), after noting that |g(x) + h(x)| ≤ |g(x)| + |h(x)|.
(c) Here f (b) = P(X = b) = 1, so E(X) = bf P(b) = b by definition.
P
(d) Now f (x) = 0 for x 6∈ (a, b], so E(X) = x xf (x) ≤ x bf (x) = b and similarly E(X) > a.
(e) For any real a, linearity of the expectation gives
 
0 ≤ E (X − a)2 = E X 2 − 2aX + a2 = E(X 2 ) − 2aE(X) + a2 ,

and setting a = E(X) and simplifying the right-hand side to E(X 2 ) − E(X)2 yields the result.

Probability and Statistics for SIC note 1 of slide 115

Moments of a distribution
P r
Definition 90. If X has a PMF f (x) such that x |x| f (x) < ∞, then
(a) the rth moment of X is E(X r );
(b) the rth central moment of X is E[{X − E(X)}r ];
(c) the variance of X is var(X) = E[{X − E(X)}2 ] (the second central moment);
p
(d) the standard deviation of X is defined as var(X) (non-negative);
(e) the rth factorial moment of X is E{X(X − 1) · · · (X − r + 1)}.

Remarks:
 E(X) and var(X) are the most important moments: they represent the ‘average value’ E(X) of
X, and the ‘average squared distance’ of X from its mean, E(X).
 The variance is analogous to the moment of inertia in mechanics: it measures the scatter of X
around its mean, E(X), with small variance corresponding to small scatter, and conversely.
 The expectation and standard deviation have the same units (kg, m, . . . ) as X.

Example 91. Calculate the expectation and variance of the score when we roll a die.

Probability and Statistics for SIC slide 116

Note to Example 91
Now X takes values 1, . . . , 6 with equal probabilities 1/6. Obviously E(|X|) < ∞, and
E(X) = (1 + · · · + 6)/6 = 21/6 = 7/2. The variance is
6
X 1
E[{X − E(X)}2 ] = (x − 7/2)2 = 2
6 × 1
4 × (1 + 9 + 25) = 35/12.
6
x=1

Probability and Statistics for SIC note 1 of slide 116

63
Properties of the variance
Theorem 92. Let X be a random variable whose variance exists, and let a, b be constants. Then

var(X) = E(X 2 ) − E(X)2 = E{X(X − 1)} + E(X) − E(X)2 ;


var(aX + b) = a2 var(X);
var(X) = 0 ⇒ X is constant with probability 1.

 The first of these formulae expresses the variance in terms of either the ordinary moments, or the
factorial moments. Usually the first is more useful, but occasionally the second can be used.
 The second formula shows that the variance does not change if X is shifted by a fixed quantity b,
but the dispersion is increased by the square of a multiplier a.
 The third shows that the variance is appropriately named: if X has zero variance, then it does not
vary.

Example 93. Calculate the variance of a Poisson random variable.

Probability and Statistics for SIC slide 117

Note to Theorem 92
(a) Just expand, use linearity of E, and simplify.
(b) Ditto.
(c) If we write E(X) = µ and
X
var(X) = E[{X − E(X)}2 ] = E[{X − µ}2 ] = f (x)(x − µ)2 = 0,
x

then for each x ∈ DX , either x = µ or f (x) = 0. Suppose that f (a), f (b) > 0 and a 6= b. Then if
var(X) = 0, we must have a = µ = b, which is a contradiction. Therefore f (x) > 0 for a unique value
of x, and then we must have f (x) = 1, so P(X = x) = 1 and (x − µ)2 = 0; thus
P(X = µ) = fX (µ) = 1.

Probability and Statistics for SIC note 1 of slide 117

Note to Example 93
By recalling Example 88, we find

var(X) = E{X(X − 1)} + E(X) − E(X)2 = λ2 + λ − λ2 = λ.

Probability and Statistics for SIC note 2 of slide 117

64
Poisson du moment

(Source: Copernic)

Probability and Statistics for SIC slide 118

Moment du Poisson

(Source: Copernic)

Probability and Statistics for SIC slide 119

65
Properties of the variance II
Theorem 94. If X takes its values in {0, 1, . . .}, r ≥ 2, and E(X) < ∞, then

X
E(X) = P(X ≥ x),
x=1
X∞
E{X(X − 1) · · · (X − r + 1)} = r (x − 1) · · · (x − r + 1)P(X ≥ x).
x=r

Example 95. Let X ∼ Geom(p). Calculate E(X) and var(X).

Example 96. Each packet of a certain product has equal chances of containing one of n different
types of tokens, independently of each other packet. What is the expected number of packets you will
need to buy in order to get at least one of each type of token?

Probability and Statistics for SIC slide 120

Note to Theorem 94
 The first part of this is

X ∞
X x
X ∞
X
E(X) = xf (x) = P(X = x) 1= P(X ≥ x),
x=1 x=1 r=1 x=1

as follows on changing the order of summation, noting that since all the terms are positive, this is
a legal operation.
 The second part is proved in the same way, first writing
 
(x − 1)! x−1
r(x − 1) · · · (x − r + 1) = r! = r! .
(r − 1)!(x − r)! r−1

Then we write
X∞ X∞   ∞ ∞ y  
x−1 X X X x−1
r (x − 1) · · · (x − r + 1)P(X ≥ x) = r! fX (y) = fX (y)r! ,
x=r x=r
r − 1 y=x y=r x=r
r−1

and use Pascal’s triangle (Theorem 17) to find that


y 
X  y  
X    
x−1 x x−1 y
r! = r! − = r! = y(y − 1) · · · (y − r + 1)
x=r
r−1 x=r
r r r

after cancellations. As required, this gives



X
fX (y)y(y − 1) · · · (y − r + 1) = E{X(X − 1) · · · (X − r + 1)}.
y=r

Probability and Statistics for SIC note 1 of slide 120

66
Note to Example 95
In this case X ∈ {1, 2, . . .}, and Theorem 94 yields

X 1
E(X) = (1 − p)x−1 = = 1/p ≥ 1.
1 − (1 − p)
x=1

For the variance, note that the second part of Theorem 94, with r = 2, gives

X
E{X(X − 1)} = 2 (x − 1)(1 − p)x−1
x=2
( ∞
)
d X
= 2(1 − p) − (1 − p)x−1
dp
x=1
d
= 2(1 − p) (−1/p) = 2(1 − p)/p2 .
dp
Hence the variance is

var(X) = E{X(X − 1)} + E(X) − E(X)2 = 2(1 − p)/p2 + 1/p − 1/p2 = (1 − p)/p2 .

This gets smaller as p → 1, and larger as p → 0, as expected.

Probability and Statistics for SIC note 2 of slide 120

Note to Example 96
This can be represented as X1 + X2 + · · · + Xn , where X1 is the number of packets to the first token,
then X2 is the number of packets to the next different token (i.e., not the first), etc. Thus the Xr are
independent geometric variables with probabilities p = n/n, (n − 1)/n, . . . , 1/n. Hence the expectation
is n(1 + 1/2 + 1/3 + · · · + 1/n) ∼ n log n, which → ∞ as n → ∞.

Probability and Statistics for SIC note 3 of slide 120

67
3.3 Conditional Probability Distributions slide 121

Conditional probability distributions


Definition 97. Let (Ω, F, P) be a probability space, on which we define a random variable X, and let
B ∈ F with P(B) > 0. Then the conditional probability mass function of X given B is

fX (x | B) = P(X = x | B) = P(Ax ∩ B)/P(B),

where Ax = {ω ∈ Ω : X(ω) = x}.

Theorem 98. The function fX (x | B) satisfies


X
fX (x | B) ≥ 0, fX (x | B) = 1,
x

and is thus a well-defined mass function.

Often B is an event of form X ∈ B, for some B ⊂ R, and then


P(X = x, X ∈ B) P(X ∈ B | X = x)P(X = x) I(x ∈ B)
fX (x | B) = = = fX (x),
P(X ∈ B) P(X ∈ B) P(X ∈ B)

so fX (x | B) = 0 (x 6∈ B) and fX (x | B) ∝ fX (x) (x ∈ B), rescaled to have unit probability.

Example 99. Calculate the conditional PMFs of X ∼ Geom(p), (a) given that X > n, (b) given
that X ≤ n.
Probability and Statistics for SIC slide 122

Note to Theorem 98
We need to check the two properties of a distribution function.
 Non-negativity is obvious because the fX (x | B) = P(X = x | B) are conditional probabilities.
S
 Now Ax ∩ Ay = ∅ if x 6= y, and x∈R Ax = Ω. Hence the Ax partition R, and thus
X X
fX (x | B) = P(Ax ∩ B)/P(B) = P(B)/P(B) = 1.
x x

Probability and Statistics for SIC note 1 of slide 122

Note to Example 99
(a) The event B1 = {X > n} has probability (1 − p)n , so the new mass function is

P(X = x ∩ X > n) fX (x)I(x > n)


fX (x | B1 ) = = = p(1 − p)x−n−1 , x = n + 1, n + 2, . . . .
P(X > n) P(X > n)

This implies that conditional on X > n, X − n has the same distribution as did X originally.
(b) The event B2 = B1c = {X ≤ n} has probability 1 − (1 − p)n , so the new mass function is

P(X = x ∩ X ≤ n) fX (x)I(x ≤ n) p(1 − p)x−1


fX (x | B2 ) = = = , x = 1, . . . , n.
P(X ≤ n) 1 − (1 − p)n 1 − (1 − p)n

Probability and Statistics for SIC note 2 of slide 122

68
Conditional expected value
P
Definition 100. Suppose that x |g(x)|fX (x | B) < ∞. Then the conditional expected value of
g(X) given B is X
E{g(X) | B} = g(x)fX (x | B).
x

Theorem 101. Let X be a random variable with expected value E(X) and let B be an event with
P(B), P(B c ) > 0. Then
E(X) = E(X | B)P(B) + E(X | B c )P(B c ).
More generally, when {Bi }∞
i=1 is a partition of Ω, P(Bi ) > 0 for all i, and the sum is absolutely
convergent,
X∞
E(X) = E(X | Bi )P(Bi ).
i=1

Probability and Statistics for SIC slide 123

Note to Theorem 101


We prove the second part, of which the first is a special case. The total probability theorem,
Theorem 43, gives

X ∞
X
f (x) = P(X = x) = P(X = x | Bi )P(Bi ) = f (x | Bi )P(Bi ),
i=1 i=1

and this gives


∞ ∞
( ) ∞
X X X X X X
E(X) = xf (x) = x f (x | Bi )P(Bi ) = xf (x | Bi ) P(Bi ) = E(X | Bi )P(Bi ),
x x i=1 i=1 x i=1

as required. The first part follows on setting B1 = B, B2 = B c , B3 = B4 = · · · = ∅.

Probability and Statistics for SIC note 1 of slide 123

Example
Example 102. Calculate the expected values for the distributions in Example 99.

Probability and Statistics for SIC slide 124

69
Note to Example 99
(a) Since
fX (x | B1 ) = p(1 − p)x−n−1 , x = n + 1, n + 2, . . . ,
we have

X ∞
X
E(X | B1 ) = xp(1 − p)x−n−1 = (n + y)p(1 − p)y−1
x=n+1 y=1

where we have set y = x − n, and hence



X ∞
X
E(X | B1 ) = n p(1 − p)y−1 + yp(1 − p)y−1 = n + 1/p,
y=1 y=1

since the first sum equals unity and the second is the expectation of a Geom(p) variable.
(b) We can tackle this directly using the expression
n
X p(1 − p)x−1
E(X | B2 ) = x
1 − (1 − p)n
x=1

or indirectly by writing B = B1 and B c = B2 , which are complementary events, and using


Theorem 101:
E(X) = E(X | B1 )P(B1 ) + E(X | B2 )P(B2 ),
giving
1/p = (n + 1/p)(1 − p)n + E(X | B2 ){1 − (1 − p)n },
and a little algebra yields
1/p − (n + 1/p)(1 − p)n
E(X | B2 ) = .
1 − (1 − p)n

Probability and Statistics for SIC note 1 of slide 124

3.4 Notions of Convergence slide 125

Convergence of distributions
We often want to approximate one distribution by another. The mathematical basis for doing so is the
convergence of distributions.

Definition 103. Let {Xn }, X be random variables whose cumulative distribution functions are {Fn },
F . Then we say that the random variables {Xn } converge in distribution (or converge in law) to
X, if, for all x ∈ R where F is continuous,

Fn (x) → F (x), n → ∞.
D
We write Xn −→ X.

If DX ⊂ Z, then Fn (x) → F (x) if fn (x) → f (x) for all x, n → ∞.

Probability and Statistics for SIC slide 126

70
Law of small numbers
n

Recall from Theorem 17 that n−r r → 1/r! for all r ∈ N, when n → ∞.

Theorem 104 (Law of small numbers). Let Xn ∼ B(n, pn ), and suppose that npn → λ > 0 when
D
n → ∞. Then Xn −→ X, where X ∼ Pois(λ).

Theorem 104 can be used to approximate binomial probabilities for large n and small p by Poisson
probabilities.

Example 105. In Example 47 we saw that the probability of having exactly r fixed points in a random
permutation of n objects is
n−r
1 X (−1)k e−1
→ , r = 0, 1, . . . , n → ∞,
r! k! r!
k=0

Thus the number of fixed points has a limiting Pois(1) distribution.

Probability and Statistics for SIC slide 127

Note to Theorem 104


For any fixed r we have
   
n r n−r −r n 1
pn (1 − pn ) =n × (npn )r (1 − npn /n)n−r → λr e−λ , n → ∞,
r r r!

which is the required Poisson mass function; call this limiting Poisson random variable X. This
convergence implies that P(Xn ≤ x) → P(X ≤ x) for any fixed real x, since P(Xn ≤ x) is just then a
finite sum of probabilities, each of which is converging to the limiting Poisson probability.

Probability and Statistics for SIC note 1 of slide 127

Law of small numbers


B(10,0.5) B(20,0.25)
0.15

0.15
f(x)

f(x)
0.00

0.00

0 5 10 15 0 5 10 15
x x

B(50,0.1) Pois(5)
0.15

0.15
f(x)

f(x)
0.00

0.00

0 5 10 15 0 5 10 15
x x

Mass functions of three binomial distributions and the Poisson distribution, all with expectation 5.

Probability and Statistics for SIC slide 128

71
Numerical comparison
Example 106 (Binomial and Poisson distributions). Compare P(X ≤ 3) for X ∼ B(20, p), with
p = 0.05, 0.1, 0.2, 0.5 with the results from a Poisson approximation, P(X ′ ≤ 3), with X ′ ∼ Pois(np),
using the functions pbinom and ppois in the software R — see

http://www.r-project.org/

Thus for example we have:

> pbinom(3,size=20,prob=0.05) # Binomial prob, Pr(X <= 3)


[1] 0.9840985
> ppois(3,lambda=20*0.05) # Poisson approx, Pr(X’ <= 3)
[1] 0.9810118
Probability and Statistics for SIC slide 129

People versus Collins


Example 107. In 1964 a handbag was stolen in Los Angeles by a young woman with blond hair in a
pony tail. The thief disappeared, but soon afterwards she was spotted in a yellow car with a bearded
black man with a moustache. The police then arrested a woman called Janet Collins, who matched the
description, and had a black bearded friend with a moustache, who drove a yellow car.

Due to a lack of evidence and of reliable witnesses, the prosecutor tried to convince the jury that
Collins and her friend were the only pair in Los Angeles who could have committed the crime. He
found a probability of p = 1/(12 × 106 ) that a couple picked at random should fit the description, and
they were convicted.

In a higher court it was argued that the number of couples X fitting the description must follow a
Poisson distribution with λ = np, where n is the size of the population to which the couple belong. To
be certain that the couple were guilty, P(X > 1 | X ≥ 1) must be very small. But with n = 106 ,
2 × 106 , 5 × 106 , 10 × 106 , these probabilities are 0.041, 0.081, 0.194, 0.359: it was therefore very far
from certain that they were guilty. They were finally cleared.

Probability and Statistics for SIC slide 130

Note to Example 107


Here the law of small numbers applies, so

P(X > 1 | X ≥ 1) = 1 − P(X = 1 | X ≥ 1) = 1 − λe−λ /(1 − e−λ ) = 1 − λ/(eλ − 1),

with Poisson parameter λ = np = 1/12, 1/6, 5/12 and 1 respectively. Calculation gives the required
numbers. In fact here X has a truncated Poisson distribution.
Probability and Statistics for SIC note 1 of slide 130

72
Example
Example 108. Let XN be a hypergeometric variable, then

m N −m

x n−x
P(XN = x) = N
, x = max(0, m + n − N ), . . . , min(m, n).
n

This is the distribution of the number of white balls obtained when we take a random sample of size n
without replacement from an urn containing m white balls and N − m black balls. Show that when
N, m → ∞ in such a way that m/N → p, where 0 < p < 1,
 
n x
P(XN = x) → p (1 − p)n−x , i = 0, . . . , n.
x

Hence the limiting distribution of XN is B(n, p).

Probability and Statistics for SIC slide 131

Note to Example 108


We apply the last part of Theorem 17, writing
m N −m m N −m
x n−x mx (N − m)n−x m−x x (N − m)−(n−x) n−x
N  = × 
n
Nn N −n N
n
n!
→ px (1 − p)n−x × , N → ∞,
x!(n − x)!
under the terms of the theorem, as required.

Probability and Statistics for SIC note 1 of slide 131

Which distribution?
We have encountered several distributions: Bernoulli, binomial, geometric, negative binomial,
hypergeometric, Poisson—how to choose? Here is a little algorithm to help your reasoning:

Is X based on independent trials (0/1) with a same probability p, or on draws from a finite population,
with replacement?
 If Yes, is the total number of trials n fixed, so X ∈ {0, . . . , n}?
– If Yes: use the binomial distribution, X ∼ B(n, p) (and thus the Bernoulli distribution if
n = 1).
⊲ If n ≈ ∞ or n ≫ np, we can use the Poisson distribution, X ∼ Pois(np).

– If No, then X ∈ {n, n + 1, . . .}, and we use the geometric (if X is the number of trials until
one success) or negative binomial (if X is the number of trials until the last of several
successes) distributions.

 If No, then if the draw is independent but without replacement from a finite population, then X ∼
hypergeometric distribution.
There are many more distributions, and we may choose a distribution on empirical grounds. The
following map comes from Leemis and McQueston (2008, American Statistician) . . .

Probability and Statistics for SIC slide 132

73
Probability and Statistics for SIC slide 133

Probability and Statistics for SIC slide 134

74
4 Continuous Random Variables slide 135

4.1 Basic Ideas slide 136

Continuous random variables


In many situations, we must work with continuous variables:
 the time until the end of the lecture ∈ (0, 45) min;
 the pair (height, weight) ∈ (0, ∞)2 .
Until now we supposed that the support

DX = {x ∈ R : X(ω) = x, ω ∈ Ω}

of X is countable, so X is a discrete random variable. We suppose now that DX is not countable,


which implies also that Ω itself is not countable.

Definition 109 (Reminder). Let (Ω, F, P) be a probability space. The cumulative distribution
function of a rv X defined on (Ω, F, P) is

F (x) = P(X ≤ x) = P(Bx ), x ∈ R,

where Bx = {ω : X(ω) ≤ x} ⊂ Ω.

Probability and Statistics for SIC slide 137

Probability density functions


Definition 110. A random variable X is continuous if there exists a function f (x), called the
probability density function (or density) (PDF) of X, such that
Z x
P(X ≤ x) = F (x) = f (u) du, x ∈ R.
−∞
R∞
The properties of F imply that (i) f (x) ≥ 0, and (ii) −∞ f (x) dx = 1.

Remarks:
 Evidently,
dF (x)
f (x) = .
dx
Ry
 Since P(x < X ≤ y) = f (u) du for x < y, for all x ∈ R,
x
Z y Z x
P(X = x) = lim P(x < X ≤ y) = lim f (u) du = f (u) du = 0.
y↓x y↓x x x

 If X is discrete, then its PMF f (x) is often also called its density function.

Probability and Statistics for SIC slide 138

75
Motivation
We study continuous random variables for several reasons:
 they appear in simple but powerful models—for example, the exponential distribution often
represents the waiting time in a process where events occur completely at random;
 they give simple but very useful approximations for complex problems—for example, the normal
distribution appears as an approximation for the distribution of an average, under fairly general
conditions;
 they are the basis for modelling complex problems either in probability or in statistics—for
example, the Pareto distribution is often a good approximation for heavy-tailed data, in finance
and for the internet.
We will discuss a few well-known distributions, but there are plenty more (see map at the end of
Chapter 3) . . ..

Probability and Statistics for SIC slide 139

Basic distributions
Definition 111 (Uniform distribution). The random variable U having density
(
1
, a ≤ u ≤ b,
f (u) = b−a a < b,
0, otherwise,

is called a uniform random variable. We write U ∼ U (a, b).

Definition 112 (Exponential distribution). The random variable X having density


(
λe−λx , x > 0,
f (x) =
0, otherwise,

is called an exponential random variable with parameter λ > 0. We write X ∼ exp(λ).

In practice random variables are almost always either discrete or continuous, with exceptions such as
daily rain totals.

Example 113. Find the cumulative distribution functions of the uniform and exponential distributions,
and establish the lack of memory (or memorylessness) property of X:

P(X > x + t | X > t) = P(X > x), t, x > 0.

Probability and Statistics for SIC slide 140

76
Note to Example 113
Integration of the uniform density gives


0, u ≤ a,
F (u) = (u − a)/(b − a), a < u ≤ b,


1, u > b.

Sketch the density and the CDF.


Integration of the exponential density gives
(
0, x ≤ 0,
F (x) =
1 − exp(−λx), x > 0.

Draw the density and the CDF.


For the lack of memory of the exponential distribution, note that

P(X > x + t) exp{−λ(x + t)}


P(X > x + t | X > t) = = = exp(−λx), x > 0.
P(X > t) exp(−λt)

Probability and Statistics for SIC note 1 of slide 140

Gamma distribution
Definition 114 (Gamma distribution). The random variable X having density
( α
λ
xα−1 e−λx , x > 0,
f (x) = Γ(α)
0, otherwise,

is called a gamma random variable with parameters α, λ > 0; we write X ∼ Gamma(α, λ).
Here α is called the shape parameter and λ is called the rate, with λ−1 the scale parameter. By
letting α = 1 we get the exponential density, and when α = 2, 3, . . . we get the Erlang density.
Slide 99 gives the properties of Γ(·).

exp(1) Gamma, shape=5,rate=3


0.4 0.8

0.4 0.8
f(x)

f(x)
0.0

0.0

−2 0 2 4 6 8 −2 0 2 4 6 8
x x

Gamma, shape=0.5,rate=0.5 Gamma, shape=8,rate=2


0.4 0.8

0.4 0.8
f(x)

f(x)
0.0

0.0

−2 0 2 4 6 8 −2 0 2 4 6 8
x x

Probability and Statistics for SIC slide 141

77
Laplace distribution
Definition 115 (Laplace). The random variable X having density

λ −λ|x−η|
f (x) = e , x ∈ R, η ∈ R, λ > 0,
2
is called a Laplace random variable (or sometimes a double exponential) random variable.

(Source: http://www-history.mcs.st-and.ac.uk/PictDisplay/Laplace.html)
Pierre-Simon Laplace (1749–1827): Théorie Analytique des Probabilités (1814)
According to Napoleon Bonaparte: ‘Laplace did not consider any question from the right angle: he
sought subtleties everywhere, conceived only problems, and brought the spirit of “infinitesimals” into
the administration.’
Probability and Statistics for SIC slide 142

Pareto distribution
Definition 116 (Pareto). The random variable X with cumulative distribution function
(
0, x < β,
F (x) =  α , α, β > 0,
β
1− x , x ≥ β,

is called a Pareto random variable.

Vilfredo Pareto (1848–1923): Professor at Lausanne University, father of economic science.


(Source: http://www.gametheory.net/dictionary/People/VilfredoPareto.html)

Example 117. Find the cumulative distribution function of the Laplace distribution, and the
probability density function of the Pareto distribution.

Probability and Statistics for SIC slide 143

78
Note to Example 117
For the Laplace distribution, integration of the density gives
(
1 −λ|x−η|
e , x ≤ η,
F (x) = 2 1 −λ|x−η|
1 − 2e , x > η.

Note that F (η) = 1/2, so η is the median of the distribution.


Sketch the density and the CDF.
For the Pareto density, just differentiate with respect to x to obtain the density function,
(
0, x < β,
f (x) = αβ α
xα+1
, x ≥ β.

Probability and Statistics for SIC note 1 of slide 143

Moments
Definition 118. Let g(x) be a real-valued function, and X a continuous random variable of density
f (x). Then if E{|g(X)|} < ∞, we define the expectation of g(X) to be
Z ∞
E{g(X)} = g(x)f (x) dx.
−∞

In particular the expectation and the variance of X are


Z ∞
E(X) = xf (x) dx,
Z−∞

var(X) = {x − E(X)}2 f (x) dx = E(X 2 ) − E(X)2 .
−∞

Example 119. Calculate the expectation and the variance of the following distributions: (a) U (a, b);
(b) gamma; (c) Pareto.

Probability and Statistics for SIC slide 144

79
Note to Example 119
1
(a) Note that we need to compute E(U r ) for r = 1, 2, and this is r+1 (br+1 − ar+1 )/(b − a). Hence
E(X) = 12 (b2 − a2 )/(b − a) = (b + a)/2, as expected. For the variance, note that
3 − a3
1b
E(X 2 ) − E(X)2 = 3 b − (b + a)2 /4 = 31 (b2 + ab + a2 ) − (b2 + 2ab + a2 )/4 = (b − a)2 /12.
−a
(b) In this case
Z ∞
r
E(X ) = xr × λα xα−1 Γ(α)−1 exp(−λx) dx
0
Z ∞
= λ−r Γ(α)−1 ur+α−1 e−u du
0
= λ−r Γ(r + α)/Γ(α).

Properties of the gamma function (slide 99) give

E(X) = α/λ, E(X 2 ) = α(α + 1)/λ2 , var(X) = E(X 2 ) − E(X)2 = α/λ2 .

(c) The expectation is Z ∞


r
E(X ) = αβ α xr−α−1 dx = αβ r /(α − r)
β

provided that α > r. If α ≤ r then the moment does not exist. In particular, E(X) < ∞ only if α > 1.

Probability and Statistics for SIC note 1 of slide 144

Conditional densities
We can also calculate conditional cumulative distribution and density functions: for reasonable subsets
A ⊂ R we have
R
P(X ≤ x ∩ X ∈ A) f (y) dy
FX (x | X ∈ A) = P(X ≤ x | X ∈ A) = = Ax ,
P(X ∈ A) P(X ∈ A)

where Ax = {y : y ≤ x, y ∈ A}, and


( fX (x)
P(X∈A) , x ∈ A,
fX (x | X ∈ A) =
0, otherwise.

With I(X ∈ A) the indicator variable of the event X ∈ A, we can write

E {g(X) I(X ∈ A)}


E{g(X) | X ∈ A} = ,
P(X ∈ A)

Example 120. Let X ∼ exp(λ). Find the density and the cumulative distribution function of X, given
that X > 3.
Probability and Statistics for SIC slide 145

80
Note to Example 120
With A = (3, ∞), we have P(X ∈ A) = exp(−3λ). Hence
(
0, x < 3,
FX (x | X ∈ A) = exp(−3λ)−exp(−λx)
exp(−3λ) , x ≥ 3,

and the formula here reduces to 1 − exp{−(x − 3)λ}, x > 3. This is just the exponential density,
shifted along to x = 3. There is a close relation to the lack of memory property.

Probability and Statistics for SIC note 1 of slide 145

Example
Example 121. To get a visa for a foreign country, you call its consulate every morning at 10 am. On
any given day the civil servant is only there to answer telephone calls with probability 1/2, and when he
does answer, he lets the phone ring for a random amount of time T (min) whose distribution is
(
0, t ≤ 1,
FT (t) = −1
1 − t , t > 1.

(a) If you call one morning and don’t hang up, what is the probability that you will listen to the ringing
tone for at least s minutes?
(b) You decide to call once every day, but to hang up if there has been no answer after s∗ minutes.
Find the value of s∗ which minimises your time spent listening to the ringing tone.

Probability and Statistics for SIC slide 146

Waiting time in Example 121


20
Expected waiting time (min)
5 10 0 15

2 4 6 8 10
Time t (min)

Probability and Statistics for SIC slide 147

81
Note to Example 121
(a) Let S be the time for which it rings on a given day. Then
(
1, s<1
P(S > s) = P(S > s | absent)P(absent) + P(S > s | present)P(present) = 1 1 −1
2 + 2s , s ≥ 1.

This implies that


(
0, s<1
FS (s) = P(S ≤ s) = 1 − P(S > s) = 1 −1
2 (1 − s ), s ≥ 1.

This is a defective distribution, since lims→∞ FS (s) < 1, because there is a point mass of 1/2 at
+∞, corresponding to the event that he is absent.
The expected waiting time if you don’t put the phone down is
Z ∞
1 1 ds
E(S) = E(S | absent)P(absent) + E(S | present)P(present) = 2 × ∞ + 2 s 2 = ∞,
1 s

so we expect to wait a long time. Even if he is there, E(S | present) = ∞.


(b) If the call is successful before I hang up at time s∗ > 1, then the expected ringing time is
Z s∗ Z s∗ Z s∗
∗ ∗ fS (x) x−2 s∗ log s∗
E(S | S < s ) = xfS (x | S < s ) dx = x dx = x dx = .
0 0 P(S < s∗ ) 1 1 − 1/s∗ s∗ − 1

The number of calls N until you get through to the visa clerk is a geometric random variable with
success probability
p = P(S < s∗ ) = 1 − P(S > s∗ ) = 21 (1 − 1/s∗ ) :
there are N − 1 unsuccessful calls each of length s∗ , followed by a successful call. The number of
unsuccessful calls has expectation
2 1
E(N ) − 1 = 1/p − 1 = −1= ∗ (2s∗ − s∗ + 1) = (s∗ + 1)/(s∗ − 1).
1 − 1/s∗ s −1

Hence the total time spent listening to the ringing tone is

w(s∗ ) = s∗ {E(N ) − 1} + E(S | S < s∗ ) = s∗ (s∗ + 1)/(s∗ − 1) + s∗ log s∗ /(s∗ − 1).

To minimise this we differentiate with respect to s∗ , getting

w′ (s∗ ) = (s∗ − 1)−2 (s∗ 2 − s∗ − 2 − log s∗ ) = (s∗ − 1)−2 {(s∗ − 2)(s∗ + 1) − log s∗ },

and setting w′ (s∗ ) = 0 gives that s∗ must solve the equation (s∗ − 2)(s∗ + 1) = log s∗ , for s∗ > 1.
This gives s∗ = 2.25 minutes as being the optimum length of call, and in this case w(s∗ ) = 7.3
minutes, while E(N ) = 1 + 2.6 = 3.6 calls.
Note the shape of the graph on slide 147. The expected total waiting time increases very sharply for
the impatient (who put the phone down before time s∗ ), but not so fast for patient people who wait
beyond time s∗ . Draw your own conclusions!

Probability and Statistics for SIC note 1 of slide 147

82
X discrete or continuous?
Discrete Continuous
Support DX countable contains an interval (x− , x+ ) ⊂ R

fX mass function density function


dimensionless units [x]−1
P≤ fX (x) ≤ 1
0 ≤ fX (x)
R0 ∞
x∈R fX (x) = 1 −∞ fX (x) dx = 1
P Ra
FX (a) = P(X ≤ a) x≤a fX (x) −∞ fX (x) dx
P R
P(X ∈ A) x∈A fX (x) A fX (x) dx

P Rb
P(a < X ≤ b) {x:a<x≤b} fX (x) a fX (x) dx
Ra
P(X = a) fX (a) ≥ 0 a fX (x) dx = 0
P R∞
E{g(X)} (if well defined) x∈R g(x)fX (x) −∞ g(x)fX (x) dx

Probability and Statistics for SIC slide 148

4.2 Further Ideas slide 149

Quantiles
Definition 122. Let 0 < p < 1. We define the p quantile of the cumulative distribution function
F (x) to be
xp = inf{x : F (x) ≥ p}.
For most continuous random variables, xp is unique and equals xp = F −1 (p), where F −1 is the inverse
function F ; then xp is the value for which P(X ≤ xp ) = p. In particular, we call the 0.5 quantile the
median of F .

Example 123. Let X ∼ exp(λ). Show that xp = −λ−1 log(1 − p).

Example 124. Find the p quantile of the Pareto distribution.

The infimum is needed when there are jumps in the distribution function, or when it is flat over some
interval. Here is an example:

Example 125. Compute x0.5 and x0.9 for a Bernoulli random variable with p = 1/2.

Probability and Statistics for SIC slide 150

Note to Example 123


We have to solve F (xp ) = 1 − exp(−λxp ) = p, which gives the required result.

Probability and Statistics for SIC note 1 of slide 150

Note to Example 124


We have to solve F (xp ) = 1 − (β/xp )α = p, which gives xp = β(1 − p)−1/α .

83
Probability and Statistics for SIC note 2 of slide 150

Note to Example 125


Recall that in this case 

0, x < 0,
F (x) = 1/2, 0 ≤ x < 1,


1, x ≥ 1.
There is no value of x such that F (x) = 0.9, but F (x) ≥ 0.9 for every x ≥ 1, so

x0.9 = inf{x : F (x) ≥ 0.9} = inf{x : x ≥ 1} = 1.

Likewise
x0.5 = inf{x : F (x) ≥ 0.5} = inf{x : x ≥ 0} = 0.

Probability and Statistics for SIC note 3 of slide 150

Transformations
We often consider Y = g(X), where g is a known function, and we want to calculate FY and fY given
FX and fX .

Example 126. Let Y = − log(1 − U ), where U ∼ U (0, 1). Calculate FY (y) and discuss. Calculate
also the density and cumulative distribution function of W = − log U . Explain.

Example 127. Let Y = ⌈X⌉, where X ∼ exp(λ) (thus Y is the smallest integer greater than X).
Calculate FY (y) and fY (y).

Probability and Statistics for SIC slide 151

Note to Example 126


Note first that since 0 < U < 1, 1 − U > 0 and taking the log is OK, and we get
Y = − log(1 − U ) > 0. Hence

P(Y ≤ y) = P{− log(1 − U ) ≤ y} = P{U ≤ 1 − exp(−y)} = 1 − exp(−y), y>0

which is the exponential density; note that the transformation here is monotone. Thus Y has an
exponential distribution.
For W = − log U , we have

P(W ≤ w) = P{− log(U ) ≤ w}


= P{log U ≥ −w}
= P(U ≥ e−w )
= 1 − P(U < e−w ) = 1 − e−w , w > 0,

where the < can become an ≤ because there is no probability at individual points in R.
Hence W also has an exponential distribution. This is obvious, because if U ∼ U (0, 1), then
1 − U ∼ U (0, 1) also.

Probability and Statistics for SIC note 1 of slide 151

84
Note to Example 127
Y = r iff r − 1 < X ≤ r, so for r = 1, 2, . . . , we have
Z r Z r
P(Y = r) = fX (x) dx = λe−λx dx = (e−λ(r−1) − e−λr ) = (e−λ )r−1 (1 − e−λ ).
r−1 r−1

This is the geometric distribution with probability p = 1 − e−λ .

Probability and Statistics for SIC note 2 of slide 151

General transformation
We can formalise the previous discussion in the following way:

Definition 128. Let g : R 7→ R be a function and B ⊂ R any subset of R. Then g −1 (B) ⊂ R is the
set for which g{g −1 (B)} = B.

Theorem 129. Let Y = g(X) be a random variable and By = (−∞, y]. Then
(R
g −1 (By ) fX (x) dx, X continuous,
FY (y) = P(Y ≤ y) = P
x∈g −1 (By ) fX (x), X discrete,

where g−1 (By ) = {x ∈ R : g(x) ≤ y}. When g is monotone increasing or decreasing and has
differentiable inverse g−1 , then
−1
dg (y)
fY (y) = fX {g −1 (y)}, y ∈ R.
dy

Example 130. If X ∼ exp(λ) and Y = exp(X), find FY and fY .

Example 131. Find the distribution and density functions of Y = cos(X), where X ∼ exp(1).

Probability and Statistics for SIC slide 152

85
Note to Theorem 129
We have
P(Y ∈ B) = P{g(X) ∈ B} = P{X ∈ g −1 (B)},
because X ∈ g−1 (B) if and only if g(X) ∈ g{g −1 (B)} = B.
To find FY (y) we take By = (−∞, y], giving

FY (y) = P(Y ≤ y) = P{g(X) ∈ By } = P{X ∈ g −1 (By )},

which is the formula in the theorem.


When g is monotone increasing with (monotone increasing) inverse g −1 , we have
g−1 {(−∞, y]} = (−∞, g −1 (y)] , and hence

FY (y) = P{Y ∈ By } = P{X ∈ g −1 (By )} = P{X ≤ g −1 (y)} = FX {g−1 (y)}, y ∈ R.

In the case of a continuous random variable X, differentiation gives

dg−1 (y)
fY (y) = fX {g −1 (y)}, y ∈ R.
dy

When g is monotone decreasing with (monotone decreasing) inverse g−1 , we have


g−1 {(−∞, y]} = [g −1 (y), ∞) , and hence

FY (y) = P{Y ∈ By } = P{X ∈ g −1 (By )} = P{X ≥ g −1 (y)}, y ∈ R.

In the case of a continuous density, FY (y) = P{X ≥ g −1 (y)} = 1 − FX {g−1 (y)} and differentiation
gives
dg−1 (y)
fY (y) = − fX {g −1 (y)}, y ∈ R;
dy
note that −dg−1 (y)/dy ≥ 0, because g −1 (y) is monotone decreasing.
Thus in both cases we can write
−1
dg (y)
fY (y) = fX {g−1 (y)}, y ∈ R.
dy

Probability and Statistics for SIC note 1 of slide 152

86
Note to Example 130
Note first that since X only puts probability on R+ , Y ∈ (1, ∞).
In terms of the theorem, let By = (−∞, y], and note that g(x) = ex is monotone increasing, with
g−1 (y) = log y, so

P(Y ≤ y) = P(Y ∈ B) = P{g(X) ∈ B} = P{X ∈ g −1 (B)} = P{X ∈ (−∞, log y]} = FX (log y),

so
P(Y ≤ y) = 1 − exp{−λ log y} = 1 − y −λ , y > 1.
Hence Y has the Pareto distribution with β = 1, α = λ, and
(
0, y ≤ 1,
fY (y) = −λ−1
λy , y > 1.

To get the density directly, we note that dg−1 (y)/dy = 1/y, and
−1
dg (y)
fY (y) = fX {g −1 (y)} = |y −1 | × λe−λ log y = λy −λ−1 , y > 1,
dy

and fY (y) = 0 for y ≤ 1, because if y < 1, then log y < 0, and fX (x) = 0 for x < 0.

Probability and Statistics for SIC note 2 of slide 152

87
Note to Example 131
 Here Y = g(X) = cos(X) takes values only in the range −1 ≤ y ≤ 1, so if y < −1, By = ∅, and
if y ≥ 1, By = R, thus giving (
0, y < −1
FY (y) =
1, y ≥ 1.
 A sketch of the function cos x for x ≥ 0 shows that in the range 0 < x < 2π, and for −1 < y < 1,
the event cos(X) ≤ y is equivalent to the event cos−1 (y) ≤ X ≤ 2π − cos−1 (y). Since the cosine
function is periodic, the set By is an infinite union of disjoint intervals. In fact

[
−1
cos(X) ≤ y ⇔ X∈g (By ) = {x : 2πj + cos−1 (y) ≤ x ≤ 2π(j + 1) − cos−1 (y)},
j=0

and therefore
P(Y ≤ y) = P{X ∈ g−1 (B)}

X 
= P 2πj + cos−1 (y) ≤ X ≤ 2π(j + 1) − cos−1 (y)
j=0
X∞

= exp[−λ{2πj + cos−1 (y)}] − exp[−λ{2π(j + 1) − cos−1 (y)}]
j=0
exp{−λ cos−1 (y)} − exp{λ cos−1 (y) − 2πλ}
= ,
1 − exp(−2πλ)
where we noticed that the summation is proportional to a geometric series.
 Note that if y = 1, then cos−1 (y) = 0, and so P(Y ≤ 1) = 1, and if y = −1, then cos−1 (y) = π,
and then P(Y ≤ −1) = 0, as required. Here we used values of cos−1 (y) in the range [0, π].
 The density function is found by differentiation: since cos{cos−1 (y)} = y, we have

d cos−1 (y) 1
=− ,
dy sin{cos−1 (y)}

and this gives

λ exp{−λ cos−1 (y)} + exp{λ cos−1 (y) − 2πλ}


fY (y) = × , y ∈ (−1, 1).
sin{cos−1 (y)} 1 − exp(−2πλ)

Probability and Statistics for SIC note 3 of slide 152

88
4.3 Normal Distribution slide 153

Normal distribution
Definition 132. A random variable X having density
 
1 (x − µ)2
f (x) = exp − , x ∈ R, µ ∈ R, σ > 0,
(2π)1/2 σ 2σ 2
2 2
√ with expectation µ and variance σ : we write X ∼ N (µ, σ ). (The
is a normal random variable
standard deviation of X is σ 2 = σ > 0.)
When µ = 0, σ 2 = 1, the corresponding random variable Z is standard normal, Z ∼ N (0, 1), with
density
2
φ(z) = (2π)−1/2 e−z /2 , z ∈ R.
Then Z x Z x
1 2 /2
FZ (x) = P(Z ≤ x) = Φ(x) = φ(z) dz = e−z dz.
−∞ (2π)1/2 −∞

This integral is given in the Formulaire.

Note that f (x) = σ −1 φ{(x − µ)/σ} for x ∈ R.

Probability and Statistics for SIC slide 154

Johann Carl Friedrich Gauss (1777–1855)

The normal distribution is often called the Gaussian distribution. Gauss used it for the combination
of astronomical and topographical measures.

Probability and Statistics for SIC slide 155

89
Johann Carl Friedrich Gauss (1777–1855)

The normal distribution is often called the Gaussian distribution. Gauss used it for the combination
of astronomical and topographical measures.

Probability and Statistics for SIC slide 156

Standard normal density

N(0,1) density
0.4
0.3
phi(z)

0.2
0.1
0.0

−3 −2 −1 0 1 2 3

The famous bell curve:


2 /2
φ(z) = (2π)−1/2 e−z , z ∈ R.

Probability and Statistics for SIC slide 157

90
Interpretation of N (µ, σ 2)
 The density function is centred at µ, which is the most likely value and also the median;
 the standard deviation σ is a measure of the spread of the values around µ:
– 68% of the probability lies in the interval µ ± σ;
– 95% of the probability lies in the interval µ ± 2σ;
– 99.7% of the probability lies in the interval µ ± 3σ.

Example 133. The average height for a class of students was 178 cm, with standard deviation 7.6 cm.
If this is representative of the population, then 68% have heights in the interval 178 ± 7.6 cm (blue
lines), 95% in the interval 178 ± 2 × 7.6 cm (green lines), and 99.7% in the interval 178 ± 3 × 7.6 cm
(cyan lines, almost invisible).
0.06
0.05
0.04
Density

0.03
0.02
0.01
0.00

150 160 170 180 190 200

Height (cm)

Probability and Statistics for SIC slide 158

Properties
Theorem 134. The density φ(z), the cumulative distribution function Φ(z), and the quantiles zp of
Z ∼ N (0, 1) satisfy, for all z ∈ R:
(a) the density is symmetric with respect to z = 0, i.e., φ(z) = φ(−z);
(b) P(Z ≤ z) = Φ(z) = 1 − Φ(−z) = 1 − P(Z ≥ z);
(c) the standard normal quantiles zp satisfy zp = −z1−p , for all 0 < p < 1;
(d) z r φ(z) → 0 when z → ±∞, for all r > 0. This imples that the moments E(Z r ) exist for all
r ∈ N;
(e) we have

φ′ (z) = −zφ(z), φ′′ (z) = (z 2 − 1)φ(z), φ′′′ (z) = −(z 3 − 3z)φ(z), ...

This implies that E(Z) = 0, var(Z) = 1, E(Z 3 ) = 0, etc.


(f) If X ∼ N (µ, σ 2 ), then Z = (X − µ)/σ ∼ N (0, 1).

Note that if X ∼ N (µ, σ 2 ), then we can write X = µ + σZ, where Z ∼ N (0, 1).

Probability and Statistics for SIC slide 159

91
Theorem 134
(a) Obvious by substitution:
2 /2 2 /2
φ(−z) = (2π)−1/2 e−(−z) = (2π)−1/2 e−z = φ(z).

(b) Obvious by the symmetry of φ(z), as


Z z Z ∞
Φ(z) = φ(x) dx = φ(x) dx = 1 − Φ(−z),
−∞ −z

which implies that

P(Z ≤ z) = Φ(z) = 1 − Φ(−z) = 1 − P(Z ≤ −z) = 1 − Φ(−z).

(c) Again obvious by symmetry, using (b): p = Φ(z) = 1 − Φ(−z) implies that zp = −z1−p .
(d) This is just a fact from analysis, since for any r ≥ 0, we have

zr zr
z r φ(z) ∝ P∞ 2i
< → 0, z → ∞,
i=0 z /i! z 2(r+1)

and by symmetry the same will be true when z → −∞.


(e) Differentiate φ(z) repeatedly, and then note that
Z
 ∞
E(Z) = zφ(z) dz = [−φ(z)]∞ −∞ = 0, E(Z 2 − 1) = φ′ (z) −∞ = 0,

etc. by (d). Hence E(Z) = 0, E(Z 2 ) = 1, etc.


(f) This is just a change of variable in the density function.

Probability and Statistics for SIC note 1 of slide 159

92
Values of the function Φ(z)
z 0 1 2 3 4 5 6 7 8 9
0.0 .50000 .50399 .50798 .51197 .51595 .51994 .52392 .52790 .53188 .53586
0.1 .53983 .54380 .54776 .55172 .55567 .55962 .56356 .56750 .57142 .57535
0.2 .57926 .58317 .58706 .59095 .59483 .59871 .60257 .60642 .61026 .61409
0.3 .61791 .62172 .62552 .62930 .63307 .63683 .64058 .64431 .64803 .65173
0.4 .65542 .65910 .66276 .66640 .67003 .67364 .67724 .68082 .68439 .68793
0.5 .69146 .69497 .69847 .70194 .70540 .70884 .71226 .71566 .71904 .72240
0.6 .72575 .72907 .73237 .73565 .73891 .74215 .74537 .74857 .75175 .75490
0.7 .75804 .76115 .76424 .76730 .77035 .77337 .77637 .77935 .78230 .78524
0.8 .78814 .79103 .79389 .79673 .79955 .80234 .80511 .80785 .81057 .81327
0.9 .81594 .81859 .82121 .82381 .82639 .82894 .83147 .83398 .83646 .83891
1.0 .84134 .84375 .84614 .84850 .85083 .85314 .85543 .85769 .85993 .86214
1.1 .86433 .86650 .86864 .87076 .87286 .87493 .87698 .87900 .88100 .88298
1.2 .88493 .88686 .88877 .89065 .89251 .89435 .89617 .89796 .89973 .90147
1.3 .90320 .90490 .90658 .90824 .90988 .91149 .91309 .91466 .91621 .91774
1.4 .91924 .92073 .92220 .92364 .92507 .92647 .92786 .92922 .93056 .93189
1.5 .93319 .93448 .93574 .93699 .93822 .93943 .94062 .94179 .94295 .94408
1.6 .94520 .94630 .94738 .94845 .94950 .95053 .95154 .95254 .95352 .95449
1.7 .95543 .95637 .95728 .95818 .95907 .95994 .96080 .96164 .96246 .96327
1.8 .96407 .96485 .96562 .96638 .96712 .96784 .96856 .96926 .96995 .97062
1.9 .97128 .97193 .97257 .97320 .97381 .97441 .97500 .97558 .97615 .97670
2.0 .97725 .97778 .97831 .97882 .97932 .97982 .98030 .98077 .98124 .98169

Remark: A more detailed table can be found in the Formulaire. You may also use the function pnorm
in the software R: Φ(z) = pnorm(z).

Example 135. Calculate

P(Z ≤ 0.53), P(Z ≤ −1.86), P(−1.86 < Z < 0.53), z0.95 , z0.025 , z0.5 .

Probability and Statistics for SIC slide 160

Note to Example 135


In R we use pnorm for Φ and qnorm for Φ−1 :

> pnorm(0.53)
[1] 0.701944
> pnorm(-1.86)
[1] 0.03144276
> pnorm(0.53)- pnorm(-1.86)
[1] 0.6705013
> qnorm(0.95)
[1] 1.644854
> qnorm(0.025)
[1] -1.959964
> qnorm(0.5)
[1] 0
Probability and Statistics for SIC note 1 of slide 160

93
Examples and calculations
Example 136. The duration in minutes of a maths lecture is N (47, 4), but should be 45. Give the
probability that (a) the lecture finishes early, (b) the lecture finishes at least 5 minutes late.

Example 137. Show that the expectation and variance of X ∼ N (µ, σ 2 ) are µ and σ 2 , and find the p
quantile of X.

Example 138. Calculate the cumulative distribution function and the density of Y = |Z| and
W = Z 2 , where Z ∼ N (0, 1).

Probability and Statistics for SIC slide 161

Note to Example 136


(a) Note that we can write X = µ + σZ, where Z ∼ N (0, 1). We have X ∼ N (47, 4), and we seek
.
P(X < 45) = P{(X − 47)/2 < (45 − 47)/2} = P(Z < −1) = 1 − 0.84134 = 0.16.
.
(b) P(X > 50) = P{(X − 47)/2 > (50 − 47)/2} = P(Z > 1.5) = 1 − 0.93319 = 0.067.

Probability and Statistics for SIC note 1 of slide 161

Note to Example 137


Since we can write X = µ + σZ, and E(Z) = 0 and E(Z 2 ) = var(Z) = 1 by Theorem 134(e), we just
apply the properties of mean and variance from Theorems 89 and 92.

Probability and Statistics for SIC note 2 of slide 161

Note to Example 138


 For Y , note that if y > 0, then P(Y ≤ y) = P(−y ≤ Z ≤ y) = Φ(y) − Φ(−y), and differentiate
to obtain 2φ(y), for y > 0 and zero otherwise.
Alternatively, in the terms of Theorem 129, we have g(x) = |x| and therefore
g−1 (By ) = g −1 {(−∞, y]} = (−y, y), provided that y ≥ 0, and g−1 (By ) = ∅ if y < 0. Therefore
Z Z y
P(Y ≤ y) = φ(x) dx = φ(x) dx = Φ(y) − Φ(−y), y > 0,
g −1 (By ) −y

as before.
√ √ √ √
 For W , the same argument gives P(W ≤ w) = P(− w ≤ Z ≤ w) = Φ( w) − Φ(− w), for
w > 0. Then differentiate to obtain the density.
√ √
In this case g(x) = x2 and g −1 (Bw ) = g −1 {(−∞, w]} = (− w, w) for w ≥ 0 and g −1 (Bw ) = ∅
for w < 0. This gives the previous result, by a slightly more laborious route.

Probability and Statistics for SIC note 3 of slide 161

94
Normal approximation to the binomial distribution
The normal distribution is a central to probability, partly because it can be used to approximate
probabilities of other distributions. One of the basic results is:

Theorem 139 (de Moivre–Laplace). Let Xn ∼ B(n, p), where 0 < p < 1, let

µn = E(Xn ) = np, σn2 = var(Xn ) = np(1 − p),

and let Z ∼ N (0, 1). When n → ∞,


 
Xn − µn Xn − µn D
P ≤ z → Φ(z), z ∈ R; i.e., −→ Z.
σn σn

This gives us an approximation of the probability that Xn ≤ r:


   
Xn − µn r − µn . r − µn
P(Xn ≤ r) = P ≤ =Φ ,
σn σn σn
·
which corresponds to Xn ∼ N {np, np(1 − p)}.
In practice the approximation is bad when min{np, n(1 − p)} < 5.

Probability and Statistics for SIC slide 162

Normal and Poisson approximations to the binomial


We have already encountered the Poisson approximation to the binomial distribution, valid for large n
and small p. The normal approximation is valid for large n and min{np, n(1 − p)} ≥ 5. Left: a case
where the normal approximation is valid. Right: a case where the Poisson approximation is valid.

B(16, 0.5) and Normal approximation B(16, 0.1) and Normal approximation
density

density
0.20

0.20
0.00

0.00

0 5 10 15 0 5 10 15
r r

B(16, 0.5) and Poisson approximation B(16, 0.1) and Poisson approximation
density

density
0.20

0.20
0.00

0.00

0 5 10 15 0 5 10 15
r r

Probability and Statistics for SIC slide 163

95
Continuity correction
A better approximation to P(Xn ≤ r) is given by replacing r by r + 12 ; the 1
2 is called the continuity
correction. This gives !
. r + 12 − np
P(Xn ≤ r) = Φ p .
np(1 − p)
Binomial(15, 0.4) and Normal approximation

0.20
0.15
Density
0.10 0.05
0.00
0 5 10 15
x

Example 140. Let X ∼ B(15, 0.4). Calculate the exact and approximate values of P(X ≤ r) for
r = 1, 8, 10, with and without the continuity correction. Comment.

Probability and Statistics for SIC slide 164

Note to Example 140


The following R code shows how to do this, but first do some of it on the board using the normal table.

Probability and Statistics for SIC note 1 of slide 164

NumeRical Results

pbinom(c(1,8,10),15,prob=0.4)
[1] 0.005172035 0.904952592 0.990652339

pnorm(c(1,8,10),mean=15*0.4,sd=sqrt(15*0.4*0.6))
[1] 0.004203997 0.854079727 0.982492509

pnorm(c(1,8,10)+0.5,mean=15*0.4,sd=sqrt(15*0.4*0.6))
[1] 0.008853033 0.906183835 0.991146967
Probability and Statistics for SIC slide 165

Example
Example 141. The total number of students in a class is 100.
(a) Each student goes independently to a maths lecture with probability 0.6. What is the size of the
smallest classroom suited for the number of students who go to class, with a probability of 0.95?
(b) There are 14 lectures per semester, and the students decide to go to each lecture independently.
What is now the size of the smallest classroom necessary?

Probability and Statistics for SIC slide 166

96
Note to Example 141
(a) The number of students present X is B(100, 0.6), so the mean is 100 × 0.6 = 60 and the variance
is 100 × 0.6 × 0.4 = 24. We seek x such that
   
X − 60 x − 60 . x − 60
0.95 = P(X ≤ x) = P √ ≤ √ =Φ √ ,
24 24 24
√ √
and this implies that (x − 60)/ 24 = Φ−1 (0.95) = 1.65, and thus x = 60 + 24 × 1.65 = 68.08.
Better have a room for 69.
(b) Now we want to solve the equation
 14  14
14 X − 60 x − 60 . x − 60
0.95 = P(X ≤ x) =P √ ≤ √ =Φ √ ,
24 24 24
√ √
and this implies that (x − 60)/ 24 = Φ−1 (0.951/14 ) = 2.68, and thus x = 60 + 24 × 2.68 = 73.14.
Better have a room for 74.
Probability and Statistics for SIC note 1 of slide 166

4.4 Q-Q Plots slide 167

Quantile-quantile (Q-Q) plots


One way of comparing a sample X1 , . . . , Xn with a theoretical distribution F :
 we order the Xj , giving
X(1) ≤ X(2) ≤ · · · ≤ X(n) ,
then we plot the graph against F −1 {1/(n + 1)}, F −1 {2/(n + 1)} . . . , F −1 {n/(n + 1)}.
 The idea: in an ideal case U1 , . . . , Un ∼ U (0, 1) should cut the interval (0, 1) into n + 1
sub-intervals of width 1/(n + 1), so we should plot the graph of the U(j) against 1/(n + 1), . . .,
D
n/(n + 1), and thus the X(j) = F −1 (U(j) ) against the F −1 {j/(n + 1)};
 the closer the graph is to a straight line, the more the data resemble a sample from F ;
 we often take a standard version of F (e.g., exp(1), N (0, 1)), and then the F −1 {j/(n + 1)} are
called the plotting positions of F —then the slope gives an estimation of the dispersion parameter
of the distribution, and the value at the origin gives an estimation of the position parameter;
 for the distributions exp(1) and N (0, 1) we have respectively
       
−1 j j −1 j −1 j
F = − log 1 − , F =Φ ;
n+1 n+1 n+1 n+1

 it is difficult to draw strong conclusions from such a graph for small n, as the variability is then
large—we have a tendency to over-interpret patterns in the plot.

Probability and Statistics for SIC slide 168

97
Note to the following graphs
 First graph: the normal graph is close to a straight line, whereas the exponential one is not.
Suggests that the normal would be a reasonable model for these data. Derive the formula for the
exponential plotting positions, using the quantile formula for the exponential distribution.
 Second graph: Here we compare the real data (top centre) with simulated data. The fact that it is
hard to tell which is which (you need to remember the shape of the first graph, or to note that
tied observations are impossible with simulations) suggests that the heights can be considered to
be normal.
 The lower left is gamma: there is clearer nonlinearity than with the other panels—but it is hard to
be sure with this sample size.
 The lower middle is obviously not normal; the sample size is big, however.

Probability and Statistics for SIC note 1 of slide 168

Heights of students
Q-Q plots for the heights of n = 36 students in SSC, for the exponential and normal distributions.

Exponential Q−Q plot Normal Q−Q Plot


190

190
Height (cm)

Height (cm)
180

180
170

170
160

160

0.0 1.0 2.0 3.0 −2 −1 0 1 2


Exponential plotting positions Normal plotting positions

Probability and Statistics for SIC slide 169

n = 36: Which sample is not normal?


There are five samples of simulated normal variables, and some real data.
160 170 180 190
185

170 180 190


Height (cm)

Height (cm)

Height (cm)
175 165

160

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Normal plotting positions Normal plotting positions Normal plotting positions
190
185

185
180
Height (cm)

Height (cm)

Height (cm)
175

175
170
165

165
160

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Normal plotting positions Normal plotting positions Normal plotting positions

Probability and Statistics for SIC slide 170

98
n = 100: Which sample is not normal?
There are five samples of simulated normal variables, and one simulated gamma sample.

210

210

210
Height (cm)

Height (cm)

Height (cm)
190

190

190
170

170

170
150

150

150
−2 0 1 2 −2 0 1 2 −2 0 1 2
210 Normal plotting positions Normal plotting positions Normal plotting positions

210

210
Height (cm)

Height (cm)

Height (cm)
190

190

190
170

170

170
150

150

150
−2 0 1 2 −2 0 1 2 −2 0 1 2
Normal plotting positions Normal plotting positions Normal plotting positions

Probability and Statistics for SIC slide 171

n = 500: Which sample is not normal?


There are five samples of simulated normal variables, and one simulated gamma sample.
210

210

210
Height (cm)

Height (cm)

Height (cm)
190

190

190
170

170

170
150

150

150

−3 −1 1 2 3 −3 −1 1 2 3 −3 −1 1 2 3
Normal plotting positions Normal plotting positions Normal plotting positions
210

210

210
Height (cm)

Height (cm)

Height (cm)
190

190

190
170

170

170
150

150

150

−3 −1 1 2 3 −3 −1 1 2 3 −3 −1 1 2 3
Normal plotting positions Normal plotting positions Normal plotting positions

Probability and Statistics for SIC slide 172

99
Which density?
 Uniform variables lie in a finite interval, and give equal probability to each part of the interval;
 exponential and gamma variables lie in (0, ∞), and are often used to model waiting times and
other positive quantities,
– the gamma has two parameters and is more flexible, but the exponential is simpler and has
some elegant properties;
 Pareto variables lie in the interval (β, ∞), so are not appropriate for arbitrary positive quantities
(which could be smaller than β), but are often used to model financial losses over some threshold
β;
 normal variables lie in R and are used to model quantities that arise (or might arise) through
averaging of many small effects (e.g., height and weight, which are influenced by many genetic
factors), or where measurements are subject to error;
 Laplace variables lie in R; the Laplace distribution can be used in place of the normal in situations
where outliers might be present.

Probability and Statistics for SIC slide 173

100
5. Several Random Variables slide 174
Lexicon

Mathematics English Français


E(X) expected value/expectation of X espérance de X
E(X r ) rth moment of X rième moment de X
var(X) variance of X variance de X
MX (t) moment generating function of X, or fonction génératrice des moments
the Laplace transform of fX (x) ou transformée de Laplace de fX (x)

fX,Y (x, y) joint density/mass function densité/fonction de masse conjointe


FX,Y (x, y) joint (cumulative) distribution function fonction de répartition conjointe
fX|Y (x | y) conditional density function densité conditionnelle
fX,Y (x, y) = fX (x)fY (y) X, Y independent X, Y independantes
iid
X1 , . . . , Xn ∼ F random sample from F échantillon aléatoire

E(X r Y s ) joint moment moment conjoint


cov(X, Y ) covariance of X and Y covariance de X et Y
corr(X, Y ) correlation of X and Y correlation de X et Y
E(X | Y = y) conditional expectation of X espérance conditionnelle de X
var(X | Y = y) conditional variance of X variance conditionnelle de X

X(r) rth order statistic rième statistique d’ordre

Probability and Statistics for SIC slide 175

5.1 Basic Notions slide 176


Motivation
Often we have to consider the way in which several variables vary simultaneously. Some examples:

Example 142. The distribution of (height, weight) of a student picked at random from the class.

Example 143 (Hats, continuation of Example 47). Three men with hats permute them in a random
way. Let I1 be the indicator of the event in which man 1 has his hat, etc. Find the joint distribution of
(I1 , I2 , I3 ).

Our previous definitions generalise in a natural way to this situation.

Probability and Statistics for SIC slide 177

101
Note to Example 143
The possibilities each have probability 1/6, and with the notation that Ij indicates that the jth hat is
on the right head, are
1 2 3
1 2 3 (I1 , I2 , I3 ) = (1, 1, 1)
1 3 2 (I1 , I2 , I3 ) = (1, 0, 0)
2 1 3 (I1 , I2 , I3 ) = (0, 0, 1)
2 3 1 (I1 , I2 , I3 ) = (0, 0, 0)
3 1 2 (I1 , I2 , I3 ) = (0, 0, 0)
3 2 1 (I1 , I2 , I3 ) = (0, 1, 0)
from which we can compute anything we like.

Probability and Statistics for SIC note 1 of slide 177

Discrete random variables


Definition 144. Let (X, Y ) be a discrete random variable: the set

D = {(x, y) ∈ R2 : P{(X, Y ) = (x, y)} > 0}

is countable. The (joint) probability mass function of (X, Y ) is

fX,Y (x, y) = P{(X, Y ) = (x, y)}, (x, y) ∈ R2 ,

and the (joint) cumulative distribution function of (X, Y ) is

FX,Y (x, y) = P(X ≤ x, Y ≤ y), (x, y) ∈ R2 .

Example 145 (Hats, Continuation of Example 143). Find the joint distribution of
(X, Y ) = (I1 , I2 + I3 ).

Probability and Statistics for SIC slide 178

Note to Example 145


The lines of the table in the previous example all have probabilities 1/6, so, for example, we have

f (0, 0) = P(X = 0, Y = 0) = P{configuration (2, 3, 1)} + P{configuration (3, 1, 2)} = 2/6.

In a similar manner, we obtain:


x y f (x, y)
0 0 2/6
0 1 2/6
0 2 0/6
1 0 1/6
1 1 0/6
1 2 1/6

Probability and Statistics for SIC note 1 of slide 178

102
Continuous random variables
Definition 146. The random variable (X, Y ) is said to be (jointly) continuous if there exists a
function fX,Y (x, y), called the (joint) density of (X, Y ), such that
Z Z
P{(X, Y ) ∈ A} = fX,Y (u, v) dudv, A ⊂ R2 .
(u,v)∈A

By letting A = {(u, v) : u ≤ x, v ≤ y}, we see that the (joint) cumulative distribution function of
(X, Y ) can be written
Z x Z y
FX,Y (x, y) = P(X ≤ x, Y ≤ y) = fX,Y (u, v) dudv, (x, y) ∈ R2 ,
−∞ −∞

and this implies that


∂2
fX,Y (x, y) = FX,Y (x, y).
∂x∂y

Probability and Statistics for SIC slide 179

Example
Example 147. Calculate the joint cumulative distribution function and P(X ≤ 1, Y ≤ 2) when
(
e−x−y , y > x > 0,
fX,Y (x, y) ∝
0, otherwise.

We can write f (x, y) = ce−x−y I(y > x)I(x > 0), where I(A) is the indicator function of the set A.

Probability and Statistics for SIC slide 180

103
Note to Example 147
Note that if min(x, y) ≤ 0, then F (x, y) = 0, and consider the integral for y > x (sketch):
Z x Z y
F (x, y) = du dv f (u, v)
−∞ −∞
Z x Z y
= c e−u du e−v dv
Z0 x u
 u
= c du e−u e−v y
Z0 x
 
= c du e−u e−u − e−y
Z0 x
= c du (e−2u − e−u−y )
0 x
= c e−u−y − 21 e−2u 0
 
= 21 c 1 − e−2x − 2e−y + 2e−y−x .

On setting x = y = +∞, we get 12 c = 1, and this implies that c = 2.


Now for y ≤ x, consideration of areas shows that we should take the above formula with y = x, so

−2x + 2e−x−y − 2e−y , y > x > 0,
1 − e

F (x, y) = 1 + e−2y − 2e−y , x ≥ y > 0,


0, otherwise.

Thus F (1, 2) = 1 − e−2 + 2e−3 − 2e−2 = 1 − 3e−2 + 2e−3 .

Probability and Statistics for SIC note 1 of slide 180

Exponential families
Definition 148. Let (X1 , . . . , Xn ) be a discrete or continuous random variable with mass/density
function of the form
( p )
X
f (x1 , . . . , xn ) = exp si (x)θi − κ(θ1 , . . . , θp ) + c(x1 , . . . , xn ) , (x1 , . . . , xn ) ∈ D ⊂ Rn ,
i=1

where (θ1 , . . . , θp ) ∈ Θ ⊂ Rp . This is called an exponential family distribution—not to be confused


with the exponential distribution.

Example 149. Show that the (a) Poisson and (b) gamma distributions are exponential families.

Example 150 (Random graph model). (a) Suppose that we have d ≥ 3 nodes, and links appear
between nodes i and j (i 6= j) independently with probability p. Let Xi,j be the indicator that there is
a link between P
i and j. Show that thePjoint mass function of X1,2 , . . . , Xd−1,d is an exponential family.
(b) If s1 (x) = i<j xi,j and s2 (x) = i<j<k xi,j xj,k xk,i , discuss the properties of data from an
exponential family with mass function

f (x1,2 , . . . , xd−1,d ) = exp {s1 (x)θ + s2 (x)β − κ(θ, β) + c(x1,2 , . . . , xd−1,d )} , θ, β ∈ R,

as θ and β vary.

Probability and Statistics for SIC slide 181

104
Note to Example 149
 (a) We write

f (x; λ) = λx exp(−λ)/x! = exp(x log λ − λ − log x!), λ > 0, x ∈ {0, 1, . . .},

which is of the required form with n = p = 1, s(x) = x, θ = log λ ∈ Θ = R, κ(θ) = exp(θ), and
c(x) = − log x!.
 (b) We write
f (x; λ, α) = λα xα−1 exp(−λx)/Γ(α)
= exp {α log x − λx + α log λ − log Γ(α) − log x} , λ, α > 0, x > 0,
which is of the required form with n = 1, p = 2, θ1 = α, θ2 = −λ, so Θ = R+ × R− ,
s1 (x) = log x, s2 (x) = x, so D = R × R+ and κ(θ) = log Γ(θ1 ) − θ1 log(−θ2 ), c(x) = − log x.

Probability and Statistics for SIC note 1 of slide 181

Note to Example 150


 (a) Since the Xi,j are Bernoulli variables, we can write f (xi,j ) = pxi,j (1 − p)1−xi,j , where
xi,j ∈ {0, 1}, and 0 < p < 1. Since they are independent, their joint mass function is
Y
f (x1,2 , . . . , xd−1,d ) = pxi,j (1 − p)1−xi,j
i<j
 
X X 
= exp xi,j log p + (1 − xi,j ) log(1 − p)
 
i<j i<j
 
X d(d − 1)
= exp  xi,j log{p/(1 − p)} + log(1 − p) ,
2
i<j

P
which is of the given form with n = d(d − 1)/2, p = 1, s(x) = i<j xi,j , c(x1,2 , . . . , xd−1,d ) ≡ 0,
θ = log{p/(1 − p)} ∈ Θ = R, and κ(θ) = d(d − 1) log(1 + eθ )/2 (check this).
Note that p = 1/2 corresponds to θ = 0, which corresponds to links appearing independently with
probability 0.5, whereas setting θ ≪ 0 will give a very sparse graph, with very few links.
 (b) Here s1 (x) counts how many links there are, and s2 (x) counts how many triangles there are.
Increasing β therefore gives more probability to graphs with lots of triangles, whereas decreasing β
makes triangles less likely. So, taking θ ≪ 0 and β ≫ 0 will tend to give a graph with a few links,
but mostly in triangles. Note that the normalising constant is very complex, as it is
X
κ(θ, β) = log exp {s1 (x)θ + s2 (x)β} ,

where the sum is over all 2n possible values of (x1,2 , . . . , xd−1,d ).


 Exponential families are very useful in practice, because
– many standard distributions can be written as exponential families,
– we can construct new ones to model things of interest to us,
– they have a unified probabilistic and statistical theory, with many nice properties.

Probability and Statistics for SIC note 2 of slide 181

105
Marginal and conditional distributions
Definition 151. The marginal probability mass/density function of X is
(P
fX,Y (x, y), discrete case,
fX (x) = R ∞y x ∈ R.
−∞ fX,Y (x, y) dy, continuous case,

The conditional probability mass/density function of Y given X is

fX,Y (x, y)
fY |X (y | x) = , y ∈ R,
fX (x)

provided fX (x) > 0. If (X, Y ) is discrete,

fX (x) = P(X = x), fY |X (y | x) = P(Y = y | X = x).

 The conditional density fY |X (y | x) is undefined if fX (x) = 0. (Why?)


 Analogous definitions exist for fY (y), fX|Y (x | y), and for the conditional cumulative distribution
functions FX|Y (x | y), FY |X (y | x).

Probability and Statistics for SIC slide 182

Examples
Example 152. Calculate the conditional PMF of Y given X, and the marginal PMFs of Example 145.

Example 153. Calculate the marginal and conditional densities for Example 147.

Example 154. Every day I receive a number of emails whose distribution is Poisson, with parameter
µ = 100. Each is a spam independently with probability p = 0.9. Find the distribution of the number
of good emails which I receive. Given that I have received 15 good ones, find the distribution of the
total number that I received.
Probability and Statistics for SIC slide 183

106
Note to Example 152
The joint mass function can be represented as

x y f (x, y)
0 0 2/6
0 1 2/6
1 0 1/6
1 2 1/6
so

fX (0) = f (0, 0) + f (0, 1) = 2/3, fX (1) = f (1, 0) + f (1, 2) = 1/3,


fY (0) = f (0, 0) + f (1, 0) = 1/2, fY (1) = f (0, 1) = 1/3, fY (2) = f (1, 2) = 1/6,

and from which we can compute the required conditional distribution.


For example, we have
( 2/6
fX,Y (0, y) = 21 , y ∈ {0, 1},
fY |X (y | x = 0) = = 2/3
fX (0) 0, otherwise,

and so we obtain
x y f (y | x)
0 0 1/2
0 1 1/2
1 0 1/2
1 2 1/2

Probability and Statistics for SIC note 1 of slide 183

Note to Example 153


 The only interesting cases are when x, y > 0. In this case the marginal density of X is
Z ∞
fX (x) = 2 e−x−y dy = 2e−2x , x > 0,
y=x

and obviously this integrates to unity. The marginal density of Y is


Z y
fY (y) = 2 e−x−y dx = 2e−y (1 − e−y ), y > 0,
x=0

and its integral is 2(1 − 1/2) = 1, so this is also a valid density function.
 For the conditional densities we have

f (y | x) = 2e−x−y /(2e−2x ) = ex−y , y > x,

and
f (x | y) = 2e−x−y /{2e−y (1 − e−y )} = e−x /(1 − e−y ), 0 < x < y.
It is easy to check that both conditional densities integrate to unity. Compare to Example 120.

Probability and Statistics for SIC note 2 of slide 183

107
Note to Example 154
Let N denote the total number of emails, and G the number of good ones. Then conditional on
N = n, G ∼ B(n, p), so

n! µn
fG,N (g, n) = fG|N (g | n)fN (n) = (1−p)g pn−g × e−µ , n ∈ {0, 1, 2, . . .}, g ∈ {0, 1, . . . , n},
g!(n − g)! n!

where µ > 0 and 0 < p < 1. Thus the number of good emails G has density

X
fG (g) = fG,N (g, n)
n=g

e−µ µg (1 − p)g X 1
= × µn−g pn−g
g! n=g
(n − g)!

e−µ µg (1 − p)g X 1
= × (µp)r , where r = n − g,
g! r=0
r!
e−µ µg (1 − p)g {µ(1 − p)}g −µ(1−p)
= × eµp = e , g ∈ {0, 1, . . .},
g! g!
which is the Poisson mass function with parameter µ(1 − p).
Finally, given that G = g,
n! g n−g × µn e−µ
fG,N (g, n) g!(n−g)! (1 − p) p n! (pµ)n−g
fN |G (n | g) = = = e−pµ , n = g, g + 1, . . . ,
fG (g) e−µ(1−p) µg (1 − p)g /g! (n − g)!

which is a Poisson distribution with mean µp, shifted to start at n = g. Thus the number of spams
S = N − G has a Poisson distribution, with mean µp.

Probability and Statistics for SIC note 3 of slide 183

Multivariate random variables


Definition 155. Let X1 , . . . , Xn be rvs defined on the same probability space. Their joint
cumulative distribution function is

FX1 ,...,Xn (x1 , . . . , xn ) = P(X1 ≤ x1 , . . . , Xn ≤ xn )

and their joint density/mass probability function is


(
P(X1 = x1 , . . . , Xn = xn ), discrete case,
fX1 ,...,Xn (x1 , . . . , xn ) = ∂ n FX ,...,Xn (x1 ,...,xn )
1
∂x1 ···∂xn , continuous case.

We analogously define the conditional and marginal densities, the cumulative distribution functions,
etc., by replacing (X, Y ) by X = XA , Y = XB , where A, B ⊂ {1, . . . , n} and A ∩ B = ∅. So for
example, if n = 4, we can consider the marginal distribution of (X1 , X2 ) and its conditional
distribution given (X3 , X4 ).

Subsequently everything can be generalised to n variables, but for ease of notation we will mostly limit
ourselves to the bivariate case.
Probability and Statistics for SIC slide 184

108
Multinomial distribution
Definition 156. The random variable (X1 , . . . , Xk ) has the multinomial distribution of
denominator m and probabilities (p1 , . . . , pk ) if its mass function is
k
X
m!
f (x1 , . . . , xk ) = px1 px2 · · · pxk k , x1 , . . . , xk ∈ {0, . . . , m}, xj = m,
x1 ! × · · · × xk ! 1 2
j=1

where m ∈ N and p1 . . . , pk ∈ [0, 1], with p1 + · · · + pk = 1.

This distribution appears as the distribution of the number of individuals in the categories {1, . . . , k}
when m independent individuals fall into the classes with probabilities {p1 , . . . , pk }. It generalises the
binomial distribution to k > 2 categories.

Example 157 (Vote). n students vote for three candidates for the presidency of their syndicate. Let
X1 , X2 , X3 be the number of corresponding votes, and suppose that the n students vote independently
with probabilities p1 = 0.45, p2 = 0.4, and p3 = 0.15. Find the joint distribution of X1 , X2 , X3 ,
calculate the marginal distribution of X3 , and the conditional distribution of X1 given X3 = x3 .

Probability and Statistics for SIC slide 185

109
Note to Example 157
 This is a multinomial distribution with k = 3, denominator n, and the given probabilities. The
joint density is therefore
3
X
n!
f (x1 , x2 , x3 ) = px1 px2 px3 , x1 , x2 , x3 ∈ {0, . . . , n}, xj = n.
x1 !x2 !x3 ! 1 2 3
j=1

 The marginal distribution of X3 is the number of votes for the third candidate. If we say that a
vote for him is a success, and a vote for one of the other two is a failure, we see that
X3 ∼ B(n, p3 ): X3 is binomial with denominator n and probability 0.15.
Alternatively we can compute the marginal density for x3 = 0, . . . , n using Definition 151 with
X = X3 and Y = (X1 , X2 ) as
X n!
P(X3 = x3 ) = px1 px2 px3
x1 !x2 !x3 ! 1 2 3
{(x1 ,x2 ):x1 +x2 =n−x3 }
n! X (x1 + x2 )! x1 x2
= px3 p p
x3 !(x1 + x2 )! 3 x1 !x2 ! 1 2
{(x1 ,x2 ):x1 +x2 =n−x3 }
n!
= px3 (p1 + p2 )n−x3
x3 !(x1 + x2 )! 3
n!
= px3 (1 − p3 )n−x3 ,
(n − x3 )!x3 ! 3
using Newton’s binomial formula (Theorem 17) and the fact that p1 + p2 = 1 − p3 . Thus again we
see that X3 ∼ B(n, p3 ).
 If we now take the ratio of the joint density of (X1 , X2 , X3 ) to the marginal density of X3 , we
obtain the conditional density
fX1 ,X2 ,X3 (x1 , x2 , x3 )
fX1 ,X2 |X3 (x1 , x2 | x3 ) =
fX3 (x3 )
n! x1 x2 x3
x1 !x2 !x3 ! p1 p2 p3
= n! x1 +x2 px3
x3 !(x1 +x2 )! (p1 + p2 ) 3
(x1 + x2 )! x1 x2
= π 1 π 2 , 0 ≤ x1 ≤ x + 1 + x2 ,
x1 !x2 !
where π1 = p1 /(p1 + p2 ), π2 = 1 − π1 . This density is binomial with denominator
n − x3 = x1 + x2 and probability π1 = p1 /(1 − p3 ). Note that X2 = n − x3 − X1 , so although the
conditional mass function here has two arguments X1 , X2 , in reality it is of dimension 1.
 We conclude that, conditional on knowing the vote for one candidate, X3 = x3 , the split of votes
for the other two candidates has a binomial distribution. If we regard a vote for candidate 1 as a
‘success’, then X1 ∼ B(n − x3 , π1 ), where n − x3 is the number of votes not for candidate 3, and
π1 is the conditional probability of voting for candidate 1, given that a voter has not chosen
candidate 3.
Probability and Statistics for SIC note 1 of slide 185

110
Independence
Definition 158. Random variables X, Y defined on the same probability space are independent if

P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B), A, B ⊂ R.

By letting A = (−∞, x] and B = (−∞, y], we find that

FX,Y (x, y) = · · · = FX (x)FY (y), x, y ∈ R,

implying the equivalent condition

fX,Y (x, y) = fX (x)fY (y), x, y ∈ R, (2)

which will be our criterion of independence. This condition concerns the functions fX,Y (x, y), fX (x),
fY (y): X, Y are independent iff (2) remains true for all x, y ∈ R.

If X, Y are independent, then for all x such that fX (x) > 0,

fX,Y (x, y) fX (x)fY (y)


fY |X (y | x) = = = fY (y), y ∈ R.
fX (x) fX (x)

Thus knowing that X = x does not affect the density of Y : this is an obvious meaning of
“independence”. By symmetry fX|Y (x | y) = fX (x) for all y such that fY (y) > 0.

Probability and Statistics for SIC slide 186

Examples
Example 159. Are (X, Y ) independent in (a) Example 145? (b) Example 147? (c) when
(
e−3x−2y , x, y > 0,
fX,Y (x, y) ∝
0, sinon.

If X and Y are independent, then in particular the support of (X, Y ) must be of the form
SX × SY ⊂ R 2 .

Definition 160. A random sample of size n from a distribution F of density f is a set of n


independent random variables which all have a distribution F . Equivalently we say that X1 , . . . , Xn
are independent and identically distributed (iid) with distribution F , or with density f , and write
iid iid
X1 , . . . , Xn ∼ F or X1 , . . . , Xn ∼ f .
iid
By independence, the joint density of X1 , . . . , Xn ∼ f is
n
Y
fX1 ,...,Xn (x1 , . . . , xn ) = fX (xj ).
j=1

iid
Example 161. If X1 , X2 , X3 ∼ exp(λ), give their joint density.

Probability and Statistics for SIC slide 187

111
Note to Example 159
(a) Since
2 1
fX (0)fY (2) = × 6= fX,Y (0, 2) = 0,
3 6
X and Y are dependent. This is obvious, because if I have the wrong hat (i.e., X = 0), then it is
impossible that both other persons have the correct hats (i.e., Y = 2 is impossible).
Finding a single pair (x, y) giving fX,Y (x, y) 6= fX (x)fY (y) is enough to show dependence, while to
show independence it must be true that fX,Y (x, y) = fX (x)fY (y) for every possible (x, y).
(b) In this case (
2e−x−y , y > x > 0,
fX,Y (x, y) =
0, otherwise.
and we previously saw that

fX (x) = 2 exp(−2x)I(x > 0), fY (y) = 2 exp(−y){1 − exp(−y)}I(y > 0),

so obviously the joint density is not the product of the marginals. This is equally obvious on looking at
the conditional densities.
In this case, the dependence is clear without any computations, as the support of (X, Y ) cannot be the
product of sets IA (x)IB (y), but it would have to be if they were independent.
(c) The density factorizes and the support is a Cartesian product, so they are independent.

Probability and Statistics for SIC note 1 of slide 187

Note to Example 161


The variables are independent, so

f (x1 , x2 , x3 ) = f (x1 )f (x2 )f (x3 ) = λ3 exp{−λ(x1 + x2 + x3 )}, x1 , x2 , x3 > 0, λ > 0.

Probability and Statistics for SIC note 2 of slide 187

Mixed distributions
We sometimes encounter distributions with X discrete and Y continuous, or vice versa.

Example 162. A big insurance company observes that the distribution of the number of insurance
claims X in one year for its clients does not follow a Poisson distribution. However, a claim is a rare
event, and so it seems reasonable that the distribution of small numbers should be applied. To model
X, we suppose that for each client, the number of claims X in one year follows a Poisson distribution
Pois(y), but that Y ∼ Gamma(α, λ): the mean number of claims for a client with Y = y is then
E(X | Y = y) = y, since certain clients are more likely to make a claim than others.
Find the joint distribution of (X, Y ), the marginal distribution of X, and the conditional distribution of
Y given X = x.

Probability and Statistics for SIC slide 188

112
Note to Example 162
 If X, conditional on Y = y has the Poisson density with parameter y > 0, then
y x −y
fX|Y (x | y) = e , x = 0, 1, . . . , y > 0.
x!
R∞
Recall also the definition of the gamma function: Γ(a) = 0 ua−1 e−u du, for a > 0, and that
Γ(a + 1) = aΓ(a).
 The joint density is

yx λα y α−1
fX|Y (x | y) × fY (y) = exp(−y) × exp(−λy), x ∈ {0, 1, . . . , }, y > 0,
x! Γ(α)

for λ, α > 0. Thus the marginal probability mass function of X is


Z ∞
fX (x) = fX,Y (x, y) dy
−∞
Z ∞
λα
= y x+α−1 exp{−(λ + 1)y} dy
x!Γ(α) 0
Z ∞
λα
= (λ + 1)−(x+α) ux+α−1 exp(−u) du with u = (λ + 1)y
x!Γ(α) 0
λα −(x+α)
= (λ + 1) Γ(x + α),
x!Γ(α)
 α  x
Γ(x + α) λ 1
= , x = 0, 1, . . . ,
x!Γ(α) λ+1 λ+1
which is negative binomial with parameters p = λ/(λ + 1) and α, in the form given on slide 99.
 The conditional density of Y given that X = x is
fX,Y (x, y)
fY |X (y | x) =
fX (x)
fX|Y (x | y)fY (y)
=
fX (x)
yx λα y α−1
x! exp(−y) × Γ(α) exp(−λy)
= λα
x!Γ(α) (λ + 1)−(x+α) Γ(x + α)
= ···
(λ + 1)x+α y x+α−1
= exp{−y(λ + 1)}, y > 0.
Γ(x + α)
This is gamma with shape parameter α + x and scale parameter 1 + λ.
Hence observing that a customer has x accidents updates our estimate for his/her value of y from
the initial mean E(Y ) = α/λ to the posterior mean E(Y | X = x) = (α + x)/(λ + 1). This is
plotted for x = 0, . . . , 4, α = 1, λ = 10 in the figure.

Probability and Statistics for SIC note 1 of slide 188

113
Insurance and learning
Mean=0.1

1.0
10

0.4 0.6 0.8


8
Initial density

Probability
4 6

0.2
2

0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10
Accident rate, y Number of accidents, x

Mean=2

1.0

10
0.4 0.6 0.8

Conditional density
8
Probability

4 6
0.2

2
0.0

0
0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0
Number of accidents, x Accident rate, y

The graph shows how the knowledge of the number of accidents changes the distribution of the rate of
accidents y for an insured party. Top left: the original density fY (y). Top right: the conditional mass
function fX|Y (x | y = 0.1) for a good driver. Bottom left: the conditional mass function
fX|Y (x | y = 2) for a bad driver. Bottom right: the conditional densities fY |X (y | x) with x = 0
(blue), 1 (red), 2 (black), 3 (green), 4 (cyan) (in order of decreasing maximal density).

Probability and Statistics for SIC slide 189

5.2 Dependence slide 190

Joint moments
Definition 163. Let X, Y be random variables of density fX,Y (x, y). Then if E{|g(X, Y )|} < ∞, we
can define the expectation of g(X, Y ) to be
(P
g(x, y)fX,Y (x, y), discrete case,
E{g(X, Y )} = RR x,y
g(x, y)fX,Y (x, y) dxdy, continuous case.

In particular we define the joint moments and the joint central moments by

E(X r Y s ), E [{X − E(X)}r {Y − E(Y )}s ] , r, s ∈ N.

The most important of these is the covariance of X and Y ,

cov(X, Y ) = E [{X − E(X)} {Y − E(Y )}] = E(XY ) − E(X)E(Y ).

Probability and Statistics for SIC slide 191

114
Properties of covariance
Theorem 164. Let X, Y, Z be random variables and a, b, c, d ∈ R constants. The covariance satisfies:

cov(X, X) = var(X);
cov(a, X) = 0;
cov(X, Y ) = cov(Y, X), (symmetry);
cov(a + bX + cY, Z) = b cov(X, Z) + c cov(Y, Z), (bilinearity);
cov(a + bX, c + dY ) = bd cov(X, Y );
var(a + bX + cY ) = b2 var(X) + 2bc cov(X, Y ) + c2 var(Y );
cov(X, Y )2 ≤ var(X)var(Y ), (Cauchy–Schwarz inequality).

Probability and Statistics for SIC slide 192

Note to Theorem 164


 All of this is mechanical computation. The only part that needs any thought is the last. For any
a ∈ R, we have

var(aX + Y ) = a2 var(X) + 2acov(X, Y ) + var(Y ) = Aa2 + Ba + C ≥ 0,

and since this quadratic polynomial in a has at most one real root, we have

B 2 − 4AC = 4cov(X, Y )2 − 4var(X)var(Y ) ≤ 0,

leading to cov(X, Y )2 ≤ var(X)var(Y ).


 Equality would mean that there is precisely one real root, so var(aX + Y ) = 0 for some a, in
which case aX + Y is a constant, c, say, with probability one, and therefore provided a 6= 0 there
is an exact linear relation aX + Y = c between X and Y .
Probability and Statistics for SIC note 1 of slide 192

Independence and covariance


If X and Y are independent and g(X), h(Y ) are functions whose expectations exist, then

E{g(X)h(Y )} = · · · = E{g(X)}E{h(Y )}.

By letting g(X) = X − E(X) and h(Y ) = Y − E(Y ), we can see that if X and Y are independent,
then
cov(X, Y ) = · · · = 0.
Thus X, Y indep ⇒ cov(X, Y ) = 0. However, the converse is false.

Probability and Statistics for SIC slide 193

115
Linear combinations of random variables
Pn
Definition 165. The average of random variables X1 , . . . , Xn is X = n−1 j=1 Xj .

Lemma 166. Let X1 , . . . , Xn be random variables and a, b1 , . . . , bn be constants. Then (a)


n
X
E(a + b1 X1 + · · · + bn Xn ) = a + bj E(Xj ),
j=1
n
X X
var(a + b1 X1 + · · · + bn Xn ) = b2j var(Xj ) + bj bk cov(Xj , Xk ).
j=1 j6=k

(b) If X1 , . . . , Xn are independent, then cov(Xj , Xk ) = 0, j 6= k, so


n
X
var(a + b1 X1 + · · · + bn Xn ) = b2j var(Xj ).
j=1

(c) If X1 , . . . , Xn are independent and all have mean µ and variance σ 2 , then

E(X) = µ, var(X) = σ 2 /n.

Example 167. Let X1 , X2 be independent rv’s with E(X1 ) = 1, var(X1 ) = 1, E(X2 ) = 2,


var(X2 ) = 4, and Y = 16 + 5X1 − 6X2 . Calculate E(Y ), var(Y ).

Probability and Statistics for SIC slide 194

Note to Lemma 166


(a) The expectation of a + b1 X1 + · · · + bn Xn follows easily from the fact that expectation is a linear
operator.
The variance of a + b1 X1 + · · · + bn Xn follows by extending the result on var(a + bX + cY ) from
Theorem 164 in an obvious way.
(b) Obvious.
(c) Use (a) and (b) with a = 0, b1 = · · · = bn = 1/n and the facts that E(Xj ) = µ, var(Xj ) = σ 2
and cov(Xj , Xk ) = 0 when j 6= k, since the variables are independent.

Probability and Statistics for SIC note 1 of slide 194

Note to Example 167


Lemma 166 gives

E(Y ) = E(16 + 5X1 − 6X2 ) = 16 + 5E(X1 ) − 6E(X2 ) = 16 + 5 × 1 − 6 × 2 = 9,


var(Y ) = var(16 + 5X1 − 6X2 ) = 52 var(X1 ) + (−6)2 var(X2 ) = 25 × 1 + 36 × 4 = 169.

Probability and Statistics for SIC note 2 of slide 194

116
Correlation
Unfortunately the covariance depends on the units of measurement, so we often use the following
dimensionless measure of dependence.

Definition 168. The correlation of X, Y is


cov(X, Y )
corr(X, Y ) = .
{var(X)var(Y )}1/2

This measures the linear dependence between X and Y .

Example 169. We can model the heredity of a quantitative genetic characteristic as follows. Let X be
its value for a parent, and Y1 and Y2 its values for two children.
iid
Let Z1 , Z2 , Z3 ∼ N (0, 1) and take

X = Z1 , Y1 = ρZ1 + (1 − ρ2 )1/2 Z2 , Y2 = ρZ1 + (1 − ρ2 )1/2 Z3 , 0 < ρ < 1.

Calculate E(X), E(Yj ), corr(X, Yj ) and corr(Y1 , Y2 ).

Probability and Statistics for SIC slide 195

Note to Example 169


 Easy to use linearity of expectation to see that E(X) = E(Yj ) = 0 and that
var(X) = var(Yj ) = 1.
 Since the Zs are independent and therefore are uncorrelated, and using the bilinearity of
covariance, we have
cov(X, Yj ) = cov{Z1 , ρZ1 + (1 − ρ2 )1/2 Zj }
= cov(Z1 , ρZ1 ) + cov{Z1 , (1 − ρ2 )1/2 Zj }
= ρcov(Z1 , Z1 ) + (1 − ρ2 )1/2 cov(Z1 , Zj )
= ρvar(Z1 ) + 0 = ρ.
 Likewise
cov(Y1 , Y2 ) = cov{ρZ1 + (1 − ρ2 )1/2 Z2 , ρZ1 + (1 − ρ2 )1/2 Z3 }
= ρ2 cov(Z1 , Z1 ) + ρ(1 − ρ2 )1/2 cov(Z1 , Z3 ) + ρ(1 − ρ2 )1/2 cov(Z2 , Z1 )
+(1 − ρ2 )cov(Z2 , Z3 )
= ρ2 cov(Z1 , Z1 )
= ρ2 var(Z1 ) = ρ2 .
 Therefore corr(X, Yj ) = ρ > corr(Y1 , Y2 ) = ρ2 , since 0 < ρ < 1
So the correlation between siblings is less than that between a parent and his/her offspring. A
similar computation shows that the correlation between cousins will be ρ4 .

Probability and Statistics for SIC note 1 of slide 195

117
Properties of correlation
Theorem 170. Let X, Y be random variables having correlation ρ = corr(X, Y ). Then:
(a) −1 ≤ ρ ≤ 1;
(b) if ρ = ±1, then there exist a, b, c ∈ R such that

aX + bY + c = 0

with probability 1 (X and Y are then linearly dependent);


(c) if X, Y are independent, then corr(X, Y ) = 0;
(d) the effect of the transformation

(X, Y ) 7→ (a + bX, c + dY )

is
corr(X, Y ) 7→ sign(bd)corr(X, Y ).

Probability and Statistics for SIC slide 196

Note to Theorem 170


(a) Just apply the Cauchy–Schwarz inequality.
(b) Equality in the Cauchy–Schwarz inequality arises iff we have var(aX + bY + c) = 0 for some
a, b, c ∈ R, and this can only mean that aX + bY + c = 0 with probability 1.
(c), (d) Just computations.

Probability and Statistics for SIC note 1 of slide 196

Limitations of correlation
Note that:
 correlation measures linear dependence, as in the upper panels below;
 we can have strong nonlinear dependence, but correlation zero, as in the bottom left panel;
 correlation can be strong but specious, as in the bottom right, where two sub-populations, each
without correlation, are combined.
rho=−0.3 rho=0.9
4

4
2

2
0

0
y

y
−2

−2
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4
x x
rho=0 rho=0.9
4

4
2

2
0

0
y

y
−2

−2
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4
x x

Probability and Statistics for SIC slide 197

118
Correlation 6= causation
Two variables can be very correlated without one causing changes in the other.
 The left panel shows strong dependence between the number of mobile phone transmitter masts,
and the number of births in UK towns. Do masts increase fertility?
 The right panel shows that this dependence disappears when we allow for population size: more
people ⇒ more births and more transmitter masts. Adding masts will not lead to more babies.

rho=0.92 rho=−0.09

22 18
Total births in 2009
2e+04

Birth rate in 2009


10 12 14
2e+03 2e+02

20 50 200 1000 5000 20 50 200 1000 5000


Number of transmitter masts Number of transmitter masts

Probability and Statistics for SIC slide 198

Conditional expectation
Definition 171. Let g(X, Y ) be a function of a random vector (X, Y ). Its conditional expectation
given X = x is
(P
g(x, y)fY |X (y | x), in the discrete case,
E{g(X, Y ) | X = x} = R ∞y
−∞ g(x, y)fY |X (y | x) dy, in the continuous case,

on the condition that fX (x) > 0 and E{|g(X, Y )| | X = x} < ∞. Note that the conditional
expectation E{g(X, Y ) | X = x} is a function of x.

Example 172. Let Z = XY , where X and Y are independent, X having a Bernoulli distribution with
probability p, and Y having the Poisson distribution with parameter λ.
 Find the density of Z.
 Find E(Z | X = x).

Example 173. Calculate the conditional expectation and variance of the total number of emails
received in Example 154, given the arrival of g good emails.

Probability and Statistics for SIC slide 199

119
Note to Example 172
The event Z = 0 occurs iff we have either X = 0 and Y takes any value, or if X = 1 and Y = 0.
Since X and Y are independent, we therefore have

X
fZ (0) = P(X = 0, Y = y) + P(X = 1, Y = 0)
y=0
X∞
= P(X = 0)P(Y = y) + P(X = 1)P(Y = 0)
y=0

X
= P(X = 0) P(Y = y) + P(X = 1)P(Y = 0)
y=0

= (1 − p) × 1 + p × e−λ .

Similarly

fZ (z) = P(X = 1, Y = z) = P(X = 1)P(Y = z) = p × λz e−λ /z!, z = 1, 2, . . . .

No other values for Z are possible. Clearly the above probabilities are non-negative, and

X ∞
X ∞
X
fZ (z) = (1 − p) + pe−λ + pλz e−λ /z! = (1 − p) + p λz e−λ /z! = (1 − p) + p = 1,
z=0 z=1 z=0

so (
(1 − p) + pe−λ , z = 0,
fZ =
pλz e−λ /z!, z = 1, 2, . . . ,
is indeed a density function.
Now

E(Z | X = x) = E(XY | X = x) = E(xY | X = x) = xE(Y | X = x) = xE(Y ) = xλ,

since if we know that X = x, then the value x of X is a constant, and since Y and X are
independent, E{h(Y ) | X = x} = xE{h(Y )} for any function h(Y ). Therefore

E(Z | X = 0) = 0, E(Z | X = 1) = λ.

Probability and Statistics for SIC note 1 of slide 199

Note to Example 173


The number of spams S = N − G has a Poisson distribution, with mean pµ. Thus the conditional
expectation of N given G = g is

E(N | G = g) = E(S + G | G = g) = E(S + g | G = g) = pµ + g,

because conditional on G = g, we treat g as a constant, and S ∼ Poiss(pµ). Likewise

var(N | G = g) = var(S + G | G = g) = var(S + g | G = g) = var(S | G = g) = pµ.

Probability and Statistics for SIC note 2 of slide 199

120
Expectation and conditioning
Sometimes it is easier to calculate E{g(X, Y )} in stages.

Theorem 174. If the required expectations exist, then

E{g(X, Y )} = EX [E{g(X, Y ) | X = x}] ,


var{g(X, Y )} = EX [var{g(X, Y ) | X = x}] + varX [E{g(X, Y ) | X = x}] .

where EX and varX represent expectation and variance according to the distribution of X.

Example 175. n = 200 persons pass a busker on a given day. Each one of them decides
independently with probability p = 0.05 to give him money. The donations are independent, and have
expectation µ = 2$ and variance σ 2 = 1$2 . Find the expectation and the variance of the amount of
money he receives.

Probability and Statistics for SIC slide 200

121
Note to Example 175
 Let Xj = 1 if the jth person decides to give him money and Xj = 0 otherwise, and let Yj be the
amount of money given by the jth person, if money is given. Then we can write his total takings as

T = g(X, Y ) = Y1 X1 + · · · + Yn Xn ,
iid iid
where X1 , . . . , Xn ∼ B(1, p) are independent Bernoulli variables and Y1 , . . . , Yn ∼ (µ, σ 2 ). We
want to compute E(T ) and var(T ).
 We first condition on X1 , . . . , Xn , in which case (using an obvious shorthand notation)
E(T | X = x) = E(Y1 X1 + · · · + Yn Xn | X = x)
Xn
= E(Yj Xj | X = x)
j=1
n
X n
X n
X n
X
= xj E(Yj | X = x) = xj E(Yj ) = xj µ = µ xj ,
j=1 j=1 j=1 j=1
var(T | X = x) = var(Y1 X1 + · · · + Yn Xn | X = x)
Xn
= var(Yj Xj | X = x) by independence of the Yj
j=1
Xn n
X n
X
= x2j var(Yj | X = x) = x2j σ 2 = σ 2 xj .
j=1 j=1 j=1
In these expressions the Xj are treated as fixed quantities xj and are regarded as constants, since
the computations are conditional on Xj = xj . Note that x2j = xj , since xj = 0, 1.
 Now we ‘uncondition’, by replacing the values xj of the Xj by the corresponding random variables,
and in order to calculate the expressions in Theorem 174 we therefore need to compute
     
Xn Xn Xn
E µ Xj  , var µ Xj  , E σ 2 Xj  .
j=1 j=1 j=1

Pn
We have that S = j=1 Xj ∼ B(n, p), so S has mean np and variance np(1 − p), and this yields
E(T ) = EX [E{T | X = x}] = EX (µS) = µEX (S) = npµ = 200 × 0.05 × 2 = 20,
var(T ) = EX [var{T | X = x}] + varX [E{T | X = x}]
= EX (σ 2 S) + varX (µS) = npσ 2 + µ2 np(1 − p)
= 200 × 0.05 × 1 + 22 × 200 × 0.05 × 0.95 = 48.
Probability and Statistics for SIC note 1 of slide 200

122
5.3 Generating Functions slide 201

Definition
Definition 176. We define the moment-generating function of a random variable X by

MX (t) = E(etX )

for t ∈ R such that MX (t) < ∞.

 MX (t) is also called the Laplace transform of fX (x).


 The MGF is useful as a summary of all the properties of X, we can write
∞ r r
! ∞ r
X t X X t
tX
MX (t) = E(e ) = E = E(X r ),
r! r!
r=0 r=0

from which we can obtain all the moments E(X r ) by differentiation.

Example 177. Calculate MX (t) when: (a) X = c with probability one; (a) X is an indicator variable;
(c) X ∼ B(n, p); (d) X ∼ Pois(λ); (e) X ∼ N (µ, σ 2 ).

Probability and Statistics for SIC slide 202

123
Note to Example 177
(a) X is discrete, so MX (t) = 1 × et×c = ect , valid for t ∈ R.
(b) Here MX (t) = (1 − p)et×0 + pet×1 = 1 − p + pet , valid for t ∈ R.
(c) Using the binomial theorem we have
n
X   Xn  
n x
tx n−x n
MX (t) = e p (1 − p) = (pet )x (1 − p)n−x = (1 − p + pet )n , t ∈ R.
x=0
x x=0
x

(d) We have

X x ∞
X
xt λ −λ (λet )x
MX (t) = e e = e−λ = exp(λet )e−λ = exp{λ(et − 1)}, t ∈ R,
x! x!
x=0 x=0
P
where we have used the exponential series ea = ∞ n
n=0 a /n! for any a ∈ R.
(e) We first consider Z ∼ N (0, 1) and compute
Z ∞ Z ∞
1 2
E(etZ ) = etz × φ(z) dz = etz × √ e−z /2 dz.
−∞ −∞ 2π

The fact that the N (µ, σ 2 ) density integrates to 1, i.e.,


Z ∞
1 2 2
√ e−(x−µ) /(2σ ) dx = 1, µ ∈ R, σ > 0
−∞ 2πσ 2

implies, on expanding the exponent and re-arranging the result, that


Z ∞
1  2 2 2
 2 2

1/2
exp −x /(2σ ) + xµ/σ dx = σ exp µ /(2σ ) , µ ∈ R, σ ∈ R+ .
−∞ (2π)

2
If we take σ = 1, µ = t, the left-hand side is the MGF of Z, and the right is et /2 , valid for any t ∈ R.
(As an aside, note that if we take µ = 0, σ 2 = 1/(1 − 2t), then the left-hand side is the MGF of Z 2 ,
and the right is (1 − 2t)−1/2 , valid only if t < 1/2. Thus

MZ 2 (t) = (1 − 2t)−1/2 , t < 1/2.

This is the moment-generating function of a chi-squared random variable with one degree of freedom.)
Now note that

E(etX ) = E[exp{t(µ + σZ)}]


= exp(tµ)E[exp{(tσ)Z}]
= exp{tµ + (tσ)2 /2}
= exp(tµ + t2 σ 2 /2), t ∈ R.

Probability and Statistics for SIC note 1 of slide 202

124
Important theorems I
Theorem 178. If M (t) is the MGF of a random variable X, then

MX (0) = 1;
Ma+bX (t) = eat MX (bt);

∂ r MX (t)
E(X r ) = ;
∂tr t=0

E(X) = MX (0);
′′ ′
var(X) = MX (0) − MX (0)2 .

Example 179. Find the expectation and the variance of X ∼ exp(λ).

Theorem 180 (No proof). There exists an injection between the cumulative distribution functions
FX (x) and the moment-generating functions MX (t).

Theorem 180 is very useful, as it says that if we recognise a MGF, we know to which distribution it
corresponds.

Probability and Statistics for SIC slide 203

Note to Theorem 178


This is just a series of mechanical computations, the last three of which involve differentiation of
MX (t) with respect to t under the integral sign.

Probability and Statistics for SIC note 1 of slide 203

Note to Example 179


A simple calculation gives
Z ∞ Z ∞
λ
MX (t) = ext × λe−λx dx = λ e−(λ−t)x dx = , t < λ.
0 0 λ−t

Then we just differentiate MX (t) twice, getting



MX (t) = λ/(λ − t)2 , ′′
MX (t) = 2λ/(λ − t)3 ,

and set t = 0, using Theorem 178 to get the expectation and variance, λ−1 and λ−2 respectively.

Probability and Statistics for SIC note 2 of slide 203

125
Linear combinations
Theorem 181. Let a, b1 , . . . , bn ∈ R and X1 , . . . , Xn be independent rv’s whose MGFs exist. Then
Y = a + b1 X1 + · · · + bn Xn has MGF
n
Y
MY (t) = · · · = eta MXj (tbj ).
j=1

In particular, if X1 , . . . , Xn is a random sample, then S = X1 + · · · + Xn has

MS (t) = MX (t)n .
ind
Example 182. Let X1 , X2 ∼ Pois(λ), Pois(µ). Find the distribution of X1 + X2 .

Example 183. Let X1 , . . . , Xn be independent with Xj ∼ N (µj , σj2 ). Show that

Y = a + b1 X1 + · · · + bn Xn ∼ N (a + b1 µ1 + · · · + bn µn , b21 σ12 + · · · + b2n σn2 ) :

thus a linear combination of normal rv’s is normal.


Probability and Statistics for SIC slide 204

Note to Theorem 181


This simple calculation uses the fact that independence of X1 , . . . , Xn implies that the expectation of
a product is the product of the expectations.
Probability and Statistics for SIC note 1 of slide 204

Note to Example 182


Theorem 181 implies that

MX1 +X2 (t) = MX1 (t)MX2 (t) = exp{(λ1 + λ2 )(et − 1)}, t ∈ R,

so by Theorem 180 and Theorem 177(d) we recognise that X1 + X2 is a Poisson variable with
parameter λ1 + λ2 .
Probability and Statistics for SIC note 2 of slide 204

Note to Example 183


Since the Xj are independent and their MGFs are MXj (t) = exp(tµj + t2 σj2 /2), we can first use
Theorem 181 to see that
n
Y
MY (t) = eta MXj (tbj )
j=1
n
Y
= exp(ta) exp(tbj µj + t2 b2j σj2 /2)
j=1

= exp[t(a + b1 µ1 + · · · + bn µn ) + (t2 /2)(σ12 b21 + · · · + σn2 b2n )],

and then Theorem 180 to obtain

Y ∼ N (a + b1 µ1 + · · · + bn µn , b21 σ12 + · · · + b2n σn2 ).

126
Probability and Statistics for SIC note 3 of slide 204

Important theorems II
D
Definition 184 ( −→ , Reminder). Let {Xn }, X be random variables whose cumulative distribution
functions are {Fn }, F . Then we say that the random variables {Xn } converge in distribution to X,
if, for all x ∈ R where F is continuous,

Fn (x) → F (x), n → ∞.
D
We then write Xn −→ X.

Theorem 185 (Continuity, no proof). Let {Xn }, X be random variables with distribution functions
{Fn }, F , whose MGFs Mn (t), M (t) exist for 0 ≤ |t| < b. Then if Mn (t) → M (t) for |t| ≤ a < b when
D
n → ∞, then Xn −→ X, i.e., Fn (x) → F (x) at each x ∈ R where F is continuous.

Example 186 (Law of small numbers, II). Let Xn ∼ B(n, pn ) and X ∼ Pois(λ). Show that if
n → ∞, pn → 0 in such a way that npn → λ, then
D
Xn −→ X.
Probability and Statistics for SIC slide 205

Note to Example 186


The results from Example 177 give MXn (t) = (1 − pn + pn et )n for Xn ∼ B(n, pn ) and
MX (t) = exp{λ(et − 1)} for X ∼ Pois(λ), both valid for t ∈ R.
We use the fact that if a ∈ R, then (1 + a/n)n → ea as n → ∞.
If n → ∞ and npn → λ, we can write
 n
t n npn (et − 1)
MXn (t) = (1 − pn + pn e ) = 1 + → exp{λ(et − 1)} = MX (t), t ∈ R,
n

and this is true for any t ∈ R. Hence the hypothesis of the theorem is clearly satisfied, and thus
D
Xn −→ X.
Probability and Statistics for SIC note 1 of slide 205

127
Mean vector and covariance matrix
Definition 187. Let X = (X1 , . . . , Xp )T be a p × 1 vector of random variables. Then
 
E(X1 )
 
E(X)p×1 =  ...  ,
E(Xp )
 
var(X1 ) cov(X1 , X2 ) ··· cov(X1 , Xp )
cov(X1 , X2 ) var(X2 ) ··· cov(X2 , Xp )
 
var(X)p×p =  .. .. .. .. ,
 . . . . 
cov(X1 , Xp ) cov(X2 , Xp ) · · · var(Xp )

are called the expectation (mean vector) and the (co)-variance matrix of X.

The matrix var(X) is positive semi-definite, since


 
Xp
var  aj Xj  = aT var(X)a ≥ 0
j=1

for all vectors a = (a1 , . . . , ap )T ∈ Rp .

Probability and Statistics for SIC slide 206

Moment-generating function: multivariate case


Definition 188. The moment-generating function (MGF) of a random vector
Xp×1 = (X1 , . . . , Xp )T is
TX
Pp
MX (t) = E(et ) = E(e r=1 tr Xr ), t∈T,

where T = {t ∈ Rp : MX (t) < ∞}. Let the rth and (r, s)th elements of the mean vector E(X)p×1
and of the covariance matrix var(X)p×p be the quantities E(Xr ) and cov(Xr , Xs ).

The MGF has the following properties:


 0 ∈ T , thus MX (0) = 1;
 we have

′ ∂MX (t) ∂ 2 MX (t) ′ ′
E(X)p×1 = MX (0) = , var(X)p×p = − MX (0)MX (0)T ;
∂t t=0 ∂t∂tT t=0

 if A, B ⊂ {1, . . . , p} and A ∩ B = ∅, and we write XA for the subvector of X containing


{Xj : j ∈ A}, etc., then XA and XB are independent iff
T T
MX (t) = E(etA XA +tB XB ) = MXA (tA )MXB (tB ), t∈T;

 there is an injective mapping between MGFs and probability distributions.

Probability and Statistics for SIC slide 207

128
Example
Example 189. Emails arrive as a Poisson process with rate λ emails per day: the number of emails
arriving each day has the Poisson distribution with parameter λ. Each is a spam with probability p.
Show that the numbers of good emails and of spams are independent Poisson variables with
parameters (1 − p)λ and pλ.

Probability and Statistics for SIC slide 208

Note to Example 189


Let NP= S + G be P the total number of spam and good emails, and note that N ∼ Poiss(λ), while
S= N I
j=1 j , G = N
j=1 (1 − Ij ), with Ij being the indicator that the jth message is a spam.
The joint MGF of S and G is therefore
  
X N 
E [exp(t1 S + t2 G)] = E exp t1 Ij + t2 (1 − Ij ) 
 
j=1
    
X N 
= EN E exp t1 Ij + t2 (1 − Ij ) | N = n ,
 
j=1

where we have used the iterated expectation formula from Theorem 174. The inner expectation is
   
X N  Yn

E exp t1 Ij + t2 (1 − Ij ) | N = n  = E [exp {t1 Ij + t2 (1 − Ij )}]
 
j=1 j=1

because conditional on N = n, the I1 , . . . , In are independent, and because they are Bernoulli
variables each with success probability p, we have

E [exp {t1 Ij + t2 (1 − Ij )}] = et1 P(Ij = 1) + et2 P(Ij = 0) = pet1 + (1 − p)et2 .

Therefore    
XN   n
E exp t1 Ij + t2 (1 − Ij ) | N = n = (1 − p)et2 + pet1 ,
 
j=1

and on inserting the right-hand side of this into the original expectation, and then treating N = n as
random with a Poiss(λ) distribution, we get
h i
t2 t1 N
E {exp(t1 S + t2 G)} = EN (1 − p)e + pe

X λn  n
= e−λ (1 − p)et2 + pet1
n!
n=0
  
= exp −λ + λ (1 − p)et2 + pet1
  
= exp −λ(1 − p + p) + λ (1 − p)et2 + pet1
 
= exp −λ(1 − p) + λ(1 − p)et2 × exp −λp + λpet1
= E {exp(t2 G)} × E {exp(t1 S)} ,

which is the MGF of two independent Poisson variables G and S with means (1 − p)λ and pλ, as
required.

Probability and Statistics for SIC note 1 of slide 208

129
Parenthesis: Characteristic function
Many distributions do not have a MGF, since E(etX ) < ∞ only for t = 0. In this case, the Laplace
transform of the density is not useful. Instead we can use the Fourier transform, leading us to the
following definition.

Definition 190. Let i = −1. The characteristic function of X is

ϕX (t) = E(eitX ), t ∈ R.

Every random variable has a characteristic function, which possesses the same key properties as the
MGF. Characteristic functions are however more complicated to handle, as they require ideas from
complex analysis (path integrals, Cauchy’s residue theorem, etc.).

Theorem 191 (No proof). X and Y have the same cumulative distribution function if and only if
they have the same characteristic function. If X is continuous and has density f and characteristic
function ϕ then Z ∞
1
f (x) = e−itx ϕ(t) dt
2π −∞
for all x at which f is differentiable.

Probability and Statistics for SIC slide 209

Parenthesis: Cumulant-generating function


Definition 192. The cumulant-generating function (CGF) of X is KX (t) = log MX (t). The
cumulants κr of X are defined by
∞ r
X
t dr KX (t)
KX (t) = κr , κr = .
r=1
r! dtr t=0

It is easy to verify that E(X) = κ1 and var(X) = κ2 .


The CGF is equivalent to the MGF, and so shares its properties, but it is often easier to work with the
CGF.

Example 193. Calculate the CGF and the cumulants of (a) X ∼ N (µ, σ 2 ); (b) Y ∼ Pois(λ).

Probability and Statistics for SIC slide 210

Note to Example 193


We get directly from Example 177(d) that if X ∼ N (µ, σ 2 ), then (a)

KX (t) = log MX (t) = tµ + t2 σ 2 /2, t ∈ R,

so we see that κ1 = µ and κ2 = σ 2 , with all other cumulants zero.


(b) Likewise from Example 177(c),

KY (t) = log MY (t) = λ(et − 1), t ∈ R,

so κr = λ for all r.
Probability and Statistics for SIC note 1 of slide 210

130
Cumulants of sums of random variables
Theorem 194. If a, b1 , . . . , bn are constants and X1 , . . . , Xn are independent random variables, then
n
X
Ka+b1 X1 +···+bn Xn (t) = ta + KXj (tbj ).
j=1

If X1 , . . . , Xn are independent variables having cumulants κj,r , then the CGF of S = X1 + · · · + Xn is


n
X n X
X ∞ r ∞ r X
X n
t t
KS (t) = KXj (t) = κj,r = κj,r :
r! r!
j=1 j=1 r=1 r=1 j=1

iid
the rth cumulant of X1 + · · · + Xn is the sum of the rth cumulants of the Xj . If the X1 , . . . , Xn ∼ F
and have CGF K(t), then t has CGF nK(t) and has rth cumulant nκr .

Probability and Statistics for SIC slide 211

Note to Theorem 194


For the first result, just use take logarithms in Theorem 181. For the second, just use the definition of
the CGF in terms of the infinite series.
This result is very useful when looking at linear combinations of independent variables.

Probability and Statistics for SIC note 1 of slide 211

Multivariate cumulant-generating function


Definition 195. The cumulant-generating function (CGF) of a random variable
Xp×1 = (X1 , . . . , Xp )T is
TX
KX (t) = log MX (t) = log E(et ), t∈T,

where T = {t ∈ Rp : MX (t) < ∞}.

The CGF has the following properties:


 0 ∈ T , thus KX (0) = 0;
 we have
′ ∂KX (t) ∂ 2 KX (t)
E(X)p×1 = KX (0) = , var(X)p×p = ;
∂t t=0 ∂t∂tT t=0
 if A, B ⊂ {1, . . . , p} and A ∩ B = ∅, and we write XA for the subvector of X containing
{Xj : j ∈ A}, etc., then XA and XB are independent iff
T T
KX (t) = log E(etA XA +tB XB ) = KXA (tA ) + KXB (tB ), t∈T;

 there is an injective mapping between CGFs and probability distributions.

Probability and Statistics for SIC slide 212

131
5.4 Multivariate Normal Distribution slide 213

Multivariate normal distribution


Definition 196. The random vector X = (X1 , . . . , Xp )T has a multivariate normal distribution if
there exist a p × 1 vector µ = (µ1 , . . . , µp )T ∈ Rp and a p × p symmetric matrix Ω with elements ωjk
such that
uT X ∼ N (uT µ, uT Ωu), u ∈ Rp ;
then we write X ∼ Np (µ, Ω).

 Since var(uT X) = uT Ωu ≥ 0 for any u ∈ Rp , Ω must be positive semi-definite.


 This definition allows degenerate distributions, for which there exists a u such that var(uT X) = 0.
This gives mathematically clean results but can be avoided in applications by reformulating the
problem to avoid degeneracy, effectively working in a space of dimension m < p.

Probability and Statistics for SIC slide 214

Multivariate normal distribution, II


Lemma 197. (a) We have

E(Xj ) = µj , var(Xj ) = ωjj , cov(Xj , Xk ) = ωjk , j 6= k,

so µ and Ω are called the mean vector and covariance matrix of X.


(b) The moment-generating function of X is MX (u) = exp(uT µ + 12 uT Ωu), for u ∈ Rp .
(c) If A, B ⊂ {1, . . . , p}, and A ∩ B = ∅ then

XA ⊥⊥ XB ⇔ ΩA,B = 0.

iid
(d) If X1 , . . . , Xn ∼ N (µ, σ 2 ), then Xn×1 = (X1 , . . . , Xn )T ∼ Nn (µ1n , σ 2 In ).
(e) Linear combinations of normal variables are normal:

ar×1 + Br×p X ∼ Nr (a + Bµ, BΩB T).

Lemma 198. The random vector X ∼ Np (µ, Ω) has a density function on Rp if and only if Ω is
positive definite, i.e., Ω has rank p. If so, the density function is
1 
f (x; µ, Ω) = exp − 12 (x − µ)T Ω−1 (x − µ) , x ∈ Rp . (3)
(2π)p/2 |Ω|1/2

If not, X is a linear combination of variables that have a density function on Rm , where m < p is the
rank of Ω.
Probability and Statistics for SIC slide 215

132
Note to Lemma 197
(a) Let ej denote the p-vector with 1 in the jth place and zeros everywhere else. Then
Xj = eTj X ∼ N (µj , ωjj ), giving the mean and variance of Xj .
Now var(Xj + Xk ) = var(Xj ) + var(Xk ) + 2cov(Xj , Xk ), and

Xj + Xk = (ej + ek )T X ∼ N (µj + µk , ωjj + ωkk + 2ωjk ),

which implies that cov(Xj , Xk ) = ωjk = ωkj .


T
(b) Since uT X ∼ N (uT µ, uT Ωu), its MGF is MuT X (t) = E(etu X ) = exp(tuT µ + 12 t2 uT Ωu). The
T
MGF of X is MX (u) = E(eu X ) = MuT X (1) = exp(uT µ + 12 uT Ωu), for any u ∈ Rp , as stated.
(c) Without loss of generality, let XA = (X1 , . . . , Xq )T , for 1 ≤ q < p, and partition tT = (tTA , tTB ),
µT = (µTA , µTB ), etc. Also without loss of generality suppose that A ∪ B = {1, . . . , n}, since otherwise
we can just set tj = 0 for j 6∈ A ∪ B. Then, using matrix algebra, the joint CGF of X can be written as

KX (t) = tT µ + 21 tT Ωt = tTA µA + tTB µB + 21 tTA ΩAA tA + 12 tTB ΩBB tB + tTA ΩAB tB .

This equals the sum of the CGFs of XA and XB , i.e.,

KXA (t) + KXB (t) = tTA µA + 21 tTA ΩAA tA + tTB µB + + 12 tTB ΩBB tB

if and only if the final term of KX (t) equals zero for all t, which occurs if and only if ΩAB = 0. Hence
the elements of the variance matrix corresponding to cov(Xr , Xs ) must equal zero for any r ∈ A and
s 6∈ A, as required. Clearly this also holds if A ∪ B =
6 {1, . . . , p}.
(d) In this case each of the Xj has mean µ and variance σ 2 , and since they are independent,
cov(Xj , Xk ) = 0 for j 6= k. If u ∈ Rn , then uT X is a linear combination of normal variables, with
mean and variance X X
uj µ = uT µ1n , u2j σ 2 = uT σn2 u,

so X ∼ Nn (µ1n , σ 2 In ), as required.
(e) The MGF of a + BX equals
T
E [exp{tT (a + BX)}] = E [exp{tT a + (B T t)T X)}] = et a MX (B T t)
= exp{tT a + (B T t)T µ + 12 (B T t)T Ω(B T t)}

= exp tT (a + Bµ) + 21 tT (BΩB T )t ,

which is the MGF of the Nr (a + Bµ, BΩB T) distribution.

Probability and Statistics for SIC note 1 of slide 215

133
Note I to Lemma 198
 Since Ω is positive semi-definite, the spectral theorem tells us that we may write Ω = AT DA,
where D = diag(d1 , . . . , dp ) contains the eigenvalues of Ω, with d1 ≥ d2 ≥ · · · ≥ dp ≥ 0, and A is
a p × p orthogonal matrix, i.e., AT A = AAT = Ip and |A| = 1. Note that
|Ω| = |AT DA| = |AT | × |D| × |A| = |D|, and that if the inverse exists, Ω−1 = AT D −1 A.
 Now Y = AX ∼ Np (Aµ, AΩAT ), and AΩAT = AAT DAAT = D is diagonal, so Y1 , . . . , Yp are
independent normal variables with means bj given by the elements of Aµ and variances dj .
 Suppose that dp > 0, so that Ω has rank p. Then all the Yj have non-degenerate normal densities,
and since they are independent, their joint density is
p
Y  
(yj − bj )2 
fY (y) = (2πdj )−1/2 exp − = (2π)−p/2 |D|−1/2 exp − 12 (y − b)T D −1 (y − b) .
2dj
j=1

Since Y = AX and A−1 = AT , we have that X = AT Y , and this transformation has Jacobian
|AT | = 1. Since |D| = |Ω|, we can appeal to Theorem 204 and hence write the density of X as

fX (x) = |AT |fY (Ax) = (2π)−p/2 |Ω|−1/2 exp − 21 (Ax − Aµ)T D −1 (Ax − Aµ) , x ∈ Rp ,

where (Ax − Aµ)T D −1 (Ax − Aµ) = (x − µ)T AT D −1 A(x − µ) = (x − µ)T Ω−1 (x − µ), giving (3).
 If Ω has rank m < p, then dm > 0 but dm+1 = · · · = dp = 0. In this case only Y1 , . . . , Ym have
positive variances, and the argument above allows us to construct a joint density for Y1 , . . . , Ym on
Rm . Since Ym = bm , . . . , Yp = bp with probability one, we can write

X = AT Y = AT (Y1 , . . . , Ym , bm+1 , . . . , bp )T ,

which confirms that the density of X is positive only on an m-dimensional linear subspace of Rp
generated by the variation of Y1 , . . . , Ym ; it might be said to have only ‘m degrees of freedom’.

Probability and Statistics for SIC note 2 of slide 215

134
Note II to Lemma 198
 Since Ω is symmetric and positive semi-definite, the spectral theorem tells us that we may write
Ω = ADAT , where D = diag(d1 , . . . , dp ) contains the (real) eigenvalues of Ω, with
d1 ≥ d2 ≥ · · · ≥ dp ≥ 0, and A is a p × p orthogonal matrix, i.e., AT A = AAT = Ip and |A| = 1.
The columns A1 , . . . , Ap of A are the eigenvectors corresponding to the respective eigenvalues;
note that
Xp
Ω = ADAT = dj aj aTj ,
j=1

that |Ω| = |ADAT | = |A| × |D| × |AT | = |D|, and that Ω−1 = AD −1 AT if the inverse exists.
 Now let Yj ∼ N (0, dj ) be independent variables, let Y = (Y1 , . . . , Yp )T , and let u ∈ Rp ; note that
if dj = 0 then Yj = 0 with probability one. Then
p
X
T T T
u X = u (µ + AY ) = u µ + Yj uT aj
j=1

is a linear combination of normal variables, so it has a normal distribution, with mean uT µ and
variance
   
X p Xn Xn
var uT µ + Yj uT aj  = (uT aj )2 var(Yj ) = uT  dj aj aTj  u = uT Ωu,
j=1 j=1 j=1

which implies that X = µ + AY ∼ Np (µ, Ω), according to Definition 196.


P
 Now X = µ + pj=1 Yj aj can be constructed by scaling the eigenvectors aj of Ω by
normally-distributed factors Yj , so X − µ lies in the linear space S = span(a1 , . . . , am ) generated
by the eigenvectors aj for which dj > 0. If dp > 0, then m = p and S = Rp , but otherwise S is a
proper subspace of Rp generated by a1 , . . . , am , and d1 ≥ · · · ≥ dm > 0 but
dm+1 = · · · = dp = 0. In this case X has a density on µ + S, but places no probability elsewhere.
 For example, suppose that p = 2, a1 = (1, 0)T and a2 = (0, 1)T . If m = 2, then d1 , d2 > 0, and X
can lie anywhere in R2 , whereas if m = 1, then d1 > 0 but d2 = 0, and X can only take values in
the x-axis, within which its density is N (µ1 , d1 ). If m = 0, then X takes the constant value µ
with probability one.
 To compute the density of X, suppose that m ≥ 1 and note that the non-degenerate part of
Y = AT (X − µ), Y+ = (Y1 , . . . , Ym )T , say, has joint density
m
Y  −1

fY+ (y+ ) = (2πdj )−1/2 exp −yj2 /(2dj ) = (2π)−p/2 |D+ |−1/2 exp − 12 y+
T
D+ y+ ,
j=1

where D+ = diag(d1 , . . . , dm ) and y+ = (y1 , . . . , ym )T .


 If Ω has rank m < p, then dm > 0 but dm+1 = · · · = dp = 0. In this case only Y1 , . . . , Ym have
positive variances, and the argument above allows us to construct a joint density for Y1 , . . . , Ym on
Rm . Since Ym = bm , . . . , Yp = bp with probability one, we can write

X = AT Y = AT (Y1 , . . . , Ym , bm+1 , . . . , bp )T ,

which confirms that the density of X is positive only on an m-dimensional linear subspace of Rp
generated by the variation of Y1 , . . . , Ym ; it might be said to have only ‘m degrees of freedom’.

Probability and Statistics for SIC note 3 of slide 215

135
Bivariate normal densities
Normal PDF with p = 2, µ1 = µ2 = 0, ω11 = ω22 = 1, and correlation ρ = ω12 /(ω11 ω22 )1/2 = 0 (left),
−0.5 (centre) and 0.9 (right).

Probability and Statistics for SIC slide 216

Examples
Example 199. If X ∼ N (1, 4) , Y ∼ N (−1, 9), corr(X, Y ) = −1/6, and they have a joint normal
distribution, give the joint distribution of (X, Y ). Hence find the distribution of W = X + Y .
iid
Example 200. If X1 , . . . , X4 ∼ N (0, σ 2 ), find the distribution of Y = BX when
 
1 −1 −1 −1
1 −1 1 1
B= 1
.
1 −1 1 
1 1 1 −1

Probability and Statistics for SIC slide 217

136
Note to Example 199
Part (a) of Lemma 197 gives that
     
X E(X) var(X) cov(X, Y )
∼ N2 , ,
Y E(Y ) cov(X, Y ) var(Y )

and we know all the elements of the matrices except


p p
cov(X, Y ) = corr(X, Y ) × var(X) × var(Y ) = −1/6 × 2 × 3 = −1.

Therefore      
X 1 4 −1
∼ N2 , .
Y −1 −1 9
Since W is a linear combination of normal variables, it has a normal distribution, and we can apply
Part (e) of Lemma 197 with r = 1, p = 2, a = 0 and B = (1, 1) to obtain
      
X T 4 −1 1
W = (1, 1) ∼ N1 0 + (1, 1)(1, −1) , (1, 1) = N (0, 11).
Y −1 9 1

Probability and Statistics for SIC note 1 of slide 217

Note to Example 200


For this we use parts (d) and (e) of Lemma 197. For (d) we take µ = 04×1 and Ω = σ 2 I4 , and for (e)
we take a = 04×1 and the stated matrix B. Thus
D  D 
Y = a + BX ∼ N4 (a + Bµ, BΩB T) = N4 0, σ 2 BB T = N4 0, 4σ 2 I4 ,

because it is easy to check that BB T = 4I4 . Thus the variables Y1 , . . . , Y4 have N (0, 4σ 2 )
distributions, and are independent because their covariance matrix is diagonal.

Probability and Statistics for SIC note 2 of slide 217

Marginal and conditional distributions


Theorem 201. Let X ∼ Np (µp×1 , Ωp×p ), where |Ω| > 0, and let A, B ⊂ {1, . . . , p} with
|A| = q < p, |B| = r < p and A ∩ B = ∅.
Let µA , ΩA and ΩAB be respectively the q × 1 subvector of µ, q × q and q × r submatrices of Ω
conformable with A, A × A and A × B. Then:
(a) the marginal distribution of XA is normal,

XA ∼ Nq (µA , ΩA );

(b) the conditional distribution of XA given XB = xB is normal,



XA | XB = xB ∼ Nq µA + ΩAB Ω−1 −1
B (xB − µB ), ΩA − ΩAB ΩB ΩBA .

This has two important implications:


 (a) implies that any subvector of X also has a multivariate normal distribution;
 (b) implies that two components of XA are conditionally independent given XB if and only if the
corresponding off-diagonal element of ΩA − ΩAB Ω−1
B ΩBA equals zero.

Probability and Statistics for SIC slide 218

137
Proof of Theorem 201
First note that without loss of generality we can permute the elements of X so that the components of
XA appear before those of XB , then writing X T = (XA T
, XBT ). Partition the vectors t, µ, and the
matrix Ω conformally with X, using obvious notation.
(a) The CGF of X is
 T    T   
T 1 T tA µA 1 tA ΩA ΩAB tA
KX (t) = t µ + 2 t Ωt = +2
tB µB tB ΩBA ΩB tB

We obtain the marginal CGF of XA by setting tB = 0, giving


 T    T   
tA µA 1 tA ΩA ΩAB tA
KX (t) = +2 = tA µA + 12 tTA ΩA tA ,
0 µB 0 ΩBA ΩB 0

which is the CGF of the Nq (µA , ΩA ) distribution.


(b) Consider W = XA − ΩAB Ω−1 B XB . This is a linear combination of normals and so is normal, and
its mean and variance matrix are

µA − ΩAB Ω−1
B µB , ΩA − ΩAB Ω−1
B ΩBA ,

and as cov(XB , W ) = 0 (check!) and they are jointly normally distributed, W ⊥⊥ XB . Now

XA = W + ΩAB Ω−1
B XB ,

and as W and XB are independent, the distribution of W is unchanged by conditioning on the event
XB = xB . The conditional mean of XA is therefore

E(XA | XB = xB ) = E(W +ΩAB Ω−1 −1 −1


B XB | XB = xB ) = E(W )+ΩAB ΩB xB = µA +ΩAB ΩB (xB −µB )

as required. Likewise

var(XA | XB = xB ) = var(W + ΩAB Ω−1 −1


B XB | XB = xB ) = var(W ) = ΩA − ΩAB ΩB ΩBA ,

because the term in XB is conditionally constant. This gives the required result.

Probability and Statistics for SIC note 1 of slide 218

Example
Example 202. Let (X1 , X2 ) be the pair (height (cm), weight (kg)) for a population of people aged
20. To model this, we take    
180 225 90
µ= , Ω= .
70 90 100
(a) Find the marginal distributions of X1 and of X2 , and corr(X1 , X2 ).
(b) Do the marginal distributions determine the joint distribution?
(c) Find the conditional distribution of X2 given that X1 = x1 , and of X1 given that X2 = x2 .

Probability and Statistics for SIC slide 219

138
Note to Example 202
(a) The marginal distributions are X1 ∼ N (180, 225) and X2 ∼ N (70, 100). The correlation is
ω12 90 90
√ =√ = = 0.6.
ω11 ω22 225 × 100 150

(b) Clearly not, because they don’t determine the correlation.


(c) For this we have

XA | XB = xB ∼ Nq µA + ΩAB Ω−1 −1
B (xB − µB ), ΩA − ΩAB ΩB ΩBA .

where XA = X2 , XB = X1 , so

µA + ΩAB Ω−1 −1
B (xB − µB ) = µ2 + ω21 ω11 (x1 − µ1 ) = 70 + 0.4(x1 − 180),
ΩA − ΩAB Ω−1 2
B ΩBA = 100 − 90 /225 = 64.

Thus X2 | X1 = x1 ∼ N {70 + 0.4(x1 − 180), 64}: larger height leads to larger weight, on average.
A similar computation gives

X1 | X2 = x2 ∼ N {180 + 0.9(x2 − 70), 144}.

In each case the mean depends linearly on the conditioning variable, and the conditional variance is
smaller than the marginal variance, consistent with the idea that conditioning adds information and
therefore reduces uncertainty.

Probability and Statistics for SIC note 1 of slide 219

Bivariate normal distribution


The normal bivariate density for (X1 , X2 ) =(hauteur, poids), as well as the straight lines
E(X2 | X1 = x1 ) = 70 + 0.4(x1 − 180) (blue) and E(X1 | X2 = x2 ) = 180 + 0.9(x2 − 70) (green).
7090
80
x2
60
50

150 160 170 180 190 200 210


x1

Probability and Statistics for SIC slide 220

139
Francis Galton (1822–1911)

(Source: Wikipedia)

Probability and Statistics for SIC slide 221

Regression to the mean


 Galton obtained the heights of parents and of their children, and fitted a line.
 The slope of the line < 1: the children of tall parents are smaller than them, on average, and the
children of small parents are larger than them, on average.
 This effect is called regression to the mean, and appears in many contexts. For example,
someone with an above-average mark on a midterm test will tend to do worse in the final, on
average.

Probability and Statistics for SIC slide 222

140
5.5 Transformations slide 223

Reminder: Transformation of random variables


We often want to calculate the distributions of random variables based on other random variables.
 Let Y = g(X), where the function g is known. We want to obtain FY and fY from FX and fX .
 Let g : R 7→ R, B ⊂ R, and g −1 (B) ⊂ R be the set for which g{g−1 (B)} = B. Then

P(Y ∈ B) = P{g(X) ∈ B} = P{X ∈ g −1 (B)},

since X ∈ g−1 (B) iff g(X) = Y ∈ g{g −1 (B)} = B.


 To find FY (y), we take By = (−∞, y], giving

FY (y) = P(Y ≤ y) = P{g(X) ∈ By } = P{X ∈ g −1 (By )}.

 If the function g is monotonic increasing with (monotonic increasing) inverse g −1 , then


−1
dFY (y) dFX {g −1 (y)} dg (y)
fY (y) = = = fX {g (y)} ×
−1 ,
dy dy dy

where the | · | ensures that the same formula holds with monotonic decreasing g.

Probability and Statistics for SIC slide 224

X bivariate
We calculate P(Y ∈ B), with Y ∈ Rd a function of X ∈ R2 and
   
Y1 g1 (X1 , X2 )
   
Y =  ...  =  ..
.  = g(X).
Yd gd (X1 , X2 )

Let g : R2 7→ Rd be a known function, B ⊂ Rd , and g −1 (B) ⊂ R2 be the set for which


g{g−1 (B)} = B. Then
P(Y ∈ B) = P{g(X) ∈ B} = P{X ∈ g −1 (B)}.
iid
Example 203. If X1 , X2 ∼ exp(λ), calculate the distribution of X1 + X2 .

It can be helpful to include indicator functions in formulae for densities of new variables (examples
later).

Probability and Statistics for SIC slide 225

141
Note to Example 203
We want to compute P(Y ≤ y) = P(X1 + X2 ≤ y), and with By = (−∞, y] and g(x1 , x2 ) = x1 + x2 ,
we have that
g−1 (By ) = {(x1 , x2 ) ∈ R2 : x1 + x2 ≤ y}.
Thus we want to compute FY (y) = P(X1 + X2 ≤ y). If y < 0 this is zero, and otherwise equals
Z y Z y−x1
FY (y) = P(X1 + X2 ≤ y) = dx1 dx2 λ2 e−λ(x1 +x2 )
0 0
Z y h i0
= λ dx1 e−λx1 e−λx2
y−x1
Z0 y
= λ dx1 e−λx1 (1 − e−λ(y−x1 ) )
0
= 1 − e−λy − λye−λy , y ≥ 0,

giving (
0, y < 0,
FY (y) = −λy −λy
1−e − λye , y ≥ 0.

Differentiation gives fY (y) = λ2 ye−λy for y > 0, (the gamma density with shape parameter α = 2).

Probability and Statistics for SIC note 1 of slide 225

Transformations of joint continuous densities


Theorem 204. Let X = (X1 , X2 ) ∈ R2 be a continuous random variable, and let Y = (Y1 , Y2 ) with
Y1 = g1 (X1 , X2 ) and Y2 = g2 (X1 , X2 ), where:
(a) the system of equations y1 = g1 (x1 , x2 ), y2 = g2 (x1 , x2 ) can be solved for all (y1 , y2 ), giving the
solutions x1 = h1 (y1 , y2 ), x2 = h2 (y1 , y2 ); and
(b) g1 and g2 are continuously differentiable and have Jacobian

∂g1 ∂g1 ∂g ∂g
∂g1 ∂g2
∂x1 ∂x2 1 2
J(x1 , x2 ) = ∂g2 ∂g2 = − ,
∂x
1 ∂x2
∂x1 ∂x2 ∂x2 ∂x1

which is positive if fX1 ,X2 (x1 , x2 ) > 0.


Then
fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (x1 , x2 ) × |J(x1 , x2 )|−1 x1 =h1 (y1 ,y2 ),x2 =h2 (y1 ,y2 ) .

iid
Example 205. Calculate the joint density of X1 + X2 and X1 − X2 when X1 , X2 ∼ N (0, 1).
iid
Example 206. Calculate the joint density of X1 + X2 and X1 /(X1 + X2 ) when X1 , X2 ∼ exp(λ).

Probability and Statistics for SIC slide 226

142
Note to Example 205
 We already have one way to do this, as we can write
        
Y1 X1 + X2 1 1 X1 X1
= = =B ,
Y2 X1 − X2 1 −1 X2 X2

say, and use results for the multivariate normal distribution in Lemma 197(e).
 Using Theorem 204 instead, we need to compute

fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (x1 , x2 ) × |J(x1 , x2 )|−1 x1 =h1 (y1 ,y2 ),x2 =h2 (y1 ,y2 ) .

First, note that the Jacobian of the transformation (x1 , x2 ) 7→ (y1 , y2 ) is

J(x1 , x2 ) = |B| = | − 2| = 2.

Now we need to express the density


1 2 1 2
fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 (x2 ) = √ e−x1 /2 × √ e−x2 /2 , x1 , x2 ∈ R,
2π 2π
in terms of (y1 , y2 ). As x1 = (y1 + y2 )/2 and x2 = (y1 − y2 )/2, the exponent may be written in
terms of the new variables y1 , y2 as
  1 1
− 12 (x21 + x22 ) = − 12 {(y1 + y2 )/2}2 + {(y1 − y2 )/2}2 = − (2y12 + 2y22 ) = − (y 2 + y22 ),
2×4 2×2 1
so  
1 1 1 2 2
fY1 ,Y2 (y1 , y2 ) = × exp − (y + y2 ) , y1 , y2 ∈ R,
2 2π 2×2 1
and we see that Y1 and Y2 are mutually independent N (0, 2) variables.

Probability and Statistics for SIC note 1 of slide 226

143
Note to Example 206
We write
f (x1 , x2 ) = λ2 exp{−λ(x1 + x2 )}I(x1 > 0)I(x2 > 0).
With Y1 = X1 + X2 > 0 and Y2 = X1 /(X1 + X2 ) ∈ (0, 1), we have

y1 = g1 (x1 , x2 ) = x1 + x2 > 0, y2 = g2 (x1 , x2 ) = x1 /(x1 + x2 ) ∈ (0, 1),

and the corresponding inverse transformation is

x1 = h(y1 , y2 ) = y1 y2 , x2 = h(y1 , y2 ) = y1 (1 − y2 ), x1 , x2 > 0.

Clearly these transformations satisfy the conditions of Theorem 204. We can either compute

1 1 (x + x )
1 2
J = x2 x1 = − 2
= 1/y1 > 0,
(x1 +x2 )2 − (x1 +x 2)
2 (x 1 + x 2)

or (maybe better),
∂h1 ∂h1
−1 ∂y1 ∂y2
J = ∂h ∂h2 = y1 > 0.
∂y12 ∂y2

Thus

f (y1 , y2 ) = λ2 exp{−λ(x1 + x2 )}I(x1 > 0)I(x2 > 0)|J −1 | x1 =y1 y2 ,x2 =y1 (1−y2 )
= y1 λ2 exp(−λy1 )I(y1 y2 > 0)I{y1 (1 − y2 ) > 0},
= y1 λ2 exp(−λy1 )I(y1 > 0) × I(0 < y2 < 1)
= fY1 (y1 ) × fY2 (y2 ).

Integration over y2 shows that the marginal density of Y1 is y1 λ2 exp(−λy1 )I(y1 > 0), and so
Y1 ∼ Gamma(1, λ) and Y2 ∼ U (0, 1), independently.

Probability and Statistics for SIC note 2 of slide 226

Sums of independent variables


Theorem 207. If X, Y are independent random variables, then the PDF of their sum S = X + Y is
the convolution fX ∗ fY of the PDFs fX , fY :
(R ∞
fS (s) = fX ∗ fY (s) = P −∞ fX (x)fY (s − x) dx, X, Y continuous,
x fX (x)fY (s − x), X, Y discrete.

Example 208. Show that the sum of independent exponential and gamma variables has a gamma
distribution.
Probability and Statistics for SIC slide 227

144
Note to Theorem 207
Change variables to W = X and S = X + Y , so the Jacobian is

1 0
J = = 1,
1 1

and note that x = w and y = s − w. Thus, since X and Y are independent, an application of
Theorem 204 gives

fW,S (w, s) = fX,Y (w, s − w) × |J −1 | = fX (w)fY (s − w) × 1.

Therefore the marginal density of S in the continuous case is


Z ∞
fS (s) = fX (w)fY (s − w) dw.
−∞

The computation in the discrete case is similar, but the Jacobian is not needed.

Probability and Statistics for SIC note 1 of slide 227

Note to Example 208


Use indicator functions to write the densities as
λα xα−1 −λx
fX (x) = e I(x > 0), fY (y) = λe−λy I(y > 0), λ, α > 0,
Γ(α)

and use the convolution formula to give that S = X + Y has density


Z ∞ Z ∞ α α−1
λ w
fS (s) = fX (w)fY (s − w) dw = e−λw I(w > 0) × λe−λ(s−w) I(s − w > 0) dw.
−∞ −∞ Γ(α)

The product of the indicator functions is positive only when w > 0 and s − w > 0 simultaneously, i.e.,
when 0 < w < s, and hence on putting constants outside the integral, we have
Z
λα+1 e−λs s α−1
fS (s) = w dw.
Γ(α) 0

On noting that the integral equals sα /α and recalling that αΓ(α) = Γ(α + 1), we have

λα+1 sα −λs
fS (s) = e , s > 0,
Γ(α + 1)

so S ∼ gamma(α + 1, λ). In particular, a sum of two exponential variables has a gamma(2,λ)


distribution.
Probability and Statistics for SIC note 2 of slide 227

145
Risk estimation
Accurate estimation of risk is essential in financial markets, nuclear power plants, etc. We often need
to calculate the effect of rare events for several variables together, with little information on their joint
distribution. To be concrete, let −X1 , −X2 be negative shocks in a financial market, and consider
S = X1 + X2 , whose quantiles s1−α we need to estimate, such that

P(S ≤ s1−α ) = 1 − α, P(S > s1−α ) = α,

for small α. We will consider two cases:


 X1 , X2 ∼ N (µ, σ 2 ), with correlation ρ;
ind
 X1 , X2 ∼ Pareto( 12 ).
Then it turns out that

s1−α,Normal ≤ 2z1−α,Normal , 2z1−α,Pareto < s1−α,Pareto :

in the normal case (often used in practice) twice the marginal risk is an upper bound for the joint risk,
but in the Pareto case it is a lower bound. So if we base risk calculations on the normal distribution
but reality is Pareto, losses can be much greater than predicted.

Probability and Statistics for SIC slide 228

Note on risk estimation, I


For the normal case, we first we note that if X1 ∼ N (µ, σ 2 ), then the 1 − α quantile of X1 satisfies
 
X1 − µ x1−α − µ
1 − α = P(X1 ≤ x1−α ) = P ≤ = Φ(z1−α ),
σ σ
so x1−α = µ + σz1−α .
Now since S = X1 + X2 is a linear combination of normal variables, it is normal, with mean 2µ and
variance

var(S) = var(X1 ) + var(X2 ) + 2{var(X1 )var(X2 )}1/2 corr(X1 , X2 ) = 2σ 2 + 2σ 2 ρ = 2σ 2 (1 + ρ),

where we have written cov(X1 , X2 ) = {var(X1 )var(X2 )}1/2 corr(X1 , X2 ). Hence


S ∼ N {2µ, 2σ 2 (1 + ρ)}, so this means that
( ) " #
S − 2µ s1−α − 2µ s1−α − 2µ
1 − α = P(S ≤ s1−α ) = P p ≤p =Φ p ;
2σ 2 (1 + ρ) 2σ 2 (1 + ρ) 2σ 2 (1 + ρ)
p
this yields s1−α = 2µ + σ 2(1 + ρ)z1−α ≤ 2x1−α , because |ρ| ≤ 1. Thus for normal variables, the
quantile of the sum is bounded above by the sum of the quantiles.

Probability and Statistics for SIC note 1 of slide 228

146
Note on risk estimation, II
Now consider the convolution of independent random variables X1 , X2 both having distribution
function F (x) = 1 − x−1/2 , x > 1, and thus density function 12 x−3/2 , x > 1. According to
Theorem 207, their sum S = X1 + X2 has density
Z s−1
1
fS (s) = 4 x−3/2 (z − x)−3/2 dx, z > 2.
1

To work this out we first set x = su and a = 1/s for convenience, and on setting u = sin2 θ obtain
Z 1−a
1 −2
fS (s) = 4 s {u(1 − u)}−3/2 du
a
Z θ2
1 −2 dθ
= 2s 2 2
,
θ1 sin θ cos θ
Z θ2
−2 dθ
= 2s 2 ,
θ1 sin (2θ)
 
−2 cos 2θ θ2
= s − ,
sin 2θ θ1
 θ1
1
= 12 s−2 − tan θ ,
tan θ θ2

after a little trigonometry. The limits for the integral are determined using a = sin2 θ1 , giving
tan θ1 = {a/(1 − a)}1/2 , and 1 − a = sin2 θ2 , giving tan θ2 = {(1 − a)/a}1/2 . Thus
h i
fS (s) = 12 s−2 {(1 − a)/a}1/2 − {a/(1 − a)}1/2 − {a/(1 − a)}1/2 + {(1 − a)/a}1/2
s−2
= p (1 − 2a)
a(1 − a)
s−2
= , s > 2,
s2 (s − 1)1/2

after a little algebra. It is then easy to check that FS (s) = 1 − 2(s − 1)1/2 /s, defined for s > 2.
Now the 1 − α quantile of X is x1−α = 1/α2 , and the 1 −√α quantile of S is the solution to the
equation α = 2(s − 1)1/2 /s, and this is s1−α = 2α−2 (1 + 1 − α2 ). Thus we see that in this case

2x1−α < s1−α , 0<α<1:

the sum of the α-quantiles of X1 and X √2 is always less than the α-quantile of their sum. For very
2 2
small α the ratio s1−α /(2x1−α ) = 1 + 1 − α ∼ 2 − (1 − α) /2, so the quantile of the sum is almost
twice the sum of the quantiles.
The implication is that the choice of model can have a huge effect on estimation of risk. If two risk
variables have a joint normal distribution, then we can bound the quantile of their sum S by doubling
the quantile of one of them. However if they have another joint distribution, then this may badly
underestimate the quantile of S. In many applications Pareto distributions for extreme risks are much
more plausible than are normal tails, but joint normality is often used. This can lead to financial
meltdown due to serious underestimation of risk, especially for complex products where dependencies
are hidden. Google ‘The formula that killed Wall Street’, or check out
http://www.wired.com/techbiz/it/magazine/17-03/wp_quant?currentPage=all.

Probability and Statistics for SIC note 2 of slide 228

147
Multivariate case
Theorem 204 extends to random vectors with continuous density, Y = g(X) ∈ Rn , where X ∈ Rn is a
continuous variable:

(X1 , . . . , Xn ) 7→ (Y1 = g1 (X1 , . . . , Xn ), . . . , Yn = gn (X1 , . . . , Xn )).

If the inverse transformation h exists, and has Jacobian


∂g1 ∂g1
···
∂x1 ∂xn
.. ,
J(x1 , . . . , xn ) = ... ..
. .
∂g
n ··· ∂gn
∂x1 ∂xn

then
fY1 ,...,Yn (y1 , . . . , yn ) = fX1 ,...,Xn (x1 , . . . , xn ) |J(x1 , . . . , xn )|−1 ,
evaluated at x1 = h1 (y1 , . . . , yn ), . . . , xn = hn (y1 , . . . , yn ).

Probability and Statistics for SIC slide 229

Convolution and sums of random variables


Theorem 209. If X1 , . . . , Xn are independent random variables, then the PDF of S = X1 + · · · + Xn
is the convolution
fS (s) = fX1 ∗ · · · ∗ fXn (s).

In fact it is easier to use the MGFs for convolutions, if possible.


iid
Example 210. Show that if X1 , . . . , Xn ∼ exp(λ), then Y = X1 + · · · + Xn ∼ gamma(n, λ).

Probability and Statistics for SIC slide 230

Note to Example 210


The MGF of X ∼ exp(λ) is MX (t) = λ/(λ − t), for t < λ. Now if Y has the gamma(n, λ)
distribution,
Z ∞
λn y n−1 −λy
E(etY ) = ety e dy
0 Γ(n)
Z ∞
λn
= y n−1 e−(λ−t)y dy
Γ(n) 0
Z ∞
λn
= (λ − t)n y n−1 e−(λ−t)y dy
(λ − t)n Γ(n) 0
 n
λ
= × 1,
λ−t
provided that λ − t > 0, or equivalently that t < λ. The last step just notes that the integral
corresponds to the density of the gamma(n, λ − t) distribution, and so equals unity.
Now MY (t) = MX (t)n = λn /(λ − t)n , so Y has the stated gamma distribution, since there is a
bijection between MGFs and distributions.

Probability and Statistics for SIC note 1 of slide 230

148
5.6 Order Statistics slide 231

Definition
Definition 211. The order statistics of the rv’s X1 , . . . , Xn are the ordered values

X(1) ≤ X(2) ≤ · · · ≤ X(n−1) ≤ X(n) .

If the X1 , . . . , Xn are continuous, then no two of the Xj can be equal, i.e.,

X(1) < X(2) < · · · < X(n−1) < X(n) .

In particular, the minimum is X(1) , the maximum is X(n) , and the median is
1
X(m+1) (n = 2m + 1, odd), 2 (X(m) + X(m+1) ) (n = 2m, even).

The median is the central value of X1 , . . . , Xn .

Probability and Statistics for SIC slide 232

iid
Theorem 212. Let X1 , . . . , Xn ∼ F , from a continuous distribution with density f , then:

P(X(n) ≤ x) = F (x)n ;
P(X(1) ≤ x) = 1 − {1 − F (x)}n ;
n!
fX(r) (x) = F (x)r−1 f (x){1 − F (x)}n−r , r = 1, . . . , n.
(r − 1)!(n − r)!
iid
Example 213. If X1 , X2 , X3 ∼ exp(λ), give the densities of the X(r) .

Example 214. Abélard and Héloïse make an appointment to work at the Learning Centre. Both are
late independently of each other, arriving at times distributed uniformly up to one hour after the time
agreed. Find the distribution and the expectation of the time at which the first one arrives, and give
the density of his (or her) waiting time. Find the expected time at which they can start to work.

Probability and Statistics for SIC slide 233

Note to Theorem 212


 First, X(n) ≤ x if and only if all the Xi ≤ x, and this has probability F (x)n .
 Likewise X(1) > x if and only if all the Xi > x, and this has probability {1 − F (x)}n . Thus the
required CDF is P(X(1) ≤ x) = 1 − {1 − F (x)}n .
 Finally, for the event X(r) ∈ [x, x + dx), we need to have split the sample into three groups of
respective sizes r − 1, 1 and n − r (hence the combinatorial coefficient) and probabilities F (x),
f (x)dx, and 1 − F (x). This gives the required formula.

Probability and Statistics for SIC note 1 of slide 233

149
Note to Example 213
We note that in this case f (x) = λe−λx and F (x) = 1 − exp(−λx), and then just apply the theorem
with n = 3 and r = 1, 2, 3:

fX(1) (x) = 3λe−λx × (e−λx )2 , x>0


fX(2) (x) = 6(1 − e−λx ) × λe−λx × e−λx , x>0
−λx 2 −λx
fX(3) (x) = 3(1 − e ) × λe , x > 0.

Probability and Statistics for SIC note 2 of slide 233

Note to Example 214


Let 0 < U < V < 1 denote the ordered arrival times.
U is the minimum of n = 2 independent U (0, 1) variables, each with F (u) = u (0 < u < 1), so
according to the second line of Theorem 212 U has distribution function FU (u) = 1 − (1 − u)2 and
corresponding density

dFU (u) d{1 − (1 − u)2 }


fU (u) = = = 2(1 − u), 0 < u < 1;
du du
R1
consequently E(U ) = 0 u × 2(1 − u) du = 1 − 2/3 = 1/3. To compute the joint density we note that
the uniformity of the arrival times implies that

P(V ≤ v, U ≤ u) = P(V ≤ v) − P(V ≤ v, U > u) = v 2 − (v − u)2 , 0 < u < v < 1,

because the event V < v occurs if and only if both of them independently arrive before v, and the
event V ≤ v, U > u occurs if and only if they both arrive in the interval (u, v). It follows that the joint
density is
∂ 2 P(V ≤ v, U ≤ u)
f (u, v) = = 2I(0 < u < v < 1).
∂u∂v
Therefore w = v − u has density
Z 1 Z 1
f (w) = 2I(0 < u < v < 1) du = 2 I(0 < u < u + w < 1) du
u=0 u=0
Z 1
= 2 I(0 < u < 1 − w) du
u=0
= 2(1 − w), 0 < w < 1.

They canR start to work when the second of them arrives, at time V , and this has expectation
1
E(V ) = 0 2v dv = 2/3, i.e., 40 minutes after the agreed time.

Probability and Statistics for SIC note 3 of slide 233

150
6. Approximation and Convergence slide 234

Motivation
It is often difficult to calculate the exact probability p of an event of interest, and we have to
approximate. Possible approaches:
 try to bound p;
 analytic approximation, often using the law of large numbers and the central limit theorem;
 numerical approximation, often using Monte Carlo methods.
The final approaches use the notion of convergence of sequences of random variables, which we will
study in this chapter.
We have already seen examples of these ideas: normal approximation to the binomial distribution, law
of small numbers, . . .
Probability and Statistics for SIC slide 235

6.1 Inequalities slide 236

Inequalities
Theorem 215. If X is a random variable, a > 0 a constant, h a non-negative function and g a convex
function, then

P{h(X) ≥ a} ≤ E{h(X)}/a, (basic inequality)


P(|X| ≥ a) ≤ E(|X|)/a, (Markov’s inequality)
2 2
P(|X| ≥ a) ≤ E(X )/a , (Chebyshov’s inequality)
E{g(X)} ≥ g{E(X)}. (Jensen’s inequality)

On replacing X by X − E(X), Chebyshov’s inequality gives

P{|X − E(X)| ≥ a} ≤ var(X)/a2 .

These inequalities are more useful for theoretical calculations than for practical use.

Example 216. We are testing a classification method, in which the probability of a correct
classification is p. Let Y1 , . . . , Yn be the indicators of correct classifications in n test cases, and let Y
be their average. For ε = 0.2 and n = 100, bound

P(|Y − p| > ε).

Probability and Statistics for SIC slide 237

151
Note to Theorem 215
(a) Let Y = h(X). If y ≥ 0, then for any a > 0, y ≥ yI(y ≥ a) ≥ aI(y ≥ a). Therefore

E{h(X)} = E(Y ) ≥ E{Y I(Y ≥ a)} ≥ E{aI(Y ≥ a)} = aP(Y ≥ a) = aP{h(X) ≥ a},

and division by a > 0 gives the result.


(b) Note that h(x) = |x| is a non-negative function on R, and apply (a).
(c) Note that h(x) = x2 is a non-negative function on R, and that P(X 2 ≥ a2 ) = P(|X| ≥ a).
(d) A convex function has the property that, for all y, there exists a value b(y) such that
g(x) ≥ g(y) + b(y)(x − y) for all x. If g(x) is differentiable, then we can take b(y) = g ′ (y). (Draw a
graph if need be.) To prove this result, we take y = E(X), and then have

g(X) ≥ g{E(X)} + b{E(X)}{X − E(X)},

and taking expectations of this gives E{g(X)} ≥ g{E(X)}.

Probability and Statistics for SIC note 1 of slide 237

Note to Example 216


P
We note that nj=1 Yj ∼ B(n, p), so has mean np and variance np(1 − p), write X = Y − p, and note
that E(X) = 0 and E(X 2 ) = var(X) = var(Y ) = n−2 × np(1 − p). Now Chebyshov’s inequality gives

P(|Y − p| > ε) = P(|X| > ε) ≤ P(|X| ≥ ε) ≤ E(X 2 )/ε2 ,

and since p(1 − p) ≤ 1/4 in the range 0 ≤ p ≤ 1,

E(X 2 )/ε2 = var(Y )/ε2 = p(1 − p)/(nε2 ) ≤ 1/4/(100 × 0.22 ) = 1/16.

Probability and Statistics for SIC note 2 of slide 237

Hoeffding’s inequality
Theorem 217. (Hoeffding’s inequality, no proof) Let Z1 , . . . , Zn be independent random variables
such that E(Zi ) = 0 and ai ≤ Zi ≤ bi for constants ai < bi . If ε > 0, then for all t > 0,
n
! n
X Y 2 2
P Zi ≥ ε ≤ e−tε et (bi −ai ) /8 .
i=1 i=1

This inequality is much more useful than the others for finding powerful bounds in practical situations.
iid
Example 218. Show that if X1 , . . . , Xn ∼ Bernoulli(p) and ε > 0, then
2
P(|X − p| > ε) ≤ 2e−2nε .

For ε = 0.2 and n = 100, bound


P(|X − p| > ε).

Probability and Statistics for SIC slide 238

152
Note to Example 218
For the theoretical part, take Zi = (Xi − p)/n, and note that −p/n ≤ Zi ≤ (1 − p)/n, so
bi − ai = 1/n. Then
X X
P(|X − p| > ε) = P( Zi > ε) + P(− Zi > ε),

and each of these probabilities can be bounded by


n 2 2
on
e−tε et (1/n) /8 = exp{t2 /(8n) − tε}.

To minimise this with respect to t, we take t = 4nε, which leads to the result.
For the numerical part, just insert into the previous part and get 0.00067, which is much smaller than
the bound obtained using the Chebyshov inequality (Example 216).

Probability and Statistics for SIC note 1 of slide 238

6.2 Convergence slide 239

Convergence
Definition 219 (Deterministic convergence). If x1 , x2 , . . . , x are real numbers, then xn → x iff for all
ε > 0, there exists Nε such that |xn − x| < ε for all n > Nε .

Probabilistic convergence is more complicated . . . We could hope that (for example) Xn → X if either

P(Xn ≤ x) → P(X ≤ x), x ∈ R,

or
E(Xn ) → E(X)
when n → ∞.

Example 220. For n = 1, 2, . . . let Xn be the random variable such that

P(Xn = 0) = 1 − 1/n, P(Xn = n2 ) = 1/n.

Then when n → ∞,

P(|Xn | > 0) = P(Xn = n2 ) = 1/n → 0,


E(Xn ) = 0 × (1 − 1/n) + n2 × 1/n = n → ∞.

Does Xn → 0 or Xn → ∞? What does ‘converge’ mean for random variables?

Probability and Statistics for SIC slide 240

153
Modes of convergence of random variables
Definition 221. Let X, X1 , X2 , . . . be random variables with cumulative distribution function
F, F1 , F2 , . . .. Then
a.s.
(a) Xn converges to X almost surely, Xn −→ X, if
 
P lim Xn = X = 1;
n→∞

2
(b) Xn converges to X in mean square, Xn −→ X, if

lim E{(Xn − X)2 } = 0, where E(Xn2 ), E(X 2 ) < ∞;


n→∞

P
(c) Xn converges to X in probability, Xn −→ X, if for all ε > 0,

lim P(|Xn − X| > ε) = 0;


n→∞

D
(d) Xn converges to X in distribution, Xn −→ X, if

lim Fn (x) = F (x) at each point x where F (x) is continuous.


n→∞

Probability and Statistics for SIC slide 241

a.s.
Xn −→ X
To understand this better:
 all the variables {Xn }, X must be defined on the same probability space, (Ω, F, P). It is not trivial
to construct this space (we need ‘Kolmogorov’s extension theorem’).
 Then to each ω ∈ Ω corresponds a sequence

X1 (ω), X2 (ω), . . . , Xn (ω), . . .

which will converge, or not, as a sequence of real numbers.


a.s.
 If Xn −→ X, then n o
P ω : lim Xn (ω) = X(ω) =1:
n→∞

the set of values of ω for which Xn (ω) 6→ X(ω) has probability 0.

Example 222. Let U ∼ U (0, 1), where Ω = [0, 1], U (ω) = ω, Xn (ω) = U (ω)n , n = 1, 2, . . ., and
a.s.
X(ω) = 0. Show that Xn −→ X.

Probability and Statistics for SIC slide 242

Note to Example 222


Here we note that for any 0 ≤ ω < 1, Xn (ω) = U (ω)n = ω n → 0 as n → ∞, so Xn (ω) → X(ω) for
every ω ∈ [0, 1). The only ω for which Xn (ω) 6→ X(ω) is ω = 1, and this has zero probability of
occurring, so n o
P ω : lim Xn (ω) = X(ω) = P(U < 1) = 1,
n→∞

as required.

Probability and Statistics for SIC note 1 of slide 242

154
Relations between modes of convergence
a.s. 2 P
 If Xn −→ X, Xn −→ X or Xn −→ X, then X1 , X2 , . . . , X must all be defined with respect to
D
only one probability space. This is not the case for Xn −→ X, which only concerns the
probabilities. This last is thus weaker than the others.
 These modes of convergence are related to one another in the following way:
a.s.
Xn −→ X ⇒
P D
Xn −→ X ⇒ Xn −→ X
2
Xn −→ X ⇒

All other implications are in general false.


P D
 The most important modes of convergence in this course are −→ and −→ , since we often wish
D
to approximate probabilities, and −→ gives us a way to do so.
iid
Example 223. Let X1 , . . . , Xn ∼ (µ, σ 2 ) with 0 < σ 2 < ∞. Show that
2
X = (X1 + · · · + Xn )/n −→ µ.
D
Example 224. Let Xn = (−1)n Z, where Z ∼ N (0, 1). Show that Xn −→ Z, but that this is the
only mode of convergence that applies here.

Probability and Statistics for SIC slide 243

Note to Example 223


Note that E(X) = µ, so by definition of the variance as var(X) = E[{X − E(X)}2 ], we have

E{(X − µ)2 } = var(X) = σ 2 /n → 0, n → ∞,


2
which implies that X −→ µ, as required.

Probability and Statistics for SIC note 1 of slide 243

Note to Example 224


For even n there is nothing to prove, since then Xn = (−1)n Z = Z, and then
P(Xn ≤ x) = P(Z ≤ x).
For odd n, Xn = (−1)n Z = −Z, so

P(Xn ≤ x) = P(−Z ≤ x) = P(Z ≥ −x) = 1 − Φ(−x) = Φ(x) = P(Z ≤ x).


D
Hence Xn −→ Z, though this is trivial because Xn and Z have the same distribution for every n.
Now for n odd,

P(|Xn − Z| > ε) = P(| − Z − Z| > ε) = P(|Z| > ε/2) = 2Φ(−ε/2) 6→ 0, n → ∞,

so Xn does not converge in probability to Z, and thus neither of the other modes of convergence can
be true either.
Probability and Statistics for SIC note 2 of slide 243

155
Continuity theorem (reminder)
Theorem 225 (Continuity). Let {Xn }, X be random variables with cumulative distribution functions
{Fn }, F , whose MGFs Mn (t), M (t) exist for 0 ≤ |t| < b. If there exists a 0 < a < b such that
D
Mn (t) → M (t) for |t| ≤ a when n → ∞, then Xn −→ X, that is to say, Fn (x) → F (x) at each
x ∈ R where F is continuous.

 We could replace Mn (t) and M (t) by the cumulant-generating functions Kn (t) = log Mn (t) and
K(t) = log M (t).
 We established the law of small numbers (Theorem 104 and Example 186, Poisson approximation
of the binomial distribution) by using this result.
 Here is another example:

Example 226. Let X be a random variable which has a geometric distribution with a probability of
success p. Calculate the limit distribution of pX when p → 0.

Probability and Statistics for SIC slide 244

Note to Example 226


P∞ r
Recall that if |a| < 1, then r=0 a = 1/(1 − a).
The MGF of pX is

X
E(etpX ) = etpx p(1 − p)x−1
x=1

X
= petp {etp (1 − p)}x
x=0
petp 1 1 1
= = = → , p → 0,
1 − (1 − p)etp p−1 e−tp − (1 − p)/p 1 + (e−tp − 1)/p 1−t
which is the MGF of Y ∼ exp(1). We need t < 1.

Probability and Statistics for SIC note 1 of slide 244

Combinations of convergent sequences


Theorem 227 (Combination of convergent sequences). Let x0 , y0 be constants, X, Y, {Xn }, {Yn }
random variables, and h a function continuous at x0 . Then
D P
Xn −→ x0 ⇒ Xn −→ x0 ,
P P
Xn −→ x0 ⇒ h(Xn ) −→ h(x0 ),
D P D D
Xn −→ X and Yn −→ y0 ⇒ Xn + Yn −→ X + y0 , Xn Yn −→ Xy0 .

The third line is known as Slutsky’s lemma. It is very useful in statistical applications.
iid iid
2 ), Y , . . . , Y ∼ (µ , σ 2 ), µ 6= 0, σ 2 , σ 2 < ∞, and
Example 228. Let X1 , . . . , Xn ∼ (µX , σX 1 n Y Y X X Y
define
X n Xn
Rn = Y /X, Y = n−1 Yj , X = n−1 Xj .
j=1 j=1

P
Show that Rn −→ µY /µX when n → ∞.

156
Probability and Statistics for SIC slide 245

Note to Example 228


2 2
2 < ∞, by Example 223, X −→ µ , and likewise Y −→ µ . Hence X −→ µ , by P
Note that since σX X Y X
the contents of slide 243, and since the function h(x) = 1/x is continuous at µX 6= 0, it must be true
P
using line 2 of the theorem that 1/X −→ 1/µX , a constant. Therefore we have by line 3 that
D
Rn = Y × 1/X −→ µY × 1/µX ,
P
and as this is a constant, line 1 implies that Rn −→ µY × 1/µX , as required.

Probability and Statistics for SIC note 1 of slide 245

Convergence in distribution: Limits for maxima


 In applications, we often have to take into account the greatest or the smallest random variables
considered.
 A system of n composants can break down when any composant of the system becomes faulty.
What is the distribution of the failure time?
iid
 Let X1 , . . . , Xn ∼ F , and Mn = max{X1 , . . . , Xn }. Then
(
0, F (x) < 1,
P(Mn ≤ x) = P(X1 ≤ x, . . . , Xn ≤ x) = F (x)n →
1, F (x) = 1.

 Hence Mn must be renormalised to get a non-degenerate limit distribution. Let {an } > 0 and
{bn } be sequences of constants, and consider the convergence in distribution of

Yn = (Mn − bn )/an ,

where an , bn are chosen so that a non-degenerate limit distribution for Yn exists.


Probability and Statistics for SIC slide 246

Examples
iid
Example 229. Let X1 , . . . , Xn ∼ exp(λ), and let Mn be their maximum. Find an , bn such that
D
Yn = (Mn − bn )/an −→ Y , where Y has a non-degenerate distribution.
iid
Example 230. Let X1 , . . . , Xn ∼ U (0, 1), and let Mn be their maximum. Find an , bn such that
D
Yn = (Mn − bn )/an −→ Y , where Y has a non-degenerate distribution.

Probability and Statistics for SIC slide 247

Note to Example 229


We have
P(Yn ≤ y) = F (bn + an y)n = {1 − exp(−bn λ − an λy)}n ,
and on setting an = 1/λ, bn = log n/λ, we have

P(Yn ≤ y) = {1 − exp(−y)/n}n → exp{− exp(−y)},

which is the Gumbel distribution function.

157
Probability and Statistics for SIC note 1 of slide 247

Note to Example 230


We have
P(Yn ≤ y) = F (bn + an y)n = (bn + an y)n ,
and on setting an = 1/n, bn = 1, we have (since Mn < 1) that

P(Yn ≤ y) = P{n(Mn − 1) ≤ y} = (1 + y/n)n → exp(y), y<0

which is the distribution function of −Z, where Z ∼ exp(1).

Probability and Statistics for SIC note 2 of slide 247

Fisher–Tippett theorem
iid
Theorem 231. Suppose that X1 , . . . , Xn ∼ F , where F is a continuous cumulative distribution
function. Let Mn = max{X1 , . . . , Xn }, and suppose that the sequences of constants {an } > 0 and
D
{bn } can be chosen so that Yn = (Mn − bn )/an −→ Y , where Y has a non-degenerate limit
distribution H(y) when n → ∞. Then H must be the generalised extreme-value (GEV)
distribution, ( h i
−1/ξ
exp − {1 + ξ(y − η)/τ }+ , ξ 6= 0,
H(y) =
exp [− exp {−(y − η)/τ }] , ξ = 0,
where u+ = max(u, 0), and η, ξ ∈ R, τ > 0.

Probability and Statistics for SIC slide 248

Example
The graph below shows the distributions of Mn and of Yn for n = 1, 7, 30, 365, 3650, from left to
iid
right, for X1 , . . . , Xn ∼ N (0, 1). The panel on the right also shows the limit distribution (bold),
H(y) = exp{− exp(−y)}.
1.0

1.0
0.8

0.8
0.6

0.6
CDF

CDF
0.4

0.4
0.2

0.2
0.0

0.0

-4 -2 0 2 4 -4 -2 0 2 4
y y

Probability and Statistics for SIC slide 249

158
6.3 Laws of Large Numbers slide 250

Law of large numbers


The first part of our limit results concern the behaviour of averages of independent random variables.

Theorem 232. (Weak law of large numbers) Let X1 , X2 , . . . be a sequence of independent identically
distributed random variables with finite expectation µ, and write their average as

X = n−1 (X1 + · · · + Xn ).
P
Then X −→ µ; i.e., for all ε > 0,

P(|X − µ| > ε) → 0, n → ∞.

 Thus, under mild conditions, the averages of samples of important size converge towards the
expectation of the distribution from which the sample is taken.
 If the Xi are independent Bernoulli trials, we return to our primitive notion of probability as a limit
of relative frequencies. The circle is complete.

Probability and Statistics for SIC slide 251

Weak law of large numbers


 The graphs below show the behaviour of X when Xi has two finite moments (on the left), only
E(|Xi |) < ∞ (centre), E(Xi ) doesn’t exist (and so var(X) does not exist either) (on the right).
 When E(Xi ) does not exist, the possibility of huge values of Xi implies that X cannot converge.

Probability and Statistics for SIC slide 252

159
Remarks
 The weak law is easy to prove under the supplementary hypothesis that var(Xj ) = σ 2 < ∞. We
calculate E(X) and var(X), then we apply Chebyshov’s inequality. For any ε > 0,

σ2
P(|X − µ| > ε) ≤ var(X)/ε2 = → 0, n → ∞.
nε2
 The same result applies to smooth functions of averages, empirical quantiles, and other statistics.
iid
 Let X1 , . . . , Xn ∼ F , where F is a continuous cumulative distribution function, and let
xp = F −1 (p) be the p quantile of F . By noting that
n
X
X(⌈np⌉) ≤ xp ⇔ I(Xj ≤ xp ) ≥ ⌈np⌉
j=1

P
and applying the weak law to the sum on the right, we have X(⌈np⌉) −→ xp .

Probability and Statistics for SIC slide 253

Strong law of large numbers


In fact, a stronger result is true:
a.s.
Theorem 233. (Strong law of large numbers) Under the conditions of the last theorem, X −→ µ:
 
P lim X = µ = 1.
n→∞

 This is stronger in the sense that for all ε > 0, the weak law allows the event |X − µ| > ε to occur
an infinite number of times, though with smaller and smaller probabilities. The strong law excludes
this possibility: it implies that the event |X − µ| > ε can only occur a finite number of times.
 The weak and strong laws remain valid under certain types of dependence amongst the Xj .

Probability and Statistics for SIC slide 254

6.4 Central Limit Theorem slide 255

Standardisation of an average
The law of large numbers shows us that the average X approaches µ when n → ∞. If var(Xj ) < ∞,
then Lemma 166 tells us that
E(X) = µ, var(X) = σ 2 /n,
so, for all n, the difference between X and its expectation relative to its standard deviation,

X − E(X) X −µ n1/2 (X − µ)
Zn = = p =
var(X)1/2 σ 2 /n σ

has expected value zero and unit variance.

What is the limiting behaviour of Zn ?

Probability and Statistics for SIC slide 256

160
Central limit theorem
Theorem 234 (Central limit theorem (CLT)). Let X1 , X2 , . . . be independent random variables with
expectation µ and variance 0 < σ 2 < ∞. Then

n1/2 (X − µ) D
Zn = −→ Z, n → ∞,
σ
where Z ∼ N (0, 1).

Thus ( )
n1/2 (X − µ) .
P ≤z = P(Z ≤ z) = Φ(z)
σ
for large n.
iid
The following page shows this effect for X1 , . . . , Xn ∼ exp(1); the histograms show how the empirical
densities of Zn approach the density of Z.

Probability and Statistics for SIC slide 257

n=5 n=10
0.4
0.2 0.4
Density

Density
0.2
0.0

0.0

−4 −2 0 2 4 −4 −2 0 2 4
z z

n=20 n=100
0.4
0.4
Density

Density
0.2
0.2 0.0

0.0

−4 −2 0 2 4 −4 −2 0 2 4
z z

Probability and Statistics for SIC slide 258

161
Note to Theorem 234
The cumulant-generating function of
n
X
 µ
Zn = X − µ /(σ 2 /n)1/2 = (n−1/2 /σ)Xj − n1/2
σ
j=1

is
n
X µ
KZn (t) = KXj (tn−1/2 /σ) − n1/2 t,
σ
j=1

where
KXj (t) = tµ + 12 t2 σ 2 + o(t2 ), t → 0.
Thus
h i µ
KZn (t) = n tn−1/2 µ/σ + 12 (tn−1/2 /σ)2 σ 2 + o{t2 /(nσ 2 )} − n1/2 t → t2 /2, n → ∞,
σ
is the CGF of Z ∼ N (0, 1). Thus the result follows by the continuity theorem, Theorem 185.

Probability and Statistics for SIC note 1 of slide 258

Use of the CLT


The CLT is used to approximate probabilities involving the sums of independent random variables.
Under the previous conditions, we have
   
Xn Xn
E Xj  = nµ, var  Xj  = nσ 2 ,
j=1 j=1

so Pn
j=1 Xj − nµ
n(X − µ) n1/2 (X − µ)
√ √ = = = Zn
nσ 2 nσ 2 σ
can be approximated using a normal variable:
  ( Pn )
Xn  
j=1 Xj − nµ x − nµ . x − nµ
P  
Xj ≤ x = P √ ≤ =Φ .
j=1 nσ 2 (nσ 2 )1/2 (nσ 2 )1/2

The accuracy of the approximation depends on the underlying variables: it is (of course) exact for
normal Xj , works better if the Xj are symmetrically distributed (e.g., uniform), and typically is
adequate if n > 25 or so.

Probability and Statistics for SIC slide 259

162
Example
Example 235. A book of 640 pages has a number of random errors on each page. If the number of
errors on each page follows a Poisson distribution with expectation λ = 0.1, what is the probability
that the book contains less than 50 errors?
P
When nj=1 Xj takes whole values, we can obtain a better approximation using a continuity correction:
  ( )
X n
. x + 21 − nµ
P Xj ≤ x = Φ 2 )1/2
;
j=1
(nσ
Pn
this can be important when the distribution of j=1 Xj is quite discrete.

Probability and Statistics for SIC slide 260

Note to Example 235


We take µ = σ 2 = 0.1 and n = 640. The expected number of errors is nµ = 640λ = 64, and the
variance is nσ 2 = 64, as the variable is Poisson. Thus we seek
  Pn !
Xn
j=1 Xj − 64 49 − 64 .
P Xj ≤ 49 = P √ ≤ √ = Φ(−15/8) = 0.03.
j=1
64 64

The true number is 0.031. With continuity correction we take Φ{(−15 + 0.5)/8} = 0.035.

Probability and Statistics for SIC note 1 of slide 260

6.5 Delta Method slide 261

Delta method
We often need the approximate distribution of a smooth function of an average.

Theorem 236. Let X1 , X2 , . . . be independent random variables with expectation µ and variance
0 < σ 2 < ∞, and let g ′ (µ) 6= 0, where g′ is the derivative of g. Then

g(X) − g(µ) D
−→ N (0, 1), n → ∞.
{g′ (µ)2 σ 2 /n}1/2
· 
This implies that for large n, we have g(X) ∼ N g(µ), g ′ (µ)2 σ 2 /n . Combined with Slutsky’s
lemma, we have
n
·  1 X
g(X) ∼ N g(µ), g′ (X)2 S 2 /n , S2 = (Xj − X)2 .
n−1
j=1

iid
Example 237. If X1 , . . . , Xn ∼ exp(λ), find the approximate distribution of log X.

Probability and Statistics for SIC slide 262

163
Note to Theorem 236
We note first that
X −µ D
Zn = p −→ Z ∼ N (0, 1),
σ 2 /n

and therefore that we may write X = µ + σZn / n. Taylor series expansion for large n gives
√ √
g(X) = g(µ + σZn / n) = g(µ) + g ′ (µ)σZn / n + O(n−1 ),
.
and this implies that E{g(X)} = g(µ) + O(n−1 ) and that
.
var{g(X)} = g′ (µ)2 σ 2 /n + o(n−1 ).

We can also write



g(X) − g(µ) g(µ) + g ′ (µ)σZn / n + O(n−1 ) − g(µ) D
= = Zn + O(n−1/2 ) −→ Z,
{g′ (µ)2 σ 2 /n}1/2 {g ′ (µ)2 σ 2 /n}1/2

as n → ∞, which proves the result. Slutsky’s lemma is needed only to replace g′ (µ)2 σ 2 by
P
g′ (X)2 S 2 −→ g ′ (µ)2 σ 2 .
We must be careful here, because the terms O(n−1 ) are random, and we need to know how to handle
them. But this is possible.

Probability and Statistics for SIC note 1 of slide 262

Note to Example 237


In this example, the Xi have mean µ = 1/λ and variance σ 2 = 1/λ2 , we take g(x) = log x and
g′ (x) = 1/x. Hence the mean and variance of log X are

g(µ) = log(1/λ) = − log λ, g′ (µ)2 σ 2 /n = 1/(1/λ)2 × (1/λ2 )/n = n−1 .

This is called a variance-stabilising transformation, as var(log X) does not depend on λ.

Probability and Statistics for SIC note 2 of slide 262

164
Sample quantiles
iid
Definition 238. Let X1 , . . . , Xn ∼ F , and 0 < p < 1. Then the p sample quantile of X1 , . . . , Xn is
the r th order statistic X(r) , where r = ⌈np⌉.

iid
Theorem 239. (Asymptotic distribution of order statistics) Let 0 < p < 1, X1 , . . . , Xn ∼ F , and
xp = F −1 (p). Then if f (xp ) > 0,

X(⌈np⌉) − xp D
−→ N (0, 1), n → ∞.
[p(1 − p)/{nf (xp )2 }]1/2

This implies that  


· p(1 − p)
X(⌈np⌉) ∼ N xp , .
nf (xp )2
P
 To prove this, note that X(r) ≤ x iff T = I(Xj ≤ x) ≥ r, and apply the CLT to T .

Example 240. Show that the median of a normal sample of size n is approximately distributed
according to N {µ, πσ 2 /(2n)}.

Probability and Statistics for SIC slide 263

Distribution of the median


This graph compares the exact (black) and approximate (red) densities of the median X(⌈n/2⌉) for
iid
X1 , . . . , Xn ∼ exp(1):
n=11 n=21
1.5 3.0

1.5 3.0
Density

Density
0.0

0.0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
x x

n=41 n=81
1.5 3.0

1.5 3.0
Density

Density
0.0

0.0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
x x

Probability and Statistics for SIC slide 264

165
Note to Example 240
We first note that F −1 (1/2) where F (x) = Φ{(x − µ)/σ} gives (x − µ)/σ = 0 and this means that
x1/2 = µ. Then note that

f (µ) = (2πσ 2 )−1/2 exp −(µ − µ)2 /2σ 2 = (2πσ 2 )−1/2 ,

so the asymptotic variance of the median is

1 1 −1 πσ 2
÷ × n = ,
4 2πσ 2 2n
which proves the required result.

Probability and Statistics for SIC note 1 of slide 264

166
7 Exploratory Statistics slide 265

7.1 Introduction slide 266

Construction of knowledge: the recent past

Nature

Experiment/
Theory Observation

Model Data Analysis/Statistics Data

 To try and understand Nature, we invent theories that lead to models (e.g., Mendelian genetics,
quantum theory, fluid mechanics, . . . ).
 We contrast the models with observations, preferably from experiments over which we have
some control, to assess if the theory is adequate, or should be rejected or improved.
 The data are never measured exactly, so we usually need Statistics to assess whether differences
between the data and model are due to measurement error, or whether the data conflict with the
model, and therefore undermine or even falsify the theory.
 Data can only be used to falsify a theory—future data might be incompatible with it.

Probability and Statistics for SIC slide 267

The Big Data wave . . .

Probability and Statistics for SIC slide 268

167
Construction of knowledge: the (very) near future

Nature Computation

Experiment/
Theory Observation Algorithms

Model Data Analysis/ Statistics Data

 Sometimes we just want a good prediction:


– How long will it take to drive to Geneva?
– Will this client default on his mortgage?
– Should I give this person life insurance?
– Is this banking transaction fraudulent?
 Then algorithmic learning may be useful (random forests, neural nets, . . . ).

Probability and Statistics for SIC slide 269

What is Statistics?
 Statistics is the science of extracting information from data.

 Key points:
– variation (in the data) and the resulting uncertainty (about inferences) are important and are
represented using probability models and probabilistic ideas;
– context is important—it is crucial to know how the data were gathered and what they
represent.

Probability and Statistics for SIC slide 270

168
Statistical cycle
 The statistical method has four main stages:
– planning of a study, in order to obtain the maximum relevant information with the minimum
cost—
⊲ ideally—consideration of the problem, discussion of what data will be useful and how to get
them, choice of experimental design,
⊲ less ideally—someone comes with data and asks what to do with them;
– implementation of the experiment and reception of data—lots can go wrong;
– data analysis—
⊲ data exploration based on graphical or similar tools, followed by
⊲ statistical inference from models fitted to the data;
– presentation and interpretation of the results, followed by practical conclusions and action.
 Often this cycle is iterated: data analysis suggests questions that we can’t answer, so we have to
get more data, . . .

Probability and Statistics for SIC slide 271

Study types
 Often we aim to compare the effects of treatments (drugs, adverts, . . . ) on units (patients,
website users, . . . ).
 Two broad approaches:
– designed experiment—we control how treatments are allocated to units, usually using some
form of randomisation;
– observational study—the allocation of treatments is not under our control.
 Designed experiments allow much stronger inferences than observational studies: carefully done,
they can establish that
correlation ⇒ causation.
 Both types of study are all around us—clinical trials, web surveys, sample surveys, . . .
 A key advantage of randomisation is the reduction (ideally avoidance) of bias.

Probability and Statistics for SIC slide 272

169
Effect of randomization

✓✏
✰ ✓✏ ✓✏ ✓✏ ✓✏ ✓✏
T ✛ X ✛ U T ✛ X ✛ U
✒✑ ✒✑ ✒✑ ✒✑ ✒✑ ✒✑
❅ ❅
❅ ❅
❅ ❅
❅❘✓✏
❅ ❄✠ ❅❘✓✏
❅ ❄✠
Y Y
✒✑ ✒✑
 Would offering office hours to students after the test improve final grades? Variables:
X: known mark at the test
Y : unknown final grade
T : treatment—give office hours or not
U : Unseen factors (motivation, math ability, hours spent at Sat, . . . )
 Observational study (left): U can influence T , so we can’t separate their effects.
 Randomized experiment (right): U can only influence T via the known X, so we can adjust for
X when estimating how T affects Y .
 We might observe correlation between T and Y in both cases, but only in the second can we infer
causation.
Probability and Statistics for SIC slide 273

Data analysis
Data analysis is often said to comprise two phases:
 exploratory data analysis (exploratory/descriptive statistics):
mostly simple, flexible and graphical methods that allow us to study groups of data and to detect
specific structures (tendencies, forms, atypical observations).
For example:
– in what range do most of your weights lie?
– is there an association between your weights and heights?
Exploratory analysis suggests working hypotheses and models that can be formalised and checked
in the second phase—but we risk bias if we use the data both to formulate a model and to check
it. It should be checked on new data, to reduce the chance that we are deceiving ourselves.
 statistical inference (formal/inferential statistics): leads to statistical conclusions from data using
notions from probability theory. This involves model construction, testing, estimation and
forecasting methods.

Probability and Statistics for SIC slide 274

170
7.2 Types of Data slide 275

Population, sample
Population: the entire set of units we might study;
Sample: subset of the population
The data collected are usually a sample of individuals from a population of interest.
Illustration:
 Population: set of EPFL students.
 Unit: an individual student.
 Observation: the weight of the unit.
 Sample: a subset of second-year students in SIC.

We want to study a characteristic (or characteristics) that each unit possesses, a statistical variable,
for example the weight (and height) of the individual.

Probability and Statistics for SIC slide 276

Types of variables
A variable can be quantitative or qualitative.
A quantitative variable can be discrete (often integer) or continuous (i.e., taking any value in an
interval).
Illustration:
 Quantitative discrete variables:
– number of children in a family
– number of students in this room
 Quantitative continuous variables:
– weight in kilos
– height in centimetres
Often continuous variables are rounded to some extent, or can only be recorded to a certain
precision.

Probability and Statistics for SIC slide 277

171
Qualitative variables
A (categorical) qualitative variable can be nominal (non-ordered) or ordinal (ordered).
Illustration:
 Qualitative nominal variables:
– gender (male or female)
– blood types (A, B, AB, O).
 Qualitative ordinal variables:
– the meal on offer at the Vinci (good, average, bad)
– interest in statistics (very low, low, average, high, very high)

We may convert quantitative variables into categorical variables for descriptive reasons.
Illustration: Size in centimetres: S, M, L, XL, XXL.
Probability and Statistics for SIC slide 278

7.3 Graphical Study of Variables slide 279

Qualitative variable
Example 241. The blood types of 25 donors were collected:

AB B A O B
O B O A O
B O B B B
A O AB AB O
A B AB O A

The frequency table is

Class Absolute frequency Relative frequency


A 5 5/25 = 0.2
B 8 8/25 = 0.32
O 8 8/25 = 0.32
AB 4 4/25 = 0.16
Total 25 25/25=1

Probability and Statistics for SIC slide 280

172
Pie chart
Diagramme en camembert/en secteurs (pie chart)

B
A

AB

Probability and Statistics for SIC slide 281

Bar plot
Diagramme en barres (bar plot)
8
6
4
2
0

A B O AB

Probability and Statistics for SIC slide 282

Study of a quantitative variable


Consider one continuous variable measured several times. We thus have n observations

x1 , x2 , . . . , xn

of the variable, which can be arranged in increasing order, giving the sample order statistics

x(1) ≤ x(2) ≤ · · · ≤ x(n)

where x(1) is the minimum and x(n) is the maximum.


Some authors replace x(1) by x[1] or by x1/n or by x1:n or by x(1)|n .

Probability and Statistics for SIC slide 283

173
Example
Example 242. The weights of 92 students in an American school were measured in pounds.
The data are:

Boys
140 145 160 190 155 165 150 190 195 138 160
155 153 145 170 175 175 170 180 135 170 157
130 185 190 155 170 155 215 150 145 155 155
150 155 150 180 160 135 160 130 155 150 148
155 150 140 180 190 145 150 164 140 142 136
123 155

Girls
140 120 130 138 121 125 116 145 150 112 125
130 120 130 131 120 118 125 135 125 118 122
115 102 115 150 110 116 108 95 125 133 110
150 108

Probability and Statistics for SIC slide 284

Stem-and-leaf diagram
We translate weight 95 7→ 9 | 5, weight 102 7→ 10 | 2, etc.

9 5
10 288
11 002556688
12 00012355555
13 0000013555688
14 00002555558
15 0000000000355555555557
16 000045
17 000055
18 0005
19 00005
20
21 5

Probability and Statistics for SIC slide 285

174
Histogram
 To construct a histogram, it is useful to have a frequency table, which can be considered to
summarize the observed values.
 A histogram shows the number of observations in groups determined by a division into intervals of
the same length.
Here is an example of the construction of a frequency table:

Class Centre Absolute frequency Relative frequency


87.5 − 102.5− 95 2 0.022
102.5 − 117.5− 110 9 0.098
117.5 − 132.5− 125 19 0.206
132.5 − 147.5− 140 17 0.185
147.5 − 162.5− 155 27 0.293
162.5 − 177.5− 170 8 0.087
177.5 − 192.5− 185 8 0.087
192.5 − 207.5− 200 1 0.011
207.5 − 222.5− 215 1 0.011
Total 92 1

Probability and Statistics for SIC slide 286

Histogram II
Histogram of Weight Histogram of Weight
30

30
25

25
20

20
Frequency

Frequency
15

15
10

10
5

5
0

80 100 120 140 160 180 200 220 80 100 120 140 160 180 200 220
Weight Weight

Histogram of Weight Histogram of Weight


0.020

0.020
0.015

0.015
Density

Density
0.010

0.010
0.005

0.005
0.000

0.000

80 100 120 140 160 180 200 220 80 100 120 140 160 180 200 220
Weight Weight

Histograms of the weight of students in the American school; 9 (left) and 13 (right) classes with
absolute frequencies (top) and relative frequencies (bottom).

Probability and Statistics for SIC slide 287

Histogram III
 Advantage: a histogram can be used with large and small datasets.
 Disadvantage: the loss of information due to the absence of the values of the observations and
the choice of the width of the boxes, which can be difficult, leading to different possibilities for
interpretation!
 Remark: The stem-and-leaf diagram can be seen as a particular histogram obtained by a rotation.
 Remark: There exist better versions of the histogram, such as kernel density estimates.

175
Probability and Statistics for SIC slide 288

Kernel density estimate


Definition 243. Let K be a kernel, i.e. a function with symmetric density at 0, with variance 1, and
let y1 , . . . , yn a sample of data drawn from some distribution with probability density f . Then the
kernel density estimator (KDE) of f for h > 0 is
n  
1 X y − yj
fbh (y) = K , y ∈ R.
nh h
j=1

This gives a nonparametric estimator of the density underlying the sample, and depends on:
 the kernel K — not very important. We often take

K(u) = φ(u) = (2π)−1/2 exp(−u2 /2), u ∈ R;

 the bandwidth h — very important!

Probability and Statistics for SIC slide 289

Construction of a KDE
Left: construction of a kernel density estimate, with sample y1 , . . . , yn shown by the rug.
Right: effect of the bandwidth h, which smoothes the data more as h increases.
h=1 (red), h=2 (black), h=3 (blue)
0.15

0.15
0.10

0.10
Density

Density
0.05

0.05
0.00

0.00

100 105 110 115 120 125 100 105 110 115 120 125
Length (mm) Length (mm)

Probability and Statistics for SIC slide 290

Remarks
It is not easy to create good graphs. Often those generated by standard software (e.g., Excel) are
poor. Some suggestions:
 try to show the data itself as much as possible—no chart-junk (unnecessary colours/lines/. . .
etc.);
 put units and clear explanations for the axes and the legend;
 to compare related quantities, use the same axes and put the graphs close together;
 choose the scales such that systematic relations appear at a ∼ 45◦ angle to the axes;
 varying the aspect ratio can reveal interesting things;
 draw the graph in such a way that departures from ‘standard’ appear as departures from linearity
or from a random cloud of points.

176
Probability and Statistics for SIC slide 291

Chartjunk
This graph shows 5 numbers!

(Source: http://www.datavis.ca/gallery/say-something.php)

Probability and Statistics for SIC slide 292

Chartjunk and scale

(Source: http://www.datavis.ca/gallery/say-something.php)

Probability and Statistics for SIC slide 293

177
Choosing the right axes
Effect of the choice of the axes on the perception of a relationship:

65
60

45 50 55 60
Model ozone (ppbv)

Model ozone (ppbv)


50 55

40
45

35
35 40 45 50 55 60 65 35 40 45 50 55 60 65
Observed ozone (ppbv) Observed ozone (ppbv)

Probability and Statistics for SIC slide 294

The Russian campaign of 1812

(Source: Wikipedia)

Probability and Statistics for SIC slide 295

7.4 Numerical Summaries slide 296

Principal characteristics of the data


For quantitative variables, we are usually interested in:
 the central tendency or location, which gives information on the “middle” (the position, the
centre) of the data. We often use the sample average or the sample median to summarise this;
 the dispersion, which gives information on the variability of the distribution around its centre. We
often use the range, the standard deviation or the interquartile range to summarise this;
 symmetry or asymmetry with respect to the centre;
 the number of modes (“bumps”) the data exhibit.

Probability and Statistics for SIC slide 297

178
Shapes of densities
A B

0.6

0.6
0.5

0.5
0.4

0.4
Frequences

Frequences
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0
−5 0 5 −5 0 5
Variable Variable

C D

0.6

0.6
0.5

0.5
0.4

0.4
Frequences

Frequences
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0
−5 0 5 −5 0 5
Variable Variable

A: Similar densities but different centres


B: Same centre, different dispersions
C: Different dispersions and centres
D: Asymmetry of the red density

Probability and Statistics for SIC slide 298

Central tendency
Indicators of central tendency (measures of position):
 We previously met the mean and median

E(X), F −1 (0.5),

of a theoretical distribution, F . Now we define the corresponding sample quantities, based on data
x1 , . . . , xn .
 The (arithmetic) average:
X n
x1 + · · · + xn
x= = n−1 xi .
n
i=1

The average of the weights of the American students is 145.15 pounds.


 The (sample) median: the observations are ranked in increasing order. Then the median med(x)
is the value which divides the set of observations into two sections of the same size. 50% of the
data are smaller than it and 50% are greater.

Probability and Statistics for SIC slide 299

179
Sample median
 Take med(x) = x(⌈n/2⌉) , where ⌈x⌉ is the smallest integer ≥ x.
 Data x with n = 7,

1, 4, 7, 9, 10, 12, 14 ⇒ med(x) = x(⌈7/2⌉) = x(4) = 9.

 Data x with n = 8,

1, 4, 7, 9, 10, 12, 14, 25 ⇒ med(x) = x(⌈8/2⌉) = x(4) = 9.

 We sometimes use a symmetric definition


(
x((n+1)/2) , n odd,
med(x) =
(x(n/2) + x(n/2+1) )/2, n even.

 In the above case with n = 8 we thus have

med(x) = (x(4) + x(4+1) )/2 = (9 + 10)/2 = 9.5.

Probability and Statistics for SIC slide 300

Breakdown point
 If the distribution is symmetric, then the average is close to the median.
 The average is much more sensitive to extreme (atypical) values, so-called outliers, than the
median.
 The median resists the impact of outliers:

x = 2,
x1 = 1, x2 = 2, x3 = 3 gives
med(x) = 2,

but 
x = 11,
x1 = 1, x2 = 2, x3 = 30 gives
med(x) = 2,
so the median is unchanged by changing 3 7→ 30, but the average reacts badly.
 We say that the median has (asymptotic) breakdown point 50%, because in a very large sample
the median would only move an arbitrarily large amount if 50% of the observations were corrupted.
The average has breakdown point 0%, because a single bad value can move it to ±∞.

Probability and Statistics for SIC slide 301

180
Quartiles and sample quantiles
 We previously defined the quantiles of a theoretical distribution F as xp = F −1 (p). Now we define
the sample equivalents.
 The median (50%/50%) can be generalised by dividing the distribution into four or more equal
parts.
 The bounds of the classes thus obtained are called quartiles (if 4 parts) or more generally sample
quantiles.
 To define the p quantile (also called the 100p percentile), qb(p), the data are put in increasing
order
x(1) ≤ · · · ≤ x(n) .
We calculate np/100. If this is not an integer, we take the next largest integer:

qb(p) = x(⌈n p/100⌉) .

 Special case: quartiles

qb(0.25) qb(0.50) qb(0.75)


| {z } | {z } | {z }
inferior quartile median superior quartile

Probability and Statistics for SIC slide 302

Calculation of quantiles
Illustration: Calculation of the 32% percentile with n = 10 and data

27, 29, 31, 31, 31, 34, 36, 39, 42, 45.

Here l np m
np 10 × 32
= = 3.2 ⇒ = ⌈3.2⌉ = 4 ⇒ qb(0.32) = x(4) = 31.
100 100 100
Probability and Statistics for SIC slide 303

Variability/dispersion measures
 The most common is the sample standard deviation
( n
)1/2 ( n
!)1/2
1 X 1 X
s= (xi − x)2 = x2i − n x2 ,
n−1 n−1
i=1 i=1

but this has breakdown point 0%. The quantity s2 is the sample variance.
 The range
max(x1 , . . . , xn ) − min(x1 , . . . , xn ) = x(n) − x(1)
is unsatisfactory as we consider only the two most extreme xi , which are very sensitive to outliers;
its breakdown point is also 0%.
 The interquartile range (IQR)

IQR(x) = qb(0.75) − qb(0.25)

is more resistant to outliers, with breakdown point 25%.

181
Probability and Statistics for SIC slide 304

Measures of correlation
 We often want to measure the strength of relationship between two different quantities, based on
data pairs (x1 , y1 ), . . . , (xn , yn ) from n individuals. Recall from slide 195 that the theoretical
correlation between two random variables (X, Y ) is defined as

cov(X, Y )
corr(X, Y ) = .
{var(X)var(Y )}1/2

 The sample correlation is defined as


P
n−1 nj=1 (xj − x)(yj − y)
rxy =q P P .
n−1 nj=1 (xj − x)2 n−1 nj=1 (yj − y)2

Analogous to the theoretical correlation coefficient, rxy satisfies the properties


(a) −1 ≤ rxy ≤ 1;
(b) if rxy = ±1, then the data pairs (xj , yj ) lie on a straight line, with positive slope if rxy = 1,
with negative slope if rxy = −1;
(c) rxy = 0 implies a lack of linear dependence between the variables, not a lack of
dependence;
(d) the linear transformation (xj , yj ) 7→ (a + bxj , c + dyj ) gives rxy 7→ sign(bd)rxy .

Probability and Statistics for SIC slide 305

7.5 Boxplot slide 306

“Five-number summary”
The list of five values

min = x(1) , qb(0.25), median, qb(0.75), max = x(n)

called the “five-number summary”, gives a simple and useful numerical summary of a sample. It is
the basis for the boxplot.

100 120 140 160 180 200

Boxplot (boîte à moustache) of the weights of students in an American school, with a ‘rug’ showing
the individual values.
The central box shows qb(0.25), the sample median and qb(0.75), so its width is the IQR. The limits of
the whiskers are discussed below.
Probability and Statistics for SIC slide 307

182
Boxplot: calculations
 Weights of the 92 American students.
 The “five-number summary” is

95, 125, 145, 156, 215,

thus
IQR(x) = qb(0.75) − qb(0.25) = 156 − 125 = 31
C = 1.5 × IQR(x) = 1.5 × 31 = 46.5
qb(0.25) − C = 125 − 46.5 = 78.5
qb(0.75) + C = 156 + 46.5 = 202.5
 The bounds of the whiskers are the most extreme xi lying inside the numbers

qb(0.25) − C, qb(0.75) + C.

 Any xi outside the whiskers are shown individually; they might be outliers relative to the normal
distribution.
Probability and Statistics for SIC slide 308

Boxplot: example 1
The boxplot is very useful for comparing several groups of observations:
Girl
Boy

100 120 140 160 180 200

Boxplot of the weights of students of the American school, according to gender.

Probability and Statistics for SIC slide 309

183
Boxplot: example 2

3
2
1

−3 −2 −1 0 1 2 3

Boxplot of groups of symmetric and asymmetric observations

Probability and Statistics for SIC slide 310

Boxplot: SIC 2012


190

5
Height (cm)

Exam mark
180

4
170

3
160

IN SC IN SC

The graph on the right raises a question . . .

Probability and Statistics for SIC slide 311

184
7.6 Choice of a Model slide 312

Initial analysis of data


We now have a strategy to explore data from a quantitative variable:
 always do graphical representations first;
 study the global structure of the data and identify potential atypical values (“outliers”)—find why
they appear, or use so-called robust methods, that aren’t (so) affected by bad observations;
 calculate numerical summaries to describe the central tendency (position/centre/location) and
the dispersion (scale/variability).
Here is one more step:
 sometimes the global structure is so regular that we would like to describe it by a smooth curve.
This curve is a mathematical description of the distribution of data—a density function that we
would like to fit to the data.
Probability and Statistics for SIC slide 313

Reminder: normal density


A special and very important class of densities is the normal (Gaussian) density, N (µ, σ 2 ):
 
1 (x − µ)2
f (x) = exp − , −∞ < x, µ < ∞, σ > 0,
(2πσ 2 )1/2 2σ 2

where µ is the mean (expectation) and σ is the standard deviation.


Here f (x) is the height of the curve at the point x.
0.4
0.3
phi(x)
0.2
0.1
0.0

−4 −2 0 2 4
x

Probability and Statistics for SIC slide 314

185
Normal Q-Q plot
 The histogram or the boxplot may suggest that a normal distribution is suitable for the data, such
as: no atypical values, symmetry, unimodality.
 But we need to know this more precisely: the best tool to graphically check normality is the
“normal Q-Q plot”, i.e., a graph of the ordered sample values

x(1) ≤ x(2) ≤ · · · ≤ x(n) ,

against the normal plotting positions

Φ−1 {1/(n + 1)}, Φ−1 {2/(n + 1)}, . . . , Φ−1 {n/(n + 1)}.

 A graph close to a straight line suggests that the observations can be well fitted by a normal model.
 Abnormal values appear as isolated points.
 The slope and the intercept at x = 0 give estimates of σ and µ respectively.

Probability and Statistics for SIC slide 315

Example: heights
Histogram and normal Q-Q plot of the heights of 88 students:

Normal Q−Q Plot


0.06

190
0.04

Height (cm)

180
Density

0.02

170
160
0.00

150 160 170 180 190 200 −2 −1 0 1 2

sic$Height Theoretical Quantiles

Probability and Statistics for SIC slide 316

186
Example: Newcomb data
Normal Q-Q plot of 66 measures of time for light to cross a known distance, measured by Newcomb:

Tous les données Sans les deux valeurs aberrantes

40

40
35
20
Temps de passage

Temps de passage

30
0

25
−20

20
−40

−2 −1 0 1 2 −2 −1 0 1 2

Quantiles de la loi standard normale Quantiles de la loi standard normale

Probability and Statistics for SIC slide 317

187
8 Statistical Inference slide 318

8.1 Introduction slide 319

Introduction
The study of mathematics is based on deduction:

axioms ⇒ consequences.

In the case of probability, we have

(Ω, F, P) ⇒ P(A), P(A | B), P(X ≤ x), E(X r ), . . .

Inferential statistics concern induction—having observed an event A, we want to say something about
a probability space (Ω, F, P) we suppose to be underlying the data:
?
A ⇒ (Ω, F, P ).

In the past the term inverse probability was given to this process.

Probability and Statistics for SIC slide 320

Statistical model
 We assume that the observed data, or data to be observed, can be considered as realisations of a
random process, and we aim to say something about this process based on the data.
 Since the data are finite, and the process is unknown, there will be many uncertainties in our
analysis, and we must try to quantify them as well as possible.
 Several problems must be addressed:
– specification of a model (or of models) for the data;
– estimation of the unknowns of the model (parameters, . . .);
– tests of hypotheses concerning a model;
– planning of the data collection and analysis, to answer the key questions as effectively as
possible (i.e., minimise uncertainty for a given cost);
– decision when faced with uncertainties;
– prediction of future unknowns;
– behind the other problems lies the relevance of the data to the question we want to answer.

Probability and Statistics for SIC slide 321

188
Definitions
Notation: we will use y and Y to represent the data y1 , . . . , yn and Y1 , . . . , Yn .

Definition 244. A statistical model is a probability distribution f (y) chosen or constructed to learn
from observed data y or from potential data Y .
 If f (y) = f (y; θ) is determined by a parameter θ of finite dimension, it is a parametric model,
and otherwise it is a nonparametric model.
 A perfectly known model is called simple, otherwise it is composite.

Statistical models are (almost) always composite in practice, but simple models are useful when
developing theory.

Definition 245. A statistic T = t(Y ) is a known function of the data Y .

Definition 246. The sampling distribution of a statistic T = t(Y ) is its distribution when Y ∼ f (y).

Definition 247. A random sample is a set of independent and identically distributed random
variables Y1 , . . . , Yn , or their realisations y1 , . . . , yn .

Probability and Statistics for SIC slide 322

Examples
Example 248. Assume that y1 , . . . , yn is a random sample from a Bernoulli distribution with unknown
parameter p ∈ (0, 1). Then the statistic
Xn
t= yj
j=1

is considered to be a realisation of the random variable


n
X
T = Yj ,
j=1

whose sampling distribution is B(n, p).

Example 249. Assume that y1 , . . . , yn is a random sample from


P the N (µ, σ 2 ) distribution, with µ, σ 2
unknown. Then y = n−1 (y1 + · · · + yn ) and s2 = (n − 1)−1 nj=1 (yj − y)2 are statistics, realisations
of the random variables
n
−1 21 X
Y =n (Y1 + · · · + Yn ), S = (Yj − Y )2 .
n−1
j=1

Find the sampling distribution of Y .

Probability and Statistics for SIC slide 323

189
Note to Example 249
If µ and σ 2 are finite, then elementary computations (see Lemma 166) give
 
Xn Xn
E(Y ) = E n−1 Yj  = n−1 nE(Yj ) = µ, var(Y ) = n−2 var(Yj ) = σ 2 /n,
j=1 j=1

since the Yj are independent and all have variance σ 2 . These results do not rely on normality of the
Yj , but the variance computation does need independence. We see that the larger n is, the smaller is
the variance of Y . This backs up our intuition that a larger sample is more informative about the
underlying phenomenon—but the data must be sampled independently, and the variance must be finite!
If in addition the Yj are normal, then Y is a linear combination of normal variables, and so has a
normal distribution,
Y ∼ N (µ, σ 2 /n),
so we have a very precise idea of how Y will behave (or, rather, we would have, if we knew µ and σ 2 ).

Probability and Statistics for SIC note 1 of slide 323

8.2 Point Estimation slide 324

Statistical models
We would like to study a set of individuals or elements called a population based on a subset of this
set called a sample:
 statistical model: the unknown distribution F or density f of Y ;
 parametric statistical model: the distribution of Y is known except for the values of parameters
θ, so we can write F (y) = F (y; θ), but with θ unknown;
 sample (must be representative of the population): “data” y1 , . . . , yn , often supposed to be a
iid
random sample, i.e., Y1 , . . . , Yn ∼ F ;
 statistic: any function T = t(Y1 , . . . , Yn ) of the random variables Y1 , . . . , Yn ;
 estimator: a statistic θb used to estimate a parameter θ of f .

Probability and Statistics for SIC slide 325

Example
iid
Example 250. If we assume that Y1 , . . . , Yn ∼ N (µ, σ 2 ) but with µ, σ 2 unknown, then
 this is a parametric statistical model;
b = Y is an estimator of µ, whose observed value is y;
 µ
P P
b2 = n−1 nj=1 (Yj − Y )2 is an estimator of σ 2 , whose observed value is n−1 nj=1 (yj − y)2 .
 σ

Note that:
 a statistic T is a function of the random variables Y1 , . . . , Yn , so T is itself a random variable;
 the sampling distribution of T depends on the distribution of the Yj ;
 if we cannot deduce the exact distribution of T from that of the Yj , we must sometimes make do
with knowing E(T ) and var(T ), which give partial information on the distribution of T , and thus
may allow us to approximate the distribution of T (often using the central limit theorem).

190
Probability and Statistics for SIC slide 326

Estimation methods
There are many methods for estimating the parameters of models. The choice among them depends
on various criteria, such as:
 ease of calculation;
 efficiency (getting estimators that are as precise as possible);
 robustness (getting estimators that don’t fail calamitously when the model is wrong, e.g., when
outliers appear).
The trade-off between these criteria depends on what assumptions we are willing to make in a given
context.
Examples of common methods are:
 method of moments (simple, can be inefficient);
 maximum likelihood estimation (general, optimal in many parametric models);
 M-estimation (even more general, can be robust, but loses efficiency compared to maximum
likelihood).

Probability and Statistics for SIC slide 327

Method of moments
 The method of moments estimate of a parameter θ is the value θ̃ that matches the theoretical
and empirical moments.
 For a model with p unknown parameters, we set the theoretical moments of the population equal
to the empirical moments of the sample y1 , . . . , yn , and solve the resulting equations, i.e.,
Z n
r r 1X r
E(Y ) = y f (y; θ) dy = yj , r = 1, . . . , p.
n
j=1

 We thus need as many (finite!) moments of the underlying model as there are unknown
parameters.
 We may have more than one choice of moments to use, so in principle the estimate is not unique,
but in practice we usually use the first r moments, because they give the most stable estimates.

Example 251. If y1 , . . . , yn is a random sample from the U (0, θ) distribution, estimate θ.

Example 252. If y1 , . . . , yn is a random sample from the N (µ, σ 2 ) distribution, estimate µ and σ 2 .

Probability and Statistics for SIC slide 328

Example 251
 Standard computations show that if Y ∼ U (0, θ), then E(Y ) = θ/2. To find the moments
estimate of θ, we therefore solve the equation

E(Y ) = y, i.e., θ/2 = y,

to get the estimate θ̃ = 2y.


 Simulations show that with n ≥ 12 the distribution of the random variable θ̃ is very close to
normality, as we would expect, because the central limit theorem gives a good approximation to
the distribution of θ̃ for small n, owing to the symmetry of the uniform distribution.

191
Probability and Statistics for SIC note 1 of slide 328

Example 252
The theoretical values of the first two moments are

E(Y ) = µ, E(Y 2 ) = var(Y ) + E(Y )2 = σ 2 + µ2 ,

and the corresponding sample versions are


n
X
−1
y = µ̃, n yj2 = σ̃ 2 + µ̃2 .
j=1

Solving these gives


 
n
X n
X
µ̃ = y, σ̃ 2 = n−1  yj2 − n y 2  = n−1 (yj − y)2 ,
j=1 j=1

as can be seen by expanding out the right-hand expression:


n
X n
X n
X n
X n
X
(yj − y)2 = yj2 − 2yyj + ny 2 = yj2 − 2ny 2 + ny2 = yj2 − ny2 .
j=1 j=1 j=1 j=1 j=1

Probability and Statistics for SIC note 2 of slide 328

Maximum likelihood estimation


This is a much more general and powerful method of estimation, but in practice it usually requires
numerical methods of optimisation.

Definition 253. If y1 , . . . , yn is a random sample from the density f (y; θ), then the likelihood for θ is

L(θ) = f (y1 , . . . , yn ; θ) = f (y1 ; θ) × f (y2 ; θ) × · · · × f (yn ; θ).

The data are treated as fixed, and the likelihood L(θ) is regarded as a function of θ.

Definition 254. The maximum likelihood estimate (MLE) θb of a parameter θ is the value that
gives the observed data the highest likelihood. Thus
b ≥ L(θ) for each θ.
L(θ)

Probability and Statistics for SIC slide 329

192
Calculation of the MLE θb
We simplify the calculations by maximising ℓ(θ) = log L(θ) rather than L(θ).
The approach is:
 calculate the log-likelihood ℓ(θ) (and plot it if possible);
 find the value θb maximising ℓ(θ), which often satisfies dℓ(θ)/dθ
b = 0;
 check that θb gives a maximum, often by checking that d2 ℓ(θ)/dθ
b 2 < 0.

Example 255. Suppose that y1 , . . . , yn is a random sample from an exponential density with unknown
b
λ. Find λ.

Example 256. Suppose that y1 , . . . , yn is a random sample from a uniform density, U (0, θ), with
b
unknown θ. Find θ.
Probability and Statistics for SIC slide 330

Note to Example 255


The likelihood is

L(λ) = λ e−λy1 × · · · × λ e−λy2 = λn e−λ(y1 +···+yn ) , λ > 0,

so the log likelihood is


ℓ(λ) = log L(λ) = n log λ − nλy.
b is the solution to
Thus the maximum likelihood estimate λ
dℓ(λ) n
= − ny = 0,
dλ λ
b = 1/y.
and so λ
b gives a maximum, we note that the second derivative of ℓ(λ) is
To check that λ

d2 ℓ(λ) n
2
= − 2 < 0, λ > 0,
dλ λ
b gives the unique maximum.
so the log likelihood is concave, and therefore λ

Probability and Statistics for SIC note 1 of slide 330

Note to Example 256


The density is f (y; θ) = θ −1 I(0 < y < θ), so since the observations are independent, the likelihood is
n
Y
L(θ) = θ −1 I(0 < yj < θ) = θ −n I(0 < y1 , . . . , yn < θ) = θ −n I(θ > m), θ > 0,
j=1
Q
where m = max(y1 , . . . , yn ); note that j I(0 < yj < θ) = I(m < θ). Viewed as a function of θ this
is maximised at θb = m, which is therefore the MLE.
In this case the maximum is NOT found by differentiation of the likelihood, which is not differentiable
b
at θ.
Probability and Statistics for SIC note 2 of slide 330

193
M-estimation
 This generalises maximum likelihood estimation. We maximise a function of the form
n
X
ρ(θ; Y ) = ρ(θ; Yj ),
j=1

where ρ(θ; y) is (if possible) concave as a function of θ for all y. Equivalently we minimise
−ρ(θ; Y ).
 We choose the function ρ to give estimators with suitable properties, such as small variance or
robustness to outliers.
 Taking ρ(θ; y) = log f (y; θ) gives the maximum likelihood estimator.
iid
Example 257. Let Y1 , . . . , Yn ∼ f with E(Yj ) = θ, and take ρ(y; θ) = −(y − θ)2 . Find the least
squares estimator of θ.
iid
Example 258. Let Y1 , . . . , Yn ∼ f such that E(Yj ) = θ, and take ρ(y; θ) = −|y − θ|. Find the
corresponding estimator of θ.

Probability and Statistics for SIC slide 331

Note to Example 257


We want to maximise
n
X
ρ(θ; y) = − (yj − θ)2 ,
j=1

and this is equivalent to minimising the sum of squares


n
X
−ρ(θ; y) = (yj − θ)2
j=1

with respect to θ. Differentiation gives

X n
dρ(θ; y)
− =− 2(yj − θ),

j=1

and setting this equal to zero gives θb = y. The second derivative is

d2 ρ(θ; y)
− = 2n > 0,
d2 θ
so the minimum is unique.

Probability and Statistics for SIC note 1 of slide 331

194
Note to Example 258
We want to maximise
n
X
ρ(θ; y) = − |yj − θ|,
j=1

and we note that if θ > y then −|y − θ| = y − θ and if θ < y then −|y − θ| = θ − y, so the respective
derivatives with respect to θ are −1 and +1. This implies that

dρ(θ; y)
− = P (θ) − N (θ),

where P (θ) is the number of yj for which θ < yj and N (θ) = n − P (θ) is the number of yj for which
θ > yj . Hence when regarded as a function of θ,

dρ(θ; y)
− = 2P (θ) − n

is a step function that has initial value n for θ = −∞, drops by 2 at each yj , and takes value −n when
θ = +∞. If n is odd, then 2P (θ) − n equals zero when θ is the median of the sample, and if n is even,
then 2P (θ) − n equals zero on the interval y(n/2) ≤ θ ≤ y(n/2+1) . In this latter case we can take the
median to be (y(n/2) + y(n/2+1) )/2 for uniqueness.
Thus this choice of function ρ yields the sample median as an estimator.

Probability and Statistics for SIC note 2 of slide 331

Bias
How should we compare estimators?

Definition 259. The bias of the estimator θb of θ is


b − θ.
b(θ) = E(θ)

 Interpretation of the bias:


– if b(θ) < 0 for all θ, then on average θb underestimates θ;
– if b(θ) > 0 for all θ, then on average θb overestimates θ;
– if b(θ) = 0 for all θ, then θb is said to be unbiased.
 If b(θ) ≈ 0, then θb is ‘in the right place’ on average.
iid
260. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). Find the bias and variance of µ
Example P b = Y and the bias of
2 −1 2
σ
b =n j (Yj − Y ) .

Probability and Statistics for SIC slide 332

195
Note to Example 260
 In Example 249 we saw that
E(Y ) = µ, var(Y ) = σ 2 /n,
b = Y as an estimator of µ is E(Y ) − µ = 0.
so the bias of µ
P
b2 = n−1 j (Yj − Y )2 , note that
 To find the expectation of σ
n
X n
X  2
(Yj − Y )2 = Yj − µ − (Y − µ)
j=1 j=1
n
X n
X n
X
2
= (Yj − µ) − 2 (Yj − µ)(Y − µ) + (Y − µ)2
j=1 j=1 j=1
Xn
= (Yj − µ)2 − 2n(Y − µ)2 + n(Y − µ)2
j=1
n
X
= (Yj − µ)2 − n(Y − µ)2 ,
j=1
which implies that
   
X n  Xn  
E (Yj − Y )2 = E (Yj − µ)2 − nE (Y − µ)2
   
j=1 j=1

= nvar(Yj ) − nvar(Y )
= nσ 2 − nσ 2 /n
= (n − 1)σ 2 .
Therefore  
 Xn  n−1 2
2
E σ
b = n−1 E (Yj − Y )2 = σ ,
  n
j=1

b2 is
and the bias of σ
(n − 1)σ 2 σ2
σ2 ) − σ2 =
E(b − σ2 = − .
n n
b2 underestimates σ 2 , by an amount that should be small for large n.
Therefore on average σ

Probability and Statistics for SIC note 1 of slide 332

196
Bias and variance
High bias, low variability Low bias, high variability

High bias, high variability The ideal: low bias, low variability

 θ = bullseye, supposed to be the real value


 θb = red dart thrown at the bullseye, value estimated using the data

Probability and Statistics for SIC slide 333

Mean square error


Definition 261. The mean square error (MSE) of the estimator θb of θ is
b = E{(θb − θ)2 } = · · · = var(θ)
MSE(θ) b + b(θ)2 .

This is the average squared distance between θb and its target value θ.

Definition 262. Let θb1 and θb2 be two unbiased estimators of the same parameter θ. Then

MSE(θb1 ) = var(θb1 ) + b1 (θ)2 = var(θb1 )


MSE(θb2 ) = var(θb2 ) + b2 (θ)2 = var(θb2 ),

and we say that θb1 is more efficient than θb2 if

var(θb1 ) ≤ var(θb2 ).

If so, then we prefer θb1 to θb2 .


iid
Example 263. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ), with large n. Find the bias and variance of the median M
and the average Y . Which is preferable? What if outliers might appear?

Probability and Statistics for SIC slide 334

197
Note to Example 263
 We’ve already seen in Lemma 166 that

E(Y ) = µ, var(Y ) = σ 2 /n,

so the bias of Y as an estimator of µ is E(Y ) − µ = 0.


 Results from Example 240 give that for large n,

. πσ 2
E(M ) = µ, var(M ) = ,
2n
so both estimators are (approximately) unbiased (in fact exactly unbiased), but

var(M ) π
= > 1,
var(Y ) 2

so M is less efficient than Y , because the latter has a smaller variance.


However if there are outliers, we have seen that the median M is little changed, whereas the
average Y can be badly affected. Our choice between these estimators will depend on how much
we fear that our data will be contaminated by bad values.

Probability and Statistics for SIC note 1 of slide 334

Delta method
In practice, we often consider functions of estimators, and so we appeal to another version of the delta
method (Theorem 236).

Theorem 264 (Delta method). Let θb be an estimator based on a sample of size n, such that
·
θb ∼ N (θ, v/n),

for large n, and let g be a smooth function such that g′ (θ) 6= 0. Then
· 
b ∼
g(θ) N g(θ) + vg ′′ (θ)/(2n), vg ′ (θ)2 /n .

b as an estimator of g(θ) is
This implies that the mean square error of g(θ)
n o  vg′′ (θ) 2 vg′ (θ)2
b ≈
MSE g(θ) + .
2n n

Thus for large n we can disregard the bias contribution.


iid
Example 265. Let Y1 , . . . , Yn ∼ Poiss(θ). Find two estimators of P(Y = 0), and compare their
biases and variances.
Probability and Statistics for SIC slide 335

198
Note to Example 265
 Let ψ = g(θ) = exp(−θ) = P(Y = 0).
P
 The two estimators are T1 = n−1 I(Yi = 0) and T2 = exp(−Y ).
 Simple computations (e.g., noting that nT1 ∼ B(n, ψ)) give

E(T1 ) = ψ, var(T1 ) = ψ(1 − ψ)/n.

Thus T1 is unbiased and has MSE ψ(1 − ψ)/n.


 For T2 we note that θb = Y has mean and variance θ and θ/n, and hence
. .
E(T2 ) = exp(−θ) + θ exp(−θ)/(2n), var(T2 ) = θ exp(−2θ)/n.

Therefore T2 has positive bias θ exp(−θ)/(2n) but

var(T2 ) θ exp(−2θ) θ
= = θ <1
var(T1 ) exp(−θ){1 − exp(−θ)} e −1

for all θ > 0.


Therefore T2 is preferable to T1 in terms of variance (especially if θ is large).

Probability and Statistics for SIC note 1 of slide 335

Efficiency and robustness


 Under certain conditions, notably that y1 , . . . , yn are really from the assumed model f (y; θ), and if
.
f is ‘nice’, the maximum likelihood estimator θb has good properties: for large n, E(θ) b = θ, and
b is minimal, so no estimator is better than θ.
var(θ) b
 In reality we are never certain of the model, and often we sacrifice some efficiency (small variance
under an ideal model) for robustness (good estimation even if there are outliers, or if the assumed
model is incorrect).
 If θ is a p × 1 vector, the same ideas apply. For example, for M-estimation we maximise
n
X
ρ(θ; yj )
j=1

with respect to the vector θp×1 , giving an estimator θbp×1 , which often has an approximate
Np (θ, V ) distribution.

Probability and Statistics for SIC slide 336

199
8.3 Interval Estimation slide 337
Pivots
A key element of statistical thinking is to assess uncertainty of results and conclusions.
Let t = 1 be an estimate of an unknown parameter θ based on a sample of size n:
 if n = 105 we are much more sure that θ ≈ t than if n = 10;
 as well as t we would thus like to give an interval which will be wider when n = 10 than when
n = 105 , to make the uncertainty of t explicit.
We suppose that we have
 data y1 , . . . , yn , which are regarded as a realisation of a
 random sample Y1 , . . . , Yn drawn from a
 statistical model f (y; θ) whose unknown
 parameter θ is estimated by the
 estimate t = t(y1 , . . . , yn ), which is regarded as a realisation of the
 estimator T = t(Y1 , . . . , Yn ).
We therefore need to link θ and Y1 , . . . , Yn .

Definition 266. Let Y = (Y1 , . . . , Yn ) be sampled from a distribution F with parameter θ. Then a
pivot is a function Q = q(Y, θ) of the data and the parameter θ, where the distribution of Q is known
and does not depend on θ. We say that Q is pivotal.

Probability and Statistics for SIC slide 338

Example
iid
Example 267. Let Y1 , . . . , Yn ∼ U (0, θ) with θ unknown,
X
M = max(Y1 , . . . , Yn ), Y = n−1 Yj .

 Show that Q1 = M/θ is a pivot.


 Use the central limit theorem to find an approximate pivot Q2 for large n, based on Y .

Probability and Statistics for SIC slide 339

200
Note to Example 267
 We first note that Q1 is a function of the data and the parameter, and that

P(M ≤ x) = FY (x)n = (x/θ)n , 0 < x < θ,

so
P(Q1 ≤ q) = P(M/θ ≤ q) = P(M ≤ θq) = (θq/θ)n = q n , 0 < q < 1.
which is known and does not depend on θ. Hence Q1 is a pivot.
 In Example 119(a) we saw that if Y ∼ U (0, θ), then E(Y ) = θ/2 and var(Y ) = θ 2 /12. Hence
Lemma 166(c) gives that Y has mean θ/2 and variance θ 2 /(12n), and for large n,
·
Y ∼ N {θ/2, θ 2 /(12n)} using the central limit theorem. Therefore

Y − θ/2 ·
Q2 = p = (3n)1/2 (2Y /θ − 1) ∼ N (0, 1).
2
θ /(12n)

Thus Q2 depends on both data and θ, and has an (approximately) known distribution: hence Q2 is
an (approximate) pivot. (In fact it is exact, if we could know the distribution of Y exactly.)

Probability and Statistics for SIC note 1 of slide 339

Confidence intervals
Definition 268. Let Y = (Y1 , . . . , Yn ) be data from a parametric statistical model with scalar
parameter θ. A confidence interval (CI) (L, U ) for θ with lower confidence bound L and upper
confidence bound U is a random interval that contains θ with a specified probability, called the
(confidence) level of the interval.

 L = l(Y ) and U = u(Y ) are statistics that can be computed from the data Y1 , . . . , Yn . They do
not depend on θ.
 In a continuous setting (so < gives the same probabilities as ≤), and if we write the probabilities
that θ lies below and above the interval as

P (θ < L) = αL , P (U < θ) = αU ,

then (L, U ) has confidence level

P (L ≤ θ ≤ U ) = 1 − P (θ < L) − P (U < θ) = 1 − αL − αU .

 Often we seek an interval with equal probabilities of not containing θ at each end, with
αL = αU = α/2, giving an equi-tailed (1 − α) × 100% confidence interval.
 We usually take standard values of α, such that 1 − α = 0.9, 0.95, 0.99, . . .

Probability and Statistics for SIC slide 340

201
Construction of a CI
 We use pivots to construct CIs:
– we find a pivot Q = q(Y, θ) involving θ;
– we obtain the quantiles qαU , q1−αL of Q;
– then we transform the equation

P{qαU ≤ q(Y, θ) ≤ q1−αL } = (1 − αL ) − αU

into the form


P(L ≤ θ ≤ U ) = 1 − αL − αU ,
where the bounds L, U depend on Y , q1−αL and qαU , but not on θ.
 In many cases, the bounds are of a standard form (see below).

Example 269. In Example 267, find CIs based on Q1 and on Q2 .

Example 270. A sample of n = 16 Vaudois number plates has maximum 523308 and average 320869.
Give two-sided 95% CIs for the number of cars in canton Vaud.
Probability and Statistics for SIC slide 341

Note to Example 269


 The p quantile of Q1 = M/θ is given by p = P(Q1 ≤ qp ) = qpn , so qp = p1/n . Thus

1/n
P{αU ≤ M/θ ≤ (1 − αL )1/n } = 1 − αL − αU ,

and a little algebra gives that


1/n
P{M/(1 − αL )1/n ≤ θ ≤ M/αU } = 1 − αL − αU ,

so
1/n
L = M/(1 − αL )1/n , U = M/αU .
·
 For Q2 = (3n)1/2 (2Y /θ − 1) ∼ N (0, 1), the quantiles are z1−αL and zαU , so

P{zαU ≤ (3n)1/2 (2Y /θ − 1) ≤ z1−αL } = 1 − αL − αU ,

and hence we obtain


2Y 2Y
L= , U= ;
1 + z1−αL /(3n)1/2 1 + zαU /(3n)1/2

note that for large n these are L ≈ 2Y {1 − z1−αL /(3n)1/2 } and U ≈ 2Y {1 − zαU /(3n)1/2 }.

Probability and Statistics for SIC note 1 of slide 341

202
Note to Example 270
 We set αU = αL = 0.025, with M and Y observed to be m = 523308 and y = 320869.
1/n
 For Q1 with n = 16 we have αU = 0.0251/16 = 0.794, (1 − αL )1/n = 0.9751/16 = 0.998, so
1/n
L = m/(1 − αL )1/n = 524135, U = m/αU = 659001.

Note that this CI does not contain m (and this makes sense).
·
 For Q2 = (3n)1/2 (2Y /θ − 1) ∼ N (0, 1), the quantiles are zαU = −z1−αL = −1.96, so we obtain

2y 2y
L= = 500226, U= = 894903.
1 + 1.96/(3n)1/2 1 − 1.96/(3n)1/2

This is much wider than the other CI, and includes impossible values, as we already know that
θ ≥ m.
 Clearly we prefer the interval based on Q1 .

Probability and Statistics for SIC note 2 of slide 341

Interpretation of a CI
 (L, U ) is a random interval that contains θ with probability 1 − α.
 We imagine an infinite sequence of repetitions of the experiment that gave (L, U ).
 In that case, the CI that we calculated is one of an infinity of possible CIs, and we can consider
that our CI was chosen at random from among them.
 Although we do not know whether our particular CI contains θ, the event θ ∈ (L, U ) has
probability 1 − α, matching the confidence level of the CI.
 In the figure below, the parameter θ (green line) is contained (or not) in realisations of the 95% CI
(red). The black points show the corresponding estimates.
100
80 60
Repetition
40 20
0

−2 0 2 4 6 8 10 12
Parameter

Probability and Statistics for SIC slide 342

203
One- and two-sided intervals
 A two-sided confidence interval (L, U ) is generally used, but one-sided confidence intervals,
of the form (−∞, U ) or (L, ∞), are also sometimes required.
 For one-sided CIs, we take αU = 0 or αL = 0, giving respective intervals (L, ∞) or (−∞, U ).
 To get a one-sided (1 − α) × 100% interval, we can compute a two-sided interval with
αL = αU = α, and then replace the unwanted limit by ±∞ (or another value if required in the
context).

Example 271. A sample of n = 16 Vaudois number plates has maximum 523308. Use the pivot Q1
to give one-sided 95% CIs for the number of cars in canton Vaud.

Probability and Statistics for SIC slide 343

Note to Example 271


 We set αU = αL = 0.05, with M observed to be m = 523308.
1/n
 For Q1 with n = 16 we have αU = 0.051/16 = 0.829, (1 − αL )1/n = 0.951/16 = 0.997, so
1/n
L = m/(1 − αL )1/n = 524988.3, U = m/αU = 631061.6.

 For the interval of form (L, ∞), we have have (524988.3, ∞), with the interpretation that we are
95% sure that the number of cars in the canton is at least 524988.3 (which we would interpret as
524988, for practical purposes).
 For the interval of form (−∞, U ), we have have (−∞, 631061.6), but since we have observed
m = 523308, we replace the lower bound, giving (523308, 631061.6). We are 95% sure that the
number of cars in the canton is lower than 631062 but it must be at least 523308.
Probability and Statistics for SIC note 1 of slide 343

Standard errors
In most cases we use approximate pivots, based on estimators whose variances we must estimate.

Definition 272. Let T = t(Y1 , . . . , Yn ) be an estimator of θ, let τn2 = var(T ) be its variance, and let
V = v(Y1 , . . . , Yn ) be an estimator of τn2 . Then we call V 1/2 , or its realisation v 1/2 , a standard error
for T .

Theorem 273. Let T be an estimator of θ based on a sample of size n, with


T −θ D V P
−→ Z, −→ 1, n → ∞,
τn τn2

where Z ∼ N (0, 1). Then by Theorem 227 we have

T −θ T −θ τn D
= × −→ Z, n → ∞.
V 1/2 τn V 1/2

Hence, when basing a CI on the Central Limit Theorem, we can replace τn by V 1/2 .

Probability and Statistics for SIC slide 344

204
Approximate normal confidence intervals
 We can often construct approximate CIs using the CLT, since many statistics that are based on
averages of Y = (Y1 , . . . , Yn ) have√approximate normal distributions for large n. If T = t(Y ) is an
estimator of θ with standard error V , and if Theorem 273 applies, then
·
T ∼ N (θ, V ),
√ ·
and so (T − θ)/ V ∼ N (0, 1). Thus
n √ o
.
P zαU < (T − θ)/ V ≤ z1−αL = Φ(z1−αL ) − Φ(zαU ) = 1 − αL − αU ,

implying that an approximate (1 − αL − αU ) × 100% CI for θ is


√ √
(L, U ) = (T − V z1−αL , T − V zαU ).

Recall that if αL , αU < 1/2, then z1−αL > 0 and zαU < 0, so L < U .
 Example 269 is an example of this, with T = 2Y and V = T 2 /(3n), since for large n,

L ≈ T − T z1−αL /(3n)1/2 , U ≈ T − T zαU /(3n)1/2 .

L = αU = 0.025, and then z1−αL = −zαU = 1.96, giving the ‘rule of thumb’
 Often we take α√
(L, U ) ≈ T ± 2 V for a two-sided 95% confidence interval.

Probability and Statistics for SIC slide 345

Normal random sample


An important case where exact CIs are available is the normal random sample.
iid
Theorem 274. If Y1 , . . . , Yn ∼ N (µ, σ 2 ), then

Y ∼ N (µ, σ 2 /n)
P independent
(n − 1)S 2 = nj=1 (Yj − Y )2 ∼ σ 2 χ2n−1

where χ2ν represents the chi-square distribution with ν degrees of freedom.

The first result here implies that if σ 2 is known, then

Y −µ
Z=p ∼ N (0, 1).
σ 2 /n

is a pivot that provides an exact (1 − αL − αU ) confidence interval for µ, of the form


 
σ σ
(L, U ) = Y − √ z1−αL , Y − √ zαU , (4)
n n

where zp denotes the p quantile of the standard normal distribution.

Probability and Statistics for SIC slide 346

205
Unknown variance
 In applications σ 2 is usually unknown. If so, Theorem 274 implies that

Y −µ (n − 1)S 2
p ∼ tn−1 , ∼ χ2n−1
S 2 /n σ2

are pivots that provide confidence intervals for µ and σ 2 , respectively, i.e.,
 
S S
(L, U ) = Y − √ tn−1 (1 − αL ), Y − √ tn−1 (αU ) , (5)
n n
 2 2 
(n − 1)S (n − 1)S
(L, U ) = , , (6)
χ2n−1 (1 − αL ) χ2n−1 (αU )
where:
– tν (p) is the p quantile of the Student t distribution with ν degrees of freedom;
– χ2ν (p) is the p quantile of the chi-square distribution with ν degrees of freedom.
 For symmetric densities such as the normal and the Student t , the quantiles satisfy

zp = −z1−p , tν (p) = −tν (1 − p),

so equi-tailed (1 − α) × 100% CIs have the forms

Y ± n−1/2 σ z1−α/2 , Y ± n−1/2 S tn−1 (1 − α/2).

Probability and Statistics for SIC slide 347

Two giants of 20th century statistics


Left: William Sealy Gosset (‘Student’) (1876–1937)
Right: Ronald Aylmer Fisher (1890–1962)

(Source: Wikipedia)

206
Probability and Statistics for SIC slide 348

Chi-square and Student probability densities

0.4
1
0.4

0.3
PDF

PDF
2

0.2
0.2

4
6

0.1
10
0.0

0.0
0 5 10 15 20 -4 -2 0 2 4
w t

Left: χ2ν densities with ν = 1, 2, 4, 6, 10. Right: tν densities with ν = 1 (bottom centre), 2, 4, 20, ∞
(top centre).

Probability and Statistics for SIC slide 349

Example
Example 275. Suppose that the resistance X of a certain type of electrical equipment has an
approximate N (µ, σ 2 ) distribution. A random sample of size n = 9 has average x = 5.34 ohm and
variance s2 = 0.122 ohm2 .
 Find an equi-tailed two-sided 95% CI for µ.
 Find an equi-tailed two-sided 95% CI for σ 2 .
 How does the interval for µ change if we are later told that σ 2 = 0.122 ?
 How does the calculation change if we want a 95% confidence interval for µ of form (L, ∞)?

Probability and Statistics for SIC slide 350

Note to Example 275


 We want 1 − α = 0.95, so α = 0.05 and we take αU = αL = 0.025. The formula (5) gives
(5.25, 5.43) ohms.
 Formula (6) 2 2
√ √gives (0.0066, 0.0529) ohms as the interval for σ , giving
( 0.0066, 0.0529) = (0.081, 0.230) ohms as the interval for σ (which must be positive).
 In this case σ 2 is known, so we should use (4). We replace t9 (0.975) = 2.306 with z0.975 = 1.96,
giving (5.26, 5.42) ohm. This interval is a factor 2.306/1.96 = 1.18 shorter, because there is no
uncertainty about the value of σ.
 Now we want U = ∞, so we take αU = 0 and αL = 0.05, and replace the first interval above by
 
S
Y + √ tn−1 (αL ), ∞ ohms,
n

which gives (5.27, ∞) ohms.

Probability and Statistics for SIC note 1 of slide 350

207
Comments
 The construction of confidence intervals is based on pivots, often using the central limit theorem
to approximate the distribution of an estimator, and thus giving approximate intervals.
 A confidence interval (L, U ) not only suggests where an unknown parameter is situated, but its
width U − L gives an idea of the precision of the estimate.
 In most cases √
U −L∝ V ∝ n−1/2 ,
so multiplying the sample size by 100 increases precision only by a factor of 10.
 Having to estimate the variance using V decreases precision, and thus increases the width.
 To get a one-sided (1 − α) × 100% interval, we can compute a two-sided interval with
αL = αU = α, and then replace the unwanted limit by ±∞ (or another suitable limit).
 In some cases, especially normal models, exact CIs are available.

Probability and Statistics for SIC slide 351

8.4 Hypothesis Tests slide 352

Statistical tests
Example 276. I observe 115 heads when spinning it a 5Fr coin 200 times, and 105 heads when tossing
it.
 Give a statistical model for this problem.
 Is the coin fair?
5Fr, 1978, spins 5Fr, 1978, tosses
1.0

1.0
0.8

0.8
Proportion of heads

Proportion of heads
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 50 100 150 200 0 50 100 150 200

Number of spins Number of tosses

Probability and Statistics for SIC slide 353

208
Note to Example 276
 On the assumption that the spins are independent, and that heads occurs with probability θ, the
total number of heads R ∼ B(n = 200, θ), and if the coin is fair, θ = 1/2.
 One way to see if the coin is fair is to compute a 95% CI for the unknown θ, and see if the value
θ = 1/2 lies in the interval.
 An unbiased estimator for θ is θb = R/n (and in fact this is the MLE, and the moments estimator),
and its variance is θ(1 − θ)/n, which we can estimate by V = θ(1b − θ)/n,
b so our discussion of
confidence intervals tells us that an approximate 95% confidence for θ is
√ √
θb ± z1−α/2 V = θb ± 1.96 V ,

which gives
tosses: (0.456, 0.594) spins: (0.506, 0.644),
suggesting that since the 95% confidence interval for spins does not contain 1/2, the coin is not
fair for spins, but that it is fair for tosses.
 Note that if we had had R = 85 for tosses, then we would get interval (0.356, 0.494), and would
also have concluded that the coin is not fair for tosses.
 Similar computations for the CI with α = 99% give

tosses: (0.434, 0.616) spins: (0.485, 0.665),

so if we take a wider confidence interval, we conclude that the coin is fair for spins also.

Probability and Statistics for SIC note 1 of slide 353

Confidence intervals and tests


 We can use confidence intervals (CIs) to assess the plausibility of a value θ 0 of θ:
– If θ 0 lies inside a (1 − α) × 100% CI, then we cannot reject the hypothesis that θ = θ 0 , at
significance level α.
– If θ 0 lies outside a (1 − α) × 100% CI, then we reject the hypothesis that θ = θ 0 , at
significance level α.
 The discussion of the scientific method at the start of §7 (slide 267) tells us that data cannot
prove correctness of a theory (hypothesis), because we can always imagine that future data or a
new experiment might undermine it, but data can falsify theory. Hence we can reject or not
reject (provisionally accept) a hypothesis, but we cannot prove it.
 The decision to reject or not depends on the chosen significance level α: we will reject less often if
α is small, since then the CI will be wider.
 If α is small and we do reject, this gives stronger evidence against θ 0 .
 Use of a two-sided CI (L, U ) implies that seeing either θ 0 < L or θ 0 > U would be evidence
against the theory. This is true for Example 276, but in general we should consider whether to use
(−∞, U ) or (L, ∞) instead.

Probability and Statistics for SIC slide 354

209
Null and alternative hypotheses
In a general testing problem we aim to use the data to decide between two hypotheses.
 The null hypothesis H0 , which represents the theory/model we want to test.
– For the coin tosses, H0 is that the coin is fair, i.e., P(heads) = θ = θ 0 = 1/2.
 The alternative hypothesis H1 , which represents what happens if H0 is false.
– For the coin tosses, H1 is that the coin is not fair, i.e., P(heads) 6= θ 0 .
 When we decide between the hypotheses, we can make two sorts of error:
Type I error (false positive): H0 is true, but we wrongly reject it (and choose H1 );
Type II error (false negative): H1 is true, but we wrongly accept H0 .

Decision
Accept H0 Reject H0
State of Nature H0 true Correct choice (True negative) Type I Error (False positive)
H1 true Type II Error (False negative) Correct choice (True positive)

Probability and Statistics for SIC slide 355

Taxonomy of hypotheses
Definition 277. A simple hypothesis entirely fixes the distribution of the data Y , whereas a
composite hypothesis does not fix the distribution of Y .

Example 278. If
iid iid
H0 : Y1 , . . . , Yn ∼ N (0, 1), H1 : Y1 , . . . , Yn ∼ N (0, 3),
then both hypotheses are simple.

Example 279. If θ0 is fixed (e.g., θ0 = 1/2) and

H0 : R ∼ B(n, θ0 ), H1 : R ∼ B(n, θ), θ ∈ (0, θ0 ) ∪ (θ0 , 1),

then H0 (‘the coin is fair’) is simple but H1 (‘the coin is not fair’) is composite.

Example 280. If µ, σ 2 are unknown and F is a unknown (but non-normal) distribution, and
iid iid
H0 : Y1 , . . . , Yn ∼ N (µ, σ 2 ), H1 : Y1 , . . . , Yn ∼ F,

then both H0 (‘the data are normally distributed’) and H1 (‘the data are not normally distributed’) are
composite.

Probability and Statistics for SIC slide 356

210
True and false positives: Example
 H0 : T ∼ N (0, 1) and H1 : T ∼ N (µ, 1), with µ > 0.
 Reject H0 if T > t, where t is some cut-off, so we
– reject H0 incorrectly (false positive) with probability

α(t) = P0 (T > t) = 1 − Φ(t) = Φ(−t)

– reject H0 correctly (true positive) with probability

β(t) = P1 (T > t) = P(T − µ > t − µ) = 1 − Φ(t − µ) = Φ(µ − t).

H0 False positive probability α(t)

True positive probability β(t)

H1

Probability and Statistics for SIC slide 357

ROC curve
Definition 281. The receiver operating characteristic (ROC) curve of a test plots β(t) against
α(t) as the cut-off t varies, i.e., it shows (P0 (T ≥ t), P1 (T > t)), when t ∈ R.

 In the example above, we have α = Φ(−t), so t = −Φ−1 (α) = −zα , so equivalently we graph

β(t) = Φ(µ + zα ) ≡ β(α) against α ∈ [0, 1].

 Here is the ROC curve for the example above, which has µ = 2 (in red). Also shown are the ROC
curves for µ = 0, 0.4, 3, 6. Which is which?
1.0
0.8
True positive probability β(t)
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0


False positive probability α(t)

Probability and Statistics for SIC slide 358

211
Example, II
 In case you need help, here are the densities for three of the cases:

t t t

H0 H0 H0

H1 H1 H1

Probability and Statistics for SIC slide 359

Size and power


 As µ increases, it becomes easier to detect when H0 is false, because the densities under H0 and
H1 become more separated, and the ROC curve moves ‘further north-west’.
 When H0 and H1 are the same, i.e., µ = 0, then the curve lies on the diagonal. Then the
hypotheses cannot be distinguished.
 In applications, µ is usually unknown, so we fix α (often at some conventional value,e.g., 0.05,
0.01) and then accept the resulting β(α).
 We also call (particularly in statistics books and papers)
– the false positive probability the size α of the test, and
– the true positive probability the power β of the test.

Definition 282. Let P0 (·) and P1 (·) denote probabilities computed under null and alternative
hypotheses H0 and H1 respectively. Then the size and power of a statistical test of H0 against H1 are

size α = P0 (reject H0 ), power β = P1 (reject H0 ).

Probability and Statistics for SIC slide 360

Power and confidence intervals


 If the test is based on a (1 − α) × 100% CI, the size is the probability that the true value of the
parameter lies outside the CI, so it is α.
 Taking a smaller value of α gives a wider interval, so it must decrease the power.
 Usually the width of the interval (L, U ) satisfies

U − L ∝ n−1/2 ,

so increasing n gives a narrower interval and will increase the power of the test. This makes sense,
because having more data should allow us to be more certain in our conclusions.
 Unfortunately, not all tests correspond to confidence intervals, so we need a more general approach.
 For example, checking the fit of a model is not usually possible using a confidence interval . . .

212
Probability and Statistics for SIC slide 361

Testing goodness of fit


We may want to assess whether a statistical model fits data appropriately.

Example 283. In a legal dispute, it was claimed that the numbers below were faked:

261 289 291 265 281 291 285 283 280 261 263 281 291 289 280
292 291 282 280 281 291 282 280 286 291 283 282 291 293 291
300 302 285 281 289 281 282 261 282 291 291 282 280 261 283
291 281 246 249 252 253 241 281 282 280 261 265 281 283 280
242 260 281 261 281 282 280 241 249 251 281 273 281 261 281
282 260 281 282 241 245 253 260 261 281 280 261 265 281 241
260 241

Real data could be expected to have final digits uniformly distributed on {0, 1, . . . , 9}, but here we have

0 1 2 3 4 5 6 7 8 9
14 42 14 9 0 6 2 0 0 5

How strong is the evidence that the final digits are not uniform?

Probability and Statistics for SIC slide 362

Karl Pearson (1857–1936)

(Source: University College London)

Probability and Statistics for SIC slide 363

213
Pearson statistic
Definition 284. Let O1 , . . . , Ok be the number of observations of a random sample of size
n = n1 + · · · + nk falling into the categories 1, . . . , k, whose expected numbers are E1 , . . . , Ek , where
Ei > 0. Then the Pearson statistic (or chi-square statistic) is
k
X (Oi − Ei )2
T = .
Ei
i=1

iid
Definition 285. Let Z1 , . . . , Zν ∼ N (0, 1), then W = Z12 + · · · + Zν2 follows the chi-square
distribution with ν degrees of freedom, whose density function is
1
fW (w) = wν/2−1 e−w/2 , w > 0, ν = 1, 2, . . . ,
2ν/2 Γ(ν/2)
R∞
where Γ(a) = 0 ua−1 e−u du, a > 0, is the gamma function.

 If the joint distribution of O1 , . . . , Ok is multinomial with denominator n and probabilities


P
·
p1 = E1 /n, . . . , pk = Ek /n, then T ∼ χ2k−1 , the approximation being good if k−1 Ei ≥ 5.
 We can use T to check the agreement between the data O1 , . . . , Ok and the theoretical
probabilities p1 , . . . , pk .

Probability and Statistics for SIC slide 364

Pearson statistic: Rationale


 If Oi ≈ Ei for all i, then T will be small, otherwise it will tend to be bigger.
 If the joint distribution of O1 , . . . , Ok is multinomial with denominator n and probabilities
pi = Ei /n, then each Oi ∼ B(n, pi ), and thus

E(Oi ) = npi = Ei , var(Oi ) = npi (1 − pi ) = Ei (1 − Ei /n) ≈ Ei ,


√ ·
thus Zi = (Oi − Ei )/ Ei ∼ N (0, 1), for large n, and we would imagine that
k
X k
X
(Oi − Ei )2 ·
T = = Zi2 ∼ χ2k ,
Ei
i=1 i=1
P
but the constraint i Oi = n means that only k − 1 of the Zi can vary independently, thus
reducing the degrees of freedom to k − 1.

Probability and Statistics for SIC slide 365

214
Null and alternative hypotheses for Example 283
 Null hypothesis, H0 : the final digits are independent and distributed according to the uniform
distribution on 0, . . . , 9. This simple null hypothesis implies that
P O0 , . . . , O9 have the multinomial
distribution with probabilities p0 = · · · = p9 = 0.1, and since Ej /10 > 5, we have
.
P0 (T ≤ t) = P(χ29 ≤ t), t > 0.

 Alternative hypothesis, H1 : the final digits are independent but not uniform, so O0 , . . . , O9
follow a multinomial distribution with unequal probabilities, p0 , . . . , p9 . This hypothesis is
composite, and the parameter θ ≡ (p1 , . . . , p9 ) is of dimension 9, as p0 = 1 − p1 − · · · − p9 . Under
this model,
P1 (T > t) ≥ P(χ29 > t), t > 0.
 Since values of T tend to be smaller under H0 than under H1 , we should large values of T to be
evidence against H0 in favour of H1 .
 We verify this on the following slides.

Probability and Statistics for SIC slide 366

Monte Carlo simulations of T , n = 50


Pearson’s statistics for 10,000 sets of data when testing H0 : p0 = · · · = p9 = 0.1, when: (a) (top) the
data are generated under H0 ; (b) (bottom) the data are generated with a multinomial distribution
having p0 = p1 = 0.15, p2 = · · · = p9 = 0.0875. The values of T tend to be bigger under (b). The red
line shows the χ29 density.
0.12

35
30
15 20 25
0.08

Ordered P
Density
0.04

10
5
0.00

0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
P Quantiles of Chi^2_9
0.12

50 40
0.08

Ordered P
Density

30
0.04

20 10
0.00

0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
P Quantiles of Chi^2_9

Probability and Statistics for SIC slide 367

215
Monte Carlo simulations of T , n = 100, 50
Pearson’s statistics for 10,000 sets of data when testing H0 : p0 = · · · = p9 = 0.1, when: (a) (top) the
data are generated with p0 = p1 = 0.15, p2 = · · · = p9 = 0.0875, and n = 100; (b) (bottom) the data
are generated with p0 = p1 = 0.2, p2 = · · · = p9 = 0.075 and n = 50. The red line shows the χ29
density.

0.12

80
60
0.08

Ordered P
Density

40
0.04

20
0.00

0
0 20 40 60 80 0 5 10 15 20 25 30 35
P Quantiles of Chi^2_9
0.12

50 40
0.08

Ordered P
Density

20 30
0.04

10
0.00

0
0 20 40 60 80 0 5 10 15 20 25 30 35
P Quantiles of Chi^2_9

Probability and Statistics for SIC slide 368

Example
The simulations in the previous figures show that
·
 under H0 , we indeed have T ∼ χ29 , even with n = 50;
 under H1 , the distribution of T is shifted to the right;
 the size of the shift under H1 will determine the power of the test, which depends on the sample
size n and on the non-uniformity of (p0 , . . . , p9 ).

Example 286 (Example 283, continued). Our data

0 1 2 3 4 5 6 7 8 9
14 42 14 9 0 6 2 0 0 5
.
give observed value of T equal to tobs = 158.
 For a test of H0 at significance level α = 0.05, note that the (1 − α) quantile of the χ29
distribution is 16.92. Since tobs > 16.92, we can reject H0 at significance level 0.05.
 In fact,
.
P0 (T ≥ tobs ) = P(χ29 ≥ 158) < 2.2 × 10−16 ,
so seeing data like this would be essentially impossible under H0 . It is almost certain that the
observed final digits did not come from a uniform distribution.

Probability and Statistics for SIC slide 369

216
Evidence and P-values
A statistical hypothesis test has the following elements:
 a null hypothesis H0 , to be tested against an alternative hypothesis H1 ;
 data, from which we compute a test statistic T , chosen such that large values of T provide
evidence against H0 ;
 the observed value of T is tobs , which we compare with the null distribution of T , i.e., the
sampling distribution of T under H0 ;
 we measure the evidence against H0 using the P-value

pobs = P0 (T ≥ tobs ),

where small values of pobs suggest that either


– H0 is true but something unlikely has occurred, or
– H0 is false.
 If pobs < α, then we say that the test is significant at level α or significant at the α × 100%
level.
 If we must make a decision, then we reject H0 if pobs < α, where α is the significance level of the
test, and we (provisionally) accept H0 if pobs ≥ α.

Probability and Statistics for SIC slide 370

Examples
Example 287. Recast Example 276 in terms of P-values.

Example 288. Ten new electricity meters are measured for quality control purposes, resulting in the
data
983 1002 998 996 1002 983 994 991 1005 986
Is there a systematic divergence from the standard value of 1000?

Probability and Statistics for SIC slide 371

Note to Example 287


·
 Under H0 we have R ∼ B(n, θ 0 ), and therefore R ∼ N {nθ 0 , nθ 0 (1 − θ 0 )} by the central limit
theorem. Since values of R far from nθ 0 in either direction would be evidence against H0 , this
suggests taking

T = {R − E(R)}2 /var(R) = (R − nθ 0 )2 /{nθ 0 (1 − θ 0 } = (R − 100)2 /50,

since here n = 200 and θ 0 = 1/2 yield E(R) = 100 and var(R) = 50.
·
 Since T = Z 2 , where Z ∼ N (0, 1) we have that T ∼ χ21 under H0 .
 This gives tobs = 0.5 for the tosses, and tobs = 4.5 for the spins, with corresponding P-values
. . . .
P0 (T ≥ tobs ) = P(χ21 ≥ 0.5) = 0.480, P0 (T ≥ tobs ) = P(χ21 ≥ 4.5) = 0.034.

 With α = 0.05 we would accept H0 for the tosses but reject it for the spins.
 With α = 0.01 we would accept H0 for both tosses and spins.

Probability and Statistics for SIC note 1 of slide 371

217
Note to Example 288
iid
 We assume that Y1 , . . . , Yn ∼ N (µ, σ 2 ), with σ 2 unknown. We take

H0 : µ = µ0 = 1000, H1 : µ 6= 1000.

 We know from Theorem 274 that under H0 ,

Y − µ0
Z=p ∼ tn−1 .
S 2 /n

Here the alternative hypothesis H1 is two-sided, i.e., we will reject if either Y is much larger or
much smaller than µ0 , so we should take

Y −µ
0
T = p = |Z|,
S 2 /n

and for a test at significance level α = 0.05 we therefore need to choose tα such that

α = P0 (T > tα ) = 1 − P0 (−tα ≤ Z ≤ tα ) .

But Z ∼ tn−1 is a pivot under H0 , so 1 − P0 (−tα ≤ Z ≤ tα ) = 2P0 (Z ≤ −tα ), and this implies
that tα = −tn−1 (α/2). With α = 0.025 and n = 10, we have t9 (0.025) = −2.262 from the
tables, or R, as qt(0.025, df=9).
 For the data above, y = 994 and
n
1X
s2 = (yi − y)2 = 64.88.
9
i=1
p
 Now tobs = |(994 − 1000)/ 64.88/10| = | − 2.35| = 2.35 > tα = 2.262, so we reject H0 at level
α = 5%.
 Alternatively we can compute the 95% confidence interval based on Z, which is
(988.238, 999.762). Since this does not contain µ0 , H0 is rejected at the 5% level.
 If instead the alternative hypothesis is H1 : µ > 1000, then we take Z as the test statistic, since
we are likely to have positive Z under H1 . In this case we need to choose tα such that
( )
Y − µ0
α = P0 (Z > tα ) = P0 p > tα .
S 2 /n

Since Z ∼ tn−1 , we have that tα = t9 (0.95) = 1.833, and since zobs = −2.35 < 1.833, we cannot
reject the null hypothesis at the 5% level. Indeed, having y = 994 suggests that it is not true that
µ > µ0 .
 If the alternative hypothesis is H1 : µ < 1000, then we take T = −Z as the test statistic, since we
are likely to have negative Z under H1 . In this case we need to choose tα such that
( ) ( )
Y − µ0 Y − µ0
α = P0 (−Z > tα ) = P0 p < −tα = P0 p < tn−1 (α) ,
S 2 /n S 2 /n

implying that tα = −tn−1 (α) = tn−1 (1 − α). With α = 0.05, we therefore have tα = 1.833, and
since −zobs = 2.35 > tα = 1.833, we reject the null hypothesis at the 5% level. Having
y = 994 < µ0 suggests that maybe µ < µ0 .

218
Probability and Statistics for SIC note 2 of slide 371

Decision procedures and measures of evidence


We can use a test of H0 in two related ways:
 as a decision procedure, where we
– choose a level α at which we want to test H0 , and then
– reject H0 (i.e., choose H1 ) if the P-value is less than α, or
– do not reject H0 if the P-value is greater than α.
 as a measure of evidence against H0 , with
– small values of pobs suggesting stronger evidence against H0 , but
– H1 need not be explicit, though the type of departure from H0 that we seek is implicit in the
choice of T .
 Knowing the exact value of pobs is more useful than knowing that H0 has been rejected, so the
measure of evidence is more informative.
 The strength of the evidence contained in a P-value can be summarised as follows:

α Evidence against H0
0.05 Weak
0.01 Positive
0.001 Strong
0.0001 Very strong

Probability and Statistics for SIC slide 372

Choice of α
 As with CIs, conventional values are often used, such as α = 0.05, 0.01, 0.001.
 The most common value is α = 0.05, which corresponds to a Type I error probability of 5%, i.e.,
H0 will be rejected once in every 20 tests, even when it is true.
 When many tests are performed, using large α can give many false positives, i.e., significant tests
for which in fact H0 is true.
 Consider a microarray experiment, where we test 1000 genes at significance level α, to see which
genes influence some disease. If only 100 genes have effects, we can write

P(H0 ) = 900/1000, P(H1 ) = 100/1000, P(S | H0 ) = α, P(S | H1 ) = β,

where α is the size of the test, β > α is its power, and S denotes the event that the test is
significant at level α. Bayes’ theorem gives

P(H0 )P(S | H0 ) 0.9α


P(H0 | S) = = .
P(H0 )P(S | H0 ) + P(H1 )P(S | H1 ) 0.9α + 0.1β
.
Hence with α = 0.05, β = 0.8, say, P(H0 | S) = 0.36, so over one-third of significant tests will not
.
be interesting. If instead we set α = 0.005, we have P(H0 | S) = 0.05, which is more reasonable.

Probability and Statistics for SIC slide 373

219
Many tests: State of nature

!"#$%&'(#)*#'++',$
!-#$%&'(#'++',$#.%'/')$

 True state of nature: we will test 1000 genes, of which only 100 show real effects.
 Ideally tests for the 100 genes for which H1 is true will reject, and those for the 900 for which H0
is true will fail to reject.
 Then we will follow up on the 100 rejections, and win fame and fortune.

Probability and Statistics for SIC slide 374

Many tests: False positives

H0 true, no effect
H1 true, effect present
False positive

 If we test at significance level α = 0.05, we expect to wrongly reject for 0.05 × 900 = 45 of the
genes where H0 is true; these are the false positives, which we falsely conclude show an effect.

Probability and Statistics for SIC slide 375

Many tests: False negatives

H0 true, no effect
H1 true, effect present
False positive
False negative

 If the power of the test is 0.5, then we expect to reject H0 in only 50% of the cases in which H1 is
true, i.e., we will miss 0.5 × 100 = 50 genes for which H1 is true, and there is a true effect; these
are the false negatives.

Probability and Statistics for SIC slide 376

220
Many tests: Non-reproducible results

H1 true, effect present


False positive

 Among the tests that reject H0 , we will have 45 false positives and just 50 true positives, so
almost 50% of our future effort in following up these genes will be wasted.
 To reduce this wasted effort, we must
– decrease the size α of the test, which will reduce the number of false positives; and
– increase the power β of the test, which will reduce the number of false negatives.
 This mis-use of hypothesis testing has led to a lot of ‘discoveries’ that cannot be replicated in
follow-up experiments based on further data.

Probability and Statistics for SIC slide 377

Bad science and the abuse of hypothesis testing


Many scientific and medical ‘discoveries’ are not reproducible, and may indeed be nonsense. See

http://www.badscience.net

Here are some reasons why:


 poor experimental design, particularly biased allocation of treatments, and correlation among
supposedly independent units;
 inappropriate statistical analysis, particularly over-using tests, and not making enough
allowance for multiple testing;
 failure to control the rate of false positives;
 publication bias, whereby only ‘positive findings’ (which may be false positives) are published;
and, unfortunately
 fraud—see, for example,

http://en.wikipedia.org/wiki/Anil_Potti

Probability and Statistics for SIC slide 378

221
8.5 Comparison of Tests slide 379

Types of test
There are many different tests for different hypotheses. Two important classes of tests are:
 parametric tests, which are based on a parametric statistical model, such as
iid
Y1 , . . . , Yn ∼ N (µ, σ 2 ), and H0 : µ = 0;
 nonparametric tests, which are based on a more general statistical model, such as
iid
Y1 , . . . , Yn ∼ f , et H0 : P(Y > 0) = P(Y < 0) = 1/2, i.e., the median of f is at y = 0
The main advantage of a parametric test is the possibility of finding a (nearly-)optimal test, if the
underlying assumptions are correct, though such a test could perform badly in the presence of outliers.
A nonparametric test is often more robust, but it will suffer a loss of power compared to a parametric
test, used appropriately.

Probability and Statistics for SIC slide 380

Medical analogy
We diagnose an illness based on symptoms presented by a patient:

Decision
Healthy Diseased
Patient Healthy True negative False positive
Diseased False negative True positive

In the graphic below, Symptom 1 gives perfect diagnoses, but Symptom 2 is useless. Think how the
probability of a correct diagnosis varies as the different lines move parallel to their slopes.
4

S
SS
S S
M
S M MM M
SS M
SS S S
S M MMM M M M M MM
SS SS S S
SS S S MM MM M M
S SS
S M M M M M
2

S S M MM M M M
S SSSS SSSSS SS M MM
M MM
MM M MM
M
S S S SSS S SSSSSS S S M MM MMM
MM MMMM
MM
M
MMMM
M M
MM M M
S S SSSSSSSS SS
SS SSSS SS S
SSSS M M MM
M MM
M MM M M
SSSS SSS
S SSS S SS S S
S S S M MMM
MMM M M M
S SS SSSSS S
SS
S S
SSS
S
SSSS
SSS
S
SSSS
S S S S S M
MMMMMMMM MMM
MM
M M
M
M
MM
MMMM
MM M M
M MMMMMMM
S S SS S
S SS
SS SSS S
SS
S SSSSSS
S SS S S S SS S MM M
MMMMM M
MM
MM
MM
M
MMM
M
M
MMMM
MM
MM
M
MM MMM M
MM M
M
SSSS SS S SSS SSS SSSS SS S SSS S M M MMM
M M MMM M MM MM M
S SSS SSSSSSS S SSS S
SSS S SS
S SSSSSS SS
M MMM MM
MM
MM MMM MM M MM
MMMMM
Symptom 2

SS
SSSS SSSSS
SS SS
SSS
SSSSS
SS S
SS S SSSS SS SS M MMMMM M MM
M MMMM
M MMM M M
M M M
S SSSSS
SSSS SS
S
S SS
S
S
S
SS
S SS SS S
S
SSSSSSSS
S SSS SSSSS SS SS M M MMMM
M
M
M
M
MM
MM
MM
M
M
M
MM
M
M
MM
M
M
M
MM
M
M
M
MM
M
M
M
M
MM
MM
MMMMM M M
S SS SSSS SS S SS
SSS
S S
S
SS
SS SS
SSSS
S S
SSSS
SS
SSSS
S
SSS
S S SSSS S S M MM
M
MM M
M
MM
M
MMM M
MMM
M
MMM
M
M
M
MM
MM M
M MMM
M MMM MMM MMMM M M
S S S S S SSSSSSS S
SSSS
S
SS
SSS
S
S
SSSSSSS
SSS
SS
SSSSS SSSSSSS
S SSSS S SS M M
M M
M
M
M
M
M
M
M
M
M
MM
M
M
MM
MM
M
M
M
M
MM
M
M
M
M
M
MMMM
MM
M
M
M
MMMM
MM
M
MMMM M
MMMM
M
MM
S SS SS SS
S S
S SSS S S SSSS S SS S S M M MM MM MMMMMMMM
M
M
MM
M
MMM
M
MMMM MM MM MM M MM M
0

S
S S SS S S S SSS SS SSS MMM MMMM
M MMM MM
S S SS S S SSS SS
SSSS
SSSS
S SS
S SS
S
SSS S
SS
S SSSSSSS
SS
SS
SSS S
S SS
SSSS
S MMM M MMM
M
MM
M
M
MMM
MM
MM M
MM
M
M
MM
M
M M
MM
M
MM
M
M
M MMMMM MMMMM
M
SS SS SS S SS S S
SSSS
S SSSSS SSSSSS SS M M
MMMMM
MMM
MMMMM
MMMMM
M MM
MM
M M
MM
M M
MM
MMM M
M
MMMMM M M
S S SSSS SS
S
S SS S
S SS
SS SS S S SS
SSSS SSSS S S M MM M
MMMM
M MMM
M
M
M
M
MM
MM
M
M
M
MM
MM
M
MMMM
M M M
MM M
MMMM MMMM M M
M
M
S
SS SSS S
SSSS S S S S
SSS
SSS
S SS S
SSSS SSS S
SS
S
S S MM M MM M
M
M
M MMM
M
M M MM MMM
M
S S S SS SS
SS
SSSSSSS SS
SSSSSSS
SS
S SSSSS
S S SSSS S S S SS M
M M
MMMMM
MMM
M M
M M
MMM
M
M
MM M
M
MM MMMMM
M MM MM
M
S S SS SSSS SS S
S SS
S SSSSSS
SSSS SSSSS
S
SSSS SSSSS S S MMMMM M M
M MMMMM
M MM
MM
M
MMMMM MM
M M
MM
MM
M
M
MM
M
MM M
M M M M
SS S S SSSS SS SSSSS SSSS
S SS SSSS SS S
SS
S SS S
S S M MMMM MMM
M MMMM
MM
M
MMMMMM
M MMM MM M
M MMMM
S SS SSS S SSS SS S S M M MMMMM M
M MM
SSS SS
SS
SSSSS
SSSSSS SS
S SSS SSSSSS S M
MM M M
M
M
M
MMMM M
MMM
M
MMM
MMMM M
MMMM MM MM M
S S SSS SSS SSS S SS SSS SS MM M
MM M
MMM
MM
M M MM
MM MMM M M
MM
M
MM
M
SS S S MMM M M M M
S S SS SS S S MMM MM M M MMM M
SS S S
S SS S M M MMM MMMM M M
−2

S M M M MM MM
S SS SS M
SS S M M M
S S SS M M M M M
S SS MM M M
S M M
S S M
S
−4

−5 0 5

Symptom 1

Probability and Statistics for SIC slide 381

222
ROC curve, II
 We previously met the ROC curve as a summary of the properties of a test.
 A good test will have an ROC curve lying as close to the upper left corner as possible.
 A useless test has an ROC curve lying on (or close to) the diagonal.
 This suggests that if we have a choice of tests, we should choose one whose ROC curve is as close
to the north-west as possible, i.e., we should choose the test that maximises the power for a given
size.
 This leads us to the Neyman–Pearson lemma, which says how to do this (in ideal
circumstances).

1.0
0.8
True positive probability β(t)
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0


False positive probability α(t)

Probability and Statistics for SIC slide 382

Most powerful tests


 We aim to choose our test statistic T to maximise the power of the test for a given size.
 A decision procedure corresponds to partitioning the sample space Ω containing the data Y into a
rejection region, Y, and its complement, Y, with

Y ∈ Y ⇒ Reject H0 , Y ∈ Y ⇒ Accept H0 .
P
 In Example 287, Y = {(y1 , . . . , yn ) : | yj − 100|/50 > 1.96}.
 We aim to choose Y such that P1 (Y ∈ Y) is the largest possible such that P0 (Y ∈ Y) = α.

Lemma 289 (Neyman–Pearson). Let f0 (y), f1 (y) be the densities of Y under simple null and
alternative hypotheses. Then if it exists, the set

Yα = {y ∈ Ω : f1 (y)/f0 (y) > t}

such that P0 (Y ∈ Yα ) = α maximises P1 (Y ∈ Yα ), amongst all the Y ′ such that P0 (Y ∈ Y ′ ) ≤ α.


Thus to maximise the power of a given threshold, we must base the decision on Yα .

Probability and Statistics for SIC slide 383

223
Note to Lemma 289
Suppose that a region Yα such that P0 (Y ∈ Yα ) = α does exist and let Y ′ be any other critical region
of size α or less. Then for any density f ,
Z Z
f (y) dy − f (y) dy, (7)
Yα Y′

equals Z Z Z Z
f (y) dy + f (y) dy − f (y) dy − f (y) dy,
Yα ∩Y ′ Yα ∩Y ′ Y ′ ∩Yα Y ′ ∩Yα

where Yα is the complement of Yα in the sample space, and this is


Z Z
f (y) dy − f (y) dy. (8)
Yα ∩Y ′ Y ′ ∩Yα

If f = f0 , (7) and hence (8) are non-negative, because Y ′ has size at most that of Yα . Suppose that
f = f1 . If y ∈ Yα , then tα f0 (y) > f1 (y), while f1 (y) ≥ tα f0 (y) if y ∈ Yα . Hence when f = f1 , (8) is
no smaller than Z Z 
tα f0 (y) dy − f0 (y) dy ≥ 0.
Yα ∩Y ′ Y ′ ∩Yα

Thus the power of Yα is at least that of Y ′ , and the result is established.

Probability and Statistics for SIC note 1 of slide 383

Example
Example 290. (a) Construct an optimal test for the hypothesis H0 : θ = 1/2 in Example 276, with
α = 0.05.
(b) Do you think that θ = 1/2 for spins?

Probability and Statistics for SIC slide 384

224
Note to Example 290
 The joint density of n independent Bernoulli variables can be written as
X
f (y) = θ r (1 − θ)n−r , 0 < θ < 1, r = yj ,

and H0 imposes θ = 1/2. Thus for any fixed θ we have

f1 (y) θ r (1 − θ)n−r
= = {2(1 − θ)}n {θ/(1 − θ)}r ,
f0 (y) (1/2)r (1 − 1/2)n−r

which is increasing in r if θ > 1/2 and is decreasing in r if θ < 1/2. Hence if θ > 1/2 we must take
X
Y1 = {y1 , . . . , yn : y j ≥ r1 }

for some r1 , and if θ < 1/2 we must take


X
Y2 = {y1 , . . . , yn : y j ≤ r2 }

for some r2 . So if we want to test H0 against (say) H1 : θ = 0.6, we take Y1 , and if we want to
test H0 against (say) H1 : θ = 0.4, we take Y2 .
 Suppose that we take H1 : θ = 0.6. Then we need to choose r1 such that
( ) !
R − n/2 r1 − n/2 . r1 − n/2
α = P0 (Y ∈ Y1 ) = P0 (R ≥ r1 ) = P0 p ≥ p = 1−Φ p
n/4 n/4 n/4
. √ .
and this implies that r1 = n/2 + nz1−α /2. With n = 200 and α = 0.05 this is r1 = 111.6.
Since we observed R = 115 > r1 , we reject H0 at the 5% significance level, and conclude that the
coin is biased upwards (but not downwards).
 Since the result does not depend on the value of θ chosen, provided θ > 0.5, we would also reject
against any other H1 setting θ > 1/2.
 A similar computation gives r2 = 88.37.
 If we are not sure of the value of θ, then we take a region of the form Y1 ∪ Y2 . But in order for it
to have overall size α, we take α/2 for each of the regions, giving r1 = 113.86 and r2 = 86.14.
Since Y ∈ Y1 ∪ Y2 , we still reject H0 at the 5% significance level, and conclude that the coin is
biased, without being sure in which direction it is biased.

Probability and Statistics for SIC note 1 of slide 384

225
Power and distance
iid
 A canonical example is where Y1 , . . . , Yn ∼ N (µ, σ 2 ), and

H 0 : µ = µ0 , H 1 : µ = µ1 .

 If σ 2 is known, then the Neyman–Pearson lemma can be applied, and we find that the most
powerful test is based on Y and its power is Φ(zα + δ), where Φ(zα ) = α, and

|µ1 − µ0 |
δ = n1/2
σ
is the standardized distance between the models.
 We see that
– the power increases if n increases, or if |µ1 − µ0 | increases, since in either case the difference
between the hypotheses is easier to detect,
– the power decreases if σ increases, since then the data become noisier,
– if δ = 0, then the power equals the size, because the two hypotheses are the same, and
therefore P0 (·) = P1 (·).
 Many other situations are analogous to this, with power depending on generalised versions of δ.

Probability and Statistics for SIC slide 385

Summary
 We have considered the situation where we have to make a binary choice between
– the null hypothesis, H0 , against which we want to test
– the alternative hypothesis, H1 ,
using a test statistic T whose observed value is tobs , computing the P-value,

pobs = P0 (T ≥ tobs ),

which is computed assuming that H0 is true.


 We can consider pobs as a measure of the evidence in the data against H0 .
 For a test with significance level α, we reject H0 and choose H1 if pobs < α.
 We must accept that we can make mistakes:

Decision
Accept H0 Reject H0
State of Nature H0 true Good choice Type I Error
H1 true Type II Error Good choice

 If we try to minimise the probability of Type II error (i.e., maximise power) for a given probability
of Type I error (fixed size), we can construct an optimal test, but this is only possible in simple
cases. Otherwise we usually have to compare tests numerically.

Probability and Statistics for SIC slide 386

226
9 Likelihood slide 387

9.1 Motivation slide 388


Motivation
Likelihood is one of the basic ideas of statistical inference and modelling. It gives a general and
powerful framework for dealing with all kinds of applications, in particular for
 finding estimators with the smallest variances in large samples; and
 constructing powerful tests.

Probability and Statistics for SIC slide 389

Illustration
 When we toss a coin, small asymmetries influence the probability of obtaining heads, which is not
necessarily 1/2. If Y1 , . . . , Yn denote the results of independent Bernoulli trials, then we can write

P(Yj = 1) = θ, P(Yj = 0) = 1 − θ, 0 ≤ θ ≤ 1, j = 1, . . . , n.

 Below is such a sequence for a 5Fr coin with n = 10:

1 1 1 1 1 0 1 1 1 1

Which values of θ seem to you the most and least credible:

θ = 0, θ = 0.3, θ = 0.9, θ = 0.99?

 How can we find the most plausible θ value(s)?

Probability and Statistics for SIC slide 390

Basic Idea
For a value of θ which is not very credible, the density of the data will be smaller: the higher the
density, the more credible the corresponding θ. Since the y1 , . . . , y10 result from independent trials, we
have
10
Y
f (y1 , . . . , y10 ; θ) = f (yj ; θ) = f (y1 ; θ) × · · · × f (y10 ; θ) = θ 5 × (1 − θ) × θ 4
j=1
9
= θ (1 − θ),

which we will consider as a function of θ for 0 ≤ θ ≤ 1, called the likelihood L(θ).


n=10
0.04
0.03
Likelihood
0.02 0.01
0.00

0.0 0.2 0.4 0.6 0.8 1.0


theta

Probability and Statistics for SIC slide 391

227
Relative likelihood
 To compare values of θ, we only need to consider the ratio of the corresponding values of L(θ):

L(θ1 ) f (y1 , . . . , y10 ; θ1 ) θ 9 (1 − θ1 )


= = 19 =c
L(θ2 ) f (y1 , . . . , y10 ; θ2 ) θ2 (1 − θ2 )

implies that θ1 is c times more plausible than θ2 .


b which satisfies
 The most plausible value is θ,
b ≥ L(θ),
L(θ) 0 ≤ θ ≤ 1;

θb is called the maximum likelihood estimate.


 To find θ,b we can equivalently maximise the log likelihood

ℓ(θ) = log L(θ).


b gives the plausibility of θ with respect to θ.
 The relative likelihood RL(θ) = L(θ)/L(θ) b

Probability and Statistics for SIC slide 392

Example
Example 291. Find θb and RL(θ) for a sequence of independent Bernoulli trials.

The following graph represents RL(θ), for n = 10, 20, 100 and the sequence

1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1
1 1 1 1 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 1
1 1 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1
1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1
1 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 0
b values of θ which are far away from θb become less credible
 As n increases, RL(θ) gets closer to θ:
b
with respect to θ.
 This suggests that we could construct a CI by taking the set

{θ : RL(θ) ≥ c} ,

for some c. Later we will see how to choose c.


Probability and Statistics for SIC slide 393

228
Note to Example 291
The likelihood is
n
Y n
Y
L(θ) = f (y; θ) = f (yj ; θ) = θ yj (1 − θ)1−yj = θ s (1 − θ)n−s , 0 ≤ θ ≤ 1,
j=1 j=1
P
where s = yj and we have used the fact that the observations are independent. Therefore

ℓ(θ) = s log θ + (n − s) log(1 − θ), 0 ≤ θ ≤ 1.

Differentiation of this yields

dℓ(θ) s n−s d2 ℓ(θ) s n−s


= − , =− 2 − .
dθ θ 1−θ dθ 2 θ (1 − θ)2

Setting dℓ(θ)/dθ = 0 gives just one solution, θb = s/n = y, and since the second derivative is always
negative, this is clearly the maximum. Therefore
 s  
L(θ) θ 1 − θ n−s
RL(θ) = = , 0 ≤ θ ≤ 1.
b
L(θ) θb 1 − θb

Probability and Statistics for SIC note 1 of slide 393

Bernoulli sequence

n=10 (black), n=20 (blue), n=100 (red)


1.0 0.8
Relative likelihood
0.2 0.4 0.6
0.0

0.0 0.2 0.4 0.6 0.8 1.0


theta

Probability and Statistics for SIC slide 394

229
Bernoulli sequence

n=10 (black), n=20 (blue), n=100 (red)

1.0 0.8
Relative likelihood
0.4 0.6

c=0.3
0.2

c=0.1
0.0

0.0 0.2 0.4 0.6 0.8 1.0


theta

Probability and Statistics for SIC slide 395

9.2 Scalar Parameter slide 396

Likelihood
Definition 292. Let y be a set of data, whose joint probability density f (y; θ) depends on a parameter
θ, then the likelihood and the log likelihood are

L(θ) = f (y; θ), ℓ(θ) = log L(θ),

considered a function of θ.

If y = (y1 , . . . , yn ) is a realisation of the independent random variables of Y1 , . . . , Yn , then


n
Y n
X
L(θ) = f (y; θ) = f (yj ; θ), ℓ(θ) = log f (yj ; θ),
j=1 j=1

where f (yj ; θ) represents the density of one of the yj .

Probability and Statistics for SIC slide 397

230
Maximum likelihood estimation
Definition 293. The maximum likelihood estimate θb satisfies
b ≥ L(θ) for all θ,
L(θ)
b ≥ ℓ(θ), since L(θ) and ℓ(θ) have their maxima at the same value of θ. The
which is equivalent to ℓ(θ)
corresponding random variable is called the maximum likelihood estimator (MLE).

 Often θb satisfies
b
dℓ(θ) b
d2 ℓ(θ)
= 0, 2
< 0.
dθ dθ
In this course we will suppose that the first of these equations has only one solution (not always
the case in reality).
 In realistic cases we use numerical algorithms to obtain θb and d2 ℓ(θ)/dθ
b 2.

Probability and Statistics for SIC slide 398

Information
Definition 294. The observed information J(θ) and the expected information (or Fisher
information) I(θ) are
 2 
d2 ℓ(θ) d ℓ(θ)
J(θ) = − , I(θ) = E{J(θ)} = E − .
dθ 2 dθ 2

They measure the curvature of −ℓ(θ): the larger J(θ) and I(θ), the more concentrated ℓ(θ) and L(θ)
are.
iid b var(θ),
b J(θ) and I(θ).
Example 295. If y1 , . . . , yn ∼ Bernoulli(θ), calculate L(θ), ℓ(θ), θ,

Probability and Statistics for SIC slide 399

Note to Example 295


We saw in Example 291 that

L(θ) = θ s (1 − θ)n−s , ℓ(θ) = s log θ + (n − s) log(1 − θ), 0 ≤ θ ≤ 1,

that the MLE is θb = s/n = y, and clearly

d2 ℓ(θ) s n−s
J(θ) = − 2
= 2+ .
dθ θ (1 − θ)2

Now treating θb as a random variable, θb = S/n, where S ∼ B(n, θ), we see that since E(S) = nθ and
var(S) = nθ(1 − θ), we have after a little algebra that

b = θ(1 − θ) n
var(θ) , I(θ) = E{J(θ)} = , 0 < θ < 1.
n θ(1 − θ)

b = 1/I(θ).
Note that var(θ)

Probability and Statistics for SIC note 1 of slide 399

231
Limit distribution of the MLE
Theorem 296. Let Y1 , . . . , Yn be a random sample from a parametric density f (y; θ), and let θb be the
MLE of θ. If f satisfies regularity conditions (see below), then
D
b 1/2 (θb − θ) −→ N (0, 1),
J(θ) n → ∞.

Thus for large n, n o


·
θb ∼ b −1 ,
N θ, J(θ)

and a two-sided equi-tailed CI for θ with approximate level (1 − α) is

θb
I1−α = (L, U ) = (θb − J(θ)
b −1/2 z1−α/2 , θb + J(θ)
b −1/2 z1−α/2 ).

b which
We can show that for large n (and a regular model) no estimator has a smaller variance than θ,
b
θ
implies that the CIs I1−α are as narrow as possible.

Example 297. Find the 95% CI for the coin data with n = 10, 20, 100.
b
n Tails θb b
J(θ) θ
I0.95 W
I0.95
10 9 0.9 111.1 (0.72, 1.08) (0.63, 0.99)
20 16 0.8 125.0 (0.62, 0.98) (0.59, 0.94)
100 69 0.69 467.5 (0.60, 0.78) (0.60, 0.78)

Probability and Statistics for SIC slide 400

Likelihood ratio statistic


Sometimes a CI based on the normal limit distribution of θb is unreasonable. It is then better to use
ℓ(θ) itself.
b Then the
Definition 298. Let ℓ(θ) be the log likelihood for a scalar parameter θ, whose MLE is θ.
likelihood ratio statistic is n o
W (θ) = 2 ℓ(θ) b − ℓ(θ) .

Theorem 299. If θ 0 is the value of θ that generated the data, then under the regularity conditions
giving θb a normal limit distribution,
D
W (θ 0 ) −→ χ21 , n → ∞.
·
Hence W (θ 0 ) ∼ χ21 for large n.
iid
Example 300. Find W (θ) when Y1 , . . . , Yn ∼ Bernoulli(θ 0 ).

Probability and Statistics for SIC slide 401

232
Note to Example 300
Since
ℓ(θ) = s log θ + (n − s) log(1 − θ), 0 ≤ θ ≤ 1,
and θb = s/n = y, we have
h i
W (θ) = 2 nθb log(θ/θ)
b b log{(1 − θ)/(1
+ n(1 − θ) b − θ)} ,

D
and if we write θb = θ + n−1/2 a(θ)Z, where a2 (θ) = θ(1 − θ) and Z −→ N (0, 1), we end up after a
Taylor series or two with
. D
W (θ) = Z 2 −→ χ21 .

Probability and Statistics for SIC note 1 of slide 401

Implications of Theorem 299


 Suppose we want to test the hypothesis H0 : θ = θ 0 , where θ 0 is fixed. If H0 is true, the theorem
·
implies that W (θ 0 ) ∼ χ21 . The larger W (θ 0 ) is, the more we doubt H0 . Thus we can take
W (θ 0 ) as a test statistic, whose observed value is wobs , and with
 . 
pobs = P W (θ 0 ) ≥ wobs = P χ21 ≥ wobs

as significance level. The smaller pobs is, the more we doubt H0 .


 Let χ2ν (1 − α) be the (1 − α) quantile of the χ2ν distribution. Theorem 299 implies that a CI for θ 0
at the (1 − α) level is the set
 n n o o
W
I1−α = θ : W (θ) ≤ χ21 (1 − α) = b − ℓ(θ) ≤ χ2 (1 − α)
θ : 2 ℓ(θ) 1
n o
= b − 1 χ2 (1 − α) .
θ : ℓ(θ) ≥ ℓ(θ) 2 1

 With 1 − α = 0.95 we have χ21 (0.95) = 3.84. Thus the 95% CI for a scalar θ contains all θ such
b − 1.92. In this case we have
that ℓ(θ) ≥ ℓ(θ)
b = exp{ℓ(θ) − ℓ(θ)}
RL(θ) = L(θ)/L(θ) b ≥ exp(−1.92) ≈ 0.15;

compare with slide 395.

Probability and Statistics for SIC slide 402

233
CIs based on the likelihood ratio statistic

n=10 (black), n=20 (blue), n=100 (red)

0
−1
Level 0.9
Level 0.95

−4 −3 −2
Log likelihood

Level 0.99
−5
−6

0.0 0.2 0.4 0.6 0.8 1.0


theta
b
 When n increases, the CI becomes narrower and more symmetric about θ.
 When 1 − α increases (i.e., α decreases), the CI becomes wider.

Probability and Statistics for SIC slide 403

‘Standard Model’ of particle physics

(Source: http://www.benbest.com/science/standard.html)
 The top quark was discovered in 1995.
 The result of the experiments led to find it was a variable y = 17, which should have the
distribution Poisson(θ) with θ = 6.7 if this quark did not exist.

Probability and Statistics for SIC slide 404

234
Top quark: Likelihood

0 −2
Log likelihood
−6 −4
−8
5 10 15 20 25
theta

Here we have f (y; θ) = θ y e−θ /y!, θb = y, y = 17, and θ 0 = 6.7, and thus
n o
wobs = W (θ 0 ) = 2 log f (y; θ) b − log f (y; θ 0 ) ,
n o
= 2 (y log θb − θb − log y!) − (y log θ 0 − θ 0 − log y!)
= 11.06.
.
Thus P(W ≥ wobs ) = P(χ21 ≥ 11.06) = 0.00088: a rare event, if θ 0 = 6.7.

Probability and Statistics for SIC slide 405

Regularity
The regularity conditions are complicated. Situations where they are false are often cases where
 one of the parameters is discrete;
 the support of f (y; θ) depends on θ;
 the true θ is on the limit of its possible values.
The conditions are satisfied in the majority of cases met in practice.
Here is an example where they are not satisfied:
iid b
Example 301. If Y1 , . . . , Yn ∼ U (0, θ), find the likelihood L(θ) and the MLE θ.
b
Show that the limit distribution of n(θ − θ)/θ when n → ∞ is exp(1).
Discuss.
Probability and Statistics for SIC slide 406

235
Note to Example 301
First show that owing to the independence, we have
n
Y n
Y  −1
L(θ) = fY (yj ; θ) = θ I(0 < yj < θ) = θ −n I(max yj < θ), θ > 0,
j=1 j=1

and therefore θb = M = max Yj , whose distribution we already know is

P(M ≤ x) = (x/θ)n , 0 < x < θ.

Now n o
b
P n(θ − θ)/θ ≤ x = P(θb ≥ θ − xθ/n) = 1 − {(θ − xθ/n)/θ}n → 1 − exp(−x),

as required. Note that:


 the scaling needed to get a limiting distribution is much faster here than in the usual case;
 the limit is not normal.
Probability and Statistics for SIC note 1 of slide 406

Example
Comparison of the distributions of θb in a regular case (panels above, with standard deviation ∝ n−1/2 )
and in a nonregular case (Example 301, panels below, with standard deviation ∝ n−1 ). In other
nonregular cases it might happen that the distribution is not nice (like here) and/or that the
convergence is slower than in regular cases.
n=16, regular n=64, regular n=256, regular
6
1.5
Density

Density

Density
2.5

3
0.0

0.0

0.0 1.0 2.0 0.0 1.0 2.0 0.0 1.0 2.0

MLE MLE MLE

n=16, non−regular n=64, non−regular n=256, non−regular


6
Density

Density

Density

60
15
3
0

1.0 1.4 1.8 1.0 1.4 1.8 1.0 1.4 1.8

MLE MLE MLE

Probability and Statistics for SIC slide 407

236
9.3 Vector Parameter slide 408
Vector θ
Often θ is a vector of dimension p. Then the definitions and results above are valid with some slight
changes:
 the MLE θb often satisfies the vector equation

b
dℓ(θ)
= 0p×1 ;

 J(θ) and I(θ) are p × p matrices;
 and in regular cases,
·
θb ∼ Np {θ, J(θ)
b −1 }.

Example 302. If y1 , . . . , yn is a random sample from the N (µ, σ 2 ) distribution, calculate µ b2


b and σ
and their asymptotic distribution.

Probability and Statistics for SIC slide 409

Note 1 to Example 302


 The density function of a normal random variable with mean µ and variance σ 2 is
(2πσ 2 )−1/2 exp{−(y − µ)2 /(2σ 2 )}, so here θ2×1 = (µ, σ 2 )T ∈ R × R+ , and the likelihood for a
random sample y1 , . . . , yn equals
n
Y n
Y  
1 (yj − µ)2
L(θ) = f (y; θ) = f (yj ; θ) = √ exp − .
2πσ 2 2σ 2
j=1 j=1

Therefore the log likelihood is


n
n n 1 X
ℓ(µ, σ) = − log(2π) − log σ 2 − 2 (yj − µ)2 , µ ∈ R, σ 2 > 0.
2 2 2σ
j=1

Its first derivatives are


X n n
∂ℓ ∂ℓ n 1 X
= σ −2 (yj − µ), =− 2 + 4 (yj − µ)2 ,
∂µ ∂σ 2 2σ 2σ
j=1 j=1

and the elements of the observed information matrix J(µ, σ 2 ) are given by
n
∂2ℓ n ∂2ℓ n ∂2ℓ n 1 X
2
= − 2, 2
= − 4 (y − µ), = − (yj − µ)2 .
∂µ σ ∂µ∂σ σ ∂(σ 2 )2 2σ 4 σ 6
j=1

Probability and Statistics for SIC note 1 of slide 409

237
Note 2 to Example 302
 To obtain the MLEs, we solve simultaneously the equations
!  P   
∂ℓ(µ,σ2 )
∂µ σ −2 nj=1 (yj − µ) 0
= n 1 P n 2 = 0 .
∂ℓ(µ,σ2 ) − + (y − µ)
2 ∂σ 2σ2 2σ4 j=1 j

Now
n n n
∂ℓ(b b2 )
µ, σ 1 X X
−1
X
=0⇒ 2 (yj − µ
b) = 0 ⇒ nb
µ= yj ⇒ µ
b=n yj = y
∂µ σ
b
j=1 j=1 j=1

and
n n n
∂ℓ(b b2 )
µ, σ n 1 X 2 2 −1
X
2 −1
X
= 0 ⇒ = (y j − µ
b) ⇒ σ
b = n (y j − µ
b ) = n (yj − y)2 .
∂σ 2 σ2
2b σ4
2b
j=1 j=1 j=1

The first of these has the sole solutionPb = y for all values of σ 2 , and therefore ℓ(b
µ µ, σ 2 ) is
unimodal with maximum at σ b2 = n−1 (yj − y)2 . At the point (b b2 ), the observed information
µ, σ
2
matrix J(µ, σ ) is diagonal with elements diag{n/b 2
σ , n/(2b 4
σ )}, and so is positive definite. Hence
P
b = y and σ
µ b2 = n−1 (yj − y)2 are the sole solutions to the likelihood equation, and therefore
are the maximum likelihood estimates.
·
 The fact that θb ∼ Np {θ, J(θ)
b −1 } implies that
     2 
µb · µ σ
b /n 0
∼ N2 , ,
b2
σ σ2 0 σ 4 /n
2b

which implies that µ b and σb2 are approximately independent, since their covariance is zero and they
have an approximate joint normal distribution. In fact we know from Theorem 274 that they are
exactly independent, with µ b ∼ N (µ, σ 2 /n), and nbσ 2 ∼ σ 2 χ2n−1 ; this last fact implies that
E(bσ 2 ) = (n − 1)σ 2 /n and var(b
σ 2 ) = 2(n − 1)σ 4 /n2 , which converge to the values in the
large-sample distribution as n → ∞.
 For confidence intervals, we note that the large-sample theory implies that a (1 − α) confidence
interval for µ is √ √
µ b/ n = y ± z1−α/2 σ
b ± z1−α/2 σ b/ n
and unless n is very small, this approximate interval is very similar to the exact interval

y ± tn−1 (1 − α/2)s/ n

that stems from Theorem 274. A similar argument applies to the interval for σ 2 .

Probability and Statistics for SIC note 2 of slide 409

238
Nested models
 Very few applications have only one parameter, so we must test hypotheses concerning vector
parameters.
 For example, in the case of the normal model, we often want to test the hypothesis H0 : µ = µ0 ,
where µ0 is a specified value (such as µ0 = 0), without restricting σ 2 . In this case we want to
compare two models, where

θ = (µ, σ 2 ) ∈ R × R+ , θ = (µ, σ 2 ) ∈ {µ0 } × R+ ,

whose respective parameters have dimensions 2 (general model) and 1 (simplified model).
 In a general context, put θp×1 = (ψq×1 , λ(p−q)×1 ), and suppose that we want to test whether the
simple model with ψ = ψ 0 explains the data as well as the general model. Thus, under the general
model, θ = (ψ, λ) ∈ Rq × Rp−q , whereas under the simple model, θ = (ψ, λ) ∈ {ψ 0 } × Rp−q .
 We say that the simpler model is nested in the general model.
 We use this terminology in all situations where one of the two models becomes the same as the
other when the parameter is restricted.

Probability and Statistics for SIC slide 410

Likelihood ratio statistic


 Take two nested models with corresponding MLEs

θb = (ψ,
b λ),
b θb0 = (ψ 0 , λ
b0 ),

b ≥ ℓ(θb0 ), and write the likelihood ratio statistic


where inevitably ℓ(θ)
n o n o
b − ℓ(θb0 ) = 2 ℓ(ψ,
W (ψ 0 ) = 2 ℓ(θ) b λ)
b − ℓ(ψ 0 , λ
b0 ) .

 Then if it is true that ψ = ψ 0 (i.e., the simpler of the two models is true), we have
D
W (ψ 0 ) −→ χ2q , n → ∞.

This gives a basis for tests and CIs as before, using the approximation
·
W (ψ 0 ) ∼ χ2q ,

b
valid for large n. In general this approximation is better than that for θ.
 This result generalises Theorem 299, where θ ≡ ψ is scalar, q = 1, and λ is not present, p − q = 0.

Probability and Statistics for SIC slide 411

239
Example
Example 303. Below the results of 100 tosses of two different coins, with 1 denoting a head:

1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1
1 1 1 1 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 1
1 1 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1
1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1
1 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 0

1 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 0 0 1 0
1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0
1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 0 0 1
1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 1 1 0 0
0 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1 0 1

Let θ1 , θ2 be the corresponding probabilities of obtaining heads. Find the likelihood and the likelihood
ratio statistic. Does θ1 = θ2 : are the probabilities equal?

Probability and Statistics for SIC slide 412

Note to Example 303


 There are 69 heads for the first coin and 55 for the second. If the trials are independent, then

L(θ1 , θ2 ) = θ169 (1 − θ1 )31 × θ255 (1 − θ2 )45 , 0 ≤ θ1 , θ2 ≤ 1,

and the contours of the relative log likelihood ℓ(θ1 , θ2 ) − ℓ(θb1 , θb2 ) are shown on the next slide.
Clearly the parameter has dimension p = 2.
 Differentiation gives that θb1 = 69/(69 + 31) = 0.69 and θb2 = 55/(55 + 45) = 0.55, corresponding
to the black blob in the figure, at (θ1 , θ2 ) = (θb1 , θb2 ) = (0.69, 0.55).
 Under the simpler model θ1 = θ2 = θ, we have

L(θ, θ) = θ 69 (1 − θ)31 × θ 55 (1 − θ)45 = θ 124 (1 − θ)76 , 0 ≤ θ ≤ 1,

and the parameter has one dimension, q = 1, corresponding to the diagonal red line in the contour
plot. Clearly θb = 124/(124 + 76) = 0.62, corresponding to the red blob on the diagonal line, at
b θ)
(θ1 , θ2 ) = (θ, b = (0.62, 0.62).
 To compare the two models, we note that the corresponding likelihood ratio statistic is

wobs = 2{ℓ(θb1 , θb2 ) − ℓ(θ,


b θ)}
b = −2{ℓ(θ,
b θ)
b − ℓ(θb1 , θb2 )},

and reading off from the graph this (approximately) equals wobs = −2(−2) = 4 (since the relative
log likelihood is −2 at the red blob). Thus, if the simpler model is true, we have
. .
P(W ≥ wobs ) = P(χ21 ≥ 4) = 0.046. This event has a fairly small probability (around 1 in 20), so
it is some, but not strong, evidence that the more complex model is needed, i.e., that the coins do
not have the same success probability.

Probability and Statistics for SIC note 1 of slide 412

240
Example: Likelihood
Contours of log likelihood

1.0
0.8
0.6
theta2

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

theta1

Probability and Statistics for SIC slide 413

9.4 Statistical Modelling slide 414

Drinking and driving

Probability and Statistics for SIC slide 415

241
Drinking and driving
Example 304. Formulate a model for the data, and use it to verify the change in the proportion of
accidents due to alcohol in 2005–2006.
Is there a difference on the other side of the Röstigraben?

Probability and Statistics for SIC slide 416

Note to Example 304


 Each canton has many drivers, and each driver has a small probability of an accident, so a Poisson
approximation to the binomial distribution of the number of drivers having an accident seems
reasonable. Let yac denote the number of accidents in year a and canton c, and suppose these are
independent Poisson variables with means µca , for a = 2005, 2006, c = 1, . . . , C = 23.
 Here are some possible models and their interpretations:
– µca ≡ µ, the mean number of accidents is constant in each canton and each year. This is
obviously silly because the mean numbers of accidents in a canton will be related to its
population, and these are very different, so we expect this model to be strongly contradicted by
the data. This model has 1 parameter, µ;
– µca = λc , the mean number of accidents is different for each canton, but does not depend on
the year. This model has 23 parameters, λ1 , . . . , λ23 ;
– µca = λc ψ I(a=2006) , the mean number of accidents is different for each canton, and increases
by a factor ψ from 2005 to 2006, uniformly across all the cantons, where we expect ψ ≈ 1,
since huge changes are unlikely. This model has 24 parameters, λ1 , . . . , λ23 , ψ;
I(a=2006,r=0) I(a=2006,r=1)
– µca = λc ψ0 ψ1 , the mean number of accidents is different for each
canton, and increases by a factor ψ0 from 2005 to 2006 in the french-speaking cantons (r = 0)
and by a factor ψ1 from 2005 to 2006 in the other cantons (r = 1, across the Röstigraben).
This model has 25 parameters, λ1 , . . . , λ23 , ψ0 , ψ1 ;
– µca = λca , the mean number of accidents is different for each canton and each year, and there
is no structure in the variation. This model has 46 parameters.

Probability and Statistics for SIC note 1 of slide 416

242
Values of the maximised log likelihood ℓb
Model ℓb Number of parameters 2(ℓb2 − ℓb1 ) df
µca ≡ µ −4668.59 1
µca = λc −161.62 23 9011.9 22
µca = λc ψ I(a=2006) −157.70 24 7.7 1
I(a=2006,r=0) I(a=2006,r=1)
µca = λc ψ0 ψ1 −155.20 25 5.2 1
µca = λca −146.72 46 16.9 21

The indices:
 c for the canton
 a for the year
 r = 1 for the other side of the Röstigraben
Calculation of the likelihood ratio statistics:

9011.9 = 2{−161.62 − (−4668.59)}, df = 23 − 1 = 22,


7.7 = 2{−157.70 − (−161.62)}, df = 24 − 23 = 1,
5.2 = 2{−155.20 − (−157.70)}, df = 25 − 24 = 1,
16.9 = 2{−146.72 − (−155.20)}, df = 46 − 25 = 21,

and (for example) we obtain P(χ21 ≥ 7.7) = 0.0055.

Probability and Statistics for SIC slide 417

Distribution of the likelihood ratio statistics


Here are simulations for comparing the models with 25 parameters and 46 parameters. The χ221
distribution gives a very good approximation to the empirical distribution of the likelihood ratio
statistic, W .
Simulated likelihood ratio statistics
0.10

50
0.08

40
Ordered LR statistics
0.06

30
Density
0.04

20
0.02

10
0.00

0 10 20 30 40 50 0 10 20 30 40 50
Likelihood ratio statistic Quantiles of chi−squared distribution, 21 df

Probability and Statistics for SIC slide 418

243
Estimates
Here are the estimates and standard errors for the ‘best’ model:

Estimate Std. Error z value Pr(>|z|)


cantonFR 4.61884 0.07402 62.404 < 2e-16 ***
cantonJU 3.85312 0.10598 36.358 < 2e-16 ***
cantonVD 6.63906 0.03378 196.510 < 2e-16 ***
cantonNE 4.77156 0.06907 69.087 < 2e-16 ***
cantonAG 5.48651 0.04636 118.352 < 2e-16 ***
cantonAI 1.82690 0.27765 6.580 4.71e-11 ***
cantonAR 3.04614 0.15131 20.131 < 2e-16 ***
...
cantonZG 4.24556 0.08377 50.679 < 2e-16 ***
romand1:year -0.02753 0.04435 -0.621 0.534715
romand0:year 0.08787 0.02489 3.530 0.000415 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Probability and Statistics for SIC slide 419

General approach
Having understood the problem and looked at the data:
 we choose one or several models, basing our choice on
– prior knowledge, or
– stochastic reasoning, or
– purely empirical ideas;
 we fit the models by maximum likelihood;
 we compare nested models using their maximised log likelihoods, often via the likelihood ratio
statistic 2(ℓb2 − ℓb1 );
·
 we choose one or several ‘best’ models, and we use the approximation θb ∼ N {θ 0 , J(θ)
b −1 } to find
CIs for the parameters, which we can interpret with respect to the original problem;
 we verify whether the ‘best’ models are good;
 if all goes OK, we stop; otherwise, we start again at step 1, or we look for more (better?) data.

Probability and Statistics for SIC slide 420

244
General case
 In general we take the likelihood based on data y = (y1 , . . . , yn ) to be P(Y = y; θ). If the yj are
independent, then (as before) we have
n
Y
L(θ) = P(Y = y; θ) = P(Yj = yj ; θ). (9)
j=1

 In practice continuous observations yj are rounded, so the true observation lies in a small interval
(yj − δ/2, yj + δ/2), and then

P(Yj = yj ) = P{Yj ∈ (yj − δ/2, yj + δ/2)} = F (yj + δ/2) − F (yj − δ/2) ≈ δf (yj ).

This justifies using


n
Y n
Y
L(θ) = P(Y = y; θ) = P(Yj = yj ) ∝ f (yj ; θ)
j=1 j=1

for continuous observations, but in general we use (9) for independent data.
 For example, if some of the yj are only known to exceed some value c, then (9) is
n
Y
L(θ) = P(Y = y; θ) ∝ f (yj ; θ)I(yj ≤c) {1 − F (c; θ)}I(yj >c) .
j=1

Probability and Statistics for SIC slide 421

9.5 Linear Regression slide 422

‘Acqua Alta’ in Venice

(Source: Andrea Pattaro/AFP Photo/AFP/Getty Images)

Probability and Statistics for SIC slide 423

245
Santa Maria della Salute

(Source: http://www.fotocommunity.de/pc/pc/display/26118338)

Probability and Statistics for SIC slide 424

High tide in Venice, 1881–2011


180
Annual maximum flood (cm)

160
140
120
100
80
60

1900 1920 1940 1960 1980 2000

Probability and Statistics for SIC slide 425

246
Simple linear regression
 Let Y be a random variable, the response variable and suppose that its distribution depends on a
variable x, supposed non-random, the explanatory variable. Sometimes we call x the
independent variable, and Y the dependent variable.
 A simple model to describe linear dependence of E(Y ) on x is

Y ∼ N (β0 + β1 x, σ 2 ),

where β0 , β1 , σ 2 are the unknown parameters to be estimated using the data.


 In Venice, Y is the highest tide for a year, and x is the year. We have
– yj , highest tide (cm) in year j, observed for j = 1, . . . , n;
– xj , the year j, observed for j = 1, . . . , n;
– β0 , average tide height (cm) when x = 0, to be estimated;
– β1 , annual change in the average height per year (cm/year), to be estimated;
– σ 2 , variance of Y (cm2 ), to be estimated.
 According to our general recipe, we find the likelihood for the unknown parameters, then we find
the estimators, etc.
Probability and Statistics for SIC slide 426

Estimation
Example 305. Show that if the Y1 , . . . , Yn are independent, then the log likelihood for the simple
linear model is
n
2 n n 2 1 X
ℓ(β0 , β1 , σ ) = − log(2π) − log σ − 2 (yj − β0 − β1 xj )2 , (β0 , β1 , σ 2 ) ∈ R2 × R+ .
2 2 2σ
j=1

Hence show that if n ≥ 2 and not all the xj are equal, then
Pn n
j=1 yj (xj − x)
X
b
β1 = Pn , b0 = y − βb1 x, σ
β b 2
= n −1
(yj − βb0 − βb1 xj )2 ,
(x − x)2
j=1 j j=1

and that (under the regularity conditions) the approximate distribution of the estimators for large n is
b      
β0  β0  2 T −1  1 x1
σ
b (X X) 0  
, Xn×2 =  ... ...  .
βb1  ∼ ·
N3 β1  , 4 /n
 0 2b
σ 
b2
σ σ2 1 xn

Probability and Statistics for SIC slide 427

247
Note to Example 305
 For the log likelihood we simply note that if the Y1 , . . . , Yn are independent, then
n
Y n
Y 1 
f (y; β0 , β1 , σ 2 ) = f (yj ; β0 , β1 ) = √ exp − 21 (yj − β0 − β1 xj )2 /σ 2 ,
j=1 j=1 2πσ 2

and take logs.


 For the estimators, we need the derivatives of the log likelihood, which are
Xn
∂ℓ
= (yj − β0 − β1 xj )/σ 2 ,
∂β0
j=1
n
X
∂ℓ
= xj (yj − β0 − β1 xj )/σ 2 ,
∂β1
j=1
Xn X n
∂2ℓ ∂2ℓ ∂2ℓ
= −n/σ 2 , = − x2j /σ 2 , =− xj /σ 2 ,
∂β02 ∂β12 j=1
∂β0 ∂β1
j=1
∂ℓ n 1 n ∂2ℓ
1
= − + 4 S(β0 , β1 ), = 4 − 6 S(β0 , β1 )
∂σ 2 2σ 2 2σ 2
∂(σ )2 2σ σ
Xn Xn
∂2ℓ 2
∂ ℓ
= − (yj − β0 − β1 xj )/σ 4 , = − xj (yj − β0 − β1 xj )/σ 4 ,
∂β0 ∂σ 2 ∂β1 ∂σ 2
j=1 j=1
where X
S(β0 , β1 ) = (yj − β1 − β1 xj )2 ,
is the sum of squares. The solutions to the equations ∂ℓ/∂β0 = ∂ℓ/∂β1 = 0 are the least squares
estimates, since they minimise S(β0 , β1 ). They turn out to be
Pn
j=1 yj (xj − x)
βb0 = y − βb1 x, βb1 = Pn 2
.
j=1 (xj − x)

Clearly the estimator of the slope, b


Pn β1 is well-defined only if not all the xj are equal, and only if
n ≥ 2 is it then the case that j=1 (xj − x) > 0. The MLE of σ 2 is also readily found, and turns
2

out to be as stated.
 The final result comes from using the results on slide 409, noting that the part of J(β0 , β1 , σ 2 ) for
(β0 , β1 ) can be written as X T X/σ 2 , and that the off-diagonal terms involving σ 2 are zero in
J(βb0 , βb1 , σ
b2 )

Probability and Statistics for SIC note 1 of slide 427

248
Results
 It is difficult to interpret β0 if xj = j, as then β0 corresponds to the mean height for the year
j = 0, well before Venice was founded. Thus we take xj = j − 2000, and then β0 corresponds to
the average height in the year 2000.
 We find
Parameter Estimate Standard Error
β0 (cm) 131.5 2.6
β1 (cm/year) 0.35 0.04
σ 2 (cm2 ) 16.82 35.5

 The fitted line seems reasonable, but there is a lot of variation around it (σ 2 is large).
180
Annual maximum flood (cm)

160
140
120
100
80
60

1900 1920 1940 1960 1980 2000

Probability and Statistics for SIC slide 428

High tide in Venice


 If the model Y ∼ N (β0 + β1 x, σ 2 ) is good, then

(Yj − β0 − β1 xj )/σ ∼ N (0, 1),

and thus the residual should satisfy


·
rj = (Yj − βb0 − βb1 xj )/b
σ ∼ N (0, 1), j = 1, . . . , n

 We can verify the assumption of normality by comparing the residuals r1 , . . . , rn to the normal
distribution:
Normal Q−Q Plot
4
3
Sample Quantiles

2
1
0
−1
−2

−2 −1 0 1 2

Theoretical Quantiles

Probability and Statistics for SIC slide 429

249
SIC Students, 2012
Data on the heights and weights of 8 women and 80 men for this course in 2012: is the linear relation
the same for men and for women?

110
100
90
Weight (kg)

80
70
60
50

160 170 180 190

Height (cm)

Probability and Statistics for SIC slide 430

Regression models
 Let Yj denote the weight (kg) of the jth person, xj their height (cm), zj an indicator, zj = 1 if
jth person is a man, zj = 0 otherwise.
 Let Yj ∼ N (µj , σ 2 ) be independent random variables with five nested models:
1. µj = β0 (same line, slope 0);
2. µj = β0 + β1 xj (same line for men and women);
3. µj = β0 zj + β2 (1 − zj ) + β1 xj (two parallel lines);
4. µj = β0 + β1 xj zj + β3 xj (1 − zj ) (same intercept, different slopes);
5. µj = β0 zj + β2 (1 − zj ) + β1 xj zj + β3 xj (1 − zj ) (two different lines).
 There are exact results to compare these models in the context of analysis of variance
(ANOVA), but we can use the general ideas of likelihood as an approximation:

Model ℓb Number of parameters 2(ℓbgen − ℓbsimple ) df


1 −330.93 1
2 −311.78 2 38.3 1
3 −311.78 3 0 1
4 −311.77 3 0.2 1
5 −310.96 4 1.64 2

Probability and Statistics for SIC slide 431

250
Models 1 (grey), 2 (black) and 5 (red/blue)

110

110
50 60 70 80 90

50 60 70 80 90
Weight (kg)

Weight (kg)
160 170 180 190 160 170 180 190

Height (cm) Height (cm)

Left: the grey line can be improved by taking a non-zero slope (black). Right: the red line is badly
determined, but the blue one seems OK (and very similar to the black one).

Probability and Statistics for SIC slide 432

Results
 Estimates for model 2:
Parameter Estimate Standard Error
β0 (kg) -71.8 21.4
β1 (kg/cm) 0.82 0.12
σ 2 (kg2 ) 8.462 10.79

 Evidently this cannot be extrapolated to babies, who will have negative weights, but a CI for β0 ,
even at 99%, does not contain 0.
 The normality assumption seems roughly OK, according to a QQ-plot of the residuals:
Normal Q−Q Plot
3
2
Residuals

1
0
−1

−2 −1 0 1 2

Theoretical Quantiles

Probability and Statistics for SIC slide 433

251
Comments
 Linearity seems reasonable in both cases. Is there a slight acceleration in the slope in Venice,
which we could try to model with a polynomial such that

E(Yj ) = β0 + β1 xj + · · · + βq xqj ?

 Normality seems less reasonable in Venice(?)—since the data are the annual maxima, one of our
models for maxima (end §6.2) might be more appropriate (e.g., Gumbel distribution).
 We could use these models for forecasting, but then a linear trend must be extrapolated—which
seems OK in Venice for the mid term (0–10 years?), but maybe not later.
 For these models with normal observations there are exact results, but often it is enough to use the
general theory for tests, CIs etc.
 There are many generalisations of regression models. For example, we could suppose that
Bernoulli variables Yj are independent with

eβ0 +β1 xj 1
P(Yj = 1) = , P(Yj = 0) = , j = 1, . . . , n,
1 + eβ0 +β1 xj 1+ eβ0 +β1 xj
a logistic regression model.

Probability and Statistics for SIC slide 434

252
10 Bayesian Inference slide 435

10.1 Basic Ideas slide 436

Bayesian inference
Up to now we have supposed that all the information about θ comes from the data y. But if we have
knowledge about θ in the form of a prior density

π(θ),

we can use Bayes’ theorem to compute the posterior density for θ conditional on y, i.e.,

f (y | θ)π(θ)
π(θ | y) = .
f (y)

The crucial difference with previous discussion is that here

the observed data y are fixed, and θ is regarded as a random variable.

In order to do this, we have to have prior information, π(θ), which may be based on
 data separate from y;
 an ‘objective’ notion of what it is ‘reasonable’ to believe about θ;
 a ‘subjective’ notion of what ‘I’ believe about θ.
We will reconsider π(θ) after discussion of Bayesian mechanics.

Probability and Statistics for SIC slide 437

Reminder: Bayes’ Theorem


Let B1 , . . . , Bk be a partition of the sample space Ω, and let A be any event in the sample space. Then
P(A ∩ Bi )
P(Bi | A) =
P(A)
P(A | Bi )P(Bi )
=
P(A)
P(A | Bi )P(Bi )
= Pk .
j=1 P(A | Bj )P(Bj )

Interpretation: the knowledge that the event A has occurred changes the probabilities of the events
B1 , . . . , Bk :
P(B1 ), . . . , P(Bk ) −→ P(B1 | A), . . . , P(Bk | A).

Probability and Statistics for SIC slide 438

253
Application of Bayes’ theorem
We suppose that the parameter θ has density π(θ), and that the conditional density of Y given θ is
f (y | θ). The joint density is
f (y, θ) = f (y | θ)π(θ),
and by Bayes’ theorem the conditional density of θ given Y = y is

f (y | θ)π(θ)
π(θ | y) = ,
f (y)

where Z
f (y) = f (y | θ)π(θ) dθ

is the marginal density of the data Y .

Probability and Statistics for SIC slide 439

Bayesian updating
 We can use Bayes’ theorem to update the prior density for the random variable θ to a posterior
density for θ:
y
π(θ) −→ π(θ | y),
or equivalently
data
prior uncertainty −→ posterior uncertainty.

 We use π(θ), π(θ | y), rather than f (θ), f (θ | y), to stress that these distributions depend on
information external to the data.
 π(θ | y) contains our knowledge about θ, once we have seen data y, if our initial knowledge about
θ was contained in the density π(θ).

Probability and Statistics for SIC slide 440

254
Beta(a, b) density
Definition 306. The beta(a, b) density for θ ∈ (0, 1) has the form

θ a−1 (1 − θ)b−1
π(θ) = , 0 < θ < 1, a, b > 0,
B(a, b)

where a and b are parameters, B(a, b) = Γ(a)Γ(b)/Γ(a + b) is the beta function, and
Z ∞
Γ(a) = ua−1 e−u du, a > 0,
0

is the gamma function.

Note that a = b = 1 gives the U (0, 1) density.

Example 307. Show that if θ ∼ Beta(a, b), then

a ab
E(θ) = , var(θ) = .
a+b (a + b + 1)(a + b)2

Example 308. Calculate the posterior density of θ for a sequence of Bernoulli trials, if the prior
density is Beta(a, b).

Probability and Statistics for SIC slide 441

Note to Example 307


Since π is a density function, we have Z 1
π(θ) dθ = 1,
0
and therefore Z 1
Γ(a)Γ(b)
θ a−1 (1 − θ)b−1 dθ = B(a, b) = , a, b > 0.
0 Γ(a + b)
This implies that
Z 1 Z 1
r 1 B(a + r, b) Γ(a + r)Γ(a + b)
E(θ ) = θ r π(θ) dθ = θ r+a−1 (1 − θ)b−1 dθ = = ,
0 B(a, b) 0 B(a, b) Γ(a)Γ(a + b + r)

and since Γ(a + 1)/Γ(a) = a, for a > 0, we have

a a(a + 1) ab
E(θ) = , E(θ 2 ) = , var(θ) = .
a+b (a + b)(a + b + 1) (a + b + 1)(a + b)2

Probability and Statistics for SIC note 1 of slide 441

255
Note to Example 308
Suppose that conditional on θ, the data y1 , . . . , yn are a random sample from the Bernoulli
distribution, for which P(Yj = 1) = θ and P(Yj = 0) = 1 − θ, where 0 < θ < 1. The likelihood is
n
Y
L(θ) = f (y | θ) = θ yj (1 − θ)1−yj = θ s(1 − θ)n−s , 0 < θ < 1,
j=1
P
where s = yj .
The posterior density of θ conditional on the data and using the beta prior density is given by Bayes’
theorem, and is
θ s+a−1 (1 − θ)n−s+b−1 /B(a, b)
π(θ | y) = R1
s+a−1 (1 − θ)n−s+b−1 dθ/B(a, b)
0 θ
s+a−1
∝ θ (1 − θ)n−s+b−1 , 0 < θ < 1. (10)

As this has unit integral for all positive a and b, the constant normalizing (10) must be
B(a + s, b + n − s). Therefore
1
π(θ | y) = θ s+a−1 (1 − θ)n−s+b−1 , 0 < θ < 1.
B(a + s, b + n − s)

Thus the posterior density of θ has the same form as the prior: acquiring data has the effect of
updating (a, b) to (a + s, b + n − s). As the mean of the B(a, b) density is a/(a + b), the posterior
mean is (s + a)/(n + a + b), and this is roughly s/n in large samples. Hence the prior density inserts
information equivalent to having seen a sample of a + b observations, of which a were successes. If we
.
were very sure that θ = 1/2, for example, we might take a = b very large, giving a prior density tightly
concentrated around θ = 1/2, whereas taking smaller values of a and b would increase the prior
uncertainty.

Probability and Statistics for SIC note 2 of slide 441

Prior densities
a= 0.5 , b= 0.5 a= 1 , b= 1 a= 5 , b= 5
12

12

12
Density of theta

Density of theta

Density of theta
0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

theta theta theta

a= 5 , b= 10 a= 10 , b= 5 a= 10 , b= 10
12

12

12
Density of theta

Density of theta

Density of theta
0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

theta theta theta

Probability and Statistics for SIC slide 442

256
Posterior densities with n = 10, s = 9
a+s= 9.5 , b+n−s= 1.5 a+s= 10 , b+n−s= 2 a+s= 14 , b+n−s= 6

12

12

12
Density of theta

Density of theta

Density of theta
0 2 4 6 8

0 2 4 6 8

0 2 4 6 8
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

theta theta theta

a+s= 14 , b+n−s= 11 a+s= 19 , b+n−s= 6 a+s= 19 , b+n−s= 11


12

12

12
Density of theta

Density of theta

Density of theta
0 2 4 6 8

0 2 4 6 8

0 2 4 6 8
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

theta theta theta

Probability and Statistics for SIC slide 443

Posterior densities with n = 30, s = 24


a+s= 24.5 , b+n−s= 6.5 a+s= 25 , b+n−s= 7 a+s= 29 , b+n−s= 11
12

12

12
Density of theta

Density of theta

Density of theta
0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

theta theta theta

a+s= 29 , b+n−s= 16 a+s= 34 , b+n−s= 11 a+s= 34 , b+n−s= 16


12

12

12
Density of theta

Density of theta

Density of theta
0 2 4 6 8

0 2 4 6 8

0 2 4 6 8

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

theta theta theta

Probability and Statistics for SIC slide 444

257
Posterior densities with n = 100, s = 69
a+s= 69.5 , b+n−s= 31.5 a+s= 70 , b+n−s= 32 a+s= 74 , b+n−s= 36

12

12

12
Density of theta

Density of theta

Density of theta
0 2 4 6 8

0 2 4 6 8

0 2 4 6 8
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

theta theta theta

a+s= 74 , b+n−s= 41 a+s= 79 , b+n−s= 36 a+s= 79 , b+n−s= 41


12

12

12
Density of theta

Density of theta

Density of theta
0 2 4 6 8

0 2 4 6 8

0 2 4 6 8
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

theta theta theta

Probability and Statistics for SIC slide 445

Properties of π(θ | y)
 The density contains all the information about θ, but it is useful to extract summaries, such as the
posterior expectation or the posterior variance,

E(θ | y), var(θ | y),


or the maximum a posteriori (MAP) estimator, i.e., θ̃ such that

π(θ̃ | y) ≥ π(θ | y), ∀θ.

 We could use the posterior expectation or the MAP estimator as point estimates for θ based on
the data y, but a more systematic approach is based on loss functions.

Example 309. Calculate the posterior expectation and variance of θ, and its MAP estimate, for
Example 308.

Probability and Statistics for SIC slide 446

258
Note to Example 309
We saw earlier that the beta density with parameters a, b has

a a(a + 1) ab
E(θ) = , E(θ 2 ) = , var(θ) = ,
a+b (a + b)(a + b + 1) (a + b + 1)(a + b)2

so since the posterior density is beta with parameters a + s, b + n − s, the posterior mean and variance
are
a+s (a + s)(b + n − s)
E(θ | y) = , var(θ | y) = .
a+b+n (a + b + n + 1)(a + b + n)2
The MAP estimate is obtained by maximising the density
1
π(θ | y) = θ s+a−1 (1 − θ)n−s+b−1 , 0 < θ < 1.
B(a + s, b + n − s)

as a function of θ. Taking logs and differentiating gives that π(θ | y) is maximised by


a+s−1
θ̃ = .
n+a+b−2
Probability and Statistics for SIC note 1 of slide 446

Point estimation and loss functions


To construct an estimator based on the data y, we consider that the choice of estimate corresponds to
a decision, and seek to minimise the expected loss from a bad decision.

Definition 310. If Y ∼ f (y; θ), then the loss function R(y; θ) is a non-negative function of Y and θ.
The expected posterior loss is
Z
E {R(y; θ) | y} = R(y; θ)π(θ | y) dθ.

Example 311. If I seek to estimate θ by minimising E {R(y; θ) | y} with respect to θ̃(y), show that
with
R1 (y; θ) = |θ̃ − θ|, R2 (y; θ) = (θ̃ − θ)2 ,
R
I obtain that θ̃1 (y) is the median of π(θ | y) and that θ̃2 = E(θ | y) = θπ(θ | y) dθ is the posterior
expectation of θ.

Loss functions are also useful when we want to base a decision on the data: we construct R(y; θ) to
represent the loss when we observe y and base the decision on it, but the state of reality is θ, and then
choose the decision that minimises the expected loss.

Probability and Statistics for SIC slide 447

259
Note to Example 311
 For R1 (y; θ) = |θ̃ − θ|, we have
n o n o
E {R1 (y; θ) | y} = E (θ̃ − θ)I(θ̃ > θ) | y + E (θ − θ̃)I(θ̃ < θ) | y
Z θ̃ Z ∞
= (θ̃ − θ)π(θ | y) dθ + (θ − θ̃)π(θ | y) dθ,
−∞ θ̃
and differentiation of this with respect to θ̃ gives
Z θ̃ Z ∞
π(θ | y) dθ − π(θ | y) dθ
−∞ θ̃

This equals zero when the two probabilities are the same, and then we must have
Z θ̃ Z ∞
π(θ | y) dθ = π(θ | y) dθ = 12 ,
−∞ θ̃

so θ̃1 ≡ θ̃1 (y) is the median of the posterior density π(θ | y).
 For R2 (y; θ) = (θ̃ − θ)2 , we have on setting m(y) = E(θ | y) and using a little algebra that
h i
E {R2 (y; θ) | y} = E {θ̃ − m(y) + m(y) − θ}2 | y
h i h i  
= E {θ̃ − m(y)}2 | y + 2E {θ̃ − m(y)}{m(y) − θ} | y + E {m(y) − θ}2 | y
= {θ̃ − m(y)}2 + var(θ | y) :
the first term is constant with respect to the posterior distribution of θ because θ̃ and m(y) do not
depend on θ, but only on the variable y, which is fixed by the conditioning; the second term is
h i
E {θ̃ − m(y)}{m(y) − θ} | y = {θ̃ − m(y)}E {m(y) − θ | y} = 0;

and the third term is just the conditional variance of θ, given y.


Therefore we minimise E {R2 (y; θ) | y} by choosing θ̃ ≡ θ̃2 (y) = m(y) for all y, since this ensures
that {θ̃ − m(y)}2 is identically zero, and var(θ | y) does not depend on θ̃.

Probability and Statistics for SIC note 1 of slide 447

Interval estimation and credibility intervals


 The Bayesian analogue of the (1 − α) × 100% CI for θ is the (1 − α) credibility interval for θ
obtained using the α/2 and (1 − α/2) quantiles of π(θ | y).
 In the Bayesian paradigm, the limits of the interval are functions of y, and therefore are fixed,
while θ is random. This is the opposite of the situation with usual confidence intervals (where the
limits are random, and θ is fixed).
 By taking α = 0.05, a = b = 0.5, we obtain

n = 10 n = 30 n = 100 θb ± 1.96J(θ)
b −1/2
Lower 0.619 0.633 0.595 0.599
Upper 0.989 0.912 0.774 0.781

Here θb is the MLE of θ, and J(θ)


b is the observed information.
 a, b have some influence when n is small, but little influence for large samples, since the data then
contain a lot of information about θ.

260
Probability and Statistics for SIC slide 448

Conjugate densities
 Particular combinations of data and prior densities give posterior densities of the same form as the
prior densities. Example :
s,n
θ ∼ Beta(a, b) −→ θ | y ∼ Beta(a + s, b + n − s),

where the data s ∼ B(n, θ) correspond to s successes out of n independent trials with success
probability θ.
 The beta density is the conjugate prior density of the binomial distribution: if the likelihood is
proportional to θ s (1 − θ)n−s , then choosing a beta prior for θ ensures that the posterior density of
θ is also beta, with updated parameters (a + s, b + n − s).
 Conjugate prior densities are very useful, as we can often avoid having to integrate:
If we recognise π(θ | y), no need to integrate!
iid
Example 312. Let Y1 , . . . , Yn | µ ∼ N (µ, σ 2 ) and µ ∼ N (µ0 , τ 2 ), where µ0 , σ 2 and τ 2 are known.
Obtain the posterior distribution of µ | Y1 , . . . , Yn , without integration.

Probability and Statistics for SIC slide 449

Note to Example 312


Note first that the normal density with mean B and variance A,

(2πA)−1/2 exp −(x − B)2 /(2A) ,

has exponent    
1 B
x2 − +x − 21 B 2 /A,
2A A
so that we can read off A and B from the coefficients of x2 and x.
Turning to this particular calculation, we seek a density for µ, so any terms not involving µ can be
treated as constants. Now

π(µ | y) ∝ f (y | µ) × π(µ)
 
 Xn   
1 1 2 1 1 2
= exp − 2 (yj − µ) ×√ exp − 2 (µ − µ0 ) ,
(2πσ 2 )n/2  2σ
j=1
 2πτ 2 2τ

whose exponent factorises as


   P 
2 1 n 1 yj µ0
µ −2 + +µ + 2 + const,
σ2 τ 2 σ2 τ

and on comparing with the expression above, we have


−1 X  
A = n/σ 2 + 1/τ 2 , B= yj /σ 2 + µ0 /τ 2 / n/σ 2 + 1/τ 2 .

Thus P 
yj /σ 2 + µ0 /τ 2 1
µ|y∼N 2 2
, .
n/σ + 1/τ n/σ + 1/τ 2
2

Note that as n → ∞ or τ 2 → ∞, this becomes N (y, σ 2 /n).

261
Probability and Statistics for SIC note 1 of slide 449

Prediction of a future random variable Z


 Will the next result be tails (Z = 0) or heads (Z = 1)?
 Use Bayes’ theorem to calculate the posterior density of Z given Y = y:
R
P(Z = z, Y = y) f (z, y | θ)π(θ) dθ
P(Z = z | Y = y) = = R .
P(Y = y) f (y | θ)π(θ) dθ

Example 313. Calculate the posterior distribution for another Bernoulli trial, independent of the
previous ones.

Reminder: B(a, b) = Γ(a)Γ(b)/Γ(a + b), and Γ(a + 1) = aΓ(a), a, b > 0.

Probability and Statistics for SIC slide 450

Note to Example 313


Let θ be the unknown probability of a head and let Z = 1 indicate the event that the next toss yields a
head. Conditional on θ, P(Z = 1 | y, θ) = θ independent of the data y so far. If the prior density for θ
is beta with parameters a and b, then
Z 1
P(Z = 1 | y) = P(Z = 1 | θ, y)π(θ | y) dθ
0
Z 1
θ a+s−1 (1 − θ)b+n−r−1
= θ dθ
0 B(a + s, b + n − s)
B(a + s + 1, b + n − s) a+s
= = ,
B(a + s, b + n − s) a+b+n
on using results for beta functions. As n, s → ∞, this tends to the sample proportion of heads s/n, so
the prior information is drowned by the sample.

Probability and Statistics for SIC note 1 of slide 450

Bayesian approach
 Treat each unknown (parameter θ, predictand Z, . . .) as a random variable, give it a distribution,
and use Bayes’ theorm to calculate its posterior distribution given any data.
 We must build a more elaborate model, with prior information, but we can then treat all the
unknowns (parameters, data, missing values, etc.) on the same basis, and thus we just need to
apply probability calculus, conditioning on whatever we have observed.
 Philosophical questions:
– Are we justified in using prior knowledge in this way?
– Where does this knowledge come from?
 We often choose prior distributions for practical reasons (e.g., conjugate distributions) rather than
philosophical reasons.
 Practical question:
– How to calculate all the integrals we need?
 We often use Monte Carlo methods, and construct Markov chains whose limit distributions are
π(θ | y). This is a story for another day . . .

262
Probability and Statistics for SIC slide 451

10.2 Bayesian Modelling slide 452

NMR Data
60 NMR data Wavelet Decomposition Coefficients

1
2
40

Resolution Level
7 6 5 4 3
y
20

8
0

9
0 200 400 600 800 1000 0 128 256 384 512
Translate
Daub cmpct on ext. phase N=2

Left: original data, with n = 1024


Right: orthogonal transform with n = 1024 coefficients at different resolutions

Probability and Statistics for SIC slide 453

Parsimonious representations
In many modern applications we want to extract a signal from a noisy environment:
 finding the combination of genes leading to an illness;
 cleaning a biomedical image;
 denoising a download;
 detection of spams.
We often search for a parsimonious representation of the signal, with many null elements.
Probability and Statistics for SIC slide 454

Orthogonal transformation
 Original data X with noisy signal µn×1 : X ∼ Nn (µ, σ 2 In ).
T
 Suppose Yn×1 = Wn×n Xn×1 , where W T W = W W T = In is orthogonal.
 Choose W such that θ = W T µ should have many small elements, and a few big ones, giving a
sparse representation of µ in the basis corresponding to W ;
 ‘kill’ small coefficients of Y , which correspond to the noise, giving

θ̃n×1 = kill(Y ) = kill(W T X);

 then estimate the signal µ by


µ̃ = W θ̃ = W × kill(W T X).
 We aim that the kill(·) operator has good Bayesian properties, corresponding to a suitable risk
function.
Probability and Statistics for SIC slide 455

263
Wavelet decomposition
 A good choice of W is based on wavelets, which have the required sparseness properties.
 Here are the Haar wavelet coefficients, with n = 8:
 
1 1 1 0 1 0 0 0
1 1 1 0 −1 0 0 0
 
1 1 −1 0 0 1 0 0
 
1 1 −1 0 0 −1 0 0
 .
1 −1 0 1 0 0 1 0
 
1 −1 0 1 0 0 −1 0 
 
1 −1 0 −1 0 0 0 1
1 −1 0 −1 0 0 0 −1

 We set up W so that each column of this orthogonal matrix has unit norm., i.e., we post-multiply
this matrix by √ √ √ √ √ √
{diag( 8, 8, 2, 2, 2, 2, 2, 2)}−1 ,
to ensure that W W T = W T W = I8 .
Probability and Statistics for SIC slide 456

Prior and posterior distributions


 Suppose that Y | θ ∼ N (θ, σ 2 ), and we have the prior mixture
(
0, with probability 1 − p,
θ=
N (0, τ 2 ), with probability p,

thus the prior ‘density’ for θ is

π(θ) = (1 − p)δ(θ) + pτ −1 φ(θ/τ ), θ ∈ R,

where δ(θ) is the delta function, which places unit mass at θ = 0.


 If p, σ and τ are known, the posterior ‘density’ has the form
 
−1 θ − ay
π(θ | y) = (1 − py )δ(θ) + py b φ , θ ∈ R,
b

where
a = τ 2 /(τ 2 + σ 2 ), b2 = 1/(1/σ 2 + 1/τ 2 ),
and
p(σ 2 + τ 2 )−1/2 φ{y/(σ 2 + τ 2 )1/2 }
py =
(1 − p)σ −1 φ(y/σ) + p(σ 2 + τ 2 )−1/2 φ{y/(σ 2 + τ 2 )1/2 }
is the posterior probability that θ 6= 0.

Probability and Statistics for SIC slide 457

264
Bayesian shrinkage
 To estimate θ, we use loss function |θ̃ − θ|, so θ̃ is the posterior median of θ.
 Here are the cumulative distribution functions of θ, prior (left) and posterior when p = 0.5,
σ = τ = 1, and y = −2.5 (centre), et y = −1 (right).
 If y is close to zero, then θ̃ = 0, but if not, then 0 < |θ̃| < |y|, so θ̃ shrinks y towards zero.
 Lines: probability=0.5 (red); value of y (blue); posterior median θ̃ (green).
Prior Posterior, y=−2.5, posterior median=−0.98 Posterior, y=−1, posterior median=0
1.0

1.0

1.0
0.8

0.8

0.8
0.6

0.6

0.6
CDF

CDF

CDF
0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

0.0
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
theta theta theta

Probability and Statistics for SIC slide 458

Adaptive parameter estimation


To estimate the unknown parameters p, σ, τ we use maximum likelihood:
 the marginal density of y is

f (y) = (1 − p)σ −1 φ(y/σ) + p(σ 2 + τ 2 )−1/2 φ{y/(σ 2 + τ 2 )1/2 }, y ∈ R,


iid
so if y1 , . . . , yn ∼ f , we can estimate p, σ, τ by maximising the log likelihood
n
X
ℓ(p, σ, τ ) = log f (yj ; p, σ, τ ).
j=1

 Here we find pb = 0.08, σ


b = 0.54, τb = 0.028.
 Now we can calculate θ̃j for each of the yj , and obtain the denoised signal.

Probability and Statistics for SIC slide 459

NMR data, after transformation


Original coefficients Shrunken coefficients
1

1
2

2
Resolution Level

Resolution Level
7 6 5 4 3

7 6 5 4 3
8

8
9

0 128 256 384 512 0 128 256 384 512


Translate Translate
Daub cmpct on ext. phase N=2 Daub cmpct on ext. phase N=2

265
Probability and Statistics for SIC slide 460

NMR data, after cleaning


NMR data Bayesian posterior median

60

60
40

40
wr(w)
20

20
y
0

0
−20

−20
0 200 400 600 800 1000 0 200 400 600 800 1000

Probability and Statistics for SIC slide 461

Spam filter
 We wish to construct a spam filter, using the presence or absence of certain email characteristics
C1 , . . . , Cm .
 The data Y are of the form
S C1 C2 ··· Cm
1 0 1 1 ··· 1
2 1 0 1 ··· 0
.. .. .. .. .. ..
. . . . . .
n 0 0 0 ··· 0

where S = 1 for a spam, and Ci = 1 if characteristic i (e.g., the word ‘Nigeria’, Russian language,
hotmail address) is present.
 Simple model:
P(S = 1) = p, P(S = 0) = 1 − p,
P(Ci = 1 | S = 1) = αi , P(Ci = 0 | S = 1) = 1 − αi ,
P(Ci = 1 | S = 0) = βi , P(Ci = 0 | S = 0) = 1 − βi ,
and the C1 , . . . , Cm are independent, given the value of S.

Probability and Statistics for SIC slide 462

266
Spam filter
 For a new email with C1+ , . . . , Cm
+ but without S + , we calculate

P(S + = 1 | C1+ , . . . , Cm
+
, Y ),

and quarantine the email if this probability exceeds some threshold d ∈ (0, 1).
 If we write θ = (p, α1 , . . . , αm , β1 , . . . , βm ), and if we suppose that a priori
iid
p, α1 , . . . , αm , β1 , . . . , βm ∼ U (0, 1),

then π(θ | y) = f (y | θ) × π(θ)/f (y), so


n
( m )sj ( m
)1−sj
Y Y cji Y cji
1−cji 1−cji
π(θ | y) ∝ p αi (1 − αi ) (1 − p) βi (1 − βi ) ×1
j=1 i=1 i=1
s n−s
= p (1 − p) ×
Ym P P P P
sj cji (1−sj )cji
αi j (1 − αi ) j sj (1−cji ) βi j (1 − βi ) j (1−sj )(1−cji )
i=1
m
ps (1 − p)n−s Y αtii1 (1 − αi )ti2 βiti3 (1 − βi )ti4
∝ × ,
B(1 + s, 1 + n − s) B(1 + ti1 , 1 + ti2 )B(1 + ti3 , 1 + ti4 )
i=1
P P P P
= j sj , ti1 = j sj cji , ti2 = j sj (1 − cji ), ti3 = j (1 − sj )cji ,
with sP
ti4 = j (1 − sj )(1 − cji ).

Probability and Statistics for SIC slide 463

Spam filter
 With the new characteristics C + = (C1+ , . . . , Cm
+ ), we want to calculate

P(S + = 1, C + | Y )
P(S + = 1 | C + , Y ) =
P(C + | Y )
P(S + = 1, C + | Y )
=
P(S + = 0, C + | Y ) + P(S + = 1, C + | Y )
where Z
P(S = s , C | Y ) = P(S + = s+ , C + = c+ | θ, y)π(θ | y) dθ,
+ + +

and P(S + = s+ , C + = c+ | θ, y) = P(S + = s+ , C + = c+ | θ), with


( m
)s+ ( m
)1−s+
Y c+ Y c+
1−c+ 1−c+
P(S + = s+ , C + = c+ | θ) = p αi i (1 − αi ) i (1 − p) βi i (1 − βi ) i .
i=1 i=1

Thus P(S + = s+ , C + = c+ | θ)π(θ | y) equals


+ +
+ + m t + t +
ps+s (1 − p)n−s+1−s Y αii1 (1 − αi )ti2 βi i3 (1 − βi )ti4
,
B(1 + s, 1 + n − s) B(1 + ti1 , 1 + ti2 )B(1 + ti3 , 1 + ti4 )
i=1
P P P
where s = sj , t+ i1 = j sj cji + s+ c+i , t+
i2 = j sj (1 − cji ) + s+ (1 − c+
i ),
P P
ti3 = j (1 − sj )cji + (1 − s )ci , ti4 = j (1 − sj )(1 − cji ) + (1 − s )(1 − c+
+ + + + +
i ).

Probability and Statistics for SIC slide 464

267
Spam filter
 Thus P(S + = s+ , C + | Y ) equals
m
B(1 + s + s+ , 2 + n − s − s+ ) Y B(1 + t+ + + +
i1 , 1 + ti2 )B(1 + ti3 , 1 + ti4 )
,
B(1 + s, 1 + n − s) B(1 + ti1 , 1 + ti2 )B(1 + ti3 , 1 + ti4 )
i=1

from which we obtain


P(S + = 1, C + | Y )
P(S + = 1 | C + , Y ) = ,
P(S + = 0, C + | Y ) + P(S + = 1, C + | Y )

or, equivalently, the



log odds = log P(S + = 1, C + | Y )/P(S + = 0, C + | Y ) .

 Thus we need to save the 2 + 4 × m quantities


P
s = j sj n−s
P P P
t11 = j sj cj1 t21 = j sj cj2 ··· tm1 = j sj cjm
P P P
t12 = j sj (1 − cj1 ) t22 = j sj (1 − cj2 ) ··· tm2 = j sj (1 − cjm )
P P P
t13 = j (1 − sj )cj1 t23 = j (1 − sj )cj2 ··· tm3 = j (1 − sj )cjm
P P P
t14 = j (1 − sj )(1 − cj1 ) t24 = j (1 − sj )(1 − cj2 ) ··· tm4 = j (1 − sj )(1 − cjm )

and update them when we get new values for sj , c1 , . . . , cm .

Probability and Statistics for SIC slide 465

Comments
 The key assumption that the C1 , . . . , Cm are independent, given S, is probably false, but maybe
not too damaging — often idiot’s Bayes works quite well.
 Simulations with p = 0.8, n = 100 emails whose S and C are known, and 1000 new emails for
which only C + is known.
 Here an email is classified as spam if

P(S + = 1 | C + , Y ) > P(S + = 0 | C + , Y ).

 From 180 good emails, 141 are misclassified with m = 2, whereas only 1 is misclassified with
m = 20.

m=2 m = 20
Spam Good Total Spam Good Total
Spam 761 59 820 Spam 810 10 820
Good 141 39 180 Good 1 179 180

Probability and Statistics for SIC slide 466

Comments
 Bayesian ideas provide an approach that integrates the treatment of uncertainty and modelling,
with which we can tackle very complex problems.
 The main philosophical difficulty is the status of the prior information.
 The main practical difficulty is the need to calculate many complex multidimensional integrals.

268
Probability and Statistics for SIC slide 467

Thank you and au revoir!


Probability and Statistics for SIC slide 468

269

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy