Lecture Notes - Probability Theory: Manuel Cabral Morais
Lecture Notes - Probability Theory: Manuel Cabral Morais
Department of Mathematics
Instituto Superior Técnico
0. Warm up 1
0.1 Historical note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 (Symmetric) random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1 Probability spaces 12
1.1 Random experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Events and classes of sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3 Probabilities and probability functions . . . . . . . . . . . . . . . . . . . . 31
1.4 Distribution functions; discrete, absolutely continuous and mixed
probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.5 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2 Random variables 56
2.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.2 Combining random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3 Distributions and distribution functions . . . . . . . . . . . . . . . . . . . . 66
2.4 Key r.v. and random vectors and distributions . . . . . . . . . . . . . . . . 70
2.4.1 Discrete r.v. and random vectors . . . . . . . . . . . . . . . . . . . 70
2.4.2 Absolutely continuous r.v. and random vectors . . . . . . . . . . . . 75
2.5 Transformation theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.5.1 Transformations of r.v., general case . . . . . . . . . . . . . . . . . 82
2.5.2 Transformations of discrete r.v. . . . . . . . . . . . . . . . . . . . . 84
2.5.3 Transformations of absolutely continuous r.v. . . . . . . . . . . . . 86
2.5.4 Transformations of random vectors, general case . . . . . . . . . . . 92
2.5.5 Transformations of discrete random vectors . . . . . . . . . . . . . . 92
2.5.6 Transformations of absolutely continuous random vectors . . . . . . 98
2.5.7 Random variables with prescribed distributions . . . . . . . . . . . 105
ii
3 Independence 111
3.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.2 Independent r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.3 Functions of independent r.v. . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.4 Order statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.5 Constructing independent r.v. . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.6 Bernoulli process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.7 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.8 Generalizations of the Poisson process . . . . . . . . . . . . . . . . . . . . . 143
4 Expectation 147
4.1 Definition and fundamental properties . . . . . . . . . . . . . . . . . . . . 148
4.1.1 Simple r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.1.2 Non negative r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.1.3 Integrable r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.1.4 Complex r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
4.2 Integrals with respect to distribution functions . . . . . . . . . . . . . . . . 160
4.2.1 On integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.2.2 Generalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.2.3 Discrete distribution functions . . . . . . . . . . . . . . . . . . . . . 165
4.2.4 Absolutely continuous distribution functions . . . . . . . . . . . . . 165
4.2.5 Mixed distribution functions . . . . . . . . . . . . . . . . . . . . . . 166
4.3 Computation of expectations . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.3.1 Non negative r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.3.2 Integrable r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.3.3 Mixed r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.3.4 Functions of r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
4.3.5 Functions of random vectors . . . . . . . . . . . . . . . . . . . . . . 172
4.3.6 Functions of independent r.v. . . . . . . . . . . . . . . . . . . . . . 173
4.3.7 Sum of independent r.v. . . . . . . . . . . . . . . . . . . . . . . . . 174
4.4 Lp spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.5 Key inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.5.1 Young’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
4.5.2 Hölder’s moment inequality . . . . . . . . . . . . . . . . . . . . . . 179
4.5.3 Cauchy-Schwarz’s moment inequality . . . . . . . . . . . . . . . . . 181
4.5.4 Lyapunov’s moment inequality . . . . . . . . . . . . . . . . . . . . . 182
iii
4.5.5 Minkowski’s moment inequality . . . . . . . . . . . . . . . . . . . . 183
4.5.6 Jensen’s moment inequality . . . . . . . . . . . . . . . . . . . . . . 184
4.5.7 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.6 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
4.6.1 Moments of r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
4.6.2 Variance and standard deviation . . . . . . . . . . . . . . . . . . . . 192
4.6.3 Skewness and kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . 194
4.6.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
4.6.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
4.6.6 Moments of random vectors . . . . . . . . . . . . . . . . . . . . . . 202
4.6.7 Multivariate normal distributions . . . . . . . . . . . . . . . . . . . 203
4.6.8 Multinomial distributions . . . . . . . . . . . . . . . . . . . . . . . 216
iv
Warm up
1
For more extensive and exciting accounts on the history of Statistics and Probability,
we recommend:
2
Remark 0.2 — Applications of random walk
The path followed by atom in a gas moving under the influence of collisions with other
atoms can be described by a random walk (RW). Random walk has also been applied in
other areas such as:
• economics (RW used to model shares prices and other factors);
• visual arts, such as Antony Gormley’s Quantum Cloud sculpture in London which
was designed by a computer using a random walk algorithm.3
•
The next proposition provides answers to the following questions:
• How can we model and analize the symmetric random walk?
• What random variables can arise from this random experiment and how can we
describe them?
2
Genetic drift is one of several evolutionary processes which lead to changes in allele frequencies over
time.
3
For more applications check http://en.wikipedia.org/wiki/Random walk.
3
Proposition 0.3 — Symmetric random walk (Karr, 1993, pp. 1–4)
1. The model
Let:
2. Random variables
Two random variables immediately arise:
4
1
3
2
1
t 0 t
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
!1
!2
!3
!1
!4
4
Steps are functions defined on the sample space. Thus, steps are random variables.
4
Invoking the random and symmetric character of this walk, and assuming that
the steps are independent and identically distributed, all 2n possible values of
(Y1 , . . . , Yn ) are equally likely, and, for every (y1 , . . . , yn ) ∈ {−1, 1}n ,
n
&
P (Y1 = y1 , . . . , Yn = yn ) = P (Yi = yi ) (2)
i=1
' (n
1
= . (3)
2
4. First calculations
Let us assume from now on that X0 = 0. Then:
• |Xn | ≤ n, ∀n ∈ IN ;
• Xn is even at even times (n mod 2 = 0) (e.g. X2 cannot be equal to 1);
• Xn is odd at odd times (n mod 2 = 1) (e.g. X1 cannot be equal to 0).
n+k
that is, a = 2
and a has to be an integer in {0, . . . , n}.
As a consequence,
' ( ' (n
n 1
P (Xn = k) = n+k × , (5)
2
2
5
More generally,
for any real set B ⊂ IR, by countable additivity of the probability function P .
(Rewrite (6) taking into account that n and k have to be both even or both odd.) •
Remark 0.4 — Further properties of the symmetric random walk (Karr, 1993,
pp. 4–5)
Exploiting the special structure of the SRW lead us to conclude that:
• the SRW cannot move from one level to another without passing through all values
between (“continuity”);
• all 2n length−n paths are equally likely so two events containing the same number
no. paths
of paths have the same probability, 2n
, which allows the probability of one
event to be determined by showing that the paths belonging to this event are in
one-to-one correspondence with those of an event of known probability — in many
cases this correspondence is established geometrically, namely via reasoning known
as reflection principle.5 •
6
" n $
%
E(Xn ) = E Yi
i=1
n
%
= E(Yi )
i=1
= 0. (8)
Proposition 0.9 — Time of first return to the origin and symmetric random
walk (Karr, 1993, pp. 7-9)
The time at which the SRW first returns to the origin,
6
Exam 2010/01/19.
7
T 0 = min{n ∈ IN : Xn = 0}, (10)
is an important functional of the SRW (it maps the SRW into a scalar). It can represent
the time to ruin.
Interestingly enough, for n ∈ IN , T 0 must be a positive and even r.v. (recall that
X0 = 0). And, for n ∈ IN :
' ( ' (2n
0 2n 1
P (T > 2n) = P (X1 &= 0, . . . , X2n &= 0) = × ; (11)
n 2
' ( ' (2n
0 1 2n 1
P (T = 2n) = × . (12)
2n − 1 n 2
√ 1
Moreover, using the Stirling’s approximation to n!, n! * 2π nn+ 2 e−n , we get
Exercise 0.10 — Time of first return to the origin and symmetric random walk
1
lim P (T 0 > 2n) = lim √ .
n→+∞ n→+∞ πn
to derive (13).
!+∞ 0
!+∞ 0
(d) Verify that 2n P (T = 2n) = 1 + n=1 P (T > 2n), even though we have
/ n=1 !+∞ 0
E(Z) = 2 × 1 + n=1 P (Z > 2n) , for any positive and even random variable Z
!
with finite expected value E(Z) = +∞ n=1 2n × P (Z = 2n). •
8
Proposition 0.11 — First passage times and symmetric random walk (Karr,
1993, pp. 9–11)
Similarly, the first passage time
!
that is, the “empirical averages”, Xnn = n1 ni=1 Yi , converge to the “theoretical average”
E(Y1 ). •
9
%
P (a < Xn ≤ b) = P (Xn = k)
a<k≤b
a Xn b
−0 −0 −0
= P n 3 < n
3 ≤ 3 n
1 1 1
n n n
√ √
* Φ(b/ n) − Φ(a/ n). (21)
Admit that each step of PD has length equal to one meter and that he has already taken
exactly 100 (a hundred) steps.
Find an approximate value for the probability that PD is within a five meters
neighborhood of the lamppost. •
10
Please note that we can get the limiting distribution function by using the Stirling’s
approximation and the following result:
' ( ' ( ' (2n
2k 2n − 2k 1
P (W2n = 2k) = × × . (23)
k n−k 2
•
Exercise 0.16 — Arc sine law
Prove result (22) (Karr, 1993, p. 13). •
References
• Grimmett, G.R. and Stirzaker, D.R. (2001). Probability and Random Processes
(3rd. edition). Oxford. (QA274.12-.76.GRI.40695 refers to the library code of the
1st. and 2nd. editions from 1982 and 1992, respectively.)
11
Chapter 1
Probability spaces
[...] have been taught that the universe evolves according to deterministic
laws that specify exactly its future, and a probabilistic description is necessary
only because of our ignorance. This deep-rooted skepticism in the validity
of probabilistic results can be overcome only by proper interpretation of the
meaning of probability. Papoulis (1965, p. 3)
Much of our life is based on the belief that the future is largely unpredictable
(Grimmett and Stirzaker, 2001, p. 1), nature is liable to change and chance governs
life.
We express this belief in chance behaviour by the use of words such as random, probable
(probably), probability, likelihood (likeliness), etc.
There are essentially four ways of defining probability (Papoulis, 1965, p. 7) and this
is quite a controversial subject, proving that not all of probability and statistics is cut-
and-dried (Righter, 200–):
12
• relative frequency (Von Mises);2
• it can be used only for a limited class of problems since the equally likely condition
is often violated in practice;
2
Kolmogorov said: “[...] mathematical theory of probability to real ’random phenomena’ must depend
on some form of the frequency concept of probability, [...] which has been established by von Mises [...].”
(http://en.wikipedia.org/wiki/Richard von Mises)
3
Inductive reasoning or inductive logic is a type of reasoning which involves moving from a set of
specific facts to a general conclusion (http://en.wikipedia.org/wiki/Inductive reasoning).
4
Bayesianism uses probability theory as the framework for induction. Given new evidence, Bayes’
theorem is used to evaluate how much the strength of a belief in a hypothesis should change with the
data we collected.
5
http://en.wikipedia.org/wiki/Kolmogorov axioms
13
Relative frequency definition of probability
The relative frequency approach was developed by Von Mises in the beginning of the 20th.
century; at that time the prevailing definition of probability was the classical one and his
work was a healthy alternative (Papoulis, 1965, p. 9).
The relative frequency definition of probability used to be popular among engineers
and physicists. A random experiment is repeated over and over again, N times; if the
event A occurs NA times out of N , then the probability of A is defined as the limit of the
relative frequency of the occurrence of A:
NA
P (A) = lim . (1.2)
N →+∞ N
14
• outcomes ω, elements of the sample space, also referred to as sample points or
realizations;
15
1.1 Random experiments
Definition 1.1 — Random experiment
A random experiment consists of both a procedure and observations,6 and its outcome
cannot be determined in advance. •
Random experiment
E1 Give a lecture.
Observe the number of students seated in the 4th. row, which has 7 seats.
E4 Give n lectures.
Observe the number of students seated in the forth row in each of those n lectures.
6
Yates and Goodman (1999, p. 7).
16
The finest-grain property simply means that all possible distinguishable outcomes are
identified separately. Moreover, Ω is (usually) known before the random experiment takes
place. The choice of Ω balances fidelity to reality with mathematical convenience (Karr,
1993, p. 12).
• Finite set
The simplest random experiment has two outcomes.
A random experiment with n possible outcomes may be modeled with a sample
space consisting of n integers.
• Countable set
The sample space for an experiment with countably many possible outcomes is
ordinarily the set IN = {1, 2, . . .} of positive integers or the set of {. . . , −1, 0, +1, . . .}
of all integers.
Whether a finite or a countable sample space better describes a given phenomenon
is a matter of judgement and compromise. (Comment!)
• Function spaces
In some random experiments the outcome is a trajectory followed by a system over
an interval of time. In this case the outcomes are functions. •
17
Example 1.6 — Sample spaces
The sample spaces defined below refer to the random experiments defined in Example
1.2:
E3 IR0+ Interval in IR
Note that C([0, 1]) represents the vector space of continuous, real-valued functions on
[0, 1]. •
18
1.2 Events and classes of sets
Definition 1.7 — Event (Karr, 1993, p. 18)
Given a random experiment with sample space Ω, an event can be provisionally defined
as a subset of Ω whose probability is defined. •
Remark 1.8 — An event A occurs if the outcome ω of the random experiment belongs
to A, i.e. ω ∈ A. •
E.A. Event
• Complementation
The complement of an event A ⊂ Ω is
19
• Intersection
The intersection of events A and B (A, B ⊂ Ω) is
The events A and B are disjoint (mutually exclusive) if A ∩ B = ∅, i.e. they have
no outcomes in common, therefore they never happen at the same time.
• Union
The union of events A and B (A, B ⊂ Ω) is
A ∪ B = {ω ∈ Ω : ω ∈ A or ω ∈ B}. (1.5)
• Set difference
Given two events A and B (A, B ⊂ Ω), the set difference between B and A consists
of those outcomes in B but not in A:
B\A = B ∩ Ac . (1.6)
• Symmetric difference
Let A and B be two events (A, B ⊂ Ω). Then the outcomes that are in one but not
in both sets consist on the symmetric difference:
20
Set operation Property
Complementation (Ac )c = A
∅c = Ω
Ωc = ∅
Associativity
(A ∩ B) ∩ C = A ∩ (B ∩ C)
(A ∪ B) ∪ C = A ∪ (B ∪ C)
De Morgan’s laws
(A ∩ B)c = Ac ∪ B c
(A ∪ B)c = Ac ∩ B c
Distributivity
(A ∩ B) ∪ C = (A ∪ C) ∩ (B ∪ C)
(A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C)
ω ∈ A ⇒ ω ∈ B. (1.8)
So if A occurs then B also occurs. However, the occurrence of B does not imply
the occurrence of A.
• Equality
Two events A and B are equal, written A = B, iff A ⊂ B and B ⊂ A. This means
ω ∈ A ⇔ ω ∈ B. (1.9)
•
21
Proposition 1.14 — Properties of set containment (Resnick, 1999, p. 4)
These properties are straightforward but we stated them for the sake of completeness and
their utility in the comparison of the probabilities of events:
• A⊂A
• A ⊂ B, B ⊂ C ⇒ A ⊂ C
• A ⊂ C, B ⊂ C ⇒ (A ∪ B) ⊂ C
• A ⊃ C, B ⊃ C ⇒ (A ∩ B) ⊃ C
• A ⊂ B ⇔ B c ⊂ Ac . •
ω Member of Ω Outcome
22
Functions on the sample space (such as random variables defined in the next chapter)
are even more important than events themselves.
An indicator function is the simplest way to associate a set with a (binary) function.
The indicator function of an event, which resulted from a set operation on events A
and B, can often be written in terms of the indicator functions of these two events.
1Ac = 1 − 1A (1.11)
1A∩B = min{1A , 1B }
= 1A × 1B (1.12)
1A∪B = max{1A , 1B }; (1.13)
1B\A = 1B∩Ac
= 1B × (1 − 1A ) (1.14)
1A∆B = |1A − 1B |. (1.15)
The definition of indicator function quickly yields the following result when we are
able compare events A and B.
23
Proposition 1.19 — Another property of indicator functions (Resnick, 1999, p.
5)
Let A and B be two events of Ω. Then
A ⊆ B ⇔ 1A ≤ 1B . (1.16)
Note here that we use the convention that for two functions f , g with domain Ω and range
IR, we have f ≤ g iff f (ω) ≤ g(ω) for all ω ∈ Ω. •
Definition 1.22 — Lim sup, lim inf and limit set (Karr, 1993, p. 20)
Let (An )n∈IN be a sequence of events of Ω. Then we define the two following limit sets:
+∞
8 +∞
#
lim sup An = An (1.19)
k=1 n=k
= {ω ∈ Ω : ω ∈ An for infinitely many values of n}
= {An , i.o.}
+∞
# +∞
8
lim inf An = An (1.20)
k=1 n=k
= {ω ∈ Ω : ω ∈ An for all but finitely many values of n}
= {An , ult.},
where i.o. and ult. stand for infinitely often and ultimately, respectively.
24
Let A be an event of Ω. Then the sequence (An )n∈IN is said to converge to A, written
An → A or limn→+∞ An = A, if
Then
+∞
8 +∞
#
lim sup An = An
k=1 n=k
= Ω (1.23)
&=
+∞
# +∞
8
lim inf An = An
k=1 n=k
= ∅, (1.24)
Proposition 1.25 — Properties of lim sup and lim inf (Resnick, 1999, pp. 7–8)
Let (An )n∈IN be a sequence of events of Ω. Then
25
Definition 1.26 — Monotone sequences of events (Resnick, 1999, p. 8)
Let (An )n∈IN be a sequence of events of Ω. It is said to be monotone non-decreasing,
written An ↑, if
A1 ⊆ A2 ⊆ A3 ⊆ . . . . (1.27)
A1 ⊇ A2 ⊇ A3 ⊇ . . . . (1.28)
•
26
Let An = {Xn = 0}. Since A1 ⇒ A2 ⇒ . . ., i.e. (An )n∈IN is a non-decreasing monotone
;
sequence of events, written An ↑, we get An → A = +∞ n=1 An . Moreover, the extinction
probability is given by
" +∞ $ ' (
#
P ({Xn = 0 for some n}) = P {Xn = 0} = P lim {Xn = 0}
n→+∞
"n=1
+∞
$
#
= P An
' n=1 (
= P lim An . (1.31)
n→+∞
Later on, we shall conclude that we can conveniently interchange the limit sign and
the probability function and add: P (Xn = 0 for some n) = P (limn→+∞ {Xn = 0}) =
limn→+∞ P ({Xn = 0}). •
Thus, the convergence of sets is the same as pointwise convergence of their indicator
functions. •
Exercise 1.31 — Limits of indicator functions (Exercise 1.8, Karr, 1993, p. 40)
Prove Proposition 1.30. •
27
Example 1.34 — Closure under set operations (Resnick, 1999, p. 12)
• Suppose Ω = IR and C = {finite real intervals} = {(a, b] : −∞ < a < b < +∞}.
Then C is not closed under finite unions since (1, 2] ∪ (36, 37] is not a finite interval.
However, C is closed under intersection since (a, b]∩(c, d] = (max{a, c}, min{b, d}] =
(a ∨ c, b ∧ d].
1. Ω ∈ A
2. A ∈ A ⇒ Ac ∈ A
3. A, B ∈ A ⇒ A ∪ B ∈ A. •
28
1. Ω ∈ F
2. A ∈ F ⇒ Ac ∈ F
;+∞
3. A1 , A2 , . . . ∈ F ⇒ i=1 Ai ∈ F. •
• Trivial σ−algebra
F = {∅, Ω}
• Power set
F = IP (Ω) = class of all subsets of Ω
• Trivial example
Let Ω = {1, 2, 3} and U = {{1}}. Then σ(U) = {∅, {1}, {2, 3}, Ω} is a σ−algebra
on Ω.
29
Since we tend to deal with real random variables we have to define a σ−algebra on
Ω = IR and the power set on IR, IP (IR) is not an option. The most important σ−algebra
on IR is the one defined as follows.
that is, σ(U) = B(IR). Its elements are called Borel sets.8 •
• Every “reasonable” set of IR — such as intervals, closed sets, open sets, finite sets,
<
and countable sets — belong to B(IR). For instance, {x} = +∞ n=1 (x − 1/n, x].
• Moreover, the Borel σ−algebra on IR, B(IR), can also be generated by the class of
intervals {(−∞, a] : −∞ < a < +∞} or {(b, +∞) : −∞ < b < +∞}.
8
Borel sets are named after Émile Borel. Along with René-Louis Baire and Henri
Lebesgue, he was among the pioneers of measure theory and its application to probability theory
(http://en.wikipedia.org/wiki/Émile Borel).
30
1.3 Probabilities and probability functions
Motivation 1.47 — Probability function (Karr, 1993, p. 23)
A probability is a set function, defined for events; it should be countably additive (i.e.
σ−additive), that is, the probability of a countable union of disjoint events is the sum of
their individual probabilities. •
2. Axiom 2 — P (Ω) = 1.
9
Righter (200—) called the first and second axioms duh rules.
31
Then the function defined as
" $
# %
P Ai = pi , ∀I ⊆ {1, . . . , n}, (1.36)
i∈I i∈I
P (∅) = 0. (1.37)
2. Finite additivity
If A1 , . . . , An are (pairwise) disjoint events then
"n $ n
# %
P Ai = P (Ai ). (1.38)
i=1 i=1
Therefore if A ⊆ B then
32
4. Addition rule
For all A and B (disjoint or not),
• Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy.
As a student she was deeply concerned with issues of discrimination and social justice
and also participated in anti nuclear demonstrations.
Kahneman and Tversky found that about 85% of the subjects ranked “Linda is a bank
teller and is active in the feminist movement” as more probable than “Linda is a bank
teller”. •
P (B \ A) = P (B) − P (A ∩ B) (1.43)
P (A∆B) = P (A ∪ B) − P (A ∩ B) = P (A) + P (B) − 2 × P (A ∩ B). (1.44)
• property 3. by rewriting B as (B \ A) ∪ (A ∩ B) = (B \ A) ∪ A;
33
We proceed with some advanced properties of probability functions.
• The terms on the right side of (1.47) alternate in sign and give inequalities called
Bonferroni inequalities 11 when we neglect the remainders. Two examples:
;+∞ ;+∞ 9; :
n−1
10
Note that n=1 An = n=1 Bn , where Bn = An \ i=1 A i are disjoint events.
11
They are named after Italian mathematician Carlo Emilio Bonferroni.
34
" n
$ n
# %
P Ai ≤ P (Ai ) (1.48)
"i=1
n
$ i=1
n
# % %
P Ai ≥ P (Ai ) − P (Ai ∩ Aj ) (1.49)
i=1 i=1 1≤i<j≤n
• Let the event Ai represent the rejection of the (simple) null hypothesis H0,i
(i = 1, . . . , n). Then if we test the (multiple or simultaneous) null hypothesis
H0 : ∩ni=1 H0,i , the probability of rejecting H0 is equal to the probability of rejecting
at least one of the (simple) null hypotheses. Moreover, this probability does not
exceed the sum of the probabilities of individually rejecting each of the (simple)
null hypotheses:
"n $ n
# %
P Ai ≤ P (Ai ).
i=1 i=1
Consequently, if the desired significance level for the test involving H0 is set
to be equal to α0 , then the Bonferroni correction leads to the conclusion that
each individual null hypothesis should be tested at a significance level of α0 /n
(http://en.wikipedia.org/wiki/Bonferroni correction). •
• A1 ⊂ A2 ⊂ A3 ⊂ . . . ⊂ An ⊂ . . .;
35
-;n−1 .
• B1 = A1 , B2 = A2 \ A1 , B3 = A3 \ (A1 ∪ A2 ), . . . , Bn = An \ i=1 Ai are disjoint
events;
;
• since A1 , A2 , . . . is a non-decreasing sequence of events An ↑ A = +∞ n=1 An =
;+∞ ;n
n=1 Bn , Bn = An \ An−1 , and i=1 Bi = An ; if we add to this σ−additivity,
we conclude that
" +∞ $ " +∞ $ +∞
# # %
P (A) = P An = P Bn = P (Bn )
n=1 n=1 n=1
n
"n $
% #
= lim ↑ P (Bi ) = lim ↑ P Bi = lim ↑ P (An ).
n→+∞ n→+∞ n→+∞
i=1 i=1
4. Whenever An ↓ ∅ in F, P (An ) ↓ 0. •
36
Remark 1.69 — Inf, sup, lim inf and lim sup
Let a1 , a2 , . . . be a sequence of real numbers. Then
• Infimum
The infimum of the set {a1 , a2 , . . .} — written inf an — corresponds to the greatest
element (not necessarily in {a1 , a2 , . . .}) that is less than or equal to all elements of
{a1 , a2 , . . .}.12
• Supremum
The supremum of the set {a1 , a2 , . . .} — written sup an — corresponds to the
smallest element (not necessarily in {a1 , a2 , . . .}) that is greater than or equal to
every element of {a1 , a2 , . . .}.13
14
• Limit inferior and limit superior of a sequence of real numbers
lim inf an = supk≥1 inf n≥k an
lim sup an = inf k≥1 supn≥k an .
Proposition 1.71 — A special case of the Fatou’s lemma (Resnick, 1999, p. 32)
Suppose A1 , A2 , . . . is a sequence of events in F. Then
P (lim inf An ) ≤ lim inf P (An ) ≤ lim sup P (An ) ≤ P (lim sup An ). (1.50)
37
Theorem 1.73 — Continuity (Karr, 1993, p. 27)
If An → A then P (An ) → P (A). •
Theorem 1.76 — (1st.) Borel-Cantelli Lemma (Resnick, 1999, p. 102; Karr, 1993,
p. 27)
Let A1 , A2 , . . . be any events in F. Then
+∞
%
P (An ) < +∞ ⇒ P (lim sup An ) = 0. (1.51)
n=1
38
1.4 Distribution functions; discrete, absolutely
continuous and mixed probabilities
Motivation 1.78 — Distribution function (Karr, 1993, pp. 28-29)
A probability function P on the Borel σ−algebra B(IR) is determined by its values
P ((−∞, x]), for all intervals (−∞, x].
Probability functions on the real line play an important role as distribution functions
of random variables. •
Definition 1.79 — Distribution function (Karr, 1993, p. 29)
Let P be a probability function defined on (IR, B(IR)). The distribution function
associated to P is represented by FP and defined by
FP (x) = P ((−∞, x]), x ∈ IR. (1.52)
•
Theorem 1.80 — Some properties of distribution functions (Karr, 1993, pp. 29-
30)
Let FP be the distribution function associated to P . Then
1. FP is non-decreasing. Hence, the left-hand limit
2. FP is right-continuous, i.e.
for each x.
39
Definition 1.81 — Distribution function (Resnick, 1999, p. 33)
A function FP : IR → [0, 1] satisfying properties 1., 2. and 3. from Theorem 1.80 is called
a distribution function. •
(−∞, x] FP (x)
(x, +∞) 1 − FP (x)
(−∞, x) FP (x− )
[x, +∞) 1 − FP (x− )
(a, b] FP (b) − FP (a)
[a, b) FP (b− ) − FP (a− )
[a, b] FP (b) − FP (a− )
(a, b) FP (b− ) − FP (a)
{x} FP (x) − FP (x− )
40
FP (x) = P ((−∞, x])
7
0, x < s
= (1.61)
1, x ≥ s.
The property that FP (x) only takes values 0 or 1 characterizes point masses. •
We are going to revisit the discrete and absolutely continuous probabilities and
introduce mixed distributions.
2. There is a real sequence x1 , x2 , . . . and numbers p1 , p2 , . . . with pn > 0, for all n, and
!+∞
n=1 pn = 1 such that
+∞
%
P (A) = pn × %xn (A), (1.64)
n=1
41
3. There is a real sequence x1 , x2 , . . . and numbers p1 , p2 , . . . with pn > 0, for all n, and
!+∞
n=1 pn = 1 such that
+∞
%
FP (x) = pn × 1[xn ,+∞) (x), (1.65)
n=1
42
• Geometric distribution with parameter p (p ∈ [0, 1])
C = IN
px = (1 − p)x−1 p, x ∈ C.
This probability function is associated to the total number of
(i.i.d.) Bernoulli trials needed to get one sucess — the first sucess
(http://en.wikipedia.org/wiki/Geometric distribution). The graph of
7
0, x<1
FP (x) = ![x] i−1 [x]
(1.66)
i=1 (1 − p) p = 1 − (1 − p) , x ≥ 1,
where [x] represents the integer part of the real number x, follows:
43
• Hypergeometric distribution with parameters N, M, n (N, M, n ∈
IN and M, n ≤ N )
C = {x ∈ IN0 : max{0, n − N + M } ≤ x ≤ min{n, M }}
−M
(Mx )(Nn−x )
px = , x ∈ C.
(Nn )
It is a discrete probability that describes the number of successes in a
sequence of n draws from a finite population with size N without replacement
(http://en.wikipedia.org/wiki/Hypergeometric distribution).
also be used for the number of events in other specified intervals such as distance,
area or volume (http://en.wikipedia.org/wiki/Poisson distribution).
The figure comprises the distribution function for three different values of λ. •
44
Remark 1.95 — Absolutely continuous probabilities
If P is an absolutely continuous probability then FP (x) is an absolutely continuous real
function. •
• Continuous function
A real function f is continuous in x if for any sequence {x1 , x2 , . . .} such that
limn→∞ xn = x, it holds that limn→∞ f (xn ) = f (x).
One can say, briefly, that a function is continuous iff it preserves limits.
For the Cauchy definition (epsilon-delta) of continuous function see
http://en.wikipedia.org/wiki/Continuous function
45
7
0, if x = 0
2. f (x) =
x sin(1/x), if x &= 0,
on a finite interval containing the origin.
(http://en.wikipedia.org/wiki/Absolute continuity) •
by the two parameters, a and b, which are its minimum and maximum values
(http://en.wikipedia.org/wiki/Uniform distribution (continuous)).
46
• Exponential distribution with parameter λ (λ ∈ IR+ )
7
λe−λx , x ≥ 0
fP (x) =
0, otherwise.
7
0, x<0
FP (x) = −λx
1 − e , x ≥ 0.
15
I.e. a process in which events occur continuously and independently at a constant average rate.
47
@x
FP (x) = −∞
fP (s)ds, x ∈ IR
The normal distribution or Gaussian distribution is used describes data that cluster
around a mean or average. The graph of the associated probability density function
is bell-shaped, with a peak at the mean, and is known as the Gaussian function or
bell curve http://en.wikipedia.org/wiki/Normal distribution). •
48
Example 1.102 — Mixed distributions
Let M (λ)/M (µ)/1 represent a queueing system with Poisson arrivals (rate λ > 0) and
exponential service times (rate µ > 0) and one server.
In the equilibrium state, the probability function associated to the waiting time of an
arriving customer is
49
1.5 Conditional probability
Motivation 1.103 — Conditional probability (Karr, 1993, p. 35)
We shall revise probabilities to account for the knowledge that an event has occurred,
using a concept known as conditional probability. •
50
This is the correct answer to another question:
• For a family with two children, what is the probability that both are boys given
that the younger is a boy?
In this case
•
16
The prosecution made this error in the famous Dreyfus affair
(http://en.wikipedia.org/wiki/Alfred Dreyfus) in 1894.
51
Example 1.111 — Multiplication rule (Montgomery and Runger, 2003, p. 43)
The probability that an automobile battery, subject to high engine compartment
temperature, suffers low charging current is 0.7. The probability that a battery is subject
to high engine compartment temperature 0.05.
What is the probability that a battery suffers low charging current and is subject to
high engine compartment temperature?
Event Probability
• Probability
mult. rule
P (C ∩ T ) = P (C|T ) × P (T )
= 0.7 × 0.05
= 0.035. (1.79)
52
Corollary 1.115 — Law of total probability (Montgomery and Runger, 2003, p. 44)
For any events A and B,
Example 1.116 — Law of total probability (Grimmett and Stirzaker, 2001, p. 11)
Only two factories manufacture zoggles. 20% of the zoggles from factory I and 5% from
factory II are defective. Factory I produces twice as many zoggles as factory II each week.
What is the probability that a zoggle, randomly chosen from a week’s production, is
not defective?
Event Probability
• Probability
P (Dc ) = 1 − P (D)
lawtotalprob
= 1 − [P (D|A) × P (A) + P (D|Ac ) × P (Ac )]
' (
2 1
= 1 − 0.20 × + 0.05 ×
3 3
51
= .
60
53
Proposition 1.118 — Bayes’ theorem (Karr, 1993, p. 36)
Let {A1 , A2 , . . .} be a countable partition of Ω. Then, for each event B with P (B) > 0
and each n,
P (B|An )P (An )
P (An |B) =
P (B)
P (B|An )P (An )
= !+∞ (1.82)
i=1 P (B|Ai )P (Ai )
• Probability
P (D|A) × P (A)
P (A|D) =
P (D)
0.20 × 23
=
1 − 51
60
8
= . (1.83)
9
54
References
• Grimmett, G.R. and Stirzaker, D.R. (2001). Probability and Random Processes (3rd.
edition). Oxford. (QA274.12-.76.GRI.30385 and QA274.12-.76.GRI.40695 refer to
the library code of the 1st. and 2nd. editions from 1982 and 1992, respectively.)
• Montgomery, D.C. and Runger, G.C. (2003). Applied statistics and probability for
engineers. John Wiley & Sons, New York. (QA273-280/3.MON.64193)
• Righter, R. (200–). Lectures notes for the course Probability and Risk Analysis
for Engineers. Department of Industrial Engineering and Operations Research,
University of California at Berkeley.
• Yates, R.D. and Goodman, D.J. (1999). Probability and Stochastic Processes: A
friendly Introduction for Electrical and Computer Engineers. John Wiley & Sons,
Inc. (QA273-280/4.YAT.49920)
55
Chapter 2
Random variables
2.1 Fundamentals
Motivation 2.1 — Inverse image of sets (Karr, 1993, p. 43)
Before we introduce the concept of random variable (r.v.) we have to talk rather
extensively on inverse images of sets and inverse image mapping. •
56
Proposition 2.4 — Properties of inverse image mapping (Karr, 1993, p. 43;
Resnick, 1999, p. 72)
Let:
• X : Ω → Ω- ;
2. X −1 (Ω- ) = Ω
3. B ⊆ B - ⇒ X −1 (B) ⊆ X −1 (B - )
; ;
4. X −1 ( i∈I Bi ) = i∈I X −1 (Bi )
< <
5. X −1 ( i∈I Bi ) = i∈I X −1 (Bi )
6. B ∩ B - = ∅ ⇒ X −1 (B) ∩ X −1 (B - ) = ∅
7. X −1 (B c ) = [X −1 (B)]c . •
Proposition 2.6 — σ − algebras and inverse image mapping (Resnick, 1999, pp.
72–73)
Let X : Ω → Ω- be a mapping with inverse image. If F - is a σ − algebra on Ω- then
X −1 (F - ) = {X −1 (B) : B ∈ F - } (2.2)
is a σ − algebra on Ω.
57
Proposition 2.8 — Inverse images of σ − algebras generated by classes of
subsets (Resnick, 1999, p. 73)
Let C - be a class of subsets of Ω- . Then
i.e., the inverse image of the σ − algebra generated by C - is the same as the σ − algebra
on Ω generated by the inverse images. •
X −1 (F - ) ⊆ F. (2.4)
58
Remark 2.14 — Random variables (Karr, 1993, p. 44)
A r.v. is a function on the sample space: it transforms events into real sets.
The technical requirement that sets {X ∈ B} = X −1 (B) be events of Ω is needed in
order that the probability
be defined. •
Proposition 2.16 — Checking if X is a r.v. (Resnick, 1999, p. 77; Karr, 1993, p. 47)
The real function X : Ω → IR is a r.v. iff
• Random experiment
Throw a traditional fair die and observe the number of points.
• Sample space
Ω = {1, 2, 3, 4, 5, 6}
• σ−algebra on Ω
Let us consider a non trivial one:
F = {∅, {1, 3, 5}, {2, 4, 6}, Ω}
• Random variable
X : Ω → IR such that: X(1) = X(3) = X(5) = 0 and X(2) = X(4) = X(6) = 1
59
• Inverse image mapping
Let B ∈ B(IR). Then
∅, if 0 &∈ B, 1 &∈ B
{1, 3, 5}, if 0 ∈ B, 1 &∈ B
X −1 (B) =
{2, 4, 6}, if 0 &∈ B, 1 ∈ B
Ω, if 0 ∈ B, 1 ∈ B
∈ F, ∀B ∈ B(IR). (2.8)
60
Proposition 2.22 — σ−algebra generated by a r.v. (Karr, 1993, p. 46)
The family of events that are inverse images of Borel sets under a r.v is a σ − algebra on
Ω. In fact, given a r.v. X, the family
• Moreover, σ(X) is a σ − algebra for every function X : Ω → IR; and X is a r.v. iff
σ(X) ⊆ F, i.e., iff X is a measurable map (Karr, 1993, p. 46). •
since
∅, if 0 &∈ B, 1 &∈ B
Ac , if 0 ∈ B, 1 &∈ B
X −1 (B) = (2.13)
A, if 0 &∈ B, 1 ∈ B
Ω, if 0 ∈ B, 1 ∈ B,
for any B ∈ B(IR). •
61
Example 2.26 — σ−algebra generated by a simple r.v. (Karr, 1993, pp. 45-46)
A simple r.v. takes only finitely many values and has the form
n
%
X= ai × 1Ai , (2.14)
i=1
σ(X) = σ({A1 , . . . , An })
#
= { Ai : I ⊆ {1, . . . , n}}, (2.16)
i∈I
62
2.2 Combining random variables
To work with r.v., we need assurance that algebraic, limiting and transformation
operations applied to them yield other r.v.
In the next proposition we state that the set of r.v. is closed under:
• addition and scalar multiplication;1
• maximum and minimum;
• multiplication;
• division.
Corollary 2.30 — Closure under algebraic operations (Karr, 1993, pp. 48–49)
Let X : Ω → IR be a r.v. Then
X + = max{X, 0} (2.18)
X − = − min{X, 0}, (2.19)
the positive and negative parts of X (respectively), are non-negative r.v., and so is
|X| = X + + X − . (2.20)
•
63
Theorem 2.32 — Closure under limiting operations (Karr, 1993, p. 49)
Let X1 , X2 , . . . be r.v. Then sup Xn , inf Xn , lim sup Xn and lim inf Xn are r.v.
Consequently if
and that when X = limn→+∞ Xn exists, X = lim sup Xn = lim inf Xn (Karr, 1993,
p. 49). •
64
Definition 2.36 — Borel measurable function (Karr, 1993, p. 66)
A function g : IRn → IRm (for fixed n, m ∈ IN ) is Borel measurable if
• A function g : IRn → IRm Borel measurable iff each of its components is Borel
measurable as a function from IRn to IR.
• Moreover, the class of Borel measurable function has the closure properties under
algebraic and limiting operations as the family of r.v. on a probability space
(Ω, F, P ). •
65
2.3 Distributions and distribution functions
The main importance of probability functions on IR is that they are distributions of r.v.
3. SX (x) = 1 − FX (x) = PX ((x, +∞)) = P (X −1 ((x, +∞)) = P ({X > x}), x ∈ IR, is
the survival (or survivor) function of X. •
66
Definition 2.46 — Identically distributed r.v. (Karr, 1993, p. 52)
Let X and Y be two r.v. Then X and Y are said to be identically distributed — written
d
X = Y — if
Definition 2.47 — Equal r.v. almost surely (Karr, 1993, p. 52; Resnick, 1999, p.
167)
a.s.
Let X and Y be two r.v. Then X is equal to Y almost surely — written X = Y — if
Remark 2.48 — Identically distributed r.v. vs. equal r.v. almost surely (Karr,
1993, p. 52)
Equality in distribution of X and Y has no bearing on their equality as functions on Ω,
i.e.
d a.s.
X = Y &⇒ X = Y, (2.32)
even though
a.s. d
X = Y ⇒ X = Y. (2.33)
Example 2.49 — Identically distributed r.v. vs. equal r.v. almost surely
• X ∼ Bernoulli(0.5)
P ({X = 0}) = P ({X = 1}) = 0.5
• Y = 1 − X ∼ Bernoulli(0.5) since
P ({Y = 0}) = P ({1 − X = 0}) = P ({X = 1}) = 0.5
P ({Y = 1}) = P ({1 − X = 1}) = P ({X = 0}) = 0.5
a.s.
d
• X = Y but X &= Y . •
67
Exercise 2.50 — Identically distributed r.v. vs. equal r.v. almost surely
a.s. d
Prove that X = Y ⇒ X = Y . •
FXi (x) = lim F(X1 ,...,Xi−1 ,Xi ,Xi+1 ,...,Xd ) (x1 , . . . , xi−1 , x, xi+1 , . . . , xd ). (2.35)
xj →+∞,j.=i
68
Definition 2.56 — Discrete random vector (Karr, 1993, pp. 53–54)
The random vector X = (X1 , . . . , Xd ) is said to be discrete if X1 , . . . , Xd are discrete r.v.
i.e. if there is a countable set C ⊂ IRd such that P ({X ∈ C}) = 1. •
Definition 2.57 — Absolutely continuous random vector (Karr, 1993, pp. 53–54)
The random vector X = (X1 , . . . , Xd ) is absolutely continuous if there is a non-negative
function fX : IRd → IR0+ such that
6 x1 6 xd
FX (x) = ... fX (s1 , . . . , sd ) dsd . . . ds1 , (2.36)
−∞ −∞
for every x = (x1 , . . . , xd ) ∈ IRd . fX is called the joint density function of (X1 , . . . , Xd ). •
69
2.4 Key r.v. and random vectors and distributions
2.4.1 Discrete r.v. and random vectors
Integer-valued r.v. like the Bernoulli, binomial, geometric, negative binomial,
hypergeometric and Poisson, and integer-valued random vectors like the multinomial are
discrete r.v. and random vectors of great interest.
Notation X ∼ Uniform({x1 , x2 , . . . , xn })
Parameter {x1 , x2 , . . . , xn } (xi ∈ IR, i = 1, . . . , n)
Range {x1 , x2 , . . . , xn }
P.f. P ({X = x}) = n1 , x = x1 , x2 , . . . , xn
!n
This simple r.v. has the form X = i=1 xi × 1{xi } .
• Bernoulli distribution
Notation X ∼ Bernoulli(p)
Parameter p = P (sucess) (p ∈ [0, 1])
Range {0, 1}
P.f. P ({X = x}) = px (1 − p)1−x , x = 0, 1
• Binomial distribution
Notation X ∼ Binomial(n, p)
Parameters n = number of Bernoulli trials (n ∈ IN )
p = P (sucess) (p ∈ [0, 1])
Range {0, 1, . . . , n}
-n.
P.f. P ({X = x}) = x px (1 − p)n−x , x = 0, 1, . . . , n
The binomial r.v. results from the sum of n i.i.d. Bernoulli distributed r.v.
70
• Geometric distribution
Notation X ∼ Geometric(p)
Parameter p = P (sucess) (p ∈ [0, 1])
Range IN = {1, 2, 3, . . .}
P.f. P ({X = x}) = (1 − p)x−1 p, x = 1, 2, 3, . . .
P ({X > k + x}|{X > k}) = P ({X > x}), ∀k, x ∈ IN. (2.38)
Notation X ∼ NegativeBinomial(r, p)
Parameters r = pre-specified number of sucesses (r ∈ IN )
p = P (sucess) (p ∈ [0, 1])
Range {r, r + 1, . . .}
-x−1.
P.f. P ({X = x}) = r−1 (1 − p)x−r pr , x = r, r + 1, . . .
The negative binomial r.v. results from the sum of r i.i.d. geometrically distributed
r.v.
• Hypergeometric distribution
Notation X ∼ Hypergeometric(N, M, n)
Parameters N = population size (N ∈ IN )
M = sub-population size (M ∈ IN, M ≤ N )
n = sample size (n ∈ IN, n ≤ N )
Range {max{0, n − N + M }, . . . , min{n, M }}
(M )(N −M )
P.f. P ({X = x}) = x Nn−x , x = max{0, n − N + M }, . . . , min{n, M }
(n)
71
• Poisson distribution
Notation X ∼ Poisson(λ)
Parameter λ (λ ∈ IR+ )
Range IN0 = {0, 1, 2, 3, . . .}
x
P.f. P ({X = x}) = e−λ λx! , x = 0, 1, 2, 3, . . .
• Multinomial distribution
In probability theory, the multinomial distribution is a generalization of the binomial
distribution when we are dealing not only with two types of events — a success with
probability p and a failure with probability 1 − p — but with d types of events with
!
probabilities p1 , . . . , pd such that p1 , . . . , pd ≥ 0, di=1 pi = 1.4
2
http://en.wikipedia.org/wiki/Poisson distribution
3
http://en.wikipedia.org/wiki/Ladislaus Bortkiewicz
4
http://en.wikipedia.org/wiki/Multinomial distribution
72
Exercise 2.60 — Binomial r.v. (Grimmett and Stirzaker, 2001, p. 25)
DNA fingerprinting — In a certain style of detective fiction, the sleuth is required to
declare the criminal has the unusual characteristics...; find this person you have your
man. Assume that any given individual has these unusual characteristics with probability
10−7 (independently of all other individuals), and the city in question has 107 inhabitants.
Given that the police inspector finds such person, what is the probability that there
is at least one other? •
73
Exercise 2.65 — Negative hypergeometric r.v. (Grimmett and Stirzaker, 2001, p.
19)
Capture-recapture — A population of N animals has had a number M of its members
captured, marked, and released. Let X be the number of animals it is necessary to
recapture (without re-release) in order to obtain r marked animals.
Show that
- .-
M M −1 N −M
.
N r−1
P ({X = x}) = -N −1.x−r . (2.40)
x−1
•
• Y ∼ Poisson(λ)
then Xi ∼ Poisson(λpi ), i = 1, . . . , d. •
Exercise 2.67 — Relating the p.f. of the negative binomial and binomial r.v.
Let X ∼ NegativeBinomial(r, p) and Y ∼ Binomial(x − 1, p). Prove that, for x = r, r +
1, r + 2, . . . and r = 1, 2, 3, . . ., we get
P (X = x) = p × P (Y = r − 1)
/ 0
= p × FBinomial(x−1,p) (r − 1) − FBinomial(x−1,p) (r − 2) . (2.41)
Exercise 2.68 — Relating the d.f. of the negative binomial and binomial r.v.
Let X ∼ NegativeBinomial(r, p), Y ∼ Binomial(x, p) e Z = x − Y ∼ Binomial(x, 1 − p).
Prove that, for x = r, r + 1, r + 2, . . . and r = 1, 2, 3, . . ., we have
FN egativeBinomial(r,p) (x) = P (X ≤ x)
= P (Y ≥ r)
= 1 − FBinomial(x,p) (r − 1)
= P (Z ≤ x − r)
= FBinomial(x,1−p) (x − r). (2.42)
74
2.4.2 Absolutely continuous r.v. and random vectors
• Uniform distribution on the interval [a, b]
Notation X ∼ Uniform(a, b)
Parameters a = minimum value (a ∈ IR)
b = maximum value (b ∈ IR, a < b)
Range [a, b]
P.d.f. fX (x) = 1
b−a , a≤x≤b
• Beta distribution
In probability theory and statistics, the beta distribution is a family of continuous
probability distributions defined on the interval [0, 1] parameterized by two positive
shape parameters, typically denoted by α and β. In Bayesian statistics, it can be
seen as the posterior distribution of the parameter p of a binomial distribution,
if the prior distribution of p was uniform. It is also used in information theory,
particularly for the information theoretic performance analysis for a communication
system.
Notation X ∼ Beta(α, β)
Parameters α (α ∈ IR+ )
β (β ∈ IR+ )
Range [0, 1]
P.d.f. fX (x) = 1
B(α,β) xα−1 (1 − x)β−1 , 0 ≤ x ≤ 1
where
6 1
B(α, β) = xα−1 (1 − x)β−1 dx (2.43)
0
Γ(α)Γ(β)
B(α, β) = , (2.44)
Γ(α + β)
75
where
6 +∞
Γ(α) = y α−1 e−y dy (2.45)
0
1 (y − a)α−1 (b − y)β−1
fY (y) = , a ≤ y ≤ b. (2.46)
B(α, β) (b − a)α+β−1
The p.d.f. of this distribution can take various forms on account of the “shape”
parameters a and b, as illustrated by the following graph and table:
(a) Prove that the d.f. of the r.v. X ∼ Beta(α, β) can be written in terms of the
d.f. of Binomial r.v. when α and β are integer-valued:
76
(b) Prove that the p.d.f. of the r.v. X ∼ Beta(α, β) can be rewritten in terms of the
p.f. of the r.v. Y ∼ Binomial(α + β − 2, x), when α and β are integer-valued:
fBeta(α,β) (x) = (α + β − 1) × P (Y = α − 1)
/
= (α + β − 1) × FBinomial(α+β−2,x) (α − 1)
0
− FBinomial(α+β−2,x) (α − 2) . (2.48)
• Normal distribution
The normal distribution or Gaussian distribution is a continuous probability
distribution that describes data that cluster around a mean or average. The graph of
the associated probability density function is bell-shaped, with a peak at the mean,
and is known as the Gaussian function or bell curve. The Gaussian distribution
is one of many things named after Carl Friedrich Gauss, who used it to analyze
astronomical data, and determined the formula for its probability density function.
However, Gauss was not the first to study this distribution or the formula for its
density function that had been done earlier by Abraham de Moivre.
Notation X ∼ Normal(µ, σ 2 )
Parameters µ (µ ∈ IR)
σ 2 (σ 2 ∈ IR+ )
Range IR
(x−µ)2
P.d.f. fX (x) = √ 1 e− 2σ 2 , −∞ < x < +∞
2πσ
The normal distribution can be used to describe, at least approximately, any variable
that tends to cluster around the mean. For example, the heights of adult males in
the United States are roughly normally distributed, with a mean of about 1.8 m.
Most men have a height close to the mean, though a small number of outliers have
a height significantly above or below the mean. A histogram of male heights will
appear similar to a bell curve, with the correspondence becoming closer if more data
are used. (http://en.wikipedia.org/wiki/Normal distribution).
77
and
FX (x) = P (X ≤ x)
' (
X −µ x−µ
= P Z= ≤
σ σ
' (
x−µ
= Φ . (2.50)
σ
• Exponential distribution
The exponential distributions are a class of continuous probability distributions.
They tend to be used to describe the times between events in a Poisson process,
i.e. a process in which events occur continuously and independently at a constant
average rate (http://en.wikipedia.org/wiki/Exponential distribution).
Notation X ∼ Exponential(λ)
Parameter λ = inverse of the scale parameter (λ ∈ IR+ )
Range IR0+ = [0, +∞)
P.d.f. fX (x) = λ e−λx , x ≥ 0
Equivalently,
This property is referred as to lack of memory: no matter how old your equipment
is, its remaining life has same distribution as a new one.
The exponential (resp. geometric) distribution is the only absolutely continuous
(resp. discrete) r.v. satisfying this property.
• Gamma distribution
The gamma distribution is frequently a probability model for waiting
times; for instance, in life testing, the waiting time until death is a
random variable that is frequently modeled with a gamma distribution
(http://en.wikipedia.org/wiki/Gamma distribution).
78
Notation X ∼ Gamma(α, β)
Parameters α = shape parameter (α ∈ IR+ )
β = inverse of the scale parameter (β ∈ IR+ )
Range IR0+ = [0, +∞)
βα
P.d.f. fX (x) = Γ(α) xα−1 e−βx , x ≥ 0
Special cases
79
It is possible to relate the d.f. of X ∼ Erlang(n, β) with the d.f. of a Poisson r.v.:
∞
%
FErlang(n,β) (x) = e−βx (βx)i /i!
i=n
= 1 − FP oisson(βx) (n − 1), x > 0, n ∈ IN. (2.53)
The graphical representation of the joint density of a random vector with a bivariate
standard normal distribution follows — it depends on the parameter ρ.
1
0.15
0.1 0
0.05 2
0 -1
0 x2
-2
-2
0
-2
x1 2 -3
-3 -2 -1 0 1 2 3
80
Case Graph and contour plot of the joint p.d.f.
of a bivariate STANDARD normal (cont.)
ρ<0 Ellipses centered in (0, 0) and asymmetric in relation to the axes,
suggesting that X2 decreases when X1 increases
3
1
0.3 0
0.2
0.1 2
0 -1
0 x2
-2
-2
0
-2
x1 2 -3
-3 -2 -1 0 1 2 3
1
0.3 0
0.2
0.1 2
0 -1
0 x2
-2
-2
0
-2
x1 2 -3
-3 -2 -1 0 1 2 3
81
2.5 Transformation theory
2.5.1 Transformations of r.v., general case
Motivation 2.70 — Transformations of r.v., general case (Karr, 1993, p. 60)
Let:
• X be a r.v. with d.f. FX ;
• g −1 ((−∞, y]) = {x ∈ IR : g(x) ≤ y} be the inverse image of the Borel set (−∞, y]
under g.
Then
82
Thus, we are able to write
P ({Y ∈ B}) = P ({g(X) ∈ B}) = P ({X ∈ g −1 (B)}). (2.56)
•
(b) aX + b
(c) eX . •
6
The electrical resistance of an object is a measure of its opposition to the
passage of a steady electric current. The SI unit of electrical resistance is the ohm
(http://en.wikipedia.org/wiki/Electrical resistance).
7
Electrical conductance is a measure of how easily electricity flows along a certain path through an
electrical element. The SI derived unit of conductance is the siemens (also called the mho, because it is
the reciprocal of electrical resistance, measured in ohms). Oliver Heaviside coined the term in September
1885 (http://en.wikipedia.org/wiki/Electrical conductance).
83
Exercise 2.77 — D.f. of a transformation of a r.v., absolutely continuous case
Let X ∼ Uniform(0, 2π) and Y = sin X. Prove that
0, y < −1
1 arcsin y
FY (y) = + π , −1 ≤ y ≤ 1 (2.58)
2
1, y > 1.
•
84
Proposition 2.82 — P.f. of a transformation of a discrete r.v. (Murteira, 1979,
p. 122)
Let:
• X be a discrete r.v. with p.f. P ({X = x});
Then
for y ∈ RY . •
85
2.5.3 Transformations of absolutely continuous r.v.
Proposition 2.84 — D.f. of a strictly monotonic transformation of an
absolutely continuous r.v. (Karr, 1993, pp. 60 and 68)
Let:
• X be an absolutely continuous r.v. with d.f. FX and p.d.f. fX ;
Then
for y ∈ RY . Similarly, if
• g is a continuous, strictly decreasing, Borel measurable function
then
for y ∈ RY . •
86
Remark 2.87 — Transformations of absolutely continuous and discrete r.v.
(Karr, 1993, p. 61)
in general, Y = g(X) need not be absolutely continuous even when X is, as shown in the
next exercise, while if X is a discrete r.v. then so is Y = g(X) regardless of the Borel
measurable function g. •
Exercise 2.88 shows that we need some conditions on g to ensure that Y = g(X) is
also an absolutely continuous r.v. This will be the case when g is a continuous monotonic
function.
87
Then Y = g(X) is an absolutely continuous r.v. with p.d.f. given by
D −1 D
D dg (y) D
fY (y) = fX [g −1 (y)] × DD D, (2.65)
dy D
for y ∈ RY . •
∀y ∈ RY . •
88
Remark 2.94 — P.d.f. of a non monotonic transformation of an absolutely
continuous r.v. (Rohatgi, 1976, p. 71)
In practice Theorem 2.89 is quite useful, but whenever its conditions are violated we
should return to P ({Y ≤ y}) = P ({X ∈ g −1 ((−∞, y])}) to obtain the FY (y) and then
differentiate this d.f. to derive the p.d.f. of the transformation Y . This is the case in the
next two exercises. •
89
Theorem 2.98 — P.d.f. of a finite sum of monotonic restrictions of a function
g of an absolutely continuous r.v. (Rohatgi, 1976, pp. 73–74)
Let:
• X be an absolutely continuous r.v. with p.d.f. fX ;
• Y = g(X) be a transformation of X under g, where g : IR → IR is a Borel measurable
function that transforms RX onto some set RY = g(RX ).
Moreover, suppose that:
• g(x) is differentiable for all x ∈ RX ;
dg(x)
• dx
is continuous and nonzero at all points of RX but a finite number of x.
Then, for every real number y ∈ RY ,
(a) there is a positive integer n = n(y) and real numbers (inverses) g1−1 (y), . . . , gn−1 (y)
such that
D
dg(x) DD
g(x)|x=g−1 (y) = y and &= 0, k = 1, . . . , n(y), (2.72)
k dx Dx=g−1 (y)
k
or
dg(x)
(b) there is not an x such that g(x) = y and dx
&= 0, in which case we write n =
n(y) = 0.
In addition, Y = g(X) is an absolutely continuous r.v. with p.d.f. given by
7 !n(y) D −1 D
−1 D dgk (y) D
k=1 fX [gk (y)] × D dy D , n = n(y) > 0
fY (y) = (2.73)
0, n = n(y) = 0,
for y ∈ RY . •
90
Motivation 2.101 — P.d.f. of a countable sum of monotonic restrictions of a
function g of an absolutely continuous r.v.
The formula P ({Y ≤ y}) = P ({X ∈ g −1 ((−∞, y])}) and the countable additivity of
probability functions allows us to compute the p.d.f. of Y = g(X) in some instance even
if g has a countable number of inverses. •
9
We remind the reader that term-by-term differentiation is permissible if the differentiated series is
uniformly convergent.
91
2.5.4 Transformations of random vectors, general case
What follows is the analogue of Proposition 2.71 in a multidimensional setting.
Then
FY (y) = P ({Y1 ≤ y1 , . . . , Ym ≤ ym })
"7 "m $=$
&
= P X ∈ g −1 (−∞, yi ] . (2.79)
i=1
92
• Y = (Y1 , . . . , Yd ) = g(X) = (g1 (X1 , . . . , Xd ), . . . , gd (X1 , . . . , Xd )) be a
transformation of X under g, where g : IRd → IRd is a one-to-one Borel measurable
function that maps RX onto some set RY ⊂ IRd ;
• g −1 be the inverse mapping such that g −1 (y) = (g1−1 (y), . . . , gd−1 (y)).
for y = (y1 , . . . , yd ) ∈ RY . •
for y = (y1 , . . . , yd ) ∈ RY . •
93
Exercise 2.109 — Joint p.f. of a transformation of a discrete random vector
Let X = (X1 , X2 ) be a discrete random vector with joint p.f. P (X = x, Y = y) given in
the following table:
X1 X2
-2 0 2
1 1 1
−1 6 6 12
0 1
12
1
12 0
1 1
6
1
6
1
12
Theorem 2.110 — P.f. of the sum, difference, product and division of two
discrete r.v.
Let:
• (X, Y ) be a discrete bidimensional random vector with joint p.f. P (X = x, Y = y);
• Z =X +Y
• U =X −Y
• V =XY
P (Z = z) = P (X + Y = z)
%
= P (X = x, X + Y = z)
x
%
= P (X = x, Y = z − x)
x
%
= P (X + Y = z, Y = y)
y
%
= P (X = z − y, Y = y) (2.82)
y
P (U = u) = P (X − Y = u)
%
= P (X = x, X − Y = u)
x
%
= P (X = x, Y = x − u)
x
94
%
= P (X − Y = u, Y = y)
y
%
= P (X = u + y, Y = y) (2.83)
y
P (V = v) = P (X Y = v)
%
= P (X = x, XY = v)
x
%
= P (X = x, Y = v/x)
x
%
= P (XY = v, Y = y)
y
%
= P (X = v/y, Y = y) (2.84)
y
P (W = w) = P (X/Y = w)
%
= P (X = x, X/Y = w)
x
%
= P (X = x, Y = x/w)
x
%
= P (X/Y = w, Y = y)
y
%
= P (X = wy, Y = y). (2.85)
y
•
Exercise 2.111 — P.f. of the difference of two discrete r.v.
Let (X, Y ) be a discrete random vector with joint p.f. P (X = x, Y = y) given in the
following table:
X Y
1 2 3
1 1
12
1
12
2
12
2 2
12 0 0
3 1
12
1
12
4
12
(a) Prove that X and Y are identically distributed but are not independent.
(b) Obtain the p.f. of U = X − Y
(c) Prove that U = X − Y is not a symmetric r.v., that is U and −U are not identically
distributed. •
95
Corollary 2.112 — P.f. of the sum, difference, product and division of two
independent discrete r.v.
Let:
• X and Y be two independent discrete r.v. with joint p.f. P (X = x, Y = y) =
P (X = x) × P (Y = y), ∀x, y
• Z =X +Y
• U =X −Y
• V =XY
Then
P (Z = z) = P (X + Y = z)
%
= P (X = x) × P (Y = z − x)
x
%
= P (X = z − y) × P (Y = y) (2.86)
y
P (U = u) = P (X − Y = u)
%
= P (X = x) × P (Y = x − u)
x
%
= P (X = u + y) × P (Y = y) (2.87)
y
P (V = v) = P (X Y = v)
%
= P (X = x) × P (Y = v/x)
x
%
= P (X = v/y) × P (Y = y) (2.88)
y
P (W = w) = P (X/Y = w)
%
= P (X = x) × P (Y = x/w)
x
%
= P (X = wy) × P (Y = y). (2.89)
y
96
Exercise 2.113 — P.f. of the sum of two independent r.v. with three well
known discrete distributions
Let X and Y be two independent discrete r.v. Prove that
(a) X ∼ Binomial(nX , p) ⊥
⊥ Y ∼ Binomial(nY , p) ⇒ (X + Y ) ∼ Binomial(nX + nY , p)
(b) X ∼ NegativeBinomial(nX , p) ⊥⊥ Y ∼ NegativeBinomial(nY , p) ⇒ (X + Y ) ∼
NegativeBinomial(nX + nY , p)
(c) X ∼ Poisson(λX ) ⊥
⊥ Y ∼ Poisson(λY ) ⇒ (X + Y ) ∼ Poisson(λX + λY ),
i.e. the families of Poisson, Binomial and Negative Binomial distributions are closed under
summation of independent members.11 •
97
2.5.6 Transformations of absolutely continuous random vectors
Motivation 2.116 — P.d.f. of a transformation of an absolutely continuous
random vector (Karr, 1993, p. 62)
Recall that a random vector X = (X1 , . . . , Xd ) is absolutely continuous if there is a
function fX on IRd satisfying
Computing the density of Y = g(X) requires that g be invertible, except for the special
case that X1 , . . . , Xd are independent (and then only for particular choices of g). •
for y = (y1 , . . . , yd ) ∈ RY .
98
Then the random vector Y = (Y1 , . . . , Yd ) is absolutely continuous and its joint p.d.f. is
given by
/ 0
fY (y) = fX g −1 (y) × |J(y)|, (2.94)
for y = (y1 , . . . , yd ) ∈ RY . •
99
Theorem 2.122 — P.d.f. of a transformation, with a finite number of inverses,
of an absolutely continuous random vector (Rohatgi, 1976, pp. 136–137)
Assume the conditions of Theorem 2.117 and suppose that:
• for each y ∈ RY ⊂ IRd , the transformation g has a finite number k = k(y) of
inverses;
• RX ⊂ IRd can be partitioned into k disjoint sets, A1 , . . . , Ak , such that the
transformation g from Ai (i = 1, . . . , k) into IRd , say g i , is one-to-one with inverse
transformation g −1
i
= (g1−1i (y), . . . , gd−1i (y)), i = 1, . . . , k;
• the first partial derivatives of g −1
i
exist, are continuous and that each Jacobian
D ∂g1−1 ∂g1−1
D
D i (y)
··· i (y) D
D ∂y1 ∂yd D
D .. .. D
Ji (y) = DD . ··· . D &= 0,
D (2.95)
D ∂gd−1 ∂gd−1 D
D i (y) i (y) D
∂y1
··· ∂yd
for y = (y1 , . . . , yd ) ∈ RY . •
Theorem 2.123 — P.d.f. of the sum, difference, product and division of two
absolutely continuous r.v. (Rohatgi, 1976, p. 141)
Let:
• (X, Y ) be an absolutely continuous bidimensional random vector with joint p.d.f.
fX,Y (x, y);
• Z = X + Y , U = X − Y , V = X Y and W = X/Y .
Then
100
fU (u) = fX−Y (u)
6 +∞
= fX,Y (x, x − u) dx
−∞
6 +∞
= fX,Y (u + y, y) dy (2.98)
−∞
Remark 2.124 — P.d.f. of the sum and product of two absolutely continuous
r.v.
It is interesting to note that:
d FZ (z)
fZ (z) =
dz
d P (Z = X + Y ≤ z)
=
I6 6 dz J
d
= fX,Y (x, y) dy dx
dz {(x,y): x+y≤z}
I6 +∞ 6 z−x J
d
= fX,Y (x, y) dy dx
dz −∞ −∞
6 +∞ I6 z−x J
d
= fX,Y (x, y) dy dx
−∞ dz −∞
6 +∞
= fX,Y (x, z − x) dx; (2.101)
−∞
d FV (v)
fV (v) =
dv
d P (V = XY ≤ v)
=
dv
101
I6 6 J
d
= fX,Y (x, y) dy dx
dv {(x,y): xy≤v}
@ G@ H
+∞ d v/x
f (x, y) dy dx, x > 0
−∞ dv −∞ X,Y
= @ +∞ d G@ +∞ H
fX,Y (x, y) dy dx, x < 0
−∞ dv v/x
6 +∞
1
= fX,Y (x, v/x) dx. (2.102)
−∞ |x|
•
Corollary 2.125 — P.d.f. of the sum, difference, product and division of two
independent absolutely continuous r.v. (Rohatgi, 1976, p. 141)
Let:
• X and Y be two independent absolutely continuous r.v. with joint p.d.f.
fX,Y (x, y) = fX (x) × fY (y), ∀x, y;
• Z = X + Y , U = X − Y , V = X Y and W = X/Y .
Then
102
6 +∞
= fX (wy) × fY (y) × |y| dy. (2.106)
−∞
(b) X + Y ;
(c) X − Y . •
Remark 2.128 — D.f. and p.d.f. of the sum, difference, product and division
of two absolutely continuous r.v.
In several cases it is simpler to obtain the d.f. of those four algebraic functions of X and
Y than to derive the corresponding p.d.f. It suffices to apply Proposition 2.104 and then
differentiate the d.f. to get the p.d.f., as seen in the next exercises. •
Exercise 2.129 — D.f. and p.d.f. of the difference of two absolutely continuous
r.v.
Choosing adequate underkeel clearance (UKC) is one of the most crucial and most difficult
problems in the navigation of large ships, especially very large crude oil carriers.
Let X be the water depth in a passing shallow waterway, say a harbour or a channel,
and Y be the maximum ship draft. Then the probability of a safe passing a shallow
waterway can be expressed as P (UKC = X − Y > 0).
Assume that X and Y are independent r.v. such that X ∼ Gamma(n, β) and Y ∼
Gamma(m, β), where n, m ∈ IN and m < n. Derive an expression for P (UKC = X − Y >
!
0) taking into account that FGamma(k,β) (x) = ∞ i=k e
−βx
(βx)i /i!, k ∈ IN . •
103
Exercise 2.130 — D.f. and p.d.f. of the sum of two absolutely continuous r.v.
Let X and Y be the durations of two independent system components set in what is called
a stand by connection.12 In this case the system duration is given by X + Y .
Prove that the p.d.f. of X + Y equals
- .
αβ e−βz − e−αz
fX+Y (z) = , z > 0,
α−β
if X ∼ Exponencial(α) and Y ∼ Exponencial(β), where α, β > 0 and α &= β. •
12
At time 0, only the component with duration X is on. The component with duration Y replaces the
other one as soon as it fails.
104
2.5.7 Random variables with prescribed distributions
The construction of a r.v. with a prescribed d.f. depends on the following definition.
(a) X ∼ Exponential(λ);
(b) X ∼ Bernoulli(θ).
105
Remark 2.137 — Existence of a quantile function (Karr, 1993, p. 63)
Even though F need be neither continuous nor strictly increasing, F −1 always exists.
As the figure of the quantile function (associated with the d.f.) of X ∼ Bernoulli(θ),
−1
F jumps where F is flat, and is flat where F jumps.
Although not necessarily a pointwise inverse of F , F −1 serves that role for many
purposes and has a few interesting properties. •
106
Remark 2.143 — Quantile transformation (Karr, 1993, p. 64)
R.v. with d.f. F can be simulated by applying F −1 to the (uniformly distributed) values
produced by the random number generator.
Feasibility of this technique depends on either having F −1 available in closed form or
being able to approximate it numerically. •
13
Let us remind the reader that the sum of n independent exponential distributions with parameter λ
has an Erlang(n, λ).
107
Exercise 2.147 — The quantile transformation and the generation of the Beta
distribution
Let Y and Z be two independent r.v. with distributions Gamma(α, λ) and Gamma(β, λ),
respectively (α, β, λ > 0).
(a) Prove that X = Y /(Y + Z) ∼ Beta(α, β).
(b) Use this result to describe a random number generation method for the Beta(α, β),
where α, β ∈ IN .
(c) Use any software you are familiar with to generate and plot the histogram of 1000
observations from the Beta(4, 5) distribution. •
or, equivalently,
7
0, if u ≥ p
x= (2.115)
1, if u < p.
(Is there any advantage of (2.115) over (2.114)?) •
108
Exercise 2.151 — The converse of the quantile transformation
Prove Propositon 2.150 (Karr, 1993, p. 64). •
for all x1 , . . . , xn . •
for each n ∈ IN and x1 , . . . , xn . Then there is a probability space say (Ω, F, P ) and a
sequence of {Xk }k∈IN of r.v. defined on it such that Fn is the d.f. of (X1 , . . . , Xn ), for each
n ∈ IN . •
109
References
• Gentle, J.E. (1998). Random Number Generation and Monte Carlo Methods.
Springer-Verlag, New York, Inc. (QA298.GEN.50103)
• Grimmett, G.R. and Stirzaker, D.R. (2001). One Thousand Exercises in Probability.
Oxford University Press.
• Righter, R. (200–). Lectures notes for the course Probability and Risk Analysis
for Engineers. Department of Industrial Engineering and Operations Research,
University of California at Berkeley.
110
Chapter 3
Independence
3.1 Fundamentals
Motivation 3.1 — Independence (Resnick, 1999, p. 91; Karr, 1993, p. 71)
The intuitive appeal of independence stems from the easily envisioned property that the
ocurrence of an event has no effect on the probability that an independent event will
occur. Despite the intuitive appeal, it is important to recognize that independence is a
technical concept/definition which must be checked with respect to a specific model.
Independence — or the absence of probabilistic interaction — sets probability apart
as a distinct mathematical theory. •
(b) A and B are independent iff P (B|A) = P (B|Ac ), where P (A) ∈ (0, 1). •
111
Exercise 3.4 — (In)dependence and disjoint events
Let A and B two disjoint events with probabilities P (A), P (B) > 0. Show that these
two events are not independent. •
Remark 3.7 — Independence for a finite number of events (Resnick, 1999, p. 92)
! - .
Note that (3.2) represents nk=2 nk = 2n − n − 1 equations and can be rephrased as
follows:
Corollary 3.8 — Independence for a finite number of events (Karr, 1993, p. 81)
Events A1 , . . . , An are independent iff Ac1 , . . . , Acn are independent. •
Exercise 3.9 — Independence for a finite number of events (Exercise 3.1, Karr,
1993, p. 95)
Let A1 , . . . , An be independent events.
112
; A
(a) Prove that P ( ni=1 Ai ) = 1 − ni=1 [1 − P (Ai )].
(b) Consider a parallel system with n components and assume that P (Ai ) is the
reliability of the component i (i = 1, . . . , n). What is the system reliability? •
for all Ai ∈ Gi , i = 1, . . . , n. •
113
Definition 3.16 — π−system (Resnick, 1999, p. 32; Karr, 1993, p. 21)
Let P family of subsets of the sample space Ω. P is said to be a π − system if it is closed
under finite intersection: A, B ∈ P ⇒ A ∩ B ∈ P. •
1. Ci is a π − system
2. Ci , i = 1, . . . , n, are independent
then the σ − algebras generated by these n classes of events, σ(Ci ), . . . , σ(Cn ), are
independent. •
1
Synonyms (Resnick, 1999, p. 36): λ − system, σ − additive, Dynkin class.
2
If A, B ∈ D and A ⊆ B then B\A ∈ D.
;+∞
3
If A1 ⊆ A2 ⊆ . . . and Ai ∈ D then i=1 Ai ∈ D.
114
Definition 3.22 — Arbitrary number of independent classes (Resnick, 1999, p.
93; Karr, 1993, p. 94)
Let T be an arbitrary index set. The classes {Ct , t ∈ T } are independent if, for each finite
I such that I ⊂ T , {Ct , t ∈ I} are independent.
An infinite collection of σ − algebras is independent if every finite subcollection is
independent. •
115
3.2 Independent r.v.
The notion of independence for r.v. can be stated in terms of Borel sets. Moreover, basic
independence criteria can be develloped based solely on intervals such as (−∞, x].
Independence for r.v. can also be defined in terms of the independence of σ − algebras.
116
Theorem 3.30 — Independence criterion for a finite number of r.v. (Karr, 1993,
p. 72)
R.v. X1 , . . . , Xn are independent iff
n
&
FX1 ,...,Xn (x1 , . . . , xn ) = FXi (xi ), (3.7)
i=1
for all x1 , . . . , xn ∈ IR. •
Specialized criteria for discrete and absolutely continuous r.v. follow from Theorem
3.30.
Theorem 3.35 — Independence criterion for discrete r.v. (Karr, 1993, p. 73;
Resnick, 1999, p. 94)
The discrete r.v. X1 , . . . , Xn , with countable ranges R1 , . . . , Rn , are independent iff
n
&
P ({X1 = x1 , . . . , Xn = xn }) = P ({Xi = xi }), (3.9)
i=1
for all xi ∈ Ri , i = 1, . . . , n. •
117
Exercise 3.37 — Independence criterion for discrete r.v.
The number of laptops (X) and PCs (Y ) sold daily in a store have a joint p.f. partially
described in the following table:
Y
X 0 1 2
0 0.1 0.1 0.3
1 0.2 0.1 0.1
2 0 0.1 a
Complete the table and prove that X and Y are not independent r.v. •
118
Example 3.42 — Independent r.v. (Karr, 1993, pp. 75–76)
Independent r.v. are inherent to certain probability structures.
• Binary expansions4
Let P be the uniform distribution on Ω = [0, 1]. Each point ω ∈ Ω has a binary
expansion
+∞
%
2−n × 1 = 1. (3.14)
n=1
4
Or dyadic expansions of uniform random numbers (Resnick, 1999, pp. 98-99).
5
The proof of this result can also be found in Resnick (1999, pp. 99-100).
119
In other cases, whether r.v. are independent depends on the value of a parameter.
Exercise 3.43 — Bivariate normal distributed r.v. (Karr, 1993, p. 96, Exercise
3.8)
Let (X, Y ) have the bivariate normal p.d.f.
I 2 J
1 x − 2ρxy + y 2
fX,Y (x, y) = K exp − 2
, (x, y) ∈ IR2 . (3.19)
2π 1 − ρ2 2(1 − ρ )
Exercise 3.44 — I.i.d. r.v. with absolutely continuous distributions (Karr, 1993,
p. 96, Exercise 3.9)
Let (X, Y ) be an absolutely continuous random vector where X and Y are i.i.d. r.v. with
absolutely continuous d.f. F . Prove that:
(a) P ({X = Y }) = 0;
120
3.3 Functions of independent r.v.
Motivation 3.45 — Disjoint blocks theorem (Karr, 1993, p. 76)
R.v. that are functions of disjoint subsets of a family of independent r.v. are also
independent. •
• X1 , . . . , Xn be independent r.v.;
Then
- . - .
Y1 = g1 X (1) , . . . , Yk = gk X (k) (3.20)
121
Corollary 3.50 — Disjoint blocks theorem (Karr, 1993, p. 77)
Let:
• X1 , . . . , Xn be independent r.v.;
We have already addressed the p.d.f. (or p.f.) of a sum, difference, product or
division of two independent absolutely continuous (or discrete) r.v. However, the sum
of independent absolutely continuous r.v. merit special consideration — its p.d.f. has a
specific designation: convolution of p.d.f..
Then the p.d.f. of X + Y is termed the convolution of the p.d.f. f and g, represented by
f , g and given by
6 +∞
(f , g)(t) = f (t − s) × g(s) ds. (3.21)
−∞
122
Exercise 3.54 — Sum of independent binomial distributions
Let X ∼ Binomial(nX , p) and Y ∼ Binomial(nY , p) be independent.
Prove that X + Y ∼ Binomial(nX + nY , p) by using the Vandermonde’s identity
(http://en.wikipedia.org/wiki/Vandermonde’s identity).6 •
R.v. Convolution
!k 9! :
k
Xi ∼indep Binomial(ni , p), i = 1, . . . , k X
i=1 i ∼ Binomial n
i=1 i , p
!n 9! :
k
Xi ∼indep NegativeBinomial(ri , p), i = 1, . . . , n i=1 X i ∼ NegativeBinomial i=1 r i , p
!n !n
Xi ∼indep Poisson(λi ), i = 1, . . . , n i=1Xi ∼ Poisson ( i=1 λi )
!n -!n !n .
Xi ∼indep. Normal(µi , σi2 ), i = 1, . . . , n i=1 Xi ∼ Normal i=1 µi , i=1 σi
2
!n -!n !n 2 2 .
i=1 ci Xi ∼ Normal i=1 ci µi , i=1 ci σi
!n !n
Xi ∼indep. Gamma(αi , λ), i = 1, . . . , n i=1 Xi ∼ Gamma ( i=1 αi , λ)
123
for |ρ| = |corr(X, Y )| < 1.7
Prove that X + Y is normally distributed with parameters E(X + Y ) = µX + µY and
2
V (X + Y ) = V (X) + 2cov(X, Y ) + V (Y ) = σX + 2ρσX σY + σY2 . •
R.v. Minimum
An
Xi ∼indep Geometric(pi ), i = 1, . . . , n mini=1,...,n Xi ∼ Geometric (1 − i=1 (1 − pi ))
9! :
n
Xi ∼indep Exponential(λi ), ai > 0, i = 1, . . . , n mini=1,...,n ai Xi ∼ Exponential λi
i=1 ai
!n
Xi ∼indep Pareto(b, αi ), i = 1, . . . , n, a > 0 mini=1,...,n aXi ∼ Pareto (ab, i=1 αi )
7
The fact that two random variables X and Y both have a normal distribution does not imply that
the pair (X, Y ) has a joint normal distribution. A simple example is one in which Y = X if |X| > 1
and Y = −X if |X| < 1. This is also true for more than two random variables. (For more details see
http://en.wikipedia.org/wiki/Multivariate normal distribution).
8
The Pareto distribution seemed to show rather well the way that a larger portion of
the wealth of any society is owned by a smaller percentage of the people in that society
(http://en.wikipedia.org/wiki/Pareto distribution).
124
R.v. Minimum
9 1
:
Xi ∼i.i.d. Weibull(α, β), i = 1, . . . , n mini=1,...,n Xi ∼ Weibull α/n β , β
'9 :− β1 (
!n −β
Xi ∼indep Weibull(αi , β), i = 1, . . . , n mini=1,...,n Xi ∼ Weibull i=1 αi ,β
125
3.4 Order statistics
Algebraic operations on independent r.v., such as the minimum, the maximum and order
statistics, are now further discussed because they play a major role in applied areas such
as reliability.
Definition 3.62 — System reliability function (Barlow and Proschan, 1975, p. 82)
The system reliability function for the interval [0, t] is the probability that the system
functions successfully throughout the interval [0, t].
If T represents the system lifetime then the system reliability function is the survival
function of T ,
ST (t) = P ({T > t}) = 1 − FT (t). (3.26)
If the system has n components with independent lifetimes X1 , . . . , Xn , with survival
functions SX1 (t), . . . , SXn (t), then system reliability function is a function of those n
reliability functions, i.e,
ST (t) = h [SX1 (t), . . . , SXn (t)] . (3.27)
If they are not independent then ST (t) depends on more than the component marginal
distributions at time t. •
126
Example 3.65 — Reliability function of a series system
A series system functions if all its components function. Therefore the system lifetime is
given by
ST (t) = 1 − FT (t)
"n $
8
= 1−P {Xi ≤ t}
i=1
= 1 − [1 − SX (t)]n , (3.31)
127
Proposition 3.69 — Joint p.d.f. of the order statistics and more (Murteira, 1980,
pp. 57, 55, 54)
i.i.d.
Let X1 , . . . , Xn be absolutely continuous r.v. such that Xi ∼ X, i = 1, . . . , n. Then:
n
&
fX(1) ,...,X(n) (x1 , . . . , xn ) = n! × fX (xi ), (3.32)
i=1
for x1 ≤ . . . ≤ xn ;
n ' (
% n
FX(i) (x) = × [FX (x)]j × [1 − FX (x)]n−j
j=i
j
= 1 − FBinomial(n,FX (x)) (i − 1), (3.33)
for i = 1, . . . , n;
n!
fX(i) (x) = × [FX (x)]i−1 × [1 − FX (x)]n−i × fX (x), (3.34)
(i − 1)! (n − i)!
for i = 1, . . . , n;
n!
fX(i) ,X(j) (xi , xj ) =
(i − 1)! (j − i − 1)! (n − j)!
× [FX (xi )]i−1 × [FX (xj ) − FX (xi )]j−i−1 × [1 − FX (xj )]n−j
×fX (xi ) × fX (xj ), (3.35)
T = X(n−k+1) . (3.36)
i.i.d.
If Xi ∼ X, i = 1, . . . , n, then the system reliability function can also be derived by using
the auxiliary r.v.
128
In fact,
ST (t) = P (Zt ≥ k)
= 1 − P (Zt ≤ k − 1)
= 1 − FBinomial(n,SX (t)) (k − 1)
= P (n − Zt ≤ n − k)
= FBinomial(n,FX (t)) (n − k). (3.38)
129
3.5 Constructing independent r.v.
The following theorem is similar to Proposition 2.133 and guarantees that we can also
construct independent r.v. with prescribed d.f.
130
3.6 Bernoulli process
Motivation 3.77 — Bernoulli (counting) process (Karr, 1993, p. 88)
Counting sucesses in repeated, independent trials, each of which has one of two possible
outcomes (success 9 and failure). •
Definition 3.79 — Important r.v. in a Bernoulli process (Karr, 1993, pp. 88–89)
In isolation a Bernoulli process is neither deep or interesting. However, we can identify
three associated and very important r.v.:
!
• Sn = ni=1 Xi , the number of successes in the first n trials (n ∈ IN );
• Tk = min{n : Sn = k}, the time (trial number) at which the kth. success occurs
(k ∈ IN ), that is, the number of trials needed to get k successes;
• Uk = Tk −Tk−1 , the time (number of trials) between the kth. and (k −1)th. successes
(k ∈ IN, T0 = 0, U1 = T1 ). •
131
Proposition 3.83 — Important distributions in a Bernoulli process (Karr, 1993,
pp. 89–90)
In a Bernoulli process with parameter p (p ∈ [0, 1]) we have:
• Sn ∼ Binomial(n, p), n ∈ IN ;
• Tk ∼ NegativeBinomial(k, p), k ∈ IN ;
i.i.d. d
• Uk ∼ Geometric(p) = NegativeBinomial(1, p), k ∈ IN . •
(b) Consider a Bernoulli process with parameter p = 1/2 and obtain the probability of
having 57 successes between times 10 and 100. •
Exercise 3.85 — Relating the Bernoulli counting process and random walk
(Karr, 1993, p. 97, Exercise 3.21)
Let Sn be a Bernoulli (counting) process with p = 12 .
Prove that the process Zn = 2Sn − n is a symmetric random walk. •
• stationary increments — that is, for fixed j, the distribution of Sk+j − Sk is the
same for all k ∈ IN . •
132
• Operations Research
Queueing in any area, failures in manufacturing systems, finance, risk modelling,
network models
• Computer Systems
Communication networks, intelligent control systems, data compression, detection
of signals, job flow in computer systems, physics – statistical mechanics. •
133
•
(b) Assume time is divided into consecutive fixed-length time-slots. Consider N sensors
and assume that the ith sensor triggers an alarm in any given fixed-length time-slot
with probability αi (i = 1, . . . , N ), independently of the remaining sensors.
Obtain the probability that at least one alarm sounds in a given time-slot (Zukerman,
2000–2012, p. 60) and the p.f. of the number of times at least one alarm sounds in
the first n time-slots. •
134
Are the two resulting processes independent?13
(b) Assume once again that time is divided into consecutive fixed-length time-slots.
Moreover, assume an alarm is triggered in a fixed-length time-slot with probability α
and subsequently checked whether it is a false alarm (with probability p) or not.
Determine the p.f. of the number of time-slots between consecutive false alarms. •
13
NO! If we try to merge the two splitting processes and assume they are independent we get a
parameter αp + α(1 − p) − αp × α(1 − p) which is different form α.
135
3.7 Poisson process
In what follows we use the notation of Ross (1989, Chapter 5) which is slightly different
from the one of Karr (1993, Chapter 3).
Definition 3.96 — Counting process (in continuous time) (Ross, 1989, p. 209)
A stochastic process {N (t), t ≥ 0} is said to be a counting process if N (t) represents the
total number of events (e.g. arrivals) that have occurred up to time t. From this definition
we can conclude that a counting process {N (t), t ≥ 0} must satisfy:
• N (t) ∈ IN0 , ∀ t ≥ 0;
• N (t) − N (s) corresponds to the number of events that have occurred in the interval
(s, t], ∀ 0 ≤ s < t. •
14
For more examples, check http://en.wikipedia.org/wiki/Poisson process.
136
Definition 3.98 — Counting process (in continuous time) with stationary
increments (Ross, 1989, p. 210)
The counting process {N (t), t ≥ 0} is said to have stationary increments if distribution
of the number of events that occur in any interval of time depends only on the length of
the interval,15 that is,
d
• N (t2 + s) − N (t1 + s) = N (t2 ) − N (t1 ), ∀ s > 0, 0 ≤ t1 < t2 . •
• N (t) ∼ Poisson(λt). •
• N (0) = 0;
137
Proposition 3.103 — Joint p.f. of N (t1 ), . . . , N (tn ) in a Poisson process (Karr,
1993, p. 91)
For 0 < t1 < . . . < tn and 0 ≤ k1 ≤ . . . ≤ kn ,
&n
e−λ(tj −tj−1 ) [λ(tj − tj−1 )]kj −kj−1
P ({N (t1 ) = k1 , . . . , N (tn ) = kn }) = , (3.39)
j=1
(k j − k j−1 )!
where t0 = 0 and k0 = 0. •
Definition 3.106 — Important r.v. in a Poisson process (Karr, 1993, pp. 88–89)
Let {N (t), t ≥ 0} be a Poisson process with rate λ. Then:
• Sn = inf{t : N (t) = n} represents the time of the occurrence of the nth. event (e.g.
arrival), n ∈ IN ;
• Xn = Sn − Sn−1 corresponds to the time between the nth. and (n − 1)th. events
(e.g. interarrival time), n ∈ IN .
138
Proposition 3.107 — Important distributions in a Poisson process (Karr, 1993,
pp. 92–93)
So far we know that N (t) ∼ Poisson(λt), t > 0. We can also add that:
• Sn ∼ Erlang(n, λ), n ∈ IN ;
i.i.d.
• Xn ∼ Exponential(λ), n ∈ IN . •
N (t) ≥ n ⇔ Sn ≤ t (3.40)
FSn (t) = FErlang(n,λ) (t)
= P ({N (t) ≥ n})
+∞ −λt
% e (λt)j
=
j=n
j!
= 1 − FP oisson(λt) (n − 1), n ∈ IN. (3.41)
139
Motivation 3.112 — Conditional distribution of the first arrival time (Ross,
1989, p. 222)
Suppose we are told that exactly one event of a Poisson process has taken place by time
t (i.e. N (t) = 1), and we are asked to determine the distribution of the time at which the
event occurred (S1 ). •
Proposition 3.113 — Conditional distribution of the first arrival time (Ross,
1989, p. 223)
Let {N (t), t ≥ 0} be a Poisson process with rate λ > 0. Then
S1 |{N (t) = 1} ∼ Uniform(0, t). (3.42)
•
Exercise 3.114 — Conditional distribution of the first arrival time
Prove Proposition 3.113 (Ross, 1989, p. 223). •
140
Then the merged process {N1 (t) + N2 (t), t ≥ 0} is a Poisson process with rate λ1 + λ2 .
(a) Starting at an arbitrary time, compute the probability that at least two men arrive
before three women arrive (Ross, 1989, p. 242, Exercise 20).
(b) What is the probability that the number of arrivals (men and women) exceeds ten
in the first 20 minutes? •
Moreover, we can add that N1 (t)|{N (t) = n} ∼ Binomial(n, p) and N2 (t)|{N (t) = n} ∼
Binomial(n, 1 − p). •
141
Exercise 3.122 — Splitting a Poisson process
Prove Proposition 3.121 (Ross, 1989, pp. 218–219).
Why are the two resulting processes independent? •
Exercise 3.124 — Splitting a Poisson process (Ross, 1989, p. 243, Exercise 23)
Cars pass a point on the highway at a Poisson rate of one per minute. If five percent of
the cars on the road are Dodges, then:
(a) What is the probability that at least one Dodge passes during an hour?
(b) If 50 cars have passed by an hour, what is the probability that five of them were
Dodges?
(c) Given that ten Dodges have passed by in an hour, obtain the expected value of the
number of cars to have passed by in that time. •
142
3.8 Generalizations of the Poisson process
In this section we consider three generalizations of the Poisson process. The first of these
is the non homogeneous Poisson process, which is obtained by allowing the arrival rate at
time t to be a function of t.
• N (0) = 0;
Moreover,
'6 t+s (
N (t + s) − N (s) ∼ Poisson λ(z) dz (3.44)
s
(a) Obtain the expression of the expected value of the number of arrivals until t
(0 ≤ t ≤ 8). Derive the probability of no arrivals in the interval [3,5].
(b) Determine the expected value of the number of arrivals in the last 5 opening hours
(interval [3, 8]) given that 15 customers have arrived in the last 3 opening hours
(interval [5, 8]). •
143
Exercise 3.127 — The output process of an infinite server Poisson queue and
the non homogeneous Poisson process
Prove that the output process of the M/G/∞ queue — i.e., the number of customers who
(by time t) have already left the infinite server queue with Poisson arrivals and general
service d.f. G — is a non homogeneous Poisson process with intensity function λG(t). •
where
144
Definition 3.132 — Conditional Poisson process (Ross, 1983, pp. 49–50)
Let:
145
References
• Barlow, R.E. and Proschan, F. (1965/1996). Mathematical Theory of Reliability.
SIAM (Classics in Applied Mathematics).
(TA169.BAR.64915)
• Barlow, R.E. and Proschan, F. (1975). Reliability and Life Testing. Holt, Rinehart
and Winston, Inc.
• Grimmett, G.R. and Stirzaker, D.R. (2001). Probability and Random Processes
(3rd. edition). Oxford University Press. (QA274.12-.76.GRI.30385 and QA274.12-
.76.GRI.40695 refer to the library code of the 1st. and 2nd. editions from 1982 and
1992, respectively.)
• Pinkerton, S.D. and Holtgrave, D.R. (1998). The Bernoulli-process model in HIV
transmission: applications and implications. In Handbook of economic evaluation
of HIV prevention programs, Holtgrave, D.R. (Ed.), pp. 13–32. Plenum Press, New
York.
• Ross, S.M. (1983). Stochastic Processes. John Wiley & Sons. (QA274.12-
.76.ROS.36921 and QA274.12-.76.ROS.37578)
146
Chapter 4
Expectation
One of the most fundamental concepts of probability theory and mathematical statistics
is the expectation of a r.v. (Resnick, 1999, p. 117).
• a discrete r.v. X with values in a countable set C and p.f. P ({X = x}) and
are
%
E(X) = x × P ({X = x}) (4.1)
x∈C
6 +∞
E(Y ) = y × fY (y) dy, (4.2)
−∞
respectively.
When X ≥ 0 it is permissible that E(X) = +∞, but finiteness is mandatory when X
can take both positive and negative (or null) values. •
1. Constant preserved
If X ≡ c then E(X) = c.
147
2. Monotonicity
1
If X ≤ Y then E(X) ≤ E(Y ).
3. Linearity
For a, b ∈ IR, E(aX + bY ) = aE(X) + bE(Y ).
4. Continuity2
If Xn → X then E(Xn ) → E(X).
where:
148
Consider, for example, X ∼ Binomial(2, p). In this case:
• A1 = F F , A2 = F S, A3 = SF , A4 = SS;
• a1 = 0, a2 = 1, a3 = 1, a4 = 2;
!n
• X= i=1 ai × 1Ai .
Remark 4.4 — Expectation of a simple r.v. (Resnick, 1999, p. 119; Karr, 1993, p.
102)
• Note that Definition 4.3 coincides with our knowledge of discrete probability from
more elementary courses: the expectation is computed by taking a possible value,
multiplying by the probability of the possible value and then summing over all
possible values.
• E(X) is well-defined in the sense that all representations of X yield the same value
! !
for E(X): different representations of X, X = ni=1 ai ×1Ai and X = m a- ×1A#j ,
!n !m - j=1 j-
lead to the same expected value E(X) = i=1 ai × P (Ai ) = j=1 aj × P (Aj ).
Proposition 4.5 — Properties of the set of simple r.v. (Resnick, 1999, p. 118)
Let E be the set of all simple r.v. defined on (Ω, F, P ). We have the following properties
of E.
149
!n !m
(b) If X = i=1 ai × 1Ai ∈ E and Y = j=1 bj × 1Bj ∈ E then
n %
% m
X +Y = (ai + bj ) × 1Ai ∩Bj ∈ E. (4.5)
i=1 j=1
2. If X, Y ∈ E then XY ∈ E since
n %
% m
XY = (ai × bj ) × 1Ai ∩Bj . (4.6)
i=1 j=1
Remark 4.10 — Monotonicity of expectation for simple r.v. (Karr, 1993, p. 103)
The monotonicity of expectation for simple r.v. is a desired property which follows from
150
• linearity and
In fact, if X ≤ Y ⇔ Y − X ≥ 0 then
This argument is valid provided that E(Y ) − E(X) is not of the form +∞ − ∞. •
Xn = n × 1(0, 1 ) . (4.8)
n
151
4.1.2 Non negative r.v.
Before we proceed with the definition of the expectation of non negative r.v.,4 we need
to recall the measurability theorem. This theorem state that any non negative r.v. can
be approximated by a simple r.v., and it is the reason why it is often the case that an
integration result about non negative r.v. — such as the expectation and its properties
— is proven first to simple r.v.
Theorem 4.13 — Measurability theorem (Resnick, 1999, p. 91; Karr, 1993, p. 50)
Suppose X(ω) ≥ 0, for all ω. Then X : Ω → IR is a Borel measurable function (i.e.
a r.v.) iff there is an increasing sequence of simple and non negative r.v. X1 , X2 , . . .
(0 ≤ X1 ≤ X2 ≤ . . .) such that
Xn ↑ X, (4.9)
Motivation 4.15 — Expectation of a non negative r.v. (Karr, 1993, pp. 103–104)
We now extend the definition of expectation to all non negative r.v. However, we have
already seen that continuity of expectation fails even for simple r.v. and therefore we
cannot define the expected value of a non negative r.v. simply as E(X) = limn→+∞ E(Xn ).
Unsurprisingly, if we apply the measurability theorem then the definition of
expectation of a non negative r.v. virtually forces monotone continuity for increasing
sequences of non negative r.v.:
• if X1 , X2 , . . . are simple and non negative r.v. and X is a non negative r.v. such that
Xn ↑ X (pointwise) then E(Xn ) ↑ E(X).
4
Karr (1993) and Resnick (1999) call these r.v. positive when they are actually non negative.
152
It is convenient and useful to assume that these non negative r.v. can take values in the
+
extended set of non negative real numbers, IR0 .
Further on, we shall have to establish another restricted form of continuity: dominated
continuity for integrable r.v.5 •
• does not depend on the approximating sequence {Xn , n ∈ IN }, as stated in the next
proposition. •
We now list some basic properties of the expectation operator applied to non negative
r.v. For instance, linearity, monotonicity and monotone continuity/convergence. This last
property describes how expectation and limits interact, and under which circunstances
we are allowed to interchange expectation and limits.
5
We shall soon define integrable r.v.
153
Proposition 4.20 — Expectation of a linear combination of non negative r.v.
(Karr, 1993, p. 104; Resnick, 1999, p. 123)
Let X and Y be two non negative r.v. and a, b ∈ IR+ . Then
Remark 4.23 — Monotonicity of expectation for non negative r.v. (Karr, 1993,
p. 105)
Monotonicity of expectation follows, once again, from positivity and linearity. •
Theorem 4.25 — Fatou’s lemma (Karr, 1993, p. 105; Resnick, 1999, p. 132)
Let {Xn , n ∈ IN } be a sequence of non negative r.v. Then
154
Exercise 4.28 — Fatou’s lemma and continuity of p.f.
Verify that Theorem 4.25 could be used in a part of the proof of the continuity of p.f. if
we considered Xn = 1An (Karr, 1993, p. 106). •
We now state another property of expectation of non negative r.v.: the monotone
continuity/convergence of expectation.
Xn ↑ X (4.15)
then
155
Theorem 4.33 — Expectation of a linear convergent series of non negative r.v.
(Karr, 1993, p. 106; Resnick, 1999, p. 131)
!
Let {Yk , k ∈ IN } be a collection of non negative r.v. such that +∞
k=1 Yk (ω) < +∞, for
every ω. Then
" +∞ $ +∞
% %
E Yk = E(Yk ). (4.17)
k=1 k=1
•
156
4.1.3 Integrable r.v.
It is time to extend the definition of expectation to r.v. X that can take both positive
and negative (or null) values. But first recall that:
• X = X + − X −;
• |X| = X + + X − .
The definition of expectation of such a r.v. preserves linearity and is based on the fact
that X can be written as a linear combination of two non negative r.v.: X = X + − X − .
Definition 4.38 — Integrable r.v.; the set of integrable r.v. (Karr, 1993, p. 107;
Resnick, 1999, p. 126)
Let X be a r.v., not necessarily non negative. Then X is said to be integrable if
E(|X|) < +∞.
The set of integrable r.v. is denoted by L1 or L1 (P ) if the probability measure needs
to be emphasized. •
so both E(X + ) and E(X − ) are finite, E(X + ) − E(X − ) is not of the form ∞ − ∞,
thus, the definition of expectation of X is coherent.
157
2. Moreover, since |X × 1A | ≤ |X|, E(X; A) = E(X × 1A ) is finite (i.e. exists!) as long
as E(|X|) < +∞, that is, as long as E(X) exists.
Corollary 4.43 — Modulus inequality (Karr, 1993, p. 108; Resnick, 1999, p. 128)
If X ∈ L1 then
|E(X)| ≤ E(|X|). (4.21)
•
7
That is, aX + bY ∈ L1 . In fact, L1 is a vector space.
158
The continuity of expectation for integrable r.v. can be finally stated.
Xn → X. (4.23)
|Xn | ≤ Y, (4.24)
159
4.2 Integrals with respect to distribution functions
Integrals (of Borel measurable functions) with respect to d.f. are known as Lebesgue–
Stieltjes integrals. Moreover, they are really expectations with respect to probabilities on
IR and are reduced to sums and Riemann (more generally, Lebesgue) integrals.
4.2.1 On integration
Remark 4.52 — Riemann integral (http://en.wikipedia.org/wiki/Riemann integral)
In the branch of mathematics known as real analysis, the Riemann integral, created
by Bernhard Riemann (1826–1866), was the first rigorous definition of the integral of a
function on an interval.
• Overview
Let g be a non-negative real-valued function of the interval [a, b], and let
S = {(x, y) : 0 < y < g(x)} be the region of the plane under the graph of the
function g and above the interval [a, b].
The basic idea of the Riemann integral is to use very simple approximations for the
@b
area of S, denoted by a g(x) dx, namely by taking better and better approximations
— we can say that “in the limit” we get exactly the area of S under the curve.
• Riemann sums
Choose a real-valued function f which is defined on the interval [a, b]. The Riemann
sum of f with respect to the tagged partition a = x0 < x1 < x2 < . . . < xn = b
together with t0 , . . . , tn−1 (where xi ≤ ti ≤ xi+1 ) is
n−1
%
g(ti )(xi+1 − xi ), (4.26)
i=0
where each term represents the area of a rectangle with height g(ti ) and length
xi+1 − xi . Thus, the Riemann sum is the signed area under all the rectangles.
• Riemann integral
Loosely speaking, the Riemann integral is the limit of the Riemann sums of a
function as the partitions get finer.
If the limit exists then the function is said to be integrable (or more specifically
Riemann-integrable).
160
• Limitations of the Riemann integral
With the advent of Fourier series, many analytical problems involving integrals came
up whose satisfactory solution required interchanging limit processes and integral
signs.
Failure of monotone convergence — The indicator function 1Q on the rationals is not
Riemann integrable. No matter how the set [0, 1] is partitioned into subintervals,
each partition will contain at least one rational and at least one irrational number,
since rationals and irrationals are both dense in the reals. Thus, the upper Darboux
sums8 will all be one, and the lower Darboux sums9 will all be zero.
Unsuitability for unbounded intervals — The Riemann integral can only integrate
functions on a bounded interval. It can however be extended to unbounded intervals
by taking limits, so long as this does not yield an answer such as +∞ − ∞.
What about integrating on structures other than Euclidean space? — The Riemann
integral is inextricably linked to the order structure of the line. How do we free
ourselves of this limitation? •
161
Lebesgue integration has the beautiful property that every bounded function defined
over a bounded interval with a Riemann integral also has a Lebesgue integral, and
for those functions the two integrals agree. But there are many functions with a
Lebesgue integral that have no Riemann integral.
As part of the development of Lebesgue integration, Lebesgue invented the concept
of Lebesgue measure, which extends the idea of length from intervals to a very large
class of sets, called measurable sets.
• Integration
We start with a measure space (Ω, F, µ) where Ω is a set, F is a σ − algebra of
subsets of Ω and µ is a (non-negative) measure on F of subsets of Ω.
In the mathematical theory of probability, we confine our study to a probability
measure µ, which satisfies µ(Ω) = 1.
In Lebesgue’s theory, integrals are defined for a class of functions called measurable
functions.
@
We build up an integral Ω g dµ for measurable real-valued functions g defined on Ω
in stages:
162
Remark 4.54 — Lebesgue/Riemann–Stieltjes integration
(http://en.wikipedia.org/wiki/Lebesgue-Stieltjes integration)
The Lebesgue–Stieltjes integral is the ordinary Lebesgue integral with respect to a measure
known as the Lebesgue–Stieltjes measure, which may be associated to any function of
bounded variation on the real line.
• Definition
@b
The Lebesgue–Stieltjes integral a g(x) dF (x) is defined when g : [a, b] → IR is Borel-
measurable and bounded and F : [a, b] → IR is of bounded variation in [a, b] and
right-continuous, or when g is non-negative and F is monotone and right-continuous.
4.2.2 Generalities
First of all, we should recall that given a d.f. F on IR, there is a unique p.f. on IR such
that PF ((a, b]) = F (b) − F (a).
Moreover, all functions g appearing below are assumed to be Borel measurable.
where the expectation is that of g(X) as a Borel measurable function of the r.v. X defined
on the probability space (IR, B(IR), PF ). •
163
Definition 4.56 — Integrable of a function with respect to a d.f. (Karr, 1993, p.
110)
Let F be a d.f. on IR and g a signed function. Then g is said to be integrable with respect
@
to F if IR g(x) dF (x) < +∞, and in this case, the integral of g with respect to F equals
6 6 6
g(x) dF (x) = +
g (x) dF (x) − g − (x) dF (x). (4.28)
IR IR IR
•
Definition 4.57 — Integral of a function over a set with respect to a d.f. (Karr,
1993, p. 110)
Let F be a d.f. on IR and g either non negative or integrable and B ∈ B(IR). The integral
of g over B with respect to F is equal to
6 6
g(x) dF (x) = g(x) × 1B (x) dF (x). (4.29)
B IR
•
The properties of the integral of a function with respect to a d.f. are those of
expectation:
1. Constant preserved
2. Monotonicity
3. Linearity
4. Relation to PF
5. Fatou’s lemma
164
4.2.3 Discrete distribution functions
Keep in mind that integrals with respect to discrete d.f. are sums.
Theorem 4.58 — Integral with respect to a discrete d.f. (Karr, 1993, p. 111)
Consider a d.f. F (t) that can be written as
%
F (x) = pi × 1[xi ,+∞) (x). (4.30)
i
Corollary 4.60 — Integrable function with respect to a discrete d.f. (Karr, 1993,
p. 111)
The function g is said to be integrable with respect to the discrete d.f. F iff
%
|g(xi )| × pi < +∞, (4.32)
i
165
Exercise 4.62 — Integral with respect to an absolutely continuous d.f.
Prove Theorem 4.61 (Karr, 1993, p. 112). •
Thus,
Corollary 4.64 — Integral with respect to a mixed d.f. (Karr, 1993, p. 112)
The integral of g with respect to the mixed d.f. F is a corresponding combination of
integrals with respect to Fd and Fa :
6 6 6
g(x) dF (x) = α × g(x) dFd (x) + (1 − α) × g(x) dFa (x)
IR IR IR
% 6 +∞
= α× g(xi ) × pi + (1 − α) × g(x) × fa (x) dx. (4.40)
i −∞
In order that the integral with respect to a mixed d.f. exists, g must be piecewise
continuous and either non negative or integrable with respect to both Fd and Fa . •
166
4.3 Computation of expectations
So far we have defined the expectation for simple r.v.
The expectations of other types of r.v. — such as non negative, integrable and mixed
r.v. — naturally involve integrals with respect to distribution functions.
Theorem 4.65 — Expected value of a non negative r.v. (Karr, 1993, p. 113)
If X ≥ 0 then
6 +∞ 6 +∞
E(X) = x dFX (x) = [1 − FX (x)] dx. (4.41)
0 0
•
167
Exercise 4.70 — A nonnegative r.v. with infinite expectation
Let X ∼ Pareto(b = 1, α = 1). i.e.
7
α bα
xα+1
= x12 , x ≥ b = 1
fX (x) = (4.44)
0, otherwise.
Prove that E(X) exists and E(X) = +∞ (Resnick, 1999, p. 126, Example 5.2.1). •
respectively. •
168
1
(b) fX (x) = , x ∈ IR.
π(1 + x2 )
(Resnick, 1999, p. 126, Example 5.2.1.)10 •
(b) Write FX (x) as a linear combination of the d.f. of two r.v.: a discrete and an
absolutely continuous r.v.
(c) Obtain the expected value of X, by using the fact that X is non negative, thus,
@ +∞
E(X) = 0 [1 − FX (x)] dx.
Compare this value with the one you would obtain using Corollary 4.75. •
10
There is a typo in the definition of the first p.d.f. in Resnick (1999): x > 1 should read as |x| > 1.
The second p.d.f. corresponds to the one of a r.v. with (standard) Cauchy distribution.
11
Adapted from Walrand (2004, pp. 53–55, Example 4.10.9).
169
Exercise 4.77 — Expectation of a mixed r.v. in a queueing setting
Consider a M/M/1 system.12 Let:
ρ
(d) Verify that E(Wq ) = µ(1−ρ)
. •
12
The arrivals to the system are governed by a Poisson process with rate λ, i.e. the time between arrivals
has an exponential distribution with parameter λ; needless to say, M stands for memoryless. The service
times are not only i.i.d. with exponential distribution with parameter µ, but also independent from the
arrival process. There is only one server, and the service policy is FCFS (first come first served). ρ = µλ
represents the traffic intensity and we assume that ρ ∈ (0, 1).
13
Equilibrium roughly means that a lot of time has elapsed since the system has been operating and
therefore the initial conditions no longer influence the state of system.
14
Wq is the time elapsed from the moment the customer arrives until his/her service starts in the
system in equilibrium.
170
4.3.4 Functions of r.v.
Unsurprisingly, we are surely able to derive expressions for the expectation of a Borel
measurable function g of the r.v. X, E[g(X)]. Obtaining this expectation does not require
the derivation of the d.f. of g(X) and follows from section 4.2.
In the two sections we shall discuss the expectation of specific functions of r.v.:
g(X) = X k , k ∈ IN .
171
4.3.5 Functions of random vectors
When dealing with functions of random vectors, the only useful formulas are those
referring to the expectation of functions of discrete and absolutely continuous random
vectors.
These formulas will be used to obtain, for instance, what we shall call measures of
(linear) association between r.v.
Then
% %
E[g(X1 , . . . , Xd )] = ... g(x1 , . . . , xd ) × P ({X1 = x1 , . . . , Xd = xd }).(4.56)
x1 ∈C1 xd ∈Cd
•
Then
E[g(X1 , . . . , Xd )]
6 +∞ 6 +∞
= ... g(x1 , . . . , xd ) × fX1 ,...,Xd (x1 , . . . , xd ) dx1 . . . dxd . (4.57)
−∞ −∞
•
172
4.3.6 Functions of independent r.v.
When all the components of the random vector (X1 , . . . , Xd ) are independent, the formula
of E[g(X1 , . . . , Xd )] can be simplified.
The next results refer to two independent random variables (d = 2). The generalization
for d > 2 is straightforward.
Then
6 I6 J
E[g(X, Y )] = g(x, y) dFX (x) dFY (y)
IR IR
6 I6 J
= g(x, y) dFY (y) dFX (x). (4.58)
IR IR
•
Moreover, the expectation of the product of functions of independent r.v. is the product
of their expectations. Also note that the product of two integrable r.v. need not be
integrable.
173
4.3.7 Sum of independent r.v.
We are certainly not going to state that E(X + Y ) = E(X) + E(Y ) when X and Y are
simple or non negative or integrable independent r.v.15
Instead, we are going to write the d.f. of the sum of two independent r.v. in terms
of integrals with respect to the d.f. and define a convolution of d.f.16
Theorem 4.88 — D.f. of a sum of two independent r.v. (Karr, 1993, p. 118)
Let X and Y be two independent r.v. Then
6 6
FX+Y (t) = FX (t − y) dFY (y) = FY (t − x) dFX (x). (4.60)
IR IR
•
Let us revisit an exercise from Chapter 2 to illustrate the use of Corollary 4.92.
15
This result follows from the linearity of expectation.
16
Recall that in Chapter 2 we derived expressions for the p.f. and the p.d.f. of the sum of two
independent r.v.
174
Exercise 4.93 — D.f. of the sum of two independent absolutely continuous r.v.
Let X and Y be the durations of two independent system components set in what is called
a stand by connection.17 In this case the system duration is given by X + Y .
17
At time 0, only the component with duration X is on. The component with duration Y replaces the
other one as soon as it fails.
175
4.4 Lp spaces
Motivation 4.95 — Lp spaces (Karr, 1993, p. 119)
While describing a r.v. in a partial way, we tend to deal with E(X p ), p ∈ [1, +∞), or a
function of several such expected values. Needless to say that we have to guarantee that
E(|X|p ) is finite. •
where b > 0 is the minimum possible value of X and α > 0 is called the Pareto index.
For which values of p ∈ [1, +∞) we have X ∈ Lp ? •
176
4.5 Key inequalities
What immediately follows is a table with an overview of a few extremely useful inequalities
involving expectations.
Some of these inequalities are essential to prove certain types of convergence of
sequences of r.v. in Lp and uniform integrability (Resnick, 1999, p. 189)18 and provide
answers to a few questions we compiled after the table.
Finally, we state and treat each inequality separately.
V (X)
(Chebyshev-Bienaymé) X ∈ L2 , a > 0 P ({|X − E(X)| ≥ a}) ≤ a2
“n p o”
1
X ∈ L2 , a > 0 P |X − E(X)| ≥ a V (X) ≤ a2
2V (X)
(Cantelli) X ∈ L2 , a > 0 P ({|X − E(X)| ≥ a}) ≤ a2 +V (X)
“n p o”
1
(one-sided Chebyshev) X ∈ L2 , a > 0 P X − E(X) ≥ a V (X) ≤ 1+a 2
18
For a definition of uniform integrability see http://en.wikipedia.org/wiki/Uniform integrability.
177
Motivation 4.100 — A few (moment) inequalities
• Young — How can we relate the areas under (resp. above) an increasing function h
in the interval [0, a] (resp. in the interval of [0, h−1 (b)]) with the area of the rectangle
with vertices (0, 0), (0, b), (a, 0) and (a, b), where b ∈ (0, maxx∈[0,a] h(x)]?
• Chebyshev — When can we provide non trivial upper bounds for the tail
probability P ({X ≥ a}) •
• a, b ∈ IR+ .
Then
a × b ≤ H(a) + K(b). (4.66)
•
178
Exercise 4.102 — Young’s inequality (Karr, 1993, p. 119)
Prove Lemma 4.101, by using a graphical argument (Karr, 1993, p. 119). •
then
ap b q
a×b≤ + . (4.67)
p q
For the proof of this result see http://en.wikipedia.org/wiki/Young’s inequality, which
states (4.67) as Young’s inequality. See also Karr (1993, p. 120) for a reference to (4.67)
as a consequence of Young’s inequality as stated in (4.66). •
Then
X × Y ∈ L1 (4.68)
1 1
E(|XY |) ≤ E (|X|p ) × E (|Y |q ).
p q (4.69)
179
Remarks 4.105 — Hölder’s (moment) inequality
• The numbers p and q above are said to be Hölder conjugates of each other.
• In case we are dealing with S, a measurable subset of IR with the Lebesgue measure,
and f and g are measurable real-valued functions on S then Hölder’s inequality reads
as follows:
6 '6 ( p1 '6 ( 1q
|f (x) × g(x)| dx ≤ |f (x)|p dx × |g(x)|q dx . (4.70)
S S S
• When we are dealing with n−dimensional Euclidean space and the counting
measure, we have
n
" n $ p1 " n $ 1q
% % %
|xk × yk | ≤ |xk |p × |yk |q , (4.71)
k=1 k=1 k=1
180
4.5.3 Cauchy-Schwarz’s moment inequality
A special case of Hölder’s moment inequality — p = q = 2 — is nothing but the Cauchy-
Schwarz’s moment inequality.
In mathematics, the Cauchy-Schwarz inequality19 is a useful inequality encountered
in many different settings, such as linear algebra applied to vectors, in analysis applied to
infinite series and integration of products, and in probability theory, applied to variances
and covariances.
The inequality for sums was published by Augustin Cauchy in 1821, while the
corresponding inequality for integrals was first stated by Viktor Yakovlevich Bunyakovsky
in 1859 and rediscovered by Hermann Amandus Schwarz in 1888 (often misspelled
“Schwartz”).
X × Y ∈ L1 (4.72)
K
E(|X × Y |) ≤ E(|X|2 ) × E(|Y |2 ). (4.73)
•
Remarks 4.108 — Cauchy-Schwarz’s moment inequality
(http://en.wikipedia.org/wiki/Cauchy-Schwarz inequality)
• In the Euclidean space IRn with the standard inner product, the Cauchy-Schwarz’s
inequality is
" n $2 " n $ " n $
% % %
xi × yi ≤ x2i × yi2 . (4.74)
i=1 i=1 i=1
• The triangle inequality for the inner product is often shown as a consequence of the
Cauchy-Schwarz inequality, as follows: given vectors x and y, we have
=x + y=2 = >x + y, x + y?
≤ (=x= + =y=)2 . (4.75)
•
19
Also known as the Bunyakovsky inequality, the Schwarz inequality, or the Cauchy-Bunyakovsky-
Schwarz inequality (http://en.wikipedia.org/wiki/Cauchy-Schwarz inequality).
181
Exercise 4.109 — Confronting the squared covariance and the product of the
variance of two r.v.
Prove that
Ls ⊆ Lr (4.77)
1 1
E r (|X|r ) ≤ E s (|X|s ). (4.78)
•
Remarks 4.111 — Lyapunov’s moment inequality
This result is not correctly stated in Karr (1993, p. 121) and can be also deduced
from the Cauchy-Schwarz’s inequality, as well as from Jensen’s inequality, stated
below.
182
d
• The equality in (4.78) holds iff X is a degenerate r.v., i.e. X = c, where c is a real
constant (Rohatgi, 1976, p. 103). •
21
Rohatgi (1976, p. 103) provides an alternative proof.
22
The real line is a normed vector space with the absolute value as the norm, and so the triangle
inequality states that |x + y| ≤ |x| + |y|, for any real numbers x and y. The triangle inequality is useful
in mathematical analysis for determining the best upper estimate on the size of the sum of two numbers,
in terms of the sizes of the individual numbers (http://en.wikipedia.org/wiki/Triangle inequality).
183
4.5.6 Jensen’s moment inequality
Jensen’s inequality, named after the Danish mathematician and engineer Johan Jensen
(1859–1925), relates the value of a convex function of an integral to the integral of the
convex function. It was proved by Jensen in 1906.
Given its generality, the inequality appears in many forms depending on the context.
In its simplest form the inequality states, that
• the convex transformation of a mean is less than or equal to the mean after convex
transformation.
184
• For a real convex function g, numbers x1 , x2 , . . . , xn in its domain, and positive
weights ai , i = 1, . . . , n, Jensen’s inequality can be stated as:
' !n ( !n
ai × xi a × g(xi )
g i=1
!n ≤ i=1 !ni . (4.86)
i=1 ai i=1 ai
• For instance, considering g(x) = log(x), which is a concave function, we can establish
the arithmetic mean-geometric mean inequality:23 for any list of n non negative real
numbers x1 , x2 , . . . , xn ,
x1 + x2 + . . . + xn √
x̄ = ≥ n x1 × x2 × . . . × xn = mg. (4.88)
n
Moreover, equality in (4.88) holds iff x1 = x2 = . . . = xn . •
Exercise 4.120 — Jensen’s inequality and the distance between the mean and
the median
Prove that for any r.v. having an expected value and a median, the mean and the median
can never differ from each other by more than one standard deviation:
K
|E(X) − med(X)| ≤ V (X), (4.89)
by using Jensen’s inequality twice — applied to the absolute value function and to the
square root function24 (http://en.wikipedia.org/wiki/Chebyshev’s inequality). •
23
See http://en.wikipedia.org/wiki/AM-GM inequality.
24
In this last case we should apply the concave version of Jensen’s inequality.
185
4.5.7 Chebyshev’s inequality
Curiously, Chebyshev’s inequality is named after the Russian mathematician
Pafnuty Lvovich Chebyshev (1821–1894), although it was first formulated by
his friend and French colleague Irénée-Jules Bienaymé (1796–1878), according to
http://en.wikipedia.org/wiki/Chebyshev’s inequality.
In probability theory, the Chebyshev’s inequality,25 in the most usual version — what
Karr (1993, p. 122) calls the Bienaymé-Chebyshev’s inequality —, can be ultimately stated
as follows:
• no more than k12 × 100% of the values of the r.v. X are more than k standard
deviations away from the expected value of X.
• a > 0.
Then
E[g(X)]
P ({X ≥ a}) ≤ . (4.90)
g(a)
•
• Markov’s inequalities
E[|X|]
X ∈ L1 , a > 0 ⇒ P ({|X| ≥ a}) ≤ a
E[|X|p ]
X ∈ Lp , a > 0 ⇒ P ({|X| ≥ a}) ≤ ap
25
Also known as Tchebysheff’s inequality, Chebyshev’s theorem, or the Bienaymé-Chebyshev’s
inequality (http://en.wikipedia.org/wiki/Chebyshev’s inequality).
26
Karr (1993) does not mention this inequality. For more details see
http://en.wikipedia.org/wiki/Chernoff’s inequality.
186
• Chebyshev-Bienaymé’s inequalities
X ∈ L2 , a > 0 ⇒ P ({|X − E(X)| ≥ a}) ≤ V a(X)
2
9> K ?:
2 1
X∈L ,a>0⇒P |X − E(X)| ≥ a V (X) ≤ a2
• Cantelli’s inequality
2V (X)
X ∈ L2 , a > 0 ⇒ P ({|X − E(X)| ≥ a}) ≤ a2 +V (X)
(a) Prove that from the Chebyshev(-Bienaymé)’s inequality we can then infer that the
chance that a given article is between 600 and 1400 characters would be at least
75%.
(b) The inequality is coarse: a more accurate guess would be possible if the distribution
of the length of the articles is known.
Demonstrate that, for example, a normal distribution would yield a 75% chance of
an article being between 770 and 1230 characters long.
(http://en.wikipedia.org/wiki/Chebyshev’s inequality.) •
187
Exercise 4.126 — Chebyshev(-Bienaymé)’s inequality
Let X ∼ Uniform(0, 1).
9> K ?:
(a) Obtain P |X − 12 | < 2 1/12 .
9> K ?:
(b) Obtain a lower bound for P |X − 12 | < 2 1/12 , by noting that E(X) = 1
2
and
1
V (X) = 12
. Compare this bound with the value you obtained in (a).
(a) Prove that these bounds cannot be improved upon for the r.v. X with p.f.
P ({X = −1}) = 2k12 , x = −1
P ({X = 0}) = 1 − 1 ,
k2
x=0
P ({X = x}) = 1
(4.91)
P ({X = 1}) = 2k2 , x=1
0, otherwise,
9 K :
1
where k > 1, that is, P |X − E(X)| ≥ k V (X) = k2
. (For more details see
http://en.wikipedia.org/wiki/Chebyshev’s inequality.)27
(b) Prove that equality holds exactly for any r.v. Y that is a linear transformation of
X.28 •
1
!n P
where X̄n = n i=1 Xi . That is, X̄n → µ as n → +∞. •
27
This is the answer to Exercise 4.36 from Karr (1993, p. 133).
28
Inequality holds for any r.v. that is not a linear transformation of X
(http://en.wikipedia.org/wiki/Chebyshev’s inequality).
188
Exercise 4.129 — Chebyshev(-Bienaymé)’s inequality and the weak law of
large numbers
Use Chebyshev(-Bienaymé)’s inequality to prove the weak law of large numbers stated in
Remark 4.128. •
(http://en.wikipedia.org/wiki/Chebyshev’s inequality). •
189
4.6 Moments
Motivation 4.132 — Moments of r.v.
The nature of a r.v. can be partial described in terms of a number of features — such
as the expected value, the variance, the skewness, kurtosis, etc. — that can written in
terms of expectation of powers of X, the moments of a r.v. •
Then:
Remarks 4.134 — kth. moment and kth. central moment of a r.v. (Karr, 1993,
p. 123; http://en.wikipedia.org/wiki/Moment (mathematics))
• The kth. central moment exists under the assumption that X ∈ Lk because Lk ⊆ L1 ,
for any k ∈ IN (a consequence of Lyapunov’s inequality).
• If the kth. (central) moment exists29 so does the (k − 1)th. (central) moment, and
all lower-order moments. This is another consequence of Lyapunov’s inequality.
• If X ∈ L1 the first moment is the expectation of X; the first central moment is thus
null. In higher orders, the central moments are more interesting than the moments
about zero. •
29
Or the kth. moment about any point exists. Note that the kth. central moment is nothing but the
kth. moment about E(X).
190
Proposition 4.135 — Computing the kth. moment of a non negative r.v.
If X is a non negative r.v. and X ∈ Lk , for k ∈ IN , we can write the kth. moment of X
in terms of the following Riemann integral:
6 ∞
k
E(X ) = k xk−1 × [1 − FX (x)] dx. (4.96)
0
•
Exercise 4.138 — The median of a r.v. and the minimization of the expected
absolute deviation (Karr, 1993, p. 130, Exercise 4.2)
The median of the r.v. X, med(X), is such that P ({X ≤ med(X)}) ≥ 12 and P ({X ≥
med(X)}) ≥ 21 .
Prove that if X ∈ L1 then
Exercise 4.139 — Minimizing the mean squared error (Karr, 1993, p. 131,
Exercise 4.12)
Let {A1 , . . . , An } be a finite partition of Ω. Suppose that we know which of A1 , . . . , An
has occurred, and wish to predict whether some other event B has occurred. Since we
know the values of the indicator functions 1A1 , . . . , 1An , it make sense to use a predictor
!
that is a function of them, namely linear predictors of the form Y = ni=1 ai × 1Ai , whose
accuracy is assessed via the mean squared error:
191
Exercise 4.140 — Expectation of a r.v. with respect to a conditional
probability function (Karr, 1993, p. 132, Exercise 4.20)
Let A be an event such that P (A) > 0.
Show that if X is positive or integrable then E(X|A), the expectation of X with
respect to the conditional probability function PA (B) = P (B|A), is given by
def E(X; A)
E(X|A) = , (4.99)
P (A)
where E(X; A) = E(X × 1A ) represents the expectation of X over the event A. •
Exercise 4.143 — The meaning of a null variance (Karr, 1993, p. 131, Exercise
4.19)
a.s.
Prove that if V (X) = 0 then X = E(X). •
192
Proposition 4.145 — Variance of the sum (or difference) of two independent
r.v. (Karr, 1993, p. 124)
If X, Y ∈ L2 and are two independent r.v. then
V (X + Y ) = V (X − Y ) = V (X) + V (Y ). (4.104)
•
Exercise 4.146 — Expected values and variances of some important r.v. (Karr,
1993, pp. 125 and 130, Exercise 4.1)
Verify the entries of the following table.
Bernoulli(p) p ∈ [0, 1] p p (1 − p)
Binomial(n, p) n ∈ IN ; p ∈ [0, 1] np n p (1 − p)
- .
Hipergeometric(N, M, n) N ∈ IN nM
N nMN 1− MN
M ∈ IN, M ≤ N
n ∈ IN, n ≤ N
1−p
Geometric(p) p ∈ [0, 1] 1
p p2
Poisson(λ) λ ∈ IR+ λ λ
r(1−p)
NegativeBinomial(r, p) r ∈ IN ; p ∈ [0, 1] r
p p2
(b−a)2
Uniform(a, b) a, b ∈ IR, a < b a+b
2 12
Exponential(λ) λ ∈ IR+ 1
λ
1
λ2
Gamma(α, β) α, β ∈ IR+ α
β
α
β2
αβ
Beta(α, β) α, β ∈ IR+ α
α+β (α+β)2 (α+β+1)
9 : G 9 : 9 :H
Weibull(α, β) α, β ∈ IR+ α Γ 1 + β1 α2 Γ 1 + β2 − Γ2 1 + β1
193
Definition 4.147 — Normalized (central) moments of a r.v.
(http://en.wikipedia.org/wiki/Moment (mathematics))
Let X be a r.v. such that X ∈ Lk , for some k ∈ IN . Then:
• the normalized kth. moment of the X is the kth. moment divided by [SD(X)]k ,
E(X k )
; (4.105)
[SD(X)]k
• the normalized kth. central moment of X is given by
E{[X − E(X)]k }
; (4.106)
[SD(X)]k
These normalized central moments are dimensionless quantities, which represent the
distribution independently of any linear change of scale. •
• A r.v. that is skewed to the right (the tail of the p.(d.)f. is heavier on the right),
will have a positive skewness. •
194
Exercise 4.151 — Skewness of a r.v.
Prove that the skewness of:
• If the p.(d.)f. of the r.v. X has a peak at the expected value and long tails, the 4th.
moment will be high and the kurtosis positive. Bounded distributions tend to have
low kurtosis.
• KC(X) must be greater than or equal to [SC(X)]2 − 2; equality only holds for
Bernoulli distributions (prove!).
• For unbounded skew distributions not too far from normal, KC(X) tends to be
somewhere between [SC(X)]2 and 2 × [SC(X)]2 . •
30
Some authors do not subtract three.
195
Exercise 4.155 — Kurtosis of a r.v.
Prove that the kurtosis of:
4.6.4 Covariance
Motivation 4.156 — Covariance (and correlation) between two r.v.
It is crucial to obtain measures of how much two variables change together, namely
absolute and relative measures of (linear) association between pairs of r.v. •
(this last formula is more useful for computational purposes, prove it!) •
1. X⊥
⊥Y ⇒ cov(X, Y ) = 0
2. cov(X, Y ) = 0 &⇒ X⊥
⊥Y
3. cov(X, Y ) &= 0 ⇒ X ⊥
& ⊥Y
196
4. cov(X, Y ) = cov(Y, X) (symmetric operator!)
a.s.
5. cov(X, X) = V (X) ≥ 0 and V (X) = 0 ⇒ X = E(X) (positive semi-definite
operator!)
6. cov(aX, bY ) = a b cov(X, Y )
7. cov(X + a, Y + b) = cov(X, Y )
When we deal with uncorrelated r.v. — i.e., if cov(Xi , Xj ) = 0, ∀i &= j — or with pairwise
independent r.v. — that is, Xi⊥ ⊥Xj , ∀i &= j —, we have:
" n $ n
% %
V ci Xi = c2i V (Xi ). (4.114)
i=1 i=1
And if, besides being uncorrelated or (pairwise) independent r.v., we have ci = 1, for
i = 1, . . . , n, we get:
" n $ n
% %
V Xi = V (Xi ), (4.115)
i=1 i=1
i.e. the variance of the sum of uncorrelated or (pairwise) independent r.v. is the sum of
the individual variances. •
197
4.6.5 Correlation
Motivation 4.162 — Correlation between two r.v.
(http://en.wikipedia.org/wiki/Correlation and dependence)
The most familiar measure of dependence between two r.v. is (Pearson’s) correlation.
It is obtained by dividing the covariance between two variables by the product of their
standard deviations.
Correlations are useful because they can indicate a predictive relationship that can
be exploited in practice. For example, an electrical utility may produce less power on a
mild day based on the correlation between electricity demand and weather. Moreover,
correlations can also suggest possible causal, or mechanistic relationships. •
corr(X, Y ) = 0 (4.117)
31
Also know as the Pearson’s correlation coefficient (http://en.wikipedia.org/wiki/Correlation and
dependence).
32
X, Y ∈ L2 are said to be correlated r.v. if corr(X, Y ) &= 0.
198
Exercise 4.168 — Sufficient conditions to deal with uncorrelated sample mean
and variance (Karr, 1993, p. 132, Exercise 4.28)
i.i.d.
Let Xi ∼ X, i = 1, . . . , n, such that E(X) = E(X 3 ) = 0.
!
Prove that the sample mean X̄ = n1 ni=1 Xi and the sample variance
1
!n
S 2 = n−1 2
i=1 (Xi − X̄) are uncorrelated r.v. •
2. corr(X, Y ) = 0 &⇒ X⊥
⊥Y
3. corr(X, Y ) &= 0 ⇒ X ⊥
& ⊥Y
4. corr(X, Y ) = corr(Y, X)
5. corr(X, X) = 1
6. corr(aX, bY ) = corr(X, Y )
Exercise 4.171 — Negative linear association between three r.v. (Karr, 1993, p.
131, Exercise 4.17)
Prove that there are no r.v. X, Y and Z such that corr(X, Y ) = corr(Y, Z) =
corr(Z, X) = −1. •
33
A consequence of Cauchy-Schwarz’s inequality.
199
Remark 4.173 — Interpretation of the size of a correlation
(http://en.wikipedia.org/wiki/Pearson product-moment correlation coefficient)
Several authors have offered guidelines for the interpretation of a correlation coefficient.
Others have observed, however, that all such criteria are in some ways arbitrary and
should not be observed too strictly.
The interpretation of a correlation coefficient depends on the context and purposes.
A correlation of 0.9 may be very low if one is verifying a physical law using high-quality
instruments, but may be regarded as very high in the social sciences where there may be
a greater contribution from complicating factors. •
Thus, if the absolute value of corr(X, Y ) is very close to the unit we are tempted to add
that the association between X and Y is “likely” to be linear.
200
The second one (top right) is not distributed normally; while an obvious relationship
between the two variables can be observed, it is not linear, and the Pearson correlation
coefficient is not relevant.
In the third case (bottom left), the linear relationship is perfect, except for one outlier
which exerts enough influence to lower the correlation coefficient from 1 to 0.816.
Finally, the fourth example (bottom right) shows another example when one outlier
is enough to produce a high correlation coefficient, even though the relationship between
the two variables is not linear. •
Several sets of (x, y) points, with the correlation coefficient of x and y for each set.
Note that the correlation reflects the noisiness and direction of a linear relationship (top
row), but not the slope of that relationship (middle), nor many aspects of nonlinear
relationships (bottom). The figure in the center has a slope of 0 but in that case the
correlation coefficient is undefined because the variance of y is zero. •
34
A correlation between age and height in children is fairly causally transparent, but a correlation
between mood and health in people is less so. Does improved mood lead to improved health; or does
good health lead to good mood; or both? Or does some other factor underlie both? In other words,
a correlation can be taken as evidence for a possible causal relationship, but cannot indicate what the
causal relationship, if any, might be.
201
4.6.6 Moments of random vectors
Moments of random vectors are defined component, pairwise, etc. Instead of an expected
value (resp. variance) we shall deal with a mean vector (resp. covariance matrix).
Definition 4.176 — Mean vector and covariance matrix of a random vector
(Karr, 1993, p. 126)
Let X = (X1 , . . . , Xd ) be a d−dimensional random vector. Then provided that:
• Xi ∈ L1 , i = 1, . . . , d, the mean vector of X is the d−vector of the individual means,
µ = (E(X1 ), . . . , E(Xd ));
202
4.6.7 Multivariate normal distributions
Motivation 4.179 — Multivariate normal distribution (Tong, 1990, p. 1)
There are many reasons for the predominance of the multivariate normal distribution:
• even if the original data cannot be fitted satisfactorily with a multivariate normal
distribution, by the central limit theorem the distribution of the sample mean vector
is asymptotically normal;
• zero correlation imply independence between two components of the random vector
with multivariate normal distribution;
203
Definition 4.181 — Multivariate normal distribution (Karr, 1993, p. 126)
Let:
• µ = (µ1 , . . . , µd ) ∈ IRd ;
Then the random vector X = (X1 , . . . , Xd ) has a multivariate normal distribution with
mean vector µ and covariance matrix Σ if
1
X = Σ 2 Y + µ, (4.118)
where:
i.i.d.
• Y = (Y1 , . . . , Yd ) with Yi ∼ Normal(0, 1), i = 1, . . . , d;
1
9 1 :2 1
• Σ 2 is the unique matrix satisfying Σ 2 × Σ 2 = Σ.
Gentle (1998, p. 106) refers other procedures to generate pseudo-random numbers with
multivariate normal distribution. •
35
A d × d matrix A is called invertible or non-singular or non-degenerate if there exists an
d × d matrix B such that AB = BA = Id , where Id denotes the d × d identity matrix.
(http://en.wikipedia.org/wiki/Invertible matrix) .
204
Proposition 4.183 — Characterization of the multivariate normal distribution
(Karr, 1993, pp. 126–127)
Let X ∼ Normald (µ, Σ ) where µ = (µ1 , . . . , µd ) and Σ = [σi j ]i,j=1,...,d . Then:
E(Xi ) = µi , i = 1, . . . , d; (4.119)
cov(Xi , Xj ) = σij , i, j = 1, . . . , d; (4.120)
I J
d 1 1
fX (x) = (2π)− 2 |Σ
Σ|− 2 2 −1
× exp − (x − µ) Σ (x − µ) , (4.121)
2
for x = (x1 , . . . , xd ) ∈ IRd . •
(b) Use Mathematica to plot this joint p.d.f. for µ1 = µ2 = 0 and σ12 = σ22 = 1, and at
least five different values of the correlation coefficient ρ. •
205
In what follows we describe a few distributional properties of bivariate normal
distributions and, more generally, multivariate normal distributions.
The following figure37 shows the two marginal distributions of a bivariate normal
distribution:
where:
• µ1 = (µ1 , . . . , µk );
• µ2 = (µk+1 , . . . , µd );
37
Taken from http://www.aiaccess.net/English/Glossaries/GlosMod/e gm multinormal distri.htm.
206
• Σ 11 = [σij ]i,j=1,...,k ; Σ 12 = [σij ]1≤i<j≤k ;
• Σ 21 = Σ 2
12 ;
• Σ 22 = [σij ]i,j=k+1,...,d .
The following figure (where d = p)38 represents the covariance matrix of X 1 , Σ 11 , which
is just the upper left corner square submatrix of order k of the original covariance matrix:
207
Exercise 4.189 — Distribution/moments of a linear transformation of a
bivariate normal random vector
and b = −C µ, then Y is a bivariate normal variable with zero means, unit variances
and correlation coefficient ρ (Tong, 1990, p. 10).
Verify that Y ∗ is a bivariate normal variable with zero means, variances 1 − ρ and
1 + ρ and null correlation coefficient (Tong, 1990, p. 10). Comment.
(c) Conclude that if X ∼ Normal2 (µ, Σ ) such that |ρ| < 1 then
B CB CB CB C
√1 0 √1 − √12 σ1−1 0 X1 − µ1
1−ρ 2
∼ Normal2 (0, I2 ), (4.132)
0 √1
1+ρ
√1
2
√1
2
0 σ2−1 X2 − µ2
i.e. we can obtain a bivariate normal distribution with any mean vector
and (non-singular, semi-definite positive) covariance matrix through a
transformation of two independent standard normal r.v. (Tong, 1990, p. 11;
http://xbeta.org/wiki/show/Bivariate+normal+distribution). •
208
Theorem 4.190 — Distribution/moments of a linear transformation of a
multivariate normal distribution (Tong, 1990, p. 32, Theorem 3.3.3)
Let:
• X ∼ Normald (µ, Σ );
Then
Y = C X + b ∼ Normalk (C µ + b, C Σ C2 ). (4.134)
The family of multivariate normal distributions is closed not only under linear
transformations, as stated in the previous theorem, but also under linear combinations.
Then
where the mean vector and the covariance matrix of Y are given by
µY = C1 µ1 + C2 µ2 (4.136)
ΣY = C1 Σ 11 C2 2 2 2
1 + C2 Σ 22 C2 + C1 Σ 12 C2 + C2 Σ 21 C1 , (4.137)
respectively. •
209
The result that follows has already been proved in Chapter 3 and is a particular case
of Theorem 4.193.
In general, r.v. may be uncorrelated but highly dependent. But if a random vector
has a multivariate normal distribution then any two or more of its components that are
uncorrelated are independent.
• N be a positive integer;
210
Definition 4.196 — Mixed (central) moments
Let:
• r1 , . . . , rd ∈ IN .
Then:
The Isserlis’ theorem is a formula that allows one to compute mixed moments of the
multivariate normal distribution with null mean vector40 in terms of the entries of its
covariance matrix.
• In his original paper from 1918, Isserlis considered only the fourth-order moments,
in which case the formula takes appearance
which can be written in terms of the covariances: σ12 × σ34 + σ13 × σ24 + σ14 × σ23 .
It also added that if (X1 , . . . , X2n ) is a zero mean multivariate normal random vector,
then
211
!A
where the notation means summing over all distinct ways of partitioning
X1 , . . . , X2n into pairs.
• Another applications include the analysis of portfolio returns, quantum field theory,
generation of colored noise, etc. •
Then:
• if k is odd (i.e. k = 2n − 1, n ∈ IN )
d
B C
&
E (Xi − µi )ri = 0; (4.144)
i=1
d
B C
& %&
E (Xi − µi )ri = σij , (4.145)
i=1
!A
where the is taken over all allocations of the set {1, 2, . . . , n} into n (unordered)
pairs, that is, if you have a k = 2n = 6th order central moment, you will be summing
the products of n = 3 covariances. •
212
Exercise 4.199 — Isserlis’ theorem
Let X = (X1 , . . . , X4 ) ∼ Normal4 (0, Σ = [σij ]). Prove that
(http://en.wikipedia.org/wiki/Isserlis’ theorem). •
213
The following figure41 shows the conditional distribution of Y |{X = x0 } of a random
vector (X, Y ) with a bivariate normal distribution:
The inverse Mills ratio is the ratio of the probability density function over the
cumulative distribution function of a distribution and corresponds to a specific conditional
expectation, as stated below.
Exercise 4.204 — Conditional distributions and the inverse Mills’ ratio in the
bivariate normal setting
Assume that X1 represents the log-dose of insuline that has been administrated and X2
the decrease in blood sugar after a fixed amount of time. Also assume that (X1 , X2 ) has
a bivariate normal distribution with mean vector and covariance matrix
B C B C
0.56 0.027 2.417
µ= e Σ= . (4.150)
53 2.417 407.833
41
Once again taken from http://www.aiaccess.net/English/Glossaries/GlosMod/e gm multinormal distri.htm.
214
(a) Obtain the probability that the decrease in blood sugar exceeds 70, given that log-dose
of insuline that has been administrated is equal to 0.5.
(b) Determine the log-dose of insuline that has to be administrated so that the expected
value of the decrease in blood sugar equals 70.
(c) Obtain the expected value of the decrease in blood sugar, given that log-dose of
insuline that has been administrated exceeds 0.5. •
215
4.6.8 Multinomial distributions
The genesis and the definition of multinomial distributions follow.
Remark 4.209 — Special case of the multinomial distribution (Johnson and Kotz,
1969, p. 281)
Needless to say, we deal with the binomial distribution when d = 2, i.e.,
d
Multinomial2−1 (n, p = (p, 1 − p)) = Binomial(n, p). (4.153)
Curiously, J. Bernoulli, who worked with the binomial distribution, also used the
multinomial distribution. •
The index d − 1 follows from the fact that the r.v. Nd (or any other component of N ) is redundant:
42
!d−1
Nd = n − i=1 Ni .
216
Remark 4.210 — Applications of multinomial distribution
(http://controls.engin.umich.edu/wiki/index.php/Multinomial distributions)
Multinomial systems are a useful analysis tool when a “success-failure” description is
insufficient to understand the system. For instance, in chemical engineering applications,
multinomial distributions are relevant to situations where there are more than two possible
outcomes (temperature = high, med, low).
A continuous form of the multinomial distribution is the Dirichlet distribution
(http://en.wikipedia.org/wiki/Dirichlet distribution).43 •
43
The Dirichlet distribution is in turn the multivariate generalization of the beta distribution.
217
Outcome Temperature Pressure Probability
Worried about risk of runaway reactions, the Miles Reactor Company is implementing
a new program to assess the safety of their reaction processes. The program consists
of running each reaction process 100 times over the next year and recording the reactor
conditions during the process every time. In order for the process to be considered safe,
the process outcomes must be within the following limits:
1 high high n1 = 0
2 high low n2 ≤ 20
3 low high n3 ≤ 2
4 low low n4 = 100 − n1 − n2 − n3
44
For any positive integer d and any nonnegative integer n, we have
% d
&
n!
(x1 + . . . + xd ) =
n
Ad × xni i , (4.154)
(n1 ,...,nd ) i=1 ni ! i=1
where the summation is taken over all d−vectors of nonnegative integer indices n1 , . . . , nd such that the
sum of all ni is n. As with the binomial theorem, quantities of the form 00 which appear are taken to be
equal 1. See http://en.wikipedia.org/wiki/Multinomial theorem for more details.
218
Definition 4.214 — Mixed factorial moments
Let:
• r1 , . . . , rd ∈ IN0 .
•
Marginal moments and marginal central moments, and covariances and correlations
between the components of a random vector can be written in terms of mixed
(central/factorial) moments. This is particularly useful when we are dealing with the
multinomial distribution.
(b) cov(Xi , Xj )
• r1 , . . . , rd ∈ IN0 .
219
From the general formula (4.157), we find the expected value of Ni , and the covariances
and correlations between Ni and Nj .
E(Ni ) = n pi , (4.158)
for i = 1, . . . , d.
The covariance matrix is as follows. Each diagonal entry is the variance
V (Ni ) = n pi (1 − pi ), (4.159)
cov(Ni , Nj ) = −n pi pj , (4.160)
for i, j = 1, . . . , d, i &= j. All covariances are negative because, for fixed n, an increase in
one component of a multinomial vector requires a decrease in another component. The
covariance matrix is a d × d positive-semidefinite matrix of rank d − 1.
The off-diagonal entries of the corresponding correlation matrix are given by
N
pi pj
corr(Ni , Nj ) = − , (4.161)
(1 − pi ) (1 − pj )
for i, j = 1, . . . , d, i &= j.45 Note that the number of trials (n) drops out of the expression
of corr(Ni , Nj ). •
220
Proposition 4.220 — Marginal distributions in a multinomial setting (Johnson
and Kotz, 1969, p. 281)
The marginal distribution of any Ni , i = 1, . . . , d, is Binomial with parameters n and pi .
I.e.
Ni ∼ Binomial(n, pi ). (4.162)
for i = 1, . . . , d. •
n! P
A s na ! s (4.163)
= Qs na !×(n− s
na )!
× i=1 pai i × 1 − j=1 paj ,
i=1 i i=1 i
!s
for nai ∈ IN0 , i = 1, . . . , s such that i=1 nai ≤ n. •
pi
E(Ni |Nj ) = (n − Nj ) × . (4.164)
1 − pj
• The random vector (Na1 , . . . Nas ) conditional on a event referring to any subset of the
remaining Nj ’s, say {Nb1 = nb1 , . . . , Nbr = nbr }, has also a multinomial distribution.
Its p.f. can be found in Johnson and Kotz (1969, p. 284). •
221
Remark 4.223 — Conditional distributions and the simulation of a
multinomial distribution (Gentle, 1998, p. 106)
The following conditional distributions taken from Gentle (1998, p. 106) suggest a
procedure to generate pseudo-random vectors with a multinomial distribution:
• N1 ∼ Binomial(n, p1 );
9 :
p2
• N2 |{N1 = n1 } ∼ Binomial n − n1 , 1−p 1
;
9 :
• N3 |{N1 = n1 , N2 = n2 } ∼ Binomial n − n1 − n2 , 1−pp13−p2 ;
• ...
9 !d−2 :
pd−1
• Nd−1 |{N1 = n1 , . . . , Nd−2 = nd−2 } ∼ Binomial n − i=1 ni , 1−
Pd−2
pi
;
i=1
d !d−1
• Nd |{N1 = n1 , . . . , Nd−1 = nd−1 } = n − i=1 ni .
4. . . .
!d−2 p(2)
5. n − i=1 n(d+1−i) and 1−
Pd−2 , say n(2) ,
i=1 p(d+1−i)
222
and finally
!d−1
6. assign n(1) = n − i=1 n(d+1−i) . •
References
• Gentle, J.E. (1998). Random Number Generation and Monte Carlo Methods.
Springer-Verlag. (QA298.GEN.50103)
• Johnson, N.L. and Kotz, S. (1969). Discrete distributions John Wiley & Sons.
(QA273-280/1.JOH.36178)
223
Chapter 5
• there is a convergence of Xn (ω) in the classical sense to a fixed value X(ω), for each
and every event ω;
• the probability that the distance between Xn from a particular r.v. X exceeds any
prescribed positive value decreases and converges to zero;
• the sequence formed by calculating the expected value of the (absolute or quadratic)
distance between Xn and X converges to zero;
224
q.m.
• convergence in quadratic mean or in L2 ( → );
L1
• convergence in L1 or in mean (→);
d
• convergence in distribution (→).
It is important for the reader to be familiarized with all these modes of convergence,
the way they can be related and with the applications of such results and understand
their considerable significance in probability, statistics and stochastic processes.
Definition 5.2 — Almost sure convergence (Karr, 1993, p. 135; Rohatgi, 1976, p.
249)
The sequence of r.v. {X1 , X2 , . . .} is said to converge almost surely to a r.v. X if
'L M(
P w : lim Xn (ω) = X(ω) = 1. (5.1)
n→+∞
a.s.
In this case we write Xn → X (or Xn → X with probability 1). •
225
Motivation 5.5 — Convergence in probability (Karr, 1993, p. 135;
http://en.wikipedia.org/wiki/Convergence of random variables)
Convergence in probability essentially means that the probability that |Xn − X| exceeds
any prescribed, strictly positive value converges to zero.
The basic idea behind this type of convergence is that the probability of an “unusual”
outcome becomes smaller and smaller as the sequence progresses. •
226
Example 5.8 — Convergence in probability
Let {X1 , X2 , . . .} be a sequence of i.i.d. r.v. such that Uniform(0, θ), where θ > 0.
P
(a) Check if X(n) = maxi=1,...,n Xi → θ.
• R.v.
i.i.d.
Xi ∼ X, i ∈ IN
X ∼ Uniform(0, θ)
• D.f. of X
0, x < 0
x
FX (x) = , 0≤x≤θ
θ
1, x > θ
• New r.v.
X(n) = maxi=1,...,n Xi
• D.f. of X(n)
227
• Conclusion
P
X(n) → θ.
Interestingly enough, X(n) is the ML estimator of θ and also a consistent
P
estimator of θ (X(n) → θ). However, E[X(n) ] = nθ/(n + 1) &= θ, i.e. X(n) is
a biased estimator of θ.
P
(b) Prove that X(1:n) = mini=1,...,n Xi → 0.
• New r.v.
X(1:n) = mini=1,...,n Xi
• D.f. of X(1:n)
• Conclusion
P
X(1:n) → 0. •
228
Example 5.10 — Chebyshev(-Bienaymé)’s inequality and convergence in
probability
Let {X1 , X2 , . . .} be a sequence of r.v. such that Xn ∼ Gamma(n, n), n ∈ IN . Prove that
P
Xn → 1, by making use of Chebyshev(-Bienaymé)’s inequality.
• R.v.
Xn ∼ Gamma(n, n), n ∈ IN
n
E(Xn ) = n
=1
n 1
V (Xn ) = n2
= n
P
• Checking the convergence in probability Xn → 1
The application of the definition of this type of convergence and Chebyshev(-
Bienaymé)’s inequality leads to
" $
% K
lim P (|Xn − 1| > %) = lim P |Xn − E(Xn )| ≥ K V (Xn )
n→+∞ n→+∞ V (Xn )
1
≤ lim ' (2
n→+∞
√* 1
n
1 1
= lim
%2 n→+∞ n
= 0,
• Conclusion
P
Xn → 1. •
229
Exercise 5.12 — Convergence in probability
Let {X1 , X2 , . . .} be a sequence of r.v. such that Xn ∼ Bernoulli( n1 ), n ∈ IN .
P
(a) Show that Xn → 0, by obtaining P ({|Xn | > %}), for 0 < % < 1 and % ≥ 1 (Rohatgi,
1976, pp. 243–244, Example 5).
d
(b) Verify that E(Xnk ) → E(X k ), where k ∈ IN and X = 0. •
230
Exercise 5.18 — Convergence in quadratic mean implies convergence of 2nd.
moments (Karr, 1993, p. 158, Exercise 5.6(a))
q.m.
Show that Xn → X ⇒ E(Xn2 ) → E(X 2 ) (Rohatgi, 1976, p. 248, proof of Theorem 8). •
231
5.1.2 Convergence in distribution
Motivation 5.22 — Convergence in distribution
(http://en.wikipedia.org/wiki/Convergence of random variables)
Convergence in distribution is very frequently used in practice, most often it arises from
the application of the central limit theorem. •
• With this mode of convergence, we increasingly expect to see the next r.v. in a
sequence of r.v. becoming better and better modeled by a given d.f., as seen in
exercises 5.25 and 5.26.
• It must be noted that it is quite possible for a given sequence of d.f. to converge to
a function that is not a d.f., as shown in Example 5.27 and Exercise 5.28.
232
where 0 < θ < +∞, and X(n) = max1,...,n Xi .
d
Show that X(n) → θ (Rohatgi, 1976, p. 241, Example 2). •
• X ∼ Bernoulli(p).
d
Prove that Xn → X iff pn → p. •
Example 5.27 — A sequence of d.f. converging to a non d.f. (Murteira, 1979, pp.
330–331)
Let {X1 , X2 , . . .} be a sequence of r.v. with d.f.
0, x < −n
x+n
FXn (x) = , −n ≤ x < n (5.9)
2n
1, x ≥ n.
Please note that limn→+∞ FXn (x) = 12 , x ∈ IR, as suggested by the graph below with some
terms of the sequence of d.f., for n = 1, 103 , 106 (from top to bottom):
1.0
0.8
0.6
0.4
0.2
233
Exercise 5.29 — The requirement that only the continuity points of FX should
be considered is essential
- .
Let Xn ∼ Uniform 12 − n1 , 12 + n1 and X be a r.v. degenerate at 12 .
d
(a) Prove that Xn → X (Karr, 1993, p. 142).
- . - .
(b) Verify that FXn 12 = 12 for each n, and these values do not converge to FX 12 = 1.
Is there any contradiction with the convergence in distribution previously proved?
(Karr, 1993, p. 142.) •
Exercise 5.30 — The requirement that only the continuity points of FX should
be considered is essential (bis)
- .
Let Xn ∼ Uniform 0, n1 and X a r.v. degenerate at 0.
d
Prove that Xn → X, even though FXn (0) = 0, for all n, and FX (0) = 1,
that is, the convergence of d.f. fails at the point x = 0 where FX is discontinuous
(http://en.wikipedia.org/wiki/Convergence of random variables). •
1.0
0.8
0.6
0.4
0.2
!3 !2 !1 1 2 3
234
and
0, x < 0
1
lim FXn (x) = , x=0
n→+∞
2
1, x > 0
d
(a) Prove that Xn → X, where X a r.v. degenerate at 2.
(b) Verify that none of the p.f. P ({Xn = x}) assigns any probability to the point x = 2,
for all n, and that P ({Xn = x}) → 0 for all x (Rohatgi, 1976, p. 242, Example 4). •
• R.v.
-λ.
Xn ∼ Geometric n
, n ∈ IN
• P.f. of Xn and Yn
- .x−1 λ
P (Xn = x) = 1 − nλ × n , x = 1, 2, . . .
- .ny−1 λ
P (Yn = y) = P (Xn = ny) = 1 − nλ × n , y = n1 , n2 , . . .
235
• D.f. of Yn
where [ny] represents the integer part of the real number ny and
• Conclusion
d
Yn → Exponential(λ). •
2
This result is very important in the generation of pseudo-random numbers from the Uniform(0, 1)
distribution by using computers since these machines “deal” with discrete mathematics.
236
The following table condenses the definitions of convergence of sequences of r.v.
Exercise 5.36 — Modes of convergence and the vector space structure of the
family of r.v. (Karr, 1993, p. 158, Exercise 5.2)
Prove that, for ∗ = a.s., P, q.m., L1 ,
∗ ∗
Xn → X ⇔ Xn − X → 0, (5.12)
i.e. the four function-based forms of convergence are compatible with the vector space
structure of the family of r.v. •
237
5.1.3 Alternative criteria
The definition of almost sure convergence and its verification are far from trivial. More
tractable criteria have to be stated...
i.e.
a.s. P
Xn → X ⇔ Yn = sup |Xk − X| → 0. (5.14)
k≥n
•
238
Exercise 5.40 — Relating almost sure convergence and convergence in
probability (bis)
-E F. 1
Let {X1 , X2 , . . .} be a sequence of r.v. with P Xn = ± n1 = 2.
a.s.
Prove that Xn → X, where the r.v. X is degenerate at 0, by using (5.16) (Rohatgi,
1976, p. 252). •
The next results relate complete convergence, which is stronger than almost sure
convergence, and sometimes more convenient to establish (Karr, 1993, p. 137).
239
Exercise 5.46 — Relating almost sure convergence and complete convergence
Show Proposition 5.44, by using the (1st.) Borel-Cantelli lemma (Karr, 1993, p. 138). •
Since in the proof of Theorem 5.49 the continuous functions used to approximate
indicator functions can be taken to be arbitrarily smooth we can add a sufficient condition
that guarantees convergence in distribution.
240
Corollary 5.52 — Sufficient condition for convergence in distribution (Karr,
1993, p. 139)
Let:
Then
d
E[f (Xn )] → E[f (X)], ∀ f ∈ C(k) ⇒ Xn → X. (5.22)
•
The next table summarizes the alternative criteria and sufficient conditions for almost
sure convergence and convergence in distribution of sequences of r.v.
P
We should also add that Grimmett and Stirzaker (2001, p. 310) state that if Xn → X
Lr
and P ({|Xn | ≤ k}) = 1, for all n and some k, then Xn → X, for all r ≥ 1,3 namely
q.m. L1
Xn → X (which in turn implies Xn → X).
3
Let X, X1 , X2 , . . . belong to Lr (r ≥ 1). Then the sequence of r.v. {X1 , X2 , . . .} is said to converge
Lr
to X in Lr ) — denoted by Xn → X — if limn→+∞ E (|Xn − X|r ) = 0 (Grimmett and Stirzaker, 2001,
p. 308).
241
5.2 Relationships among the modes of convergence
Given the plethora of modes of convergence, it is natural to inquire how they can be
always related or hold true in the presence of additional assumptions (Karr, 1993, pp. 140
and 142).
242
Exercise 5.60 — Convergence in probability implies convergence in
distribution
Show Proposition 5.59, (Karr, 1993, p. 141). •
Figure 5.1 shows that convergence in distribution is the weakest form of convergence,
since it is implied by all other types of convergence studied so far.
q.m. L1
Xn → X ⇒ Xn → X
⇒
P d
Xn → X ⇒ Xn → X
⇒
a.s.
Xn → X
Grimmett and Stirzaker (2001, p. 314) refer that convergence in distribution is the
weakest form of convergence for two reasons: it only involves d.f. and makes no reference
to an underlying probability space.4 However, convergence in distribution has an useful
representation in terms of almost sure convergence, as stated in the next theorem.
4
Let us remind the reader that that there is an equivalent formulation of convergence in distribution
which involves d.f. alone: the sequence of d.f. {F1 , F2 , . . .} converges to the d.f. F , if limn→+∞ Fn (x) =
F (x) at each point x where F is continuous (Grimmett and Stirzaker, 2001, p. 190).
243
5.2.2 Counterexamples
Counterexamples to all implications among the modes of convergence (and more!) are
condensed in Figure 5.2 and presented by means of several exercises.
q.m. L1
Xn → X &⇐ Xn → X
&⇐
P d
Xn → X &⇐ Xn → X
&⇐
a.s.
Xn → X
Before proceeding with exercises, recall exercises 5.4 and 5.12 which pertain to the
sequence of r.v. {X1 , X2 , . . .}, where Xn ∼ Bernoulli( n1 ), n ∈ IN . In the first exercise
a.s. P
we proved that Xn & → 0, whereas in the second one we concluded that Xn → 0. Thus,
P a.s.
combining these results we can state that Xn → 0 &⇒ Xn → 0.
244
Exercise 5.65 — Convergence in quadratic mean does not imply almost sure
convergence
- .
Let Xn ∼ Bernoulli n1 .
q.m. a.s.
Prove that Xn → 0, but Xn & → 0 (Rohatgi, 1976, p. 252, Example 9). •
a.s. L1 q.m.
Show that Xn → 0 and Xn → 0, however Xn & → 0 (Karr, 1993, p. 141,
Counterexample b)). •
n = 2k + m, m = 0, 1, . . . , 2k − 1, k = 0, 1, 2, . . . (5.30)
245
Exercise 5.68 — Convergence in distribution does not imply convergence in
probability
Let {X2 , X3 , . . .} be a sequence of r.v. such that
0, x<0
1 1
FXn (x) = − n, 0 ≤ x < 1 (5.33)
2
1, x ≥ 1,
- .
i.e. Xn ∼ Bernoulli 12 + n1 , n = 2, 3, . . .
d - .
Prove that Xn → X, where X ∼ Bernoulli 12 and independent of any Xn , but
P
Xn →
& X (Karr, 1993, p. 142, Counterexample d)). •
246
5.2.3 Implications of restricted validity
Proposition 5.70 — Convergence in distribution to a constant implies
convergence in probability (Karr, 1993, p. 140; Rohatgi, 1976, p. 246)
Let {X1 , X2 , . . .} be a sequence of r.v. and c ∈ IR. Then
d P
Xn → c ⇒ Xn → c. (5.34)
•
d
derive the following result: X(1:n) → θ.
Recall that the expected value of a r.v. X over an event A is given by E(X; A) = E(X ×
1A ). •
247
Proposition 5.75 — Alternative criterion for uniform integrability (Karr, 1993,
p. 143)
A sequence of r.v. {X1 , X2 , . . .} is uniformly integrable iff
• {X1 , X2 , . . .} is uniformly absolutely continuous: for each % > 0 there is δ > 0 such
that supn E(|Xn |; A) < % whenever P (A) > δ. •
248
5.3 Convergence under transformations
Since the original sequence(s) of r.v. is (are) bound to be transformed, it is natural to
inquire whether the modes of convergence are preserved under continuous mappings and
algebraic operations of the r.v.
• g : IR → IR be a continuous function.
Then
∗ ∗
Xn → X ⇒ g(Xn ) → g(X), ∗ = a.s., P, d. (5.39)
•
249
Remark 5.84 — Preservation of {a.s., P, q.m, L1 }−convergence under addition
Under the conditions of Theorem 5.83,
∗
• Xn ± Yn → X ± Y, ∗ = a.s., P, q.m, L1 . •
Then
d
Xn + Yn → X + c. (5.41)
•
250
Exercise 5.88 — Slutsky’s theorem or preservation of d−convergence under
(restricted) addition
Prove Theorem 5.86 (Karr, 1993, p. 146; Rohatgi, 1976, pp. 253–4). •
As for the product, almost sure convergence and convergence in probability are
preserved.
5
Proposition 5.18 of Karr (1993, p. 144) may come handy to prove the result. This proposition
reads as follows: the sequence of r.v. {X1 , X2 , . . .} converges in probability to X iff each subsequence
a.s.
{X1! , X2! , . . .} contains a further subsequence {X1!! , X2!! , . . .} such that Xn → X.
251
Exercise 5.93 — (Non)preservation of q.m.−convergence under product
Prove Theorem 5.91 (Karr, 1993, p. 147; Rohatgi, 1976, p. 254). •
Convergence in distribution is preserved under product, provided that one limit factor
is constant (Karr, 1993, p. 146).
Then
d
Xn × Yn → X × c. (5.46)
•
Preservation under...
252
Exercise 5.96 — Slutsky’s theorem or preservation of d−convergence under
(restricted) product
Prove Theorem 5.94 (Karr, 1993, pp. 147–8). •
• R.v.
i.i.d.
Xi ∼ X, i ∈ IN
X : E(X) = µ, V (X) = σ 2 = µ2 , E [(X − µ)4 ] = µ4 , which are finite
moments since X ∈ L4 .
• Auxiliary results
E(X̄n ) = µ
σ2 µ2
V (X̄n ) = n
= n
E(Sn2 ) = σ = µ2G
2
H
- n .2 µ4 −µ2 2(µ4 −2µ22 ) 2(µ4 −3µ22 )
V (Sn2 ) = n−1 n
2
− n2
+ n3
(Murteira, 1980, p. 46).
X̄n −µ
• Asymptotic sample distribution of √
Sn / n
X̄n −µ d
To show that √
Sn / n
→ Normal(0, 1) it suffices to note that
√ X̄n −µ
X̄n − µ σ/ n
√ = 3 , (5.50)
Sn / n Sn2
σ2
3
−µ
X̄n√ d 2
Sn P
prove that → Normal(0, 1) and
σ/ n σ2
→ 1, and then apply Slutsky’s
theorem as stated in (5.48).
253
• Convergence in probability of the denominator
By using the definition of convergence in probability and the Chebyshev(-
Bienaymé)’s inequality, we get, for any % > 0:
" $
- 2 . D D % K
lim P |Sn − σ 2 | > % = lim P DSn2 − E(Sn2 )D ≥ K V (Sn2 )
n→+∞ n→+∞ 2
V (Sn )
1
≤ lim ' (2
n→+∞
√ * 2
V (Sn )
1
= 2
lim V (Sn2 )
% n→+∞
= 0, (5.51)
P
i.e. Sn2 → σ 2 .
Finally, note that convergence in probability is preserved under continuous
K
mappings such as g(x) = σx , hence
N N
2
Sn P σ2
P
Sn2 → σ 2 ⇒ → = 1. (5.52)
σ2 σ2
• Conclusion
X̄−µ d
√
S/ n
→ Normal(0, 1).
254
Exercise 5.99 — Slutsky’s theorem or preservation of d−convergence under
(restricted) division (bis)
Let {X1 , X2 , . . .} a sequence of i.i.d. r.v. with common distribution Bernoulli(p) and X̄n =
1
!n
n i=1 Xi the maximum likelihood estimator of p.
X̄ − p d
3 n → Normal(0, 1). (5.57)
X̄n (1−X̄n )
n
255
5.4 Convergence of random vectors
Before defining modes of convergence of a sequence of random d−vectors we need two
recall the definition of norm of a vector.
denote the L2 norm (or Euclidean norm) and the L1 norm of x, respectively. •
256
Remark 5.104 — Convergence in distribution of a sequence of random vectors
(Karr, 1993, p. 149)
Due to the intractability of multi-dimension d.f., convergence in distribution — unlike
the four previous modes of convergence — has to be defined by taking advantage of the
alternative criterion for convergence in distribution stated in Theorem 5.49. •
257
As with sequences of r.v., convergence almost surely, in probability and in distribution
are preserved under continuous mappings of sequences of random vectors.
Then
∗ ∗
Xn → X ⇒ g(X n ) → g(X), ∗ = a.s., P, d. (5.62)
•
258
5.5 Limit theorems for Bernoulli summands
Let {X1 , X2 , . . .} be a Bernoulli process with parameter p ∈ (0, 1). In this section we
study the asymptotic behavior of the Bernoulli counting process {S1 , S2 , . . .}, where Sn =
!n
i=1 Xi ∼ Binomial(n, p).
This illustration suggests the following statement: Snn = X̄n converges, in some sense, to
p = 12 . In fact, if we use Chebyshev(-Bienaymé)’s inequality we can prove that
'LD D M(
D Sn D
lim P D − pD > ε = 0, (5.63)
n→+∞ Dn D
P
that is, Snn → p = 12 . (Show this result!) In addition, we can also prove that the proportion
of heads after n flips will almost surely converge to one half as n approaches infinity, i.e.,
Sn a.s.
n
→ p = 12 . Similar convergences can be devised for the mean of n i.i.d. r.v.
The Indian mathematician Brahmagupta (598–668) and later the Italian
mathematician Gerolamo Cardano (1501–1576) stated without proof that the accuracies
of empirical statistics tend to improve with the number of trials. This was then formalized
as a law of large numbers (LLN).
259
The LLN was first proved by Jacob Bernoulli. It took him over 20 years to develop a
sufficiently rigorous mathematical proof which was published in his Ars Conjectandi (The
Art of Conjecturing) in 1713. He named this his Golden Theorem but it became generally
known as ”Bernoulli’s Theorem”. In 1835, S.D. Poisson further described it under the
name La loi des grands nombres (The law of large numbers). Thereafter, it was known
under both names, but the Law of large numbers is most frequently used.
Other mathematicians also contributed to refinement of the law, including Chebyshev,
Markov, Borel, Cantelli and Kolmogorov. These further studies have given rise to two
prominent forms of the LLN:
These forms do not describe different laws but instead refer to different ways of describing
the mode of convergence of the cumulative sample means to the expected value:
Then
Sn q.m.
→ p, (5.64)
n
therefore
Sn P
→ p. (5.65)
n
•
260
Exercise 5.112 — Weak law of large numbers for GBernoulliHsummands
- .2
Show Theorem 5.111, by calculating the limit of E Snn − p (thus proving the
convergence in quadratic mean) and combining Proposition 5.55 (which states that
convergence in quadratic mean implies convergence in L1 ) and Proposition 5.57 (it says
that convergence in L1 implies convergence in probability) (Karr, 1993, p. 151). •
Remark 5.115 — Weak and strong laws of large numbers for Bernoulli
summands (http://en.wikipedia.org/wiki/Law of large numbers; Karr, 1993, p. 152)
• Theorem 5.113 can be invoked to support the frequency interpretation of probability.
• The WLLN for Bernoulli summands states that for a specified large n, Snn is likely
to be near p. Thus, it leaves open the possibility that the event {| Snn − p| > %}, for
any % > 0, happens an infinite number of times, although at infrequent intervals.
The SLLN for Bernoulli summands shows that this almost surely will not occur.
In particular, it implies that with probability 1, we have that, for any % > 0, the
inequality | Snn − p| > % holds for all large enough n.
7
Let f (x) and g(x) be two functions defined on some subset of the real numbers. One writes f (x) =
O(g(x)) as x → ∞ iff there exists a positive real number M and a real number x0 such that |f (x)| ≤
M |g(x)| for all x > x0 (http://en.wikipedia.org/wiki/Big O notation).
8
A simple proof (using the 2nd. Borel-Cantelli lemma) can be found in Rohatgi (1976, p. 265).
261
• Finally, the proofs of theorems 5.111 and 5.113 only involve the moments of Xi .
Unsurprisingly, these two theorems can be reproduced for other sequences of i.i.d.
r.v., namely those in L2 (in the case of the WLLN) and in L4 (for the SLLN), as we
shall see in sections 5.6 and 5.7. •
2
• φ(x) = √1 e−x /2 be the standard normal p.d.f.
2π
Then
P ({Sn = kn })
lim = 1. (5.67)
n→+∞ √φ(xn )
np(1−p)
•
262
that is, the p.f. of Sn ∼ Binomial(n, p) evaluated at kn can be properly approximated by
the p.d.f. of a normal distribution, with mean E(Sn ) = np and variance V (Sn ) = np(1−p),
evaluated at √kn −np . •
np(1−p)
(b) What is the probability that exactly 20 heads result when you flip a fair coin 40 times?
(www.maths.bris.ac.uk/∼mb13434/Stirling DeMoivre Laplace.pdf) •
• According to Murteira (1979, p. 348), the well known continuity correction was
proposed by Feller in 1968 to improve the normal approximation to the binomial
distribution,10 and can be written as:
B C B C
b + 21 − np a − 12 − np
P (a ≤ Sn ≤ b) * Φ K −Φ K . (5.71)
np(1 − p) np(1 − p)
10
However, http://en.wikipedia.org/wiki/Continuity correction suggests that continuity correction
dates back from Feller, W. (1945). On the normal approximation to the binomial distribution. The
Annals of Mathematical Statistics 16, pp. 319–329.
263
• The proof of the central limit theorem for summands (other than Bernoulli
ones) involves a Taylor series expansion11 and requires dealing with the notion of
characteristic function of a r.v.12 Such proof can be found in Murteira (1979, pp.
354–355); Karr (1993, pp. 190–196) devotes a whole section to this theorem. •
(b) The ideal size of a course is 150 students. On average 30% of those accepted will
enroll, therefore the organisers accept 450 students.
What is the probability that more than 150 students enroll?
(www.maths.bris.ac.uk/∼mb13434/Stirling DeMoivre Laplace.pdf) •
• In the two central limit theorems for Bernoulli summands, although n → +∞,
the parameter p remained fixed. These theorems provide useful approximations to
binomial probabilities, as long as the values of p are close to neither zero or one,
and inaccurate ones, otherwise.
11
The Taylor series of a real or complex function f (x) that is infinitely differentiable in a neighborhood
of a real (or complex number) a is the power series written in the more compact sigma notation as
!+∞ f (n) (a)
n=0 n! (x − a)n , where f (n) (a) denotes the nth derivative of f evaluated at the point a. In the case
that a = 0, the series is also called a Maclaurin series (http://en.wikipedia.org/wiki/Taylor series).
12
For a scalar random variable X the characteristic function is defined as the expected value of eitX ,
E(eitX ), where i is the imaginary unit, and t ∈ IR is the argument of the characteristic function
(http://en.wikipedia.org/wiki/Characteristic function (probability theory)).
264
Theorem 5.124 — Poisson limit theorem (Karr, 1993, p. 155)
Let:
• n × pn → λ, where λ ∈ IR+ .
Then
d
X n → Poisson(λ). (5.72)
•
• R.v.
Xn ∼ Binomial(n, pn )
• Parameters
n
λ
pn = n
(0 < λ < n)
• P.f.
-n. - λ .x - .
λ n−x
P (Xn = x) = x n
1− n
, x = 0, 1, . . . , n
• Limit p.f.
For any x ∈ {0, 1, . . . , n}, we get
λx n(n − 1) . . . (n − x + 1)
lim P (Xn = x) = × lim
n→+∞ x! n→+∞ nx
' (n ' (−x
−λ λ
× lim 1 + × lim 1 −
n→+∞ n n→+∞ n
x
λ
= × 1 × e−λ × 1
x!
λx
= e−λ .
x!
265
• Conclusion
If the limit p.f. of Xn coincides with p.f. of X ∼ Poisson(λ) then the same holds
for the limit d.f. of Xn and the d.f. of X. Hence
d
Xn → Poisson(λ).
(c) Suppose that in an interval of length 1000, 500 points are placed randomly.
Use the Poisson limit theorem to prove that we can approximate the p.f. of the number
points that will be placed in a sub-interval of length 10 by
5k
e−5 (5.73)
k!
266
5.6 Weak law of large numbers
Motivation 5.126 — Weak law of large numbers (Rohatgi, 1976, p. 257)
Let:
In this section we are going to answer the next question in the affirmative:
Sn −an P
• Are there constants an and bn (bn > 0) such that bn
→ 0?
In other words, what follows are extensions of the WLLN for Bernoulli summands
(Theorem 5.111), to other sequences of:
• i.i.d. r.v. in L2 ;
• i.i.d. r.v. in L1 . •
Definition 5.127 — Obeying the weak law of large numbers (Rohatgi, 1976, p.
257)
Let:
Then {X1 , X2 , . . .} is said to obey the weak law of large numbers (WLLN) with respect
to the sequence of constants {b1 , b2 , . . .} (bn > 0, bn ↑ +∞) if there is a sequence of real
constants {a1 , a2 , . . .} such that
S n − an P
→ 0. (5.74)
bn
an and bn are called centering and norming constants, respectively. •
267
Remark 5.128 — Obeying the weak law of large numbers (Rohatgi, 1976, p. 257)
The definition in Murteira (1979, p. 319) is a particular case of Definition 5.127 with
!
an = ni=1 E(Xi ) and bn = n.
!
• Let {X1 , X2 , . . .} be a sequence of r.v., X̄n = n1 ni=1 Xi , and {Z1 , Z2 , . . .} be another
sequence of r.v. such that Zn = Snb−a n
n
= X̄n − E(X̄n ), n = 1, 2, . . .
P
Then {X1 , X2 , . . .} is said to obey the WLLN if Zn → 0.
Theorem 5.129 — Weak law of large numbers, i.i.d. r.v. in L2 (Karr, 1993, p.
152)
Let {X1 , X2 , . . .} be a sequence of i.i.d. r.v. in L2 with common expected value µ and
variance σ 2 . Then
q.m.
X̄n → µ, (5.75)
1
!n P
(a) Prove that X̄n = n i=1 Xi → 1 + q.
P
(b) Show that X(1:n) = mini=1,...,n Xi → q.14 •
A closer look to the proof Theorem 5.129 leads to the conclusion that the r.v. need
only be pairwise uncorrelated and identically distributed in L2 , since in this case we still
2
have V (X̄n ) = σn (Karr, 1993, p. 152).
13
As suggested by Rohatgi (1976, p. 258, Corollary 3).
14
Use the d.f. of X(1:n) .
268
Theorem 5.132 — Weak law of large numbers, pairwise uncorrelated and
identically distributed r.v. in L2 (Karr, 1993, p. 152; Rohatgi, 1976, p. 258)
Let {X1 , X2 , . . .} be a sequence of pairwise uncorrelated and identically distributed r.v.
q.m.
in L2 with common expected value µ and variance σ 2 . Thus, X̄n → µ and15 hence
{X1 , X2 , . . .} obeys the WLLN:
P
X̄n → µ. (5.78)
•
We can also drop the assumption that we are dealing with identically distributed r.v.
as suggested by the following theorem.
269
S
T n
T%
bn = U σi2 . (5.80)
i=1
If bn → +∞ then
S n − an Sn − E(Sn ) P
= K → 0. (5.81)
bn V (Sn )
Theorem 5.133 can be further generalized: the sequence of r.v. need only have the mean
of its first n terms, X̄n , with a specific variance behavior, as stated below. •
then
P
X̄n − E(X̄n ) → 0, (5.83)
!n
that is, {X1 , X2 , . . .} obeys the WLLN with respect to bn = n (an = i=1 E(Xi )). •
Remark 5.138 — (Special cases of ) Markov’s theorem (Murteira, 1979, pp. 320–
321; Rohatgi, 1979, p. 258)
• The WLLN holds for a sequence of pairwise uncorrelated r.v., with common expected
value µ and uniformly limited variance V (Xn ) < k, n = 1, 2, . . . ; k ∈ IR+ .16
• The WLLN also holds for a sequence of pairwise uncorrelated and identically
distributed r.v. in L2 , with common expected value µ and σ 2 (Theorem 5.132).
16
This corollary of Markov’s theorem is due to Chebyshev. Please note that when we dealing with
pairwise uncorrelated r.v., the condition (5.82) in Markov’s theorem still reads: limn→+∞ V (X̄n ) =
!n
limn→+∞ n12 i=1 V (Xi ) = 0.
270
• Needless to say that the WLLN holds for any sequence of i.i.d. r.v. in L2 (Theorem
5.129) and therefore X̄n is a consistent estimator of µ.
Moreover, according to http://en.wikipedia.org/wiki/Law of large numbers, the
assumption of finite variances (V (Xi ) = σ 2 < +∞) is not necessary; large or infinite
variance will make the convergence slower, but the WLLN holds anyway, as stated
in Theorem 5.143. This assumption is often used because it makes the proofs easier
and shorter. •
The next theorem provides a necessary and sufficient condition for a sequence of r.v.
{X1 , X2 , . . .} to obey the WLLN.
Theorem 5.139 — An alternative criterion for the WLLN (Rohatgi, 1976, p. 258,
Theorem 2)
Let:
• {X1 , X2 , . . .} be a sequence of r.v. (in L2 );
• Yn = X̄n , n = 1, 2, . . ..
!n
Then {X1 , X2 , . . .} satisfies the WLLN with respect to bn = n (an = i=1 E(Xi )), i.e.
P
X̄n − E(X̄n ) → 0, (5.84)
iff ' (
Yn2
lim E = 0. (5.85)
n→+∞ 1 + Yn2
•
Remark 5.140 — An alternative criterion for the WLLN (Rohatgi, 1976, p. 259)
Since condition (5.85) does not apply to the individual r.v. Xi Theorem 5.139 is of limited
use. •
271
Finally, the assumption that the r.v. belong to L2 is dropped and we state a theorem
due to Soviet mathematician Aleksandr Yakovlevich Khinchin (1894–1959).
Theorem 5.143 — Weak law of large numbers, i.i.d. r.v. in L1 (Rohatgi, 1976, p.
261)
Let {X1 , X2 , . . .} be a sequence of i.i.d. r.v. in L1 with common finite mean µ.17 Then
{X1 , X2 , . . .} satisfies the WLLN with respect to bn = n (an = nµ), i.e.
P
X̄n → µ. (5.87)
•
Exercise 5.146 — Weak law of large numbers, i.i.d. r.v. in L1 (bis bis)
Let {X1 , X2 , . . .} be a sequence of i.i.d. r.v. with common p.d.f.
7
1+δ
x2+δ
, x≥1
fX (x) = (5.88)
0, otherwise,
P 1+δ
where δ > 0.20 Show that X̄n → E(X) = δ
(Rohatgi, 1976, p. 262, Example 5). •
17
Please note that nothing is said about the variance, it need not to be finite.
!n
18
This means that the kth. sample moment, n1 i=1 Xik , is a consistent estimator of E(X k ) if the i.i.d.
r.v. belong to Lk .
!n
19
I.e., the sample variance, n1 i=1 Xi2 − (X̄n )2 , is a consistent estimator of V (X) if we are dealing
with i.i.d. r.v. in L2 .
20
This is the p.d.f. of a Pareto(1, 1 + δ) r.v.
272
5.7 Strong law of large numbers
This section is devoted to a few extensions of the SLLN for Bernoulli summands (or
Borel’s SLLN), Theorem 5.113. They refer to sequences of:
• i.i.d. r.v. in L4 ;
• i.i.d. r.v. in L1 .
Theorem 5.147 — Strong law of large numbers, i.i.d. r.v. in L4 (Karr, 1993, p.
152; Rohatgi, 1976, pp. 264–265, Theorem 1)
Let {X1 , X2 , . . .} be a sequence of i.i.d. r.v. in L4 , with common expected value µ. Then
a.s.
X̄n → µ. (5.89)
•
The proviso of a common finite fourth moment can be dropped if there is a degenerate
r.v. that dominates the i.i.d. r.v. X1 , X2 , . . .
Corollary 5.149 — Strong law of large numbers, dominated i.i.d. r.v. (Rohatgi,
1976, p. 265)
Let {X1 , X2 , . . .} be a sequence of i.i.d. r.v., with common expected value µ and such that
The next lemma is essential to prove yet another extension of Borel’s SLLN (Theorem
5.113).
273
Lemma 5.150 — Kolmogorov’s inequality (Rohatgi, 1976, p. 268)
Let {X1 , X2 , . . .} be a sequence of independent r.v., with common null mean and variances
σi2 , i = 1, 2, . . . Then, for any % > 0,
'L M( !n 2
i=1 σi
P max |Sk | > % ≤ . (5.92)
k=1,...,n %2
•
The condition of dealing with i.i.d. r.v. in L4 can be further relaxed as long as the r.v.
are still independent and the variances of X1 , X2 , . . . have a specific behavior, as stated
below.
Then
a.s.
Sn − E(Sn ) → 0. (5.94)
•
274
Exercise 5.156 — Strong law of large numbers, i.i.d. r.v. in L1 , or
Kolmogorov’s SLLN
Show Theorem 5.155 (Karr, 1993, pp. 188–189; Rohatgi, 1976, pp. 274–275). •
• z is real iff z̄ = z;
K
• the modulus of z is |z| = x2 + y 2 ;
275
Definition 5.157 — Characteristic function (Karr, 1993, p. 163; Resnick, 1999, p.
295)
The characteristic function of the real-valued r.v. X, with c.d.f. FX (x), is the complex
valued function of a real variable t,
ϕX : IR → C, (5.96)
ϕX (t) = E(eitX )
6 ∞
= eitx dFX (x). (5.97)
−∞
•
eitb −eita
Uniform(a, b) it(b−a)
276
Exercise 5.160 — Characteristic function
Obtain the characteristic function of at least two discrete (resp. three absolutely
continuous) distributions. •
2. ϕX (t) satisfies:
where a, b ∈ IR.
277
@∞
6. The imaginary part of ϕX (t), Im[ϕX (t)] = −∞
sin(tx) dFX (x), is an odd function,
that is, Im[ϕX (t)] = −Im[ϕX (−t)].
−itnbn
/ −1
0n
ϕa−1
n Sn −nbn
(t) = e × ϕ (a
X n t) . (5.103)
10. If a r.v. X has a moment-generating function MX (t) = E(etX ), then the domain of
the characteristic function can be extended to the complex plane, and ϕX (−it) =
MX (t). •
(b) 3.;
(d) 9. •
278
Theorem 5.163 — Inversion theorem (Karr, 1993, p. 166)
Let a < b be two continuity points of the c.d.f. of the r.v. X. Then
We can recover not only the p.d.f. of an absolutely continuous r.v., but also the
individual probabilities P (X = x) using the characteristic function of X.
Theorem 5.166 — Fourier inversion theorem (Karr, 1993, p. 167; Resnick, 1999, p.
303)
@ +∞
If −∞ |ϕX (t)| dt < ∞ then X is an absolutely continuous r.v. with p.d.f. given by
6 +∞
1
fX (x) = e−itx × ϕX (t) dt (5.106)
2π −∞
•
279
Proposition 5.168 — Inversion theorem (Karr, 1993, p. 168)
Let X be a real discrete r.v. and ϕX (t) its characteristic function. Then
6 T
1
P (X = x) = lim e−itx × ϕX (t) dt, (5.107)
T →+∞ 2T −T
for x ∈ IR. •
Interestingly enough, the p.f. of an integer-valued r.v. X can be also written in terms
of ϕX (t), as mentioned below.
(c) Use Mathematica to obtain the p.f. of a Poisson(1) r.v., by using Corollary 5.170. •
Characteristic functions can also be used to find moments of a r.v. X provided that
they exist. Furthermore, by verifying a simple condition, characteristic functions establish
that the moments of X exist.
for k ∈ N. •
280
Exercise 5.173 — Calculation of moments known to exist
(a) Prove Theorem 5.172 (Karr, 1993, p. 169).
(b) Use Theorem 5.172 to derive the first and second moments of X ∼ Normal(0, 1)
(Karr, 1993, p. 171). •
The Taylor expansion of characteristic functions is crucial to prove some limit theorems
(Karr, 1993, p. 171).
as t → 0.22 •
281
The next theorem states that the characteristic function of a r.v. uniquely determines
its distribution (Resnick, 1999, p. 302).
Theorem 5.179 — Uniqueness theorem (Karr, 1993, p. 167; Resnick, 1999, p. 302)
d
If ϕX (t) = ϕY (t), for all t, then X = Y . •
The following result — the Lévy continuity theorem — establishes that the pointwise
limit of a sequence of characteristic functions is a characteristic function, provided that
it is continuous at zero (Karr, 1993, p. 172).
The Lévy continuity theorem is frequently used to prove the law of large numbers and
the Central Limit Theorem.
Theorem 5.183 — Lévy continuity theorem (Karr, 1993, p. 172; Resnick, 1999, pp.
304–305)
Let {X1 , X2 , . . .} be a sequence of r.v. and ϕX1 (t), ϕX2 (t), . . . the corresponding
characteristic functions. If
(i) ϕ(t) = limn→+∞ ϕXn (t) for every t ∈ IR
282
Exercise 5.184 — Lévy continuity theorem
Prove Theorem 5.183 by using the following result, stated and proved by Resnick (1999,
p. 311): there is K ∈ IR such that for each X,
6
−1 K a
P (|X| ≥ a ) ≤ {1 − Re[ϕX (t)]} dt, (5.116)
a 0
for all a > 0 (Karr, 1993, pp. 172–173; Resnick, 1999, 311–312). •
(a) the weak law of large numbers stated in Theorem 5.143 (Karr, 1993, pp. 173–174);
(b) the Poisson limit theorem stated in Theorem 5.124 (Karr, 1993, p. 174);
!+∞ d
(c) 2−i Xi = Uniform(−1, 1), when the Xi are i.i.d. r.v. with common p.f. P (Xi =
i=1
−1) = P (Xi = 1) = 12 (Karr, 1993, p. 173). •
• extends the DeMoivre-Laplace global limit theorem (Theorem 5.120), in which the
Sn have binomial distributions (Karr, 1993, p. 174).
283
Theorem 5.186 — Lindeberg-Lévy Central Limit Theorem (or CLT for i.i.d.
r.v.) (Resnick, 1999, p. 313; Karr, 1993, p. 174; Murteira, 1979, p. 354)
Let:
• {X1 , X2 , . . .} be a sequence of i.i.d. r.v. such that E(Xi ) = µ and V (Xi ) = σ 2 ∈ IR+ ,
for i = 1, 2, . . .;
!n
• Sn = i=1 Xi be the sum of the first n terms of that sequence of i.i.d. r.v.;
Sn − E(Sn )
Zn = K
V (Sn )
Sn − nµ
= √ . (5.117)
nσ 2
Then
d
Zn → Normal(0, 1). (5.118)
Remark 5.187 — Lindeberg-Lévy Central Limit Theorem (or CLT for i.i.d.
r.v.)
• This variant of the CLT allows us to add that, when we deal with a sufficiently large
number n of i.i.d. r.v. X1 , . . . , Xn , with common mean µ and common positive and
finite variance σ 2 , the c.d.f. of the sum of these r.v. can be approximate as follows:
' ( ' (
Sn − nµ s − nµ CLT s − nµ
P (Sn ≤ s) = P √ ≤ √ * Φ √ . (5.119)
nσ 2 nσ 2 nσ 2
• Because of the continuity theorem, characteristic functions23 are used in the most
frequently seen proof of this version of the CLT. •
Exercise 5.188 — Lindeberg-Lévy Central Limit Theorem (or CLT for i.i.d.
r.v.)
Prove Theorem 5.186 (Karr, 1993, p. 174; Murteira, 1979, pp. 354–355; Resnick, 1999,
pp. 313–314). •
23
And their Taylor expansions omitting terms of higher order than the 2nd degree.
284
In the classical form of the CLT, the r.v. must be identically distributed. However,
the CLT can be generalized to the case where the summands are independent r.v. but
not identically distributed (Resnick, 1999, p. 314), given that they comply with certain
conditions.
Interestingly, the next variant of the CLT is due to Lyapunov and was proved before
the Lindeberg-Lévy CLT (Murteira, 1979, p. 359). The Lyapunov CLT requires that the
r.v. |Xi | have finite moments of some order (2 + δ), δ > 0, and that the rate of growth of
these moments is limited by the Lyapunov condition given below.
Then
d
Zn → Normal(0, 1) (5.121)
if {X1 , X2 , . . .} satisfies the Lyapunov condition, i.e., if
E(|Xn |2+δ ) < +∞, n = 1, 2, . . .
∃δ > 0 : 1
!n / 2+δ
0 (5.122)
limn→+∞ Pn σ2 2+δ i=1 E |Xi − µi | = 0.
( i=1 i )
•
24
These variances are all finite because the sequence of of r.v. satisfies the Lyapunov condition.
Moreover, Murteira (1979, p. 359) mentions that σ1 &= 0; we strongly believe this condition should
read as follows: at least one of the variances should be non null.
285
Theorem 5.191 — Lindeberg-Feller Central Limit Theorem (Karr, 1993, p. 194;
Murteira, 1979, p. 360; Resnick, 1999, p. 315)
Let:
Then
d
Zn → Normal(0, 1) (5.123)
and
σk2
lim max = 0, (5.124)
n→+∞ k=1,...,n V (Sn )
• The Lindeberg condition is not by itself a necessary condition for the validity of the
CLT (Karr, 1993, p. 196).26
• Lindeberg (resp. Feller) proved the necessary (resp. sufficient) part of the Lindeberg-
Feller CLT (Murteira, 1979, p. 360).
25
Once again these variances are all finite (Murteira, 1979, p. 360) and at least one of them should be
non null. Curiously, Resnick (1999, p. 314) does not mention these conditions on the variances.
σ2
26
For instance, if Xi ∼ Normal(0, 2i ), then V (Sn ) = 2n+1 −1 * 2n+1 and limn→+∞ maxk=1,...,n V (Skn ) =
2 &= 0. In this case, (5.124) fails and so does the Lindeberg condition, even though
K
Sn / V (Sn ) ∼ Normal(0, 1), for all n (Karr, 1993, p. 196). However, once we stipulate that
σ2
limn→+∞ maxk=1,...,n V (Skn ) = 0 the Lindeberg conditions is necessary: if X1 , X2 , . . . are independent
σ2 d
r.v., with limn→+∞ maxk=1,...,n V (Skn ) = 0, and if Zn → Normal(0, 1), then {X1 , X2 , . . .} satisfies the
Lindeberg condition (Karr, 1993, Theorem 7.18, p. 196).
286
• The Lindeberg condition essentially means that, for each k, most of the mass of
Xk is centered in an interval about the mean µk and this interval is small when
compared to V (Sn ) (Resnick, 1999, p. 315).
• If the sequence of r.v. {X1 , X2 , . . .} satisfies the Lyapunov condition then it also
satisfies the Lindeberg condition (Karr, 1993, p. 193). •
287
!n
• Sn = i=1 Xi be the partial sum of the first n terms of that sequence of i.i.d. r.v.
Then
Sn Sm
lim sup K = lim sup K
n→+∞ 2n ln[ln(n)] n→+∞ m≥n 2m ln[ln(m)]
a.s.
= 1. (5.126)
•
Exercise 5.195 — Law of the iterated logarithm, standard normal and i.i.d.
summands
Prove Theorem 5.194 Karr (1993, pp. 198–200). •
Theorem 5.196 — Law of the iterated logarithm, i.i.d. summands (Karr, 1993,
p. 200)
Let:
• {X1 , X2 , . . .} be a sequence of i.i.d. r.v. such that E(Xi ) = µ and V (Xi ) = σ 2 ∈ IR+ ,
i = 1, 2, . . .;
!
• Sn = ni=1 Xi be the partial sum of the first n terms of that sequence of i.i.d. r.v.
Then
Sn − nµ a.s.
lim sup K = 1. (5.127)
n→+∞ 2
2n σ ln[ln(n)]
•
288
5.11 Applications of the limit theorems
Monte Carlo integration, the characterisation of maximum likelihood estimators (MLE)
and empirical distribution functions benefit from the strong law of large numbers, central
limit theorem and the law of the iterated logarithm (Karr, 1993, pp. 200–207).
Theorem 5.198 — Monte Carlo integration and the strong law of large
numbers (Karr, 1993, p. 201)
Let:
• h be a continuous (or even just Borel measurable) function on [0, 1] and such that
@1
0
|h(x)| dx < ∞;
Exercise 5.199 — Monte Carlo integration and the strong law of large numbers
Prove Theorem 5.198 (Karr, 1993, p. 201). •
Under widely satisfied conditions, maximum likelihood estimators are not only
consisted, but also asymptotically normal (Karr, 1003, pp. 201–202).
Theorem 5.200 — Maximum likelihood estimation and the weak law of large
numbers (Karr, 1993, pp. 201–202)
Let:
• {X1 , X2 , . . .} be a sequence of i.i.d. r.v. with the same p.d.f. (or p.f.) f (., θ) as the
r.v. X;
• θ̂n = θ̂n (X1 , . . . , Xn ) be the MLE of θ based on the random sample of size n,
(X1 , . . . , Xn ).
289
Suppose:
• for each θ,
76 = 12
+∞ GK K H2
lim sup f (x, θ) − f (x, θ + h) dx = 0; (5.132)
δ→0 −∞ |h|≤δ
• for each θ,
6 +∞ GK K H2
lim sup f (x, θ) × f (x, θ + u) dx < 1. (5.133)
c→+∞ −∞ |u|≥c
Theorem 5.202 — Maximum likelihood estimation and the CLT (Karr, 1993, p.
204)
Under the conditions of Theorem 5.200 and the finiteness of the Fisher information,
B' (2 C
∂ ln f (X, θ)
I(θ) = E , (5.135)
∂θ
290
Proposition 5.204 — Empirical distribution functions and the strong law of
large numbers
Let:
• {X1 , X2 , . . .} be a sequence of i.i.d. r.v. with the same entirely unknown c.d.f. F as
the r.v. X;
!
• Fn (x, X) = n1 ni=1 I(−∞,x] (Xi ), x ∈ IR, be the empirical distribution function for
the random sample X = (X1 , . . . , Xn );27
Not only
n!
P [Fn (x, X) = s] = × [F (x)]ns × [1 − F (x)]n−ns , (5.137)
(ns)! (n − ns)!
for s = 0, n1 , n2 , . . . , n−1
n
, 1,
E[Fn (x, X)] = F (x), (5.138)
F (x) [1 − F (x)]
V [Fn (x, X)] = , (5.139)
n
but more important
a.s.
Fn (x, X) → F (x), (5.140)
that is, Fn (x, X) is a strongly consistent estimator of F (x). •
This convergence is also uniform. This result is also known as the Glivenko-Cantelli
theorem.
i.e.,
a.s.
sup |Fn (x, X) − F (x)| → 0. (5.142)
x∈IR
•
Suffice to say that we could have applied the CLT and conclude that
Fn (x, X) − F (x) d
3 → Normal(0, 1). (5.143)
F (x)[1−F (x)]
n
27
Fn (x, X) corresponds to the proportion Xi ’s smaller than or equal to x in the random sample
(X1 , . . . , Xn ).
291
Expectedly, Fn (x, X) is used in the statistic of the Kolmogorov-Smirnov goodness of
fit of test, supx∈IR |Fn (x, X) − F0 (x)|, where F0 represents the conjectured (and known)
distribution. Interestingly enough, for any absolutely continuous c.d.f. F , it is possible
to:
√
• provide an asymptotic distribution for n supx∈IR |Fn (x, X) − F (x)| — this result
constitutes the Kolmogorov-Smirnov theorem;
292
References
• Grimmett, G.R. and Stirzaker, D.R. (2001). Probability and Random Processes
(3rd. edition). Oxford University Press. (QA274.12-.76.GRI.30385 and QA274.12-
.76.GRI.40695 refer to the library code of the 1st. and 2nd. editions from 1982 and
1992, respectively.)
293