An Introduction To Discrete Probability: 5.1 Sample Space, Outcomes, Events, Probability
An Introduction To Discrete Probability: 5.1 Sample Space, Outcomes, Events, Probability
Roughly speaking, probability theory deals with experiments whose outcome are
not predictable with certainty. We often call such experiments random experiments.
They are subject to chance. Using a mathematical theory of probability, we may be
able to calculate the likelihood of some event.
In the introduction to his classical book [1] (first published in 1888), Joseph
Bertrand (1822–1900) writes (translated from French to English):
“How dare we talk about the laws of chance (in French: le hasard)? Isn’t chance
the antithesis of any law? In rejecting this definition, I will not propose any
alternative. On a vaguely defined subject, one can reason with authority. ...”
Of course, Bertrand’s words are supposed to provoke the reader. But it does seem
paradoxical that anyone could claim to have a precise theory about chance! It is not
my intention to engage in a philosophical discussion about the nature of chance.
Instead, I will try to explain how it is possible to build some mathematical tools that
can be used to reason rigorously about phenomema that are subject to chance. These
tools belong to probability theory. These days, many fields in computer science
such as machine learning, cryptography, computational linguistics, computer vision,
robotics, and of course algorithms, rely a lot on probability theory. These fields are
also a great source of new problems that stimulate the discovery of new methods
and new theories in probability theory.
Although this is an oversimplification that ignores many important contributors,
one might say that the development of probability theory has gone through four eras
whose key figures are: Pierre de Fermat and Blaise Pascal, Pierre–Simon Laplace,
and Andrey Kolmogorov. Of course, Gauss should be added to the list; he made
major contributions to nearly every area of mathematics and physics during his life-
time. To be fair, Jacob Bernoulli, Abraham de Moivre, Pafnuty Chebyshev, Alek-
sandr Lyapunov, Andrei Markov, Emile Borel, and Paul Lévy should also be added
to the list.
269
270 5 An Introduction to Discrete Probability
Fig. 5.1 Pierre de Fermat (1601–1665) (left), Blaise Pascal (1623–1662) (middle left), Pierre–
Simon Laplace (1749–1827) (middle right), Andrey Nikolaevich Kolmogorov (1903–1987) (right).
Before Kolmogorov, probability theory was a subject that still lacked precise def-
initions. In1933, Kolmogorov provided a precise axiomatic approach to probability
theory which made it into a rigorous branch of mathematics with even more appli-
cations than before!
The first basic assumption of probability theory is that even if the outcome of an
experiment is not known in advance, the set of all possible outcomes of an experi-
ment is known. This set is called the sample space or probability space. Let us begin
with a few examples.
Example 5.1. If the experiment consists of flipping a coin twice, then the sample
space consists of all four strings
Example 5.2. If the experiment consists in rolling a pair of dice, then the sample
space W consists of the 36 pairs in the set
W = D⇥D
with
D = {1, 2, 3, 4, 5, 6},
where the integer i 2 D corresponds to the number (indicated by dots) on the face of
the dice facing up, as shown in Figure 5.2. Here we assume that one dice is rolled
first and then another dice is rolled second.
Example 5.3. In the game of bridge, the deck has 52 cards and each player receives
a hand of 13 cards. Let W be the sample space of all possible hands. This time it is
not possible to enumerate the sample space explicitly. Indeed, there are
5.1 Sample Space, Outcomes, Events, Probability 271
✓ ◆
52 52! 52 · 51 · 50 · · · 40
= = = 635, 013, 559, 600
13 13! · 39! 13 · 12 · · · · 2 · 1
The event A consists of five outcomes. In Example 5.2, the event that we get “dou-
bles” when we roll two dice, namely that each dice shows the same value is,
B = {(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)},
 Pr(w) = 1.
w2W
If we assume in Example 5.2, that our dice are “fair,” namely that each of the six
possibilities for a particular dice has probability 1/6 , then each of the 36 rolls w 2 W
has probability
1
Pr(w) = .
36
We can also consider “loaded dice” in which there is a different distribution of
probabilities. For example, let
1
Pr1 (1) = Pr1 (6) =
4
1
Pr1 (2) = Pr1 (3) = Pr1 (4) = Pr1 (5) = .
8
These probabilities add up to 1, so Pr1 is a probability distribution on D. We can
assign probabilities to the elements of W = D ⇥ D by the rule
Definition 5.1. A finite discrete probability space (or finite discrete sample space)
is a finite set W of outcomes or elementary events w 2 W , together with a function
Pr : W ! R, called probability measure (or probability distribution) satisfying the
following properties:
Pr(0)
/ =0
Pr(W ) = 1.
5.1 Sample Space, Outcomes, Events, Probability 273
The event W is called the certain event. In general there are other events A such that
Pr(A) = 1.
Remark: Even though the term probability distribution is commonly used, this is
not a good practice because there is also a notion of (cumulative) distribution func-
tion of a random variable (see Section 5.2, Definition 5.4), and this is a very different
object (the domain of the distribution function of a random variable is R, not W ).
For another example, if we consider the event
that in flipping a coin five times, heads turns up exactly once, the probability of this
event is
5
Pr(A) = .
32
If we use the probability measure Pr on the sample space W of pairs of dice, the
probability of the event of having doubles
B = {(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)},
is
1 1
Pr(B) = 6 ·= .
36 6
However, using the probability measure Pr11 , we obtain
1 1 1 1 1 1 3 1
Pr11 (B) = + + + + + = > .
16 64 64 64 64 16 16 6
Loading the dice makes the event “having doubles” more probable.
It should be noted that a definition slightly more general than Definition 5.1 is
needed if we want to allow W to be infinite. In this case, the following definition is
used.
Definition 5.2. A discrete probability space (or discrete sample space) is a triple
(W , F , Pr) consisting of:
1. A nonempty countably infinite set W of outcomes or elementary events.
2. The set F of all subsets of W , called the set of events.
3. A function Pr : F ! R, called probability measure (or probability distribution)
satisfying the following properties:
a. (positivity)
0 Pr(A) 1 for all A 2 F .
b. (normalization)
Pr(W ) = 1.
c. (additivity and continuity)
For any sequence of pairwise disjoint events E1 , E2 , . . . , Ei , . . . in F (which
means that Ei \ E j = 0/ for all i 6= j), we have
274 5 An Introduction to Discrete Probability
!
•
[ •
Pr Ei = Â Pr(Ei ).
i=1 i=1
The main thing to observe is that Pr is now defined directly on events, since
events may be infinite. The third axiom of a probability measure implies that
Pr(0)
/ = 0.
The notion of a discrete probability space is sufficient to deal with most problems
that a computer scientist or an engineer will ever encounter. However, there are
certain problems for which it is necessary to assume that the family F of events
is a proper subset of the power set of W . In this case, F is called the family of
measurable events, and F has certain closure properties that make it a s -algebra
(also called a s -field). Some problems even require W to be uncountably infinite. In
this case, we drop the word discrete from discrete probability space.
Remark: A s -algebra is a nonempty family F of subsets of W satisfying the fol-
lowing properties:
1. 0/ 2 F .
2. For every subset A ✓ W , if A 2 F then A 2 F .
S
3. For every countable family (Ai )i 1 of subsets Ai 2 F , we have i 1 Ai 2 F.
Note that every s -algebra is a Boolean algebra (see Section 6.11, Definition 6.14),
but the closure property (3) is very strong and adds spice to the story.
In this chapter we deal mostly with finite discrete probability spaces, and occa-
sionally with discrete probability spaces with a countably infinite sample space. In
this latter case, we always assume that F = 2W , and for notational simplicity we
omit F (that is, we write (W , Pr) instead of (W , F , Pr)).
Because events are subsets of the sample space W , they can be combined using
the set operations, union, intersection, and complementation. If the sample space
W is finite, the definition for the probability Pr(A) of an event A ✓ W given in
Definition 5.1 shows that if A, B are two disjoint events (this means that A \ B = 0),
/
then
Pr(A [ B) = Pr(A) + Pr(B).
More generally, if A1 , . . . , An are any pairwise disjoint events, then
It is natural to ask whether the probabilities Pr(A [ B), Pr(A \ B) and Pr(A) can
be expressed in terms of Pr(A) and Pr(B), for any two events A, B 2 W . In the first
and the third case, we have the following simple answer.
Proposition 5.1. Given any (finite) discrete probability space (W , Pr), for any two
events A, B ✓ W , we have
5.1 Sample Space, Outcomes, Events, Probability 275
Proof. Observe that we can write A [ B as the following union of pairwise disjoint
subsets:
A [ B = (A \ B) [ (A B) [ (B A).
Then using the observation made just before Proposition 5.1, since we have the dis-
joint unions A = (A \ B) [ (A B) and B = (A \ B) [ (B A), using the disjointness
of the various subsets, we have
The equation Pr(A) = 1 Pr(A) follows from the fact that A \ A = 0/ and A [ A = W ,
so
1 = Pr(W ) = Pr(A) + Pr(A).
If A ✓ B, then A \ B = A, so B = (A \ B) [ (B A) = A [ (B A), and since A and
B A are disjoint, we get
Since probabilities are nonegative, the above implies that Pr(A) Pr(B). t
u
Proposition 5.2. Given any probability space (W , F , Pr) (discrete or not), for any
sequence of events (Ai )i 1 , if Ai ✓ Ai+1 for all i 1, then
✓ • ◆
[
Pr Ai = lim Pr(An ).
n7!•
i=1
276 5 An Introduction to Discrete Probability
S•
Proof. The trick is to express i=1 Ai as a union of pairwise disjoint events. Indeed,
we have
•
[
Ai = A1 [ (A2 A1 ) [ (A3 A2 ) [ · · · [ (Ai+1 Ai ) [ · · · ,
i=1
as claimed. t
u
We leave it as an exercise to prove that if Ai+1 ✓ Ai for all i 1, then
✓• ◆
\
Pr Ai = lim Pr(An ).
n7!•
i=1
the event in which the second flip is H. Since A and B contain 16 outcomes, we have
16 1
Pr(A) = Pr(B) = = .
32 2
The intersection of A and B is
5.1 Sample Space, Outcomes, Events, Probability 277
the event in which the first two flips are H, and since A \ B contains 8 outcomes, we
have
8 1
Pr(A \ B) = = .
32 4
Since
1
Pr(A \ B) =
4
and
1 1 1
Pr(A)Pr(B) = · = ,
2 2 4
we see that A and B are independent events. On the other hand, if we consider the
events
A = {TTTTT, HHTTT}
and
B = {TTTTT, HTTTT},
we have
2 1
Pr(A) = Pr(B) = = ,
32 16
and since
A \ B = {TTTTT},
we have
1
Pr(A \ B) = .
32
It follows that
1 1 1
Pr(A)Pr(B) = · = ,
16 16 256
but
1
Pr(A \ B) = ,
32
so A and B are not independent.
Example 5.4. We close this section with a classical problem in probability known as
the birthday problem. Consider n < 365 individuals and assume for simplicity that
nobody was born on February 29. In this problem, the sample space is the set of
all 365n possible choices of birthdays for n individuals, and let us assume that they
are all equally likely. This is equivalent to assuming that each of the 365 days of
the year is an equally likely birthday for each individual, and that the assignments
of birthdays to distinct people are independent. Note that this does not take twins
into account! What is the probability that two (or more) individuals have the same
birthday?
To solve this problem, it is easier to compute the probability that no two individ-
uals have the same birthday. We can choose n distinct birthdays in 365 n ways, and
these can be assigned to n people in n! ways, so there are
278 5 An Introduction to Discrete Probability
✓ ◆
365
n! = 365 · 364 · · · (365 n + 1)
n
configurations where no two people have the same birthday. There are 365n possible
choices of birthdays, so the probabilty that no two people have the same birthday is
✓ ◆✓ ◆ ✓ ◆
365 · 364 · · · (365 n + 1) 1 2 n 1
q= = 1 1 ··· 1 ,
365n 365 365 365
and thus, the probability that two people have the same birthday is
✓ ◆✓ ◆ ✓ ◆
1 2 n 1
p=1 q=1 1 1 ··· 1 .
365 365 365
=e Âni=11 365
i
n(n 1)
e 2·365 .
If we want the probability q that no two people have the same birthday to be at most
1/2, it suffices to require
n(n 1) 1
e 2·365 ,
2
that is, n(n 1)/(2 · 365) ln(1/2), which can be written as
n(n 1) 2 · 365 ln 2.
n2 n 2 · 365 ln 2 = 0
are p
1 ± 1 + 8 · 365 ln 2
m= ,
2
and we find that the positive root is approximately m = 23. In fact, we find that if
n = 23, then p = 50.7%. If n = 30, we calculate that p ⇡ 71%.
What if we want at least three people to share the same birthday? Then n = 88
does it, but this is harder to prove! See Ross [12], Section 3.4.
5.2 Random Variables and their Distributions 279
Next, we define what is perhaps the most important concept in probability: that
of a random variable.
In many situations, given some probability space (W , Pr), we are more interested
in the behavior of functions X : W ! R defined on the sample space W than in the
probability space itself. Such functions are traditionally called random variables, a
somewhat unfortunate terminology since these are functions. Now, given any real
number a, the inverse image of a
1
X (a) = {w 2 W | X(w) = a},
Pr(X = a).
This function of a is of great interest, and in many cases it is the function that we
wish to study. Let us give a few examples.
Example 5.5. Consider the sample space of 5 coin flips, with the uniform probability
measure (every outcome has the same probability 1/32). Then the number of times
X(w) that H appears in the sequence w is a random variable. We determine that
1 5 10
Pr(X = 0) = Pr(X = 1) = Pr(X = 2) =
32 32 32
10 5 1
Pr(X = 3) = Pr(X = 4) = Pr(X = 5) = .
32 32 32
The function defined Y such that Y (w) = 1 iff H appears in w, and Y (w) = 0
otherwise, is a random variable. We have
1
Pr(Y = 0) =
32
31
Pr(Y = 1) = .
32
Example 5.6. Let W = D ⇥ D be the sample space of dice rolls, with the uniform
probability measure Pr (every outcome has the same probability 1/36). The sum
S(w) of the numbers on the two dice is a random variable. For example,
S(2, 5) = 7.
The value of S is any integer between 2 and 12, and if we compute Pr(S = s) for
s = 2, . . . , 12, we find the following table.
280 5 An Introduction to Discrete Probability
s 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
Pr(S = s) 36 36 36 36 36 36 36 36 36 36 36
We can represent the choice of pivots and the steps of the algorithm by an ordered
binary tree as shown in Figure 5.3. Except for the root node, every node corresponds
1/10
() 12 (14)
to the choice of a pivot, say x. The list S1 is shown as a label on the left of node x,
5.2 Random Variables and their Distributions 281
and the list S2 is shown as a label on the right of node x. A leaf node is a node such
that |S1 | 1 and |S2 | 1. If |S1 | 2, then x has a left child, and if |S2 | 2, then x
has a right child. Let us call such a tree a computation tree. Observe that except for
minor cosmetic differences, it is a binary search tree. The sorted list can be retrieved
by a suitable traversal of the computation tree and is
If you run this algorithm on a few more examples, you will realize that the choice
of pivots greatly influences how many comparisons are needed. If the pivot is chosen
at each step so that the size of the lists S1 and S2 is roughly the same, then the number
of comparisons is small compared to n, in fact O(n ln n). On the other hand, with a
poor choice of pivot, the number of comparisons can be as bad as n(n 1)/2.
In order to have a good “average performance,” one can randomize this algorithm
by assuming that each pivot is chosen at random. What this means is that whenever
it is necessary to pick a pivot from some list Y , some procedure is called and this
procedure returns some element chosen at random from Y . How exactly this done is
an interesting topic in itself but we will not go into this. Let us just say that the pivot
can be produced by a random number generator, or by spinning a wheel containing
the numbers in Y on it, or by rolling a dice with as many faces as the numbers in Y .
What we do assume is that the probability measure that a number is chosen from a
list Y is uniform, and that successive choices of pivots are independent. How do we
model this as a probability space?
Here is a way to do it. Use the computation trees defined above! Simply add
to every edge the probability that one of the element of the corresponding list, say
Y , was chosen uniformly, namely 1/|Y |. So given an input list S of length n, the
sample space W is the set of all computation trees T with root label S. We assign
a probability to the trees T in W as follows: If n = 0, 1, then there is a single tree
and its probability is 1. If n 2, for every leaf of T , multiply the probabilities along
the path from the root to that leaf and then add up the probabilities assigned to
these leaves. This is Pr(T ). We leave it as an exercise to prove that the sum of the
probabilities of all the trees in W is equal to 1.
A random variable of great interest on (W , Pr) is the number X of comparisons
performed by the algorithm. To analyze the average running time of this algorithm,
it is necessary to determine when the first (or the last) element of a sequence
Y = (yi , . . . , y j )
is chosen as a pivot. To carry out the analysis further requires the notion of expecta-
tion that has not yet been defined. See Example 5.23 for a complete analysis.
Definition 5.4. Given a (finite) discrete probability space (W , Pr), a random vari-
able is any function X : W ! R. For any real number a 2 R, we define Pr(X = a)
as the probability
282 5 An Introduction to Discrete Probability
1
Pr(X = a) = Pr(X (a)) = Pr({w 2 W | X(w) = a}),
The term probability mass function is abbreviated as p.m.f , and cumulative dis-
tribution function is abbreviated as c.d.f . It is unfortunate and confusing that both
the probability mass function and the cumulative distribution function are often ab-
breviated as distribution function.
The probability mass function f for the sum S of the numbers on two dice from
Example 5.6 is shown in Figure 5.4, and the corresponding cumulative distribution
function F is shown in Figure 5.5.
Fig. 5.4 The probability mass function for the sum of the numbers on two dice.
If W is finite, then f only takes finitely many nonzero values; it is very discontin-
uous! The c.d.f F of S shown in Figure 5.5 has jumps (steps). Observe that the size
of the jump at every value a is equal to f (a) = Pr(S = a).
The cumulative distribution function F has the following properties:
1. We have
lim F(x) = 0, lim F(x) = 1.
x7! • x7!•
F (a)
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 a
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Fig. 5.5 The cumulative distribution function for the sum of the numbers on two dice.
F(a) = Â f (x),
xa
If the sample space W is countably infinite, then f and F are still defined as above
but in
F(a) = Â f (x),
xa
the expression on the righthand side is the limit of an infinite sum (of positive terms).
284 5 An Introduction to Discrete Probability
As a consequence, some of the definitions given in the discrete case in terms of the
probabilities Pr(X = x), for example Definition 5.7, become trivial. These defini-
tions need to be modifed; replacing Pr(X = x) by Pr(X x) usually works.
In the general case where the cdf F of a random variable X has discontinuities,
we say that X is a discrete random variable if X(w) 6= 0 for at most countably many
w 2 W . Equivalently, the image of X is finite or countably infinite. In this case, the
mass function of X is well defined, and it can be viewed as a discrete version of a
density function.
In the discrete setting where the sample space W is finite, it is usually more
convenient to use the probability mass function f , and to abuse language and call it
the distribution of X.
Example 5.8. Suppose we flip a coin n times, but this time, the coin is not necessarily
fair, so the probability of landing heads is p and the probability of landing tails
is 1 p. The sample space W is the set of strings of length n over the alphabet
{H, T}. Assume that the coin flips are independent, so that the probability of an
event w 2 W is obtained by replacing H by p and T by 1 p in w. Then let X be the
random variable defined such that X(w) is the number of heads in w. For any i with
0 i n, since there are ni subsets with i elements, and since the probability of a
sequence w with i occurrences of H is pi (1 p)n i , we see that the distribution of
X (mass function) is given by
5.2 Random Variables and their Distributions 285
✓ ◆
n i
f (i) = p (1 p)n i , i = 0, . . . , n,
i
Example 5.9. As in Example 5.8, assume that we flip a biased coin, where the prob-
ability of landing heads is p and the probability of landing tails is 1 p. However,
this time, we flip our coin any finite number of times (not a fixed number), and we
are interested in the event that heads first turns up. The sample space W is the infinite
set of strings over the alphabet {H, T} of the form
Assume that the coin flips are independent, so that the probability of an event w 2 W
is obtained by replacing H by p and T by 1 p in w. Then let X be the random
variable defined such that X(w) = n iff |w| = n, else 0. In other words, X is the
number of trials until we obtain a success. Then it is clear that
f (n) = (1 p)n 1
p, n 1.
Example 5.10. Let us go back to Example 5.8, but assume that n is large and that
the probability p of success is small, which means that we can write np = l with l
of “moderate” size. Let us show that we can approximate the distribution f of X in
an interesting way. Indeed, for every nonnegative integer i, we can write
286 5 An Introduction to Discrete Probability
✓ ◆
n i
f (i) = p (1 p)n i
i
✓ ◆i ✓ ◆n i
n! l l
= 1
i!(n i)! n n
✓ ◆n ✓ ◆ i
n(n 1) · · · (n i + 1) l i l l
= 1 1 .
ni i! n n
so we obtain
li l
f (i) ⇡ e
, i 2 N.
i!
The above is called a Poisson distribution with parameter l . It is named after the
French mathematician Simeon Denis Poisson.
It turns out that quite a few random variables occurring in real life obey the
Poisson probability law (by this, we mean that their distribution is the Poisson dis-
tribution). Here are a few examples:
1. The number of misprints on a page (or a group of pages) in a book.
2. The number of people in a community whose age is over a hundred.
3. The number of wrong telephone numbers that are dialed in a day.
4. The number of customers entering a post office each day.
5. The number of vacancies occurring in a year in the federal judicial system.
As we will see later on, the Poisson distribution has some nice mathematical
properties, and the so-called Poisson paradigm which consists in approximating the
distribution of some process by a Poisson distribution is quite useful.
5.3 Conditional Probability and Independence 287
In general, the occurrence of some event B changes the probability that another
event A occurs. It is then natural to consider the probability denoted Pr(A | B) that
if an event B occurs, then A occurs. As in logic, if B does not occur not much can be
said, so we assume that Pr(B) 6= 0.
Definition 5.5. Given a discrete probability space (W , Pr), for any two events A and
B, if Pr(B) 6= 0, then we define the conditional probability Pr(A | B) that A occurs
given that B occurs as
Pr(A \ B)
Pr(A | B) = .
Pr(B)
Example 5.11. Suppose we roll two fair dice. What is the conditional probability
that the sum of the numbers on the dice exceeds 6, given that the first shows 3? To
solve this problem, let
B = {(3, j) | 1 j 6}
be the event that the first dice shows 3, and
A = {(i, j) | i + j 7, 1 i, j 6}
so we get
Pr(A \ B) 3 6 1
Pr(A | B) = = = .
Pr(B) 36 36 2
The next example is perhaps a little more surprising.
Example 5.12. A family has two children. What is the probability that both are boys,
given at least one is a boy?
There are four possible combinations of sexes, so the sample space is
and we assume a uniform probability measure (each outcome has probability 1/4).
Introduce the events
B = {GB, BG, BB}
of having at least one boy, and
A = {BB}
of having two boys. We get
A \ B = {BB},
and so
288 5 An Introduction to Discrete Probability
Pr(A \ B) 1 3 1
Pr(A | B) = = = .
Pr(B) 4 4 3
Contrary to the popular belief that Pr(A | B) = 1/2, it is actually equal to 1/3. Now,
consider the question: what is the probability that both are boys given that the first
child is a boy? The answer to this question is indeed 1/2.
The next example is known as the “Monty Hall Problem,” a standard example of
every introduction to probability theory.
Example 5.13. On the television game Let’s Make a Deal, a contestant is presented
with a choice of three (closed) doors. Behind exactly one door is a terrific prize. The
other doors conceal cheap items. First, the contestant is asked to choose a door. Then
the host of the show (Monty Hall) shows the contestant one of the worthless prizes
behind one of the other doors. At this point, there are two closed doors, and the
contestant is given the opportunity to switch from his original choice to the other
closed door. The question is, is it better for the contestant to stick to his original
choice or to switch doors?
We can analyze this problem using conditional probabilities. Without loss of gen-
erality, assume that the contestant chooses door 1. If the prize is actually behind door
1, then the host will show door 2 or door 3 with equal probability 1/2. However, if
the prize is behind door 2, then the host will open door 3 with probability 1, and if
the prize is behind door 3, then the host will open door 2 with probability 1. Write
Pi for “the prize is behind door i,” with i = 1, 2, 3, and D j for “the host opens door
D j, ” for j = 2, 3. Here it is not necessary to consider the choice D1 since a sensible
host will never open door 1. We can represent the sequences of choices occurring in
the game by a tree known as probability tree or tree of possibilities, shown in Figure
5.8.
Every leaf corresponds to a path associated with an outcome, so the sample space
is
W = {P1; D2, P1; D3, P2; D3, P3; D2}.
The probability of an outcome is obtained by multiplying the probabilities along the
corresponding path, so we have
1 1 1 1
Pr(P1; D2) = Pr(P1; D3) = Pr(P2; D3) = Pr(P3; D2) = .
6 6 3 3
Suppose that the host reveals door 2. What should the contestant do?
The events of interest are:
1. The prize is behind door 1; that is, A = {P1; D2, P1; D3}.
2. The prize is behind door 3; that is, B = {P3; D2}.
3. The host reveals door 2; that is, C = {P1; D2, P3; D2}.
Whether or not the contestant should switch doors depends on the values of the
conditional probabilities
1. Pr(A | C): the prize is behind door 1, given that the host reveals door 2.
5.3 Conditional Probability and Independence 289
P1
1/3 1
P2 D3 Pr(P 2; D3) = 1/3
1/3
1
P3 D2 Pr(P 3; D2) = 1/3
2. Pr(B | C): the prize is behind door 3, given that the host reveals door 2.
We have A \C = {P1; D2}, so
and
1 1 1
Pr(C) = Pr({P1; D2, P3; D2}) = + = ,
6 3 2
so
Pr(A \C) 1 1 1
Pr(A | C) = = = .
Pr(C) 6 2 3
We also have B \C = {P3; D2}, so
and
Pr(B \C) 1 1 2
Pr(B | C) = = = .
Pr(C) 3 2 3
Since 2/3 > 1/3, the contestant has a greater chance (twice as big) to win the bigger
prize by switching doors. The same probabilities are derived if the host had revealed
door 3.
A careful analysis showed that the contestant has a greater chance (twice as large)
of winning big if she/he decides to switch doors. Most people say “on intuition” that
it is preferable to stick to the original choice, because once one door is revealed,
the probability that the valuable prize is behind either of two remaining doors is
290 5 An Introduction to Discrete Probability
1/2. This is incorrect because the door the host opens depends on which door the
contestant orginally chose.
Let us conclude by stressing that probability trees (trees of possibilities) are very
useful in analyzing problems in which sequences of choices involving various prob-
abilities are made.
Proposition 5.3. (Bayes’ Rules) For any two events A, B with Pr(A) > 0 and Pr(B) >
0, we have the following formulae:
1. (Bayes’ rule of retrodiction)
Pr(A | B)Pr(B)
Pr(B | A) = .
Pr(A)
2. (Bayes’ rule of exclusive and exhaustive clauses) If we also have Pr(A) < 1 and
Pr(B) < 1, then
4. (Bayes’ law)
Pr(A | B)Pr(B)
Pr(B | A) = .
Pr(A | B)Pr(B) + Pr(A | B)Pr(B)
which shows the first formula. For the second formula, observe that we have the
disjoint union
A = (A \ B) [ (A \ B),
so
We leave the more general rule as an exercise, and the third rule follows by unfolding
definitions. The fourth rule is obtained by combining (1) and (2). t u
Example 5.14. Doctors apply a medical test for a certain rare disease that has the
property that if the patient is affected by the disease, then the test is positive in 99%
of the cases. However, it happens in 2% of the cases that a healthy patient tests
positive. Statistical data shows that one person out of 1000 has the disease. What is
the probability for a patient with a positive test to be affected by the disease?
Let S be the event that the patient has the disease, and + and the events that
the test is positive or negative. We know that
Pr(S) = 0.001
Pr(+ | S) = 0.99
Pr(+ | S) = 0.02,
Pr(+ | S)Pr(S)
Pr(S | +) = .
Pr(+)
We also have
Pr(+) = Pr(+ | S)Pr(S) + Pr(+ | S)Pr(S),
so we obtain
0.99 ⇥ 0.001 1
Pr(S | +) = ⇡ = 5%.
0.99 ⇥ 0.001 + 0.02 ⇥ 0.999 20
Since this probability is small, one is led to question the reliability of the test! The
solution is to apply a better test, but only to all positive patients. Only a small portion
of the population will be given that second test because
Pr(S) = 0.00001
Pr(+ | S) = 0.99
Pr(+ | S) = 0.01.
You will find that the probability Pr(S | +) is approximately 0.000099, so the chance
of being sick is rather small, and it is more likely that the test was incorrect.
Recall that in Definition 5.3, we defined two events as being independent if
Pr(A \ B) = Pr(A)Pr(B).
Remark: For a fixed event B with Pr(B) > 0, the function A 7! Pr(A | B) satisfies
the axioms of a probability measure stated in Definition 5.2. This is shown in Ross
[11] (Section 3.5), among other references.
The examples where we flip a coin n times or roll two dice n times are examples
of indendent repeated trials. They suggest the following definition.
Definition 5.6. Given two discrete probability spaces (W1 , Pr1 ) and (W2 , Pr2 ), we
define their product space as the probability space (W1 ⇥ W2 , Pr), where Pr is given
by
Pr(w1 , w2 ) = Pr1 (w1 )Pr2 (W2 ), w1 2 W1 , w2 2 W2 .
There is an obvious generalization for n discrete probability spaces. In particular, for
any discrete probability space (W , Pr) and any integer n 1, we define the product
space (W n , Pr), with
The fact that the probability measure on the product space is defined as a prod-
uct of the probability measures of its components captures the independence of the
trials.
Remark: The product of two probability spaces (W1 , F1 , Pr1 ) and (W2 , F2 , Pr2 )
can also be defined, but F1 ⇥ F2 is not a s -algebra in general, so some serious
work needs to be done.
5.3 Conditional Probability and Independence 293
The notion of independence also applies to random variables. Given two random
variables X and Y on the same (discrete) probability space, it is useful to consider
their joint distribution (really joint mass function) fX,Y given by
Definition 5.7. Two random variables X and Y defined on the same discrete proba-
bility space are independent if
Remark: If X and Y are two continuous random variables, we say that X and Y are
independent if
It is easy to verify that if X and Y are discrete random variables, then the above
condition is equivalent to the condition of Definition 5.7.
Example 5.15. If we consider the probability space of Example 5.2 (rolling two
dice), then we can define two random variables S1 and S2 , where S1 is the value
on the first dice and S2 is the value on the second dice. Then the total of the two
values is the random variable S = S1 + S2 of Example 5.6. Since
1 1 1
Pr(S1 = a and S2 = b) = = · = Pr(S1 = a)Pr(S2 = b),
36 6 6
the random variables S1 and S2 are independent.
Example 5.16. Suppose we flip a biased coin (with probability p of success) once.
Let X be the number of heads observed and let Y be the number of tails observed.
The variables X and Y are not independent. For example
Pr(X = 1 and Y = 1) = 0,
yet
Pr(X = 1)Pr(Y = 1) = p(1 p).
Now, if we flip the coin N times, where N has the Poisson distribution with parame-
ter l , it is remarkable that X and Y are independent; see Grimmett and Stirzaker [6]
(Section 3.2).
Given the joint mass function fX,Y of two random variables X and Y , the mass
functions fX of X and fY of Y are called marginal mass functions, and they are
obtained from fX,Y by the formulae
Remark: To deal with the continuous case, it is useful to consider the joint distri-
bution FX,Y of X and Y given by
for any two reals a, b 2 R. We say that X and Y are jointly continuous with joint
density function fX,Y if
Z x Z y
FX,Y (x, y) = fX,Y (u, v)du dv, for all x, y 2 R
• •
for some nonnegative integrable function fX,Y . The marginal density functions fX
of X and fY of Y are defined by
Z • Z •
fX (x) = fX,Y (x, y)dy, fY (y) = fX,Y (x, y)dx.
• •
FX (x) = Pr(X x) = FX,Y (x, •), FY (y) = Pr(Y y) = FX,Y (•, y).
For example, if X and Y are two random variables with joint density function
fX,Y given by
1 x
fX,Y (x, y) = e y y , 0 < x, y < •,
y
then the marginal density function fY of Y is given by
Z +• Z •
1 y xy
fY (y) = fX,Y (x, y) dx = e dx = e y , y > 0.
• 0 y
It can be shown that X and Y are independent iff
We now turn to one of the most important concepts about random variables, their
mean (or expectation).
Definition 5.8. Given a finite discrete probability space (W , Pr), for any random
variable X, the mean value or expected value or expectation1 of X is the number
E(X) defined as
where X(W ) denotes the image of the function X and where f is the probability
mass function of X. Because W is finite, we can also write
E(X) = Â X(w)Pr(w).
w2W
1 It is amusing that in French, the word for expectation is espérance mathématique. There is hope
for mathematics!
296 5 An Introduction to Discrete Probability
In this setting, the median of X is defined as the set of elements x 2 X(W ) such
that
1 1
Pr(X x) and Pr(X x) .
2 2
provided that the above sum converges absolutely (that is, the partial sums of abso-
lute values converge). If we have a probability space (X, F , Pr) with W uncountable
and if X is absolutely continuous so that it has a density function f , then the expec-
tation of X is given by the integral
Z +•
E(X) = x f (x)dx.
•
It is even possible to define the expectation of a random variable that is not neces-
sarily absolutely continuous using its cumulative density function F as
Z +•
E(X) = xdF(x),
•
where the above integral is the Lebesgue–Stieljes integal, but this is way beyond the
scope of this book.
Observe that if X is a constant random variable (that is, X(w) = c for all w 2 W
for some constant c), then
since Pr(W ) = 1. The mean of a constant random variable is itself (as it should be!).
Example 5.17. Consider the sum S of the values on the dice from Example 5.6. The
expectation of S is
1 2 5 6 5 1
E(S) = 2 · +3· +···+6· +7· +8· + · · · + 12 · = 7.
36 36 36 36 36 36
Example 5.18. Suppose we flip a biased coin once (with probability p of landing
heads). If X is the random variable given by X(H) = 1 and X(T) = 0, the expectation
of X is
Example 5.19. Consider the binomial distribution of Example 5.8, where the ran-
dom variable X counts the number of heads (success) in a sequence of n trials. Let
us compute E(X). Since the mass function is given by
5.4 Expectation of a Random Variable 297
✓ ◆
n i
f (i) = p (1 p)n i , i = 0, . . . , n,
i
we have
n n ✓ ◆
n i
E(X) = Â i f (i) = Â i p (1 p)n i .
i=0 i=0 i
We use a trick from analysis to compute this sum. Recall from the binomial theorem
that
n ✓ ◆
n i
(1 + x)n = Â x.
i=0 i
so (⇤) becomes
n ✓ ◆ i
np n p
= Â i i qi ,
qn i=0
and multiplying both sides by qn and using the fact that q = 1 p, we get
n ✓ ◆
n
 i i pi (1 p)n i = np,
i=0
and so
E(X) = np.
It should be observed that the expectation of a random variable may be infinite.
For example, if X is a random variable whose probability mass function f is given
by
1
f (k) = , k = 1, 2, . . . ,
k(k + 1)
then Âk2N {0} f (k) = 1, since
• • ✓ ◆ ✓ ◆
1 1 1 1
 k(k + 1)  k
=
k+1
= lim 1
k7!• k+1
= 1,
k=1 k=1
298 5 An Introduction to Discrete Probability
but
1
E(X) = Â k f (k) = Â k + 1
= •.
k2N {0} k2N {0}
Example 5.19 illustrates the fact that computing the expectation of a random
variable X can be quite difficult due the complicated nature of the mass function f .
Therefore it is desirable to know about properties of the expectation that make its
computation simpler. A crucial property of expectation that often allows simplifica-
tions in computing the expectation of a random variable is its linearity.
Proof. We have
E(X +Y ) = Â z · Pr(X +Y = z)
z
= Â Â(x + y) · Pr(X = x and Y = y)
x y
By substitution, we obtain
proving that E(X + Y ) = E(X) + E(Y ). When W is countably infinite, we can per-
mute the indices x and y due to absolute convergence.
For the second equation, if l 6= 0, we have
5.4 Expectation of a Random Variable 299
E(l X) = Â x · Pr(l X = x)
x
x
=l · Pr(X = x/l )
x l
= l  y · Pr(X = y)
y
= l E(X).
It is also important to realize that the above equation holds even if the Xi are not
independent.
Here is an example showing how the linearity of expectation can simplify calcu-
lations. Let us go back to Example 5.19. Define n random variables X1 , . . . , Xn such
that Xi (w) = 1 iff the ith flip yields heads, otherwise Xi (w) = 0. Clearly, the number
X of heads in the sequence is
X = X1 + · · · + Xn .
we get
E(X) = np.
The above example suggests the definition of indicator function, which turns out
to be quite handy.
Definition 5.9. Given a discrete probability space with sample space W , for any
event A, the indicator function (or indicator variable) of A is the random variable IA
defined such that n
1 if w 2 A
IA (w) =
0 if w 2 / A.
Here is the main property of the indicator function.
Proposition 5.7. The expectation E(IA ) of the indicator function IA is equal to the
probabilty Pr(A) of the event A.
Proof. We have
as claimed t
u
300 5 An Introduction to Discrete Probability
This fact with the linearity of expectation is often used to compute the expectation
of a random variable, by expressing it as a sum of indicator variables. We will see
how this method is used to compute the expectation of the number of comparisons in
quicksort. But first, we use this method to find the expected number of fixed points
of a random permutation.
Example 5.20. For any integer n 1, let W be the set of all n! permutations of
{1, . . . , n}, and give W the uniform probabilty measure; that is, for every permutation
p, let
1
Pr(p) = .
n!
We say that these are random permutations. A fixed point of a permutation p is any
integer k such that p(k) = k. Let X be the random variable such that X(p) is the
number of fixed points of the permutation p. Let us find the expectation of X. To do
this, for every k, let Xk be the random variable defined so that Xk (p) = 1 iff p(k) = k,
and 0 otherwise. Clearly,
X = X1 + · · · + Xn ,
and since
E(X) = E(X1 ) + · · · + E(Xn ),
we just have to compute E(Xk ). But, Xk is an indicator variable, so
Now there are (n 1)! permutations that leave k fixed, so Pr(X = 1) = 1/n. There-
fore,
1
E(X) = E(X1 ) + · · · + E(Xn ) = n · = 1.
n
On average, a random permutation has one fixed point.
(g X)(w) = g(X(w)), w 2 W.
Given two random variables X and Y , if j and y are two functions, we leave it
as an exercise to prove that if X and Y are independent, then so are j(X) and y(Y ).
Although computing the mass function of g in terms of the mass function f of X
can be very difficult, there is a nice way to compute its expectation. Here is a second
tool that makes it easier to compute an expectation.
Proof. We have
E(g(X)) = Â y · Pr(g X = y)
y
=Â Â g(x) · Pr(X = x)
y x,g(x)=y
= Â g(x) · Pr(X = x)
x
= Â g(x) f (x),
x
as claimed. t
u
The cases g(X) = X k , g(X) = zX , and g(X) = etX (for some given reals z and t)
are of particular interest.
Given two random variables X and Y on a discrete probability space W , for any
function g : R ⇥ R ! R, then g(X,Y ) is a random variable and it is easy to show
that E(g(X,Y )) (if it exists) is given by
Example 5.21. Consider the random variable X of Example 5.19 counting the num-
ber of heads in a sequence of coin flips of length n, but this time, let us try to compute
E(X k ), for k 2. By Proposition 5.8, we have
n
E(X k ) = Â ik f (i)
i=0
n ✓ ◆
n i
= Â ik p (1 p)n i
i=0 i
n ✓ ◆
k n
= Âi pi (1 p)n i .
i=1 i
Recall that
302 5 An Introduction to Discrete Probability
✓ ◆ ✓ ◆
n n 1
i =n .
i i 1
Using this, we get
n ✓ ◆
n i
E(X k ) = Â ik p (1 p)n i
i=1 i
n ✓ ◆
k 1 n 1
= np  i pi 1 (1 p)n i (let j = i 1)
i=1 i 1
n 1 ✓ ◆
k 1 n 1
= np  ( j + 1) p j (1 p)n 1 j
j=0 j
= npE((Y + 1)k 1
)
using Proposition 5.8 to establish the last equation, where Y is a random variable
with binomial distribution on sequences of length n 1 and with the same probabil-
ity p of success. Thus, we obtain an inductive method to compute E(X k ). For k = 2,
we get
E(X 2 ) = npE(Y + 1) = np((n 1)p + 1).
Here is a third tool to compute expectation. If X only takes nonnegative integer
values, then the following result may be useful for computing E(X).
Proposition 5.9. If X is a random variable that takes on only nonnegative integers,
then its expectation E(X) (if it exists) is given by
•
E(X) = Â Pr(X i).
i=1
as claimed. t
u
Proposition 5.9 has the following intuitive geometric interpretation: E(X) is the
area above the graph of the cumulative distribution function F(i) = Pr(X i) of X
and below the horizontal line F = 1. Here is an application of Proposition 5.9.
Example 5.22. In Example 5.9, we consider finite sequences of flips of a biased
coin, and the random variable of interest is the first occurrence of tails (success).
5.4 Expectation of a Random Variable 303
f (n) = (1 p)n 1
p, n 1.
Then we have
•
E(X) = Â Pr(X i)
i=1
•
= Â (1 p)i 1 .
i=1
1 1
= = .
1 (1 p) p
Therefore,
1
E(X) = ,
p
which means that on the average, it takes 1/p flips until heads turns up.
Let us now compute E(X 2 ). By Proposition 5.8, we have
•
E(X 2 ) = Â i2 (1 p)i 1
p
i=1
•
= Â (i 1 + 1)2 (1 p)i 1
p
i=1
• • •
= Â (i 1)2 (1 p)i 1
p + Â 2(i 1)(1 p)i 1
p + Â (1 p)i 1
p
i=1 i=1 i=1
• •
= Â j2 (1 p) j p + 2 Â j(1 p) j p + 1 (let j = i 1)
j=0 j=1
2(1 p)
pE(X 2 ) = +1
p
2 p
= ,
p
so
2 p
E(X 2 ) = .
p2
By the way, the trick of writing i = i 1 + 1 can be used to compute E(X). Try to
recompute E(X) this way. The expectation E(X) can also be computed using the
derivative technique of Example 5.19, since (d/dt)(1 p)i = i(p 1)i 1 .
Example 5.23. Let us compute the expectation of the number X of comparisons
needed when running the randomized version of quicksort presented in Example
5.7. Recall that the input is a sequence S = (x1 , . . . , xn ) of distinct elements, and that
(y1 , . . . , yn ) has the same elements sorted in increasing order. In order to compute
E(X), we decompose X as a sum of indicator variables Xi, j , with Xi, j = 1 iff yi and
y j are ever compared, and Xi, j = 0 otherwise. Then it is clear that
n j 1
X= Â Â Xi, j ,
j=2 i=1
and
n j 1
E(X) = Â Â E(Xi, j ).
j=2 i=1
The crucial observation is that yi and y j are ever compared iff either yi or y j is chosen
as the pivot when {yi , yi+1 , . . . , y j } is a subset of the set of elements of the (left or
right) sublist considered for the choice of a pivot.
This is because if the next pivot y is larger than y j , then all the elements in
(yi , yi+1 , . . . , y j ) are placed in the list to the left of y, and if y is smaller than yi ,
then all the elements in (yi , yi+1 , . . . , y j ) are placed in the list to the right of y. Conse-
quently, if yi and y j are ever compared, some pivot y must belong to (yi , yi+1 , . . . , y j ),
and every yk 6= y in the list will be compared with y. But, if the pivot y is distinct
from yi and y j , then yi is placed in the left sublist and y j in the right sublist, so yi
and y j will never be compared.
It remains to compute the probability that the next pivot chosen in the sublist
Yi, j = (yi , yi+1 , . . . , y j ) is yi (or that the next pivot chosen is y j , but the two proba-
bilities are equal). Since the pivot is one of the values in (yi , yi+1 , . . . , y j ) and since
each of these is equally likely to be chosen (by hypothesis), we have
1
Pr(yi is chosen as the next pivot in Yi, j ) = .
j i+1
5.4 Expectation of a Random Variable 305
Consequently, since yi and y j are ever compared iff either yi is chosen as a pivot or
y j is chosen as a pivot, and since these two events are mutally exclusive, we have
2
E(Xi, j ) = Pr(yi and y j are ever compared) = .
j i+1
It follows that
n j 1
E(X) = Â Â E(Xi, j )
j=2 i=1
n j
1
=2Â Âk (set k = j i + 1)
j=2 k=2
n n
1
=2Â Âk
k=2 j=k
n
n k+1
=2Â
k=2 k
n
1
= 2(n + 1) Â 4n.
k=1 k
E(X) = 2n ln n +Q (n).
E(X) = l .
l li
f (i) = e , i 2 N,
i!
306 5 An Introduction to Discrete Probability
so we have
•
li
E(X) = Â ie l
i=0 i!
•
li 1
= le l
 (i 1)!
i=1
• j
l
= le l
 (let j = i 1)
j=0 j!
l l
= le e = l,
as claimed. This is consistent with the fact that the expectation of a random variable
with a binomial distribution is np, under the Poisson approximation where l = np.
We leave it as an exercise to prove that
E(X 2 ) = l (l + 1).
E(XY ) = E(X)E(Y ).
Proof. We have
= Â Â xy · Pr(X = x)Pr(Y = y)
x y
✓ ◆✓ ◆
= Â x · Pr(X = x) Â y · Pr(Y = y)
x y
= E(X)E(Y ),
as claimed. Note that the independence of X and Y was used in going from line 2 to
line 3. t
u
In Example 5.15 (rolling two dice), we defined the random variables S1 and S2 ,
where S1 is the value on the first dice and S2 is the value on the second dice. We
also showed that S1 and S2 are independent. If we consider the random variable
P = S1 S2 , then we have
7 7 49
E(P) = E(S1 )E(S2 ) = · = ,
2 2 4
5.5 Variance, Standard Deviation, Chebyshev’s Inequality 307
since E(S1 ) = E(S2 ) = 7/2, as we easily determine since all probabilities are equal
to 1/6. On the other hand, S and P are not independent (check it).
The mean (expectation) E(X) of a random variable X gives some useful information
about it, but it does not say how X is spread. Another quantity, the variance Var(X),
measure the spread of the distribution by finding the “average” of the square differ-
ence (X E(X))2 , namely
Definition 5.11. Given a discrete probability space (W , Pr), for any random variable
X, the variance Var(X) of X (if it exists) is defined as
The following result shows that the variance Var(X) can be computed using E(X 2 )
and E(X).
Proposition 5.11. Given a discrete probability space (W , Pr), for any random vari-
able X, the variance Var(X) of X is given by
as claimed. t
u
For example, if we roll a fair dice, we know that the number S1 on the dice has
expectation E(S1 ) = 7/2. We also have
1 91
E(S12 ) = (12 + 22 + 32 + 42 + 52 + 62 ) = ,
6 6
so the variance of S1 is
✓ ◆2
91 7 35
Var(S1 ) = E(S12 ) (E(S1 ))2 = = .
6 2 12
The quantity E(X 2 ) is called the second moment of X. More generally, we have
the following definition.
Example 5.25. In Example 5.21, the case of a binomial distribution, we found that
We also found earlier (Example 5.19) that E(X) = np. Therefore, we have
Therefore,
Var(X) = np(1 p).
Example 5.26. In Example 5.22, the case of a geometric distribution, we found that
1
E(X) =
p
2 p
E(X 2 ) = .
p2
It follows that
5.5 Variance, Standard Deviation, Chebyshev’s Inequality 309
E(X) = l
E(X 2 ) = l (l + 1).
It follows that
Therefore, a random variable with a Poisson distribution has the same value for its
expectation and its variance,
E(X) = Var(X) = l .
Proof. Recall from Proposition 5.10 that if X and Y are independent, then E(XY ) =
E(X)E(Y ). Then, we have
as claimed. t
u
The following proposition is also useful.
Proposition 5.13. Given a discrete probability space (W , Pr), for any random vari-
able X, the following properties hold:
1. If X 0, then E(X) 0.
2. If X is a random variable with constant value l , then E(X) = l .
3. For any two random variables X and Y defined on the probability space (W , Pr),
if X Y , which means that X(w) Y (w) for all w 2 W , then E(X) E(Y )
(monotonicity of expectation).
4. For any scalar l 2 R, we have
Var(l X) = l 2 Var(X).
Proof. Properties (1) and (2) are obvious. For (3), X Y iff Y X 0, so by (1) we
have E(Y X) 0, and by linearity of expectation, E(Y ) E(X). For (4), we have
as claimed. t
u
Property (4) shows that unlike expectation, the variance is not linear (although
for independent random variables, Var(X +Y ) = Var(X) + Var(Y ). This also holds
in the more general case of uncorrelated random variables; see Proposition 5.14
below).
Example 5.28. As an application of Proposition 5.12, if we consider the event of
rolling two dice, since we showed that the random variables S1 and S2 are indepen-
dent, we can compute the variance of their sum S = S1 + S2 and we get
35 35 35
Var(S) = Var(S1 ) + Var(S2 ) = + = .
12 12 6
Recall from Example 5.17 that E(S) = 7.
Here is an application of geometrically distributed random variables.
Example 5.29. Suppose there are m different types of coupons (or perhaps, the kinds
of cards that kids like to collect), and that each time one obtains a coupon, it is
equally likely to be any of these types. Let X denote the number of coupons one
needs to collect in order to have at least one of each type. What is the expected
value E(X) of X? This problem is usually called a coupon collecting problem.
The trick is to introduce the random variables Xi , where Xi is the number of
additional coupons needed, after i distinct types have been collected, until another
new type is obtained, for i = 0, 1, . . . , m 1. Clearly,
5.5 Variance, Standard Deviation, Chebyshev’s Inequality 311
m 1
X= Â Xi ,
i=0
and each Xi has a geometric distribution, where each trial has probability of success
pi = (m i)/m. We know (see Example 5.22,) that
1 m
E(Xi ) = = .
pi m i
Consequently,
m 1 m 1 m
m 1
E(X) = Â E(Xi ) = Â m i
=m .
i=0 i=0 i=1 i
E(X) = m ln m +Q (m).
It follows that
m 1 m 1
i
Var(X) =  Var(Xi ) = m  (m i)2
.
i=0 i=0
m 1 m 1 m 1
i m m i
 (m i)2
= Â (m i)2 Â (m i)2
i=0 i=0 i=0
m 1 m 1
m 1
= Â (m i)2 Â (m i)
i=0 i=0
m m
1 1
=m  .
j=1 j2 j=1 j
so we get
m2 p 2
Var(X) = +Q (m ln m).
6
Let us go back to the example about fixed points of random permutations (Ex-
ample 5.20). We found that the expectation of the number of fixed points is µ = 1.
The reader should compute the standard deviation. The difficulty is that the ran-
dom variables Xk are not independent, (for every permutation p, we have Xk (p) = 1
iff p(k) = k, and 0 otherwise). You will find that s = 1. If you get stuck, look at
Graham, Knuth and Patashnik [5], Chapter 8.
If X and Y are not independent, we still have
and we get
The term E(XY ) E(X)E(Y ) has a more convenient form. Indeed, we have
Definition 5.13. Given two random variables X and Y , their covariance Cov(X,Y )
is defined by
However, the reader will check easily that X and Y are not independent.
A better measure of independence is given by the correlation coefficient r(X,Y )
of X and Y , given by
Cov(X,Y )
r(X,Y ) = p p ,
Var(X) Var(Y )
314 5 An Introduction to Discrete Probability
provided that Var(X) 6= 0 and Var(Y ) 6= 0. It turns out that |r(X,Y )| 1, which is
shown using the Cauchy–Schwarz inequality.
Equality is achieved if and only if there exist some a, b 2 R (not both zero) such
that E((aX + bY )2 ) = 0.
T (l ) = E((X + lY )2 ).
We get
T (l ) = E(X 2 + 2l XY + l 2Y 2 )
= E(X 2 ) + 2l E(XY ) + l 2 E(Y 2 ).
should have at most one real root, which is equivalent to the well-known condition
which is equivalent to q q
|E(XY )| E(X 2 ) E(Y 2 ),
as claimed.
If (E(XY ))2 = E(X 2 )E(Y 2 ), then either E(Y 2 ) = 0, and then with a = 0, b = 1,
we have E((aX + bY )2 ) = 0, or E(Y 2 ) > 0, in which case the quadratic equation
It can be shown that for any random variable Z, if E(Z 2 ) = 0, then Pr(Z = 0) =
1; see Grimmett and Stirzaker [6] (Chapter 3, Problem 3.11.2). In fact, this is a
consequence of Proposition 5.2 and Chebyshev’s Inequality (see below), as shown
in Ross [11] (Section 8.2, Proposition 2.3). It follows that if equality is achieved in
the Cauchy–Schwarz inequality, then there are some reals a, b (not both zero) such
that Pr(aX + bY = 0) = 1; in other words, X and Y are dependent with probability
1. If we apply the Cauchy-Schwarz inequality to the random variables X E(X) and
Y E(Y ), we obtain the following result.
Proposition 5.16. For any two random variables X and Y on a discrete probability
space, we have
|r(X,Y )| 1,
with equality iff there are some real numbers a, b , g (with a, b not both zero) such
that Pr(aX + bY = g) = 1.
As emphasized by Graham, Knuth and Patashnik [5], the variance plays a key
role in an inquality due to Chebyshev (published in 1867) that tells us that a random
variable will rarely be far from its mean E(X) if its variance Var(X) is small.
 aPr(w)
w2W
(X(w) E(X))2 a
= aPr (X E(X))2 a ,
316 5 An Introduction to Discrete Probability
Var(X)
Pr |X E(X)| a .
a2
It is also convenient
p to restate the Chebyshev’s Inequality in terms of the standard
deviation s = Var(X) of X, to write E(X) = µ, and to replace a 2 by c2 Var(X),
and we get: For every c > 0,
1
Pr(|X µ| cs ) ;
c2
equivalently
1
Pr(|X µ| < cs ) . 1
c2
This last inequality says that a random variable will lie within cs of its mean with
probability at least 1 1/c2 . If c = 10, the random variable will lie between µ 10s
and µ + 10s at least 99% of the time.
We can apply the Chebyshev Inequality to the experiment of Example 5.28 where
we roll two fair dice. We found that µ = 7 and s 2 = 35/6 (for one roll). If we
assume that we perform n independent trials, then the total value of the n rolls has
expecation 7n and the variance if 35n/6. It follows that the sum will be between
r r
35n 35n
7n 10 and 7n + 10
6 6
at least 99% of the time. If n = 106 (a million rolls), then the total value will be
between 6.976 million and 7.024 million more than 99% of the time.
Another interesting consequence of the Chebyshev’s Inequality is this. Suppose
we have a random variable X on some discrete probability space (W , Pr). For any n,
we can form the product space (W n , Pr) as explained in Definition 5.6, with
Xk (w1 , . . . , wn ) = X(wk ).
It is easy to see that the Xk are independent. Consider the random variable
S = X1 + · · · + Xn .
The behavior of the average sum of n independent samples described at the end of
Section 5.5 is an example of a weak law of large numbers. A precise formulation
of such a result is shown below. A version of this result was first shown by Jacob
Bernoulli and was published by his nephew Nicholas in 1713. Bernoulli did not have
Chebyshev’s Inequality at this disposal (since Chebyshev Inequality was proved in
1867), and he had to resort to a very ingenious proof.
Proof. As earlier, ✓ ◆
X1 + · · · + Xn
E =µ
n
and because the Xi are independent,
✓ ◆
X1 + · · · + Xn s2
Var = .
n n
Another remarkable limit theorem has to do with the limit of the distribution of
the random variable
X1 + · · · + Xn nµ
p ,
s n
where the Xi are i.i.d random variables with mean µpand variance s . Observe that
the mean of X1 + · · · + Xn is nµ and its variance is s n, since the Xi are assumed to
be i.i.d.
We have not discussed a famous distribution, the normal or Gaussian distribution,
only because it is a continuous distribution. The standard normal distribution is the
cumulative distribution function F whose density function is given by
1 1 x2
f (x) = p e 2 ;
2p
that is, Z x
1 1 y2
F(x) = p e 2 dy.
2p •
The function f (x) decays to zero very quickly and its graph has a bell–shape. More
generally, we say that a random variable X is normally distributed with parameters
µ and s 2 (and that X has a normal distribution) if its density function is the function
1 (x µ)2
f (x) = p e 2s 2 .
2ps
Figure 5.11 shows some examples of normal distributions.
Using a little bit of calculus, it is not hard to show that if a random variable X
is normally distributed with parameters µ and s 2 , then its mean and variance are
5.6 Limit Theorems; A Glimpse 319
given by
E(X) = µ,
Var(X) = s 2 .
Fig. 5.12 Abraham de Moivre (1667–1754) (left), Pierre–Simon Laplace (1749–1827) (middle),
Johann Carl Friedrich Gauss (1777–1855) (right).
Observe that now, we have two approximations for the distribution of a random
variable X = X1 + · · · + Xn with a binomial distribution. When n is large and p is
small, we have the Poisson approximation. When np(1 p) is large, the normal
approximation can be shown to be quite good.
Theorem 5.2 is a special case of the following important theorem known as cen-
tral limit theorem.
tends to the standard normal distribution as n goes to infinity. This means that for
every real a,
✓ ◆ Z a
X1 + · · · + Xn nµ 1 1 2
lim Pr p a = p e 2x .
n7!• s n 2p •
We lack the machinery to prove this theorem. This machinery involves character-
istic functions and various limit theorems. We refer the interested reader to Ross [11]
(Chapter 8), Grimmett and Stirzaker [6] (Chapter 5), Venkatesh [14], and Shiryaev
[13] (Chapter III).
The central limit theorem was originally stated and proved by Laplace but
Laplace’s proof was not entirely rigorous. Laplace expanded a great deal of efforts
in estimating sums of the form
✓ ◆
n k
 k
p (1 p)n k
pk
knp+x np(1 p)
elegance and beauty of the typesetting. Lyapunov gave the first rigorous proof of the
central limit theorem around 1901.
Fig. 5.13 Pierre–Simon Laplace (1749–1827) (left), Aleksandr Mikhailovich Lyapunov (1857-
1918) (right).
The following example from Ross [11] illustrates how the central limit theorem
can be used.
X1 + · · · + Xn nd
Zn = p
2 n
If the astronomer wants to be 95% certain that his estimated value is accurrate to
within 0.5 light year, he should make n⇤ measurements, where n⇤ is given by
322 5 An Introduction to Discrete Probability
✓p ◆
n⇤
2F 1 = 0.95,
4
that is, ✓p ◆
n⇤
F = 0.975.
4
Using tables for the values of the function F, we find that
p
n⇤
= 1.96,
4
which yields
n⇤ ⇡ 61.47.
Since n should be an integer, the astronomer should make 62 observations.
The above analysis relies on the assumption that the distribution of Zn is well
approximated by the normal distribution. If we are concerned about this point, we
can use Chebyshev’s inequality. If we write
X1 + · · · + Xn nd
Sn = ,
2
we have
4
E(Sn ) = d and Var(Sn ) = ,
n
so by Chebyshev’s inequality, we have
✓ ◆
1 4 16
Pr |Sn d| > 2
= .
2 n(1/2) n
Hence, if we make n = 16/0.05 = 320 observations, we are 95% certain that the
estimate will be accurate to within 0.5 light year.
The method of making repeated measurements in order to “average” errors is
applicable to many different situations (geodesy, astronomy, etc.).
There are generalizations of the central limit theorem to independent but not
necessarily identically distributed random variables. Again, the reader is referred to
Ross [11] (Chapter 8), Grimmett and Stirzaker [6] (Chapter 5), and Shiryaev [13]
(Chapter III).
There is also the famous strong law of large numbers due to Andrey Kolmogorov
proved in 1933 (with an earlier version proved in 1909 by Émile Borel). In order to
state the strong law of large numbers, it is convenient to define various notions of
convergence for random variables.
Fig. 5.14 Félix Edouard Justin Émile Borel (1871–1956) (left), Andrey Nikolaevich Kolmogorov
(1903–1987) (right).
P
3. We say that Xn converges to X in probability, denoted Xn ! X, if for every
e > 0,
lim Pr(|Xn X| > e) = 0.
n7!•
D
4. We say that Xn converges to X in distribution, denoted Xn ! X, if
The proof is beyond the scope of this book. Interested readers should consult
Grimmett and Stirzaker [6] (Chapter 7), Venkatesh [14], and Shiryaev [13] (Chapter
III). Fairly accessible proofs under the additional assumption that E(X14 ) exists can
be found in Brémaud [2], and Ross [11].
Actually, for almost sure convergence, the assumption that E(X12 ) exists is re-
dundant provided that E(|X1 |) exists, in which case µ = E(|X1 |), but the proof takes
some work; see Brémaud [2] (Chapter 1, Section 8.4) and Grimmett and Stirzaker
[6] (Chapter 7). There are generalizations of the strong law of large numbers where
the independence assumption on the Xn is relaxed, but again, this is beyond the
scope of this book.
Therefore
GX (z) = E(zX ).
Note that
GX (1) = Â Pr(w) = 1,
w2W
so the radius of convergence of the power series GX (z) is at least 1. The nicest
property about pgf’s is that they usually simplify the computation of the mean and
variance. For example, we have
5.7 Generating Functions; A Glimpse 325
E(X) = Â kPr(X = k)
k 0
= Â Pr(X = k) · kzk 1
z=1
k 0
= G0X (1).
Similarly,
E(X 2 ) = Â k2 Pr(X = k)
k 0
Therefore, we have
Remark: The above results assume that G0X (1) and G00X (1) are well defined, which
is the case if the radius of convergence of the power series GX (z) is greater than 1.
If the radius of convergence of GX (z) is equal to 1 and if limz"1 G0X (z) exists, then
The above facts follow from Abel’s theorem, a result due to N. Abel. Abel’s theorem
1 1 zn
Un = (1 + z + · · · + zn 1 ) = .
n n(1 z)
(1 + x)n 1
Un (1 + x) =
✓ ◆ nx ✓ ◆ ✓ ◆ ✓ ◆
1 n 1 n 1 n 2 1 n n 1
= + x+ x +···+ x .
n 1 n 2 n 3 n n
It follows that
n 1 (n 1)(n 2)
Un (1) = 1; Un0 (1) = ; Un00 (1) = ;
2 3
Then, we find that the mean is given by
n 1
µ=
2
and the variance by
n2 1
s 2 = Un00 (1) +Un0 (1) (Un0 (1))2 = .
12
5.7 Generating Functions; A Glimpse 327
Another nice fact about pgf’s is that the pdf of the sum X + Y of two indepen-
dent variables X and Y is the product their pgf’s. This is because if X and Y are
independent, then
n
Pr(X +Y = n) = Â Pr(X = k and Y = n k)
k=0
n
= Â Pr(X = k)Pr(Y = n k),
k=0
If we flip a biased coin where the probability of tails is p, then the pgf for the
number of heads after one flip is
H(z) = 1 p + pz.
H(z)n = (1 p + pz)n .
This allows us to rederive the formulae for the mean and the variance. We get
and
s 2 = n(H 00 (1) + H 0 (1) (H 0 (1))2 ) = n(0 + p p2 ) = np(1 p).
If we flip a biased coin repeatedly until heads first turns up, we saw that the
random variable X that gives the number of trials n until the first occurrence of tails
has the geometric distribution f (n) = (1 p)n 1 p. It follows that the pgf of X is
pz
GX (z) = pz + (1 p)pz2 + · · · + (1 p)n 1
pzn + · · · = .
1 (1 p)z
Since we are assuming that these trials are independent, the random variables that
tell us that m heads are obtained has pgf
✓ ◆m
pz
GX (z) =
1 (1 p)z
✓ ◆
m+k 1
= pm zm  ((1 p)z)k
k k
✓ ◆
j 1 m
=Â p (1 p) j m zk .
j j m
328 5 An Introduction to Discrete Probability
An an exercise, the reader should check that the pgf of a Poisson distribution with
parameter l is
GX (z) = el (z 1) .
More examples of the use of pgf can be found in Graham, Knuth and Patashnik
[5].
Another interesting generating function is the moment generating function MX (t).
It is defined as follows: for any t 2 R,
This shows that MX (t) is the exponential generating function of the sequence of mo-
ments (E(X n )); see Graham, Knuth and Patashnik [5]. If X is a continuous random
variable, then the function MX ( t) is the Laplace transform of the density function
f.
Furthermore, if X and Y are independent, then E(XY ) = E(X)E(Y ), so we have
n ✓ ◆ n ✓ ◆
n n
E (X +Y )n = Â E(X kY n k ) = Â E(X)k E(Y )n k ,
k=0 k k=0 k
and since
5.7 Generating Functions; A Glimpse 329
E (X +Y )n n
MX+Y (t) = Â t
n n!
!
n ✓ ◆
1 n
= Â k E(X)k E(Y )n k t n
n! k=0
E(X)k E(Y )n k n
=Â t
n k! (n k)!
E(X k ) E(Y n k ) n
=Â t .
n k! (n k)!
But, this last term is the coefficient of t n in MX (t)MY (t). Therefore, as in the case of
pgf’s, if X and Y are independent, then
Another way to prove the above equation is to use the fact that if X and Y are
independent random variables, then so are etX and etY for any fixed real t. Then,
Remark: If the random variable X takes nonnegative integer values, then it is easy
to see that
MX (t) = GX (et ),
where GX is the generating function of X, so MX is defined over some open
interval ( r, r) with r > 0 and MX (t) > 0 on this interval. Then, the function
KX (t) = ln MX (t) is well defined, and it has a Taylor expansion
k1 k2 k3 kn
KX (t) = t + t2 + t3 + · · · + tn + · · · . (⇤)
1! 2! 3! n!
The numbers kn are called the cumulants of X. Since
•
µn n
MX (t) = Â t ,
n=0 n!
k1 = µ1
k2 = µ2 µ12
k3 = µ3 3µ1 µ2 + 2µ13
k4 = µ4 4µ1 µ4 + 12µ12 µ2 3µ22 6µ14
..
.
330 5 An Introduction to Discrete Probability
Notice that k1 is the mean and k2 is the variance of X. Thus, it appears that the
cumulants are the natural generalization of the mean and variance. Furthermore, be-
cause logs are taken, all cumulants of the sum of two independent random variables
are additive, just as the mean and variance. This property makes cumulants more
important than moments.
The third generating function associtaed with a random variable X, and the most
important, is the characteristic function jX (t), defined by
a complex function of the real variable t. The “innocent” insertion of i in the expo-
nent has the effect that |eitX | = 1, so jX (t) is defined for all t 2 R.
If X is a continuous random variable with density function f , then
Z •
jX (t) = eitx f (x)dx.
•
The proof is essentially the same as the one we gave for the moment generating
function, modulo powers of i.
5. If X is a random variable, for any two reals a, b,
Given two random variables X and Y , their joint characteristic function jX,Y (x, y)
is defined by
5.8 Chernoff Bounds 331
Theorem 5.5. Two random variables X and Y have the same characteristic function
iff they have the same distribution.
know how quickly such a probability goes to zero as a becomes large (in absolute
value). Such probabilities are called tail distributions. It turns out that the moment
generating function MX (if it exists) yields some useful bounds by applying a very
simple inequality to MX known as Markov’s inequality and due to the mathematician
Andrei Markov, a major contributor to probability theory (the inventor of Markov
chains).
E(X)
Pr(X a) .
a
Proof. Let Ia be the random variable defined so that
n
1 if X a
Ia =
0 otherwise.
Since X 0, we have
X
Ia . (⇤)
a
Also, since Ia takes only the values 0 and 1, E(Ia ) = Pr(X a). By taking expecta-
tions in (⇤), we get
E(X)
E(Ia ) ,
a
which is the desired inequality since E(Ia ) = Pr(X a). t u
Proposition 5.19. (Chernoff Bounds) Let X be a random variable and assume that
the moment generating function MX = E(etX ) is defined. Then, for every a > 0, we
have
5.8 Chernoff Bounds 333
ta
Pr(X a) min e MX (t)
t>0
ta
Pr(X a) min e MX (t).
t<0
In order to make good use of the Chernoff bounds, one needs to find for which
values of t the function e ta MX (t) is minimum. Let us give a few examples.
Example 5.31. If X has a standard normal distribution, then it is not hard to show
that
2
M(t) = et /2 .
Consequently, for any a > 0 and all t > 0, we get
ta t 2 /2
Pr(X a) e e .
2
The value t that minimizes et /2 ta is the value that minimizes t 2 /2 ta, namely
t = a. Thus, for a > 0, we have
a2 /2
Pr(X a) e .
The function on the right hand side decays to zero very quickly.
Example 5.32. Let us now consider a random variable X with a Poisson distribution
with parameter l . It is not hard to show that
t
M(t) = el (e 1)
.
Applying the Chernoff bound, for any nonnegative integer k and all t > 0, we get
t
Pr(X k) el (e 1)
e kt
.
Using calculus, we can show that the function on the right hand side has a minimum
when l (et 1) kt is minimum, and this is when et = k/l . If k > l and if we let
et = k/l in the Chernoff bound, we obtain
✓ ◆k
l
Pr(X k) el (k/l 1)
,
k
which is equivalent to
e l (el )k
Pr(X k) .
kk
Our third example is taken from Mitzenmacher and Upfal [10] (Chapter 4).
t
= eµ(e 1)
.
Therefore,
t
MX (t) eµ(e 1)
for all t.
The next step is to apply the Chernoff bounds. Using a little bit of calculus, we
obtain the following result proved in Mitzenmacher and Upfal [10] (Chapter 4).
An an application, if the Xi are independent flips of a fair coin (pi = 1/2), then
µ = n/2, and by picking d = 6 lnn n , it is easy to show that
✓ ◆
n 1p µd 2 2
Pr X 6n ln n 2e 3 = .
2 2 n
This shows that the concentrations of the number of heads around the mean n/2
is very
p tight. Most of the time, the deviations from the mean are of the order
O( n ln n). Another simple calculation using the Chernoff bounds shows that
✓ ◆
n n n
Pr X 2e 24 .
2 4
This is a much better bound than the bound provided by the Chebyshev inequality:
✓ ◆
n n 4
Pr X .
2 4 n
Ross [11] and Mitzenmacher and Upfal [10] consider the situation where a gam-
bler is equally likely to win or lose one unit on every play. Assuming that these
random variables Xi are independent, and that
1
Pr(Xi = 1) = Pr(Xi = 1) = ,
2
let Sn = Âni=1 Xi be the gamblers’s winning after n plays. It is easy that to see that
the moment generating function of Xi is
336 5 An Introduction to Discrete Probability
et + e t
MXi (t) = .
2
Using a little bit of calculus, one finds that
t2
MXi (t) e 2 .
The minimum is achieved for t = a/n, and assuming that a > 0, we get
a2
P(Sn a) e 2n , a > 0.
5.9 Summary
Problems
References
1. Joseph Bertrand. Calcul des Probabilités. New York, NY: Chelsea Publishing Company,
third edition, 1907.
2. Pierre Brémaud. Markov Chains, Gibbs Fields, Monte Carlo Simulations, and Queues. TAM
No. 31. New York, NY: Springer, third edition, 2001.
3. William Feller. An Introduction to Probability Theory and its Applications, Vol. 1 . New
York, NY: Wiley, third edition, 1968.
4. William Feller. An Introduction to Probability Theory and its Applications, Vol. 2 . New
York, NY: Wiley, second edition, 1971.
5. Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics: A Foun-
dation For Computer Science. Reading, MA: Addison Wesley, second edition, 1994.
6. Geoffrey Grimmett and David Stirzaker. Probability and Random Processes. Oxford, UK:
Oxford University Press, third edition, 2001.
7. Pierre–Simon Laplace. Théorie Analytique des Probabilités, Volume I. Paris, France: Editions
Jacques Gabay, third edition, 1820.
References 339
8. Pierre–Simon Laplace. Théorie Analytique des Probabilités, Volume II. Paris, France: Edi-
tions Jacques Gabay, third edition, 1820.
9. Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge, UK: Cam-
bridge University Press, first edition, 1995.
10. Michael Mitzenmacher and Eli Upfal. Probability and Computing. Randomized Algorithms
and Probabilistic Analysis. Cambridge, UK: Cambridge University Press, first edition, 2005.
11. Sheldon Ross. A First Course in Probability. Upper Saddle River, NJ: Pearson Prentice Hall,
eighth edition, 2010.
12. Sheldon Ross. Probability Models for Computer Science. San Diego, CA: Har-
court/Academic Press, first edition, 2002.
13. Albert Nikolaevich Shiryaev. Probability. GTM No. 95 . New York, NY: Springer, second
edition, 1995.
14. Santosh S. Venkatesh. The Theory of Probability: Explorations and Applications. Cambridge,
UK: Cambridge University Press, first edition, 2012.
U.C. Berkeley — CS276: Cryptography Handout 0.2
Luca Trevisan January, 2009
The following notes cover, mostly without proofs, some basic notions and results of
discrete probability. They were written for an undergraduate class, so you may find
them a bit slow.
1 Basic Definitions
In cryptography we typically want to prove that an adversary that tries to break a
certain protocol has only minuscule (technically, we say “negligible”) probability of
succeeding. In order to prove such results, we need some formalism to talk about the
probability that certain events happen, and also some techniques to make computa-
tions about such probabilities.
In order to model a probabilistic system we are interested in, we have to define
a sample space and a probability distribution. The sample space is the set of all
possible elementary events, i.e. things that can happen. A probability distribution is
a function that assigns a non-negative number to each elementary event, this number
being the probability that the event happen. We want probabilities to sum to 1.
Formally,
For example, if we want to model a sequence of three coin flips, our sample space
will be {Head, T ail}3 (or, equivalently, {0, 1}3 ) and the probability distribution will
assign 1/8 to each element of the sample space (since each outcome is equally likely).
If we model an algorithm that first chooses at random a number in the range 1, . . . , 10200
and then does some computation, our sample space will be the set {1, 2, . . . , 10200 },
and each element of the sample space will have probability 1/10200 .
We will always restrict ourselves to finite sample spaces, so we will not remark it
each time. Discrete probability is the restriction of probability theory to finite sample
spaces. Things are much more complicated when the sample space can be infinite.
1
An event is a subset A ✓ ⌦ of the sample space. The probability of an event is defined
in the intuitive way
X
P[A] = P(a)
a2A
P(a) = pk (1 p)n k
where k is the number of 1s in a
When p = 1/2 then we have the uniform distribution over {0, 1}n . If p = 0 then all a
have probability zero, except 00 · · · 0, which has probability one. (Similarly if p = 1.)
The other cases are more interesting.
These distributions are called Bernoulli distributions or binomial distributions.
If we have a binomial distribution with parameter p, and we ask what is the probability
of the event Ak that we get a string with k ones, then such a probability is
✓ ◆
n k
P[Ak ] = p (1 p)n k
k
2
example, when we play dice, we are interested in the probabilistic system where two
dice are rolled, and the sample space is {1, 2, . . . , 6}2 , with the uniform distribution
over the 36 elements of the sample space, and we are interested in the sum of the
outcomes of the two dice. Similarly, when we study a randomized algorithm that
makes some internal random choices, we are interested in the running time of the
algorithm, or in its output. The notion of a random variable gives a tool to formalize
questions of this kind.
A random variable X is a function X : ⌦ ! V where ⌦ is a sample space and V is
some arbitrary set (V is called the range of the random variable). One should think
of a random variable as an algorithm that on input an elementary event returns some
output. Typically, V will either be a subset of the set of real numbers or of the set
of binary strings of a certain length.
Let ⌦ be a sample space, P a probability distribution on ⌦ and X be a random
variable on ⌦. If v is in the range of X, then the expression X = v denotes an event,
namely the event {a 2 ⌦ : X(a) = v}, and thus the expression P[X = v] is well
defined, and it is something interesting to try to compute.
Let’s look at the example of dice. In that case, ⌦ = {1, . . . , 6}2 , for every (a, b) 2 ⌦
we have P(a, b) = 1/36. Let us define X as the random variable that associates a + b
to an elementary event (a, b). Then the range of X is {2, 3, . . . , 12}. For every element
of the range we can compute the probability that X take such value. By counting
the number of elementary events in each event we get
It is possible to define more than one random variable over the same sample space,
and consider expressions more complicated than equalities.
When the range of a random variable X is a subset of the real numbers (e.g. if X
is the running time of an algorithm — in which case the range is even a subset of
the integers) then we can define the expectation of X. The expectation of a random
variable is a number defined as follows.
X
E[X] = vP[X = v]
v2V
where V is the range of X. We can assume without loss of generality that V is finite,
so that the expression above is well defined (if it were an infinite series, it could
diverge or even be undefined).
3
Expectations can be understood in terms of betting. Say that I am playing some
game where I have a probability 2/3 of winning, a probability 1/6 of losing and a
probability 1/6 of a draw. If I win, I win $ 1; if I lose I lose $ 2; if there is a draw I
do not win or lose anything. We can model this situation by having a sample space
{L, D, W } with probabilities defined as above, and a random variable X that specifies
my wins/losses. Specifically X(L) = 2, X(D) = 0 and X(W ) = 1. The expectation
of X is
1 1 2 1
E[X] = · ( 2) + · 0 + · 1 =
6 6 3 3
so if I play this game I “expect” to win $ 1/3. The game is more than fair on my
side.
When we analyze a randomized algorithm, the running time of the algorithm typically
depends on its internal random choices. A complete analysis of the algorithm would
be a specification of the running time of the algorithm for each possible sequence of
internal choices. This is clearly impractical. If we can at least analyse the expected
running time of the algorithm, then this will be just a single value, and it will give
useful information about the typical behavior of the algorithm (see Section 4 below).
Here is a very useful property of expectation.
Example 3 Consider the following question: if we flip a coin n times, what is the
expected number of heads? If we try to answer this question without using the linearity
of expectation we have to do a lot of work. Define ⌦ = {0, 1}n and let P be the uniform
distribution; let X be the random variable such that X(a) = the number of 1s in a 2 ⌦.
Then we have, as a special case of Bernoulli distribution, that
✓ ◆
n
P[X = k] = 2 n
k
In order to compute the average of X, we have to compute the sum
n ✓ ◆
X n n
k2 (1)
k=0
k
which requires quite a bit of ingenuity. We now show how to solve Expression (1) just
to see how much work can be saved by using the linearity of expectation. An inspection
of Expression (1) shows that it looks a bit like the expressions that one gets out of the
4
Binomial Theorem, except for the presence of k. In fact it looks pretty much like the
derivative of an expression coming from the Binomial Theorem (this is a standard
trick). Consider (1/2 + x)n (we have in mind to substitute x = 1/2 at some later
point), then we have
✓ ◆n X n ✓ ◆
1 n
+x = 2 (n k) xk
2 k=0
k
and then n ✓ ◆
d((1/2 + x)n ) X n (n k)
= 2 kxk 1
dx k=0
k
but also ✓ ◆n 1
d((1/2 + x)n ) 1
=n +x
dx 2
and putting together
n ✓ ◆
X ✓ ◆n 1
n (n k) k 1 1
2 kx =n +x .
k=0
k 2
So much for the definition of average. Here is a better route: we can view X as the
sum of n random variables X1 , . . . , Xn , where Xi is 1 if the i-th coin flip is 1 and Xi
is 0 otherwise. Clearly, for every i, E[Xi ] = 12 · 0 + 12 · 1 = 12 , and so
n
E[X] = E[X1 + · · · + Xn ] = E[X1 ] + · · · + E[Xn ] = .
2
3 Independence
5
In order to answer to this question (I will give it away that the answer is 1/3), one
needs some tools to reason about the probability that a certain event holds given (or
conditioned on the fact) that a certain other event holds.
Fix a sample space ⌦ and a probability distribution P. Suppose we are given that a
certain event A ✓ ⌦ holds. Then the probability of an elementary event a given the
fact that A holds (written P(a|A)) is defined as follows: if a 62 A, then it is impossible
that a holds, and so P(a|A) = 0; otherwise, if a 2 A, then P(a|A) has a value that is
proportional to P(a). One realizes that the factor of proportionality has to be 1/P[A],
so that probabilities sum to 1 again. Our definition of conditional probability of an
elementary event is then
(
0 If a 62 A
P(a|A) = P(a)
P[A]
Otherwise
The above formula already lets us solve the question asked at the beginning of this
section. Notice that probabilities conditioned on an event A such that P[A] = 0 are
undefined.
Then we extend the definition to arbitrary events, and we say that for an event B
X
P[B|A] = P(b|A)
b2B
One should check that the following (more standard) definition is equivalent
P[A \ B]
P[B|A] =
P[A]
If A and B are independent, and P[A] > 0, then we have P[B|A] = P[B]. Similarly, if
A and B are independent, and P[B] > 0, then we have P[A|B] = P[A]. This motivates
the use of the term “independence.” If A and B are independent, then whether A
holds or not is not influenced by the knowledge that B holds or not.
When we have several events, we can define a generalized notion of independence.
6
\ Y
P[ Ai ] = P[Ai ]
i2I i2I
All this stu↵ was just to prepare for the definition of independence for random vari-
ables, which is a very important and useful notion.
Definition 6 If X and Y are random variables over the same sample space, then we
say that X and Y are independent if for any two values v, w, the event (X = v) and
(Y = w) are independent.
Definition 8 Let X1 , . . . , Xn be random variables over the same sample space, then
we say that they are mutually independent if for any sequence of values v1 , . . . , vn ,
the events (X1 = v1 ), . . . , (Xn = vn ) are mutually independent.
Definition 10 Let X1 , . . . , Xn be random variables over the same sample space, then
we say that they are pairwise independent if for every i, j 2 {1, . . . , n}, i 6= j, we have
that Xi and Xj are independent.
7
Example 11 Consider the following probabilistic system: we toss 2 coins, and we let
the random vaiables X, Y, Z be, respectively, the outcome of the first coin, the outcome
of the second coin, and the XOR of the outcomes of the two coins (as usual, we inter-
pret outcomes of coins as 0/1 values). Then X, Y, Z are not mutually independent,
for example
P[Z = 0|X = 0, Y = 0] = 1
while
P[Z = 0] = 1/2
1
P[X = v, Z = w] = P[X = v]P[Z = w] =
4
and this is true, since, in order to have Z = w and X = v, we must have Y = w v,
and the event that X = v and Y = w v happens with probability 1/4.
8
4 Deviation from the Expectation
E[X]
P[X k]
k
Sometimes the bound given by Markov’s inequality are extremely bad, but the bound
is as strong as possible if the only information that we have is the expectation of X.
For example, suppose that X counts the number of heads in a sequence of n coin
flips. Formally, ⌦ is {0, 1}n with the uniform distribution, and X is the number of
ones in the string. Then E[X] = n/2. Suppose we want to get an upper bound on
P[X n] using Markov. Then we get
E[X] 1
P[X n] =
n 2
This is ridiculous! The right value is 2 n , and the upper bound given by Markov’s
inequality is totally o↵, and it does not even depend on n.
However, consider now the experiment where we flip n coins that are glued together,
so that the only possible outcomes are n heads (with probability 1/2) and n tails
(with probability 1/2). Define X again as the number of heads. We still have that
E[X] = n/2, and we can apply Markov’s inequality as before to get
E[X] 1
P[X n] =
n 2
But, now, the above inequality is tight, becuase P[X n] is precisely 1/2.
The moral is that Markov’s inequality is very useful because it applies to every non-
negative random variables having a certain expectation, so we can use it without
having to study our random variable too much. On the other hand, the inequality
9
will be accurate when applied to a random variable that typically deviates a lot from
its expectation (say, the number of heads that we get when we toss n glued coins)
and the inequality will be very bad when we apply it to a random variable that
is concentrated around its expectation (say, the number of heads that we get in n
independent coin tosses). In the latter case, if we want accurate estimations we have
to use more powerful methods. One such method is described below.
4.2 Variance
For a random variable X, the random variable
X 0 = |X E[X]|
gives all the information that we need in order to decide whether X is likely to deviate
a lot from its expectation or not. All we need to do is to prove that X 0 is typically
small. However this idea does not lead us very far (analysing X 0 does not seem to be
any easier than analysing X).
Here is a better tool. Consider
(X E[X])2
This is again a random variable that tells us how much X deviates from its expec-
tation. In particular, if the expectation of such an auxliary random variable is small,
then we expect X to be typically close to its expectation. The variance of X is defined
as
The variance is a useful notion for two reasons: it is often easy to compute and it
gives rise to sometimes strong estimations on the probability that a random variable
deviates from its expectation.
10
Theorem 15 (Chebyshev’s Inequality)
Var(X)
P[|X E[X]| k]
k2
The nice idea is in the first step. The second step is just an application of Markov’s
inequality and the last step uses the definition of variance.
p
The value (X) = Var(X) is called the standard deviation of X. One expects the
value of a random variable X to be around the interval E[X] ± (X). We can restate
Chebyshev’s Inequality in terms of standard deviation
Let Y be a random variable that is equal to 0 with probability 1/2 and to 1 with
probability 1/2. Then E[Y ] = 1/2, Y = Y 2 , and
1 1 1
Var(Y ) = E[Y 2 ] (E[Y ])2 = =
2 4 4
Let X the random variable that counts the number of heads in a sequence of n
independent coin flips. We have seen that E[X] = n/2. Computing the variance
according to the definition would be painful. We are fortunate that the following
result holds.
Var(aX + b) = a2 Var(X)
11
2. Let X1 , . . . , Xn be pairwise independent random variables on the same sample
space. Then
1
P[X n] P[|X E[X]| n/2]
n
This is still much less than the correct value 2 n , but at least it is a value that
decreases with n. It is also possible to show that Chebyshev’s inequality is as strong
as possible given its assumption.
Let n = 2k 1 for some integer k and let X1 , . . . , Xn be the collection of pairwise
independent random variables as defined in Example 12. Let X = X1 + . . . + Xn .
Suppose we want to compute P[X = 0]. Since each Xi has variance 1/4, we have that
X has variance n/4, and so
1
P[X = 0] P[|X E[X]| n/2]
n
k
which is almost the right value: the right value is 2 = 1/(n + 1).
A Appendix
12
✓ ◆
n n!
=
k k!(n k)!
n ✓ ◆
X n
= 2n (2)
k=0
k
Theorem 18 (Binomial Theorem) For every two reals a, b and non-negative in-
teger n,
n ✓ ◆
X
n n
(a + b) = ak b n k
k=0
k
We can see that Equation (2) follows from the Binomial Theorem by simply substi-
tuting a = 1 and b = 1.
Sometimes we have to deal
Pn with summations of the form 1 + 1/2 + 1/3 + . . . + 1/n.
It’s good to know that k=1 1/k ⇡ ln n. More precisely
Pn
1/k
Theorem 19 limn!1 k=1
ln n
= 1.
Pn Pn
In particular, k=1 1/k 1 + ln n for every n, and k=1 1/k ln n for sufficiently
large n.
The following inequality is exceedingly useful in computing upper bounds of proba-
bilities of events:
1 + x ex (3)
1 1
ex = 1 + x + x2 + . . . + xk + . . .
2 k!
Observe that Equation (3) is true for every real x, not necessarily positive (but it
becomes trivial for x < 1).
Here is a typical application of Equation (3). We have a randomized algorithm that
has a probability ✏ over its internal coin tosses of succeeding in doing something (and
13
when it succeeds, we notice that it does, say because the algorithm is trying to invert
a one-way function, and when it succeeds then we can check it efficiently); how many
times do we have to run the algorithm before we have probability at least 3/4 that
the algorithm succeeds?
The probability that it never succeeds in k runs is
(1 ✏)k e ✏k
2
If we choose k = 2/✏, the probability of k consecutive failures is less than e < 1/4,
and so the probability of succeeding (at least once) is at least 3/4.
1
P[T kt]
k
and if we choose k = 106 we have that
P[T 106 t] 10 6
.
However there is a much faster way of guaranteeing termination with high probability.
We let the program run for 2t time. There is a probability 1/2 that the algorithm will
stop before that time. If so we are happy. If not, we terminate the computation, and
start it over (in the second iteration, we let the algorithm use independent random
bits). If the second computation does not terminate within 2t time, we reset it once
more, and so on. Let T 0 be the random variable that gives the time taken by this
new version of the algorithm (with the stop and reset actions). Now we have that
the probability that we use more than 2kt time is equal to the probability that for k
consecutive (independent) times the algorithm takes more than 2t time. Each of these
events happen with probability at most 1/2, and so
P[T 0 2kt] 2 k
14
and if take k = 20, the probability is less than 10 6 , and the time is only 40t rather
than 1, 000, 000t.
Suppose that t = t(n) is the average running time of our algorithm on inputs of
length n, and that we want to find another algorithm that finishes always in time
t0 (n) and that reports a failure only with negligible probability, say with probability at
most n log n . How large do we have choose t0 , and what the new algorithm should be
like?
If we just put a timeout t0 on the original algorithm, then we can use Markov’s in-
equality to say that t0 (n) = nlog n t(n) will suffice, but now t0 is not polynomial in
n (even if t was). Using the second method, we can put a timeout 2t and repeat
the algorithm (log n)2 times. Then the failure probability will be as requested and
t0 (n) = 2(log n)2 t(n).
If we know how the algorithm works, then we can make a more direct analysis.
Example 21 Suppose that our goal is, given n, to find a number 2 a n 1 such
that gcd(a, n) = 1. To simplify notation, let l = ||n|| ⇡ log n be the number of digits
of n in binary notation (in a concrete application, l would be a few hundreds). Our
algorithm will be as follows:
• Output “failure”.
We would like to find a value of k such that the probability that the algorithm reports
a failure is negligible in the size of the input (i.e. in l).
At each iteration, the probability that algorithm finds an element that is coprime with
n is
(n) 1 1
=
n 2 6 log log n 6 log l
So the probability that there is a failure in one iteration is at most
✓ ◆
1
1
6 log l
15
and the probability of k consecutive independent failures is at most
✓ ◆k
1 k/6 log l
1 e
6 log l
16