Probability
Probability
1 Probability Space 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Probabilities on a Finite or Countable Space . . . . . . . . . . . . . . . . . . . . . 17
1.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1
CONTENTS 2
2.8.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.8.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.8.3 Properties of conditional expectation . . . . . . . . . . . . . . . . . . . . 77
2.8.4 Convergence theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.8.5 Conditional expectation given a random variable . . . . . . . . . . . . . . 82
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5.2 Inference for the difference in means of two normal distributions, vari-
ances unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.5.3 Paired t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.5.4 Inference on the variance of two normal populations . . . . . . . . . . . 236
6.5.5 Inference on two population proportions . . . . . . . . . . . . . . . . . . 238
6.6 The chi-square test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.6.1 Goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.6.2 Tests of independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
6.6.3 Test of homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
6.7.1 Significance level and power function . . . . . . . . . . . . . . . . . . . . 256
6.7.2 Null distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
6.7.3 Best critical region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
6.7.4 Some tests for single sample . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.7.5 Some tests for two samples . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.7.6 Chi-squared tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7 Regression 277
7.1 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
7.1.1 Simple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . 277
7.1.2 Confidence interval for σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
7.1.3 Confidence interval for β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
7.1.4 Confidence interval for β0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
7.1.5 Prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Chapter 1
Probability Space
1.1 Introduction
Random experiments are experiments whose output cannot be surely predicted in ad-
vance. But when one repeats the same experiment a large number of times one can observe
some ”regularity” in the average output. A typical example is the toss of a coin: one can-
not predict the result of a single toss, but if we toss the coin many times we get an average of
about 50% of ”heads” if the coin is fair. The theory of probability aims towards a mathematical
theory which describes such phenomena. This theory contains three main ingredients:
a) The state space: this is the set of all possible outcomes of the experiment, and it is
usually denoted by Ω.
Examples
b) The event: An ”event” is a property which can be observed either to hold or not to hold
after the experiment is done. In mathematical terms, an event is a subset of Ω. If A and B are
two events, then
1
This is a draft version which contains many errors. Comments are very welcome
5
1.1. INTRODUCTION 6
We denote by A the family of all events. Often (but not always: we will see why later)
we have A = 2Ω the set of all subsets of Ω. The family A should be ”stable” by the logical
operations described above: if A, B ∈ A then we must have Ac ∈ A, A ∩ B ∈ A, A ∪ B ∈ A,
and also Ω ∈ A and ∅ ∈ A.
c) The probability: With each event A one associates a number denoted by P(A) and called
the ”probability of A”. This number measures the likelihood of the event A to be realized a
priori, before performing the experiment. It is chosen between 0 and 1, and the more likely
the event is, the closer to 1 this number is.
To get an idea of the properties of these numbers, one can imagine that they are the limits
of the ”frequency” with which the events are realized: let us repeat the same experiment n
times; the n outcomes might of course be different (think of n successive tosses of the same
die, for instance). Denote by fn (A) the frequency with which the event A is realized (i.e. the
number of times the event occurs, divided by n). Intuitively we have:
(we will give a precise meaning to this ”limit” later). From the obvious properties of frequen-
cies, we immediately deduce that:
1. 0 ≤ P(A) ≤ 1;
2. P(Ω) = 1;
A mathematical model for our experiment is thus a triple (Ω, A, P), consisting of the space
Ω, the family A of all events, and the family of all P(A) for A ∈ A; hence we can consider that
P is a map from A into [0, 1], which satisfies at least the properties (2) and (3) above (plus in
fact an additional property, more difficult to understand, and which is given later).
A fourth notion, also important although less basic, is the following one:
d) Random variable: A random variable is a quantity which depends on the outcome of
the experiment. In mathematical terms, this is a map from Ω into a space E, where often
1.2. PROBABILITY SPACE 7
1. ∅ ∈ A and Ω ∈ A;
3. A is closed under finite unions and finite intersections: that is, if A1 , . . . , An are all in A,
then ∪ni=1 and ∩ni=1 Ai are in A as well (for this it is enough that A be stable by the union
and the intersection of any two sets);
Definition 1.2.1. A is an algebra if it satisfies (1), (2) and (3) above. It is a σ-algebra, (or a
σ-field) if it satisfies (1), (2), and (4) above.
1.2. PROBABILITY SPACE 8
Definition 1.2.3. If C ⊂ 2Ω , the σ-algebra generated by C, and written σ(C), is the smallest
σ-algebra containing C. (It always exists because 2Ω is a σ-algebra, and the intersection of a
family of σ-algebras is again a σ-algebra)
where x1 , . . . , xn ∈ Q.
We can show that B(Rd ) is also the σ-algebra generated by all open subsets (or by all the
closed subsets) of Rd .
1. P(Ω) = 1;
2. For every countable sequence (An )n≥1 of elements of A, pairwise disjoint (that is, An ∩
Am = ∅ whenever n 6= m), one has
∞
[ ∞
X
P An = P(An ).
n=1 n=1
Axiom (2) above is called countable additivity; the number P(A) is called the probability of
the event A.
In Definition 1.2.4 one might imagine a more elementary condition than (2), namely:
Theorem 1.2.5. Let (Ω, A, P) be a probability space. The following properties hold:
(i) P(∅) = 0;
(ii) P is additive.
1.3. PROPERTIES OF PROBABILITY 9
Proof. If in Axiom (2) we take An = ∅ for all n, we see that the number a = P(∅) is equal to an
infinite sum of itself; since 0 ≤ a ≤ 1, this is possible only if a = 0, and we have (i).
For (ii) it sufffices to apply Axiom (2) with A1 = A and A2 = B and A3 = A4 = ... = ∅, plus
the fact that P(∅) = 0, to obtain (1.1).
Applying (1.1) for A ∈ A and B = Ac we get (iii).
To show (iv), suppose A ⊂ B then applying (1.1) for A and B\A we have
Theorem 1.3.1. Let A be a σ-algebra. Suppose that P : A → [0, 1] satisfies P(Ω) = 1 and is
additive. Then the following statements are equivalent:
Proof. The notation An ↓ A means that An+1 ⊂ An , each n, and ∩∞ i=1 An = A. The notation
∞
An ↑ A means that An ⊂ An+1 and ∪i=1 An = A, ∀n ≥ 2.
(i) ⇒ (v) Let An ∈ A with An ↑ A. We construct a new sequence as follows: B1 = A1 and
Bn = An+1 \An .Then ∪∞
i=1 Bn = A; An = ∪ni=1 Bi and the events (Bn )n≥1 are pairwise disjoint.
Therefore
∞
X
P(A) = P(∪k≥1 Bk ) = Bk .
k=1
1.3. PROPERTIES OF PROBABILITY 10
Hence
n
X ∞
X
P(An ) = P(Bk ) ↑ P(Bk ) = P(A).
k=1 k=1
n
X ∞
X
P(∪∞
k=1 Ak ) = lim P(Bn ) = lim P(Ak ) = P(Ak ).
n→∞ n→∞
k=1 k=1
We can say that An ∈ A converges to A (we write An → A) if limn→∞ IAn (w) = IA (w) for all
w ∈ Ω. Note that if the sequence An increases (resp. decreases) to A, then it also tends to A in
the above sense.
Theorem 1.3.2. Let P be a probability measure and let An be a sequence of events in A which
converges to A. Then A ∈ A and limn→∞ P(An ) = P(A).
Proof. Now let Bn = ∩m≥n Am and Cn = ∪m≤n Am . Then Bn increases to A and Cn decreases to
A, thus lim P(Bn ) = lim P(Cn ) = P(A), by Theorem 1.3.1. However Bn ⊂ An ⊂ Cn , therefore
n→∞ n→∞
P(Bn ) ≤ P(An ) ≤ P(Cn ), so lim P(An ) = P(A) as well.
n→∞
Lemma 1.3.3. Let S be a set. Let I be a π-system on S, that is, a family of subsets of S stable
under finite intersection:
I1 , I2 ∈ I ⇒ I1 ∩ I2 ∈ I.
Let Σ = σ(I). Suppose that µ1 and µ2 are probability measure on (S, Σ) such that µ1 = µ2 on I.
Then µ1 = µ2 on Σ.
1.3. PROPERTIES OF PROBABILITY 11
Proof. Let
D = {F ∈ Σ : µ1 (F ) = µ2 (F )}.
a) S ∈ D,
b) if A, B ∈ D and A ⊆ B then B \ A ∈ D,
c) if An ∈ D and An ↑ A, then A ∈ D.
so that F ∈ D.
Since D is a d-system and D ⊇ I then D ⊇ σ(I) = Σ, and the result follows.
This lemma implies that if two probability measure agree on a π-system, then they agree
on the σ-algebra generated by that π-system.
Now, by definition of λ,
X X X
µ0 (Fn ) = µ0 (E ∩ Fn ) + µ0 (E c ∩ Fn )
n n n
c
≥ λ(E ∩ G) + λ(E ∩ G),
Theorem 1.4.1. Let (pw )w∈Ω be a family of real numbers indexed by the finite or countable set
Ω. Then there exists a unique probability P such that P({w}) = pw if and only if pw ≥ 0 and
P
w∈Ω pw = 1. In this case for any A ⊂ Ω,
X
P(A) = pw .
w∈A
Suppose first that Ω is finite. Any family of nonnegative terms summing up to 1 gives an
example of a probability on Ω. But among all these examples the following is particularly
important:
Definition 1.4.2. A probability P on the finite set Ω is called uniform if pw = P({w}) does not
depend on w.
Then computing the probability of any event A amounts to counting the number of points in
A. On a given finite set Ω there is one and only one uniform probability.
Example: There are 20 balls in the urn, 10 white 10 red. One draws a set of 5 balls from the
urn. Denote X the number of white ball in the set. We want to find the probability that X = x,
where x is an arbitrary fixed integer.
We label from 1 to 10 for white balls and from 11 to 20 for red balls. Since the balls are
drawn at once, it is natural to consider that an outcome is a subset with 5 elements of the
set {1, . . . , 20} of all 20 balls. That is, Ω is the family of all subsets with 5 points, and the total
5
number of possible outcomes is |Ω| = C20 . Next, it is also natural to consider that all possible
outcomes are equally likely, that is P is the uniform probability on Ω. The quantity X is a
“random variable” because when the outcome w is known, one also knows the number X(w).
5−x
The possible values of X is from 0 to 5 and the set X −1 ({x}) = {X = x} contains C10 x
C10
points for all 0 ≤ x ≤ 5. Hence
x 5−x
C10 C510 if 0 ≤ x ≤ 5
C20
P(X = x) =
0 otherwise.
We thus obtain, when x varies, the distribution or the law, of X. This distribution is called the
hypergeometric distribution.
1.5. CONDITIONAL PROBABILITY 14
Definition 1.5.1. Let A, B be events, P(B) > 0. The conditional probability of A given B is
P(A ∩ B)
P(A|B) = .
P(B)
Theorem 1.5.2. Suppose P(B) > 0. The operation A 7→ P(A|B) from A → [0, 1] defines a new
probability measure on A, called the conditional probability measure given B.
Proof. We define Q(A) = P(A|B), with B fixed. We must show Q satisfies (1) and (2) of 1.2.4.
But
P(Ω ∩ B) P(B)
Q(Ω) = P(Ω|B) = = = 1.
P(B) P(B)
Therefore, Q satisfies (1). As for (2), note that if (An )n≥1 is a sequence of elements of A which
are pairwise disjoint, then
P((∪∞
n=1 An ) ∩ B) P(∪∞
n=1 (An ∩ B))
Q(∪∞ ∞
n=1 An ) = P(∪n=1 An |B) = =
P(B) P(B)
and also the sequence (An ∩ B)n≥1 is pairwise disjoint as well; thus
∞ ∞ ∞
X P(An ∩ B) X X
Q(∪∞
n=1 An ) = = P(An |B) = Q(An ).
n=1
P(B) n=1 n=1
Proof. We use induction. For n = 2, the theorem is simply 1.5.1. Suppose the theorem holds
for n − 1 events. Let B = A1 ∩ . . . ∩ An−1 . Then by 1.5.1 P(B ∩ An ) = P(An |B)P(B); next we
replace P(B) by its value given in the inductive hypothesis:
3. Ω = ∪n En .
Theorem 1.5.5 (Partition Equation). Let (En )n≥1 be a finite or countable partition of Ω. Then if
A ∈ A,
X
P(A) = P(A|En )P(En ).
n
Theorem 1.5.6 (Bayes’ Theorem). Let (En ) be a finite or countable partition of Ω, and suppose
P(A) > 0. Then
P(A|En )P(En )
P(En |A) = P .
m P(A|Em )P(Em )
Example 1.5.7. Because a new medical procedure has been shown to be effective in the early
detection of an illness, a medical screening of the population is proposed. The probability
that the test correctly identifies someone with the illness as positive is 0.99, and the probability
that the test correctly identifies someone without the illness as negative is 0.95. The incidence
of the illness in the general population is 0.0001. You take the test, and the result is positive.
What is the probability that you have the illness? Let D denote the event that you have the
illness, and let S denote the event that the test signals positive. The probability requested can
be denoted as . The probability that the test correctly signals someone without the illness as
negative is 0.95. Consequently, the probability of a positive test without the illness is
P(S|Dc ) = 0.05.
P(S|D)P(D)
P(D|S) = = 0.002.
P(S|D)P(D) + P(S|Dc )P(Dc )
Surprisingly, even though the test is effective, in the sense that P(S|D) is high and P(S|Dc )
is low, because the incidence of the illness in the general population is low, the chances are
quite small that you actually have the disease even if the test is positive.
Example 1.5.8. Suppose that Bob can decide to go to work by one of three modes of trans-
portation, car, bus, or commuter train. Because of high traffic, if he decides to go by car, there
is a 0.5 chance he will be late. If he goes by bus, which has special reserved lanes but is som-
times overcrowded, the probability of being late is only 0.2. The commuter train is almost
never late, with a probability of only 0.01, but is more expensive than the bus.
(a) Suppose that Bob is late one day, and his boss wishes to estimate the probability that he
drove to work that day by car. Since he does not know which mode of transportation Bod usu-
ally uses, he gives a prior probability of 31 to each of the three possibilities. What is the boss’
estimate of the probability that Bob drove to work?
(b) Suppose that the coworker of Bob’s knows that he almost always takes the commuter train
to work, never take the bus, but somtimes, 0.1 of the time, takes the car. What is the coworkers
probability that Bob drove to work that day, given that he was late?
We have the following information given in the problem:
1
P(bus) = P(car) = P(train) =
3
P(late|car) = 0.5;
P(late|train) = 0.01;
P(late|bus) = 0.2.
1.6. INDEPENDENCE 17
Repeat the identical calculations as above, but instead of the prior probabilities being 31 , we
use pr(bus) = 0, P(car) = 0.1, and P(train) = 0.9. Plugging in to the same equation with these
three changes, we get P(car|late) = 0.8475.
1.6 Independence
Definition 1.6.1. 1. Two events A and B are independent if
P(A ∩ B) = P(A)P(B).
2. A (possibly infinite) collection of events (Ai )i∈I is a pairwise independent collection if for
any distinct elements i1 , i2 ∈ I,
3. A (possibly infinite) collection of events (Ai )i∈I is an independent collection if for every
finite subset J of I, one has
Y
P(∩i∈J Ai ) = P(Ai ).
i∈J
If events (Ai )i∈I are independent, they are pairwise independent, but the converse is false.
Proposition 1.6.2. a) If A and B are independent, so also are A and B c ; Ac and B; and Ac and
Bc.
b) If A and B are independent and P(B) > 0, then
Proof. a) A and B c . Since A and B are independent, then P(A ∩ B) = P(A)P(B) = P(A)(1 −
P(B c )) = P(A) − P(A)P(B c ). We have P(A ∩ B) = P(A) − P(A ∩ B c ). Substituting these into the
equation P(A ∩ B) = P(A)P(B), we obtain
Hence
P(A ∩ B c ) = P(A)P(B c ).
1.6. INDEPENDENCE 18
So
and
P(A ∩ B c ) P(A)P(B c )
P(A|B c ) = = = P(A).
P(B c ) P(B c )
Examples:
1. Toss a coin 3 times. If Ai is an event depending only on the ith toss, then it is standard
to model (Ai )1≤i≤3 as being independent.
2. One chooses a card at random from a deck of 52 cards. A = ”the card is a heart”, and
B = ”the card is Queen”. A natural model for this experiment consists in prescribing
the probability 1/52 for picking any one of the cards. By additivity, P(A) = 13/52 and
P(B) = 4/52 and P(A ∩ B) = 1/52 hence A and B are independent.
3. Let n = {1, 2, 3, 4}, and A = 2Ω . Let P({i}) = 1/4, where i = 1, 2, 3, 4. Let A = {1, 2}, B =
{1, 3}, and C = {2, 3}. Then A, B, C are pairwise independent but are not independent.
Exercises
Axiom of Probability
1.1. Give a possible sample space for each of the following experiments:
2. A student is asked for the month of the year and the day of the week on which her birth-
day falls.
1.6. INDEPENDENCE 19
1.4. Let (Gα )α∈I be an arbitrary family of σ-algebras defined on an abstract space Ω. Show that
H = ∩α∈I Gα is also a σ-algebra.
1.5. Suppose that Ω is an infinite set (countable or not), and let A be the family of all subsets
which are either finite or have a finite complement. Show that A is an algebra, but not a σ-
algebra.
1.6. Give a counterexample that shows that, in general, the union A ∪ B of two σ-algebras
need not be a σ-algebra.
1.7. Let Ω = {a, b, c} be a sample space. Let P({a}) = 1/2, P({b}) = 1/3, and P({c}) = 1/6. Find
the probabilities for all eight subsets of Ω.
1.10. Let (Bn ) be a sequence of events such that P(Bn ) = 1 for all n ≥ 1. Show that
\
P Bn = 1.
n
Pn
2. P (∪ni=1 Ai ≤
P P
i=1 P (Ai ) − i<j P (Ai ∩ Aj ) + i<j<k P (Ai ∩ Aj ∩ Ak ).
1.14. Let (Ω, A, P) be a probability space. Show for events Bi ⊂ Ai the following inequality
X
P(∪i Ai ) − P(∪i Bi ) ≤ P(Ai ) − P(Bi ) .
i
Pn
1.15. If (Bk ) are events such that k=1 P(Bk ) > n − 1, then
n
\
P Bk ) > 0.
k=1
1. How many different sequences of process and control samples are possible each day?
Assume that the five process samples are considered identical and that the two control
samples are considered identical.
2. How many different sequences of process and control samples are possible if we con-
sider the five process samples to be different and the two control samples to be identical.
3. For the same situation as part (b), how many sequences are possible if the first test of
each day must be a control sample?
2. If the seven components are all identical, how many different designs are possible?
3. If the seven components consist of three of one type of component and four of another
type, how many different designs are possible? (more difficult)
1. How many three-digit phone prefixes that are used to represent a particular geographic
area (such as an area code) can be created from the digits 0 through 9?
2. As in part (a), how many three-digit phone prefixes are possible that do not start with 0
or 1, but contain 0 or 1 as the middle digit?
3. How many three-digit phone prefixes are possible in which no digit appears more than
once in each prefix?
2. If the first bit of a byte is a parity check, that is, the first byte is determined from the other
seven bits, how many different bytes are possible?
1.20. A bowl contains 16 chips, of which 6 are red, 7 are white, and 3 are blue. If four chips are
talmn at random and without replacement, find the probability that: (a) each of the 4 chips is
red; (b) none of the 4 chips is red; (c) there is at least 1 chip of each color.
1.21. Three distinct integers are chosen at random from the first 20 positive integers. Com-
pute the probability that: (a) their stun is even; (b) their product is even.
1.22. There are 5 red chips and 3 blue chips in a bowl. The red chips are numbered 1, 2, 3, 4, 5,
respectively, and the blue chips are numbered 1, 2, 3, respectively. If 2 chips are to be drawn at
random and without replacement, find the probability that these chips have either the same
number or the same color.
1.23. In a lot of 50 light bulbs, there are 2 bad bulbs. An inspector examines 5 bulbs, which are
selected at random and without replacement. (a) Find the probability of at least 1 defective
bulb among the 5. (b) How many bulbs should be examined so that the probability of finding
at least 1 bad bulb exceeds 0.2 ?
1.24. Three winning tickets are drawn from urn of 100 tickets. What is the probability of win-
ning for a person who buys:
1. 4 tickets?
1.25. A drawer contains eight different pairs of socks. If six socks are taken at random and
without replacement, compute the probability that there is at least one matching pair among
these six socks.
1. What is the probability that at least two students have the same birthday?
2. What is the minimum value of n which secures probability 1/2 that at least two have a
common birthday?
1.27. Four mice are chosen (without replacement) from a litter containing two white mice.
The probability that both white mice are chosen is twice the probability that neither is chosen.
How many mice are there in the litter?
1.28. Suppose there are N different types of coupons available when buying cereal; each box
contains one coupon and the collector is seeking to collect one of each in order to win a prize.
After buying n boxes, what is the probability pn that the collector has at least one of each type?
(Consider sampling with replacement from a population of N distinct elements. The sample
size is n > N . Use inclusion-exclusion formula)
1.29. An absent-minded person has to put n personal letters in n addressed envelopes, and
he does it at random. What is the probability pm,n that exactly m letters will be put correctly in
their envelopes?
1.30. N men run out of a men’s club after a fire and each takes a coat and a hat. Prove that:
a) the probability that no one will take his own coat and hat is
N
X (N − k)!
(−1)k ;
k=1
N !k!
b) the probability that each man takes a wrong coat and a wrong hat is
" N
#2
X 1
(−1)k .
k=2
k!
1.31. You throw 6n dice at random. Find the probability that each number appears exactly n
times.
1.32. * Mary tosses n + 1 fair coins and John tosses n fair coins. What is the probability that
Mary gets more heads than John?
1.6. INDEPENDENCE 23
Conditional Probability
1.33. Bowl I contains 6 red chips and 4 blue chips. Five of these 10 chips are selected at ran-
dom and without replacement and put in bowl II, which was originally empty. One chip is
then drawn at random from bowl II. Given that this chip is blue, find the conditional proba-
bility that 2 red chips and 3 blue chips are transferred from bowl I to bowl II.
1.34. You enter a chess tournament where your probability of winning a game is 0.3 against
half the players (call them type 1), 0.4 against a quarter of the players (call them type 2), and
0.5 against the remaining quarter of the players (call them type 3). You play a game against a
randomly chosen opponent. What is the probability of winning?
1.35. We roll a fair four-sided die. If the result is 1 or 2, we roll once more but otherwise, we
stop. What is the probability that the sum total of our rolls is at least 4?
1.36. There are three coins in a box. One is a two-headed coin, another is a fair coin, and
the third is a biased coin that comes up heads 75 percent of the time. When one of the three
coins is selected at random and flipped, it shows heads. What is the probability that it was the
two-headed coin?
1.37. Alice is taking a probability class and at the end of each week she can be either up-to-
date or she may have fallen behind. If she is up-to-date in a given week, the probability that
she will be up-to-date (or behind) in the next week is 0.8 (or 0.2, respectively). If she is behind
in a given week, the probability that she will be up-to-date (or behind) in the next week is 0.6
(or 0.4, respectively). Alice is (by default) up-to-date when she starts the class. What is the
probability that she is up-to-date after three weeks?
1.38. At the station there are three payphones which accept 20p pieces. One never works, an-
other always works, while the third works with probability 1/2. On my way to the metropolis
for the day, I wish to identify the reliable phone, so that I can use it on my return. The station
is empty and I have just three 20p pieces. I try one phone and it does not work. I try another
twice in succession and it works both times. What is the probability that this second phone is
the reliable one?
1.39. An insurance company insure an equal number of male and female drivers. In any given
year the probability that a male driver has an accident involving a claim is α, independently
of other years. The analogous probability for females is β. Assume the insurance company
selects a driver at random.
a) What is the probability the selected driver will make a claim this year?
b) What is the probability the selected driver makes a claim in two consecutive years?
c) Let A1 , A2 be the events that a randomly chosen driver makes a claim in each of the first
and second years, respectively. Show that P (A2 |A1 ) ≥ P (A1 ).
1.6. INDEPENDENCE 24
1.40. Three newspapers A, B and C are published in a certain city, and a survey shows that for
the adult population 20% read A, 16% B, and 14% C, 8% read both A and B, 5% both A and C,
4% both B and C, and 2% read all three. If an adult chosen at random, find the probability that
1.41. Customers are used to evaluate preliminary product designs. In the past, 95% of highly
successful products received good reviews, 60% of moderately successful products received
good reviews, and 10% of poor products received good reviews. In addition, 40% of products
have been highly successful, 35% have been moderately successful, and 25% have been poor
products.
2. If a new design attains a good review, what is the probability that it will be a highly suc-
cessful product?
3. If a product does not attain a good review, what is the probability that it will be a highly
successful product?
1.42. An inspector working for a manufacturing company has a 99% chance of correctly iden-
tifying defective items and a 0.5% chance of incorrectly classifying a good item as defective.
The company has evidence that its line produces 0.9% of nonconforming items.
1. What is the probability that an item selected for inspection is classified as defective?
1.43. A new analytical method to detect pollutants in water is being tested. This new method
of chemical analysis is important because, if adopted, it could be used to detect three different
contaminants—organic pollutants, volatile solvents, and chlorinated compounds—instead
of having to use a single test for each pollutant. The makers of the test claim that it can detect
high levels of organic pollutants with 99.7% accuracy, volatile solvents with 99.95% accuracy,
and chlorinated compounds with 89.7% accuracy. If a pollutant is not present, the test does
not signal. Samples are prepared for the calibration of the test and 60% of them are contami-
nated with organic pollutants, 27% with volatile solvents, and 13% with traces of chlorinated
compounds.
A test sample is selected randomly.
1.6. INDEPENDENCE 25
2. If the test signals, what is the probability that chlorinated compounds are present?
1.44. Software to detect fraud in consumer phone cards tracks the number of metropolitan
areas where calls originate each day. It is found that 1% of the legitimate users originate calls
from two or more metropolitan areas in a single day. However, 30% of fraudulent users origi-
nate calls from two or more metropolitan areas in a single day. The proportion of fraudulent
users is 0.01%. If the same user originates calls from two or more metropolitan areas in a
single day, what is the probability that the user is fraudulent?
1.45. The probability of getting through by telephone to buy concert tickets is 0.92. For the
same event, the probability of accessing the vendor’s Web site is 0.95. Assume that these two
ways to buy tickets are independent. What is the probability that someone who tries to buy
tickets through the Internet and by phone will obtain tickets?
1.46. The British government has stepped up its information campaign regarding foot and
mouth disease by mailing brochures to farmers around the country. It is estimated that 99%
of Scottish farmers who receive the brochure possess enough information to deal with an out-
break of the disease, but only 90% of those without the brochure can deal with an outbreak.
After the first three months of mailing, 95% of the farmers in Scotland received the informa-
tive brochure. Compute the probability that a randomly selected farmer will have enough
information to deal effectively with an outbreak of the disease.
1.47. In an automated filling operation, the probability of an incorrect fill when the process is
operated at a low speed is 0.001. When the process is operated at a high speed, the probability
of an incorrect fill is 0.01. Assume that 30% of the containers are filled when the process is
operated at a high speed and the remainder are filled when the process is operated at a low
speed.
2. If an incorrectly filled container is found, what is the probability that it was filled during
the high-speed operation?
1.48. An encryption-decryption system consists of three elements: encode, transmit, and de-
code. A faulty encode occurs in 0.5% of the messages processed, transmission errors occur in
1% of the messages, and a decode error occurs in 0.1% of the messages. Assume the errors are
independent.
2. What is the probability of a message that has either an encode or a decode error?
1.6. INDEPENDENCE 26
1.49. It is known that two defective copies of a commercial software program were erro-
neously sent to a shipping lot that has now a total of 75 copies of the program. A sample
of copies will be selected from the lot without replacement.
1. If three copies of the software are inspected, determine the probability that exactly one
of the defective copies will be found.
2. If three copies of the software are inspected, determine the probability that both defec-
tive copies will be found.
3. If 73 copies are inspected, determine the probability that both copies will be found.
Hint: Work with the copies that remain in the lot.
1.50. A robotic insertion tool contains 10 primary components. The probability that any com-
ponent fails during the warranty period is 0.01. Assume that the components fail indepen-
dently and that the tool fails if any component fails. What is the probability that the tool fails
during the warranty period?
1.51. A machine tool is idle 15% of the time. You request immediate use of the tool on five
different occasions during the year. Assume that your requests represent independent events.
1. What is the probability that the tool is idle at the time of all of your requests?
2. What is the probability that the machine is idle at the time of exactly four of your re-
quests?
3. What is the probability that the tool is idle at the time of at least three of your requests?
1.52. A lot of 50 spacing washers contains 30 washers that are thicker than the target dimen-
sion. Suppose that three washers are selected at random, without replacement, from the lot.
1. What is the probability that all three washers are thicker than the target?
2. What is the probability that the third washer selected is thicker than the target if the first
two washers selected are thinner than the target?
3. What is the probability that the third washer selected is thicker than the target?
1.53. Continuation of previous exercise. Washers are selected from the lot at random, without
replacement.
1. What is the minimum number of washers that need to be selected so that the probability
that all the washers are thinner than the target is less than 0.10?
2. What is the minimum number of washers that need to be selected so that the probability
that one or more washers are thicker than the target is at least 0.90?
1.6. INDEPENDENCE 27
1.54. The alignment between the magnetic tape and head in a magnetic tape storage system
affects the performance of the system. Suppose that 10% of the read operations are degraded
by skewed alignments, 5% by off-center alignments, 1% by both skewness and offcenter, and
the remaining read operations are properly aligned. The probability of a read error is 0.01
from a skewed alignment, 0.02 from an off-center alignment, 0.06 from both conditions, and
0.001 from a proper alignment. What is the probability of a read error.
1.55. Suppose that a lot of washers is large enough that it can be assumed that the sampling
is done with replacement. Assume that 60% of the washers exceed the target thickness.
1. What is the minimum number of washers that need to be selected so that the probability
that all the washers are thinner than the target is less than 0.10?
2. What is the minimum number of washers that need to be selected so that the probability
that one or more washers are thicker than the target is at least 0.90?
1.56. In a chemical plant, 24 holding tanks are used for final product storage. Four tanks are
selected at random and without replacement. Suppose that six of the tanks contain material
in which the viscosity exceeds the customer requirements.
1. What is the probability that exactly one tank in the sample contains high viscosity ma-
terial?
2. What is the probability that at least one tank in the sample contains high viscosity ma-
terial?
3. In addition to the six tanks with high viscosity levels, four different tanks contain ma-
terial with high impurities. What is the probability that exactly one tank in the sample
contains high viscosity material and exactly one tank in the sample contains material
with high impurities?
1.57. Plastic parts produced by an injection-molding operation are checked for conformance
to specifications. Each tool contains 12 cavities in which parts are produced, and these parts
fall into a conveyor when the press opens. An inspector chooses 3 parts from among the 12 at
random. Two cavities are affected by a temperature malfunction that results in parts that do
not conform to specifications.
1. What is the probability that the inspector finds exactly one nonconforming part?
2. What is the probability that the inspector finds at least one nonconforming part?
1.58. A bin of 50 parts contains five that are defective. A sample of two is selected at random,
without replacement.
1.6. INDEPENDENCE 28
1. Determine the probability that both parts in the sample are defective by computing a
conditional probability.
2. Determine the answer to part (a) by using the subset approach that was described in
this section.
1.59. * The Polya urn model is as follows. We start with an urn which contains one white ball
and one black ball. At each second we choose a ball at random from the urn and replace it
together with one more ball of the same color. Calculate the probability that when n balls are
in the urn, i of them are white.
1.60. You have n urns, the rth of which contains r − 1 red balls and n − r blue balls, r =
1, . . . , n. You pick an urn at random and remove two balls from it without replacement. Find
the probability that the two balls are of different colors. Find the same probability when you
put back a removed ball.
1.61. A coin shows heads with probability p on each toss. Let πn be the probability that the
number of heads after n tosses is even. Show that πn+1 = (1 − p)πn + p(1 − πn ) and find πn .
1.62. There are n similarly biased dice such that the probability of obtaining a 6 with each one
of them is the same and equal to p (0 < p < 1). If all the dice are rolled once, show that pn , the
probability that an odd number of 6’s is obtained, satisfies the difference equation
pn + (2p − 1)pn−1 = p,
1.63. Dubrovsky sits down to a night of gambling with his fellow officers. Each time he stakes
u roubles there is a probability r that he will win and receive back 2u roubles (including his
stake). At the beginning of the night he has 8000 roubles. If ever he has 256000 roubles he will
marry the beautiful Natasha and retire to his estate in the country. Otherwise, he will commit
suicide. He decides to follow one of two courses of action:
(i) to stake 1000 roubles each time until the issue is decided;
Advise him (a) if r = 1/4 and (b) if r = 3/4. What are the chances of a happy ending in each
case if he follows your advice?
1.6. INDEPENDENCE 29
Independence
1.64. Let the events A1 , A2 , . . . , An be independent and P (Ai ) = p (i = 1, 2, . . . , n). What is
the probability that:
1.65. Each of four persons fires one shot at a target. Let Ck denote the event that the tar-
get is hit by person k, k = 1, 2, 3, 4. If C1 , C2 , C3 , C4 are independent and if P(C1 ) = P(C2 ) =
0.7, P(C3 ) = 0.9, and P(C4 ) = 0.4, compute the probability that (a) all of them hit the target;
(b) exactly one hits the target; (c) no one hits the target; (d) at least one hits the target.
1.66. The probability of winning on a single toss of the dice is p. A starts, and if he fails, he
passes the dice to B, who then attempts to win on her toss. They continue tossing the dice
back and forth until one of them wins. What are their respective probabilities of winning?
1.67. Two darts players throw alternately at a board and the first to score a bull wins. On each
of their throws player A has probability pA and player B pB of success; the results of different
throws are independent. If A starts, calculate the probability that he/she wins.
1.68. * A fair coin is tossed until either the sequence HHH occurs in which case I win or the
sequence T HH occurs, when you win. What is the probability that you win?
1.69. Let A1 , . . . , An be independent events, with P(Ai ) < 1. Prove that there exists an event B
with P(B) > 0 such that B ∩ Ai = ∅ for 1 ≤ i ≤ n.
1.70. n balls are placed at random into n cells. Find the probability pn that exactly two cells
remain empty.
1.71. An urn contains b black balls and r red balls. One of the balls is drawn at random, but
when it is put back in the urn c additional balls of the same color are put in with it. Now
suppose that we draw another ball. Show that the probability that the first ball drawn was
black given that the second ball drawn was red is b/(b + r + c).
1.72. Suppose every packet of the detergent TIDE contains a coupon bearing one of the letters
of the word TIDE. A customer who has all the letters of the word gets a free packet. All the let-
ters have the same possibility of appearing in a packet. Find the probability that a housewife
who buys 8 packets will get:
2.1.1 Definitions
Throughout this section we suppose that Ω is countable and A = 2Ω . A random variable
X in this case is defined as a map from Ω into R. A random variable stands for an observation
of the outcome of a random event. Before the random event we may know the range of X but
we do not know its exact value until the random event happens. The distribution of a random
variable X is defined by
Since the set Ω is countable, the range of X is also countable. Suppose that X(Ω) = {x1 , x2 , . . .}.
Then the distribution of X is completely determined by the following numbers pX i = P(X =
xi ), i ≥ 1. Indeed, for any event A ∈ A,
X X
PX (A) = P[X = xi ] = pXi .
xi ∈A xi ∈A
provided this sum makes sense: this is the case when at least one of the following conditions
is satisfied
1. Ω is finite;
xi p X
P
2. Ω is countable and the series i i absolutely convergence;
3. X ≥ 0 always (in this case, the above sum and hence E[X] as well may take value +∞.
30
2.1. RANDOM VARIABLES ON A COUNTABLE SPACE 31
Remark 1. Since Ω is countable, we denote pw the probability that the elementary event w ∈ Ω
happens. Then the expectation of X is given by
X
E[X] = X(w)pw .
w∈Ω
Let L1 denote the space of all random variables with finite expectation defined on (Ω, A, P).
The following facts are straightforward from the definition of expectation.
5. Let ϕ : R → R. Then
X X
E[ϕ(X)] = ϕ(xi )pX
i = ϕ(X(w))pw
i w∈Ω
X 1X 1
E[|X|] = |xi |pX
i ≤ (|xi |2 + 1)pX 2
i = (E(X ) + 1) < ∞.
i
2 i 2
DX = E[(X − E[X])2 ]
DX = E[X 2 ] − (E[X])2 .
Hence X X 2
DX = x2i pX
i − xi p X
i .
i i
2.1. RANDOM VARIABLES ON A COUNTABLE SPACE 32
2.1.2 Examples
Poisson distribution
DX = λ.
Bernoulli distribution
X is Bernoulli with parameter p ∈ [0, 1], denoted X ∼ Ber(p), if it takes only two values 0
and 1 and
P[X = 1] = 1 − P[X = 0] = p.
X corresponds to an experiment with only two outcomes, usually called “success” (X = 1)
and “failure” (X = 0). The expectation and variance of X are
Binomial distribution
X has Binomial distribution with parameters p ∈ [0, 1] and n ∈ N, denoted X ∼ B(n, p), if
X takes on the values {0, 1, . . . , n} and
One has
n
X n
X
E[X] = kP[X = k] = kCnk pk (1 − p)n−k
k=0 k=0
n
X
k−1 k−1
= np Cn−1 p (1 − p)n−k = np,
k=1
2.2. RANDOM VARIABLES ON A GENERAL PROBABILITY SPACE 33
and
n
X n
X
2 2
E[X ] = k P[X = k] = k 2 Cnk pk (1 − p)n−k
k=0 k=0
n
X n
X
2 k−2 k−2 n−k k−1 k−1
= n(n − 1)p Cn−2 p (1 − p) + np Cn−1 p (1 − p)n−k
k=2 k=1
2
= n(n − 1)p + np,
Geometric distribution
One repeatedly performs a sequence of independent Bernoulli trials until achieving the
first sucesses. Let X denote the number of failures before reaching the first success. X has a
Geometric distribution with parameter q = 1 − p ∈ [0, 1], denoted X ∼ Geo(q),
P[X = k] = q k p, k = 0, 1, . . .
2.2.1 Definition
Let (Ω, A) be a measurable space and B(R) the σ-algebra Borel on R.
1. X is a random variable.
Proof. Claim (1) ⇒ (2) is self-evident, we will prove: (2) ⇒ (1). Let
We have C is a σ-algebra and it contains all sets with the form (−∞, a] for every a ∈ R. Thus C
contains B(R). On the other hand, C ⊂ B(R), so C = B(R). This concludes our proof.
Example 2.2.3. Let (Ω, A) be a measurable space. For each subset B of Ω one can verifies
that IB is a random variable iff B ∈ A. More general, if xi ∈ R and Ai ∈ A for all i belongs to
P
some countable index set I, then X(w) = i∈I xi IAi (w) is also a random variable. We call such
random variable X discrete random variable. When I is finite then X is called simple random
variable.
Definition 2.2.4. A function ϕ : Rd → R is called Borel measurable if X −1 (B) ∈ B(Rd ) for all
B ∈ B(R).
Remark 3. It implies from the above definition that every continuous function is Borel. Con-
sequently, all the functions (x, y) 7→ x+y, (x, y) 7→ xy, (x, y) 7→ x/y, (x, y) 7→ x∨y, (x, y) 7→ x∧y
are Borel, where x ∨ y = max(x, y), x ∧ y = min(x, y).
Theorem 2.2.5. Let X1 , . . . , Xd be random variables defined on a measurable space (Ω, A) and
ϕ : Rd → R a Borel function. Then Y = ϕ(X1 , . . . , Xd ) is also a random variable.
Proof. Let: X(w) = (X1 (w), . . . , Xd (w)) is the function on (Ω, A) and takes values in Rd . For
every a1 , . . . , ad ∈ R we have:
d
Y d
\
X −1
(−∞, ai ] = {w : Xi (w) ≤ ai } ∈ A.
i=1 i=1
This implies X −1 (B) ∈ A for every B ∈ B(Rd ). Hence, for every C ∈ B(Rd ), B := ϕ−1 (C) ∈
B(Rd ). Thus,
Y −1 (C) = X −1 (ϕ−1 (C)) ∈ A,
i.e. Y is the random variable.
Corollary 2.2.6. If X and Y are random variables, so also are X ±Y, XY, X ∧Y, X ∨Y, |X|, X + :=
X ∨ 0, X − = (−X) ∨ 0 and X/Y (if Y 6= 0).
Theorem 2.2.7. If X1 , X2 , . . . are random variables then so are supn Xn , inf n Xn , lim supn Xn ,
lim inf n Xn
It follows from Theorem 2.2.7 that if the sequence of random variables (Xn )n≥1 point-wise
converges to X, i.e. Xn (w) → X(w) for all w ∈ Ω, then X is a random variable.
2.2. RANDOM VARIABLES ON A GENERAL PROBABILITY SPACE 35
2. If X is non-negative then there exists a sequence of simple random variables Yn such that
Yn ↑ X.
Proof. 1. For each n ≥ 1, denote Xn (w) = nk if nk ≤ X(w) < k+1
n
for some k ∈ Z. Xn is a
1
discrete random variable and |Xn (w) − X(w)| ≤ n for every w ∈ Ω. Hence, the sequence
(Xn ) converges uniformly in w to X.
2. Suppose that X ≥ 0. For each n ≥ 1, denote Yn (w) = 2kn if 2kn ≤ X(w) < k+1 2n
for some
n n n
k ∈ {0, 1, . . . , n2 −1} and Yn (w) = 2 if X(w) ≥ 2 . We can easily verify that the sequence
of simple random variables (Yn ) satisfying Yn (w) ↑ X(w) for all w ∈ Ω.
Definition 2.2.9. Let X be a random variable defined on a measurable space (Ω, A).
we have Y = ϕ(X).
In general case, by Theorem 2.2.8, there exists a sequence of discrete σ(X)-measurable
functions Yn which uniformly converges to Y . Thus, there exists Borel functions ϕn such that
Yn = ϕn (X). Denote
B = {x ∈ R : ∃ lim ϕn (x)}.
n
Clearly, B ∈ B(R) and B ⊃ X(Ω). Let: ϕ(x) = limn ϕn (x)IB (x). We have Y = limn Yn =
limn ϕn (X) = ϕ(X).
2.3. DISTRIBUTION FUNCTIONS 36
2.3.1 Definition
Definition 2.3.1. Let X be a real valued random variable.
On the other hand, for any function F : R → [0, 1] satisfying the these three conditions there
exists a (unique) probability measure µ on (R, B(R)) such that F (x) = µ((−∞, x)), for all x ∈ R
(See [13], section 2.5.2).
If X and Y has the same distribution function we say X and Y are equal in distribution
d
and denote X = Y .
2.3.2 Examples
Uniform distribution U [a, b]
1
b−a
if a ≤ x ≤ b,
f (x) =
0 otherwise,
is called the Uniform distribution on [a, b] and denoted by U [a, b]. The distribution function
corresponds to f is
0
if x < a,
F (x) = x−a b−a
if a ≤ x ≤ b,
1 if x > b.
1 (x−a)2
f (x) = √ e− 2σ 2 , x ∈ R,
2πσ 2
is called the Normal distribution with mean a and variance σ 2 and denoted by N(a, σ 2 ). When
a = 0 and σ 2 = 1, N(0, 1) is called the Standard normal distribution.
xα−1 e−x/λ
fX (x) = I(0,∞) (x)
Γ(α)λα
is called the Gamma distribution with parameters α, λ(α, λ > 0); Γ denotes the gamma func-
R∞
tion Γ(α) = 0 xα−1 e−x dx. In particular, an Exp(λ) distribution is G(1, λ) distribution. The
gamma distribution is frequently a probability model for waiting times; for instance, in life
testing, the waiting time until ”death” is a random variable which is frequently modeled with
a gamma distribution.
2.4. EXPECTATION 38
2.4 Expectation
Denote Ls = Ls (Ω, A, P) the set of simple random variable. It should be noted that a simple
random variable has of course many different representations of the form (2.2). However,
E[X] does not depend on the particular representation chosen for X.
Let X and Y be in Ls . We can write
n
X n
X
X= ai IAi , and Y = bi IAi .
i=1 i=1
for some subsets Ai which form a measurable partition of Ω. Then for any α, β ∈ R, αX + βY
is also in Ls and
Xn
αX + βY = (ai + bi )IAi .
i=1
This supremum always exists in [0, ∞]. It follows from the positivity of expectation operator
that the definition above for E[X] coincides with Definition 2.4.1 on Ls .
Note that EX ≥ 0 but it may happen that EX = +∞ even when X is never equal to +∞.
Definition 2.4.2. 1. A random variable X is called integrable if E[|X|] < ∞. In this case, its
expectation is defined to be
2. If E[X + ] and E[X − ] are not both equal to +∞ then the expectation of X is still defined
and given by (2.4) where we use the convention that +∞ + a = +∞ and −∞ + a = −∞
for any a ∈ R.
Lemma 2.4.3. Let X be a non-negative random variable and (Xn )n≥1 a sequence of simple
random variables increasing to X. Then E[Xn ] ↑ E[X] (even if E[X] = ∞).
Proof. We have (EXn )n≥1 is the increasing sequence and upper bounded by EX by Definition
(2.3) so (EXn )n≥1 is convergent to a with a ≤ EX. To prove a = EX, we only show that for
every simple random variable Y satisfying 0 ≤ Y ≤ X, we have EY ≤ a.
Indeed, suppose Y takes m different values y1 , . . . , ym . Let Ak = {w : Y (w) = yk }. For each
∈ (0, 1], consider the sequence Yn, = (1 − )Y I{(1−)Y ≤Xn } . We have Yn, is simple random
variable, Yn, ≤ Xn so
EYn, ≤ EXn ≤ a for every n. (2.5)
On the other hand, Y ≤ limn Xn so for every w ∈ Ω, there exists n = n(w) such that (1 −
)Y (w) ≤ Xn (w), i.e. Ak ∩ {w : (1 − )Y (w) ≤ Xn (w)} → Ak as n → ∞. We have
m
X
EYn, = (1 − ) yk P Ak ∩ [(1 − )Y ≤ Xn ]
k=1
Xm
→ (1 − ) yk P(Ak ) = (1 − )EY, as n → ∞.
k=1
Asscociate with (2.5), we have (1 − )EY ≤ a for every ∈ (0, 1], i.e. EY ≤ a.
Proof. Statement 2 follows exactly from equation 2.3. To prove statement 1, firstly we remark
that if X and Y are two non-negative random variables and α, β ≥ 0, by Theorem 2.2.8 there
exist two increasing non-negative sequences (Xn ) and (Yn ) in Ls converging to X and Y re-
spectively. Hence, αXn + βYn are also simple non-negative random variables, and convege
to αX + βY . Applying linear and non-negative properties of expectation operator on Ls and
Lemma 2.4.3, we have E(αX + βY ) = αEX + βEY.
2.4. EXPECTATION 40
Now we prove Theorem 2.4.4. Consider two random variables X, Y ∈ L1 . Since |αX +
βY | ≤ |α||X| + |β||Y |, αX + βY ∈ L1 . We have: if α > 0,
i.e.
E(αX) = αE(X) for every α ∈ R. (2.6)
On the other hand, let Z = X + Y we have Z + − Z − = X + Y = X + + Y + − (X − + Y − ), so
Z + + X − + Y − = Z − + X + + Y + . Thus E(Z + ) + E(X − ) + E(Y − ) = E(Z − ) + E(X + ) + E(Y + ),
then
EZ = E(Z + ) − E(Z − ) = E(X + ) + E(Y + ) − E(X − ) − E(Y − ) = EX + EY.
Asscociate with (2.6) we obtain
An event A happens almost surely if P(A) = 1. Thus we say X equals Y almost surely if
P[X = Y ] = 1 and denote X = Y a.s.
Proof. 2) Let A = {w : X(w) = ∞}. For every n, we have X(w) ≥ X(w)IA (w) ≥ nIA (w) so
E(X) ≥ nP(A) for every n. Thus P(A) ≤ E(X)
n
→ 0 as n → ∞. From this, we have P(A) = 0.
3) Let An = {w : |X(w)| ≥ 1/n}. We have (An )n≥1 is the decreasing sequence and P(X 6=
0) = limn→∞ P(An ). Moreover,
1
IA (w) ≤ |X(w)|IAn (w) ≤ |X(w)|
n n
so P(An ) ≤ nE|X| = 0 for every n. Thus P(A) = 0 i.e. X = 0 a.s.
Theorem 2.4.6. Let X and Y be integrable random variables. If X = Y a.s. then E[X] = E[Y ].
Proof. Firstly, we consider the case: X and Y are non-negative. Let A = {w : X(w) 6= Y (w)}.
We have P(A) = 0. Moreover,
Suppose (Yn ) is a sequence of simple random variables increasing to Y . Hence, (Yn IA ) is also a
sequence of simple random variables increasing to (Y IA ). Suppose for each n ≥ 1, the random
variable Yn is bouned by Nn , so
Proof. For each n, let (Yn,k )k≥1 be a sequence of simple random variables increasing to Xn and
let Zk = maxn≤k Yn,k . Then (Zk )k≥1 is the sequence of simple non-negative random variables,
and thus there exists Z = limk→∞ Zk . Also
Yn,k ≤ Zk ≤ X ∀n ≤ k
Since the left and right sides are the same, X = Z a.s., we deduce the result.
Theorem 2.4.8 (Fatou’s lemma). If the random variables Xn satisfy Xn ≥ Y a.s for all n and
some Y ∈ L1 . Then
E[lim inf Xn ] ≤ lim inf E[Xn ].
n→∞ n→∞
Proof. Firstly we prove Theorem to the case Y = 0. Let Yn = inf k≥n Xk . We have (Yn ) is the
sequence of non-decreasing random variables and
Since Xn ≥ Yn , we have EXn ≥ EYn . Asscociate with monotone convergence theorem to the
sequence Yn , we obtain
The general case follows from appling the above result to the sequence of non-negative ran-
dom variables X̂n := Xn − Y.
Proof. Since |X| ≤ Y , X ∈ L1 . Let Zn = |Xn − X|. Since Zn ≥ 0 and −Zn ≥ −2Y , applying
Fatou Lemma to Zn and −Zn , we obtain
0 = E(lim inf Zn ) ≤ lim inf EZn ≤ lim sup EZn = − lim inf E(−Zn ) ≤ −E(lim inf (−Zn )) = 0.
n→∞ n→∞ n→∞ n→∞ n→∞
Proof. If E(X 2 )E(Y 2 ) = 0 then XY = 0 a.s. Thus |E(XY )|2 = E(X 2 )E(Y 2 ) = 0.
p
If E(X 2 )E(Y 2 ) 6= 0, applying the inequality 2|ab| ≤ a2 + b2 for a = X/ E(X 2 ) and b =
p
Y / E(Y 2 ) and then taking expectation for two sides, we obtain
XY X2 Y2
2E p ≤E + E = 2.
E(X 2 )E(Y 2 ) E(X 2 ) E(Y 2 )
If X ∈ L2 , we denote
DX = E[(X − EX)2 ].
DX is called the variance of X. Using the linearity of expectation operator, one can verify that
DX = E(X 2 ) − (EX)2 .
2.4. EXPECTATION 43
Theorem 2.4.11. 1. (Markov’s inequality) Suppose X ∈ L1 , then for any a > 0, it holds
E(|X|)
P(|X| ≥ a) ≤ .
a
Proof. Firstly, we consider the case h ≥ 0. Then there exists a sequence of simple non-
Pkn n n +
negative Borel functions (hn ) increasing to h. Suppose hn = i=1 ai IAi for ai ∈ R and
n
R
Thus, if h is non-negative, we usually have E(h(X)) = h(x)f (x)dx.
In general case, applying above result for h+ and h− we deduce this proof.
Example 2.4.13. Let X ∼ Exp(1). Applying Theorem 2.4.12 for h(x) = x and h(x) = x2 respec-
tively, we have Z ∞
EX = xe−x dx = 1,
0
and Z ∞
EX = 2
x2 e−x dx = 2.
0
2.5. RANDOM ELEMENTS 44
2.5.1 Definitions
Definition 2.5.1. Let (E, E) be a measure space. A function X : Ω → E is called A/E-measurable
or random element if X −1 (B) ∈ A for all B ∈ E. The function
PX (B) = P(X −1 (B)), B ∈ E,
is called probablity distribution of X on (E, E).
When (E, E) = (Rd , B(Rd )), we call X a random vector.
Let X = (X1 , . . . , Xd ) be a d-dimensional random vector defined on (Ω, A, P). The distri-
bution function of X is defined by
F (x) = P[X < x] = P[X1 < x1 , . . . , Xd < xd ], x ∈ Rd .
We can easily verify that F satisfying the following properties:
1. 0 ≤ F (x) ≤ 1 for all x ∈ Rd .
4. F is left continuous.
The random vector X has a density f : Rd → R+ if f is a non-negative Borel measurable
function satisfying Z
F (x) = f (u)du, for any x ∈ Rd .
u<x
This equation implies that
Z
P[X ∈ B] = f (x)dx, for all B ∈ B(Rd ).
B
In particular, we have
Z Z
P[X1 ∈ B1 ] = P[X ∈ B1 × R d−1
]= f (x1 , . . . , xd )dx2 . . . dxd dx1 for all B1 ∈ B(Rd ).
B1 Rd−1
This implies that if X = (X1 , . . . , Xd ) has a density f then X1 also has a density given by
Z
fX1 (x1 ) = f (x1 , x2 , . . . , xd )dx2 . . . dxd , for all x1 ∈ R. (2.8)
Rd−1
A similar argument as the proof Theorem 2.4.12 yields,
Theorem 2.5.2. Let X = (X1 , . . . , Xd ) be a random vector which has density function f , ϕ :
Rd → R a Borel measurable function. We have
Z
E[ϕ(X)] = ϕ(x)f (x)dx
Rd
R
provided that ϕ is non-negative or Rd |ϕ(x)|f (x)dx < ∞.
2.6. INDEPENDENT RANDOM VARIABLES 45
2.5.2 Example
Multivariate normal distribution
Polynomial distribution
n! kd+1
P[X1 = k1 , . . . , Xd = kd ] = pk11 pk22 . . . pd+1 ,
k1 !k2 ! . . . kd+1 !
2. The (Ei , Ei )-valued random variables (Xi )i∈I are independent if the σ-algebras (Xi−1 (Ei ))i∈I
are independent.
• A1 , A2 , . . . ∈ D with An ↑ A implies A ∈ D;
Lemma 2.6.2 (Monotone classes). Let C, D be classes of subsets of Ω where C is a π-system and
D is a λ-system such that C ⊂ D. Then σ(C) ⊂ D.
Lemma 2.6.3. Let G and F be sub-σ-algebras of A. Let G1 and F1 be π-systems such that σ(G1 ) =
G and σ(F1 ) = F. Then G is independent of F if F1 and G are independent, i.e.,
Proof. Suppose that F1 and G1 are independent. We fix any F ∈ F1 and define
We also have that σG is a λ-system containing π-system F1 so that σG = F, which yields the
desired property.
Theorem 2.6.4. Let X and Y be two random variables. The following statements are equiva-
lent:
(i) X is independent of Y ;
(iii) f (X) and g(Y ) are independent for any Borel functions f, g : R → R;
(iv) E[f (X)g(Y )] = E[f (X)]E[g(Y )] for any Borel function f, g : R → R which are either posi-
tive or bounded.
2.7. COVARIANCE 47
Proof. (i) ⇒ (ii): Suppose X be independent of Y , then two events {w : X(w) < x} và {w :
Y (w) < y} are also independent for every x, y ∈ R. We have (ii).
(ii) ⇒ (i): Since the set of events {w : X(w) < x}, x ∈ R, is a π-system generating σ(X)
and {w : X(w) < y}, y ∈ R, is a π-system generating σ(Y ) , so applying Lemma 2.6.3 we have
X is independent of Y .
(i) ⇒ (iii): For every A, B ∈ B(R), we have f −1 (A), g −1 (B) ∈ B(R) then
E(XY ) = E(X)E(Y ) for every random variable which is integrable or non-negative X and Y.
Firstly, we suppose that: X and Y are non-negative. By Theorem 2.2.8 there exists a sequence
of simple random variables Xn = ki=1 ai IAi increasing to X and Yn = lj=1
Pn Pn
bj IBj increasing to
Y for Ai ∈ σ(X) và Bi ∈ σ(Y ). Applying monotone convergence theorem, we have
X ln
kn X kn X
X ln
E(XY ) = lim E(Xn Yn ) = lim ai bj P(Ai Bj ) = lim ai bj P(Ai )P(Bj )
n→∞ n→∞ n→∞
i=1 j=1 i=1 j=1
kn
X ln
X
= lim ai P(Ai ) bj P(Bj ) = lim E(Xn )E(Yn ) = E(X)E(Y ).
n→∞ n→∞
i=1 j=1
2.7 Covariance
Definition 2.7.1. The covariance of random variables X, Y ∈ L2 is defined by
Example 2.7.2. Let X and Y be independent random variables whose distributions are N (0, 1).
Denote Z = XY and T = X − Y . We have
and
cov(Z, T 2 ) = E(XY (X − Y )2 ) − E(XY )E((X − Y )2 ) = −2,
since E(XY ) = EXEY = 0, E(X 3 Y ) = E(X 3 )EY = 0, E(XY 3 ) = EXE(Y 3 ) = 0 and E(X 2 Y 2 ) =
E(X 2 )E(Y 2 ) = 1. Thus Z and T are uncorrelated random variables but not independent.
Proposition 2.7.3. Let (Xn )n≥1 be a sequence of pair-wise uncorrelated random variables. Then
Proof. We have
h 2 i
D(X1 + . . . + Xn ) = E (X1 − EX1 ) + . . . (Xn − EXn )
n
X X
= E[(Xi − EXi )2 ] + 2 E[(Xi − EXi )(Xj − EXj )]
i=1 1≤i<j≤n
Xn n
X
= E[(Xi − EXi )2 ] = D(Xi ),
i=1 i=1
since E[(Xi − EXi )(Xj − EXj )] = E(Xi Xj ) − E(Xi )E(Xj ) = 0 for any 1 ≤ i < j ≤ n.
2.8.1 Definition
Definition 2.8.1. Let (Ω, A, P) be a probability space and X an integrable random variable.
Let G be a sub-σ-algebra of A. Then there exists a random variable Y such that
1. Y is G-measurable,
2. E[|Y |] < ∞,
2.8. CONDITIONAL EXPECTATION 49
Moreover, if Z is another random variable with these properties then P[Z = Y ] = 1. Y is called
a version of the conditional expectation E[X|G] of X given G, and we write Y = E[X|G], a.s.
2.8.2 Examples
Example 2.8.2. Let X be an integrable random variable and G = σ(A1 , . . . , Am ) where (Ai )1≤i≤m
is a measurable partition of Ω. Suppose that P(Ai ) > 0 for all i = 1, . . . , m. Then
n Z
X 1
E(X|G) = XdP IAi .
i=1
P(Ai ) Ai
Example 2.8.3. Let X and Z be random variables whose joint density is fX,Z (x, z). We know
R
that fZ (z) = R fX,Z (x, z)dx is density of Z. Define the elementary conditional density fX|Z of
X given Z by
fX,Z (x,z) if f (z) 6= 0,
fZ (z) Z
fX|Z (x|z) :=
0 otherwise.
Let h be a Borel function on R such that E[|h(X)|] < ∞. Set
Z
g(z) = h(x)fX|Z (x|z)dx.
R
5. E(ξ|F) = ξ a.s.
6. E(E(ξ|G)) = E(ξ).
5. Statement 5 is evident.
6. Using Definition 2.8.1 for G = Ω, we have:
Z Z
E(ξ|G)dP = ξdP ⇒ E(E(ξ|G)) = Eξ a.s.
Ω Ω
7. If A ∈ G, we have: Z Z Z
E[E(ξ|G2 )|G1 ]dP = E(ξ|G2 )dP = ξdP.
A A A
From this and Definition 2.8.1, the first equation is proven. The second one follows from
Statement 5 and remark that E(ξ|G1 ) is G2 -measurable.
8. If A ∈ G, X and IA are independent. Hence, we have:
Z Z
ξdP = E(ξIA ) = Eξ.P(A) = (Eξ)dP ⇒ E(ξ|G) = E(ξ) a.s.
A A
2.8. CONDITIONAL EXPECTATION 51
which proves the desired relation for indicators, and hence for simple random variables. Next,
if {ηn , n ≥ 1} are simple random variables, such that ηn ↑ η almost surely as n → ∞ , it follows
that ηn ξ ↑ ηξ and ηn E(ξ|G) ↑ ηE(ξG|) almost surely as n → ∞, from which the conclusion
follows by monotone convergence. The general case follows by the decomposition ξ = ξ + −ξ −
and η = η + − η − .
Theorem 2.8.5 (Monotone convergence theorem). a) Suppose that ξn ↑ ξ a.s. and there exists
−
a positive integer m such that E(ξm ) < ∞. Then, E(ξn |G) ↑ E(ξ|G) a.s.
+
b) Suppose that ξn ↓ ξ a.s. and there exists a positive integer m such that E(ξm ) < ∞, then
E(ξn |G) ↓ E(ξ|G) a.s.
and then
lim E(ξn |G) = E(ξ|G) a.s.
n
Similarly to claim (b).
E(lim inf ξn |G) ≤ lim inf E(ξn |G) ≤ lim sup E(ξn |G) ≤ E(lim sup ξn |G), a.s.
n n n n
2.8. CONDITIONAL EXPECTATION 52
Theorem 2.8.7 (Lebesgue’s dominated convergence theorem). Suppose that E(η) < ∞, |ξn | ≤
a.s.
η a.s., and ξn −→ ξ. Then,
The proofs of Fatou’s lemma and Lebesgue’s dominated convergence theorem are analo-
gous in a similar vein to the proofs of Fatou’s lemma and the Dominated convergence theorem
without conditioning.
Theorem 2.8.8 (Jensen’s inequality). Let ϕ : R → R be a convex function such that ϕ(ξ) is
integrable. Then
E(ϕ(ξ)|G) ≥ ϕ(E(ξ|G)), a.s.
Proof. A result in real analysis is that if ϕ : R → R is convex, then ϕ(x) = supn (an x + bn ) for a
countable collection of real numbers (an , bn ). Then
But E(an ξ + bn |G) ≤ E(ϕ(ξ)|G), hence an E(ξ|G) + bn ≤ E(ξ|G), for every n. Taking the supremum
in n, we get the result.
2
In particular, if ϕ(x) = x2 then E(ξ 2 |G) ≥ E(ξ|G) .
b) Let ϕ : R → R be a Borel function such that both ξ and ξϕ(η) are integrable. Then, the
equation
E(ξϕ(η)|η = y) = ϕ(y)E(ξ|η = y)
holds Pη -a.s.
c) If ξ and η are independent, then
E(ξ|η = y) = E(ξ).
2.9 Exercises
2.2. An urn contains 7 white balls numbered 1,2,...,7 and 3 black ball numbered 8,9,10. Five
balls are randomly selected, (a) with replacement, (b) without replacement. For each of cases
(a) and (b) give the distribution:
2.3. A machine normally makes items of which 4% are defective. Every hour the producer
draws a sample of size 10 for inspection. If the sample contains no defective items he does
not stop the machine. What is the probability that the machine will not be stopped when it
has started producing items of which 10% are defective.
2.4. Let X represent the difference between the number of heads and the number of tails
obtained when a fair coin is tossed n times. What are the possible values of X? Calculate
expectation and variance of X.
2.5. An urn contains N1 white balls and N2 black balls; n balls are drawn at random, (a) with
replacement, (b) without replacement. What is the expected number of white balls in the
sample?
2.6. A student takes a multiple choice test consisting of two problems. The first one has 3
possible answers and the second one has 5. The student chooses, at random, one answer as
the right one from each of the two problems. Find:
b) the V ar(X).
2.7. In a lottery that sells 3,000 tickets the first lot wins $1,000, the second $500, and five other
lots that come next win $100 each. What is the expected gain of a man who pays 1 dollar to
buy a ticket?
2.9. EXERCISES 54
2.8. A pays 1 dollar for each participation in the following game: three dice are thrown; if
one ace appears he gets 1 dollar, if two aces appear he gets 2 dollars and if three aces appear
he gets 8 dollars; otherwise he gets nothing. Is the game fair, i.e., is the expected gain of the
player zero? If not, how much should the player receive when three aces appear to make the
game fair?
2.9. Suppose a die is rolled twice. What are the possible values that the following random
variables can take on?
4. The value of the first roll minus the value of the second roll.
2.10. Suppose X has a binomial distribution with parameters n and p ∈ (0, 1). What is the
most likely outcome of X?
2.11. An airline knows that 5 percent of the people making reservations on a certain flight will
not show up. Consequently, their policy is to sell 52 tickets for a flight that can hold only 50
passengers. What is the probability that there will be a seat available for every passenger who
shows up?
2.12. Suppose that an experiment can result in one of r possible outcomes, the ith outcome
having probability pi , i = 1, . . . , r, ri=1 pi = 1. If n of these experiments are performed, and if
P
the outcome of any one of the n does not affect the outcome of the other n − 1 experiments,
then show that the probability that the first outcome appears x1 times, the second x2 times,
and the rth xr times is
n!
px1 px2 · · · pxr r
x1 !x2 ! · · · xr ! 1 2
when x1 + x2 + . . . + xr = n. This is known as the multinomial distribution.
2.13. A television store owner figures that 50 percent of the customers entering his store will
purchase an ordinary television set, 20 percent will purchase a color television set, and 30
percent will just be browsing. If five customers enter his store on a certain day, what is the
probability that two customers purchase color sets, one customer purchases an ordinary set,
and two customers purchase nothing?
2.15. If a fair coin is successively flipped, find the probability that a head first appears on the
fifth trial.
2.9. EXERCISES 55
2.16. A coin having probability p of coming up heads is successively flipped until the rth head
appears. Argue that X, the number of flips required, will be n, n ≥ r, with probability
r−1 r
P[X = n] = Cn−1 p (1 − p)n−r , n ≥ r.
This is known as the negative binomial distribution. Find the expectation and variance of X.
2.17. A fair coin is independently flipped n times, k times by A and n − k times by B. Show that
the probability that A and B flip the same number of heads is equal to the probability that
there are a total of k heads.
2.18. Suppose that we want to generate a random variable X that is equally likely to be either
0 or 1, and that all we have at our disposal is a biased coin that, when flipped, lands on heads
with some (unknown) probability p. Consider the following procedure:
1. Flip the coin, and let 01 , either heads or tails, be the result.
(a) Show that the random variable X generated by this procedure is equally likely to be
either 0 or 1.
(b) Could we use a simpler procedure that continues to flip the coin until the last two flips
are different, and then sets X = 0 if the final flip is a head, and sets X = 1 if it is a tail?
2.19. Consider n independent flips of a coin having probability p of landing heads. Say a
changeover occurs whenever an outcome differs from the one preceding it. For instance, if
the results of the flips are HHT HT HHT , then there are a total of five changeovers. If p = 1/2,
what is the probability there are k changeovers?
2.20. Let X be a Poisson random variable with parameter λ. What is the most likely outcome
of X?
2.21. * Poisson Approximation to the Binomial Let P be a Binomial probability with probabil-
ity of success p and number of trial n. Let λ = np. Show that
−k
λk
λ n n−1 n−k+1 λ
P (k successes) = 1− ... 1− .
k! n n n n n
Let n → ∞ and let p change so that λ remains constant. Conclude that for small p and large n,
λk −λ
P (k successes) = e , where λ = pn.
k!
2.9. EXERCISES 56
b) Show for r = 2, 3, 4, . . .,
E{X(X − 1) . . . (X − r + 1)} = λr .
2.25. Suppose X takes all its values in N = {0, 1, 2, . . .}. Show that
∞
X
E[X] = P[X > n].
n=0
2.26. Liam’s bowl of spaghetti contains n strands. He selects two ends at random and joins
them together. He does this until there are no ends left. What is the expected number of
spaghetti hoops in the bowl?
2.27. Sarah collects figures from cornflakes packets. Each packet contains one figure, and n
distinct figures make a complete set. Find the expected number of packets Sarah needs to
collect a complete set.
2.28. Each packet of the breakfast cereal Soggies contains exactly one token, and tokens are
available in each of the three colours blue, white and red. You may assume that each token
obtained is equally likely to be of the three available colours, and that the (random) colours
of different tokens are independent. Find the probability that, having searched the contents
of k packets of Soggies, you have not yet obtained tokens of every colour.
Let N be the number of packets required until you have obtained tokens of every colour.
Show that E[N ] = 11
2
.
2.9. EXERCISES 57
2.29. Each box of cereal contains one of 2n different coupons. The coupons are organized
into n pairs, so that coupons 1 and 2 are a pair, coupons 3 and 4 are a pair, and so on.
Once you obtain one coupon from every pair, you can obtain a prize. Assuming that the
coupon in each box is chosen independently and uniformly at random from the 2n possibili-
ties, what is the expected number of boxes you must buy before you can claim the prize?
b) What is the probability that the number of kilos of bread that will be sold in a day is, (i)
more than 300 kilos? (ii) between 150 and 450 kilos?
c) Denote by A and B the events in (i) and (ii), respectively. Are A and B independent
events?
2.31. Suppose that the duration in minutes of long-distance telephone conversations follows
an exponential density function:
1
f (x) = e−x/5 for x > 0.
5
Find the probability that the duration of a conversation:
d) will be less than 6 minutes given that it was greater than 3 minutes.
2.32. A number is randomly chosen from the interval (0;1). What is the probability that:
2.33. The height of men is normally distributed with mean µ=167 cm and standard deviation
σ=3 cm.
a) What is the percentage of the population of men that have height, (i) greater than 167
cm, (ii) greater than 170 cm, (iii) between 161 cm and 173 cm?
ii) two will have height smaller than the mean (and two bigger than the mean)?
2.34. Find the constant k and the mean and variance of the population defined by the prob-
ability density function
f (x) = k(1 + x)−3 for 0 ≤ x < ∞
and zero otherwise.
2.35. A mode of a distribution of one random variable X is a value of x that maximizes the pdf
or pmf. For X of the continuous type, f (x) must be continuous. If there is only one such x, it
is called the mode of the distribution. Find the mode of each of the following distributions
2.37. Let 0 < p < 1. A (100p)th percentile (quantile of order p) of the distribution of a random
variable X is a value ζp such that
Find the pdf f (x), the 25th percentile and the 60th percentile for each of the the followin cdfs.
3. F (x) = 1
2
+ 1
π
tan−1 (x), −∞ < x < ∞.
2.38. If X is a random variable with the probability density function f , find the probability
density function of Y = X 2 if
2.9. EXERCISES 59
2
(a) f (x) = 2xe−x , for 0 ≤ X < ∞
2.40. Let X be a uniform distribution U (0, 1). Find the density of the following random vari-
able.
1. Y = − λ1 ln(1 − X).
X
2. Z = ln 1−X . This is known as Logistic distribution.
q
1
3. T = 2 ln 1−X . This is known as Rayleigh distribution.
2.42. Let X be a random variable with distribution function F that is continuous. Show that
Y = F (X) is uniform.
2.43. Let F be a distribution function that is continuous and is such that the inverse function
F −1 exists. Let U be uniform on [0, 1]. Show that X = F −1 (U ) has distribution function F .
2.44. 1. Let X be a non-negative random variable satisfying E[X α ] < ∞ for some α > 0.
Show that Z ∞
α
E[X ] = α xα−1 (1 − F (x))dx.
0
2.46. Let X be a nonnegative random variable with mean µ and variance σ 2 , both finite. Show
that for any b > 0,
1
P[X ≥ µ + bσ] ≤ .
1 + b2
[(x−µ)b+σ]2
Hint: consider the function g(x) = σ 2 (1+b2 )2
.
2.47. Let X be a random variable with mean µ and variance σ 2 , both finite. Show that for any
d > 1,
1
P[µ − dσ < X < µ + dσ] ≥ 1 − 2 .
d
2.48. Divide a line segment into two parts by selecting a point at random. Find the probability
that the larger segment is at least three times the shorter. Assume a uniform distribution.
1. Let (An ) be a sequence of events such that limn P(An ) = 0. Show that limn→∞ E[XIAn ] = 0.
2. Show that for any > 0, there exists a δ > 0 such that for any event A satisfying P(A) < δ,
E[XIA ] < .
2.51. Given the probability space (Ω, A, P), suppose X is a non-negative random variable and
E[X] = 1. Define Q : A → R by Q(A) = E[XIA ].
3. Suppose P(X > 0) = 1. Let EQ denote expectation with respect to Q. Show that EQ [Y ] =
EP [Y X].
Random elements
2.52. An urn contains 3 red balls, 4 blue balls and 2 yellow balls. Pick up randomly 2 ball from
that urn and denote X and Y the number of red and yellow balls in the 2 balls, respectively.
1. P[X ∧ Y ≤ i].
2. P[X = Y ].
3. P[X > Y ].
4. P[X divides Y ].
2.55. Let X and Y be independent geometric random variables with parameters λ and µ.
2.56. Let X and Y be independent random variables with uniform distribution on the set
{−1, 1}. Let Z = XY . Show that X, Y, Z are pairwise independent but that they are not mutu-
ally independent.
2.57. * Let n be a prime number greater than 2; and X, Y be independent and uniformly dis-
tributed on {0, 1, . . . , n − 1}. For each r, 0 ≤ r ≤ n − 1, define Zr = X + rY ( mod n). Show that
the random variable Zr , r = 0, . . . , n − 1, are pairwise independent.
2.58. Let (Xn ) be a sequence of independent random variables with P[Xn = 1] = P[Xn =
−1] = 21 for all n. Let Zn = X0 X1 . . . Xn . Show that Z1 , Z2 , . . . are independent.
2.59. Let (a1 , . . . , an ) be a random permutation of (1, . . . , n), equally likely to be any of the n!
possible permutations. Find the expectation of
n
X
L= |ai − i|.
i=1
2.9. EXERCISES 62
2.60. A blood test is being performed on n individuals. Each person can be tested separately.
but this is expensive. Pooling can decrease the cost. The blood samples of k people can be
pooled and analyzed together. If the test is negative, this one test suffices for the group of k
individuals. If the test is positive, then each of the k persons must be tested separately and
thus k + 1 total tests are required for the k people. Suppose that we create n/k disjoint groups
of k people (where k divides n) and use the pooling method. Assume that each person has a
positive result on the test independently with probability p.
(a) What is the probability that the test for a pooled sample of k people will be positive?
(d) Give an inequality that shows for what values of p pooling is better than just testing every
individual.
2.61. You need a new staff assistant, and you have n people to interview. You want to hire the
best candidate for the position. When you interview a candidate, you can give them a score,
with the highest score being the best and no ties being possible. You interview the candidates
one by one. Because of your company’s hiring practices, after you interview the kth candidate,
you either offer the candidate the job before the next interview or you forever lose the chance
to hire that candidate. We suppose the candidates are interviewed in a random order, chosen
uniformly at random from all n! possible orderings.
We consider the following strategy. First, interview m candidates but reject them alL these
candidates give you an idea of how strong the field is. After the mth candidate. hire the
first candidate you interview who is better than all of the previous candidates you have in-
terviewed.
1. Let E be the event that we hire the best assistant, and let Ei be the event that ith candi-
date is the best and we hire him. Determine P(Ei ), and show that
n
m X 1
P(E) = .
n j=m+1 j − 1
2. Show that
m m
(ln n − ln m) ≤ P(E) ≤ (ln(n − 1) − ln(m − 1)).
n n
3. Show that m(ln n − ln m)/n is maximized when m = n/e, and explain why this means
P(E) ≥ 1/e for this choice of m.
2.62. Let X and Y have the joint pdf
6(1 − x − y) if x + y < 1, x > 0, y > 0,
f (x, y) =
0 otherwise.
2.9. EXERCISES 63
2.64. Let X be a normal with µ = 0 and σ 2 < ∞, and let Θ be uniform on [0, π]. Assume that
X and θ are independent. Find the distribution of Z = X + a cos Θ.
2.65. Let X and Y be independent random variable with the same distribution N (0, σ 2 ).
2.66. (Simulation of Normal Random Variables) Let U and V be two independent uniform
random variable on [0, 1]. Let θ = 2πU and S = − ln(V ).
Y1 = min{Xi , 1 ≤ i ≤ n},
Y2 = second smallest of X1 , . . . , Xn ,
..
.
Yn = max{Xi , 1 ≤ i ≤ n}.
Then Y1 , . . . , Yn are also random variables, and Y1 ≤ Y2 ≤ . . . ≤ Yn . They are called the order
statistics of (X1 , . . . , Xn ) and are usually denoted Yk = X(k) . Assume that Xi are i.i.d. with
common density f .
2.69. Let X and Y be independent and suppose P[X + Y = α] = 1 for some constant α. Show
that both X and Y are constant random variables.
2.70. Let (Xn )n≥1 be iid with common continuous distribution function F (x). Denote Rn =
Pn
j=1 I{Xj ≥Xn } , and An = {Rn = 1}.
1. Show that the sequence of random variables (Rn )n≥1 is independent and
1
P[Rn = k] = , for k = 1, . . . , n.
n
Definition 3.1.1. Let (Xn )n≥1 be a sequence of random variables defined on(Ω, A, P). We say
that Xn
a.s.
• converges almost surely to X and denoted by Xn −→ X or limn Xn = X a.s., if
P w : lim Xn (w) = X(w) = 1;
n→∞
P
• converges in probability to X and denoted by Xn −→ X, if for any > 0,
Lp
• converges in Lp (p > 0) to X and denoted by Xn −→ X if E(|Xn |p ) < ∞ for any n,
E(|X|p ) < ∞ and
lim E(|Xn − X|p ) = 0.
n→∞
Note that, the value of a random variable is a number, so the most natural way to consider
the convergence of random variables is via the convergence of a sequence of numbers; and
here comes the convergence almost surely. But sometimes this mode of convergence can fail,
65
3.1. CONVERGENCE OF RANDOM VARIABLES 66
then the convergence in probability is defined in the meaning that the larger n is, the smaller
and smaller the probability that Xn is far away from X becomes; and the convergence in Lp is
considered in the sense that the average distance between Xn and X must tends to 0.
We have the following example.
P
It implies that Xn −→ 0.
• In order to prove the convergence in Lp for p ∈ (0, 2), we must check that
This limit can be deduced from the computation that E (|Xn |p ) = np−2 .
• Usually, in order to prove or disprove the convergence almost surely, we use the Borel-
Cantelli lemma that can be stated as follows.
Lemma 3.1.3 (Borel-Cantelli). Let An be a sequence of events in a probability space {Ω, F, P}.
Denote lim sup An = ∩∞
n=1 (∪m≥n Am ) .
1. If Σ∞
n=1 P(An ) < ∞, then P(lim sup An ) = 0.
2. If Σ∞
n=1 P(An ) = ∞ and An ’s are independent, then P(lim sup An ) = 1.
Proof. 1. From the definition of lim sup An , it is clear that for every i,
2. We have
1 − P(lim sup An ) = P ∩∞
n=1 ∪m≥n Am
∞
= P ∪n=1 ∪m≥n Am
= P ∪∞
n=1 ∩m≥n Am .
3.1. CONVERGENCE OF RANDOM VARIABLES 67
In order to prove that P(lim sup An ) = 1, i.e 1 − P(lim sup An ) = 0, we will show that
P ∩m≥n Am = 0 for every n. Indeed, since An ’s are independent,
P ∩m≥n Am = Πm≥n P(Am )
= Πm≥n [1 − P(Am )]
≤ Πm≥n e−P(Am ) = e−Σm≥n P(Am ) = e−∞ = 0.
The meaning of the event lim sup An is that An occurs for an infinite number of n. Therefore
P(lim sup An ) = 0 means that almost surely there exists just a finite number of n that we can
see An .
Now, let’s see the application of the Borel-Cantelli Lemma in our example. We denote the
event An = {Xn 6= 0} = {Xn = n}. Then,
1
Σ∞ ∞
n=1 P(An ) = Σn=1 < ∞.
n2
It implies that almost surely An occurs a finite number of n, i.e the number of n such that Xn
differs from zero is finite. Hence, almost surely the limit of Xn exists and it must be zero. So
a.s.
Xn −→ X.
P
Proof. ⇒) Suppose that Xn → X. For any > 0 and w ∈ Ω, because of the increasing property
x
of the function x 7→ x+1 on the interval [0, ∞), we have
Hence
lim sup dP (Xn , X) ≤ + lim sup P(|Xn − X| > ) = for all > 0.
n→∞ n→∞
The following proposition shows that among the three modes of convergence, the conver-
gence in probability is the weakest form.
E(|Xn − X|p )
P(|Xn − X| > ) = P(|Xn − X|p > p ) ≤ .
p
Since E(|Xn − X|p ) → 0, by the sandwich theorem, P(|Xn − X| > ) converges also to 0.
P
Therefore Xn −→ X.
a.s.
2. Suppose that Xn −→ X. It is clear that
|Xn − X|
≤ 1,
1 + |Xn − X|
P
From the Proposition 3.1.4, we have Xn −→ X.
In the above example, we can see that convergence in probability is not sufficient for con-
vergence almost surely. However, we have the following result.
P
Proposition 3.1.6. 1. Suppose Xn −→ X. Then there exists a subsequence (nk )k≥1 such that
a.s.
Xnk −→ X.
3.1. CONVERGENCE OF RANDOM VARIABLES 69
2. On the contrary, if for all subsequence (nk )k≥1 , there exists a further subsequence (mk )k≥1
a.s. P
such that Xmk −→ X then Xn −→ X.
P
Proof. 1. Suppose Xn −→ X. Then from Proposition 3.1.4,
|Xn − X|
limE = 0.
n→ 1 + |Xn − X|
It is clear that
|Xnk − X|
Σ∞
k=1 E < ∞.
1 + |Xnk − X|
Therefore,
|Xnk − X|
Σ∞
k=1 <∞ a.s.
1 + |Xnk − X|
Then, almost surely
|Xnk − X|
lim = 0,
k→∞ 1 + |Xnk − X|
it implies that
lim |Xnk − X| = 0,
k→∞
It implies that for all subsequence {mk } of {nk }, Xnk can not converge almost surely to
X. This is in contradiction with the hypothesis.
P
So we must have that Xn −→ X.
Sn = X1 + . . . + Xn .
Corollary 3.2.2. Let (Xn )n≥1 be a sequence of pairwise uncorrelated random variables satisfy-
ing
D(X1 ) + . . . + D(Xn )
lim = 0.
n→∞ n2
Then
Sn − ESn P
−→ 0, asn → ∞.
n
Proof. Observe that D(Sn ) = D(X1 ) + . . . + D(Xn ) and apply Theorem 3.2.1.
Lemma 3.2.3. Let (Xn )n≥1 be a sequence of i.i.d random variables with finite variance. Then
Sn P
−→ EX1 , as n → ∞.
n
Note that when Xn has the Bernoulli law, then Sn is the number of successful trials and
Bernoulli showed that Sn /n converges in probability to the probability of success of a trial.
However, his proof is much more complicated than the one given here.
Theorem 3.2.4. Let (Xn )n≥1 be a sequence of pair-wise uncorrelated random variables satisfy-
ing supn D(Xn2 ) ≤ σ 2 < ∞. Then
Sn − ESn
lim = 0 a.s and in L2 .
n→∞ n
Proof. At first, we assume that E(Xn ) = 0. Denote Yn = Sn /n. Then E(Yn ) = 0, and from
Proposition 2.7.3,
n
2 1 X σ2
E(Yn ) = 2 DXi ≤ .
n i=1 n
L2
Hence Yn → 0. We also have
∞ ∞
X X σ2
E(Yn22 ) ≤ < ∞.
n=1 n=1
n2
3.2. LAWS OF LARGE NUMBERS 72
we have
√
h p(n)2 2 i n − p(n)2
2 2p(n) + 1 2 2 n + 1 2 3
E Yn − Yp(n)2 ≤ 2
σ ≤ 2
σ ≤ 2
σ ≤ 3/2 σ 2 ,
n n n n n
√
with the observations n ≤ (p(n) + 1)2 and p(n) ≤ n. By the same argument, since
∞ ∞
X h p(n)2 2 i X 3 2
E Yn − Yp(n)2 ≤ 3/2
σ < ∞,
n=1
n n=1
n
then
p(n)2 h.c.c
Yn − Yp(n)2 → 0.
n
2 a.s
From (3.2) and the observation p(n)n
→ 1, we deduce that Yn → 0.
In general, if E(Xn ) 6= 0, we denote Zn = Xn − E(Xn ). Then {Zn } is a sequence of pair-
wise uncorrelated random variables with mean zero satisfying the condition of the theorem.
Therefore
Sn − ESn Z1 + . . . + Zn a.s
= → 0.
n n
In the following, we state without proof two general versions of strong law of large num-
bers.
Theorem 3.2.5. Let (Xn )n≥1 be a sequence of independent random variable and, (bn )n≥1 a se-
quence of positive numbers satisfying bn ↑ ∞. If
∞
X DXn Sn − E(Sn ) a.s.
< ∞ then −→ 0.
n=1
b2n bn
Theorem 3.2.6. Let (Xn )n≥1 be a sequence of iid random variables. Then
Sn
lim = E(X1 ) iff E(|X1 |) < ∞.
n→∞ n
3.2. LAWS OF LARGE NUMBERS 73
i.i.d
Example 3.2.7. Consider (Xn )n≥1 ∼ B(1, p). From Theorem 3.2.6,
Sn h.c.c
−→ E(X1 ) = p.
n
Then, to approximate the probability of success of each trial, we can use the approximation
Sn /n for n large enough.
An application of Strong law of large numbers that is quite simple but very useful is the
Monte Carlo method.
In practical, we use the computer to generate the sequence (Uj )j≥1 and obtain an approxima-
tion of I for any function f satisfying the condition (3.3). Under the condition (3.4), the error
of the approximation only depends on the size n and not on the smoothness of f . The Monte
3.3. CENTRAL LIMIT THEOREMS 74
Carlo method also seems to be more useful than the other deterministic ones in approximat-
ing the multiple integral. The only thing we must care about is the square of the error. If we
can reduce it, then the calculation will be more accurate and we can also reduce the time on
computer (see [?]). That is the way one wants to improve the Monte Carlo method.
The error of the Monte Carlo method will be analysed in more detail based on the Central
limit theorems that will be explained in the following.
Theorem 3.3.2. For every random variable X, the characteristic function ϕX has the following
properties;
1. ϕX (0) = 1;
2. ϕX (−t) = ϕX (t);
3. |ϕX (t)| ≤ 1;
Proof. It is easy to see that ϕX (0) = 1. Applying the inequality (EX)2 ≤ E(X 2 ),
p q
|ϕX (t)| = (E cos tX)2 + (E sin tX)2 ≤ E(cos2 tX) + E(sin2 tX) = 1,
then ϕX is bounded. And the continuity of can be deduced by Lebesgue dominated conver-
gence theorem.
The following theorem shows the connection between the characteristic function of a ran-
dom variable and its moments.
Theorem 3.3.3. If E[|X|m ] < ∞ for some positive integer m. Then ϕX has continuous deriva-
tives up to order m, and
Proof. Since E(|X|m ) < ∞, we have E(|X|k ) < ∞ for all k = 1, . . . , m. Then
Z Z
sup |(ix) e |dFX (x) ≤ |x|k dFX (x) < ∞.
k itx
t
From Lebesgue theorem, we can take the differentation under the integral sign and obtain
(3.5). In (3.5), let t = 0 then we have (3.6).
Consider the Taylor expansion2 of function exp(x) at x = 0,
n−1
itX
X (itX)k (itX)n iθX
E(e )=E + e
k=0
k! n!
n−1
X (it)k (it)n
= E(X k ) + E(X n ) + αn (t) ,
k=0
k! n!
where |θ| ≤ |t|, αn (t) = E X n (eiθX − 1) . Therefore |αn (t)| ≤ 2E(|X|n ), i.e it is bounded. So
from the Dominated convergence theorem, we have αn (t) → 0 as t → 0.
The inverse statement can be proved by concurrence, see [13, page 190-193].
2
Taylor expansion: Suppose that ϕ has continuous derivatives up to order m, then
m−1
X ϕ(k) (0) k ϕ(m) (θ) m
ϕ(x) = x + x ,
k! m!
k=0
If X ∼ N (a, σ 2 ) then Z
1 (x−a)2
ϕX (t) = √ eitx e− 2σ2 dx.
2πσ 2
Theorem 3.3.6. Two random vectors have the same distribution if their characteristic functions
R
coincide. Moreover, if |ϕX (t)|dt < ∞ then X has bounded continuous density given by
1 −ity
fX (y) = e ϕ(t)dt.
2π
Example 3.3.7. Let X and Y have Poisson distribution with the corresponding parameters µ
and λ. Assume more that X and Y are independent. Let us consider the distribution of the
random variable X + Y . We can compute its characteristic function as
it −1)
ϕX+Y (t) = E(eit(X+Y ) ) = E(eitX )E(eitY ) = e(λ+µ)(e .
Then this characteristic function agrees with the one of P oi(µ + λ). So the random variable
X + Y has the Poisson distribution with the parameter µ + λ.
3.3. CENTRAL LIMIT THEOREMS 77
We can also use the characteristic function to check whether the random variables are
independent.
ϕ(X1 ,...,Xn ) (t1 , . . . , tn ) = ϕX1 (t1 ) . . . ϕXn (tn ) for all (t1 , . . . , tn ) ∈ Rn .
Example 3.3.9. Let X and Y be independent random variables which have standard normal
distribution N (0, 1). According to Example 3.3.5, we have
2 −s2
ϕ(X+Y,X−Y ) (t, s) = Eeit(X+Y )+is(X−Y ) = Eei(t+s)X Eei(t−s)Y = e−t .
2 2
Put s = 0 and t = 0, we have ϕX+Y (t) = e−t and ϕX−Y (s) = e−s . Hence both X + Y and X − Y
have normal distribution N (0, 2). Furthermore, they are independent since
Note that we do not require or suppose that the random variables (Xn )n≥1 are defined on
the same probability space in the above definition. We just care about the expectation or the
distribution. Therefore sometimes we call weakly convergence by convergence in distribution
(See Exercise 3.27).
If we suppose that Xn ’s and X are defined on the same probability space, we have the
following propositions.
Proposition 3.3.11. Let (Xn )n≥1 and X be random variables defined on the same probability
w
space (Ω, F, P). If Xn −→ X then Xn −→ X.
P
P w
Proof. We prove by contradiction. Assume that Xn −→ X but Xn 6−→ X. Then there exist a
bounded continuous function f , a constant > 0 and a subsequence (nk )k≥1 such that
From Proposition 3.1.6, there exists a subsequence (mk )k≥1 of the sequence (nk )k≥1 such that
a.s h.c.c
Xmk −→ X. Since f is continuous, f (Xmk ) −→ f (X). By Dominated Convergence Theorem,
E(f (Xmk )) → E(f (X)). It is in contradiction with (3.8). Then the result follows.
3.3. CENTRAL LIMIT THEOREMS 78
Proposition 3.3.12. Let (Xn )n≥1 and X be random variables defined on the same probability
w
space (Ω, F, P). If Xn −→ X and X = const a.s then Xn −→ X.
P
|x−a| w
Proof. Let X ≡ a a.s. Consider the bounded continuous function f (x) = |x−a|+1
. Since Xn −→
P
a, E(f (Xn )) → f (a) = 0. From Proposition 3.1.4, Xn → a.
The following theorem gives us a very useful criterion to verify the weak convergence of
random variables by using the characteristic function. Its proof is provided in [13, page 196-
199].
Theorem 3.3.13. Let (Fn )n≥1 be a sequence of distribution function whose characteristic func-
tions are (ϕn )n≥1 respectively, Z
ϕn (t) = eitx dFn (x).
R
w
1. If Fn → F for some distribution function F then (ϕn ) converges point-wise to the charac-
teristic function ϕ of F .
Example 3.3.14. Let Xn be normal N (an , σn2 ). Suppose that an → 0 and σn2 → 1 as n → ∞.
Then the sequence (Xn ) converges weakly to N (0, 1) since
2 2 /2 2 /2
ϕXn (t) = eitan −σn t → e−t .
Example 3.3.15 (Weak laws of large numbers). Let (Xk )k≥1 be a iid sequence of random vari-
ables whose mean is finite. Then
1 P
(X1 + . . . + Xn ) −→ a.
n
Indeed, denote Sn = X1 + . . . + Xn and ϕ is characteristic function of Xk . Then,
Theorem 3.3.17. Let (Xn )n≥1 be a sequence of i.i.d random variables and E(Xn ) = µ and
−nµ w
DXn = σ 2 ∈ (0, ∞). Denote Sn = X1 + . . . + Xn . Then Yn = Sσn√ n
→ N (0, 1).
Proof. Denote ϕ by the characteristic function of the random variable Xn − µ. Since Xn ’s have
the same law, ϕ does not depend on n. Moreover, since Xn ’s are independent,
n n
X Xj − µ Y X − µ
j t n
ϕYn (t) = E exp it √ = E exp it √ = ϕ √ .
j=1
σ n j=1
σ n σ n
It is clear that E(Xj − µ) = 0 and E((Xj − µ)2 ) = σ 2 . Then from Theorem 3.3.3, ϕ has the
continuous second derivative and
σ 2 t2
ϕ(t) = 1 − + t2 α(t),
2
where α(t) → 0 as t → 0. Using the expansion ln(1 + x) = x + o(x) as x → 0,
t2 t2 t t2
ln ϕYn (t) = n ln 1 − + 2α √ →− .
2n nσ σ n 2
2 /2
Therefore ϕYn (t) → e−t as n → ∞. Applying Theorem 3.3.13, we have the desired result.
In the following, we give an example of the central limit theorem. More detail, we will
approximate the binomial probability by the normal probability.
Example 3.3.18. We know that a binomial random variable Sn ∼ B(n, p) can be written as the
sum of n i.i.d random variables ∼ B(1, p). Then as n large enough, from the central limit the-
p
orem, we can approximate the random variable (Sn − np)/ np(1 − p) by the standard normal
variable N(0, 1).
Usually, the probability that a ≤ Sn ≤ b can be formulated as
However, when n is too large, calculating Cni for some i is impossible since it exceeds the
capacity of the calculator or the computer (please, consider 1000! or 5000!). Then in practical,
3.4. EXERCISES 80
Note that to compute the last probability, we can write down it as an integral from the density
function of the normal variable. It can be computed or approximated easily.
In order to define the rate that the distribution of FYn converges to normal distribution, we
use the Berry-Esseen’s inequality: Suppose E(|X1 |3 ) < ∞ then
Z x −t2 /2
e E(|X1 − EX1 |3 )
sup FYn (x) − √ dt ≤ KBE √ , (3.9)
−∞<x<∞ −∞ 2π σ3 n
√
where KBE is some constant in ( 610+3
√
2π
, 0.4748) (see[12]).
The condition that Xn ’s are iid is too restrictive. Many authors manage to weaken this
condition. In the following, we state the Lindeberg’s central limit theorem. Its proof can be
found in [13, page 221-225].
Theorem 3.3.19. Let (Xn )n≥1 be a sequence of independent random variables with finite vari-
ance. Denote Sn = X1 + . . . + Xn , Bn = DX1 + . . . + DXn . Suppose that
n
1 X 2
Ln () := E (X k − EX k ) I{|Xk −EXk |>Bn } → 0, ∀ > 0. (3.10)
Bn2 k=1
Sn −ESn w
Then Sn∗ = Bn
→ N (0, 1).
3.4 Exercises
lim E(|Xn − X| ∧ 1) = 0.
n→∞
3.4. Consider the probability space ([0; 1], B([0; 1]), P ). Let X = 0 and X1 , X2 , . . . be random
variables
0 if n1 ≤ ω ≤ 1
Xn (ω) =
en if 0 ≤ ω < 1/n.
P
Show that X −→ X, but E|Xn − X|p does not converge for any p > 0.
3.5. Consider the probability space ([0; 1], B([0; 1]), P ). Let X = 0. For each n = 2m + k where
0 ≤ k < 2m , we define
1 if k ≤ ω ≤ k+1
2m 2m
Xn (ω) =
0 otherwise.
P
Show that X −→ X, but {Xn } does not converge to X a.s.
3.6. Let (Xn )n≥1 be a sequence of exponential random variables with parameter λ = 1. Show
that h Xn i
P lim sup = 1 = 1.
n→∞ ln n
3.7. Let X1 , X2 , . . . be a sequence of identically distributed random variables with E|X1 | < ∞
and let Yn = n−1 max1≤i≤n |Xi |. Show that limn E(Yn ) = 0 and limn Yn = 0 a.s.
P
3.8. [5] Let (Xn )n≥1 be random variables with Xn −→ X. Suppose |Xn (ω)| ≤ C for a constant
C > 0 and all ω. Show that limn→∞ E|Xn − X| = 0.
3.10. [10] Let X1 , . . . , Xn be independent and identically distributed random variables with
V ar(S1 ) < ∞. Show that
n
1 X P
jXj −→ EX1 .
n(n + 1) j=1
3.11. [2] If for every n, V ar(Xi ) ≤ c < ∞ and Cov(Xi , Xj ) < 0 (i, j = 1, 2, . . .), then the WLLN
holds.
3.4. EXERCISES 82
3.12. [2](Theorem of Bernstein) Let {Xn } be a sequence of random variables so that V ar(Xi ) ≤
c < ∞ (i = 1, 2, . . .) and Cov(Xi , Xj ) → 0 when |i − j| → ∞ then the WLLN holds.
3.13. [5] Let (Yj )j≥1 be a sequence of independent Binomial random variables, all defined on
the same probability space, and with law B(p, 1). Let Xn = nj=1 Yj . Show that Xj is B(p, j)
P
X
and that jj converges a.s to p.
Q n1
n
3.14. [5] Let {Xj }j≥1 be i.i.d with Xj in L1 . Let Yj = eXj . Show that j=1 Yj converges to a
constant α a.s.
3.15. [5] Let (Xj )j≥1 be i.i.d with Xj in L1 and EXj = µ. Let (Yj )j≥1 be also i.i.d with Yj in L1
and EYj = ν 6= 0. Show that
n
1 X µ
lim Pn Xj = a.s.
n→∞
j=1 Yj j=1
ν
Pn
3.16. [5] Let (Xj )j≥1 be i.i.d with Xj in L1 and suppose √1 − ν) converges in distribu-
n j=1 (Xj
tion to a random variable Z. Show that
n
1X
lim Xj = ν a.s.
n→∞ n
j=1
3.18. [5] Let (Xj )j≥1 be i.i.d. N (1; 3) random variables. Show that
X1 + X2 + . . . Xn 1
lim 2 2 2
= a.s.
n→∞ X1 + X2 + . . . + Xn 4
3.19. [5] Let (Xj )j≥1 be i.i.d with mean µ and variance σ 2 . Show that
n
1X
lim (Xi − µ)2 = σ 2 a.s.
n→∞ n
i=1
3. X ∼ U (a, b);
3.4. EXERCISES 83
5. X ∼ Exp(λ);
3.21. Show that if X1 , . . . , Xn are independent and uniformly distribution on (−1, 1), then for
n ≥ 2, X1 + . . . + Xn has density
1 ∞ sin t n
Z
f (x) = cos txdt.
π 0 t
3.22. Suppose that X has density
1 − cos x
f (x) = .
πx2
Show that
ϕX (t) = (1 − |t|)+ .
3.24. Let X1 , X2 , . . . be independent taking values 0 and 1 with probability 1/2 each.
3.26. Consider the probability space ([0; 1], B([0; 1]), P ). Let X and X1 , X2 , . . . be random vari-
ables
1 if 0 ≤ ω ≤ 1/2
X2n (ω) =
0 if 1/2 < ω ≤ 1.
and
0 if 0 ≤ ω ≤ 1/2
X2n+1 (ω) =
1 if 1/2 < ω ≤ 1.
Show that the sequence (Xn ) converges in distribution? Does it converge in probability?
3.4. EXERCISES 84
3.27. Let (Xn )n≥1 and X are random variables whose distribution functions are (Fn )n≥1 and
F , respectively.
w
1. If Xn −→ X then limn→∞ Fn (x) = F (x) for all x ∈ D where D is a dense subset of R given
by
D = {x ∈ R : F (x+) = F (x)}.
w
2. If limn→∞ Fn (x) = F (x) for any x in some dense subset of R then Xn −→ X.
w P
3.28. If Xn −→ X, Yn −→ c, then
w
a) Xn + Yn −→ X + c
w
b) Xn Yn −→ cX
w
c) Xn /Yn −→ X/c if Yn 6= 0 a.s for all n and c 6= 0.
d P
3.29. [10] Show that if Xn −→ X and X = c a.s for a real number c, then Xn −→ X.
3.30. [10] A family of random variable (Xi )i∈I is called uniformly integrable if
Let X1 , X2 , . . . be random variables. Show that {|Xn |} is uniformly integrable if one of the
following condition holds:
b) P (|Xn | ≥ c) ≤ P (|X| ≥ c) for all n and c > 0, where X is an integrable random variable.
3.34. [10] Suppose that Xn is a random variable having the binomial distribution with size n
and probability θ ∈ (0, 1), n = 1, 2, . . . Define Yn = log(Xn /n) when Xn ≥ 1 and Yn = 1 when
√ d
Xn = 0. Show that limn Yn = log θ a.s and n(Yn − log θ) −→ N (0, 1−θ θ
).
3.35. [2] Show that for the sequence {Xn } of independent random variables with
3.5. INTRODUCTION 85
1−2−n 1
a) P [Xn = ±1] = 2
, P [Xn = ±2n ] = 2n+1
, n = 1, 2, . . . ,
b) P [Xn = ±n2 ] = 21 ,
2 p √ d
( Sn − n) −→ N (0, 1).
σ
3.37. [5] Show that !
n
X nk 1
lim e−n = .
n→∞
k=0
k! 2
2
Pn
3.38. [5] Let (Xj )j≥1 be i.i.d with EXj = 0 and σX j
= σ 2 < ∞. Let Sn = j=1 Xj . Show that
r
|Sn | 2
lim E √ = σ.
n→∞ n n
3.39. [5] Let (Xj )j≥1 be i.i.d with the uniform distribution on (-1;1). Let
Pn
j=1 Xj
Y n = Pn 2
Pn 3
.
j=1 Xj + j=1 Xj
√
Show that nYn converges in distribution.
3.40. [5] Let (Xj )j≥1 be i.i.d with the uniform distribution on (−j; j).
a) Show that
Sn d 1
3−→ N (0; ).
n2 9
b) Show that
S d
qP n −→ N (0, 1).
n
j=1 σj2
3.5 Introduction
3
Statistics is a process of using the scientific method to answer questions and make deci-
sions. That process involves designing studies, collecting good data, describing the data with
numbers and graphs, analyzing the data, and then making conclusions. We now review each
of these steps and show where statistics plays the all-important role.
3
This part is borrowed from D. Rumsey, Statistics Essentials for Dummies (2010) Wiley Publishing, Inc.
3.5. INTRODUCTION 86
Survey
An observational study is one in which data are collected on individuals in a way that
doesn’t affect them. The most common observational study is the survey. Surveys are ques-
tionnaires that are presented to individuals who have been selected from a population of in-
terest. Surveys may be administered in a variety of ways, e.g. personal interview, telephone
interview, and self-administered questionnaire.
If conducted properly, surveys can be very useful tools for getting information. However,
if not conducted properly, surveys can result in bogus information. Some problems include
improper wording of questions, which can be misleading, people who were selected to par-
ticipate but do not respond, or an entire group in the population who had no chance of even
being selected. These potential problems mean a survey has to be well thought-out before it’s
given.
A downside of surveys is that they can only report relationships between variables that are
found; they cannot claim cause and effect. For example, if in a survey researchers notice that
the people who drink more than one Diet Coke per day tend to sleep fewer hours each night
than those who drink at most one per day, they cannot conclude that Diet Coke is causing the
lack of sleep. Other variables might explain the relationship, such as number of hours worked
per week.
Experiments
An experiment imposes one or more treatments on the participants in such a way that
clear comparisons can be made. Once the treatments are applied, the response is recorded.
For example, to study the effect of drug dosage on blood pressure, one group might take 10 mg
of the drug, and another group might take 20 mg. Typically, a control group is also involved,
where subjects each receive a fake treatment (a sugar pill, for example).
Experiments take place in a controlled setting, and are designed to minimize biases that
might occur. Some potential problems include: researchers knowing who got what treatment;
a certain condition or characteristic wasn’t accounted for that can affect the results (such as
weight of the subject when studying drug dosage); or lack of a control group. But when de-
signed correctly, if a difference in the responses is found when the groups are compared, the
researchers can conclude a cause and effect relationship.
3.5. INTRODUCTION 87
It is perhaps most important to note that no matter what the study, it has to be designed
so that the original questions can be answered in a credible way.
First, a few words about selecting individuals to participate in a study. In statistics, we have
a saying: “Garbage in equals garbage out.” If you select your subjects in a way that is biased
— that is, favouring certain individuals or groups of individuals — then your results will also
be biased.
Suppose Bob wants to know the opinions of people in your city regarding a proposed
casino. Bob goes to the mall with his clipboard and asks people who walk by to give their
opinions. What’s wrong with that? Well, Bob is only going to get the opinions of a) people
who shop at that mall; b) on that particular day; c) at that particular time; d) and who take the
time to respond. That’s too restrictive - those folks don’t represent a cross-section of the city.
Similarly, Bob could put up a Web site survey and ask people to use it to vote. However, only
those who know about the site, have Internet access, and want to respond will give him data.
Typically, only those with strong opinions will go to such trouble. So, again, these individuals
don’t represent all the folks in the city.
In order to minimize bias, you need to select your sample of individuals randomly - that
is, using some type of “draw names out of a hat” process. Scientists use a variety of methods
to select individuals at random, but getting a random sample is well worth the extra time and
effort to get results that are legitimate.
Say you’re conducting a phone survey on job satisfaction of Americans. If you call them at
home during the day between 9 a.m. and 5 p.m., you’ll miss out on all those who work during
the day; it could be that day workers are more satisfied than night workers, for example. Some
surveys are too long - what if someone stops answering questions halfway through? Or what if
they give you misinformation and tell you they make $100,000 a year instead of $45,000? What
if they give you an answer that isn’t on your list of possible answers? A host of problems can
occur when collecting survey data. Experiments are sometimes even more challenging when
it comes to collecting data. Suppose you want to test blood pressure; what if the instrument
you are using breaks during the experiment? What if someone quits the experiment half- way
3.5. INTRODUCTION 88
through? What if something happens during the experiment to distract the subjects or the
researchers? Or they can’t find a vein when they have to do a blood test exactly one hour after
a dose of a drug is given? These are just some of the problems in data collection that can arise
with experiments.
Descriptive statistics
Data are also summarized (most often in conjunction with charts and/or graphs) by using
what statisticians call descriptive statistics. Descriptive statistics are numbers that describe a
data set in terms of its important features.
If the data are categorical (where individuals are placed into groups, such as gender or po-
litical affiliation) they are typically summarized using the number of individuals in each group
(called the frequency) or the percentage of individuals in each group (the relative frequency).
Numerical data represent measurements or counts, where the actual numbers have mean-
ing (such as height and weight). With numerical data, more features can be summarized be-
sides the number or percentage in each group. Some of these features include measures of
center (in other words, where is the “middle” of the data?); measures of spread (how diverse
or how concentrated are the data around the center?); and, if appropriate, numbers that mea-
sure the relationship between two variables (such as height and weight).
Some descriptive statistics are better than others, and some are more appropriate than
others in certain situations. For example, if you use codes of 1 and 2 for males and females,
respectively, when you go to analyze that data, you wouldn’t want to find the average of those
numbers — an “average gender” makes no sense. Similarly, using percentages to describe the
amount of time until a battery wears out is not appropriate.
Data are summarized in a visual way using charts and/or graphs. Some of the basic graphs
used include pie charts and bar charts, which break down variables such as gender and which
applications are used on teens’ cell phones. A bar graph, for example, may display opinions on
an issue using 5 bars labeled in order from “Strongly Disagree” up through “Strongly Agree.”
But not all data fit under this umbrella. Some data are numerical, such as height, weight,
time, or amount. Data representing counts or measurements need a different type of graph
3.5. INTRODUCTION 89
that either keeps track of the numbers themselves or groups them into numerical groupings.
One major type of graph that is used to graph numerical data is a histogram.
course.
ical treatment, researchers may use three categories to assess the outcome: Did the patient
get better, worse, or stay the same? Categorical data are typically summarized by reporting
either the number of individuals falling into each category, or the percentage of individuals
falling into each category. For example, pollsters may report the percentage of Republicans,
Democrats, Independents, and others who took part in a survey. To calculate the percentage
of individuals in a certain category, find the number of individuals in that category, divide by
the total number of people in the study, and then multiply by 100%. For example, if a survey
of 2,000 teenagers included 1,200 females and 800 males, the resulting percent- ages would
be (1,200 : 2,000) * 100% = 60% female and (800 : 2,000) * 100% = 40% male.
You can further break down categorical data by creating crosstabs. Crosstabs (also called
two-way tables) are tables with rows and columns. They summarize the information from
two categorical variables at once, such as gender and political party, so you can see (or easily
calculate) the percentage of individuals in each combination of categories. For example, if
you had data about the gender and political party of your respondents, you would be able to
look at the percentage of Republican females, Democratic males, and so on. In this example,
the total number of possible combinations in your table would be the total number of gender
categories times the total number of party affiliation categories. The U.S. government calcu-
lates and summarizes loads of categorical data using crosstabs. (see Chapter 11 for more on
two-way tables.) If you’re given the number of individuals in each category, you can always
calculate your own percents. But if you’re only given percentages without the total number in
the group, you can never retrieve the original number of individuals in each group. For exam-
ple, you might hear that 80% of people surveyed prefer Cheesy cheese crackers over Crummy
cheese crackers. But how many were surveyed? It could be only 10 people, for all you know,
because 8 out of 10 is 80%, just as 800 out of 1,000 is 80%. These two fractions (8 out of 10 and
800 out of 1,000) have different meanings for statisticians, because the first is based on very
little data, and the second is based on a lot of data. (See Chapter 7 for more information on
data accuracy and margin of error.)
When it comes to measures of center, the average doesn’t always tell the whole story and may
be a bit misleading. Take NBA salaries. Every year, a few top-notch players (like Shaq) make
much more money than anybody else. These are called outliers (numbers in the data set that
are extremely high or low compared to the rest). Because of the way the average is calcu-
lated, high outliers drive the average upward (as Shaq’s salary did in the preceding example).
Similarly, outliers that are extremely low tend to drive the average downward. What can you
report, other than the average, to show what the salary of a “typical” NBA player would be?
Another statistic used to measure the center of a data set is the median. The median of a data
set is the place that divides the data in half, once the data are ordered from smallest to largest.
It is denoted by M or x̃. To find the median of a data set:
2. If the data set contains an odd number of numbers, the one exactly in the middle is the
median.
3. If the data set contains an even number of numbers, take the two numbers that appear
exactly in the middle and average them to find the median.
For example, take the data set 4, 2, 3, 1. First, order the numbers to get 1, 2, 3, 4. Then note this
data has an even number of numbers, so go to Step 3. Take the two numbers in the middle 2
and 3 and find their average: 2.5.
Note that if the data set is odd, the median will be one of the numbers in the data set itself.
However, if the data set is even, it may be one of the numbers (the data set 1, 2, 2, 3 has median
2); or it may not be, as the data set 4, 2, 3, 1 (whose median is 2.5) shows.
Chapter 4
Theorem 4.1.2. X is an Rd -valued Gaussian random variable if and only if its characteristic
function has the form
1
ϕX (u) = exp ihu, µi − hu, Qui ,
2
n
where µ ∈ R and Q is an n × n symmetric nonnegative semi-definite matrix. Q is then the
covariance matrix of X and µ is the mean of X, that is
Example 4.1.3. Let X1 , . . . , Xn be R-valued independent random variable with law N(µj , σj2 )
then X = (X1 , . . . , Xn ) is Gaussian. Moreover, for any constant matrix A ∈ Rm×n , then Y =
QX ∗ is an m-dimensional Gaussian random variable.
Proposition 4.1.4. Let X be an Rn -valued Gaussian random variable. The components Xj are
independent if and only if the covariance matrix Q of X is diagonal.
93
4.2. GAMMA, CHI-SQUARE, STUDENT AND F DISTRIBUTIONS 94
2.2
alpha =7/8, lambda = 1
2 alpha = 1,lambda = 1
alpha = 2, lambda = 1
1.8 alpha = 3, lambda = 1
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
xα−1 e−x/λ
fX (x) = I{x>0} .
Γ(α)λα
Corollary 4.2.2. Let (Xi )1≤i≤n be a sequence of independent random variables. Suppose that
Xi is G(αi , λ) distributed. Then S = X1 + · · · + Xn is G(α1 + · · · + αi , λ) distributed.
0.9
n=1
n=2
0.8
n=4
n=6
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10
In addition,
n→∞ 1 2
fn (t) −→ √ e−t /2 .
2π
4.2.4 F distribution
Definition 4.2.6. Let U and V be independent chi-square random variables with m and n
degrees of freedom, respectively. The distribution of
U/m
W =
V /n
4.3. SAMPLE MEAN AND SAMPLE VARIANCE 96
0.4
n=1
n=2
0.35 n=8
normal
0.3
0.25
0.2
0.15
0.1
0.05
0
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6
0.8
n = 4, m = 4
n = 10, m = 4
0.7 n = 10, m = 10
n=4, m= 10
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
is called the F distribution with m and n degrees of freedom and is denoted by Fm,n .
Proposition 4.3.1. The random variable X n and the vector of random variables (X1 − X n , X2 −
X n , . . . , Xn − X n ) are independent.
Proof. We write
n
X n
X
sX n + ti (Xi − X n ) = ai X i
i=1 i=1
s
where ai = n
+ (ti − t). Note that
n n n
X X s2 X
ai = s and a2i = + (ti − t)2 .
i=1 i=1
n i=1
The first factor is the cf of X n while the second factor is the cf of (X1 −X n , X2 −X n , . . . , Xn −X n )
(this is obtained by let s = 0 in the formula). This implies the desired result.
Theorem 4.3.3. The distribution of (n−1)s2n /σ 2 is the chi-square distribution with n−1 degrees
of freedom.
Also,
n n X − µ 2
1 X 2 1 X 2
(X i − µ) = (X i − X n ) + √ =: U + V.
σ 2 i=1 σ 2 i=1 σ/ n
Since U and V are independent, ϕW (t) = ϕU (t)ϕV (t). Since W and V both follow chi-square
distribution, so
ϕW (t) (1 − i2t)−n/2
ϕU (t) = = = (1 − i2t)−(n−1)/2 .
ϕV (t) (1 − i2t)−1/2
The last expression is the c.f. of a random variable with a χ2n−1 distribution.
Corollary 4.3.4.
Xn − µ
√ ∼ tn−1 .
sn / n
4.4. EXERCISES 98
4.4 Exercises
4.1. Show that
2. if X ∼ tn then X 2 ∼ F1,n .
3. the Cauchy distribution and the t distribution with 1 degree of freedom are the same.
4. Iif X and Y are independent exponential random variable with λ = 1, then X/Y follows
an F distribution.
4.2. Show how to use the chi-square distribution to calculate P(a < s2n /σ 2 < b).
4.5. Let X1 , X2 and X3 be three independent chi-square variables with r1 , r2 and r3 degrees of
freedom, respectively.
2. Deduce that
X1 /r1 X2 /r3
and
X2 /r2 (X1 + X2 )/(r1 + r2 )
are independent F -variables.
Chapter 5
Parameter estimation
Example 5.1.2. An urn contains m balls, labeled from 1, 2, . . . , m and are identical except for
the number. The experiment is to choose a ball at random and record the number. Let X
denote the number. Then the distribution of X is given by
1
P[X = k] = , f or x = 1, . . . , m.
m
In case m is unknown, to obtain information on m we take a sample of n balls, which we will
denote as X = (X1 , . . . , Xn ) where Xi is the number on the ith ball.
The sample can be drawn in several ways.
1. Sampling with replacement: We randomly select a ball, record its number and put it
back to the urn. All the ball are then remixed, and the next ball is chosen. We can see
that X1 , . . . , Xn are mutually independent random variables and each has the same dis-
tribution as X. Hence (X1 , . . . , Xn ) is a random sample.
2. Sampling without replacement: Here n balls are selected at random. After a ball is se-
lected, we do not return it to the urn. The X1 , . . . , Xn are not independent, but each Xi
has the same distribution as X.
If m is much greater than n, the sampling schemes are practically the same.
99
5.1. SAMPLES AND CHARACTERISTIC OF SAMPLE 100
1. Sample mean
X1 + . . . + Xn
X̄n = .
n
2. Population variance
n
1X
Sn2 (X) = (Xi − X̄n )2 ,
n i=1
and sample variance
n
1 X
s2n (X) = (Xi − X̄n )2 .
n − 1 i=1
6. The sample median is a measure of central tendency that divides the data into two equal
parts, half below the median and half above. If the number of observations is even, the
median is halfway between the two central values. If the number of observations is odd,
the median is the central value.
7. When an ordered set of data is divided into four equal parts, the division points are called
quartiles. The first or lower quartile, q1 , is a value that has approximately 25% of the ob-
servations below it and approximately 75% of the observations above. The second quar-
tile, q2 , has approximately 50% of the observations below its value. The second quartile
is exactly equal to the median. The third or upper quartile, q3 , has approximately 75% of
the observations below its value.
8. The interquartile range is defined as IQR = q3 − q1 . The IQR is also used as a measure
of variability.
1. Divide each number xi into two parts: a stem, consisting of one or more of the leading
digits and a leaf, consisting of the remaining digit.
Example 5.2.1. The weights of 80 students are given in the following table.
5.2. DATA DISPLAY 102
59.0 59.5 52.7 47.9 55.7 48.3 52.1 53.1 55.2 45.3
46.5 54.8 48.4 53.1 56.9 47.4 50.2 52.1 49.6 46.4
52.9 41.1 51.0 50.0 56.8 45.9 59.5 52.8 46.7 55.7
48.6 51.6 53.2 54.1 45.8 50.4 54.1 52.0 56.2 62.7
62.0 46.8 54.6 54.7 50.2 45.9 49.1 42.6 49.8 52.1
56.5 53.5 46.5 51.9 46.5 53.5 45.5 50.2 55.1 49.6
47.6 44.8 55.0 56.2 49.4 57.0 52.4 48.4 55.0 47.1
52.4 56.8 53.2 50.5 56.6 49.5 53.1 51.2 55.5 53.7
in most cases and that the number of bins should increase with n. Choosing the number of
bins approximately equal to the square root of the number of observations often works well
in practice.
The histogram is a visual display of the frequency distribution. The stages for constructing
a histogram follow.
2. Mark and label the vertical scale with the frequencies or the relative frequencies.
3. Above each bin, draw a rectangle where height is equal to the frequency (or relative fre-
quency) corresponding to that bin.
Histogram of weight
15
No. of student
10
5
0
40 45 50 55 60
weight
from the box edge, is called an outlier. A point more than 3 interquartile ranges from the box
edge is called an extreme outlier.
Example 5.2.3. Consider the sample in Example 5.2.1. The quantiles of the sample are q1 =
48.40, q2 = 52.10, q3 = 54.85. Bellow is the box plot of the students’ weight.
60
55
50
45
158.7 167.6 164.0 153.1 179.3 153.0 170.6 152.4 161.5 146.7
147.2 158.2 157.7 161.8 168.4 151.2 158.7 161.0 147.9 155.5
about the form of the underlying distribution. However, histograms are usually not really re-
liable indicators of the distribution form unless the sample size is very large. Probability plot-
ting is a graphical method for determining whether sample data conform to a hypothesized
distribution based on a subjective visual examination of the data. The general procedure is
very simple and can be performed quickly. It is also more reliable than the histogram for small
to moderate size samples.
To construct a probability plot, the observations in the sample are first ranked from small-
est to largest. That is, the sample x1 , x2 , . . . , xn is arranged as x(1) ≤ x(2) < . . . ≤ x(n) . The
ordered observations x(j) are then plotted against their observed cumulative frequency (j −
0.5)/n. If the hypothesized distribution adequately describes the data, the plotted points will
fall approximately along a straight line which is approximately between the 25th and 75th
percentile points; if the plotted points deviate significantly from a straight line, the hypothe-
sized model is not appropriate. Usually, the determination of whether or not the data plot as
a straight line is subjective.
In particular, a normal
probability
plot can be constructed by plotting the standardized
j−0.5
normal scores zj = Φ−1 n against x(j) .
2.86, 3.33, 3.43, 3.77, 4.16, 3.52, 3.56, 3.63, 2.43, 2.78.
Since all the points are very close to the straight line, one may conclude that a normal distri-
bution adequately describes the data.
Remark 4. This is very surjective method. Please use it at your own risk! Later we will intro-
duce the Shapiro and Wilcoxon tests for the normal distribution hypothesis.
5.3.1 Statistics
Example 5.3.1. We continue Example 5.1.2. Recall that we do not know the number of balls
m and have to use the sample (X1 , . . . , Xn ) to obtain information about m.
Since E(X) = m+1
2
, using laws of large numbers, we have
X1 + . . . + Xn a.s. m + 1
−→ .
n 2
Therefore, we get the first estimator for m given by
X 1 + . . . + Xn a.s.
m̂n := 2 − 1 −→ m.
n
Another estimation for m is defined by
m̃n := max{X1 , . . . , Xn }.
Since n
Y m − 1 n
P[m̃n 6= m] = P[X1 < m, . . . , Xn < m] = P[Xi < m] = →0
i=1
m
a.s.
as n → ∞, we have m̃n −→ m.
The estimator m̂n and m̃n are called statistics which depend only on the observations
X1 , . . . , Xn not m.
Definition 5.3.2. Let X = (X1 , . . . , Xn ) be a sample observed from X and (T, BT ) a measur-
able space. Then any function ϕ(X) = ϕ(X1 , . . . , Xn ), where ϕ : (Rn , B(Rn )) → (T, BT ) is a
measurable function, of the sample is called a statistic.
In the following we only consider the case that (T, BT ) is a subset of (Rd , B(Rd )).
Definition 5.3.3. Let X = (X1 , . . . , Xn ) be a sample observed from a distribution with density
f (x; θ), θ ∈ Θ. Let Y = ϕ(X) be a statistic with density fY (y; θ). Then Y is called a sufficient
statistic for θ if
f (x; θ)
= H(x),
fY (ϕ(x); θ)
where x = (x1 , . . . , xn ), f (x; θ) is density of X at x, and H(x) does not depend on θ ∈ Θ.
5.3. POINT ESTIMATIONS 107
Example 5.3.4. Let (X1 , . . . , Xn ) be a sample observed from a Poisson distribution with pa-
rameter λ > 0. Then Yn = ϕ(X) = X1 + . . . + Xn has Poisson distribution with parameter nλ.
Hence
Qn Pn
f (X; θ) i=1 f (Xi ; θ) e−nλ λ i=1 Xi Yn ! Yn !
= = Qn −nλ Y
= Yn Q n .
fY (ϕ(X); θ) fYn (Yn ; nθ) i=1 Xi ! e (nλ) n i=1 Xi !
In order to directly verify the sufficiency of statistic ϕ(X) we need to know the density of
ϕ(X) which is not always the case in practice. We next introduce the following criterion of
Neyman to overcome this difficulty.
Example 5.3.6. Let (X1 , . . . , Xn ) be a sample from normal distribution N (θ, 1) with θ ∈ Θ = R.
Denote x̄ = n1 ni=1 xi . The joint density of X1 , . . . , Xn at (x1 , . . . , xn ) is given by
P
h P i
n (xi −x̄)2
n(x̄ − θ)2 exp − i=1 2
n
X (xi − θ)2
1 h i
exp − = exp − .
(2π)n/2 i=1
2 2 (2π)n/2
We see that the first factor on the right hand side depends upon x1 , . . . , xn only through x̄ and
the second factor does not depend upon θ, the factorization theorem implies that the mean
X̄ of the sample is a sufficient statistic for θ, the mean of the normal distribution.
(a) Eθ [ϕ(X1 , . . . , Xn )] = θ;
(b) Dθ ϕ(X1 , . . . , Xn ) ≤ Dθ ϕ̄(X1 , . . . , Xn ) for any unbiased estimator ϕ̄(X1 , . . . , Xn ) of θ.
4. a consistent estimator of θ if
P θ
ϕ(X1 , . . . , Xn ) −→ θ khi n → ∞.
Here we denote Eθ , Dθ , Pθ the expectation, variance and probability under the condition
that the distribution of Xi is F (x; θ).
Example 5.3.8. Let (X1 , . . . , Xn ) be a sample from normal distribution N (a, σ 2 ). Using the
linearity of expectation and laws of large numbers, we have
Example 5.3.9. In Example 5.3.1, both m̂n and m̃n are consistent estimators of m. Moreover,
m̂n is unbiased and m̃n is asymptotic unbiased.
Example 5.3.10. Let (X1 , . . . , Xn ) be a sample from normal distribution N(a; σ 2 ) where σ 2 is
known. We know that X̄n is an unbiased, consistent estimator of a. But how close is X̄n to a?
√
Since X̄n ∼ N(a; σ 2 /n), we have (X̄n − a)/(σ/ n) has a standard normal N(0; 1) distribution.
Therefore,
h X̄n − a i h σ σ i
0.954 = P − 2 < √ < 2 = P X̄n − 2 √ < a < X̄n + 2 √ . (5.2)
σ/ n n n
Expression (5.2) says that before the sample is drawn the probability that a belongs to the
random interval X̄n − 2 √σn < a < X̄n + 2 √σn is 0.954. After the sample is drawn the realized
interval x̄n − 2 √σn < a < x̄n + 2 √σn has either trapped a or it has not. But because of the
high probability of success before the sample is drawn, we call the interval X̄n − 2 √σn < a <
5.3. POINT ESTIMATIONS 109
X̄n + 2 √σn a 95.4% confidence interval for a. We can say, with some confidence, that x̄ is within
2 √σn from a. The number 0.954 = 95.4% is called a confidence coefficient. Instead of using 2,
we could use, say, 1.645, 1.96 or 2.576 to obtain 90%, 95% or 99% confidence intervals for a.
Note that the lengths of these confidence intervals increase as the confidence increases; i.e.,
the increase in confidence implies a loss in precision. On the other hand, for any confidence
coefficient, an increase in sample size leads to shorter confidence intervals.
In the following, thanks to the central limit theorems, we will present a general method to
find the confident interval for parameters of a large class of distribution. To avoid confusion,
let θ0 denote the true, unknown value of the parameter θ. Suppose ϕ is an estimator of θ0 such
that
√ w
n(ϕ − θ0 ) → N(0, σϕ2 ). (5.3)
√
The parameter σϕ2 is the asymptotic variance of nϕ and, in practice, it is usually unknown.
For the present, though, we suppose that σϕ2 is known.
√
Let Z = n(ϕ − θ0 )/σϕ be the standardized random variable. Then Z is asymptotically
N(0, 1). Hence, P[−1.96 < Z < 1.96] = 0.95. This implies
h σϕ σϕ i
0.95 = P ϕ − 1.96 √ < θ0 < ϕ + 1.96 √ (5.4)
n n
σϕ σ
Because the interval ϕ − 1.96 √ < θ 0 < ϕ + 1.96 √ϕ is a function of the random variable
n n
ϕ, we will call it a random interval. The probability that the random interval contains θ is
approximately 0.95.
Since in practice, we often do not know σϕ . Suppose that there exists a consistent estimator
of σϕ , say Sϕ . It then follows from Slutsky’s theorem that
√
n(ϕ − θ0 ) w
→ N (0, 1).
Sϕ
√ √
Hence the interval ϕ − 1.96Sϕ / n, ϕ − 1.96Sϕ / n would be a random interval with approx-
imate probability 0.95% of covering θ0 .
In general, we have the following definition.
Definition 5.3.11. Let (X1 , . . . , Xn ) be a sample from a distribution F (x, θ), θ ∈ Θ. A random
interval (ϕ1 , ϕ2 ), where ϕ1 and ϕ2 are some estimator of θ, is called a (1−α)-confidence interval
for θ if
P(ϕ1 < θ < ϕ2 ) = 1 − α,
for some α ∈ [0, 1].
5.3. POINT ESTIMATIONS 110
1. Because α < α0 implies that xα/2 > xα0 /2 , selection of higher values for confidence coef-
ficients leads to larger error terms and hence, longer confidence intervals, assuming all
else remains the same.
2. Choosing a larger sample size decreases the error part and hence, leads to shorter con-
fidence intervals, assuming all else stays the same.
3. Usually the parameter σ is some type of scale parameter of the underlying distribution.
In these situations, assuming all else remains the same, an increase in scale (noise level),
generally results in larger error terms and, hence, longer confidence intervals.
Let X1 , . . . , Xn be a random sample from the Bernoulli distribution with probability of suc-
cess p. Let p̂ = X̄ be the sample proportion of successes. It follows from the Central limit the-
orem that p̂ has an approximate N(p; p(1−p) n
) distribution. Since p̂ and p̂(1 − p̂) are consistent
estimators for p and p(1 − p), respectively, an approximate (1 − α) confidence interval for p is
given by r r
p̂(1 − p̂ p̂(1 − p̂
p̂ − zα/2 , p̂ + zα/2 .
n n
In general, the confidence intervals developed so far in this section are approximate. They
are based on the Central Limit Theorem and also, often require a consistent estimate of the
variance. In our next example, we develop an exact confidence interval for the mean when
sampling from a normal distribution
Note that the only difference between this confidence interval and the large sample confi-
dence interval (5.5) is that tα/2,n−1 replaces zα/2 . This one is exact while (5.5) is approximate.
Of course, we have to assume we are sampling a normal population to get the exactness. In
practice, we often do not know if the population is normal. Which confidence interval should
we use? Generally, for the same α, the intervals based on tα/2,n−1 are larger than those based
on zα/2 . Hence, the interval (5.6) is generally more conservative than the interval (5.5). So in
practice, statisticians generally prefer the interval (5.6).
Sometimes confidence intervals on the population variance or standard deviation are needed.
When the population is modelled by a normal distribution, the tests and intervals described
in this section are applicable. The following result provides the basis of constructing these
confidence intervals.
Theorem 5.3.14. If s2 is the sample variance from a random sample of n observation from a
normal distribution with unknown variance σ 2 , then a 100(1 − α)% CI on σ 2 is
(n − 1)s2 2 (n − 1)s2
≤σ ≤ 2 ,
c2α/2,n−1 c1−α/2,n−1
where c2a,n−1 satisfies P(χ2n−1 > c2a,n−1 ) = a and the random variable χ2n−1 has a chi-square dis-
tribution with n − 1 degrees of freedom.
A practical problem of interest is the comparison of two distributions; that is, comparing
the distributions of two random variables, say X and Y . In this section, we will compare the
means of X and Y . Denote the means of X and Y by aX and aY , respectively. In particular, we
shall obtain confidence intervals for the difference ∆ = aX − aY . Assume that the variances
2
of X and Y are finite and denote them as σX = V ar(X) and let σY2 = V ar(Y ). Let X1 ...., Xn be
a random sample from the distribution of X and let Y1 , ..., Ym be a random sample from the
distribution of Y. Assume that the sample were gathered independently of one another. Let
X̄ and Ȳ the sample means of X and Y , respectively. Let ∆ˆ = X̄ − Ȳ . Next we obtain a large
sample confidence interval for ∆ based on the asymptotic distribution of ∆. ˆ
Proposition 5.3.15. Let N = n + m denote the total sample size. We suppose that
n m
→ λX , and → λY where λX + λY = 1.
N N
Then a (1 − α) confidence interval for ∆ is
2
1. (if σX and σY2 are known)
r r
2
σX σ2 2
σX σ2
(X̄ − Ȳ ) − zα/2 + Y , (X̄ − Ȳ ) + zα/2 + Y ; (5.7)
n m n m
5.4. METHOD OF FINDING ESTIMATION 113
2
2. (if σX and σY2 are unknown)
r r
s2 (X) s2 (Y ) s2 (X) s2 (Y )
(X̄ − Ȳ ) − zα/2 + , (X̄ − Ȳ ) + zα/2 + , (5.8)
n m n m
where s2 (X) and s2 (Y ) are sample variances of (Xn ) and (Ym ), respectively.
√ w
Proof. It follows from the Central limit theorem that n(X̄ − aX ) −→ N(0; σX
2
). Thus,
√ w
2
σX
N (X̄ − aX ) −→ N(0; ).
λX
Likewise,
√ σY2 w
N (Ȳ − aY ) −→ N(0;
).
λY
Since the samples are independent of one another, we have
√
w σ2 σ2
N (X̄ − Ȳ ) − (aX − aY ) −→ N(0; X + Y ).
λX λY
This implies (5.7). Since S ∗2 (X) and S ∗2 (Y ) are consistent estimators of σX
2
and σY2 , applying
Slutsky’s theorem, we obtain (5.8).
Let X and Y be two independent random variables with Bernoulli distributions B(1, p1 )
and B(1, p2 ), respectively. Let X1 , . . . , Xn be a random sample from the distribution of X and
let Y1 , . . . , Ym be a random sample from the distribution of Y .
Definition 5.4.1. For each sample point x, let θ̂(x) be a parameter value at which L(x; θ) at-
tains its maximum as a function of θ, with x held fixed. A maximum likelihood estimator
(MLE) of the parameter θ based on a sample X is θ̂(X).
Example 5.4.2. Let (X1 , . . . , Xn ) be a random sample from the distribution N (θ, 1), where θ is
unknown. We have n
Y 1 2
L(x; θ) = √ e−(xi −θ) /2 ,
i=1
2π
A simple calculus shows that the MLE of θ is θ̂ = n1 ni=1 xi . One can easily verify that θ̂ is an
P
Example 5.4.3. Let (X1 , . . . , Xn ) be a random sample from the Bernoulli distribution with a
unknown parameter p. The likelihood function is
n
Y
L(x; p) = pxi (1 − p)1−xi .
i=1
1
Pn
A simple calculus shows that the MLE of p is θ̂ = n i=1 xi . One can easily verify that θ̂ is an
unbiased and consistent estimator of θ.
Let X1 , . . . , Xn denote a random sample from the distribution with pdf f (x; θ), θ ∈ Θ. Let
θ0 denote the true value of θ. The following theorem gives a theoretical reason for maximizing
the likelihood function. It says that the maximum of L(θ) asymptotically separates the true
model at θ0 from models at θ 6= θ0 .
Then
lim Pθ0 [L(X; θ0 ) > L(X; θ)] = 1, for all θ 6= θ0 .
n→∞
Since the function φ(x) = − ln x is strictly convex, it follows from the Law of Large Numbers
and Jensen’s inequality that
n
1 X f (Xi ; θ) P h f (X ; θ) i
1
h f (X ; θ) i
1
ln → Eθ0 ln < ln Eθ0 = 0.
n i=1 f (Xi ; θ0 ) f (X1 ; θ0 ) f (X1 ; θ0 )
5.4. METHOD OF FINDING ESTIMATION 115
Note that condition f (.; θ) 6= f (.; θ0 ) for all θ 6= θ0 is needed to obtain the last strict inequality
while the common support is needed to obtain the last equality.
Theorem 5.4.4 says that asymptotically the likelihood function is maximized at the true value
θ0 . So in considering estimates of θ0 , it seems natural to consider the value of θ which maxi-
mizes the likelihood.
We close this section by showing that maximum likelihood estimators, under regularity
conditions, are consistent estimators.
Theorem 5.4.5. Suppose that the pdfs f (x, θ) satisfying (R0), (R1) and
∂ ∂
L(θ) = 0 ⇔ ln L(θ) = 0,
∂θ ∂θ
P
has a solution θ̂n such that θ̂n → θ0 .
µj = E[X j ] = gj (θ1 , . . . , θk ), j = 1, . . . , k.
and n
1X j
mj = X .
n i=1 i
The moments estimator (θ̂1 , . . . , θ̂k ) is obtained by solving the system of equations
mj = gj (θ1 , . . . , θk ), j = 1, . . . , k.
Example 5.4.6 (Binomial distribution). Let (X1 , . . . , Xn ) be a random sample from the Bino-
mial distribution B(k, p), that is,
Here we assume that p and k are unknown parameters. Equating the first two sample mo-
ments to those of the population yields
X̄ = kp k = k̂ = X̄n2
n 1 Pn
X̄n − n 2
⇔ i=1 (Xi −X̄n )
1 Pn X 2 = kp(1 − p) + k 2 p2 p = p̂ = X̄n .
n i=1 i k̂
Theorem 5.5.1 (Rao-Cramer Lower Bound). Let X1 , . . . , Xn be iid with common pdf
f (x; θ) for θ ∈ Θ. Assume that the regularity conditions (R0)-(R2) hold. Moreover,
suppose that
[k 0 (θ)]2
DY ≥ ,
nI(θ)
Proof. Since Z
k(θ) = u(x1 , . . . , xn )f (x1 ; θ) . . . f (xn ; θ)dx1 . . . dxn ,
Rn
we have
Z n
0
X ∂ ln f (xi ; θ)
k (θ) = u(x1 , . . . , xn ) f (x1 ; θ) . . . f (xn ; θ)dx1 . . . dxn .
Rn i=1
∂θ
Pn ∂ ln f (xi ;θ)
Denote Z = i=1 ∂θ
. It is easy to verify that E[Z] = 0 and DZ = nI(θ). Moreover,
k 0 (θ) = E[Y Z]. Hence, we have
p
k 0 (θ) = E[Y Z] = E[Y ]E[Z] + ρ nI(θ)DY ,
5.5. LOWER BOUND FOR VARIANCE 117
where ρ is the correlation coefficient between Y and Z. Since E[Z] = 0 and ρ2 ≤ 1, we get
|k 0 (θ)|2
≤ 1,
nI(θ)DY
Definition 5.5.2. Let Y be an unbiased estimator of a parameter θ in the case of point estima-
tion. The statistic Y is called an efficient estimator of θ if and only if the variance of Y attains
the Rao-Cramer lower bound.
Example 5.5.4 (Poisson distribution). Let X1 , X2 , . . . , Xn denote a random sample from a Pois-
son distribution that has the mean θ > 0. Show that X̄ is an efficient estimator of θ.
In the above examples, we were able to obtain the MLEs in closed form along with their
distributions and, hence, moments. This is often not the case. Maximum likelihood esti-
mators, however, have an asymptotic normal distribution. In fact, MLEs are asymptotically
efficient.
Theorem 5.5.5. Assume X1 , . . . , Xn are iid with pdf f (x; θ0 ) for θ0 ∈ Θ such that the regularity
condition (R0)-(R5) are satisfied. Suppose further that 0 < I(θ0 ) < ∞, and
(R6) The pdf f (x; θ) is three times differentiable as a function of θ. Moreover, for all θ ∈ Θ, there
exists a constant c and a function M (x) such that
∂ 2 ln f (x; θ)
≤ M (x),
∂θ3
with Eθ0 [M (X)] < ∞, for all θ0 − c < θ < θ0 + c and all x in the support of X.
Proof. Expanding the function l0 (θ) into a Taylor series of order two about θ0 and evaluating it
at θ̂n , we get
1
l0 (θ̂n ) = l0 (θ0 ) + (θ̂n − θ0 )l00 (θ0 ) + (θ̂n − θ0 )2 l000 (θn∗ ),
2
where θn∗ is between θ0 and θ̂n . But l0 (θ̂n ) = 0. Hence,
√ n−1/2 l0 (θ0 )
n(θ̂n − θ0 ) = .
−n−1 l00 (θ0 ) − (2n)−1 (θ̂n − θ0 )l000 (θn∗ )
5.6. EXERCISES 118
Note that |θ̂n − θ0 | < c0 implies that |θn∗ − θ0 | < c0 , thanks to Condition (R6), we have
n n
1 000 ∗ 1 X ∂ 2 ln f (Xi ; θ) 1X
− l (θn ) ≤ ≤ M (Xi ).
n n i=1 ∂θ3 n i=1
1
Pn P
Since Eθ0 |M (X)| < ∞, applying Law of Large Numbers, we have n i=1 M (Xi ) → Eθ0 [M (X)].
P
Moreover, since θ̂n → θ0 , for any > 0, there exists N > 0 so that P[|θ̂n − θ0 | < c0 ] ≥ 1 − 2
and
h 1X n i
P M (Xi ) − Eµ0 [M (X)] < 1 ≥ 1 − ,
n i=1 2
5.6 Exercises
5.2. A confidence interval estimate is desired for the gain in a circuit on a semiconductor
device. Assume that gain is normally distributed with standard deviation σ = 20.
5.3. Following are two confidence interval estimates of the mean µ of the cycles to failure of
an automotive door latch mechanism (the test was conducted at an elevated stress level to
accelerate the failure).
2. The confidence level for one of these CIs is 95% and the confidence level for the other is
99%. Both CIs are calculated from the same sample data. Which is the 95% CI? Explain
why.
5.4. n = 100 random samples of water from a fresh water lake were taken and the calcium
concentration (milligrams per liter) measured. A 95% CI on the mean calcium concentration
is 0.49 ≤ µ ≤ 0.82.
1. Would a 99% CI calculated from the same sample data been longer or shorter?
2. Consider the following statement: There is a 95% chance that µ is between 0.49 and 0.82.
Is this statement correct? Explain your answer.
3. Consider the following statement: If n = 100 random samples of water from the lake
were taken and the 95% CI on µ computed, and this process was repeated 1000 times,
950 of the CIs will contain the true value of µ. Is this statement correct? Explain your
answer.
5.5. A research engineer for a tire manufacturer is investigating tire life for a new rubber com-
pound and has built 16 tires and tested them to end-of-life in a road test. The sample mean
and standard deviation are 60,139.7 and 3645.94 kilometers. Find a 95% confidence interval
on mean tire life.
5.6. An Izod impact test was performed on 20 specimens of PVC pipe. The sample mean is
X̄ = 1.25 and the sample standard deviation is S = 0.25. Find a 99% lower confidence bound
on Izod impact strength.
5.7. The compressive strength of concrete is being tested by a civil engineer. He tests 12 spec-
imens and obtains the following data.
2216 2237 2225 2301 2318 2255
2249 2204 2281 2263 2275 2295
5.6. EXERCISES 120
1. Is there evidence to support the assumption that compressive strength is normally dis-
tributed? Does this data set support your point of view? Include a graphical display in
your answer.
5.8. A machine produces metal rods. A random sample of 15 rods is selected, and the diame-
ter is measured. The resulting date (in millimetres) are as follows
8.24 8.25 8.2 8.23 8.24
8.21 8.26 8.26 8.2 8.25
8.23 8.23 8.19 8.28 8.24
5.9. A rivet is to be inserted into a hole. A random sample of n = 15 parts is selected, and
the hole diameter is measured. The sample standard deviation of the hole diameter measure-
ments is s = 0.008 millimeters. Construct a 99% CI for σ 2 .
5.10. The sugar content of the syrup in canned peaches is normally distributed with standard
deviation σ. A random sample of n = 10 cans yields a sample standard deviation of s = 4.8
milligrams. Find a 95% CI for σ.
5.11. Of 1000 randomly selected cases of lung cancer, 823 resulted in death within 10 years.
2. How large a sample would be required to be at least 95% confident that the error in
estimating the 10-year death rate from lung cancer is less than 0.03?
5.12. A random sample of 50 suspension helmets used by motorcycle riders and automobile
race-car drivers was subjected to an impact test, and on 18 of these helmets some damage
was observed.
1. Find a 95% CI on the true proportion of helmets of this type that would show damage
from this test.
2. Using the point estimate of p obtained from the preliminary sample of 50 helmets, how
many helmets must be tested to be 95% confident that the error in estimating the true
value of p is less than 0.02?
3. How large must the sample be if we wish to be at least 95% confident that the error in
estimating p is less than 0.02, regardless of the true value of p?
5.6. EXERCISES 121
where α1 +α2 = α. If α1 = α2 = α/2, we have the usual 100(1−α)% CI for µ. In the above, when
√
α1 6= α2 , the CI is not symmetric about µ. The length of the interval is L = σ(zα1 + zα2 )/ n.
Prove that the length of the interval L is minimized when α1 = α2 = α/2.
5.14. Let the observed value of the mean X̄ of a random sample of size 20 from a distribution
that is N (µ, 80) be 81.2. Find a 95 percent confidence interval for µ.
5.15. Let X̄ be the mean of a random sample of size n from a distribution that is N (µ, 9). Find
n such that P[X̄ − 1 < µ < X̄ + 1] = 0.90, approximately.
5.16. Let a random sample of size 17 from the normal distribution N (µ, σ 2 ) yield x̄ = 4.7 and
s2 = 5.76. Determine a 90 percent confidence interval for µ.
5.17. Let X̄ denote the mean of a random sample of size n from a distribution that has mean
µ and variance σ 2 = 10. Find n so that the probability is approximately 0.954 that the random
interval (X̄ − 21 , X̄ + 12 ) includes µ.
1. If σ 2 is known, find a minimum value for n to guarantee that a 0.95 CI for µ will have
length no more than σ/4.
2. If σ 2 is unknown, find a minimum value for n to guarantee, with probability 0.90, that a
0.95 CI for µ will have length no more than σ/4.
5.21. Let (X1 , . . . , Xn ) be iid uniform U(0; θ). Let Y be the largest order statistics. Show that
the distribution of Y /θ does not depend on θ, and find the shortest (1 − α) CI for θ.
eθ−x
f (x; θ) = , x ∈ R, θ ∈ R.
(1 + eθ−x )2
5.25. Let X1 , . . . , Xn represent a random sample from each of the distributions having the
following pdfs or pmfs:
5.28. Suppose X1 , . . . , Xn are iid with pdf f (x; θ) = e−x/θ I{0<x<∞} . Find the mle of P[X ≥ 3].
3. The length (in millimeters) of cuckoos’ eggs found in hedge sparrow nests can be mod-
elled with this distribution. Fot the data
22, 0, 23, 9, 20, 9, 23, 8, 25, 0, 24, 0 21, 7, 23, 8, 22, 8, 23, 1, 23, 1, 23, 5, 23, 0, 23, 0,
Yi = βxi + i , i = 1, . . . , n,
3. Find the mle β̄n of β, and show that it is an unbiased estimator of β. Compare the vari-
ances of β̄n and β̂n .
2. Among all such unbiased estimator, find the one with minimum variance, and calculate
the variance.
5.6. EXERCISES 124
2. If (X1 , . . . , Xn ) is a random sample from this distribution, show that the mle of θ is an
efficient estimator of θ.
√
3. What is the asymptotic distribution of n(θ̂ − θ).
2. If (X1 , . . . , Xn ) is a random sample from this distribution, show that the mle of θ is an
efficient estimator of θ.
√
3. What is the asymptotic distribution of n(θ̂ − θ).
5.35. Let (X1 , . . . , Xn ) be a random sample from a N(0; θ) distribution. We want to estimate
√
the standard deviation θ. Find the constant c so that Y = c ni=1 |Xi | is an unbiased estimator
P
√
of θ and determine its efficiency.
5.37 (Beta (θ, 1) distribution). Let X1 , X2 , . . . , Xn denote a random sample of size n > 2 from
a distribution with pdf
θxθ−1 for x ∈ (0, 1)
f (x; θ) =
0 otherwise
where the parameter space Ω = (0, ∞).
n
1. Show that θ̂ = − Pn ln Xi
is the MLE estimator of θ.
i=1
4. Is θ̂ an efficient estimator of θ?
2
5.38. Let X1 , . . . , Xn be iid N(θ, 1). Show that the best unbiased estimator of θ2 is X n − n1 .
Calculate its variance and show that it is greater thatn the Cramer-Rao lower bound.
Chapter 6
Hypothesis Testing
6.1 Introduction
Point estimation and confidence intervals are useful statistical inference procedures. An-
other type of inference that is frequently used concerns tests of hypotheses. As in the last sec-
tion, suppose our interest centers on a random variable X which has density function f (x; θ)
where θ ∈ Θ. Suppose we think, due to theory or a preliminary experiment, that θ ∈ Θ0 or
θ ∈ Θ1 where Θ0 and Θ1 are subsets of Θ and Θ0 ∪ Θ1 = Θ. We label the hypothesis as
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1 . (6.1)
We call H0 the null hypothesis and H1 the alternative hypothesis. A hypothesis of the form
θ = θ0 is called a simple hypothesis while a hypothesis of the form θ > θ0 or θ < θ0 is called a
composite hypothesis. A test of the form
H0 : θ = θ0 versus H1 : θ 6= θ0
H0 : θ ≤ θ0 versus H1 : θ > θ0 ,
or
H0 : θ ≥ θ0 versus H1 : θ < θ0
is called a one-sided test.
Often the null hypothesis represents no change or no difference from the past while the
alternative represents change or difference. The alternative is often referred to as the research
worker’s hypothesis. The decision rule to take H0 or H1 is based on a sample X1 , . . . , Xn from
the distribution of X and hence, the decision could be right or wrong. There are only two
types of statistical errors we may commit: rejecting H0 when H0 is true (called the Type I
error) and accepting H0 when H0 is wrong (called the Type II error).
Let D denote the sample space. A test of H0 versus H1 is based on a subset C of D. This set
C is called the critical region and its corresponding decision rule is:
126
6.1. INTRODUCTION 127
Our goal is to select a critical region which minimizes the probability of making error. In gen-
eral, it is not possible to simultaneously reduce Type I and Type II errors because of a see-saw
effect: if one takes C = ∅ then H0 would be never rejected so the probability of Type I error
would be 0, but the Type II error occurs with probability 1. Type I error is usually considered
to be worse than Type II. Therefore, we will choose a critical regions which, on one hand,
bound the probability of Type I error at a certain level, and on the other hand, minimizes the
probability of Type II error.
α is also called the significance level of the test associated with critical region C.
Over all critical regions of size α, we will look for the one whom has the lowest probability
of Type II error. It also means that for θ ∈ Θ1 , we want to maximize
We call the probability on the right side of this equation the power of the test at θ. So our task
is to find among all the critical region of size α the one with highest power.
We define the power function of a critical region by
In particular, if we have µ0 = 0, µ1 = 1, n = 100 then at the significant level 5%, we would reject
H0 in favor of H1 if X̄n > 0.1965 and the power of the test is 1 − Φ(−8.135) = 0.9999.
Example 6.1.3 (Large Sample Test for the Mean). Let X1 , . . . , Xn be a random sample from the
distribution of X with mean µ and finite variance σ 2 . We want to test the hypotheses
H0 : µ = µ0 versus H1 : µ > µ0
In general, the distribution of the sample mean cannot be obtained in closed form. So we will
use the Central Limit Theorem to find the critical region. Indeed, since
X̄n − µ w
√ → N (0, 1),
S/ n
The power of the test is also approximated by using the Central Limit Theorem
√
√ n(µ0 − µ)
γ(µ) = P[X̄n ≥ µ0 + xα σ/ n] ≈ Φ − xα − .
σ
So if we have some reasonable idea of what σ equals, we can compute the approximate power
function.
Finally, note that if X has normal distribution then X̄S/n√−µ
n
has a t distribution with n − 1
degrees of freedom. Thus we can establish a rejection rule having exact level α:
−µ
X̄n√
Reject H0 in favor of H1 if T = S/ n
≥ tα,n−1 ,
where tα,n−1 is the upper α critical point of a t distribution with n − 1 degrees of freedom.
6.2. METHOD OF FINDING TEST 129
One way to report the results of a hypothesis test is to state that the null hypothesis was
or was not rejected at a specified α-value or level of significance. For example, we can say
that H0 : µ = 0 was rejected at the 0.05 level of significance. This statement of conclusions is
often inadequate because it gives the decision maker no idea about whether the computed
value of the test statistic was just barely in the rejection region or whether it was very far
into this region. Furthermore, stating the results this way imposes the predefined level of
significance on other users of the information. This approach may be unsatisfactory because
some decision makers might be uncomfortable with the risks implied by α = 0.05.
To avoid these difficulties the p-value approach has been adopted widely in practice. The
p-value is the probability that the test statistic will take on a value that is at least as extreme
as the observed value of the statistic when the null hypothesis H0 is true. Thus, a p-value
conveys much information about the weight of evidence against H0 , and so a decision maker
can draw a conclusion at any specified level of significance. We now give a formal definition
of a p-value.
Definition 6.1.4. The p-value is the smallest level of significance that would lead to rejection
of the null hypothesis H0 with the given data.
This mean that if α > p-value, we would reject H0 while if α < p-value, we would not reject
H0 .
supθ∈Θ0 L(x; θ)
λ(x) = .
supθ∈Θ L(x; θ)
A likelihood ratio test is any test that has a rejection region of the form C = {x : λ(x) ≤ c} for
some c ∈ [0, 1].
The motivation of likelihood ratio test comes from the fact that if θ0 is the true value of θ
then, asymptotically, L(θ0 ) is the maximum value of L(θ). Therefore, if H0 is true, λ should be
close to 1; while if H1 is true, λ should be smaller.
Example 6.2.2 (Likelihood Ratio Test for the Exponential Distribution). Suppose X1 , . . . , Xn
are iid with pdf f (x; θ) = θ−1 e−x/θ I{x>0} and θ > 0. Let’s consider the hypotheses
6.3. METHOD OF EVALUATING TEST 130
H0 : θ = θ0 versus H1 : θ 6= θ0 ,
where θ0 > 0 is a specified value. The likelihood ratio test statistic simplifies to
X̄ n
n
λ(X) = en e−nX̄n /θ0 .
θ0
The decision rule is to reject H0 if λ ≤ c. Using differential calculus, it is easy to show that λ ≤ c
iff X̄ ≤ c1 θ0 or X̄ ≥ c2 θ0 for some positive constants c1 , c2 . Note that under the null hypothesis,
H0 , the statistic θ20 ni=1 Xi has a χ2 distribution with 2n degrees of freedom. Therefore, the
P
where χ21−α/2 (2n) is the lower α/2 quantile of a χ2 distribution with 2n degrees of freedom and
χ2α/2 (2n) is the upper α/2 quantile of a χ2 distribution with 2n degrees of freedom.
If ϕ(X) is a sufficient statistic for θ with pdf or pmf g(t; θ), then we might consider con-
structing an likelihood ratio test based on ϕ and its likelihood function L∗ (t; θ) = g(t; θ) rather
than on the sample X and its likelihood function L(x; θ).
Theorem 6.2.3. If ϕ(X) is a sufficient statistic for θ and λ∗ (t) and λ(x) are the likelihood ratio
test statistics based on ϕ and X, respectively, then λ∗ (ϕ(x)) = λ(x) for every x in the sample
space.
Proof. From the Factorization Theorem, the pdf or pmf of X can be written as f (x; θ) =
g(T (x); θ)h(x), where g(t; θ) is the pdf or pmf of T and h(x) does not depend on θ. Thus
Definition 6.3.1. A subset C of the sample space is called a best critical region of size α for
testing the simple hypothesis
H0 : θ = θ0 versus H1 : θ = θ1 ,
if Pθ0 [X ∈ C] = α and for every subset A of the sample space
Pθ0 [X ∈ A] = α implies Pθ1 [X ∈ C] ≥ Pθ1 [X ∈ A].
The following theorem of Neyman and Pearson provides a systematic method of deter-
mining a best critical region.
Theorem 6.3.2. Let (X1 , . . . , Xn ) be a sample from a distribution that has density f (x; θ).
Then the likelihood of X1 , X2 , . . . , Xn is
n
Y
L(x; θ) = f (xi ; θ), for x = (x1 , . . . , xn ).
i=1
Let θ0 and θ1 be distinct fixed values of θ so that Θ = {θ0 , θ1 }, and let k be a positive
number. Let C be a subset of the sample space such that
L(x;θ0 )
(a) L(x;θ1 )
≤ k for each x ∈ C;
(b) L(x;θ0 )
L(x;θ1 )
≥ k for each x ∈ D\C;
Then C is a best critical region of size α for testing the simple hypothesis
H0 : θ = θ0 versus H1 : θ = θ1 .
Proof. We prove the theorem when the random variables are of the continuous type. If A is
another critical region of size α, we will show that
Z Z
L(x; θ1 )dx ≥ L(x; θ1 )dx.
C A
c
Write C as the disjoint union of C ∩ A and C ∩ A and A as the disjoint union of A ∩ C and
A ∩ C c , we have
Z Z Z Z
L(x; θ1 )dx − L(x; θ1 )dx = L(x; θ1 )dx − L(x; θ1 )dx
C A C∩A c A∩CZc
Z
1 1
≥ L(x; θ0 )dx − L(x; θ0 )dx,
k C∩Ac k A∩C c
where the last inequality follows from conditions (a) and (b). Moreover, we have
Z Z Z Z
L(x; θ0 )dx − L(x; θ0 )dx = L(x; θ0 )dx − L(x; θ0 )dx = α − α = 0.
C∩Ac A∩C c C A
Example 6.3.3. Let X = (X1 , . . . , Xn ) denote a random sample from the distribution N (θ, 1).
It is desired to test the simple hypothesis
H0 : θ = 0 versus H1 : θ = 1.
We have
Pn
1 1 2
L(0; x) (2π)n/2
exp − 2 i=1 xi
X n
n
= = exp − xi + .
L(1; x) 1
exp − 12 ni=1 (xi − 1)2
P
i=1
2
(2π)n/2
is a best critical region, where c is a constant that can be determined so that the size of the
critical region is α. Since X̄n ∼ N (0, 1/n),
For example, if n = 25, α = 0.05 then c = 0.329. Thus the power of this best test of H0 against
H1 is 0.05 at θ = 1 is
Z ∞ (x − 1)2
1
p exp − dx = 1 − Φ(−3.355) = 0.999.
0.329 2π/25 2/25
Definition 6.3.4. The critical region C is a uniformly most powerful (UMP) critical region of
size α for testing the simple hypothesis H0 against an alternative composite hypothesis H1 if
the set C is a best critical region of size a for testing H0 against each simple hypothesis in H1 .
A test defined by this critical region C is called a uniformly most powerful (UMP) test, with
significance level α, for testing the simple hypothesis H0 against the alternative composite
hypothesis H1 .
6.3. METHOD OF EVALUATING TEST 133
It is well-known that uniformly most powerful tests do not always exist. However, when
they do exist, the Neyman-Pearson theorem provides a technique for finding them.
Example 6.3.5. Let (X1 , X2 , . . . , Xn ) be a random sample from the distribution N (0, θ), where
the variance θ is an unknown positive number. We will show that there exists a uniformly
most powerful test with significance level α for testing
H0 : θ = θ0 versus H1 : θ > θ0 .
Let θ0 be a number greater than θ0 an let k denote a positive number. Let C be the set of points
where n
L(θ0 ; x) X
2 2θ0 θ0 n θ0
≤k⇔ xi ≥ 0 ln − ln k = c.
L(θ0 ; x) i=1
θ − θ0 2 θ0
n o
The set C = (x1 , . . . , xn ) : ni=1 x2i ≥ c is then a best critical region for our testing problem.
P
This can be done using the observation that θ10 ni=1 Xi2 has a χ2 -distribution with n degrees
P
of freedom. Note that for each number θ0 > θ0 , the foregoing argument holds. It means that C
is a uniformly most powerful critical region of size α.
In conclusion, if ni=1 Xi2 ≥ c, H0 is rejected at the significance level α and H1 is accepted;
P
otherwise, H0 is accepted.
Example 6.3.6. Let (X1 , . . . , Xn ) be a sample from the normal distribution N (a, 1), where a is
unknown. We will show that there is no uniformly most powerful test of the simple hypothesis
H0 : a = a0 versus H1 : a 6= a0 .
Theorem 6.3.8. Assume that L(x; θ) has a monotone decreasing likelihood ratio in the statistic
y = u(x). The following test is uniformly most powerful of level α for the hypotheses (6.2):
In case L(x; θ) has a monotone increasing likelihood ratio in the statistic y = u(x) we can
construct a uniformly most powerful test in a similar way.
Proof. We first consider the simple null hypothesis: H00 : θ = θ0 . Let θ1 > θ0 be arbitrary but
fixed. Let C denote the most powerful critical region for θ0 versus θ1 . By the Neyman-Pearson
Theorem, C is defined by,
L(X; θ0 )
≤ k, if and only if X ∈ C,
L(X; θ1 )
where k is determined by α = Pθ0 [X ∈ C]. However, since θ1 > θ0 ,
L(X; θ0 )
= g(u(X)) ≤ k ⇔ u(X) > g −1 (k),
L(X; θ1 )
L(x;θ0 )
where g(u(x)) = L(x;θ 1)
. Since α = Pθ0 [u(X) ≥ g −1 (k), we have c = g −1 (k). Hence, the Neyman-
Pearson test is equivalent to the test defined by (6.3). Moreover, the test is uniformly most
powerful for θ0 versus θ1 > θ0 because the test only depends on θ1 > θ0 and g −1 (k) is uniquely
determined under θ0 .
Let γ(θ) denote the power function of the test (6.3). For any θ0 < θ00 , the test (6.3) is the
most powerful test for testing θ0 versu θ00 with the level γ(θ0 ), we have γ(θ00 ) > γ(θ0 ). Hence γ(θ)
is a nondecreasing function. This implies maxθ<θ0 γ(θ) = α.
Example 6.3.9. Let X1 , . . . , Xn be a random sample from a Bernoulli distribution with param-
eter p = θ, where 0 < θ < 1. Let θ0 < θ1 . Consider the ratio of likelihood,
L(x1 , . . . , xn ; θ0 ) θ0 (1 − θ1 ) xi 1 − θ0 n
P
= .
L(x1 , . . . , xn ; θ1 ) θ1 (1 − θ0 ) 1 − θ1
By Theorem 6.3.8, the uniformly most powerful level α decision rule is given by
n
X
Reject H0 if Y = Xi ≥ c,
i=1
Null hypothesis: H0 : µ = µ0
Test statistic: Z0 = X̄−µ
√0
σ/ n
Example 6.4.1. The following data give the score of 10 students in a certain exam.
75 64 75 65 72 80 71 68 78 62.
Assume that the score is normally distributed with mean µ and known variance σ 2 = 36, test
the following hypotheses at the 0.05 level of significance and find the P -value of each test.
Solution: (a) We may solve the problem by following the six-step procedure as follows.
6. Since |Z0 | < zα/2 we do not reject H0 : µ = 70 in favour of H1 : µ 6= 70 at the 0.05 level of
significance. More precisely, we conclude that the mean score is 70 based on a sample
of 10 measurements.
The P -value of this test is 2(1 − Φ(|Z0 |)) = 2(1 − Φ(0.5270)) = 0.598.
(b)
There is a close relationship between the test of a hypothesis about any parameter, say θ,
and the confidence interval for θ. If [l, u] is a 100(1 − α)% confidence interval for the parameter
θ, the test of size α of the hypothesis
H0 : θ = θ0 , H1 : θ 6= θ0
will lead to rejection of H0 if and only if θ0 is not in the 100(1 − α)% confidence interval [l, u].
Although hypothesis tests and CIs are equivalent procedures insofar as decision making
or inference about µ is concerned, each provides somewhat different insights. For instance,
the confidence interval provides a range of likely values for µ at a stated confidence level,
whereas hypothesis testing is an easy framework for displaying the risk levels such as the P -
value associated with a specific decision.
In testing hypotheses, the analyst directly selects the type I error probability. However, the
probability of type II error β depends on the choice of sample size. In this section, we will
show how to calculate the probability of type II error β. We will also show how to select the
sample size to obtain a specified value of β.
In the following we will derive the formula for β of the two-sided test. The ones for one-
sided tests can be derived in a similar way and we leave it as an exercise for the reader.
Finding the probability of type II error β: Consider the two-sided hypothesis
H0 : µ = µ0 , H1 : µ 6= µ0 .
Suppose the null hypothesis is false and that the true value of the mean is µ = µ0 + δ for some
δ. The test statistic Z0 is
√ δ √n
X − µ0 X − (µ0 + δ) δ n
Z0 = √ = √ ∼N , 1).
σ/ n σ/ n σ σ
Therefore, the probability of type II error is β = Pµ0 +δ (|Z0 | ≤ zα/2 ), i.e.,
Sample size formula There are no closed form for n from equation (6.4). However, we can
estimate n as follows.
√
Case 1 If δ > 0, then Φ(−zα/2 − δ σ n ) ≈ 0 then
√
δ n (zα/2 + zβ )2 σ 2
β ≈ Φ zα/2 − ⇔n≈ .
σ δ2
6.4. SOME WELL-KNOWN TESTS FOR A SINGLE SAMPLE 139
√
δ n
Case 2 If δ < 0, then Φ(zα/2 − σ
) ≈ 1 then
√
δ n (zα/2 + zβ )2 σ 2
β ≈ 1 − Φ − zα/2 − ⇔n≈ .
σ δ2
Therefore, the sample size formula is defined by
(zα/2 + zβ )2 σ 2
n≈
δ2
Large-sample test
We have developed the test procedure for the null hypothesis H0 : µ = µ0 assuming that
the population is normally distributed and that σ 2 is known. In many if not most practical sit-
uations σ 2 will be unknown. Furthermore, we may not be certain that the population is well
modeled by a normal distribution. In these situations if n is large (say n > 40) the sample vari-
ance s2 can be substituted for σ 2 in the test procedures with little effect. Thus, while we have
given a test for the mean of a normal distribution with known σ 2 , it can be easily converted
into a large-sample test procedure for unknown σ 2 that is valid regardless of the form of the
distribution of the population. This large-sample test relies on the central limit theorem just
as the large-sample confidence interval on σ 2 that was presented in the previous chapter did.
Exact treatment of the case where the population is normal, σ 2 is unknown, and n is small
involves use of the t distribution and will be deferred in the next section.
We assume again that a random sample X1 , X2 , . . . , Xn has been taken from a normal
N (µ, σ 2 ) population. Recall that X and s(X)2 are sample mean and sample variance of the
random sample X1 , X2 , . . . , Xn , respectively. It is known that
X −µ
tn−1 = √
s(X)/ n
has a t distribution with n − 1 degree of freedom. This fact leads to the following test on the
mean µ.
6.4. SOME WELL-KNOWN TESTS FOR A SINGLE SAMPLE 140
Null hypothesis: H0 : µ = µ0
X̄−µ√
Test statistic: T0 = s(X)/ 0
n
Because the t-table in the Appendix contains a few critical values for each t distribution, com-
putation of the exact P -value directly from the table is usually impossible. However, it is easy
to find upper and lower bounds on the P -value from this table.
Example 6.4.2. The following data give the IQ score of 10 students.
112 116 115 120 118 125 118 113 117 121.
Suppose that the IQ score is normally distributed N(µ, σ 2 ), test the following hypotheses at
the 0.05 level of significance and estimate the P -value of each test.
(a) H0 : µ = 115 against H1 : µ 6= 115.
When the true value of the mean is µ = µ0 + δ, the distribution for T0 is called the non-
√
central t distribution with n − 1 degrees of freedom and non-centrality parameter δ n/σ.
Therefore, the type II error of the two-sided alternative would be
β = P(|T00 | ≤ tα/2,n−1 )
We assume that a random sample X1 , X2 , . . . , Xn has been taken from a normal N (µ, σ 2 )
population. Since (n − 1)s2 (X)/σ 2 follows the chi-square distribution with n − 1 degrees of
freedom, we obtain the following test for value of σ 2 .
Null hypothesis: H0 : σ = σ0
2 (X)
Test statistic: χ20 = (n−1)s
σ20
Example 6.4.3. An automatic filling machine is used to fill bottles with liquid detergent. A
random sample of 20 bottles results in a sample variance of fill volume of s2 = 0.0153 (fluid
ounces)2 . If the variance of fill volume exceeds 0.01 (fluid ounces)2 , an unacceptable propor-
tion of bottles will be underfilled or overfilled. Is there evidence in the sample data to suggest
that the manufacturer has a problem with underfilled or overfilled bottles? Use α = 0.05, and
assume that fill volume has a normal distribution.
Solution
6. Since χ20 < cα,19 , we conclude that there is no strong evidence that the variance of fill
volume exceeds 0.01 (fluid ounces)2 .
Since P(χ21 9 > 27.2) = 0.10 and P(χ21 9 > 30.4) = 0.05, we conclude that the P -value of the test
is in the interval (0.05, 0.10). Note that the actual P -value is 0.0649.
Let (X1 , . . . , Xn ) be a random sample observing from a random variable X with B(1, p)
distribution. Then p̂ = X is a point estimator of p. By the Central limit theorem, when n is
large, p̂ is approximately normal with mean p and variance p(1 − p)/n. We thus obtain the
following test for value of p.
6. Since Z0 < −zα , we reject H0 and conclude that the process fraction defective p is less
than 0.05. The P -value for this value of the test statistic is Φ(−1.947)) = 0.0256, which is
less than α = 0.05. We conclude that the process is capable.
Suppose that p is the true value of the population proportion. The approximate β-error is
defined as follows
These equations can be solved to find the approximate sample size n that gives a test of level
α that has a specified β risk. The sample size is defined as follows.
X 1 − X 2 − (µ1 − µ2 )
Z= q 2 ∼ N(0, 1).
σ1 σ22
n1
+ n2
Null hypothesis: H0 : µ1 − µ2 = ∆0
X 1 − X 2 − ∆0
Test statistic: Z0 = q 2
σ1 σ2
n1
+ n22
When the population variances are unknown, the sample variances s21 and s22 can be substi-
tuted into the test statistic Z0 to produce a large-sample test for the difference in means. This
procedure will also work well when the populations are not necessarily normally distributed.
However, both n1 and n2 should exceed 40 for this large-sample test to be valid.
Example 6.5.2. A product developer is interested in reducing the drying time of a primer
paint. Two formulations of the paint are tested; formulation 1 is the standard chemistry, and
formulation 2 has a new drying ingredient that should reduce the drying time. From expe-
rience, it is known that the standard deviation of drying time is 8 minutes, and this inherent
variability should be unaffected by the addition of the new ingredient. Ten specimens are
6.5. SOME WELL-KNOWN TESTS FOR TWO SAMPLES 145
painted with formulation 1, and another 10 specimens are painted with formulation 2; the 20
specimens are painted in random order. The two sample average drying times are X 1 = 121
minutes and X 2 = 112 minutes, respectively. What conclusions can the product developer
draw about the effectiveness of the new ingredient, using α = 0.05?
Solution:
1. The quantity of interest is the difference in mean drying time, µ1 − µ2 , and ∆0 = 0.
6. Since Φ−1 (α) = Φ−1 (0.05) = 1.645 < Z0 , we reject H0 at the α = 0.05 level and conclude
that adding the new ingredient to the paint significantly reduces the drying time.
Alternatively, we can find the P -value for this test as
P -value = 1 − Φ(2.52) = 0.0059.
Therefore H0 : µ1 = µ2 would be rejected at any significance level α ≥ 0.0059.
Suppose we have two independent normal populations with unknown means µ1 and µ2 ,
and unknown but equal variances σ 2 . Assume that assumptions (6.5) hold.
The pooled estimator of σ 2 , denoted by Sp2 is defined by
(n1 − 1)s21 + (n2 − 1)s22
Sp2 = .
n1 + n2 − 2
The inference for µ1 − µ2 is based on the following result.
Theorem 6.5.3. Under all the assumption mentioned above, the quantity
X 1 − X 2 − (µ1 − µ2 )
T = q
Sp n11 + n12
Null hypothesis: H0 : µ1 − µ2 = ∆0
X 1 − X 2 − ∆0
Test statistic: T0 = q
Sp n11 + n12
Example 6.5.4. The IQ’s of 9 children in a district of a large city have empirical mean 107 and
standard deviation 10. The IQ’s of 12 children in another district have empirical mean 112
and standard deviation 9. Test the equality of means at the 0.05 significance of level.
Example 6.5.5. Two catalysts are being analyzed to determine how they affect the mean yield
of a chemical process. Specifically, catalyst 1 is currently in use, but catalyst 2 is acceptable.
Since catalyst 2 is cheaper, it should be adopted, providing it does not change the process
yield. A test is run in the pilot plant and results in the data shown in the following table.
Observation Num. Catalyst 1 Catalyst 2
1 91.50 89.19
2 94.18 90.95
3 92.18 90.46
4 95.39 93.21
5 91.79 97.19
6 89.07 97.04
7 94.72 91.07
8 89.21 92.75
Is there any difference between the mean yields? Use α = 0.05, and assume equal variances.
In some situations, we cannot reasonably assume that the unknown variances σ12 andσ22
are equal. There is not an exact t-statistic available for testing H0 : µ1 − µ2 = ∆0 in this case.
However, if H0 is true, the statistic
X 1 − X 2 − ∆0
T0∗ = q 2
s1 s2
n1
+ n22
is distributed approximately as t with degrees of freedom given by
2 2
s1 s22
n1
+ n2
ν = (s2 /n )2 (s2 /n )2 .
1
1
n1 −1
+ n2 2 −12
6.5. SOME WELL-KNOWN TESTS FOR TWO SAMPLES 147
Therefore, if σ12 6= σ22 , the hypotheses on differences in the means of two normal distribution
are tested as in the equal variances case, except that T0∗ is used as the test statistic and n1 +n2 −2
is replaced by ν in determining the degrees of freedom for the test.
Null hypothesis: H0 : µD = ∆0
D − ∆0
Test statistic: T0 = √
SD / n
Example 6.5.6. An article in the Journal of Strain Analysis (1983, Vol. 18, No. 2) compares
several methods for predicting the shear strength for steel plate girders. Data for two of these
methods, the Karlsruhe and Lehigh procedures, when applied to nine specific girders, are
shown in the following table.
Karlsruhe Method 1.186 1.151 1.322 1.339 1.200 1.402 1.365 1.537 1.559
Lehigh Method 1.061 0.992 1.063 1.062 1.065 1.178 1.037 1.086 1.052
Difference Dj 0.119 0.159 0.259 0.277 0.138 0.224 0.328 0.451 0.507
Test whether there is any difference (on the average) between the two methods?
Solution:
2
D = 0.2736, SD = 0.018349, T0 = 6.05939, t0.025,8 = 2, 306.
We conclude that there is difference between the two method at the 0.05 level of significance.
Theorem 6.5.7. Let X11 , X12 , . . . , X1n1 be a random sample from a normal population with
mean µ1 and variance σ12 and let X21 , X22 , . . . , X2n2 be a random sample from a second normal
population with mean µ2 and variance σ22 . Assume that both normal populations are indepen-
dent. Let s21 and s22 be the sample variances. Then the ratio
s21 /σ12
F =
s22 /σ22
This result is based on the fact that (n1 −1)s21 /σ12 is a chi-square random variable with n1 −1
degrees of freedom, that (n2 − 1)s21 /σ22 is a chi-square random variable with n2 − 1 degrees
of freedom, and that the two normal populations are independent. Clearly under the null
hypothesis H0 : σ12 = σ22 , the ratio F0 = s21 /s22 has an Fn1 −1,n2 −1 distribution. Let fα,n1 −1,n2 −1 be a
constant satisfying
P[F0 > fα,n1 −1,n2 −1 ] = α.
It follows from the property of F distribution that
1
f1−α,n1 −1,n2 −1 = .
fα,n1 −1,n2 −1
6.5. SOME WELL-KNOWN TESTS FOR TWO SAMPLES 149
Example 6.5.8. Oxide layers on semiconductor wafers are etched in a mixture of gases to
achieve the proper thickness. The variability in the thickness of these oxide layers is a critical
characteristic of the wafer, and low variability is desirable for subsequent processing steps.
Two different mixtures of gases are being studied to determine whether one is superior in re-
ducing the variability of the oxide thickness. Twenty wafers are etched in each gas. The sam-
ple standard deviations of oxide thickness are s1 = 1.96 angstroms and s2 = 2.13 angstroms,
respectively. Is there any evidence to indicate that either gas is preferable? Use α = 0.05.
Solution: At the α = 0.05 level of significance we need to test
level of significance. Therefore, there is no strong evidence to indicate that either gas results
in a smaller variance of oxide thickness.
P̂1 − P̂2
Z=r
n1 + n2
p(1 − p)
n1 n2
6.6. THE CHI-SQUARE TEST 150
Null hypothesis: H0 : p1 = p2
P̂1 − P̂2 X1 + X2
Test statistic: Z0 = r with p̂ = .
n1 + n2 n1 + n2
p̂(1 − p̂)
n1 n2
Example 6.5.9. Two different types of polishing solution are being evaluated for possible use
in a tumble-polish operation for manufacturing interocular lenses used in the human eye fol-
lowing cataract surgery. Three hundred lenses were tumble- polished using the first polish-
ing solution, and of this number 253 had no polishing-induced defects. Another 300 lenses
were tumble-polished using the second polishing solution, and 196 lenses were satisfactory
upon completion. Is there any reason to believe that the two polishing solutions differ? Use
α = 0.01.
We shall assume that a random sample of size n is to be taken from the given population.
That is, n independent observations are to be taken, and there is probability pi that each ob-
servation will be of type i(i = 1, ..., k). On the basis of these n observations, the hypothesis is
to be tested.
For each i, we denote Ni the number of observations in the random sample that are of type
i.
6.6. THE CHI-SQUARE TEST 151
has the property that if H0 is true and the sample size n → ∞, then Q converges in
distribution to the χ2 distribution with k − 1 degrees of freedom.
Suppose that we observer an i.i.d. sample X1 , . . . , Xn of random variables that take a finite
number of values B1 , . . . , Bk with unknown probability p1 , . . . , pk . Consider the hypothesis
where Ni is number of Xj equal to Bj . On the other hand, if H1 holds, then for some index i∗ ,
pi∗ 6= p0i . We write
νi∗ − np0i∗ pi∗ νi∗ − npi∗ √ pi∗ − p0i∗
r
= √ + n p 0 .
p0i∗
p
np0i∗ npi∗ pi ∗
the first term converges to N(0, (1 − pi∗ )pi∗ /p0i∗ ) by the central limit theorem while the second
term diverges to plus or minus infinity. It means that if H1 holds then Q → ∞. Therefore, we
will reject H0 if Q ≥ cα,k−1 where cα,k−1 is chosen such that the error of type 1 is equal to the
level of significance α:
α = P0 (Q > cα,k−1 ) ≈ P(χ2k−1 > cα,k−1 ).
This test is called chi-squared goodness-of-fit test
Blood type A B AB O
Number of people 2162 738 228 2876.
Example 6.6.2. A study of blood types among a sample of 6004 people gives the following
result
A previous study claimed that the proportions of people whose blood of types A, B, AB and
O are 33.33%, 12.5%, 4.17% and 50%, respectively.
We can use the data in Table 6.2 to test the null hypothesis H0 that the probabilities (p1 , p2 , p3 , p4 )
of the four blood type equal ( 13 , 18 , 24
1 1
, 2 ). The χ2 test statistic is then
(2162 − 6004 × 31 )2 (738 − 6004 × 81 )2 (228 − 6004 × 1 2
24
) (2876 − 6004 × 12 )2
Q= + + + = 20.37
6004 × 13 6004 × 18 1
6004 × 24 6004 × 12
To test H0 at the level α0 , we would compare Q to the 1 − α0 quantile of the χ2 distribution
with three degrees of freedom. Alternatively, we can compute the P -value, which would be
the smallest α0 at which we could reject H0 . In general, the P -value is 1 − F (Q) where F is the
cumulative distribution function of the χ2 distribution with k − 1 degrees of freedom. In this
example k = 4 and Q = 20.37 then the p-value is 1.42 × 10−4 .
Let X1 , . . . , Xn be an i.i.d. sample from unknown distribution P and consider the following
hypotheses
H0 : P = P0 vs H1 : P 6= P0
for some particular, possibly continuous distribution P0 . To do this, we will split a set of all
possible outcomes of Xi , say X, into a finite number of intervals I1 , . . . , Ik . The null hypothesis
H0 implies that for all intervals
P(X ∈ Ij ) = P0 (X ∈ Ij ) = p0j .
It is clear that H0 implies H00 . However, the fact that H00 holds does not guarantee that H0
hold. There are many distribution different from P that have the same distribution on the
intervals I1 , . . . , Ik as P . On one hand, if we group into more and more intervals, our discrete
approximation of P will get closer and closer to P , so in some sense H00 will get ’closer’ to
H0 . However, we can not split into too many intervals either, because the χ2k−1 -distribution
approximation for statistic T in Pearson’s theorem is asymptotic. The rule of thumb is to group
the data in such a way that the expected count in each interval np0i is at least 5.
6.6. THE CHI-SQUARE TEST 153
Example 6.6.3. Suppose that we wish to test the null hypothesis that the logarithms of the
lifetime of ball bearings are an i.i.d. sample from the normal distribution with mean ln(50) =
3.912 and variance 0.25. The observed logarithms are
In order to have the expected count in each interval be at least 5, we can use at most k = 4
intervals. We shall make these intervals each have probability 0.25 under the null hypothesis.
That is, we shall divide the intervals at the 0.25, 0.5, and 0.75 quantiles of the hypothesized
normal distribution. These quantiles are
The number of observation in each of the four intervals are then 3, 4, 8 and 8. We then calcu-
late
Q = 3.609.
Our table of the χ2 distribution with three degrees of freedom indicates that 3.609 is between
the 0.6 and 0.7 quantiles, so we would not reject the null hypothesis at levels less 0.3 and reject
the null hypothesis at levels greater than 0.4. (Actually, the P -value is 0.307.)
We can extend the goodness-of-fit test to deal with the case in which the null hypothesis
is that the distribution of our data belongs to a particular parametric family. The alternative
hypothesis is that the data have a distribution that is not a member of that parametric family.
There are two changes to the test procedure in going from the case of a simple null hypothesis
to the case of a composite null hypothesis. First, in the test statistic Q, the probabilities p0i are
replaced by estimated probabilities based on the parametric family. Second, the degrees of
freedom are reduced by the number of parameters.
Let us start with a discrete case when a random variable takes a finite number of values
B1 , . . . , Bk and
pi = P(X = Bi ), i = 1, . . . , k.
We would like to test a hypothesis that this distribution comes from a family of distributions
{Pθ : θ ∈ Θ}. In other words, if we denote
pj (θ) = Pθ (X = Bj ),
6.6. THE CHI-SQUARE TEST 154
we want to test
The situation now is complicated since we want to test if pj = pj (θ), j ≤ r at least for some
θ ∈ Θ which means that we may have many candidates for θ. One way to approach this
problem is as follows.
Step 1: Assuming that hypothesis H0 holds, we can find an estimator θ∗ of this unknown θ.
Step 2: Try to test if, indeed, the distribution P is equal to Pθ∗ by using the statistics
k
∗
X (Ni − npi (θ∗ ))2
Q =
i=1
npi (θ∗ )
This approach looks natural, the only question is what estimate θ∗ to use and how the fact
that θ∗ also depends on the data will affect the convergence of Q. It turns out that if we let θ∗
be the maximum likelihood estimate, i.e. θ that maximizes the likelihood function
The we will reject H0 if T ≤ c where the threshold c is determined from the condition
Example 6.6.4. Suppose that a gene has two possible alleles A1 and A2 and the combinations
of these alleles define three genotypes A1 A1 , A1 A2 and A2 A2 . We want to test a theory that
Probability to pass A to a child = θ
1
Probability to pass A2 to a child = 1 − θ
p1 (θ) = P(A1 A1 ) = θ2
p2 (θ) = P(A1 A2 ) = 2θ(1 − θ)
p3 (θ) = P(A2 A2 ) = (1 − θ)2 .
6.6. THE CHI-SQUARE TEST 155
Suppose that given a random sample X1 , . . . , Xn from the population the counts of each geno-
type are N1 , N2 and N3 . To test the theory we want to test the hypothesis
H0 : pi = pi (θ), i = 1, 2, 3 vs H1 : otherwise.
First of all, the dimension of the parameter set is s = 1 since the distributions are determined
by one parameter θ. To find the MLE θ∗ we have to maximize the likelihood function
After computing the critical point by setting the derivative equal to 0, we get
2N1 + N2
θ∗ = .
2n
Therefore, under the null hypothesis H0 the statistics
3
∗
X (Ni − npi (θ∗ ))2
Q =
i=1
npi (θ∗ )
In the case when the distributions Pθ are continuous or, more generally, have infinite num-
ber of values that must be grouped in order to use chi-squared test (for example, normal or
Poisson distribution), it can be a difficult numerical problem to maximize the “grouped“ like-
lihood function
Pθ (I1 )N1 · · · Pθ (Ik )Nk .
It is tempting to use a usual non-grouped MLE θ̂ of θ instead of the above θ∗ because it is often
easier to compute, in fact, for many distributions we know explicit formulas for these MLEs.
However, if we use θ̂ in the statistic
k
X (Ni − npi (θ̂))2
Q̂ =
i=1 npi (θ̂)
then it will no longer converges to χ2r−s−1 distribution. It has been shown that typically this Q̂
will converge to a distribution “in between” χ2k−s−1 and χ2k−s 1 . Thus, a conservative decision
rule is to reject H0 whether Q̂ > c where c is chosen such that P(χ2k−1 > c) = α.
Example 6.6.5 (Testing Whether a Distribution Is Normal). Consider now a problem in which
a random sample X1 , ..., Xn is taken from some continuous distribution for which the p.d.f.
is unknown, and it is desired to test the null hypothesis H0 that this distribution is a normal
1
Chernoff, Herman; Lehmann, E. L. (1954) The use of maximum likelihood estimates in χ2 tests for goodness
of fit. Ann. Math. Statistics 25, pp. 579-586.
6.6. THE CHI-SQUARE TEST 156
distribution against the alternative hypothesis H1 that the distribution is not normal. To per-
form a χ2 test of goodness- of-fit in this problem, divide the real line into k subintervals and
count the number Ni of observations in the random sample that fall into the ith subinterval
(i = 1, ..., k).
If H0 is true, and if µ and σ 2 denote the unknown mean and variance of the normal dis-
tribution, then the parameter vector θ is the two-dimensional vector θ(µ, σ 2 ). The probability
πi (θ), or πi (µ, σ 2 ), that an observation will fall within the ith subinterval, is the probability as-
signed to that subinterval by the normal distribution with mean µ and variance σ 2 . In other
words, if the ith subinterval is the interval from ai to bi , then
b − µ a − µ
i i
πi (µ, σ 2 ) = Φ −Φ .
σ σ
It is important to note that in order to calculate the value of the statistic Q∗ , the M.L.E.’s µ∗ and
σ 2∗ must be found by using the numbers N1 , ..., Nk of observations in the different subinter-
vals. The M.L.E.’s should not be found by using the observed values of X1 , ..., Xn themselves.
In other words, µ∗ and σ 2∗ will be the values of µ and σ 2 that maximize the likelihood function
X = {(i, j) : i = 1, . . . , a, j = 1, . . . , b}
where the first coordinate represents the first feature that belongs to one of a categories and
the second coordinate represents the second feature that belongs to one of b categories. An
i.i.d. sample X1 , ..., Xn can be represented by a contingency table below where Nij is the num-
ber all observations in a cell (i, j).
Feature 2
Feature 1 1 2 ··· b
1 N11 N12 · · · N1b
2 N21 N22 · · · N2b
.. .. .. .. ..
. . . . .
a Na1 Na2 · · · Nab
We would like to test the independence of two features which means that
Denote θij = P[X = (i, j)]; pi = P[X 1 = i]; qj P[X 2 = j]. Then we want to test
We can see that this null hypothesis H0 is a special case of the composite hypotheses from
previous lecture and it can be tested using the chi-squared goodness-of-fit test. The total
number of groups is k = a × b. Since pi s and qj s should add up to one, one parameter in
each sequence, for example pa and qb , can be computed in terms of other probabilities and
we can take (p1 , ..., pa−1 ) and (q1 , ..., qb−1 ) as free parameters of the model. This means that the
dimension of the parameter set is
s = (a − 1) × (b − 1).
Therefore, if we find the maximum likelihood estimates for the parameters of this model then
the chi-squared statistic satisfies
X (Nij − np∗i qj∗ )2 w
Q= ∗ ∗
−→ χ2k−s−1 = χ2(a−1)(b−1)
ij
npi qj
6.6. THE CHI-SQUARE TEST 158
To formulate the test it remains to find the maximum likelihood estimates of the parameters.
We need to maximize the likelihood function
Y Y
N
Y N
(pi qj )Nij = pi i+ pj +j ,
ij i j
P P
where Ni+ = j Nij and N+j = i Nij . Since pi s and qj s are not related to each other, max-
Q N Q N
imizing the likelihood function above is equivalent to maximizing i pi i+ and j pj +j sepa-
rately. We have
Y a−1
X
N
ln pi i+ = Ni+ ln pi + Na+ ln(1 − p1 − · · · − pa−1 ).
i i=1
Ni+
p∗i = , i = 1, . . . , a.
n
Similarly, the MLE for qj is
N+j
qj∗ =
, ij = 1, . . . , b.
n
Therefore, chi-square statistic Q in this case can be written as
2
Ni+ N+j
X Nij − n
Q= Ni+ N+j
.
ij n
We reject H0 if Q > cα,(a−1)(b−1) where the threshold cα,(a−1)(b−1) is determined from the condi-
tion
P[χ2(a−1)(b−1) > cα,(a−1)(b−1) ] = α.
If we denote
pij = P(Categoryj |Groupi )
so that for each group i we have
C
X
pij = 1,
j=1
If observations X1 , ..., Xn are sampled independently from the entire population then ho-
mogeneity over groups is the same as independence of groups and categories. Indeed, if have
homogeneity
P(Categoryj |Groupi ) = P(Categoryj ),
then we have
P(Categoryj , Groupi ) = P(Categoryj )P(Groupi ).
This means that to test homogeneity we can use the test of independence above. Denote
2
Ni+ N+j
X X Nij − n
R C
w
Q= Ni+ N+j
−→ χ2(C−1)(R−1) .
i=1 j=1 n
We reject H0 at the significance level α if Q > cα,(C−1)(R−1) where the threshold cα,(C−1)(R−1) is
determined from the condtion
Example 6.6.6. In this example, 100 people were asked whether the service provided by the
fire department in the city was satisfactory. Shortly after the survey, a large fire occured in
the city. Suppose that the same 100 people were asked whether they thought that the service
provided by the fire department was satisfactory. The result are in the following table:
Satisfactory Unsatisfactory
Before fire 80 20
After fire 72 28
Suppose that we would like to test whether the opinions changed after the fire by using a
chi-squared test. However, the i.i.d. sample consisted of pairs of opinions of 100 people
(Xi1 , Xi2 ), i = 1, . . . , 100 where the first coordinate/feature is a person’s opinion before the
fire and it belongs to one of two categories
and the second coordinate/feature is a person’s opinion after the fire and it also belongs to
one of two categories
{“Satisf actory”, “U nsatisf actory”},
So the correct contingency table corresponding to the above data and satisfying the assump-
tion of the chi-squared test would be the following:
Satisfactory Unsatisfactory
Satisfactory 70 10
Unsatisfactory 2 18
In order to use the first contingency table, we would have to poll 100 people after the fire
independently of the 100 people polled before the fire.
6.7 Exercises
6.7. Let X1 , . . . , Xn be a random sample from a Poisson distribution with mean λ > 0.
1. Show that the likelihood ratio test of H0 : λ = λ0 versus H1 : λ 6= λ0 is based upon the
statistic Y = X1 + . . . + Xn . Obtain the null distribution of Y .
2. For λ0 = 2 and n = 5, find the significance level of the test that rejects H0 if Y ≤ 4 or
Y ≥ 17.
6.8. Let X1 , . . . , Xn be a random sample from a Bernoulli B(1, θ) distribution, where 0 < θ < 1.
1. Show that the likelihood ratio test of H0 : θ = θ0 versus H1 : θ 6= θ0 is based upon the
statistic Y = X1 + . . . + Xn . Obtain the null distribution of Y .
2. For n = 100 and θ0 = 1/2, find c1 so that the test reject H0 when Y ≤ c1 or Y ≥ c2 = 100−c1
has the approximate significance level α = 0.05.
6.9. Let X1 , . . . , Xn be a random sample from a Γ(α = 3, β = θ) distribution, where 0 < θ < ∞.
1. Show that the likelihood ratio test of H0 : θ = θ0 versus H1 : θ 6= θ0 is based upon the
statistic Y = X1 + . . . + Xn . Obtain the null distribution of 2Y /θ0 .
2. For θ0 = 3 and n = 5, find c1 and c2 so that the test that rejects H0 when Y ≤ c1 or Y ≥ c2
has significance level 0.05.
6.11. Let X1 , . . . , Xn be a random sample of size 10 from a normal distribution N (0, σ 2 ). Find
a best critical region of size α = 0.05 for testing H0 : σ 2 = 1 against H1 : σ 2 = 2. Is this a best
critical region of size 0.05 for testing H0 : σ 2 = 1 against H1 : σ 2 = 4? Against H1 : σ 2 = σ12 > 1.
6.7. EXERCISES 162
6.12. If X1 , . . . , Xn is a random sample from a distribution having pdf of the form f (x; θ) =
θxθ−1 , 0 < x < 1, nzero elsewhere, show that a obest critical region for testing H0 : θ = 1 against
H1 : θ = 2 is C = (x1 , . . . , xn ) : c ≤ x1 x2 . . . xn .
6.13. Let X1 , . . . , Xn denote a random sample from a normal distribution N (θ, 100). Show that
C = (x1 , . . . , xn ) : x̄ ≥ c is a best critical region for testing H0 : θ = 75 against H1 : θ = 78.
Find n and c so that
PH0 [(X1 , . . . , Xn ) ∈ C] = PH0 [X̄ ≥ c] = 0.05
and
PH1 [(X1 , . . . , Xn ) ∈ C] = PH1 [X̄ ≥ c] = 0.90,
approximately.
6.14. Let X1 , . . . , Xn be iid with pmf f (x; p) = px (1 − p)1−x , x = 0, 1, zero elsewhere. Show that
xi ≤ c is a best critical region for testing H0 : p = 21 against H1 : p = 13 .
P
C = (x1 , . . . , xn ) :
P
Use the Central Limit Theorem to find n and c so that approximately PH0 [ Xi ≤ c] = 0.10 and
P
PH1 [ Xi ≤ c] = 0.80.
6.15. Let X1 , . . . , X10 denote a random sample of size 10 from a Poisson distribution with
mean λ. Show that the critical region C defined by 10
P
i=1 xi ≥ 3 is a best critical region for
testing H0 : λ = 0.1 against H1 : λ = 0.5. Determine, for this test, the significance level α and
the power at θ = 0.5.
6.16. Let X have the pmf f (x; θ) = θx (1 − θ)1−x , x = 0, 1, zero elsewhere. We test the simple
hypothesis H0 : λ = 14 against the alternative composite hypothesis H1 : θ < 14 by taking a
random sample of size 10 and rejecting H0 : θ = 41 iff the observed values x1 , . . . , x1 0 of the
sample observations are such that 10 1
P
i=1 xi ≤ 1. Find the power function γ(θ), 0 < θ ≤ 4 , of
this test.
6.17. (a) The sample mean and standard deviation from a random sample of 10 observations
from a normal population were computed as x = 23 and σ = 9. Calculate the value of the test
statistic of the test required to determine whether there is enough evidence to infer at the 5%
significance level that the population mean is greater than 20.
(b) Repeat part (a) with n = 30.
(c) Repeat part (b) with n = 40.
6.18. (a) A statistics practitioner is in the process of testing to determine whether there is
enough evidence to infer that the population mean is different from 180. She calculated the
6.7. EXERCISES 163
mean and standard deviation of a sample of 200 observations as x = 175 and σ = 22. Cal-
culate the value of the test statistic of the test required to determine whether there is enough
evidence at the 5% significance level.
(b) Repeat part (a) with s = 45.
(c) Repeat part (a) with s = 60.
6.19. A courier service advertises that its average delivery time is less than 6 hours for local
deliveries. A random sample of times for 12 deliveries to an address across town was recorded.
These data are shown here. Is this sufficient evidence to support the courier’s advertisement,
at the 5% level of significance?
3.03, 6.33, 7.98, 4.82, 6.50, 5.22, 3.56, 6.76, 7.96, 4.54, 5.09, 6.46.
6.20. Aircrew escape systems are powered by a solid propellant. The burning rate of this pro-
pellant is an important product characteristic. Specifications require that the mean burning
rate must be 50 centimeters per second. We know that the standard deviation of burning rate
is σ = 2 centimeters per second. The experimenter decides to specify a type I error proba-
bility or significance level of α = 0.05 and selects a random sample of n = 25 and obtains a
sample average burning rate of X = 51.3 centimeters per second. What conclusions should
be drawn?
6.21. The mean water temperature downstream from a power plant cooling tower discharge
pipe should be no more than 100◦ F . Past experience has indicated that the standard deviation
of temperature is 2◦ F . The water temperature is measured on nine randomly chosen days,
and the average temperature is found to be 98◦ F.
(a) Should the water temperature be judged acceptable with α = 0.05?
(b) What is the P -value for this test?
(c) What is the probability of accepting the null hypothesis at α = 0.05 if the water has a true
mean temperature of 104◦ F ?
6.23. Cloud seeding has been studied for many decades as a weather modification procedure.
The rainfall in acre-feet from 20 clouds that were selected at random and seeded with silver
6.7. EXERCISES 164
nitrate follows:
18.0, 30.7, 19.8, 27.1, 22.3, 18.8, 31.8, 23.4, 21.2, 27.9,
31.9, 27.1, 25.0, 24.7, 26.9, 21.8, 29.2, 34.8, 26.7, 31.6.
(a) Can you support a claim that mean rainfall from seeded clouds exceeds 25 acre-feet? Use
α = 0.01.
(b) Compute the power of the test if the true mean rainfall is 27 acre-feet.
(c) What sample size would be required to detect a true mean rainfall of 27.5 acre-feet if we
wanted the power of the test to be at least 0.9?
6.24. The life in hours of a battery is known to be approximately normally distributed, with
standard deviation σ = 1.25 hours. A random sample of 10 batteries has a mean life of x = 40.5
hours.
(a) Is there evidence to support the claim that battery life exceeds 40 hours? Use α = 0.05.
(b) What is the P -value for the test in part (a)?
(c) What is the power for the test in part (a) if the true mean life is 42 hours?
(d) What sample size would be required to ensure that the probability of making type II error
does not exceed 0.10 if the true mean life is 44 hours?
(e) Explain how you could answer the question in part (a) by calculating an appropriate con-
fidence bound on life.
6.25. Medical researchers have developed a new artificial heart constructed primarily of ti-
tanium and plastic. The heart will last and operate almost indefinitely once it is implanted
in the patient’s body, but the battery pack needs to be recharged about every four hours. A
random sample of 50 battery packs is selected and subjected to a life test. The average life of
these batteries is 4.05 hours. Assume that battery life is normally distributed with standard
deviation σ = 0.2 hour.
(a) Is there evidence to support the claim that mean battery life exceeds 4 hours? Use α = 0.05.
(b) Compute the power of the test if the true mean battery life is 4.5 hours.
(c) What sample size would be required to detect a true mean battery life of 4.5 hours if we
wanted the power of the test to be at least 0.9?
(d) Explain how the question in part (a) could be answered by constructing a one-sided con-
fidence bound on the mean life.
6.26. After many years of teaching, a statistics professor computed the variance of the marks
on her final exam and found it to be σ 2 = 250. She recently made changes to the way in
which the final exam is marked and wondered whether this would result in a reduction in the
variance. A random sample of this year’s final exam marks are listed here. Can the professor
infer at the 10% significance level that the variance has decreased?
57 92 99 73 62 64 75 70 88 60.
6.7. EXERCISES 165
6.27. With gasoline prices increasing, drivers are more concerned with their cars’ gasoline
consumption. For the past 5 years, a driver has tracked the gas mileage of his car and found
that the variance from fill-up to fill-up was σ 2 = 23 mpg2 . Now that his car is 5 years old, he
would like to know whether the variability of gas mileage has changed. He recorded the gas
mileage from his last eight fill-ups; these are listed here. Conduct a test at a 10% significance
level to infer whether the variability has changed.
28 25 29 25 32 36 27 24.
Tests on proportion
6.28. (a) Calculate the P -value of the test of the following hypotheses given that p̂ = 0.63 and
n = 100:
H0 : p = 0.6 vs H1 : p > 0.6.
(b) Repeat part (a) with n = 200.
(c) Repeat part (a) with n = 400.
(d) Describe the effect on P -value of increasing sample size.
6.29. Has the recent drop in airplane passengers resulted in better on-time performance?
Before the recent economic downturn, one airline bragged that 92% of its flights were on time.
A random sample of 165 flights completed this year reveals that 153 were on time. Can we
conclude at the 5% significance level that the airline’s on-time performance has improved?
6.30. In a random sample of 85 automobile engine crank- shaft bearings, 10 have a surface
finish roughness that exceeds the specifications. Does this data present strong evidence that
the proportion of crankshaft bearings exhibiting excess surface roughness exceeds 0.10? State
and test the appropriate hypotheses using α = 0.05.
6.31. An study claimed that nearly one-half of all engineers continue academic studies be-
yond the B.S. degree, ultimately receiving either an M.S. or a Ph.D. degree. Data from an
article in Engineering Horizons (Spring 1990) indicated that 117 of 484 new engineering grad-
uates were planning graduate study.
(a) Are the data from Engineering Horizons consistent with the claim reported by Fortune?
Use α = 0.05 in reaching your conclusions.
(b) Find the P -value for this test.
(c) Discuss how you could have answered the question in part (a) by constructing a two-sided
confidence interval on p.
6.32. A researcher claims that at least 10% of all football helmets have manufacturing flaws
that could potentially cause injury to the wearer. A sample of 200 helmets revealed that 16
helmets contained such defects.
(a) Does this finding support the researcher’s claim? Use α = 0.01.
(b) Find the P -value for this test.
6.7. EXERCISES 166
6.33. In random samples ò 12 from each of two normal populations, we found the following
statistics: x1 = 74, s1 = 18 and x2 = 71, s2 = 16.
(a) Test with α = 0.05 to determine whether we can infer that the population means differ.
(b) Repeat part (a) increasing the standard deviation to s1 = 210 and s2 = 198.
(c) Describe what happens when the sample stan- dard deviations get larger.
(d) Repeat part (a) with sample size 150.
(e) Discuss the effects of increasing the sample size.
6.34. Random sampling from two normal populations produced the following results
(a) Can we infer that at the 5% significance level that µ1 is greater than µ2 .
(b) Repeat part (a) decreasing the standard deviation to s1 = 31, s2 = 16.
(c) Describe what happens when the sample stan- dard deviations get smaller.
(d) Repeat part (a) with samples of size 20.
(e) Discuss the effects of decreasing the sample size.
(f ) Repeat part (a) changing the mean of sample 1 to x1 = 409.
6.35. Two machines are used for filling plastic bottles with a net volume of 16.0 ounces. The
fill volume can be assumed normal, with standard deviation σ1 = 0.020 and σ2 = 0.025 ounces.
A member of the quality engineering staff suspects that both machines fill to the same mean
net volume, whether or not this volume is 16.0 ounces. A random sample of 10 bottles is taken
from the output of each machine.
Machine 1 Machine 2
16.03 16.01 16.02 16.03
16.04 15.96 15.97 16.04
16.05 15.98 15.96 16.02
16.05 16.02 16.01 16.01
16.02 15.99 15.99 16.00
(e) Assuming equal sample sizes, what sample size should be used to assure that the probabil-
ity of making type II error is 0.05 if the true difference in means is 0.04? Assume that α = 0.05.
6.36. Every month a clothing store conducts an inventory and calculates losses from theft.
The store would like to reduce these losses and is considering two methods. The first is to
hire a security guard, and the second is to install cameras. To help decide which method to
choose, the manager hired a security guard for 6 months. During the next 6-month period, the
store installed cameras. The monthly losses were recorded and are listed here. The manager
decided that because the cameras were cheaper than the guard, he would install the cameras
unless there was enough evidence to infer that the guard was better. What should the manager
do?
Security guard 355 284 401 398 477 254
Cameras 486 303 270 386 411 435
Pair t-test
6.37. Many people use scanners to read documents and store them in a Word (or some other
software) file. To help determine which brand of scanner to buy, a student conducts an exper-
iment in which eight documents are scanned by each of the two scanners he is interested in.
He records the number of errors made by each. These data are listed here. Can he infer that
brand A (the more expensive scanner) is better than brand B?
Document 1 2 3 4 5 6 7 8
BrandA 17 29 18 14 21 25 22 29
BrandB 21 38 15 19 22 30 31 37
6.38. In an effort to determine whether a new type of fertilizer is more effective than the type
currently in use, researchers took 12 two-acre plots of land scattered throughout the county.
Each plot was divided into two equal-sized subplots, one of which was treated with the cur-
rent fertilizer and the other with the new fertilizer. Wheat was planted, and the crop yields
were measured.
Plot 1 2 3 4 5 6 7 8 9 10 11 12
Current fertilizer 56 45 68 72 61 69 57 55 60 72 75 66
New fertilizer 60 49 66 73 59 67 61 60 58 75 72 68
(a) Can we conclude at the 5% significance level that the new fertilizer is more effective than
the current one?
(b) Estimate with 95% confidence the difference in mean crop yields between the two fertiliz-
ers.
(c) What is the required condition(s) for the validity of the results obtained in parts (a) and
(b)?
6.7. EXERCISES 168
6.39. Random samples from two normal population produced the following statistics
(a) Can we infer at the 10% significance level that the two population variances differ?
(b) Repeat part (a) changing the sample sizes to n1 = 15 and n2 = 15.
(c) Describe what happens to the test statistics and the conclusion when the sample sizes
decrease.
6.40. A statistics professor hypothesized that not only would the means vary but also so would
the variances if the business statistics course was taught in two different ways but had the
same final exam. He organized an experiment wherein one section of the course was taught
using detailed PowerPoint slides whereas the other required students to read the book and
answer questions in class discussions. A sample of the marks was recorded and listed next.
Can we infer that the variances of the marks differ between the two sections?
Class 1 64 85 80 64 48 62 75 77 50 81 90
Class 2 73 78 66 69 79 81 74 59 83 79 84
6.41. An operations manager who supervises an assembly line has been experiencing prob-
lems with the sequencing of jobs. The problem is that bottle- necks are occurring because
of the inconsistency of sequential operations. He decides to conduct an experiment wherein
two different methods are used to complete the same task. He measures the times (in sec-
onds). The data are listed here. Can he infer that the second method is more consistent than
the first method?
Method 1 8.8 9.6 8.4 9.0 8.3 9.2 9.0 8.7 8.5 9.4
Method 2 9.2 9.4 8.9 9.6 9.7 8.4 8.8 8.9 9.0 9.7
6.42. Random samples from two binomial populations yielded the following statistics:
(a) Calculate the P -value of a test to determine whether we can infer that the population pro-
portions differ.
(b) Repeat part (a) increasing the sample sizes to 400.
(c) Describe what happens to the p-value when the sample sizes increase.
6.43. Random samples from two binomial populations yielded the following statistics:
(a) Calculate the P -value of a test to determine whether we there is evidence to infer that the
population proportions differ.
(b) Repeat part (a) p̂1 = 0.95 and p̂2 = 0.90.
(c) Describe the effect on the P -value of increasing the sample proportions.
(d) Repeat part (a) p̂1 = 0.10 and p̂2 = 0.05.
(e) Describe the effect on the P -value of decreasing the sample proportions.
6.44. Many stores sell extended warranties for products they sell. These are very lucrative for
store owners. To learn more about who buys these warranties, a random sample was drawn
of a store’s customers who recently purchased a product for which an extended warranty was
available. Among other vari- ables, each respondent reported whether he or she paid the
regular price or a sale price and whether he or she purchased an extended warranty.
Regular Price Sale Price
Sample size 229 178
Number who bought extended warranty 47 25
Can we conclude at the 10% significance level that those who paid the regular price are more
likely to buy an extended warranty?
6.45. Surveys have been widely used by politicians around the world as a way of monitoring
the opinions of the electorate. Six months ago, a survey was undertaken to determine the de-
gree of support for a national party leader. Of a sample of 1100, 56% indicated that they would
vote for this politician. This month, another survey of 800 voters revealed that 46% now sup-
port the leader.
(a) At the 5% significance level, can we infer that the national leader’s popularity has de-
creased?
(b) At the 5% significance level, can we infer that the national leader’s popularity has de-
creased by more than 5%?
6.46. A random sample of 500 adult residents of Maricopa County found that 385 were in
favour of increasing the highway speed limit to 75 mph, while another sample of 400 adult
residents of Pima County found that 267 were in favour of the increased speed limit. Do these
data indicate that there is a difference in the support for increasing the speed limit between
the residents of the two counties? Use α = 0.05. What is the P -value for this test?
6.47. Two different types of injection-molding machines are used to form plastic parts. A part
is considered defective if it has excessive shrinkage or is discolored. Two random samples,
each of size 300, are selected, and 15 defective parts are found in the sample from machine 1
while 8 defective parts are found in the sample from machine 2. Is it reasonable to conclude
that both machines produce the same fraction of defective parts, using α = 0.05? Find the
P -value for this test.
6.7. EXERCISES 170
6.48. A new casino game involves rolling 3 dice. The winnings are directly proportional to the
total number of sixes rolled. Suppose a gambler plays the game 100 times, with the following
observed counts:
Number of Sixes 0 1 2 3
Number of Rolls 48 35 15 3
The casino becomes suspicious of the gambler and wishes to determine whether the dice are
fair. What do they conclude?
6.49. Suppose that the distribution of the heights of men who reside in a certain large city is
the normal distribution for which the mean is 68 inches and the standard deviation is 1 inch.
Suppose also that when the heights of 500 men who reside in a certain neighbourhood of the
city were measured, the distribution in the following table was obtained. Test the hypothesis
that, with regard to height, these 500 men form a random sample from all the men who reside
in the city.
6.50. The 50 values in the following table are intended to be a random sample from the stan-
dard normal distribution.
-1.28 -1.22 -0.32 -0.80 -1.38 -1.26 2.33 -0.34 -1.14 0.64
0.41 -0.01 -0.49 0.36 1.05 0.04 0.35 2.82 0.64 0.56
-0.45 -1.66 0.49 -1.96 3.44 0.67 -1.24 0.76 -0.46 -0.11
-0.35 1.39 -0.14 -0.64 -1.67 -1.13 -0.04 0.61 -0.63 0.13
0.72 0.38 -0.85 -1.32 0.85 -0.41 -0.11 -2.04 -1.61 -1.81
a) Carry out a χ2 test of goodness-of-fit by dividing the real line into five intervals, each of
which has probability 0.2 under the standard normal distribution.
b) Carry out a χ2 test of goodness-of-fit by dividing the real line into ten intervals, each of
which has probability 0.1 under the standard normal distribution.
Chapter 7
Regression
Let us find the m.l.e. of β̂0 , β̂1 , σ̂ 2 that maximize this likelihood function L. First of all, it is
obvious that (β̂0 , β̂1 ) is also minimized
n
X
∗
L (β0 , β1 ) := (Yi − β0 − β1 Xi )2
i=1
171
7.1. SIMPLE LINEAR REGRESSION 172
β̂0 = Ȳ − β̂1 X̄
XY − X̄ Ȳ
β̂1 =
X 2 − X̄ 2
n
2 1X
σ̂ = (Yi − β̂0 − β̂1 Xi )2
n i=1
The numerator in the last sum is the sum of squares of the residuals and the numerator
is the variance of Y and R2 is usually interpreted as the proportion of variability in the data
explained by the linear model. The higher R2 the better our model explains the data. Next, we
would like to do statistical inference about the linear model.
1. Construct confidence intervals for parameters of the model β0 , β1 and σ 2 .
2. Construct prediction intervals for Y given any point X.
3. Test hypotheses about parameters of the model.
The distribution of β̂0 , β̂1 and σ̂ 2 are defined by the following result.
Proposition 7.1.1. 1. Vector (β̂0 , β̂1 ) has a normal distribution with mean (β0 , β1 )
and covariance matrix
!
σ2 X 2 −X̄
Σ= , where σx2 = X 2 − X̄ 2 .
nσx2 −X̄ 1
α α
P[χ2n−2 > c1−α/2,n−2 ] = 1 − , P[χ2n−2 > cα/2,n−2 ] = ,
2 2
then h nσ̂ 2
2 nσ̂ 2 i
P ≤σ ≤ = 1 − α.
cα/2,n−2 c1−α/2,n−2
Therefore the (1 − α) CI for σ 2 is
nσ̂ 2 nσ̂ 2
≤ σ2 ≤ .
cα/2,n−2 c1−α/2,n−2
has a tn−2 distribution with n − 2 degrees of freedom. Choose tα/2,n−2 such that
Y = β0 + β1 X +
and it is natural to take Ŷ = β̂0 + β̂1 X as the prediction of Y . Let us find the distribution of
their difference Ŷ − Y .
Ŷ − Y
r
σ̂ 2 (X̄−X)2
n−2
n+1+ σx2
Choose tα/2,n−2 such that P[|tn−2 | < xα ] = 1 − α we obtain the (1 − α) CI for Y as follows
s s
σ̂ 2 (X̄ − X)2 σ̂ 2 (X̄ − X)2
Ŷ − tα/2,n−2 n+1+ ≤ Y ≤ Ŷ + tα/2,n−2 n + 1 + .
n−2 σx2 n−2 σx2
Appendies
Rz 2
e−x /2
Table of Normal distribution Φ(z) = −∞
√
2π
dx
175
7.1. SIMPLE LINEAR REGRESSION 176
1
P[T1 < 1.376] = 0.8 và P[|T1 | < 1.376] = 0.6
7.1. SIMPLE LINEAR REGRESSION 178
DF: n 0.995 0.975 0.2 0.1 0.05 0.025 0.02 0.01 0.005 0.002 0.001
1 0.00004 0.001 1.642 2.706 3.841 5.024 5.412 6.635 7.879 9.55 10.828
2 0.01 0.0506 3.219 4.605 5.991 7.378 7.824 9.21 10.597 12.429 13.816
3 0.0717 0.216 4.642 6.251 7.815 9.348 9.837 11.345 12.838 14.796 16.266
4 0.207 0.484 5.989 7.779 9.488 11.143 11.668 13.277 14.86 16.924 18.467
5 0.412 0.831 7.289 9.236 11.07 12.833 13.388 15.086 16.75 18.907 20.515
6 0.676 1.237 8.558 10.645 12.592 14.449 15.033 16.812 18.548 20.791 22.458
7 0.989 1.69 9.803 12.017 14.067 16.013 16.622 18.475 20.278 22.601 24.322
8 1.344 2.18 11.03 13.362 15.507 17.535 18.168 20.09 21.955 24.352 26.124
9 1.735 2.7 12.242 14.684 16.919 19.023 19.679 21.666 23.589 26.056 27.877
10 2.156 3.247 13.442 15.987 18.307 20.483 21.161 23.209 25.188 27.722 29.588
11 2.603 3.816 14.631 17.275 19.675 21.92 22.618 24.725 26.757 29.354 31.264
12 3.074 4.404 15.812 18.549 21.026 23.337 24.054 26.217 28.3 30.957 32.909
13 3.565 5.009 16.985 19.812 22.362 24.736 25.472 27.688 29.819 32.535 34.528
14 4.075 5.629 18.151 21.064 23.685 26.119 26.873 29.141 31.319 34.091 36.123
15 4.601 6.262 19.311 22.307 24.996 27.488 28.259 30.578 32.801 35.628 37.697
16 5.142 6.908 20.465 23.542 26.296 28.845 29.633 32 34.267 37.146 39.252
17 5.697 7.564 21.615 24.769 27.587 30.191 30.995 33.409 35.718 38.648 40.79
18 6.265 8.231 22.76 25.989 28.869 31.526 32.346 34.805 37.156 40.136 42.312
19 6.844 8.907 23.9 27.204 30.144 32.852 33.687 36.191 38.582 41.61 43.82
20 7.434 9.591 25.038 28.412 31.41 34.17 35.02 37.566 39.997 43.072 45.315
21 8.034 10.283 26.171 29.615 32.671 35.479 36.343 38.932 41.401 44.522 46.797
22 8.643 10.982 27.301 30.813 33.924 36.781 37.659 40.289 42.796 45.962 48.268
23 9.26 11.689 28.429 32.007 35.172 38.076 38.968 41.638 44.181 47.391 49.728
24 9.886 12.401 29.553 33.196 36.415 39.364 40.27 42.98 45.559 48.812 51.179
25 10.52 13.12 30.675 34.382 37.652 40.646 41.566 44.314 46.928 50.223 52.62
26 11.16 13.844 31.795 35.563 38.885 41.923 42.856 45.642 48.29 51.627 54.052
27 11.808 14.573 32.912 36.741 40.113 43.195 44.14 46.963 49.645 53.023 55.476
28 12.461 15.308 34.027 37.916 41.337 44.461 45.419 48.278 50.993 54.411 56.892
29 13.121 16.047 35.139 39.087 42.557 45.722 46.693 49.588 52.336 55.792 58.301
30 13.787 16.791 36.25 40.256 43.773 46.979 47.962 50.892 53.672 57.167 59.703
Bibliography
[1] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA:
Duxbury, 2002.
[3] DeGroot, M., & Mark J. Schervish. Probability and Statistics. 3rd ed. Boston, MA: Addison-
Wesley, 2002.
[4] Hogg, R., McKean, J.W., & Craig, A.T. (2005) Introduction to Mathematical Statistics, 6th
Edition. Pearson Education International.
[6] Montgomery, D. C., & Runger, G. C. (2010). Applied statistics and probability for engineers.
John Wiley & Sons.
[8] Rahman N. A. (1983) Theoretical exercises in probability and statistics, second edition.
Macmillan Publishing.
[9] Rice, John. Mathematical statistics and data analysis. Nelson Education, 2006.
[11] Yuri, S. & Kelbert, M. (2008) Probability and Statistics by Example: Volume 1 and 2. Cam-
bridge University Press.
[12] Shevtsova, I. (2011). On the absolute constants in the Berry-Esseen type inequalities for
identically distributed summands. arXiv preprint arXiv:1111.6554.
[13] Nguyen Duy Tien, Vu Viet Yen (2001) Probability Theory (in Vietnamese). Educational
Publishing House.
179